ARM Cortex-A76AE Reliably Stays in Lock Step

“A man with a watch always knows what time it is. A man with two is never certain.” – Segal’s Law

Another day, another new ARM processor. These guys are like Taco Bell. They keep rearranging the same three or four ingredients to create a surprising assortment of different products. A pipeline here, a new cache there, and you’ve got a whole menu of options. There’s something for everyone, even if it does all taste the same.

One new ingredient in ARM’s kitchen, though, is a reliability feature called split-lock. It’s aimed at autonomous vehicles (self-driving cars) but it could also be relevant to robotics, aerospace, and other high-reliability applications. It’s not an entirely new concept, but it does add a bit of spice to an otherwise familiar bill of fare.

You can’t tell much by just looking at the menu, so I ducked behind the counter and talked directly to the cooks in the kitchen. Specifically, I spent some quality time with members of ARM’s CPU design team, who were candid, friendly, and helpful. It was a refreshing change from the usual corporate briefing and death by PowerPoint.

The processor in question is the Cortex-A76AE, and it’s all-new… but it’s not. As the name suggests, it’s based on the Cortex-A76, which we covered during its announcement in June. The -A76 is ARM’s current top of the line: a superscalar, out-of-order, 64-bit beast with a projected 3.0-GHz maximum clock frequency. The new -A76AE is all of that plus “AE,” which stands for “automotive enhancements.” Despite the name, the enhancements aren’t really specific to cars, although that’s clearly the sexiest market niche for now. Who would want to hear about a processor for industrial automation?

There are a number of goodies rolled up into that AE moniker, some more subtle than others. For example, ARM provides licensees of the -A76AE with documentation and testing info that will help with their eventual ISO 26262 certification down the road. You can’t usually certify IP itself, but, if you’re clever, you can remove roadblocks to your customers’ ultimate certification once they put it in an SoC. Imagination Technologies did a similar thing a year ago with its MIPS I6500-F processor.

ARM’s design team also certified themselves, in a way, as they were designing the -A76AE. As with ISO 9001 certification, you can show that your internal processes are documented, understood, and adhered to, and that helps both you and your customers clear certain regulatory hurdles. ARM was looking ahead with the -A76AE, knowing that its customers would be facing a regulatory decathlon.

The most interesting hardware component of the AE, though, is the split-lock feature. This allows two CPU cores to run in lockstep, each one double-checking the other on a cycle-by-cycle basis. If one CPU somehow disagrees with the other, the -A76AE flags a lockstep fault. The idea is to run safety-critical code on two CPUs in lockstep for maximum reliability. One CPU core might conceivably suffer a random failure, but two CPUs failing in the same way at the same time is statistically unlikely.

Neither CPU in this twinning arrangement is master or slave. They both run the same code at the same speed, and both are treated equally. Most of the time (in fact, all the time under normal circumstances), both CPUs will produce identical results. Their behavior will be indistinguishable. That’s what you want.

However, running in lockstep does obviously mean that you’re giving up half of your processing resources. A dual-core -A76AE running in lockstep is only as fast as a single-core version, and a four-core configuration will perform like a dual-core system – but with the added insurance of backup hardware.

That’s your call, and enough high-reliability customers are interested in that configuration to make it worth ARM’s time to produce a special version of its top-flight CPU for just that purpose.

You don’t have to run the -A76AE in lockstep if you don’t want to. It’s an optional feature, and if it’s disabled, you essentially have a standard Cortex-A76 with better documentation. Lockstep has to be enabled/disabled at reset, so it’s not something you’d switch on or off on the fly. Which makes sense. This isn’t like toggling between 16/32-bit modes on an x86 processor. You either really want to use lockstep or you don’t.

If you don’t, that’s the “split” part of ARM’s split-lock feature description. Split mode is simply what we used to call normal operation. Like “World War I,” it’s a retronym; a term coined only after a new word renders the old one obsolete. Running in split mode, as opposed to lock mode, means that all the CPU cores in your -A76AE implementation run independently, like normal. There’s almost no difference between an -A76AE running in split (normal) mode and a standard -A76.

The “almost” prompts a small footnote. There is obviously some additional hardware within the -A76AE to implement the lock-mode safety checking, and, like most gratuitous hardware, it exacts a small performance penalty. The design team told me that an -A76AE runs about 5% slower than a standard -A76, all things being equal. Part of that minor performance hit comes from a slightly slower maximum clock frequency, and part from some slight differences in IPC (instructions per clock cycle). Running the processor in split mode (i.e., with lockstep disabled) claws back a small portion of that difference – perhaps a few percent – but not all of it. Thus, an -A76AE can never be as fast as a standard -A76. But if you’re really worried about that last 5% of theoretical maximum performance, you’re probably using the wrong processor or writing really bad code.

Twinning processors for reliability isn’t an entirely new concept. It’s not even new to ARM, having first appeared in the Cortex-R8. NASA’s Space Shuttle ganged six computers together, with four cross-checking one another, a fifth for backup, and one cold spare. Big fault-tolerant systems have used similar techniques for decades. Plenty of designers have also crafted their own solutions, usually with standalone processors and external hardware with lots of registers and comparators. ARM has made the process a whole lot easier. Just toggle a few configuration bits at bootup and stand back.

How does it work? An extreme approach might have been to scatter comparators all throughout the processor’s pipeline, checking every register, data bus signal, and address line. That’s overkill, according to the designers I spoke with, and it’s unnecessary. There’s no need to burden the entire microarchitecture. All you really care about is what happens outside the CPU core. It’s enough to compare “external” bus signals (external to the CPU core; internal to the SoC). The -A76AE monitors the interfaces between the CPU core, its caches, and its coherent buses. If it detects anything untoward, it raises a lockstep fault. What you do from that point is up to you.

What the -A76AE doesn’t require is any extra software. There’s no software component at all to the lockstep feature; it’s all done in hardware. The only code you’ll need to write is the handler in case of a fault. Apart from a few configuration bits, the entire process is software-invisible.

Like all recent ARM processors, the -A76AE is designed to be used in either homogeneous processor groupings (i.e., all CPU cores the same) or heterogeneous arrangements, preferably a “big.little” pairing with Cortex-A55. Additionally, you can choose to implement one, two, or four CPU cores per “cluster,” which is a specific ARM-defined grouping. You can have as many clusters as you want within your SoC. For redundancy to work, you’d obviously want an even number of CPUs per cluster (two or four). Only CPUs within a cluster can be paired for lockstep operation; you can’t pair a CPU from one cluster with one from another cluster. Also, with a four-core cluster, you must enable lockstep for both pairs. In other words, lockstep is a cluster-level feature. You can’t enable it for half the cluster but not the other half.

The changes between the standard Cortex-A76 and the -A76AE seem minor, but they were planned far in advance. The lockstep feature wasn’t just grafted on at the last minute, and the certification requirements had to be in place early on, before work on the CPU core even began. So, even though the standard -A76 doesn’t make use of any of that extra work, its design was informed by the requirements of its younger twin. ARM’s design team had to avoid shooting themselves in the foot with the -A76 so they’d be in a good position when it came time to certify the -A76AE. “The overall design process would’ve looked substantially different,” had they not planned for redundancy from the outset, one engineer told me.

It’s gratifying to know that CPU cycles – and silicon transistors in general – are so plentiful that we can casually throw half of them away by duplicating an entire CPU and its caches. There was a time when a multimillion-transistor processor was an incredible (and incredibly expensive) achievement. Now we double them up and fold them over. At the same time, it’s nice to know that future autonomous vehicles and other high-reliability systems will have that kind of redundancy built in. Hey, Cortex-A76AE, looks like you’ve got my back.