“My strength is translating emotion because I’m such a feeler.” – Selena Gomez
Apple’s upcoming ARM-based Macs will run existing x86 software alongside ARM native code. How does that work, how good is it, and what magic tricks is Apple using? We asked around to find out.
I spoke with some current and former Apple employees and microprocessor designers, who shall remain anonymous. They hinted at certain hardware/software tradeoffs that hold the secret to Apple’s remarkably good performance in emulation.
First, the basics. Apple has gone through at least five different processor generations by my count. The company started with the 6502, then 68K, PowerPC, x86, and now ARM. It’s always used ARM for its iDevices – iPad, iPhone, iPod, etc. – since the very first iPod almost 20 years ago. In parallel with that, Apple swapped out the moribund 68K processor family for PowerPC chips (which it helped design) in its Macintosh product line. Those later gave way to Intel x86-based Macs, and, about a year from now, they’ll switch again to the first ARM-based Macs. That makes four completely unrelated processor families in one product line. It also neatly closes the circle, since Apple was among the very first investors (along with VLSI) in ARM back when it was still Acorn/Advanced RISC Machines, Ltd.
What’s remarkable is that all three architecture changes supported the previous generation’s binaries. PowerMacs could run unmodified 68K code, x86-based Macs could run existing PowerPC code, and the promised ARM-based Macs will run today’s x86 code. Users won’t be able to tell the difference. It just works.
That’s a good trick, as anyone who’s tried to write a binary translator or emulator will tell you. Oh, it seems simple enough, but the devil is in the details. You can get 80% or so of your code running early on, and that leads to a false sense of accomplishment. The remaining 20% takes approximately… forever. Most binary translation projects either get abandoned, run painfully slowly, are fraught with bugs, or have unsupported corner cases. Apple is among the few companies (including IBM and the old Digital Equipment Corp.) to get it working reliably. And now the company has pulled off the same trick three times.
How well does it work? Well enough that Apple’s emulation beats other systems running natively.
Some early benchmark results show Apple’s ARM-based hardware running x86 benchmark code in emulation. The hardware in question is Apple’s Developer Transition Kit (DTK), a special developer-only box intended to give programmers an early head start on porting their x86 MacOS applications to the new ARM-based hardware. The DTK uses the same A12Z chip found in the latest (2020 model) iPad Pro. The A12Z is, in turn, a slight variation on the earlier A12X processor used in previous generation iPad Pro models. In fact, the A12X and A12Z are essentially identical except for the number of GPUs. A12X has seven GPUs while the later A12Z has eight. (They are, in fact, the exact same die, with one GPU disabled on the A12X.) Both have eight CPU cores arranged in a 4+4 “big.little” arrangement, with four big high-performance CPUs, four smaller power-efficient CPUs, and either seven or eight GPUs. The whole cluster is clocked at about 2.5 GHz.
It’s worth noting that the A12 generation of iPad Pro tablets has been shipping for about two years, so the A12X and A12Z aren’t cutting edge anymore. The A12Z isn’t the chip that will power next year’s ARM-based Macs, but it’s as close as we (or Apple) can get today. Production machines will be powered by the upcoming A14X, which should be considerably faster, with 12 CPU cores (eight fast and four slow) and built-in TSMC’s leading-edge 5nm process.
Geekbench identifies the DTK’s processor as “VirtualApple 2400 MHz (4 cores).” That tells us two things. First, that it’s running the benchmarks in x86 emulation mode. Otherwise, it’d report the A12Z as an ARM processor. Geekbench also reports the operating system as MacOS, which is available only in x86 form (for now, anyway). Second, that it’s using only four of the chip’s eight CPU cores, which suggests that the four low-power cores aren’t being exercised at all.
Benchmark scores are nice, but they’re meaningless in a vacuum. What makes them interesting is how they stack up against other machines. (For the record, these results were leaked in violation of the NDA that developers are required to sign before getting access to a DTK, but that’s no reason to believe they aren’t accurate.)
Surprisingly, the DTK’s x86 emulation is faster than some real x86 processors. Twitter user _rogame reports that it beats Microsoft’s Surface Book 3 (based on Intel’s Core i7-1065G7) in OpenCL benchmarks. It also beat an AMD Ryzen–based HP laptop in the same test.
Embarrassingly, DTK also outscores a Surface Pro X, the ARM-based Windows laptop that Microsoft introduced less than a year ago, and which we covered here. Like DTK, the Surface Pro has an ARM processor, but unlike DTK, it’s running everything – the Windows operating system as well as the benchmark code – in native mode, not in emulation. Thus, Apple’s emulation of an x86 running on ARM is faster than Microsoft’s native ARM implementation.
(A bug in the Geekbench dashboard reports the Surface Pro X’s processor as a “Pentium II/III,” which is clearly wrong. It’s actually the SQ1 chip, a Microsoft-specific tweak of Qualcomm’s Snapdragon 8cx. The benchmarks are native ARM code for SQ1, not translated or emulated x86 code. Confused?)
All of this looks pretty good for Mac users hoping to switch to ARM-based systems while hanging on to their old Mac apps. So, how did Apple pull this off?
Has Apple redesigned ARM to run x86 binaries? Is there an x86 coprocessor or hardware accelerator of some sort to extend the basic ARM processor cluster? Special instructions, maybe? A big FPGA that converts binary code?
On one hand, Apple clearly has the resources to design anything it wants to, so it could have followed any of these paths. The company also has plenty of experience in this area, creating series after series of processors for iDevices for almost 20 years. On the other hand, there are limits to what its ARM license allows. Even Apple can’t radically alter the ARM instruction set or the programmer’s model without permission from ARM, which isn’t likely forthcoming. So, what’s the trick?
Apple insiders tell me the company has spent years profiling its code and looking for performance bottlenecks. It knows what MacOS code traces look like, and how third-party apps behave. They’ve got a decade of experience profiling and optimizing x86 binaries, too, so the team knows where the pain points are.
The consensus is that Apple didn’t need to mess around with the ARM architecture – at least, not the visible parts of it. Instead, the company likely optimized its microarchitecture: the underlying circuitry that implements the programmer-visible parts. All the x86 translation is done in software (called Rosetta 2), not in hardware. Unlike other x86 clone chips like Transmeta or nVidia’s Project Denver, there are no special x86-emulation opcodes, no shadow register sets to mimic the x86, no coprocessors, and no massive hardware assists.
Instead, Apple’s A14X remains a conventional ARM-based processor that’s optimized to run Rosetta 2 translation code really well. Caches should be quite large, because it needs to cache chunks of x86 code (as data) as well as the Rosetta 2 executable. The chip’s prefetch logic would also need to be optimized for misaligned x86 binaries, as well as for fusing operations that aren’t native to ARM’s instruction set. Task switching is vitally important, as the processor will be alternating between executing Rosetta 2 and the code it translates.
Rosetta will likely produce certain repeating patterns, as x86 instructions will convert to similar sequences, so fetching, decoding, and dispatching those patterns would be a priority. ARM compilers tend to produce certain common patterns, and x86 compilers will produce different ones, so it’s important to optimize the front-end logic for both kinds, not just conventional ARM code. Fusing multiple ARM instructions in the decode and dispatch phases would provide a big performance boost.
The x86 architecture supports more memory-addressing schemes than ARM does, so Rosetta is likely to encounter oddball memory references involving register-to-register arithmetic, offsets, and scaling. A little hardware assist here would yield dividends in performance. A few extra adders, shifters, or integer multipliers in the decode path could help this along.
Did Apple include special hardware to speed x86 translation? Yes, but not the kind of massive hardware-emulation machines that other companies have tried. Instead, the company has taken the typical Apple approach: do as much as possible in software and optimize the hardware to run that code really well. Software is more flexible, easier to patch and update, and oftentimes more capable.
Finally, it bears repeating that the early benchmarks results are all from the A12Z chip, a two-year-old design with none of these enhancements. The first ARM-based Macs are still a year away and will be based on A14X, a chip that’s not only faster and newer but that was designed from the start to be a killer Rosetta 2 engine. Should Apple decide it needs more aggressive x86 hardware assists, it can always design those into the A15 generation.
Apple is uniquely qualified to create an effective binary translator. The company has done it twice before, always in software. The initial results suggest that it’s pulled off a hat trick, even without any hardware assists. Rosetta 2 running on A14X should be even better.