“My strength is translating emotion because I’m such a feeler.” – Selena Gomez
Apple’s upcoming ARM-based Macs will run existing x86 software alongside ARM native code. How does that work, how good is it, and what magic tricks is Apple using? We asked around to find out.
I spoke with some current and former Apple employees and microprocessor designers, who shall remain anonymous. They hinted at certain hardware/software tradeoffs that hold the secret to Apple’s remarkably good performance in emulation.
First, the basics. Apple has gone through at least five different processor generations by my count. The company started with the 6502, then 68K, PowerPC, x86, and now ARM. It’s always used ARM for its iDevices – iPad, iPhone, iPod, etc. – since the very first iPod almost 20 years ago. In parallel with that, Apple swapped out the moribund 68K processor family for PowerPC chips (which it helped design) in its Macintosh product line. Those later gave way to Intel x86-based Macs, and, about a year from now, they’ll switch again to the first ARM-based Macs. That makes four completely unrelated processor families in one product line. It also neatly closes the circle, since Apple was among the very first investors (along with VLSI) in ARM back when it was still Acorn/Advanced RISC Machines, Ltd.
What’s remarkable is that all three architecture changes supported the previous generation’s binaries. PowerMacs could run unmodified 68K code, x86-based Macs could run existing PowerPC code, and the promised ARM-based Macs will run today’s x86 code. Users won’t be able to tell the difference. It just works.
That’s a good trick, as anyone who’s tried to write a binary translator or emulator will tell you. Oh, it seems simple enough, but the devil is in the details. You can get 80% or so of your code running early on, and that leads to a false sense of accomplishment. The remaining 20% takes approximately… forever. Most binary translation projects either get abandoned, run painfully slowly, are fraught with bugs, or have unsupported corner cases. Apple is among the few companies (including IBM and the old Digital Equipment Corp.) to get it working reliably. And now the company has pulled off the same trick three times.
How well does it work? Well enough that Apple’s emulation beats other systems running natively.
Some early benchmark results show Apple’s ARM-based hardware running x86 benchmark code in emulation. The hardware in question is Apple’s Developer Transition Kit (DTK), a special developer-only box intended to give programmers an early head start on porting their x86 MacOS applications to the new ARM-based hardware. The DTK uses the same A12Z chip found in the latest (2020 model) iPad Pro. The A12Z is, in turn, a slight variation on the earlier A12X processor used in previous generation iPad Pro models. In fact, the A12X and A12Z are essentially identical except for the number of GPUs. A12X has seven GPUs while the later A12Z has eight. (They are, in fact, the exact same die, with one GPU disabled on the A12X.) Both have eight CPU cores arranged in a 4+4 “big.little” arrangement, with four big high-performance CPUs, four smaller power-efficient CPUs, and either seven or eight GPUs. The whole cluster is clocked at about 2.5 GHz.
It’s worth noting that the A12 generation of iPad Pro tablets has been shipping for about two years, so the A12X and A12Z aren’t cutting edge anymore. The A12Z isn’t the chip that will power next year’s ARM-based Macs, but it’s as close as we (or Apple) can get today. Production machines will be powered by the upcoming A14X, which should be considerably faster, with 12 CPU cores (eight fast and four slow) and built-in TSMC’s leading-edge 5nm process.
Geekbench identifies the DTK’s processor as “VirtualApple 2400 MHz (4 cores).” That tells us two things. First, that it’s running the benchmarks in x86 emulation mode. Otherwise, it’d report the A12Z as an ARM processor. Geekbench also reports the operating system as MacOS, which is available only in x86 form (for now, anyway). Second, that it’s using only four of the chip’s eight CPU cores, which suggests that the four low-power cores aren’t being exercised at all.
Benchmark scores are nice, but they’re meaningless in a vacuum. What makes them interesting is how they stack up against other machines. (For the record, these results were leaked in violation of the NDA that developers are required to sign before getting access to a DTK, but that’s no reason to believe they aren’t accurate.)
Surprisingly, the DTK’s x86 emulation is faster than some real x86 processors. Twitter user _rogame reports that it beats Microsoft’s Surface Book 3 (based on Intel’s Core i7-1065G7) in OpenCL benchmarks. It also beat an AMD Ryzen–based HP laptop in the same test.
Embarrassingly, DTK also outscores a Surface Pro X, the ARM-based Windows laptop that Microsoft introduced less than a year ago, and which we covered here. Like DTK, the Surface Pro has an ARM processor, but unlike DTK, it’s running everything – the Windows operating system as well as the benchmark code – in native mode, not in emulation. Thus, Apple’s emulation of an x86 running on ARM is faster than Microsoft’s native ARM implementation.
(A bug in the Geekbench dashboard reports the Surface Pro X’s processor as a “Pentium II/III,” which is clearly wrong. It’s actually the SQ1 chip, a Microsoft-specific tweak of Qualcomm’s Snapdragon 8cx. The benchmarks are native ARM code for SQ1, not translated or emulated x86 code. Confused?)
All of this looks pretty good for Mac users hoping to switch to ARM-based systems while hanging on to their old Mac apps. So, how did Apple pull this off?
Has Apple redesigned ARM to run x86 binaries? Is there an x86 coprocessor or hardware accelerator of some sort to extend the basic ARM processor cluster? Special instructions, maybe? A big FPGA that converts binary code?
On one hand, Apple clearly has the resources to design anything it wants to, so it could have followed any of these paths. The company also has plenty of experience in this area, creating series after series of processors for iDevices for almost 20 years. On the other hand, there are limits to what its ARM license allows. Even Apple can’t radically alter the ARM instruction set or the programmer’s model without permission from ARM, which isn’t likely forthcoming. So, what’s the trick?
Apple insiders tell me the company has spent years profiling its code and looking for performance bottlenecks. It knows what MacOS code traces look like, and how third-party apps behave. They’ve got a decade of experience profiling and optimizing x86 binaries, too, so the team knows where the pain points are.
The consensus is that Apple didn’t need to mess around with the ARM architecture – at least, not the visible parts of it. Instead, the company likely optimized its microarchitecture: the underlying circuitry that implements the programmer-visible parts. All the x86 translation is done in software (called Rosetta 2), not in hardware. Unlike other x86 clone chips like Transmeta or nVidia’s Project Denver, there are no special x86-emulation opcodes, no shadow register sets to mimic the x86, no coprocessors, and no massive hardware assists.
Instead, Apple’s A14X remains a conventional ARM-based processor that’s optimized to run Rosetta 2 translation code really well. Caches should be quite large, because it needs to cache chunks of x86 code (as data) as well as the Rosetta 2 executable. The chip’s prefetch logic would also need to be optimized for misaligned x86 binaries, as well as for fusing operations that aren’t native to ARM’s instruction set. Task switching is vitally important, as the processor will be alternating between executing Rosetta 2 and the code it translates.
Rosetta will likely produce certain repeating patterns, as x86 instructions will convert to similar sequences, so fetching, decoding, and dispatching those patterns would be a priority. ARM compilers tend to produce certain common patterns, and x86 compilers will produce different ones, so it’s important to optimize the front-end logic for both kinds, not just conventional ARM code. Fusing multiple ARM instructions in the decode and dispatch phases would provide a big performance boost.
The x86 architecture supports more memory-addressing schemes than ARM does, so Rosetta is likely to encounter oddball memory references involving register-to-register arithmetic, offsets, and scaling. A little hardware assist here would yield dividends in performance. A few extra adders, shifters, or integer multipliers in the decode path could help this along.
Did Apple include special hardware to speed x86 translation? Yes, but not the kind of massive hardware-emulation machines that other companies have tried. Instead, the company has taken the typical Apple approach: do as much as possible in software and optimize the hardware to run that code really well. Software is more flexible, easier to patch and update, and oftentimes more capable.
Finally, it bears repeating that the early benchmarks results are all from the A12Z chip, a two-year-old design with none of these enhancements. The first ARM-based Macs are still a year away and will be based on A14X, a chip that’s not only faster and newer but that was designed from the start to be a killer Rosetta 2 engine. Should Apple decide it needs more aggressive x86 hardware assists, it can always design those into the A15 generation.
Apple is uniquely qualified to create an effective binary translator. The company has done it twice before, always in software. The initial results suggest that it’s pulled off a hat trick, even without any hardware assists. Rosetta 2 running on A14X should be even better.
2 thoughts on “What’s Inside Apple Silicon Processors?”
In the article, you describe Apple needing large caches, storing X86 code as data and Rosetta as instructions. This implies dynamic translation, my understanding is that its static, everything is converted to run ARM code. Am I mistaken?
Since Rosetta only runs user code, it appears that a lot of the more complex OS-level instructions aren’t translated. This makes a lot of sense, if its true.
Others have commented that Apple might struggle with high-end machines, like Mac Pro. I don’t think so. Unlike Intel, Apple has a very good idea what apps people run on their machines. My guess is that Apple will have hardware for the Mac Pros that assists with video editing and similar tasks. I think they have an opportunity for a massive performance benefit for a few areas, and Apple knows where that makes sense.
The interesting thing we don’t know about is external GPUs, expansion bus, things like that. Apple might do something crazy like let Mac Pro users have CPU cards where they can add more compute resources to their machine. Apple can now make their own architecture decisions, they aren’t tied to Intel.
AFAIK, Rosetta 2’s main mode of operation is to transcode Intel binaries to Apple silicon binaries on install, storing the AArch64 code in the same manner as a Universal 2 compile will do (whether that’s in a separate resource or file I don’t know).
Dynamically loaded code is treated differently, though Apple wasn’t clear about whether the x86 code would be interpreted or dynamically transcoded.
To put it simply, Apple’s silicon team has been designing custom silicon for a decade now, starting with the A4 when they were partnered with Samsung (which created the first Exynos), then continued on while after a break from Samsung (when they became frenemies) up until the current day. Samsung dropped back and continued Exynos (like Qualcomm) using pretty much standard ARM high performance and high efficiency cores.
The Apple silicon team OTOH, kept on with their silicon and CPU design implementing out-of-order execution units, parallel arithmetic units, branch prediction, improved arithmetic and vector logic, wider and faster data paths, improved caching, and all the goodies contained in modern, high performance processors. Their silicon team is among the best in the world, bar none.
The Mac SoCs will be built as an A14 class SoC on a 5nm TSMC node, and the mask shrink from 7nm to 5nm should yield about a 30% increase in speed and efficiency with the silicon team’s annual optimizations yielding additional benefit.
Here’s an article from Anandtech comparing the A13 Lightning high performance cores in the iPhone 11 actively cooled against competing cores: https://www.anandtech.com/show/15875/apple-lays-out-plans-to-transition-macs-from-x86-to-apple-socs.
Note that Skylake 10900K has the edge in integer, and 10900K, Ryzen 3950X, and Cortex-X1 (already at 5nm) have the edge in floating point. This benchmark – unlike Geekbench – runs on the metal of the CPU vs. function tests which can be aided by other Bionic IP blocks.
10900K is the new Intel core-i9 14nm part with 10 cores and 20 threads. Intel added two cores attempting to compete with Ryzen, but this part is reaching the limits of physics – they had to shave the top of the chip to get enough surface area to cool the chip. It’s advertised at 125w TDP, but in real life can draw over 300w.
When Apple creates the Mac SoCs, they can add as many Firestorm high performance cores, Icestorm high efficiency cores, neural engine cores, and graphics cores as they deem fit – and they’ve stated they intend to design the SoC to match the applicable chassis of the Mac they’re going in to. They won’t have to pick an Intel part and configuration and cobble together a system to work around that combination.
To give you an idea of the silicon team’s annual work, last year’s A13 seemed to have a primary focus on efficiency, so they installed hundreds of voltage domains to shut down parts of the SoC not in use. While there were there, they sped up matrix multiplication and division (used by machine learning) by 6x, and added other optimizations which sped up Lightning and Thunder by 20% (despite no shrink in the mask size).
Note also that when putting together the Mac SoC, they will include as many IP blocks from the iPhone as they want including a world class image system processor, a neural engine capable of 5 trillion ops/sec (if they keep the A13 configuration – which should actually be 30% faster), a secure enclave capable of securely storing and comparing fingerprint or facial geometry data for biometric authentication, H.264 and H.265 process blocks, encryption/decryption blocks, and a whole host of IP blocks. They can also add new blocks – say for instance for PCIe lanes which will be needed for more modular Macs and Thunderbolt support.
I fully expect Firestorm cores to match or exceed the speed of 10900K, though it’s unclear at this juncture whether A14 will support SMT which of course can almost double your multiprosessing workload on a per core basis.
The first rumored Mac SoC will have 8 Firestorm and 4 Icestorm cores and appears to be targeted at some kind of laptop, and is rumored to be one of the first implementations of ARMv9 and one of the first to implement a core clock of over 3ghz, though I suppose with actual active cooling they could push clocks and core counts to any level they desire.
The A13 SoC had 8.5 billion transistors, and A14 SoC (the iPhone 12 version with two Firestorm and four Icestorm cores) is said to have 15 billion. The Mac SoC will probably have much more.