It’s not a bug, it’s a feature, if you publish it in the manual, right? AMD has taken a “white hat” approach to a possible security risk in its newest Zen 3 processors by publishing a white paper that describes the problem. Although there’s no known exploit in the field, AMD appears to be heading off any problems by detailing how, when, and where the problem might occur, and what you can do about it. Kudos to the company for transparency.
The problem lies in a hidden feature called predictive store forwarding (PSF). It’s a performance enhancement first introduced in Zen 3 chips (i.e., those Ryzen 5900X and 5950X CPUs that are impossible to buy right now) that speculatively feeds data to the processor before it’s actually available. The concept is not entirely novel in the CPU world, but it’s the first time AMD has implemented it in this particular way, and it comes with one wee small caveat.
First, what is PSF? It’s a way to speed up loads from memory by speculating – guessing, really – what data you’re going to load. Memory loads are among the slowest activities any processor can perform because external DRAM is so #$@% slow compared to your 3-GHz multicore CPU. Waiting on data to come back from memory eats up a lot of time. That’s why we have caches, but even caches are slower than the logic inside the CPU. If only there was some way to know ahead of time what data you were going to load and just work with that…
Well, there is. Mostly. Chances are, the data you load from memory is the same as whatever you stored there earlier. In a simplistic system with a single processor, that’s how it always works. Memory doesn’t change all by itself (you hope), so any data you store will still be there, safe and sound, when you retrieve it later. The complication starts when you have multiple processors, or intelligent peripherals, or a DMA controller, or shared memory, or even a single multicore processor running multiple threads. Then, there’s no guarantee that the data in DRAM hasn’t been molested a dozen times since you last wrote to it. So, the CPU takes its best guess.
Working on the theory that the data in DRAM probably hasn’t changed since the last time you wrote to it, AMD’s Zen 3 chips will simply reuse the data from the last store instruction to that location. During write cycles, the processor copies the outgoing data into an internal buffer and tags it with a hash of the destination address. Then, when a load instruction requests data from that same address, the processor uses the data from the buffer instead of waiting around for the DRAM.
This is much faster than waiting for external memory; it’s even faster than waiting for the data cache (assuming the cache hits). It really pays off when the load instruction comes right after the store instruction, because the processor doesn’t have to wait for the write cycle to complete, then wait some more for the read cycle (from the same address) to complete, then wait while the results are passed into the execution pipeline.
To be clear, the processor still completes a normal read cycle from external memory. It just doesn’t wait for the results before getting a head start on executing the load and then proceeding with the next several instructions.
So far, so good. This technique is used in several companies’ processors and is generally known as store-to-load forwarding (STLF). That part’s not new. What’s new is AMD’s tweak to the process that makes it even faster by also bypassing the MMU.
Standard run-of-the-mill STLF depends on matching the memory address of the store with the memory address of the load. Which, in turn, depends on your MMU. If you’ve got memory translation enabled, the address pointers in your code bear little relation to the physical memory addresses the CPU reads and writes. Until you know the physical address, you can’t match up the loads with the stores. You have to wait for the MMU to perform its address translation. Unless you’re AMD.
Zen 3 processors skip the address translation lookup and try to divine which loads are paired with which stores simply by observing your code’s behavior. Using a proprietary algorithm (which the company does not reveal), it matches up load/store pairs. It then buffers the data values on stores. If it’s right, and if a store is followed closely by a load from the same address, Zen 3 chips will happily supply the buffered data immediately without waiting for the DRAM, the cache, or the MMU. Result: performance increase.
What could go wrong? Not much, really. If the pairing algorithm errs and somehow manages to match the wrong load with the wrong store, it will flush the incorrect data from its buffer. That can happen with any hashing algorithm, where two unrelated addresses happen to produce the same hash. No harm done.
It also flushes the data if the MMU setup has changed, breaking the pairing. This is kind of a corner case, but it’s important to catch anyway. Your code might read and write the same address for a while, but then a change to the MMU’s translation tables might relocate one but not the other.
There’s also the case where indexed arrays can fool the processor into thinking you’re accessing the same addresses when you’re not. All x86 processors can do indexed array addressing in assembly language as part of their native instruction set. This makes them great targets for C compilers (and for hardcore assembly-language programmers), and it makes it deceptively easy for the hardware to second-guess what address you’re really asking for. AMD admits that its pairing algorithm can be fooled if you repeatedly access an array with one index pointer, then switch to a different one. The hardware catches this mistake, too, and flushes the PSF buffer accordingly. In all cases, the right thing happens.
But that might depend on what you consider “the right thing.” AMD’s Zen 3 chips never supply incorrect data, but they do initiate memory accesses that aren’t strictly necessary – and that may provide the kernel of a malicious exploit.
The Spectre and Meltdown bugs showed us that the most seemingly trivial of side effects can sometimes be used to compromise a computer system. They both leveraged the subtle side effects of speculative execution to tease out sensitive information. Spectre and Meltdown fetched instructions that were later flushed and unused, but those speculative fetches had an effect on caches. Similarly, PSF reads data that will be flushed and unused, and it also has an observable effect on caches. Thus, PSF’s hidden behavior may have a (barely) detectable effect on a computer system.
Like Spectre and Meltdown, exploiting the effects of a speculative PSF memory access takes some real effort. And, even when it’s successful, it only exfiltrates data from an already-infected system. It’s not a way in; it’s a way out.
In a great example of burying the lede, AMD’s whitepaper saves this warning for the very last paragraph: “Predictive Store Forwarding is a new feature in AMD Zen 3 CPUs which may improve application performance but also has security implications.”
The company warns that this might be exploited in software-only sandboxing, including browsers. Fortunately, PSF can be turned off, unlike the vulnerabilities underlying Spectre and Meltdown. AMD’s chips provide two system-level configuration bits that can disable STLF entirely or just PSF by itself. Either way, the settings are per-thread, so you can disable it for some tasks and leave it enabled for others.
AMD’s chips automatically flush PSF buffers whenever there’s a change in code privilege, an interrupt, an exception, an intersegment (far) call, a system call, or any of several other circumstances. That makes it pretty hard to exploit across threads, tasks, or code segments. Again, you’d need to work pretty hard to find a way to exploit this potential bug, but that doesn’t mean someone won’t try. Especially now that it’s been publicly documented. Still, better to be upfront about it and highlight the potential weakness now than have to apologize for it later.
Speculative execution and speculative load/store behavior are a fact of life with modern CPUs. They’re just two of the many tools that CPU designers use to deliver the performance we all crave. The surprise is not that there are unintended consequences to these often obscure and nonintuitive tweaks. The surprise is that there aren’t more of them.