“When one has taken root, one puts out branches.” – Jules Verne
What do microprocessors, pine trees, and public libraries all have in common? They have lots of branches. Branch instructions are fundamental to all von Neumann machines, along with arithmetic and logic operations. Without branches – specifically, conditional branches – you can’t really write any software, and you don’t really have a computer.
Trouble is, branches are also the biggest thieves of performance, stealing clock cycles that could be put to good use. Your CPU might be able to calculate huge sums or do elaborate bit manipulations in a single cycle, but that same chip probably needs a dozen cycles to decide whether it’s going left or right. Indecision wastes time.
Therein lies an opportunity for improvement. CPU designers have been working mightily for the past 20-odd years to improve the miserable performance of branches. Shaving just a few clock cycles here and there would pay big dividends, given how much time most processors waste dithering over branches. Some programs contain 30% or more in branch instructions; save a little time there and you get a big benefit overall.
Worse, it’s the branches that led us to security holes like Spectre, Meltdown, and others. These bugs are hard to fix – in fact, probably impossible to fix without exacting a heavy toll on performance – and deliciously subtle. So, we’re faced with a tradeoff: make the processor fast or make it reliable. Most of us would opt for the latter, but it would sure be nice to have both, wouldn’t it?
The Root of the Problem
The whole reason branches hurt performance is because processors have pipelines. And the whole reason processors have pipelines is to increase performance. So, right off the bat, we’ve got a conflict.
The shortest practical processor pipeline has three stages (fetch, decode, execute), while high-end processors have pipelines that are 20+ stages long. Some chips have variable-length pipelines, depending on what the CPU is doing (integer versus floating-point math, for example). That means you’ve got anywhere from three to 20+ instructions “in flight” as they work their way through the CPU’s pipeline circuitry. Great stuff, and it improves performance dramatically.
But, like railroad trains, microprocessors work best when they stay on track, running at full speed in a straight line. A program composed of nothing but an unbroken string of arithmetic and logical operations would see maximal performance. The pipeline would stay full and the chip would crank out new results on every clock cycle. In an ideal world.
Realistically, though, we’re going to have branches, and lots of them. And every single branch means the CPU has to abandon its current path and start down a different track. All the instructions that were in-flight will be abandoned, the pipeline flushed of partial results, and a whole new string of instructions fetched from somewhere new. Just as the processor was building up momentum, it all comes to a sparking and sputtering halt as the machine abruptly changes direction. Not a recipe for efficiency. But what can you do?
Well… there are a couple of clever tricks. For one, you can design the CPU to recognize an unconditional branch (one that’s taken every time, like at the bottom of some loops), figure out where the branch points to, and start fetching from that location – all before you get to the branch itself. Plenty of real-world processors do exactly that: they have an early-warning system that detects unconditional branches early in the pipeline and redirects the fetching mechanism to swerve from its straight-ahead path and start fetching from the branch-target address. Done right, the whole direction change takes place without any wasted time.
Conditional Branches
But that’s only in the best case. More often, we pepper our programs with conditional branch instructions – branch if zero, branch if nonzero, branch if this number is negative, or bigger than that one, and so on. Now you have a different problem. The CPU knows there’s a branch in its pipeline, but it doesn’t yet know the outcome of the test (zero, nonzero, etc.) and therefore whether the branch should be taken or not. It’s helpless to decide, and it has to wait until the absolute last moment to do so. We almost always write code with the comparison followed immediately by the branch. Who ever does a compare, waits for a while, and then branches?
Unfortunately, this presents the worst-case scenario for the processor, and the longer the pipeline, the worse it gets. One ameliorating trick is called a branch hint. Some processors (ARM’s recently announced Helium is one example) have two versions of every conditional-branch instruction: a likely branch and an unlikely branch. You get to pick which one to use, but this usually requires dabbling in assembly-language code. For example, if you’ve written a do-while loop that will iterate a thousand times before it terminates or falls though, you might want to use your processor’s version of the “branch if counter is nonzero, probably” instruction. That tells the CPU hardware that it’s a conditional branch and therefore unknowable until the last nanosecond, but that the Omniscient Software Engineer has hinted that the branch should probably be taken anyway. The processor takes your suggestion on faith and starts fetching code from the branch-target address instead of the address following the branch as it normally would. If your hint was correct, you’ve saved yourself a big performance hit on 999 of those thousand iterations. On the last one, the hint will prove wrong and the processor will have to flush its pipeline and start over on the correct path. But you’re still ahead of the game.
Branch hints work, but they obviously require software intervention. They also can’t be updated at runtime. Once your code is compiled, your hints are permanently cast in, uh, silicon. And, they require a CPU instruction set that supports them. You can’t retrofit branch hinting onto an existing ISA.
Hardware Branch Prediction
One step up the sophistication ladder is branch prediction, and this is where Spectre and Meltdown creep into the picture. Instead of statically encasing hints in your object code, branch prediction works entirely in hardware, and it can theoretically be added to any CPU architecture at all without tweaking the ISA. Its operation is invisible – that’s part of the problem – but it’s also very effective. And widespread. And difficult to debug. And it wastes energy. But despite all these shortcomings, branch prediction is almost ubiquitous among processors precisely because it patches up so many performance holes.
With branch prediction, the CPU hardware itself tries to guess which way a conditional branch will go. The exact methods change from CPU to CPU and are often closely guarded secrets, but there is some commonality. For example, branches that point backwards are assumed to be taken, while branches pointing forward are assumed not-taken. That’s based on statistical analysis of thousands of programs. Backward-pointing branches are often found at the bottom of iterative loops and are therefore likely to be taken more often than not. It’s a simple test (is the address offset positive or negative?) and yields good results.
Many CPUs also maintain a branch-history table (BHT), a collection of simple saturating counters that record the last few times each branch instruction was encountered. If the branch was taken, the hardware increments a counter. If not taken, it decrements the counter. The next time it encounters this branch, the CPU will make a guess based on its history whether it should be taken or not. It’s an imperfect system, but it’s yet another way to sidestep the pain of mispredicted branches.
To improve accuracy, some CPUs employ two-level branch prediction, which is smarter about nested branches or branches that are taken/not-taken at arbitrary intervals. Other processors will keep a “local” branch history separate from its global branch history. Since even the largest BHT can’t record the history of every single branch, some hashing is involved, and this can trick the CPU into confusing one branch with another one in a completely different part of the program. Some replace the two-bit saturating counter with a three-bit counter or with more elaborate tallies. Some processors even have multiple branch predictors working independently and then weight the outcomes. The list goes on.
Speculative Execution
Yet another side effect of branches combined with long pipelines is that your processor will almost certainly execute instructions that it shouldn’t, and then throw the results away. It simply can’t be avoided. Unless you want your CPU to freeze up and do nothing at all until it knows the outcome of every branch, it’s inevitable that the chip will do something between the time it fetches a branch and resolves its outcome. This is known as speculative execution.
The trouble with speculative execution – okay, just one of the troubles – is that it’s very hard to unwind all the unintended side effects. With a 20-stage pipeline, you’ve got 20 different instructions in flight, most of which are updating registers, twiddling bits, loading and storing to/from memory, decrementing counters, toggling flags, changing privilege levels, performing arithmetic, tinkering with status registers, poking at I/O ports, rounding results, and myriad other things that all have to be undone and erased without a trace, and all because the chip mispredicted which way the last branch was going to go. It’s a lot of work to undo. Not surprising, then, that tiny traces remain.
One small side-effect that most processors don’t try to erase is the effect on data caches. A memory read or write transaction will probably go though the on-chip cache, potentially updating a cache entry here or there. If that read/write turns out to have been executed speculatively and needs to be undone, the CPU hardware takes care of that. But it doesn’t undo the effect on the cache. This isn’t normally considered a bug, since cache contents aren’t deterministic anyway. You get what you get.
The same vulnerability applies to some privileged system functions. Your CPU might mispredict a branch, speculatively execute something that triggers a system fault, and then fail to completely mask its side effects. Spectre and Meltdown combined these two tiny, but related, vulnerabilities to create a two-part bucket brigade that hand-carried data out of affected systems, one bit at a time. It’s tedious, but it works.
The software patches to defang Specture and Meltdown have been almost as creative as the bugs themselves, but they do exact a performance penalty because they selectively disable speculative execution and/or branch prediction. Those are big performance hits to take, especially in a high-end processor with a long pipeline. Most users report percentage slowdowns in the high single digits to low double digits, depending on what software they’re running. That’s a high price to guarantee vaccination against a very rare bug.
The ultimate solution may be to keep processor microarchitecture largely as it is, while improving the “unwinding” mechanisms for reversing the effects of speculative execution. If there are no detectable side effects, there’s no vulnerability to exploit (we think). On the other hand, erasing all trace of any conceivable sequence of 20-plus instructions will be exceedingly difficult to design and verify. There’s a reason these bugs lie dormant for years – decades, even – before being discovered: they’re extraordinarily subtle.
The other solution is to just stop writing code with jumps and branches. C’mon, programmers! Do your part! If you can’t decide which way your code is supposed to flow, maybe it’s time to go into marketing. Or journalism.
Or you can get the compilers to fix the branch prediction in the code that’s running. This idea comes out of trying to make simulation run faster –
https://youtu.be/Bh5axlxIUvM
– that also fixes the Spectre/Meltdown weakness.
RTL simulation is all state-machines, if you know the state you now which way the branches (are likely) go for a given change. You can buy the patent –
https://patents.google.com/patent/US20150256484A1/en