Small Change, Big Difference

They say football is a game of inches, and League of Legends is a game of millimeters. Well, making microprocessor chips is a game of atoms.

Most of us aren’t very good at statistics, but we do have an intuitive kind of understanding about the law of averages and the law of large numbers. If you have a big truckload of oranges, losing one or two oranges doesn’t make much difference. But if you have only three oranges in your hand, losing one (or gaining one) is a big deal. Obvious, right?

Semiconductor fabrication works in a similar vein. If your chip was built in the 1980s using 1.0-micron technology, a few extra atoms here or there didn’t make a lot of difference to its size, weight, performance, reliability, or power consumption. A bit of stray copper or polysilicon wouldn’t have affected the device’s operation. Indeed, back then it probably wouldn’t have been measurable or even detectable. Ignorance was silicon bliss.

Now our chips are made to astonishing tolerances. Atoms count. And because there are fewer of them – because the tolerances are so small – a few extra atoms one way or the other can make a significant difference to a chip’s behavior. We already know that tiny contaminants or imperfections can render a chip useless. But even tinier changes can alter your system’s power consumption by double-digit percentages. And there’s nothing you can do about it.

Quite a number of people are starting to notice, and measure, the difference. A study done at the University of California at San Diego (Go, Tritons!) took a set of six presumably identical Intel Core i5-540M processors, hooked them up to the same power supply and peripheral components, ran the same benchmark tests, measured them with the same instruments, and found… not the same power consumption. In fact, power usage varied by 12% to 17%. To be clear, the power consumption varied by at least 12% — not less. They weren’t able to get six apparently identical chips to produce anything like identical power numbers.

To factor out the possibility that some other hardware might somehow be contributing to the observed variations, the group used two different motherboards, but reported no difference in their readings. The power variations followed the processor, not the support logic.

Furthermore, chips that ran “hot” on one test did so across all the tests, clearly fingering the silicon and not the software as the culprit. Using the 19 separate benchmark tests within SPEC2006 as their test suite, the team saw that processor #2 (for instance) consistently used more power than processor #3.

Not surprisingly, slowing down the processors reduced the absolute variation in power consumption, though not their relative relationships. Speeding the chips up increased power differences, sometimes dramatically. One processor on one test sucked 20% more power than its peers. All for identical chips.

Lest we think that this effect is limited to Intel processors, a team at UCLA (Go, Bruins!) found that ARM-based microcontrollers do essentially the same thing. Their study focused on the opposite end of the spectrum, where the test subjects were in sleep mode. They started out by testing ten (theoretically) identical Atmel SAM3U MCUs while running in their active state and found power variations of less than 10%. This seemed unremarkable and is mentioned almost in passing, because the purpose of the study was to monitor sleep-mode power. In reality, a 5–10% difference in power consumption among identical devices running identical code would freak out most developers and have them questioning their instruments. But let’s move on.

In sleep mode, each chip behaved very differently, indeed. Some consumed 3x to more than 5x more power than their peers under the same conditions. The team then varied the temperature of each device and, as expected, power consumption scaled with temperature, but not in the same way for each chip. Oddly, the individual SAM3U devices didn’t all maintain their rankings as high- or low-power leaders. While they all consumed more sleep-mode power as the temperature rose, some increased more rapidly than others, sometimes overtaking their siblings. A low-power leader when cool (22°C/72°F, in the testing) might finish up mid-pack at 60°C. One outlier was nearly off the charts at room temperature (145 µW versus about 40 µW for all of its rivals), but then it changed hardly at all as the mercury rose. Some chips exhibited hockey-stick curves while others followed a more-gentle slope. Go figure.

The UCLA researchers make the not-unreasonable assumption that an MCU-based system will likely spend most of its life in sleep mode, possibly waking periodically to sample a sensor and process the results before hibernating again. In such cases, the chip’s sleep-mode power consumption, and the way it varies with temperature, is vitally important. But based on just the ten samples they examined, the duty cycle of such a system might have to change by as much as 50% to get the battery life you want. Conversely, your battery life might change drastically based on the individual chip you happen to get in the mail. And imagine trying to engineer a reliable way to accommodate those random variations.

It gets weirder. A completely unrelated group working in the Czech Republic found that benchmark performance can vary considerably because of… magic, as far as they can tell. They ran the same FFT benchmark over and over on an x86 machine running Fedora, and they got the same score every time, as you would expect. Then they rebooted the machine and ran the test a few thousand more times – and got different numbers. The results in the second test run were all the same – just different from the first test run. Rebooting a third time produced a third set of numbers, again all self-consistent. Another few thousand test runs, and another cluster of similar scores. Strange. Within each test run, the scores were nearly identical, but between reboots, they varied by as much as 4% from previous runs.

Granted, four percent isn’t a lot, but what’s remarkable is that the scores changed at all. Two thousand consecutive test runs all produce identical scores, then a simple reboot changes the next two thousand scores, all by exactly the same amount and in the same direction, for no apparent reason. Even when you think you’re running a controlled experiment, you’re not really in control.

Imagine the consternation when you say, “Hey, boss, I ran that benchmark you asked for and got the same result a thousand separate times. Those numbers are rock-solid. Go ahead and publish them.” Then some n00b runs the same binaries on the same hardware on the same day and gets a significantly different number. Thousands of times. Also completely repeatable. Who’s correct?

If you need any worse news, the Czech team also found that compiling the same source code with the same gcc compiler and the same compile-time switches – in short, precisely the same circumstances – produced different binaries that varied in performance by as much as 13%. That’s with no changes whatsoever from the programmer, the tools, or the development system. They pegged the variation to nondeterminism in the gnu tools. Good luck with that.