My Ruler Must Be Broken

“The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ but ‘That’s funny…’” – Isaac Asimov

According to Adam Savage, the difference between science and just screwing around is writing it down. It’s the measurement – the annotation, the calibration, the methodical note-taking – that separates good science (and engineering) from mere hacking and tinkering. Without good measurements there can be no good science.

So raise a caliper and spare a thought for the measurers in our industry – the ones wielding the oscilloscope probes, the voltmeters, the electron microscopes. For they are the ones who enable us to produce better, faster, and more reliable electronics.

They’re also the ones who get to say, “That’s funny…”

It’s that single unexpected measurement that marks the difference between dreary routine and technological breakthrough. When you find the outlier you begin to discover that things aren’t as they seem. And you start down the path of discovery.

Okay, enough with the bromides. What we’re here to discuss is the work that EEMBC does in benchmarking low-power MCUs. That’s a tough task – tougher than I would have thought at first – but it’s made even tougher by the pesky nature of reality. As we mentioned a few weeks ago, today’s microprocessors and microcontrollers aren’t as stable and predictable as we like to think. Even mass-produced chips can vary from batch to batch, week to week, or chip to chip. Things aren’t always what they seem.

That’s why we need benchmarks: calibrated tests that tell us exactly what the chip is doing and how it compares to other chips doing the same work. Normally, benchmarks measure performance, but EEMBC does power-consumption benchmarks, too. Very useful stuff when you’re trying to decide which low-power MCU to put in your next wearable device or battery-powered gizmo.

EEMBC’s ULPbench (ultra-low-power benchmark) is designed for exactly that purpose. Take a bunch of competing MCUs and run them all through the same suite of tests, then measure both their active-mode and their sleep-mode energy usage. Voila! At the end you’ve got a big table of numbers to help you pick the best and most power-efficient chip for your design. How hard can that be?

Well… EEMBC’s lab technicians noticed a funny thing. The benchmark results weren’t very repeatable. Which sort of violates the Prime Directive of benchmark testing: all results should be objective, verifiable, and repeatable. Otherwise, you might as well be picking random lottery numbers. But the EEMBC folks are smart. These guys design and run benchmarks all day long. They know a thing or two about calibrating instruments, isolating variables, and generating repeatable results. So what gives?

In some cases, the exact same benchmark test running on the exact same hardware produced results that were 5% or 10% or even 15% different from test to test. That’s not supposed to happen.

One hint was that the sleep-mode power varied more than the active-mode power. ULPbench exercises both characteristics separately, on the theory that many MCU applications are asleep most of the time, waiting for an interrupt or an outside stimulus. For these designs, sleep-mode power affects battery life more than active-mode current consumption. And the sleep-mode numbers were all over the map, even on the same hardware.

CMOS circuity leaks. It leaks a lot. Electrons fairly fall out of your chips when they’re not working. It’s not unusual for a CPU or MCU to leak more current when it’s not running than it uses when it is running. Your MCU is like the old SR-71 spy plane, which was said to leak fuel on the runway because it was designed for supersonic speeds, where the air friction would heat the plane and cause it to expand, sealing the gaps. It was great at moving fast but leaked badly when sitting still.

And since CMOS circuits leak when they’re sitting still, the passive-mode power measurements were like trying to catch rain in a thimble. There’s no way to control leakage current. At least, not if you aren’t controlling the fab line that manufactures the chip. And even then, a difference of a few atoms here or there can make a measureable difference to a chip’s power characteristics. Particularly in sleep mode.

According to EEMBC’s tests, the biggest determinants to power consumption are temperature and wafer lot. Temperature makes a big difference, as many of us have probably already noticed. The hotter a chip gets, the more power it consumes – even if it’s sleeping. Especially if it’s sleeping. The correlation is not linear, either, it’s exponential. At room temperature (say, 25°C), a difference of a few degrees will barely make a detectable difference in power. But at the upper end of the chip’s rated temperature range, a few degrees can make a big difference. Good to know if you’re designing monitoring equipment that’s going to sit outside in the sun or be buried inside a fanless enclosure.

The other big variable is wafer lot, and there’s nothing we can do about that. If two chips are “sisters” – that is, they’re cut from the same silicon wafer and processed simultaneously – they’ll probably vary by no more than 5%, according to EEMBC’s observations. But if the two chips were processed on different wafers on different days, the variation can be as much as 15%. For supposedly identical chips. Same part number, same manufacturer, same software, same everything. It’s a total crap shoot.

Again, the difference is most noticeable in sleep mode. The more your chip sleeps (or hibernates, or dozes, or snoozes, etc.) the more its power will vary from the norm. For some very-low-duty-cycle applications like wireless utility meters, this is a big deal. Those devices are supposed to last 10 years on a single battery because they spend 99.99% of their time in sleep mode. Their active duty cycle is something like 1 in 10,000. Paradoxically, their power consumption would be more stable if they ran more often. Battery life might be shorter, but it would be more predictable.

Sadly, this problem only gets worse over time. CMOS circuits leak, and small-geometry CMOS leaks more than the older-generations stuff. So the more “advanced” your chip manufacturing is, the more it’s going to leak and the more unpredictable it’s going to be. A chip built in 90-nanometer silicon is going to leak more than one built in 180nm, and so on. Granted, newer process geometries lead to all sorts of advantages, including lower active-mode current consumption, but deteriorating passive-mode leakage is the price we pay.

That means you’re sometimes better off using a chip built with “outdated” technology. A 0.35-micron device won’t leak nearly as much as one built on a newer fab. It might be physically larger (okay, it will definitely be larger), and it might not run as fast, but it’ll leak less and have a more stable and predictable power profile. If you’re in the 99% sleep camp, that’s probably the direction you want to be heading.

There are a lot of variables in play here, but one good rule of thumb at EEMBC is that the crossover point is around the 1:5000 duty cycle. In other words, if your device is active less than 0.02% of the time, you’re probably better off with a chip built on older technology (180nm vs. 90nm, for instance). If your device is active more often than that, sleep-mode power starts to get overwhelmed by active power, so you’ll want a newer device.

Tests and benchmarks are supposed to be about repeatability. They’re supposed to be stable, reliable, and trustworthy. And normally, they are. Performance benchmarks are generally rock-solid. If you test a Core i7-2270K or a Cortex-A25 over and over, you get the same performance scores. It’s digital. Deterministic. Repeatable. It’s when someone gets different performance numbers that you suspect cheating.

But power benchmarks aren’t like that. Unlike traditional performance benchmarks, they’re measuring an analog quantity, not a quantized one. You expect some variation just because of the instruments and the nature of power itself. But not big 10% and 15% differences. And not because of the humidity in the lab.

EEMBC set itself a daunting task when it undertook ULPbench. But the results are actually better than many people expected. Their numbers don’t vary because of any flaw or limitation in the benchmark. On the contrary: they vary precisely because the benchmark is so accurate and well designed. The variations they’re seeing in ULPbench scores are real. The chips really are exhibiting random variations in behavior. It’s not measurement error; it’s physics.

Going forward, users of EEMBC’s ULPbench will have to calibrate their own expectations to align with reality. You’re simply not going to be able to duplicate EEMBC’s results – or anyone’s results – with any degree of accuracy. Want to drive your lab partner nuts? Ask him to benchmark the power consumption of the same MCU development board every day for a week. See how long it takes before he starts to say, “That’s funny…”