“Just because your voice reaches halfway around the world doesn’t mean you are wiser than when it reached only to the end of the bar.” — Edward R. Murrow
To twist an old cliché, there are three kinds of lies: lies, damned lies, and benchmarks. EEMBC aims to improve all three.
For more than 20 years, EEMBC has been in the unenviable business of creating, testing, and distributing benchmarks for embedded devices. (The name once stood for EDN Embedded Benchmark Consortium, but now it’s just an unattributed acronym, like MIPS.) They’ve got benchmarks to measure performance, benchmarks for power consumption, benchmarks for security – you name it. EEMBC’s various benchmarks have become the de facto standard yardstick for devices that aren’t computers. That’s largely by default, since there aren’t many other benchmarks around that aren’t tuned for narrow applications or specific runtime environments. Java benchmarks are plentiful; benchmarks for measuring an MCU’s power consumption while sending Wi-Fi packets or doing real-time motor control, not so much.
But with success comes responsibility, as well as certain challenges. EEMBC’s ULPmark (the name is a mashup of ultra-low-power and benchmark) has been around for years, and it’s widely used to “prove” that one maker’s MCUs are more power efficient than their competitors’ chips. We’ll pause here while the alarm bells going off in your head have a chance to subside.
Still with us? Grand. Measuring CPU performance is hard enough, even among processors with similar architectures. Just look at any Intel-versus-AMD showdown. But that’s a cakewalk compared to measuring a chip’s power consumption. Measured doing what? Running at full speed with all peripherals active? Napping in low-power mode? Loping along at twenty furlongs per fortnight? Embedded systems and MCUs have lots of peripherals combined with lots of power-saving modes that can switch on and off on short notice. How do you create a level playing field, even among chips from the same maker or the same architectural family?
The good news is, EEMBC anticipated most of those complications long ago, when ULPmark was created. Lab technicians striving for the ultimate ULPmark score were free to use whatever power-saving measures a chip might provide. And why not? That’s why they’re there. Use ’em if you’ve got ’em. But you had to document the conditions and the configuration, and if you wanted EEMBC to officially certify your results, they had to be reproducible. No miraculous one-off scores permitted; no exaggerated claims.
The bad news is, there were still some loopholes. ULPmark measures power consumption, while EEMBC’s classic CoreMark benchmark gauges performance – but there is no correlation between the two. On the surface, that seems natural enough. They’re two different tests, after all. But, in practice, it led to customer confusion. Everyone kinda, sorta understood that published ULPmark scores represented one extreme corner case (low power, low performance) while CoreMark scores represented the diagonally opposite corner. Both are useful, but both are just floating data points.
Tempting as it is, you can’t simply draw a line between those two points and naively assume that your chip’s power/performance characteristics lie anywhere along that line. Performance is rarely linear, and power consumption never is. Without laboriously testing and plotting every step between the two extremes, there’s no way to know what the curve/line/wiggle/Brownian motion for your chip really looks like.
EEMBC’s newest version of ULPmark, called ULPmark-CM, fixes some of that. It doesn’t change the code of either benchmark by very much, but it does alter the reporting rules. Henceforth, if you report power numbers using ULPmark, you must also report the CoreMark performance numbers that go with it, measured with the exact same hardware and software configuration. At last, we have two benchmarks that are correlated! A report of x iterations per millijoule (the way ULPmark has always worked) comes bundled with a score of y CoreMarks per second under the same conditions.
Crucially, this provides an x to go with the y on a power/performance graph. Before, an MCU’s power and performance numbers could be (and often were) plotted on the same graph, even though they were largely unrelated. Now, it’s possible to build a proper x/y plot of a chip’s performance versus power needs.
That still requires multiple data points, of course, and EEMBC has always encouraged testers to benchmark their devices at multiple points along that continuum. Marketing departments, however, tend to resist any attempt to fully characterize their devices, preferring to cherry-pick the one or two that make their MCU look good. That’s still the case; EEMBC can’t force anyone to benchmark their chips. It can only tighten up the rules for reporting the results.
This change is EEMBC’s first major public pronouncement since appointing its new president and CTO, Peter Torelli. Peter follows Markus Levy, who founded EEMBC almost single-handedly back in 1997 and has headed the nonprofit group ever since. Coincidentally, Peter joined Intel at about the same time that Markus left Intel to work as an editor at EDN and later, EEMBC.
Peter says there’d been a lot of pushback from both vendors and developers to “rationalize” benchmark scores. The trouble was, everyone had a different idea of what was rational. Oddly, the definition seemed to coincide with whatever would prove most flattering to their own product line. Some wanted to mandate maximum-speed tests; some wanted different low-power settings; some felt the midpoint would be most fair (midpoint of what?). He says the technical subcommittee at EEMBC seriously considered a three-point spread: one at maximum clock speed, one at minimum practical clock speed, and one at 3.0V supply voltage – this last one because 3.0V is the baseline for some of EEMBC’s other tests. The group felt it was a good engineering solution, but the marketing people hated it.
Ideally, of course, everyone wanted a single number, the One True Measure of Goodness, but that’s not how benchmarks work. In the end, EEMBC decided to let everyone test and report as many configurations as they like, just so long as they report the performance numbers that go with their power numbers. (EEMBC does not require the reverse; CoreMark performance scores don’t require corresponding power numbers.)
You can almost hear Peter Torelli roll his eyes as he talks about pressure from various groups petitioning him to create new benchmarks for this, or to tweak the benchmarks for that. (Markus Levy dealt with similar appeals for more than 20 years. It seems to go with the territory.) Right now, one subcommittee is working out the parameters for “typical” Wi-Fi activity for edge-node IoT devices. Good luck with that.
He also talks about the gradual shift in EEMBC’s customer base and usage. Early on, chip vendors gobbled up EEMBC benchmarks and published (selected) scores as proof of their technical superiority. They still do, but Peter says that, nowadays, internal engineering groups are their primary customers. Everyone understands that benchmarks are just the opening line of a novel; they never tell the whole story. Development teams will draw upon EEMBC’s broad suite of tests to do their own internal testing and decide what and how they want to punish different chips. “It’s more of an in-lab analysis tool and less of a pushbutton ‘give me a number’ tool,” he says. The engineer in him seems pleased at the change.