feature article
Subscribe Now

A Mark on the Bench

EEMBC Benchmarks Scores Improve, But What Does it Mean?

Writing benchmarks is a lonely endeavor. It’s kind of like being a referee or an umpire. Everybody wants a good and fair benchmark, but “good” and “fair” are both open to interpretation, and whoever comes out on the short end of the evaluation is sure to howl and squeal.

The patient souls at EEMBC (Embedded Microprocessor Benchmark Consortium) have been dealing with this problem for well over a decade. They’ve produced a number of different benchmarks that measure any number of vital system parameters, all with the goal of helping programmers and engineers choose the best chip for their next project. Will that new Atmel chip be fast enough to do what you want? Does the latest AMD processor have enough oomph to get the job done? Can that little $2 part handle decryption in under a microsecond? There’s an EEMBC benchmark for that.

Perhaps the purest and simplest of these is the CoreMark benchmark. CoreMark is intended to test only the processor’s core: The internal CPU architecture, independent of on-chip memory, peripherals, or pin-out. As such, it’s not the most practical of benchmarks, but that’s the point. CoreMark is intended to be a CPU designer’s litmus test. It measures how quickly code sluices through the internal plumbing, not how fast the chip can wiggle its I/O lines. Anyone can download and run CoreMark, and anyone can post their scores. So far, there are 388 CoreMark scores posted at www.coremark.org.

Each score is listed both “straight up” (CoreMarks) and normalized for clock speed (CoreMarks/MHz). The former tells you how fast the chip is. The latter tells you how efficient it is. Engineering managers like the first number; nerds like the second number.

Which means we’ll be focusing here on the second number.

Before we do, it’s worth pointing out that the range of CoreMark scores is impressive. From an all-time low score of 0.37 to a high of over 336,000, the scores span almost six orders of magnitude. Naturally, the low scorers are little 8-bit microcontrollers that sell for next to nothing, while the top scorers are fire-breathing, massively multicore supercomputers-on-a-chip, such as IBM’s Power7. Between those two extremes, there’s probably a processor for every need and every budget, don’t you think?

When we look at CoreMark/MHz scores, however, the ranking is a bit different. Normalizing for clock frequency eliminates the “brute force” aspect of the test and drops some of the faster processors down the rankings a bit. Now, one could argue that that’s pointless: We don’t buy normalized processors, we buy real chips, and if that IBM chip runs faster than that Intel chip, so be it. We don’t normalize a car’s MPG rating by its weight, or a golfer’s handicap by his age. So CoreMark/MHz is more like a baseball player’s batting average. The hits (CoreMark) divided by the number of at-bats (clock cycles). In engineering terms, it’s a measure of the processor’s internal efficiency. How much work can it get done in a given clock cycle? The ratio is interesting in its own right, but it’s also a good window into the processor’s potential power consumption. The higher the CoreMark/MHz ratio, the more efficient the device and the less juice it should consume.

Here, too, we see a big spread of scores, from a low of 0.03 to a high of over 167. That’s more than four orders of magnitude, which seems surprising now that we’ve taken clock frequency out of the equation. That represents a difference of 5-thousand-to-1 in terms of CPU efficiency, which hardly seems possible. We don’t see cars with 5 thousand times better gas mileage than their peers, or baseball players that are 5 thousand times more likely to hit the ball (except maybe on the playground). Yet here we’ve got some pretty clear evidence that some processors are thousands of times more productive than others. What’s going on?

Quite a number of things, as it happens, and as you probably expected. For starters, the faster chips have multiple CPU cores, so they’re cheating, in the sense that they’re running multiple copies of the benchmark and combining their scores. But all’s fair in love and benchmarking, and double-teaming the benchmark code is actually a pretty fair representation of how the chip would behave in real life. It would be worse if the benchmark scores didn’t improve; what would be the point of multicore in that case?

Then there are compiler differences. CoreMark is delivered as C source code, so it has to be compiled. Any first-year computer programmer knows that the choice of compiler makes a big difference to the quality (and speed) of your code. Benchmark tests are no different. In fact, some benchmarks (like NullStone) are actually tests of the compiler, not of the hardware.

So, for instance, we’ve got one example of an NXP LPC2939 chip that delivers 0.54 CoreMark/MHz, and another example of the same chip that gets 1.18. That’s a 2:1 difference for the exact same device. The only difference between them? One runs out of flash memory and the other, RAM.

In other head-to-head examples, the cause of different benchmark scores is the compiler. Is that fair? Probably, since the compilers themselves are commercial products that you or I could use on our own projects. If Compiler A really does produce code that runs twice as fast as Compiler B’s, I know which one moves to the top of my shopping list.

Memory can play a difference, but only sometimes. Most low-end microcontrollers have enough on-chip memory to hold the little CoreMark test entirely within their confines. Faster 32-bit and 64-bit processors typically don’t have on-chip RAM, but they do have big caches. So even though the processor has to fetch the benchmark out of slow external memory, the program will run from the cache from then on, negating any difference in memory speed or latency. That might not be a good reflection of how your real code works (in fact, it’s probably not), but hey—a benchmark can only do so much.

Then there’s the human element. Some benchmark testers have more… enthusiasm for their job than others. Pretty much any benchmark is subject to unethical optimization, although CoreMark is pretty hack-proof. As far as the people running EEMBC can tell, none of the CoreMark scores published on the website are bogus. They even certify some of the scores (if the tester is a member of EEMBC) by duplicating the test in their own labs. Currently, only about 10% of CoreMark scores are certified; the rest are “self-certified.” Peer pressure keeps people from making egregious claims. A wildly out-of-whack score would quickly be challenged by competitors, as well as current customers. “Hey, how come the chip you sold me doesn’t go that fast?”

Finally, there’s good old-fashioned progress. CoreMark scores have been improving over time, some by as much as 30–40% in a single year. It’s not entirely clear where that gain is coming from. I spoke with a number of EEMBC members who’d recently posted upgraded CoreMark scores, and their general response was, “we’re always striving to improve our compiler tools to better serve our valued customers,” blah, blah, blah. No improvements to the chips, in other words. Just improvements in the compiler or—and this is a real possibility—a better understanding of which compiler switches produce the most flattering results.

And this brings us to the final paradox of benchmark testing. In real-world development, the compiler setting that gives you the best benchmark score isn’t always the one you really want. Benchmarks like CoreMark test speed, and nothing else. Specifically, they don’t grade code density or code size, nor do they measure power consumption or debug accessibility. The hardware and software setup that produces the fastest code may also produce a gigantic binary image that hogs RAM, ROM, and power. Sports cars that post astonishing 0–60 times rarely have big trunks for hauling groceries and other practical uses.

So we’re back to where we started, rating and ranking chips based on our own personal mix of different criteria, most of them subjective. As nice as it seems to have a single, objective number to measure “goodness,” no number is going to tell you what you want. That’s what engineers are for. 

6 thoughts on “A Mark on the Bench”

  1. Pingback: DMPK Studies
  2. Pingback: IN Vitro ADME

Leave a Reply

featured blogs
Oct 9, 2024
Have you ever noticed that dogs tend to circle around a few times before they eventually take a weight off their minds?...

featured chalk talk

Shift Left Block/Chip Design with Calibre
In this episode of Chalk Talk, Amelia Dalton and David Abercrombie from Siemens EDA explore the multitude of benefits that shifting left with Calibre can bring to chip and block design. They investigate how Calibre can impact DRC verification, early design error debug, and optimize the configuration and management of multiple jobs for run time improvement.
Jun 18, 2024
25,467 views