feature article
Subscribe Now

A Mark on the Bench

EEMBC Benchmarks Scores Improve, But What Does it Mean?

Writing benchmarks is a lonely endeavor. It’s kind of like being a referee or an umpire. Everybody wants a good and fair benchmark, but “good” and “fair” are both open to interpretation, and whoever comes out on the short end of the evaluation is sure to howl and squeal.

The patient souls at EEMBC (Embedded Microprocessor Benchmark Consortium) have been dealing with this problem for well over a decade. They’ve produced a number of different benchmarks that measure any number of vital system parameters, all with the goal of helping programmers and engineers choose the best chip for their next project. Will that new Atmel chip be fast enough to do what you want? Does the latest AMD processor have enough oomph to get the job done? Can that little $2 part handle decryption in under a microsecond? There’s an EEMBC benchmark for that.

Perhaps the purest and simplest of these is the CoreMark benchmark. CoreMark is intended to test only the processor’s core: The internal CPU architecture, independent of on-chip memory, peripherals, or pin-out. As such, it’s not the most practical of benchmarks, but that’s the point. CoreMark is intended to be a CPU designer’s litmus test. It measures how quickly code sluices through the internal plumbing, not how fast the chip can wiggle its I/O lines. Anyone can download and run CoreMark, and anyone can post their scores. So far, there are 388 CoreMark scores posted at www.coremark.org.

Each score is listed both “straight up” (CoreMarks) and normalized for clock speed (CoreMarks/MHz). The former tells you how fast the chip is. The latter tells you how efficient it is. Engineering managers like the first number; nerds like the second number.

Which means we’ll be focusing here on the second number.

Before we do, it’s worth pointing out that the range of CoreMark scores is impressive. From an all-time low score of 0.37 to a high of over 336,000, the scores span almost six orders of magnitude. Naturally, the low scorers are little 8-bit microcontrollers that sell for next to nothing, while the top scorers are fire-breathing, massively multicore supercomputers-on-a-chip, such as IBM’s Power7. Between those two extremes, there’s probably a processor for every need and every budget, don’t you think?

When we look at CoreMark/MHz scores, however, the ranking is a bit different. Normalizing for clock frequency eliminates the “brute force” aspect of the test and drops some of the faster processors down the rankings a bit. Now, one could argue that that’s pointless: We don’t buy normalized processors, we buy real chips, and if that IBM chip runs faster than that Intel chip, so be it. We don’t normalize a car’s MPG rating by its weight, or a golfer’s handicap by his age. So CoreMark/MHz is more like a baseball player’s batting average. The hits (CoreMark) divided by the number of at-bats (clock cycles). In engineering terms, it’s a measure of the processor’s internal efficiency. How much work can it get done in a given clock cycle? The ratio is interesting in its own right, but it’s also a good window into the processor’s potential power consumption. The higher the CoreMark/MHz ratio, the more efficient the device and the less juice it should consume.

Here, too, we see a big spread of scores, from a low of 0.03 to a high of over 167. That’s more than four orders of magnitude, which seems surprising now that we’ve taken clock frequency out of the equation. That represents a difference of 5-thousand-to-1 in terms of CPU efficiency, which hardly seems possible. We don’t see cars with 5 thousand times better gas mileage than their peers, or baseball players that are 5 thousand times more likely to hit the ball (except maybe on the playground). Yet here we’ve got some pretty clear evidence that some processors are thousands of times more productive than others. What’s going on?

Quite a number of things, as it happens, and as you probably expected. For starters, the faster chips have multiple CPU cores, so they’re cheating, in the sense that they’re running multiple copies of the benchmark and combining their scores. But all’s fair in love and benchmarking, and double-teaming the benchmark code is actually a pretty fair representation of how the chip would behave in real life. It would be worse if the benchmark scores didn’t improve; what would be the point of multicore in that case?

Then there are compiler differences. CoreMark is delivered as C source code, so it has to be compiled. Any first-year computer programmer knows that the choice of compiler makes a big difference to the quality (and speed) of your code. Benchmark tests are no different. In fact, some benchmarks (like NullStone) are actually tests of the compiler, not of the hardware.

So, for instance, we’ve got one example of an NXP LPC2939 chip that delivers 0.54 CoreMark/MHz, and another example of the same chip that gets 1.18. That’s a 2:1 difference for the exact same device. The only difference between them? One runs out of flash memory and the other, RAM.

In other head-to-head examples, the cause of different benchmark scores is the compiler. Is that fair? Probably, since the compilers themselves are commercial products that you or I could use on our own projects. If Compiler A really does produce code that runs twice as fast as Compiler B’s, I know which one moves to the top of my shopping list.

Memory can play a difference, but only sometimes. Most low-end microcontrollers have enough on-chip memory to hold the little CoreMark test entirely within their confines. Faster 32-bit and 64-bit processors typically don’t have on-chip RAM, but they do have big caches. So even though the processor has to fetch the benchmark out of slow external memory, the program will run from the cache from then on, negating any difference in memory speed or latency. That might not be a good reflection of how your real code works (in fact, it’s probably not), but hey—a benchmark can only do so much.

Then there’s the human element. Some benchmark testers have more… enthusiasm for their job than others. Pretty much any benchmark is subject to unethical optimization, although CoreMark is pretty hack-proof. As far as the people running EEMBC can tell, none of the CoreMark scores published on the website are bogus. They even certify some of the scores (if the tester is a member of EEMBC) by duplicating the test in their own labs. Currently, only about 10% of CoreMark scores are certified; the rest are “self-certified.” Peer pressure keeps people from making egregious claims. A wildly out-of-whack score would quickly be challenged by competitors, as well as current customers. “Hey, how come the chip you sold me doesn’t go that fast?”

Finally, there’s good old-fashioned progress. CoreMark scores have been improving over time, some by as much as 30–40% in a single year. It’s not entirely clear where that gain is coming from. I spoke with a number of EEMBC members who’d recently posted upgraded CoreMark scores, and their general response was, “we’re always striving to improve our compiler tools to better serve our valued customers,” blah, blah, blah. No improvements to the chips, in other words. Just improvements in the compiler or—and this is a real possibility—a better understanding of which compiler switches produce the most flattering results.

And this brings us to the final paradox of benchmark testing. In real-world development, the compiler setting that gives you the best benchmark score isn’t always the one you really want. Benchmarks like CoreMark test speed, and nothing else. Specifically, they don’t grade code density or code size, nor do they measure power consumption or debug accessibility. The hardware and software setup that produces the fastest code may also produce a gigantic binary image that hogs RAM, ROM, and power. Sports cars that post astonishing 0–60 times rarely have big trunks for hauling groceries and other practical uses.

So we’re back to where we started, rating and ranking chips based on our own personal mix of different criteria, most of them subjective. As nice as it seems to have a single, objective number to measure “goodness,” no number is going to tell you what you want. That’s what engineers are for. 

6 thoughts on “A Mark on the Bench”

  1. Pingback: DMPK Studies
  2. Pingback: IN Vitro ADME

Leave a Reply

featured blogs
Jun 2, 2023
Diversity, equity, and inclusion (DEI) are not just words but values that are exemplified through our culture at Cadence. In the DEI@Cadence blog series, you'll find a community where employees share their perspectives and experiences. By providing a glimpse of their personal...
Jun 2, 2023
I just heard something that really gave me pause for thought -- the fact that everyone experiences two forms of death (given a choice, I'd rather not experience even one)....
Jun 2, 2023
Explore the importance of big data analytics in the semiconductor manufacturing process, as chip designers pull insights from throughout the silicon lifecycle. The post Demanding Chip Complexity and Manufacturing Requirements Call for Data Analytics appeared first on New Hor...

featured video

Automatically Generate, Budget and Optimize UPF with Synopsys Verdi UPF Architect

Sponsored by Synopsys

Learn to translate a high-level power intent from CSV to a consumable UPF across a typical ASIC design flow using Verdi UPF Architect. Power Architect can focus on the efficiency of the Power Intent instead of worrying about Syntax & UPF Semantics.

Learn more about Synopsys’ Energy-Efficient SoCs Solutions

featured paper

EC Solver Tech Brief

Sponsored by Cadence Design Systems

The Cadence® Celsius™ EC Solver supports electronics system designers in managing the most challenging thermal/electronic cooling problems quickly and accurately. By utilizing a powerful computational engine and meshing technology, designers can model and analyze the fluid flow and heat transfer of even the most complex electronic system and ensure the electronic cooling system is reliable.

Click to read more

featured chalk talk

NXP GoldVIP: Integration Platform for Intelligent Connected Vehicles
Today’s intelligent connected vehicle designs are smarter and safer than ever before and this has a lot to do with a rapidly increasing technological convergence of sensors, machine learning, over the air updates, in-vehicle high bandwidth networking and more. In this episode of Chalk Talk, Amelia Dalton chats with Brian Carlson from NXP about NXP’s new GoldVIP Platform. They examine the benefits that this kind of software integration platform can bring to automotive designs and how you can take a test drive of the GoldVIP for yourself.
Nov 29, 2022