feature article
Subscribe Now

Lies, Damn Lies, Benchmarks and ML Perf Auto?

MLCommons, in collaboration with AVCC, the Autonomous Vehicle Compute Consortium, recently released the ML Perf Auto 0.5. benchmark suite.  The suite is intended to compare the performance of different AI engines in real-world vision/perception applications for ADAS and autonomous driving.

Will it live up to that ambitious goal? Let’s go back a few years…

Early in my career, I was a working group member of the Programmable Electronics Performance (PREP) corporation, which was chartered to establish a suite of benchmarks that could be used to compare the different programmable logic devices from different vendors against one another in the areas of performance and density.  With so many different architectures and so many bold claims regarding “equivalent gates” and the performance of a given product family, EE Times felt it was their duty, and opportunity, to establish an impartial way to compare among the dozen and a half or so programmable logic vendors that existed at the time.  

It was clear from my own personal experience, as a designer who used the Xilinx XC3000 series FPGAs, that some form of benchmarking would indeed be very useful.  At the time, the performance and different speed grades of the XC3000 family were based on the toggle rate of a flip flop that was within the confines of the Configurable Logic Block (CLB); no routing resources were used in the performance figures.  The claimed device speeds ranged from as high as 125 MHz (if my memory serves me correctly) to 50 MHz.  I used the 70 MHz device, only to learn that when I filled up the device and used routing resources, as one would, the actual operation of the device that I was able to achieve was roughly 7 MHz, 1/10 of the claimed device performance.  At the time, my boss, who wasn’t familiar with FPGAs, thought that I must be incompetent; clearly something must be wrong with my design if I were able to achieve only 7 MHz performance from a 70 MHz device.

Xilinx wasn’t alone in their specsmanship; each vendor had their own nuanced way to profile the performance of their device in a manner that showcased the device in the best possible light. The performance was stated regardless of the applicability and relevance of that specification in real world designs. So impartial benchmarking clearly was in order.

As mentioned, PREP set out to measure both the equivalent density and performance of the various PLD architectures through a suite of 9 different benchmarks that were intended to represent typical types of functionalities that would be found in a typical design (if there is such a thing).  The benchmark suite was deliberately defined to be balanced in comparing an FPGA to a CPLD.  There were register-rich benchmarks that favored FPGA architectures and wide fan-in functions, which, in turn, favored P-Term architectures or CPLDs.  The objective was to determine how many instances of a given benchmark could be instantiated in a device and how fast each instance could operate.  In this manner, a general idea of the density and performance of a given device could be determined.

I am explaining this benchmarking process at this level of detail so the reader can appreciate something that was expected to be simplistic in being able to benchmark a device, and how things actually turned out.  First and foremost, while every company verified each other’s results for accuracy, when the different participants had their results verified and published, each vendor had their own spin on how or why they “won,” and why they had the superior product / architecture.  So, even with published, verified results, the final message and final results were still quite muddy.  Perhaps one of the unforeseen consequences is that the introduction of the PREP benchmarks led to physical silicon tweaks that didn’t necessarily benefit the end user but led to significantly improved PREP results.  Similarly, there were software and synthesis tweaks that were also implemented by vendors to again show best-in-class results for a series of contrived benchmarks.  However, if a design even slightly deviated from those benchmarks, all bets were off.  So, from my previous experience of seeing exactly “how the sausage was made,” I have developed a somewhat jaundiced view regarding benchmarking in general.  That said, with the right knowledge of the benchmarking process, there are meaningful insights that can be garnered into the understanding of the device capabilities that would have otherwise not been possible.

Enter ML Perf Auto 0.5, the recently released benchmark suite from MLCommons that was developed in collaboration with AVCC – the Autonomous Vehicle Compute Consortium.  These benchmarks are focused on AI perception workloads and are intended to compare the different company’s AI engines when used in “real world” vision / perception applications specifically for ADAS and autonomous driving.  Where several decades ago there was considerable chicanery from programmable logic vendors regarding performance and gate count claims warranting the need for PREP, the AI semiconductor industry is perhaps even richer in contradictory, unclear claims where apparently not all TOPs are created equally.  

Similar to PREP, one of the intentions of the ML Perf Auto benchmarks is to provide a means to measure device performance using a consistent suite of datasets under well prescribed conditions – i.e. apples to apples comparison.  In this case, the data set, which is used as a benchmark performance, is from Cognata–which is generally considered an industry standard.  This data set, which was established to accelerate the development of self-driving systems, is a photorealistic synthetic data set that also contains augmented real-world data.  However, even with the Cognata suite, it is recognized that many real-world scenarios aren’t represented.  These conditions that aren’t captured include the broad spectrum of possible weather conditions, lighting conditions, or different sensor configurations. This suite of images is applied to a predefined vehicle architecture containing multiple cameras and sensors.  To that end, a basic survey of the ADAS sensor configurations in terms of the actual number of cameras, the resolution, the field of view, and the frame rate, in addition to the number of radar sensors and the controversial addition of LIDAR that are employed by a given automobile further muddy the water.  

The AVCC has established a framework that provides these specifications that are associated with a given ADAS level.  This framework was established by the working group, which is composed of representatives from a broad range of automotive OEMs, Tier 1s, and semiconductor and sensor suppliers. In practice, however, it’s unclear how accurately this framework reflects the real world given the disparity between these specifications and those of the actual vehicles that are on the road today.

To further complicate matters, there is no universally accepted definition of what specific functionality is included in a given ADAS level–to date, this remains a grey area.  Furthermore, new, derivative ADAS levels–i.e. Level 2+–have been introduced, which implies–not quite Level 3 (i.e. “eyes off, hands on”), but getting close as OEMs are unable to achieve Level 3 ADAS but looking to achieve some level (pun intended) of marketing clout.  So, any consistency in being able to claim a given ADAS level based upon performance against a given set of benchmarks will be a stretch at best, and–not unlike PREP results–when the final verified results are completed, it’s probably safe to assume that all vendors that participated will claim that they were “the best.” 

Clearly the complexity of this benchmarking endeavor can’t be overstated–and especially so when compared to the relative simplicity of PREP.   Not only are the data sets and associated constraints significant–the suite of results that are measured and benchmarked are complex and profound.  The measured results include latency, determinism, throughput, accuracy, false positives, and the ability to respond to a given set of frames per second under the constraints of power, and at a given pixel density.  PREP simply measured the number of instances of a benchmark that could be stepped and repeated and then the performance of each of those instances, and that proved to be difficult and fraught with complexity and chicanery.

With all that in mind, the MLCommons has taken a progressive and pragmatic view regarding what benefits can be derived from this benchmarking effort.  In the short term, there is an expectation that it will allow purchasing departments to take a more consistent approach to respond to RFQs and RFI–though again, user beware, the manner in which benchmark results are achieved should be scrutinized.  

In the next 3 years or less, there is an expectation that the benchmarks will lead to architectural changes in silicon.  Hopefully those changes will reflect benefits that apply to real world conditions vs. optimizations that yield good results for a set of contrived benchmarks.  Unfortunately, as I mentioned, earlier in my PREP experience, I witnessed the latter happening.

Eventually, MLCommons has ambitions of being part of some form of safety certification body.  Having had direct experience in developing functional safety concepts and designing in functional safety into associated products, this goal appears to be far too ambitious, if not entirely impossible and impractical.  Safety is assessed and qualified at the system level–not strictly the device level.  To assume that the automobile will employ the exact same architecture as specified by AVCC is too much of a leap, and, to be blunt, not realistic.  Perhaps there are interim, lesser ways of assessing safety compliance that I’m not seeing at the moment.  That’s entirely possible.

As the sentiment goes, there are lies, damn lies and benchmarks.  Not to disparage the effort, because there’s incredible value in the effort that’s being embarked upon, just, as the saying goes–warning! Objects in the mirror may be closer than they appear to be! 

One thought on “Lies, Damn Lies, Benchmarks and ML Perf Auto?”

  1. Robert, agreed, PREP benchmarks led to specmanship, nuanced performance profiles showcased in the best possibe light. Devices and tools were tweaked to best advantage. Everybody won at something, confusing customers.
    On the plus side, the benchmarks provided a “hello world” for designers to get started with new designs and demonstate what programmabe logic could do. Simple examples such as a counter, an ALU and a state machine written in both Verilog and VHDL provided a Rosetta Stone of logic to teach logic design to designers, tool vendors and semiconductor suppliers.
    Hopefully, the ML Perf Auto benchmark suite can provide similar benefits.

Leave a Reply

featured blogs
Nov 14, 2025
Exploring an AI-only world where digital minds build societies while humans lurk outside the looking glass....

featured chalk talk

GreenPAK™ Programmable Mixed-Signal Device with ADC and Analog Features
Sponsored by Mouser Electronics and Renesas
In this episode of Chalk Talk, Robert Schreiber from Renesas and Amelia Dalton explore common design “concept to market” challenges faced by engineers today, how the GreenPAK Programmable Mixed-Signal Devices help solve these issues and how you can get started using these solutions in your next design. 
Nov 5, 2025
43,300 views