Benchmarking Battlefield

In our previous article “Terminology Tango 101” we poked fun at the myriad metrics given by programmable logic companies in their publications and data sheets. While it’s fun to make fun of the confusion induced by this dizzying data, it is also interesting and useful to dig past the difficulties of agreeing on units and dimensions and to take a look at the actual processes that are used to test our tools and evaluate our architectures. While there is a good deal of obfuscation built into the numbers that are presented to the public, the companies that publish them do make a significant effort to measure accurately and inform realistically. The problem is that constructing and conducting an accurate and reasonable test of FPGA tool and architecture performance is incredibly complex.

Domain Specific Dilemma

First, we need to understand that the range of applications being targeted to programmable logic is growing broader every day. In the past, it was interesting and relevant to publish an estimated gate count and a rough idea of maximum operating frequency as a means of informing the design community of the capabilities of a particular FPGA architecture. Today, however, with FPGAs containing a wealth of embedded hard IP and being used for everything from digital signal processing, to embedded and reconfigurable computing, to high-speed serial I/O, it is virtually impossible to measure and report on what’s relevant to every design team considering a particular tool or technology for their project.

FPGA companies have mostly abandoned counting equivalent ASIC gates or “system gates” in favor of reporting the number of logic cells or look up tables (LUTs) contained in a particular architecture. Now, however, with devices containing vast numbers of embedded multipliers, huge amounts of block and distributed RAM, high-performance RISC processors, and a host of complex I/O hardware, the LUT count is only a small part of the story. Likewise, measuring the clock speed of the programmable LUT fabric misses much of the relevant performance potential of a device in a real application. Factors such as the serial I/O bandwidth, the equivalent DSP performance, and the processing power are probably even more meaningful today than old-school FPGA metrics.

“FPGA vendors have long used large suites of customer designs to measure the performance of each architecture,” says Steve Sharp, senior marketing manager of corporate solutions at Xilinx. “Unfortunately these methods give results that are analogous to horsepower ratings on a car. They don’t reflect the performance you’ll see with the device in your system.” Recognizing that new customers like DSP and embedded systems designers are migrating to FPGA platforms in large numbers, Xilinx is working to redefine benchmarking to better measure what today’s designers are looking for. “We can measure basic things like DMIPs for processors or GMACs for DSP, but that’s just the tip of the iceberg,” Sharp continues. “The system architect wants to know how, for example, using an FPGA to accelerate DSP performance affects the whole system. We’re working to find robust ways to measure these things.”

EDA companies have always been reluctant to publish benchmark results, except with reference to their own previous generations. In fact, most EDA companies’ license agreements prohibit licensees from publishing any competitive benchmark results obtained from their tools. As we’ll discuss later, the difficulty of running a fair and meaningful benchmark is such that few, if any, EDA customers have the resources, ability, and willingness to invest the effort to conduct an accurate evaluation. For EDA companies, the risk of bogus numbers being published is too great, and the likelihood of one corner-case result being taken out of context is too high. They’ve chosen to simply prohibit benchmark publication altogether.

Don’t try this at home

In researching this article, we talked to almost every major EDA and programmable logic vendor about their current benchmarking processes and purposes. We discovered that, without exception, the companies we talked to maintain a significant ongoing investment in benchmarking and take the process very seriously. The cost is sobering. Most have large investments in networked server farms with tens to hundreds of dedicated, high-performance processors running almost non-stop cranking out benchmark results. Most reported that a “complete” run of their benchmark suites takes on the order of days to weeks of 24-hour-per-day crunching on a passel of Pentiums, repeatedly running hundreds of reference designs through synthesis, place-and-route, timing analysis, and logic verification.

Only by examining the results of a wide array of designs can any reasonable general conclusions be reached on the realistic real-world performance of any tool suite or FPGA architecture. The variation from design to design is usually much larger than the difference between tools or technologies, so a considerable amount of data is required to reach any statistically significant conclusions. If you have visions of doing a meaningful evaluation on your laptop with a handful of designs, your efforts will probably serve to mislead you more than guide you.

Surveying the situation

One thing that struck us in examining the benchmarking methods used by FPGA and EDA vendors was the considerable effort put forth to make tests balanced and fair. While it would be easy to run a thousand designs through your own synthesis tool and report on only the hundred or so where your solution performed best, we saw no evidence of such charades in the vendors we interviewed. Vendors seem to go to great pains to choose a variety of designs that are representative of real customer use of their products and to measure the results of those designs in a meaningful way. The reasons for this are evident when we examine the motivations behind running benchmarks in the first place.

“We benchmark with a variety of purposes in mind,” says Actel’s Saloni Howard-Sarin. “Of course there is marketing’s need to discuss where and on what we perform well, but we also have internal needs to look for deficiencies in our tools and to drive engineering by finding out specifically what we need to improve. We also want to be able to give direction to third-party vendors who supply tools to us and to monitor release-to-release improvement of our overall tool suite. Finally, we use the benchmarking process to test out new architectures under development.”

Actel’s investment in benchmarking is mostly for the purpose of tracking the progress from generation to generation of their architectures and tools. Even so, they maintain a test suite with thousands of designs to evaluate almost every aspect of their device and tool performance. Because they primarily offer technologies like flash and antifuse, direct benchmarking against competitors with similar devices is not much of a priority. For the SRAM-based FPGAs, however, competition is fierce, and the vendors are looking over their shoulders, benchmarking themselves against the others on a regular basis.

Altera has created an impressive methodology for measuring and monitoring key performance characteristics of their own and their competitors’ devices and tools. Designing a fair and accurate test methodology that yields reasonable metrics is a daunting task, and Altera has clearly made a considerable investment that has paid off in this area. Their system includes a comprehensive and representative test suite, automatic compensation for technology and IP differences between various FPGA families, and automatic generation of design constraints to get optimal results from synthesis and layout tools.

Altera’s resulting benchmark run requires one week with more MIPs than mission control on a moon-landing to grind out final metrics for the test suite. They did an excellent job sorting out precisely what to measure and what to nail down. Third party synthesis was chosen, for example, to give consistent, high-quality results and to level the playing field between vendors, while translators were used to substitute each vendor’s cores into the test designs so that proprietary IP performance differences would be reflected in the measured results.

Lattice Semiconductor has made competitive benchmarking less of a priority but still runs an impressive program. Their benchmark efforts are focused primarily on improving software and silicon from release to release. Like Altera, they work to isolate the contribution of synthesis and place-and-route from the capabilities of the device itself. “Our suite of benchmark functions is very large,” says Stan Kopec, Lattice’s VP of marketing. “We have designs with various grades of detail, including both specific tests to measure device and tool performance on known functions and general customer designs that contain difficult structures or that exercise specific corner situations.” Lattice does exhaustive benchmarking early in the life of new architectures, building a virtual prototype and running iterative tests to see how much connectivity resources, such as routing channels, are required to allow satisfactory utilization while keeping silicon area to a minimum.

On the competitive front, Lattice chooses to emphasize benchmarking IP cores rather than generic architecture characterization. In terms of being easy to understand and consume, their approach makes a lot of sense. The performance of a timing-critical IP block such as a PCI core, for example, might be the single most useful piece of data to a design team for whom that core is key. “The scale of functions now being implemented in programmable logic,” Kopec continues, “has grown so complex that universal benchmark results are not terribly useful. Lattice does not publish competitive benchmark results for that reason. For any particular designer, they might not be accurate, or they just might not apply to their world.”

Industry leader Xilinx approaches benchmarking with a huge investment, a great deal of experience, and very realistic expectations. Xilinx has a suite of between 50 and 100 designs to evaluate each architecture, and they work to keep the benchmark process as close to the actual customer experience as possible. “The way you constrain the design or the way you run the tools can affect the results by a factor of two or three,” says Xilinx’s Sharp. “Even in benchmarking just the logic [and not DSP, I/O, or processing features], it doesn’t make sense to run designs unconstrained or with unrealistic constraints.”

Different tools also have different “default” modes with different approaches to optimization. Some tools are constraint-driven and optimize only until the designers’ constraints are met. Others simply optimize until they reach diminishing returns or until some specified set of algorithms has been run to completion. The tradeoff between tool run time, design performance, and design area hangs in the balance, and arbitrary benchmarking decisions can lead to big variances in results. In light of this fuzzy science, Xilinx’s focus on system-level performance with today’s new breed of designs is very forward-thinking.

Even though EDA vendors don’t publish results, they collect a lot of them. “In order to accurately evaluate the impact of a change in a synthesis algorithm, we need at least 50 and probably 100 designs,” says Jeff Garrison, director of FPGA marketing at Synplicity. “We have a huge infrastructure that leverages probably 100 machines and runs for days.” Synplicity is aggressive in seeking out new designs to add to their benchmark suite. “If you always run the same designs,” Garrison continues, “your tools may get very good at those designs, but not perform well on others. Your tools tend to learn your suite. We’re constantly working with customers to bring in fresh designs to keep up our improvement goals.” Synplicity is serious about their benchmarks, working to achieve a specific target performance improvement with each release. Without such a fair and accurate test mechanism, it would be extremely difficult for them to assess their own progress.

What to measure

Even the LUT fabric benchmarking problem isn’t solved once you have a large number of machines and a wide range of designs. It is also very difficult to decide what to measure. On timing performance, for example, most designs have a number of clock domains operating at different frequencies. Any one of those could be the critical clock domain for the design, depending upon the design goals. Setting up an automated suite that reports the frequency of only the fastest clock or the slowest clock may miss the real problem altogether. Also, in some applications, the total slack may be a more informative metric than just the frequency. Many times, the performance is determined by timing from an I/O to the core of the device or from the core to an output instead of by an internal critical path. Deciding on a general metric that can evaluate a large number of designs and yield meaningful results is not a trivial exercise.

With the new emphasis on hard IP, particularly on high-end FPGA families like Xilinx’s Virtex 4 and Altera’s Stratix II, perhaps the most meaningful metrics are domain- or application-specific. As Xilinx says, the best measurements for the current generation of FPGA devices have yet to be defined. With high-speed SERDES I/O, for example, there is not only a challenge in quantifying the performance of the device; the designer must be convinced that the FPGA can function at speed on a real board mounted in a real backplane. For this type of proof, we can look less to benchmark results and more to live hardware demonstrations such as Xilinx’s 11GB/s demonstration boards that have been making the rounds at trade shows.

Constraining the tools

Compounding the complexity of measurement is the bad benchmarking behavior of good design tools. Many synthesis and place-and-route tools will work to meet only the timing constraints specified by the user, and then stop their optimization efforts. The reason is understandable – many of the timing optimizations available increase the size of the design by adding logic and can cause extra power dissipation or slow down the overall circuit to improve a single path. When running a benchmark for performance, it is important not to over-constrain the tools. Doing so can result in a larger design that actually runs slower than it would have if properly constrained.

Almost every vendor we interviewed uses Synplicity’s tools as their “gold standard” for synthesis. FPGA vendors typically cite the large installed base of Synplify and Synplify-Pro in the market as well as the vendor-neutral basis for comparing FPGA architectures. Synplicity has added benchmarking-specific features to their tools to accommodate this. Their auto-constraints mode does what many FPGA vendors’ benchmark flows have done for years: iteratively run the synthesis process with increasingly tight timing constraints until the tool cannot meet timing. This incremental methodology allows the tool to approach its optimal result from below without over-constraining.

Making sense of it

Considering all the confusion in benchmark methodologies, the best answer is probably the one most often cited as the wrong one. It is true that taking your own one or two designs for a test-drive is a terrible way to evaluate one design tool against another or one FPGA architecture against another for purposes of making general conclusions about the relative merits of those products. It is also true, however, that nothing will better predict the results of your design than your design itself. If you have a set of designs that accurately reflect the types of IP, the size, the complexity, and the overall architecture of the project you’re contemplating, they may be far superior to any vendor-supplied benchmarks at answering your particular questions.