Hardware Emulation: The Clash of The Titans

Editor’s note: Lauro Rizzatti may be best known for his association with EVE, now part of Synopsys. Just to be clear, he is no longer with Synopsys, so this piece doesn’t reflect an official Synopsys position.

Hardware emulation has moved from the dusty back room of an engineering department used by a select few to numerous desktops and, along the way, become an important component of a verification blueprint. The numbers tell part of the story: The hardware emulation market hovers around $350 million, with a prediction from Gary Smith EDA that it will raise to $1 billion by 2017 –– only a few short years from now.

I’m able to tell the rest of the story. Over the years as EVE’s vice president of marketing and general manager of EVE-USA, I had a ringside seat to this metamorphosis, trotting the globe meeting with engineering organizations to better understand their verification requirements. EVE, now Synopsys, competed with Cadence and Mentor Graphics for a slice of this market. All three vendors claim their emulation offering is the best on the market. In fact, a quick glance at the published specs would make rolling the dice a way to make a selection. After all, vendors create datasheets to highlight the best capabilities of the tools and either ignore or downplay the limitations.

A careful and critical investigation of the published data, substantiated by the word on the street and moderated by common sense would open up a window on the emulation competitive landscape and allow an appreciation of each of the tools for what they offer, and where the differences are.

Cadence’s Palladium

The latest implementation called Palladium-XP2 was introduced in 2013. It appears to be an improvement of the hardware and software of the previous Palladium-XP version, not a brand-new emulator based on a re-spin of a proven and successful custom processor chip technology.

Palladium-XP2 continues to excel in very fast compilation time, an inherent benefit of the custom-processor technology. While overall design capacity is more than adequate for most system-on-chip (SoC) designs, it requires the largest number of boxes of the three offerings to achieve a comparable capacity. Just consider that, for the maximum capacity of 2.3 billion ASIC-equivalent gates, it requires 32 boxes. See table 1.

	Palladium-XP2	Veloce2	ZeBu-Server3
Capacity of a single box in ASIC-equivalent gates	72M	256M	300M
Max Capacity w/ needed boxes	2.3BG / 32 boxes	2BG / 8 boxes	3BG / 10 boxes

Table 1: Palladium-XP2 requires 32 boxes for comparable capacity to its competitors.

More boxes translate in larger dimensions, heavier weight and more power consumption. Worth mentioning, Palladium-XP2 is the only system cooled via liquid coolant. One negative of the processor-based technology is that it consumes significantly more energy than an FPGA-based emulator with equivalent capacity. It shouldn’t come as a surprise that Palladium-XP2 lags behind its competitors for system reliability.

Capacity scalability of about 4.5 million ASIC-equivalent gates permits ample flexibility to deploy the emulator resources in a multi-user environment. Strictly speaking, this is correct, but it may not be that advantageous anymore. Design teams select the emulator configuration to accommodate their largest designs, not the largest number of users. With design sizes growing inexorably, a granularity of 4.5-million gates may be a bit small, may be good for IP blocks, but not for full or even partial SoC designs. At 4.5-million gate capacity, a modern register transfer level (RTL) simulator may be adequate for the job at a fraction of the cost of the emulator.

Historically, Cadence’s (and Quickturn’s, its precursor) emulators were essentially used in in-circuit emulation mode (ICE) where the design-under-test (DUT) mapped inside the box was connected to the target system where the taped-out chip would ultimately reside. This approach required a speed adapter to accommodate the fast clock rate of the chip (hundreds of megahertz or even gigahertz) to the slow clock rate of the emulator (one or few megahertz). The long history of Palladium fostered the creation of the largest library of speed adapters. This is definitely a plus for Cadence.

The alternative to ICE is to replace the physical testbed with a software test environment. This could either come in the form of a synthesizable testbench or embedded software, or in the form of a behavioral testbench driving the DUT via transactors. Different vendors called it transaction-based acceleration (TBX) mode or transaction-based verification (TBV) mode.

Regardless of the name, this verification mode is the emerging trend in the industry. It does not require manned supervision to plug/unplug speed adapters when the user switches from one design to the next. As such, it is the mandatory choice for remote access to allow for the creation of large design processing centers accessible 24/7 from anywhere in the world.

Palladium-XP2 supports transaction-based verification, but it is rumored that its throughput is significantly lower than that of its competitors. Why would that be?

High-clocking speed, as important as it is, it is not the only or the primary parameter critical to achieve high throughput, and this is especially true in transaction mode. In transaction mode, the communication channel between the emulator and the host workstation processing the testbench must have large bandwidth and low latency.

Palladium-XP2 connects to the host server via separate channels for design-download/waveform-upload (one 16x lanes PCIe) and for testbench communication (four 16x lanes PCIe) per chassis. Based on equivalent design capacity, Palladium-XP2 has more channels for the testbench communication and the same number of channels for design/waveform services than Veloce 2 and more of all than ZeBu-Server. See table 2.

	Palladium-XP2	Veloce2	ZeBu-Server3
Design Capacity	72Mgates	256Mgates	300MGates
Communication channels for testbench	Four 16xlanes PCIe	16 4xlanes PCIe	Five 4xlanes PCIe
Communication channels for design/waveform services	One 16xlanes PCIe	Four 4xlanes PCIe	Shared with Testbench

Table 2: Palladium-XP2 has more channels for the testbench communication and the same number of channels for design/waveform services than Veloce 2 and more of all than ZeBu-Server.

From the bandwidth point of view, it seems that the advantage favors Palladium-XP2, but rumors abound to the contrary. Has this to do with latency? It’s hard to say.

No emulation vendor publishes latency specs. Looking at a rather different application field than EDA, namely the fast stock trading industry –– recently, an object of virulent printed attacks, public scorn, and the topic of a successful book –– FPGAs are leading the charge to replace conventional computing machines for fast trading. Low latency is the key ingredient.

With FPGAs, the response to a stimulus is generated in a fraction of the time required by even the fastest computer on earth in virtue of the massive parallelism intrinsic to the FPGA. The emulation engine in Palladium-XP2 and its ancestors have been using Boolean computing cores in very large quantities, similar to a computer with a vast number of CPU cores. They still are no match for the low latency of an FPGA. Whether this is the real cause for the lower throughput of Palladium-XP2 compared to FPGA-based emulators or not, I’ll let Cadence answer the call.

Hardware emulators are becoming mandatory to clear a large design of all residual bugs uncovered by RTL simulation and formal analysis, before silicon availability. Needless to say, design debugging must be efficient, which means easy to use, effective, and fast.

Palladium-XP2 features “FullVision,” defined as “at-speed full visibility of any nets for typically two-million samples during runtime,” and “InfiniTrace,” defined as “enables unlimited trace-capture depth and allows users to revert back to any checkpoint and restart emulation from that point.” Further its “Dynamic Probes” allow for “fast waveform upload of up to 80-million samples of selected signals before run.” All of this sounds impressive, but the above definitions do not clearly state that the timing window extension from two-million cycles to 80-million cycles trades off full vision to a partial vision of 50,000 signals to be selected at compile time.

Cadence enjoys the longest list of customers in all market segments of the semiconductor industry, accumulated over two decades. However, for embedded software validation, it may not be the best choice. In ICE mode, software debugging is carried out via a physical JTAG connection, a rather slow and unfriendly method. The alternative for software debugging is to use a virtual JTAG, based on transactors. However, as discussed above, Palladium-XP2 in transaction mode does not have the same throughput of its two competitors.

One final comment regarding technology node refreshment. Looking at historical data, the introduction rate of a new Palladium version based on a lower technology node has been well over four years. Palladium-XP was launched in April 2010 and, four years later, the new Palladium-XP2 is still based on the same technology node.

Mentor Graphics’ Veloce

In 2012, Mentor Graphics launched a new emulator called Veloce2 based on an evolution of the custom emulator-on-chip design first introduced with Veloce. The new version has been substantially improved in the hardware and benefitted from several software enhancements. Recently, Mentor announced Veloce OS3 that makes the emulator a data-center-hosted global resource.

Veloce2 doubles the capacity of Veloce –– launched in April 2007–– to two-billion ASIC equivalent gates with two interconnected Maximus cabinets, each accommodating four Quattro chassis of 256-million gates each. This level of capacity meets all of today requirements in a reasonable sized configuration.

A scalability of 16-million gates leads to up to 16 concurrent users per Quattro, which seems to be a right balance for today’s design sizes and still efficient in a multi-user environment.

Forced air cooling makes Veloce2 easier and a less expensive to install than Palladium-XP2. It also increases its reliability.

Veloce2 supports the ICE mode with an adequate library of speed adapters, but Mentor is actively promoting the acceleration mode, called TBX for Transaction-Based-Acceleration, highlighting its benefits that enable the creation of design processing centers.

In fact, Mentor pushed the envelope even further, by introducing the VirtuaLab concept. The idea is to create virtual peripherals and virtual host systems that surround the DUT mapped inside the emulator and connect them to the DUT via transactors. Mentor calls the connection “co-model channel technology.” Virtual devices offer the same functionality as traditional ICE solutions, but without the need for cables and additional hardware units.

The throughput in TBX has been enhanced by increasing the number of communication channels between the emulator and the server for the testbench (16 4x lanes PCIe), and for design/waveform services (four 4x lanes PCIe) per Quattro. See table 2.

Veloce2 supports 100% visibility without compilation. On-board memories coupled to the emulator-on-chip devices, store up to 500k samples of “compressed” data, including registers and memory contents. The data is uploaded to the host workstation via four dedicated channels, and a reconstruction mechanism running on the host computer rebuilds the waveforms of all combinational design nodes.

While the above debugging capability is similar to that of Palladium-XP2, Mentor devised a new and faster debugging approach based on FSDB transactors that creates on-the-fly waveforms of a few selected signals without requiring compilation. A debug process called “back-replay debug” consists of rewinding and re-running a test with added debug visibility, such as assertions, monitors, trackers, $display, and waveform capture. This debugging process removes the testbench and reduces the amount of data sent to the host, providing a boost in time-to-visibility in a fully deterministic environment.

Another area of improvement has targeted embedded software validation at the system level. Called WarpCore, the approach replaces the RTL processing cores in an SoC design with QEMU-based cores and moves them into the host connected to Veloce2 via transactors. The emulator continues to execute the remaining synthesizable portion of the SoC pushing the performance to an upper limit of 100MIPS from one to three MIPS when the entire SoC is mapped inside Veloce2. Apparently, some Veloce2’s users are booting an RTOS such as Android, and then running applications such as Antutu for performance characterization prior to silicon.

Today, Mentor Graphics cannot match the sheer number of customers claimed by Cadence but, in the past two to three years, it has increased its customer base and bit into Cadence’s customer base. It is rumored that Mentor owns the networking and storage market segments, and is one of two choices for one of the largest processor companies in the world.

Synopsys’ ZeBu

Earlier in 2014, Synopsys launched ZeBu-Server3. The name ZeBu is a combination of “zero” and “bugs.” Yes, EVE’s goal with ZeBu was to ensure the design being debugged had zero bugs. ZeBu-Server3 is based on the largest commercial FPGA from Xilinx –– that is, the Virtex7-LX2000T. This gives Synopsys the advantage of offering the emulator with the largest capacity at three-billion ASIC-equivalent gates in the smallest footprint in physical dimensions, weight and power consumption. All of which contribute to its high reliability.

Capacity scalability or granularity – the minimum increment you can increase capacity – of 60-million gates compared to Palladium-XP2 at 4.5-million and Veloce2 at 16-million puts ZeBu-Server3 at a disadvantage.

Design compilation speed is also at a disadvantage vis-à-vis its competitors. The main hurdle is in the place&route of the FPGAs. Synopsys does not publish data, but it is public knowledge that the P&R of a Virtex7-LX2000 may take several hours, even limiting the resource utilization to 50% or so.

It seems Synopsys today continues the EVE’s approach of not actively promoting ICE, and instead putting all its eggs in the TBV (transaction-based verification) basket. Up to five 4xlanes PCIe channels connect the testbench to the DUT providing a reasonable throughput in transaction mode. The same channels are shared between the testbench communication and the design downloading and waveforms uploading.

Design debugging is supported by 100% visibility via dynamic probing, a feature that takes advantage of the built-in scan chain of the Xilinx Virtex FPGAs. While dynamic probing does not require compilation, it comes with a drawback –– to retrieve data takes a long time at a speed of few tens of hertz. The sequential data activity is not stored in on-board memories, rather it is sent directly to the host server where the combinational activity is recreated via a proprietary mechanism. EVE/Synopsys points out that the overall performance, including data retrieval via dynamic probing, data transfer to the server and reconstruction of combinational data is in the same ball park of its competitors. Rumors on the street do not substantiate this.

ZeBu-Server also supports static probing and flexible probing, both requiring compilation, but which do not suffer the dramatic speed drop of dynamic probing. Static probing is used in ICE mode, but is rather limited in space depth (number of signals) and in time depth (number of cycles) by the on-board trace memory capacity. As stated earlier, the ICE mode is not actively promoted by EVE/Synopsys. Hence, this limitation may not be noteworthy. Flexible probing vastly extends the space limits and is used to hone quickly into a time frame of few tens of thousands cycles to allow for engaging dynamic probing, opening the entire design space to bug searching.

ZeBu-Server leads the pack with the highest clock speed, bordering the performance of FPGA prototyping for designs in excess of 100-million gates. This makes the tool a good choice for embedded software validation, when the software debugger is connected via transactors.

By all indications, ZeBu-Server is trailing its competitors in number of customers. A monitoring of the quarterly earnings calls would reveal that Mentor and Cadence boast success after success in the emulation space. Not so for Synopsys. It may be a company policy, or it may reflect a difficult time to report a win. Supposedly, Synopsys shares with Mentor the substantial emulation investment of one of the largest processor companies in the world. Read John Cooley’s Deepchip for more information.

Wrap up and conclusion

The 51st Design Automation Conference (DAC) kicks off Monday, June 2, and runs through Wednesday, June 4, at the Moscone Center in San Francisco. Palladium, Veloce and ZeBu will be on display and demonstrating their full capabilities.

About Lauro Rizzatti

Lauro Rizzatti is a verification consultant. He was formerly general manager of EVE-USA and its vice president of marketing before Synopsys’ acquisition of EVE. Previously, he held positions in management, product marketing, technical marketing, and engineering. He can be reached at lauro@rizzatti.com.