High-End FPGA Showdown – Part 1

Intel announced this week that they have begun shipping the first of their new Agilex FPGAs to early-access customers. This moves us into what we historically think of as the “head-to-head” phase of the competition between the two biggest FPGA suppliers. Xilinx shipped their first “Versal ACAP” FPGAs back in June, so, after a very long and contentious “who is going to ship first?” battle, it turns out the two rival companies began shipping their comparable FPGA lines within about two months of each other. This means that, unlike other recent races to be first on a process node, neither company had any significant time to scoop up design wins with a new, superior technology uncontested by a rival.

This time around, though, the competitive field is one larger, with interloper Achronix claiming they are set to ship the first samples of their new Speedster 7t FPGAs by the end of this year. For development teams, this means that, by the end of the year, there will be three well-differentiated high-end FPGA offerings to choose from – all on comparable process technology, and all with intriguing unique features and capabilities.

This will be the first of a multi-part series comparing the new high-end FPGA families from these three vendors. We’ll look at the underlying process technology, the FPGA logic (LUT) fabric itself, the hardened resources for accelerating processing and networking, memory architectures, chip/package/customization architecture, IO resources, design tool strategy, unique and novel features and capabilities of each offering, and marketing strategy. Buckle up, it’ll be an exciting ride. Uh, if you’re the type who gets a thrill out of enormous numbers of FLOPS, crazy bandwidths, and some of the most interesting and capable semiconductor devices ever designed.

A note – both Intel and Achronix weighed in and provided info for this article. Xilinx did not respond to our request for information.

This time, the prize for high-end FPGA supremacy has morphed somewhat. In the past, the largest market for high-end FPGAs was in networking, and the market share line shifted ever-so-gradually – primarily based on who could capture the richest set of design wins with a new-generation family from the customers deploying the latest round of wired and wireless networking hardware. The timing of the 5g rollout has changed that dynamic, however. 5g began ramping to scale prior to the arrival of the current wave of FPGA technology, so the backbone of the first round of 5g is built on previous-generation programmable logic. These devices will flow into an already robust 5g ecosystem, so we don’t have alignment between the clean-slate revolution of 5g and the dawn of a new generation of FPGAs. These FPGAs were designed with the mechanics of 5g already pretty well understood. Do not underestimate the importance of FPGAs to 5g, or of 5g to the FPGA market, however. When you use your cell phone today, there’s about a 99% chance that your call is going through some FPGAs. With 5g, the impact of FPGAs will be even greater.

That fact plays an interesting game with the rapid expansion of the emerging market for data center acceleration – primarily for AI workloads. Estimates are that the market for AI acceleration will skyrocket over the next few years, and these devices – with their impressive price tags and non-trivial power budgets – are set to compete mostly for the data center portion of that market, although all three vendors claim to be offering solutions that help all the way to the edge/endpoint. Each of these vendors makes it very clear that capturing those AI acceleration sockets is a priority, and they’ve all architected their new chips around that idea. The combination of these factors has set the stage for these three companies to be competing ferociously on both the 5g and the AI acceleration fronts – meaning that these devices need to have robust AI acceleration features, stellar networking performance, a robust set of development tools for deploying these ultra-complex chips, and a cunning marketing strategy.

Let’s look at all those factors, shall we?

Starting with the underlying process technology, Xilinx and Achronix FPGA families are fabricated on TSMC 7nm, and Intel Agilex is fabricated on the similar-capability Intel 10nm process. Don’t be confused by the 7/10 nomenclature difference. We long ago reached the point where semiconductor marketing groups name the nodes based on what sounds good to the market, rather than deriving them from any discernible feature of the transistors themselves. By our estimation, TSMC’s 7nm and Intel’s 10nm are roughly equivalent processes, and vendors using both processes basically agree. This means that Intel’s long-held lead in process technology seems to have whithered to a vapor, but, as we approach the dusk of Moore’s Law, it is inevitable that the competitive field on silicon process will level itself.

All three vendors get a modest boost from jumping to the latest semiconductor process node. This jump is not likely to be up to historical Moore’s Law standards, however, as the incremental benefits from each new process update have been steadily declining over the past several nodes. Everyone got a one-time temporary boost when FinFET technology came along, and now we will probably see a continuation of the trend of diminishing marginal return as we move forward toward the coming economic end of Moore’s Law.

In the old days, each new node brought higher density, better performance, and lower power consumption, all in copious quantities, due to the reduction in transistor size. Now, vendors have to trade off between the three and are often left with smaller returns, even on the metrics they are favoring. At the same time, the non-recurring cost of moving to a new process node has continued an exponential climb. This means that the stakes for FPGA companies have risen dramatically, as they are required to invest steadily more for ever-decreasing benefits in order to remain competitive. It also means that we are entering an era where the architecture and features of the FPGAs themselves, the tools used to make them work, and the marketing strategies of the three companies will be the key factors, rather than the timing of who gets to a new process node first.

Considering the process technology to be essentially a wash, let’s look at the capabilities and features of each vendor’s offering. Starting with the most basic FPGA feature – the LUT fabric. We’ve often lamented the fact that every company counts LUTs differently, and that game has gotten even more complex with each generation. Xilinx and Achronix currently use something like a 6-input LUT, and Intel’s ALM is essentially an 8-input LUT. The vendors more-or-less agree that we could convert these numbers into equivalent numbers of 4-input LUTs using factors of 2.2 LUT4s per LUT6, and 2.99 LUT4s per LUT8.

Using that math, the Achronix Speedster 7t family leads the field with 363K to 2.6M LUT6s (translated into 800K-5.76M LUT4-equivalents), Intel Agilex weighs in with 132K-912K ALMs (translated into 395K-2.7M LUT4-equivalents), and Xilinx’s Versal family packs about 246K-984K CLBs (translated into 541K-2.2M LUT4-equivalents). Each vendor makes claims that their architecture is superior, highlighting design features that may improve logic density, performance, or routability in certain specific applications or configurations. It isn’t clear to us at this point that any vendor’s LUT is significantly superior to any other’s.

The amount you can do with an FPGA is only partially determined by the LUT count, however. One also must factor in the challenge of actually using a meaningful percentage of those LUTs (which we will discuss later in looking at the design tool landscape), and the amount of capability packed into hardened logic blocks that allows design capability to be implemented with minimal involvement from the LUT fabric. Depending on your design, you may find that you can cram a lot more into one FPGA or another – independent of the LUT count.

The primary reason FPGAs are adept at AI inferencing is the massive numbers of arithmetic operations (primarily multiply-accumulate at various precisions) that can be accomplished in parallel, thanks to the huge arrays of “DSP” blocks woven into the programmable logic fabric. These allow FPGAs to perform matrix operations such as convolution much more efficiently than conventional von Neumann processors.

Looking at the hardware multipliers critical to AI inferencing, Achronix’s variable-precision multipliers yield 41K int-8 units, or 82K int-4 units. Intel Agilex has 2K-17K 18×19 multipliers, and Xilinx Versal brings about 500-3K “DSP Engines,” which are presumably “DSP58 slices,” that include 27×24 multipliers and new hardware floating-point capability. This comparison is decidedly “apples to oranges to mangoes,” and it must be somewhat “caveat designor” as to which fruit is preferable for your application.

All three vendors now offer hardened support for floating point. Achronix has a completely new architecture for their DSP blocks, which they call “Machine Learning Processors” (MLPs). Each MLP contains up to 32 multiplier/accumulators (MACs), 4- to 24-bit integer modes, and various floating-point modes, including native support for Tensorflow’s Bfloat16 format as well as block floating-point format. Most importantly, the Achronix MLP couples embedded memory blocks tightly with the arithmetic units, allowing the MAC operations to run at full 750 MHz without getting hung up waiting for memory through the FPGA fabric.

Intel also uses variable precision DSP blocks with hardware floating-point (essentially like they’ve offered for years in their FPGAs). Intel’s floating point support is perhaps the broadest and most mature of the three. With Agilex, they have introduced two new floating-point modes, half-precision floating-point (FP16) & Block floating-point (Bfloat16), and they have made architectural adjustments to make their DSP operations even more efficient.

Xilinx has upgraded their previous DSP48 slices to DSP58 – presumably because they now include hardware floating-point, and their multipliers are upgraded to 27×24. So, with this generation, the other two vendors have joined Intel in offering hardware floating-point support. This is a reversal for Xilinx, who previously claimed that floating point support in FPGAs was a bad idea because floating point is primarily used for training, and FPGAs will primarily target inference.

In terms of what floating point formats are available, FP32 is supported by Versal (with up to 2.1K multipliers) and Agilex (with up to 8.7K multipliers). Half-precision (FP16) is supported by all three families – Versal with up to 2.1K multipliers, Agilex with up to 17.1K multiplers, and Speedster with up to 5.1K multipliers. Bfloat16 is supported by Agilex (with up to 17.1K multipliers) and Speedster (with up to 5.1K multipliers). For FP24, Versal and Agilex would presumably use the FP32 units, and Speedster has up to 2.6K multipliers. Achronix Speedster also supports up to 81.9K multiplers for block floating point.

Xilinx also brings a new software-programmable vector processor – an array of up to 400 1GHz+ VLIW-SIMD vector processing cores with hardened compute and tightly coupled memory. This offers a much simpler programming model for parallelizing complex vector operations and taking advantage of the FPGA’s copious compute resources. Overall, this checks the “GPU/inference engine” box on Xilinx’s apparent “kitchen sink” competitive strategy. We’ll talk more about that in a bit.

Intel’s answer to the Achronix MLP and Xilinx vector processor is an evolution of old-school. They point out that the Agilex DSP blocks achieve the same functionality as the other vendor’s new DSP features, using established and well-understood FPGA design development flows, and without requiring customers to partition their design among the various architectural blocks of the device. If your team has FPGA/RTL design expertise, this is a good thing. If you’re in an application where software engineers are doing the DSP, Xilinx’s software-programmable approach might have advantages.

Besides simply counting the multipliers, we can also compare the capabilities by looking at the vendors’ claims regarding total theoretical performance. One caveat here, though. These claims are grossly exaggerated and deliberately hard to define precisely. Vendors typically arrive at a figure by multiplying the number of multipliers on the chip by the maximum operating frequency of those multipliers to come up with an “up to XX TOPS or TFLOPS” figure. Obviously, no real-world design will use 100% of the available multipliers, none will be able to achieve the maximum theoretical clock rate of those multipliers, none will be able to keep those multipliers supplied with input data at the appropriate rate, and the precision of those operations varies from vendor to vendor. In other words, it’s a terrible metric, but it’s the best we’ve got for comparison.

If we had to estimate, we’d say that FPGAs could realistically achieve something like 50-90% of their theoretical maximum in real-world designs. This is considerably better than GPUs, for example, which are thought to be capable of only 10-20% of their theoretical maximum in the real world.

Extrapolating TOPS numbers for int8 operations, Xilinx Versal tops the list with about 171 TOPS if we include 133 from their vector processor, 12 from their DSP blocks and 26 from their logic fabric. Speedster follows with about 86 TOPS including 61 from their MLP and 25 from their logic fabric. Agilex weighs in with 92 int8 TOPS, including 51 from DSP blocks and 41 from logic fabric. Looking at bfloat16 FLOPS, Agilex leads with 40, Versal follows with 9, and Speedster with 8. Speedster gets a big advantage for block floating point, however, with 123 FLOPS, followed by Agilex with 41 and Versal with 15.

These numbers are all derived from the companies’ own data sheets and, as we mentioned, are theoretical maximums that are not likely achievable in real-world practical applications. Achronix’s “usable” claim carries some merit, as their MLP is a unique design that aims to keep the variable-precision multiplication operations within the block itself and operating at the maximum clock rate without requiring round trips into the logic fabric to accomplish the most common operations for AI inference. Similarly, Xilinx’s vector processor architecture should do a good job keeping the data flowing smoothly through the arithmetic pipe. That said, we have yet to see a benchmark or reference design that shakes out the companies’ claims in any meaningful way.

Of course, using all those LUTs and multipliers requires getting your design to actually place and route and meet timing in your chosen chip. As FPGAs have grown, this has become an increasingly difficult challenge. Single-bit nets and logic paths fanning out over enormous chips with limited routing resources make traditional timing closure a nightmare. One by one, conventional techniques for achieving timing closure on synchronous designs hit the wall and failed to scale. Both Xilinx and Achronix tackled this issue in this new generation of their FPGAs by adding a network on chip (NoC) that overlays the traditional logic and routing fabric. The NoC essentially changes the rules of the game, as the entire chip no longer needs to achieve timing closure in one giant magical confluence. Now, smaller synchronous chunks can pass data via the NoC, relieving the burden on the traditional routing fabric and decomposing what was an enormous design automation tool problem into smaller, manageable chunks.

Intel had already taken a different approach several generations ago – paving over their entire logic fabric with a massive array of micro-registers called “HyperFlex registers”. These registers allow longer, more complex logic paths to be retimed and pipelined – with the effect that the overall design becomes essentially asynchronous. Interestingly, this is also the net effect of the NoCs employed by Xilinx and Achronix. There are challenges with each approach, as both add a significant amount of complexity to the chip design and to the design tools we use with them. In Intel’s case, it is also reported that the HyperFlex registers have some negative impact on the overall speed that can be achieved with the logic fabric. Intel says that the HyperFlex architecture in Agilex FPGAs is 2nd generation and has improvements/enhancements over the prior generation HyperFlex architecture to improve performance and ease timing closure. We’ll have to wait and see what users report after Agilex has some miles on it.

Of the two vendors who went the NoC route – Xilinx and Achronix, Achronix claims to have the fastest NoC with their two-dimensional cross-chip AXI implementation. Each row or column in the NoC is implemented as two 256-bit, unidirectional AXI channels operating at 2 Ghz, delivering 512 Gbps of data traffic in each direction simultaneously. Speedster’s NoC has a total of 197 endpoints, yielding 27 Tbps aggregate bandwidth, pulling a ton of weight off of the conventional bit-wise routing resources of the FPGA. Xilinx’s Versal NoC performance is not published as far as we can tell, but with around 28 endpoints, we’d guess something like 1.5 Tbps.

Well, we’re running out of virtual ink for this week, but we’ll be back next week with a continuation – looking at the fascinating and flexible memory architectures these FPGA families bring to the table, unique packaging and customization capabilities of each family, insane SerDes IO capabilities, embedded processing subsystems, design tool flows, and more.

3 thoughts on “High-End FPGA Showdown – Part 1”

Karl Stevens says:

September 6, 2019 at 6:36 am

” These registers allow longer, more complex logic paths to be retimed and pipelined – with the effect that the overall design becomes essentially asynchronous.”
Retiming and pipelining, yes. Asynchronous, not even close: BUT the ability to change path timing certainly is useful in asynch design.
Now, the tools and H”Description”Ls need to become H”Design”Ls.
I spent 30 years in Design, Debug, Modify, Troubleshoot, and Architect. Thank goodness that was before Verilog, VHDL, State Machines, and the insanity that Verilog can be simulated therefore it must be used for design. Digital design is Boolean Logic Design that must be done and DEBUGGED before simulation is done for VERIFICATION.

Log in to Reply
Pingback: Характеристики новітніх ПЛІС – kanyevsky.kpi.ua
Pingback: Характеристики новейших ПЛИС — kanyevsky.kpi.ua

High-End FPGA Showdown – Part 1

Related

3 thoughts on “High-End FPGA Showdown – Part 1”

Leave a Reply Cancel reply

featured video

How NV5, NVIDIA, and Cadence Collaboration Optimizes Data Center Efficiency, Performance, and Reliability

featured chalk talk