High-End FPGA Showdown – Part 2

In Part 1 of this series, we looked at the new high-end FPGA families from Achronix, Intel, and Xilinx. We compared the underlying semiconductor processes, the type and amount of programmable logic LUT fabric, the type and amount of DSP/arithmetic resources and their applicability to AI inference acceleration tasks, the claimed TOPS/FLOPS performance capabilities, and on-chip interconnect such as FPGA routing resources and networks-on-chip (NOCs). From those comparisons it is clear that each of these vendors’ offerings has unique and interesting capabilities that would make them shine in particular application areas. We also inadvertently highlighted just how difficult it is to do a meaningful analysis of such complex semiconductor devices.

All three vendors – Xilinx, Intel, and Achronix – discussed our assumptions and analysis with us and provided invaluable insight for this series.

This week, we’ll tackle memory architectures, in-package integration architecture, and high-speed serial IO capabilities. Here again, we’ll see that this generation of FPGAs far outshines even their immediate predecessors, and we’ll show further evidence that these are probably the most sophisticated chips ever created. We are at a fascinating time in the history of semiconductor evolution, with Moore’s Law coming to an economic end, a new generation of AI technology and applications demanding an entirely new approach to computing, and enormous competitive stakes at play with vast new markets opening up for these amazing devices.

The real-world performance of FPGAs is just as dependent on memory architecture as on compute resources and internal bandwidth. In today’s computation environment, the data is the thing – and moving, processing, and storing that data efficiently in the flow of the computation is key. Today, the global data infrastructure spans a landscape from small, sensor-laden endpoints to the network edge, local storage and computing, back to cloud data centers with vast computation and storage resources, and then back through the whole thing to the edge again. The role of FPGAs in that round trip is enormous – with FPGAs contributing heavily to storage, networking, memory, and computation.

We should point out that Xilinx maintains that their Versal ACAP series of devices are a separate category from FPGAs – one they have dubbed “ACAP” for “Adaptive Compute Acceleration Platform.” As we understand it, the lynchpin of that claim is that Versal is aimed at a different audience from traditional FPGAs – an audience of application developers who may not have FPGA expertise, and require an interaction model that does not begin with configuring FPGA fabric with a bitstream. In fact, they point out, Versal can be booted and operated without ever configuring the FPGA fabric at all. This, combined with features such as the vector processing engines and the network-on-chip (NoC) are the basis for their argument that Versal devices are “ACAPs” rather than “FPGAs.”

For our purposes here, however, we will continue to evaluate Versal ACAP against these other, very similar FPGA families. We believe these three offerings will frequently be competing for the same sockets. Furthermore, our audience has always contained a large contingent of FPGA design experts, dating back to pre-2009 when we were known as “FPGA Journal”. We understand the motivation behind Xilinx’s marketing position. They want to attract a new market – with customers for whom “FPGA” may be an intimidating or confusing label. Xilinx took a similar strategy with their “Zynq” families of devices – referring to them as “SoCs” rather than “FPGAs”. But, “ACAP” is a harder sell, because the SoC category already existed and had a large number of competing offerings. Creating a new category of one is a tall order. We’ll see if it catches on. We are waiting for the first competitor to build a device they identify as an “ACAP.”

Each of these competing families takes a different and interesting stab at optimizing the memory architecture for the target applications they envision. Unlike conventional CPU or GPU architectures, FPGAs uniquely allow the reconfiguration of the memory hierarchy to match the task at hand. This can have a staggering impact on the throughput, latency, and power-efficiency of the resulting application. FPGA memory architectures allow us to partition our application so that each use of memory has the best trade-off between locality/bandwidth and density.

Starting at the lowest density but highest bandwidth are memory resources within the LUTs themselves. There, logic has direct, hard-wired access to small amounts of stored data, creating the most efficient path possible for data flow. All FPGA architectures have LUT-based memory as a core feature. The amount of LUT memory is roughly proportional to the LUT count, which we discussed last week. While this storage is hyper-local and delivers essentially optimal bandwidth for the associated logic, most applications have memory requirements that far outstrip the meager but precious LUT memory resources.

Moving up one level in density and down one in bandwidth, then, we have various architectures for “block” memory in the FPGA fabric. Block structures, as the name implies, are dedicated, hardened memory areas within the FPGA fabric which require data paths to span more FPGA interconnect. Each vendor has their own strategy for partitioning these on-chip memory resources. They have exhaustively modeled various types of applications and their memory needs, weighed the tradeoffs between distribution and density, and come up with a tiered approach that they feel best solves the broadest set of problems, with particular emphasis on the primary targeted application types.

Beginning with Achronix, Speedster 7t offers up to 385Mb of embedded memory, divided between LRAM2k, BRAM72k, and MLP blocks. Intel Agilex offers over 300 Mb of embedded RAM between three types of block embedded memory – MLABs, M20K blocks, and eSRAM memory blocks. Xilinx Versal offers block RAM, “UltraRAM”, and Accelerator RAM – totaling as much as around 294 Mb in their largest “AI Core” devices. Each of these architectures is a best-guess by the vendor on what size chunks and what proximity to other resources will give the best performance across a large range of target applications.

Moving one more level up the hierarchy we have memory that is included in the package with the FPGA. This is generally implemented in a high-density, high-bandwidth, high-cost technology such as HBM. Since we are going off-chip to reach it (via interposer or EMIB or other packaging link), the latency and bandwidth are lower than for embedded memory, but better than what we’d get going off-chip across a PCB through conventional memory interfaces (which we’ll also tackle in a bit). The goal at this level is to bring great density and great bandwidth together – with far more data than can be managed on-chip, with much higher bandwidth than we’d get with external memory.

Before we can discuss in-package memory, however, we should shine some light on fundamental differences in the three vendors’ approach to integration at the package level. Here, we give Intel Agilex the nod for maximum flexibility with smallest end-user investment. Intel’s Agilex is designed from the ground up for in-package integration flexibility. Intel uses a proprietary technology called EMIB (embedded multi-die interconnect bridge) to connect chiplets within the package. The FPGA fabric itself is a chiplet, the SerDes transceivers another, in-package memory such as HBM another, and other optional peripherals. Each of these peripherals could be implemented in a different process technology, meaning Intel can update or refresh any chiplet at any time without having to re-design their entire FPGA (as they would with a monolithic approach). The additional feather in Intel’s cap here is their ability to clip in custom chiplets based on their recently-acquired eASIC technology. That means users’ custom logic can be added into the package of their FPGA with minimal NRE and design overhead. eASIC allows designs originally implemented in FPGA fabric (for example) to be hardened into chiplets – giving ASIC-like performance, density, and power efficiency.

Achronix has announced Speedster 7t as a family of stand-alone chips, but a Speedcore embedded FPGA version is also available that includes the same resources as Speedster7t, but can also include custom instructions, to further optimize for a specific class of application; these could be dedicated packet processing, TCAM or signal processing functions. In that scenario, the integration decisions on what goes into the chip versus package, and what hardened IP is included in the same chunk of silicon with the FPGA fabric is entirely up to their customers’ design teams. This approach gives the maximum flexibility and control to the end user, but with much higher cost, risk, and design expertise required on the customer’s end.

Achronix is also in the chiplet business, and participates in Open Compute Project (OCP’s) Open Domain-Specific Architecture (ODSA) initiative. ODSA is working to establish standards to drive an open chiplet ecosystem that will facilitate the creation of SiPs that mix and match chiplets from multiple vendors. This would allow package-level customization similar to Intel’s – but not using Intel’s proprietary EMIB interconnect technology. Achronix’s view is that design teams will often go with a stand-alone FPGA solution initially, and once the design is proven do a cost-reduction round that might include hardening some logic into standard cell ASIC designs that also include programmable FPGA IP blocks, or building a custom SiP using chiplets.

Xilinx offers the least device customization flexibility of the three, but with by far the most “out of the box” options. Xilinx was a pioneer in multi-die integration for FPGAs, using interposers to stitch together multiple chiplets for three product generations now. Interestingly, however, Xilinx has backed off that strategy as others have pushed forward with it. Xilinx now builds much more of the functionality of their device into one monolithic die. This brings advantages in speed, cost, and reliability, but reduces the ability to mix-and-match chiplets to customize the in-package integration. Offsetting this, Xilinx is planning a very large offering of Versal families, hoping to get off the shelf devices with appropriate sets of resources to match various types of applications. Xilinx also continues to maintain a broad offering of previous-generation traditional FPGAs that may be the best fit for many applications and markets.

Moving back to in-package memory, then, both Xilinx and Intel offer comparable in-package HBM stacks as far as we can tell. Xilinx says there will be a Versal HBM series but hasn’t formally announced details of it yet, but we can surmise based on their support in other families. With Agilex, Intel offers the ability to put up to 16GB HBM2, as well as other types of memory resources into the package. Achronix offers no in-package memory option, but instead claims that their use of up to 8 GDDR6 memory controllers capable of supporting 512 Gbps of bandwidth each, gives their devices an aggregate GDDR6 bandwidth of 4 Tbps, which is comparable to what the other vendors offer with their HBM options, but at a lower cost. The tradeoffs there are more power consumption and more PCB design complexity versus in-package HBM. It is likely that the availability of GDDR6 will be more immediate (given the target mass market for graphics subsystem use) while HBM2 is taking some time to ramp to volume production.

Looking at support for on-board memory, all vendors support DDR4 and will support DDR5.

Intel Agilex continues their approach of offering hardened memory controllers with a hardened DDRx memory controller (HMC – but not “Hybrid Memory Cube”). Intel has been on the HMC track for several years, dating back to the 28nm Altera Arria 5 line. Intel says their FPGA integrated hard memory controller facilitates close core to periphery and periphery to core timing transfers in Hard PHY, effectively guaranteeing timing closure and reducing compile time as well as reducing read and write memory latency in half-rate mode. Intel also supports their non-volatile Optane persistent memory, which offers RAM-like performance in a non-volatile technology.

Xilinx Versal AI Core series (also using a hardened memory controller) delivers up to 1.2 Tbps DDR4 bandwidth, and up to 1.6 Tbps LPDDR4 bandwidth, as well as support for CCIX.

Interestingly, Intel is also supporting low-latency / coherent memory hierarchy access through UPI/CXL protocols to Intel Xeon Scalable processors. We will discuss more of this aspect of Intel’s approach compared with the other vendors in a future section talking about integration into heterogeneous computing environments.

Achronix – also utilizing hardened memory controllers – additionally supports GDDR6, as mentioned above. In external memory, the number of ports is a key consideration for many applications, as the ability to do multiple reads/writes simultaneously from a shared memory resource can eliminate the performance bottleneck associated with memory bandwidth.

Of course, great chips can’t do great work unless data can be moved in and out of them efficiently. When it comes to data movement, FPGAs are undisputed kings – making their living for decades bridging and moving and routing vast amounts of data between disparate systems and protocols with their flexible logic and high-speed IO capabilities. Now, all of these vendors have moved to higher-throughput PAM4 technology for their fastest high-speed serial IO transceivers. PAM4 defines four voltage levels for logic, rather than the usual two, packing twice the data into every clock cycle.

Xilinx Versal ACAP supports up to 44 GTY transceivers (32.75Gb/s), plus up to 52 GTM transceivers (58Gb/s), with an aggregate total IO bandwidth of up to about 1.31 Tbps. Intel’s Agilex families can include a bewildering array of options, as the SerDes transceivers are included in separate “Tile” chiplets that vary by application area, with up to 8x PAM4 112 Gbps, and up to 48x PAM4 58 Gpbs. Achronix Speedster 7t offers a staggering 72x PAM4 112 Gbps transceivers. All of these are impressive numbers, but keep in mind that SerDes transceivers pose some of the most daunting design challenges including board- and system-level signal integrity. They also are huge contributors to the cost of the chip, so selection of a device with a set of transceivers that meet your application needs is worth careful consideration.

For crazy-fast Ethernet, Xilinx Versal ACAP debuts the company’s new internally-developed multi-rate MAC, which handles a wide range of configurations including 4x10GE, 1x40GE, 4x25GE, 2x50GE, or 1x100GE. Versal Prime ACAP devices include up to four of these multi-rate MACs. Intel Agilex includes hard Ethernet MACs with PCS, and FEC supporting 16 x 10/25GE, 8 x 50GE, 4 x 100GE, 2 x 200GE, 1 x 400GE. This allows up to 4 x 400Gb Ethernet network interface connectivity. Achronix Speedster 7t has 16 lanes of Ethernet in their 7t1500 and 32 in 7t6000. This gives four and eight 400Gb Ethernet connections respectively, as well as support for lower rates.

For PCIe, Xilinx Versal ACAP offers 1 x Gen4 x16 cache-coherent interconnect for accelerators (CCIX), which operates over standard PCIe links, up to 4x Gen4 x8 PCIe, and up to 2 multi-rate Ethernet MACs. Intel’s Agilex offers PCIe Gen4 x16 (up to 16 Gbps per lane) and Gen5 x16 (up to 32 Gbps per lane). Achronix Speedster supports up to 2x PCIe Gen5 x16.

In short, all of these families carry on the FPGA tradition of massive, flexible IO – and this discussion barely scratches the surface. We could dedicate several articles to the subtle but critical nuance of high-speed interfaces on these devices, so take the time to understand the details of any family you are planning to use against your application needs. Consider what is hardened in the total solution and what needs to be implemented or supported in the LUT fabric. Buy only the bandwidth you actually need, as there’s no reason to pay for expensive high-performance transceivers if your application doesn’t need them. In general, Ferraris aren’t meant for idling down to the supermarket.

In the next installment of this series, we’ll be tackling processing subsystems and integration with external processors, hardware ecosystems such as accelerator cards using these devices, and – perhaps most important of all – design and application development tool support that gets us from the world of the target application developer – whether it’s C/C++ code, TensorFlow, OpenCL, SystemVerilog, or some other language or dialect – into something that can harness the amazing power of these devices.

One thought on “High-End FPGA Showdown – Part 2”

TotallyLost says:

September 14, 2019 at 3:37 pm

Thanks Kevin … these chips are certainly getting pretty awesome and converging head to head for the data center.

15 years ago there were some pure old school hardware guys sneering that we would want to run UNIX on the FPGA SOC’s … well that day is actually long past as a possibility, reality is that it’s not out of sight to be running cache coherent MP configurations with highly accelerated critical paths in hard logic, for both the OS and the Application.

Log in to Reply

High-End FPGA Showdown – Part 2

Related

One thought on “High-End FPGA Showdown – Part 2”

Leave a Reply Cancel reply

featured video

How NV5, NVIDIA, and Cadence Collaboration Optimizes Data Center Efficiency, Performance, and Reliability

featured chalk talk