Xilinx Boosts the Bandwidth

A few years ago, we all thought we were on the cusp of “enough” bandwidth. If everybody with a smartphone could stream HD video simultaneously, what else could we even want? Surely an infrastructure that could handle that task would be ready for whatever else we wanted to throw at it.

Oh, how wrong we were.

Video still is the biggest bandwidth hog, but now we’re talking 4K and 8K at terrifying bit-rates. And, instead of just one smartphone per person, we’ve got billions of IoT devices all trying to stream massive amounts of data from each of those “things” to all the rest, on that internet of theirs. And, since a lot of the computing for all those edge devices can’t yet be done at the edge, lots of that data is shipped back and forth to cloud data centers, creating even more of a network crunch. Combine that IoT explosion with the transition to 5G and the network that backs it up, and you’ve got yourself a serious global bandwidth crisis.

Which, of course, is exactly what Xilinx was hoping for.

The history of FPGAs is the history of bandwidth explosion. For the past several decades, FPGAs have been the go-to technology for networking companies trying to hit the next big bandwidth target first. FPGAs powered some of the first one-gig appliances, and now they’re poised to enable the move to 800 gig (gulp). Xilinx just made a pair of announcements that put them front-and-center in that equation. The appetizer is a new Alveo U-25 SmartNIC Platform, and the main course is a crazy-powerful new family of FPGAs ACAPs (shhh, they’re really still FPGAs, but don’t tell). These two announcements push Xilinx’s product portfolio further into the realm of extreme bandwidth enablement.

To fully appreciate the strategic importance of the Alveo card, we need to first consider Intel’s stranglehold on the data center. While Intel’s corporate lawyers probably forbid their marketing from using the term “dominance” (out of fear of garnering the attention of anti-trust regulators), Intel does, in fact, dominate the business of data center hardware. When it comes to technology like FPGAs, Intel has their own via their acquisition of Altera, and they likely aren’t too keen to welcome competitors to the party. Intel can do all kinds of mean tricks, like shipping servers to OEMs with FPGAs already built in. Hey, why buy an FPGA from someone else if your server already came with one, right?

If you’re a competitor like Xilinx, who claims to now be “data center first” – that makes things kinda tough. What you need is a trojan horse. A nice gift you can roll up to the gate of the data center, saying “Hey, let me in. I’ll make things better in there.” And, what better offering to bring than a smart network interface card. In cloud data centers, an estimated 30% of compute resources are sucked up by networking I/O processing. As the number of CPU cores continues to increase, the overhead continues to grow – which results in further increased demands on networking. Smart network interface cards allow you to migrate acceleration closer to the network interface, freeing up compute resources to do more computing.

Xilinx’s new Alveo U-25 SmartNIC is a “bump-in-the-wire.” It can be inserted into the data flow without bothering anything over in Intel-land, and it just makes life easier for all those hard-working and expensive Xeons. Ah, but here’s the rub. Xilinx says the SmartNIC brings a “True convergence of network, storage, and compute acceleration functions on a single platform.” Uh, oh. That means that, once you’ve got a bunch of these suckers in your data center, why not give them some more work to do like storage and compute acceleration. Your Xeons will love you! But Intel – maybe not so much. Xilinx will have succeeded in getting their silicon across the formidable Intel moat, and it’ll be sitting there taunting those Intel components at close range.

Before we dive into more details of the Alveo U-25 SmartNIC, let’s look at the second announcement – the Versal Premium Series. Sounds a little like a new luxury crossover vehicle, doesn’t it? Xilinx says Versal Premium is actually a new family of what they call ACAPs (Adaptable Compute Acceleration Platforms) — or what we call FPGAs. This brings up an issue for those accustomed to the historical programmable logic vernacular. Namely, Xilinx is trying to change the name of EVERYTHING lately.

Here’s our helpful EE Journal decoder ring, which may come in handy when interpreting newer-generation Xilinx marketing materials:

ACAP = FPGA

Platform = FPGA

Series = FPGA Family

Adaptable Hardware = FPGA LUT fabric

Adaptable Engines = FPGA LUTs

Intelligent Engines = DSP Slices

Scalar Engines = ARM processor subsystems

Dynamic Function Exchange = Partial Reconfiguration

LUTs (OK, this is a tricky one) = 6-input LUTs

Buckle up, this one gets complicated. Most of the industry gives “LUT” numbers based on an estimated number of equivalent 4-input LUTs (LUT4s), even though their devices have wider LUTs. To their credit, Xilinx now gives LUT counts based on their 6-input LUTs, which (gasp) correspond to something actually on the chip. Xilinx’s basic programmable logic block is called a CLB. CLBs are made up of slices. Slices are made up of several 6-input LUTs, each of which has 2 flip flops. We can look at the data sheet numbers for “CLB flip flops,” and we see that it is exactly double the number given for LUTs.

Way to go Xilinx! An FPGA datasheet that shows what’s actually on the chip. Novel concept.

Oops, except now they have introduced a marketing metric called System Logic Cells, and it gets top-line billing on the product table. What is a System Logic Cell? Apparently, it is approximately 2.187 LUTs. You may ask how they came up with 2.187? Simple. They calculated what multiplier was required to make their devices sound bigger than Intel/Altera’s. Ah, and they were doing so well.

Anyway, back to Versal Premium ACAPs. These are truly amazing devices. The Versal line has two branches: an “AI” branch with families that include AI engines, and a more general-purpose branch without AI engines. Versal Premium is part of the general-purpose branch, and it comes packed with resources the likes of which we’ve never seen. Beginning with the bandwidth side, Premium packs 9Tb/s of serial bandwidth via up to 68 32Gbps NRZ transceivers, which are perfect for implementing power-optimized mainstream 100G interfaces, and up to 140 58Gbps PAM4 transceivers for applications migrating to 400G. Those 140 transceivers can also be doubled up to give 70 lanes of 112Gbps for those of you flirting with the idea of 800 gig gear.

On the programmable logic “adaptable hardware” front, Versal Premium brings from 700K to almost 3.4M of Xilinx’s LUT6 cells. Even with conversion factors for estimating the equivalent number of LUT4s, that makes these some enormous programmable logic chips. And, all that logic fabric is just part of the story. These devices have incredible amounts of functionality already hardened that would have taken up massive amounts of LUT space in the past. Xilinx estimates that the hardened cores included in one Versal Premium device would fill up the LUT fabric of 22 Virtex UltraScale+ devices. That means these devices bring an astronomical jump in functionality, not to mention the big boost in performance and reduction in power consumption.

Versal Premium has hardened numerous functions that would have previously been implemented or partially implemented in LUTs, including 400G high-speed crypto engines, 600G Interlaken cores, 600G Ethernet cores, multi-rate Ethernet cores, 112G PAM4 Transceivers, PCIe® Gen5 w/DMA & CCIX, and CXL. The CXL in particular is interesting in that it gets Xilinx at least on par with Intel when it comes to cache-coherent communication with the ubiquitous Xeons (whenever Xeon itself eventually gets CXL support). The largest Premium device includes 14,352 of the new (larger) DSP blocks, so for signal processing, convolution, and other tasks that require massive parallel arithmetic, Premium brings the goods.

The largest device has a total of 994Mb of on-chip SRAM, and all that storage close to the computation elements is likely to accelerate applications considerably, as well as reducing power due to less off-chip memory access. And, the embedded ARM processing subsystem includes both application processors (dual-core Arm Cortex-A72, 48KB/32KB L1 Cache w/parity & ECC; 1MB L2 Cache w/ECC), and real-time processors (dual-core Arm Cortex-R5F, 32KB/32KB L1 Cache, and 256KB TCM w/ECC).

In conventional FPGAs, all that hardware would make routing and timing closure a near-impossibility which is why the Versal devices include a dedicated, full-featured network-on-chip (NoC) that dramatically reduces the burden on routing and timing closure. This should substantially reduce the work required in the Vivado tool suite. And, speaking of tools, Versal Premium will take advantage of Xilinx’s new “Vitis” multi-entry-point tool architecture, which fronts the Vivado back-end traditionally used by HDL designers with environments designed specifically for software engineers, and for AI engineers.

Versal Premium has documentation available now, with design tool support scheduled for the second half of 2020 and first silicon shipments in early 2021. Pin migration allows prototyping of designs to begin now with the existing Versal Prime devices, with a smooth transition to Versal Premium chips when they become available.

Xilinx Boosts the Bandwidth

Related

Leave a Reply Cancel reply

featured video

How NV5, NVIDIA, and Cadence Collaboration Optimizes Data Center Efficiency, Performance, and Reliability

featured chalk talk