feature article
Subscribe Now

Xilinx Scales Hyperscale

Announces “Reconfigurable Acceleration Stack”

Today, power rules the data center. When we put together huge clusters of machines to handle the world’s biggest computing loads, the final arbiter of the amount of computing we can do, and the size of the data center we can actually build, is power. It is estimated that data centers consume as much as 2% of the electricity in the US, and the situation is only going to get worse. Today, the biggest companies in the data center build-out are going to extremes to get access to more energy, and to make their data centers more energy efficient.

Data centers are often built near hydroelectric dams or other ready sources of cheap power that will keep the operation cost down. Climate and availability of cooling resources is a major consideration as well, as many data centers spend at least the same amount on cooling as on direct computing. Still, the limiting factor is often the maximum amount of power available from the local utility. If we can improve the performance-per-watt of our server machinery, we can have a dramatic impact on the economics of data centers.

For some types of data centers – those with very large loads on a single type of task – hyperscale computing has been gaining popularity. In general, hyperscale simply means the ability to scale and allocate computing resources in response to demand. However, in the context of cloud, big data, distributed storage, and in large-scale AI applications, hyperscaling dwells far from the “lots of cheap hardware” vision that was the original rationale. Huge distributed data center monsters such as Google, Facebook, Amazon, and Microsoft rely on hyperscale architectures to efficiently deliver their services.

Since acceleration with FPGAs is one of the most effective ways to reduce power consumption while increasing compute performance, there has been intense effort of late into figuring out how to get FPGAs into the data center, and how to put them to work efficiently offloading large compute tasks. As we have written many times, the most challenging problem in any such large-scale deployment of FPGAs is software – getting the code set up so that critical workloads can be handled by the FPGA accelerators. Various approaches are in play – compiling OpenCL code into FPGA implementations, and doing custom RTL design, model-based design, and high-level synthesis from languages like C and C++. But, all of these are somewhat long-term solutions when it comes to rapid widespread adoption.

This week, however, Xilinx announced another strategy  – their so-called “Reconfigurable Acceleration Stack,” based on a much coarser-grained use of FPGAs. Rather than pairing FPGA resources one-on-one with conventional processors, the company is supporting pooled FPGAs as resources that take advantage of the hyperscale architecture to accelerate key cloud and data center workloads. Specifically, the company says that they are targeting machine learning, data analytics, and video processing.

Xilinx is calling this a “stack” because it has three basic levels – provisioning, application development, and platform development. Starting at the bottom, platform development is based on things we already know well – FPGAs, development boards, and reference designs for hyperscale applications. Supporting that is the company’s Vivado design tool suite, and the “SDAccel” development environment, which provides high-level design specifically aimed at acceleration applications. In the next level up, we have specific libraries for each of the targeted application areas – DNN (deep neural networking) and GEMM (general matrix multiplication) for machine learning, HEVC (high-efficiency video codec) for encoding and decoding for video processing, and data mover and compute kernels for SQL. Above that we have frameworks such as Caffe, FFMPEG, and SQL, and, finally, the top level is the OpenStack open source cloud OS.

The Xilinx strategy has several clever elements. First, by focusing on a small number of high-value applications segments, the company can do the application heavy lifting in advance. Starting adopters off with development kits, reference designs, domain-specific libraries, and a robust tool suite shaves huge chunks off the development process for deployment of FPGA-based acceleration. Second, by choosing applications that are important to very large enterprises, they set themselves up for larger potential wins on a smaller number of design/support cycles. Third, by going after the idea of “pooled” FPGAs, they avoid the architectural swamp of pairing processor, memory, and FPGA fabric in a finer-grained architecture. And fourth, the company is (finally) able to take advantage of its “partial reconfiguration” technology in a meaningful way – to facilitate very fast transition of FPGA clusters from one application to another, circumventing the comparatively costly full-reconfiguration process typically required by FPGAs. 

There has been significant attention in recent years to the role of GPUs in data center acceleration, but pooled FPGAs bring a substantial advantage in power efficiency while still approaching the performance capabilities of GPU-based accelerators. In machine learning, in particular, it seems plausible that GPUs would be used for the “training” tasks while FPGAs could shine at the “inferencing” portion, which is where the broader deployment and scaling is required. 

Xilinx is claiming substantial performance-per-watt savings over traditional server CPUs for a number of typical hyperscale workloads. The advantage claimed ranges from 16x for machine learning inference, 23x for networking vSwitch, 34x for SQL queries for data analytics, and 40x for both video transcode and storage compression (more datapath-related applications). The company goes farther to claim advantages over rival Intel/Altera’s offering as well – claiming a significant (4x-6x) efficiency advantage over Arria 10, and a 2x-3x advantage over the upcoming Stratix 10 family. Xilinx claims that much of this advantage comes from more efficient arithmetic operations on reduced-precision fixed-point operation, resulting from compromises Altera made in supporting floating-point. While these claims (like any vendor’s competitive benchmarks) should be taken with a grain of salt, the reduced-precision fixed-point argument is interesting because of the machine-learning inference tasks that rely heavily on 8-bit arithmetic operations. Perhaps we’ll see an AI-specific FPGA at some point with scores of hardened 8×8 multipliers?

It will be interesting to watch as Xilinx positions itself against Intel/Altera in the data center space. With Intel’s long-time dominance, Xilinx has to be seen as the “challenger” in any scenario. But FPGAs are a disruptive force in the data center, and part of the disruption could occur in the supply chain. Clearly, we expect to see some impressive offerings from Intel/Altera – some of which were announced at the recent Supercomputing show. And, because wins in the server/data center arena are likely to be large ones, the early reactions of the major customers will probably foreshadow the overall direction of the market.

 

 

Leave a Reply

featured blogs
Dec 1, 2023
Why is Design for Testability (DFT) crucial for VLSI (Very Large Scale Integration) design? Keeping testability in mind when developing a chip makes it simpler to find structural flaws in the chip and make necessary design corrections before the product is shipped to users. T...
Nov 27, 2023
See how we're harnessing generative AI throughout our suite of EDA tools with Synopsys.AI Copilot, the world's first GenAI capability for chip design.The post Meet Synopsys.ai Copilot, Industry's First GenAI Capability for Chip Design appeared first on Chip Design....
Nov 6, 2023
Suffice it to say that everyone and everything in these images was shot in-camera underwater, and that the results truly are haunting....

featured video

Dramatically Improve PPA and Productivity with Generative AI

Sponsored by Cadence Design Systems

Discover how you can quickly optimize flows for many blocks concurrently and use that knowledge for your next design. The Cadence Cerebrus Intelligent Chip Explorer is a revolutionary, AI-driven, automated approach to chip design flow optimization. Block engineers specify the design goals, and generative AI features within Cadence Cerebrus Explorer will intelligently optimize the design to meet the power, performance, and area (PPA) goals in a completely automated way.

Click here for more information

featured paper

Power and Performance Analysis of FIR Filters and FFTs on Intel Agilex® 7 FPGAs

Sponsored by Intel

Learn about the Future of Intel Programmable Solutions Group at intel.com/leap. The power and performance efficiency of digital signal processing (DSP) workloads play a significant role in the evolution of modern-day technology. Compare benchmarks of finite impulse response (FIR) filters and fast Fourier transform (FFT) designs on Intel Agilex® 7 FPGAs to publicly available results from AMD’s Versal* FPGAs and artificial intelligence engines.

Read more

featured chalk talk

LEMBAS LTE/GNSS USB Modem from TE Connectivity
In today’s growing IoT design community, there is an increasing need for a smart connectivity system that helps both makers and enterprises get to market quickly. In this episode of Chalk Talk, Amelia Dalton chats with Jin Kim from TE Connectivity about TE’s LEMBAS LTE/GNSS USB Modem and how this plug-and-play solution can help jumpstart your next IoT design. They also explore the software, hardware, and data plan details of this solution and the design-in questions you should keep in mind when considering using the LEMBAS LTE/GNSS USB modem in your design.
Apr 20, 2023
27,049 views