feature article
Subscribe Now

Xilinx Scales Hyperscale

Announces “Reconfigurable Acceleration Stack”

Today, power rules the data center. When we put together huge clusters of machines to handle the world’s biggest computing loads, the final arbiter of the amount of computing we can do, and the size of the data center we can actually build, is power. It is estimated that data centers consume as much as 2% of the electricity in the US, and the situation is only going to get worse. Today, the biggest companies in the data center build-out are going to extremes to get access to more energy, and to make their data centers more energy efficient.

Data centers are often built near hydroelectric dams or other ready sources of cheap power that will keep the operation cost down. Climate and availability of cooling resources is a major consideration as well, as many data centers spend at least the same amount on cooling as on direct computing. Still, the limiting factor is often the maximum amount of power available from the local utility. If we can improve the performance-per-watt of our server machinery, we can have a dramatic impact on the economics of data centers.

For some types of data centers – those with very large loads on a single type of task – hyperscale computing has been gaining popularity. In general, hyperscale simply means the ability to scale and allocate computing resources in response to demand. However, in the context of cloud, big data, distributed storage, and in large-scale AI applications, hyperscaling dwells far from the “lots of cheap hardware” vision that was the original rationale. Huge distributed data center monsters such as Google, Facebook, Amazon, and Microsoft rely on hyperscale architectures to efficiently deliver their services.

Since acceleration with FPGAs is one of the most effective ways to reduce power consumption while increasing compute performance, there has been intense effort of late into figuring out how to get FPGAs into the data center, and how to put them to work efficiently offloading large compute tasks. As we have written many times, the most challenging problem in any such large-scale deployment of FPGAs is software – getting the code set up so that critical workloads can be handled by the FPGA accelerators. Various approaches are in play – compiling OpenCL code into FPGA implementations, and doing custom RTL design, model-based design, and high-level synthesis from languages like C and C++. But, all of these are somewhat long-term solutions when it comes to rapid widespread adoption.

This week, however, Xilinx announced another strategy  – their so-called “Reconfigurable Acceleration Stack,” based on a much coarser-grained use of FPGAs. Rather than pairing FPGA resources one-on-one with conventional processors, the company is supporting pooled FPGAs as resources that take advantage of the hyperscale architecture to accelerate key cloud and data center workloads. Specifically, the company says that they are targeting machine learning, data analytics, and video processing.

Xilinx is calling this a “stack” because it has three basic levels – provisioning, application development, and platform development. Starting at the bottom, platform development is based on things we already know well – FPGAs, development boards, and reference designs for hyperscale applications. Supporting that is the company’s Vivado design tool suite, and the “SDAccel” development environment, which provides high-level design specifically aimed at acceleration applications. In the next level up, we have specific libraries for each of the targeted application areas – DNN (deep neural networking) and GEMM (general matrix multiplication) for machine learning, HEVC (high-efficiency video codec) for encoding and decoding for video processing, and data mover and compute kernels for SQL. Above that we have frameworks such as Caffe, FFMPEG, and SQL, and, finally, the top level is the OpenStack open source cloud OS.

The Xilinx strategy has several clever elements. First, by focusing on a small number of high-value applications segments, the company can do the application heavy lifting in advance. Starting adopters off with development kits, reference designs, domain-specific libraries, and a robust tool suite shaves huge chunks off the development process for deployment of FPGA-based acceleration. Second, by choosing applications that are important to very large enterprises, they set themselves up for larger potential wins on a smaller number of design/support cycles. Third, by going after the idea of “pooled” FPGAs, they avoid the architectural swamp of pairing processor, memory, and FPGA fabric in a finer-grained architecture. And fourth, the company is (finally) able to take advantage of its “partial reconfiguration” technology in a meaningful way – to facilitate very fast transition of FPGA clusters from one application to another, circumventing the comparatively costly full-reconfiguration process typically required by FPGAs. 

There has been significant attention in recent years to the role of GPUs in data center acceleration, but pooled FPGAs bring a substantial advantage in power efficiency while still approaching the performance capabilities of GPU-based accelerators. In machine learning, in particular, it seems plausible that GPUs would be used for the “training” tasks while FPGAs could shine at the “inferencing” portion, which is where the broader deployment and scaling is required. 

Xilinx is claiming substantial performance-per-watt savings over traditional server CPUs for a number of typical hyperscale workloads. The advantage claimed ranges from 16x for machine learning inference, 23x for networking vSwitch, 34x for SQL queries for data analytics, and 40x for both video transcode and storage compression (more datapath-related applications). The company goes farther to claim advantages over rival Intel/Altera’s offering as well – claiming a significant (4x-6x) efficiency advantage over Arria 10, and a 2x-3x advantage over the upcoming Stratix 10 family. Xilinx claims that much of this advantage comes from more efficient arithmetic operations on reduced-precision fixed-point operation, resulting from compromises Altera made in supporting floating-point. While these claims (like any vendor’s competitive benchmarks) should be taken with a grain of salt, the reduced-precision fixed-point argument is interesting because of the machine-learning inference tasks that rely heavily on 8-bit arithmetic operations. Perhaps we’ll see an AI-specific FPGA at some point with scores of hardened 8×8 multipliers?

It will be interesting to watch as Xilinx positions itself against Intel/Altera in the data center space. With Intel’s long-time dominance, Xilinx has to be seen as the “challenger” in any scenario. But FPGAs are a disruptive force in the data center, and part of the disruption could occur in the supply chain. Clearly, we expect to see some impressive offerings from Intel/Altera – some of which were announced at the recent Supercomputing show. And, because wins in the server/data center arena are likely to be large ones, the early reactions of the major customers will probably foreshadow the overall direction of the market.

 

 

Leave a Reply

featured blogs
Oct 21, 2020
You've traveled back in time 65 million years with no way to return. What evidence can you leave to ensure future humans will know of your existence?...
Oct 21, 2020
We'€™re concluding the Online Training Deep Dive blog series, which has been taking the top 15 Online Training courses among students and professors and breaking them down into their different... [[ Click on the title to access the full blog on the Cadence Community site. ...
Oct 20, 2020
In 2020, mobile traffic has skyrocketed everywhere as our planet battles a pandemic. Samtec.com saw nearly double the mobile traffic in the first two quarters than it normally sees. While these levels have dropped off from their peaks in the spring, they have not returned to ...
Oct 16, 2020
[From the last episode: We put together many of the ideas we'€™ve been describing to show the basics of how in-memory compute works.] I'€™m going to take a sec for some commentary before we continue with the last few steps of in-memory compute. The whole point of this web...

featured video

Better PPA with Innovus Mixed Placer Technology – Gigaplace XL

Sponsored by Cadence Design Systems

With the increase of on-chip storage elements, it has become extremely time consuming to come up with an optimized floorplan with manual methods. Innovus Implementation’s advanced multi-objective placement technology, GigaPlace XL, provides automation to optimize at scale, concurrent placement of macros, and standard cells for multiple objectives like timing, wirelength, congestion, and power. This technology provides an innovative way to address design productivity along with design quality improvements reducing weeks of manual floorplan time down to a few hours.

Click here for more information about Innovus Implementation System

Featured Paper

The Cryptography Handbook

Sponsored by Maxim Integrated

The Cryptography Handbook is designed to be a quick study guide for a product development engineer, taking an engineering rather than theoretical approach. In this series, we start with a general overview and then define the characteristics of a secure cryptographic system. We then describe various cryptographic concepts and provide an implementation-centric explanation of physically unclonable function (PUF) technology. We hope that this approach will give the busy engineer a quick understanding of the basic concepts of cryptography and provide a relatively fast way to integrate security in his/her design.

Click here to download the whitepaper

Featured Chalk Talk

Microchip PIC-IoT WG Development Board

Sponsored by Mouser Electronics and Microchip

In getting your IoT design to market, you need to consider scalability into manufacturing, ease of use, cloud connectivity, security, and a host of other critical issues. In this episode of Chalk Talk, Amelia Dalton sits down with Jule Ann Baker of Microchip to chat about these issues, and how the Microchip PIC-IoT WG development board can help you overcome them.

Click here for more information about Microchip Technology PIC-IoT WG Development Board (AC164164)