Xilinx Scales Hyperscale

Today, power rules the data center. When we put together huge clusters of machines to handle the world’s biggest computing loads, the final arbiter of the amount of computing we can do, and the size of the data center we can actually build, is power. It is estimated that data centers consume as much as 2% of the electricity in the US, and the situation is only going to get worse. Today, the biggest companies in the data center build-out are going to extremes to get access to more energy, and to make their data centers more energy efficient.

Data centers are often built near hydroelectric dams or other ready sources of cheap power that will keep the operation cost down. Climate and availability of cooling resources is a major consideration as well, as many data centers spend at least the same amount on cooling as on direct computing. Still, the limiting factor is often the maximum amount of power available from the local utility. If we can improve the performance-per-watt of our server machinery, we can have a dramatic impact on the economics of data centers.

For some types of data centers – those with very large loads on a single type of task – hyperscale computing has been gaining popularity. In general, hyperscale simply means the ability to scale and allocate computing resources in response to demand. However, in the context of cloud, big data, distributed storage, and in large-scale AI applications, hyperscaling dwells far from the “lots of cheap hardware” vision that was the original rationale. Huge distributed data center monsters such as Google, Facebook, Amazon, and Microsoft rely on hyperscale architectures to efficiently deliver their services.

Since acceleration with FPGAs is one of the most effective ways to reduce power consumption while increasing compute performance, there has been intense effort of late into figuring out how to get FPGAs into the data center, and how to put them to work efficiently offloading large compute tasks. As we have written many times, the most challenging problem in any such large-scale deployment of FPGAs is software – getting the code set up so that critical workloads can be handled by the FPGA accelerators. Various approaches are in play – compiling OpenCL code into FPGA implementations, and doing custom RTL design, model-based design, and high-level synthesis from languages like C and C++. But, all of these are somewhat long-term solutions when it comes to rapid widespread adoption.

This week, however, Xilinx announced another strategy – their so-called “Reconfigurable Acceleration Stack,” based on a much coarser-grained use of FPGAs. Rather than pairing FPGA resources one-on-one with conventional processors, the company is supporting pooled FPGAs as resources that take advantage of the hyperscale architecture to accelerate key cloud and data center workloads. Specifically, the company says that they are targeting machine learning, data analytics, and video processing.

Xilinx is calling this a “stack” because it has three basic levels – provisioning, application development, and platform development. Starting at the bottom, platform development is based on things we already know well – FPGAs, development boards, and reference designs for hyperscale applications. Supporting that is the company’s Vivado design tool suite, and the “SDAccel” development environment, which provides high-level design specifically aimed at acceleration applications. In the next level up, we have specific libraries for each of the targeted application areas – DNN (deep neural networking) and GEMM (general matrix multiplication) for machine learning, HEVC (high-efficiency video codec) for encoding and decoding for video processing, and data mover and compute kernels for SQL. Above that we have frameworks such as Caffe, FFMPEG, and SQL, and, finally, the top level is the OpenStack open source cloud OS.

The Xilinx strategy has several clever elements. First, by focusing on a small number of high-value applications segments, the company can do the application heavy lifting in advance. Starting adopters off with development kits, reference designs, domain-specific libraries, and a robust tool suite shaves huge chunks off the development process for deployment of FPGA-based acceleration. Second, by choosing applications that are important to very large enterprises, they set themselves up for larger potential wins on a smaller number of design/support cycles. Third, by going after the idea of “pooled” FPGAs, they avoid the architectural swamp of pairing processor, memory, and FPGA fabric in a finer-grained architecture. And fourth, the company is (finally) able to take advantage of its “partial reconfiguration” technology in a meaningful way – to facilitate very fast transition of FPGA clusters from one application to another, circumventing the comparatively costly full-reconfiguration process typically required by FPGAs.

There has been significant attention in recent years to the role of GPUs in data center acceleration, but pooled FPGAs bring a substantial advantage in power efficiency while still approaching the performance capabilities of GPU-based accelerators. In machine learning, in particular, it seems plausible that GPUs would be used for the “training” tasks while FPGAs could shine at the “inferencing” portion, which is where the broader deployment and scaling is required.

Xilinx is claiming substantial performance-per-watt savings over traditional server CPUs for a number of typical hyperscale workloads. The advantage claimed ranges from 16x for machine learning inference, 23x for networking vSwitch, 34x for SQL queries for data analytics, and 40x for both video transcode and storage compression (more datapath-related applications). The company goes farther to claim advantages over rival Intel/Altera’s offering as well – claiming a significant (4x-6x) efficiency advantage over Arria 10, and a 2x-3x advantage over the upcoming Stratix 10 family. Xilinx claims that much of this advantage comes from more efficient arithmetic operations on reduced-precision fixed-point operation, resulting from compromises Altera made in supporting floating-point. While these claims (like any vendor’s competitive benchmarks) should be taken with a grain of salt, the reduced-precision fixed-point argument is interesting because of the machine-learning inference tasks that rely heavily on 8-bit arithmetic operations. Perhaps we’ll see an AI-specific FPGA at some point with scores of hardened 8×8 multipliers?

It will be interesting to watch as Xilinx positions itself against Intel/Altera in the data center space. With Intel’s long-time dominance, Xilinx has to be seen as the “challenger” in any scenario. But FPGAs are a disruptive force in the data center, and part of the disruption could occur in the supply chain. Clearly, we expect to see some impressive offerings from Intel/Altera – some of which were announced at the recent Supercomputing show. And, because wins in the server/data center arena are likely to be large ones, the early reactions of the major customers will probably foreshadow the overall direction of the market.