FPGAs Cool Off the Datacenter

The problem is programming.

If it were just a straight-up race to see what kind of chip delivers the most processing for the least power, FPGAs would have won long ago. A custom hardware version of just about any algorithm you can name, carefully optimized for FPGA LUT fabric, will run much faster, with less latency and far less power, than anything you can do with any conventional processor.

But what about GPUs?

Yes, that includes GPUs – especially the power part. And these days, it’s getting more and more to the point that power is the ONLY part. In big-iron computing applications such as large datacenters, you could always add more cores, servers, or processors. But you can’t add more processing if you can’t pipe more power into the building or get more heat out. Many datacenters are running up against exactly that issue right now. That’s why you see huge server farms constructed where power is cheap and abundant. And, for the companies that need giant server farms, power is usually the largest business expense.

The power glut basically eliminates GPUs from the hunt as far as long-term solutions for datacenter throughput. While GPUs can pack more processing into a smaller space than their CPU counterparts, they both share the same coulomb-guzzling von Neumann architecture, so any power savings from the CPU-to-GPU conversion are simply artifacts of integration.

The key for the future, then, is to crank out as much processing as possible with the least amount of total power. That’s where FPGAs shine. Because an algorithm can be reduced to an optimized hardware implementation in an FPGA, the only thing that could go faster would be a custom non-reprogrammable chip (like an ASIC or ASSP), which would be fine as long as you never, ever needed to run a different algorithm.

The problem is programming.

This idea of programmable logic as a processing panacea is not a new one, it turns out. For decades, there has been a dedicated community working on “reconfigurable computing” – which meant using FPGAs (or FPGA-like fabric) to accelerate algorithms in supercomuting applications. Several of the leading supercomputer companies (such as Cray) built machines around this idea over a decade ago. They put a co-processing card in their machines with a bunch of FPGAs and a high-speed (for the time) connection to the processor and the memory. When fine-tuned for specialized algorithms (think DNA sequencing, oil and gas exploration, and financial analysis), these machines could make bank. Unfortunately, “fine tuning” was accomplished by hiring teams of FPGA engineers to spend months hand-crafting RTL code for the FPGA-based accelerators.

For those who didn’t have teams of FPGA experts available to spend months on each individual algorithm, the problem was programming. The people who knew the complex, acceleration-needing algorithms tended to be software savvy. The people with the hardware expertise to build optimized FPGA-based accelerators were hardware engineers. This created the uncomfortable situation where the interface between those two groups of people was exactly at the nexus of complexity – the subtle nuances of these esoteric high-performance algorithms.

This week, at the Supercomputing conference, we may be witnessing the beginning of the long-awaited breakthrough. The two major FPGA players have both chosen the event to make important announcements in the area of FPGA-based and FPGA-accelerated computation. We have been expecting some key announcements from Altera. The company has been making an obvious strategic play for the datacenter for years now. They long ago announced that they were beefing up their floating point support – a known weakness for FPGAs in compute acceleration. They also have been talking for a while about their support for an OpenCL flow – which allows code written in the increasingly popular parallel-friendly GPU-inspired C dialect to be targeted to Altera FPGAs.

Did you catch that last part? It’s about the programming problem.

Xilinx, on the other hand, has been showing a broader play. Delivering on their “All Programmable” slogan, the company has been systematically building design flows for various application domains – beginning with domain-specific capture methods and funneling down into the company’s powerful new Vivado implementation tools. Xilinx divides the world into two major groups – hardware engineers and software engineers. For hardware engineers, Vivado is the ticket – with a comprehensive suite of EE-friendly design, analysis, and implementation tools.

For software engineers, the company has launched the “SDx” brand – for software-defined (whatever). At Supercomputing this week, the “whatever” is datacenter and computation acceleration applications, and the official title is SDAccel. The company is announcing a development environment that supports OpenCL, C, and C++ for software engineers using FPGAs. Now we have a horse race, and that’s a very good thing.

The part of that announcement that provides the most interesting head-to-head comparison vs archrival Altera is that Xilinx is the OpenCL support. Of course, FPGA-based computing is not all about OpenCL, but that’s a very good place to examine some key differences in the approach of the two FPGA companies.

Altera certainly got off to an early lead in the OpenCL arena, and they are now in a maturing phase. Altera’s big announcement at Supercomputing is with partners IBM and Nallatech: an acceleration platform that combines Altera FPGAs with POWER8 CPUs via IBM’s Coherent Accelerator Processor Interface (CAPI). This enables shared virtual memory between the FPGA and the processor, and it can take advantage of Altera’s OpenCL environment for development of accelerators. Long-time FPGA acceleration experts Nallatech are providing a development card that they claim is the industry’s first CAPI FPGA accelerator card.

Xilinx’s strategy takes advantage of the company’s high-level synthesis (HLS) technology. This could be a key advantage vs Altera, as HLS is capable of extracting maximum benefit from the FPGA’s hardware programmability. Altera’s OpenCL implementation creates what one might think of as a GPU (of sorts) on an FPGA – with processing elements analogous to GPU cores that execute software kernels in parallel. This solution should be easy to program, as it mimics many elements of a GPU, but it will most likely not achieve the same performance and quality-of-results as Xilinx’s HLS-based approach. HLS has been repeatedly shown to deliver hardware architectures on a par with what skilled experts can create with hand-coded, manually-optimized RTL

The “most likely” qualifier is because Xilinx is breaking new ground packaging HLS technology in a way that it can be understood and used by software engineers. Traditionally, HLS has been an incredible power tool for hardware experts. If you understand concepts like loop unrolling, pipelining, resource sharing, and datapath/controller/memory architecture, you can rapidly explore vast numbers of hardware micro-architectures, selecting the one that delivers the best solution for your application. Then, HLS can quickly generate the RTL code that synthesizes down into your chosen architecture. But encapsulating all that power into something that appears to software engineers as a ‘compiler’ is a substantial task. Picture an airplane with a “Fly me to Dallas” button instead of normal cockpit controls and instruments that a pilot would understand.

Since Xilinx is riding the HLS horse, they can support several languages (C, C++, OpenCL) all at the same time, with the same implementation back-end. That makes their solution potentially broader, as there are certainly vast application areas that use C and C++ rather than OpenCL.

Taking the HLS route also presented Xilinx with another challenge, however. In order to create a “software-like” development and runtime environment, several pieces of the normal FPGA flow had to be improved. First, for testing and debugging code, the normal HLS, logic synthesis, place-and-route, timing-closure, device programming cycle would be far too clunky for the rapid-iteration development software engineers are accustomed to. To get around this, Xilinx created a software environment that would be comfortable to any Eclipse-using programmer, and then set up their HLS tool to spit out software-executable cycle-accurate versions of the HLS architecture. This allows the “accelerated” algorithm to be executed completely in software for debug and testing purposes – without the long logic synthesis and place-and-route runs required for the final hardware-based implementation.

To get around the run-time issues, Xilinx deployed their partial-reconfiguration technology. This allows the FPGA to be re-configured with a different algorithm without interrupting system execution. This is critical in datacenter applications where a server cannot be arbitrarily taken offline in order to process a normal, full-FPGA configuration cycle. Partial reconfiguration allows the FPGA to keep its IOs live and to keep some baseline communication with the applications processor while it quickly swaps in new acceleration bitstreams.

Xilinx claims that SDAccel can deliver up to 25x better performance-per-watt compared with CPUs or GPUs. If the software development and deployment environment is good enough to realize even a fraction of that potential in the datacenter, it could represent the biggest revolution ever in managing datacenter power.

The fact that the market is heating up and getting more competitive just underscores that potential, and it is certainly a trend to watch over the next few years. With Altera and Xilinx now squaring off against each other to win compute acceleration sockets, and with Intel announcing that they, too, plan to produce versions of their Xeon processor with FPGAs inside, it is clear that the heavy-hitters think FPGAs could revolutionize computation.

IF… we really can solve that programming problem.