At the Gigaom Structure 2014 event last week, Intel’s Diane Bryant announced that Intel is “integrating [Intel’s] industry-leading Xeon processor with a coherent FPGA in a single package, socket compatible to [the] standard Xeon E5 processor offerings.” Bryant continues, saying that the FPGA will provide Intel customers “a programmable, high performance coherent acceleration capability to turbo-charge their algorithms” and that industry benchmarks indicate that FPGA-based accelerators can deliver >10x performance gains, with an Intel-claimed 2x additional performance, thanks to a low-latency coherent interface between the FPGA and the processor.
If we did our math right, Intel is implying that an FPGA could boost the speed of a server-based application by somewhere in the range of 20x.
At almost the same time, Microsoft announced a system it calls “Catapult” (which apparently has no connection whatsoever to the very closely related algorithmic synthesis technology from Calypto, Inc. – which oddly bears exactly the same name). Microsoft’s Catapult, described in a paper titled: “A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services,” achieved a reported 95% increase in Bing search engine performance, with only a 10% increase in power consumption. Yep, pairing FPGAs (and most definitely Altera FPGAs, in this case) with traditional processors basically doubled the performance and power-efficiency of a traditional heavy-iron server task.
Well, who out there didn’t see this coming? Anyone? Anyone?
With data center power consumption estimated at somewhere between one and ten percent of the entire world’s electricity use – and growing fast, doing more computation with less energy is a problem with enormous economic and ecological stakes. Today, data centers are built based on access to cheap power, and the size and throughput of those data centers is typically limited by how much power can be brought into the building and how much heat can be taken out. Companies like Microsoft, Google, Facebook, and eBay clearly would be highly motivated to both crank up the MIPS and lower the electric bill.
At the same time, Moore’s Law, after a nearly fifty-year run, is most definitely running out of gas. First, we hit the power wall on single-core processors, reaching a point where clocking the chips faster ran up the power more than it improved performance. Then we went to two, four, and more cores, and, finally, we’re looking at wider instructions and data busses to compensate for the lack of continued progress in the underlying semiconductor processes. Simply waiting for better silicon to solve the data center’s power woes is not a viable option.
Just about everyone who reads these pages understands that FPGAs offer the potential for dramatically increased compute performance combined with much lower power consumption. For specialized algorithms, an FPGA-based hardware implementation offers the rewards of fine-grained parallelism – lower latency, higher throughput, and much lower power.
Of course, everyone who already understands the benefits of FPGA-based compute acceleration also knows the single biggest obstacle to widespread adoption: the programming model. Traditional von Neumann processors and their accompanying ecosystem have evolved to the point of being drop-dead easy to program. Slap down a few lines of C, C++, or any other popular language, crank up an open-source compiler, and your computer is computing in no time.
With FPGAs, getting the gates to do your bidding is a substantially greater challenge – one that has even provided a lucrative livelihood for many of us. Converting a complex algorithm to an efficient custom hardware architecture, then describing that architecture in a hardware description language, then simulating, synthesizing, and placing-and-routing the resulting design can get you to the point … where you now have hundreds of annoying timing violations to sort out. An EE degree, a fair amount of experience, and a few months of free time are pretty much minimum requirements for making efficient use of FPGA fabric to accelerate a single high-performance algorithm, and most of the folks writing complex algorithms for cloud and data center applications don’t have a lot of extra time and energy to pick up that kind of expertise.
This programming problem has not escaped the attention of the folks who make FPGAs, of course. They’ve been toiling away for years trying to simplify the process of programming FPGAs. Today, the state-of-the-art is best represented by three primary approaches: model-based design, high-level synthesis, and parallel programming languages like OpenCL. All of these approaches have merit for different types of problems and for different programmer skill sets. None of them has reached anything like the robustness required for the general programming public to be able to efficiently take advantage of FPGA co-processing.
So, why would Intel ever buy Altera?
(Note: We have no indication that Intel has any actual plans to buy Altera, so we’re speculating here.)
Intel has arguably the most advanced semiconductor processes in the world, and those processes have historically been applied to making high-performance processors for PCs and servers. These days, the PC market is waning, and Intel has failed to capture any meaningful share of the exploding mobile and tablet market, which is dominated by lower-power ARM architecture processors. That leaves Intel with the server market, which (thanks to the reliance of the mobile, cloud, and emerging IoT markets on giant server farms) is growing rapidly.
However, as we mentioned above, power is the primary limiting factor in the global data center build-out, and ARM is trying to get into the server game by taking advantage of their comparatively lower-power processor architectures. This poses a significant risk to Intel’s domination of the racks. With PCs in the decline and the data center possibly up for grabs, Intel needs to do something.
What Intel needs is a game-changing answer to server power efficiency, and the best place to look for that is in FPGAs.
Of course, Intel could make their own FPGAs or FPGA fabric – integrated in the same package or potentially even on the same die as their processors, but that doesn’t solve the problem. The key to success with FPGA technology is tools, not fabric. And, if Intel put every engineer in the company to the task of developing FPGA tools for the next decade, they would not be able to match what Altera and Xilinx have today. Robust FPGA tools require tens of thousands of user-generated designs to botch their way through the tool flow, and no amount of careful engineering development can replace that experience-based tool evolution.
Furthermore, what Altera and Xilinx have today is (as we mentioned earlier) not yet even remotely up to the task of smoothly compiling high-performance server-based algorithms into a form that will efficiently execute on hybrid processor/FPGA heterogeneous computing servers. They have the bare bones of a few marginally workable solutions. Of course, as this recent announcement shows, Intel could partner with Altera or Xilinx and hope that those companies give enough attention to the server space to pull it off, but with the perpetual lure of the lucrative comms space constantly distracting the FPGA companies from the server world’s problems, that crucial attention is most definitely not guaranteed.
This announcement is certainly not Intel’s first warning shot about FPGAs, heterogeneous compute acceleration with FPGAs, or partnering with companies like Altera. A few years back, Intel launched another device family, the E6x5C, with an Atom processor and an Altera Arria FPGA sharing the same package, connected by PCIe.
This new announcement bumps the processor component of that up to Xeon land, and it bumps the ever-so-critical FPGA-to-processor communication channel up from PCIe to low-latency, coherent Quickpath Interconnect (QPI) – reportedly capable of up to 25 Gbps communication at very low latency. As one can see from the Bing/Microsoft paper (or as many of us know from traumatic personal experience), the architecture for passing and sharing data between processor, FPGA, and memory is the single most important feature (and potential bottleneck) of any heterogeneous computing platform with FPGA fabric.
Intel is experimenting and learning with other pieces of the FPGA puzzle as well, of course. After dipping their toes in the water by partnering with smaller FPGA suppliers Achronix and Tabula, fabricating devices for those companies on the 22nm Tri-Gate (FinFET) process, the company stepped up to a manufacturing partnership with Altera for the upcoming Stratix 10 FPGA family, based on Intel’s 14nm Tri-Gate process. This is a critical (and vastly under-appreciated) engineering task where both the FPGA fabric and the semiconductor process must be adapted and evolved to work together. You can’t just slap any old FPGA fabric on a cutting-edge semiconductor process and expect it to work, and you conversely can’t take just any semiconductor process and succeed – even with a proven FPGA fabric. Both parts have to meet and meld in the middle.
For the record, Intel isn’t saying which FPGA company they are partnering with for the new heterogeneous Xeon devices. Both Altera and Xilinx are tight-lipped as well, so we’re gonna put our well-considered bet on Altera. Either way, the curtain will be pulled back soon enough, because Intel says that the end customers will need to use the FPGA company’s tools and design flow in order to take advantage of the FPGA portion of the processor. So, that conversation would be something like this.
Facebook: Hey Intel we’d like to use your new heterogeneous Xeon/FPGA processors.
Intel: OK, you’ll need to get FPGA tools and support from the vendor.
Facebook: Which vendor is that?
Intel: We’re not saying…
OK, maybe not exactly, but – that’s one secret that won’t last long.
It’s important to note that Intel is not the first to plan mass production of heterogeneous processors with FPGAs. Xilinx has been attacking that market for a few years now with their Zynq family – which incorporates ARM processors with Xilinx FPGA fabric. Altera is aggressively giving chase with their own ARM-based FPGA SoC families. While Zynq is certainly not a data-center-class processor, the distance from today’s Zynq to one that would be a viable server-class solution isn’t huge, and the expertise and tool flow that Xilinx is accumulating with Zynq would come in very handy in a fight over low-power server dominance.
Even though Intel soft-pedaled the Xeon/FPGA announcement, the potential implications are enormous. If the tools can get good enough (and that’s a big IF), we are looking at the displacement of the von Neumann processor as the dominant computing architecture for the majority of the world’s data centers, and therefore the majority of the world’s computation. And, it could happen during the most rapid expansion of global computing power in history.
Sure, Intel could continue to defend its turf against insurgent ARM-based architectures in the single most important market in the world by simply partnering with a few smaller companies (like Altera) for the most critical enabling technologies. Intel could hope that those partners will spend enough time and energy solving the tool flow problem to make that discontinuous leap in the global computing architecture possible and practical.
Somehow that scenario doesn’t seem like the most likely to me.