May 25, 2016
Cooling Off Accelerated Computing
Intel/Altera Attacks Memory Bottleneck
A battle is on to claim supremacy in the next generation of computing. Alliances are forming, battle plans are being forged, and armies are amassing.
The enemy, quite simply, is power. No matter what kind of computing we’re doing, from IoT edge devices to massive data center and high-performance computing (HPC) server applications, the single limiting factor is energy consumption. Our IoT device may have to survive indefinitely on minuscule doses of harvested energy. Our mobile device needs to live an entire day on a single battery charge. Our server farm has to achieve the maximum possible throughput, limited by the amount of energy the power company can provide and the amount of heat we can pump out of the building.
In the US alone, data centers use an estimated 100 billion kilowatt hours per year. Depending on the figures you use, that works out to 2-3% of total US electricity. Other estimates put global total IT energy consumption (not just data centers) at around 10% of the world’s electricity.
This is a big deal.
Long ago, we stopped optimizing for performance in our computing hardware. The energy consumption for the fastest processors was increasing exponentially, and it became more efficient to parallelize processors than to make monolithic cores run faster. From that point on, the battle was not for performance but for power - the architecture that could squeeze the most computation out of the fewest coulombs would win the race. In short, when it came to designing computers, we needed to stop designing Ferraris and start designing Priuses. Efficiency is everything.
As electronic engineers, we also have to come to grips with the reality that today, in some way, we are all designing computers. With a multi-tiered computing architecture that reaches from sensors in IoT-edge and mobile devices through the fabric of the internet to server-based cloud, storage, and high-performance computing systems - and back again, it is almost impossible to be working on an electronic system design that isn’t part of that global computing infrastructure.
If, then, we are all computer engineers, and computation is all about optimizing energy efficiency, we need to take a careful collective look at the basic architecture for computation. Historically, the dominant piece of that architecture is the von Neumann machine - an elegant piece of digital design that maximizes computational complexity over silicon area. Because of the ability to program with software, a von Neumann machine can complete more complex operations over a given silicon area than anything else ever devised.
That would be wonderful if we were still trying to optimize for minimum silicon area. But, unfortunately, we are not. As Moore’s Law has given us more and more effective silicon area to work with, silicon has become almost free. At the same time, toggling the billions of transistors we fabricate in that silicon has put power at a premium.
Nobody ever claimed that von Neumann was the most power-efficient architecture. And it most definitely is not. For any given application, we could design custom hardware that can perform the algorithm many orders of magnitude more efficiently than a von Neumann machine. That is why we have ASICs.
But ASICs do nothing to solve the efficiency problem of von Neumann for general-purpose computing. We can’t afford to design a custom chip for every single algorithm in the world. For that, we have the gift of FPGA fabric. FPGA fabric combines the computational efficiency of ASIC with the programmability of software (well, sort of). But von Neumann has enabled us to build software algorithms of such enormous complexity that we couldn’t hope to fit them on any FPGA we will be able to create in the foreseeable future.
This leads us to the conclusion that our ultimate computing architecture must be a heterogeneous machine that combines von Neumann (for complexity management) with FPGA-like fabric (for computational energy efficiency). We won’t do the math here, but let’s say that, with such a machine, we should be able to gain multiple orders of magnitude in computational energy efficiency.
The devil, however, is in the architectural details.
The top of the “details” list clearly lies in programming: how do we program these new heterogeneous von Neumann/FPGA machines? That particular detail is by far the most important and is far, far from solved. Whoever can deliver the dream of legacy software running efficiently in an accelerated heterogeneous environment will rule the world and all others will bow at their feet.
Before that, however, we need to build the machine. Of course, we can construct such a computer using off-the-shelf components available today: processors, FPGAs, DRAMs, etc. Many have done this already and have reported impressive results. Microsoft’s Catapult project yielded compelling gains in efficiency of search algorithms with just such an approach. But there are significant integrations and optimizations we can do that will dramatically improve upon those already-impressive results.
One of the key areas to attack is memory. If you look at a typical computing machine, a giant chunk of the energy budget is spent shuffling data back and forth, to and from DRAM. With modern, high-bandwidth memory architectures, that process involves a lot of serializing and deserializing, pumping high-frequency signals through PCB traces and connectors, and energizing DRAM cells. Since today’s high-performance applications are voracious consumers of memory bandwidth, a lot of juice is consumed by memory.
One of the immediately visible benefits of Intel’s acquisition of Altera is its focus on key projects that take advantage of technology from both sides. This week, the company announced details of the upcoming Stratix 10 MX devices (we wrote about the pre-announcement back in November here) - which combine Altera’s FPGA technology with Intel’s EMIB (Embedded Multi-die Interconnect Bridge) technology, along with SK hynix HBM2 (High-bandwidth Memory - 2nd generation) to provide FPGAs with basically insane amounts of high-efficiency, high-bandwidth memory.
By the numbers, you can have a device with 2 million FPGA logic elements and up to four stacks of 4GB-each HBM2 memory totaling 16GB (that’s giga Bytes). Each of those four stacks has a bandwidth of 256 GBps, yielding an aggregate 1-terabyte-per-second bandwidth with 16 gigabytes capacity. The HBM2 stacks are connected to the FPGA via Intel’s EMIB, which provides an extremely low-latency connection between FPGA fabric and memory. Maybe even more important, this low-latency connection doesn’t use the FPGA’s IO and transceiver resources, leaving those available for other purposes.
The company has just released details of the family, which includes devices with 1M-2M FPGA logic elements, 4-16GB of HBM2 memory, and 48 dual-mode 56Gbps PAM-4 / 30 Gbps NRZ transceivers. The SiP HBM2 memory should truly be a game changer with these devices, as it gives incredible bandwidth and power efficiency, while freeing up those FPGA transceivers for other tasks.
Meanwhile, back on the battle lines, a number of companies (including ARM, IBM, Qualcomm, Xilinx, and others) hoping to attack Intel’s dominant position in the data center have formed an alliance to define a standard for communication between processors and accelerators (such as FPGAs). While this initiative is in the VERY early stages (it appears to have no technology behind it just yet, but rather represents an agreement between several players to work cooperatively on a standard), it does seem like a scramble is underway to get some kind of foothold against the Intel/Altera juggernaut in FPGA-based compute acceleration. The next few years will be interesting to watch.
Posted on May 27, 2016 at 9:17 AMThis statement is key "One of the key areas to attack is memory. If you look at a typical computing machine, a giant chunk of the energy budget is spent shuffling data back and forth, to and from DRAM."
This is a characteristic of RISC CPUs and RISC was created because compilers could not efficiently compile CISC.
On top of that, compilers compile to an intermediate language which is JIT compiled to machine code at run time.
JIT reads byte code from data memory and writes machine code to data cache then the machine code is written to memory then loaded into instruction cache creating a lot of memory traffic that does not contribute to the application at all.
Accelerators in FPGAs typically stream data to the chip where the "application" processing algorithm is already installed(NO JIT)
The FPGA does not have an ISA, JIT, or cache.
However an FPGA can be designed to execute C statements and eliminate the RISC, JIT, etc.
Data is read from memory, processed, then results are written to memory so the end result of processed data being in memory is done efficiently.
Solution: Eliminate Intermediate Language, JIT, Cache, and execute language statements directly in the FPGA.
Posted on June 04, 2016 at 6:34 AMDear Kevin,
an excellent article clearly convincing us, that the dominance of heterogeneous computing is anavoidable, although the von Neumann syndrome creates masses of problems. Because of the bad state of the art of FPGA application development we still urgently need von Neumann (vN). Although by FPGAs up to 2 orders of magnitude more transistors are needed migrations yield speed-ups by up to several orders of magnitude. This Reconfigurable Computing Paradox demonstrates how imbelievably bad the vN paradigm really is:
But we have anoter machine paradigm, suitable for running FPGAs: the Xputer, using multiple data counters instead of a program counter. See
For a Mead-&-Conway-type environment already in 1984 we got by Xputer use a speed-up by 15000 for re-implementation the CMOS design rule check by using PLAs instead of FPGAs. See the PISA Project: