May 25, 2016

Cooling Off Accelerated Computing

Intel/Altera Attacks Memory Bottleneck

by Kevin Morris

A battle is on to claim supremacy in the next generation of computing. Alliances are forming, battle plans are being forged, and armies are amassing. 

The enemy, quite simply, is power. No matter what kind of computing we’re doing, from IoT edge devices to massive data center and high-performance computing (HPC) server applications, the single limiting factor is energy consumption. Our IoT device may have to survive indefinitely on minuscule doses of harvested energy. Our mobile device needs to live an entire day on a single battery charge. Our server farm has to achieve the maximum possible throughput, limited by the amount of energy the power company can provide and the amount of heat we can pump out of the building. 

In the US alone, data centers use an estimated 100 billion kilowatt hours per year. Depending on the figures you use, that works out to 2-3% of total US electricity. Other estimates put global total IT energy consumption (not just data centers) at around 10% of the world’s electricity.

This is a big deal.

Long ago, we stopped optimizing for performance in our computing hardware. The energy consumption for the fastest processors was increasing exponentially, and it became more efficient to parallelize processors than to make monolithic cores run faster. From that point on, the battle was not for performance but for power - the architecture that could squeeze the most computation out of the fewest coulombs would win the race. In short, when it came to designing computers, we needed to stop designing Ferraris and start designing Priuses. Efficiency is everything.

As electronic engineers, we also have to come to grips with the reality that today, in some way, we are all designing computers. With a multi-tiered computing architecture that reaches from sensors in IoT-edge and mobile devices through the fabric of the internet to server-based cloud, storage, and high-performance computing systems - and back again, it is almost impossible to be working on an electronic system design that isn’t part of that global computing infrastructure. 

If, then, we are all computer engineers, and computation is all about optimizing energy efficiency, we need to take a careful collective look at the basic architecture for computation. Historically, the dominant piece of that architecture is the von Neumann machine - an elegant piece of digital design that maximizes computational complexity over silicon area. Because of the ability to program with software, a von Neumann machine can complete more complex operations over a given silicon area than anything else ever devised. 

That would be wonderful if we were still trying to optimize for minimum silicon area. But, unfortunately, we are not. As Moore’s Law has given us more and more effective silicon area to work with, silicon has become almost free. At the same time, toggling the billions of transistors we fabricate in that silicon has put power at a premium.

Nobody ever claimed that von Neumann was the most power-efficient architecture. And it most definitely is not. For any given application, we could design custom hardware that can perform the algorithm many orders of magnitude more efficiently than a von Neumann machine. That is why we have ASICs. 

But ASICs do nothing to solve the efficiency problem of von Neumann for general-purpose computing. We can’t afford to design a custom chip for every single algorithm in the world. For that, we have the gift of FPGA fabric. FPGA fabric combines the computational efficiency of ASIC with the programmability of software (well, sort of). But von Neumann has enabled us to build software algorithms of such enormous complexity that we couldn’t hope to fit them on any FPGA we will be able to create in the foreseeable future.

This leads us to the conclusion that our ultimate computing architecture must be a heterogeneous machine that combines von Neumann (for complexity management) with FPGA-like fabric (for computational energy efficiency). We won’t do the math here, but let’s say that, with such a machine, we should be able to gain multiple orders of magnitude in computational energy efficiency.

The devil, however, is in the architectural details.

The top of the “details” list clearly lies in programming: how do we program these new heterogeneous von Neumann/FPGA machines? That particular detail is by far the most important and is far, far from solved. Whoever can deliver the dream of legacy software running efficiently in an accelerated heterogeneous environment will rule the world and all others will bow at their feet.

Before that, however, we need to build the machine. Of course, we can construct such a computer using off-the-shelf components available today: processors, FPGAs, DRAMs, etc. Many have done this already and have reported impressive results. Microsoft’s Catapult project yielded compelling gains in efficiency of search algorithms with just such an approach. But there are significant integrations and optimizations we can do that will dramatically improve upon those already-impressive results.

One of the key areas to attack is memory. If you look at a typical computing machine, a giant chunk of the energy budget is spent shuffling data back and forth, to and from DRAM. With modern, high-bandwidth memory architectures, that process involves a lot of serializing and deserializing, pumping high-frequency signals through PCB traces and connectors, and energizing DRAM cells. Since today’s high-performance applications are voracious consumers of memory bandwidth, a lot of juice is consumed by memory.

One of the immediately visible benefits of Intel’s acquisition of Altera is its focus on key projects that take advantage of technology from both sides. This week, the company announced details of the upcoming Stratix 10 MX devices (we wrote about the pre-announcement back in November here) - which combine Altera’s FPGA technology with Intel’s EMIB (Embedded Multi-die Interconnect Bridge) technology, along with SK hynix HBM2 (High-bandwidth Memory - 2nd generation) to provide FPGAs with basically insane amounts of high-efficiency, high-bandwidth memory.

By the numbers, you can have a device with 2 million FPGA logic elements and up to four stacks of 4GB-each HBM2 memory totaling 16GB (that’s giga Bytes). Each of those four stacks has a bandwidth of 256 GBps, yielding an aggregate 1-terabyte-per-second bandwidth with 16 gigabytes capacity. The HBM2 stacks are connected to the FPGA via Intel’s EMIB, which provides an extremely low-latency connection between FPGA fabric and memory. Maybe even more important, this low-latency connection doesn’t use the FPGA’s IO and transceiver resources, leaving those available for other purposes.

The company has just released details of the family, which includes devices with 1M-2M FPGA logic elements, 4-16GB of HBM2 memory, and 48 dual-mode 56Gbps PAM-4 / 30 Gbps NRZ transceivers. The SiP HBM2 memory should truly be a game changer with these devices, as it gives incredible bandwidth and power efficiency, while freeing up those FPGA transceivers for other tasks.

Meanwhile, back on the battle lines, a number of companies (including ARM, IBM, Qualcomm, Xilinx, and others) hoping to attack Intel’s dominant position in the data center have formed an alliance to define a standard for communication between processors and accelerators (such as FPGAs). While this initiative is in the VERY early stages (it appears to have no technology behind it just yet, but rather represents an agreement between several players to work cooperatively on a standard), it does seem like a scramble is underway to get some kind of foothold against the Intel/Altera juggernaut in FPGA-based compute acceleration. The next few years will be interesting to watch. 

Channels

Computers. FPGA. Software.

 
    submit to reddit  

Comments:


KarlS51

Total Posts: 10
Joined: Dec 2014

Posted on May 27, 2016 at 9:17 AM

This statement is key "One of the key areas to attack is memory. If you look at a typical computing machine, a giant chunk of the energy budget is spent shuffling data back and forth, to and from DRAM."

This is a characteristic of RISC CPUs and RISC was created because compilers could not efficiently compile CISC.

On top of that, compilers compile to an intermediate language which is JIT compiled to machine code at run time.

JIT reads byte code from data memory and writes machine code to data cache then the machine code is written to memory then loaded into instruction cache creating a lot of memory traffic that does not contribute to the application at all.

Accelerators in FPGAs typically stream data to the chip where the "application" processing algorithm is already installed(NO JIT)

The FPGA does not have an ISA, JIT, or cache.

However an FPGA can be designed to execute C statements and eliminate the RISC, JIT, etc.

Data is read from memory, processed, then results are written to memory so the end result of processed data being in memory is done efficiently.

Solution: Eliminate Intermediate Language, JIT, Cache, and execute language statements directly in the FPGA.

Reiner

Total Posts: 6
Joined: Aug 2010

Posted on June 04, 2016 at 6:34 AM

Dear Kevin,
an excellent article clearly convincing us, that the dominance of heterogeneous computing is anavoidable, although the von Neumann syndrome creates masses of problems. Because of the bad state of the art of FPGA application development we still urgently need von Neumann (vN). Although by FPGAs up to 2 orders of magnitude more transistors are needed migrations yield speed-ups by up to several orders of magnitude. This Reconfigurable Computing Paradox demonstrates how imbelievably bad the vN paradigm really is:
http://xputer.de/RCpx/#pdx
But we have anoter machine paradigm, suitable for running FPGAs: the Xputer, using multiple data counters instead of a program counter. See
http://en.wikipedia.org/wiki/Xputer...
For a Mead-&-Conway-type environment already in 1984 we got by Xputer use a speed-up by 15000 for re-implementation the CMOS design rule check by using PLAs instead of FPGAs. See the PISA Project:
http://www.fpl.uni-kl.de/PISA/index.htm...
Best regards,
Reiner
You must be logged in to leave a reply. Login »

Related Articles

Code, Copyright, and Craziness

Google and Oracle Battle it Out Over Code Copyright

by Jim Turley

Im not a doctor, but I play one on TV. Chris Robinson

Im not a lawyer (thankfully), but that won...

The Worlds Best Multiplexer

Lattice CrossLink pASSP Fits Into the Odd Spaces in New Designs

by Jim Turley

Better to be a big fish in a small pond than a small fish in a big pond. Anonymous

In biology, it...

Atom, We Hardly Knew Ye

Intel Axes Smartphone Chips Amid Big Restructuring

by Jim Turley

A billion here, a billion there, and pretty soon youre talking real money. US Senator Everett Dirkson (likely misattributed)

You can...

FPGAs for the Masses?

Freeing FPGA Implementation from the Hardware Designer's Grip

by Dick Selwood

Over the years, there have been many attempts to make FPGAs easier to use, and most of them now occupy the footnotes of FPGA history....

PCB with a Faster Cadence

OrCad and Allegro Speed Up Board Design

by Kevin Morris

The PCB Design tool race is perhaps the most stable and long-lived competition in all of electronic design automation. Since at least the 1980s, commercial...

  • Feature Articles RSS
  • Comment on this article
  • Print this article

Login Required

In order to view this resource, you must log in to our site. Please sign in now.

If you don't already have an acount with us, registering is free and quick. Register now.

Sign In    Register