feature article
Subscribe Now

Cooling Off Accelerated Computing

Intel/Altera Attacks Memory Bottleneck

A battle is on to claim supremacy in the next generation of computing. Alliances are forming, battle plans are being forged, and armies are amassing. 

The enemy, quite simply, is power. No matter what kind of computing we’re doing, from IoT edge devices to massive data center and high-performance computing (HPC) server applications, the single limiting factor is energy consumption. Our IoT device may have to survive indefinitely on minuscule doses of harvested energy. Our mobile device needs to live an entire day on a single battery charge. Our server farm has to achieve the maximum possible throughput, limited by the amount of energy the power company can provide and the amount of heat we can pump out of the building. 

In the US alone, data centers use an estimated 100 billion kilowatt hours per year. Depending on the figures you use, that works out to 2-3% of total US electricity. Other estimates put global total IT energy consumption (not just data centers) at around 10% of the world’s electricity.

This is a big deal.

Long ago, we stopped optimizing for performance in our computing hardware. The energy consumption for the fastest processors was increasing exponentially, and it became more efficient to parallelize processors than to make monolithic cores run faster. From that point on, the battle was not for performance but for power – the architecture that could squeeze the most computation out of the fewest coulombs would win the race. In short, when it came to designing computers, we needed to stop designing Ferraris and start designing Priuses. Efficiency is everything.

As electronic engineers, we also have to come to grips with the reality that today, in some way, we are all designing computers. With a multi-tiered computing architecture that reaches from sensors in IoT-edge and mobile devices through the fabric of the internet to server-based cloud, storage, and high-performance computing systems – and back again, it is almost impossible to be working on an electronic system design that isn’t part of that global computing infrastructure. 

If, then, we are all computer engineers, and computation is all about optimizing energy efficiency, we need to take a careful collective look at the basic architecture for computation. Historically, the dominant piece of that architecture is the von Neumann machine – an elegant piece of digital design that maximizes computational complexity over silicon area. Because of the ability to program with software, a von Neumann machine can complete more complex operations over a given silicon area than anything else ever devised. 

That would be wonderful if we were still trying to optimize for minimum silicon area. But, unfortunately, we are not. As Moore’s Law has given us more and more effective silicon area to work with, silicon has become almost free. At the same time, toggling the billions of transistors we fabricate in that silicon has put power at a premium.

Nobody ever claimed that von Neumann was the most power-efficient architecture. And it most definitely is not. For any given application, we could design custom hardware that can perform the algorithm many orders of magnitude more efficiently than a von Neumann machine. That is why we have ASICs. 

But ASICs do nothing to solve the efficiency problem of von Neumann for general-purpose computing. We can’t afford to design a custom chip for every single algorithm in the world. For that, we have the gift of FPGA fabric. FPGA fabric combines the computational efficiency of ASIC with the programmability of software (well, sort of). But von Neumann has enabled us to build software algorithms of such enormous complexity that we couldn’t hope to fit them on any FPGA we will be able to create in the foreseeable future.

This leads us to the conclusion that our ultimate computing architecture must be a heterogeneous machine that combines von Neumann (for complexity management) with FPGA-like fabric (for computational energy efficiency). We won’t do the math here, but let’s say that, with such a machine, we should be able to gain multiple orders of magnitude in computational energy efficiency.

The devil, however, is in the architectural details.

The top of the “details” list clearly lies in programming: how do we program these new heterogeneous von Neumann/FPGA machines? That particular detail is by far the most important and is far, far from solved. Whoever can deliver the dream of legacy software running efficiently in an accelerated heterogeneous environment will rule the world and all others will bow at their feet.

Before that, however, we need to build the machine. Of course, we can construct such a computer using off-the-shelf components available today: processors, FPGAs, DRAMs, etc. Many have done this already and have reported impressive results. Microsoft’s Catapult project yielded compelling gains in efficiency of search algorithms with just such an approach. But there are significant integrations and optimizations we can do that will dramatically improve upon those already-impressive results.

One of the key areas to attack is memory. If you look at a typical computing machine, a giant chunk of the energy budget is spent shuffling data back and forth, to and from DRAM. With modern, high-bandwidth memory architectures, that process involves a lot of serializing and deserializing, pumping high-frequency signals through PCB traces and connectors, and energizing DRAM cells. Since today’s high-performance applications are voracious consumers of memory bandwidth, a lot of juice is consumed by memory.

One of the immediately visible benefits of Intel’s acquisition of Altera is its focus on key projects that take advantage of technology from both sides. This week, the company announced details of the upcoming Stratix 10 MX devices (we wrote about the pre-announcement back in November here) – which combine Altera’s FPGA technology with Intel’s EMIB (Embedded Multi-die Interconnect Bridge) technology, along with SK hynix HBM2 (High-bandwidth Memory – 2nd generation) to provide FPGAs with basically insane amounts of high-efficiency, high-bandwidth memory.

By the numbers, you can have a device with 2 million FPGA logic elements and up to four stacks of 4GB-each HBM2 memory totaling 16GB (that’s giga Bytes). Each of those four stacks has a bandwidth of 256 GBps, yielding an aggregate 1-terabyte-per-second bandwidth with 16 gigabytes capacity. The HBM2 stacks are connected to the FPGA via Intel’s EMIB, which provides an extremely low-latency connection between FPGA fabric and memory. Maybe even more important, this low-latency connection doesn’t use the FPGA’s IO and transceiver resources, leaving those available for other purposes.

The company has just released details of the family, which includes devices with 1M-2M FPGA logic elements, 4-16GB of HBM2 memory, and 48 dual-mode 56Gbps PAM-4 / 30 Gbps NRZ transceivers. The SiP HBM2 memory should truly be a game changer with these devices, as it gives incredible bandwidth and power efficiency, while freeing up those FPGA transceivers for other tasks.

Meanwhile, back on the battle lines, a number of companies (including ARM, IBM, Qualcomm, Xilinx, and others) hoping to attack Intel’s dominant position in the data center have formed an alliance to define a standard for communication between processors and accelerators (such as FPGAs). While this initiative is in the VERY early stages (it appears to have no technology behind it just yet, but rather represents an agreement between several players to work cooperatively on a standard), it does seem like a scramble is underway to get some kind of foothold against the Intel/Altera juggernaut in FPGA-based compute acceleration. The next few years will be interesting to watch. 

9 thoughts on “Cooling Off Accelerated Computing”

  1. This statement is key “One of the key areas to attack is memory. If you look at a typical computing machine, a giant chunk of the energy budget is spent shuffling data back and forth, to and from DRAM.”

    This is a characteristic of RISC CPUs and RISC was created because compilers could not efficiently compile CISC.

    On top of that, compilers compile to an intermediate language which is JIT compiled to machine code at run time.

    JIT reads byte code from data memory and writes machine code to data cache then the machine code is written to memory then loaded into instruction cache creating a lot of memory traffic that does not contribute to the application at all.

    Accelerators in FPGAs typically stream data to the chip where the “application” processing algorithm is already installed(NO JIT)

    The FPGA does not have an ISA, JIT, or cache.

    However an FPGA can be designed to execute C statements and eliminate the RISC, JIT, etc.

    Data is read from memory, processed, then results are written to memory so the end result of processed data being in memory is done efficiently.

    Solution: Eliminate Intermediate Language, JIT, Cache, and execute language statements directly in the FPGA.

  2. Dear Kevin,
    an excellent article clearly convincing us, that the dominance of heterogeneous computing is anavoidable, although the von Neumann syndrome creates masses of problems. Because of the bad state of the art of FPGA application development we still urgently need von Neumann (vN). Although by FPGAs up to 2 orders of magnitude more transistors are needed migrations yield speed-ups by up to several orders of magnitude. This Reconfigurable Computing Paradox demonstrates how imbelievably bad the vN paradigm really is:
    http://xputer.de/RCpx/#pdx
    But we have anoter machine paradigm, suitable for running FPGAs: the Xputer, using multiple data counters instead of a program counter. See
    https://en.wikipedia.org/wiki/Xputer
    For a Mead-&-Conway-type environment already in 1984 we got by Xputer use a speed-up by 15000 for re-implementation the CMOS design rule check by using PLAs instead of FPGAs. See the PISA Project:
    http://www.fpl.uni-kl.de/PISA/index.htm
    Best regards,
    Reiner

  3. Pingback: GVK Bioscience
  4. Pingback: DMPK
  5. Pingback: Engineer X

Leave a Reply

featured blogs
Apr 14, 2021
By Simon Favre If you're not using critical area analysis and design for manufacturing to… The post DFM: Still a really good thing to do! appeared first on Design with Calibre....
Apr 14, 2021
You put your design through a multitude of tools for various transformations. Going back to formal verification in between every change to rely on your simulation tools can be a rigorous approach,... [[ Click on the title to access the full blog on the Cadence Community site...
Apr 14, 2021
Hybrid Cloud architecture enables innovation in AI chip design; learn how our partnership with IBM combines the best in EDA & HPC to improve AI performance. The post Synopsys and IBM Research: Driving Real Progress in Large-Scale AI Silicon and Implementing a Hybrid Clo...
Apr 13, 2021
The human brain is very good at understanding the world around us.  An everyday example can be found when driving a car.  An experienced driver will be able to judge how large their car is, and how close they can approach an obstacle.  The driver does not need ...

featured video

Learn the basics of Hall Effect sensors

Sponsored by Texas Instruments

This video introduces Hall Effect, permanent magnets and various magnetic properties. It'll walk through the benefits of Hall Effect sensors, how Hall ICs compare to discrete Hall elements and the different types of Hall Effect sensors.

Click here for more information

featured paper

Understanding the Foundations of Quiescent Current in Linear Power Systems

Sponsored by Texas Instruments

Minimizing power consumption is an important design consideration, especially in battery-powered systems that utilize linear regulators or low-dropout regulators (LDOs). Read this new whitepaper to learn the fundamentals of IQ in linear-power systems, how to predict behavior in dropout conditions, and maintain minimal disturbance during the load transient response.

Click here to download the whitepaper

featured chalk talk

Cutting the AI Power Cord: Technology to Enable True Edge Inference

Sponsored by Mouser Electronics and Maxim Integrated

Artificial intelligence and machine learning are exciting buzzwords in the world of electronic engineering today. But in order for artificial intelligence or machine learning to get into mainstream edge devices, we need to enable true edge inference. In this episode of Chalk Talk, Amelia Dalton chats with Kris Ardis from Maxim Integrated about the MAX78000 family of microcontrollers and how this new microcontroller family can help solve our AI inference challenges with low power, low latency, and a built-in neural network accelerator. 

Click here for more information about Maxim Integrated MAX78000 Ultra-Low-Power Arm Cortex-M4 Processor