The Path to Acceleration

Every hardware designer knows that a von Neumann machine (a traditional processor) is a model of computational inefficiency. The primary design goal of the architecture is flexibility, with performance and power efficiency compromised in as an afterthought. Every calculation requires fetching instructions and shuttling data back and forth between registers and memory in a sequential Rube-Goldbergian fashion. There is absolutely nothing efficient about it.

Dataflow machines (and their relatives), however, are the polar opposite. With data streaming directly into and through parallel computational elements – all pre-arranged according to the algorithm being performed – calculations are done just about as quickly and efficiently as possible. Custom designed hardware like this can perform calculations orders of magnitude faster and more power-efficiently than von Neumann processors.

Why don’t we just do everything with custom-designed hardware, then? Well, you know the answer to that. It’s really not practical to manufacture new hardware for each algorithm we want to run, and many algorithms are so complex that – even with the billions of transistors on today’s biggest chips, we wouldn’t have enough gates to do them in hardware. That means that, for the most complex systems, we’d definitely need a mixture of conventional, von Neumann processors (doing the non-performance-critical tasks) along with custom-designed hardware (for the big number crunching).

This blend of perfectly-partitioned hardware and software has been a dream of computational idealists for decades. FPGAs delivered the first critical capability in that vision – hardware that could be reconfigured into anything we want on the fly, so that new algorithms could be loaded and reloaded without the need to physically build new hardware. The idealists were beside themselves with excitement and hurried off to form research consortia and all manner of long-range projects to explore and enhance our ability to perform computation with FPGAs paired with conventional processors.

The remaining stumbling block was a bit harder to conquer, however. Even though we could build the hardware for our ideal computing system, we didn’t have a good way to program it. For the past two decades (OK, even longer than that, really), we have struggled to find a convenient and productive way to write software that could execute efficiently on one of these heterogeneous computing machines. For the most part, we’ve failed.

The core problem that has attracted the most attention is the challenge of taking an algorithm described in a normal, sequential programming language (like C) and restructuring it into a parallel/pipelined architecture that would be fast and efficient for hardware implementation. Various generations of technologies with names like “Behavioral Synthesis,” “High-level synthesis,” “Architectural Exploration,” “ESL,” “Algorithmic Synthesis,” “Hardware compilation,” (and probably a few I’m forgetting), have been put forth and have found their way into everything from academic research projects to full-fledged EDA tools. The underlying NP-complete problems of resource allocation and scheduling gave fodder to a generation of grad students who assaulted the issue with arsenals of heuristics, linear programming algorithms, and divine trickery.

Many more practical people, wanting to skip the whole black-magic realm of sequential-to-parallel algorithm transformation, came up with alternative approaches. Most of these involved teaching new programming languages or methodologies to us where we manually specified the parallelism and pipelining ourselves. We got special dialects of C, graphical programming languages, variants of languages like Java, and IP-based design methods, where we created algorithms by stitching together functional hardware blocks.

All of these approaches required the user to learn a significant new methodology and/or language, to have a strong understanding of hardware architectures, and to create designs specifically for the hardware environment they were targeting.

Altera is taking a cleverly different approach to this issue, with the announcement this week of a new software development kit (SDK) for OpenCL (Open Computing Language). OpenCL is an open-source variant on C, originally created by Apple, and now under license by the Khronos Group. OpenCL is similar to NVIDIA’s popular CUDA – designed for writing software to take advantage of arrays of parallel processing elements such as graphics processors. OpenCL allows the creation of what are called “kernels” that execute on OpenCL devices, as well as APIs for the management and control functions.

So far, this sounds like one of the options we already discussed, right? Isn’t it just another of our “special dialects of C”?

Yes, it is.

However, OpenCL has an important differentiator – a user base. You see, the world of high-performance computing has not been sitting around idly waiting for us to come up with the ideal heterogeneous computing platform. People needed performance, and they needed it NOW. The fastest processing machines on the block for the past couple of decades have been GPUs (graphics processors) with their large arrays of parallel processors. But you can’t just take any old software and plunk it down onto an array of processors. You need to write your code in a way that partitions the tasks out logically so that they can be executed in parallel – taking advantage of the key feature of the GPU, LOTS of processors. Therefore, we have seen the emergence and rapid growth in popularity of languages like OpenCL and CUDA.

Altera wants all those people writing OpenCL code to be able to swap their GPUs for FPGAs. By releasing an OpenCL SDK (with the associated hardware – which we’ll discuss in a bit), Altera has given the OpenCL crowd a way to move their software seamlessly from processor+GPU to processor+FPGA. Altera is not, therefore, faced with creating a new language or dialect, promoting and selling the idea to the masses, and training the world in developing code that can be run on an FPGA. They are also not trying to create, maintain, and support a complex high-level synthesis program for a wide audience of users.

Altera’s development flow is very similar to targeting OpenCL code to a CPU+GPU combo. Your normal C code is compiled with normal C compilers for your conventional processor, and your OpenCL kernels (here’s where the magic takes place) are magically transformed into hardware blocks that reside on your FPGA. The FPGA-to-CPU connection is made via PCIe. Altera has certified boards with FPGAs and the appropriate additional resources and connections that will plug right into your host computer and – wheee! – you’re off and running at blinding speed, and (this is important) with a tiny fraction of the power you’d require running the same algorithm on a CPU+GPU combination.

This OpenCL SDK should allow designers who don’t have any FPGA experience to take advantage of the amazing capabilities of today’s FPGAs for compute acceleration – without having to learn HDLs or hardware architecture. It seems like a viable and differentiated solution for high-performance computing in a number of areas.

Looking down the road just a bit, there is more excitement on the horizon. Altera’s previously-announced and still-upcoming SoC FPGAs will combine a high-performance ARM processor subsystem with FPGA fabric on a single chip (similar to Xilinx’s Zynq devices.) It doesn’t take much imagination to envision these CPU+FPGA devices having the capability to perform the entire supercomputing task – both the conventional processor and the hardware acceleration kernel portions. Furthermore, such devices could have a much higher bandwidth connection between CPU and FPGA fabric, because of the massive amount of on-chip interconnect, and because those signals would no longer have to go off chip.

We could truly end up with data centers full of racks of servers – where each blade contains FPGA SoC devices running OpenCL code. The performance capabilities of such an architecture would be staggering and, more importantly, the power-per-performance would be dramatically lower than for today’s high-performance computing systems. Since power is often the ultimate limitation on the amount of compute power we can deploy today, these could form the basis of the supercomputers of the future.

It will be interesting to watch the adoption and evolution of Altera’s OpenCL technology. We don’t yet know whether current CPU+GPU users will broadly accept and adopt the CPU+FPGA idea. A lot of that will depend on how good a job Altera has done with the tools in the background. The company claims that the OpenCL kernels are seamlessly converted into hardware blocks that will pass cleanly through synthesis and place-and-route without issues like timing violations or routing failures. A lot of experienced FPGA designers would love to be able to claim that, but it’s a difficult problem. Altera claims that early adopters of their OpenCL SDK have had good success, however, so we’ll see how it progresses as the broader rollout continues.

3 thoughts on “The Path to Acceleration”

kevin says:

November 6, 2012 at 9:05 am

Altera is betting that high-performance computing folks will take OpenCL (which a lot of them are already using) and target hybrid computing systems with FPGAs married to conventional processors.

What do you think?

Log in to Reply
HenryWMS says:

November 6, 2012 at 11:39 am

All commendations, but IMHO not a winning strategy. Please remember these comments are from a bit of sub-plankton within a small stagnant pond; but please, please keep the configurable FPGA with the included (inside the FPGA) DO.254? pre-approved processor and OS on-board; (perhaps) more tools (i.e. expand the simulation such as ‘L–C*’ [and many other CAD packages] which provide the facility to ‘drop’ code onto processors [e.g. A?R,et al] within their schematic) and samples, plain ol’time, and maybe the odd book(!).

Log in to Reply
Dwyland says:

November 6, 2012 at 1:27 pm

I think this is somewhere between a final solution and a transition to a final solution. Kernals can be simple, single-clock data flow elements (single register + ALU) or multi-register, multi-clock units from state machines to GPUs. This allows a compiler to optimize the data flow paths (hardware vs clocks) through the flow pattern. Eventually, there will be a set of hard and soft kernals to get the most bang out of the silicon. Hats off to Altera for taking the step.

Log in to Reply

The Path to Acceleration

Related

3 thoughts on “The Path to Acceleration”

Leave a Reply Cancel reply

featured video

How MediaTek Optimizes SI Design with Cadence Optimality Explorer and Clarity 3D Solver

featured chalk talk