When we hear the term “supercomputing,” each of us probably forms a different image in our head – depending on our age. For my generation, I like to visualize the semi-cylindrical form of the Cray-2, with its radical architecture and Flourinert cooling. It looked fast just sitting there. Others may envision anything from IBM mainframes to racks of blades to modern video gaming consoles.
In practical terms, supercomputing seems to mean computers that have a processing power equivalent to the smartphones of five years from now, at a cost premium of hundreds to thousands of times the price of current mainstream computers. While that kind of multiplier may seem absurd, so are the lengths to which supercomputer users will go at any given time in order to gain a couple notches on the current state of Moore’s Law.
This week, in Seattle, we have the supercomputing party of the year – SC11. As is customary leading up to a major show, we’ve had the usual salvo of press releases – for everything from new blade/server configurations to cooling systems to memory and storage components. When thinking about the supercomputing problem, however, it pays to back off a step or two from the exotic and esoteric hardware on display, and to think about the problem of supercomputing at a higher level.
In recent years, we have passed a fundamental milestone in the evolution of computing architectures. There is no longer much gain to be had by cranking up the operating frequency of a single processor core. There is likewise little efficiency available from continuing to widen the word length of our basic architecture. Each of those “improvements” that has been available to us in the past will give us diminishing marginal returns at the current state of technology. This has led us to move from “how fast” to “how many” as our quest for ever more processing power has continued. Instead of trying to get more performance out of each processor, we try to see how many processors we can stuff into a box.
This, of course, brings its own set of challenges. First, we have to change the way we develop software – to be able to take advantage of all those processing elements operating in parallel. Second, we become limited fundamentally by the amount of power we have available. These two challenges define the software and the hardware side of supercomputing for the near future. Taking advantage of a multiplicity (or even a plethora) of processing elements is mainly a software problem. We need new languages, new compilers, new operating systems, and better ways of expressing algorithms in order to take advantage of massive parallelism.
On the hardware side, we need processing elements that pack a lot of computing power into a small space and (probably more important now) do the computation with the least amount of power possible. It is these two areas where FPGAs have attracted a lot of attention in recent years. We have long documented the abilities of FPGAs to accelerate many types of algorithms by significant factors – with a huge savings in power. FPGAs have walloped DSP processors, conventional processors, and even graphics processors – both in terms of raw processing throughput on a single device, and particularly when considering the amount of power consumed.
Why, then, haven’t FPGAs taken over the supercomputing scene entirely?
Oh, how they have tried. Reconfigurable computing engines based on FPGAs have been designed, touted, tried, re-designed, re-touted, and ultimately – not found much acceptance. Formidable academic consortia have risen and fallen with the sole purpose of exploiting the extraordinary capabilities of FPGAs for supercomputing applications. The reason is rooted back in that software part of the problem. The only super-great way to do supercomputing on an FPGA has been to hand-code the algorithm in a hardware description language (HDL) such as VHDL or Verilog.
Well, then, why hasn’t somebody come up with a better solution for the software issue? I’m pretty sure more brainpower has been applied to that problem than to locating Bigfoot, Elvis, and UFOs combined. Well, by some definition of “brainpower” anyway. There have been more start-ups than you can count hawking high-level synthesis algorithms, new parallel programming languages that target FPGA hardware, graphical input languages, model-based design methodologies — the list goes on and on. The tool-making world understands that most people who are world experts in the subtleties of DNA-matching algorithms did not reach that pinnacle of their field while also mastering VHDL “on the side.”
However, two briefings I had on the run-up to supercomputing illustrate that major progress is in the works, and that we may be closer than we think to answering the ultimate question of FPGAs, supercomputing, and everything. First, Altera is announcing plans to support OpenCL – a language developed with GPU-based supercomputing in mind. For the past few years, led most visibly by Nvidia, GPUs have been gaining a lot of popularity for high-performance computing applications. Nvidia rolled out their CUDA (Complete Unified Device Architecture) several years ago – and it has gained significant acceptance in many traditional supercomputing strongholds like financial computing, pharmaceuticals, and oil and gas exploration. Alongside that effort came OpenCL (Open Computing Language), which provides a facility for developers to write C-like code that explicitly specifies parallelism. OpenCL actually originated at Apple and is now the charge of The Khronos Group – a non-profit industry consortium aimed specifically at compute-acceleration standards. Even though OpenCL was authored with GPUs in mind, it is equally applicable to parallel processing with modern FPGAs.
Altera’s vision is for software developed in OpenCL to be compiled into a combination of executable software and FPGA-based accelerators. Since OpenCL has the notion of concurrency built right in, it is much more straightforward to translate an algorithm from there into a datapath in an FPGA. Altera has been developing this technology for some time, apparently, and has at least parts of the solution in the hands of key customers.
The goal here is to provide FPGA-based acceleration in areas that don’t have traditional FPGA design expertise. There is already a large community of interest around OpenCL and its relatives, and giving those developers a path to FPGA-based supercomputers has to be a good thing.
Also on the path of making FPGA-based acceleration easy is MathWorks. While their MATLAB and Simulink tools have long been the de-facto standard for algorithm prototyping, the process of getting from that prototype to actual hardware has always been a tricky one. The company has been addressing that issue in recent years – bridging the gap with what they call “model-based design.” We will have more extensive coverage of some specific aspects of the MathWorks approach in an upcoming article, but in the context of supercomputing, their approach merits consideration. By creating a representation of an algorithm based on a combination of standardized models with a sprinkling of new, user-defined functions, complex algorithms can be modeled in something conceptually very close to hardware. When the time comes to move the algorithm into hardware, the transformation is a straightforward one – translate each model into its corresponding pre-defined hardware implementation. It is essentially IP-based design, but starting conceptually at the algorithmic model.
By breaking the boundaries of the von Neumann architecture, FPGAs have an inherent advantage in supercomputing applications. They should ultimately be able to deliver more computations for less total power than any architecture based on a combination of conventional processors. With the astounding growth in capacity and performance of FPGAs in recent years, truly the only major hurdle that remains is the software component. Even there, progress is promising.
So hey, software guys – what do you say? Get us some better tools already!
Photograph by Rama, Wikimedia Commons