Like the proverbial carrot-on-a-stick, FPGA-based acceleration has been right in front of our noses, just out of reach, for the better part of three decades. We move closer, and the prize moves farther away. Every few years, we feel some tangible progress, and perhaps cut the distance in half, but asymptotes can be unfriendly bedfellows. The old “reconfigurable computing” vision of FPGAs as replacements for CPUs has teased, taunted, and ultimately disappointed us.
The fact that FPGAs are basically vast arrays of unconnected logic gives the notion that we can have optimal hardware custom designed for our algorithm, where the appropriate hardware resources can be applied to generate maximum parallelism and efficiency without all those pesky program counters and instructions. They conjure visions of data flowing smoothly into one side of our machine and results flowing out the other with absolute minimal friction.
All we have to do is get our software into one.
Ah, and there’s the rub. The overwhelming challenge of FPGA acceleration has always been the programming model. And, try as we may, every approach that has been tried (and there have been many) has failed to come anywhere near what can be achieved with a conventional von Neumann CPU.
At first, we simply thought we should re-train the world’s software engineers to use “different languages” and adopt hardware-description languages like Verilog to develop their code. After writing thousands of lines of almost-incomprehensible nonsense in VHDL or Verilog to do what would have required a few dozen lines of simple C, software engineers told us that plan was, uh, sub-optimal, and to please not ever call them back again.
With some success, we created high-level synthesis tools that can convert sequential software-like descriptions written in software-like languages such as C and C++ into parallel datapath machines that can then be synthesized and placed-and-routed, but that is not a software development flow. That is a “hardware design flow” and the amount of time and effort required for each iteration of that process still puts it far behind modern software development and debug environments. While HLS can give us FPGA-accelerated designs with a fraction of the effort required by conventional HDL design, it still falls well short of what we can do to put the same algorithm on a regular processor.
But we are at the dawn of a new era. Who needs a programming model if the thing we are accelerating is not even software? (Well, not exactly, anyway). For a range of applications, the AI revolution has done away with the task of programming. Instead, we just feed our machines massive amounts of data, and they program themselves. And, the value proposition for AI is so incredibly strong that we are determined to do whatever is required – even if it is somewhat less convenient than writing conventional software. We don’t need 5-second iterations from code changes to results, no fancy debuggers to see all the inner workings of our CNN… Heck, we don’t even KNOW the inner workings of our CNN.
Mipsology, a French startup, recognized this opportunity and came up with an elegant strategy to capitalize on it. The Mipsology team has mad skills and an enormous amount of experience doing computation with FPGAs in the emulation space, where several of them worked on ZeBu, an innovative FPGA prototyping/emulation platform developed by EVE and later acquired by Synopsys. Calling on their FPGA mapping experience from emulation, the Mipsology team developed Zebra, a software stack that takes any neural network from any of the popular frameworks – Caffe, Caffe2, MXNET, TensorFlow, etc. – and creates an efficient FPGA accelerator model that can be deployed on a wide range of FPGA-based platforms from Amazon AWS F1 FPGA instances to a huge range of FPGA accelerator boards.
Partnering with companies like Xilinx, Avnet, Mellanox, Advantech, Western Digital, and Tul, Zebra can take any trained neural network from the popular frameworks and create an FPGA-based accelerator on a wide range of FPGA platforms, with zero changes or hardware expertise required from the AI engineer. Mipsology says Zebra is essentially pushbutton – network model in, FPGA implementation out. This allows AI engineers to target inference deployment in anything from data centers to edge devices to desktop applications. It’s like just having a big “ACCELERATE” button. Mipsology claims “Zebra users don’t have to learn new languages, new frameworks, or new tools. Not a single line of code must be changed in the application.”
Beyond not having to change code, though, Mipsology has solved another key problem. The FPGA does not have to be reconfigured (and therefore no synthesis, place-and-route, or timing closure required) in order to load an updated model. Performance wise, Mipsology claims to be able to achieve more than 5,000 images per second on ResNet 50, and more than 2,500 images per second on Inception-V3, and 250 fps on YoloV3 – on Xilinx’s Alveo U250. That’s some serious throughput, and well beyond what the leading GPUs can accomplish. There is also the option to dial in more performance by scaling back the resolution at which the hardware processes your model.
Recently, Mipsology and Xilinx announced that Zebra software/IP has been integrated into the latest build of Xilinx’s Alveo U50 data center accelerator card, making the jump to acceleration for Alveo users that much easier. The company bills this as “Zero Effort IP” – and who doesn’t need more zero effort solutions in their life?
The AI inference world is certainly exciting right now, crowded with a plethora of solutions including new chip architectures, accelerator cards, development tools… the list goes on and on. It will be interesting to see how Mipsology’s “zero effort,” hardware-agnostic, framework-independent, software stack approach fares as the market begins to shake out. The team certainly seems to be checking a lot of the key boxes – with ease-of-use, platform portability, performance, and power efficiency.
Ah, and there’s the rub. The overwhelming challenge of FPGA acceleration has always been the programming model. And, try as we may, every approach that has been tried (and there have been many) has failed to come anywhere near what can be achieved with a conventional von Neumann CPU.
Do you mean that FPGA accelerators do not exist(or that they do not accelerate algorithms)? Do you mean that von Neumann CPU is the fastest thing known to man?
Then came the GPU for graphic algorithms. Of course it was immediately obvious that was not enough and there was an immediate effort to extend to general purpose computing algorithms … Sometimes just let sleeping dogs lie.
AI inferencing algorithms are not for general purpose computing and let us hope that AI is big enough to overcome the notion that the von Neumann CPU must be used for AI.
@Karl – FPGA accelerators most certainly do exist, and deliver performance and energy efficiency orders of magnitude better than von Neumann CPUs. But, their adoption has always been severely limited by how difficult they are to program.
@Kevin, thanks for your reply. One of my frustrations is the so-called “Tool Chain” which is more like a ball and chain. It is absurd to start with a Hardware Description Language for design entry. First there must be logic design, and that means the logical combinations of inputs and storage/state elements. Each must have a name that is used in and/or/not expressions. It is obvious that things that appear in expressions are inputs used to determine the true/false value of the output. Sure outputs must have names because they are inputs to other expressions. Lists of inputs and outputs are not necessary.
Sensitivity lists, Always blocks, processes, blocking/non-blocking assignments only have meaning to synthesis, not the logic. Sorry, I needed to let off a little steam.
HLS/SystemC or whatever you want to call it, boils down to evaluating numeric expressions but after many years it still has not fully “matured” and is not universally accepted. (Let’s skip the fact that there are 2 HDLs and that new C++ Classes may have to be designed)
By now you have had enough of this … but there is hope. There is a compiler that will take an expression and identify the sequence of operators and operands to evaluate expressions/algorithms. This can be implemented using a few hundred LUTs and 3 embedded memory blocks on an FPGA. Here is the sweet part — it is programmable by simply loading the memories, not by re-designing the FPGA.
I just saw an article about Altera OpenCL that was going to take care of all this.
Have you checked lately? At the time it seemed that the GPU was handling the graphics fine, but a group of know it alls decided they could do better…