Ross Freeman, co-founder of Xilinx, invented the FPGA in 1984. In the 34 years that have passed, FPGAs have been wildly successful and are certainly among the most important electronic devices ever conceived. But during that entire history, tracing the evolution of FPGAs from dozens of LUTs to millions, the FPGA has been the optimal solution for … exactly zero applications.
Don’t get me wrong. FPGAs do one thing exceptionally well: Flexibility. FPGAs can often do what no other device can, bridging gaps between otherwise-incompatible protocols, reconfiguring themselves on the fly to adapt to changing requirements and circumstances, and acting as stand-ins for ASICs and ASSPs that have not yet been created. If your application needs the flexibility of FPGAs, probably nothing else will work.
But all that flexibility comes at a cost – price, power, and performance. Best estimates are that all three of those factors are worse than optimized, dedicated silicon by about a factor of ten. That means that, if you design your application with an FPGA in it, and your application is successful, over time, once your requirements stop changing and your design and architecture get nailed down, replacing that FPGA with something cheaper, faster, and more power-efficient will be high on your list of priorities.
For many applications, that day never comes. By the time there is impetus to remove the FPGA, new requirements have come along that start the clock over again. A new design requires a new FPGA and goes through its own maturation process. So, FPGAs have had many application areas where they remain for decades, even though they are never ever the optimal design solution.
This is a problem for FPGA companies, as it limits their growth potential. Their products never get to enjoy the “cash cow” stage, where large volume orders come in with virtually no effort. Instead, FPGA vendors are constantly battling to win new sockets and to re-win old ones. They are forever giving heavy-duty support to a wide range of customers in ever-evolving situations simply in order to retain business they’ve already won. You may have been supplying a customer like Cisco with fantastic silicon and service for decades, but fall off your game on one new design iteration and you’ll find yourself kicked to the curb while your competitor steps in and captures your hard-earned customer.
As a result, FPGA companies have always been chasing the elusive “killer app” – the application where FPGAs are the optimal fit, and there’s no opportunity for some ASIC or ASSP to step in and grab the easy money just as the market explodes. The requirements are tricky. You need to find a problem where the flexibility of FPGAs is an essential and enduring part of the solution, and where dedicated/specialized hardware can’t be designed to do the job any better or cheaper. That’s a tall order, but it’s never stopped them from trying.
Now, there is considerable buzz in the industry about FPGAs for AI applications. Both Xilinx and Intel are touting the prowess of FPGAs as accelerators for convolutional neural network (CNN) deep-learning applications. These AI applications typically have both “training” and “inference” phases. “Training” is executed on a broad data set to teach the network its job and establish the best topology, coefficients, and so forth. Training typically happens once, and big-iron servers are used – often with GPUs doing the heavy lifting. Training requires massive floating-point computation, and GPUs are the best fit (so far) for delivering the required floating-point performance.
Once the network is trained and optimized, it can be deployed into the field, where “inferencing” is done. Inferencing has different requirements than training. Where training is generally done in a data center, inferencing is often done in embedded applications. Where training is not particularly sensitive to cost and power (since it’s done once, in a data center), inferencing can be extremely sensitive to both cost and power, since it will be done enormous numbers of times, will often be built into high-volume embedded systems with severe BOM cost restrictions, and will often be required to operate on batteries or other very limited power sources.
Most importantly, unlike training, inferencing can be done with narrow bit-width fixed point computation. Helloooooo FPGAs! Your killer app may have just arrived.
It turns out you can build a pretty decent neural net inferencing engine using FPGA LUT fabric. The fixed-point math aligns perfectly with FPGAs’ sweet spot, and best of all (try to contain your glee, FPGA marketers) every neural network topology is different. It turns out that optimal neural networks for various applications have very different topologies, so the inferencing processor for automotive vision, for example, might be completely different from one for language processing.
And, speaking of automotive vision applications, here we have an enormous industry going through a period of disruptive change, where FPGAs might possibly play an essential and irreplaceable role? Oh my. FPGA sales people are skipping along the garden path with songs in their hearts and dollar signs in their eyes. Could this finally be it? Is AI the promised land where programmable logic can curl up on a rug in front of the fire and just cash checks all day long?
Sadly, probably not.
First, it’s no secret that FPGAs are hard to “program.” There is a very small population of FPGA experts in the world, and the Venn diagram showing those people and AI application experts has a very small intersection indeed. That means that AI experts need some serious help creating FPGA implementations of their solutions. Both Xilinx and Intel have gone to great lengths to bridge that expertise gap, with various solutions in play. The most common answer for those non-FPGA-experts wanting to use FPGAs as accelerators is to use OpenCL (a C-like language designed for using GPUs as accelerators). It’s a solid strategy. If you want to sell FPGAs against GPUs, design a flow that makes GPU code easily portable to FPGAs. That way, you’ve conquered the programming challenge right up front – or at least made it much easier.
Unfortunately, the GPU-as-accelerator market is dominated by one company – Nvidia. Nvidia created a proprietary language called CUDA (similar to OpenCL) for software developers wanting to use Nvidia GPUs to accelerate their applications. When OpenCL came along (with that annoying “Open” in the name) Nvidia announced “support” for OpenCL, but clearly kept the bulk of their effort behind CUDA, and their CUDA customers were perfectly happy to keep writing CUDA code, thank you very much. The success of CUDA and Nvidia in the GPU acceleration market has put a big damper on the adoption of OpenCL, and that has significantly slowed the use of OpenCL as a bridge from GPU-based acceleration to FPGA-based acceleration – which is exactly what Nvidia wants.
Further, many of the neural network experts have not yet taken the OpenCL or the CUDA plunge. They need more help to bridge the gap between their trained models and FPGA-based inferencing engines. A number of companies are attacking this very important problem, and we’ll discuss them more in the future. But for now, FPGA-based neural network inferencing is basically limited to organizations with the ability to deploy FPGA experts alongside their neural network/AI engineers. In fact, this problem is most likely the driving factor behind the Xilinx/Daimler alliance we wrote about last week – Daimler probably needed Xilinx’s help to implement their automotive-specific AI algorithms on Xilinx’s hardware.
Beyond the programming problem, there is another barrier in the FPGAs’ path to irreplaceability. In high-volume embedded applications (such as automobiles) the solutions will become both specific and bounded over time. That means that the wild flexibility of FPGAs will no longer be required. If a pedestrian-recognizing network is proven effective, the hardware for that application can still be hardened, with a potentially enormous boost in performance, reduction in power, and (most importantly) reduction in cost. The FPGA will only be the go-to chip while the system architecture is in flux.
We talked with a manufacturer of Lidar, for example, who uses large numbers of FPGAs in their system. While they are producing one of the very best performing systems on the market for automotive applications, their system cost still runs five digits. They estimate that they need to reduce that to three digits (Yep, a two-order-of-magnitude cost reduction for the same performance) as well as reducing their footprint and power consumption – before they are viable for mass production in automobiles. The top of their to-do list? Design out the FPGAs.
We suspect that this same situation may exist across numerous subsystems going after the ADAS and AD markets, leading to temporary euphoria for FPGA companies as they win sockets – followed by future disappointment when they are designed out before the big-volume payoff. And, this is just in the automotive space. Anywhere FPGAs are called upon to do the same job for a long period of time, they are vulnerable to replacement by a more optimized application-specific solution.
One place this may not be true (or may at least see the effects delayed) is in the data center/cloud deployment of AI applications. There, a variety of different-topology neural networks may have to be deployed on the same hardware, and further-optimized ASIC or ASSP solutions will be much longer in arriving. But even then, we will need more purpose-built FPGAs aimed directly at the data center. The current “do anything” SoC FPGA with its assortment of “would be nice” features will certainly be too bloated with unused features to be optimal for data center applications. And, given the recent rise in eFPGA (embedded FPGA IP) technology, more companies may choose to design data center class neural network accelerators that don’t have the overhead (and huge margins) of stand-alone FPGAs.
With both Xilinx and Intel focused on fighting it out for the data center acceleration market (which is expected to be exploding any time now), this aspect of their respective strategies will be critical. But with their focus on the data center, the possibly more lucrative embedded opportunities such as automotive should prove even more elusive. It will be interesting to watch.
33 thoughts on “Is AI the Killer FPGA Application?”
The flexibility of FPGAs is overkill. The result being that too much time, power, and cost is in the fabric.
Microprogramming was used for years in CPU controls because it is much easier than RTL design and there is a lot of flexibility to change algorithms.
Another thing that is overlooked are the embedded memories that can operate in full dual port mode and the width can easily be changed.
The necessity to build registers out of individual dFFs with input, output, and clocks all connected by the fabric is absurd.
The eFPGA approach is desirable, but still is too focused on physical design. The Microsoft Roslyn Compiler Syntax Api does a good job of breaking down a program into the sequencing necessary for evaluation and control sequences.
Function can be done by either hardware or software, but software has debugging capability far beyond RTL. So program the function, extract the control graph, design the dataflow using memory blocks where appropriate, microprogram the control(using memory blocks and some LUTs) and reload the memory blocks to change function. Please note that the elusive C to hardware nonsense was never mentioned.
FPGAs can provide much higher performance but there is still the problem of programmability as it is mentioned in the article.
The only way to make is widely available to the cloud/ML users is through the use of libraries/cores that are completely transparent to the users and there is no need to change the original code.
This is exactly the approach that we follow at InAccel to allow acceleration of Machine Learning algorithm (i.e. for Spark) without the need to change the original code. The deployment in the cloud of ready-to-use FPGA cores allows both the speedup of the applications and the reduction of the TCO.
The primary bottleneck in CPU/GPU solutions is serialization from memory, even with large multi-level caches. If you can fit the algorithm in L1 cache, then it can go fast. Amdahl’s law kicks in.
The speedup that FPGA’s provide, is that you can remove many of the serial memory accesses, and replace them with parallel accesses to block rams, lut rams, and just plain old registers …. and in some cases remove the memory accesses all together when it’s just temporary data that had to be stored (rather then held in some register or cache) in a CPU/GPU solution.
If the algorithm is seriously memory bound, in both CPU/GPU and FPGA, then there isn’t a significant speed up. If the algorithm isn’t memory bound at all, then a highly pipelined superscalar high core count CPU/GPU implementation is likely to be as fast, or faster than an FPGA. Switching speeds in FPGA’s are fairly slow compared to a fast ASIC, CPU or GPU.
It really doesn’t matter what HDL you use … C, OpenGL, Verilog or VHDL as long as the back end does a good job at removing dead logic and produces highly optimal netlists. C and OpenGL are good resources for converting large CPU/GPU algorithms to logic … corresponding Verilog and VHDL designs can get nasty.
If the algorithm isn’t memory bound at all, then a highly pipelined superscalar high core count CPU/GPU
implementation is likely to be as fast, or faster than an FPGA.
While it’s possible to be faster than an FPGA with high core count, it’s probably not economical energy wise. An example is the proof of work functions in cryptocurrency. They quickly migrated to FPGA as the energy cost of CPU/GPU was prohibitive. Probably the same economics work in the data center too.
Interesting article, some good points made.
There is clearly a growing interest in this subject from the likes of Xilinx, I’ve seen a number of papers and I see they have recently acquired DeepPhi.