Today, we call it “acceleration” – the use of specialized hardware to optimize compute tasks that do not perform well on conventional von Neumann processors. We have entered an “age of acceleration” driven primarily by the explosion in AI technology. Countless startups are engaged in developing chips with alternative architectures that accelerate and parallelize various types of compute-intensive algorithms. As a result, we are living in a heterogeneous computing world with processors and accelerators working side by side on a new generation of applications. It is possible, even likely, that this proliferation of acceleration will subsume our current notion of processing, and this heterogeneous approach will simply be the new “computing.”
Front and center in the acceleration race are FPGAs and SoC FPGAs. Because each algorithm wants a slightly different specialized hardware architecture in order to execute efficiently, custom accelerator chips such as ASICs have to make serious compromises in order to function as at least somewhat general-purpose acceleration machines. At a high level, this means system designers are faced with a choice of either a heterogeneous system with numerous ASIC accelerator chips to handle various types of problems, or a single type of compromise chip that handles many types of algorithms well. FPGAs are that compromise. Because FPGAs offer infinitely reconfigurable logic, we can have exactly the accelerator we need for each algorithm, and the only compromise is having that accelerator in programmable logic rather than hardened gates.
The elephant in the room with FPGA-based acceleration, however, is the programming model. Implementing a hardware version of an algorithm in FPGA fabric generally requires hardware engineers with specific FPGA/HDL expertise, and a lot of time. When compared with the traditional software programming model, FPGAs are exponentially more demanding to program. The biggest challenge facing the FPGA industry today is building a workable development flow that allows software-like methodology for achieving near-optimal acceleration of algorithms in FPGAs.
High-level synthesis (HLS) is a key technology for bridging that gap in expertise and productivity for getting from algorithm to architecture in FPGAs. Both Xilinx and Intel (the two largest FPGA companies) offer HLS flows that target their FPGA families. These HLS tools take C/C++ code and semi-magically produce hardware design language (HDL) architectures (such as register-transfer level Verilog) for implementation in FPGAs.
On the surface, we might be tempted to think “Great! Problem solved.” If we can go from C/C++ to HDL to FPGA hardware, we can just bring our software algorithms in and compile them directly for FPGA acceleration, right?
Oops, we didn’t look at the fine print.
And there is a LOT of fine print. It turns out that HLS tools are able to handle only a very narrow dialect of C and C++. This dialect is so narrow, in fact, that your chances of successfully bringing conventional software into an HLS tool are approximately zero. While HLS tools can process C or C++, they require some pretty specific coding styles in order to produce reasonable hardware architectures. There are numerous language constructs that are not synthesizable. And just getting synthesizable code is only the beginning. HLS is capable of producing an enormous range of architectures for any particular algorithm. In order to get one that meets your design constraints, you’ll need to provide guidance to the tool, and that requires hardware design knowledge. Just throwing some synthesizable C code at an HLS tool could easily get you a solution that is orders of magnitude worse than an optimal one.
Due to all these issues, HLS turns out to be more of a power tool for hardware designers than a tool that allows software designers to create hardware. HLS can dramatically improve the productivity of hardware designers. And, by mastering the use of HLS and using C/C++ as a higher-level alternative hardware description language, hardware (and HLS) experts can realize enormous performance and efficiency gains in their designs. But this doesn’t solve our fundamental problem for acceleration – getting software to take optimal advantage of FPGA acceleration without needing to bring in a team of FPGA experts for months.
This is where Silexica comes in.
Silexica’s SLX FPGA performs static and dynamic code analysis to give us insight into C/C++ code that we want to accelerate. The SLX FPGA identifies non-synthesizable C/C++ code, detects data types that are not “hardware aware,” and locates parallelism within the C/C++ code that can be accelerated in FPGAs. Beyond that, the SLX FPGA can do automatic or “guided” refactoring of that code for use with HLS. Finally, it can create the HLS pragmas required to optimize the resulting design – taking into account performance goals and available FPGA resources such as DSP blocks and memory. In short, the SLX FPGA acts as an in-house HLS/FPGA expert to get your algorithm from “software” to a combination of software and FPGA-based hardware accelerators.
The rubber meets the road in HLS land with loops in your code. If you have a loop (or nested loop) with some arithmetic operations inside, there is often the potential to unroll and/or pipeline the loop. Often there will be some computationally expensive operation such as multiply-accumulate inside that can take advantage of the DSP resources in a typical FPGA. FPGAs can have thousands of DSP blocks, so it is theoretically possible to have thousands of iterations of a loop executing in hardware in parallel. The amount and type of parallelization depends on the availability of hardware in the FPGA and the data dependencies within the loop structure. You can’t, for example, execute iterations in parallel if one iteration depends on the result of a previous iteration.
Beyond that, most software implementations of algorithms are written with standard data types, with little regard to quantizing down to minimum required bit-widths. When executing on a conventional processor, this doesn’t matter significantly, as the datapaths tend to be fixed width, and the arithmetic processing units are designed for those specific widths. In the world of custom FPGA hardware, however, massive gains can be made by reducing bit widths where possible, as that removes huge amounts of actual hardware from the accelerator architecture. Quantizing and parallelizing loops are where the majority of the gains can be found in moving software algorithms into FPGA-based hardware accelerators.
The transition from sequential code to parallel, optimized hardware is still far more art than science. While HLS can dramatically accelerate the creation of those optimized hardware architectures, it still only slightly shifts the level and type of engineering expertise required, from “RTL designer” to “hardware expert with HLS experience.” The Silexica SLX FPGA probably doesn’t remove the need for hardware expertise entirely, but it does have a good chance to change the “art” of hardware accelerator optimization into more of a “paint by number” operation. It will be interesting to see how teams take advantage of this type of tool as we see more and more compute-intensive tasks being moved to heterogeneous computing environments with FPGA accelerators.