The history of digital hardware design is one of managing ever-increasing complexity by raising the level of design abstraction. When our digital circuits had four inputs, it was completely reasonable to do logic minimization with a Karnaugh map. When sequential logic was involved, a state diagram was a nice way to work things out, and we could generally draw a single page schematic with a dozen or so logic gates describing our implementation. As the number of logic gates soared, though, those schematics became hundreds of incomprehensible pages.
We leveled up, of course, and adopted register-transfer level design and logic synthesis. At that higher level of abstraction, we could describe tens of thousands of gates of logic with just a few hundred lines of hardware description language. The downside, of course, was that we were now separated from the “bare metal” of the logic gates in our design. Our primary source – the thing we created – was the RTL/HDL description. The gate-level netlist that was generated from that was truly understood only by the synthesis tool we used.
Moore’s Law, however, was not done with us yet. Exponentials are unrelenting, and “hundreds of lines” of RTL quickly expanded into “tens of thousands” of lines of RTL. We once again were designing a thing more complex than we could manage. This time, we raised our level of abstraction by encapsulation. We created standard blocks of IP for complex, commonly-used functions. Multipliers, ALUs, microcontrollers, and even full-fledged multi-core application processors became black boxes we could drop into our design with no need to dive into the thousands of lines of underlying RTL code that implemented them. Again, the tradeoff was lack of visibility into the details of our design. Not only could we no longer access the “bare metal” logic gates, we didn’t even have access or understanding of the register-transfer-level microarchitectures that implemented most of our biggest functions.
But virtually no hardware design is a simple drag-and-drop affair that can be assembled from large, pre-designed chunks of IP. If that’s all we need, we can simply select one of the thousands of pre-designed, pre-manufactured SoCs on the market and use software to implement whatever our application demands. Custom hardware inevitably has some new, original content – some secret sauce – that isn’t just code running on an MCU or CPU. The whole reason we have programmable logic devices such as FPGAs is to enable the implementation of that “secret sauce” that differentiates our design.
High-level Synthesis (HLS) was created to address this problem. RTL/HDL captures the specific structure and implementation of a design, whereas HLS raises our level of abstraction an additional layer, – to a (mostly) purely behavioral/functional description that describes what the design does, independent of implementation. To use HLS, the designer creates the design with a “software” language such as C/C++ that captures the algorithm. Then, the HLS tool helps to find the best way to implement that function in hardware – given the overall design constraints such as available hardware resources, required performance, power consumption, and so forth.
The key words in that last paragraph are “helps to find.” HLS is not a “compiler” in the software sense. HLS, as it exists today, is a power tool for hardware experts, not a magic “hardware engineer in a box” that converts conventional software into hardware. Using HLS effectively requires a deep understanding of hardware engineering concepts such as datapath flow and control, parallelism, resource utilization, loop unrolling, pipelining, memory access architectures, latency, throughput, power consumption – the list goes on and on. HLS tools generally give a hardware design expert a massive productivity boost by automating the low-level aspects of designing a hardware architecture, allowing the designer to evaluate more options and implement their chosen solution more quickly.
Today, however, the problem has become more challenging still. Hardware/FPGA designs skills such as writing RTL/HDL, optimizing logic synthesis, running place-and-route, and achieving timing closure are rare. And many of the “secret sauce” elements are extremely complex and come from non-hardware-design domains such as AI. Many engineering teams that have sophisticated algorithms that require hardware acceleration don’t have access to hardware engineers, or, if they do, the task of passing design information from domain experts such as data scientists and software algorithm experts to hardware engineers is unmanageably complex. What we now need is a way for application/domain experts to take advantage of the hardware-optimizing power of HLS and to build a bridge directly from the functionality and constraints specified by those engineers to optimized hardware implementations in hardware such as FPGAs or ASICs.
HLS and FPGAs are a superb fit, and Xilinx almost certainly leads the world in HLS users. FPGAs allow very rapid development and deployment of complex custom logic designs with maximum flexibility to adapt to design and requirements changes. HLS facilitates very rapid development of hardware architectures for FPGA implementation, so the combination of HLS and FPGA creates a flow from functional description to working hardware implementation that is unmatched. Xilinx recognized this years ago and made an enormous push to bring the world of FPGA designers up to speed on HLS. They offered sophisticated HLS capabilities right in their proprietary tool suite and made it accessible and affordable to the masses.
Xilinx’s HLS-for-all strategy was in sharp contrast to the EDA companies who were marketing HLS tools to the most elite ASIC designers at the most elite price points. The result was that Xilinx quickly built a comparatively enormous user base of HLS adopters and gained a tremendous amount of experience with users adopting HLS for FPGA-based design.
But, as we pointed out above, a capable hardware engineer is pretty much mandatory to drive the HLS tool. A jet can get you from point A to point B faster than anything, but it does require a jet pilot to make it happen. What we really need is a way to bridge the gap that separates system designers, application developers, data scientists, and other domain experts from the HLS/FPGA flow. And, of course, any time there is an obvious “what we really need” you can count on the startup universe to begin cranking out innovative solutions to chip away at the problem.
One of the most successful of those efforts is Silexica’s SLX FPGA tool. Silexica does a lot of the work of the hardware engineer in taking “ordinary” C/C++ code and converting it to “HLS-ready” C/C++. In most cases, the difference is a set of pragmas or directives that must be added to the code (usually by a hardware engineer) that instructs the HLS tool how to optimize the hardware architecture of the design. These pragmas may include instructions for unrolling and pipelining loops, managing memory access and creating hardware memory structures from C++ data structures, and so forth. The SLX FPGA does both static and dynamic analysis of C++ code as well as the design constraints, identifies critical issues like data dependencies, and makes recommendations about what particular pragmas to insert for the best HLS results. This is not a trivial task, and the SLX FPGA dramatically lowers the bar on HLS-specific expertise required to get the best results from an HLS tool.
This week, Silexica announced the release of a new plugin for Xilinx’s Vitis Unified Software Platform. The new plugin expands the capabilities of the Vitis HLS tool, enabling the addition of new pragmas and compiler optimizations. That means Silexica can add powerful optimization control to the existing Xilinx tool, which can either be manually inserted by designers or automatically exercised by the full SLX tool. This paves the way for SLX to do even more and more powerful optimizations of the HLS process, making it easier to get from generic C/C++ to optimized hardware implementations.
The initial version of the SLX Plugin – which the company says is the first commercial plugin for Vitis – contains a new “loop interchange pragma,” which is a simple-yet-powerful feature that allows nested loops to be inverted, with the outer loop becoming the inner. In many situations, this can result in substantial improvements in the optimality of the hardware implementation generated by HLS. Loop Interchange can remove data dependencies that block HLS from improving parallelism, pipelining, and memory access regularity. The plugin also improves the HLS design’s internal program representation before the RTL code is generated, which allows the HLS tool to more easily relocate loop-invariant memory accesses. It also verifies that the transformation does not cause functionality issues and is not safe or illegal from a data-flow point of view.
This is the first of many pragmas Silexica plans to add to the SLX plugin. The model is a bit like a “freemium” play, where the loop interchange optimization will come for free, and additional planned features will be available for customers of the SLX FPGA. While the SLX FPGA certainly makes HLS easier for non-experts, it also brings a powerful set of automated capabilities to experienced HLS users, including identifying non-synthesizable code constructs and suggesting alternatives, recommending optimizations, and automatically adding selected pragmas. The SLX Plugin is available immediately, standalone and free as an add-on to Vitis HLS 2020.2. The SLX FPGA tool will support the SLX Plugin in 2020.4, releasing on January 11, 2021.
Also last week, Xilinx announced they were acquiring the assets of Falcon Computing Solutions, another (HLS) compiler optimization technology focused specifically on hardware acceleration of software applications. Xilinx plans to use the technology to “make adaptive computing more accessible to software developers by enhancing the Xilinx Vitis Unified Software Platform with automated hardware-aware optimizations.”
The Falcon compiler uses machine learning algorithms to optimize code for FPGA optimization in hardware. It essentially learns how the HLS compiler behaves and is trained to find the best places in code to apply certain optimization pragmas. The goal is, again, to make HLS more approachable by software developers with minimal hardware expertise. Falcon was developed by a team led by industry legend Dr. Jason Cong who interestingly, also co-founded AutoESL – which was acquired by Xilinx in 2010 and is now the basis of Xilinx’s Vitis HLS tool – as well as Neptune Design Automation – a startup focused on FPGA physical design optimization, which Xilinx acquired in 2013, and whose technology is now part of Xilinx’s Vivado FPGA design suite.
It isn’t entirely clear how the capabilities of the Falcon and Silexica tools overlap, or how they would complement each other. It will be interesting to watch as the two technologies make their way into the Xilinx user base, with the clear goal of enabling system and software developers to more easily take advantage of FPGA technology to deliver massive acceleration of their applications.