feature article
Subscribe Now

Silexica Bridges the HLS Gap

The New “Computing” Comes of Age

Today, we call it “acceleration” – the use of specialized hardware to optimize compute tasks that do not perform well on conventional von Neumann processors. We have entered an “age of acceleration” driven primarily by the explosion in AI technology. Countless startups are engaged in developing chips with alternative architectures that accelerate and parallelize various types of compute-intensive algorithms. As a result, we are living in a heterogeneous computing world with processors and accelerators working side by side on a new generation of applications. It is possible, even likely, that this proliferation of acceleration will subsume our current notion of processing, and this heterogeneous approach will simply be the new “computing.”

Front and center in the acceleration race are FPGAs and SoC FPGAs. Because each algorithm wants a slightly different specialized hardware architecture in order to execute efficiently, custom accelerator chips such as ASICs have to make serious compromises in order to function as at least somewhat general-purpose acceleration machines. At a high level, this means system designers are faced with a choice of either a heterogeneous system with numerous ASIC accelerator chips to handle various types of problems, or a single type of compromise chip that handles many types of algorithms well. FPGAs are that compromise. Because FPGAs offer infinitely reconfigurable logic, we can have exactly the accelerator we need for each algorithm, and the only compromise is having that accelerator in programmable logic rather than hardened gates.

The elephant in the room with FPGA-based acceleration, however, is the programming model. Implementing a hardware version of an algorithm in FPGA fabric generally requires hardware engineers with specific FPGA/HDL expertise, and a lot of time. When compared with the traditional software programming model, FPGAs are exponentially more demanding to program. The biggest challenge facing the FPGA industry today is building a workable development flow that allows software-like methodology for achieving near-optimal acceleration of algorithms in FPGAs. 

High-level synthesis (HLS) is a key technology for bridging that gap in expertise and productivity for getting from algorithm to architecture in FPGAs. Both Xilinx and Intel (the two largest FPGA companies) offer HLS flows that target their FPGA families. These HLS tools take C/C++ code and semi-magically produce hardware design language (HDL) architectures (such as register-transfer level Verilog) for implementation in FPGAs. 

On the surface, we might be tempted to think “Great! Problem solved.” If we can go from C/C++ to HDL to FPGA hardware, we can just bring our software algorithms in and compile them directly for FPGA acceleration, right?

Oops, we didn’t look at the fine print.

And there is a LOT of fine print. It turns out that HLS tools are able to handle only a very narrow dialect of C and C++. This dialect is so narrow, in fact, that your chances of successfully bringing conventional software into an HLS tool are approximately zero. While HLS tools can process C or C++, they require some pretty specific coding styles in order to produce reasonable hardware architectures. There are numerous language constructs that are not synthesizable. And just getting synthesizable code is only the beginning. HLS is capable of producing an enormous range of architectures for any particular algorithm. In order to get one that meets your design constraints, you’ll need to provide guidance to the tool, and that requires hardware design knowledge. Just throwing some synthesizable C code at an HLS tool could easily get you a solution that is orders of magnitude worse than an optimal one.

Due to all these issues, HLS turns out to be more of a power tool for hardware designers than a tool that allows software designers to create hardware. HLS can dramatically improve the productivity of hardware designers. And, by mastering the use of HLS and using C/C++ as a higher-level alternative hardware description language, hardware (and HLS) experts can realize enormous performance and efficiency gains in their designs. But this doesn’t solve our fundamental problem for acceleration – getting software to take optimal advantage of FPGA acceleration without needing to bring in a team of FPGA experts for months.

This is where Silexica comes in.

Silexica’s SLX FPGA performs static and dynamic code analysis to give us insight into C/C++ code that we want to accelerate. The SLX FPGA identifies non-synthesizable C/C++ code, detects data types that are not “hardware aware,” and locates parallelism within the C/C++ code that can be accelerated in FPGAs. Beyond that, the SLX FPGA can do automatic or “guided” refactoring of that code for use with HLS. Finally, it can create the HLS pragmas required to optimize the resulting design – taking into account performance goals and available FPGA resources such as DSP blocks and memory. In short, the SLX FPGA acts as an in-house HLS/FPGA expert to get your algorithm from “software” to a combination of software and FPGA-based hardware accelerators.

The rubber meets the road in HLS land with loops in your code. If you have a loop (or nested loop) with some arithmetic operations inside, there is often the potential to unroll and/or pipeline the loop. Often there will be some computationally expensive operation such as multiply-accumulate inside that can take advantage of the DSP resources in a typical FPGA. FPGAs can have thousands of DSP blocks, so it is theoretically possible to have thousands of iterations of a loop executing in hardware in parallel. The amount and type of parallelization depends on the availability of hardware in the FPGA and the data dependencies within the loop structure. You can’t, for example, execute iterations in parallel if one iteration depends on the result of a previous iteration.

Beyond that, most software implementations of algorithms are written with standard data types, with little regard to quantizing down to minimum required bit-widths. When executing on a conventional processor, this doesn’t matter significantly, as the datapaths tend to be fixed width, and the arithmetic processing units are designed for those specific widths. In the world of custom FPGA hardware, however, massive gains can be made by reducing bit widths where possible, as that removes huge amounts of actual hardware from the accelerator architecture. Quantizing and parallelizing loops are where the majority of the gains can be found in moving software algorithms into FPGA-based hardware accelerators.

The transition from sequential code to parallel, optimized hardware is still far more art than science. While HLS can dramatically accelerate the creation of those optimized hardware architectures, it still only slightly shifts the level and type of engineering expertise required, from “RTL designer” to “hardware expert with HLS experience.” The Silexica SLX FPGA probably doesn’t remove the need for hardware expertise entirely, but it does have a good chance to change the “art” of hardware accelerator optimization into more of a “paint by number” operation. It will be interesting to see how teams take advantage of this type of tool as we see more and more compute-intensive tasks being moved to heterogeneous computing environments with FPGA accelerators.

5 thoughts on “Silexica Bridges the HLS Gap”

  1. FPGAs have many embedded memory blocks that can be used instead of flops to implement registers. True dual port mode can read two operands while another block reads the operator and address of the next operand for the algorithm, etc.

    For multiply/add, just pipeline using another small memory block and adder.

    And probably best of all the Roslyn compiler syntax API makes it easy to parse the algorithm to get the values and operator sequence for evaluation. Then simple classes can be used for simulation and debug using the same logic as the FPGA.

    Yes indeed heterogeneous is the way to go. Down with HDL for design entry.

  2. The C# compiler can be used to generate an AST from C source code.
    The SyntaxWalker emits nodes in the sequence for evaluation of if/else, while, for, expressions, etc .
    As far as I can tell HLS has (after 20 years or more of “maturing”) is finally able to handle expressions which I think is implied by the “very limited dialect” of C mentioned in this article.

    The C# compiler uses the AST for code generation and the code does not have to be verified. (debugged ? yes)

    Compiled code is used without verification, so synthesis using the same AST should not need to be verified.

    1. I am building my HLS toolkit (Quokka https://github.com/EvgenyMuryshkin/QuokkaEvaluation) and it is based on Roslyn.

      So yes, it is rather powerful and many C# constructs can be mapped to hardware.
      Syntax check came out of the box, but verification still needs to be done via unit\integration testing.

      What is really cool is ability to use a modern IDE (Visual Studion, VS Code), which comes with rich code support, IntelliSense, refactoring capabilities, debugging and testing infrastructure.

      At some stage of development, I noticed that hardware just runs without event looking into HDL (as long as you stick to what toolkit can support without trying to hack it around)

  3. Regarding C#: we actually implemented HLS for C# with our Hastlayer project. It actually decompiles the compiled .NET assemblies with ILSpy instead of processing the C# source. An AST is created there as well and this way it can support the other .NET languages too. And this way it’s possible to work with a simplified AST: for example, lambda expressions (Karl, probably you mean these when you talk about expressions) don’t need to be supported: these are actually compiled into classes and methods by the C# compiler.

    1. Actually I meant Assignment expressions: more specifically SimpleAssignmentExpressions and BinaryExpressions as exposed by SyntaxWalker Nodes.Kind(). Lambdas are beyond what hardware designers need. Sadly, they probably don’t even use conditional assignments or embedded memory blocks.

      Classes/Objects correspond to hardware modules.

      Get/Set can emulate handshaking among hardware modules.

      Arithmetic and Boolean expressions are used in both classes and modules.

      And yes, lambdas are very useful for creating expression trees internally.

Leave a Reply

featured blogs
Nov 29, 2023
Cavitation poses a formidable challenge to modern boat design, especially for high-speed sailing vessels participating in events like America's Cup , Vendee Globe , and Route du Rhum . Hydrofoils, in particular, are susceptible to cavitation, which can cause surface dama...
Nov 27, 2023
Qualcomm Technologies' SVP, Durga Malladi, talks about the current benefits, challenges, use cases and regulations surrounding artificial intelligence and how AI will evolve in the near future....
Nov 27, 2023
See how we're harnessing generative AI throughout our suite of EDA tools with Synopsys.AI Copilot, the world's first GenAI capability for chip design.The post Meet Synopsys.ai Copilot, Industry's First GenAI Capability for Chip Design appeared first on Chip Design....
Nov 6, 2023
Suffice it to say that everyone and everything in these images was shot in-camera underwater, and that the results truly are haunting....

featured video

TDK CLT32 power inductors for ADAS and AD power management

Sponsored by TDK

Review the top 3 FAQs (Frequently Asked Questions) regarding TDK’s CLT32 power inductors. Learn why these tiny power inductors address the most demanding reliability challenges of ADAS and AD power management.

Click here for more information

featured webinar

Rapid Learning: Purpose-Built MCU Software Tools for Data-Driven Embedded IoT Systems

Sponsored by ITTIA

Are you developing an MCU application that captures data of all kinds (metrics, events, logs, traces, etc.)? Are you ready to reduce the difficulties and complications involved in developing an event- and data-centric embedded system? This webinar will quickly introduce you to excellent MCU-specific software options for developing your next-generation data-driven IoT systems. You will also learn how to recognize and overcome data management obstacles. Register today as seats are limited!

Register Now!

featured chalk talk

How IO-Link® is Enabling Smart Factory Digitization -- Analog Devices and Mouser Electronics
Safety, flexibility and sustainability are cornerstone to today’s smart factories. In this episode of Chalk Talk, Amelia Dalton and Shasta Thomas from Analog Devices discuss how Analog Device’s IO-Link is helping usher in a new era of smart factory automation. They take a closer look at the benefits that IO-Link can bring to an industrial factory environment, the biggest issues facing IO-Link sensor and master designs and how Analog Devices ??can help you with your next industrial design.
Feb 2, 2023