feature article
Subscribe Now

HLS Powers AI Revolution

Will Inference Finally Mainstream HLS?

I started working in high-level synthesis (HLS) in 1994 which, assuming my math is correct, was 26 years ago. In those early days, we referred to the technology as “behavioral synthesis” because it relied on analyzing the desired behavior of a circuit in order to create a structural description rather than simply converting from a higher-level structural description to a lower-level one. 

HLS promised to replace thousands of lines of detailed RTL code with a few dozen lines of behavioral description as the primary design entry method. It brought the ability to quickly try out a huge range of architectural options for a datapath and select the one that best realized the key design constraints for a particular application. HLS de-coupled the specifics of hardware architecture from the desired behavior of the design. It expanded the concept of re-use. It raised the level of abstraction. It would cause an enormous gain in designer productivity.

Our team, and those like ours, were giddy about the potential of HLS to revolutionize digital circuit design. While the technical challenges with HLS were enormous, the promise it brought was so compelling that we were certain HLS would completely take over electronic design within three to five years.

Boy were we wrong.

HLS, it turns out, was much more difficult than we had first estimated. Sure, our tool could give a snappy demo, and we could crank out an FFT example design with the best of them, but being able to handle real-world designs in a practical and compelling way proved to be a vast and consuming technical challenge. Months of development and testing became years, and HLS became jokingly known as “the technology of the future – and always will be.” While numerous chip design groups did pilot projects and beta tests with varying degrees of success, integrating HLS into the existing RTL-centric design and verification flow was prohibitively difficult for most. HLS was relegated to the role of “science-fair experiment.”

 

Over the two decades that followed, HLS slowly and steadily matured. Moore’s Law drove design complexity to a point that methodologies had to change. While HLS offered a compelling way to raise the level of design abstraction, more of the burden was borne by increased scale and complexity of re-usable IP blocks. Chip design teams would assemble a system-on-chip primarily from re-usable IP, and would then use either HLS or conventional RTL design to create the comparatively small and unique portion of the design that remained. Over time, the benefits of HLS became clear to a wider and wider segment of the engineering community, and HLS captured more of the design of those “special sauce” design blocks. But the dream of giant systems on chip designed completely using behavioral abstractions was never to be.

HLS hit a major milestone in 2011, with Xilinx’s acquisition of a company called AutoESL.  There was a powerful technological synergy between the ability of HLS to quickly generate optimized datapath architectures from behavioral descriptions and the ability of FPGAs to quickly create hardware implementations of those datapaths. The combination of HLS and FPGA promised to form a kind of new “compiler” and “processor” pair that could take C/C++ code and create an engine that could trounce conventional processors in performance and power efficiency. A typical simple but computationally-demanding algorithm put through the HLS-to-FPGA flow could easily beat a conventional software-on-CPU implementation of the same algorithm by orders of magnitude.

Xilinx seeded the world with HLS technology by quickly moving their HLS tool into their core Vivado design tool framework, essentially making HLS available for a very low cost to an enormous community of designers. Other HLS offerings from the large EDA suppliers were exotic technology, and they carried price tags to match. EDA had found a niche for HLS in high-end ASIC design and had kept the price of the technology high in order to capitalize on that. Xilinx’s offering, on the other hand, brought HLS to the masses. Over the years that followed, Xilinx built up the largest community of customers using HLS in the world – by far. 

Still, HLS faced challenges making the move to mainstream. Even though HLS typically starts with a description in C or C++, it is most definitely not a technology that can compile conventional software into hardware. Many code constructs that work well in software running on a CPU are not directly synthesizable into hardware, and writing code that will result in efficient HLS implementations is an art unto itself. Furthermore, HLS tools themselves require a working knowledge of hardware structures that most software engineers do not have, so HLS is usually more of a “power tool” for experienced hardware designers, rather than a tool that enables software developers to design hardware. In fact, the C or C++ used for HLS should be thought of more as a hardware description language with different syntax and semantics, rather than as “software.”

Now, however, HLS has found a new avenue for mainstream acceptance, and this one may be the most compelling yet. With the rapid growth of AI, there has been an industry-wide rush to find ways to accelerate AI inference. Numerous startups, as well as established chip companies, have rushed to come up with the best processor architecture that can outperform conventional CPUs at the massive convolutions that are required by convolutional neural networks (CNNs) as well as other AI architectures. FPGAs have found a clear spotlight among these, due to their ability to create custom hardware optimized for each particular neural network – something that no hard-wired AI accelerator can hope to accomplish. FPGAs have the potential to outperform even the best dedicated general-purpose AI accelerator chips by a significant margin.

But FPGAs are hamstrung in the AI arena by the same issue that has plagued them in every other market. Creating a design with an FPGA is hard. Getting from the realm of the AI expert/data scientist using frameworks such as TensorFlow down to the level of synthesizable RTL for an optimized FPGA (or ASIC) implementation of that neural network is a daunting challenge that spans multiple engineering domains. HLS has the ability to bridge that divide, creating a deterministic path from AI framework to RTL. 

Now, we see numerous companies (FPGA as well as EDA) working to create smooth, tested design flows that encapsulate the AI-HLS-RTL methodology. Xilinx, Intel, Mentor, and Cadence have all made announcements and rolled out products that implement various flavors of this flow. Xilinx uses their HLS tool in their recently-announced Vitis AI framework. Intel’s comparatively new HLS technology is a key component of their ambitious oneAPI framework. Cadence touts their Stratus HLS tool for AI accelerator design, as does Mentor with their Catapult HLS AI Toolkit, and companies like Silexica are augmenting those design flows with tools that optimize and drive the HLS process for creation of accelerator architectures.  https://www.eejournal.com/article/silexica-bridges-the-hls-gap/

HLS will be working quietly behind the scenes in many of these AI-to-hardware implementation flows, but it will clearly be a key enabling technology. As AI inference claws its way out of the data center and into edge and endpoint IoT devices, the need for highly-optimized inference solutions that generate high performance on a tiny power budget should pull numerous design teams into this type of design methodology. It will be interesting to watch.

 

One thought on “HLS Powers AI Revolution”

  1. A related article “But, if we know that RTL is so bad, we also know somewhere, deep down, that we will need to do something different someday. But what? and When?”
    RTL goes back to “Verilog can be simulated, therefore we MUST us it for design entry” which is one of the dumbest illogical conclusions ever. Verilog is for synthesis, not logic design.
    HLS, OpenCL are focused on expression evaluation, not interaction of IP blocks.
    The real problem is and always has been getting the data and the events(inputs) that trigger actions.
    That is the problem that accelerators and all the complexities of out of order execution, branch prediction, caches, and the rest of super scalar try to solve.
    GPUs and accelerators stream the data to on chip memory so it can be accessed at the same speed as registers. Matrix Algebra needs to address rows and columns. Try a true dual port embedded memory block.
    IP involves connecting functional blocks. Try defining classes for the IP and let the OOP compiler/debugger take a lot of pain out of design/debug. And you can run the compiled code, too.

Leave a Reply

featured blogs
Feb 20, 2024
Graphics processing units (GPUs) have significantly transcended their original purpose, now at the heart of myriad high-performance computing applications. GPUs accelerate processes in fields ranging from artificial intelligence (AI) and machine learning to video editing and ...
Feb 15, 2024
This artist can paint not just with both hands, but also with both feet, and all at the same time!...

featured video

Shape The Future Now with Synopsys ARC-V Processor IP

Sponsored by Synopsys

Synopsys ARC-V™ Processor IP delivers the optimal power-performance-efficiency and extensibility of ARC processors with broad software and tools support from Synopsys and the expanding RISC-V ecosystem. Built on the success of multiple generations of ARC processor IP covering a broad range of processor implementations, including functional safety (FS) versions, the ARC-V portfolio delivers what you need to optimize and differentiate your SoC.

Learn more about Synopsys ARC-V RISC-V Processor IP

featured paper

How to Deliver Rock-Solid Supply in a Complex and Ever-Changing World

Sponsored by Intel

A combination of careful planning, focused investment, accurate tracking, and commitment to product longevity delivers the resilient supply chain FPGA customers require.

Click here to read more

featured chalk talk

ADI's ISOverse
In order to move forward with innovations on the intelligent edge, we need to take a close look at isolation and how it can help foster the adoption of high voltage charging solutions and reliable and robust high speed communication. In this episode of Chalk Talk, Amelia Dalton is joined by Allison Lemus, Maurizio Granato, and Karthi Gopalan from Analog Devices and they examine benefits that isolation brings to intelligent edge applications including smart building control, the enablement of Industry 4.0, and more. They also examine how Analog Devices iCoupler® digital isolation technology can encourage innovation big and small!  
Mar 14, 2023
39,222 views