feature article
Subscribe Now

HLS Powers AI Revolution

Will Inference Finally Mainstream HLS?

I started working in high-level synthesis (HLS) in 1994 which, assuming my math is correct, was 26 years ago. In those early days, we referred to the technology as “behavioral synthesis” because it relied on analyzing the desired behavior of a circuit in order to create a structural description rather than simply converting from a higher-level structural description to a lower-level one. 

HLS promised to replace thousands of lines of detailed RTL code with a few dozen lines of behavioral description as the primary design entry method. It brought the ability to quickly try out a huge range of architectural options for a datapath and select the one that best realized the key design constraints for a particular application. HLS de-coupled the specifics of hardware architecture from the desired behavior of the design. It expanded the concept of re-use. It raised the level of abstraction. It would cause an enormous gain in designer productivity.

Our team, and those like ours, were giddy about the potential of HLS to revolutionize digital circuit design. While the technical challenges with HLS were enormous, the promise it brought was so compelling that we were certain HLS would completely take over electronic design within three to five years.

Boy were we wrong.

HLS, it turns out, was much more difficult than we had first estimated. Sure, our tool could give a snappy demo, and we could crank out an FFT example design with the best of them, but being able to handle real-world designs in a practical and compelling way proved to be a vast and consuming technical challenge. Months of development and testing became years, and HLS became jokingly known as “the technology of the future – and always will be.” While numerous chip design groups did pilot projects and beta tests with varying degrees of success, integrating HLS into the existing RTL-centric design and verification flow was prohibitively difficult for most. HLS was relegated to the role of “science-fair experiment.”

 

Over the two decades that followed, HLS slowly and steadily matured. Moore’s Law drove design complexity to a point that methodologies had to change. While HLS offered a compelling way to raise the level of design abstraction, more of the burden was borne by increased scale and complexity of re-usable IP blocks. Chip design teams would assemble a system-on-chip primarily from re-usable IP, and would then use either HLS or conventional RTL design to create the comparatively small and unique portion of the design that remained. Over time, the benefits of HLS became clear to a wider and wider segment of the engineering community, and HLS captured more of the design of those “special sauce” design blocks. But the dream of giant systems on chip designed completely using behavioral abstractions was never to be.

HLS hit a major milestone in 2011, with Xilinx’s acquisition of a company called AutoESL.  There was a powerful technological synergy between the ability of HLS to quickly generate optimized datapath architectures from behavioral descriptions and the ability of FPGAs to quickly create hardware implementations of those datapaths. The combination of HLS and FPGA promised to form a kind of new “compiler” and “processor” pair that could take C/C++ code and create an engine that could trounce conventional processors in performance and power efficiency. A typical simple but computationally-demanding algorithm put through the HLS-to-FPGA flow could easily beat a conventional software-on-CPU implementation of the same algorithm by orders of magnitude.

Xilinx seeded the world with HLS technology by quickly moving their HLS tool into their core Vivado design tool framework, essentially making HLS available for a very low cost to an enormous community of designers. Other HLS offerings from the large EDA suppliers were exotic technology, and they carried price tags to match. EDA had found a niche for HLS in high-end ASIC design and had kept the price of the technology high in order to capitalize on that. Xilinx’s offering, on the other hand, brought HLS to the masses. Over the years that followed, Xilinx built up the largest community of customers using HLS in the world – by far. 

Still, HLS faced challenges making the move to mainstream. Even though HLS typically starts with a description in C or C++, it is most definitely not a technology that can compile conventional software into hardware. Many code constructs that work well in software running on a CPU are not directly synthesizable into hardware, and writing code that will result in efficient HLS implementations is an art unto itself. Furthermore, HLS tools themselves require a working knowledge of hardware structures that most software engineers do not have, so HLS is usually more of a “power tool” for experienced hardware designers, rather than a tool that enables software developers to design hardware. In fact, the C or C++ used for HLS should be thought of more as a hardware description language with different syntax and semantics, rather than as “software.”

Now, however, HLS has found a new avenue for mainstream acceptance, and this one may be the most compelling yet. With the rapid growth of AI, there has been an industry-wide rush to find ways to accelerate AI inference. Numerous startups, as well as established chip companies, have rushed to come up with the best processor architecture that can outperform conventional CPUs at the massive convolutions that are required by convolutional neural networks (CNNs) as well as other AI architectures. FPGAs have found a clear spotlight among these, due to their ability to create custom hardware optimized for each particular neural network – something that no hard-wired AI accelerator can hope to accomplish. FPGAs have the potential to outperform even the best dedicated general-purpose AI accelerator chips by a significant margin.

But FPGAs are hamstrung in the AI arena by the same issue that has plagued them in every other market. Creating a design with an FPGA is hard. Getting from the realm of the AI expert/data scientist using frameworks such as TensorFlow down to the level of synthesizable RTL for an optimized FPGA (or ASIC) implementation of that neural network is a daunting challenge that spans multiple engineering domains. HLS has the ability to bridge that divide, creating a deterministic path from AI framework to RTL. 

Now, we see numerous companies (FPGA as well as EDA) working to create smooth, tested design flows that encapsulate the AI-HLS-RTL methodology. Xilinx, Intel, Mentor, and Cadence have all made announcements and rolled out products that implement various flavors of this flow. Xilinx uses their HLS tool in their recently-announced Vitis AI framework. Intel’s comparatively new HLS technology is a key component of their ambitious oneAPI framework. Cadence touts their Stratus HLS tool for AI accelerator design, as does Mentor with their Catapult HLS AI Toolkit, and companies like Silexica are augmenting those design flows with tools that optimize and drive the HLS process for creation of accelerator architectures.  https://www.eejournal.com/article/silexica-bridges-the-hls-gap/

HLS will be working quietly behind the scenes in many of these AI-to-hardware implementation flows, but it will clearly be a key enabling technology. As AI inference claws its way out of the data center and into edge and endpoint IoT devices, the need for highly-optimized inference solutions that generate high performance on a tiny power budget should pull numerous design teams into this type of design methodology. It will be interesting to watch.

 

One thought on “HLS Powers AI Revolution”

  1. A related article “But, if we know that RTL is so bad, we also know somewhere, deep down, that we will need to do something different someday. But what? and When?”
    RTL goes back to “Verilog can be simulated, therefore we MUST us it for design entry” which is one of the dumbest illogical conclusions ever. Verilog is for synthesis, not logic design.
    HLS, OpenCL are focused on expression evaluation, not interaction of IP blocks.
    The real problem is and always has been getting the data and the events(inputs) that trigger actions.
    That is the problem that accelerators and all the complexities of out of order execution, branch prediction, caches, and the rest of super scalar try to solve.
    GPUs and accelerators stream the data to on chip memory so it can be accessed at the same speed as registers. Matrix Algebra needs to address rows and columns. Try a true dual port embedded memory block.
    IP involves connecting functional blocks. Try defining classes for the IP and let the OOP compiler/debugger take a lot of pain out of design/debug. And you can run the compiled code, too.

Leave a Reply

featured blogs
Oct 21, 2020
You've traveled back in time 65 million years with no way to return. What evidence can you leave to ensure future humans will know of your existence?...
Oct 21, 2020
We'€™re concluding the Online Training Deep Dive blog series, which has been taking the top 15 Online Training courses among students and professors and breaking them down into their different... [[ Click on the title to access the full blog on the Cadence Community site. ...
Oct 20, 2020
In 2020, mobile traffic has skyrocketed everywhere as our planet battles a pandemic. Samtec.com saw nearly double the mobile traffic in the first two quarters than it normally sees. While these levels have dropped off from their peaks in the spring, they have not returned to ...
Oct 16, 2020
[From the last episode: We put together many of the ideas we'€™ve been describing to show the basics of how in-memory compute works.] I'€™m going to take a sec for some commentary before we continue with the last few steps of in-memory compute. The whole point of this web...

featured video

Better PPA with Innovus Mixed Placer Technology – Gigaplace XL

Sponsored by Cadence Design Systems

With the increase of on-chip storage elements, it has become extremely time consuming to come up with an optimized floorplan with manual methods. Innovus Implementation’s advanced multi-objective placement technology, GigaPlace XL, provides automation to optimize at scale, concurrent placement of macros, and standard cells for multiple objectives like timing, wirelength, congestion, and power. This technology provides an innovative way to address design productivity along with design quality improvements reducing weeks of manual floorplan time down to a few hours.

Click here for more information about Innovus Implementation System

featured paper

Fundamentals of Precision ADC Noise Analysis

Sponsored by Texas Instruments

Build your knowledge of noise performance with high-resolution delta-sigma ADCs. This e-book covers types of ADC noise, how other components contribute noise to the system, and how these noise sources interact with each other.

Click here to download the whitepaper

Featured Chalk Talk

USB Type C Wireless Power Charging

Sponsored by Mouser Electronics and Wurth

Today, there are fantastic new standard battery charging solutions that can help with your design. But, many designers don’t know where to begin with implementing these complex standards. In this episode of Chalk Talk, Amelia Dalton chats with Hebberly Ahatan of Wurth Electronik about the latest charging standards, and how new kits from Wurth make implementing them in your next design a snap.

Click here for more information about Wurth Design Kits