HLS Powers AI Revolution

I started working in high-level synthesis (HLS) in 1994 which, assuming my math is correct, was 26 years ago. In those early days, we referred to the technology as “behavioral synthesis” because it relied on analyzing the desired behavior of a circuit in order to create a structural description rather than simply converting from a higher-level structural description to a lower-level one.

HLS promised to replace thousands of lines of detailed RTL code with a few dozen lines of behavioral description as the primary design entry method. It brought the ability to quickly try out a huge range of architectural options for a datapath and select the one that best realized the key design constraints for a particular application. HLS de-coupled the specifics of hardware architecture from the desired behavior of the design. It expanded the concept of re-use. It raised the level of abstraction. It would cause an enormous gain in designer productivity.

Our team, and those like ours, were giddy about the potential of HLS to revolutionize digital circuit design. While the technical challenges with HLS were enormous, the promise it brought was so compelling that we were certain HLS would completely take over electronic design within three to five years.

Boy were we wrong.

HLS, it turns out, was much more difficult than we had first estimated. Sure, our tool could give a snappy demo, and we could crank out an FFT example design with the best of them, but being able to handle real-world designs in a practical and compelling way proved to be a vast and consuming technical challenge. Months of development and testing became years, and HLS became jokingly known as “the technology of the future – and always will be.” While numerous chip design groups did pilot projects and beta tests with varying degrees of success, integrating HLS into the existing RTL-centric design and verification flow was prohibitively difficult for most. HLS was relegated to the role of “science-fair experiment.”

Over the two decades that followed, HLS slowly and steadily matured. Moore’s Law drove design complexity to a point that methodologies had to change. While HLS offered a compelling way to raise the level of design abstraction, more of the burden was borne by increased scale and complexity of re-usable IP blocks. Chip design teams would assemble a system-on-chip primarily from re-usable IP, and would then use either HLS or conventional RTL design to create the comparatively small and unique portion of the design that remained. Over time, the benefits of HLS became clear to a wider and wider segment of the engineering community, and HLS captured more of the design of those “special sauce” design blocks. But the dream of giant systems on chip designed completely using behavioral abstractions was never to be.

HLS hit a major milestone in 2011, with Xilinx’s acquisition of a company called AutoESL. There was a powerful technological synergy between the ability of HLS to quickly generate optimized datapath architectures from behavioral descriptions and the ability of FPGAs to quickly create hardware implementations of those datapaths. The combination of HLS and FPGA promised to form a kind of new “compiler” and “processor” pair that could take C/C++ code and create an engine that could trounce conventional processors in performance and power efficiency. A typical simple but computationally-demanding algorithm put through the HLS-to-FPGA flow could easily beat a conventional software-on-CPU implementation of the same algorithm by orders of magnitude.

Xilinx seeded the world with HLS technology by quickly moving their HLS tool into their core Vivado design tool framework, essentially making HLS available for a very low cost to an enormous community of designers. Other HLS offerings from the large EDA suppliers were exotic technology, and they carried price tags to match. EDA had found a niche for HLS in high-end ASIC design and had kept the price of the technology high in order to capitalize on that. Xilinx’s offering, on the other hand, brought HLS to the masses. Over the years that followed, Xilinx built up the largest community of customers using HLS in the world – by far.

Still, HLS faced challenges making the move to mainstream. Even though HLS typically starts with a description in C or C++, it is most definitely not a technology that can compile conventional software into hardware. Many code constructs that work well in software running on a CPU are not directly synthesizable into hardware, and writing code that will result in efficient HLS implementations is an art unto itself. Furthermore, HLS tools themselves require a working knowledge of hardware structures that most software engineers do not have, so HLS is usually more of a “power tool” for experienced hardware designers, rather than a tool that enables software developers to design hardware. In fact, the C or C++ used for HLS should be thought of more as a hardware description language with different syntax and semantics, rather than as “software.”

Now, however, HLS has found a new avenue for mainstream acceptance, and this one may be the most compelling yet. With the rapid growth of AI, there has been an industry-wide rush to find ways to accelerate AI inference. Numerous startups, as well as established chip companies, have rushed to come up with the best processor architecture that can outperform conventional CPUs at the massive convolutions that are required by convolutional neural networks (CNNs) as well as other AI architectures. FPGAs have found a clear spotlight among these, due to their ability to create custom hardware optimized for each particular neural network – something that no hard-wired AI accelerator can hope to accomplish. FPGAs have the potential to outperform even the best dedicated general-purpose AI accelerator chips by a significant margin.

But FPGAs are hamstrung in the AI arena by the same issue that has plagued them in every other market. Creating a design with an FPGA is hard. Getting from the realm of the AI expert/data scientist using frameworks such as TensorFlow down to the level of synthesizable RTL for an optimized FPGA (or ASIC) implementation of that neural network is a daunting challenge that spans multiple engineering domains. HLS has the ability to bridge that divide, creating a deterministic path from AI framework to RTL.

Now, we see numerous companies (FPGA as well as EDA) working to create smooth, tested design flows that encapsulate the AI-HLS-RTL methodology. Xilinx, Intel, Mentor, and Cadence have all made announcements and rolled out products that implement various flavors of this flow. Xilinx uses their HLS tool in their recently-announced Vitis AI framework. Intel’s comparatively new HLS technology is a key component of their ambitious oneAPI framework. Cadence touts their Stratus HLS tool for AI accelerator design, as does Mentor with their Catapult HLS AI Toolkit, and companies like Silexica are augmenting those design flows with tools that optimize and drive the HLS process for creation of accelerator architectures. https://www.eejournal.com/article/silexica-bridges-the-hls-gap/

HLS will be working quietly behind the scenes in many of these AI-to-hardware implementation flows, but it will clearly be a key enabling technology. As AI inference claws its way out of the data center and into edge and endpoint IoT devices, the need for highly-optimized inference solutions that generate high performance on a tiny power budget should pull numerous design teams into this type of design methodology. It will be interesting to watch.

One thought on “HLS Powers AI Revolution”

Karl Stevens says:

March 31, 2020 at 1:54 pm

A related article “But, if we know that RTL is so bad, we also know somewhere, deep down, that we will need to do something different someday. But what? and When?”
RTL goes back to “Verilog can be simulated, therefore we MUST us it for design entry” which is one of the dumbest illogical conclusions ever. Verilog is for synthesis, not logic design.
HLS, OpenCL are focused on expression evaluation, not interaction of IP blocks.
The real problem is and always has been getting the data and the events(inputs) that trigger actions.
That is the problem that accelerators and all the complexities of out of order execution, branch prediction, caches, and the rest of super scalar try to solve.
GPUs and accelerators stream the data to on chip memory so it can be accessed at the same speed as registers. Matrix Algebra needs to address rows and columns. Try a true dual port embedded memory block.
IP involves connecting functional blocks. Try defining classes for the IP and let the OOP compiler/debugger take a lot of pain out of design/debug. And you can run the compiled code, too.

Log in to Reply

HLS Powers AI Revolution

Related

One thought on “HLS Powers AI Revolution”

Leave a Reply Cancel reply

featured video

How NV5, NVIDIA, and Cadence Collaboration Optimizes Data Center Efficiency, Performance, and Reliability

featured chalk talk