In part 1 of this series, we looked at new high-end FPGA families from Xilinx, Intel, and Achronix and discussed their underlying semiconductor processes, the type and amount of programmable logic LUT fabric, the type and amount of DSP/arithmetic resources and their applicability to AI inference acceleration tasks, the claimed TOPS/FLOPS performance capabilities, and on-chip interconnect such as FPGA routing resources and networks-on-chip (NOCs). In part 2, we looked at memory architectures, in-package integration architecture, and high-speed serial IO capabilities. From these comparisons, it is clear that these are some of the most complex and sophisticated chips ever developed, that there are high stakes involved in this battle, and that each vendor brings some unique value to the table with no clear winner or loser.
In this installment, we will look at possibly the most important factor of all – the design tool flow that allows engineering teams to take advantage of the awesome power of these devices. It turns out that the flexibility and power of programmable logic are both its greatest assets and its biggest limitations. FPGAs are not processors, even though one of the most attractive uses of them is accelerating computation. Engineers with software-only backgrounds often fail to appreciate the complexity involved in using these devices and the long learning curve required to gain enough proficiency to use FPGAs near their capability. Unlike conventional von Neumann processors, FPGAs are not software programmable – at least not in the conventional sense.
Taking full advantage of FPGAs requires digital logic to be designed, and, despite decades of progress, we are not yet at the point where FPGAs can be optimally used without at least some degree of hardware expertise in the design process.There are several caveats to that statement, however. First, after years of dealing with the huge (and ever-increasing) complexity of FPGA design, FPGA vendors have come up with a few tricks to mitigate this issue. The easiest and most significant is the use of pre-designed single-function accelerators and applications. For certain common applications, FPGA companies have already done the design for you – with experts tuning and tweaking the logic for optimal performance, power consumption, and utilization. If you happen to be doing one of these applications, you may never even know you are using an FPGA. It just sits quietly in the background making everything better in your system. We’ll go into more details on this in a future installment when we talk about the marketing and distribution of these devices.
The second approach to reducing the required design expertise is raising the level of design abstraction. By creating tools that allow design at a higher level, FPGA vendors reduce the need for detailed design at the register-transfer level, thus greatly simplifying the process. We will discuss that in detail later in this article.
But, regardless of the reality on the ground, FPGA companies today are addressing primarily three distinct audiences, or domains of expertise. Digital hardware engineers with HDL skills, software engineers, and (most recently) AI/ML engineers. Each of these groups of users stands to gain profoundly from the application of FPGA technology, but giving each of these groups the tools they need to take advantage of FPGAs is a daunting task.
To unroll the vast landscape of FPGA tools and IP, we should start with the core – the part that uniquely defines FPGAs, and that is the look-up-table (LUT) fabric. In modern times, the FPGA has evolved like the Swiss Army Knife or the smartphone. The “knife” portion of the SAK has become only a small portion of the device’s capability, but we still call them “knives.” The “phone” feature of a smartphone is becoming vanishingly small in terms of differentiating a device, but we still refer to them as “phones.” And the LUT fabric in FPGAs is now only a small fraction of the value delivered by these amazing devices, but the LUT fabric is the one thing that uniquely gives FPGAs their superpowers.
Xilinx points out that their “ACAP” devices do not require that the LUT fabric be configured or even used at all in order for their Versal devices to be booted and used. This is a large part of their basis for claiming that Versal devices are not FPGAs, but a new category. But there is enough commonality in the capabilities and mission of all these devices to compare them, and it is also unlikely that Versal will find a home in many applications where the FPGA fabric is completely unused.
It All Starts with HDL Design
The lowest level of design tool in the FPGA domain is place and route. This process takes a netlist of interconnected LUTs (along with the configuration of those LUTs), arranges it on the chip in a (hopefully) near optimal fashion, and makes the required connections through the programmable interconnect fabric. Place and route is the heart and soul of FPGA implementation. As devices have gotten progressively larger, two trends have impacted the place and route process. First, the portion of the chip dedicated for interconnect (versus logic) has had to increase in order for large designs to be routed successfully. Of course, FPGA companies don’t like allocating enormous amounts of silicon for routing, because that means less logic on their device. You never see a vendor brag about the vast routing resources they provide, but LUT counts are often front-and-center. Maximizing LUTs and minimizing routing, then, is datasheet marketing 101.
In order to minimize the amount of silicon real estate consumed by routing resources, FPGA companies do an exhaustive dance during chip design where they place and route countless real-world designs over numerous iterations of the chip development in order to find the perfect balance where most user designs will be able to take advantage of a high percentage of the logic resources on the chip without failing during the routing process. If they allocate too few routing resources, large amounts of logic on the chip will be unusable. This has happened in the past, as FPGA companies have released chips that could regularly be routed only at 60% or so utilization, rendering the chips far below their advertised capacity in the real world. Conversely, if too much silicon is allocated to routing, the chips will have lower logic density than their competitors for the same silicon area.
Obviously, the better the place and route algorithms perform, the fewer routing resources are required to complete most designs. Thus, the performance of place and route has an impact all the way back to the design of the devices themselves.
The second trend that has impacted place and route is that the dominant factor in delay in logic paths has shifted from “logic” delay to “interconnect” delay. This has a profound effect on the design flow, as the logic synthesis process can no longer do timing analysis independent of the delays associated with routing/interconnect. This means that logic synthesis and place-and-route must be bound tightly together – with placement information determining routing delays, and those routing delays being fed back to make different decisions in the construction of logic paths by synthesis. In today’s world, logic synthesis and place and route are generally bound together in one (usually iterative) step, where various solutions for both the logic and the layout are evaluated and compared until a satisfactory result is found that meets key goals such as timing, design area, and power consumption;.
Each of these three companies offers a robust toolset for synthesis and place-and-route, and these toolsets are the home base for the traditional FPGA user – the “digital hardware engineers with HDL skills” mentioned above.
Xilinx bit the bullet in 2012 with a ground-up rewrite of their then-aging ISE tool suite, creating Vivado. Seven years later, Vivado has nicely matured into a comparatively robust, reliable platform with an architecture that has generally done well keeping up with the fast pace of upgrades in the FPGA business. The logic synthesis and place-and-route algorithms are state-of-the art, and Vivado does well with compile times and memory footprint on today’s enormous designs. Vivado is customizable via TCL, offering a great deal of control and customizability. Vivado also contains a simulator and an IP integration tool.
Xilinx does extensive iteration between their tools and their FPGA architectures to find the sweet spot in terms of routing versus logic resources that allows their devices to achieve consistent high utilization with their tools, with solid results in terms of timing closure. Of the three vendors, Xilinx is the most “old school” on layout and timing closure, while Intel and Achronix have both taken somewhat novel architectural steps in their device architecture to help achieve timing closure on today’s large and complex designs.
However, Xilinx has also led the charge in high-level synthesis (HLS) in the FPGA world, and Vivado HLS is (by far, we believe) the industry’s most used HLS tool, supporting a C/C++ to gates flow for hardware designers looking for productivity beyond what register-transfer level (RTL) offers. The high adoption rate of Xilinx’s HLS tool also helps solve timing closure, as the automatically-generated RTL from HLS tools tends to be much better behaved for timing closure than handwritten RTL.
Intel’s Quartus Prime Pro is an evolution of the Altera Quartus design-tool suite that has been their flagship for FPGA design for the past two decades. As we mentioned before, Intel updated their chips a couple generations ago with what they call the “HyperFlex” architecture – essentially covering the device with small registers that facilitate on-the-fly re-timing of critical logic paths by the tools. This facilitates much easier timing closure on complex designs, likely at the expense of some overall performance.
More recently, Intel has added an optional strategy called “Fractal Synthesis” to Quartus for designs such as machine-learning algorithms that are heavy in arithmetic or small multipliers. The company says that Microsoft used Fractal Synthesis on their “Brainwave” project (which powers Bing searches) to fill Stratix 10 devices to 92% at high performance. Another recent addition is Design Assistant design-rule checks (DRCs) that help to locate issues in constraints and placed netlist, designed to reduce the iterations required to close timing.
Intel came far later to the HLS party than Xilinx, but they now include the Intel HLS Compiler in the Quartus suite. The HLS Compiler takes untimed C++ as input and generates RTL code that is optimized to target Intel FPGAs. While Intel’s HLS tool has significantly less usage in the field than Xilinx’s Vivado HLS, we expect to see considerable adoption as the HLS Compiler powers the “FPGA” leg of Intel’s One API software development platform. It appears that Intel’s HLS implementation may be skewed more toward the software engineer than Xilinx’s (which is pretty clearly a power tool for hardware designers).
Achronix’s ACE tool suite relies on third-party tools for simulation and synthesis. An OEM version of Synopsys Synplify Pro is included in the ACE suite, and it includes advanced floorplanning and critical-path analysis features to assist in closing timing. With Speedster7t, however, Achronix has taken a unique approach to timing closure with their novel network-on-chip (NoC) implementation. The NoC “enables designers to transport data anywhere across the FPGA fabric at speeds up to 2GHz without using logic resources. This also means that fewer valuable FPGA resources will be consumed when placing and routing the user design than a traditional FPGA that must exclusively use LUTs to route signals within the device.”
Since acceleration workloads typically involve large amounts of “bus” routing for multi-bit paths, Achronix has also introduced byte (or nibble-) based routing to further improve timing closure. These are additional routing resources that can be used if a data word is moved and no bit-swapping is required.
For their HLS entry, Achronix has partnered with Mentor, whose Catapult-C is probably the most proven ASIC HLS tool on the market. Catapult-C is industrial strength, and it brings a price tag to match. Catapult-C will allow a full C/C++ flow to target Speedster7t FPGAs – with the usual caveat that HLS is still a tool designed for hardware engineers to improve productivity and quality of results, rather than a tool to enable software engineers to design hardware. Catapult-C should certainly shine in 5G applications, as well as in optimizing acceleration workloads such as AI inference.
Entry Points for Software Developers
For the last two decades (since Altera launched their ill-fated “Excalibur” series), many high-end FPGAs have contained conventional von Neumann processors in addition to their LUT fabric. These processors range from “soft” microcontrollers implemented in the LUT fabric itself to complex “hard” multi-core 64-bit processing subsystems complete with peripherals. These FPGAs qualify as systems on chip (SoCs), and, with this evolution, the design problem became more complicated, because we now need tools to support software engineers to develop the applications that run on the processing subsystems embedded in today’s FPGAs.
For years that meant that FPGA SoC projects required both hardware and software experts on the team. But, as FPGA companies have worked to penetrate new markets for their chips, they have worked to reduce the reliance on HDL/hardware expertise to use their devices. Today, they are working to provide software development tool suites that don’t just enable software engineers to develop code to run on the embedded processing subsystems – they also allow those software engineers to create accelerators that take advantage of the FPGA fabric to accelerate their applications.
As we wrote recently, Xilinx has just released their Vitis unified platform that seeks to enable software engineers (as well as AI developers using the “Vitis AI” entry point) to accelerate applications using Xilinx devices. Xilinx says that Vitis “provides a unified programming model for accelerated host CPU, embedded CPU and hybrid (host + embedded) applications. In addition to the core development tools, Vitis provides a rich set of hardware-accelerated libraries, pre-optimized for Xilinx hardware platforms.”
Vitis could represent a bit of a behavior and philosophy shift for Xilinx, (who, historically, has had a bit of a reputation for eating their own ecosystem) with a “commitment to open source and community participation.” Xilinx says that all of their hardware-accelerated libraries are being released on GitHub, and their runtime, XRT, is also being open-sourced. Of course, all this open sourcery is still software that ultimately targets Xilinx hardware, and it would probably be a pretty big undertaking to modify it to target competitors’ devices, but – combined with the fact that Vitis is offered free of charge, it’s certainly a giant step in the right direction for the company, and a red carpet for non-traditional users such as CUDA developers.
Intel’s One API is designed with similar goals to Xilinx’s Vitis, but with an understandably wider target. Because of the breadth of Intel’s compute portfolio – heavy-iron Xeon processors, GPUs, FPGAs, specialized AI engines such as Nervana and Movidius, and so on – Intel is setting about the ambitious goal of a single-entry point for software development that spans the gamut of what Intel calls “SVMS architectures” (scalar, vector, matrix, spatial, deployed in CPUs, GPUs, NNPs and FPGAs). Intel says that One API “promises to remove the barriers to entry that currently exist for hardware accelerators on FPGAs by abstracting the DMA of data from the host to the FPGA and back – something that is very manual, tedious, and error-prone in the HDL-based design flow. One API also shares the back-end infrastructure for FPGAs with Intel’s HLS Compiler and Intel OpenCL SDK for FPGAs, allowing developers currently using these tools to easily move over to One API.”
Intel reminds us to keep aware that “FPGA developers will need to modify their code for optimal performance on the FPGA or the use of libraries that have been pre-optimized for the spatial FPGA architecture. Software developers must be trained in the ways of the FPGA to enable the full performance benefit of FPGA special architecture acceleration with the portability across architectures.”
Thanks Intel-wan. We all aspire to be trained in the ways of the FPGA.
One API will use a new programming language called “Data Parallel C++” (DPC++) as well as API calls. One API will incorporate API libraries for various workload domains and will also include enhanced analysis and debug tools tailored to DPC++.
One capability that sets Intel’s offering apart is their optimized low-latency and cache-coherent UPI (and, in the future, CXL) interface between XEON-scalable processors and the FPGA accelerator. The company says that this capability “enables applications such as virtualization, AI, and large memory databases to perform at higher levels versus relying on traditional PCIe connectivity.”
When it comes to enabling software developers to take direct advantage of FPGA acceleration, Achronix is at a disadvantage compared with their larger, more diverse competitors. The company is developing an ecosystem through various partnerships, but they will likely primarily target design teams with the required hardware expertise to use their FPGAs effectively, as well as applications for which the Achronix ecosystem already has robust pre-optimized solutions and reference designs.
As AI Takes Over the World…
Most recently, AI has emerged as a new “killer app” for FPGAs. As we discussed in previous installments, FPGAs have a superb ability to dynamically create custom processing engines tailored to the needs of particular AI inference applications. But that also creates a new breed of user – the AI/ML engineer. This third target audience brings with them their own set of requirements for tool flows to take advantage of the power of FPGAs. These folks need support for TensorFlow, Caffe, and other front-end tools of their trade, complete with paths that translate those into reasonably-optimized FPGA-based solutions for their problems.
As part of their Vitis unified software platform announcement, Xilinx also announced Vitis AI, a plug-in that targets (as one might guess) AI and data scientists. It allows AI models to be compiled directly from standard frameworks such as TensorFlow. Vitis AI instantiates what Xilinx calls a domain-specific processing unit (DPU) in the FPGA fabric or in the Versal device’s AI engines. A key (and compelling) advantage of this approach is that Vitis AI compiles the AI models into op-code for the DPU, and, therefore, models can be downloaded and run within minutes or even seconds as no new place & route is needed. This makes iteration time and real-world workload provisioning significantly simpler and faster than with most approaches where FPGA fabric configuration is part of the plan. Vitis AI also takes advantage of Xilinx’s pruning technology to help optimize models for inference.
Intel took a similar approach to their One API for AI developers (in reality, the order was reversed) with the Intel distribution of the OpenVINO toolkit. Intel’s version provides AI developers what the company says is “a single toolkit to accelerate their solution deployment across multiple hardware platforms including Intel FPGAs.” Intel says that OpnVINO can take AI developers and data scientists from frameworks such as TensorFlow and Caffe directly to hardware without any FPGA knowledge required.
Intel’s solution supports a variety of popular neural networks and also allows the creation of custom networks. They also include customization APIs for FPGA developers to morph the engine for their application. The company says that orders-of-magnitude performance gains are possible by custom-building an AI engine to an application’s particular data flow and building blocks.
The customization flow also includes what Intel calls AI+, for applications that require other functionality in addition to AI. This allows application developers to take advantage of the flexibility of FPGA fabric to enable (for example) AI+ pre-processing, freeing the CPU from having to perform orchestration, and reducing overall system latency.
Achronix provides low-level machine-learning library functions for frameworks such as TensorFlow, as well as support for higher-level frameworks. They support both “overlay architectures” and mapping of the datagraph from the high-level framework directly to FPGA logic. The company says an “overlay” is “a specialized application processor which is optimized for one or more ML networks, with the hardware instantiated in reprogrammable logic and a specific network implemented in microcode running on that hardware instance.” This means that the functionality of an overlay implementation can be modified by changing the software, or by changing both hardware and software, whereas a direct-mapped datagraph requires the FPGA fabric to be fully or partially reconfigured. Achronix’s approach essentially allows you to dial-in the amount of optimization you want, depending on the time and expertise you have available to throw at the problem.
As you can see, the scope and magnitude of development tools and technology being thrown at these complex programmable logic devices is massive, and we have barely scratched the surface here. It will be interesting to see what capabilities emerge as differentiators, as we believe that more hardware sockets will be won based on development flows than on the capabilities of the hardware itself.
Next time, we will take a look at the vast ecosystem of IP and reference designs, as well as pre-integrated boards and modules that could propel these complex devices into high-value applications where the engineering resources for detailed, optimized development are not always available.