feature article
Subscribe Now

Who Will Win AI at the Edge?

Low-power Options Get Traction

We’ve written a lot about AI in the cloud, and we’ve discussed data center solutions such as GPUs, high-end FPGAs, and dedicated AI chips such as Intel’s Nervana. For many applications, training and inferencing CNNs and other AI-based systems in cloud data centers is the only way to get the compute power required to crunch the vast data sets and complex models. But, for perhaps an even larger set of applications, cloud-based inference is not practical. We may need latency that cannot be achieved by shipping data upstream to be analyzed. We may not have the ability to maintain a full-time network connection. We may have any of a number of other factors that preclude sending data to a data center and waiting for results to return.

For these applications, the only practical solution is to do the AI inferencing at the edge, right where the data is collected. But AI at the edge brings a host of challenges. The computation required for inferencing of even modest-sized models is enormous. Conventional applications processors cannot come close to the performance required, and their architecture is far from ideal for neural network inferencing tasks. GPUs are power hungry and expensive. Most edge devices are heavily constrained on cost, power, and form factor. Throwing in a real-time latency requirement brings the problem almost to the realm of “unsolvable” with current technology.

Ah, that “almost unsolvable” is the stuff fortunes are made of.

The challenge of conquering AI at the edge has attracted droves of brilliant minds to the task. Countless startups are slinging silicon at the problem, conjuring up novel processing architectures adapted to the particular requirements of CNN inferencing. What are those requirements? For starters, we need a lot of multiplication. The C in CNN stands for “convolution” and, as math nerds know, that means we are likely to be multiplying a lot of matrices.

But, for training and inference, the types of values we are multiplying are not the same. During training, we need floating point math, which is the reason GPUs have taken such a hold in the data center where most CNN models are created. However, once training is done and those coefficients are established, it is possible to do inferencing with much simpler values – fixed point numbers, sometimes as narrow as a single bit. If you consider the hardware and energy costs of doing full single- or double-precision floating point multiplication versus multiplication of small fixed-point quantities, the implications are obvious: custom hardware that can execute massive numbers of correct-width, fixed-point hardware multiplications in parallel could generate orders of magnitude better performance and efficiency on CNN inferencing than conventional CPUs or GPUs.

When it comes to creating custom hardware, there are two primary options, of course. If the hardware is static (and we have the buget, time, and talent) we create an ASIC. If the hardware configuration needs to change, we use FPGAs. If we’re hip and cool and trying to capture the best of both worlds, we may create an ASIC with embedded FPGA fabric (eFPGA). Custom ASIC, it turns out, is a good solution for only a tiny fraction of the AI edge applications out there. Seldom are we given a static, unchanging CNN model that we want to use forever. More often, we want an SoC optimized for every part of our design except the CNN inferencing part, and then we want a block (or chip) with FPGA fabric or another type of optimized neural processing unit or tensor processing CNN accelerator.

If you’re designing an SoC, have endpoint AI processing requirements, and need on-chip acceleration, there are a number of options available. Cadence Design Systems, for example, has developed versions of their Tensilica processor IP. Cadence’s recently-announced Tensilica DNA-100 processor IP capitalizes on the sparsity of many CNN models by avoiding the repeated loading and mulitplication of zeroes inherent in the flow of most accelerator architectures. The company claims this yields much more efficient inference computation than other acceleration solutions with similar multiply-accumulate (MAC) array sizes.

Synopsys has an entire portfolio of DesignWare IP aimed at the edge inferencing market, including vision-specific architectures, memory and datapath customization solutions, and customized versions of the venerable ARC processor architecture. Synopsys appears to cater more to the “roll-your-own” crowd when it comes to edge AI, which may result in more application-optimized systems, albeit with a steeper design and learning curve on the hardware architecture side.

FPGA and eFPGA companies appear to be committing heavy engineering to the edge inference problem as well. Mainstream FPGA companies Xilinx and Intel have thus-far focused most of their attention on the high-end versions of FPGA acceleration, with bigger, more expensive chips aimed at the data center or other power-rich applications. This has given niche FPGA players like QuickLogic, Lattice Semiconductor, and Microchip/Microsemi an opportunity to capitalize on their low-power FPGA fabric technology to create various low-cost, low-power FPGA solutions, and QuickLogic and Lattice have both also joined companies like Achronix and FlexLogix in the eFPGA movement, offering IP blocks that put FPGA fabric in your ASIC and pre-engineered FPGA IP and software stacks to facilitate the creation of application-specific AI accelerators with that fabric. Just last month, QuickLogic acquired SensiML corporation, particularly for the SensiML Analytics Toolkit, which provides a streamlined flow for developing AI-based pattern-matching sensor algorithms optimized for ultra-low power consumption.

While the hardware and hardware IP suppliers battle each other with various claims on their combination of low power, low cost, tiny form factor, and inference throughput and latency, perhaps the bigger challenge faced by the industry is the infrastructure for creating and optimizing AI models in the first place. While languages and tool flows continue to evolve, the community of truly skilled AI experts with the cross-over capability to optimize those models for the wide variety of custom hardware environments competing for the crown in edge-based AI is almost vanishingly small. It is likely that the hardware architectures that win the battle may be the ones with the smoothest development flow, rather than the ones with the most compelling data sheets. We have seen time and again that novel hardware alone is not enough to dominate a market. The winning solution is often one that lacks luster in optimality, but excels in usability. It will definitely be interesting to watch.

Leave a Reply

featured blogs
May 25, 2023
Register only once to get access to all Cadence on-demand webinars. Unstructured meshing can be automated for much of the mesh generation process, saving significant engineering time and cost. However, controlling numerical errors resulting from the discrete mesh requires ada...
May 24, 2023
Accelerate vision transformer models and convolutional neural networks for AI vision systems with the ARC NPX6 NPU IP, the best processor for edge AI devices. The post Designing Smarter Edge AI Devices with the Award-Winning Synopsys ARC NPX6 NPU IP appeared first on New Hor...
May 8, 2023
If you are planning on traveling to Turkey in the not-so-distant future, then I have a favor to ask....

featured video

Automate PCB P&R Tasks for Designs in Minutes

Sponsored by Cadence Design Systems

Discover how to get a dramatic reduction in design turnaround time by automating your placement, power plane generation, and critical net routing with Cadence® Allegro® X AI technology. Built on and accessed through the Allegro X Design Platform, Allegro X AI reduces P&R tasks from days to minutes with equivalent or higher quality compared with manually designed boards.

Click here for more information

featured contest

Join the AI Generated Open-Source Silicon Design Challenge

Sponsored by Efabless

Get your AI-generated design manufactured ($9,750 value)! Enter the E-fabless open-source silicon design challenge. Use generative AI to create Verilog from natural language prompts, then implement your design using the Efabless chipIgnite platform - including an SoC template (Caravel) providing rapid chip-level integration, and an open-source RTL-to-GDS digital design flow (OpenLane). The winner gets their design manufactured by eFabless. Hurry, though - deadline is June 2!

Click here to enter!

featured chalk talk

Johnson RF Connectivity Solutions
The growing need for remote patient monitoring and wireless connectivity has made RF in medicine applications more important than ever before. In this episode of Chalk Talk, Amelia Dalton chats with Ketan Thakkar from Cinch Connectivity Solutions about the growing trends in medicine today that are encouraging the use of RF, why higher frequency, smaller form factor, cable assembly expansion and adapter expansion are vital components in today’s medical applications and why Johnson medical solutions could be a great fit for your next medical design.
Nov 28, 2022
23,177 views