feature article
Subscribe Now

Who Will Win AI at the Edge?

Low-power Options Get Traction

We’ve written a lot about AI in the cloud, and we’ve discussed data center solutions such as GPUs, high-end FPGAs, and dedicated AI chips such as Intel’s Nervana. For many applications, training and inferencing CNNs and other AI-based systems in cloud data centers is the only way to get the compute power required to crunch the vast data sets and complex models. But, for perhaps an even larger set of applications, cloud-based inference is not practical. We may need latency that cannot be achieved by shipping data upstream to be analyzed. We may not have the ability to maintain a full-time network connection. We may have any of a number of other factors that preclude sending data to a data center and waiting for results to return.

For these applications, the only practical solution is to do the AI inferencing at the edge, right where the data is collected. But AI at the edge brings a host of challenges. The computation required for inferencing of even modest-sized models is enormous. Conventional applications processors cannot come close to the performance required, and their architecture is far from ideal for neural network inferencing tasks. GPUs are power hungry and expensive. Most edge devices are heavily constrained on cost, power, and form factor. Throwing in a real-time latency requirement brings the problem almost to the realm of “unsolvable” with current technology.

Ah, that “almost unsolvable” is the stuff fortunes are made of.

The challenge of conquering AI at the edge has attracted droves of brilliant minds to the task. Countless startups are slinging silicon at the problem, conjuring up novel processing architectures adapted to the particular requirements of CNN inferencing. What are those requirements? For starters, we need a lot of multiplication. The C in CNN stands for “convolution” and, as math nerds know, that means we are likely to be multiplying a lot of matrices.

But, for training and inference, the types of values we are multiplying are not the same. During training, we need floating point math, which is the reason GPUs have taken such a hold in the data center where most CNN models are created. However, once training is done and those coefficients are established, it is possible to do inferencing with much simpler values – fixed point numbers, sometimes as narrow as a single bit. If you consider the hardware and energy costs of doing full single- or double-precision floating point multiplication versus multiplication of small fixed-point quantities, the implications are obvious: custom hardware that can execute massive numbers of correct-width, fixed-point hardware multiplications in parallel could generate orders of magnitude better performance and efficiency on CNN inferencing than conventional CPUs or GPUs.

When it comes to creating custom hardware, there are two primary options, of course. If the hardware is static (and we have the buget, time, and talent) we create an ASIC. If the hardware configuration needs to change, we use FPGAs. If we’re hip and cool and trying to capture the best of both worlds, we may create an ASIC with embedded FPGA fabric (eFPGA). Custom ASIC, it turns out, is a good solution for only a tiny fraction of the AI edge applications out there. Seldom are we given a static, unchanging CNN model that we want to use forever. More often, we want an SoC optimized for every part of our design except the CNN inferencing part, and then we want a block (or chip) with FPGA fabric or another type of optimized neural processing unit or tensor processing CNN accelerator.

If you’re designing an SoC, have endpoint AI processing requirements, and need on-chip acceleration, there are a number of options available. Cadence Design Systems, for example, has developed versions of their Tensilica processor IP. Cadence’s recently-announced Tensilica DNA-100 processor IP capitalizes on the sparsity of many CNN models by avoiding the repeated loading and mulitplication of zeroes inherent in the flow of most accelerator architectures. The company claims this yields much more efficient inference computation than other acceleration solutions with similar multiply-accumulate (MAC) array sizes.

Synopsys has an entire portfolio of DesignWare IP aimed at the edge inferencing market, including vision-specific architectures, memory and datapath customization solutions, and customized versions of the venerable ARC processor architecture. Synopsys appears to cater more to the “roll-your-own” crowd when it comes to edge AI, which may result in more application-optimized systems, albeit with a steeper design and learning curve on the hardware architecture side.

FPGA and eFPGA companies appear to be committing heavy engineering to the edge inference problem as well. Mainstream FPGA companies Xilinx and Intel have thus-far focused most of their attention on the high-end versions of FPGA acceleration, with bigger, more expensive chips aimed at the data center or other power-rich applications. This has given niche FPGA players like QuickLogic, Lattice Semiconductor, and Microchip/Microsemi an opportunity to capitalize on their low-power FPGA fabric technology to create various low-cost, low-power FPGA solutions, and QuickLogic and Lattice have both also joined companies like Achronix and FlexLogix in the eFPGA movement, offering IP blocks that put FPGA fabric in your ASIC and pre-engineered FPGA IP and software stacks to facilitate the creation of application-specific AI accelerators with that fabric. Just last month, QuickLogic acquired SensiML corporation, particularly for the SensiML Analytics Toolkit, which provides a streamlined flow for developing AI-based pattern-matching sensor algorithms optimized for ultra-low power consumption.

While the hardware and hardware IP suppliers battle each other with various claims on their combination of low power, low cost, tiny form factor, and inference throughput and latency, perhaps the bigger challenge faced by the industry is the infrastructure for creating and optimizing AI models in the first place. While languages and tool flows continue to evolve, the community of truly skilled AI experts with the cross-over capability to optimize those models for the wide variety of custom hardware environments competing for the crown in edge-based AI is almost vanishingly small. It is likely that the hardware architectures that win the battle may be the ones with the smoothest development flow, rather than the ones with the most compelling data sheets. We have seen time and again that novel hardware alone is not enough to dominate a market. The winning solution is often one that lacks luster in optimality, but excels in usability. It will definitely be interesting to watch.

Leave a Reply

featured blogs
Apr 25, 2024
Structures in Allegro X layout editors let you create reusable building blocks for your PCBs, saving you time and ensuring consistency. What are Structures? Structures are pre-defined groups of design objects, such as vias, connecting lines (clines), and shapes. You can combi...
Apr 25, 2024
See how the UCIe protocol creates multi-die chips by connecting chiplets from different vendors and nodes, and learn about the role of IP and specifications.The post Want to Mix and Match Dies in a Single Package? UCIe Can Get You There appeared first on Chip Design....
Apr 18, 2024
Are you ready for a revolution in robotic technology (as opposed to a robotic revolution, of course)?...

featured video

How MediaTek Optimizes SI Design with Cadence Optimality Explorer and Clarity 3D Solver

Sponsored by Cadence Design Systems

In the era of 5G/6G communication, signal integrity (SI) design considerations are important in high-speed interface design. MediaTek’s design process usually relies on human intuition, but with Cadence’s Optimality Intelligent System Explorer and Clarity 3D Solver, they’ve increased design productivity by 75X. The Optimality Explorer’s AI technology not only improves productivity, but also provides helpful insights and answers.

Learn how MediaTek uses Cadence tools in SI design

featured paper

Designing Robust 5G Power Amplifiers for the Real World

Sponsored by Keysight

Simulating 5G power amplifier (PA) designs at the component and system levels with authentic modulation and high-fidelity behavioral models increases predictability, lowers risk, and shrinks schedules. Simulation software enables multi-technology layout and multi-domain analysis, evaluating the impacts of 5G PA design choices while delivering accurate results in a single virtual workspace. This application note delves into how authentic modulation enhances predictability and performance in 5G millimeter-wave systems.

Download now to revolutionize your design process.

featured chalk talk

Unlock the Productivity and Efficiency of a Connected Plant
In this episode of Chalk Talk, Amelia Dalton and Patrick Casey from Schneider Electric explore the multitude of benefits that mobility brings to industrial applications. They investigate how Schneider Electric’s Harmony Hub can simplify monitoring and testing, increase operational efficiency and connectivity openness in industrial plants, and how NFC technology can bring new innovation possibilities to IIoT applications.
Apr 23, 2024
554 views