feature article
Subscribe Now

Who Will Win AI at the Edge?

Low-power Options Get Traction

We’ve written a lot about AI in the cloud, and we’ve discussed data center solutions such as GPUs, high-end FPGAs, and dedicated AI chips such as Intel’s Nervana. For many applications, training and inferencing CNNs and other AI-based systems in cloud data centers is the only way to get the compute power required to crunch the vast data sets and complex models. But, for perhaps an even larger set of applications, cloud-based inference is not practical. We may need latency that cannot be achieved by shipping data upstream to be analyzed. We may not have the ability to maintain a full-time network connection. We may have any of a number of other factors that preclude sending data to a data center and waiting for results to return.

For these applications, the only practical solution is to do the AI inferencing at the edge, right where the data is collected. But AI at the edge brings a host of challenges. The computation required for inferencing of even modest-sized models is enormous. Conventional applications processors cannot come close to the performance required, and their architecture is far from ideal for neural network inferencing tasks. GPUs are power hungry and expensive. Most edge devices are heavily constrained on cost, power, and form factor. Throwing in a real-time latency requirement brings the problem almost to the realm of “unsolvable” with current technology.

Ah, that “almost unsolvable” is the stuff fortunes are made of.

The challenge of conquering AI at the edge has attracted droves of brilliant minds to the task. Countless startups are slinging silicon at the problem, conjuring up novel processing architectures adapted to the particular requirements of CNN inferencing. What are those requirements? For starters, we need a lot of multiplication. The C in CNN stands for “convolution” and, as math nerds know, that means we are likely to be multiplying a lot of matrices.

But, for training and inference, the types of values we are multiplying are not the same. During training, we need floating point math, which is the reason GPUs have taken such a hold in the data center where most CNN models are created. However, once training is done and those coefficients are established, it is possible to do inferencing with much simpler values – fixed point numbers, sometimes as narrow as a single bit. If you consider the hardware and energy costs of doing full single- or double-precision floating point multiplication versus multiplication of small fixed-point quantities, the implications are obvious: custom hardware that can execute massive numbers of correct-width, fixed-point hardware multiplications in parallel could generate orders of magnitude better performance and efficiency on CNN inferencing than conventional CPUs or GPUs.

When it comes to creating custom hardware, there are two primary options, of course. If the hardware is static (and we have the buget, time, and talent) we create an ASIC. If the hardware configuration needs to change, we use FPGAs. If we’re hip and cool and trying to capture the best of both worlds, we may create an ASIC with embedded FPGA fabric (eFPGA). Custom ASIC, it turns out, is a good solution for only a tiny fraction of the AI edge applications out there. Seldom are we given a static, unchanging CNN model that we want to use forever. More often, we want an SoC optimized for every part of our design except the CNN inferencing part, and then we want a block (or chip) with FPGA fabric or another type of optimized neural processing unit or tensor processing CNN accelerator.

If you’re designing an SoC, have endpoint AI processing requirements, and need on-chip acceleration, there are a number of options available. Cadence Design Systems, for example, has developed versions of their Tensilica processor IP. Cadence’s recently-announced Tensilica DNA-100 processor IP capitalizes on the sparsity of many CNN models by avoiding the repeated loading and mulitplication of zeroes inherent in the flow of most accelerator architectures. The company claims this yields much more efficient inference computation than other acceleration solutions with similar multiply-accumulate (MAC) array sizes.

Synopsys has an entire portfolio of DesignWare IP aimed at the edge inferencing market, including vision-specific architectures, memory and datapath customization solutions, and customized versions of the venerable ARC processor architecture. Synopsys appears to cater more to the “roll-your-own” crowd when it comes to edge AI, which may result in more application-optimized systems, albeit with a steeper design and learning curve on the hardware architecture side.

FPGA and eFPGA companies appear to be committing heavy engineering to the edge inference problem as well. Mainstream FPGA companies Xilinx and Intel have thus-far focused most of their attention on the high-end versions of FPGA acceleration, with bigger, more expensive chips aimed at the data center or other power-rich applications. This has given niche FPGA players like QuickLogic, Lattice Semiconductor, and Microchip/Microsemi an opportunity to capitalize on their low-power FPGA fabric technology to create various low-cost, low-power FPGA solutions, and QuickLogic and Lattice have both also joined companies like Achronix and FlexLogix in the eFPGA movement, offering IP blocks that put FPGA fabric in your ASIC and pre-engineered FPGA IP and software stacks to facilitate the creation of application-specific AI accelerators with that fabric. Just last month, QuickLogic acquired SensiML corporation, particularly for the SensiML Analytics Toolkit, which provides a streamlined flow for developing AI-based pattern-matching sensor algorithms optimized for ultra-low power consumption.

While the hardware and hardware IP suppliers battle each other with various claims on their combination of low power, low cost, tiny form factor, and inference throughput and latency, perhaps the bigger challenge faced by the industry is the infrastructure for creating and optimizing AI models in the first place. While languages and tool flows continue to evolve, the community of truly skilled AI experts with the cross-over capability to optimize those models for the wide variety of custom hardware environments competing for the crown in edge-based AI is almost vanishingly small. It is likely that the hardware architectures that win the battle may be the ones with the smoothest development flow, rather than the ones with the most compelling data sheets. We have seen time and again that novel hardware alone is not enough to dominate a market. The winning solution is often one that lacks luster in optimality, but excels in usability. It will definitely be interesting to watch.

Leave a Reply

featured blogs
Oct 27, 2020
As we continue this blog series, we'€™re going to keep looking at System Design and Verification Online Training courses. In Part 1 , we went over Verilog language and application, Xcelium simulator,... [[ Click on the title to access the full blog on the Cadence Community...
Oct 27, 2020
Back in January 2020, we rolled out a new experience for component data for our discrete wire products. This update has been very well received. In that blog post, we promised some version 2 updates that would better organize the new data. With this post, we’re happy to...
Oct 26, 2020
Do you have a gadget or gizmo that uses sensors in an ingenious or frivolous way? If so, claim your 15 minutes of fame at the virtual Sensors Innovation Fall Week event....
Oct 23, 2020
[From the last episode: We noted that some inventions, like in-memory compute, aren'€™t intuitive, being driven instead by the math.] We have one more addition to add to our in-memory compute system. Remember that, when we use a regular memory, what goes in is an address '...

featured video

Demo: Inuitive NU4000 SoC with ARC EV Processor Running SLAM and CNN

Sponsored by Synopsys

Autonomous vehicles, robotics, augmented and virtual reality all require simultaneous localization and mapping (SLAM) to build a map of the surroundings. Combining SLAM with a neural network engine adds intelligence, allowing the system to identify objects and make decisions. In this demo, Synopsys ARC EV processor’s vision engine (VPU) accelerates KudanSLAM algorithms by up to 40% while running object detection on its CNN engine.

Click here for more information about DesignWare ARC EV Processors for Embedded Vision

featured paper

Fundamentals of Precision ADC Noise Analysis

Sponsored by Texas Instruments

Build your knowledge of noise performance with high-resolution delta-sigma ADCs. This e-book covers types of ADC noise, how other components contribute noise to the system, and how these noise sources interact with each other.

Click here to download the whitepaper

Featured Chalk Talk

Electronic Fuses (eFuses)

Sponsored by Mouser Electronics and ON Semiconductor

Today’s advanced designs demand advanced circuit protection. The days of replacing old-school fuses are long gone, and we need solutions that provide more robust protection and improved failure modes. In this episode of Chalk Talk, Amelia Dalton chats with Pramit Nandy of ON Semiconductor about the latest advances in electronic fuses, and how they can protect against overcurrent, thermal, and overvoltage.

More information about ON Semiconductor Electronic Fuses