feature article
Subscribe Now

Watching AI Evolve

A Common Trajectory?

We’ve looked at several AI IP offerings over the last year or so, from Cadence (more than once), Synopsys, and Imagination Technologies. All of them, in one way or another, rely on computing units like DSPs or GPUs to handle the many computations it takes to make an inference.

As all of this IP evolves, there’s a key trend: move neural-net inference out of the cloud and into the “edge” devices that currently rely on the cloud. Latency and privacy can be improved, and devices that have poor cloud connections can be more self-sufficient.

But today, we’re going to talk about two different neural-net applications: vision and speech recognition. We have IP for both from Cadence: their announcement a couple of months ago of the Tensilica DNA 100 accelerator and their much more recent announcement of their Tensilica HiFi 5 audio platform.

Looking at them illustrates some of the drivers of the evolution that we’re seeing – and the two applications are not the same in terms of what they require, so the IP design decisions and tradeoffs have been different.

Generic Processing

The first stage of neural processing has taken place amidst seas of processing elements – typically DSPs or GPUs. At this early stage of things, algorithms and applications and performance and power are all in the experimental phase. It’s about getting it to work, not necessarily getting it to work optimally.

Vision left this stage of evolution a while ago, but not so much audio. In a conversation with Cadence about their HiFi 5 IP, it was clear that their prior HiFi 4 edition was handling neural nets, but doing so using the same DSPs that were being used for everything else. That has changed with HiFi 5.

Tighter DSPs

The next evolutionary stage comes when you have a better handle on how much processing is required – in other words, how many DSPs you’re likely to need access to. Here we take note of two key facts about neural-net processing:

  • You’re going to be working with smaller data units. The big data loads consist of the weights or coefficients for the nodes of the network and for what they call the activation – the input data (a visual image or frame or an audio sound-stream). These tend to involve integers of at most 16 bits – sometimes down to 4 bits or smaller. None of the 32-bit behemoths that float around for other DSP applications.
  • You’re dealing with integer multiplication. A general-purpose DSP will have lots of multiply-accumulation units (MACs), but, because it’s general purpose, those units can be configured into a number of different modes. You don’t need most of those modes for neural-net operations. You can use a simpler integer unit.

So one of the big differences between Cadence’s HiFi 4 and HiFi 5 is that the latest one has a series of simplified DSPs optimized for neural-net use. You no longer need general-purpose DSPs.

Data in the Way

The next evolutionary step involves the surprising fact that performance may not be limited by all of those MAC calculations. The bottleneck may be something far more pedestrian: moving data around.

As you re-use a set of MACs for different parts of the calculation, you end up moving matrices of weights in and out of use. And, of course, you’re moving the input data around as you receive it and deliver it to different parts of the engine. Moving all of those matrices takes time.

So the first thing that you can do is to optimize the architecture in a way that reduces the required number of data moves, getting that movement out of the way of actual work done by the MACs. This is a typical change that’s been made for accelerators, and you see that in the vision stuff. Notably, that hasn’t happened yet with HiFi 5 for reasons we’ll come back to in a moment.

The other key characteristic of these matrices is that they tend to be sparse. That means that most of their entries are 0, with a few having useful non-zero data elements. This gives rise to a couple of opportunities (one of which we’ll defer for a moment).

When you have big chunks of data that are zero, what technique beckons for reducing the amount of data? Compression, of course. So the weights and inputs can be compressed before being stored, making it much faster to move them around when needed. When pressed into service, the matrices are decompressed first. Key here, of course, is the requirement that the time you save moving data doesn’t get eaten back up due to compression and decompression time. Which, given that this is a touted feature in both the DNA 100 and the HiFi 5, appears to be the case.

Ignore the Zeros

Finally, those sparse matrices, once decompressed, can result in lots of calculations on those zero-valued elements. Cadence created an optimized neural-net engine in their DNA 100 that works only on non-zero elements. They skip the zero-valued ones, gaining yet more performance.

This and the optimized architecture are steps that the HiFi 5 has not taken yet, both for the reason that we need to dig into next.

Soft or Hard?

We now have a completely separate set of decisions to make, depending on how mature our algorithms are and how much performance we need to squeeze out of a device: do we execute the algorithm in software or in hardware (or in some combination thereof)?

There’s been a consistent trajectory in the industry: do things in software as much as possible, since there’s pretty much nothing more flexible than software. If you are on a steep learning curve and have to make frequent changes, then software is your safest bet. As long as you don’t overstep your memory footprint, you can make all the changes you want without affecting the hardware design at all.

But, at some point, many applications start to cry out for hardware implementation. It might be for cost reasons or speed reasons or power reasons, but any of those reasons can drive the move to harden some, if not all, of the algorithm.

Vision and audio processing have very different performance requirements. Even analysis of a still picture involves a starting data set of megapixels (meaning many megabits). Yes, as you move through the neural-net pipeline, that set of bits shrinks due to convolution and pooling, but it starts out big. If you’re processing video, then you have to process the frames at the frame rate in order to keep up with real-time performance. That’s a lot of data to push around.

Audio, meanwhile, has a couple of characteristic differences as compared to visual processing. With a still image, you have all the pixels at once. Convolution is used because it resembles how we think our brains process vision. But audio works differently. First of all, instead of a pile of bits that are concurrently valid, you have a time sequence of bits from the audio stream. So you’re not just processing data that was captured at the same time; you’re capturing a data set acquired over time. This is where recurrent neural nets (RNNs) tend to be used rather than convolutional neural nets (CNNs).

Audio also processes relatively low-frequency vibrations – those in the audible range. This limits us to a couple tens of kilohertz – resulting in processing of a few megaMACs/cycle – rather than hundreds of megaMACs/cycle or even teraMACs/cycle for video. In other words, all in all, audio processing requires far less performance.

So, given the tradeoff between performance and the flexibility of software, audio can still stay with an all-software implementation. This is why Cadence went as far as to use optimized DSPs for audio, but they didn’t use an optimized architecture or a hardened circuit. They can still get the performance they need without giving up any of the software flexibility.

(Click to enlarge. Image courtesy Cadence)

The DNA 100, by comparison, went farther down that hardening path, with a dedicated architecture and the non-zero-value-only MAC.

(Click to enlarge. Image courtesy Cadence)

So there you have it. The xNN evolution pipeline – from all generic DSPs to optimized hardware – and how vision and audio processing have traveled down that path, giving proof that not all neural nets are the same. Are we at the end of that pipeline for vision? Not likely. But we won’t know what the next stages are until we get there.

 

More info:

Cadence Tensilica DNA 100

Cadence Tensilica HiFi 5

2 thoughts on “Watching AI Evolve”

    1. Hi, Bryon. I am a system and hardware designer new to AI. You wrote:
      “You’re going to be working with smaller data units. The big data loads consist of the weights or coefficients for the nodes of the network and for what they call the activation – the input data (a visual image or frame or an audio sound-stream). These tend to involve integers of at most 16 bits – sometimes down to 4 bits or smaller. None of the 32-bit behemoths that float around for other DSP applications.
      You’re dealing with integer multiplication. A general-purpose DSP will have lots of multiply-accumulation units (MACs), but, because it’s general purpose, those units can be configured into a number of different modes. You don’t need most of those modes for neural-net operations. You can use a simpler integer unit.”

      Using embedded block memory for these MACs is a natural fit for FPGAs because there are so many blocks available.

      Typical FPGA designs use LUTs grouped together to form 4, 8, 16 bit integers and LUTs for combo logic. LUTs are one bit wide memories, pure and simple that can do any Boolean combination of up to 6 Boolean variables.

      A MAC can use 1 or 2 inputs of control to evaluate 2 two-bit operands–4 inputs. So 2 LUTs can do a 4 bit MAC, 8 by 256 can decode an 8 bit input to 8 outputs, and so on. There are hundreds and even thousands of memory blocks available on FPGAs and designs can get both performance and flexibility in this way.

      I have not tried matrix multiplication, but intuitively feel that dual 2 port memories can deliver the matrix elements for each operation in one clock cycle.

      This approach could blend in very well with the evolution.

Leave a Reply

featured blogs
Jul 22, 2021
The HotFix 019 (QIR 3, indicated as 2021.1 in the application splash screens) update for OrCAD® and Allegro® is now available at Cadence Downloads . This blog post contains important links... [[ Click on the title to access the full blog on the Cadence Community si...
Jul 21, 2021
It's a funny old thing to find yourself in possession of a USB-C dock when you don't have a host machine that sports a USB-C connector with which to drive it....
Jul 21, 2021
We explain how virtual prototyping eliminates ASIC design bugs before RTL, and how chip architecture design modeling correlates key performance attributes. The post Take the Guesswork Out of Designing Your New Product Architecture appeared first on From Silicon To Software....
Jul 9, 2021
Do you have questions about using the Linux OS with FPGAs? Intel is holding another 'Ask an Expert' session and the topic is 'Using Linux with Intel® SoC FPGAs.' Come and ask our experts about the various Linux OS options available to use with the integrated Arm Cortex proc...

featured video

Electromagnetic Analysis for High-Speed Communication

Sponsored by Cadence Design Systems

When your team is driving the future of breakthrough technologies like autonomous driving, industrial automation, and healthcare, you need software that helps meet approaching deadlines and increasingly high-performance demands. Learn how a system analysis solution can provide accurate 3D modeling, electromagnetic simulation, and electrothermal simulation at the chip, package, PCB, and system level.

Click to learn more

featured paper

Configure the backup voltage in a reversible buck/boost regulator

Sponsored by Maxim Integrated

This application note looks at a reference circuit design using Maxim’s MAX38888, which provides a supercapacitor-based power backup in the absence of the system rail by discharging its stored charge. The backup voltage provided by the regulator from the super cap is 12.5% less than the system rail when the system rail is removed. This note explains how to maintain the backup voltage within 5% of the minimum SYS charge voltage.

Click to read more

featured chalk talk

ROHM Gate Drivers

Sponsored by Mouser Electronics and ROHM Semiconductor

Today’s rapid growth of power and motor control applications demands a fresh look at gate driver technology. Recent advances in gate drivers help designers hit new levels of efficiency and performance in their designs. In this episode of Chalk Talk, Amelia Dalton chats with Mitch Van Ochten of ROHM about the latest in isolated and non-isolated gate driver solutions.

Click here for more information about ROHM Semiconductor Automotive Gate Drivers