Is the Future of AI Sparse?

I think we’re all familiar with the old riddle joke that starts with the question, “Why did the chicken cross the road?” and ends with the answer, “To get to the other side!” This is one of the things we seem to learn by some form of sociological osmosis without ever being able to recall where we first heard it.

According to the Wikipedia, this is an example of something called “anti-humor” in which the curious setup of the joke leads the listener to expect a traditional punchline, but they are instead given a simple statement of fact.

There are, of course, numerous derivatives of this iconic (some might say “fowl”) example, including the following:

Why did the chicken cross the road? To get to the idiot’s house. “Knock-knock.” (“Who’s there?”) “The chicken!”

Why did the chewing gum cross the road? It was stuck to the chicken’s foot.

Why did the dinosaur cross the road? Chickens didn’t exist yet.

My wife (Gina the Gorgeous) has two stupid cats. She says they are my cats also, but I don’t remember being asked for my opinion prior to their becoming part of our family. Knowing our cats as I do, one variant of the “crossing the road” joke I heard recently that brought a grin to my face was as follows:

Why did the cat cross the road? Because the chicken had a laser pointer.

If you are still reading, you must be (a) lacking in things to do and (b) wondering where I’m going with this. I will reveal all in just a moment. Before we go there, however, if you have any additional answers to the chicken conundrum you would care to share with the rest of us, please feel free to post them in the comments below.

Why is this chicken crossing the road? (Source: Pixabay.com)

The reason for my poultry-centric cogitations is that I was just talking to Sam Fok, who is one of the co-founders of Femtosense.ai. Are you wondering as to the meaning underlying this company name? If so, you’ll be delighted to discover that I just found the following on their website:

Femto is the SI prefix for 10^-15. When a neuron fires in the human brain, the amount of energy dissipated is on the order of femtojoules. Femtosense technology is inspired by research conducted at the Stanford University Brains in Silicon Laboratory, where our founders created a spiking neural network chip, the synaptic operations of which were on the order of, you guessed it, femtojoules. Since many inference problems are also sensing problems, Femtosense seemed like a good idea at the time.

Well, you can’t argue with spiking logic like that. One of the things I love about America is the entrepreneurial ethos that enables motivated people to take an idea, form a company, and run with it. This is what happened to Sam and his colleagues. As part of their PhDs, they were working in the Brains in Silicon laboratory at Stanford University, building—in Sam’s words— “weird neuromorphic hardware.” Specifically, the work they were doing focused on hardware that could take full advantage of sparse artificial neural networks (ANNs).

In fact, they were so enthused by the work they were doing and the potential they were seeing as PhD candidates that, upon graduation, they founded Femtosense in 2018.

This is where we return to chickens (which sparked my earlier chicken related ruminations). As Sam explained to me, this was a classic “chicken-and-egg” situation.

Just to remind ourselves, the question “Which came first: the chicken or the egg?” is a causality dilemma stemming from the observation that all chickens hatch from eggs and all chicken eggs are laid by chickens. Thus, “chicken-and-egg” is a metaphoric adjective describing situations where it is not clear which of two events should be considered the cause and which should be considered the effect, to express a scenario of infinite regress, or to convey the difficulty of sequencing actions where each seems to depend on others being done first.

As an aside, the Australian Academy of Science has the answer to this conundrum as follows: “With amniotic eggs showing up roughly 340 million or so years ago, and the first chickens evolving at around 58 thousand years ago at the earliest, it’s a safe bet to say the egg came first. Eggs were around way before chickens even existed.” By some strange quirk of fate, they posted this answer in 2018, which is the same year Femtosense was founded (coincidence…?).

But we digress… The “chicken-and-egg” scenario to which Sam was referring was the fact that, back in 2018, although the concept of sparsity was understood (i.e., zeroing out values in the ANN to remove unnecessary parameters without affecting inferencing accuracy), this technique was not widely used in the wild. The problem is that sparse networks come into their own when running on appropriate hardware. However, companies were reluctant to design hardware for sparse workloads that didn’t exist, while developers weren’t keen to spend time working on sparse networks if there wasn’t any hardware on which to run them.

Of course, if you are bright-eyed and bushy-tailed, there’s another way of looking at this, which is that if there’s a workload (like a sparsity network) that could be valuable, but that is not well served by existing hardware, then maybe, just maybe, there’s an opportunity to build some new hardware to satisfy the hoped-for demand.

The following image provides a 30,000-foot view of things. On the left we have a conventional von Neumann CPU-type processor executing an AI/ML model stored in memory, which involves a lot of data moving around. Next, we have processing architectures like GPUs and NPUs, which boast arrays of small cores or multiply-accumulate (MAC) units, but which still involve a lot of time-consuming and energy-hungry data movement.

Traditional versus sparse processing units (Source: Femtosense)

The next step up the hierarchy is to combine distributed processors with near-memory compute. The folks at Femtosense take this one step further with their sparse processing unit (SPU), which provides a unique combination of near-memory compute and sparse acceleration.

Sparsity comes in different flavors. First, we have sparse weights in which sparsely connected models store and compute only those weights that actually matter, thereby resulting in a 10X improvement in speed, efficiency, and memory. Second, we have sparse activations, which means skipping computation when a neuron outputs zero, thereby providing a further 10X improvement in speed and efficiency. Now, math may not be my strong suit, but I’m reasonably confident that 10X x 10X = 100X, wherever you hang your hat.

The SPU’s near-memory-compute architecture distributes memory into small banks located near the processing elements to improve throughput. In addition to reducing data motion by performing computations close to the on-chip memory banks, this eliminates the energy and memory bottlenecks caused by accessing off-chip memory.

The SPU also features a scalable architecture in which the cores can be tiled, thereby facilitating the creation of devices that scale to match the performance needs and cost/size/power constraints of any deployment environment.

The guys and gals at Femtosense received their first silicon at the end of 2022. Below we see the SPU001 as a bare die, a packaged test chip, and an evaluation board.

SPU001 bare die, packaged chip, and evaluation board (Source: Femtosense)

As opposed to the test chip, the SPU001—which was created using a 22nm TSMC process—is also available in 1.52mm x 2.2mm chip-scale packaging and on an evaluation board. With 1MB of on-chip SRAM (which is equivalent to 10MB effective memory when running a sparse network), the SPU001 provides sub-milliwatt inference for the always-on processing of speech, audio, and other 1-D data.

I was very impressed with the video of the CES 2022 Femtosense demo; the difference between the raw sound and the audio stream with background noise removed is stunning (now I want to see the video of their 2023 CES presentation).

In this video, the raw audio input is shown on the bottom of the screen and the processed output is presented on the top. Unfortunately, although you can hear the difference, it’s a bit hard to see what’s happening on the displays. I think it would be better if they were to indicate the noise portion of the signal in the bottom window in red to make it easier to see what’s being removed (it’s easy to be a critic, which is why I’m so good in this role).

This real-time speech enhancement is achieved while consuming less than 1mW in power. This is the sort of capability that could benefit all sorts of applications and devices, including earbuds, headsets, and hearing aids, where the latter could have the noise reduction augmented with speech enhancement.

Related audio applications include always-listening keyword detection while consuming less than 100µW in power, a hands-free remote control that lasts over two years on batteries, and always-on multiclass sound event detection in the form of an intelligent security system that can detect, identify, and respond appropriately to such audio occurrences as glass breaking, a baby crying, gunshots, and alarms generated by sensor nodes.

Sam says that, in addition to providing their technology in the form of SPU001 chips and evaluation boards (along with the supporting software development kit (SDK) that includes advanced sparse optimization tools, a model performance simulator, and the Femto compiler), the chaps and chapesses at Femtosense are not averse to offering their technology as IP to be incorporated into other peoples’ MPUs and SoCs.

Also of interest is the fact that they are currently hiring. I was just on their Careers page where I see open positions in application engineering, SoC design and integration, frontend chip design and hardware architecture, and backend chip design. If only I were 40 years younger (and brighter, taller, slimmer, and had more hair). Sad to relate, I fear my days at the forefront of design are behind me, but perhaps you know someone who might be tempted…

Is the Future of AI Sparse?

Related

Leave a Reply Cancel reply

featured video

How NV5, NVIDIA, and Cadence Collaboration Optimizes Data Center Efficiency, Performance, and Reliability

featured chalk talk