With literally dozens of chip startups working on edge AI inference these days, it’s easy to get caught up in a blur of marketing hype around the various “novel” and “revolutionary” architectures under development. The potential rewards are enormous, as AI has invaded just about every application space, and the intense demands AI inference puts on computing hardware have resulted in a mad dash to find new architectures that can outperform traditional CPUs on inference and break the reliance on cloud-based inference for devices at the edge.
The killer app at the tip of the spear of the AI revolution is clearly video. Video’s massive data sets, enormous bandwidth demands, and critical latency requirements can easily bring even the newest data-center class GPUs and CPUs to their knees. Any time you think you’ve got the problem under control, someone wants more resolution, higher frame rates, more channels, lower latency, and higher quality. At the same time, the AI inference models that are being executed on these video workloads are growing exponentially. There is literally no amount of computing performance available today that can deliver the full potential of machine vision.
Standing out in a sea of startups is Perceive, with their already-shipping “Ergo” device. How exactly are they standing out? Well, for starters, as we just said, their devices are shipping. That puts them ahead of the majority of the competition who are still in various stages from stealth mode to tape-out to sampling with key customers. But the big story on Perceive could be boiled down to a single metric:
For those of you who regularly deal with this sort of thing, we thought we’d just leave that there on its own line. It’s not a typo. While you recover, we’ll bring the rest of the crowd up to speed. Computational power efficiency – how much number crunching you can do on a given amount of energy – is by far the most important metric for evaluating any processor for use in power-constrained edge/endpoint devices. If you’re running on battery power or are constrained in your ability to get heat out of your device, power is your biggest system-level concern. Combine that with the intense computational demands of AI inference, and you end up with the current designer’s dilemma – whether to try to make-do with a low-accuracy, stripped down AI model that you can run on your device, or to ship your data off to cloud-based systems for the inference, or to do some hybrid where your in-device AI says, “Some animal might be walking by” – and then it ships the image off to the cloud-based system that returns the answer: “That is an eleven-year-old Welsh Corgi named Doug.” … well after he walks out of the video frame altogether.
When we talk to companies about their AI inference offerings, the primary talking point is almost always trillions of operations per second, per watt of power, or tera-operations per watt. Now, at this point we probably could jump into a discussion of what exactly an “operation” is in the context of AI inference, and we’d take up a lot of paragraphs on semantics. But, for reasons that will become clear soon, we are going to table that discussion for the moment. Feel free to weigh in with comments if you feel we’ve led folks astray.
Let’s look at some TOPS numbers. NVidia’s fastest offerings have TOPS/Watt in the small single-digits. The new A100 claims 624 INT8 TOPS at 250W, which could come out to (a very optimistic estimate of) just under 2.5 TOPS/Watt. Google’s Edge TPU delivers 4 TOPS at 2W, so a solid 2 TOPS/Watt there, and Blaize’s recently-announced Graph Streaming Processor (GSP) chip claims around 2 TOPS/Watt. FlexLogix InferX X1 delivers a claimed 8.5 peak TOPS at power consumption between 2 and 10W, which would put TOPS/Watt somewhere in the 1-4 range (FlexLogix is adamant that TOPS/Watt is not a good metric, and we agree, but keep reading). Seeing a trend here? The current magic number in TOPS/Watt is 2. Perceive claims that Ergo does 55.
So, using our (arguably sketchy) TOPS/Watt metric, Perceive’s Ergo beats the current state-of-the-art in inference by a factor of 20-50x. Perceive themselves claim (20-100x) so, we’re in the ballpark. This is a really big deal.
But, now let’s dive deeper into the metrics. We don’t really have a standardized definition of an “operation.” Since nobody said “TFLOPS,” we could be talking INT8, INT4… There is really no telling. Also, is a multiply-accumulate one operation or two? And, who is to say how many “operations” are involved in doing what people really want, which is currently running something like YOLOv3 on a video stream. And, since external memory is often heavily utilized during inference, do TOPS/Watt metrics take the memory power consumption into account?
Finally, and here is the kicker: for almost all existing inference engines, the bottleneck is the gazillions of multiply-accumulate (MAC) operations required for convolution. Those multiply-accumulates are the “operations” upon which the TOPS claims are based. Almost all of the novel hardware architectures going after the AI inference pot of gold are building different structures to make multiply-accumulate operations more efficient, get better utilization out of the MACs on the chip, and so forth.
Perceive is doing inference a completely different way. Their chip has no giant array of MACs, so they are arguably not doing any of the “TOPS” by which inference engines measure themselves. Perceive is VERY quiet about exactly how they are achieving this and instead say they started with a clean sheet, using information theory to examine the computations that are actually taking place in neural networks. The resulting chip runs “the equivalent of 4 TOPS” and boasts benchmark results on YOLO v3 – where Ergo can run a blistering 250fps while consuming only 70mW, or a more civilized 30fps at 20mW. So, all TOPS claims aside, that is game-changing performance and efficiency. It literally gives the ability to run data-center-class AI models in edge devices, with power budgets in the tens of mW.
Perceive claims Ergo is not constrained to CNNs and can run any type of popular neural network topology including CNNs, RNNs, LSTMs, “and more.” Ergo is delivered in a tiny 7×7 mm package and requires no external RAM. It can be in-system reconfigured to run various models and can run multiple models and types of models at once. It could for example, simultaneously run both video and audio inference engines in real time. Perceive says they have simultaneously run YOLOv3 and its 63 million weights, ResNet 28 with several million parameters, and an LSTM or RNN to do speech and audio processing. That’s a complex workload in a tiny chip.
The chip also features a fast-wakeup mode, where it can go from zero power to operation in 50 milliseconds, including decryption. That means only a couple of frames of video would be missed if, for example, motion were detected and Ergo were called in to do analysis.
In addition to the AI fabric, Ergo has a complete video pre-processing subsystem along with all the relevant IO to support two cameras.The pre-processing subsystem handles de-warping, white balance, gamma correction, and so forth. Ergo has a similar IO and pre-processing system for audio inputs, and it can handle arrays of microphones. Ergo also includes an ARC processor and external Flash, which can be used to update the AI models loaded in the device.
Ergo is fabricated with GLOBALFOUNDRIES 22FDX platform, so clearly the company is aiming at some high-volume, price-sensitive applications. But, with Ergo’s performance and efficiency, they are likely running largely unopposed for the moment in ultra-low-power edge inference. Perceive is launching with a complete design solution including reference boards and reference designs for common imaging and audio inferencing applications. Currently, a design flow from PyTorch is supported, and Perceive is likely working on additional flows from popular AI frameworks.
Obviously, Perceive’s technology has potential applications running the gamut of AI scenarios – probably even including the data center, but Perceive is smart to focus on edge-based video and audio applications to start. It will be interesting to watch as applications are developed around this high-accuracy, high-performance, ultra-low power technology.