feature article
Subscribe Now

Untether AI – Peddling PetaOps

PCIe AI Accelerator Raises the Bar

It seems we are announcing new AI inference hardware on a weekly basis these days. There is an incredible amount of engineering talent and energy going into the task of developing the next generation of computing architecture – one optimized for the AI revolution. Recently, Untether AI came out of stealth mode and unveiled an entire ecosystem aimed at boosting performance and cutting power consumption in AI workloads in a wide variety of systems, via a PCIe accelerator card boasting a new chip architecture developed by the company.

Let’s look at the chip first. 

Untether looked at the inference task from a power consumption perspective and observed that a large portion of the power burned by any inference task (a whopping 91% according to Untether) is used to move massive amounts of data in and out of external memories. The company attacked this problem with an architecture they call “at-memory computation” – which really amounts to putting large amounts of memory as close as possible to the location where data will be needed for computation. This is accomplished via a memory bank block containing 385KBs of SRAM with a 2D array of 512 processing elements. There are 511 such banks on the “runAI200” chip, yielding 200MB of highly-localized memory and a beefy 502 TeraOps, in what the company calls “sport” mode. There is also an “eco” mode which maximizes power efficiency, yielding a remarkable 8 TOPs per watt (about the best we’ve seen in this class device). 

The runAI200 chip achieves this with a novel architecture, rather than exotic silicon. The company says it is fabricated on a mainstream 16nm CMOS process. In addition to hyper-locality of memory, the performance/efficiency magic includes reliance on integer data types (hence quantization is necessary). The combination of massively-parallel MAC operations, data locality – with on-chip storage of coefficients, and batch=1 optimization gives significant advantages in power and performance.

Tailored for inference acceleration, runAI200 devices operate using integer data types and a batch mode of 1. At the heart of the unique at-memory compute architecture is a memory bank: 385KBs of SRAM with a 2D array of 512 processing elements. With 511 banks per chip, each device offers 200MB of memory, and operating at 960MHz delivers 502 TeraOperations per second in its “sport” mode. It may also be configured for maximum efficiency, offering 8 TOPs per watt in 720 MHz “eco” mode. The new runAI200 devices consist of 19.4 billion transistors, manufactured using a cost-effective, mainstream 16nm process. The architecture appears to borrow several concepts from the FPGA world, including the localized on-chip memory blocks and redundancy – allowing for higher yield.

Diving into some details, the chip is a highly regular structure, and the routing architecture is optimized for inference. There is “direct row transfer,” which is a column-based bank-to-bank local interconnect that provides 32 GB/s transfer for a total of 15 TB/s. Going the other direction is a “rotator cuff,” which provides column-to-column movement of 16GB/s bank-to-bank for a total of 8TB/s. All of this makes its way to and from the PCIe block via a row-based ring pipelined bus that manages 8GB/s per row, for a total of 80GB/s. The PCIe block itself is PCIe gen 4 x16, delivering 32GB/s connectivity. With all that, data can move throughout the chip as well as on and off with minimal friction. The chip also has a specialized 32-bit custom instruction set RISC processor operating at 920MHz to handle housekeeping tasks while the inference fabric does the heavy lifting.

Untether AI puts four of the runAI200 devices onto their “tsunAImi accelerator card” which delivers a crazy 2 PetaOps in a PCIe form factor. The formidable throughput and efficiency of the chip makes this kind of performance possible within the limited form factor. The card manages 1 PetaByte/s memory bandwidth, which can be DMA with the host processor system, maximizing access of the AI engine to application data. The tsunAImi card claims significantly higher performance than other cards we’ve seen, with something like triple the performance of competitors such as Qualcomm, NVidia, Tenstorrent, and Groq – and double the power efficiency of the nearest competitor, despite many of the competitors being built on more advanced (and more expensive) semiconductor processes. Untether claims this can mean a significant savings in total cost of ownership.

Of course, the users have to program the thing, and to do that Untether AI offers the “imAIgine Software Development Kit.” The company says the flow is automated, and it starts with popular and familiar TensorFlow and Pytorch frameworks. There is a lot of model-crunching to do, including quantization, layer optimization, layer-replacement/approximation, and optional post-quantization retraining. The model must then be mapped onto the chip and resources allocated and optimized. To see what you’ve done, the system provides visualizations, including resource allocation and congestion analysis, cycle-accurate simulation, and utilization reporting. 

Of course, there is always the question of accuracy retention when quantizing and munging around on models, and Untether claims that (for example) automated quantization on ResNet-50 v1.5 moves accuracy from FP32: 76.458% Top-1 to INT8: 76.298% Top-1. That’s a very small accuracy price to pay for the enormous benefits in performance and efficiency from dropping FP32 in INT8. On MobileNet v1 the company claims we go from FP32: 71.666% Top-1 to INT8: 70.620% Top-1, and UAT: 71.336% Top-1. 

Because models will need to be partitioned across multiple chips, there is a heuristic partitioning algorithm that does graph analysis after compilation. It is designed to minimize chip-to-chip communication and is extensible to multiple-card partitioning. This means that the system is scalable to enormous models, which will become increasingly important as the average model size is growing rapidly at this point in our AI evolution. 

The runAI200 devices and the tsunAImi accelerator card are sampling now and should be available in production in Q1 2021. It will be interesting to see the adoption of this multi-petaOps solution versus the offerings of some of the more established players. 


Leave a Reply

featured blogs
Nov 24, 2020
In our last Knowledge Booster Blog , we introduced you to some tips and tricks for the optimal use of the Virtuoso ADE Product Suite . W e are now happy to present you with some further news from our... [[ Click on the title to access the full blog on the Cadence Community s...
Nov 23, 2020
It'€™s been a long time since I performed Karnaugh map minimizations by hand. As a result, on my first pass, I missed a couple of obvious optimizations....
Nov 23, 2020
Readers of the Samtec blog know we are always talking about next-gen speed. Current channels rates are running at 56 Gbps PAM4. However, system designers are starting to look at 112 Gbps PAM4 data rates. Intuition would say that bleeding edge data rates like 112 Gbps PAM4 onl...
Nov 20, 2020
[From the last episode: We looked at neuromorphic machine learning, which is intended to act more like the brain does.] Our last topic to cover on learning (ML) is about training. We talked about supervised learning, which means we'€™re training a model based on a bunch of ...

featured video

AI SoC Chats: Scaling AI Systems with Die-to-Die Interfaces

Sponsored by Synopsys

Join Synopsys Interface IP expert Manmeet Walia to understand the trends around scaling AI SoCs and systems while minimizing latency and power by using die-to-die interfaces.

Click here for more information about DesignWare IP for Amazing AI

featured paper

How semiconductor technologies have impacted modern telehealth solutions

Sponsored by Texas Instruments

Innovate technologies have given the general population unprecedented access to healthcare tools for self-monitoring and remote treatment. This paper dives into some of the newer developments of semiconductor technologies that have significantly contributed to the telehealth industry, along with design requirements for both hospital and home environment applications.

Click here to download the whitepaper

Featured Chalk Talk

Innovative Hybrid Crowbar Protection for AC Power Lines

Sponsored by Mouser Electronics and Littelfuse

Providing robust AC line protection is a tough engineering challenge. Lightning and other unexpected events can wreak havoc with even the best-engineered power supplies. In this episode of Chalk Talk, Amelia Dalton chats with Pete Pytlik of Littelfuse about innovative SIDACtor semiconductor hybrid crowbar protection for AC power lines, that combine the best of TVS and MOV technologies to deliver superior low clamping voltage for power lines.

More information about Littelfuse SIDACtor + MOV AC Line Protection