feature article
Subscribe Now

Untether AI – Peddling PetaOps

PCIe AI Accelerator Raises the Bar

It seems we are announcing new AI inference hardware on a weekly basis these days. There is an incredible amount of engineering talent and energy going into the task of developing the next generation of computing architecture – one optimized for the AI revolution. Recently, Untether AI came out of stealth mode and unveiled an entire ecosystem aimed at boosting performance and cutting power consumption in AI workloads in a wide variety of systems, via a PCIe accelerator card boasting a new chip architecture developed by the company.

Let’s look at the chip first. 

Untether looked at the inference task from a power consumption perspective and observed that a large portion of the power burned by any inference task (a whopping 91% according to Untether) is used to move massive amounts of data in and out of external memories. The company attacked this problem with an architecture they call “at-memory computation” – which really amounts to putting large amounts of memory as close as possible to the location where data will be needed for computation. This is accomplished via a memory bank block containing 385KBs of SRAM with a 2D array of 512 processing elements. There are 511 such banks on the “runAI200” chip, yielding 200MB of highly-localized memory and a beefy 502 TeraOps, in what the company calls “sport” mode. There is also an “eco” mode which maximizes power efficiency, yielding a remarkable 8 TOPs per watt (about the best we’ve seen in this class device). 

The runAI200 chip achieves this with a novel architecture, rather than exotic silicon. The company says it is fabricated on a mainstream 16nm CMOS process. In addition to hyper-locality of memory, the performance/efficiency magic includes reliance on integer data types (hence quantization is necessary). The combination of massively-parallel MAC operations, data locality – with on-chip storage of coefficients, and batch=1 optimization gives significant advantages in power and performance.

Tailored for inference acceleration, runAI200 devices operate using integer data types and a batch mode of 1. At the heart of the unique at-memory compute architecture is a memory bank: 385KBs of SRAM with a 2D array of 512 processing elements. With 511 banks per chip, each device offers 200MB of memory, and operating at 960MHz delivers 502 TeraOperations per second in its “sport” mode. It may also be configured for maximum efficiency, offering 8 TOPs per watt in 720 MHz “eco” mode. The new runAI200 devices consist of 19.4 billion transistors, manufactured using a cost-effective, mainstream 16nm process. The architecture appears to borrow several concepts from the FPGA world, including the localized on-chip memory blocks and redundancy – allowing for higher yield.

Diving into some details, the chip is a highly regular structure, and the routing architecture is optimized for inference. There is “direct row transfer,” which is a column-based bank-to-bank local interconnect that provides 32 GB/s transfer for a total of 15 TB/s. Going the other direction is a “rotator cuff,” which provides column-to-column movement of 16GB/s bank-to-bank for a total of 8TB/s. All of this makes its way to and from the PCIe block via a row-based ring pipelined bus that manages 8GB/s per row, for a total of 80GB/s. The PCIe block itself is PCIe gen 4 x16, delivering 32GB/s connectivity. With all that, data can move throughout the chip as well as on and off with minimal friction. The chip also has a specialized 32-bit custom instruction set RISC processor operating at 920MHz to handle housekeeping tasks while the inference fabric does the heavy lifting.

Untether AI puts four of the runAI200 devices onto their “tsunAImi accelerator card” which delivers a crazy 2 PetaOps in a PCIe form factor. The formidable throughput and efficiency of the chip makes this kind of performance possible within the limited form factor. The card manages 1 PetaByte/s memory bandwidth, which can be DMA with the host processor system, maximizing access of the AI engine to application data. The tsunAImi card claims significantly higher performance than other cards we’ve seen, with something like triple the performance of competitors such as Qualcomm, NVidia, Tenstorrent, and Groq – and double the power efficiency of the nearest competitor, despite many of the competitors being built on more advanced (and more expensive) semiconductor processes. Untether claims this can mean a significant savings in total cost of ownership.

Of course, the users have to program the thing, and to do that Untether AI offers the “imAIgine Software Development Kit.” The company says the flow is automated, and it starts with popular and familiar TensorFlow and Pytorch frameworks. There is a lot of model-crunching to do, including quantization, layer optimization, layer-replacement/approximation, and optional post-quantization retraining. The model must then be mapped onto the chip and resources allocated and optimized. To see what you’ve done, the system provides visualizations, including resource allocation and congestion analysis, cycle-accurate simulation, and utilization reporting. 

Of course, there is always the question of accuracy retention when quantizing and munging around on models, and Untether claims that (for example) automated quantization on ResNet-50 v1.5 moves accuracy from FP32: 76.458% Top-1 to INT8: 76.298% Top-1. That’s a very small accuracy price to pay for the enormous benefits in performance and efficiency from dropping FP32 in INT8. On MobileNet v1 the company claims we go from FP32: 71.666% Top-1 to INT8: 70.620% Top-1, and UAT: 71.336% Top-1. 

Because models will need to be partitioned across multiple chips, there is a heuristic partitioning algorithm that does graph analysis after compilation. It is designed to minimize chip-to-chip communication and is extensible to multiple-card partitioning. This means that the system is scalable to enormous models, which will become increasingly important as the average model size is growing rapidly at this point in our AI evolution. 

The runAI200 devices and the tsunAImi accelerator card are sampling now and should be available in production in Q1 2021. It will be interesting to see the adoption of this multi-petaOps solution versus the offerings of some of the more established players. 


Leave a Reply

featured blogs
Sep 30, 2022
When I wrote my book 'Bebop to the Boolean Boogie,' it was certainly not my intention to lead 6-year-old boys astray....
Sep 30, 2022
Wow, September has flown by. It's already the last Friday of the month, the last day of the month in fact, and so time for a monthly update. Kaufman Award The 2022 Kaufman Award honors Giovanni (Nanni) De Micheli of École Polytechnique Fédérale de Lausanne...
Sep 29, 2022
We explain how silicon photonics uses CMOS manufacturing to create photonic integrated circuits (PICs), solid state LiDAR sensors, integrated lasers, and more. The post What You Need to Know About Silicon Photonics appeared first on From Silicon To Software....

featured video

PCIe Gen5 x16 Running on the Achronix VectorPath Accelerator Card

Sponsored by Achronix

In this demo, Achronix engineers show the VectorPath Accelerator Card successfully linking up to a PCIe Gen5 x16 host and write data to and read data from GDDR6 memory. The VectorPath accelerator card featuring the Speedster7t FPGA is one of the first FPGAs that can natively support this interface within its PCIe subsystem. Speedster7t FPGAs offer a revolutionary new architecture that Achronix developed to address the highest performance data acceleration challenges.

Click here for more information about the VectorPath Accelerator Card

featured paper

Algorithm Verification with FPGAs and ASICs

Sponsored by MathWorks

Developing new FPGA and ASIC designs involves implementing new algorithms, which presents challenges for verification for algorithm developers, hardware designers, and verification engineers. This eBook explores different aspects of hardware design verification and how you can use MATLAB and Simulink to reduce development effort and improve the quality of end products.

Click here to read more

featured chalk talk

Current Sense Resistor - WFC & WFCP Series

Sponsored by Mouser Electronics and Vishay

If you are working on a telecom, consumer or industrial design, current sense resistors can give you a great way to detect and convert current to voltage. In this episode of Chalk Talk, Amelia Dalton chats with Clinton Stiffler from Vishay about the what, where and how of Vishay’s WFC and WFCP current sense resistors. They investigate how these current sense resistors are constructed, how the flip-chip design of these current sense resistors reduces TCR compared to other chip resistors, and how you can get started using a Vishay current sense resistor in your next design.

Click here for more information about Vishay / Dale WFC/WFCP Metal Foil Current Sense Resistors