feature article
Subscribe Now

FlexLogix Accelerates Edge Inference

New InferX X1 Chips Show Promise

Let’s start with one hard fact. There are a lot of companies developing new AI inference processors right now, and most of them won’t survive. We have heard reports that as many as 80 startups are currently funded to some level, and all of them are working hard on building better mousetraps, or at least building better specialized processors that could execute neural network models for identifying mice with minimal latency and power. Conventional processors are not very efficient at AI inference, which requires convolution operations with billions, or even trillions, of multiplications to occur as fast as possible, and with minimal power consumption.

And the field isn’t limited to startups, either. Intel just announced that their new second-generation Xeon processors have special extensions called “DL Boost” that accelerate performance of AI inference as much as 30x over the previous Xeons, and Intel has also acquired at least three companies – Movidius, Nervana, and, yep, Altera – who offer acceleration capabilities for AI inference. NVidia has made a substantial business in recent years accelerating AI processing with their GPUs and their CUDA programming language. Xilinx has re-branded their company from FPGA-centric to acceleration-centric with a new “data center first” mantra – largely focused on accelerating AI tasks in the data center, and they recently acquired DeePhi Tech for their machine-learning IP. The Achronix Speedcore Gen4 eFPGA offering includes Machine Learning Processor (MLP) blocks that dramatically accelerate inference. And, the list goes on and on.

AI acceleration is critical in every level of the compute infrastructure, from cloud/data center all the way to battery-powered edge devices. The edge, of course, is in many ways the most demanding, as designs tend to be cost-, area-, and power-constrained, but with little compromise in the performance and latency required. Many applications cannot take advantage of heavy-iron data center AI acceleration because the round-trip latency and reliability are just not good enough. For these types of applications, the race is on to develop low-power, low-latency, low-cost AI inference processors – and there is no shortage of competitors.

This week, Flex Logix – a supplier of eFPGA IP – announced that they were jumping into the fray and going into the chip business as well, using their own IP to develop what they call the InferX X1 chip, which “delivers high throughput in edge applications with a single DRAM.” The company claims that the new devices will deliver “up to 10 times the throughput compared to existing inference edge ICs.” And, according to the company’s benchmark data, it appears they may have a solid contender.

Flex Logix used the interconnect technology from their eFPGA offering and added what they call nnMAX inferencing clusters to create the new devices. The company claims much higher throughput/watt than existing solutions, particularly at low batch sizes – common in

edge applications where there is typically only one camera/sensor. Flex Logix says the InferX X1’s performance at small batch sizes is “close to data center inference boards and is

optimized for large models that need 100s of billions of operations per image. For YOLOv3 real time object recognition, InferX X1 processes 12.7 frames/second of 2 megapixel

images at batch size = 1. Performance is roughly linear with image size, so frame rate

approximately doubles for a 1 megapixel image. This is with a single DRAM.”

InferX X1 will be offered as chips for edge devices and on half-height, half-length PCIe cards for edge servers and gateways, as well as via IP for companies developing their own ASICs/SoCs. The X1 chip consists of four tiles, and each tile includes 1,024 MAC units operating at 1.067 Ghz and 2MB of on-chip SRAM. Data feeds in via the ArrayLINX interconnect into the top of the nnMax cluster stack. The devices use partial reconfiguration to change the routing configuration layer-by-layer while pulling in new weights from DRAM. At speed, this takes only a microsecond.

Flex Logix says that they have hidden the complexity of the FPGA underpinnings from the user, and that end users will not need to learn or use any FPGA development tools.

Their software stack includes the “nnMAX Compiler,” which accepts models from TensorFlow Lite or ONNX, and the engine supports integer 8 and 16, and bfloat 16 data types. The software will automatically cast between these types to achieve the required precision and performance, and then automatically generate the required configuration bitstream. This means that models can be run at full precision to get up and running quickly, and optimized, quantized models can be swapped in later for better performance. Winograd transformations (with 12-bit accuracy) are used for integer 8 mode, giving more than 2x speedup for those operations. Numerics can be mixed in each layer, allowing optimization with integer types to be combined with selective use of floating point for accuracy.

The goal of the Flex Logix architecture is to minimize power consumption by reducing movement of data via the FPGA-style interconnect. With on-chip SRAM, the demands on off-chip DRAM memory are much smaller, with corresponding improvements in performance and power consumption. The new chips are fabricated in TSMC 16nm FinFET technology, which has outstanding static and dynamic power characteristics, further boosting X1’s prowess in edge applications. A single InferX X1 chip, with 4 tiles totaling 4,096 MACs, could hit a theoretical peak throughput of 8.5 TOPS. X1 chips are designed to be efficiently daisy-chained for larger models and higher throughput.

Flex Logix says nnMAX is in development now and will be available for integration in SoCs by Q3 2019. The stand-alone InferX X1 will tape-out in Q3 2019, and samples of chips and PCIe boards will be available shortly after.

The move to selling silicon is a leap for Flex Logix, who has made their living so far as a supplier of FPGA IP, but the exercise of developing and shipping their own silicon is likely to complement their IP business well. This also opens an entirely new class of customer for the company, as they have historically focused on organizations with the wherewithal to develop their own custom ICs, and shipping stand-alone chips addresses an entirely new and larger domain of end users. The company’s sales and support structures will have to evolve to handle the challenge of this new, larger customer pool.

While benchmark results are impressive and Flex Logix has a good reputation as a savvy startup, it is too early to predict any long-term winners in this fast-moving market. With so many companies developing AI inference solutions, we expect to be bombarded with competing claims over the coming months. It will be interesting to watch.

Leave a Reply

featured blogs
Mar 29, 2024
By Mark Williams, Sr Software Engineering Group Director Translator: Masaru Yasukawa 差動アンプはã1つの入力信号ではなく2つの入力信号間の差にゲインをé...
Mar 26, 2024
Learn how GPU acceleration impacts digital chip design implementation, expanding beyond chip simulation to fulfill compute demands of the RTL-to-GDSII process.The post Can GPUs Accelerate Digital Design Implementation? appeared first on Chip Design....
Mar 21, 2024
The awesome thing about these machines is that you are limited only by your imagination, and I've got a GREAT imagination....

featured video

We are Altera. We are for the innovators.

Sponsored by Intel

Today we embark on an exciting journey as we transition to Altera, an Intel Company. In a world of endless opportunities and challenges, we are here to provide the flexibility needed by our ecosystem of customers and partners to pioneer and accelerate innovation. As we leap into the future, we are committed to providing easy-to-design and deploy leadership programmable solutions to innovators to unlock extraordinary possibilities for everyone on the planet.

To learn more about Altera visit: http://intel.com/altera

featured chalk talk

What are the Differences Between an Integrated ADC and a Standalone ADC?
Sponsored by Mouser Electronics and Microchip
Many designs today require some form of analog to digital conversion but how you implement an ADC into your design can make a big difference when it comes to accuracy and precision. In this episode of Chalk Talk, Iman Chalabi from Microchip and Amelia Dalton investigate the benefits of both integrated ADC solutions and standalone ADCs. They discuss the roles that internal switching noise, process technology, and design complexity play when choosing the right ADC solution for your next design.
Apr 17, 2023
38,944 views