Let’s start with one hard fact. There are a lot of companies developing new AI inference processors right now, and most of them won’t survive. We have heard reports that as many as 80 startups are currently funded to some level, and all of them are working hard on building better mousetraps, or at least building better specialized processors that could execute neural network models for identifying mice with minimal latency and power. Conventional processors are not very efficient at AI inference, which requires convolution operations with billions, or even trillions, of multiplications to occur as fast as possible, and with minimal power consumption.
And the field isn’t limited to startups, either. Intel just announced that their new second-generation Xeon processors have special extensions called “DL Boost” that accelerate performance of AI inference as much as 30x over the previous Xeons, and Intel has also acquired at least three companies – Movidius, Nervana, and, yep, Altera – who offer acceleration capabilities for AI inference. NVidia has made a substantial business in recent years accelerating AI processing with their GPUs and their CUDA programming language. Xilinx has re-branded their company from FPGA-centric to acceleration-centric with a new “data center first” mantra – largely focused on accelerating AI tasks in the data center, and they recently acquired DeePhi Tech for their machine-learning IP. The Achronix Speedcore Gen4 eFPGA offering includes Machine Learning Processor (MLP) blocks that dramatically accelerate inference. And, the list goes on and on.
AI acceleration is critical in every level of the compute infrastructure, from cloud/data center all the way to battery-powered edge devices. The edge, of course, is in many ways the most demanding, as designs tend to be cost-, area-, and power-constrained, but with little compromise in the performance and latency required. Many applications cannot take advantage of heavy-iron data center AI acceleration because the round-trip latency and reliability are just not good enough. For these types of applications, the race is on to develop low-power, low-latency, low-cost AI inference processors – and there is no shortage of competitors.
This week, Flex Logix – a supplier of eFPGA IP – announced that they were jumping into the fray and going into the chip business as well, using their own IP to develop what they call the InferX X1 chip, which “delivers high throughput in edge applications with a single DRAM.” The company claims that the new devices will deliver “up to 10 times the throughput compared to existing inference edge ICs.” And, according to the company’s benchmark data, it appears they may have a solid contender.
Flex Logix used the interconnect technology from their eFPGA offering and added what they call nnMAX inferencing clusters to create the new devices. The company claims much higher throughput/watt than existing solutions, particularly at low batch sizes – common in
edge applications where there is typically only one camera/sensor. Flex Logix says the InferX X1’s performance at small batch sizes is “close to data center inference boards and is
optimized for large models that need 100s of billions of operations per image. For YOLOv3 real time object recognition, InferX X1 processes 12.7 frames/second of 2 megapixel
images at batch size = 1. Performance is roughly linear with image size, so frame rate
approximately doubles for a 1 megapixel image. This is with a single DRAM.”
InferX X1 will be offered as chips for edge devices and on half-height, half-length PCIe cards for edge servers and gateways, as well as via IP for companies developing their own ASICs/SoCs. The X1 chip consists of four tiles, and each tile includes 1,024 MAC units operating at 1.067 Ghz and 2MB of on-chip SRAM. Data feeds in via the ArrayLINX interconnect into the top of the nnMax cluster stack. The devices use partial reconfiguration to change the routing configuration layer-by-layer while pulling in new weights from DRAM. At speed, this takes only a microsecond.
Flex Logix says that they have hidden the complexity of the FPGA underpinnings from the user, and that end users will not need to learn or use any FPGA development tools.
Their software stack includes the “nnMAX Compiler,” which accepts models from TensorFlow Lite or ONNX, and the engine supports integer 8 and 16, and bfloat 16 data types. The software will automatically cast between these types to achieve the required precision and performance, and then automatically generate the required configuration bitstream. This means that models can be run at full precision to get up and running quickly, and optimized, quantized models can be swapped in later for better performance. Winograd transformations (with 12-bit accuracy) are used for integer 8 mode, giving more than 2x speedup for those operations. Numerics can be mixed in each layer, allowing optimization with integer types to be combined with selective use of floating point for accuracy.
The goal of the Flex Logix architecture is to minimize power consumption by reducing movement of data via the FPGA-style interconnect. With on-chip SRAM, the demands on off-chip DRAM memory are much smaller, with corresponding improvements in performance and power consumption. The new chips are fabricated in TSMC 16nm FinFET technology, which has outstanding static and dynamic power characteristics, further boosting X1’s prowess in edge applications. A single InferX X1 chip, with 4 tiles totaling 4,096 MACs, could hit a theoretical peak throughput of 8.5 TOPS. X1 chips are designed to be efficiently daisy-chained for larger models and higher throughput.
Flex Logix says nnMAX is in development now and will be available for integration in SoCs by Q3 2019. The stand-alone InferX X1 will tape-out in Q3 2019, and samples of chips and PCIe boards will be available shortly after.
The move to selling silicon is a leap for Flex Logix, who has made their living so far as a supplier of FPGA IP, but the exercise of developing and shipping their own silicon is likely to complement their IP business well. This also opens an entirely new class of customer for the company, as they have historically focused on organizations with the wherewithal to develop their own custom ICs, and shipping stand-alone chips addresses an entirely new and larger domain of end users. The company’s sales and support structures will have to evolve to handle the challenge of this new, larger customer pool.
While benchmark results are impressive and Flex Logix has a good reputation as a savvy startup, it is too early to predict any long-term winners in this fast-moving market. With so many companies developing AI inference solutions, we expect to be bombarded with competing claims over the coming months. It will be interesting to watch.