feature article
Subscribe Now

Intel Announces Stratix 10 NX

AI-specific FPGAs Target Inference

Intel has announced what they call their “First Intel AI-Optimized FPGA,” the Stratix 10 NX family. The company says these FPGAs “will offer customers customizable, reconfigurable and scalable AI acceleration for compute-demanding applications such as natural language processing and fraud detection.” Intel has bet on all the horses in the AI race, adding “Deep Learning Boost (DL Boost) to their flagship Xeon processors to dramatically accelerate AI inference in conventional data center processors, but also investing heavily in acceleration strategies such as FPGAs, acquisition of Habana Labs, Nervana (whose technology has now been reportedly dropped in favor of Habana Labs technology), and Movidius (focused on low-end AI inference).

Intel’s strategy is clearly morphing toward heterogeneous computing in the data center, where variations in workloads can have a dramatic impact on the optimal hardware configuration. For many (if not most) data center applications, the performance of Xeon with DL Boost will be adequate for the AI tasks that come along. However, if there is a heavy load of AI demand combined with low latency requirements, it makes more sense to add workload-specific acceleration. This is where Stratix 10 NX is likely to come into play.

This new Stratix 10 NX family is a continuation of the Stratix 10 line introduced back in 2015 when Intel had just announced “plans” to buy Altera. Stratix 10 is fabricated on Intel’s 14nm 3D tri-gate (FinFET) technology, and it makes extensive use of Intel’s proprietary embedded multi-die interconnect bridge (EMIB) technology to allow the company to deploy a wide variety of devices and device families for specialized application domains by mixing and matching chiplets in a single package. EMIB is Intel’s alternative to a silicon interposer for  in-package high density interconnect of heterogeneous chips.

During the wait for the deployment of the next-generation Agilex family (which is based on Intel’s delayed 10nm process and began shipping to early-access customers last August), Intel has rolled out several new variants of Stratix 10 by taking advantage of EMIB’s ability to combine chiplets into domain-focused solutions. Stratix 10 now consists of six different variants, GX – which are general-purpose FPGAs, SX – which is Intel’s SoC FPGA that includes hard processor subsystems with 64-bit quad-core ARM Cortex-A53s, TX – which is the transceiver-heavy variant with tons of PAM4 57.8 Gbps transceivers, MX – which includes in-package HBM2, DX – which supports Intel Ultra Path Interconnect (Intel UPI) for direct coherent connection to future select Intel Xeon processors, and now NX – which emphasizes low-precision computation.

In the case of the new Stratix 10 NX, the company is going after the AI inference market primarily via new AI-optimized arithmetic blocks called AI Tensor Blocks. These blocks would have previously been called “DSP” blocks, but the new versions contain dense arrays of lower-precision multipliers typically used for AI model arithmetic. Intel’s previous devices focused on higher-precision multiplication and even floating point, which would be useful in targeting AI training, but inference acceleration is all about low-precision, and Intel has answered with 15x more INT8 performance than the standard Stratix 10 DSP Block.

Intel gets this INT8 boost by swapping the usual 2-multiplier 2-accumulator architecture of the Stratix 10 MX block for a 30-multiplier, 30-accumulator block that can handle INT4, INT8, BLOCK FP12, and BLOCK FP16. Presumably, the “15x” factor is due to going from 2 MACs to 30 MACs in the same block for lower precision operations. Stratix 10 NX also boasts up to 16GB in-package stacked HBM (like the MX line) and PCIe Gen3x16 plus PCIe Gen4x16 support.

Intel says Stratix 10 NX is up to 2.3X faster than Nvidia V100 GPUs for BERT batch processing, 9.5X faster in LSTM batch processing, and 3.8X faster in ResNet50 batch processing. The targeting of NVidia in their marketing materials clearly illuminates Intel’s strategy for Stratix 10 NX. The company wants to stop NVidia’s incursion into the data center at any cost.

Of course, as we have discussed many times, the big challenge with taking advantage of FPGA performance and power efficiency in compute acceleration is the programming model. Creating an optimized accelerator using FPGAs traditionally requires a team with significant FPGA expertise and experience, and a lot of time. Intel and rival FPGA supplier Xilinx have worked hard over the past decade or so to improve that situation, and Intel seems to be hanging their hat on their ambitious “oneAPI” which is a standards-based, unified programming model that aims to facilitate integration of heterogeneous Xeon-based platforms with various accelerators such as FPGAs. Intel’s approach makes sense, given the breadth of their offering, and there is insufficient industry experience so far to fairly assess how oneAPI compares or competes with Xilinx’s VITIS, or with Nvidia’s CUDA environment. While the other solutions aim specifically at acceleration, Intel appears to be attacking the problem one level of abstraction higher, which could be a spectacular success, or could be a bridge too far.

Stratix 10 NX was announced as part of a larger Intel data center announcement, which included the debut of 3rd Gen Xeon processors with built-in AI acceleration through the integration of bfloat16 support. Bfloat16 has half the bits of FP32 for AI inference with comparable model accuracy. Also announced were the New Intel Optane persistent memory series and new 3D NAND SSDs. Taken together, the announcements show a steady drum beat of progress across the spectrum of data center AI workload optimization.

Intel is defending their data center dominance on multiple fronts these days, with strong pressure coming from acceleration providers such as Nvidia and Xilinx, alternative processors and architectures such as AMD and ARM, and a proliferation of standards-based approaches that minimize the company’s ability to hold off competition with a breadth-first and integration strategy. It will be interesting to watch the next few years unfold as the increasingly lucrative data center market attracts even more, and better-funded, attackers.

2 thoughts on “Intel Announces Stratix 10 NX”

  1. BLOCK FP12, and BLOCK FP16 … I haven’t seen Intel support block FP before, other than in their early Nervana designs. Is this something novel for FPGA FP ai processing?

Leave a Reply

featured blogs
Apr 14, 2021
Hybrid Cloud architecture enables innovation in AI chip design; learn how our partnership with IBM combines the best in EDA & HPC to improve AI performance. The post Synopsys and IBM Research: Driving Real Progress in Large-Scale AI Silicon and Implementing a Hybrid Clou...
Apr 13, 2021
The human brain is very good at understanding the world around us.  An everyday example can be found when driving a car.  An experienced driver will be able to judge how large their car is, and how close they can approach an obstacle.  The driver does not need ...
Apr 13, 2021
If a picture is worth a thousand words, a video tells you the entire story. Cadence's subsystem SoC silicon for PCI Express (PCIe) 5.0 demo video shows you how we put together the latest... [[ Click on the title to access the full blog on the Cadence Community site. ]]...
Apr 12, 2021
The Semiconductor Ecosystem- It is the definition of '€œHigh Tech'€, but it isn'€™t just about… The post Calibre and the Semiconductor Ecosystem appeared first on Design with Calibre....

featured video

Learn the basics of Hall Effect sensors

Sponsored by Texas Instruments

This video introduces Hall Effect, permanent magnets and various magnetic properties. It'll walk through the benefits of Hall Effect sensors, how Hall ICs compare to discrete Hall elements and the different types of Hall Effect sensors.

Click here for more information

featured paper

Understanding the Foundations of Quiescent Current in Linear Power Systems

Sponsored by Texas Instruments

Minimizing power consumption is an important design consideration, especially in battery-powered systems that utilize linear regulators or low-dropout regulators (LDOs). Read this new whitepaper to learn the fundamentals of IQ in linear-power systems, how to predict behavior in dropout conditions, and maintain minimal disturbance during the load transient response.

Click here to download the whitepaper

featured chalk talk

TI Robotics System Learning Kit

Sponsored by Mouser Electronics and Texas Instruments

Robotics projects can get complicated quickly, and finding a set of components, controllers, networking, and software that plays nicely together is a real headache. In this episode of Chalk Talk, Amelia Dalton chats with Mark Easley of Texas Instruments about the TI-RSLK Robotics Kit, which will get you up and running on your next robotics project in no time.

Click here for more information about the Texas Instruments TIRSLK-EVM Robotics System Lab Kit