feature article
Subscribe Now

Intel Announces Stratix 10 NX

AI-specific FPGAs Target Inference

Intel has announced what they call their “First Intel AI-Optimized FPGA,” the Stratix 10 NX family. The company says these FPGAs “will offer customers customizable, reconfigurable and scalable AI acceleration for compute-demanding applications such as natural language processing and fraud detection.” Intel has bet on all the horses in the AI race, adding “Deep Learning Boost (DL Boost) to their flagship Xeon processors to dramatically accelerate AI inference in conventional data center processors, but also investing heavily in acceleration strategies such as FPGAs, acquisition of Habana Labs, Nervana (whose technology has now been reportedly dropped in favor of Habana Labs technology), and Movidius (focused on low-end AI inference).

Intel’s strategy is clearly morphing toward heterogeneous computing in the data center, where variations in workloads can have a dramatic impact on the optimal hardware configuration. For many (if not most) data center applications, the performance of Xeon with DL Boost will be adequate for the AI tasks that come along. However, if there is a heavy load of AI demand combined with low latency requirements, it makes more sense to add workload-specific acceleration. This is where Stratix 10 NX is likely to come into play.

This new Stratix 10 NX family is a continuation of the Stratix 10 line introduced back in 2015 when Intel had just announced “plans” to buy Altera. Stratix 10 is fabricated on Intel’s 14nm 3D tri-gate (FinFET) technology, and it makes extensive use of Intel’s proprietary embedded multi-die interconnect bridge (EMIB) technology to allow the company to deploy a wide variety of devices and device families for specialized application domains by mixing and matching chiplets in a single package. EMIB is Intel’s alternative to a silicon interposer for  in-package high density interconnect of heterogeneous chips.

During the wait for the deployment of the next-generation Agilex family (which is based on Intel’s delayed 10nm process and began shipping to early-access customers last August), Intel has rolled out several new variants of Stratix 10 by taking advantage of EMIB’s ability to combine chiplets into domain-focused solutions. Stratix 10 now consists of six different variants, GX – which are general-purpose FPGAs, SX – which is Intel’s SoC FPGA that includes hard processor subsystems with 64-bit quad-core ARM Cortex-A53s, TX – which is the transceiver-heavy variant with tons of PAM4 57.8 Gbps transceivers, MX – which includes in-package HBM2, DX – which supports Intel Ultra Path Interconnect (Intel UPI) for direct coherent connection to future select Intel Xeon processors, and now NX – which emphasizes low-precision computation.

In the case of the new Stratix 10 NX, the company is going after the AI inference market primarily via new AI-optimized arithmetic blocks called AI Tensor Blocks. These blocks would have previously been called “DSP” blocks, but the new versions contain dense arrays of lower-precision multipliers typically used for AI model arithmetic. Intel’s previous devices focused on higher-precision multiplication and even floating point, which would be useful in targeting AI training, but inference acceleration is all about low-precision, and Intel has answered with 15x more INT8 performance than the standard Stratix 10 DSP Block.

Intel gets this INT8 boost by swapping the usual 2-multiplier 2-accumulator architecture of the Stratix 10 MX block for a 30-multiplier, 30-accumulator block that can handle INT4, INT8, BLOCK FP12, and BLOCK FP16. Presumably, the “15x” factor is due to going from 2 MACs to 30 MACs in the same block for lower precision operations. Stratix 10 NX also boasts up to 16GB in-package stacked HBM (like the MX line) and PCIe Gen3x16 plus PCIe Gen4x16 support.

Intel says Stratix 10 NX is up to 2.3X faster than Nvidia V100 GPUs for BERT batch processing, 9.5X faster in LSTM batch processing, and 3.8X faster in ResNet50 batch processing. The targeting of NVidia in their marketing materials clearly illuminates Intel’s strategy for Stratix 10 NX. The company wants to stop NVidia’s incursion into the data center at any cost.

Of course, as we have discussed many times, the big challenge with taking advantage of FPGA performance and power efficiency in compute acceleration is the programming model. Creating an optimized accelerator using FPGAs traditionally requires a team with significant FPGA expertise and experience, and a lot of time. Intel and rival FPGA supplier Xilinx have worked hard over the past decade or so to improve that situation, and Intel seems to be hanging their hat on their ambitious “oneAPI” which is a standards-based, unified programming model that aims to facilitate integration of heterogeneous Xeon-based platforms with various accelerators such as FPGAs. Intel’s approach makes sense, given the breadth of their offering, and there is insufficient industry experience so far to fairly assess how oneAPI compares or competes with Xilinx’s VITIS, or with Nvidia’s CUDA environment. While the other solutions aim specifically at acceleration, Intel appears to be attacking the problem one level of abstraction higher, which could be a spectacular success, or could be a bridge too far.

Stratix 10 NX was announced as part of a larger Intel data center announcement, which included the debut of 3rd Gen Xeon processors with built-in AI acceleration through the integration of bfloat16 support. Bfloat16 has half the bits of FP32 for AI inference with comparable model accuracy. Also announced were the New Intel Optane persistent memory series and new 3D NAND SSDs. Taken together, the announcements show a steady drum beat of progress across the spectrum of data center AI workload optimization.

Intel is defending their data center dominance on multiple fronts these days, with strong pressure coming from acceleration providers such as Nvidia and Xilinx, alternative processors and architectures such as AMD and ARM, and a proliferation of standards-based approaches that minimize the company’s ability to hold off competition with a breadth-first and integration strategy. It will be interesting to watch the next few years unfold as the increasingly lucrative data center market attracts even more, and better-funded, attackers.

2 thoughts on “Intel Announces Stratix 10 NX”

  1. BLOCK FP12, and BLOCK FP16 … I haven’t seen Intel support block FP before, other than in their early Nervana designs. Is this something novel for FPGA FP ai processing?

Leave a Reply

featured blogs
Oct 21, 2020
You've traveled back in time 65 million years with no way to return. What evidence can you leave to ensure future humans will know of your existence?...
Oct 21, 2020
We'€™re concluding the Online Training Deep Dive blog series, which has been taking the top 15 Online Training courses among students and professors and breaking them down into their different... [[ Click on the title to access the full blog on the Cadence Community site. ...
Oct 20, 2020
In 2020, mobile traffic has skyrocketed everywhere as our planet battles a pandemic. Samtec.com saw nearly double the mobile traffic in the first two quarters than it normally sees. While these levels have dropped off from their peaks in the spring, they have not returned to ...
Oct 16, 2020
[From the last episode: We put together many of the ideas we'€™ve been describing to show the basics of how in-memory compute works.] I'€™m going to take a sec for some commentary before we continue with the last few steps of in-memory compute. The whole point of this web...

featured video

Demo: Low-Power Machine Learning Inference with DesignWare ARC EM9D Processor IP

Sponsored by Synopsys

Applications that require sensing on a continuous basis are always on and often battery operated. In this video, the low-power ARC EM9D Processors run a handwriting character recognition neural network graph to infer the letter that is written.

Click here for more information about DesignWare ARC EM9D / EM11D Processors

featured Paper

New package technology improves EMI and thermal performance with smaller solution size

Sponsored by Texas Instruments

Power supply designers have a new tool in their effort to achieve balance between efficiency, size, and thermal performance with DC/DC power modules. The Enhanced HotRod™ QFN package technology from Texas Instruments enables engineers to address design challenges with an easy-to-use footprint that resembles a standard QFN. This new package type combines the advantages of flip-chip-on-lead with the improved thermal performance presented by a large thermal die attach pad (DAP).

Click here to download the whitepaper

featured chalk talk

RF Interconnect for 12G-SDI Broadcast Applications

Sponsored by Mouser Electronics and Amphenol RF

Today’s 4K and emerging 8K video standards require an enormous amount of bandwidth. And, with all that bandwidth, there are new demands on our interconnects. In this episode of Chalk Talk, Amelia Dalton chats with Mike Comer and Ron Orban of Amphenol RF about the evolution of broadcast technology and the latest interconnect solutions that are required to meet these new demands.

Click here for more information about Amphenol RF Adapters & Cable Assemblies for Broadcast