feature article
Subscribe Now

Endpoint AI FPGA Acceleration for the Masses

Microchip’s VectorBlox SDK

Let’s face it. Just about anybody can throw a few billion transistors onto a custom 7nm chip, crank up the coulombs, apply data center cooling, put together an engineering team with a few hundred folks including world expert data scientists, digital architecture experts, software savants, and networking gurus, and come up with a pretty decent AI inference engine. There’s nothing to it, really. 

But there are a lot of applications that really do need AI inference at the edge – when there just isn’t time to ship the data from your IoT device to the cloud and wait for an answer back. What then? Oh, and what if your team is on a budget smaller than a medium-sized nation? Yeah, that’s when things begin to get trickier. 

Consider, if you will, an IoT development team with a double-, or even (gasp) single-digit number of engineers? Imagine, if you can, a project schedule considerably shorter than a decade? Picture, if it isn’t too much of a stretch, an engineering team with some holes in expertise, and with communications that fall short of a multi-port hive mind, We’re not talking about your team, of course. This is a hypothetical exercise, but bear with me.

Such a team might find themselves in a situation where the folks designing the hardware – laying out PCBs, writing RTL, staring at eye diagrams on BERT scopes – were not experts in AI. Those working on the AI models and data sets with their TensorFlows and Caffes were not familiar with the ins and outs of shared memory, hardware multipliers, and multi-gigabit serial interfaces. And neither set was really that solid on the architectural challenges of creating efficient convolution data-path engines and optimally quantizing models for power and area efficiency. 

Well, hypothetical team, Microchip has your back. 

There is a large gap in the power/performance/cost spectrum for AI inference where Microchip’s FPGA offerings (formerly from Microsemi and Actel) run mostly unopposed. In particular, their PolarFire and PolarFire SoC FPGAs occupy a point on the curve with AI inference capability well above what one could expect with ultra-low-power offerings such as those from Lattice Semiconductor, and with cost and power consumption significantly lower than high-end offerings from the likes of Xilinx, Intel, and Achronix. If your application sits in that gap, and many, many do – particularly performing functions like smart embedded vision – Microchip’s chips may be an excellent fit.

But we’ve discussed that in the past. What brings us here today is our possibly less-than-hypothetical engineering team – toiling away to develop an edge- or endpoint-AI-based embedded vision application for an IoT device, and struggling to bridge the enormous AI-to-hardware expertise gap. If all you needed to have your smart camera distinguish between grazing herds of deer and approaching zombie hordes was some fancy hardware, we’d all have shipped our products long ago.

This week, Microchip rolled out their VectorBlox Accelerator Software Development Kit (SDK), which “helps developers take advantage of Microchip’s PolarFire FPGAs for creating low-power, flexible overlay-based neural network applications without learning an FPGA tool flow.” VectorBlox brings a number of compelling advantages to our aforementioned embattled embedded engineering team. First and foremost, it creates the critical bridge between software/AI developers creating AI models in frameworks such as TensorFlow and hardware acceleration in an FPGA – without requiring the software developers to understand FPGA design, and without requiring FPGA developers to understand the AI models. Assuming that it creates efficient implementations, this is a game-changing capability for many teams.

But VectorBlox has several other nice tricks up its conceptual sleeve that intrigued us even more. Even with an automated flow from AI model to hardware, there are numerous zombies in our path. First, most models are developed in full 32-bit floating point splendor, but in order to get the real gains from hardware acceleration, we need to trim the data widths down to something much more manageable. This quantization step can be done with little loss of accuracy, but it would be really nice to know just how much accuracy we are giving up. VectorBlox contains a bit-accurate simulator, so you can see (using your actual sample data) how the accuracy is holding up with the trimming of the bits.

Some approaches to AI FPGA accelerator generation take C/C++ and run high-level synthesis (HLS) algorithms to create custom RTL for the model, then synthesize and place-and-route that model to create a bitstream to program the FPGA. While this approach can generate very efficient accelerators, it introduces a long latency development loop when models change, and it requires the FPGA to be re-programmed for each model swap. Microchip has taken a different approach, creating a semi-generic (but customizable for your application) neural network processor in the FPGA fabric that allows new networks and coefficients to be rapidly loaded and re-loaded in real time without reconfiguring the FPGA. This gives a tremendous amount of in-system capability, with the ability to willy-nilly hot swap AI models at runtime – without having to resynthesize, or re-do place-and-route and timing closure, and therefore also without reconfiguring the FPGA.

The CoreVectorBlox IP implements Matrix Processors (MXPs) that do elementwise tensor operations such as add, sub, xor, shift, mul, dotprod, with up to 8 32-bit ALUs. It can mix precisions including int8, int16, int32, and it includes a 256KB scratchpad memory and a DMA controller. It also implements VectorBlox CNNs, which can perform tensor multiply-accumulate operations at int8 precision and support layer enhancements with software-only updates. This is what allows CNNs to be loaded and swapped at runtime. It also allows multiple CNNs to be run (apparently) in parallel, by time-multiplexing different CNNs on each frame (as we were shown in a demo that’s included.)

VectorBlox can handle models in TensorFlow and the open neural network exchange (ONNX), which supports many widely-used frameworks like Caffe2, MXNet, PyTorch, Keras, and Matlab. VectorBlox will be available soon, with an early-access program beginning in late June 2020 and a broad launch planned for the end of September. VectorBlox will make a particularly compelling combo with the new PolarFire SoC FPGA family, which adds a hardened multi-core RISC-V processing subsystem to the company’s popular PolarFire line and will begin sampling in August 2020.

One thought on “Endpoint AI FPGA Acceleration for the Masses”

  1. Quote: “Microchip has taken a different approach, creating a semi-generic (but customizable for your application) neural network processor in the FPGA fabric that allows new networks and coefficients to be rapidly loaded and re-loaded in real time without reconfiguring the FPGA. This gives a tremendous amount of in-system capability, with the ability to willy-nilly hot swap AI models at runtime – without having to resynthesize, or re-do place-and-route and timing closure, and therefore also without reconfiguring the FPGA.”
    Definitely a step in the right direction! Especially since I am taking a similar approach to execute if/else, for while, expression evaluation, and function calls. CEngine does not require C source to be compiled to a specific micro ISA. (and does as much in one clock cycle as a processor does in a pipeline sequence)
    The Sharp/Mono Syntax API is used to parse the C source code to an AbstractSyntaxTree and the SyntaxWalker exposes the execution sequences.
    4 dual port memory blocks and a few hundred LUTs are used out of the thousands that are available.
    Loading the memories with data and the control sequences can be done via DMA or FPGA bit stream.
    Yes, it is a programmable accelerator but it can also do control sequencing.

Leave a Reply

featured blogs
Nov 25, 2020
It constantly amazes me how there are always multiple ways of doing things. The problem is that sometimes it'€™s hard to decide which option is best....
Nov 25, 2020
[From the last episode: We looked at what it takes to generate data that can be used to train machine-learning .] We take a break from learning how IoT technology works for one of our occasional posts on how IoT technology is used. In this case, we look at trucking fleet mana...
Nov 25, 2020
It might seem simple, but database units and accuracy directly relate to the artwork generated, and it is possible to misunderstand the artwork format as it relates to the board setup. Thirty years... [[ Click on the title to access the full blog on the Cadence Community sit...
Nov 23, 2020
Readers of the Samtec blog know we are always talking about next-gen speed. Current channels rates are running at 56 Gbps PAM4. However, system designers are starting to look at 112 Gbps PAM4 data rates. Intuition would say that bleeding edge data rates like 112 Gbps PAM4 onl...

featured video

AI SoC Chats: Scaling AI Systems with Die-to-Die Interfaces

Sponsored by Synopsys

Join Synopsys Interface IP expert Manmeet Walia to understand the trends around scaling AI SoCs and systems while minimizing latency and power by using die-to-die interfaces.

Click here for more information about DesignWare IP for Amazing AI

featured paper

Streamlining functional safety certification in automotive and industrial

Sponsored by Texas Instruments

Functional safety design takes rigor, documentation and time to get it right. Whether you’re designing for the factory floor or cars on the highway, this white paper explains how TI is making it easier for you to find and use its integrated circuits (ICs) in your functional safety designs.

Click here to download the whitepaper

Featured Chalk Talk

Benefits of FPGAs & eFPGA IP in Futureproofing Compute Acceleration

Sponsored by Achronix

In the quest to accelerate and optimize today’s computing challenges such as AI inference, our system designs have to be flexible above all else. At the confluence of speed and flexibility are today’s new FPGAs and e-FPGA IP. In this episode of Chalk Talk, Amelia Dalton chats with Mike Fitton from Achronix about how to design systems to be both fast and future-proof using FPGA and e-FPGA technology.

Click here for more information about the Achronix Speedster7 FPGAs