feature article
Subscribe Now

Endpoint AI FPGA Acceleration for the Masses

Microchip’s VectorBlox SDK

Let’s face it. Just about anybody can throw a few billion transistors onto a custom 7nm chip, crank up the coulombs, apply data center cooling, put together an engineering team with a few hundred folks including world expert data scientists, digital architecture experts, software savants, and networking gurus, and come up with a pretty decent AI inference engine. There’s nothing to it, really. 

But there are a lot of applications that really do need AI inference at the edge – when there just isn’t time to ship the data from your IoT device to the cloud and wait for an answer back. What then? Oh, and what if your team is on a budget smaller than a medium-sized nation? Yeah, that’s when things begin to get trickier. 

Consider, if you will, an IoT development team with a double-, or even (gasp) single-digit number of engineers? Imagine, if you can, a project schedule considerably shorter than a decade? Picture, if it isn’t too much of a stretch, an engineering team with some holes in expertise, and with communications that fall short of a multi-port hive mind, We’re not talking about your team, of course. This is a hypothetical exercise, but bear with me.

Such a team might find themselves in a situation where the folks designing the hardware – laying out PCBs, writing RTL, staring at eye diagrams on BERT scopes – were not experts in AI. Those working on the AI models and data sets with their TensorFlows and Caffes were not familiar with the ins and outs of shared memory, hardware multipliers, and multi-gigabit serial interfaces. And neither set was really that solid on the architectural challenges of creating efficient convolution data-path engines and optimally quantizing models for power and area efficiency. 

Well, hypothetical team, Microchip has your back. 

There is a large gap in the power/performance/cost spectrum for AI inference where Microchip’s FPGA offerings (formerly from Microsemi and Actel) run mostly unopposed. In particular, their PolarFire and PolarFire SoC FPGAs occupy a point on the curve with AI inference capability well above what one could expect with ultra-low-power offerings such as those from Lattice Semiconductor, and with cost and power consumption significantly lower than high-end offerings from the likes of Xilinx, Intel, and Achronix. If your application sits in that gap, and many, many do – particularly performing functions like smart embedded vision – Microchip’s chips may be an excellent fit.

But we’ve discussed that in the past. What brings us here today is our possibly less-than-hypothetical engineering team – toiling away to develop an edge- or endpoint-AI-based embedded vision application for an IoT device, and struggling to bridge the enormous AI-to-hardware expertise gap. If all you needed to have your smart camera distinguish between grazing herds of deer and approaching zombie hordes was some fancy hardware, we’d all have shipped our products long ago.

This week, Microchip rolled out their VectorBlox Accelerator Software Development Kit (SDK), which “helps developers take advantage of Microchip’s PolarFire FPGAs for creating low-power, flexible overlay-based neural network applications without learning an FPGA tool flow.” VectorBlox brings a number of compelling advantages to our aforementioned embattled embedded engineering team. First and foremost, it creates the critical bridge between software/AI developers creating AI models in frameworks such as TensorFlow and hardware acceleration in an FPGA – without requiring the software developers to understand FPGA design, and without requiring FPGA developers to understand the AI models. Assuming that it creates efficient implementations, this is a game-changing capability for many teams.

But VectorBlox has several other nice tricks up its conceptual sleeve that intrigued us even more. Even with an automated flow from AI model to hardware, there are numerous zombies in our path. First, most models are developed in full 32-bit floating point splendor, but in order to get the real gains from hardware acceleration, we need to trim the data widths down to something much more manageable. This quantization step can be done with little loss of accuracy, but it would be really nice to know just how much accuracy we are giving up. VectorBlox contains a bit-accurate simulator, so you can see (using your actual sample data) how the accuracy is holding up with the trimming of the bits.

Some approaches to AI FPGA accelerator generation take C/C++ and run high-level synthesis (HLS) algorithms to create custom RTL for the model, then synthesize and place-and-route that model to create a bitstream to program the FPGA. While this approach can generate very efficient accelerators, it introduces a long latency development loop when models change, and it requires the FPGA to be re-programmed for each model swap. Microchip has taken a different approach, creating a semi-generic (but customizable for your application) neural network processor in the FPGA fabric that allows new networks and coefficients to be rapidly loaded and re-loaded in real time without reconfiguring the FPGA. This gives a tremendous amount of in-system capability, with the ability to willy-nilly hot swap AI models at runtime – without having to resynthesize, or re-do place-and-route and timing closure, and therefore also without reconfiguring the FPGA.

The CoreVectorBlox IP implements Matrix Processors (MXPs) that do elementwise tensor operations such as add, sub, xor, shift, mul, dotprod, with up to 8 32-bit ALUs. It can mix precisions including int8, int16, int32, and it includes a 256KB scratchpad memory and a DMA controller. It also implements VectorBlox CNNs, which can perform tensor multiply-accumulate operations at int8 precision and support layer enhancements with software-only updates. This is what allows CNNs to be loaded and swapped at runtime. It also allows multiple CNNs to be run (apparently) in parallel, by time-multiplexing different CNNs on each frame (as we were shown in a demo that’s included.)

VectorBlox can handle models in TensorFlow and the open neural network exchange (ONNX), which supports many widely-used frameworks like Caffe2, MXNet, PyTorch, Keras, and Matlab. VectorBlox will be available soon, with an early-access program beginning in late June 2020 and a broad launch planned for the end of September. VectorBlox will make a particularly compelling combo with the new PolarFire SoC FPGA family, which adds a hardened multi-core RISC-V processing subsystem to the company’s popular PolarFire line and will begin sampling in August 2020.

One thought on “Endpoint AI FPGA Acceleration for the Masses”

  1. Quote: “Microchip has taken a different approach, creating a semi-generic (but customizable for your application) neural network processor in the FPGA fabric that allows new networks and coefficients to be rapidly loaded and re-loaded in real time without reconfiguring the FPGA. This gives a tremendous amount of in-system capability, with the ability to willy-nilly hot swap AI models at runtime – without having to resynthesize, or re-do place-and-route and timing closure, and therefore also without reconfiguring the FPGA.”
    Definitely a step in the right direction! Especially since I am taking a similar approach to execute if/else, for while, expression evaluation, and function calls. CEngine does not require C source to be compiled to a specific micro ISA. (and does as much in one clock cycle as a processor does in a pipeline sequence)
    The Sharp/Mono Syntax API is used to parse the C source code to an AbstractSyntaxTree and the SyntaxWalker exposes the execution sequences.
    4 dual port memory blocks and a few hundred LUTs are used out of the thousands that are available.
    Loading the memories with data and the control sequences can be done via DMA or FPGA bit stream.
    Yes, it is a programmable accelerator but it can also do control sequencing.

Leave a Reply

featured blogs
May 24, 2024
Could these creepy crawly robo-critters be the first step on a slippery road to a robot uprising coupled with an insect uprising?...
May 23, 2024
We're investing in semiconductor workforce development programs in Latin America, including government and academic partnerships to foster engineering talent.The post Building the Semiconductor Workforce in Latin America appeared first on Chip Design....

featured video

Introducing Altera® Agilex 5 FPGAs and SoCs

Sponsored by Intel

Learn about the Altera Agilex 5 FPGA Family for tomorrow’s edge intelligent applications.

To learn more about Agilex 5 visit: Agilex™ 5 FPGA and SoC FPGA Product Overview

featured paper

Achieve Greater Design Flexibility and Reduce Costs with Chiplets

Sponsored by Keysight

Chiplets are a new way to build a system-on-chips (SoCs) to improve yields and reduce costs. It partitions the chip into discrete elements and connects them with a standardized interface, enabling designers to meet performance, efficiency, power, size, and cost challenges in the 5 / 6G, artificial intelligence (AI), and virtual reality (VR) era. This white paper will discuss the shift to chiplet adoption and Keysight EDA's implementation of the communication standard (UCIe) into the Keysight Advanced Design System (ADS).

Dive into the technical details – download now.

featured chalk talk

Advantech Industrial AI Camera: Small but Mighty
Sponsored by Mouser Electronics and Advantech
Artificial intelligence equipped camera systems can be a great addition to a variety of industrial designs. In this episode of Chalk Talk, Amelia Dalton and Ryan Chan from Advantech explore the components included in an industrial AI camera system, the benefits of Advantech’s AI ICAM-500 Industrial camera series and how you can get started using these solutions in your next industrial design. 
Aug 23, 2023
32,456 views