feature article
Subscribe Now

Endpoint AI FPGA Acceleration for the Masses

Microchip’s VectorBlox SDK

Let’s face it. Just about anybody can throw a few billion transistors onto a custom 7nm chip, crank up the coulombs, apply data center cooling, put together an engineering team with a few hundred folks including world expert data scientists, digital architecture experts, software savants, and networking gurus, and come up with a pretty decent AI inference engine. There’s nothing to it, really. 

But there are a lot of applications that really do need AI inference at the edge – when there just isn’t time to ship the data from your IoT device to the cloud and wait for an answer back. What then? Oh, and what if your team is on a budget smaller than a medium-sized nation? Yeah, that’s when things begin to get trickier. 

Consider, if you will, an IoT development team with a double-, or even (gasp) single-digit number of engineers? Imagine, if you can, a project schedule considerably shorter than a decade? Picture, if it isn’t too much of a stretch, an engineering team with some holes in expertise, and with communications that fall short of a multi-port hive mind, We’re not talking about your team, of course. This is a hypothetical exercise, but bear with me.

Such a team might find themselves in a situation where the folks designing the hardware – laying out PCBs, writing RTL, staring at eye diagrams on BERT scopes – were not experts in AI. Those working on the AI models and data sets with their TensorFlows and Caffes were not familiar with the ins and outs of shared memory, hardware multipliers, and multi-gigabit serial interfaces. And neither set was really that solid on the architectural challenges of creating efficient convolution data-path engines and optimally quantizing models for power and area efficiency. 

Well, hypothetical team, Microchip has your back. 

There is a large gap in the power/performance/cost spectrum for AI inference where Microchip’s FPGA offerings (formerly from Microsemi and Actel) run mostly unopposed. In particular, their PolarFire and PolarFire SoC FPGAs occupy a point on the curve with AI inference capability well above what one could expect with ultra-low-power offerings such as those from Lattice Semiconductor, and with cost and power consumption significantly lower than high-end offerings from the likes of Xilinx, Intel, and Achronix. If your application sits in that gap, and many, many do – particularly performing functions like smart embedded vision – Microchip’s chips may be an excellent fit.

But we’ve discussed that in the past. What brings us here today is our possibly less-than-hypothetical engineering team – toiling away to develop an edge- or endpoint-AI-based embedded vision application for an IoT device, and struggling to bridge the enormous AI-to-hardware expertise gap. If all you needed to have your smart camera distinguish between grazing herds of deer and approaching zombie hordes was some fancy hardware, we’d all have shipped our products long ago.

This week, Microchip rolled out their VectorBlox Accelerator Software Development Kit (SDK), which “helps developers take advantage of Microchip’s PolarFire FPGAs for creating low-power, flexible overlay-based neural network applications without learning an FPGA tool flow.” VectorBlox brings a number of compelling advantages to our aforementioned embattled embedded engineering team. First and foremost, it creates the critical bridge between software/AI developers creating AI models in frameworks such as TensorFlow and hardware acceleration in an FPGA – without requiring the software developers to understand FPGA design, and without requiring FPGA developers to understand the AI models. Assuming that it creates efficient implementations, this is a game-changing capability for many teams.

But VectorBlox has several other nice tricks up its conceptual sleeve that intrigued us even more. Even with an automated flow from AI model to hardware, there are numerous zombies in our path. First, most models are developed in full 32-bit floating point splendor, but in order to get the real gains from hardware acceleration, we need to trim the data widths down to something much more manageable. This quantization step can be done with little loss of accuracy, but it would be really nice to know just how much accuracy we are giving up. VectorBlox contains a bit-accurate simulator, so you can see (using your actual sample data) how the accuracy is holding up with the trimming of the bits.

Some approaches to AI FPGA accelerator generation take C/C++ and run high-level synthesis (HLS) algorithms to create custom RTL for the model, then synthesize and place-and-route that model to create a bitstream to program the FPGA. While this approach can generate very efficient accelerators, it introduces a long latency development loop when models change, and it requires the FPGA to be re-programmed for each model swap. Microchip has taken a different approach, creating a semi-generic (but customizable for your application) neural network processor in the FPGA fabric that allows new networks and coefficients to be rapidly loaded and re-loaded in real time without reconfiguring the FPGA. This gives a tremendous amount of in-system capability, with the ability to willy-nilly hot swap AI models at runtime – without having to resynthesize, or re-do place-and-route and timing closure, and therefore also without reconfiguring the FPGA.

The CoreVectorBlox IP implements Matrix Processors (MXPs) that do elementwise tensor operations such as add, sub, xor, shift, mul, dotprod, with up to 8 32-bit ALUs. It can mix precisions including int8, int16, int32, and it includes a 256KB scratchpad memory and a DMA controller. It also implements VectorBlox CNNs, which can perform tensor multiply-accumulate operations at int8 precision and support layer enhancements with software-only updates. This is what allows CNNs to be loaded and swapped at runtime. It also allows multiple CNNs to be run (apparently) in parallel, by time-multiplexing different CNNs on each frame (as we were shown in a demo that’s included.)

VectorBlox can handle models in TensorFlow and the open neural network exchange (ONNX), which supports many widely-used frameworks like Caffe2, MXNet, PyTorch, Keras, and Matlab. VectorBlox will be available soon, with an early-access program beginning in late June 2020 and a broad launch planned for the end of September. VectorBlox will make a particularly compelling combo with the new PolarFire SoC FPGA family, which adds a hardened multi-core RISC-V processing subsystem to the company’s popular PolarFire line and will begin sampling in August 2020.

One thought on “Endpoint AI FPGA Acceleration for the Masses”

  1. Quote: “Microchip has taken a different approach, creating a semi-generic (but customizable for your application) neural network processor in the FPGA fabric that allows new networks and coefficients to be rapidly loaded and re-loaded in real time without reconfiguring the FPGA. This gives a tremendous amount of in-system capability, with the ability to willy-nilly hot swap AI models at runtime – without having to resynthesize, or re-do place-and-route and timing closure, and therefore also without reconfiguring the FPGA.”
    Definitely a step in the right direction! Especially since I am taking a similar approach to execute if/else, for while, expression evaluation, and function calls. CEngine does not require C source to be compiled to a specific micro ISA. (and does as much in one clock cycle as a processor does in a pipeline sequence)
    The Sharp/Mono Syntax API is used to parse the C source code to an AbstractSyntaxTree and the SyntaxWalker exposes the execution sequences.
    4 dual port memory blocks and a few hundred LUTs are used out of the thousands that are available.
    Loading the memories with data and the control sequences can be done via DMA or FPGA bit stream.
    Yes, it is a programmable accelerator but it can also do control sequencing.

Leave a Reply

featured blogs
Sep 25, 2020
What do you think about earphone-style electroencephalography sensors that would allow your boss to monitor your brainwaves and collect your brain data while you are at work?...
Sep 25, 2020
[From the last episode: We looked at different ways of accessing a single bit in a memory, including the use of multiplexors.] Today we'€™re going to look more specifically at memory cells '€“ these things we'€™ve been calling bit cells. We mentioned that there are many...
Sep 25, 2020
Normally, in May, I'd have been off to Unterschleißheim, a suburb of Munich where historically we've held what used to be called CDNLive EMEA. We renamed this CadenceLIVE Europe and... [[ Click on the title to access the full blog on the Cadence Community site...
Sep 24, 2020
Samtec works with system architects in the early stages of their design to create solutions for cable management which provide even distribution of thermal load. Using ultra-low skew twinax cable to route signals over the board is a key performance enabler as signal integrity...

Featured Video

AI SoC Chats: Host Processor Interconnect IP for AI Accelerators

Sponsored by Synopsys

To support host-to-AI accelerator connectivity, AI chipsets can use PCI Express, CCIX, and/or CXL, and each have their benefits. Learn how to find the right interconnect for your AI SoC design.

Click here for more information about DesignWare IP for Amazing AI

Featured Paper

4 audio trends transforming the automotive industry

Sponsored by Texas Instruments

The automotive industry is focused on creating a comfortable driving experience – but without compromising fuel efficiency or manufacturing costs. The adoption of these new audio technologies in cars – while requiring major architecture changes – promise to bring a richer driving and in-car communication experience. Discover techniques using microphones, amplifiers, loudspeakers and advanced digital signal processing that help enable the newest trends in automotive audio applications.

Click here to download the whitepaper

Featured Chalk Talk

Machine Learning at the Edge

Sponsored by Mouser Electronics and NXP

AI and neural networks are part of almost every new system design these days. But, most system designers are just coming to grips with the implications of designing AI into their systems. In this episode of Chalk Talk, Amelia Dalton chats with Anthony Huereca of NXP about the ins and outs of machine learning and the NXP iEQ Machine Learning Software Development Environment.

Click here for more information about NXP Semiconductors i.MX RT1060 EVK Evaluation Kit (MIMXRT1060-EVK)