Let’s face it. Just about anybody can throw a few billion transistors onto a custom 7nm chip, crank up the coulombs, apply data center cooling, put together an engineering team with a few hundred folks including world expert data scientists, digital architecture experts, software savants, and networking gurus, and come up with a pretty decent AI inference engine. There’s nothing to it, really.
But there are a lot of applications that really do need AI inference at the edge – when there just isn’t time to ship the data from your IoT device to the cloud and wait for an answer back. What then? Oh, and what if your team is on a budget smaller than a medium-sized nation? Yeah, that’s when things begin to get trickier.
Consider, if you will, an IoT development team with a double-, or even (gasp) single-digit number of engineers? Imagine, if you can, a project schedule considerably shorter than a decade? Picture, if it isn’t too much of a stretch, an engineering team with some holes in expertise, and with communications that fall short of a multi-port hive mind, We’re not talking about your team, of course. This is a hypothetical exercise, but bear with me.
Such a team might find themselves in a situation where the folks designing the hardware – laying out PCBs, writing RTL, staring at eye diagrams on BERT scopes – were not experts in AI. Those working on the AI models and data sets with their TensorFlows and Caffes were not familiar with the ins and outs of shared memory, hardware multipliers, and multi-gigabit serial interfaces. And neither set was really that solid on the architectural challenges of creating efficient convolution data-path engines and optimally quantizing models for power and area efficiency.
Well, hypothetical team, Microchip has your back.
There is a large gap in the power/performance/cost spectrum for AI inference where Microchip’s FPGA offerings (formerly from Microsemi and Actel) run mostly unopposed. In particular, their PolarFire and PolarFire SoC FPGAs occupy a point on the curve with AI inference capability well above what one could expect with ultra-low-power offerings such as those from Lattice Semiconductor, and with cost and power consumption significantly lower than high-end offerings from the likes of Xilinx, Intel, and Achronix. If your application sits in that gap, and many, many do – particularly performing functions like smart embedded vision – Microchip’s chips may be an excellent fit.
But we’ve discussed that in the past. What brings us here today is our possibly less-than-hypothetical engineering team – toiling away to develop an edge- or endpoint-AI-based embedded vision application for an IoT device, and struggling to bridge the enormous AI-to-hardware expertise gap. If all you needed to have your smart camera distinguish between grazing herds of deer and approaching zombie hordes was some fancy hardware, we’d all have shipped our products long ago.
This week, Microchip rolled out their VectorBlox Accelerator Software Development Kit (SDK), which “helps developers take advantage of Microchip’s PolarFire FPGAs for creating low-power, flexible overlay-based neural network applications without learning an FPGA tool flow.” VectorBlox brings a number of compelling advantages to our aforementioned embattled embedded engineering team. First and foremost, it creates the critical bridge between software/AI developers creating AI models in frameworks such as TensorFlow and hardware acceleration in an FPGA – without requiring the software developers to understand FPGA design, and without requiring FPGA developers to understand the AI models. Assuming that it creates efficient implementations, this is a game-changing capability for many teams.
But VectorBlox has several other nice tricks up its conceptual sleeve that intrigued us even more. Even with an automated flow from AI model to hardware, there are numerous zombies in our path. First, most models are developed in full 32-bit floating point splendor, but in order to get the real gains from hardware acceleration, we need to trim the data widths down to something much more manageable. This quantization step can be done with little loss of accuracy, but it would be really nice to know just how much accuracy we are giving up. VectorBlox contains a bit-accurate simulator, so you can see (using your actual sample data) how the accuracy is holding up with the trimming of the bits.
Some approaches to AI FPGA accelerator generation take C/C++ and run high-level synthesis (HLS) algorithms to create custom RTL for the model, then synthesize and place-and-route that model to create a bitstream to program the FPGA. While this approach can generate very efficient accelerators, it introduces a long latency development loop when models change, and it requires the FPGA to be re-programmed for each model swap. Microchip has taken a different approach, creating a semi-generic (but customizable for your application) neural network processor in the FPGA fabric that allows new networks and coefficients to be rapidly loaded and re-loaded in real time without reconfiguring the FPGA. This gives a tremendous amount of in-system capability, with the ability to willy-nilly hot swap AI models at runtime – without having to resynthesize, or re-do place-and-route and timing closure, and therefore also without reconfiguring the FPGA.
The CoreVectorBlox IP implements Matrix Processors (MXPs) that do elementwise tensor operations such as add, sub, xor, shift, mul, dotprod, with up to 8 32-bit ALUs. It can mix precisions including int8, int16, int32, and it includes a 256KB scratchpad memory and a DMA controller. It also implements VectorBlox CNNs, which can perform tensor multiply-accumulate operations at int8 precision and support layer enhancements with software-only updates. This is what allows CNNs to be loaded and swapped at runtime. It also allows multiple CNNs to be run (apparently) in parallel, by time-multiplexing different CNNs on each frame (as we were shown in a demo that’s included.)
VectorBlox can handle models in TensorFlow and the open neural network exchange (ONNX), which supports many widely-used frameworks like Caffe2, MXNet, PyTorch, Keras, and Matlab. VectorBlox will be available soon, with an early-access program beginning in late June 2020 and a broad launch planned for the end of September. VectorBlox will make a particularly compelling combo with the new PolarFire SoC FPGA family, which adds a hardened multi-core RISC-V processing subsystem to the company’s popular PolarFire line and will begin sampling in August 2020.
One thought on “Endpoint AI FPGA Acceleration for the Masses”
Quote: “Microchip has taken a different approach, creating a semi-generic (but customizable for your application) neural network processor in the FPGA fabric that allows new networks and coefficients to be rapidly loaded and re-loaded in real time without reconfiguring the FPGA. This gives a tremendous amount of in-system capability, with the ability to willy-nilly hot swap AI models at runtime – without having to resynthesize, or re-do place-and-route and timing closure, and therefore also without reconfiguring the FPGA.”
Definitely a step in the right direction! Especially since I am taking a similar approach to execute if/else, for while, expression evaluation, and function calls. CEngine does not require C source to be compiled to a specific micro ISA. (and does as much in one clock cycle as a processor does in a pipeline sequence)
The Sharp/Mono Syntax API is used to parse the C source code to an AbstractSyntaxTree and the SyntaxWalker exposes the execution sequences.
4 dual port memory blocks and a few hundred LUTs are used out of the thousands that are available.
Loading the memories with data and the control sequences can be done via DMA or FPGA bit stream.
Yes, it is a programmable accelerator but it can also do control sequencing.