Teaching Machines to See

The IoT world is all about sensing, and no sense is more important or empowering than vision. We humans rely on our sight to understand the world around us more than any other source of information, and it’s likely that the same will be true for our intelligent machines. From automotive applications like ADAS to drones to factory automation, giving our systems the ability to “see” brings capabilities that are difficult or even impossible to achieve in any other way.

But vision is one of the most challenging computational problems of our era. High-resolution cameras generate massive amounts of data, and processing that information in real time requires enormous computing power. Even the fastest conventional processors are not up to the task, and some kind of hardware acceleration is mandatory at the edge. Hardware acceleration options are limited, however. GPUs require too much power for most edge applications, and custom ASICs or dedicated ASSPs are horrifically expensive to create and don’t have the flexibility to keep up with changing requirements and algorithms.

That makes hardware acceleration via FPGA fabric just about the only viable option. And it makes SoC devices with embedded FPGA fabric – such as Xilinx Zynq and Altera SoC FPGAs – absolutely the solutions of choice. These devices bring the benefits of single-chip integration, ultra-low latency and high bandwidth between the conventional processors and the FPGA fabric, and low power consumption to the embedded vision space.

Unfortunately, they also typically bring the requirement of an engineering team with FPGA design expertise. Developing the accelerators for vision algorithms is a non-trivial task, and the accelerator part is typically created using a hardware description language such as Verilog or VHDL, driving a design flow with RTL simulation, synthesis, place and route, and timing closure. In addition to requiring a qualified engineering team with specialized expertise, this can add months to the development cycle.

The problem is just getting worse. Now, AI technologies such as neural networks are being increasingly used for the complex and fuzzy pattern recognition part of vision systems. Neural networks have two distinct modes of operation. “Training” – which is done once on a large sample data set, typically in a data center environment – requires heaping helpings of floating-point computation. Your vision algorithm may be shown millions of pictures of cats, so that it can later automatically recognize cats in video streams. Training sets and tunes the coefficients that will be used in the later “Inference” phase. “Inference” is the in-the-field portion of the neural network. During inference, you want your autonomous mouse to be able to recognize cats as quickly and accurately as possible, engaging its “fight or flight” mode.

Inference is done at the edge of the IoT, or as close to it as possible. You don’t have time for massive amounts of image data to be uploaded to the cloud and processed, delivering a “Hey, that thing you’re looking at is a CAT!” conclusion about 100ms after the limbs are torn from your robotic device. Inference, therefore is typically done with fixed-point (8-bit or less) precision in the IoT edge device itself – minimizing latency and power consumption while maximizing performance.

This “training” vs “inference” model is very convenient for the companies who make FPGAs and hybrid SoC/FPGA devices. FPGAs are really good at high-speed, low-precision computation. It’s, interestingly, doubly convenient for Xilinx, whose FPGAs and SoCs differ from archrival Intel (Altera) in that Intel’s devices support hardware floating-point (good for training) supposedly at the expense of some performance in the fixed-point domain (good for inference). Xilinx is apparently more than willing to let Intel duke it out with GPUs for the training sockets, while Xilinx focuses advantages on the much-more-lucrative inference sockets.

So, Xilinx is sitting pretty with their Zynq SoCs and MPSoCs, perfectly aligned with the needs of the embedded vision developer and well differentiated from Intel/Altera’s devices. What else could they possibly need?

Oh, yeah, There’s still that “almost impossible to program” issue.

Rewinding a few paragraphs – most of the very large systems companies have well-qualified teams of hardware engineers who can handle the FPGA portion of an embedded vision system. Xilinx has dozens of engagements in every important application segment involving embedded vision – from ADAS to drones to industrial automation. But many companies don’t have the required hardware expertise for it, and they wouldn’t want to dedicate the design time to it even if they did. Plus, crossing the conceptual barrier from vision experts to neural network experts to FPGA design experts and back again is a very expensive, time consuming, and lossy process. What we really need is a way for software developers to be able to harness the power of Zynq devices without bringing in a huge team of hardware experts.

That’s the whole point of Xilinx’s reVISION.

reVISION, announced this week, is a stack – a set of tools, interfaces, and IP – designed to let embedded vision application developers start in their own familiar sandbox (OpenVX for vision acceleration and Caffe for machine learning), smoothly navigate down through algorithm development (OpenCV and NN frameworks such as AlexNet, GoogLeNet, SqueezeNet, SSD, and FCN), targeting Zynq devices without the need to bring in a team of FPGA experts. reVISION takes advantage of Xilinx’s previously-announced SDSoC stack to facilitate the algorithm development part. Xilinx claims enormous gains in productivity for embedded vision development – with customers predicting cuts of as much as 12 months from current schedules for new product and update development.

In many systems employing embedded vision, it’s not just the vision that counts. Increasingly, information from the vision system must be processed in concert with information from other types of sensors such as LiDAR, SONAR, RADAR, and others. FPGA-based SoCs are uniquely agile at handling this sensor fusion problem, with the flexibility to adapt to the particular configuration of sensor systems required by each application. This diversity in application requirements is a significant barrier for typical “cost optimization” strategies such as the creation of specialized ASIC and ASSP solutions.

The performance rewards for system developers who successfully harness the power of these devices are substantial. Xilinx is touting benchmarks showing their devices delivering an advantage of 6x images/sec/watt in machine learning inference with GoogLeNet @batch = 1, 42x frames/sec/watt in computer vision with OpenCV, and ? the latency on real-time applications with GoogLeNet @batch = 1 versus “NVidia Tegra and typical SoCs.” These kinds of advantages in latency, performance, and particularly in energy-efficiency can easily be make-or-break for many embedded vision applications.

Xilinx has also announced a range of embedded vision development kits supporting various cameras and input configurations supporting the reVISION development flow, so you’ll be able to get your design working on actual hardware as quickly as possible.

At press time, Intel had just announced the acquisition of Mobileye – a company specializing in embedded vision and collision avoidance in the automotive market – for almost as much as they paid for Altera. It seems that the stakes in this emerging applications space are going up yet again. It will be interesting to watch the battle unfold.

2 thoughts on “Teaching Machines to See”

beercandyman says:

March 14, 2017 at 1:44 pm

Good article, although I have to point out that the reason FPGAs are not used for training has nothing to do with floating point. It has to do with the ability to quickly change the architecture and constants to achieve a high accuracy during training. GPUs and CPUs can do this in milliseconds while any FPGA takes hours of place and route. For inference FPGAs are much better than GPUs or CPUs because these algorithms can be pipelined and run in real time on a FPGA. Something FPGA are better than any other programmable solution. In many cases, as was shown at FPGA2017, fewer bits are needed. Floating point, in a FPGA, gives you the ability to replicate the results of GPUs and CPUs but are not needed in the field. After all the incoming data is not floating point so using floating point just takes area and is not really useful.

Log in to Reply
hightechlov3r says:

May 23, 2017 at 12:33 am

Hi there,

Great article. It’s honestly quite mind-blowing how high-tech and innovative vision technology has become, and how machines/robotics can pretty much see on their own now, isn’t it? There’s another particularly innovative product I recently learned about, from a company called OrCam, so I thought I’d share. Essentially, they have these smart glasses with a camera for the blind/vision impaired/etc. that have an optical character recognition system that helps people without the best vision to see their surroundings. Pretty astounding. Anyhow, thanks for this content, I just thought I’d pass this along.

Best,
Gary

Log in to Reply

Teaching Machines to See

Related

2 thoughts on “Teaching Machines to See”

Leave a Reply Cancel reply

featured video

How NV5, NVIDIA, and Cadence Collaboration Optimizes Data Center Efficiency, Performance, and Reliability

featured chalk talk