Embedding Vision

Remember sitting down at a DEC VT52 terminal? The screen held 24 lines of text, at 80 characters each. The font was built in. VT52 proudly boasted support for all 95 ASCII characters including the desirable but somewhat superfluous lower case letters. Some special graphics characters were available as well, but the terminal did not support graphics per se. There was no mouse, no windows, and text editing was only marginally WYSIWYG – mostly using the “vi” text editor.

Today, it’s hard to think of interacting with a computer, or even a smartphone, without a GUI and some sort of pointing device. Even for those of us old enough to remember working on the VT52, trying to communicate with a machine exclusively via a keyboard would seem awkward and arcane at best.

Since most of our mobile and tablet devices have used touchscreen interfaces for awhile, it is interesting to watch a young person poke expectantly at the screen of a desktop or laptop computer, expecting touch response, and then being confused when the machine ignores their input. Once you’ve grown accustomed to a level of sophistication in human-machine interface, it’s hard to go back.

As engineers, most of us know what the next steps are, and we know they’re really difficult problems. Our machines need to be able to see and hear us and to understand what they’re seeing and hearing. Voice interaction has been with us for awhile now, but it really hasn’t caught on in the mainstream. The accuracy of voice recognition/understanding is still low, and the idea of a work or office environment with a sea of cubes – where everyone is talking aloud to their computers simultaneously – sounds a bit chaotic at best. The problem here, of course, is that deriving meaning from spoken language is a much harder problem than simply recognizing words and phrases in an audio stream. The secondary problem is that people seem to want private interactions with their devices – even in public places. Spoken communication doesn’t facilitate that very well.

Likewise, there has been a gigantic amount of research into machine vision. Video has been commoditized to the point that relatively inexpensive hardware is required to add video cameras, storage, and playback capability to an embedded device. However, making a machine understand what is going on in that video stream is a significant challenge – one that researchers have been grappling with for decades. The limiting factors for machine vision have always been computing power (massive amounts of computing power are required to do most of the machine vision algorithms out there in real-time on a video stream) and the algorithms themselves. While it’s true that there is a vast repository of research available on machine vision algorithms, those algorithms tend to be tailored for very specific problems. A number of sophisticated algorithms exist for facial recognition, for example, but those algorithms are different from those required for locating people in a scene, those required for understanding human gestures and movement, and so forth. Since the algorithms are so specific to the type of information being extracted from the scene, creating fixed-hardware accelerators to solve the computing problem becomes impractical. Programmable hardware like FPGAs and/or huge amounts of parallelism in conventional processors (such as with graphics processors) is required.

This year, however, machine vision went mass market with the introduction of the Kinect interface for Microsoft’s Xbox 360. In case you’ve been under a rock for the past year (which many of us working on complex engineering problems tend to be from time to time), MIcrosoft built a low-cost device that enables a video game console to get its input from watching the players move, rather than from a dedicated controller. The system can locate people in the scene, interpret their gestures, and even recognize which individuals it is “seeing.” With the retail cost of the system being less than $150 USD, one can imagine that the bill of materials cost must be very low. Granted, Kinect cheats a bit by borrowing some of the Xbox 360’s massively parallel processing power to accomplish its magic, but even with that, Kinect sets a new bar for cost-effective machine vision.

Kinect has kicked off a virtual revolution in hacking, which apparently has been warmly welcomed by Microsoft. There are websites and forums dedicated to sharing information on using and adapting the Kinect hardware for a huge variety of applications. With the broad-based adoption of Kinect, the door to machine vision has been blown open, and the next few years should see remarkable progress in adding a sense of sight to our intelligent devices.

Unfortunately, adding machine vision to your next embedded design isn’t as simple as dropping in WiFi or USB. You can’t just add a camera and a piece of machine vision IP to your embedded device and end up with a functional machine vision interface. Vision, as we mentioned, is an incredibly complex problem that has already experienced decades of research, and the average – or even the far-above-average electronic designer – isn’t going to just pick it up with some spare weekend reading. To get our intelligent devices to see and understand the world around them, we’re going to need some serious help.

Fortunately, a new group has been formed with the intent of doing just that. The Embedded Vision Alliance was founded with the goal of “Inspiring and empowering engineers to design systems that see and understand.” Jeff Bier, President of Berkeley Design Technology (BDTi) and founder of the Embedded Vision Alliance, sees huge market potential for embedded vision applications in the near future in areas like consumer electronics, automotive, gaming, retail, medical, industrial, defense, and many others. Embedded vision systems will be doing things like gesture-based control of devices, active driver safety and situational awareness, active digital signage, and point-of-sale transaction assistance – just to name a few.

“The engineer who wants to add vision to his or her embedded design will have both great news and bad news,” explains Bier. “First, they will discover that there are hundreds of papers, books, and other resources with volumes of research on the topic. Then, they will discover that the vast majority of that work is not particularly useful for real-world engineering applications. Much of the material is heavily theoretical – books with 800 pages filled with multi-variable calculus – and very little of it is in a form that engineers could use, like block diagrams and code.” One of the goals of the Embedded Vision Alliance is to sift through that mountain of information and extract that which will be practically useful for adding vision to embedded designs.

The Embedded Vision Alliance already has over a dozen companies participating – from semiconductor suppliers to distributors to software companies – all of whom see a big future for embedded vision and who have products or technology that they feel will play a significant role in deployment of that capability. The Alliance is already building a website (www.embedded-vision.com) with resources and community to assist engineers in development of embedded vision capabilities. With efforts like this, the path to embedded vision will be far less treacherous.

Embedded vision is one of the most significant and exciting engineering challenges to come along in decades, and it will happen. There will be a time when interacting with a machine that can’t see you will seem as strange as trying to compute with a VT52 would today. Once our intelligent devices gain a proper set of senses, a vast range of new applications and capabilities will emerge. If we want to be part of that revolution, we’d better start catching up now. If embedded vision were easy, everybody would already have it.