Images and Gestures

Ambitious ideas about the blossoming sensor opportunities boil down to two things:

Gather data. Lots of data.
Make sense out of that data, which means computing. Lots of computing.

Of course, how you do that depends on the application and the platform. We’ve looked at sensor fusion and the fact that some software is provided for free by the sensor makers as part of the cost of making a sensor. But higher-level (as well as low-level) software is provided by sensor-agnostic companies like Movea and Hillcrest Labs.

But then there’s the provider of the computing platform. Unlike with sensors, makers of processors – general purpose or DSP (and, increasingly, even graphics processors as they’re leveraged for non-graphics applications) can’t always point to a specific application as motivation for their technology. Sensors tend to be moderately vertical in their application; processors are much more horizontal.

However, CEVA, whom we’ve looked at a number of times, does create different architectures (or variants of their architecture) for different markets like mobile communications and audio, and they continue to define platforms that are increasingly vertical in order to drive entry into specific markets. (Which is much easier to do when you create IP than it is if you have to build an actual hard-silicon version of each variant…)

Most recently, they announced the availability of a gesture-recognition package based on work done by eyeSight, a company CEVA has invested in. This package is targeted for the MM3101 platform, which is a specialization of a specialization; it was announced early this year.

The MM3101 is a vision-processing platform intended for use in cameras. It’s based on the MM3000 architecture announced a couple of years ago. The MM3000 family is a video- and image-processing-oriented architecture that consists of a series of stream processors for processing bitstreams (encoding, decoding, etc.) and vector processors for working on arrays of pixels.

The MM3101 skips the stream processors, using only the vector processors. What does this specialization buy you as compared to using just a standard processor? They claim a 1-mm footprint (including memory) using 28-nm technology and “10-20x lower power” (meaning, presumably, 90 or 95% lower power) than an ARM Cortex A-9 doing the same thing.

But, as with anything vertical, it’s not enough simply to have the platform; customers want solutions (i.e., they want their vendors to do as much work for them as possible as a condition for buying their product). So, of course, CEVA has layers and libraries of software to accompany the platform. Up to a point.

They illustrated two specific camera-oriented algorithms to illustrate what they can do. Frankly, I found these to be interesting applications apart from the underlying technology. The first relates to so-called “Super Resolution” (SR) images. (As opposed to merely “high” resolution. Someday we’ll run out of superlative intensifiers.) The cool thing about SR images is that you don’t need an SR camera to take them. Instead, several low-quality images can be fused into a single SR image. Alternatively, a single low-resolution image can be analyzed at different scales to create an SR version. The magic lies in the algorithm that combines the images.

This can also be done with video, using a sliding window of images in the video stream. They claim to be able to process this in a manner that can yield a real-time 4x digital zoom on 720p video at 30 frames per second.

The second application has to do with high dynamic range scenes – those annoying ones on a bright day where your detail is in a shady area – so that either the bright spots stop down the camera to where you can’t see in the shadows or you’ve got to look at the image through welding goggles to avoid having your eyes burned by the bright spots.

There’s a technique for still images that consists of bracketing – taking an under-exposed, normal, and over-exposed image – and then combining them to darken down the bright spots and brighten up the dark parts. This can also be applied to video on a two-frame (instead of three-frame) basis to provide better balanced images. This halves the frame rate of the video, since two images are combined into one. This, they say, is doable for 1028p60 video in real time.

But here is where they start to hand off. It’s one thing to manipulate images into better images. It’s something quite different to interpret the meaning of scenes. And when we talk about gesture recognition, it’s all about understanding what’s going on in the video (regardless of its quality).

We’ve discussed gestures and motion before, primarily with the two motion guys I mentioned above. And they both have gesture engines, but they rely on actual movement: you take some physical token, like a remote control or a pointer, and you move it to create a gesture, and the motion sensors inside determine, together, what the gesture is.

Using a camera to decode motion is quite a different ball game. The “smarts” of the engine must decide which part of the scene it should be paying attention to, and it must then both track and interpret the motion. And, unlike a remote control, the “thing” being tracked may not move as a unit. In fact, it might be a change in the shape of tracked object that constitutes a gesture: a thumbs-up hand may be in the same location as a thumbs-down hand, but the hand itself is a different shape.

3D cameras are typically used when analyzing motion so that you get… well… three dimensions. eyeSight, however, does gesture recognition using only 2D cameras; they get no explicit depth information. Their company history has involved working with low-quality images – poor resolution, “unexpected” lighting situations, and such. They’ve worked particularly hard at scene analysis: deciding what part of the image is of interest. Even with no depth data, you can still make some depth inferences: if something passes in front of something else, then that provides useful information about their relative positioning.

They’ve applied this to a library of gestures (they don’t enumerate them explicitly, but they include left, right, wave, select, up, down, open/close fist, thumbs-up, thumbs-down, peace/victory sign, etc.) that they can process on the MM3101 platform. Based on their history of dealing with less-than-ideal situations, they say that they can operate, for example, in a moving car, with rapidly changing light and with close or far distances, and still successfully interpret what’s going on.

But there’s a catch for this sort of application. With the image-enhancement algorithms we looked at above, you do them and then they’re done. Gestures, however, are an input mechanism. That means the engine is always on, since you might gesture at the system at any time. This means that:

The processing overhead needs to be low enough that it’s not chewing up a dramatic portion of the CPU power (just like you wouldn’t want your desktop CPU churning away just to process the keyboard); and
The power has to be low so that the mere fact of being “on” doesn’t suddenly drain the battery on a mobile device. Machine vision has a reputation for being power hungry.

eyeSight claims to do the gesture processing in less than 100 MHz, leaving a ton of headroom for everything else. And power? Well, that’s one reason why CEVA keeps reinforcing the power message. They say, in particular, that they can do gesture recognition at a cost of 20 mW.

So, rolling back to the opening discussion, this parallels more traditional sensor scenarios in that, up to a point, some software is available as “part of the deal”; higher-level capabilities must be purchased separately. Only here, the “sensor” is a camera – probably one of the oldest sensors around. And, instead of the sensor guy giving out the software, it’s the DSP platform guy.

All combined together, they still meet those two key requirements: lots of data from the camera and lots of computing to make sense out of it all.

More info:

CEVA MM3101

eyeSight