Seeing Into the Distance

It’s been made into a big deal, and you can thank Avatar. Once a goofy movie gimmick that required glasses you wouldn’t get caught wearing anywhere else, 3D suddenly became cool. And, for a while, the best way to turn anything ordinary into something cool was clear: make it 3D.

Well, we’ve gotten a bit older and wiser (OK, older, anyway) and we’ve had time to catch our breaths and internalize the results of endless movies in 3D, TVs in 3D, printers in 3D. (OK, that’s more than a gimmick…) But it’s easier now to take a good, long, nuanced look at 3D and its potential for things other than box office smashing.

Embedded vision in particular is moving further into the 3D world: it’s even going 4D. But I get ahead of myself. Let’s start with 2D and move forwards to understand how it is that designers are trying to leverage simple equipment to create complex visual models.

One of the most immediate ways in which vision is being put to use is for gesture recognition. We humans process gestures largely as visual cues, and these gestures occur in three-dimensional space. When we beckon someone to approach, for example, we (in the US – this isn’t universal) place our hands up, fingers towards the sky, with the back of the hand facing the person we’re communicating with, and we move all the fingers towards us quickly and repeatedly. A 2D projection would see very little motion since the gesture largely operates along the line between you and the other party. And, in camera terms, this is the third dimension.

So capturing that particular gesture would be very difficult to do with a regular 2D camera: you would do much better if you could capture all three dimensions. We’ve looked at eyeSight, a company that’s making progress installing gesture recognition into laptops using the standard 2D laptop camera; PointGrab is another company with similar goals. But, realistically, gestures for computer use are often exaggerated versions of what we would do with another human. (Much the way we speak louder to our computers and cell phones, treating them as if they were foreigners.) A third dimension provides extra information to supplement the 2D information (at a cost, of course), allowing more nuanced motion.

There are other applications for which the third dimension can be useful. Take the cameras that are making their way into cars for things like collision avoidance. Even without specific ranging technology like radar, applications use visual cues to estimate distance to other cars for collision warning and avoidance. Again, this can be attempted in two dimensions, as is being done by iOnRoad, using a cell phone camera. But the systems being planned for outright integration into the safety systems of a car may benefit from an actual third dimension.

Apps aside, if you want real information on the missing camera dimension, depth, there are different ways of approaching it. The result is normally something called a “depth map,” which assigns a depth value to each point in the image (or to major artifacts or some subset of points). You can think of this as adding another value to the tuple that defines a pixel (say, x and y locations and color values).

The obvious way we’re used to thinking about 3D is using a binocular system – two cameras. This has the downside, of course, of requiring two cameras. There is also the “correspondence problem”: aligning the two images so that you can then triangulate a depth.

The process ordinarily consists first of removing distortions from each image, since the lens will curve straight lines towards the edge or introduce other aberrations. These need to be corrected to make the two images look “flat.” The images can then be compared pixel-for-pixel to create a so-called “disparity map” which can be used to calculate the depth map.

The challenge comes when trying to decide which points correspond to which. For small artifacts or edges – especially corners, it’s not hard. But for pixels in the middle of a color field, for example, it’s difficult to establish that correspondence.

The use of “structured light” can help. This can consist of infrared features, for example, projected on the various surfaces (invisibly, as far as we and the image camera(s) are concerned). They can either facilitate alignment, or you can infer distance directly from carefully-designed and -projected features. For example, it’s possible to project a shape that, impinging up close, looks like a horizontal oval, further away creates a circle, and further still looks like a vertical oval. By measuring the orientation and aspect ratio of the oval, you can infer a distance. This doesn’t require a second image camera, but does, of course, require the IR imaging.

Other single-camera approaches include interferometry and the use of “coded aperture” masks, where the blurring created by multiple images can yield depth information.

The approach that seems to be getting more attention these days is the so-called “time of flight” solution. The idea is that, by measuring how long it takes the light to reflect back from the surface, you can tell how far away it is; it almost sounds trivial. But exactly how this is done is less than trivial.

With one approach, a light beam is modulated with an RF carrier, and then the phase shift of the reflected signal is used to gauge distance. This works only modulo the wavelength, of course: all objects at distances corresponding to the same phase of the wave will appear to be the same distance away. Another approach is to use a shutter to block off reflected pulses of light. Depending on when the light returns during the window of time when the shutter is open, more or less of it will make it through, and the brightness is then used to infer the distance the light had to travel to get back.

Noise is, of course, the enemy in all of this, and the background light can swamp the bits you’re interested in. If you’re looking to isolate a moving object, for instance, on an otherwise constant background, then the background image can be captured and subtracted from the working image to isolate the item being tracked (or any items not part of the background).

Other complications include multi-path reflections, which can exaggerate distances, and inter-camera interference: multiple cameras may confuse each other.

But time-of-flight is of interest because it can be made small; there are no moving parts, and the processing is modest. You also get depth information for the complete image field at once, meaning that you can track objects and their depth in real time.

Having tackled three dimensions, why stop there? How about a fourth dimension? If this has you scratching your head, it’s okay; we will be veering into the slightly surreal. We usually think of time as the fourth dimension (which would suggest 3D video). But that’s not what this is: the fourth dimension is more or less the focal length. And a “plenoptic” camera can capture all focal lengths at once, meaning that you can post-process an image to change the focus. Yeah, bizarre. You can see web demonstrations of this that are almost eerie: you start with an image having a significant depth of field, with clear in-focus and out-of-focus pieces of the image; you click on an out-of-focus element, and suddenly it comes into focus.

This works by constructing what amounts to a man-made fly eye. Micro-lenses are arranged in an array over the imaging silicon, and each of those lenses captures the entire image using the light coming from all directions. You then end up with multiple renditions of the image from which you can reconstruct a picture with any focal distance. It’s like binocular vision on steroids, with a depth map easily obtainable.

Other than its being incredibly cool to be able to refocus an image after the fact, this approach has the benefits of allowing fast image capture – you don’t have a focusing step, so you can take the picture more quickly – and allowing a larger aperture – since focus is no longer an issue – making it better for low-light situations.

The challenge is that the image resolution isn’t determined by the pixel density of the image sensor; it’s set by the micro-lens array. Micro-lenses are small, as lenses go, but they’re much bigger than pixels, so the resolution on such pictures is coarser than what would otherwise be possible. Two companies are commercializing this technology. The first out was Raytrix, with 1-Mpixel and 3-Mpixel cameras for commercial and industrial use. Lytro has a consumer-oriented camera at around 1.2 Mpixels, and, because of their consumer orientation, they’ve been more visible.

For all of these options, it still feels to me like 3D camera technology is shifting around, looking for solid traction and a way forward. Expect to see it as a continuing topic of interest with the Embedded Vision Alliance. It’s one thing to make a movie look incredibly cool through dorky glasses; it’s something else entirely to add depth information in an unobtrusive manner. When 3D stops being so self-conscious, then we’ll know we’ve mastered it, and it will be no big deal.

More info:

Embedded Vision Alliance

Raytrix