You can almost imagine an I Love Lucy caper. Lucy and Ricky are trying to catch someone in the act of something nefarious. They dress up in fake private-eye clothes with a PI hat, turned up collar (pre-bro), and a fake moustache for her. They’re on opposite sides of the room in stealth mode, with only hand gestures to communicate. They’ve worked out an intricate set of signals, including “right hand to the nose means we go in 3…2…1…” and “left hand to the nose means something’s not right; hold off.”
And as they stand there poised for action, a fly lands on Lucy’s nose. And she swats it with her right hand, and Ricky gets ready to launch. But then she swipes it with her left hand and he panics and backs up, unsure of what to do. She, of course, is completely oblivious to her mixed messages.
I have no idea how this scene ends because it wasn’t a real episode and everyone knows it’s easy to think of situations that are funny but it’s hard to figure out how (or when) they end. So I’ll leave that for the pros.
But it introduces us to the tenuous world of gesture and motion, day two of the conference on touch, gesture, and motion put on by IMS Research. Unlike touch technology, which has so much to do with the technology needed to sense touches, gestures and motion aren’t that complicated from a sensing standpoint: you’ve got either inertial measurement units (IMUs) that sense motion or cameras that see what’s happening. You might have 2D or 3D vision (made possible by stereoscopic vision or some other kind of depth sensor).
But most of this is not about sensing; it’s about software. It takes a lot of processing to take a visual scene and overlay meaning on top of it. But the level of meaning depends strongly on the goal. Which brings us to the central question: what’s the difference between gesture and motion? After all, gestures are motion.
My early thinking – which was supported (selectively?) by various things I’d seen and read – was that motion had to do with things that used IMUs and that gesture had to do with things that used vision. In other words, Wii was motion and Kinect was gesture.
But the further we got into the presentations, the clearer it became – eventually being explicitly obvious – that this is completely wrong. Gestures are limited, pre-defined sets of motion that act like a single token of information. They’re discrete and limited in number, and they have specific meaning. They are oriented towards command and control, and they’re event-oriented, with a specific machine response expected after a gesture.
Motion, on the other hand, is anything that moves. It may or may not have meaning, but it’s definitely not discrete – it’s continuous. Obviously motion has to be detected in order to identify a gesture, so gesture recognition lies over the top of motion, but from an application standpoint, they’re considered separate. It’s like sound and speech: there’s an infinite range of sounds, and a microphone, amp, and speakers can faithfully render them. Identifying and interpreting those sounds that are speech, however, is much different – and harder.
Gestures are an important part of new approaches to human/machine interfaces. One company spoke to the “primacy” of touchless interfaces, which struck me as one of those over-the-top “if gestures are good for some things they must be the best for everything!” comments – where the technology drives the solution.
A different presentation noted that voice control hasn’t really taken over because people aren’t comfortable with it, especially in public. Rather than accepting this as reality, the take-away was that “social re-engineering” was needed to get people comfortable with it – again, technology forcing a solution.
So, as with touch, we have some work to do to make sure that we maintain the ability to select the right tool for the right job rather than applying one tool to everything.
Phillips noted some other challenges for gestures, not the least of which is the fact that gestures are cultural – they’re not universally intuitive. In addition, if you want to control something complex entirely using gestures, then you’ll likely have a very large gesture vocabulary to memorize – which is not likely to appeal to the masses. There are also issues with ambiguity: when you gesture “turn on,” does that mean the light or the TV?
It actually occurs to me as I write this that many gestural issues could be solved if systems could recognize – and if everyone learned – sign language. Of course… there are different sign languages… lots of them… but still…
Motion raises a separate set of questions, especially when it comes to realistic applications. The most obvious motion applications we have now are activity-related. Simulated golf or football or whatever – great fun for the family in the comfort of your own living room – and up off the couch.
But videos glorifying the future of motion also show some alluring – and entirely improbable (imho) – scenes. For example, using motion to control, say, your phone. This isn’t a gesture app; it’s a camera watching your fingers dance through the air as if you’re dialing on a macro-phone. Such scenes typically depict a standard phone interface blithely following all the hand motions in mid-air. Really?
It’s hard enough to get a touch-screen to interpret the right location of my fat finger; in the air, if I’m selecting an app from a 4×8 matrix of little icons, I’m simply going to point in the air and hit the right one? I don’t think so. If the app provides visual feedback by, say, tracking where your finger is on the screen, then maybe (those details are never part of such videos… perhaps I need to relax and let go and assume they’ll figure that out with the first prototypes).
Even more ludicrous, if I may wax Ludditic, are scenes of people playing air-violin or air piano. Must have been done by people who think that being a master at air guitar makes you able to really and truly play guitar. Or people who think that Autotune turns them into great singers.
Ask any musician whether they get their sounds simply by putting their finger or hand at the right place at the right time and they’ll tell you that that’s only the start. Pressure matters. Bending notes matters. Attack and decay matter. And those are incredibly subtle – not macro motions that are easily discernible and not even visible when executed in the mouth on a wind instrument.
Anyway… before this turns into a full-blown rant over something that, at its core, is a promotional video rather than actual technology… (Perhaps someone will take this as a challenge to create a real air-motion-only musical instrument with all the subtlety and nuance of a real instrument…) Moving along…
There is an organization that has been started to assemble vision technology information in one place; it’s called the Embedded Vision Alliance. Started and run by the folks at BDTi, the website seems to have quite a bit of information on the industry, applications, and technology. This includes, of course, both gesture and motion.
In general, developments in gesture and motion are proceeding briskly, and one of the main challenges will be figuring out where they work best and where other modalities work better. And it’s a subtle world, and discriminating subtlety from noise will be a challenge. Frankly, combining some of this with what seem to be some dramatic advances in reading brains might help to establish intent and thereby filter noise. (On the other hand, if brain reading gets that good, we won’t need to gesture at all.)
Whatever way we solve it, Lucy and Ricky would definitely benefit from a technology that helps them to decide whether or not a particular gesture really means, “Let’s roll.”