feature article
Subscribe Now

Signals and Swats

The Promise and Limitations of Gesture and Motion Technology

You can almost imagine an I Love Lucy caper. Lucy and Ricky are trying to catch someone in the act of something nefarious. They dress up in fake private-eye clothes with a PI hat, turned up collar (pre-bro), and a fake moustache for her. They’re on opposite sides of the room in stealth mode, with only hand gestures to communicate. They’ve worked out an intricate set of signals, including “right hand to the nose means we go in 3…2…1…” and “left hand to the nose means something’s not right; hold off.”

And as they stand there poised for action, a fly lands on Lucy’s nose. And she swats it with her right hand, and Ricky gets ready to launch. But then she swipes it with her left hand and he panics and backs up, unsure of what to do. She, of course, is completely oblivious to her mixed messages.

I have no idea how this scene ends because it wasn’t a real episode and everyone knows it’s easy to think of situations that are funny but it’s hard to figure out how (or when) they end. So I’ll leave that for the pros.

But it introduces us to the tenuous world of gesture and motion, day two of the conference on touch, gesture, and motion put on by IMS Research. Unlike touch technology, which has so much to do with the technology needed to sense touches, gestures and motion aren’t that complicated from a sensing standpoint: you’ve got either inertial measurement units (IMUs) that sense motion or cameras that see what’s happening. You might have 2D or 3D vision (made possible by stereoscopic vision or some other kind of depth sensor).

But most of this is not about sensing; it’s about software. It takes a lot of processing to take a visual scene and overlay meaning on top of it.  But the level of meaning depends strongly on the goal. Which brings us to the central question: what’s the difference between gesture and motion? After all, gestures are motion.

My early thinking – which was supported (selectively?) by various things I’d seen and read – was that motion had to do with things that used IMUs and that gesture had to do with things that used vision. In other words, Wii was motion and Kinect was gesture.

But the further we got into the presentations, the clearer it became – eventually being explicitly obvious – that this is completely wrong. Gestures are limited, pre-defined sets of motion that act like a single token of information. They’re discrete and limited in number, and they have specific meaning. They are oriented towards command and control, and they’re event-oriented, with a specific machine response expected after a gesture.

Motion, on the other hand, is anything that moves. It may or may not have meaning, but it’s definitely not discrete – it’s continuous. Obviously motion has to be detected in order to identify a gesture, so gesture recognition lies over the top of motion, but from an application standpoint, they’re considered separate. It’s like sound and speech: there’s an infinite range of sounds, and a microphone, amp, and speakers can faithfully render them. Identifying and interpreting those sounds that are speech, however, is much different – and harder.

Gestures are an important part of new approaches to human/machine interfaces. One company spoke to the “primacy” of touchless interfaces, which struck me as one of those over-the-top “if gestures are good for some things they must be the best for everything!” comments – where the technology drives the solution.

A different presentation  noted that voice control hasn’t really taken over because people aren’t comfortable with it, especially in public. Rather than accepting this as reality, the take-away was that “social re-engineering” was needed to get people comfortable with it – again, technology forcing a solution.

So, as with touch, we have some work to do to make sure that we maintain the ability to select the right tool for the right job rather than applying one tool to everything.

Phillips noted some other challenges for gestures, not the least of which is the fact that gestures are cultural – they’re not universally intuitive. In addition, if you want to control something complex entirely using gestures, then you’ll likely have a very large gesture vocabulary to memorize – which is not likely to appeal to the masses. There are also issues with ambiguity: when you gesture “turn on,” does that mean the light or the TV?

It actually occurs to me as I write this that many gestural issues could be solved if systems could recognize – and if everyone learned – sign language. Of course… there are different sign languages… lots of them… but still…

Motion raises a separate set of questions, especially when it comes to realistic applications. The most obvious motion applications we have now are activity-related. Simulated golf or football or whatever – great fun for the family in the comfort of your own living room – and up off the couch.

But videos glorifying the future of motion also show some alluring – and entirely improbable (imho) – scenes. For example, using motion to control, say, your phone. This isn’t a gesture app; it’s a camera watching your fingers dance through the air as if you’re dialing on a macro-phone. Such scenes typically depict a standard phone interface blithely following all the hand motions in mid-air. Really?

It’s hard enough to get a touch-screen to interpret the right location of my fat finger; in the air, if I’m selecting an app from a 4×8 matrix of little icons, I’m simply going to point in the air and hit the right one? I don’t think so. If the app provides visual feedback by, say, tracking where your finger is on the screen, then maybe (those details are never part of such videos… perhaps I need to relax and let go and assume they’ll figure that out with the first prototypes).

Even more ludicrous, if I may wax Ludditic, are scenes of people playing air-violin or air piano. Must have been done by people who think that being a master at air guitar makes you able to really and truly play guitar. Or people who think that Autotune turns them into great singers.

Ask any musician whether they get their sounds simply by putting their finger or hand at the right place at the right time and they’ll tell you that that’s only the start. Pressure matters. Bending notes matters. Attack and decay matter. And those are incredibly subtle – not macro motions that are easily discernible and not even visible when executed in the mouth on a wind instrument.

Anyway… before this turns into a full-blown rant over something that, at its core, is a promotional video rather than actual technology… (Perhaps someone will take this as a challenge to create a real air-motion-only musical instrument with all the subtlety and nuance of a real instrument…) Moving along…

There is an organization that has been started to assemble vision technology information in one place; it’s called the Embedded Vision Alliance. Started and run by the folks at BDTi, the website seems to have quite a bit of information on the industry, applications, and technology. This includes, of course, both gesture and motion.

In general, developments in gesture and motion are proceeding briskly, and one of the main challenges will be figuring out where they work best and where other modalities work better. And it’s a subtle world, and discriminating subtlety from noise will be a challenge. Frankly, combining some of this with what seem to be some dramatic advances in reading brains might help to establish intent and thereby filter noise. (On the other hand, if brain reading gets that good, we won’t need to gesture at all.)

Whatever way we solve it, Lucy and Ricky would definitely benefit from a technology that helps them to decide whether or not a particular gesture really means, “Let’s roll.”

 

More info:

The Embedded Vision Alliance

2 thoughts on “Signals and Swats”

  1. Air-motion-only instrument = theremin?

    But point taken. Why would people want to wave their hands about to control their phone? And where is the phone when all this hand waving is going on? Controlling the TV by shouting at it – now that’s my goal! šŸ™‚

Leave a Reply

featured blogs
Mar 24, 2023
With CadenceCONNECT CFD less than a month away, now is the time to make your travel plans to join us at the Santa Clara Convention Center on 19 April for our biggest CFD event of the year. As a bonus, CadenceCONNECT CFD is co-located with the first day of CadenceLIVE Silicon ...
Mar 23, 2023
Explore AI chip architecture and learn how AI's requirements and applications shape AI optimized hardware design across processors, memory chips, and more. The post Why AI Requires a New Chip Architecture appeared first on New Horizons for Chip Design....
Mar 10, 2023
A proven guide to enable project managers to successfully take over ongoing projects and get the work done!...

featured video

First CXL 2.0 IP Interoperability Demo with Compliance Tests

Sponsored by Synopsys

In this video, Sr. R&D Engineer Rehan Iqbal, will guide you through Synopsys CXL IP passing compliance tests and demonstrating our seamless interoperability with Teladyne LeCroy Z516 Exerciser. This first-of-its-kind interoperability demo is a testament to Synopsys' commitment to delivering reliable IP solutions.

Learn more about Synopsys CXL here

featured chalk talk

Bluetooth LE Audio
Bluetooth LE Audio is a prominent component in audio innovation today. In this episode of Chalk Talk, Finn Boetius from Nordic Semiconductor and Amelia Dalton discuss the what, where, and how of Bluetooth LE audio. They take a closer look at Bluetooth LE audio profiles, the architecture of Bluetooth LE audio and how you can get started using Bluetooth LE audio in your next design.
Jan 3, 2023
12,023 views