feature article
Subscribe Now

Get Woke: Sensory Improves Voice Activation

Software Company Fine-Tunes Wake-Word Recognition

“I know the voices aren’t real, but they have some good ideas.” – anonymous

We laugh when Siri, Alexa, or Cortana misunderstands us and makes embarrassing mistakes, but we cut them some slack because, hey, the technology is only a few years old, right?  

Nope. Speech recognition has been around for almost 100 years, and we’re only now getting to the point where it’s actually useful. You’d think we could make better progress than that.

Voice recognition is everywhere now. We talk to (not into) our phones; we talk to Alexa; we talk to our TV remote controls. The machines listen and, we hope, do our bidding. All those systems rely on two fundamental technologies: voice recognition, to figure out what you’re saying, and artificial intelligence, to figure out what to do about it.

The AI part may prove to be infinitely difficult, but voice recognition is no slam dunk, either. Santa Clara–based Sensory has been working on the latter problem for almost 25 years, and the company takes credit for introducing the idea to Apple.

Ten or 20 years ago, voice-recognition systems were painfully limited and inaccurate. You generally had to train them to recognize a few spoken commands by, for example, repeating the word “play” or “stop” until it collected enough samples of your voice to make a template. After that, the machine would respond only to you; anyone else’s voice wouldn’t match the template. Intentionally or not, those early systems were speaker-dependent.

Nowadays that’s considered a bug. Voice-activated gadgets are supposed to be speaker-independent. And, except for folks with a strong regional accent, they generally are. But now the industry is doing a 180: we’re trying to make devices speaker-dependent again, and Sensory is offering new software for exactly that purpose.

The idea is to allow your phone or smart appliance (think Amazon Echo or Nest thermostat) to distinguish between speakers so that it responds appropriately. You might want Alexa to follow your commands, for example, but not those of your four-year-old daughter. Even though Alexa understands both speakers, you would hope that it’s smart enough to ignore one but not the other. That’s tricky to do.

The more subtle benefits allow a device to respond to ambiguous or self-referential commands. “Update my calendar” is tricky if you don’t know who the speaker is. “Call work” can also be ambiguous. Today’s voice assistants get around this in a different way. Asking Siri about your wife’s birthday usually works, but only because (a) Siri knows who its owner is, and (b) you’ve probably identified your spouse’s birthday in your contacts list. Siri doesn’t really know who’s speaking, and any random stranger holding your phone would get the same response.  

Sensory’s software starts segregating speakers’ voices automatically, without any specific vocal training. What happens after that is up to you, however. Different OEMs will have different methods of assigning rights and privileges to each speaker; that’s a user-interface issue. What was once a bug is now a feature.

Another key subcomponent of any speech-recognition system is an effective “wake word.” Without a wake word, the system must be on, ready, listening, and processing audio 100% of the time. That obviously wastes processing power and energy, but worse, it makes it tough for the system to tell if you’re talking to it or if you’re just talking. Without calling out, “Alexa,” (an intentionally unusual multisyllabic name that’s unlikely to occur in normal conversation), your creepy Amazon eavesdropper won’t know that you’re ready to order diapers. Like an old dog, a wake word allows the system to sleep 99% of the time and pay attention only when it hears its name.

Sensory offers software for small DSPs and MCUs to implement wake words. Depending on your hardware and the number and complexity of the wake words, it requires as little as 1 MIPS of processing and 100KB of RAM. Better hardware allows for more elaborate setups. Power consumption – one of the reasons for having wake words at all – depends on your hardware, but Sensory says some OEMs are operating at under 1mA.

To make wake words even more power-efficient, Sensory has also developed its own hardware IP, the first in many years. Imaginatively called LPSD, for low-power sound detection, it’s an accelerator that gets integrated into a soft DSP from Ceva, Synopsys, Tensilica, or others. The idea is that your DSP can go to sleep and ignore audio input entirely. Sensory’s LPSD will trigger on the wake word and fire up the DSP in time for it to process the full audio command stream.

My guess is that Sensory has at least another 25 good years ahead of it. Voice activation and speech recognition aren’t going away, but there’s a lot of improvement yet to be done. The more we talk to our machines, the more we’re going to need the underlying black magic to make it happen.  

Leave a Reply

featured blogs
Jul 12, 2024
I'm having olfactory flashbacks to the strangely satisfying scents found in machine shops. I love the smell of hot oil in the morning....

featured video

Larsen & Toubro Builds Data Centers with Effective Cooling Using Cadence Reality DC Design

Sponsored by Cadence Design Systems

Larsen & Toubro built the world’s largest FIFA stadium in Qatar, the world’s tallest statue, and one of the world’s most sophisticated cricket stadiums. Their latest business venture? Designing data centers. Since IT equipment in data centers generates a lot of heat, it’s important to have an efficient and effective cooling system. Learn why, Larsen & Toubro use Cadence Reality DC Design Software for simulation and analysis of the cooling system.

Click here for more information about Cadence Multiphysics System Analysis

featured paper

DNA of a Modern Mid-Range FPGA

Sponsored by Intel

While it is tempting to classify FPGAs simply based on logic capacity, modern FPGAs are alterable systems on chips with a wide variety of features and resources. In this blog we look closer at requirements of the mid-range segment of the FPGA industry.

Click here to read DNA of a Modern Mid-Range FPGA - Intel Community

featured chalk talk

Accelerating Tapeouts with Synopsys Cloud and AI
Sponsored by Synopsys
In this episode of Chalk Talk, Amelia Dalton and Vikram Bhatia from Synopsys explore how you can accelerate your next tapeout with Synopsys Cloud and AI. They also discuss new enhancements and customer use cases that leverage AI with hybrid cloud deployment scenarios, and how this platform can help CAD managers and engineers reduce licensing overheads and seamlessly run complex EDA design flows through Synopsys Cloud.
Jul 8, 2024
772 views