feature article
Subscribe Now

Get Woke: Sensory Improves Voice Activation

Software Company Fine-Tunes Wake-Word Recognition

“I know the voices aren’t real, but they have some good ideas.” – anonymous

We laugh when Siri, Alexa, or Cortana misunderstands us and makes embarrassing mistakes, but we cut them some slack because, hey, the technology is only a few years old, right?  

Nope. Speech recognition has been around for almost 100 years, and we’re only now getting to the point where it’s actually useful. You’d think we could make better progress than that.

Voice recognition is everywhere now. We talk to (not into) our phones; we talk to Alexa; we talk to our TV remote controls. The machines listen and, we hope, do our bidding. All those systems rely on two fundamental technologies: voice recognition, to figure out what you’re saying, and artificial intelligence, to figure out what to do about it.

The AI part may prove to be infinitely difficult, but voice recognition is no slam dunk, either. Santa Clara–based Sensory has been working on the latter problem for almost 25 years, and the company takes credit for introducing the idea to Apple.

Ten or 20 years ago, voice-recognition systems were painfully limited and inaccurate. You generally had to train them to recognize a few spoken commands by, for example, repeating the word “play” or “stop” until it collected enough samples of your voice to make a template. After that, the machine would respond only to you; anyone else’s voice wouldn’t match the template. Intentionally or not, those early systems were speaker-dependent.

Nowadays that’s considered a bug. Voice-activated gadgets are supposed to be speaker-independent. And, except for folks with a strong regional accent, they generally are. But now the industry is doing a 180: we’re trying to make devices speaker-dependent again, and Sensory is offering new software for exactly that purpose.

The idea is to allow your phone or smart appliance (think Amazon Echo or Nest thermostat) to distinguish between speakers so that it responds appropriately. You might want Alexa to follow your commands, for example, but not those of your four-year-old daughter. Even though Alexa understands both speakers, you would hope that it’s smart enough to ignore one but not the other. That’s tricky to do.

The more subtle benefits allow a device to respond to ambiguous or self-referential commands. “Update my calendar” is tricky if you don’t know who the speaker is. “Call work” can also be ambiguous. Today’s voice assistants get around this in a different way. Asking Siri about your wife’s birthday usually works, but only because (a) Siri knows who its owner is, and (b) you’ve probably identified your spouse’s birthday in your contacts list. Siri doesn’t really know who’s speaking, and any random stranger holding your phone would get the same response.  

Sensory’s software starts segregating speakers’ voices automatically, without any specific vocal training. What happens after that is up to you, however. Different OEMs will have different methods of assigning rights and privileges to each speaker; that’s a user-interface issue. What was once a bug is now a feature.

Another key subcomponent of any speech-recognition system is an effective “wake word.” Without a wake word, the system must be on, ready, listening, and processing audio 100% of the time. That obviously wastes processing power and energy, but worse, it makes it tough for the system to tell if you’re talking to it or if you’re just talking. Without calling out, “Alexa,” (an intentionally unusual multisyllabic name that’s unlikely to occur in normal conversation), your creepy Amazon eavesdropper won’t know that you’re ready to order diapers. Like an old dog, a wake word allows the system to sleep 99% of the time and pay attention only when it hears its name.

Sensory offers software for small DSPs and MCUs to implement wake words. Depending on your hardware and the number and complexity of the wake words, it requires as little as 1 MIPS of processing and 100KB of RAM. Better hardware allows for more elaborate setups. Power consumption – one of the reasons for having wake words at all – depends on your hardware, but Sensory says some OEMs are operating at under 1mA.

To make wake words even more power-efficient, Sensory has also developed its own hardware IP, the first in many years. Imaginatively called LPSD, for low-power sound detection, it’s an accelerator that gets integrated into a soft DSP from Ceva, Synopsys, Tensilica, or others. The idea is that your DSP can go to sleep and ignore audio input entirely. Sensory’s LPSD will trigger on the wake word and fire up the DSP in time for it to process the full audio command stream.

My guess is that Sensory has at least another 25 good years ahead of it. Voice activation and speech recognition aren’t going away, but there’s a lot of improvement yet to be done. The more we talk to our machines, the more we’re going to need the underlying black magic to make it happen.  

Leave a Reply

featured blogs
Oct 26, 2020
Do you have a gadget or gizmo that uses sensors in an ingenious or frivolous way? If so, claim your 15 minutes of fame at the virtual Sensors Innovation Fall Week event....
Oct 26, 2020
Last week was the Linley Group's Fall Processor Conference. The conference opened, as usual, with Linley Gwenap's overview of the processor market (both silicon and IP). His opening keynote... [[ Click on the title to access the full blog on the Cadence Community s...
Oct 23, 2020
Processing a component onto a PCB used to be fairly straightforward. Through-hole products, or a single or double row surface mount with a larger centerline rarely offer unique challenges obtaining a proper solder joint. However, as electronics continue to get smaller and con...
Oct 23, 2020
[From the last episode: We noted that some inventions, like in-memory compute, aren'€™t intuitive, being driven instead by the math.] We have one more addition to add to our in-memory compute system. Remember that, when we use a regular memory, what goes in is an address '...

featured video

Demo: Inuitive NU4000 SoC with ARC EV Processor Running SLAM and CNN

Sponsored by Synopsys

Autonomous vehicles, robotics, augmented and virtual reality all require simultaneous localization and mapping (SLAM) to build a map of the surroundings. Combining SLAM with a neural network engine adds intelligence, allowing the system to identify objects and make decisions. In this demo, Synopsys ARC EV processor’s vision engine (VPU) accelerates KudanSLAM algorithms by up to 40% while running object detection on its CNN engine.

Click here for more information about DesignWare ARC EV Processors for Embedded Vision

featured paper

Designing highly efficient, powerful and fast EV charging stations

Sponsored by Texas Instruments

Scaling the necessary power for fast EV charging stations can be challenging. One solution is to use modular power converters stacked in parallel. Learn more in our technical article.

Click here to download the technical article

Featured Chalk Talk

Series 2 Product Security

Sponsored by Mouser Electronics and Silicon Labs

Side channel attacks such as differential power analysis (DPA) present a serious threat to our embedded designs. If we want to defend our systems from DPA and similar attacks, it is critical that we have a secure boot and root of trust. In this episode of Chalk Talk, Amelia Dalton chats with Gregory Guez from Silicon Labs about DPA, secure debug, and the EFR32 Series 2 Platform.

Click here for more information about Silicon Labs xGM210P Wireless Module Starter Kit