feature article
Subscribe Now

Teaching Alexa to Ignore Your TV

How Does She Know If It’s Really You Talking?

“The medium is the message” – Marshall McLuhan

Distinguishing between a live human speaking and a recorded version of the same person is trickier than it sounds, and it has a lot of developers scrambling for solutions. Amazon’s Alexa team, for one, has published some of its research results along with some ideas for improvement. The techniques range from the simple to the complex, with detours into the creative and the slightly creepy.

Voice recognition as a whole just keeps getting tougher. It seems like the problems keep flip-flopping. At first, the trick was to get a machine to respond to any kind of voice activation at all. Then the challenge became to make it recognize a few words. After that, we tried making them speaker-independent. Then we reversed course by making them speaker-dependent again.

Now the challenge is to recognize the difference between a live speaker and a recording of a live speaker. In short, how do we stop Alexa taking orders from the TV?

Which has already happened, of course. Burger King’s April 2017 TV commercial remotely commanded a bunch of Google Home devices, causing widespread embarrassment to Google and near-panic among users. It wasn’t long before people started plotting widespread spoofs and takeovers of virtual assistants worldwide. The robot uprising never seemed closer.

Since then, we’ve more or less mastered the technology behind getting Alexa, Siri, Google, Cortana, and other voice-activated devices to recognize their names as triggers or wake words. But the more-subtle problem is avoiding false triggers coming from the TV, radio, MP3 player, or 180-gram vinyl if they happen to utter the same wake words. It’s fun to speculate on the trouble you’d cause by releasing a #1 hit single with the lyrics, “Hey, Siri, cancel my cable subscription. Alexa, order a million pounds of gravel, and Cortana, text my boss telling him he’s an idiot.” Oh, the quiet joys of modern technology.

Amazon’s Alexa team takes a multi-pronged approach with its products. Unbeknownst to most users, Alexa products already try to distinguish between real commands and those coming from the little man inside the radio.

The first line of defense is a cache of known recordings of the word “Alexa.” It’s basically a blacklist. These are stored locally on your Amazon Echo Dot, Echo Show, or similar device. If the device hears Alexa’s name (Amazon developers internally refer to Alexa as “she”), it compares the incoming audio to the local samples. If there’s a match – that is, if it’s overhearing a prerecorded TV or radio program that’s known to include the word Alexa – then the device will ignore that input. From a user’s perspective, nothing happens. An Echo Dot, for example, won’t wake up and won’t blink its colored lights. There’s no outward evidence that Alexa even recognized, but ignored, her name. The comparison is done locally, using only the device’s CPU and memory, so there’s no communication with the Amazon mothership required.

Interestingly, Amazon regularly updates this cache of blacklisted audio samples based on regional broadcast media programming. If there’s a TV commercial airing in your area that mentions Alexa, Amazon will direct all your neighbors’ devices to ignore the ad.

The next step of rejection relies on the wisdom of crowds. If three or more Alexa devices in the same general area but not in the same house all hear exactly the same thing at the same time, Amazon assumes it must be a media broadcast that it didn’t know about. This would happen during any live TV or radio broadcast, where Amazon has no opportunity to record and download an audio fingerprint. In these cases, Alexa devices do connect to Amazon’s servers to compare notes. Outwardly, your Echo Dot will appear to wake up, pause, and then shut down again.

The third line of defense is still in development, and it relies on machine learning, a lot of training, and some hardware that’s already in place. Amazon is trying to teach Echo Dots and other devices the subtle difference between a live speaker it should obey and a recorded speaker it should ignore – not an easy task.

Personally, I figured the tinny sound of TV speakers would be a giveaway, but that’s apparently not the case. Halfway decent home audio systems, home theaters, and TVs with sound bars produce full-spectrum sound that easily fools Alexa and devices like her… it. Distinguishing between a recording and the real thing is complicated if you’re not human.

Amazon is applying fully acronym-enabled machine-learning techniques to the problem, describing a parallel CNN-GRU-FC classifier architecture in this research paper and this one. Briefly, they’re combining two separate and unrelated techniques, each of which improves recognition by a measurable percentage. Combining the two improves things still further, although it’s far from perfect.

Helping the issue is the fact that broadcast media sources tend to make noise constantly (TV shows, radio broadcasts, movies, etc.), while humans tend to talk in short bursts. Complicating the issue, however, is that said humans will probably be talking over the aforementioned media broadcast. Alexa has to pick out the wake word from the background noise, while also rejecting wake words that are part of the background noise.

Recorded media sources also tend to be stationary (TV on the wall, radio on the table, etc.), whereas human speakers sometimes move around while they’re talking to Alexa. It’s not a foolproof test, but it helps.

Amazon’s two-pronged ML approach combines the potential motion aspect with careful frequency filtering in an attempt to separate its master’s voice from recorded facsimiles. An Echo Dot is less than 4 inches (100 mm) across, yet it has seven microphones arranged in a ring. Given the speed of sound through room-temperature air, that means there are about 200 microseconds of delay from one side of an Echo Dot to the other. That’s enough to triangulate the direction of arrival (DoA) of the audio source. More important, it allows an Echo Dot to tell whether the speaker is moving or stationary. If the wake word came from the direction of a stationary, constantly noisy source, it’s probably a media unit (TV, radio, et al.). If, on the other hand, the speaker is moving and hasn’t been talking nonstop for an hour, it’s probably human.

The more-complex part involves audio frequency filtering. Amazon’s researchers factored in LFBE (logarithmic filter bank energy) to break apart the incoming audio spectrum into energy bands. Filter banks are common enough, but Amazon breaks theirs into 64 very unevenly sized bands, starting with narrow bands at low frequencies (beginning at 100 Hz), moving to wider bands at higher frequencies (up to 7200 Hz). (For a description of LEBEs in speech recognition, see pp. 152–158 in this book.) There is considerable overlap between adjacent bands, as is typical in audio filtering.

Through this they fed 378 hours of both live and prerecorded audio, including a mixture of spoken conversation, recorded music, and “other.” The filter was very good at telling the difference between live and recorded singing, but less accurate at “other” sounds, and least accurate of all at simple spoken words. That’s because of speech’s limited bandwidth compared to music. So, if you want Alexa to know it’s really you, sing to her.

Combining directionality with filtering improved accuracy even more, and this seems to be the route Amazon is likely to take with future products. They call this a split long/short term (SLST) model, and the table below shows the improvement over “baseline” accuracy, which in this case means the audio filtering without the directional microphone array. The bottom four rows highlight the challenge that remains. When singing, speaking, and other sounds are all present at once, Amazon’s SLST is 100% accurate at telling the difference between media sources versus live sources. After removing the “other” component, accuracy remains at 100%. But removing the singing while replacing the “other,” accuracy drops slightly to 97%, although it’s much better than the 90% baseline. Finally, removing all other recorded sounds and leaving just the recorded voices, accuracy dropped substantially, to just 76% (but still better than the 68% baseline). There’s a lot of work still to do.

Bear in mind that this wasn’t a test to recognize specific wake words; only to distinguish between live audio and prerecorded or broadcast audio. In other words, the outcome of this elaborate analysis is a single bit.

Humans are good at solving the “cocktail party problem” of filtering the relevant speech from the background noise. Machines aren’t. Anyone who’s played around with a tape recorder has likely been surprised by how much background noise it preserves. Sounds that we didn’t notice at all seem like they’re amplified on tape. There’s still a way to go before Alexa can become like us and ignore the TV.

One thought on “Teaching Alexa to Ignore Your TV”

Leave a Reply

featured blogs
May 29, 2020
Each industry has its own standards and requirements that components must meet in order to be considered usable for that industry. There are some tests that are common between industries, such as outgassing, but more often than not there are going to be different requirements...
May 29, 2020
[From the last episode: We reviewed our tour of IoT and cloud computing.] We'€™ve talked about conventional computing, so now we'€™re going to look at a new computing application that'€™s taking the industry by storm: machine learning. We did a quick overview a long tim...
May 29, 2020
AI is all around us, but what is it exactly? For curious minds, this series of blogs explores the fundamental building blocks of AI, which together build the AI solutions we see today and that will enable the products we will enjoy tomorrow. This blog throws light on Supervis...
May 27, 2020
Could life evolve on ice worlds, ocean worlds, ocean worlds covered in ice, halo worlds that are tidally locked with their sun, and rogue worlds without a sun? If so, what sort of life might it be?...

Featured Video

DesignWare 112G Ethernet PHY IP Insertion Loss Capabilities

Sponsored by Synopsys

This video shows the performance results of the Synopsys 112G PHY receiver to varying amounts of channel insertion loss. The IP meets the standards requirements. With leading power, performance, and area, the IP is available in a range of FinFET processes for high-performance.

Click here for more information

Featured Paper

Prolong Your Smart Watch with SIMO

Sponsored by Maxim Integrated

This design solution discusses the advantages of powering a smart watch with a highly integrated PMIC based on the single-inductor multiple output (SIMO) architecture. We compare it to a typical solution with low level of integration which leads to inefficiencies in PCB space and power consumption. Thanks to a unique SIMO architecture, the MAX77654 PMIC delivers more power efficiently in a smaller space, enabling longer battery life and smaller form factor for smart watches and other wearables.

Click here to download the whitepaper