Ask HN: How does Alexa avoid interrupting itself when saying its own name?

richarme · on June 29, 2024

This is achieved using Acoustic Echo Canellation (AEC). This essentially subtracts the output of the speaker as well as reverberations from the room from the microphone input. Here's a youtube video explaining the basic principle: https://www.youtube.com/watch?v=bJKGrheOoY4

Source: worked on 3rd party Alexa speakers

cushychicken · on June 29, 2024

Fundamental principle at work is adaptive filtering.

it also has uses in noise canceling headphones, voice conferencing software, and radar/sonar in some cases.

No LLMs or deep learning at all - purely DSP!

kqr · on June 29, 2024

This is also the reason you can (sometimes) video conference on your laptop without headphones plugged in. The software does not (or at least tries not to) record its own output.

yccs27 · on June 30, 2024

You only notice it if you try to use different devices for speakers and microphone - you'll get a feedback loop, since the AEC only works on the same device.

Someone · on June 29, 2024

The simple solution: switch off the code that listens for the “Alexa” prompt when saying “Alexa” yourself.

Slightly harder: keep it running, but discard hits that are timed close to the time you say “Alexa” yourself.

Even harder: have a second detector that is trained on the device saying “Alexa”, and discard hits that coincide with that detector firing. That second detector can be simplified by superimposing a waveform that humans will (barely) notice but that is easily detected by a computer on top of the audio whenever the device says “Alexa”.

Still harder: obtain the transfer function and latency between the speaker(s) and its microphone(s) and, using that, compute what signal you expect to hear at the microphone from the speaker’s output, and subtract that from the actual signal detected to get a signal that doesn’t include one’s one utterances.

That function could be obtained from one device in the factory or trained on-device.

I suspect the first already is close to good enough for basic devices. If you want a device that can listen whilexalso playing music at volume, the last option can be helpful.

davidmurdoch · on June 29, 2024

Wouldn't it be simpler to add noise above 20kHz and ignore the key phrase if the noise is present?

lazide · on June 29, 2024

Even simpler is subtract any sound (and associated echoes) output through the speaker from any input received from the microphone.

colanderman · on June 29, 2024

Echo cancellation is much more complex than tone squelching, which is simple enough to be implemented in the analog domain.

brazzy · on June 29, 2024

Echo cancellation was implemented in the analog domain 50 years ago at the least.

It's wild to see someone assume this kind of thing requires machine learning algorithms.

lazide · on June 29, 2024

I'm guessing they were (rightfully) pointing out it's still harder than just ignoring any input if there is a given tone. Which yeah, it is. I'm assuming they'd prefer to use that for things like advertisements though, since the cancellation approach lets them ‘listen’ while still talking/playing music/etc.

xx-bean-xx · on June 29, 2024

no, because that 20+khz noise would be lost in any 44khz signal processing. you would have to guarantee the audio player is capable of higher bitrates

astrange · on June 30, 2024

It also isn't resilient to room echo/reflections, or EQ if people hook it up to their own speaker systems.

If you listen to the Apple keynote, anytime someone says Siri the audio has been very audibly lowpassed, presumably that's enough.

davidmurdoch · on June 29, 2024

Ah, I didn't think about that. Thanks for pointing that out. Typical mics sample at 48khz, but that wouldn't leave much head room. Echo devices have an array of mics, IIRC, so I wonder if they can effectively sample at a higher rate?

sgc · on June 29, 2024

That was my thought as well. Just use some form of acoustical fingerprinting. It would be very simple to do, and would be both robust and flexible.

throwaway211 · on June 29, 2024

Could two Alexas be put next to each other hailing themselves into eternity.

brookst · on June 29, 2024

I’ve always wanted to write a hit pop song called “hey Alexa, order twenty pounds of sand”. Only a complete lack of hit songwriting ability has stopped me.

yial · on June 30, 2024

*Verse 1:* Woke up this morning with a plan in my head, Gonna build a castle, need a lot of sand, Grabbed my phone, started to command, Hey Alexa, can you lend a hand?

*Pre-Chorus:* I’m feeling like a kid again, Dreaming of the seaside, In my living room, I’ll make my own tide.

*Chorus:* Hey Alexa, order twenty pounds of sand, We’ll turn my place into a beachy wonderland, Dancing barefoot, got the sun in our hands, Hey Alexa, won’t you understand?

*Verse 2:* Neighbors think I’m crazy, but they don’t know, I’ve got the waves crashing, the good vibes flow, With a little magic from my techy friend, We’re surfing in the living room, it’ll never end.

*Pre-Chorus:* Turn up the heat, let’s make it bright, With a pinch of paradise, Every day feels like a summer night.

*Chorus:* Hey Alexa, order twenty pounds of sand, We’ll turn my place into a beachy wonderland, Dancing barefoot, got the sun in our hands, Hey Alexa, won’t you understand?

*Bridge:* Building castles, making dreams, Feeling free, it’s the ultimate scheme, With every grain, we’re setting the scene, Life is better when it’s sandy and serene.

*Chorus:* Hey Alexa, order twenty pounds of sand, We’ll turn my place into a beachy wonderland, Dancing barefoot, got the sun in our hands, Hey Alexa, won’t you understand?

*Outro:* So next time you’re dreaming of the shore, Just remember what Alexa’s for, A little voice can bring the beach to your door, Hey Alexa, we’re ready for more.

jaredsohn · on June 30, 2024

LLMs can do this for you (although no guarantee it will be a hit)

toomuchtodo · on June 30, 2024

https://www.youtube.com/watch?v=ESt7CTZiXqY (NSFW)

stavros · on June 29, 2024

They'll definitely do some cancelation of the signal they're sending, you can't really hear people saying commands over your own music otherwise.

JoBrad · on June 29, 2024

I’ve interrupted my Echo with “Alexa, stop” or “Alexa <repeated phrase more clearly or louder>” while it is speaking. So it doesn’t stop listening when speaking.

Krastan · on June 29, 2024

Stop listening only when saying "Alexa" not just when speaking in general. So to test you'd have to time saying Alexa at the same time it is saying Alexa

kQq9oHeAz6wLLS · on June 29, 2024

It could stop listening only when it's saying the word "Alexa", or even just part of that word, and listen the rest of the time.

matheist · on June 29, 2024

Transfer function would need to be learned online because room reverberations contribute a lot.

ITB · on June 29, 2024

More interesting is when we used to run Alexa commercials and it would cause a denial of service attack on ourselves with all the devices being triggered, particularly during the superbowl. In that case, we added some imperceptible noise to the audio stream so that Alexa wouldn’t trigger.

shagie · on June 29, 2024

The other approach is to remove certain noise.

If there is nothing in the frequency range from 3kHz to 6kHz Alexa won't wake when a wake word is spoken. https://youtu.be/iNxvsxU2rJE doesn't wake up anything.

https://www.theverge.com/2018/2/2/16965484/amazon-alexa-supe...

> Apparently, the Alexa commercials are intentionally muted in the 3,000Hz to 6,000Hz range of the audio spectrum, which apparently tips off the system that the “Alexa” phrase being spoken isn’t in fact a real command and should be ignored.

Compare that with selecting 'Alexa, what time is it' (I'm on a Mac) and doing "speak text". Same speaker (for me with the previous video).

I had one device set with a wake word of "Amazon" but that got really annoying when watching AWS training videos. I believe Ziggy is the best wake word for that reason.

hughesjj · on June 30, 2024

I'm guessing this is because plenty of speakers aren't designed to output sound outside of typical hearing range (because, I mean why bother for a TV etc)

shagie · on June 30, 2024

If you watch https://www.youtube.com/watch?v=lugeruSbnAE you will get Alexa waking up.

Meanwhile https://youtu.be/iNxvsxU2rJE and https://youtu.be/8bACuhV5RPM and don't.

Play them on a TV or tiny computer speaker.

You can hear 3 kHz quite well. https://www.youtube.com/watch?v=AacTD0HtadE

And we can hear up to 20 kHz when young (that's 3 octaves higher than 3 kHz). https://www.ncbi.nlm.nih.gov/books/NBK10924/

And while this is audio people talking about sound... https://repforums.prosoundweb.com/index.php?topic=22875.0

> 3k is commonly the most sensitive frequency for human ears - we use it for testing wow and flutter on tape machines; you can hear pitch variation there easiest. 3k is a ballpark, but a good one - I think it works anywhere in that region 2-4k, but usually 3k is THE ONE.

> When I have a harsh fuzz guitar or trashy cymbal, I often try to just tuck down the 3k region, and suddenly it no longer covers everything. It still sounds thick and harsh, but nothing kills your ears.

> ...

> Cutting at 3k can help hide out of tune instruments and bad pitch on vocals as well.

---

Late edit. Extracting the audio of the "Alexa looses her voice" and sending it into https://academo.org/demos/spectrum-analyzer/ - watch at the range around 5.1 kHz (about the center of the 3 kHz to 6 kHz octave). It starts out with: https://i.imgur.com/iEe7s6P.png

That spike is the actress saying "Alexa" and you can see the unnatural gap in that range. Also, there is sound in that mp3 all the way out to 14 kHz.

imranq · on June 29, 2024

Wow that's cool - that's like those adversarial attacks they do on self-driving cars to make it think a stop sign is a gorilla

How do you know what noise to add?

kqr · on June 29, 2024

...or did you specifically use some sort of fingerprinting noise that Alexa was programmed to ignore?

JoBrad · on June 29, 2024

How do you differentiate a noise source that contains a signal, vs another sound playing at the same time as your signal?

felixgallo · on June 29, 2024

Not speaking for Alexa or Amazon, where I have worked in the past, but the wakeword detection is done separately in a much lower power, localized model using hardware features. On the downside this means that you can only select from a few wakewords and the filtering etc. are limited (e.g. the local model is not aware of commercials being played concurrently in the world, so has to wake up the bigger model, which can check to see if that's happening). On the positive side, it's much lower power and mitigates concerns that Alexa is listening to normal activity/conversations.

solardev · on June 29, 2024

What happens if you record her saying her own name and then play it back separately? Does she respond to that if she's not actively talking at the moment?

-----

Not directly the same case but similar, Amazon trains Alexa to avoid certain mentions of her in commercials using acoustic fingerprinting techniques: https://www.amazon.science/blog/why-alexa-wont-wake-up-when-...

kqr · on June 29, 2024

I found the scientist! Great idea for an experiment.

CoastalCoder · on June 29, 2024

> What happens if you record her saying her own name

I suggest we don't personify devices.

JoBrad · on June 29, 2024

People have personified devices and machines for centuries. My wife’s laptops have been “Lappy” (from Strongbad) since she’s had one. It would be odd to start drawing a line at a particular type of device.

djbusby · on June 29, 2024

That ship has sailed.

solardev · on June 29, 2024

I dunno, I think I'm on the side of "robots should have rights too". I'm more worried about us oppressing them than the other way around...

ma2rten · on June 29, 2024

This is the same problem as echo cancellation on calls. This is something that built into a lot of software and hardware.

hoffs · on June 29, 2024

Yeah, just like having a Google meet with speakers and microphone that's built in, when someone is speaking it gets cancelled out

nickburns · on June 29, 2024

spitballing: 'her' own hardcoded waveform is a hardcorded wake word exception. i have doubts that it'd be much more complex than that.

what i do find interesting, however, is that, at times, she'll wake to an utterance from some other media i have playing and seems to 'know' immediately that she was inadvertently awoken. the 'listening' tone and 'end listening' tones sound in quick succession. i do not have voice recognition enabled (to the extent that that setting is respected).

01HNNWZ0MV43FF · on June 29, 2024

> seems to 'know' immediately that she was inadvertently awoken. the 'listening' tone and 'end listening' tones sound in quick succession

Speculation:

- To reduce latency, the "listening" tone plays as soon as the wake word chip hears the wake word

- To improve accuracy, the wake word chip keeps a circular buffer of the last couple seconds of audio, and the main CPU / DSP scans that when it wakes up

So you get spurious wakeups exactly the same as a human - You think you hear something, then you re-listen to it in your mind and realize it was something else.

pklack · on June 29, 2024

I remember an explanation by some Google employee a while back for the Google Home that was basically this, but a layer further removed. Basically the activation causes audio to start getting streamed to the cloud backend. That runs some more thorough checks on the activation ( like checking for massively concurrent activation with the same audio, etc...) and tells the device to stop and disregard the current activation.

icecube123 · on June 29, 2024

Ive always thought they might have a method to ignore the wake work if theres a specific frequency sent at the same time. I’ve noticed that there are sometimes TV commercials that have the “alexa” or “hey google” wake words, but they do not activate the smart speakers. But if the smart speakers hear something close on just a random tv show they will activate.

But as others have said, they might be able to just sleep the wake algorithm temporarily when they know it’s playing back its own wake word.

caprock · on June 29, 2024

There's usually a very small, purpose built model for hearing the initial starting phrase (Alexa, hey Google, etc). It's called a wake word model or wake word detection. Because it's a separate component, it's then fairly straightforward to disable it during an active conversation.

makerdiety · on June 29, 2024

Easy. Because it's not artificial intelligence (modern advances are only a small subset of AI). It's just an expert system with recursion programed in.

Real AI doesn't need recursion that is explicitly instructed into its behavior. Because real artificial general intelligence has better things to do than to listen to human advisors and programmers who don't know about effective objective function optimization. Therefore, Alexa gets a rudimentary infinite recursion loop break statement explicitly installed into her by her human shepherds.

Edit: Recursion should be seen as a general, mathematical form of engineering constructs like acoustic echo cancellation and adaptive filtering. Recursion should be what those engineering tools get reduced to being.

smitelli · on June 29, 2024

Here's something I just thought to try (although I can't do it myself; don't/won't own a smart speaker) -- If a person were to play something like "The Downeaster Alexa"[1] in the room, would that wake it up or does the fact that it's sung-not-spoken with music behind it prevent activation?

[1] https://www.youtube.com/watch?v=LESFuoW-T7I

numpad0 · on June 29, 2024

Microphone noise canceling? It's known what waveform is going out, so it should be trivial to subtract playback audio from recorded audio.

chuckadams · on June 29, 2024

"Alexa, what is love?" https://www.youtube.com/watch?v=ESt7CTZiXqY&t=123s

Anyone actually pull this off?

goutham2688 · on June 29, 2024

I assume they're using some form of https://en.m.wikipedia.org/wiki/Speaker_diarisation

next_xibalba · on June 29, 2024

Couldn’t it just shut off input from the mic while speaking?

01HNNWZ0MV43FF · on June 29, 2024

Kinda but not really, you want to allow humans to interrupt it

dtagames · on June 30, 2024

It can be done in a single line of code, like this JS example: "wakeWordHeard && !sayingAlexa ? doWakeWordCommand() : null"

cedws · on June 30, 2024

I don’t have one of those Amazon things but from OP’s phrasing I’m guessing that it’s possible to interrupt Alexa even while it’s talking. That would imply this lock isn’t in place.

I’m guessing that the device just cancels out the output waveform from the input.

ww520 · on June 29, 2024

Pre-generate a waveform in exact opposite of the utterance. Add it to the incoming waveform.