Today in Seattle, Dave Limp, Amazon’s senior vice president for devices, unveiled the latest lineup of products and services from his organization. During the presentation, Rohit Prasad, Amazon vice president and Alexa head scientist, described three new advances from the Alexa science team. One of those is natural turn-taking.
Alexa’s natural-turn-taking feature — which we plan to launch next year — will let customers interact with Alexa more naturally, without the need to repeat the wake word. The feature’s AI will be able to recognize when users have finished speaking, when their speech is directed at the device, when it’s not, and whether or not a reply is expected.
Natural turn-taking builds on Alexa’s Follow-Up Mode, which uses acoustic cues to distinguish device-directed and non-device-directed speech. It adds other cues, such as visual information from devices with cameras. On-device algorithms process images from the camera, inferring from speakers’ body positions whether they are likely to be addressing Alexa.
The output of the computer vision algorithms is combined with the output of Alexa’s existing acoustic algorithm for detecting device-directed speech and fed to an on-device fusion model, which determines device directedness. This approach can distinguish device-directed speech even when multiple speakers are interacting with each other and with Alexa.
One key to natural turn-taking is handling barge-ins, or customer interruptions of Alexa’s output speech. When a customer barges in with a new request (“show me Italian restaurants instead”), Alexa knows to stop speaking and proceed with processing the new request.
In some cases of barge-in, Alexa also needs to know how far she got in her output speech, as that information could be useful to the dialogue manager. We call this scenario contextual barge-in. If, for instance, Alexa is returning a list of options after a customer request, and the customer interrupts to say, “That one”, Alexa knows that “that one” refers to whatever option Alexa was reading at the time of the barge-in.
This feature uses the difference between the time stamps of the commencement of the interrupted speech and the interruption itself to determine how far into the speech to look for a referent for the customer’s utterance. That information is passed to the Alexa Conversations dialogue manager, where it is used in determining the proper response to the customer utterance.
When natural turn-taking launches, we also plan to beta-test a feature known as user pacing. User pacing relies on several different signals to determine whether a customer has finished speaking and whether he or she needs any additional prompting.
Those signals include space fillers, such as “um” or “uh”; the lengthening of vowels, as in, “Let me seeee … ”; and incomplete utterances, such as “I think I’m going to go with”.
We are also investigating new techniques for inferring device directedness from the speech signal. Earlier this year, for instance, we reported a method that uses syntactic and semantic characteristics of customer utterances as well as the acoustic characteristics already employed by Follow-Up Mode.