Everyone’s had the experience of pausing mid-sentence during a conversation, trying to conjure a forgotten word. These pauses can be so pronounced that today’s voice assistants mistake them for the ends of users’ sentences. When this happens, the entire sentence has to be repeated.
This is frustrating for all users, but certain user groups are affected more than others — often, the groups that can benefit the most from voice assistants. During conversations, for example, people with dementia pause more often and for longer durations than others.
At Alexa AI, we experimented with several speech-processing pipelines in an attempt to address this problem. Our most successful approach involved a model that learned to “understand” incomplete sentences. To train that model, we adapted two existing datasets, truncating their sentences and pairing each sentence with a graph-based semantic representation.
One of the truncated-sentence datasets, which we presented at the ACM conference on Conversational User Interfaces (CUI) earlier this year, contains only questions; the other dataset, which we’ll present next week at Interspeech, contains more-general sentences.
The graphs in our datasets capture the semantics of each word in each sentence and the relationships between words. When we truncated the original sentences, we also removed the sections of the graphs contributed by the removed words.
We used these datasets to train a model that takes an incomplete sentence as input and outputs the corresponding incomplete semantic graph. The partial graphs, in turn, feed into a model that completes the graph, and its outputs are converted into text strings for downstream processing.
In tests involving semantic parsing, we compared the results of using our repaired utterances and using the full uninterrupted questions. In the ideal case, the outputs would be the same for both sets of inputs.
In the question-answering context, the model that received our repaired questions answered only 0.77% fewer questions than the model given the full questions. Using the more general corpus, we lost only 1.6% in graph similarity f score, which factors in both false-positive and false-negative rate.
More-natural conversation
This work is part of a broader effort to make interactions with Alexa more natural and human-like. To get a sense of the problem we’re trying to address, read the following sentence fragment slowly, focusing on how the addition of each word increases your understanding:
Yesterday Susan ate some crackers with…
Maybe Susan ate crackers with cheese, with a fork, or with her aunt … the ending does not matter. You don’t need to read the end of this sentence to understand that multiple crackers were eaten by Susan yesterday, and you built this understanding word by word.
In conversation, when sentences are left incomplete, people typically ask for a clarification, like Amit’s question in this example:
Susan: “Who was the father of …”
Amit: “Sorry, of who?”
Susan: “Prince Harry”
Amit: “Oh, King Charles III”
Our two papers shows that computer systems can successfully understand incomplete sentences, which means that natural interactions like this should be possible.
These findings are of key importance for making Alexa more accessible. People who have dementia find Alexa incredibly useful. They can set reminders, get involved in family mealtimes by choosing recipes, and access music more easily. If future systems can seamlessly recover when someone pauses unexpectedly, then people with dementia will be able to enjoy these benefits with minimal frustration.
Our work also confirms that it is possible to correct speech recognition errors through natural interactions. We all mispronounce words (as when asking the weather in Llanfairpwllgwyngyll), but mispronunciations are particularly common among people with speech impairments, muscular dystrophy, early-stage motor neurone disease, and even hearing impairments.
Similarly, it is difficult to hear a word mid-utterance when a dog barks. We show that future voice assistants can identify and clarify misheard words through natural interaction, improving the user experience for people with non-standard speech. This also improves voice agents’ robustness to noisy environments, such as family homes and public spaces.
We hope that releasing our corpora will inspire other researchers to work on this problem too, improving the natural interactivity and accessibility of future voice assistants.