Today in Seattle, Dave Limp, Amazon’s senior vice president for devices, unveiled the latest lineup of products and services from his organization. During the presentation, Rohit Prasad, Amazon vice president and Alexa head scientist, described three new advances from the Alexa science team. One of those is speaking style adaptation.
Alexa’s speech is generated by text-to-speech (TTS) models, which convert the textual outputs of Alexa’s natural-language-understanding models and dialogue managers into synthetic speech.
In recent years, Alexa has been using neural TTS, or TTS based on neural networks, which has enabled not only more natural-sounding speech, but also much greater versatility. Neural TTS enables Alexa to vary her speaking style — newscaster or music style, for instance — and it enables us to transfer prosody, or inflection patterns, from one voice to another.
In human speech, speaking style and prosody are often a matter of context, and for Alexa’s interactions with customers to be as natural as possible, the same should be true for her. Imagine the following exchange, for instance:
Customer: Alexa, play the Village People.
Alexa: Do you mean the band, the album, or the song?
A human speaker would naturally emphasize “band”, “album”, and “song”, the words most strongly correlated with missing information.
With speaking style adaptation, Alexa will begin to vary prosodic patterns in the same way, to fit the conversational context. Similarly, she will vary her tone: a cheerful, upbeat tone might fit some contexts, but that could be annoying if Alexa has just failed to successfully complete a request.
One of the models that enable speaking style adaptation generates alternative phrasings in a context-aware way, so that Alexa does not keep asking the same question repeatedly. In one round of conversation, she might say, “Do you mean the song?”, in another, “Should I play the song, then?”, and so on.
Speaking style adaptation thus represents a step in the direction of concept-to-speech, the envisioned successor of text-to-speech, which takes as input a high-level representation of a concept and has considerable latitude in how to convey it, based on context and other signals. For instance, sometimes the same conceptual content can be conveyed by tone of voice, by explicit linguistic formulation, or by both.
Speaking style adaptation depends on state information from the dialogue manager. That information includes the customer’s intent — the action the customer wants performed, such as playing a song — and slot values — the specific entities involved in the action, such as the song name.
It also includes the current conversational state — opening, development, or closing — and the dialogue manager’s current confidence in its understanding of the dialogue state.
First, the state information passes to the speech generator’s rephrasing module, a Transformer-based neural network trained on a large, domain-specific linguistic corpus. Based on the state information, the model produces a list of alternative phrasings.
The rephrasings then pass to another neural network that has been trained to identify “focus words” in each sentence, words that are good candidates for particular emphasis in speech.
The dialogue state information, the rephrasing proposed by the rephrasing module, and the output of the focus word model all pass to another neural network — the articulator — that generates the output speech.
The focus word information, together with the slot information, tells the articulator which words of the input sentence to stress. The confidence scores from the dialogue manager determine the speech style, on a spectrum from low to high excitement.
It’s still day one, however, and we are experimenting with leveraging other contextual information to further customize Alexa’s responses.