“Alexa, ask Samuel L. Jackson for the weather.”
Samuel L. Jackson: “It’s cloudy with a chance of me.”
Amazon released the Samuel L. Jackson celebrity voice in December 2019. The new voice allowed Alexa’s customers to get the news, weather, jokes, and more from the legendary actor.
“The Samuel L. Jackson celebrity voice was an important milestone in seeing our multi-persona vision come to life,” says Sai Rupanagudi, senior product manager for the Alexa wake word team.
However, while customers enjoyed the interaction with Jackson’s voice, many found the initial experience burdensome.
“You have to ask Alexa to ask Samuel everything. I was under the impression I could have Alexa speak to me in Samuel’s voice rather than the female voice we currently have. Nope, only if I ask Alexa to ask Samuel. Who wants to add more steps than you already have to take to get a response. I don’t get it,” said one reviewer.
Another reviewer said: “While it’s a neat novelty to show friends when they visit…the fact that you have to specifically ask Alexa to ask Sam to do something gets old really fast.”
Rupanagudi and other team members paid close attention to the initial feedback, as did the Alexa text-to-speech team which also addressed customer feedback by further improving the naturalness of Samuel L. Jackson’s voice, so that it more closely matches the lively personality of the actor and producer.
“Customer obsession is central to everything we do at Amazon,” says Remus Mois, senior software development manager within the Alexa text-to-speech team. “We take the feedback of our customers seriously. We decided to improve the Samuel L. Jackson celebrity voice by allowing users to invoke Sam Jackson with a new wake word.”
The concept of multiple personas or voice agents working on the same device is an important milestone for the Voice Interoperability Initiative launched by Amazon last year. The initiative’s principal tenet: different services should work seamlessly alongside one another on a single device, and voice-enabled products should be designed to support multiple simultaneous wake words.
“At Alexa, we believe that customers should be able to access their favorite agent or persona directly, be it Alexa, another agent, or a celebrity voice,” says Mois.
The response from beta customers to the new experience was overwhelmingly positive.
“Sam’s voice synthesis is actually amazing, and it no longer feels awkward to invoke him,” said one beta customer. “THANK YOU so much for the custom, and simultaneously active wake word.”
Added another beta user, “It’s great to get Sam’s voice without awkwardly asking ‘Alexa, ask Sam to…’ like I had to before.”
Still another beta user added, “Celebrity voice was fun to use. Wake word was intuitive and easy to use. Enjoyed the personality that comes through with the celebrity voice.”
The task of getting the “Hey Samuel” wake word to coexist with the “Alexa” wake word presented formidable research and engineering challenges. With today’s announcement, Alexa customers can interact with Samuel L. Jackson’s voice directly, simply by saying, “Hey Samuel.”
The research challenges
An interaction with Alexa begins with her name. Only when a device detects Alexa’s wake word does it begin streaming voice data to the cloud.
Developing machine-learning models for the new “Hey Samuel” wake word is one of the more challenging problems Shiv Vitaladevuni and his team have encountered since he joined the Alexa organization in 2013. Vitaladevuni, an Alexa senior machine learning manager, leads the wake word team.
“The Alexa wake word has billions of interactions every week,” says Vitaladevuni. “However, there was a paucity of training data for the ‘Hey Samuel’ wake word. To develop a multi-wake-word model for ‘Hey Samuel’ and Alexa, we had to develop new training and data modeling techniques, while drawing on learnings from the past.”
However, drawing from this past experience came with its own unique set of challenges. Researchers had to train the algorithm to recognize the new wake word (“Hey Samuel”), while also concurrently detecting the other primary wake words – “Alexa”, “Echo”, “Amazon”, and “Computer”.
Instead of training a model for each wake word separately, Alexa’s scientists leveraged multi-target learning, where multiple learning tasks are carried out concurrently by leaning on similarities across tasks. In multi-target learning, one input is used to predict multiple outputs. By its very nature, multi-target training is inherently more complex, given the large number of variables, and the speeds at which they must be processed.
“Multi-target training isn’t an easy task,” says Vitaladevuni, “especially when you are contending with wake words that are a single word (“Alexa”, “Amazon”, “Echo” and “Computer”) and a phrase (“Hey Samuel”). The team had to innovate in a number of areas to solve for this problem. To give just one example, we had to conduct extensive research on developing new data preparation and training techniques to balance the data sets for each word. We have made significant progress in the difficult task of ensuring multi-target training performs with the same accuracy we expect from our devices, and we are continually working to improve.”
How to add Samuel L. Jackson’s voice to your echo
The team also had to innovate to deal with the issue of false rejects. A false reject refers to an instance where a customer says “Hey Samuel” or “Alexa”, but the wake word goes unrecognized. With no audio sent to the cloud, the team doesn’t have any data to help reduce false rejects.
To get around this obstacle, Alexa’s scientists utilized transfer learning techniques to train their new multi-wake word model to accept a wide spectrum of nuances in pronunciations, thereby reducing false rejects. Transfer learning allows the algorithm to take skills learned on one domain and transfer the learnings to another domain. In this instance, the team trained a baseline model on a medium vocabulary recognition task and then adapted the model to recognize the “Hey Samuel” wake word more efficiently, utilizing minimal amounts of training data.
The engineering challenges
The wake word detector, unlike Alexa’s other machine learning systems, must run on-device. This on-device computing resource is far more limited than what’s available in the cloud for Alexa’s other components.
We’re excited to see how customers respond to this updated experience, and how we will continue to improve the experience for our customers.
As a result, Alexa scientists and engineers had to develop wake word solutions that could carry out the complex task of detecting two wake words without exceeding CPU, memory and other resources. To complicate matters, the multi-wake word functionality must run on both old and newer Echo devices.
Amazon’s engineering team also developed inference algorithms that are able to adjust to varying prefixes and their corresponding lengths for wake words that might be used in the future. This will be particularly useful as additional partner agents and personas come online with different lengths and prefixes, and will allow the team to stay true to its vision outlined in the Voice Interoperability Initiative.
While the updated Samuel L. Jackson skill has been released today, it’s still Day One for the wake word team. Now that the team has added one new wake word, it is continuing to break ground in research related to how to add new wake words to a multi-target model, using minimal training data, and without degrading the accuracy of existing wake words.
“With this new ability to develop wake words with little to no prior data, we have the opportunity to support much richer customer experiences on Alexa-enabled devices,” says Rupanagudi. “We’re excited to see how customers respond to this updated experience, and how we will continue to improve the experience for our customers.”