When a customer asks Alexa to play “Hey Jude”, and Alexa responds, “Playing ‘Hey Jude’ by the Beatles,” that response is generated by a text-to-speech (TTS) system, which converts textual inputs into synthetic-speech outputs.
Historically, TTS systems used either concatenative approaches, which string together ultrashort snippets of recorded speech, or classical statistical parametric speech synthesis (SPSS) methods, which generate speech waveforms from scratch by fitting them to statistical models.
Recently, however, voice agents such as Alexa have begun moving to the neural TTS paradigm, in which neural networks synthesize speech. Like all neural networks, neural TTS systems learn from large bodies of training examples. In user studies, subjects tend to rate the speech produced by neural TTS (or NTTS) as much more natural than speech produced through earlier methods.
In general, NTTS models require more data than SPSS models. But recent work suggests that training NTTS systems on examples from several different speakers yields better results with less data. This opens the prospect that voice agents could offer a wide variety of customizable speaker styles, without requiring voice performers to spend days in the recording booth.
At the International Conference on Acoustics, Speech, and Signal Processing, my colleagues and I will present what we believe is the first systematic study of the advantages of training NTTS systems on data from multiple speakers. In tests involving 70 listeners, we found that a model trained on 5,000 utterances from seven different speakers yielded more-natural-sounding speech than a model trained on 15,000 utterances from a single speaker.
An NTTS system trained on data from seven different speakers doesn’t sound like an average of seven different voices. When we train our neural network on multiple speakers, we use a one-hot vector — a string of 0’s with one 1 among them — to indicate which speaker provided each sample. At run time, we can select an output voice by simply passing the network the corresponding one-hot vector.
In our user study, we also presented listeners with live recordings of a human speaker and synthetic speech modeled on the same speaker and asked them whether the speaker was the same. On this test, the NTTS system trained on multiple speakers fared just as well as the one trained on a single speaker. Nor did we observe any statistical difference between the naturalness of models trained on data from speakers of different genders and models trained on data from speakers of the same gender as the target speaker.
The single-gender model was trained on 5,000 utterances from four female speakers; the mixed-gender model was trained on 5,000 utterances from four female and three male speakers.
Finally, we also found the models trained on multiple speakers to be more stable than models trained on single speakers. NTTS systems sometimes drop words, mumble, or produce “heavy glitches,” where they get stuck repeating a single sound. In our study, the multi-speaker models exhibited these types of errors less frequently than the single-speaker models.
NTTS systems typically consist of two neural networks. The first converts phonetic renderings of text into mel-spectrograms, or 50-millisecond snapshots of the power in a series of frequency bands chosen to emphasize frequencies to which humans are particularly attuned. Because humans can perceive acoustic features shorter than 50 milliseconds in duration, the second network — the vocoder — converts the mel-spectrograms into a finer-grained audio signal.
Like NTTS systems, SPSS systems learn to synthesize mel-spectrograms from phonetic data. But with SPSS systems, the vocoders have traditionally been hand-engineered. The versatility and complexity of neural vocoders accounts for much of the difference in performance between SPSS and NTTS.
Our experiments suggest that, beyond 15,000 training examples, single-speaker NTTS models will start outperforming multi-speaker models. To be sure, the NTTS version of Alexa’s current voice was trained on more than 15,000 examples. But mixed models could make it significantly easier to get new voices up and running for targeted applications.
Acknowledgments: Javier Latorre, Jaime Lorenzo-Trueba, Thomas Merritt, Thomas Drugman, Srikanth Ronanki, Klimkov Viacheslav