At this year’s Conference on Empirical Methods in Natural Language Processing (EMNLP), Julia Hirschberg, an Amazon Scholar and a professor of computer science at Columbia University, was one of three area chairs for speech and multimodality, overseeing the review of speech-related paper submissions.
Until last year, however, EMNLP had never had an area chair for speech. Traditionally, natural-language processing (NLP), which seeks to make sense of free-form language, has focused on text; speech technologies, such as automatic speech recognition, were viewed as a way to provide text inputs for NLP systems.
The wide adoption of voice-based technologies, and especially conversational assistants like Alexa, has changed that. As Hirschberg explains, with spoken language, understanding meaning — traditionally the purview of NLP — depends vitally on the acoustic speech signal.
“I do prosody,” Hirschberg says, referring to the varied inflections and rhythms of human speech. “We all produce different prosodic contours, and those contours — and also what you emphasize and don’t emphasize, where you put your pauses — can totally influence what you’re saying. It can make it a completely different thing. So that’s why it’s really important to do both [speech and NLP].”
Moreover, Hirschberg points out, some people have access to NLP technology only through speech. “To do dialogue in text is very cumbersome,” she says. “And there are a lot of people who can’t do it because they’re visually impaired. And some people, particularly in low-resource-language countries, do not know how to read. Millions of people are in that state, so there’s been a lot of work on how you can speak to technology and work on low-resource languages.”
One of the Amazon projects that Hirschberg is involved in is natural turn-taking, a feature that will allow multiple customers to converse with Alexa without repeating the wake word “Alexa”. Distinguishing device-directed speech is an example of a problem that is better solved by combining acoustic features of the speech signal with semantic interpretation than by considering either in isolation.
Amazon is a diamond sponsor of EMNLP’s virtual conference this year. Learn more about the research Amazon scientists are presenting there, and submit questions for a natural-language-processing and conversational-AI panel discussion with Amazon Scholars Shih-Fu Chang and Heng Ji, Amazon visiting academic Kai-Wei Chang, and Prem Natarajan, Alexa AI’s VP of Natural Understanding, which will take place on Nov. 19th.
“Another thing that we’re working on now is trying to understand pacing, so you know when they’ve finished talking,” Hirschberg says. “You don’t want to interrupt, but you don’t want to wait too long. For example, we found in a long study that when human beings are really talking with each other in a good way, they tend to overlap about two milliseconds of their speech. And that’s something that dialogue systems typically do not do.”
The research area that Hirschberg chairs at EMNLP is not just voice, however, but voice and multimodality. Alexa researchers’ work on natural turn-taking doesn’t just integrate semantic analysis with acoustic analysis; it also integrates it with computer vision, which helps distinguish customers’ conversations with each other from their device-directed speech.
“We’re getting much, much more multimodal now,” Hirschberg says. “That’s the wave of the future.
In the same way that Hirschberg is interested in how spoken-language-understanding systems can infer meaning from prosody, she’s also interested in how text-to-speech systems can use prosody to convey meaning.
“What I’m interested in, which I haven’t worked on before, is empathetic speech,” Hirschberg says. “And why it’s so cool is that you need to understand the context to be appropriately empathetic. It’s not just about imitation: you don’t always want to make your speech sound like the other person. Say the person’s really mad. You do not want to sound mad, right?”
Given the ways in which speech and NLP complement each other, Hirschberg says, “I think this [the speech track] was a very good start to making EMNLP broader. They may also want to include multimodal data, because that’s becoming huge nowadays. If they want to keep up with the AI wave, I think it would be a good idea to do that as well. And I think it would be good for people who are basically based in text-based things to understand how other aspects of language are important as well. I think it’s a good start, and I hope it continues.”