ASRU: Integrating speech recognition and language understanding

July 6, 2024

0 Views 0

SaveSavedRemoved 0

Jimmy Kunzmann, a senior manager for applied science with Alexa AI, is one of the sponsorship chairs at this year’s IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). His research team also presented two papers at the conference, both on the topic of “signal-to-interpretation”, or the integration of automatic speech recognition and natural-language understanding into a single machine learning model.

Jimmy Kunzmann, a senior manager for applied science with Alexa AI and a sponsorship chair at the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

“Signal-to-interpretation derives the domain, intent, and slot values directly from the audio signal, and it’s becoming more and more a hot topic in research land,” Kunzmann says. “Research is driven largely by what algorithm gives the best performance in terms of accuracy, and signal-to-interpretation can drive accuracy up and latency and memory footprint down.”

The Alexa AI team is constantly working to improve Alexa’s accuracy, but its interest in signal-to-interpretation stemmed from the need to ensure Alexa’s availability on resource-constrained devices with intermittent Internet connections.

“If Internet connectivity drops all of a sudden, and nothing is working anymore, in a home or car environment, that’s frustrating — when your lights are not switched on anymore, or you can’t call your favorite contacts in your car,” Kunzmann says.

Kunzmann says that his team’s early work concentrated on finding techniques to dramatically reduce the memory footprint of models that run on-device — techniques such as perfect hashing. But that work still approached automatic speech recognition (ASR) and natural-language understanding (NLU) as separate, sequential tasks.

More recently, he says, the team has moved to end-to-end neural-network-based models that tightly couple ASR and NLU, enabling more compact on-device models.

“By replacing traditional techniques with neural techniques, we could get a smaller footprint — and faster and more accurate models, actually,” Kunzmann says. “And the closer we couple all system components, the further we increasing reliability.”

Running end-to-end models on device can also improve responsiveness, Kunzmann says.

“Fire TV customers said that when we process requests like switching channels or proceeding to the next page on-device, we are much faster, and usability goes up,” he says

At ASRU, Kunzmann’s team is reporting on two new projects to make on-device, neural, signal-to-interpretation models even more useful.