The International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024) takes place April 14–19 in Seoul, South Korea. Amazon is a bronze sponsor of “the world’s largest and most comprehensive technical conference focused on signal processing and its applications.”
Amazon’s presence includes a workshop (Trustworthy Speech Processing), two of whose organizers are researchers with Amazon’s Artificial General Intelligence (AGI) Foundations organization: Anil Ramakrishna, senior applied scientist, and Rahul Gupta, senior manager of applied science. In addition, Wontak Kim, senior manager of research science with Amazon Devices, will present a spotlight talk titled “Synthetic data for algorithm development: Real-world examples and lessons learned.”
As in previous years, many of Amazon’s accepted papers focus on automatic speech recognition. Topics such as speech enhancement, spoken-language understanding, and wake word recognition are all well represented. This year’s publications also touch on dialogue, paralinguistics, pitch estimation, and responsible AI. Below is a quick guide to Amazon’s more than 20 papers at the conference.
Addressee detection
Long-term social interaction context: The key to egocentric addressee detection
Deqian Kong, Furqan Khan, Xu Zhang, Prateek Singhal, Ying Nian Wu
Audio event detection
Cross-triggering issue in audio event detection and mitigation
Huy Phan, Byeonggeun Kim, Vu Nguyen, Andrew Bydlon, Qingming Tang, Chieh-Chi Kao, Chao Wang
Automatic speech recognition (ASR)
Max-margin transducer loss: Improving sequence-discriminative training using a large-margin learning strategy
Rupak Vignesh Swaminathan, Grant Strimel, Ariya Rastrow, Harish Mallidi, Kai Zhen, Hieu Duy Nguyen, Nathan Susanj, Thanasis Mouchtaris
Promptformer: Prompted conformer transducer for ASR
Sergio Duarte Torres, Arunasish Sen, Aman Rana, Lukas Drude, Alejandro Gomez Alanis, Andreas Schwarz, Leif Rādel, Volker Leutnant
Significant ASR error detection for conversational voice assistants
John Harvill, Rinat Khaziev, Scarlett Li, Randy Cogill, Lidan Wang, Gopinath Chennupati, Hari Thadakamalla
Task oriented dialogue as a catalyst for self-supervised automatic speech recognition
David M. Chan, Shalini Ghosh, Hitesh Tulsiani, Ariya Rastrow, Björn Hoffmeister
Computer vision
Skin tone disentanglement in 2D makeup transfer with graph neural networks
Masoud Mokhtari, Fatima Taheri Dezaki, Timo Bolkart, Betty Mohler Tesch, Rahul Suresh, Amin Banitalebi
Dialogue
Turn-taking and backchannel prediction with acoustic and large language model fusion
Jinhan Wang, Long Chen, Aparna Khare, Anirudh Raju, Pranav Dheram, Di He, Minhua Wu, Andreas Stolcke, Venkatesh Ravichandran
Paralinguistics
Paralinguistics-enhanced large language modeling of spoken dialogue
Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yi Gu, Shalini Ghosh, Andreas Stolcke, Hung-yi Lee, Ivan Bulyko
Pitch estimation
Noise-robust DSP-assisted neural pitch estimation with very low complexity
Krishna Subramani, Jean-Marc Valin, Jan Buethe, Paris Smaragdis, Mike Goodwin
Responsible AI
Leveraging confidence models for identifying challenging data subgroups in speech models
Alkis Koudounas, Eliana Pastor, Vittorio Mazzia, Manuel Giollo, Thomas Gueudre, Elisa Reale, Giuseppe Attanasio, Luca Cagliero, Sandro Cumani, Luca de Alfaro, Elena Baralis, Daniele Amberti
Speaker recognition
Post-training embedding alignment for decoupling enrollment and runtime speaker recognition models
Chenyang Gao, Brecht Desplanques, Chelsea J.-T. Ju, Aman Chadha, Andreas Stolcke
Speech enhancement
NoLACE: Improving low-complexity speech codec enhancement through adaptive temporal shaping
Jan Buethe, Ahmed Mustafa, Jean-Marc Valin, Karim Helwani, Mike Goodwin
Real-time stereo speech enhancement with spatial-cue preservation based on dual-path structure
Masahito Togami, Jean-Marc Valin, Karim Helwani, Ritwik Giri, Umut Isik, Mike Goodwin
Scalable and efficient speech enhancement using modified cold diffusion: A residual learning approach
Minje Kim, Trausti Kristjansson
Spoken-language understanding
S2E: Towards an end-to-end entity resolution solution from acoustic signal
Kangrui Ruan, Cynthia He, Jiyang Wang, Xiaozhou Joey Zhou, Helian Feng, Ali Kebarighotbi
Towards ASR robust spoken language understanding through in-context learning with word confusion networks
Kevin Everson, Yi Gu, Huck Yang, Prashanth Gurunath Shivakumar, Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh, Wael Hamza, Hung-yi Lee, Ariya Rastrow, Andreas Stolcke
Text-to-speech
Mapache: Masked parallel transformer for advanced speech editing and synthesis
Guillermo Cambara Ruiz, Patrick Tobing, Mikolaj Babianski, Ravi chander Vipperla, Duo Wang, Ron Shmelkin, Giuseppe Coccia, Orazio Angelini, Arnaud Joly, Mateusz Lajszczak, Vincent Pollet
Wake word recognition
Hot-fixing wake word recognition for end-to-end ASR via neural model reprogramming
Pin-Jui Ku, I-Fan Chen, Huck Yang, Anirudh Raju, Pranav Dheram, Pegah Ghahremani, Brian King, Jing Liu, Roger Ren, Phani Nidadavolu
Maximum-entropy adversarial audio augmentation for keyword spotting
Zuzhao Ye, Gregory Ciccarelli, Brian Kulis
On-device constrained self-supervised learning for keyword spotting via quantization aware pre-training and fine-tuning
Gene-Ping Yang, Yue Gu, Sashank Macha, Qingming Tang, Yuzong Liu