Amazon adds Catalan to MASSIVE dataset


Earlier this year, we released MASSIVE, a million-record natural-language-understanding (NLU) dataset composed of human-translated utterances spanning 51 languages, 18 domains, 60 intents, and 55 slot types. We are pleased to announce the release of MASSIVE 1.1, which includes new data for the Catalan language.

Antoni Gaudí’s Sagrada Família basilica in the Catalan capital, Barcelona.

Mapics / stock.adobe.com

Instructions for downloading MASSIVE 1.1 can be found at our Github repository, alexa/massive. The dataset is also available from Hugging Face. For more information on the dataset, please see our paper.

One immediate customer of the additional Catalan data is the Barcelona Supercomputing Center.

Project AINA, dedicated to creating an advanced AI infrastructure for the Catalan language, is very excited about the inclusion of our language in the MASSIVE 1.1 dataset,” said Carlos Rodríguez Penagos, a researcher with the center’s text-mining unit. “This is a big step forward for digital assistants and chatbots that are able to converse fluently with people in their own language, a vital requirement of modern digital ecosystems. Amazon’s addition of Catalan to the MASSIVE dataset is very good news for languages that up to now were not well represented in the online platforms that we use daily. We will add this task to the CLUB [the Catalan Language Understanding Benchmark], the AI performance reference for this language. Thanks, Amazon, for this important initiative.”

Related content

Self-supervised training, distributed training, and knowledge distillation have delivered remarkable results, but they’re just the tip of the iceberg.

We are excited to see how MASSIVE 1.1 can be used for Catalan and for all of the 52 included languages, as we continue our progress toward a technology for understanding all of the world’s languages.

We hope that you will join us at the Massively Multilingual Natural Language Understanding (MMNLU) workshop, collocated with the Conference on Empirical Methods in Natural Language Processing (EMNLP), on December 7.

Keep building.

Acknowledgments: I would like to acknowledge the following people for executing the Catalan data collection for MASSIVE: Ana Sanchez, Aaron Nash, Liam Urbach, Wouter Leeuwis, Christopher Hench, Charith Peris, Kay Rottmann, Gokhan Tur, and Prem Natarajan.





Source link

We will be happy to hear your thoughts

Leave a reply

Rockstary Reviews
Logo
Shopping cart