Amazon adds Catalan to MASSIVE dataset

June 20, 2024

2 Views 0

SaveSavedRemoved 0

Earlier this year, we released MASSIVE, a million-record natural-language-understanding (NLU) dataset composed of human-translated utterances spanning 51 languages, 18 domains, 60 intents, and 55 slot types. We are pleased to announce the release of MASSIVE 1.1, which includes new data for the Catalan language.

Antoni Gaudí’s Sagrada Família basilica in the Catalan capital, Barcelona.

Mapics / stock.adobe.com

Instructions for downloading MASSIVE 1.1 can be found at our Github repository, alexa/massive. The dataset is also available from Hugging Face. For more information on the dataset, please see our paper.

One immediate customer of the additional Catalan data is the Barcelona Supercomputing Center.

“Project AINA, dedicated to creating an advanced AI infrastructure for the Catalan language, is very excited about the inclusion of our language in the MASSIVE 1.1 dataset,” said Carlos Rodríguez Penagos, a researcher with the center’s text-mining unit. “This is a big step forward for digital assistants and chatbots that are able to converse fluently with people in their own language, a vital requirement of modern digital ecosystems. Amazon’s addition of Catalan to the MASSIVE dataset is very good news for languages that up to now were not well represented in the online platforms that we use daily. We will add this task to the CLUB [the Catalan Language Understanding Benchmark], the AI performance reference for this language. Thanks, Amazon, for this important initiative.”