20B-parameter Alexa model sets new marks in few-shot learning

June 25, 2024

0 Views 0

SaveSavedRemoved 0

Most major advances in AI have come from supervised learning, in which machine learning models are trained on annotated data. But as commercial AI models continue to increase in scale, relying on data annotation is becoming unsustainable.

At Alexa AI, we are moving to the new paradigm of generalizable intelligence, in which models can learn new concepts and transfer knowledge from one language or task to another with minimal human input. Such models allow us to efficiently develop new features and improve Alexa on multiple languages at the same time.

As part of this move, we have introduced Transformer-based large-scale multilingual language models we call Alexa Teacher Models (AlexaTM). Given only a few examples of a task in a new language, AlexaTM can transfer what it knows to the new language with no extra human supervision.

Related content

Human-evaluation studies validate metrics, and experiments show evidence of bias in popular language models.

Moreover, AlexaTM 20B achieves state-of-the-art performance in few-shot machine translation (MT) across almost all language pairs supported by the model on the Flores-101 dataset. The gains in translating to and from low-resource languages like Marathi, Tamil, and Telugu are particularly significant (e.g., 21.8 Arabic-to-Tamil sentence-piece BLEU score compared to 0.9 for the supervised M2M-124 615M model).

These results suggest that large-scale seq2seq-style pretraining, as formulated in our work, improves MT for languages with few training pairs, especially when a large amount of monolingual data is available for the target language. AlexaTM 20B has no difficulty translating directly from different languages, in contrast to many-to-many MT systems that require parallel translation data for training.

News summarization by AlexaTM 20B when given only a single example. The input to the encoder is in the yellow box, the decoder’s output in the pink box.

AlexaTM 20B is the largest multilingual seq2seq model to date that is also capable of few-shot learning. We will be releasing the model publicly for non-commercial use to aid the development and evaluation of multilingual large language models (LLMs). We have also implemented a function to enable loading the model on up to eight GPUs with limited GPU memory for running inference on instances of Amazon Web Services’ EC2 computation service. This provides a more flexible way for researchers to use AlexaTM 20B in their own work.

In an analysis reported in our paper, we found that AlexaTM 20B, like other LLMs, has some likelihood of reproducing toxic language, social biases, and harmful stereotypes found in its training data. Therefore, we recommend that users conduct a full task-specific fairness-and-bias analysis before using the model to fully understand and address any potential harm that might arise from its use. Depending on the downstream application that AlexaTM 20B is being applied to, one or several of the prior techniques from the literature might be used to detoxify and debias the model. We reiterate the importance of task-specific fairness auditing and emphasize the need for more research on bias measurement and mitigation from the community.

All in all, we demonstrated in our work that the proposed style of pretraining enables seq2seq models that outperform much larger decoder-only LLMs across different tasks, both in a few-shot setting and with fine-tuning. We hope our work presents a compelling case for seq2seq models as a powerful alternative to decoder-only models for LLM training.

Source link