Few-shot learning is a technique in which we attempt to learn a general machine learning model for a set of related tasks and then customize it to new tasks with only a handful of training examples. This sharing of knowledge across tasks is called transfer learning.
In a paper we presented at the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), we show how we can use question-answering as a base task and achieve effective transfer learning between natural-language understanding (NLU) tasks by treating them as if they were question-answering tasks.
For instance, consider the task of intent classification, which is a mainstay of voice agents such as Alexa. If an Alexa customer says, “Alexa, play the album Innervisions”, the intent is play_music (as opposed to, say, check_weather or set_timer). The task of intent classification can be recast as the answering of a question, such as “Is the intent play_music?”
In our paper, we show that if a model has been trained to do question answering (QA), this kind of task recasting lets it transfer knowledge to other NLU tasks much more efficiently than it would otherwise. We call this method QANLU.
Across numerous experiments involving two different NLU tasks (intent classification and slot tagging), two different baseline models, and several different strategies for sampling few-shot training examples, our model consistently delivered the best performance, with relative improvements of at least 20% in several cases and 65% in one case.
We also found that sequentially fine-tuning a model on multiple tasks could improve its performance on each. In the graph below, for instance, the orange plot indicates the performance of the baseline model in our experiments; the blue plot indicates the performance of a question-answering model fine-tuned, using our method, on a restaurant-domain NLU dataset; and the grey plot indicates the performance of the question-answering model fine-tuned, using our method, first on an airline-travel NLU dataset (ATIS — Airline Travel Information Systems) and then on the restaurant-domain dataset.
With ten examples for fine-tuning, that is, our method confers a 21% improvement over baseline when the question-answering model is fine-tuned directly on the restaurant dataset. But when it’s first fine-tuned on the ATIS dataset, the improvement leaps to 63%.
This demonstrates that the advantages of our approach could compound as the model is fine-tuned on more and more tasks.
Transference
Mapping NLU problems to question answering has been studied in the literature; members of our research group have published on the topic in the past. The novelty of this work is to study the power of this approach for transfer learning.
Today, most NLU systems are built atop Transformer-based models pretrained on huge corpora of text, so they encode statistics about word sequences across entire languages. Extra layers are added to one of these networks, and the complete model is retrained on the target NLU task.
This is the paradigm we consider in our work. In our experiments, we used two different types of pretrained Transformer models, DistilBERT and ALBERT.
In addition to evaluating the effectiveness of QANLU for intent classification, we also evaluate it on the related task of slot tagging. In the example above — “Alexa, play Innervisions” — “Innervisions” is the value of a slot labeled album_name. There, the question corresponding to the slot-tagging task would be “What album name was mentioned?”
One interesting side effect of QANLU is that training on the questions and answers created for NLU tasks could improve model performance on the native question-answering task as well. If that’s the case, it opens the further possibility of using mappings between NLU and question answering for data augmentation.