Using teacher knowledge at inference time to enhance student model


Knowledge distillation (KD) is one of the most effective ways to deploy large-scale language models in environments where low latency is essential. KD involves transferring the knowledge contained in large-scale models (“teachers”) to smaller models (“students”).

Because of their size, student models are typically more efficient than teacher models, but they’re often less powerful. In a paper we presented at this year’s meeting of the Association for Computational Linguistics (ACL), we proposed retrieval-augmented knowledge distillation (ReAugKD), a framework that leverages the power of teacher models to improve student models’ performance, with a minimal latency overhead.

Related content

At this year’s ACL, Amazon researchers won an outstanding-paper award for showing that knowledge distillation using contrastive decoding in the teacher model and counterfactual reasoning in the student model improves the consistency of “chain of thought” reasoning.

Specifically, we use data representations (embeddings) and predictions produced by the teacher model on previous inputs — which can be stored in a lookup table — to guide the student model’s predictions for similar inputs. In principle, however, the same approach could be adapted for any task-specific external knowledge.

To evaluate ReAugKD, we compared its performance to that of ten prior models on six natural-language-processing tasks, including paraphrasing, natural-language inference, and question answering. On five of the tasks, ReAugKD was the top performer, and on the sixth, it ranked second. On average, it establishes a new state of the art for the benchmark, while incurring a latency overhead of only 3%.

In the inference phase of ReAugKD, we aggregate the teacher’s predictions on those past examples that are most similar to the current example and combine them with the student’s prediction.

Training method

ReAugKD uses two-step training procedure. In the first step, we begin with a teacher model that has been fine-tuned for a specific downstream task. Then we add a linear-projection layer on top of the model’s encoder, to project the encoder’s embeddings — or vector representations of the input data — to the same dimensions as the student model’s encoder. To fine-tune the parameters of the linear-projection layer, we use a supervised contrastive loss, which uses training examples with the same labels as positives and contrasts them with negatives sampled randomly from the remainder of the batch.

Related content

On natural-language-understanding tasks, student models trained only on task-specific data outperform those trained on a mix that includes generic data.

In the second step, we generate (resized) teacher embeddings and teacher predictions for the input data we’ll use to train the student. Then we create a similarity matrix for the teacher embeddings, which measures the similarity between the embedding of each input and those of all the other inputs.

To train the student model, we create a similarity matrix for the student embeddings and the teacher embeddings and use a loss function that minimizes the Kullback–Leibler divergence between the teacher-teacher similarity distribution and the teacher-student similarity distribution. Essentially, this ensures that at inference time, when we’re searching our knowledge base for teacher embeddings similar to that of the student’s current input, both the student and the teacher are using the same notion of similarity.

Our loss function also has a term that uses the popular cross-entropy loss to calculate the divergence between the student’s predictions and the teacher’s predictions.

In the training phase of ReAugKD, we combine two loss functions. One computes the cross-entropy loss between the student’s prediction and the teacher’s prediction. The other measures the Kullback–Leibler divergence between a distribution over the similarity of each teacher embedding to all the other teacher embeddings and a distribution over the similarity of each teacher embedding to all the other student embeddings.

Experiments and results

In tests, we used ReAugKD to distill the 12-layer BERT-Base model into a six-layer BERT model, evaluating performance on six datasets of the GLUE benchmark. Our method achieves start-of-the-art results on five out of the six datasets, with an average improvement of 0.42% over the previous best KD method and improvements of 1.37% and 1.43% on two of the benchmark tasks.

The version of ReAugKD that uses knowledge base retrieval also exhibits an improvement of 0.45% over ReAugKD without retrieval, verifying the benefit of retrieval augmentation in our approach.

Example from the benchmark paraphrase dataset where retrieving the teacher’s knowledge of similar examples improves the student’s prediction. Predictions are in the format <output score → prediction>; “ReAugKD combined prediction” is the final prediction, which combines the student prediction score and the teacher’s aggregated prediction score.





Source link

We will be happy to hear your thoughts

Leave a reply

Rockstary Reviews
Logo
Shopping cart