Many tasks in natural-language processing and information retrieval involve pairwise comparisons of sentences — for example, sentence similarity detection, paraphrase identification, question-answer entailment, and textual entailment.
The most accurate method of sentence comparison is so-called cross-encoding, which maps sentences against each other on a pair-by-pair basis. Training cross-encoders, however, requires annotated training data, which is labor intensive to collect.
How can we train completely unsupervised models for sentence-pair tasks, eliminating the need for data annotation?
At this year’s International Conference on Learning Representations (ICLR), we are presenting an unsupervised sentence-pair model we call a trans-encoder (paper, code), which improves on the prior state of the art by up to 5% on sentence similarity benchmarks.
A tale of two encoders
Today, there are basically two paradigms for sentence-pair tasks: cross-encoders and bi-encoders. The choice between the two comes down to the standard trade-off between computational efficiency and performance.
Cross-encoder. In a cross-encoder, two sequences are concatenated and sent in one pass to the sentence pair model, which is usually built atop a Transformer-based language model like BERT or RoBERTa. The attention heads of a Transformer can directly model which elements of one sequence correlate with which elements of the other, enabling the computation of an accurate classification/relevance score.
However, a cross-encoder needs to compute a new encoding for every pair of input sentences, resulting in high computational overhead. Cross-encoding is thus impractical for tasks like information retrieval and clustering, which involve massive pairwise sentence comparisons. Also, converting pretrained language models (PLMs) into cross-encoders always requires fine-tuning on annotated data.
Bi-encoder. By contrast, in a bi-encoder, each sentence is encoded separately and mapped to a common embedding space, where the distances between them can be measured. As the encoded sentences can be cached and reused, bi-encoding is much more efficient, and the outputs of a bi-encoder can be used off-the-shelf as sentence embeddings for downstream tasks.
That said, it is well known that in supervised learning, bi-encoders underperform cross-encoders, since they don’t explicitly model interactions between sentences.
Trans-encoder: The best of both worlds
In our ICLR paper, we ask whether we can leverage the advantages of both bi- and cross-encoders to bootstrap an accurate sentence-pair model in an unsupervised manner.
Our answer — the trans-encoder — is built on the following intuition: As a starting point, we can use bi-encoder representations to fine-tune a cross-encoder. With its more powerful inter-sentence modeling, the cross-encoder should extract more knowledge from the PLMs than the bi-encoder can given the same input data. In turn, the more powerful cross-encoder can distill its knowledge back into the bi-encoder, improving the accuracy of the more computationally practical model. We can repeat this cycle to iteratively bootstrap from both the bi- and cross-encoders.
Specifically, the process of training a trans-encoder is as follows:
Step 1. Transform PLMs into effective bi-encoders. To transform existing PLMs into bi-encoders, we leverage a simple contrastive tuning procedure. Given a sentence, we encode it twice, with two different PLMs. Because of dropout — a standard technique in which a fraction of neural-network nodes are randomly dropped during each pass through the training data, to prevent bottlenecks — the two PLMs will produce slightly different encodings.
The bi-encoder is then trained to maximize the similarity of the two almost-identical encodings. This step primes the PLMs to be good at embedding sequences. Details can be found in prior work Mirror-BERT and SimCSE.
Step 2. Self-distillation: bi- to cross-encoder. After obtaining a reasonably good bi-encoder from step one, we use it to create training data for a cross-encoder. Specifically, we label sentence pairs with the pairwise similarity scores computed by the bi-encoder and use them as training targets for a cross-encoder built on top of a new PLM.
Step 3. Self-distillation: Cross- to bi-encoder. A natural next step is to distil the extra knowledge gained from the cross-encoder back into bi-encoder form, which is more useful for downstream tasks. More important, a better bi-encoder can produce even more self-labeled data for tuning the cross-encoder. In this way we can repeat steps two and three, continually bootstrapping the encoder performance.
Our paper proposes other techniques, such as mutual distillation, to improve our model’s performance. Please refer to Section 2.4 of the paper for more details.
Benchmark: A new state-of-the-art for sentence similarity
We experiment with the trans-encoder on seven sentence textual similarity (STS) benchmarks. We observe significant improvements upon previous unsupervised sentence-pair models across all datasets.
We also benchmark binary-classification and domain transfer tasks. Please refer to section 5 of the paper for more details.