Abstract
Existing methods to measure sentence similarity are faced with two challenges: (1) labeled datasets are usually limited in size, making them insufficient to train supervised neural models; and (2) there is a training-test gap for unsupervised language modeling (LM) based models to compute semantic scores between sentences, since sentence-level semantics are not explicitly modeled at training. This results in inferior performances in this task. In this work, we propose a new framework to address these two issues. The proposed framework is based on the core idea that the meaning of a sentence should be defined by its contexts, and that sentence similarity can be measured by comparing the probabilities of generating two sentences given the same context. The proposed framework is able to generate high-quality, large-scale dataset with semantic similarity scores between two sentences in an unsupervised manner, with which the train-test gap can be largely bridged. Extensive experiments show that the proposed framework achieves significant performance boosts over existing baselines under both the supervised and unsupervised settings across different datasets.
1 Introduction
Measuring sentence similarity is a long-standing task in NLP (Luhn, 1957; Robertson et al., 1995; Blei et al., 2003; Peng et al., 2020). The task aims at quantitatively measuring the semantic relatedness between two sentences, and has wide applications in text search (Farouk et al., 2018), natural language understanding (MacCartney and Manning, 2009), and machine translation (Yang et al., 2019a).
One of the greatest challenges that existing methods face for sentence similarity is the lack of large-scale labeled datasets, which contain sentence pairs with labeled semantic similarity scores. The acquisition of such a dataset is both labor-intensive and expensive. For example, the STS benchmark (Cer et al., 2017) and SICK-Relatedness dataset (Marelli et al., 2014) respectively contain 8.6K and 9.8K labeled sentence pairs, the sizes of which are usually insufficient for training deep neural networks.
Unsupervised learning methods are proposed to address this issue, where word embeddings (Le and Mikolov, 2014) or BERT embeddings (Devlin et al., 2018) are used to to map sentences to fix-length vectors in an unsupervised manner. Then sentence similarity is computed based on the cosine or dot product of these sentence representations. Our work follows this thread where sentence similarity is computed based on fix-length sentence representations, as opposed to comparing sentences directly. The biggest issue with current unsupervised approaches is that there exists a big gap between model training and testing (i.e., computing semantic similarity between two sentences). For example, the BERT-style models are trained at the token level by predicting words given contexts, and there is neither explicit modeling sentence semantics nor producing sentence embeddings at the training stage. But at test time, sentence semantics needs to be explicitly modeled to obtain semantic similarity. The inconsistency results in a distinct discrepancy between the objectives at the two stages and inferior performance on textual semantic similarity tasks. For example, BERT embeddings yield inferior performance on semantic similarity benchmarks (Reimers and Gurevych, 2019), and even underperform the naive method such as averaging GloVe (Pennington et al., 2014) embeddings. Li et al. (2020) investigated this problem and found that BERT always induces a non-smooth anisotropic semantic space of sentences, and this property significantly harms the performance of semantic similarity.
Just as word meanings are defined by neighboring words (Harris, 1954), the meaning of a sentence is determined by its contexts. Given the same context, there is a high probability of generating two similar sentences. If there is a low probability of generating two sentences given the same context, there is a gap between these two sentences in the semantic space. Based on this idea, we propose a framework that measures semantic similarity through the probability similarity of generating two sentences given the same context in a fully unsupervised manner. As for implementation, the framework consists of the following steps: (1) we train a contextual model by predicting the probability of a sentence fitting into the left and right contexts; (2) we obtain sentence pair similarity by comparing scores assigned by the contextual model across a large number of contexts. To facilitate inference, we train a surrogate model, to act as the role of step 2, based on the outputs from step 1. The surrogate model can be directly used for sentence similarity prediction in an unsupervised setup, or used as initialization to be further finetuned on downstream datasets in the supervised setup. Note that the outcome from step 1 or the surrogate model is a fixed-length vector regarding the input sentence. Each element in the vector indicates how fit the input sentence is to the context corresponding to that element, and the vector itself can be viewed as the overall semantics of the input sentence in the contextual space. Then we use cosine distance between two sentence vectors to compute the semantic similarity.
The proposed framework offers the potential to fully address the two challenges above: (1) the context regularization provides a reliable means to generate a large-scale high-quality dataset with semantic similarity scores based on unlabeled corpus; and (2) the train-test gap can be naturally bridged by training the model on the large- scale similarity dataset, leading to significant performance gains compared to utilize pretrained models directly.
We conduct experiments on different datasets under both supervised and unsupervised setups, and experimental results show that the proposed framework significantly outperforms existing sentence similarity models.
2 Related Work
Statistics-based methods for measuring sentence similarity include bag-of-words (BoW) (Li et al., 2006), term frequency inverse document frequency (TF-IDF) (Luhn, 1957; Jones, 2004), BM25 (Robertson et al., 1995), latent semantic indexing (LSI) (Deerwester et al., 1990), and latent Dirichlet allocation (LDA) (Blei et al., 2003). Deep learning based methods for sentence similarity rely on distributed representations (Mikolov et al., 2013; Le and Mikolov, 2014) and can be generally divided into the following three categories.
Matrix Based Methods
The first line of work for measuring sentence similarity is to construct a similarity matrix between two sentences, each element of which represents the similarity between the two corresponding units in two sentences. Then the matrix is aggregated in different ways to induce the final similarity score. Pang et al. (2016) applied a two-layer convolutional neural network (CNN) followed by a feed-forward layer to the similarity matrix to derive the similarity score. He and Lin (2016) used a deeper CNN to make the best use of the similarity matrix. Yin and Schütze (2015) built a hierarchical architecture to model text compositions at different granularities, so several similarity matrices can be computed and combined for interactions. Other works proposed using the attention mechanism as a way of computing the similarity matrix (Rocktäschel et al., 2015; Wang et al., 2016; Parikh et al., 2016; Seo et al., 2016; Shen et al., 2017; Lin et al., 2017; Gong et al., 2017; Tan et al., 2018; Kim et al., 2019; Yang et al., 2019b).
Word Distance Based Methods
The second line of work to measure sentence similarity is to calculate the cost of transforming from one sentence to another; the smaller the cost is, the more similar two sentences are. This idea is implemented by the Word Mover’s Distance (WMD) (Kusner et al., 2015), which measures the dissimilarity between two documents as the minimum amount of distance that the embedded words of one document need to transform to words of another document. Following works improve WMD by incorporating supervision from downstream tasks (Huang et al., 2016), introducing hierarchical optimal transport over topics (Yurochkin et al., 2019), addressing the complexity limitation of requiring to consider each pair (Wu and Li, 2017; Wu et al., 2018; Backurs et al., 2020), and combining graph structures with WMD to perform cross-domain alignment (Chen et al., 2020). More recently, Yokoi et al. (2020) proposed to disentangle word vectors in WRD have shown significant performance boosts over vanilla WMD.
Sentence Embedding Based Methods
Sentence embeddings are high-dimensional representations for sentences. They are expected to contain rich sentence semantics so that the similarity between two sentences can be computed by considering their sentence embeddings via certain metrics such as cosine similarity. Le and Mikolov (2014) introduced paragraph vector, which is learned in an unsupervised manner by predicting the words within the paragraph using the paragraph vector. In a followup, a line of sentence embedding methods such as FastText, Skip-Thought vectors (Kiros et al., 2015), Smooth Inverse Frequency (SIF) (Arora et al., 2017), Sequential Denoising Autoencoder (SDAEs) (Hill et al., 2016), InferSent (Conneau et al., 2017), Quick-Thought vectors (Logeswaran and Lee, 2018), and Universal Sentence Encoder (Cer et al., 2018) have been proposed to improve the sentence embedding quality with more efficiency.
The great success achieved by large-scale pretraining models (Devlin et al., 2018; Liu et al., 2019) has recently stimulated a strand of work on producing sentence embeddings based on the pretraining-finetuning paradigm using large-scale unlabeled corpora. The cosine outcome between the representations of two sentences produced by large-scale pretrained models is treated as the semantic similarity (Reimers and Gurevych, 2019; Wang and Kuo, 2020; Li et al., 2020). Su et al. (2021) and Huang et al. (2021) proposed regularizing the sentence representations by whitening them, that is, enforcing the covariance to be an identity matrix to address the non-smooth anisotropic distribution issue (Li et al., 2020).
The BERT-based scores (Zhang et al., 2020; Sellam et al., 2020), though serving as automatic metrics, also capture rich semantic information regarding the sentence and have the potentials for measuring semantic similarity. Cer et al. (2018) proposed a method of encoding sentences into their corresponding embeddings that specifically target transfer learning to other NLP tasks. Karpukhin et al. (2020) adopted two unique BERT encoder models and the model weights are optimized to maximize the dot product. The most recent line of work focuses on leveraging the contrastive learning framework to tackle semantic textual similarity (Wu et al., 2020; Carlsson et al., 2021; Kim et al., 2021; Yan et al., 2021; Gao et al., 2021), where two similar sentences are pulled close and two random sentences are pulled away in the sentence representation space. This learning strategy helps better separate sentences with different semantics.
This work is motivated by learning word representations given its contexts (Mikolov et al., 2013; Le and Mikolov, 2014) with the assumption that the meaning of a word is determined by its context. Our work is based on large-scale pretrained model and aims at learning informative sentence representations for measuring sentence similarity.
3 Model
3.1 Overview
The key point of the proposed paradigm is to compute semantic similarity between two sentences by measuring the probabilities of generating the two sentences across a number of context.
We can achieve this goal based on the following steps: (1) we first need to train a contextual model to predict the probability of a sentence fitting into the left and right contexts. This goal can be achieved by either a discriminative model, namely, predicting the probability that the concatenation of a sentence with context forms a coherent text, or a generative model, namely, predicting the probability of generating a sentence given contexts; (2) next, given a pair of sentences, we can measure their similarity by comparing their scores assigned by contextual models given different contexts; (3) for step 2, for any pair of sentences at test time, we need to sample different contexts to compute scores assigned by contextual models, which is time-consuming. We thus propose to train a surrogate model that takes a pair of sentences as inputs and predicts the similarity assigned by the contextual model. This enables faster inference, though at a small sacrifice of accuracy; (4) the surrogate model can be directly used for obtaining sentence similarity scores in a unsupervised manner, or used as model initialization, which will be further fine-tuned on downstream datasets in a supervised setting. We will discuss the detail of each module in order below.
3.2 Training Contextual Models
We need a contextual model to predict the probability of a sentence fitting into left and right contexts. We combine a generative model and a discriminative model to achieve this goal, allowing us to take the advantage of both to model text coherence (Li et al., 2017).
Notations
Let ci denote the i-th sentence, which consists of a sequence of words , where ni denotes the number of words in ci. Let ci:j denote the i-th to j-th sentences. c <i and c >i respectively denote the preceding and subsequent context of ci.
3.2.1 Discriminative Models
3.2.2 Generative Models
3.3 Scoring Sentence Pairs
Let C denote a set of contexts, where NC is the size of C. For a sentence s, its semantic representation vs is an NC dimensional vector, with each individual value being S(s,c) with c ∈C. The semantic similarity between two sentences s1 and s2 can be computed based on vs1 and vs2 using different metrics such as cosine similarity.
Constructing C
We need to pay special attentions to the construction of C. The optimal situation is to use all contexts, where C is the entire corpus. Unfortunately, this is computationally prohibitive as we need to iterate over the entire corpus for each sentence s.
We propose the following workaround for tractable computation. For a sentence s, rather than using the full corpus as C, we construct its sentence specific context set Cs in a way that s can fit into all constituent context in Cs. The intuition is as follows. With respect to sentence s1, contexts can be divided into two categories: contexts that s1 fits into, based on which we will measure whether or not s2 also fits in, and contexts that s1 does not fit into, and we will measure whether or not s2 also does not fit in. We are mostly concerned about the former, and can neglect the latter. The reason is as follows: The latter can also further be divided into two categories: contexts that fit neither s1 or s2, and contexts that do not fit s1 but fit s2. For contexts that fit neither s1 and s2, we can neglect them since two sentences not fitting into the same context does not signify their semantic relatedness; for contexts that does not fit s1 but fit s2, we can leave them to when we compute Cs2.
Practically, for a given sentence s, we first use TF-IDF weighted BoW bi-gram vectors to perform primary screening on the whole corpus to retrieve related text chunks (20K for each sentence). Next, we rank all contexts using the discriminative model based on Eq. (1). For discriminative models, we cache sentence representations in advance, and compute model scores in the last neural layer, which is significantly faster than the generative model. This two-step selection strategy is akin to the pipelined selection system (Chen et al., 2017; Karpukhin et al., 2020) in open-domain QA that contains document retrieval using IR systems and fine-grained question answering using neural QA models.
Cs is built by selecting top ranked contexts by Eq. (3). We use the incremental construction strategy, adding one context at a time. To promote diversity of Cs, each text chunk is allowed to contribute at most one context, and the Jaccard similarity between the i − 1-th sentence in the context to select and those already selected should be lower than 0.5.1
3.4 Training Surrogate Models
The method described in Section 3.3 provides a direct way to compute scores for semantic relatedness. But it comes with a severe shortcoming of slow speed at inference time: Given an arbitrary pair of sentences, the model still needs to go through the entire corpus, harvest the context set Cs, and iterate all instances in Cs for context score calculation based on Eq. (3), each of which is time consuming. To address this issue, we propose training a surrogate model to accelerate inference.
Specifically, we first harvest similarity scores for sentence pairs using methods in Section 3.3. We collect scores for 100M pairs in total, which are further split into train/dev/test by 98/1/1. Next, by treating harvested similarity scores as gold labels, we train a neural model that takes a pair of sentence as an input, and predicts its similarity score. The cosine similarity between the two sentence representations is the predicted semantic similarity, and we minimize the L2 distance between predicted and golden similarities. The Siamese structure makes it possible for fixed-sized vectors for input sentences to be derived and stored, allowing for fast semantic similarity search, which we will discuss in detail in the ablation study section.
It is worth noting both the advantages and disadvantages of the surrogate model. For advantages, firstly, it can significantly speed up inference as it avoids the time-consuming process of iterating over the entire corpus to construct C. Secondly, the surrogate shares the same structure with existing widely-used models such as BERT and RoBERTa, and can thus later be easily finetuned on the human-labeled datasets in supervised learning; on the other hand, the origin model in Section 3.3 cannot be readily combined with other human-labeled datasets. For disadvantages, the surrogate model inevitably comes with a cost of accuracy, as its upper bound is the origin model in Section 3.3.
4 Experiments
4.1 Experiment Settings
We evaluate the Surrogate model on Semantic Textual Similarity (STS), Argument Facet Similarity (AFS) corpus (Misra et al., 2016), and Wikipedia Sections Distinction (Ein Dor et al., 2018) tasks. We perform both unsupervised and supervised evaluations on these tasks. For unsupervised evaluations, models are directly used for obtaining sentence representations. For supervised evaluations, we use the training set to fine-tune all models and use the L2 regression as the objective function. Additionally, we also conduct partially supervised evaluation on STS benchmarks.
Implementation Details
For discriminative model in 3.2.1, we use a single-layer bi-directional LSTM as the backbone with the size of hidden states set to 300.
For the generative model in 3.2.2, we implement the above three models, namely, p(ci|c <i,c >i), p(c <i|ci,c >i), and p(c >i|c <i,ci) based on the Seq2Seq structure, and use Transformer-large as the backbone (Vaswani et al., 2017). Sentence position embeddings and token position embeddings are added to word embeddings. The model is trained on a corpus extracted from CommonCrawl that contains 100B tokens.
For the surrogate model in 3.4, we use RoBERTa (Liu et al., 2019) as the backbone, and adopt the Siamese structure (Reimers and Gurevych, 2019), where two sentences are first mapped to vector representations using RoBERTa. We use the average pooling on the last RoBERTa layer to obtain the sentence representation. During training, we use Adam (Kingma and Ba, 2014) with learning rate of 1e-4, β1 = 0.9, β2 = 0.999. The trained surrogate model obtains an average L2 distance of 7.4 × 10−4 on dev set when trained from scratch, and 6.1 × 10−4 when initialized using the RoBERTa-large model (Liu et al., 2019). We set Cs to 500.
Baselines
We use the following models as baselines:
Avg. Glove embeddings is the average of word embeddings produced via the co-occurrence statistics in the corpus (Pennington et al., 2014).
Avg. Skip-Thought embeddings is the average of word embeddings produced by Skip-Thought vectors (Kiros et al., 2015).
InferSent uses a Siamese BiLSTM network with max-pooling over the output on NLI datasets (Conneau et al., 2017).
Avg. BERT embeddings is the average of word embeddings produced by BERT (Devlin et al., 2018).
BERT [CLS] computes scores based on the vector representation of the special token [CLS] in BERT.
BERTScore computes the similarity of two sentences as a sum of cosine similarities between their tokens’ embeddings (Zhang et al., 2020).
BLEURT is baseed on BERT and captures non-trivial semantic similarities by fine-tuning the model on the WMT Metrics dataset, on a set of ratings provided by the user, or a combination of both (Sellam et al., 2020).
DPR works by using two unique BERT encoder models and the model weights are optimized to maximize the dot product (Karpukhin et al., 2020).
Universal Sent Encoder is a method of encoding sentences into their corresponding embeddings that specifically target transfer learning to other NLP tasks (Cer et al., 2018).
SBERT is a BERT-based method of using the Siamese structure to derive sentence embeddings that can be compared through cosine similarity (Reimers and Gurevych, 2019).
4.2 Run-time Efficiency
The run-time efficiency is important for sentence representation models because similarity functions are potentially applied to large corpora. In this subsection, we compare Surrogatebase to InferSent (Conneau et al., 2017), Universal Sent Encoder (Cer et al., 2018), and SBERTbase (Reimers and Gurevych, 2019). We adopt a length batching strategy in which sentences are grouped together by length.
The proposed Surrogate model is based on PyTorch. InferSent (Conneau et al., 2017) and SBERT (Reimers and Gurevych, 2019) are based on PyTorch. Universal Sent Encoder (Cer et al., 2018) is based on Tensorflow and the model is from the Tensorflow Hub. Model efficiency is measured on a server with Intel i7-5820K CPU @ 3.30GHz, Nvidia Tesla V100 GPU, CUDA 10.2, and cuDNN. We report both CPU and GPU speed and the results can be found in Table 1. As can be seen, InferSent is around 69% faster than Surrogate model on CPU since its simpler model architecture. The speed of the proposed Surrogate model is comparable to SBERT for both non-batching and batching setups, which is in accord with our expectations due the same transformer structure adopted by the Surrogate model.
Computation speed of sentence embedding methods(sentences per second).
Model . | CPU . | GPU . |
---|---|---|
InferSent | 125 | 1527 |
Universal Sent Encoder | 72 | 1330 |
SBERTbase | 41 | 1315 |
SBERTbase length batching | 88 | 2112 |
Surrogatebase | 48 | 1514 |
Surrogatebase length batching | 91 | 2175 |
Model . | CPU . | GPU . |
---|---|---|
InferSent | 125 | 1527 |
Universal Sent Encoder | 72 | 1330 |
SBERTbase | 41 | 1315 |
SBERTbase length batching | 88 | 2112 |
Surrogatebase | 48 | 1514 |
Surrogatebase length batching | 91 | 2175 |
4.3 Experiment: Semantic Textual Similarity
We evaluate the proposed method on the Semantic Textual Similarity (STS) tasks. We compute the Spearman’s rank correlation ρ between the cosine similarity of the sentence pairs and the gold labels for comparison.
Unsupervised Evaluation
We evaluate the proposed method on the Semantic Textual Similarity (STS) tasks 2012–2016 (Agirre et al., 2012, 2013, 2014, 2015, 2016), the STS benchmark (Cer et al., 2017), and the SICK-Relatedness dataset (Marelli et al., 2014)) for evaluation. All datasets contain sentence pairs labeled between 0 and 5 as the semantic relatedness. The proposed models are directly used for inference under the unsupervised setup.
The results are shown in Table 2 and we observe significant performance boosts of the proposed models over baselines. Notably, the proposed models trained in the unsupervised setting (both Origin and Surrogate) are able to achieve competitive results to models trained on additional annotated NLI datasets. Another observation is, as expected, the Surrogate models underperform the Origin model as Origin serves as an upper bound for Surrogate but with a cost of inference speed.
Spearman rank correlation ρ between the cosine similarity of sentence representations and the gold labels for various Textual Similarity (STS) tasks under the unsupervised setting. We use *-NLI to denote the model additionally trained on NLI datasets. ♯ indicates that results are reproduced by ourselves; § indicates results are taken from Reimers and Gurevych (2019); Surrogate are results for our proposed method.
Model . | STS12 . | STS13 . | STS14 . | STS15 . | STS16 . | STSb . | SICK-R . | Avg . |
---|---|---|---|---|---|---|---|---|
fully unsupervised without human labels | ||||||||
Avg. Glove embeddings§ | 55.14 | 70.66 | 59.73 | 68.25 | 63.66 | 58.02 | 53.76 | 61.32 |
Avg. Skip-Thought embeddings§ | 57.11 | 71.98 | 61.30 | 70.13 | 65.21 | 59.42 | 55.50 | 62.95 |
InferSent-Glove♯ | 52.86 | 66.75 | 62.15 | 72.77 | 66.87 | 68.03 | 65.65 | 65.01 |
Avg. BERT embeddings§ | 38.78 | 57.98 | 57.98 | 63.15 | 61.06 | 46.35 | 58.40 | 54.81 |
BERT [CLS]♯ | 20.16 | 30.01 | 20.09 | 36.88 | 38.08 | 16.50 | 42.63 | 29.19 |
BERTScore♯ | 54.60 | 50.11 | 57.74 | 70.79 | 64.58 | 57.58 | 51.37 | 58.11 |
DPR♯ | 53.98 | 56.00 | 57.83 | 66.68 | 67.43 | 58.53 | 61.85 | 60.33 |
BLEURT♯ | 70.16 | 64.97 | 57.41 | 72.91 | 70.01 | 69.81 | 58.46 | 66.25 |
Universal Sent Encoder♯ | 64.49 | 67.80 | 64.61 | 76.83 | 73.18 | 74.92 | 76.69 | 71.22 |
Origin | 72.41 | 74.30 | 75.45 | 78.45 | 79.93 | 78.47 | 79.49 | 76.93 |
Surrogatebase | 70.62 | 72.14 | 72.72 | 76.34 | 75.24 | 74.19 | 77.20 | 74.06 |
Surrogatelarge | 71.93 | 73.74 | 73.95 | 77.01 | 76.64 | 75.32 | 77.84 | 75.20 |
partially supervised without human labels but not the same domain | ||||||||
InferSent-NLI♯ | 50.48 | 67.75 | 62.15 | 72.77 | 66.87 | 68.03 | 65.65 | 64.81 |
BERT [CLS]-NLI♯ | 60.35 | 54.97 | 64.92 | 71.49 | 70.49 | 73.25 | 70.79 | 66.61 |
BERTScore-NLI♯ | 60.89 | 54.64 | 63.96 | 74.35 | 66.67 | 65.65 | 66.01 | 64.60 |
DPR-NLI♯ | 61.36 | 56.71 | 65.49 | 71.80 | 71.03 | 74.08 | 70.86 | 67.33 |
BLEURT-NLI♯ | 66.40 | 68.15 | 71.98 | 79.69 | 77.86 | 77.98 | 70.92 | 73.28 |
Universal Sent Ecoder-NLI♯ | 65.55 | 67.95 | 71.47 | 80.81 | 78.70 | 78.41 | 69.31 | 73.17 |
BERT-NLIbase♯ | 71.07 | 76.81 | 73.29 | 79.56 | 74.58 | 77.10 | 72.65 | 75.01 |
SBERT-NLIbase§ | 70.97 | 76.53 | 73.19 | 79.09 | 74.30 | 77.03 | 72.91 | 74.86 |
SRoBERTa-NLIbase§ | 71.54 | 72.49 | 70.80 | 78.74 | 73.69 | 77.77 | 74.46 | 74.21 |
Surrogate-NLIbase | 74.15 | 76.50 | 72.23 | 81.24 | 78.75 | 79.32 | 78.56 | 77.25 |
BERT-NLIlarge♯ | 71.62 | 77.40 | 72.69 | 78.61 | 75.28 | 77.83 | 72.64 | 75.15 |
SBERT-NLIlarge§ | 72.27 | 78.46 | 74.90 | 80.99 | 76.25 | 79.23 | 73.75 | 76.55 |
SRoBERTa-NLIlarge§ | 74.53 | 77.00 | 73.18 | 81.85 | 76.82 | 79.10 | 74.29 | 76.68 |
Surrogate-NLIlarge | 76.98 | 79.83 | 75.15 | 83.54 | 79.32 | 80.82 | 79.64 | 79.33 |
Model . | STS12 . | STS13 . | STS14 . | STS15 . | STS16 . | STSb . | SICK-R . | Avg . |
---|---|---|---|---|---|---|---|---|
fully unsupervised without human labels | ||||||||
Avg. Glove embeddings§ | 55.14 | 70.66 | 59.73 | 68.25 | 63.66 | 58.02 | 53.76 | 61.32 |
Avg. Skip-Thought embeddings§ | 57.11 | 71.98 | 61.30 | 70.13 | 65.21 | 59.42 | 55.50 | 62.95 |
InferSent-Glove♯ | 52.86 | 66.75 | 62.15 | 72.77 | 66.87 | 68.03 | 65.65 | 65.01 |
Avg. BERT embeddings§ | 38.78 | 57.98 | 57.98 | 63.15 | 61.06 | 46.35 | 58.40 | 54.81 |
BERT [CLS]♯ | 20.16 | 30.01 | 20.09 | 36.88 | 38.08 | 16.50 | 42.63 | 29.19 |
BERTScore♯ | 54.60 | 50.11 | 57.74 | 70.79 | 64.58 | 57.58 | 51.37 | 58.11 |
DPR♯ | 53.98 | 56.00 | 57.83 | 66.68 | 67.43 | 58.53 | 61.85 | 60.33 |
BLEURT♯ | 70.16 | 64.97 | 57.41 | 72.91 | 70.01 | 69.81 | 58.46 | 66.25 |
Universal Sent Encoder♯ | 64.49 | 67.80 | 64.61 | 76.83 | 73.18 | 74.92 | 76.69 | 71.22 |
Origin | 72.41 | 74.30 | 75.45 | 78.45 | 79.93 | 78.47 | 79.49 | 76.93 |
Surrogatebase | 70.62 | 72.14 | 72.72 | 76.34 | 75.24 | 74.19 | 77.20 | 74.06 |
Surrogatelarge | 71.93 | 73.74 | 73.95 | 77.01 | 76.64 | 75.32 | 77.84 | 75.20 |
partially supervised without human labels but not the same domain | ||||||||
InferSent-NLI♯ | 50.48 | 67.75 | 62.15 | 72.77 | 66.87 | 68.03 | 65.65 | 64.81 |
BERT [CLS]-NLI♯ | 60.35 | 54.97 | 64.92 | 71.49 | 70.49 | 73.25 | 70.79 | 66.61 |
BERTScore-NLI♯ | 60.89 | 54.64 | 63.96 | 74.35 | 66.67 | 65.65 | 66.01 | 64.60 |
DPR-NLI♯ | 61.36 | 56.71 | 65.49 | 71.80 | 71.03 | 74.08 | 70.86 | 67.33 |
BLEURT-NLI♯ | 66.40 | 68.15 | 71.98 | 79.69 | 77.86 | 77.98 | 70.92 | 73.28 |
Universal Sent Ecoder-NLI♯ | 65.55 | 67.95 | 71.47 | 80.81 | 78.70 | 78.41 | 69.31 | 73.17 |
BERT-NLIbase♯ | 71.07 | 76.81 | 73.29 | 79.56 | 74.58 | 77.10 | 72.65 | 75.01 |
SBERT-NLIbase§ | 70.97 | 76.53 | 73.19 | 79.09 | 74.30 | 77.03 | 72.91 | 74.86 |
SRoBERTa-NLIbase§ | 71.54 | 72.49 | 70.80 | 78.74 | 73.69 | 77.77 | 74.46 | 74.21 |
Surrogate-NLIbase | 74.15 | 76.50 | 72.23 | 81.24 | 78.75 | 79.32 | 78.56 | 77.25 |
BERT-NLIlarge♯ | 71.62 | 77.40 | 72.69 | 78.61 | 75.28 | 77.83 | 72.64 | 75.15 |
SBERT-NLIlarge§ | 72.27 | 78.46 | 74.90 | 80.99 | 76.25 | 79.23 | 73.75 | 76.55 |
SRoBERTa-NLIlarge§ | 74.53 | 77.00 | 73.18 | 81.85 | 76.82 | 79.10 | 74.29 | 76.68 |
Surrogate-NLIlarge | 76.98 | 79.83 | 75.15 | 83.54 | 79.32 | 80.82 | 79.64 | 79.33 |
Partially Supervised Evaluation
We finetune the model on the combination of the SNLI (Bowman et al., 2015) and the Multi-Genre NLI (Williams et al., 2018) datasets, with the former containing 570K sentence pairs and the latter containing 433K pairs across various genres of sources. Sentence pairs from both datasets are annotated with one of the labels contradiction, entailment, and neutral. The proposed models are trained on the natural language inference task then used for computing sentence representations in an unsupervised manner.
The partially supervised results are shown in Table 2. As can be seen, results from the proposed model finetuned on NLI datasets are comparable to results from unsupervised models since no labeled similarity dataset is used, and comparable to results from supervised models if further finetuned on similarity datasets such as STS.
Supervised Evaluation
For the supervised setting, we use the STS benchmark (STSb) to evaluate supervised STS systems. This dataset contains 8,628 sentence pairs from three categories: captions, news, and forums, and is split into 5,749/1,500/1,379 sentence pairs, respectively, for training/dev/test. The proposed models are finetuned on the labeled datasets under the setup.
For our proposed framework, we use Origin to represent the original model, where C for each sentence is constructed by searching the entire corpus as in Section 3.3 and we compute similarity scores based on Eq. (4). We also report performances for Surrogate models with base and large sizes.
The results are shown in Table 3. We can see that for both model sizes (base and large) and both setups (with and without NLI training), the proposed Surrogate model significantly outperforms baseline models, leading to an average of over 2-point performance gains on the STSb dataset.
Spearman correlation ρ for the STSb dataset under the supervised setting. We use *-NLI to denote the model additionally trained on NLI datasets. ♯ indicates that results are reproduced by ourselves; § indicates results are taken from Reimers and Gurevych (2019); Surrogate are results for our proposed method.
Model . | Spearman ρ . |
---|---|
BERT [CLS]sharp | 73.01 |
BERTbase§ | 84.30 |
SBERTbase§ | 84.67 |
SRoBERTabase§ | 84.92 |
Surrogatebase | 87.91 |
BERT-NLIbase§ | 88.33 |
SBERT-NLIbase§ | 85.35 |
SRoBERTa-NLIbase§ | 84.79 |
Surrogate-NLIbase | 89.95 |
BERTlarge§ | 85.64 |
SBERTlarge§ | 84.45 |
SRoBERTalarge§ | 85.02 |
Surrogatelarge | 88.52 |
BERT-NLIlarge§ | 88.77 |
SBERT-NLIlarge§ | 86.10 |
SRoBERTa-NLIlarge§ | 86.15 |
Surrogate-NLIlarge | 90.69 |
Model . | Spearman ρ . |
---|---|
BERT [CLS]sharp | 73.01 |
BERTbase§ | 84.30 |
SBERTbase§ | 84.67 |
SRoBERTabase§ | 84.92 |
Surrogatebase | 87.91 |
BERT-NLIbase§ | 88.33 |
SBERT-NLIbase§ | 85.35 |
SRoBERTa-NLIbase§ | 84.79 |
Surrogate-NLIbase | 89.95 |
BERTlarge§ | 85.64 |
SBERTlarge§ | 84.45 |
SRoBERTalarge§ | 85.02 |
Surrogatelarge | 88.52 |
BERT-NLIlarge§ | 88.77 |
SBERT-NLIlarge§ | 86.10 |
SRoBERTa-NLIlarge§ | 86.15 |
Surrogate-NLIlarge | 90.69 |
Note that the Origin model cannot be readily adapted to the partially supervised or supervised setting because it is hard to finetune the Origin model where the context set C needs to be constructed first. Hence, we finetune the Surrogate model as a compensation for the accuracy loss brought by the replacement of Origin with Surrogate. As we can see from Table 2 and Table 3, finetuning Surrogate on NLI datasets and STSb is an effective remedy for the performance loss.
4.4 Experiment: Argument Facet Similarity
We evaluate the proposed model on the Argument Facet Similarity (AFS) dataset (Misra et al., 2016). This dataset contains 6,000 manually annotated argument pairs collected from human conversations on three topics: gun control, gay marriage, and death penalty. Each argument pair is labeled on a scale between 0 and 5 with a step of 1. Different from the sentence pairs in STS datasets, the similarity of an argument pair in AFS is measured not only in the claim, but also in the way of reasoning, which makes AFS a more difficult dataset compared to STS datasets. We report the Pearson correlation r and Spearman’s rank correlation ρ to compare all models.
Unsupervised Evaluation
The results are shown in Table 4, from which we can see that for both the unsupervised settings, the proposed models Origin and Surrogate outperform baseline models by a large margin, with over 10 points for the unsupervised setting and over 4 points for the supervised setting.
Results of Pearson correlation r and Spearman’s rank correlation ρ on the Argument Facet Similarity (AFS) dataset. ♯ indicates that results are reproduced by ourselves; § indicates results are taken from Reimers and Gurevych (2019); Surrogate are results for our proposed method.
Model . | Pearson r . | Spearman ρ . |
---|---|---|
Unsupervised Setting | ||
Avg. Glove embeddings♯ | 32.40 | 34.00 |
Avg. Skip-Thought embeddings♯ | 22.34 | 23.24 |
InferSent-Glove♯ | 24.83 | 25.83 |
Avg. BERT embeddings♯ | 29.15 | 31.45 |
BERT [CLS]♯ | 12.00 | 9.06 |
BERTScore♯ | 45.32 | 33.56 |
DPR♯ | 41.89 | 32.16 |
BLEURT♯ | 45.98 | 44.12 |
Universal Sent Encoder♯ | 44.28 | 43.47 |
Origin | 56.20 | 54.40 |
Surrogatebase | 53.00 | 52.50 |
Surrogatelarge | 54.50 | 54.70 |
Supervised Setting | ||
BERT [CLS]♯ | 35.28 | 36.24 |
BERTbase§ | 77.20 | 74.84 |
SBERTbase§ | 76.57 | 74.13 |
SRoBERTabase♯ | 77.26 | 74.89 |
Surrogatebase | 79.80 | 78.20 |
BERTlarge§ | 78.68 | 76.38 |
SBERTlarge§ | 77.85 | 75.93 |
SRoBERTalarge♯ | 79.03 | 76.92 |
Surrogatelarge | 81.00 | 80.50 |
Model . | Pearson r . | Spearman ρ . |
---|---|---|
Unsupervised Setting | ||
Avg. Glove embeddings♯ | 32.40 | 34.00 |
Avg. Skip-Thought embeddings♯ | 22.34 | 23.24 |
InferSent-Glove♯ | 24.83 | 25.83 |
Avg. BERT embeddings♯ | 29.15 | 31.45 |
BERT [CLS]♯ | 12.00 | 9.06 |
BERTScore♯ | 45.32 | 33.56 |
DPR♯ | 41.89 | 32.16 |
BLEURT♯ | 45.98 | 44.12 |
Universal Sent Encoder♯ | 44.28 | 43.47 |
Origin | 56.20 | 54.40 |
Surrogatebase | 53.00 | 52.50 |
Surrogatelarge | 54.50 | 54.70 |
Supervised Setting | ||
BERT [CLS]♯ | 35.28 | 36.24 |
BERTbase§ | 77.20 | 74.84 |
SBERTbase§ | 76.57 | 74.13 |
SRoBERTabase♯ | 77.26 | 74.89 |
Surrogatebase | 79.80 | 78.20 |
BERTlarge§ | 78.68 | 76.38 |
SBERTlarge§ | 77.85 | 75.93 |
SRoBERTalarge♯ | 79.03 | 76.92 |
Surrogatelarge | 81.00 | 80.50 |
Supervised Evaluation
We follow Reimers and Gurevych (2019) to use 10-fold cross-validation for supervised learning. Results are shown in Table 4, from which we can see for both the supervised settings, the proposed models Origin and Surrogate outperform baseline models by a large margin, with over 10 points for the unsupervised setting and over 4 points for the supervised setting.
4.5 Experiment: Wikipedia Sections Distinction
Ein Dor et al. (2018) constructed a large set of weakly labeled sentence triplets from Wikipedia for evaluating sentence embedding methods, each of which is composed of a pivot sentence, one sentence from the same section, and one from another section. The test set contains 222K triplets. The construction of this dataset is based on the idea that a sentence is thematically closer to sentences within its section than to sentences from other sections.
We use accuracy as the evaluation metric for both unsupervised and supervised experiments: An example is treated as correctly classified if the positive example is closer to the anchor than the negative example.
Unsupervised Evaluation
We directly evaluate the trained model on the test set without finetuning. Results are shown in Table 5. For the unsupervised setting, the large model Surrogatelarge outperforms the base model Surrogatebase by 2.1 points.
Accuracy results for the Wikipedia sections distinction task. ♯ indicates that results are reproduced by ourselves; § indicates results are taken from Reimers and Gurevych (2019); Surrogate are results for our proposed method.
Model . | Accuracy . |
---|---|
Unsupervised Setting | |
Avg. Glove embeddings♯ | 60.94 |
Avg. Skip-Thought embeddings♯ | 61.54 |
InferSent-Glove♯ | 63.39 |
Avg. BERT embeddings♯ | 66.40 |
BERT [CLS]♯ | 32.30 |
BERTScore♯ | 67.29 |
DPR♯ | 66.71 |
BLEURT♯ | 67.39 |
Universal Sent Encoder♯ | 65.18 |
Surrogatebase | 71.40 |
Surrogatelarge | 73.50 |
Supervised Setting | |
BERT [CLS]♯ | 78.13 |
BERTbase♯ | 79.30 |
SBERTbase§ | 80.42 |
SRoBERTabase§ | 79.45 |
Surrogatebase | 83.10 |
BERTlarge♯ | 80.15 |
SBERTlarge§ | 80.78 |
SRoBERTalarge§ | 79.73 |
Surrogatelarge | 83.50 |
Model . | Accuracy . |
---|---|
Unsupervised Setting | |
Avg. Glove embeddings♯ | 60.94 |
Avg. Skip-Thought embeddings♯ | 61.54 |
InferSent-Glove♯ | 63.39 |
Avg. BERT embeddings♯ | 66.40 |
BERT [CLS]♯ | 32.30 |
BERTScore♯ | 67.29 |
DPR♯ | 66.71 |
BLEURT♯ | 67.39 |
Universal Sent Encoder♯ | 65.18 |
Surrogatebase | 71.40 |
Surrogatelarge | 73.50 |
Supervised Setting | |
BERT [CLS]♯ | 78.13 |
BERTbase♯ | 79.30 |
SBERTbase§ | 80.42 |
SRoBERTabase§ | 79.45 |
Surrogatebase | 83.10 |
BERTlarge♯ | 80.15 |
SBERTlarge§ | 80.78 |
SRoBERTalarge§ | 79.73 |
Surrogatelarge | 83.50 |
Supervised Evaluation
During training, we use the triple objective to train the proposed model on 1.8M training triplets and evaluate it on the test set.
Results are shown in Table 5. For the supervised setting, the proposed model significantly outperforms SBERT, with a nearly 3-point gain in accuracy for both base and large models.
5 Ablation Studies
We perform comprehensive ablation studies on the STSb dataset with no additional training on NLI datasets to better understand the behavior of the proposed framework. Studies are performed on both the original model setup (denoted by Origin) and the surrogate model setup (denoted by Surrogate). We adopt the unsupervised setting for comparison.
5.1 Size of Training Data for Origin
We would like to understand how the size of data for training Origin affects downstream performances. We vary the training size between [10M, 100M, 1B, 10B, 100B] and present the results in Table 6. The model performance drastically improves as we increase the size of training data when its size is below 1B. With more training data, for example, 1B and 10B, the performance approaches the best result achieved with the largest training data.
5.2 Size of Cs
Changing the size of Cs will have an influence on downstream performance. Table 7 shows the results. The overall trend is clear: A larger C leads to better performance. When the size is 20 or 100, the results are substantially worse than the result when the size is 500. Increasing the size from 500 to 1000 only brings marginal performance gains. We thus use 500 for a trade-off between performance and speed.
5.3 Number of Pairs to Train Surrogate
Next, we would like to explore the effect of the number of sentence pairs to train Surrogate. The results are shown in Table 8. As expected, more training data leads to better performances. With only 100K training pairs, the Surrogate model still achieves an acceptable result of 74.02, which indicates that the collected automatically labeled sentence pairs are of high quality.
5.4 How to Construct C
We explore the effect of the way we construct C. We compare three different strategies: (1) the proposed two-step strategy as detailed in Section 3.3; (2) random selection; and (3) the proposed two-step strategy but without the diversity promotion constraint that allows each text chunk to contribute at most one context. For all strategies, we fix the size of C to 500.
The results for these strategies are, respectively, 78.47, 34.45, and 76.32. The random selection strategy significantly underperforms the other two. The explanation is as follows: Given the huge semantic space for sentences, randomly selected contexts are very likely to be semantic irrelevant to both s1 and s2 and can hardly reflect the contextual semantics in which the sentence resides. The similarity computed using context scores based on completely irrelevant contexts is thus extremely noisy, leading to inferior performance. Removing the diversity promotion constraint (the third strategy), the Spearman correlation reduces by over 2 points. The explanation is straightforward: Without the diversity constraint, very similar contexts will be included in C, making the dimensions in the semantic vector redundant; with more diverse contexts, the sentence similarity can be measured more comprehensively and the result can be more accurate.
5.5 Modules in the Scoring Function
We next turn to explore the effect of each term in the scoring function of Eq. (3). Table 9 shows the results. We can observe that removing each of these terms leads to performance drops to different degrees. Removing discriminative results in the least performance loss, with a reduction of 0.5; removing left-context and right-context respectively results in a performance loss of 1.11 and 1.46; and removing both left-context and right-context has the largest negative impact on the final results, with a performance loss of 1.97. These observations verify the importance of different terms in the scoring function, especially the context prediction terms.
The effect of each term in the scoring function Eq. (3). discriminative stands for , left-context stands for and right-context stands for . both contexts means we remove both left context and right context.
Model . | Spearmanρ . |
---|---|
Full | 78.47 |
w/o discriminative | 77.97 (−0.50) |
w/o left-context | 77.36 (−1.11) |
w/o right-context | 77.01 (−1.46) |
w/o both contexts | 76.50 (−1.97) |
Model . | Spearmanρ . |
---|---|
Full | 78.47 |
w/o discriminative | 77.97 (−0.50) |
w/o left-context | 77.36 (−1.11) |
w/o right-context | 77.01 (−1.46) |
w/o both contexts | 76.50 (−1.97) |
5.6 Model Structures
To train the surrogate model, we originally use the Siamese network structure where two sentences are separately feed into the same model. It would be interesting to see the effect of feeding two sentences together into the model, that is, {[CLS],s1,[SEP],s2} and then using the special token [CLS] for computing the similarity, which is the strategy that BERT uses for sentence-pair classification. Here, we call it the BERT-style model for comparison with the Siamese model.
By training the BERT-style model using the same harvested sentence pairs as the Siamese model with the L2 regression loss, we obtain a Spearman’s rank correlation of 77.43, slightly better than the result of 77.32 for the Siamese model. This is because interactions between words/phrases in two sentences are modeled more sufficiently in the BERT structure as interactions start at the input layer through self-attentions. For the Siamese structure, the two sentences do not interact until the output cosine layer.
The merit of sufficient interactions from the BERT structure also comes at a cost: We need to rerun the full model for any new sentence pair. This is not the case with the Siamese structure, which allows for fast semantic similarity search by caching sentence representations in advance. In practice, we prefer the Siamese structure because the speedup in semantic similarity search overweighs the slight performance boost brought by the BERT structure.
5.7 Case Analysis
We conduct a case analysis on STS benchmark (Cer et al., 2017) test set. Examples can be seen in Table 10. Given two sentences of text s1 and s2, the models need to compute how similar s1 and s2 are, returning a similarity score between 0 and 5. As can be seen, scores from the proposed Surrogate model are more correlated with the gold compared to the universal sentence encoder and the SBERT model.
We use gold, surrogate, sbert, and universal to denote scores obtained from the gold label, the proposed Surrogate model, the SBERT model (Reimers and Gurevych, 2019) and the Universal Sentence Encoder model (Cer et al., 2018), respectively. Scores from the proposed surrogate model are more correlated with the gold compared to the universal sentence encoder and the SBERT model.

6 Conclusion
In this work, we propose a new framework for measuring sentence similarity based on the fact that the probabilities of generating two similar sentences based on the same context should be similar. We propose a pipelined system by first harvesting massive amounts of sentence pairs along with their similarity scores, and then training a surrogate model using the automatically labeled sentence pairs for the purpose of faster inference. Extensive experiments demonstrate the effectiveness of the proposed framework against existing sentence embedding based methods.
Acknowledgment
This work is supported by the Science and Technology Innovation 2030 - “New Generation Artificial Intelligence” Major Project (no. 2021ZD0110201) and the Key R & D Projects of the Ministry of Science and Technology (2020YFC0832500). We would like to thank editors for help and anonymous reviewers for their comments and suggestions.
Notes
This strategy can also remove text duplicates.
References
Author notes
Action Editor: Chris Quirk