Sentence Similarity Based on Contexts

Existing methods to measure sentence similarity are faced with two challenges: (1) labeled datasets are usually limited in size, making them insufficient to train supervised neural models; and (2) there is a training-test gap for unsupervised language modeling (LM) based models to compute semantic scores between sentences, since sentence-level semantics are not explicitly modeled at training. This results in inferior performances in this task. In this work, we propose a new framework to address these two issues. The proposed framework is based on the core idea that the meaning of a sentence should be defined by its contexts, and that sentence similarity can be measured by comparing the probabilities of generating two sentences given the same context. The proposed framework is able to generate high-quality, large-scale dataset with semantic similarity scores between two sentences in an unsupervised manner, with which the train-test gap can be largely bridged. Extensive experiments show that the proposed framework achieves significant performance boosts over existing baselines under both the supervised and unsupervised settings across different datasets.


Introduction
Measuring sentence similarity is a long-standing task in NLP (Luhn, 1957;Robertson et al., 1995;Blei et al., 2003;Peng et al., 2020).The task aims at quantitatively measuring the semantic relatedness between two sentences, and has wide applications in text search (Farouk et al., 2018), natural language understanding (MacCartney and Manning, 2009) and machine translation (Yang et al., 2019a).
One of the greatest challenges that existing 1 Accepted by TACL.methods face for sentence similarity is the lack of large-scale labeled datasets, which contain sentence pairs with labeled semantic similarity scores.The acquisition of such dataset is both labor-intensive and expensive.
For example, the STS benchmark (Cer et al., 2017) and SICK-Relatedness dataset (Marelli et al., 2014) respectively contain 8.6K and 9.8K labeled sentence pairs, the sizes of which are usually insufficient for training deep neural networks.
Unsupervised learning methods are proposed to address this issue, where word embeddings (Le and Mikolov, 2014) or BERT embeddings (Devlin et al., 2018) are used to to map sentences to fix-length vectors in an unsupervised manner.Then sentence similarity is computed based on the cosine or dot product of these sentence representations.Our work follows this thread where sentence similarity is computed based on fix-length sentence representations, as opposed to comparing sentences directly.The biggest issue with current unsupervised approaches is that there exists a big gap between model training and testing (i.e., computing semantic similarity between two sentences).For example, the BERT-style models are trained at the token level by predicting words given contexts, and there is neither explicit modeling sentence semantics nor producing sentence embeddings at the training stage.But at test time, sentence semantics needs to be explicitly modeled to obtain semantic similarity.The inconsistency results in a distinct discrepancy between the objectives at the two stages and inferior performances on textual semantic similarity tasks.For example, BERT embeddings yield inferior performances on semantic similarity benchmarks (Reimers and Gurevych, 2019), and even underperforming the naive method such as averaging GloVe (Pennington et al., 2014) embeddings.Li et al. (2020) investigated this prob-lem and found that BERT always induces a nonsmooth anisotropic semantic space of sentences, and this property significantly harms the performance of semantic similarity.
Like word meanings are defined by neighboring words (Harris, 1954), the meaning of a sentence is determined by its contexts.Given the same context, it is a high probability to generate two similar sentences.If it is a low probability of generating two sentences given the same context, there is a gap between these two sentences in the semantic space.Based on this idea, we propose a framework that measures semantic similarity through the probability similarity of generating two sentences given the same context in a fully unsupervised manner.As for implementation, the framework consists of the following steps: (1) we train a contextual model by predicting the probability of a sentence fitting into the left and right contexts; (2) we obtain sentence pair similarity by comparing scores assigned by the contextual model across a large number of contexts.To facilitate inference, we train a surrogate model, to act as the role of step 2, based on the outputs from step 1.The surrogate model can be directly used for sentence similarity prediction in an unsupervised setup, or used as initialization to be further finetuned on downstream datasets in the supervised setup.Note that the outcome from step 1 or the surrogate model is a fixed-length vector regarding the input sentence.Each element in the vector indicates how fit the input sentence is to the context corresponding to that element, and the vector itself can be viewed as the overall semantics of the input sentence in the contextual space.Then we use cosine distance between two sentence vectors to compute the semantic similarity.
The proposed framework offers the potential to fully address the two challenges above: (1) the context regularization provides a reliable means to generate a large-scale high-quality dataset with semantic similarity scores based on unlabeled corpus; and (2) the train-test gap can be naturally bridged by training the model on the large-scale similarity dataset, leading to significant performance gains compared to utilize pretrained models directly.
We conduct experiments on different datasets under both supervised and unsupervised setups, and experimental results show that the proposed framework significantly outperforms existing sentence similarity models.

Matrix Based Methods
The first line of work for measuring sentence similarity is to construct a similarity matrix between two sentences, each element of which represents the similarity between the two corresponding units in two sentences.Then the matrix is aggregated in different ways to induce the final similarity score.Pang et al. (2016) applied a twolayer convolutional neural network (CNN) followed by a feed-forward layer to the similarity matrix to derive the similarity score.He and Lin (2016) used a deeper CNN to make the best use of the similarity matrix.Yin and Schütze (2015) built a hierarchical architecture to model text compositions at different granularities, so several similarity matrices can be computed and combined for interactions.Other works proposed to use the attention mechanism as a way of computing the similarity matrix (Rocktäschel et al., 2015;Wang et al., 2016;Parikh et al., 2016;Seo et al., 2016;Shen et al., 2017;Lin et al., 2017;Gong et al., 2017;Tan et al., 2018;Kim et al., 2019;Yang et al., 2019b).

Word Distance Based Methods
The second line of work to measure sentence similarity is to calculate the cost of transforming from one sentence to another, and the smaller the cost is, the more similar two sentences are.This idea is implemented by the Word Mover's Distance (WMD) (Kusner et al., 2015), which measures the dissimilarity between two documents as the minimum amount of distance that the embedded words of one document need to transform to words of another document.Following works improve WMD by incorporating supervision from downstream tasks (Huang et al., 2016), introducing hierarchical optimal transport over topics (Yurochkin et al., 2019), addressing the complexity limitation of requiring to consider each pair (Wu and Li, 2017;Wu et al., 2018;Backurs et al., 2020) and combining graph structures with WMD to perform crossdomain alignment (Chen et al., 2020).More recently, Yokoi et al. (2020) proposes to disentangle word vectors in WRD has shown significantly performance boosts over vanilla WMD.

Sentence Embeddings Based Methods
Sentence embeddings are high-dimensional representations for sentences.They are expected to contain rich sentence semantics so that the similarity between two sentences can be computed by considering their sentence embeddings via certain metrics such as cosine similarity.Le and Mikolov (2014) introduced paragraph vector, which is learned in an unsupervised manner by predicting the words within the paragraph using the paragraph vector.In a followup, a line of sentence embedding methods such as FastText, Skip-Thought vectors (Kiros et al., 2015), Smooth Inverse Frequency (SIF) (Arora et al., 2016), Sequential Denoising Autoencoder (SDAEs) (Hill et al., 2016), InferSent (Conneau et al., 2017), Quick-Thought vectors (Logeswaran and Lee, 2018) and Universal Sentence Encoder (Cer et al., 2018) have been proposed to improve the sentence embedding quality with more efficiency.
The great success achieved by large-scale pretraining models (Devlin et al., 2018;Liu et al., 2019) has recently stimulated a strand of work on producing sentence embeddings based on the pretraining-finetuning paradigm using large-scale unlabeled corpora.The cosine outcome between the representations of two sentences produced by large-scale pretrained models is treated as the semantic similarity (Reimers and Gurevych, 2019;Wang and Kuo, 2020;Li et al., 2020).Su et al. (2021); Huang et al. (2021) proposed to regularize the sentence representations by whitening them, i.e., enforcing the covariance to be an identity matrix to address the non-smooth anisotropic distribution issue (Li et al., 2020).
The BERT-based scores (Zhang et al., 2020;Sellam et al., 2020), though serve as automatic metrics, also capture rich semantic information regarding the sentence and have the potentials for measuring semantic similarity.Cer et al. (2018) proposed a method of encoding sentences into their corresponding embeddings that specifically target transfer learning to other NLP tasks.Karpukhin et al. (2020) adopted two unique BERT encoder models and the model weights are optimized to maximize the dot product.The most recent line of work focuses on leveraging the contrastive learning framework to tackle semantic textual similarity (Wu et al., 2020;Carlsson et al., 2021;Kim et al., 2021;Yan et al., 2021;Gao et al., 2021), where two similar sentences are pulled close and two random sentences are pulled away in the sentence representation space.This learning strategy helps better separate sentences with different semantics.This work is motivated by learning word representations given its contexts (Mikolov et al., 2013;Le and Mikolov, 2014) with the assumption that the meaning of a word is determined by its context.Our work is based on large-scale pretrained model and aims at learning informative sentence representations for measuring sentence similarity.

Overview
The key point of the proposed paradigm is to compute semantic similarity between two sentences by measuring the probabilities of generating the two sentences across a number of context.
We can achieve this goal based on the following steps: (1) we first need to train a contextual model to predict the probability of a sentence fitting into the left and right contexts.This goal can be achieved by either a discriminative model, i.e., predicting the probability that the concatenation of a sentence with context forms a coherent text, or a generative model, i.e., predicting the probability of generating a sentence given contexts; (2) next, given a pair of sentences, we can measure their similarity by comparing their scores assigned by contextual models given different contexts; (3) for step 2, for any pair of sentences at test time, we need to sample different contexts to compute scores assigned by contextual models, which is time-consuming.We thus propose to train a surrogate model that takes a pair of sentences as inputs, and predicts the similarity assigned by the contextual model.This enables faster inference though at a small sacrifice of accuracy; (4) the surrogate model can be directly used for obtaining sentence similarity scores in a unsupervised manner, or used as model initialization, which will be further fine-tuned on downstream datasets in a supervised setting.We will discuss the detail of each module in order below.

Training Contextual Models
We need a contextual model to predict the probability of a sentence fitting into left and right contexts.We combine a generative model and a discriminative model to achieve this goal, allowing us to take the advantage of both to model text coherence (Li et al., 2017).
Notations Let c i denote the i-th sentence, which consists of a sequence of words c i = {c i,1 , ..., c i,n i }, where n i denotes the number of words in c i .Let c i:j denote the i-th to j-th sentences.c <i and c >i respectively denote the preceding and subsequent context of c i .

Discriminative Models
The discriminative model takes a sequence of consecutive sentences [c <i , c i , c >i ] as the input, and maps the input to a probability indicating whether the input is natural and coherent.We treat sentence sequences taken from the original articles written by humans as positive examples and sequences with replacements of the center sentence c i as negative ones.Half of replacements of c i come from the original document, and half of replacements come from random sentences from the corpus.The concatenation of LSTM representations at the last step (right-to-left and left-to-right) is used to represent the sentence.Sentence representations for consecutive sentences are concatenated and output to the sigmoid function to obtain the final probability: (1) where h denotes learnable parameters.We deliberately make the discriminative model simple for two reasons: the discriminative approach for coherence prediction is a relatively easy task and more importantly, it will be further used in the next selection stage for screening, where faster speed is preferred.

Generative Models
Given contexts c <i and c >i , the generative model predicts the probability of generating each token in sentence c i sequentially using SEQ2SEQ structures (Sutskever et al., 2014) as the backbone: Semantic similarity between two sentences can be measured by not only the forward probability of generating the two sentences given the same context p(c i |c <i , c >i ), but also the backward probability of generating contexts given sentences.The context-given-sentence probability can be modeled by predicting preceding contexts given subsequent contexts p(c <i |c i , c >i ) and to predict subsequent contexts given preceding contexts p(c >i |c <i , c i ).

Scoring Sentence Pairs
Given context [c <i , c >i ], the score for s i fitting into the context is the linear combination of scores from discriminative and generative models: where λ 1 , λ 2 , λ 3 , λ 4 control the tradeoff between different modules.For simplification, we use c to denote context c <i , c >i .S(s i , c) is thus equivalent to S(s i , c <i , c >i ).
Let C denote a set of contexts, where N C is the size of C. For a sentence s, its semantic representation v s is an N C dimensional vector, with each individual value being S(s, c) with c ∈ C. The semantic similarity between two sentences s 1 and s 2 can be computed based on v s 1 and v s 2 using different metrics such as cosine similarity.
Constructing C We need to pay special attentions to the construction of C. The optimal situation is to use all contexts, where C is the entire corpus.Unfortunately, this is computationally prohibitive as we need to iterate over the entire corpus for each sentence s.
We propose the following workaround for tractable computation.For a sentence s, rather than using the full corpus as C, we construct its sentence specific context set C s in a way that s can fit into all constituent context in C s .The intuition is as follows: with respect to sentence s 1 , contexts can be divided into two categories: contexts which s 1 fits into, based on which we will measure whether or not s 2 also fits in; contexts which s 1 does not fit into, and we will measure whether or not s 2 also does not fit in.We are mostly concerned about the former, and can neglect the latter.The reason is as follows: the latter can also further be divided into two categories: contexts that fit neither s 1 or s 2 , and contexts that do not fit s 1 but fit s 2 .For contexts that fit neither s 1 and s 2 , we can neglect them since two sentences not fitting into the same context does not signify their semantic relatedness; for contexts that does not fit s 1 but fit s 2 , we can leave them to when we compute C s 2 .
Practically, for a given sentence s, we first use a TF-IDF weighted bag-of-word bi-gram vectors to perform primary screening on the whole corpus to retrieve related text chunks (20K for each sentence).Next, we rank all contexts using the discriminative model based on Eq.1.For discriminative models, we cache sentence representations in advance, and compute model scores in the last neural layer, which is significantly faster than the generative model.This two-step selection strategy is akin to the pipelined selection system (Chen et al., 2017;Karpukhin et al., 2020) in open-domain QA which contains document retrieval using IR systems and fine-grained question answering using neural QA models.
C s is built by selecting top ranked contexts by Eq. 3. We use the incremental construction strategy, adding one context at a time.To promote diversity of C s , each text chunk is allowed to contribute at most one context, and the Jaccard similarity between the i − 1-th sentence in the context to select and those already selected should be lower than 0.5. 2o compute semantic similarity between s 1 and s 2 , we concatenate C s 1 and C s 2 and use the concatenation as the context set C. The semantic similarity score between s 1 and s 2 is given as follows:

Training Surrogate Models
The method described in Section 3.3 provides a direct way to compute scores for semantic relat-edness.But it comes with a severe shortcoming of slow speed at inference time: given an arbitrary pair of sentences, the model still needs to go through the entire corpus, harvest the context set C s , and iterate all instances in C s for context score calculation based on Eq.(3), each of which is time consuming.To address this issue, we propose to train a surrogate model to accelerate inference.Specifically, we first harvest similarity scores for sentence pairs using methods in Section 3.3.We collect scores for 100M pairs in total, which are further split into train/dev/test by 98/1/1.Next, by treating harvested similarity scores as gold labels, we train a neural model that takes a pair of sentence as an input, and predicts its similarity score.The cosine similarity between the two sentence representations is the predicted semantic similarity, and we minimize the L 2 distance between predicted and golden similarities.The Siamese structure makes it possible that fixedsized vectors for input sentences can be derived and stored, allowing for fast semantic similarity search, which we will discuss in detail in the ablation study section.
It is worth noting both the advantages and disadvantages of the surrogate model.For advantages, firstly, it can significantly speed up inference as it avoids the time-consuming process of iterating over the entire corpus to construct C. Secondly, the surrogate shares the same structure with existing widely-used models such as BERT and RoBERTa, and can thus later be easily finetuned on the human-labeled datasets in supervised learning; on the other hand, the origin model in Section 3.3 cannot be readily combined with other humanlabeled datasets.For disadvantages, the surrogate model inevitably comes with a cost of accuracy, as its upper bound is the origin model in Section 3.3.

Experiment Settings
We evaluate the Surrogate model on Semantic Textual Similarity (STS), Argument Facet Similarity (AFS) corpus (Misra et al., 2016), and Wikipedia Sections Distinction (Ein Dor et al., 2018) tasks.We perform both unsupervised and supervised evaluations on these tasks.For unsupervised evaluations, models are directly used for obtaining sentence representations.For supervised evaluations, we use the training set to fine-tune all models and use the L 2 regression as the objective function.Additionally, we also conduct partially supervised evaluation on STS benchmarks.
Implementation Details For discriminative model in 3.2.1,we use a single-layer bi-directional LSTM as the backbone with the size of hidden states set to 300.
For generative model in 3.2.2,We implement the above three models, i.e.
p(c i |c <i , c >i ), p(c <i |c i , c >i ) and p(c >i |c <i , c i ) based on the SEQ2SEQ structure, and use Transformer-large as the backbone (Vaswani et al., 2017).Sentence position embeddings and token position embeddings are added to word embeddings.The model is trained on a corpus extracted from CommonCrawl which contains 100B tokens.
For the surrogate model in 3.4, we use RoBERTa (Liu et al., 2019) as the backbone, and adopts the Siamese structure (Reimers and Gurevych, 2019), where two sentences are first mapped to vector representations using RoBERTa.We use the average pooling on the last RoBERTa layer to obtain the sentence representation.During training, we use Adam (Kingma and Ba, 2014) with learning rate of 1e-4, β 1 = 0.9, β 2 = 0.999.The trained surrogate model obtains an average L 2 distance of 7.4 × 10 −4 on dev set when trained from scratch, and 6.1 × 10 −4 when initialized using the RoBERTa-large model (Liu et al., 2019).We set C s to 500.
Baselines We use the following models as baselines: • Avg.
Glove embeddings is the average of word embeddings produced via the co-occurrence statistics in the corpus (Pennington et al., 2014) • BERTScore computes the similarity of two sentences as a sum of cosine similarities between their tokens' embeddings (Zhang et al., 2020).• BLEURT is baseed on BERT and captures non-trivial semantic similarities by finetuning the model on the WMT Metrics dataset, on a set of ratings provided by the user, or a combination of both (Sellam et al., 2020).• DPR works by using two unique BERT encoder models and the model weights are optimized to maximize the dot product (Karpukhin et al., 2020).• Universal Sent Encoder is a method of encoding sentences into their corresponding embeddings that specifically target transfer learning to other NLP tasks (Cer et al., 2018).• SBERT is a BERT-based method of using the Siamese structure to derive sentence embeddings that can be compared through cosine similarity (Reimers and Gurevych, 2019).

Run-time Efficiency
The run-time efficiency is important for sentence representation models since similarity functions are potentially applied to large corpora.
In this subsection, we compare Surrogate base to InferSent (Conneau et al., 2017), Universal Sent Encoder (Cer et al., 2018) and SBERT base (Reimers and Gurevych, 2019).We adopt a length batching strategy in which sentences are grouped together by length.
The proposed Surrogate model is based on PyTorch.InferSent (Conneau et al., 2017) and SBERT (Reimers and Gurevych, 2019) are based on PyTorch.Universal Sent Encoder (Cer et al., 2018) is based on Tensorflow and the model is from the Tensorflow Hub.Model efficiency is measured on a server with Intel i7-5820K CPU @ 3.30GHz, Nvidia Tesla V100 GPU, CUDA 10.2 and cuDNN.We report both CPU and GPU speed and the results can be found in Table 1.As can be seen, InferSent is around 69% faster than Surrogate model on CPU since its simpler model architecture.The speed of the proposed Surrogate model is comparable to SBERT for both nonbatching and batching setups, which is in accord with our expectations due the same transformer structure adopted by the Surrogate model.

Experiment: Semantic Textual Similarity
We evaluate the proposed method on the Semantic Textual Similarity (STS) tasks.We compute the Spearman's rank correlation ρ between the cosine similarity of the sentence pairs and the gold labels for comparison.
The results are shown in Table 2 and we observe significant performance boosts of the proposed models over baselines.Notably, the proposed models trained in the unsupervised setting (both Origin and Surrogate) are able to achieve competitive results to models trained on additional annotated NLI datasets.Another observation is, as expected, the Surrogate models underperform the Origin model as Origin serves as an upper bound for Surrogate but with a cost of inference speed.
Partially Supervised Evaluation We finetune the model on the combination of the SNLI (Bowman et al., 2015) and the Multi-Genre NLI (Williams et al., 2018) dataset, with the former containing 570K sentence pairs and the latter containing 433K pairs across various genres of sources.Sentence pairs from both datasets are annotated with one of the labels contradiction, entailment, and neutral.The proposed models are trained on the natural language inference task then used for computing sentence representations in an unsupervised manner.
The partially supervised results are shown in Ta-ble 2. As can be seen, results from the proposed model finetuned on NLI datasets are comparable to results from unsupervised models since no labeled similarity dataset is used, and comparable to results from supervised models if further finetuned on similarity datasets such as STS.
Supervised Evaluation For the supervised setting, we use the STS benchmark (STSb) to evaluate supervised STS systems.This dataset contains 8,628 sentence pairs from three categories: captions, news, and forums, and is split into 5,749/1,500/1,379 sentence pairs respectively for training/dev/test.The proposed models are finetuned on the labeled datasets under the setup.
For our proposed framework, we use Origin to represent the original model, where C for each sentence is constructed by searching the entire corpus as in Section 3.3 and we compute similarity scores based on Eq.( 4).We also report performances for Surrogate models with base and large sizes.
The results are shown in Table 3.We can see that for both model sizes (base and large) and both setups (with and without NLI training), the proposed Surrogate model significantly outperforms baseline models, leading to an average of over 2point performance gains on the STSb dataset.
Note that the Origin model can not be readily adapted to the partially supervised or supervised setting because it is hard to finetune the Origin model where the context set C needs to be constructed first.Hence, we finetune the Surrogate model as a compensation for the accuracy loss brought by the replacement of Origin with Surrogate.As we can see from Table 2 and Table 3, finetuning Surrogate on NLI datasets and STSb is an effective remedy for the performance loss.

Experiment: Argument Facet Similarity
We evaluate the proposed model on the Argument Facet Similarity (AFS) dataset (Misra et al., 2016).This dataset contains 6,000 manually annotated argument pairs collected from human conversations on three topics: gun control, gay marriage and death penalty.Each argument pair is labeled on a scale between 0 and 5 with a step of 1. Different from the sentence pairs in STS datasets, the similarity of an argument pair in AFS is measured not only in the claim, but also in the way of reasoning, which makes AFS a more difficult dataset Table 2: Spearman rank correlation ρ between the cosine similarity of sentence representations and the gold labels for various Textual Similarity (STS) tasks under the unsupervised setting.We use *-NLI to denote the model additionally trained on NLI datasets.♯ indicates that results are reproduced by ourselves; § indicates results are taken from Reimers and Gurevych (2019); Surrogate are results for our proposed method.
compared to STS datasets.We report the Pearson correlation r and Spearman's rank correlation ρ to compare all models.
Unsupervised Evaluation The results are shown in Table 4, from which we can see for both the unsupervised settings, the proposed models Origin and Surrogate outperform baseline models by a large margin, with over 10 points for the unsupervised setting and over 4 points for the supervised setting.

Supervised Evaluation
We follow Reimers and Gurevych (2019) to use the 10fold cross-validation for supervised learning.Results are shown in Table 4, from which we can see for both the supervised settings, the proposed models Origin and Surrogate outperform baseline models by a large margin, with over 10 points for the unsupervised setting and over 4 points for the supervised setting.

Experiment: Wikipedia Sections Distinction
Ein Dor et al. ( 2018) constructed a large set of weakly labeled sentence triplets from Wikipedia for evaluating sentence embedding methods, each of which is composed of a pivot sentence, one sentence from the same section and one from another section.Test set contains 222K triplets.The construction of this dataset is based on the idea that a sentence is thematically closer to sentences within its section than to sentences from other sections.
We use accuracy as the evaluation metric for both unsupervised and supervised experiments: an example is treated as correctly classified if the positive example is closer to the anchor than the negative example.Results are shown in Table 5.For the supervised setting, the proposed model significantly outperforms SBERT, with a nearly 3-point gain in accuracy for both base and large models.

Ablation Studies
We perform comprehensive ablation studies on the STSb dataset with no additional training on NLI datasets to better understand the behavior of the proposed framework.Studies are performed on both the original model setup (denoted by Origin) and the surrogate model setup (denoted by Surrogate).We adopt the unsupervised setting for comparison.

Size of Training Data for Origin
We would like to understand how the size of data for training Origin affects downstream performances.We vary the training size between [10M, 100M, 1B, 10B, 100B] and present the results in proves as we increase the size of training data when its size is below 1B.With more training data, e.g.1B and 10B, the performance is getting close to the best result achieved with the largest training data.

Size of C s
Changing the size of C s will have an influence on downstream performances.Table 7 shows the results.The overall trend is clear: a larger C leads to better performances.When the size is 20 or 100, the results are substantially worse than the result when the size is 500.Increasing the size from 500 to 1000 only brings marginal performance gains.We thus use 500 for a trade-off between performance and speed.

Number of Pairs to Train Surrogate
Next, we would like to explore the effect of the number of sentence pairs to train Surrogate.The results are shown in achieves an acceptable result of 74.02, which indicates that the collected automatically labeled sentence pairs are of high quality.

How to Construct C
We explore the effect of the way we construct C. We compare three different strategies: (1) the proposed two-step strategy as detailed in Section 3.3; (2) randomly selection; and (3) the proposed twostep strategy but without the diversity promotion constraint that allows each text chunk to contribute at most one context.For all strategies, we fix the size of C to 500.
The results for these strategies are respectively 78.47, 34.45 and 76.32.The random selection strategy significantly underperforms the other two.The explanation is as follows: given the huge se- mantic space for sentences, randomly selected contexts are very likely to be semantic irrelevant to both s 1 and s 2 and can hardly reflect the contextual semantics the sentence resides in.The similarity computed using context scores based on completely irrelevant contexts is thus extremely noisy, leading to inferior performances.Removing the diversity promotion constraint (the third strategy), the Spearman correlation reduces by over 2 points.The explanation is straightforward: without the diversity constraint, very similar contexts will be included in C, making the dimensions in the semantic vector redundant; with more diverse contexts, the sentence similarity can be measured more comprehensively and the result can be more accurate.

Modules in the Scoring Function
We next turn to explore the effect of each term in the scoring function Eq.(3).Table 9 shows the results.We can observe that removing each of these terms leads to performance drops to different degrees.Removing discriminative results in the least performance loss, with a reduction of 0.5; removing left-context and right-context respectively results in a performance loss of 1.11 and 1.46; and removing both left-context and right-context has the largest negative impact on the final results, with a performance loss of 1.97.These observations verify the importance of different terms in the scoring function, especially the context prediction terms.

Model Structures
To train the surrogate model, we originally use the Siamese network structure where two sentences are separately feed into the same model.It would be interesting to see the effect of feeding two sentences together into the model, i.e., {[CLS], s 1 , [SEP], s 2 } and then using the special token [CLS] for computing the similarity, which is the strategy that BERT uses for sentence-pair classification.Here, we call it the BERT-style model for comparison with the Siamese model.
By training the BERT-style model using the same harvested sentence pairs as the Siamese model with the L 2 regression loss, we obtain a Spearman's rank correlation of 77.43, slightly better than the result of 77.32 for the Siamese model.This is because interactions between words/phrases in two sentences are modeled more sufficiently in the BERT structure as interactions start at the input layer through self-attentions.For the Siamese structure, the two sentences do not interact until the output cosine layer.
The merit of sufficient interactions from the BERT structure also comes at a cost: we need to rerun the full model for any new sentence pair.This is not the case with the Siamese structure, which allows for fast semantic similarity search by caching sentence representations in advance.In practice, we prefer the Siamese structure since the speedup in semantic similarity search overweighs the slight performance boost brought by the BERT structure.

Case Analysis
We conduct case analysis on STS benchmark (Cer et al., 2017) test set.Examples can be seen in Table 10.Given two sentences of text s 1 and s 2 , the models need to compute how similar s 1 and s 2 are, returning a similarity score between 0 and 5.As can be seen, scores from the proposed surrogate model are more correlated with to the gold compared to the universal sentence encoder and the SBERT model.

Conclusion
In this work, we propose a new framework for measuring sentence similarity based on the fact  (Reimers and Gurevych, 2019) and the Universal Sentence Encoder model (Cer et al., 2018), respectively.Scores from the proposed surrogate model are more correlated with to the gold compared to the universal sentence encoder and the SBERT model.
that the probabilities of generating two similar sentences based on the same context should be similar.We propose a pipelined system by first harvesting massive amounts of sentence pairs along with their similarity scores, and then training a surrogate model using the automatically labeled sentence pairs for the purpose of faster inference.Extensive experiments demonstrate the effectiveness of the proposed framework against existing sentence embedding based methods.
We directly evaluate the trained model on the test set without finetuning.Results are shown in Table5.For the unsupervised

Table 3 :
Reimers and Gurevych (2019)the STSb dataset under the supervised setting.We use *-NLI to denote the model additionally trained on NLI datasets.♯indicatesthatresults are reproduced by ourselves; § indicates results are taken fromReimers and Gurevych (2019); Surrogate are results for our proposed method.setting, the large model Surrogate large outperforms the base model Surrogate base by 2.1 points.

Table 6 .
The model performance drastically im-

Table 4 :
Reimers and Gurevych (2019)ion r and Spearman's rank correlation ρ on the Argument Facet Similarity (AFS) dataset.♯indicates that results are reproduced by ourselves; § indicates results are taken fromReimers and Gurevych (2019); Surrogate are results for our proposed method.

Table 8 .
As expected, more training data leads to better performances.With only 100K training pairs, the Surrogate model still

Table 5 :
Reimers and Gurevych (2019)kipedia sections distinction task.♯indicates that results are reproduced by ourselves; § indicates results are taken fromReimers and Gurevych (2019); Surrogate are results for our proposed method.

Table 6 :
The effect of size of training data for Origin.

Table 7 :
The effect of size of C.

Table 8 :
The effect of training data size for Surrogate.

Table 9 :
The effect of each term in the scoring function Eq.(3).discriminative stands for log p(y = 1|s i , c <i , c >i ), left-context stands for 1 |c<i| log p(c <i |s i , c >i ) and right-context stands for 1 |c>i| log p(c >i |c <i , s i ).both contexts means we remove both left context and right context.

Table 10 :
We use gold, surrogate, sbert and universal to denote scores obtained from the gold label,the proposed Surrogate model, the SBERT model