Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings

We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training. We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods. We then improve an XLM-based unsupervised neural MT system pre-trained on Wikipedia by supplementing it with pseudo-parallel text mined from the same corpus, boosting unsupervised translation performance by up to 3.5 BLEU on the WMT'14 French-English and WMT'16 German-English tasks and outperforming the previous state-of-the-art. Finally, we enrich the IWSLT'15 English-Vietnamese corpus with pseudo-parallel Wikipedia sentence pairs, yielding a 1.2 BLEU improvement on the low-resource MT task. We demonstrate that unsupervised bitext mining is an effective way of augmenting MT datasets and complements existing techniques like initializing with pre-trained contextual embeddings.


Introduction
Large corpora of parallel sentences are prerequisites for training models across a diverse set of applications, such as neural machine translation (NMT; Bahdanau et al., 2015), paraphrase generation (Bannard and Callison-Burch, 2005), and aligned multilingual sentence embeddings (Artetxe and Schwenk, 2019b). Systems that extract parallel corpora typically rely on various cross-lingual resources (e.g., bilingual lexicons, parallel corpora), but recent work has shown that unsupervised parallel sentence mining (Hangya et al., 2018) and unsupervised NMT (Artetxe et al., 2018;Lample et al., 2018a) produce surprisingly good results. 1 Existing approaches to unsupervised parallel sentence (or bitext) mining start from bilingual word embeddings (BWEs) learned via an unsupervised, adversarial approach (Lample et al., 2018b). Hangya et al. (2018) created sentence representations by mean-pooling BWEs over content words. To disambiguate semantically similar but nonparallel sentences, Hangya and Fraser (2019) additionally proposed parallel segment detection by searching for paired substrings with high similarity scores per word. However, using word embeddings to generate sentence embeddings ignores sentential context, which may degrade bitext retrieval performance.
We describe a new unsupervised bitext mining approach based on contextual embeddings. We create sentence embeddings by mean-pooling the outputs of multilingual BERT (mBERT; Devlin et al., 2019), which is pre-trained on unaligned Wikipedia sentences across 104 languages. For a pair of source and target languages, we find candidate translations by using nearest-neighbor search with margin-based similarity scores between pairs of mBERT-embedded source and target sentences. We bootstrap a dataset of positive and negative sentence pairs from these initial neighborhoods of candidates, then self-train mBERT on its own outputs. A final retrieval step gives a corpus of pseudo-parallel sentence pairs, which we expect to be a mix of actual translations and semanticallyrelated non-translations.
We apply our technique on the BUCC 2017 par- Figure 1: Our self-training scheme. Left: We index sentences using our two encoders. For each source sentence, we retrieve k nearest-neighbor target sentences per the margin criterion (Eq. 1), depicted here for k = 4. If the nearest neighbor is within a threshold, it is treated with the source sentence as a positive pair, and the remaining k − 1 are treated with the source sentence as negative pairs. Right: We refine one of the encoders such that the cosine similarity of the two embeddings is maximized on positive pairs and minimized on negative pairs. allel sentence mining task (Zweigenbaum et al., 2017). We achieve state-of-the-art F 1 scores on unsupervised bitext mining, with an improvement of up to 24.5 points (absolute) on published results (Hangya and Fraser, 2019). Other work (e.g., Libovický et al., 2019) has shown that retrieval performance varies substantially with the layer of mBERT used to generate sentence representations; using the optimal mBERT layer yields an improvement as large as 44.9 points. Furthermore, our pseudo-parallel text improves unsupervised NMT (UNMT) performance. We build upon the UNMT framework of Lample et al. (2018c) and XLM (Lample and Conneau, 2019) by incorporating our pseudo-parallel text (also derived from Wikipedia) at training time. This boosts performance on WMT'14 En-Fr and WMT'16 En-De by up to 3.5 BLEU over the XLM baseline, outperforming the state-of-the-art on unsupervised NMT (Song et al., 2019).
Finally, we demonstrate the practical value of unsupervised bitext mining in the low-resource setting. We augment the English-Vietnamese corpus (133k pairs) from the IWSLT'15 translation task (Cettolo et al., 2015) with our pseudo-bitext from Wikipedia (400k pairs), and observe a 1.2 BLEU increase over the best published model (Nguyen and Salazar, 2019). When we reduced the amount of parallel and monolingual Vietnamese data by a factor of ten (13.3k pairs), the model trained with pseudo-bitext performed 7 BLEU points better than a model trained on the reduced parallel text alone.

Our approach
Our aim is to create a bilingual sentence embedding space where, for each source sentence embedding, a sufficiently close nearest neighbor among the target sentence embeddings is its translation. By aligning source and target sentence embeddings in this way, we can extract sentence pairs to create new parallel corpora. Artetxe and Schwenk (2019a) construct this space by training a joint encoder-decoder MT model over multiple language pairs and using the resulting encoder to generate sentence embeddings. A margin-based similarity score is then computed between embeddings for retrieval (Section 2.2). However, this approach requires large parallel corpora to train the encoder-decoder model in the first place.
We investigate whether contextualized sentence embeddings created with unaligned text are useful for unsupervised bitext retrieval. Previous work explored the use of multilingual sentence encoders taken from machine translation mod-els (e.g., Artetxe and Schwenk, 2019b;Lu et al., 2018) for zero-shot cross-lingual transfer. Our work is motivated by recent success in tasks like zero-shot text classification and named entity recognition (e.g., Keung et al., 2019;Mulcaire et al., 2019) with multilingual contextual embeddings, which exhibit cross-lingual properties despite being trained without parallel sentences.
We illustrate our method in Figure 1. We first retrieve the candidate translation pairs: • Each source and target language sentence is converted into an embedding vector with mBERT via mean-pooling.
• Margin-based scores are computed for each sentence pair using the k nearest neighbors of the source and target sentences (Sec. 2.2).
• Each source sentence is paired with its nearest neighbor in the target language based on this score.
• We select a threshold score that keeps some top percentage of pairs (Sec. 2.2).
• Rule-based filters are applied to further remove mismatched sentence pairs (Sec. 2.3).
The remaining candidate pairs are used to bootstrap a dataset for self-training mBERT as follows: • Each candidate pair (a source sentence and its closest nearest neighbor above the threshold) is taken as a positive example.
• This source sentence is also paired with its next k − 1 neighbors to give hard negative examples (we compare this with random negative samples in Sec. 3.3).
• We finetune mBERT to produce sentence embeddings that discriminate between positive and negative pairs (Sec. 2.4).
After self-training, the finetuned mBERT model is used to generate new sentence embeddings. Parallel sentences should be closer to each other in this new embedding space, which improves retrieval performance.

Sentence embeddings and nearest-neighbor search
We use mBERT (Devlin et al., 2019) to create sentence embeddings for both languages by meanpooling the representations from the final layer. We use FAISS (Johnson et al., 2017) to perform exact nearest-neighbor search on the embeddings. We compare every sentence in the source language to every sentence in the target language; we do not use links between Wikipedia articles or other metadata to reduce the size of the search space. In our experiments, we retrieve the k = 4 closest target sentences for each source sentence; the source language is always non-English, while the target language is always English.

Margin-based score
We compute a margin-based similarity score between each source sentence and its k nearest target neighbors. Following Artetxe and Schwenk (2019a), we use the ratio margin score, which calibrates the cosine similarity by dividing it by the average cosine distance of each embedding's k nearest neighbors: cos(x,z) 2k cos(y,z) 2k . (1) We remove the sentence pairs with margin scores below some pre-selected threshold. For BUCC, we do not have development data for tuning the threshold hyperparameter, so we simply use the prior probability. For example, the creators of the dataset estimate that ∼2% of De sentences have an En translation, so we choose a score threshold such that we retrieve ∼2% of the pairs. We set the threshold in the same way for the other BUCC pairs. For UNMT with Wikipedia bitext mining, we set the threshold such that we always retrieve 2.5 million sentence pairs for each language pair.

Rule-based filtering
We also apply two simple filtering steps before finalizing the candidate pairs list: • Digit filtering: Sentence pairs which are translations of each other must have digit sequences that match exactly. 2 • Edit distance: Sentences from English Wikipedia sometimes appear in non-English pages and vice versa. We remove sentence pairs where the content of the source and target share substantial overlap (i.e., the character-level edit distance is ≤50%).

Method
De-En Fr-En Ru-En Zh-En Hangya and Fraser (2019) Table 1: F 1 scores for unsupervised bitext retrieval on BUCC 2017. Results with mBERT are from our method (Sec. 2) using the final (12th) layer. We also include results for the 8th layer (e.g., Libovický et al., 2019), but do not consider this part of the unsupervised setting as we would not have known a priori which layer was best to use.

Self-training
We devise an unsupervised self-training technique to improve mBERT for bitext retrieval using mBERT's own outputs. For each source sentence, if the nearest target sentence is within the threshold and not filtered out, the pair is treated as a positive sentence. We then keep the next k − 1 nearest neighbors as negative sentences. Altogether, these give us a training set of examples which are labeled as positive or negative pairs. We train mBERT to discriminate between positive and negative sentence pairs as a binary classification task. We distinguish the mBERT encoders for the source and target languages as f src , f tgt respectively. Our training objective is where f src (X) and f tgt (Y ) are the mean-pooled representations of the source sentence X and target sentence Y , and where Par(X, Y ) is 1 if X, Y are parallel and 0 otherwise. This loss encourages the cosine similarity between the source and target embeddings to increase for positive pairs and decrease otherwise. The process is depicted in Figure 1. Note that we only finetune f src (parameters Θ src ) and we hold f tgt fixed. If both f src and f tgt are updated, then the training process collapses to a trivial solution, since the model will map all pseudo-parallel pairs to one representation and all non-parallel pairs to another. We hold f tgt fixed, which forces f src to align its outputs to the target (in our experiments, always English) mBERT embeddings.
After finetuning, we use the updated f src to generate new non-English sentence embeddings. We then repeat the retrieval process with FAISS, yielding a final set of pseudo-parallel pairs after thresholding and filtering.

Unsupervised bitext mining
We apply our method to the BUCC 2017 shared task, "Spotting Parallel Sentences in Comparable Corpora" (Zweigenbaum et al., 2017). The task involves retrieving parallel sentences from monolingual corpora derived from Wikipedia. Parallel sentences were inserted into the corpora in a contextually-appropriate manner by the task organizers. The shared task assessed retrieval systems for precision, recall, and F 1 -score on four language pairs: De-En, Fr-En, Ru-En, and Zh-En. Prior work on unsupervised bitext mining has generally studied the European language pairs to avoid dealing with Chinese word segmentation (Hangya et al., 2018;Hangya and Fraser, 2019).

Setup
For each BUCC language pair, we take the corresponding source and target monolingual corpus, which have been pre-split into training, sample, and test sets at a ratio of 49%-2%-49%. The identity of the parallel sentence pairs for the test set were not publicly released, and are only available for the training set. Following the convention established in Hangya and Fraser (2019)

and Artetxe
Language pair Parallel sentence pair

De-En
Beide Elemente des amerikanischen Traums haben heute einen Teil ihrer Anziehungskraft verloren. Both elements of the American dream have now lost something of their appeal.
Fr-En L'Allemagneà elle seule s'attendà recevoir pas moins d'un million de demandeurs d'asile cette année. Germany alone expects as many as a million asylum-seekers this year.
Zh-En 在如今这个奇怪的新世界里，现代和前现代相互依存。 In the strange new world of today, the modern and the pre-modern depend on each other. and Schwenk (2019a), we use the test portion for unsupervised system development and evaluate on the training portion. We use the reference FAISS implementation 3 for nearest-neighbor search. We used the Glu-onNLP toolkit (Guo et al., 2020) with pre-trained mBERT weights 4 for inference and self-training. We compute the margin similarity score in Eq. 1 with k = 4 nearest neighbors. We set a threshold on the score such that we retrieve the prior proportion (e.g., ∼2%) of parallel pairs in each language. We then finetune mBERT via self-training. We take minibatches of 100 sentence pairs. We use the Adam optimizer with a constant learning rate of 0.00001 for 2 epochs. To avoid noisy translations, we finetune on the top 50% of the highestscoring pairs from the retrieved bitext (e.g., if the prior proportion is 2%, then we would use the top 1% of sentence pairs for self-training).
We considered performing more than one round of self-training but found it was not helpful for the BUCC task. BUCC has very few parallel pairs (e.g., 9,000 pairs for Fr-En) per language and thus few positive pairs for our unsupervised method to find. The size of the self-training corpus is limited by the proportion of parallel sentences, and mBERT rapidly overfits to small datasets.

Results
We provide a few examples of the bitext we retrieved in Our retrieval results are in Table 1. We compare our results with strictly unsupervised techniques, which do not use bilingual lexicons, parallel text, or other cross-lingual resources. Using mBERT as-is with the margin-based score works reasonably well, giving F 1 scores in the range of 35.8 to 45.8, which is competitive with the previous stateof-the-art for some pairs, and outperforming by 12 points in the case of Ru-En. Furthermore, applying simple rule-based filters (Sec. 2.3) on the candidate translation pairs adds a few more points, although the edit distance filter has a negligible effect when compared with the digit filter.
We see that finetuning mBERT on its own chosen sentence pairs (i.e., unsupervised self-training) yields significant improvements, adding another 8 to 14 points to the F 1 score on top of filtering. In all, these F 1 scores represent a 34% to 98% relative improvement over existing techniques in unsupervised parallel sentence extraction for these language pairs. Libovický et al. (2019) explored bitext mining with mBERT in the supervised context and found that retrieval performance significantly varies with the mBERT layer used to create sentence embeddings. In particular, they found layer 8 embeddings gave the highest precision-at-1. We also observe an improvement (Table 1) in unsupervised retrieval of another 13 to 20 points by using the 8th layer instead of the default final layer (12th). We include these results but do not consider them unsupervised, as we would not know a priori which layer was best to use.

Choosing negative sentence pairs
Other authors (e.g., Guo et al., 2018) have noted that the choice of negative examples has a con-siderable impact on metric learning. Specifically, using negative examples which are difficult to distinguish from the positive nearest neighbor is often beneficial for performance. We examine the impact of taking random sentences instead of the remaining k − 1 nearest neighbors as the negatives during self-training.
Our results are in Table 3. While self-training with random negatives still greatly improves the untuned baseline, the use of hard negative examples mined from the k-nearest neighborhood can make a significant difference to the final F 1 score.

Bitext for neural machine translation
A major application of bitext mining is to create new corpora for machine translation. We conduct an extrinsic evaluation of our unsupervised bitext mining approach on unsupervised (WMT'14 French-English, WMT'16 German-English) and low-resource (IWSLT'15 English-Vietnamese) translation tasks. We perform large-scale unsupervised bitext extraction on the October 2019 Wikipedia dumps in various languages. We use wikifil.pl 5 to extract paragraphs from Wikipedia and remove markup. We then use the syntok 6 package for sentence segmentation. Finally, we reduce the size of the corpus by removing sentences that aren't part of the body of Wikipedia pages. Sentences that contain * , =, //, ::, #, www, (talk), or the pattern [0-9]{2}:[0-9]{2} are filtered out.
We index, retrieve, and filter candidate sentence pairs with the procedure in Sec. 3. Unlike BUCC, the Wikipedia dataset does not fit in GPU memory. The processed corpus is quite large, with 133 million, 67 million, 36 million, and 6 million sentences in English, German, French, and Vietnamese respectively. We therefore shard the 5 https://github.com/facebookresearch/ fastText/blob/master/wikifil.pl 6 https://github.com/fnl/syntok dataset into chunks of 32,768 sentences and perform nearest-neighbor comparisons in chunks for each language pair. We use a simple map-reduce algorithm to merge the intermediate results back together.
We follow the approach outlined in Sec. 2 for Wikipedia bitext mining. For each source sentence, we retrieve the 4 nearest target neighbors across the millions of sentences that we extracted from Wikipedia and compute the margin-based scores for each pair.

Unsupervised NMT
We show that our pseudo-parallel text can complement existing techniques for unsupervised translation (Artetxe et al., 2018;Lample et al., 2018c). In line with existing work on UNMT, we evaluate our approach on the WMT'14 Fr-En and WMT'16 De-En test sets.
Our UNMT experiments build upon the reference implementation 7 of XLM (Lample and Conneau, 2019). The UNMT model is trained by alternating between two steps: a denoising autoencoder step and a backtranslation step (refer to Lample et al. (2018c) for more details). The backtranslation step generates pseudo-parallel training data, and we incorporate our bitext during UNMT training in the same way, as another set of pseudoparallel sentences. We also use the same initialization as Lample and Conneau (2019), where the UNMT models have encoders and decoders that are initialized with contextual embeddings trained on the source and target language Wikipedia corpora with the masked language model (MLM) objective; no parallel data is used.
We performed the exhaustive (Fr Wiki)-(En Wiki) and (De Wiki)-(En Wiki) nearest-neighbor comparison on eight V100 GPUs, which requires 3 to 4 days to complete per language pair. We retained the top 2.5 million pseudo-parallel Fr-En and De-En sentence pairs after mining.

Results
Our results are in   Table 1), we saw that selftraining improved the F 1 score for BUCC bitext retrieval. The improvement in bitext quality carries over to UNMT, and providing better pseudoparallel text yields a consistent improvement for all translation directions.
Our results are state-of-the-art in UNMT, but they should be interpreted relative to the strength of our XLM baseline. We are building on top of the XLM initialization, and the effectiveness of the initialization (and the various hyperparameters used during training and decoding) affects the strength of our final results. For example, we adjusted the beam width on our XLM baselines to attain BLEU scores which are similar to what others have published. One can apply our method to MASS, which performs better than XLM on UNMT, but we chose to report results on XLM because it has been validated on a wider range of tasks and languages.
We also trained a standard 6-layer transformer encoder-decoder model directly on the pseudoparallel text. We used the standard implementation in Sockeye (Hieber et al., 2018) as-is, and trained models for French and German on 2.5 million Wikipedia sentence pairs. We withheld 10k pseudo-parallel pairs per language pair to serve as a development set. We achieved BLEU scores of 20.8, 21.1, 28.2, and 28.0 on En-De, De-En, En-Fr, and Fr-En respectively. BLEU scores were computed with SacreBLEU (Post, 2018). This compares favorably with the best UNMT results in Lample et al. (2018c), while avoiding the use of parallel development data altogether.

Low-resource NMT
French and German are high-resource languages and are linguistically close to English. We therefore evaluate our mined bitext on a lowresource, linguistically distant language pair. The IWSLT'15 English-Vietnamese MT task (Cettolo et al., 2015) provides 133k sentence pairs derived from translated TED talks transcripts and is a common benchmark for low-resource MT. We take supervised training data from the IWSLT task and augment it with different amounts of pseudoparallel text mined from English and Vietnamese Wikipedia. Furthermore, we construct a very lowresource setting by downsampling the parallel text and monolingual Vietnamese Wikipedia text by a factor of ten (13.3k sentence pairs).
We use the reference implementation 8 for the state-of-the-art model (Nguyen and Salazar, 2019), which is a highly regularized 6+6-layer transformer with pre-norm residual connections, scale normalization, and normalized word embeddings. We use the same hyperparameters (except for the dropout rate) but train on our augmented datasets. To mitigate domain shift, we finetune the best checkpoint for 75k more steps using only the IWSLT training data, in the spirit of "trivial" transfer learning for low-resource NMT (Kocmi and Bojar, 2018).
In Table 5, we show BLEU scores as more pseudo-parallel text is included during training. As in previous works on En-Vi (cf. Luong and Manning, 2015), we use tst2012 (1553 pairs) and tst2013 (1268 pairs) as our development and test sets respectively, we tokenize all data with Moses, and we report tokenized BLEU via multi-bleu.perl. The BLEU score increases monotonically with the size of the pseudoparallel corpus and exceeds the state-of-the-art system's BLEU by 1.2 points. This result is consistent with improvements observed with other types of monolingual data augmentation like pretrained UNMT initialization, various forms of back-translation (Hoang et al., 2018;Zhou and Keung, 2020), and cross-view training (CVT; Clark et al., 2018):

En-Vi
Luong and Manning (2015) 26.4 Clark et al. (2018) 28.9 Clark et al. (2018)  We describe our hyperparameter tuning and infrastructure following Dodge et al. (2019). The translation sections of this work mostly used default parameters, but we did tune the dropout rate (at 0.2 and 0.3) for each amount of mined bitext for the supervised En-Vi task (at 100k, 200k, 300k and 400k sentence pairs). We include development scores for our best models; dropout of 0.3 did best for 0k and 100k, while 0.2 did best otherwise. Training takes less than a day on one V100 GPU.
To simulate a very low-resource task, we use one-tenth of the training data by downsampling the IWSLT En-Vi train set to 13.3k sentence pairs. Furthermore, we mine bitext from one-tenth of the monolingual Wiki Vi text and extract proportionately fewer sentence pairs (i.e., 10k, 20k, 30k and 40k pairs). We use the implementation and hyperparameters for the regularized 4+4-layer transformer used by Nguyen and Salazar (2019) in a similar setting. We tune the dropout rate (0.2, 0.3, 0.4) to maximize development performance; 0.4 was best for 0k, 0.3 for 10k and 20k, and 0.2 for 30k and 40k. In Table 6, we see larger improvements in BLEU (4+ points) for the same relative increases in mined data (as compared to Table 5). In both cases, the rate of improvement tapers off as the quality and relative quantity of mined pairs degrades at each increase.

UNMT ablation study: Pre-training and bitext mining corpora
In Sec. 4.2, we mined bitext from the October 2019 Wikipedia snapshot whereas the pre-trained XLM embeddings were created prior to January 2019. Hence, it is possible that the UNMT BLEU increase would be smaller if the bitext were mined from the same corpus used for pre-training. We ran an ablation study to show the effect (or lack thereof) of the overlap between the pre-training and pseudo-parallel corpora.
For the En-Vi language pair, we used 5 million English and 5 million Vietnamese Wiki sentences to pre-train the XLM model. We only use text from the October 2019 Wiki snapshot. We mined 300k pseudo-parallel sentence pairs using our approach (Sec. 2) from the same Wiki snapshot. We created two datasets for XLM pre-training: a 10 million-sentence corpus that is disjoint from the 600k sentences of the mined bitext, and a 10 million-sentence corpus that contains all 600k sentences of the bitext. In Table 7, we show the BLEU increase on the IWSLT En-Vi task with and without using the mined bitext as parallel data, using each of the two XLM models as the initialization.
The benefit of using pseudo-parallel text is very clear; even if the pre-trained XLM model saw the pseudo-parallel sentences during pre-training, using mined bitext still significantly improves UNMT performance (23.1 vs. 28.3 BLEU). In addition, the baseline UNMT performance without the mined bitext is similar between the two XLM initializations (23.1 vs. 23.2 BLEU), which suggests that removing some of the parallel text present during pre-training does not have a major effect on UNMT.  Finally, we trained a standard encoder-decoder model on the 300k pseudo-parallel pairs only, using the same Sockeye recipe in Sec. 4.2. This yielded a BLEU score of 27.5 on En-Vi, which is lower than the best XLM-based result (i.e., 28.9), which suggests that the XLM initialization improves unsupervised NMT. A similar outcome was also reported in Lample and Conneau (2019).

Parallel sentence mining
Approaches to parallel sentence (or bitext) mining have been historically driven by the data requirements of statistical machine translation. Some of the earliest work in mining the web for largescale parallel corpora can be found in Resnik (1998) and Resnik and Smith (2003). Recent interest in the field is reflected by new shared tasks on parallel extraction and filtering (Zweigenbaum et al., 2017; and the creation of massively-multilingual parallel corpora mined from the web, like WikiMatrix  and CCMatrix . Existing parallel corpora have been exploited in many ways to create sentence representations for supervised bitext mining. One approach involves a joint encoder with a shared wordpiece vocabulary, trained as part of multiple encoder-decoder translation models on parallel corpora (Schwenk, 2018). Artetxe and Schwenk (2019b) apply this approach at scale, and shared a single encoder and joint vocabulary across 93 languages. Another approach uses negative sampling to align the encoders' sentence representations for nearestneighbor retrieval (Grégoire and Langlais, 2018;Guo et al., 2018).
However, these approaches require training with initial parallel corpora. In contrast, Hangya et al. (2018) and Hangya and Fraser (2019) proposed unsupervised methods for parallel sentence extraction that use bilingual word embeddings induced in an unsupervised manner. Our work is the first to explore using contextual representations (mBERT; Devlin et al., 2019) in an unsupervised manner to mine for bitext, and to show improvements over the latest UNMT systems (Lample and Conneau, 2019;Song et al., 2019), for which transformers and encoder/decoder pretraining have doubled or tripled BLEU scores on unsupervised WMT'16 En-De since Artetxe et al. (2018) and Lample et al. (2018c).

Self-training techniques
Self-training refers to techniques that use the outputs of a model to provide labels for its own training. Yarowsky (1995) proposed a semi-supervised strategy where a model is first trained on a small set of labeled data and then used to assign pseudolabels to unlabeled data. Semi-supervised selftraining has been used to improve sentence encoders that project sentences into a common semantic space. For example, Clark et al. (2018) proposed cross-view training (CVT) with labeled and unlabeled data to achieve state-of-the-art results on a set of sequence tagging, MT, and dependency parsing tasks.
Semi-supervised methods require some annotated data, even if it is not directly related to the target task. Our work is the first to apply unsupervised self-training for generating cross-lingual sentence embeddings. The most similar approach to ours is the prevailing scheme for unsupervised NMT (Lample et al., 2018c), which relies on mul-tiple iterations of backtranslation (Sennrich et al., 2016) to create a sequence of pseudo-parallel sentence pairs with which to bootstrap an MT model.

Conclusion
In this work, we describe a novel approach for state-of-the-art unsupervised bitext mining using multilingual contextual representations. We extract pseudo-parallel sentences from unaligned corpora to create models that achieve state-of-theart performance on unsupervised and low-resource translation tasks. Our approach is complementary to the improvements derived from initializing MT models with pre-trained encoders and decoders, and helps narrow the gap between unsupervised and supervised MT. We focused on mBERT-based embeddings in our experiments, but we expect unsupervised self-training to improve the unsupervised bitext mining and downstream UNMT performance of other forms of multilingual contextual embeddings as well.
Our findings are in line with recent work showing that multilingual embeddings are very useful for cross-lingual zero-shot and zero-resource tasks.
Even without using aligned corpora, mBERT can embed sentences across different languages in a consistent fashion according to their semantic content. More work will be needed to understand how contextual embeddings discover these cross-lingual correspondences.