Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly available parallel corpora, and additionally mine 37.4 million sentence pairs from the Web, resulting in a 4× increase. We mine the parallel sentences from the Web by combining many corpora, tools, and methods: (a) Web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences. Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 languages. Further, we extract 83.4 million sentence pairs between all 55 Indic language pairs from the English-centric parallel corpus using English as the pivot language. We trained multilingual NMT models spanning all these languages on Samanantar which outperform existing models and baselines on publicly available benchmarks, such as FLORES, establishing the utility of Samanantar. Our data and models are available publicly at Samanantar and we hope they will help advance research in NMT and multilingual NLP for Indic languages.


Introduction
The advent of deep-learning (DL) based neural encoder-decoder models has led to significant * The first two authors have contributed equally. † Corresponding author: miteshk@cse.iitm.ac.in. ‡ Dedicated to the loving memory of my grandmother.
progress in machine translation (MT) (Bahdanau et al., 2015;Wu et al., 2016;Sennrich et al., 2016b,a;Vaswani et al., 2017). While this has been favorable for resource-rich languages, there has been limited benefit for resource-poor languages which lack parallel corpora, monolingual corpora, and evaluation benchmarks (Koehn and Knowles, 2017). Multilingual models can improve performance on resource-poor languages via transfer learning from resource-rich languages (Firat et al., 2016;Johnson et al., 2017b;Kocmi and Bojar, 2018), more so when the resource-rich and resource-poor languages are related (Nguyen and Chiang, 2017;Dabre et al., 2017). However, it is difficult to achieve this with limited in-language data (Guzmán et al., 2019), particularly when an entire group of related languages is low-resource, making transfer-learning infeasible.
A case in point is that of languages from the Indian subcontinent, a very linguistically diverse region. India has 22 constitutionally listed languages spanning 4 major language families. Other countries in the subcontinent also have their share of widely spoken languages. These languages are closely related both genetically and through contact, which led to significant sharing of vocabulary and linguistic features (Emeneau, 1956). These languages account for a collective speaker base of over 1 billion speakers. The demand for quality, publicly available translation systems in a multilingual society like India is obvious. However, there is very limited publicly available parallel data for Indic languages. Given this situation, an obvious question to ask is: What does it take to improve MT on the large set of related low-resource Indic languages? The answer is straightforward: create large parallel datasets and train proven DL models. However, collecting new data with Source en-as en-bn en-gu en-hi en-kn en-ml en-mr en-or en-pa en-ta en-te Total manual translations at the scale necessary to train large DL models would be slow and expensive. Instead, several recent works have proposed mining parallel sentences from the web (Schwenk et al., 2019a(Schwenk et al., , 2020El-Kishky et al., 2020). The representation of Indic languages in these works is poor, however (e.g., CCMatrix contains parallel data for only 6 Indic languages).
In this work, we aim to significantly increase the amount of parallel data on Indic languages by combining the benefits of many recent contributions: large Indic monolingual corpora (Kakwani et al., 2020;Ortiz Suarez et al., 2019), accurate multilingual representation learning (Feng et al., 2020;Artetxe and Schwenk, 2019), scalable approximate nearest neighbor search (Johnson et al., 2017a;Subramanya et al., 2019;Guo et al., 2020), and optical character recognition (OCR) of Indic scripts in rich text documents. By combining these methods, we propose different pipelines to collect parallel data from three different types of sources: (a) non machine readable sources like scanned parallel documents, (b) machine-readable sources like news Web sites with multilingual content, (c) IndicCorp (Kakwani et al., 2020), the largest corpus of monolingual data for Indic languages. Combining existing datasets and the new datasets that we mine from the above-mentioned sources, we present Samanantar, 1 the largest publicly available parallel corpora collection for Indic languages. Samanantar contains ∼ 49.7M parallel sentences between English and 11 Indic languages, ranging from 141K pairs between English-Assamese to 10.1M pairs between English-Hindi. Of these, 37.4M pairs are newly mined as a part of this work and 12.4M are compiled from existing sources. Thus, the newly mined data is about 3 times the existing data. Table 1 shows the language-wise statistics. Figure 1 shows the relative contribution of different sources from which new parallel sentences were mined. The largest contributor is data mined from IndicCorp, 1 Samanantar in Sanskrit means semantically similar. which accounts for 67% of the total corpus. From this English-centric corpus, we mine 83.4M parallel sentences between the 55 ( 11 2 ) Indic language pairs using English as the pivot. To evaluate the quality of the mined sentences we collect human judgments from 38 annotators for a total of 9,566 sentence pairs across 11 languages. The annotations attest to the high quality of the mined parallel corpus and validate our design choices.
To evaluate if Samanantar advances the state of the art for Indic NMT, we train a multilingual model, called IndicTrans, using Samanantar. We compare IndicTrans, with (a) commercial translation systems (Google, Microsoft), (b) publicly available translation systems OPUS-MT (Tiedemann and Thottingal, 2020a), mBART50 (Tang et al., 2020), and CVIT-ILMulti , and (c) models trained on all existing sources of parallel data between Indic languages. Across multiple publicly available test sets spanning 10 Indic languages, we observe that IndicTrans performs better than all existing open source models and even outperforms commercial systems on many benchmarks, thereby establishing the utility of Samanantar.
The three main contributions of this work, namely, (i) Samanantar, the largest collection of parallel corpora for Indic languages, (ii) In-dicTrans, a multilingual model for translating from En-Indic and Indic-En, and (iii) human judgments on cross-lingual textual similarity for about 9,566 sentence pairs, is publicly available (https://indicnlp.ai4bharat .org/samanantar/). In addition, it also contains parallel sentences between the 11 2 = 55 Indic language pairs obtained by pivoting through English (en). To build this corpus, we first collated all existing public sources of parallel data for Indic languages that have been released over the years, as described in Section 2.1. We then expand this corpus further by mining parallel sentences from three types of sources from the web as described in Sections 2.2 to 2.4.

Collation from Existing Sources
We first briefly describe the existing sources of parallel sentences for Indic languages. The Indic NLP Catalog 2 helped identify many of these sources. Recently, the WAT 2021 MultiIndicMT shared task (Nakazawa et al., 2021) also compiled many existing Indic language parallel corpora.
As shown in Table 1, these sources 4 collated together result in a total of 12.4M parallel sentences (after removing duplicates) between English and 11 Indic languages. It is interesting that no publicly available MT system has been trained using parallel data from all these existing sources.
We observed that some existing sources, such as JW300, were extremely noisy, containing many sentence pairs that were not translations of each other. However, we chose not to clean/post-process any of the existing sources, beyond what was already done by the public repositories that released these datasets. As future work, we plan to study different data filtering (Junczys-Dowmunt, 2018) and data sampling techniques (Bengio et al., 2009) and their impact on the performance of the NMT model being trained. For example, we could sort the sources by their quality and feed sentences from only very high quality sources during the later epochs while training the model.

Mining Parallel Sentences from Machine Readable Comparable Corpora
We identified several news websites which publish articles in multiple Indic languages (see Table 2  such news Web sites to be good sources of parallel sentences. We also identified some sources from the education domain (NPTEL 5 , Coursera 6 , Khan Academy 7 ) and some science Youtube channels that provide educational videos with parallel human translated subtitles in different Indic languages.
We use the following steps to extract parallel sentences from the above sources: Article Extraction. For every news Web site, we build custom extractors using BeautifulSoup 8 or Selenium 9 to extract the main article content. For NPTEL, Youtube science channels, and Khan Academy, we use youtube-dl 10 to collect Indic and English subtitles for every video. We skip the auto-generated youtube captions to ensure that we only get high quality translations. We collected subtitles for all available courses/videos on March 7, 2021. For Coursera, we identify courses which have manually created Indic and English subtitles and then use coursera-dl 11 to extract these subtitles.
Tokenization. We split the main content of the articles into sentences using the Indic NLP Library 12 (Kunchukuttan, 2020), with a few additional heuristics to account for Indic punctuation characters, sentence delimiters and non-breaking prefixes.
Parallel Sentence Extraction. At the end of the above step, we have sentence tokenized articles in English and a target language (say, Hindi). Further, all these news Web sites contain metadata based on which we can cluster the articles according to the month in which they were published (say, January 2021). We assume that to find a match for a given Hindi sentence we only need to consider all English sentences which belong to articles published in the same month as the article containing the Hindi sentence. This is a reasonable assumption as content of news articles is temporal in nature. Note that such clustering based on dates is not required for the education sources as there we can find matching sentences in bilingual captions belonging to the same video.
Let S = {s 1 , s 2 , . . . , s m } be the set of all sentences across all English articles in a particular month (or in the English caption file corresponding to a given video). Similarly, let T = {t 1 , t 2 , . . . , t n } be the set of all sentences across all Hindi articles in that same month (or in the Hindi caption file corresponding to the same video). Let f (s, t) be a scoring function which assigns a score indicating how likely it is that s ∈ S, t ∈ T form a translation pair. For a given Hindi sentence t i ∈ T , the matching English sentence can be found as: We chose f to be the cosine similarity function on embeddings of s and t. We compute these embeddings using LaBSE (Feng et al., 2020) which is a state-of-the-art multilingual sentence embedding model that encodes text from different languages into a shared embedding space. We refer to the cosine similarity between the LaBSE embeddings of s, t as the LaBSE Alignment Score (LAS).
Post Processing. Using the above described process, we find the top matching English sentence, s * , for every Hindi sentence, t i . We now apply a threshold and select only those pairs for which the cosine similarity is greater than a threshold t. Across different sources we found 0.75 to be a good threshold. We refer to this as the LAS threshold. Next, we remove duplicates in the data. We consider two pairs (s i , t i ) and (s j , t j ) to be duplicate if s i = s j and t i = t j . We also remove any sentence pair where the English sentence is less than 4 words. Lastly, we use a language identifier 13 and eliminate pairs where the language identified for s i or t i does not match the intended language.

Mining Parallel Sentences from Non-Machine Readable Comparable Corpora
While Web sources are machine readable, there are official documents that are generated which are not always machine readable. For example, proceedings of the legislative assemblies of different Indian states in English as well as the official language of the state are published as PDFs. In this work, we considered 3 such public sources: (a) documents from Tamil Nadu government 14 (en-ta), (b) speeches from Bangladesh Parliament 15 and West Bengal Legislative Assembly 16 (en-bn), and (c) speeches from Andhra Pradesh 17 and Telangana Legislative Assemblies 18 (en-te).
Most of these documents either contained scanned images of the original document or contained proprietary encodings (non-UTF8) due to legacy issues. As a result, standard PDF parsers cannot be used to extract text from them. We use the following pipeline for extracting parallel sentences from such sources.
Optical Character Recognition (OCR). We used Google's Vision API, which supports English as well as the 11 Indic languages considered, to extract text from each document.
Tokenization. We use the same tokenization process as described in the previous section on the extracted text with extra heuristics to merge an incomplete sentence at the bottom of one page with an incomplete sentence at the top of the next page.
Parallel Sentence Extraction. Unlike the previous section, we have exact information about which documents are parallel. This information is typically encoded in the URL of the document itself (e.g., https://tn.gov.in/en/budget .pdf and https://tn.gov.in/ta/budget .pdf). Hence, for a given Tamil sentence, t i we only need to consider the sentences S = {s 1 , s 2 , . . . , s m } which appear in the corresponding English article. For a given t i , we identify the matching sentence, s * , from the candidate set S, using LAS as described in Section 2.2.
Post-Processing. We use the same postprocessing as described in Section 2.2.

Mining Parallel Sentences from Web Scale Monolingual Corpora
Recent work (Schwenk et al., 2019b;Feng et al., 2020) has shown that it is possible to align parallel sentences in large monolingual corpora (e.g., CommonCrawl) by computing the similarity between them in a shared multilingual embedding space. In this work, we consider IndicCorp (Kakwani et al., 2020), the largest collection of monolingual corpora for Indic languages (ranging from 1.39M sentences for Assamese to 100.6M sentences for English). The idea is to take an Indic sentence and find its matching En sentence from a large collection of En sentences. To perform this search efficiently, we use FAISS (Johnson et al., 2017a) which does efficient indexing, clustering, semantic matching, and retrieval of dense vectors as explained below.
Indexing. We compute the sentence embedding using LaBSE for all English sentences in In-dicCorp. We create a FAISS index where these embeddings are stored in 100k clusters. We use Product Quantization (Jégou et al., 2011) to reduce the space required to store these embeddings by quantizing the 786 dimensional LaBSE embedding into a m dimensional vector (m = 64) where each dimension is represented using an 8-bit integer value.
Retrieval. For every Indic sentence (say, Hindi sentence) we first compute the LaBSE embedding and then query the FAISS index for its nearest neighbor based on normalized inner product (i.e., cosine similarity). FAISS first finds the top-p clusters by computing the distance between each of the cluster centroids and the given Hindi sentence. We set the value of p to 1024. Within each of these clusters, FAISS searches for the nearest neighbors. This retrieval is highly optimized to scale.
In our implementation, on average we were able to perform 1100 nearest neighbourhood searches per second on the index containing 100.6M En sentences.
Recomputing Cosine Similarity. Note that FAISS computes cosine similarity on the quantized vectors (of dimension m = 64). We found that while the relative ranking produced by FAISS is good, the similarity scores on the quantized vectors vary widely and do not accurately capture the cosine similarity between the original 768d LaBSE embeddings. Hence, it is difficult to choose an appropriate threshold on the similarity of the quantized vector. However, the relative ranking provided by FAISS is still good. For example, for all 100 query Hindi sentences that we analyzed, FAISS retrieved the correct matching English sentence from an index of 100.6 M sentences at the top-1 position. Based on this observation, we follow a two-step approach: First, we retrieve the top-1 matching sentence from FAISS using the quantized vector. Then, we compute the LAS between the full LaBSE embeddings of the retrieved sentence pair. On the computed LAS, we apply a LAS threshold of 0.80 (slightly higher than the one used for comparable sources described earlier) for filtering. This modified FAISS mining, combining quantized vectors for efficient searching and full embeddings from LaBSE for accurate thresholding, was crucial for mining a large number of parallel sentences.
Post-processing. We follow the same postprocessing steps as described in Section 2.2. We also used the above process to extract parallel sentences from Wikipedia by treating it as a collection of monolingual sentences in different languages. We were able to mine more parallel sentences using this approach as opposed to using Wikipedia's interlanguage links for article alignment followed by inter-article parallel sentence mining.
Note that we chose this LaBSE based alignment method over existing methods like Vecalign (Thompson and Koehn, 2019) and Bleualign (Sennrich and Volk, 2011) as these methods assume/require parallel documents. However, for IndicCorp, such a parallel alignment of documents is not available and may not even exist. Further, LaBSE is trained on 17 billion monolingual sentences and 6 billion bilingual sentence pairs using from 109 languages including all the 11 Indic languages considered in this work. The authors have shown that it produces state of the art results on multiple parallel text retrieval tasks and is effective even for low-resource languages. Given these advantages of LaBSE embeddings and to have a uniform scoring mechanism (i.e., LAS) across sources, we use the same LaBSE based mechanism for mining parallel sentences from all the sources that we considered.

Mining Inter-Indic Language Corpora
So far, we have discussed mining parallel corpora between English and Indic languages. Following Freitag and Firat (2020) and Rios et al. (2020), we now use English as a pivot to mine parallel sentences between Indic languages from all the English-centric corpora described earlier in this section. Most of the sources that we crawled data from for creating Samanantar were English-centric, that is, they contain data in English and one or more Indian languages. Hence we chose English as the pivot language. For example, let (s en , t hi ) and (ŝ en , t ta ) be mined parallel sentences between en-hi and en-ta respective. If s en =ŝ en then we extract (t hi , t ta ) as a Hindi-Tamil parallel sentence pair. Further, we use a very strict de-duplication criterion to avoid the creation of very similar parallel sentences. For example, if an en sentence is aligned to m hi sentences and n ta sentences, then we would get mn hi-ta pairs. We retain only 1 randomly chosen pair out of these mn pairs, since these mn pairs are likely to be similar. We mined 83.4M parallel sentences between the 11 2 Indic language pairs resulting in a 5.33× increase in publicly available sentence pairs between these languages (see Table 3).

Analysis of the Quality of the Mined Parallel Corpus
We now describe the intrinsic evaluation of the data that we mined as a part of this work using the methods described in Sections 2.2, 2.3, and 2.4). This evaluation was performed by asking human annotators to estimate cross-lingual Semantic Textual Similarity (STS) of the mined parallel sentences.

Annotation Task and Setup
We sampled 9,566 sentence pairs (English and Indic) from the mined data across 11 Indic languages  and several sources. The sampling was stratified to have equal number of sentences from three sets: • Definite accept: sentence pairs with LAS larger than 0.1 of the chosen threshold.
• Marginal accept: sentence pairs with LAS larger than but within 0.1 of the chosen threshold.
• Reject: sentence pairs with LAS smaller than but within 0.1 of the chosen threshold.
The sampled sentences were shuffled randomly such that no ordering is preserved across sources or LAS. We then divided the language-wise sentence pairs into annotation batches of 30 parallel sentences each. For defining the annotation scores, we refer to the SemEval-2016 Task 1 (Agirre et al., 2016), wherein crosslingual semantic textual similarity is characterized by six ordinal levels ranging from complete semantic equivalence (5) to complete semantic dissimilarity (0). These guidelines were explained to 38 annotators across 11 Indic languages, with a minimum of 2 annotators per language. Each annotator is a native speaker in the language assigned and is also fluent in English. The annotators have experience of 1 to 20 years in working on language tasks, with a mean of 5 years. The annotation task was performed on Google forms: Each form consisted of 30 sentence pairs from an annotation batch. Annotators were shown one sentence pair at a time and were asked to score it in the range of 0 to 5. The SemEval-2016 guidelines were visible to annotators at all times. After annotating 30 parallel sentences, the annotators submitted the form and resumed again with a new form. Annotators were compensated at the rate of Rs 100 to Rs 150 (1.38 to 2.06 USD) per 100 words read.

Annotation Results and Discussion
The results of the annotation of the 9,566 sentence pairs and almost 30,000 annotations are shown language-wise in Table 4. Over 85% of the sentence pairs are such that annotators agree within a semantic similarity score of 1 of each other. We make the following key observations from the data.

Sentence Pairs Included in Samanantar Have
High Semantic Similarity. Overall, the 'All accept' sentence pairs received a mean STS score of 4.27 and a median of 5. On a scale of 0 to 5, where 5 represents perfect semantic similarity, these statistics indicate that annotators rated sentence pairs that are included in Samanantar to be of high quality. Furthermore, the chosen LAS thresholds sensitively regulate quality: the 'Definite accept' sentence pairs have a high average STS score of 4.63, which reduces to 3.89 with 'Marginal accept', and significantly falls to 2.94 with the 'Reject' sets.  Table 4: Results of the annotation task to evaluate the semantic similarity between sentence pairs across 11 languages. Human judgments confirm that the mined sentences (All accept) have a high semantic similarity and with a moderately high correlation between the human judgments and LAS.

LaBSE Alignment and Annotator Scores are Moderately Correlated. The Spearman correlation coefficient between LAS and STS is a
moderately positive value of 0.37, that is, sentence pairs with a higher LAS are more likely to be rated to be semantically similar. However, the correlation coefficient is also not very high (say > 0.5) indicating potential for further im provement in learning multilingual representations with LaBSE-like models. Further, the two languages that have the smallest correlation (As and Or) also have the smallest resource sizes, indicating potential for improvement in alignment methods for low-resource languages.
LaBSE Alignment is Negatively Correlated with Sentence Length, while Annotator Scores Are Not. To be consistent across languages, sentence length is computed for the English sentence in each pair. We find that sentence length is negatively correlated with LAS with a Spearman correlation coefficient of -0.35, while it is almost uncorrelated with STS with a Spearman correlation coefficient of -0.04. In other words, pairs with longer sentences are less likely to have high alignment on LaBSE representations.
Error Analysis of Mined Corpora For error analysis we considered those sentence pairs as accurate sentences which had (a) LAS greater than the threshold, that is, both marginally accept and definitely accept, and (b) human annotation score greater than or equal to 4. We found that extraction accuracy is 79.5% overall, while the extraction accuracy for Definitely accept bucket is 90.1%. This shows that LAS score based mining and filtering can yield high-quality parallel corpora with high accuracy. In Table 5 we call out different styles of errors for each of the 3 buckets. In Marginally Reject (MR) bucket, we find cases where English and aligned language sentences are different in meaning and cannot be treated as parallel sentences altogether. In Marginally Accept (MA) and Definitely Accept (DA) buckets, we find more minor errors, for instance differences in quantity / number and mistaken alignment of special words like Quarter finals (in English) being aligned to Semi finals (in Indic languages). In summary, the annotation task established that the parallel sentences in Samanantar are of high quality and validated the chosen thresholds. The task also established that LaBSE-based alignment should be further improved for low-resource languages (like as, or) and for longer sentences. We will release this parallel dataset and human judgments on the over 9,566 sentence pairs as a dataset for evaluating cross-lingual semantic similarity between English and Indic languages.

IndicTrans: Multilingual, Single Indic Script Models
The languages in the Indian subcontinent exhibit many lexical and syntactic similarities on account of genetic and contact relatedness (Abbi, 2012;Subbārāo, 2012). Genetic relatedness manifests in the two major language groups considered in this work: the Indo-Aryan branch of the Indo-European family and the Dravidian family. Owing to the long history of contact between these language groups, the Indian subcontinent is a linguistic area (Emeneau, 1956) exhibiting convergence of many linguistic properties between Inflammation of the interior portion of the eye, known as uveitis, can cause blurred vision and 0.89 DA It should be ''when exposed to eye pain, especially when exposed to light high amounts of light'' (photophobia). languages of these groups. Hence, we explore multilingual models spanning all these Indic languages to enable transfer from high resource to low resource languages on account of genetic relatedness (Nguyen and Chiang, 2017) or contact relatedness (Goyal et al., 2020). We trained 2 types of multilingual models for translation involving Indic languages: (i) One to Many for English to Indic language translation (O2M: 11 pairs) (ii) Many to One for Indic language to English translation (M2O: 11 pairs). Data Representation. We made a design choice to represent all the Indic language data in a single script (using the Indic NLP Library). The scripts for these Indic languages are all derived from the ancient Brahmi script. Though each of these scripts have their own Unicode codepoint range, it is possible to get a 1-1 mapping between characters in these different scripts since the Unicode standard takes into account the similarities between these scripts. Hence, we convert all the Indic data to the Devanagari script. This allows better lexical sharing between languages for transfer learning, prevents fragmentation of the subword vocabulary between Indic languages and allows using a smaller subword vocabulary.
The first token of the source sentence is a special token indicating the source language (Tan et al., 2019;Tang et al., 2020). The model can make a decision on the transfer learning between these languages based on both the source language tag and the similarity of representations. When multiple target languages are involved, we follow the standard approach of using a special token in the input sequence to indicate the target language (Johnson et al., 2017b). Other standard pre-processing done on the data are Unicode normalization and tokenization. When the target language is Indic, the output in Devanagari script is converted back to the corresponding Indic script.
Training Data. We use all the Samanantar parallel data between English and Indic languages and remove overlaps with any test or validation data using a very strict criteria. For the purpose of overlap identification only, we work with lower-cased data with all punctuation characters removed. We remove any translation pair, (en, t), from the training data if (i) the English sentence en appears in the validation/test data of any En-X language pair or (ii) the Indic sentence t appears in the validation/test data of the corresponding En-X language pair. Note that, since we train a joint model it is important to ensure that no en sentence in the test/validation data appears in any of the En-X training sets. For instance, if there is an en sentence in the En-Hi validation/test data then any pair containing this sentence should not be in any of the En-X training sets. We do not use any data sampling while training and leave the exploration of these strategies for future work (Arivazhagan et al., 2019).
Validation Data. We used all the validation data from the benchmarks described in Section 5.1.
Vocabulary. We learn separate vocabularies for English and Indic languages from English-centric training data using 32K BPE merge operations each using subword-nmt (Sennrich et al., 2016b).
Network and Training. We use fairseq (Ott et al., 2019) for training transformer-based models. We use 6 encoder and decoder layers, input embeddings of size 1536 with 16 attention heads and feedforward dimension of 4096. We optimized the cross entropy loss using the Adam optimizer with a label-smoothing of 0.1 and gradient clipping of 1.0. We use mixed precision training with Nvidia Apex. 19 We use an initial learning rate of 5e-4, 4000 warmup steps, and the learning rate annealing schedule as proposed in Vaswani et al. (2017). We use a global batch size of 64k tokens. We train each model on 8 V100 GPUs and use early stopping with the patience of 5 epochs.
Decoding. We use beam search with a beam size of 5 and length penalty set to 1.

Experimental Setup
We evaluate the usefulness of Samanantar by comparing the performance of a translation system trained using it with existing state of the art models on a wide variety of benchmarks.

Evaluation Metrics
We use BLEU scores for evaluating the models. To ensure consistency and reproducibility across the models, we provide SacreBLEU signatures in the footnote for Indic-English 21 and English-Indic 22 evaluations. For Indic-English, we use the in-built, default mteval-v13a tokenizer. For En-Indic, since SacreBLEU tokenizer does not support Indic languages, 23 we first tokenize using the IndicNLP tokenizer before running SacreBLEU. The evaluation script will be made available for reproducibility. 21 BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.5.1. 22 BLEU+case.mixed+numrefs.1+smooth.exp+tok.none+version.1.5.1. 23 We plan to submit a pull request in sacrebleu for indic tokenizers.

Models
We compare the the following models: Commercial MT Systems. We use the translation APIs provided by Google Cloud Platform (v2) (GOOG) and Microsoft Azure Cognitive Services (v3) (MSFT) to translate all the sentences in the test set of the benchmarks described above.
Publicly Available MT Systems. We consider the following publicly available NMT systems: OPUS-MT 24 (OPUS): These models were trained using all parallel sources available from OPUS as described in Section 2.1. We refer the readers to (Tiedemann and Thottingal, 2020b) for further details about the training data.
mBART50 25 (mBART): This is a multilingual many-to-many model which can translate between any pair of 50 languages. This model is first pre-trained on large amounts of monolingual data from all the 50 languages and then jointly fine-tuned using parallel data between multiple language pairs. We refer the readers to the original paper for details of the monolingual pre-training data and the bilingual fine-tuning data (Tang et al., 2020).

Models Trained on All Existing Parallel Data.
To evaluate the usefulness of the parallel sentences in Samanantar, we train a few well studied models using all parallel data available prior to this work. Transformer(TF): We train one transformer model each for every en-Indic language pair and one for every Indic-en language pair (22 models in all).
We follow the Transformer BASE model described in Vaswani et al. (2017). We use byte pair encoding (BPE) with a vocabulary size of ≈32K for every language. We use the same learning rate schedule as proposed in Vaswani et al. (2017). We train each model on 8 V100 GPUs and use early stopping with the patience set to 5 epochs. mT5(mT5): We finetune the pre-trained mT5 BASE model (Xue et al., 2021) for the translation task using all existing sources of parallel data. We finetune one model for every language pair of interest (18 pairs). We train each model on 1 v3 TPU and use early stopping with a patience of 25K steps.  Table 6: BLEU scores for En-X and X-En translation across different available testsets. Δ represents the difference between IndicTrans and the best results from the other models. We bold the best public model and underline the overall best model.

Models Trained Using Samanantar (IT 26 ).
We train the proposed IndicTrans model from scratch using the entire Samanantar corpus. For all the models trained/finetuned as a part of this work, we ensured that there is no overlap between the training set and the test/validation sets.

Results and Discussion
The results of our experiments on Indic-En and En-Indic translation are reported in Table 6 and  Table 7. Below, we list down the main observations from our experiments.
Compilation of Existing Resources was a Fruitful Exercise. We observe that current state-of-the-art models trained on all existing parallel data (curated as a subset of Samanantar) perform competitively with other models. Models. From Tables 6 and 7, we observe that IndicTrans trained on Samanantar outperforms nearly all existing models for all the languages in both the directions. In all cases, except for languages in the WMT and UFAL en-ta benchmark, IndicTrans trained on Samanantar improves upon all existing systems. The absolute gain in BLEU score is higher for the Indic-En direction as compared to the En-Indic direction. This is on account of better transfer in many to one settings compared to one-to-many settings (Aharoni et al., 2019) and better language model on the target side. In particular, in Table 7, we observe that Indic-Trans trained on Samanantar clearly outperforms IndicTrans trained only on existing resources. Note that the results reported in Table 7 Table 6, we observe that IndicTrans trained on Samanantar outperforms commercial models (GOOG and MSFT) on most benchmarks. On the FLORES dataset our models are still a few points behind the commercial systems. The higher performance of the commercial NMT systems on the FLORES dataset indicates that the in-house training datasets for these systems better capture the domain and data distributions of the FLORES dataset.

IndicTrans
Performance Gains are Higher for Low Resource Languages. We observe significant gains for low resource languages such as, or and kn, especially in the Indic-En direction. These languages benefit from other related languages with more resources due to multilingual training.
Pre-training Needs Further Investigation. mT5, which is pre-trained on large amounts of monolingual corpora from multiple languages, does not always outperform a Transformer BASE model that is just trained on existing parallel data without any pre-training. While this does not invalidate the value of pre-training, it does suggest that pre-training needs to be optimized for the specific languages. As future work, we would like to explore pre-training using the monolingual corpora on Indic languages available from IndicCorp. Further, we would like to pre-train a single script mT5-or mBART-like model for Indic languages and then fine-tune on MT using Samanantar.

Conclusion
We present Samanantar, the largest publicly available collection of parallel corpora for Indic languages. In particular, we mine 37.4M parallel sentences by leveraging Web crawled monolingual corpora as well as recent advances in multilingual representation learning, approximate nearest neighbor search, and optical character recognition. We also mine 83.4M parallel sentences between 55 Indic language pairs from this English-centric corpus. We collect human judgments for 9,566 sentence pairs from Samanantar and show that the newly mined pairs are of high quality. Our multilingual single-script model, IndicTrans, trained on Samanantar outperforms existing models on a wide variety of benchmarks, demonstrating that our parallel corpus mining approaches can contribute to high-quality MT models for Indic languages.
To further improve the parallel corpora and translation quality for Indian languages, the following areas need further exploration: (a) improving LaBSE representations for low-resource languages and longer sentences, especially benefiting from human judgments, (b) optimizing training schedules and objectives such that they utilize data quality information and linguistic similarity, and (c) pre-training multilingual models.
We hope that the three main contributions of this work-Samanantar, IndicTrans, and a manually annotated dataset for cross-lingual similarity-will contribute to further research on NMT and multilingual NLP for Indic languages.