Revisiting Multi-Domain Machine Translation

When building machine translation systems, one often needs to make the best out of heterogeneous sets of parallel data in training, and to robustly handle inputs from unexpected domains in testing. This multi-domain scenario has attracted a lot of recent work that fall under the general umbrella of transfer learning. In this study, we revisit multi-domain machine translation, with the aim to formulate the motivations for developing such systems and the associated expectations with respect to performance. Our experiments with a large sample of multi-domain systems show that most of these expectations are hardly met and suggest that further work is needed to better analyze the current behaviour of multi-domain systems and to make them fully hold their promises.


Introduction
Data-based Machine Translation (MT), whether statistical or neural, rests on well-understood machine learning principles. Given a training sample of matched source-target sentence pairs (f , e) drawn from an underlying distribution D s , a model parameterized by θ (here, a translation function h θ ) is trained by minimizing the empirical expectation of a loss function (h θ (f ), e). This approach ensures that the translation loss remains low when translating more sentences drawn from the same distribution.
Owing to the great variability of language data, this ideal situation is rarely met in practice, warranting the study of an alternative scenario, where the test distribution D t differs from D s . In this setting, domain adaptation (DA) methods are in order. DA has a long history in Machine Learning in general (e.g., Shimodaira, 2000;Ben-David et al., 2010;Joaquin Quionero-Candela and Lawrence, 2008;Pan and Yang, 2010) and in NLP in particular (e.g., Daumé III and Marcu, 2006;Blitzer, 2007;Jiang and Zhai, 2007). Various techniques thus exist to handle both the situations where a (small) training sample drawn from D t is available in training, or where only samples of source-side (or target-side) sentences are available (see Foster and Kuhn [2007]; Bertoldi and Federico [2009]; Axelrod et al. [2011]; for proposals from the statistical MT era, or Chu and Wang [2018] for a recent survey of DA for Neural MT).
A seemingly related problem is multi-domain (MD) machine translation (Sajjad et al., 2017;Farajian et al., 2017b;Kobus et al., 2017;Zeng et al., 2018;Pham et al., 2019) where one single system is trained and tested with data from multiple domains. MD machine translation (MDMT) corresponds to a very common situation, where all available data, no matter its origin, is used to train a robust system that performs well for any kind of new input. If the intuitions behind MDMT are quite simple, the exact specifications of MDMT systems are rarely spelled out: For instance, should MDMT perform well when the test data is distributed like the training data, when it is equally distributed across domains or when the test distribution is unknown? Should MDMT also be robust to new domains? How should it handle domain labeling errors?
A related question concerns the relationship between supervised domain adaptation and multidomain translation. The latter task seems more challenging as it tries to optimize MT performance for a more diverse set of potential inputs, with an additional uncertainty regarding the distribution of test data. Are there still situations where MD systems can surpass single domain adaptation, as is sometimes expected?
In this paper, we formulate in a more precise fashion the requirements that an effective MDMT system should meet (Section 2). Our first contribution is thus of methodological nature and consists of lists of expected properties of MDMT systems and associated measurements to evaluate them (Section 3). In doing so, we also shed light on new problems that arise in this context, regarding, for instance, the accommodation of new domains in the course of training, or the computation of automatic domain tags. Our second main contribution is experimental and consists in a thorough reanalysis of eight recent multi-domain approaches from the literature, including a variant of a model initially introduced for DA. We show in Section 4 that existing approaches still fall short to match many of these requirements, notably with respect to the handling of a large amount of heterogeneous domains and to dynamically integrating new domains in training.

Requirements of Multi-Domain MT
In this section, we recap the main reasons for considering a multi-domain scenario and discuss their implications in terms of performance evaluation.

Formalizing Multi-Domain Translation
We conventionally define a domain d as a distribution D d (x) over some feature space X that is shared across domains (Pan and Yang, 2010): In machine translation, X is the representation space for source sentences; each domain corresponds to a specific source of data, and differs from the other data sources in terms of textual genre, thematic content (Chen et al., 2016;Zhang et al., 2016), register (Sennrich et al., 2016a), style (Niu et al., 2018), an so forth. Translation in domain d is formalized by a translation function h d (y|x) pairing sentences in a source language with sentences in a target language y ∈ Y. h d is usually assumed to be deterministic (hence y = h d (x)), but can differ from one domain to the other.
A typical learning scenario in MT is to have access to samples from n d domains, which means that the training distribution D s is a mixture . Multi-domain learning, as defined in Dredze and Crammer (2008), further assumes that domain tags are also available in testing; the implication being that the test distribution is also as a mix- making the problem distinct from mere domain adaption. A multi-domain learner is then expected to use these tags effectively (Joshi et al., 2012) when computing the combined translation function h(x, d), and to perform well in all domains (Finkel and Manning, 2009). This setting is closely related to the multi-source adaptation problem formalized in Mansour et al. (2009a,b) and Hoffman et al. (2018). This definition seems to be the most accepted view of a multi-domain MT 1 and one that we also adopt here. Note that in the absence of further specification, the naive answer to the MD setting should be to estimate one translation functionĥ d (x) separately for each domain, then to translate usingĥ (x, d) where I(x) is the indicator function. We now discuss the arguments that are put forward to proceed differently.

Reasons for Building MDMT Systems
A first motivation for moving away from the one-domain / one-system solution are practical (Sennrich et al., 2013;Farajian et al., 2017a): When faced with inputs that are potentially from multiple domains, it is easier and computationally cheaper to develop one single system instead of having to optimize and maintain multiple engines. The underlying assumption here is that the number of domains of interests can be large, a limiting scenario being fully personalized machine translation (Michel and Neubig, 2018).
A second line of reasoning rests on linguistic properties of the translation function and contends that domain specificities are mostly expressed lexically and will primarily affect content words or multi-word expressions; function words, on the other hand, are domain agnostic and tend to remain semantically stable across domains, motivating some cross-domain parameter sharing. An MDMT system should simultaneously learn lexical domain peculiarities, and leverage crossdomain similarities to improve the translation of generic contexts and words (Zeng et al., 2018;Pham et al., 2019). It is here expected that the MDMT scenario should be more profitable when the domain mix includes domains that are closely related and can share more information.
A third series of motivations is of statistical nature. The training data available for each domain is usually unevenly distributed, and domain-specific systems trained or adapted on small datasets are likely to have a high variance and generalize poorly. For some test domains, there may even be no data at all (Farajian et al., 2017a). Training mixdomain systems is likely to reduce this variance, at the expense of a larger statistical bias (Clark et al., 2012). Under this view, MDMT would be especially beneficial for domains with little training data. This is observed for multilingual MT from English: an improvement for under-resourced languages due to positive transfer, at the cost of a decrease in performance for well-resourced languages (Arivazhagan et al., 2019).
Combining multiple domain-specific MTs can also be justified in the sake of distributional robustness (Mansour et al., 2009a,b), for instance, when the test mixture differs from the train mixture, or when it includes new domains unseen in training. An even more challenging case is when the MT would need to perform well for any test distribution, as studied for statistical MT in Huck et al. (2015). In all these cases, mixing domains in training and/or testing is likely to improve robustness against unexpected or adversarial test distribution (Oren et al., 2019).
A distinct line of reasoning is that mixing domains can have a positive regularization effect for all domains. By introducing variability in training, it prevents DA from overfitting the available adaptation data and could help improve generalization even for well-resourced domains. A related case is made in Joshi et al. (2012), which shows that part of the benefits of MD training is due to an ensembling effect, where systems from multiple domains are simultaneously used in the prediction phase; this effect may subsist even in the absence of clear domain separations.
To recap, there are multiple arguments for adopting MDMT, some already used in DA settings, and some original. These arguments are not mutually exclusive; however, each yields specific expectations with respect to the performance of this approach, and should also yield appropriate evaluation procedure. If the motivation is primarily computational, then a drop in MT quality with respect to multiple individual domains might be acceptable if compensated by the computational savings. If it is to improve statistical estimation, then the hope will be that MDMT will improve, at least for some underresourced domains, over individually trained systems. If, finally, it is to make the system more robust to unexpected or adversarial test distributions, then this is the setting that should be used to evaluate MDMT. The next section discusses ways in which these requirements of MDMT systems could be challenged.

Challenging Multi-Domain Systems
In this section, we propose seven operational requirements that can be expected from an effective multi-domain system, and discuss ways to evaluate whether these requirements are actually met. All these evaluations will rest on comparison of translation performance, and do not depend on the choice of a particular metric. To make our results comparable with the literature, we will only use the BLEU score (Papineni et al., 2002) in Section 4, noting it may not be the best yardstick to assess subtle improvements of lexical choices that are often associated with domain adapted systems (Irvine et al., 2013). Other important figures of merit for MDMT systems are the computational training cost and the total number of parameters.

Multi-Domain Systems Should Be Effective
A first expectation is that MDMT systems should perform well in the face of mixed-domain test data. We thus derive the following requirements.
[P1-LAB] A MDMT should perform better than the baseline, which disregards domain labels, or reassigns them in a random fashion (Joshi et al., 2012). Evaluating this requirement is a matter of a mere comparison, assuming the test distribution of domains is known: If all domains are equally important, performance averages can be reported; if they are not, weighted averages should be used instead.
[P2-TUN] Additionally, one can expect that MDMT will improve over fine-tuning (Luong and Manning, 2015;Freitag and Al-Onaizan, 2016), at least in domains where data is scarce, or in situations where several domains are close. To evaluate this, we perform two measurements, using a real as well as an artificial scenario. In the real scenario, we simply compare the performance of MDMT and fine-tuning for domains of varying sizes, expecting a larger gain for smaller domains.
In the artificial scenario, we split a single domain in two parts which are considered as distinct in training. The expectation here is that a MDMT should yield a clear gain for both pseudo subdomains, which should benefit from the supplementary amount of relevant training. In this situation, MDMT should even outperform finetuning on either of the pseudo sub-domain.

Robustness to Fuzzy Domain Separation
A second set of requirements is related to the definition of a domain. As repeatedly pointed out in the literature, parallel corpora in MT are often collected opportunistically and the view that each corpus constitutes a single domain is often a gross approximation. 2 MDMT should aim to make the best of the available data and be robust to domain assignments. To challenge these requirements we propose evaluating the following requirements.
[P3-HET] The notion of a domain being a fragile one, an effective MDMT system should be able to discover not only when cross-domain sharing is useful (cf. requirement [P2-TUN]), but also when intra-domain heterogeneity is hurting. This requirement is tested by artificially conjoining separate domains into one during training, hoping that the loss in performance with respect to the baseline (using correct domain tags) will remain small.
[P4-ERR] MDMTs should perform best when the true domain tag is known, but deteriorate gracefully in the face of tag errors; in this situation, catastrophic drops in performance are often observed. This requirement can be assessed by translating test texts with erroneous domain tags and reporting the subsequent loss in performance.
[P5-UNK] A related situation occurs when the domain of a test document is unknown. Several situations need to be considered: For domains seen in training, using automatically predicted domain labels should not be much worse than using the correct one. For test documents from unknown domains (zero-shot transfer), a good MD system should ideally outperform the default baseline that merges all available data.
[P6-DYN] Another requirement, more of an operational nature, is that an MDMT system should smoothly evolve to handle a growing number of domains, without having to retrain the full system each time new data is available. This is a requirement [P6-DYN] that we challenge by dynamically changing the number of training and test domains.

Scaling to a Large Number of Domains
[P7-NUM] As mentioned above, MDMT systems have often been motivated by computational arguments. This argument is all the more sensible as the number of domains increases, making the optimization of many individual systems both ineffective and undesirable. For lack of having access to corpora containing very large sets (e.g., in the order of 100-1,000) domains, we experiment with automatically learned domains.

Data and Metrics
We experiment with translation from English into French and use texts initially originating from six domains, corresponding to the following data sources: the UFAL Medical corpus V1.0 (MED); 3 the European Central Bank corpus (BANK) (Tiedemann, 2012); The JRC-Acquis Communautaire corpus (LAW) (Steinberger et al., 2006), documentations for KDE, Ubuntu, GNOME, and PHP from Opus collection (Tiedemann, 2009)  score (Papineni et al., 2002). 6 Significance testing is performed using bootstrap resampling (Koehn, 2004), implemented in compare-mt 7 (Neubig et al., 2019). We report significant differences at the level of p = 0.05. We measure the distance between domains using the H-Divergence (Ben-David et al., 2010), which relates domain similarity to the test error of a domain discriminator: the larger the error, the closer the domains. Our discriminator is a SVM independently trained for each pair of domains, with sentence representations derived via mean pooling from the source side representation of the generic Transformer model. We used the scikitlearn 8 implementation with default values. Results in Table 2 show that all domains are well separated from all others, with REL being the furthest apart, while TALK is slightly more central.

Baselines
Our baselines are standard for multi-domain systems. 9 Using Transformers (Vaswani et al., 2017) implemented in OpenNMT-tf 10 (Klein et al., 2017), we build the following systems: • a generic model trained on a concatenation of all corpora (Mixed). We develop two versions 11 of this system, one where the domain unbalance reflects the distribution of our training data given in Table 1 (Mixed-Nat) and one where all domains are equally represented in training (Mixed-Bal). The former is the best option when the train mixture D s is also expected in testing; the latter should be used when the test distribution is uniform across domains. Accordingly, we report two aggregate scores: a weighted average reflecting the training distribution, and an unweighted average, meaning that test domains are equally important.
• fine-tuned models (Luong and Manning, 2015;Freitag and Al-Onaizan, 2016), based on the Mixed-Nat system, further trained on each domain for at most 20,000 iterations, with early stopping when the dev BLEU stops increasing. The full fine-tuning (FT-Full) procedure may update all the parameters of the initial generic model, resulting in six systems adapted for one domain, with no parameter-sharing across domains.
All models use embeddings and the hidden layers sizes of dimension 512. Transformers contain with 8 attention heads in each of the 6+6 layers; the inner feedforward layer contains 2,048 cells. The adapter-based systems (see below) additionally use an adaptation block in each layer, composed of a two-layer perceptron, with an inner ReLU activation function operating on normalized entries of dimension 1,024. Training uses batches of 12,288 tokens, Adam with parameters β 1 = 0.9, β 2 = 0.98, Noam decay (warmup steps = 4, 000), and a dropout rate of 0.1 in all layers.

Multi-Domain Systems
Our comparison of multi-domain systems includes our own reimplementations of recent proposals from the literature: 12 • a system using domain control as in Kobus et al. (2017): domain information is introduced either as an additional token for each source sentence (DC-Tag), or as a supplementary feature for each word (DC-Feat).
• a system using lexicalized domain representations (Pham et al., 2019): word embeddings are composed of a generic and a domain specific part (LDR); • the three proposals of Britz et al. (2017).
TTM is a feature-based approach where the domain tag is introduced as an extra word on the target side. Training uses reference tags and inference is usually performed with predicted tags, just like for regular target words. DM is a multi-task learner where a domain classifier is trained on top the MT encoder, so as to make it aware of domain differences; ADM is the adversarial version of DM, pushing the encoder towards learning domain-independent source representations. These methods thus only use domain tags in training.
•  Sharaf et al., 2020). It fine-tunes the adapter modules of a Mixed-Nat system independently for each domain, keeping all the other parameters frozen. The latter uses the same architecture, but a different training procedure and learns all parameters jointly from scratch with a mix-domain corpus.
This list includes systems that slightly depart from our definition of MDMT: Standard implementations of TTM and WDCMT rely on infered, rather than on gold, domain tags, which must somewhat affect their predictions; DM and ADM make no use of domain tags at all. We did not consider the proposal of Farajian et al. (2017b), however, which performs on-the-fly tuning for each test sentence and diverges more strongly from our notion of MDMT.

Performance of MDMT Systems
In this section, we discuss the basic performance of MDMT systems trained and tested on six domains. Results are in Table 3. As expected, balancing data in the generic setting makes a great difference (the unweighted average is 2 BLEU points better, notably owing to the much better results for REL).
As explained above, this setting should be the baseline when the test distribution is assumed to be balanced across domains. As all other systems are trained with an unbalanced data distribution, we use the weighted average to perform global comparisons. Fine-tuning each domain separately yields a better baseline, outperforming Mixed-Nat for all domains, with significant gains for domains that are distant from MED: REL, IT, BANK, LAW.
All MDMTs (except DM and ADM) slightly improve over Mixed-Nat(for most domains), but these gains are rarely significant. Among systems using an extra domain feature, DC-Tag has a small edge over DC-Feat and also Su et al. (2019) seems to produce comparable, albeit slightly improved, results.  requires fewer parameters; it also outperforms TTM, which, however, uses predicted rather than gold domain tags. TTM is also the best choice among the systems that do not use domain tags in inference. The best contenders overall are FT-Res and MDL-Res, which significantly improve over Mixed-Nat for a majority of domains, and are the only ones to clearly fulfill [P1-LAB]; WDCMT also improves on three domains, but regresses on one. The use of a dedicated adaptation module thus seems better than featurebased strategies, but yields a large increase of the number of parameters. The effect of the adaptation layer is especially significant for small domains (BANK, IT, and REL). All systems fail to outperform fine-tuning, sometimes by a wide margin, especially for an ''isolated'' domain like REL. This might be due to the fact that domains are well separated (cf. Section 4.1) and are hardly helping each other. In this situation, MDMT systems should dedicate a sufficient number of parameters to each domain, so as to close the gap with fine-tuning. In first three, we randomly split one corpus in two parts and proceed as if this corresponded to two actual domains. A MD system should detect that these two pseudo-domains are mutually beneficial and should hardly be affected by this change with respect to the baseline scenario (no split). In this situation, we expect MDMT to even surpass fine-tuning separately on each of these dummy domains, as MDMT exploits all data, while fine-tuning focuses only on a subpart. In testing, we decode the test set twice, once with each pseudo-domain tag. This makes no difference for TTM, DM, ADM, and WDCMT, which do not use domain tags in testing. In the merge experiment, we merge two corpora in training, in order to assess the robustness with respect to heterogenous domains [P3-HET]. We then translate the two corresponding tests with the same (merged) system.

Redefining domains
Our findings can be summarized as follows. For the split experiments, we see small variations that can be positive or negative compared to the baseline situation, but these are hardly significant. All systems show some robustness with respect to fuzzy domain boundaries; this is mostly notable for ADM, suggesting that when domain are

Handling Wrong or Unknown Domains
In the last two columns of Table 4, we report the drop in performance when the domain information is not correct. In the first (RND), we use test data from the domains seen in training, presented with a random domain tag. In this situation, the loss with respect to using the correct tag is generally large (more than 10 BLEU points), showing an overall failure to meet requirement [P4-ERR], except for systems that ignore domain tags in testing.
In the second (NEW), we assess [P5-UNK] by translating sentences from a domain unseen in training (NEWS). For each sentence, we automatically predict the domain tag and use it for decoding. 14 In this configuration, again, systems 14 Domain tags are assigned as follows: we train a language model for each domain and assign tag on a per-sentence basis based on the language model log-probability (assuming using domain tags during inference perform poorly, significantly worse than the Mixed-Nat baseline (BLEU=23.5).

Handling Growing Numbers of Domains
Another set of experiments evaluate the ability to dynamically handle supplementary domains (requirement [P6-DYN]) as follows. Starting with the existing MD systems of Section 5.1, we introduce an extra domain (NEWS) and resume training with this new mixture of data 15 for 50,000 additional iterations. We contrast this approach with training all systems from scratch and report differences in performance in Figure 1 (see also Table 7 in Appendix B). 16 We expect that MDMT systems should not be too significantly impacted by the addition of a new domain and reach about the same performance as when training with this domain from scratch. From a practical viewpoint, dynamically integrating new domains is straightforward for DC-Tag, DC-Feat, or uniform domain priors). The domain classifier has an average prediction error of 16.4% for in-domain data. 15 The design of a proper balance between domains in training is critical for achieving optimal performance: As our goal is to evaluate all systems in the same conditions, we consider a basic mixing policy based on the new training distribution. This is detrimental to the small domains, for which the ''negative transfer'' effect is stronger than for larger domains. 16 WDCMT results are excluded from this table, as resuming training proved difficult to implement. Figure 1: Ability to handle a new domain. We report BLEU scores for a complete training session with seven domains, as well as differences (in blue) with training with six domains (from Table 3); and (in red) differences with continual training.
TTM, for which new domains merely add new labels. It is less easy for DM, ADM, and WDCMT, which include a built-in domain classifier whose outputs have to be pre-specified, or, for LDR, FT-Res, and MDL-Res, for which the number of possible domains is built in the architecture and has to be anticipated from the start. This makes a difference between domain-bounded systems, for which the number of domains is limited and truly open-domain systems.
We can first compare the results of coldstart training with six or seven domains in Table 7: A first observation is that the extra training data is hardly helping for most domains, except for NEWS, where we see a large gain, and for TALK. The picture is the same when one looks at MDMTs, where only the weakest systems (DM, ADM) seem to benefit from more (out-ofdomain) data. Comparing now the coldstart with the warmstart scenario, we see that the former is always significantly better for NEWS, as expected, and that resuming training also negatively impacts the performance for other domains. This happens notably for DC-Tag, TTM, and ADM. In this setting MDL-Res and DM show the smaller average loss, with the former achieving the best balance of training cost and average BLEU score.

Automatic Domains
In this section, we experiment with automatic domains, obtained by clustering sentences into k = 30 classes using the k-means algorithm based on generic sentence representations obtained via mean pooling (cf. Section 4.1). This allows us to evaluate requirement [P7-scale], training, and testing our systems as if these domains were fully separated. Many of these clusters are mere splits of the large MED, while a fewer number of classes are mixtures of two (or more) existing domains (full details are in Appendix C). We are thus in a position to reiterate, at a larger scale, the measurements of Section 5.2 and test whether multi-domain systems can effectively take advantage from the cross-domain similarities and to eventually perform better than fine-tuning. The results in    (Table 3).
effect is more visible for small subsets of the medical domain. Finally, Table 6 reports the effect of using automatic domain for each of the six test sets: Each sentence was first assigned to an automatic class, translated with the corresponding multidomain system with 30 classes; aggregate numbers were then computed, and contrasted with the sixdomain scenario. Results are clear and confirm previous observations: Even though some clusters are very close, the net effect is a loss in performance for almost all systems and conditions. In this setting, the best MDMT in our pool (MDL-Res) is no longer able to surpass the Mix-Nat baseline.

Related Work
The multi-domain training regime is more the norm than the exception for natural language processing (Dredze and Crammer, 2008;Finkel and Manning, 2009), and the design of multidomain systems has been proposed for many language processing tasks. We focus here exclusively on MD machine translation, keeping in mind that similar problems and solutions (parameter sharing, instance selection / weighting, adversarial training, etc.) have been studied in other contexts.
Multi-domain translation was already proposed for statistical MT, either considering as we do multiple sources of training data (e.g., Banerjee et al., 2010;Clark et al., 2012;Sennrich et al., 2013;Huck et al., 2015), or domains made of multiple topics (Eidelman et al., 2012;Hasler et al., 2014). Two main strategies were considered: instance-based, involving a measure of similarities between train and test domains; feature-based, where domain/topic labels give rise to additional features.
The latter strategy has been widely used in NMT: Kobus et al. (2017) inject an additional domain feature in their seq2seq model, either in the form of an extra (initial) domain-token or in the form of an additional domain-feature associated to each word. These results are reproduced by Tars and Fishel (2018), who also consider automatically induced domain tags. This technique also helps control the style of MT outputs in Sennrich et al. (2016a) and Niu et al. (2018), and to encode the source or target languages in multilingual MT (Firat et al., 2016;Johnson et al., 2017). Domain control can also be performed on the target side, as in Chen et al. (2016), where a topic vector describing the whole document serves as an extra context in the softmax layer of the decoder. Such ideas are further developed in Chu and Dabre (2018) and Pham et al. (2019), where domain differences and commonalties are encoded in the network architecture: Some parameters are shared across domains, while others are domain-specific.
Techniques proposed by Britz et al. (2017) aim to ensure that domain information is actually used in a mix-domain system. Three methods are considered, using either domain classification (or domain normalization, via adversarial training) on the source or target side. There is no clear winner in either of the three language pairs considered. One contribution of this work is the idea of normalizing representations through adversarial training, so as to make the mixture of heterogeneous data more effective; representation normalization has since proven a key ingredient in multilingual transfer learning. The same basic techniques (parameter sharing, automatic domain identification / normalization) are simultaneously at play in Zeng et al. (2018) and Su et al. (2019): In this approach, the lower layers of the MT use auxiliary classification tasks to disentangle domain specific representations on the one hand from domain-agnostic representations on the other hand. These representations are then processed as two separate inputs, then recombined to compute the translation.
Another parameter-sharing scheme is in Jiang et al. (2019), which augments a Transformer model with domain-specific heads, whose contributions are regulated at the word/position level: Some words have ''generic'' use and rely on mixed-domain heads, whereas for some other words it is preferable to use domainspecific heads, thereby reintroducing the idea of ensembling at the core of Huck et al. (2015) and Saunders et al. (2019). The results for three language pairs outperform several standard baselines for a two-domain systems (in fr:en and de:en) and a four-domain system (zh:en).
Finally, Farajian et al. (2017b), Li et al. (2018), and Xu et al. (2019) adopt a different strategy. Each test sentence triggers the selection of a small set of related instances; using these, a generic NMT is tuned for some iterations, before delivering its output. This approach entirely dispenses with the notion of domain and relies on data selection techniques to handle data heterogeneity.

Conclusion and Outlook
In this study, we have carefully reconsidered the idea of multi-domain machine translation, which seems to be taken for granted in many recent studies. We have spelled out the various motivations for building such systems and the associated expectations in terms of system performance. We have then designed a series of requirements that MDMT systems should meet, and proposed a series of associated test procedures. In our experiments with a representative sample of MDMTs, we have found that most requirements were hardly met for our experimental conditions. If MDMT systems are able to outperform the mixed-domain baseline, at least for some domains, they all fall short to match the performance of fine-tuning on each individual domain, which remains the best choice in multisource single domain adaptation. As expected however, MDMTs are less brittle than fine-tuning when domain frontiers are uncertain, and can, to a certain extent, dynamically accommodate additional domains, this being especially easy for feature-based approaches. Our experiments finally suggest that all methods show decreasing performance when the number of domains or the diversity of the domain mixture increases.
Two other main conclusions can be drawn from this study: First, it seems that more work is needed to make MDMT systems make the best out of the variety of the available data, both to effectively share what needs to be shared while at the same time separating what needs to be kept separated. We notably see two areas worthy of further exploration: the development of parameter sharing strategies when the number of domains is large; and the design of training strategies that can effectively handle a change of the training mixture, including an increase in the number of domains. Both problems are of practical relevance in industrial settings. Second, and maybe more importantly, there is a general need to adopt better evaluation methodologies for evaluating MDMT systems, which require systems developers to clearly spell out the testing conditions and the associated expected distribution of testing instances, and to report more than comparisons with simple baselines on a fixed and known handful of domains.

Acknowledgments
The work presented in this paper was partially supported by the European Commission under contract H2020-787061 ANITA.
This 8 heads in each of the 6 layers; the inner feedforward layer contains 2,048 cells. Training use a batch size of 12,288 tokens; optimization uses Adam with parameters β 1 = 0.9, β 2 = 0.98 and Noam decay (warmup steps = 4, 000), and a dropout rate of 0.1 for all layers.
• FT-Res and MDL-res use the same medium Transformer and add residual layers with a bottleneck dimension of size 1,024.
• ADM, DM use medium Tranformer model and a domain classifier composing of 3 dense layers of size 512 × 2,048, 2,048 × 2,048, and 2,048 × domain num. The two first layers of the classifier use the ReLU() as activation function, the last layer uses tanh() as activation function.
• DC-Feat uses medium Transformer model and domain embeddings of size 4. Given a sentence of domain i in a training batch, the embedding of domain i is concatenated to the embedding of each token in the sentence.
• LDR uses medium Transformer model and for each token we introduce a LDR feature of size 4 × domain num. Given a sentence of domain i ∈ [1, .., K] in the training batch, for each token of the sentence, the LDR units of the indexes outside of the range [4(i − 1), .., 4i − 1] are masked to 0, and the masked LDR feature will be concatenated to the embedding of the token. Details are in Pham et al. (2019).
• Mixed-Nat-RNN uses one bidirectional LSTM layer in the encoder and one LSTM layer in the decoder. The size of hidden layers is 1,024, the size of word embeddings is 512.
• WDCNMT uses one bidirectional GRU layer in the encoder and one GRU-conditional layer in the decoder. The size of hidden layers is 1,024, the size of word embeddings is 512.
Training For each domain, we create train/ dev/test sets by randomly splitting each corpus. We maintain the size of validation sets and of test sets equal to 1,000 lines for every domain. The learning rate is set as in Vaswani et al. (2017). For the fine-tuning procedures used for FTfull and FT-Res, we continue training using the same learning rate schedule, continuing the incrementation of the number of steps. All other MDMT systems reported in Tables 3 and 4 use a combined validation set comprising 6,000 lines, obtained by merging the six development sets. For the results in Table 7 we also append the validation set of NEWS to the multi-domain validation set. In any case, training stops if either training reaches the maximum number of iterations (50,000) or the score on the validation set does not increase for three consecutive evaluations. We average five checkpoints to get the final model.

B. Experiments with Continual Learning
Complete results for the experiments with continual learning are reported in Table 7.

C. Experiments with Automatic Domains
This experiment aims to simulate with automatic domains a scenario where the number of ''domains'' is large and where some ''domains'' are close and can effectively share information. Full results in Table 8. Cluster size vary from approximately 8k sentences (cluster 24) up to more than 350k sentences. More than two thirds of these clusters mostly comprise texts from one single domain, as for cluster 12 which is predominantly MED, the remaining clusters typically mix two domains. Fine-tuning with small domains is often outperformed by other MDMT techniques, an issue that a better regularization strategy might mitigate. Domain-control (DC-Feat) is very effective for small domains, but again less so in larger data conditions. Among the MD models, approaches using residual adapters have the best average performance.