Abstract
Multilingual task-oriented dialogue (ToD) facilitates access to services and information for many (communities of) speakers. Nevertheless, its potential is not fully realized, as current multilingual ToD datasets—both for modular and end-to-end modeling—suffer from severe limitations. 1) When created from scratch, they are usually small in scale and fail to cover many possible dialogue flows. 2) Translation-based ToD datasets might lack naturalness and cultural specificity in the target language. In this work, to tackle these limitations we propose a novel outline-based annotation process for multilingual ToD datasets, where domain-specific abstract schemata of dialogue are mapped into natural language outlines. These in turn guide the target language annotators in writing dialogues by providing instructions about each turn’s intents and slots. Through this process we annotate a new large-scale dataset for evaluation of multilingual and cross-lingual ToD systems. Our Cross-lingual Outline-based Dialogue dataset (cod) enables natural language understanding, dialogue state tracking, and end-to-end dialogue evaluation in 4 diverse languages: Arabic, Indonesian, Russian, and Kiswahili. Qualitative and quantitative analyses of cod versus an equivalent translation-based dataset demonstrate improvements in data quality, unlocked by the outline-based approach. Finally, we benchmark a series of state-of-the-art systems for cross-lingual ToD, setting reference scores for future work and demonstrating that cod prevents over-inflated performance, typically met with prior translation-based ToD datasets.
1 Introduction and Motivation
One of the staples of machine intelligence is the ability to communicate with humans and complete a task as instructed during such an interaction. This is commonly referred to as task-oriented dialogue (ToD; Gupta et al., 2005; Bohus and Rudnicky, 2009; Young et al., 2013; Muise et al., 2019). Despite having far-reaching applications, such as banking (Altinok, 2018), travel (Zang et al., 2020), and healthcare (Denecke et al., 2019), this technology is currently accessible to very few communities of speakers (Razumovskaia et al., 2022a).
The progress in multilingual ToD is critically hampered by the paucity of training data for many of the world’s languages. While cross-lingual transfer learning (Zhang et al., 2019; Xu et al., 2020; Siddhant et al., 2020; Krishnan et al., 2021) offers a partial remedy, its success is tenuous beyond typologically similar languages and generally hard to assess due to the lack of evaluation benchmarks (Razumovskaia et al., 2022a). What is more, transfer learning often cannot leverage multi-source transfer and few-shot learning due to lack of language diversity in the ToD datasets (Zhu et al., 2020; Quan et al., 2020; Farajian et al., 2020).
Therefore, the main driver of development in multilingual ToD is the creation of multilingual resources. However, even when available, they suffer from several pitfalls. Most are obtained by manual or semi-automatic translation of an English source (Castellucci et al., 2019; Bellomaria et al., 2019; Susanto and Lu, 2017; Upadhyay et al., 2018; Xu et al., 2020; Ding et al., 2022; Zuo et al., 2021, inter alia). While this process is cost-efficient and typically makes data and results comparable across languages, it yields dialogues that lack naturalness (Lembersky et al., 2012; Volansky et al., 2015), are not properly localized nor culture-specific (Clark et al., 2020). Further, they provide over-optimistic estimates of performance due to the artificial similarity between source and target texts (Artetxe et al., 2020). As an alternative to translation, new ToD datasets can be created from scratch in a target language through the Wizard-of-Oz framework (WOZ; Kelley, 1984) where humans impersonate both the client and the assistant. However, this process is highly time- and money-consuming, thus failing to scale to large quantities of examples and languages, and often lacks coverage in terms of possible dialogue flows (Zhu et al., 2020; Quan et al., 2020).
To address all these gaps, in this work we devise a novel outline-based annotation pipeline for multilingual ToD datasets that combines the best of both processes. In particular, abstract dialogue schemata, specific to individual domains, are sampled from the English Schema-Guided Dialogue dataset (SGD; Shah et al., 2018; Rastogi et al., 2020). Then, the schemata are automatically mapped into outlines in English, which describe the intention that should underlie each dialogue turn and the slots of information it should contain, as shown in Table 1. Finally, outlines are paraphrased by human subjects into their native tongue and slot values are adapted to the target culture and geography. This ensures both the cost-effectiveness and cross-lingual comparability offered by manual translation, and the naturalness and culture-specificity of creating data from scratch. Through this process, we create the Cross-lingual Outline-based Dialogue dataset (termed cod), supporting natural language understanding (intent detection and slot labeling tasks), dialogue state tracking, and end-to-end dialogue modeling in 11 domains and 4 typologically and areally diverse languages: Arabic, Indonesian, Russian, and Kiswahili.
Example from the cod dataset of outline-based dialogue generation in Russian with target language substitutions of slot values. The first column (Outline) includes example outlines presented to the dialogue creators, and the second column holds the creators’ output (Dialogue & Slot Output).

To confirm the advantages of the leveraged annotation process, we run a proof-of-concept experiment where we create two analogous datasets through the outline-based pipeline and manual translation, respectively. Based on a quality survey from human participants, we find that, while having similar annotation speed, outline-based annotation achieves significantly higher naturalness and familiarity of concepts and entities, without compromising data quality and language fluency.1 Finally, crucial evidence showed that cross-lingual transfer test scores on translation-based data are over-estimated. We demonstrate that this is due to the fact that the distribution of the sentences (and their hidden representations) is considerably more divergent between training and evaluation dialogues in cod than in the translation-based dataset.
Further, to establish realistic estimates of performance on multilingual ToD, we benchmark a series of state-of-the-art multilingual ToD models in different ToD tasks on cod. Among other findings, we report that zero-shot transfer surpasses ‘translate-test’ on slot labeling, but this trend is reversed for intent detection. Language-specific performance also varies substantially among evaluated models, depending on the quantity of unlabeled data available for pretraining.
In sum, cod provides a typologically diverse dataset for end-to-end dialogue modeling and evaluation, and streamlines a scalable annotation process that results in natural and localised dialogues. We hope that cod will contribute to democratizing dialogue technology and facilitating reliable cost-effective ToD systems for a wide array of languages. Our data and code are available at github.com/cambridgeltl/COD.
2 Related Work
Although a number of NLU resources have recently emerged in languages other than English, the availability of high-quality, multi-domain data to support multilingual ToD is still inconsistent (Razumovskaia et al., 2022a). Translation of English data has been the predominant method for generating examples in other languages: For example, the ATIS corpus (Hemphill et al., 1990) boasts translations into Chinese (He et al., 2013), Vietnamese (Dao et al., 2021), Spanish, German, Indonesian, and Turkish, among others (Susanto and Lu, 2017; Upadhyay et al., 2018; Xu et al., 2020). Bottom-up collection of ToD data directly in the target language has been the less popular choice (e.g., in French [Bonneau-Maynard et al., 2005 ] and Chinese [Zhang et al., 2017; Gong et al., 2019 ]).
Concurrent work by FitzGerald et al. (2022) employs translation as part of a dataset creation workflow where Amazon MTurk workers first translate or localize slot values, and subsequently translate or localize entire phrases in which these slots appear. While localization allows improving the geographical and cultural relevance of entities mentioned in dialogues, this approach still relies on translation from English, thus perpetuating many of the problems of earlier translation-based methods: For example, introducing English grammatical and lexical biases in dialogue utterances (Koppel and Ordan, 2011) or compromising target language idiomacity. As we demonstrate in §4, our outline-based dialogue generation method addresses these issues by eschewing direct translation in favor of guided dialogue creation in the target language, ensuring naturalness of linguistic expressions used in each language and yielding a dataset better capturing linguistic diversity.
Thus far, the focus of existing benchmarks has been predominantly either on monolingual multi-domain (Hakkani-Tür et al., 2016; Liu et al., 2019; Larson et al., 2019) or multilingual single-domain evaluation (Xu et al., 2020), rather than balancing diversity along both these dimensions. Moreover, the current multilingual datasets are mostly constrained to the two NLU tasks of intent detection and slot labeling (Li et al., 2021; van der Goot et al., 2021), and do not enable evaluations of E2E ToD systems in multilingual setups. In order to adequately assess the strengths and generalizability of NLU as well as DST and E2E models, they should be tested both on multiple languages and multiple domains, a goal pursued in this work.
3 Annotation Design
We selected the English Schema-Guided Dialogue (SGD) dataset (Shah et al., 2018; Rastogi et al., 2020) as a starting point due to its scale (20k human-assistant dialogues) and diversity (20 domains). It was constructed via automatic generation of dialogue schemata combined with manual creation of dialogue paraphrases by crowdworkers, organized as lists of turns for each individual interaction, each turn containing an utterance by the user or system. The accompanying annotations are grouped into frames, each corresponding to a single API or service (e.g., Banks_2). In turn, each service is represented as a schema including its characteristic functions (intents) and parameters (slots), as well as their natural language (NL) descriptions.2
We first assessed the viability of our method on Russian, collecting data using (i) direct translation from English and (ii) our proposed outline-based approach. We then applied our method to three other languages that boast a large number of speakers and yet suffer from a shortage of resources: Arabic, Indonesian, and Kiswahili, ensuring the dataset’s diversity in terms of language family and macro-area, as well as writing systems (Cyrillic, Arabic, and Latin scripts), see Table 2.3 In Table 3 we quantify the linguistic diversity of the language sample and compare it with the standard multilingual dialogue NLU and end-to-end datasets. In terms of typology, cod is comparable to datasets with much larger language samples (e.g., MultiATIS++, xSID) and considerably exceeds others. With respect to family and macroarea diversity, cod is the most diverse out of existing datasets.
Language statistics. The last two columns denote the number of speakers in millions. †Standard Arabic is learned as L2.
Language . | iso . | Family . | Branch . | Macro-area . | L1 [M] . | Total [M] . |
---|---|---|---|---|---|---|
Russian | ru | Indo-European | Balto-Slavic | Eurasia | 153.7 | 258 |
Standard Arabic | ar | Afro-Asiatic | Semitic | Eurasia / Africa | 0† | 274 |
Indonesian | id | Austronesian | Malayo-Polynesian | Papunesia | 43.6 | 199 |
Kiswahili | sw | Niger–Congo | Bantu | Africa | 16.3 | 69 |
Language . | iso . | Family . | Branch . | Macro-area . | L1 [M] . | Total [M] . |
---|---|---|---|---|---|---|
Russian | ru | Indo-European | Balto-Slavic | Eurasia | 153.7 | 258 |
Standard Arabic | ar | Afro-Asiatic | Semitic | Eurasia / Africa | 0† | 274 |
Indonesian | id | Austronesian | Malayo-Polynesian | Papunesia | 43.6 | 199 |
Kiswahili | sw | Niger–Congo | Bantu | Africa | 16.3 | 69 |
Comparison of diversity indices of multilingual dialogue datasets in terms of typology, family, and macroareas. For the description of the three diversity measures, we refer the reader to Ponti et al. (2020). M. TOP was created by Schuster et al. (2019); M. ATIS (Upadhyay et al., 2018); MultiATIS++ (Xu et al., 2020); MTOP (Li et al., 2021); xSID (van der Goot et al., 2021); BiTOD (Lin et al., 2021); GlobalWOZ (Ding et al., 2022).

3.1 Data Creation Protocol
The data creation protocol involved the following phases: 1) source dialogue sampling, 2) automatic generation of outlines based on intent and slot information using rewrite rules, 3) manual outline-driven target language dialogue creation and slot annotation, and 4) post-hoc review, all described here.
Source Dialogue Sampling.
To ensure wide coverage of dialogue scenarios, we randomly sampled source dialogues from across 11 domains, out of which five (Alarm, Flights, Homes, Movies, Music) are shared between the development and test set; the remainder are unique to either set, to enable cross-domain experiments. To guarantee a balanced coverage of different intents, we sampled 10 examples per intent, which ensures the task cannot be solved by simply predicting the most common intent (see Table 4 for dataset statistics).
Number of dialogues per domain and total number of turns in each set. ♢ marks the domains that are not included in the (English) training set.
. | Alarm (♢) . | Flights . | Homes . | Movies . | Music . | Media . | Banks . | Payment (♢) . | RideSharing . | Travel . | Weather . | #turns . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Dev | 13 | 12 | 12 | 16 | 14 | – | 14 | – | – | 12 | 18 | 1138 |
Test | 21 | 23 | 13 | 19 | 16 | 17 | – | 8 | 11 | – | – | 1352 |
. | Alarm (♢) . | Flights . | Homes . | Movies . | Music . | Media . | Banks . | Payment (♢) . | RideSharing . | Travel . | Weather . | #turns . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Dev | 13 | 12 | 12 | 16 | 14 | – | 14 | – | – | 12 | 18 | 1138 |
Test | 21 | 23 | 13 | 19 | 16 | 17 | – | 8 | 11 | – | – | 1352 |
Outline Generation.
Our goal was to create minimal but sufficient instructions for target language dialogue creators to ensure coverage of specific intents and slots, while avoiding imposing predefined syntactic structures or linguistic expressions. First, for each user or system act, we manually created a rewrite rule, for example, INFORM_COUNTInform the user that you found + INFORM_COUNT[value] + such option(s) (value corresponds to the number of options matching the user request). Next, we automatically match each intent and slot with its NL description (provided in the SGD schemata) and used them to generate intent/slot-specific outlines (with stylistic adaptations where necessary): For example, an intent “SearchOnewayFlight” and a description “Search for one-way flights to the destination of choice” would yield an outline Express the desire to search for one-way flights (see Table 5).
Examples of dialogue generation outlines created from SGD schemata, that is, annotations of dialogue acts, intents, slots and values, with intent-specific rewrites in bold.
Act . | Slot/Intent . | Description . | Value . | Outline . |
---|---|---|---|---|
INFORM_INTENT | SearchOnewayFlight | Search for one-way flights | – | Express the desire to search |
to the destination of choice | – | for one-way flights | ||
REQUEST | number_checked_bags | Number of bags to check in | 2 | Ask if the number of bags to |
check in is 2 |
Act . | Slot/Intent . | Description . | Value . | Outline . |
---|---|---|---|---|
INFORM_INTENT | SearchOnewayFlight | Search for one-way flights | – | Express the desire to search |
to the destination of choice | – | for one-way flights | ||
REQUEST | number_checked_bags | Number of bags to check in | 2 | Ask if the number of bags to |
check in is 2 |
Dialogue Writing.
We recruited target language native speakers fluent in English via the proz.com platform.4 Dialogue creators were presented with language-specific guidelines.5 An essential part of the task consisted in a cultural adaptation of culturally and geographically specific slot values (e.g., city names, movie titles) through substitutions with named entities more familiar or closer to the creators’ culture (e.g., American AirlinesAeroflot, New YorkJakarta).
Slot Span Validation.
First, creators performed slot span labeling while working on dialogue writing. Subsequently, the annotated data in each language underwent an additional round of manual revision by a target language native speaker and a final automatic check for slot value-span matches. We verified inter-annotator reliability on Russian, where we collected slot span annotations from pairs of independent native-speaker annotators. The accuracy scores (i.e., ratio of slot instances with matching spans to the total annotated instances) of 0.99 for development data and 0.98 for test data reveal very high agreement on this task.
4 Translation versus Outline-Based
The main motivation behind the outline-based approach is to avoid the known pitfalls of direct translation and produce evaluation data better representing the linguistic and cultural realities of each language in the sample. To verify whether the method satisfies these goals in practice, we carried out a trial experiment consisting in parallel dialogue data creation using two different methods, (i) direct translation and (ii) outline-based generation, starting from the same sample of source SGD dialogues to ensure a fair comparison. In (i), randomly sampled (see §3.1) English user/system utterances were extracted directly from the SGD data with accompanying slot and intent annotations and subsequently translated into the target language by professional translators, also responsible for validating target language slot spans. In (ii), we automatically extracted dialogue frames, including intents and slots, matching dialogue IDs sampled in (i), and used them to generate NL outlines to guide manual dialogue creation by native speakers (§3.1).
We also asked the participants to time themselves while working on the task. Notably, we found the annotation speed to be identical for the two methods, averaging 15 seconds per single dialogue turn (dialogue writing + slot annotation). While the translation approach does not require any creative input in terms of cultural adaptations of slot values, the outline-based approach allows freedom in terms of the linguistic expressions used, which results in similar time requirements.
Quality Survey.
We assessed the quality of the two methods’ output in a survey with 15 Russian native speakers, consisting of (1) independent and (2) comparative evaluation.6 Within each part, the order of questions was randomized. In Part 1, the respondents were presented with 6 randomly sampled dialogues from the data generated by either method (3 dialogues per method) and asked to answer to what extent they agree with each of four statements in Table 6 (translated into Russian) by giving a 1-5 rating. In Part 2, respondents were presented with 5 randomly sampled pairs of matching dialogue excerpts from both datasets (based on shared dialogue IDs) and asked to choose which excerpt (A or B) sounded more natural to them. Following the validation experiments and analyses of our outline-based method in Russian (as reported in the remainder of §4), we extended the quality survey to the other three languages included in cod, Arabic, Kiswahili, and Indonesian, comparing outline-generated dialogues to those translated from English by professional translators in an analogous two-part evaluation setup. All survey questions and instructions were translated into each target language and 15 native speaker participants were recruited for each language-specific survey.
Figure 1 shows average scores for Part 1 questions (Q1–Q4) across the 15 participants in each language. The methods produce dialogues which score similarly in terms of the assistant’s goal-orientedness (Q1), with a statistically significant negative effect of translation, with respect to outline-based generation, noted only in Indonesian. However, we observe consistent differences in the perceived naturalness and target-language fluency (Q2 and Q3). First, the user utterances created based on outlines are perceived as more natural-sounding (Q2) across all four languages, with the largest quality gap observed in Indonesian and Arabic. This pattern is repeated for Q3, where Arabic and Indonesian participants found outline-based generated assistant utterances substantially closer to natural target language spoken by native speakers than their translated counterparts.
Average scores for each quality survey question (see Table 6) assigned to dialogue examples generated via translation versus outline-based generation in each language. Statistically significant differences (paired Student’s t-test) are indicated as follows: p ≤ 0.05 (*), p ≤ 0.01 (**), p ≤ 0.001 (***), p ≤ 0.0001 (****); ns indicates p > 0.05.
Average scores for each quality survey question (see Table 6) assigned to dialogue examples generated via translation versus outline-based generation in each language. Statistically significant differences (paired Student’s t-test) are indicated as follows: p ≤ 0.05 (*), p ≤ 0.01 (**), p ≤ 0.001 (***), p ≤ 0.0001 (****); ns indicates p > 0.05.
Crucially, outline-generated dialogues score consistently better in terms of the familiarity of mentioned entities (Q4), with significant score differences found in all four languages. These results are encouraging, given that Q4 directly addresses one of the main objectives of our method, namely, target language-specificity. While both approaches are capable of producing convincing dialogues in each language, as reflected in positive (¿3) average scores, it is worth noting that the perceived degree of naturalness and familiarity of the conversations is on average lower in the case of Kiswahili. This emphasizes the need for careful debiasing of the concepts and situation types referred to in the dialogues, to ensure that the entire dialogues scenarios, not just slot values, reflect the linguistic and cultural reality of target language communities.
The patterns noticed in the independent evaluation (Part 1) are further reinforced in the results of the comparative evaluation in Part 2, even more clearly skewed in favor of the outline-based method. Out of 75 comparisons (15 participants judging 5 pairs each) in each language, outline-based dialogues are judged as more natural-sounding, on average, in over 80% of cases, with a near-perfect preference found in Indonesian (94%), followed by Arabic (82%), Russian (80%), and Kiswahili (76%). Table 7 shows an example pair of matching dialogue excerpts from each method with accompanying English translations.
Comparison of dialogues generated by each method. For each user/assistant utterance, we provide the original English sentences from SGD for the translation method, and English translations of the Russian utterances written based on outlines. ♣ – syntactic similarity to source language; ♠ – lexical similarity to source language.

Effects of Translationese.
Dialogue data should be representative of natural interactions between two interlocutors. The utterances of both the user and the system should reflect the properties characteristic of the conversational register in a given language, appropriate for the communicative situation at hand and the participants’ social roles (Chaves et al., 2019; Chaves and Gerosa, 2021). When qualitatively comparing the translation and outline-based generation in Table 7, we observe that translated utterances are often skewed to the source language syntax and lexicon (known as the “translationese” effects [Koppel and Ordan, 2011 ]), compromising fluency and idiomacity that are essential in natural-sounding exchanges.
One issue which arises in literal translation is syntactic calques from the source language. For instance, the translation of the first USER utterance (Table 7, col. ‘Translation’) uses a dative pronoun найти мне [dative] (find me), even though the transitive verb найти (find) does not require the [dative] case after it—a likely calque of the English expression Can you find me. In comparison, the corresponding outline-based generated utterance uses a more natural construction. Another problem concerns the differences in the use of grammatical structures depending on the language register. For instance, using passive voice in spoken English is common: For example, the last ASSISTANT utterance in Table 7. Its translation into Russian also includes passive voice, although it is usually avoided in spoken Russian (Babby and Brecht, 1975). In contrast, the outline-based utterance uses a simpler active voice construction, preserving the original meaning.
Lexical “translationese” effects include (i) the preference for lexical cognates of source language words, and (ii) the use of a vocabulary typical for the written language, both exemplified by the last ASSISTANT utterance (Table 7). The translation includes the verb запланирован (is planned), even though the verb планировать, having the same root as English to plan, is rarely used in spoken Russian when arranging near-future appointments and more frequently when making a step-by-step plan. In contrast, the outline-based generated utterance includes the verb забронировать (to book) which is more specific to arranging appointments and more frequently used in spoken language.
Slot Localization.
Datasets collected via translation stay largely grounded in the realm of the Anglosphere (Zuo et al., 2021; Hung et al., 2022). For instance, slot values are directly translated rather than being substituted with a culture-specific equivalent. As a result, multilingual models are tested in a very favorable context where only the surface language changes but the entities stay the same (this bias is especially pertinent for models in cross-lingual setups). In cod guidelines, annotators are explicitly instructed to replace English concepts with their target language equivalents. In this study, we calculate the percentage of slot values which were localised. We consider a slot value to be localized if the value is conceptually different from its English counterpart (e.g., using a local artist’s name or converting a sum in GBP or USD to the local currency). Table 8 demonstrates that more than half of all slot values in the dataset are localized, which is a large improvement. This shows that with the cod dataset models will be tested on more culturally and linguistically aware data than if the dataset were created via translation.
Evaluation of Translation-Based vs. Outline-Generated Data.
The vast majority of existing NLU datasets are based on translation from English to the target language (Xu et al., 2020; van der Goot et al., 2021). This could lead to an overly optimistic evaluation of cross-lingual ToD systems, since the data might not be representative of real-life language use, due to “translationese” effects discussed above. We verify this hypothesis in the following diagnostic experiment. We use a translate-train approach where: (i) training data are translated from the source language (en) to the target (ru) via Google Translate; and (ii) the model is fine-tuned on these automatically translated data. We then test the model on evaluation data obtained by: (a) translation using Google Translate, (b) translation by professional translators (closest in nature to existing dialogue NLU datasets), (c) generated based on outlines. For the experiment, we fine-tune mBERT (Devlin et al., 2019) on intent detection.7
The results in Table 9 show a stronger performance on translation-based evaluation sets than on more natural, outline-based generated examples, thus corroborating previous observations in other areas of NLP, e.g., machine translation (Graham et al., 2020), now also attested in ToD. Crucially, this experiment verifies that using solely translation-based ToD evaluation data might lead to an inflated estimation of models’ cross-lingual capabilities and, consequently, too optimistic performance expectations in real-life applications. This further validates our proposed outline-based approach to multilingual ToD data creation.
Cross-lingual intent detection accuracy on development and test data (a) translated via Google Translate; (b) translated by professionals; and (c) outline-generated: cod.
Data Creation . | Split . | Accuracy . |
---|---|---|
Dev | 47.98 | |
Translate | Test | 35.06 |
Professional | Dev | 48.33 |
Translation | Test | 34.62 |
Outline-based | Dev | 40.25 |
Generation | Test | 31.81 |
Data Creation . | Split . | Accuracy . |
---|---|---|
Dev | 47.98 | |
Translate | Test | 35.06 |
Professional | Dev | 48.33 |
Translation | Test | 34.62 |
Outline-based | Dev | 40.25 |
Generation | Test | 31.81 |
Analysis of Sentence Encodings.
One reason behind the scores in Table 9 likely lies in the differences between multilingual sentence encodings of English examples, examples generated via translation, and those yielded by the outline-based method. To test this, we obtain sentence encodings of all user turns for one intent from the three datasets via the distilled multilingual USE sentence encoder (Yang et al., 2020; Reimers and Gurevych, 2019).8
As shown in Figure 2, as expected, the translation-based data are encoded into sentence representations that are much more similar to their English source than the corresponding outline-generated examples. We use pairwise KL-divergence scores between KDE-estimated Gaussians to measure the similarity between English (En), Translated to Russian (Trans), and Outline-based sentences: KL (En —— Trans) =7.5 × 10−4; KL (En —— Outline) = 4.69 × 10−5; KL (Trans —— Outline) =3.84 × 10−5. As expected, direct translation artificially skews target utterances towards English. This again reinforces the finding from Table 9: Multilingual ToD datasets collected via outline-based generation should lead to more realistic assessments of multilingual ToD models than their translation-based counterparts.
Kernel density estimate (KDE) plot for distributions of user turn encodings via the distilled multilingual USE. Input sentences are either the original sentences in English (En), translated to Russian (Trans), or generated in Russian based on Outlines (Outline). Dimensionality reduction was performed using tSNE (Van der Maaten and Hinton, 2012).
Kernel density estimate (KDE) plot for distributions of user turn encodings via the distilled multilingual USE. Input sentences are either the original sentences in English (En), translated to Russian (Trans), or generated in Russian based on Outlines (Outline). Dimensionality reduction was performed using tSNE (Van der Maaten and Hinton, 2012).
5 Baselines, Results, Discussion
cod includes labeled data for three standard ToD tasks: i) Natural Language Understanding (NLU; intent detection and slot labeling); ii) dialogue state tracking (DST); and iii) end-to-end (E2E) dialogue modeling. Here, we benchmark a representative selection of state-of-the-art models (§5.1) on our new dataset, highlighting its potential for evaluation and the key challenges it presents across different tasks and experimental setups (§5.2).
Notation.
A dialogue is a sequence of alternating user and system turns . Dialogue history at turn t is the set of turns up to point t, i.e., .
5.1 Baselines and Experimental Setup
We evaluate and compare the baselines for each task along the following axes: (i) different multilingual pretrained models; (ii) cross-lingual transfer approaches; (iii) in-domain versus cross-domain.
Multilingual Pretrained Models.
For cross- lingual transfer based on multilingual pretrained models, we abide by the standard procedure where the entire set of encoder parameters and the task-specific classifier head are fine-tuned. We evaluate the following pretrained language models: (i) for NLU and DST, we use the Base variants of multilingual BERT (mBERT; Devlin et al., 2019) and XLM on RoBERTa (XLM-R; Conneau et al., 2020); the models were pretrained on Wikipedia in over 100 languages and CommonCrawl dataset, respectively; for intent detection and slot labeling, we evaluate both a model that jointly learns the two tasks (Xu et al., 2020) as well as separate task-specific models; (ii) for E2E modeling, we use multilingual T5 (mT5; Xue et al., 2021), a sequence-to-sequence model demonstrated to be the strongest baseline for cross-lingual dialogue generation (Lin et al., 2021).
Cross-lingual Transfer.
We focus on two standard methods of cross-lingual transfer: (i) transfer based on multilingual pretrained models and (ii) translate-test (Hu et al., 2020). In (i), a Transformer-based encoder is pretrained on multiple languages with a language modeling objective, yielding strong cross-lingual representations that enable zero-shot model transfer. In (ii), test data in a target language are translated into English via a translation system: We compare Google Translate (GTr)9 and MarianMT (Junczys-Dowmunt et al., 2018). The models in both transfer methods are fine-tuned on the original English task-specific data from the English SGD dataset.
For end-to-end training, we set up two additional cross-lingual baselines, similar to Lin et al. (2021). In few-shot fine-tuning (FF), after the model is trained on source language data (EN), it is further fine-tuned on a small number of target language dialogues. In our FF experiments, we use the dev sets in each language as few-shot learning data. In mixed-language pretraining (MLT; Lin et al., 2021), the model is fine-tuned on mixed language data where the slot values in the source language data are substituted with their target language counterparts. Unlike Lin et al. (2021), we do not assume the existence of a bilingual parallel knowledge base, unrealistic for low-resource languages. Hence, the translations of slot values are obtained via MarianMT (Junczys-Dowmunt et al., 2018).
In-Domain versus Cross-Domain Experiments.
cod development and test splits include examples belonging to domains which were not seen in the English training data (see Table 4). This enables cross-lingual evaluation in 3 different regimes: in-domain testing (In), where the model is evaluated on examples coming from the domains seen during training; cross-domain testing (Cross), evaluating on examples coming from the domains which were not seen during training; and overall testing (All), evaluating on all examples in the evaluation set.
Architectures and Training Hyperparameters.
NLU in ToD consists of two tasks performed for each user turn : intent detection and slot labeling, which are typically framed as sentence- and token-level classification tasks, respectively. When a model is trained in a joint fashion, the two tasks share an encoder, and task-specific classification layers are added on top of the encoder (Zhang et al., 2019; Xu et al., 2020). The loss is a sum of the intent classification and the slot labeling losses (cross-entropy). In separate training, there is no parameter sharing, so neither NLU task influences the other. The performance metrics are accuracy for intent detection and F1 for slot labeling.
In the DST task, the model maps the dialogue history to the belief state at ; this includes the slot values that have been filled up to turn t. We use BERT-DST (Chao and Lane, 2019) in the experiments, which makes a binary classification regarding the relevance of every slot-value pair to the current context. During training, negative dialogue context-slot pairs are sampled randomly in a 1:1 ratio. At inference time, every context is mapped to every possible slot-value pair. The performance metric used for DST is the standard Joint Goal Accuracy (JGA) (Rastogi et al., 2020), defined as the ratio of dialogue turns in which all slot values are correctly predicted.
As in prior work (Lin et al., 2021), E2E modeling is framed as a sequence-to-sequence (seq2seq) generation task. At every turn t, the goal is to predict the following based on fed into the model as a concatenated string. We adopt the generative seq2seq model, termed mSeq2Seq, as used by Lin et al. (2021). This is based on mT5 Small and mT5 Base (Xue et al., 2021) and standard top-k sampling. Unless stated otherwise, Small version of the model is used. As in prior work (Lin et al., 2021), performance is reported as BLEU scores (Papineni et al., 2002). Unless stated otherwise, we use a beam size of 5 for generation.10
Source Language Training.
We train all models on the standard full training split of the English SGD dataset (Rastogi et al., 2020). In order to measure performance gaps due to transfer and ensure comparability of dialogue flows in all languages, we also evaluate on the subset of the English SGD test set sampled as a source for cod (see Table 4).
5.2 Results and Discussion
Below we discuss the results of cross-lingual transfer under the experimental setups in §5.1. We report both per-language scores and averages across the four cod target languages.
Main Results.
Table 10 compares the results for the two NLU tasks, while Table 11 shows scores in the E2E task. With translate-test (TrTest), the gains are highly task-dependent: It performs considerably better than encoder-based (MEncoder) transfer on intent detection and E2E modeling, while the opposite holds for slot labeling. This is likely because: 1) we rely on a word alignment algorithm on top of English predictions to align them with the target language, which adds noise to the final predictions; and 2) many errors are due to incorrect ‘label granularity’ (e.g., predicting departure city instead of departure airport), as shown by qualitative analysis.11 Note that TrTest, unlike MEncoder, assumes access to high-quality MT systems and/or parallel data for different language pairs.
Per-language NLU results for (i) zero-shot cross-lingual transfer using multilingual pretrained models (MEncoder) and (ii) translate-test (TrTest) transfer with Google Translate and MarianMT (see §5.1). Translations for slot labeling were aligned using fast_align (Dyer et al., 2013). MEncoder results are from the separate training regime (see §5.1). All scores are averages over 5 random seeds and follow the All-domain setup.

Per-language E2E results for two cross-lingual transfer methods (see also the information in Table 10).

Table 11 reveals large gains of TrTest over the vanilla version of MEncoder, both with MarianMT and GTr, but GTr proves consistently better: This corroborates recent findings on other cross-lingual NLP tasks (Ponti et al., 2021). However, the +FF results in Table 11 reverse this trend and underline the benefits of few-shot target language fine-tuning in E2E training. The performance gains are large, even though the target language data include only 92 dialogues (¡1% of English training data). In contrast, +MLT does not have a significant impact, possibly due to i) noisy target language substitutes, obtained via automatic translation, unlike in Lin et al. (2021) where ground truth target language slot values were available; or ii) culture-specificity of slot values in cod. Thus, substitution with translations seems beneficial only for dialogues with a pre-defined common cross-lingual slot ontology.
Figure 3 c presents another interesting trend, concerning the comparison of E2E performance of a larger versus a smaller model: mT5-Base versus mT5-Small. While zero-shot performance is comparable between the two, we observe that mT5-Base performs considerably better in a few-shot training scenario (+FF). We hypothesize that in zero-shot training the models overfit to generation in English,12 while in few-shot training the model’s cross-lingual generation capabilities are highlighted, once the model has encountered several examples in the target language.
Per-language results over all domains. (a) and (b) share the model labels on the y-axis.
Per-language results over all domains. (a) and (b) share the model labels on the y-axis.
In DST, irrespective of the transfer method and target language, cross-lingual performance is near-zero, as visible from Table 12. These findings are in line with prior work (Ding et al., 2022) and are due to the DST task complexity. This is even more pronounced in zero-shot cross-lingual settings and especially for cod, where culture-specific slot values are obtained via outline-based generation. Given the low results, we focus on NLU and E2E as the two main tasks in all the following analyses.
Comparison of Multilingual Models on NLU.
The results in Table 10 and Figure 3 indicate that XLM-R largely outperforms mBERT in all setups in both NLU tasks, especially on two languages more distant from English, ID and SW. We attribute this to XLM-R being exposed to more data in these languages during pretraining than mBERT. This very reason also accounts for the discrepancy in their performance on EN relative to other languages: With XLM-R, the gap between EN scores and other languages is much smaller than with mBERT. This is especially apparent in the case of Indonesian: ID pretraining data for mBERT are less than 10% of EN pretraining data, while their sizes are comparable in XLM-R.
Further, the results in Figure 3 indicate that joint training of two NLU tasks tends to benefit intent detection while degrading the performance on slot labeling. The reverse trend is true for separate training: Slot labeling scores improve, while intent detection degrades. This confirms the trend observed in recent work (Razumovskaia et al., 2022b).13
Gaps with Respect to English.
The per- language NLU results (Table 10 and Figure 3) also illustrate a performance gap due to ‘information loss’ during transfer: The drops (averaged across all 4 target languages) of the strongest transfer method are ≈10 points on intent detection (in All-domains experiments), and 15 points on slot labeling, using exactly the same underlying models. These gaps are even more pronounced for some languages (e.g., the lowest-resource language Kiswahili) and in domain-specific setups (e.g., In-domain setups).
The E2E results in Figure 3 c also reveal a chasm between mT5 performance on English and the other four languages, especially so without any target-language adaptation. The gap, while still present, is substantially reduced with the +FF model variant (see §5.1). This disparity emphasizes the key importance of (i) continuous development of multilingual benchmarks inclusive of less-resourced languages to provide realistic estimates of performance on multilingual ToD, as well as (ii) creation of (indispensable) in-domain data for few-shot target language adaptation. The low absolute scores indicate the complexity of the task in general. Overall, these findings reveal the challenging nature of cod, and call for further research on data-efficient and effective transfer methods in multilingual ToD.
In-Domain vs. Cross-Domain Evaluation.
cod not only enables cross-lingual transfer but is also the first multilingual dialogue dataset suitable for testing models in cross-domain settings (Table 13). The general observation is that in-domain performance is much higher than cross-domain, although both have large room for improvement.
Baseline results for NLU and E2E on the cod test set, averaged over all 4 target languages; In-, Cross-domain, and All domains setups.

We conduct a more detailed analysis of the in-domain and cross-domain performance for the slot labeling task. We chose to focus on slot labeling as the annotators were explicitly instructed to substitute slot values with target language- specific values where appropriate. We use XLM-R fine-tuned on the full English dataset. In the interest of space and clarity we present the results for two domains that the model has seen in training (Flights, Movies) and one domain which it has not seen during training (Payment). The results in Table 14 support the general claims: There is a significant drop between domains seen and not seen at training.14 Further, we note that the performance on Flights is much lower than on Movies. This is due to: (i) the larger number of slots in the Flights domain; (ii) the slot values in Flights are naturally suited for localization (e.g., departure and destination cities) which makes the domain more complex for cross-cultural generalisation. This additionally proves the need to collect multilingual dialogue datasets in a more culturally aware fashion to get realistic estimates of cross-lingual performance of ToD models.
6 Conclusion and Outlook
We have presented and validated a ‘bottom-up’ method for the creation of multilingual task- oriented dialogue (ToD) datasets. The key idea is to map domain-specific language-independent dialogue schemata to natural language outlines, which in turn guide human dialogue generators to create natural target-language utterances, for the user and system alike. We have empirically demonstrated that the proposed outline-based approach yields more natural and culturally sensitive dialogues than the standard translation-based approach to multilingual ToD data creation. Moreover, we have proven that the standard translation-based approaches often yield over-inflated and unrealistic performance in multilingual evaluation, while this issue is removed with the outline-based generation method.
Our proposed approach yielded a new Cross-lingual Outline-based Dialogue dataset (termed cod), which covers 5 typologically diverse languages, 11 domains in total, and enables evaluations in standard NLU, DST, and end-to-end ToD tasks. Thus, cod is an important step towards challenging multilingual and multi-domain ToD evaluation in future research. We have also evaluated a series of state-of-the-art models for the different ToD tasks, setting baseline reference points, and revealing the challenging nature of the dataset with ample room for improvement.
We hope that our work will inspire future research across multiple aspects. One such area concerns cultural debiasing of the concepts and situations captured in the dialogues. Our method addresses this through cultural adaptations and replacements of foreign concepts with those common in the annotators’ culture and environment. The next step should involve a careful selection of dialogue scenarios based on their relevance and plausibility in the culture in question, as very recently started in other NLP areas (e.g., Liu et al., 2021). In this work, we presented useful practices and insights hoping to guide similar (potentially larger-scale) data creation efforts in ToD for other, especially lower-resource, languages, and domains.
cod is available online at github.com/cambridgeltl/COD.
Acknowledgments
This work was funded by the ERC PoC Grant MultiConvAI: Enabling Multilingual Conversational AI (no. 957356) and a research donation from Huawei. The work of EMP was supported by the Facebook CIFAR AI Chair program. We would like to thank our annotators for their contribution to this work and the TACL editors and anonymous reviewers for their helpful feedback and suggestions.
Notes
Furthermore, when asked to compare equivalent dialogues obtained with the two processes, respondents favored outline-based dialogues in more than 80% cases.
For example, the “Alarm_1” service comprises intents such as “GetAlarms” (“Get the alarms user has already set”) and “AddAlarm” (“Set a new alarm”) and slots “alarm_time”, “alarm_name”, “new_alarm_time”, and “new_alarm_name”.
The total cost of cod was 800 GBP per language.
To ensure quality, we selected candidates with reported target language credentials who successfully completed a qualification exercise consisting in writing a 6-turn dialogue according to outlines analogous to those in the main task.
The non-comparative part came first to avoid priming effects from an a priori awareness of systematic qualitative differences between examples coming from either method.
We focus on the intent detection task to avoid the interference of noise introduced by the alignment algorithms (i.e., aligning the source language examples with automatic translations of the training data for slot labeling).
The same trends were observed in the results with other standard multilingual sentence encoders such as LaBSE (Feng et al., 2022), not included due to space limits.
We opt for mT5 as it substantially outperformed mBART (Liu et al., 2020a) and other E2E baselines in the work of Lin et al. (2021). We leave experimentation with more sophisticated model variants (Liu et al., 2020b) and sampling methods such as nucleus sampling (Holtzman et al., 2020) for future work. For brevity, we do not report results with other automatic E2E modeling metrics such as Task Success Rate or Dialogue Success Rate (Budzianowski and Vulić, 2019).
This is more likely in translated text where language-specific hints for the exact slot type may get lost in translation.
The observation is also corroborated by weaker performance in languages which use non-Latin script.
We also evaluated whether incorporating English SGD schemata into the NLU models—that is, leveraging short English descriptions of domains, intents, and slots available from the English SGD dataset—improves performance, adapting the process of Cao and Zhang (2021) to a cross-lingual setup; however, we obtained negative results.
References
Author notes
Action Editor: Hai Zhao