Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation

Multilingual task-oriented dialogue (ToD) facilitates access to services and information for many (communities of) speakers. Nevertheless, its potential is not fully realized, as current multilingual ToD datasets—both for modular and end-to-end modeling—suffer from severe limitations. 1) When created from scratch, they are usually small in scale and fail to cover many possible dialogue flows. 2) Translation-based ToD datasets might lack naturalness and cultural specificity in the target language. In this work, to tackle these limitations we propose a novel outline-based annotation process for multilingual ToD datasets, where domain-specific abstract schemata of dialogue are mapped into natural language outlines. These in turn guide the target language annotators in writing dialogues by providing instructions about each turn’s intents and slots. Through this process we annotate a new large-scale dataset for evaluation of multilingual and cross-lingual ToD systems. Our Cross-lingual Outline-based Dialogue dataset (cod) enables natural language understanding, dialogue state tracking, and end-to-end dialogue evaluation in 4 diverse languages: Arabic, Indonesian, Russian, and Kiswahili. Qualitative and quantitative analyses of cod versus an equivalent translation-based dataset demonstrate improvements in data quality, unlocked by the outline-based approach. Finally, we benchmark a series of state-of-the-art systems for cross-lingual ToD, setting reference scores for future work and demonstrating that cod prevents over-inflated performance, typically met with prior translation-based ToD datasets.


Introduction and Motivation
One of the staples of machine intelligence is arguably the ability to communicate with humans * Equal contribution.and complete a task as instructed during such an interaction.This is commonly referred to as taskoriented dialogue (TOD; Gupta et al., 2005;Bohus and Rudnicky, 2009;Young et al., 2013;Muise et al., 2019).Despite having far-reaching applications, such as banking (Altinok, 2018), travel (Zang et al., 2020), and healthcare (Denecke et al., 2019), this technology is currently limited to a handful of languages (Razumovskaia et al., 2021).Thus, large communities of speakers are prevented access to automated services and information.
The progress in multilingual TOD is critically hampered by the paucity of training data for many of the world's languages.While cross-lingual transfer learning (Zhang et al., 2019;Xu et al., 2020;Siddhant et al., 2020;Krishnan et al., 2021) offers a partial remedy, its success is tenuous beyond typologically similar languages and generally hard to assess due to the lack of evaluation benchmarks (Razumovskaia et al., 2021).What is more, transfer learning often cannot leverage multi-source transfer and few-shot learning due to lack of language diversity in the TOD datasets (Zhu et al., 2020;Quan et al., 2020;Farajian et al., 2020).
Therefore, the main driver of development in multilingual TOD is the creation of multilingual resources.However, even when available, these resources suffer from several pitfalls.Most are obtained by manual or semi-automatic translation of an English source (Castellucci et al., 2019;Bellomaria et al., 2019;Susanto and Lu, 2017;Upadhyay et al., 2018;Xu et al., 2020;Ding et al., 2021;Zuo et al., 2021, inter alia).While this process is cost-efficient and typically makes data and results comparable across languages, it yields dialogues that lack naturalness (Lembersky et al., 2012;Volansky et al., 2015), are not properly localised nor culture-specific (Clark et al., 2020).Further, they provide over-optimistic estimates of performance due to the artificial similarity between source and target texts (Artetxe et al., 2020).As an Outlines Dialogue & Slot Output USER: Express the desire to search for roundtrip flights for a trip Мне нужно найти рейс в Ставрополь и обратно авиакомпании S7. the name of the airport or city to arrive at: Seattle Ставрополь the company that provides air transport services: American Airlines S7 ASSISTANT/SYSTEM: Inform the user that you found 1 such option(s).Offer the following option(s): Найден 1 рейс авиакомпании S7 с пересадкой, вылет в 7:35, возвращение в Москву в 16:15.Стоимость билетов 6845 рублей.the company that provides air transport services: American Airlines S7 departure time of the flight flying to the destination: 7:35am 07:35 departure time of the flight coming back from the trip: 4:15pm 16:15 the total cost of the flight tickets: $343 6845 рублей Table 1: Example from the COD dataset of outline-based dialogue generation in Russian with target language substitutions of slot values.The first column (Outline) includes example outlines presented to the dialogue creators, and the second column holds the creators' output (Dialogue & Slot Output).
alternative to translation, new TOD datasets can be created from scratch in a target language through the Wizard-of-Oz framework (WOZ; Kelley, 1984) where humans impersonate both the client and the assistant.However, this process is highly time-and money-consuming, thus failing to scale to large quantities of examples and languages, and often lacks coverage in terms of possible dialogue flows (Zhu et al., 2020;Quan et al., 2020).
To address all these gaps, in this work we devise a novel outline-based annotation pipeline for multilingual TOD datasets that combines the best of both processes.In particular, abstract dialogue schemata, specific to individual domains, are sampled from the English Schema-Guided Dialogue dataset (SGD; Shah et al., 2018;Rastogi et al., 2020).Then, the schemata are automatically mapped into outlines in English, which describe the intention that should underlie each dialogue turn and the slots of information it should contain, as shown in Table 1.Finally, outlines are paraphrased by human subjects into their native tongue and slot values are adapted to the target culture and geography.This ensures both the cost-effectiveness and cross-lingual comparability offered by manual translation, and the naturalness and culturespecificity of creating data from scratch.Through this process, we create the Cross-lingual Outlinebased Dialogue dataset (termed COD), supporting natural language understanding (intent detection and slot labelling tasks), dialogue state tracking, and end-to-end dialogue modelling in 11 domains and 4 typologically and areally diverse languages: Arabic, Indonesian, Russian, and Kiswahili.
To confirm the advantages of the leveraged annotation process, we run a proof-of-concept experiment where we create two analogous datasets through the outline-based pipeline and manual translation, respectively.Based on a quality survey from human participants, we find that, while having similar annotation speed, outline-based annotation achieves significantly higher naturalness and familiarity of concepts and entities, without compromising data quality and language fluency.1 Finally, crucial evidence showed that cross-lingual transfer test scores on translation-based data are over-estimated.We demonstrate that this is due to the fact that the distribution of the sentences (and their hidden representations) is considerably more divergent between training and evaluation dialogues in COD than in the translation-based dataset.
Further, to establish realistic estimates of performance on multilingual TOD, we benchmark a series of state-of-the-art multilingual TOD models in different TOD tasks on COD.Among other findings, we report that zero-shot transfer surpasses 'translate-test' on slot labelling, but this trend is reversed for intent detection.Language-specific performance also varies substantially among evaluated models, depending on the quantity of unlabelled data available for pretraining.
In sum, COD provides the typologically diverse dataset for end-to-end dialogue modelling, and streamlines a scalable annotation process that results in natural and localised dialogues.As such, we hope that COD will contribute to democratising language technology and facilitating reliable and cost-effective TOD systems for a wide array of languages.Our data and code are available at https: //github.com/cambridgeltl/COD.
The main goal of our TOD dataset creation approach is to balance the practical advantages offered by direct translation and the linguistic and cultural specificity granted by bottom-up data collection in the target language.On the one hand, translation of an existing dataset removes the need for a costly and lengthy interactive dialogue generation protocol.By using pre-existing annotated data, dialogue intent labels can be directly transferred to a new language and annotation work is limited to slot value spans.As a consequence, the data are automatically aligned across different languages, which enables direct comparisons of system performance.On the other hand, direct translation is known to perpetuate linguistic and cultural biases into the target language, skewing the syntactic and lexical properties of the data towards the source language, as well as imposing dialogue behaviours and concepts which are not necessarily familiar or appropriate in the target culture.As a result, translated datasets cannot be reliably used as benchmarks of model performance in the target language (Koppel and Ordan, 2011;Volansky et al., 2015;Artetxe et al., 2020;Ponti et al., 2020).
Our proposed outline-based approach aims to marry the benefits of both methods, while avoiding their shortcomings.It achieves time-and costeffectiveness by bootstrapping from existing dialogue schemata, but refrains from direct translation in favour of outline-guided dialogue writing with target culture-specific slot value adaptation, thus ensuring naturalness and familiarity of the concepts.
Source Data.We selected the English Schema-Guided Dialogue (SGD) dataset (Shah et al., 2018;Rastogi et al., 2020) as our starting point due to its scale (20k human-assistant dialogues) and diversity (20 different domains).The SGD dataset construction paradigm combined automatic generation of dialogue schemata and manual creation of dialogue paraphrases by crowdworkers.The method, dubbed "Machines Talking To Machines" (M2M), is an alternative to the popular human-tohuman Wizard-of-Oz framework (Kelley, 1984), where pairs of crowdworkers interact following task specifications, generated through sampling of slot values from an API client, in order to complete a certain goal and their conversations are directly recorded (Wen et al., 2016;Budzianowski et al., 2018).The crowdsourced dialogues then undergo another round of annotation with dialogue acts and slot spans.While the WOZ approach has the advantage of collecting actual human-to-human conversations, the process is expensive and prone to error, given the risk that the free-form interactions might not exhaustively cover possible interactions or might not lend themselves to direct use for model training (e.g., long and too convoluted exchanges).
The SGD's M2M approach has the advantage of greater speed and cost-effectiveness.In the first stage, it simulates the user-assistant interaction to exhaustively explore possible user behaviours and dialogue scenarios and generate dialogue outlines (i.e., template utterances and their semantic parses), maximising diversity and coverage of different dialogue flows by means of permutations of slots, intents and domains.Subsequently, crowdworkers are tasked with paraphrasing dialogue templates to create natural language (NL) utterances, preserving the meaning and key elements captured in the templates (e.g., outline: "Book movie with title is Inside Out and date is tomorrow" → paraphrase: I want to buy tickets for Inside Out for tomorrow.),and subsequently validate slot spans.Given that dialogue intents and slot values are provided in the dialogue outlines, the risk of erroneous labels in the final dataset is minimised.
The SGD dataset organises dialogue data as lists of turns for each individual interaction, each turn containing an utterance by the user or system.The accompanying annotations are grouped into frames, each corresponding to a single API or service (e.g., Banks_2).In turn, each service is represented as a schema, i.e., a normalised representation of a service-specific interface, and includes its characteristic functions (intents) and parameters (slots), as well as their NL descriptions.2Languages.To assess the viability of the outlinebased method, we selected Russian as a trial language and carried out data collection using two methods: (i) direct translation from English and (ii) our proposed outline-based approach.Having evaluated the quality of the output of both methods and the advantages of in-target outline-based creation (see later §3), we applied the method to three other languages which boast a large number of speakers and yet suffer from a shortage of resources: Arabic, Indonesian, and Kiswahili, ensuring the dataset's diversity in terms of language fam-  ily (Indo-European (RU), Afro-Asiatic (AR), Austronesian (ID) and Niger-Congo (SW)) and macroarea (Eurasia, Papunesia, Africa), as well as writing systems (Cyrillic, Arabic, and Latin scripts).We present a quantitative evaluation of the linguistic diversity of the language sample in Table 3.We also compare it with the standard multilingual dialogue NLU and end-to-end datasets.In terms of typology, COD is comparable to others which have much larger language samples (e.g., MultiATIS++ or xSID) and considerably exceeds others.With respect to family and macroarea diversity, COD is the most diverse out of existing datasets.

Annotation Protocol
The data creation protocol involved the following phases: 1) source dialogue sampling, 2) automatic generation of outlines based on intent and slot information using rewrite rules, 3) manual outlinedriven target language dialogue creation and slot annotation, 4) post-hoc review, described here.
Source Dialogue Sampling.To ensure wide coverage of dialogue scenarios, we randomly sampled source dialogues from across 11 domains, out of which five (Alarm, Flights, Homes, Movies, Music) are shared between the development and test set; the remainder are unique to either set, to enable cross-domain experiments.To guarantee a balanced coverage of different intents, we sampled 10 examples per intent, which ensures the task cannot be solved by simply predicting the most common intent.Table 4 summarises the coverage of domains and the number of dialogues and turns as a result of this sampling procedure.
Outline Generation.The goal of this step is to create minimal but sufficient instructions for target language dialogue creators to ensure coverage of specific intents and slots, while avoiding imposing predefined syntactic structures or linguistic expressions.First, for each user or system act, we manually create a rewrite rule, e.g., REQUEST_ALTS→Request alternative options or INFORM_COUNT→Inform the user that you found + INFORM_COUNT[value] + such option(s) (where value corresponds to the number of options matching the user request).Next, we automatically match each intent and slot with its NL description (provided in the SGD schemata) and used them to generate intent/slot-specific outlines (with stylistic adaptations where necessary): e.g., an intent "SearchOnewayFlight" and a description "Search for one-way flights to the destination of choice" would yield an outline Express the desire  to search for one-way flights (see Table 5).
Dialogue Writing.We recruited target language native speakers fluent in English via the proz.com platform.3Dialogue creators were presented with language-specific dialogue creation guidelines (see Appendix A), which described the goals of the task, i.e., creative writing of natural-sounding exchanges between a hypothetical user and a TOD system.An essential part of the task consisted in a cultural adaptation of the slot values, illustrated in Table 1.For all culturally and geographically specific slot values (e.g., city names, movie titles, names of artists), creators were asked to substitute them with named entities more familiar or closer to their culture (e.g., American Airlines→Aeroflot, New York→Jakarta).
Slot Span Validation.Creators performed the first round of slot span labeling while working on dialogue writing.Subsequently, the annotated data in each language underwent an additional round of manual revision by a target language native speaker and a final automatic check for slot value-span matches.We verified inter-annotator reliability on slot span labeling on Russian, where we collected slot span annotations from pairs of independent native-speaker annotators.The obtained accuracy scores (i.e., ratio of slot instances with matching spans to the total annotated instances) of 0.99 for dev data and 0.98 for test data reveal very high agreement on this task.

Translation versus Outline-Based
The main motivation behind the outline-based approach is to avoid the known pitfalls of direct translation and produce evaluation data better representing the linguistic and cultural realities of each language in the sample (see §2).To verify whether the method satisfies these goals in practice, we first carried out a trial experiment consisting in parallel dialogue data creation using two different methods, (i) direct translation and (ii) outline-based generation.To ensure a fair comparison, we used the same sample of source SGD dialogues in both tasks, in two different ways.In (i), randomly sampled (see §2.1) English user/system utterances were extracted directly from the dataset with accompanying slot and intent annotations and subsequently translated into the target language by professional translators, also responsible for validating target language slot spans.In (ii), we automatically extracted dialogue frames, including intents and slots, corresponding to the dialogue IDs sampled in (i), and used them to generate NL outlines to guide manual dialogue creation by native speakers, relying on the procedure described in §2.1.
We also asked the participants to time themselves while working on the task.Notably, we found the annotation speed to be identical for the two methods, averaging 15 seconds per single dialogue turn (dialogue writing + slot annotation).While the translation approach does not require any creative input in terms of cultural adaptations of slot values, the outline-based approach allows freedom in terms of the linguistic expressions used, removing the need for faithful translation of the original English sentences, which ultimately results in similar time requirements on both tasks.
Quality Survey.To compare the two methods' output, we carried out a quality survey with 15 Russian native speakers.It consisted of two consecutive parts: (1) independent and (2) comparative evaluation; the non-comparative part came first so as to avoid priming effects from an a priori awareness of systematic qualitative differences between examples coming from either method.Within each part, the order of questions was randomised.In Part 1, the respondents were presented with 6 randomly sampled dialogues from the data generated by either method (3 dialogues per method) and were asked to answer to what extent they agree with each of four statements (provided in  giving a rating on a 5-point Likert scale.In Part 2, respondents were presented with 5 randomly sampled pairs of matching dialogue excerpts (i.e., a set of N dialogue turns extracted based on a shared dialogue ID from both datasets) and were asked to choose which excerpt (A or B) sounded more natural to them.All survey questions and instructions were translated into Russian.
Figure 1 shows average scores for each question in Part 1 across all 15 participants.The methods produce dialogues which score very similarly in terms of the assistant's goal-orientedness (Q1) and Russian language fluency (Q3).However, the differences are clearer in scores for Q2 and Q4.First, the user utterances created based on outlines are perceived as more natural-sounding (Q2).Further, they score noticeably better in terms of the familiarity of mentioned entities.These results are encouraging, given that Q4 directly addresses one of the main objectives of our method, i.e., target language-specificity.While both approaches are capable of producing convincing Russian dialogues, the results of Part 2 are more clearly skewed in favour of the outline-based method: out of 75 comparisons (15 participants judging 5 pairs each), outline-based dialogues are preferred (i.e., judged as more natural-sounding) in 80% of cases.In Table 7 we show an example pair of dialogue excerpts from each method, analogous to those used in the survey, with accompanying English translations.
Effects of Translationese.Dialogue data are expected to be representative of a natural interaction between two interlocutors.The utterances of both the user and the system should reflect the properties characteristic of the conversational register in a given language, appropriate for the communicative situation at hand and the participants' social roles  6) assigned to dialogue examples generated via translation versus outline-based generation.(Chaves et al., 2019;Chaves and Gerosa, 2021).When qualitatively comparing the translation and outline-based generation in Table 7, we observe that translated utterances are often skewed to the source language syntax and lexicon (known as the "translationese" effects, Koppel and Ordan 2011), compromising fluency and idiomacity that are essential in natural-sounding exchanges.
One issue which arises in literal translation is syntactic calques from the source language.For instance, the translation of the first USER utterance (Table 7, col.'Translation') uses a dative pronoun найти мне[DATIVE] (find me), even though the transitive verb найти (find) does not require the [DATIVE] case after it -a likely calque of the English expression Can you find me.In comparison, the corresponding outline-based generated utterance uses a more fluent construction.Another problem concerns the differences in the use of grammatical structures depending on the language register.For instance, using passive voice in spoken English is common (cf.last ASSISTANT utterance in Table 7).The literal translation of the dialogue into Russian also includes passive voice, although it is usually avoided in spoken Russian (Babby and Brecht, 1975).In contrast, the outline-based utterance uses a simpler active voice construction which has the same meaning as the one in the translation.
We observe further "translationese" effects on the lexical level, namely (i) the preference for lexical cognates of source language words, and (ii) the use of a vocabulary typical for the written language; both are exemplified by the last ASSISTANT utterance (Table 7).The translation includes the verb запланирован (is planned), even though the verb планировать, having the same root as English to plan, is rarely used in spoken Russian with regards to arranging near-future appointments and more frequently with regards to making a step-by-step plan.In contrast, the outline-based generated utterance includes the verb забронировать (to book) Yes, the phone number is 650-813-1369.I successfully booked the visit.
Table 7: Comparison of dialogues generated by each method.For each user/assistant utterance, we provide the original English sentences from SGD for the translation method, and English translations of the Russian utterances written based on outlines.♣ -syntactic similarity to source language; ♠ -lexical similarity to source language.
which is more specific to arranging appointments and more frequently used in spoken language.Similar examples for both (i) and (ii) are presented in Appendix B.
Evaluation of TOD Systems on Translation-Based versus Outline-Generated Data.The vast majority of existing NLU datasets is based on translation from English to the target language (Xu et al., 2020;van der Goot et al., 2021).This could lead to overly optimistic evaluation of cross-lingual TOD systems as the translations might not be representative of users' language use in real life.We expect that translation-based evaluation data will lead to overly optimistic performance as it suffers from the effects of "translationese" demonstrated above.
In this diagnostic experiment, we use a translatetrain approach where: (i) training data are translated from the source language (en) to the target (ru) via Google Translate; and (ii) the model is fine-tuned on these automatically translated data.In our analysis we test the model on evaluation data obtained in each of the following ways: (a) translated using Google Translate, (b) translated by professional translators (closest in nature to existing dialogue NLU datasets), (c) generated based on outlines. 4For the experiment, we fine-tune mBERT (Devlin et al., 2019) on intent detection. 5he results in Table 8 indicate that the stronger performance is observed on translation-based evaluation sets than on more natural, outline-based generated examples.The results corroborate previous observations in other areas of NLP, e.g., machine translation (Graham et al., 2020), now for TOD.Crucially, this experiment verifies that using solely translation-based TOD evaluation data might yield an overly optimistic estimation of models' cross-lingual capabilities and, consequently, too optimistic performance expectations in real-life applications.This further validates our proposed outline-based approach to (more natural and targetgrounded) multilingual TOD data creation.
Analysis of Sentence Encodings.One reason behind the scores observed in based approach.To test this, we obtain sentence encodings of all user turns for one intent from the three datasets via the distilled multilingual USE sentence encoder (Yang et al., 2020;Reimers and Gurevych, 2019). 6s illustrated in Figure 2, the translation-based data are encoded into sentence representations that are much more similar to their English source than the corresponding outline-generated examples.The difference holds across dev and test splits and across different multilingual sentence encoders (see also Appendix C).This indicates that, as expected, the utterances obtained via translation are artificially more similar to their English counterparts than the outline-generated ones.This again underlines the finding from Table 8: multilingual TOD datasets collected via outline-based generation should lead to more realistic assessments of multilingual TOD models than translation-based multilingual TOD datasets.
Further Discussion.To meet the urgent, evergrowing demand for large-scale multilingual TOD datasets, data collection methods which efficiently leverage existing resources to generate new data fast without compromising data quality are especially needed.Direct translation has the benefit of re-using already annotated and verified data entries, moreover, it is a well-defined task which does not require task-specific guidelines or training.However, as we demonstrated here, it unnaturally skews the data towards the source language.This makes evaluation results unreliable.Our proposed 'bottom-up' approach produces a much more realistic benchmark for gauging models' multilingual capabilities.The outlines provide minimal instructions to annotators, which, together with the task guidelines, prove sufficient for a single annotator to create natural-sounding user-assistant exchanges capturing predefined intents and slot values.This circumvents the need for setting up an expensive WOZ pipeline, where pairs of users interact live and their conversations are recorded.
An important area for future improvement concerns cultural debiasing of the topics, situations and concepts captured in the dialogues.Although our method generates linguistic expressions which are natural in the target language, the dialogue scenarios included in the dataset are still inherited from English.While most of these are common around the globe (e.g., searching for a property to rent, selecting a movie to watch), some are much less likely to happen in some cultures or communities (e.g., making a public money transfer).Looking ahead, creating dialogue technology and resources that are representative of and applicable within individual communities of speakers should involve a careful selection of dialogue scenarios, based on their relevance and plausibility in the culture in question, as very recently started in other NLP areas (e.g., Liu et al., 2021).In our current dataset, we ensured the applicability and comprehensibility of the concepts referred to in the dialogues by entrusting native speakers with cultural adaptations and replacements of foreign concepts with those common in their culture and environment.

Baselines, Results, Discussion
COD includes labelled data and thus enables experimentation for three standard TOD tasks: i) Natural Language Understanding (NLU; intent detection and slot labelling); ii) dialogue state tracking (DST); and iii) end-to-end (E2E) dialogue modelling.In what follows, we benchmark a representative selection of state-of-the-art models ( §4.1) on our new dataset, highlighting its potential for evaluation and the key challenges it presents across different tasks and experimental setups ( §4.2).
Notation.A dialogue D is a sequence of alternating user and system turns {U 1 , S 1 , U 2 , S 2 , ...}.Dialogue history at turn t is the set of turns up to point t, i.e., H t = {U 1 , S 1 , ..., U t−1 , S t−1 , U t }.

Baselines and Experimental Setup
We evaluate and compare the baselines for each task along the following axes: (i) different multilingual pretrained models; (ii) cross-lingual transfer approaches; (iii) in-domain versus cross-domain.
Multilingual Pretrained Models.For crosslingual transfer based on multilingual pretrained models, we abide by the standard procedure where the entire set of encoder parameters and the taskspecific classifier head are fine-tuned.We evaluate the following pretrained language models: (i) for NLU and DST, we use the Base variants of multilingual BERT (mBERT; Devlin et al., 2019) and XLM on RoBERTa (XLM-R; Conneau et al., 2020); for intent detection and slot labelling, we evaluate both a model that jointly learns the two tasks (Xu et al., 2020) as well as separate task-specific models; (ii) for E2E modelling, we use multilingual T5 (mT5; Xue et al., 2021), a sequence-to-sequence model, as it demonstrated to be the strongest baseline for cross-lingual dialogue generation (Lin et al., 2021).
Cross-lingual Transfer.We focus on two standard methods of cross-lingual transfer: (i) transfer based on multilingual pretrained models and (ii) translatetest (Hu et al., 2020).In (i), a Transformer-based encoder is pretrained on multiple languages with a language modelling objective, yielding strong cross-lingual representations that enable zero-shot model transfer.In (ii), test data in a target language are translated into English via a translation system.To this end, we compare translations obtained via Google Translate (GTr) 7 and MarianMT (Junczys-Dowmunt et al., 2018).
For end-to-end training, we set up two additional cross-lingual baselines, similar to Lin et al. (2021).In few-shot fine-tuning (FF), after the model is trained on the source language data (English), it is further fine-tuned on a small number of target language dialogues.In our experiments for FF, we use dialogues in the development set in each language as few-shot learning data.In mixed language pretraining (MLT; Lin et al., 2021), the model is fine-tuned on mixed language data where the slot values in the source language data are substituted with their target language counterparts.Unlike Lin et al. (2021), we do not assume the existence of a bilingual parallel knowledge base, which is unrealistic for low-resource languages.Hence, the translations of the slot values are obtained via Mar-ianMT (Junczys-Dowmunt et al., 2018).
In-Domain versus Cross-Domain Experiments.COD development and test splits include examples belonging to domains which were not seen in the English training data (see Table 4).This enables cross-lingual evaluation in 3 different regimes: indomain testing (In), where the model is evaluated on examples coming from the domains seen during training; cross-domain testing (Cross), evaluating on examples coming from the domains which were not seen during training; overall testing (All), evaluating on all examples in the evaluation set.

Architectures and Training Hyperparameters.
NLU in TOD consists of two tasks performed for each user turn U i : intent detection and slot labelling, which are typically framed as sentenceand token-level classification tasks, respectively.When a model is trained in a joint fashion, the two tasks share an encoder, and task specific classification layers are added on top of the encoder (Zhang et al., 2019;Xu et al., 2020).The loss is a sum of the intent classification and the slot labelling losses (cross-entropy).In separate training, there is no parameter sharing, so neither NLU task influences the other.The performance metrics are accuracy for intent detection and F 1 for slot labelling.
In the DST task, the model maps the dialogue history H t to the belief state at U t ; this includes the slot values that have been filled up to turn t.We use BERT-DST (Chao and Lane, 2019) -R 29.20 29.11 30.53 26.39 28.81 13.10 16.96 18.35 11.27 14.92 Table 9: Per-language NLU results for two cross-lingual transfer methods: zero-shot cross-lingual transfer using multilingual pretrained models (MEncoder) and translate-test (TrTest) with Google Translate and MarianMT; see §4.1 for more details.Translations for slot labelling were aligned using fast_align (Dyer et al., 2013).The results of MEncoder are from the separate training regime (see again §4.1).All scores are averages over 5 random seeds and follow the All-domain setup.Full results on the dev and test sets are provided in Appendix E.  Table 10: Per-language E2E results for two crosslingual transfer methods (see also the information in Table 9).
mapped to every possible slot-value pair.
As in prior work (Lin et al., 2021), E2E modelling is framed as a seq-to-seq generation task.At every turn t, the goal is to predict the following S t based on H t fed into the model as a concatenated string.We adopt the generative seq2seq model, termed mSeq2Seq, as used by Lin et al. (2021).This is based on mT5 Small (Xue et al., 2021) and standard top-k sampling.As in prior work (Lin et al., 2021), performance is reported as BLEU scores (Papineni et al., 2002).Unless stated otherwise, we use a beam size of 5 for generation; see also Appendix D for further details. 88 We opt for mT5 as it substantially outperformed mBART (Liu et al., 2020a) and other E2E baselines in the work of Lin et al. (2021).We leave experimentation with more sophisticated model variants (Liu et al., 2020b) and sampling methods such as nucleus sampling (Holtzman et al., 2020) for future work.For brevity, we do not report results with other automatic metrics relevant to E2E modelling such as Task Success Rate or Dialogue Success Rate (Budzianowski and Vulić, 2019).

Source Language Training.
We train all models on the standard full training split of the English SGD dataset (Rastogi et al., 2020).In order to measure performance gaps due to transfer and ensure comparability of dialogue flows in all languages, we also evaluate on the corresponding subset of the full English SGD test set, which was sampled as a source for the COD dataset (see §2 and Table 4).

Results and Discussion
We now present and discuss the results of crosslingual transfer under the experimental setups outlined in §4.1.We report both per-language scores and averages across the 4 COD target languages.
Main Results.Table 9 compares the results for the two NLU tasks, while Table 10 shows the scores in the E2E task.Both include the two main methods of cross-lingual transfer, MEncoder and TrTest.With translate-test, the gains are highly task-dependent: it performs considerably better than encoder-based transfer methods on intent detection and E2E modelling, while the opposite holds for slot labelling.We speculate that this pattern stems from the following causes: 1) we rely on a word alignment algorithm on top of English predictions to align them with the target language, which adds noise to the final predictions.2) Qualitative analysis of the predictions revealed that many errors are due to incorrect 'label granularity' (e.g., predicting departure city instead of departure air-port).9Note that translate-test, unlike the encoderbased transfer transfer method, assumes access to high-quality MT systems and/or parallel data for different language pairs.Table 10 reveals large gains of TrTest over the vanilla version of MEncoder.This occurs with both MarianMT and GTr, with GTr being the consistently better-performing translation method: this corroborates recent findings on other cross-lingual NLP tasks (Ponti et al., 2021).However, the +FF results in Table 10 reverse this trend and underline the benefits of few shot target language fine-tuning in end-to-end training.The performance gains are large, even though the target language data includes only 92 dialogues (<1% of English training data).In contrast, +MLT does not have a significant impact.This could be due to i) noisy target language substitutes, as they are obtained via automatic translation, unlike in (Lin et al., 2021) where ground truth target language slot values were available; or ii) culture-specificity of slot values in COD.Thus, substitution with translations appears to be beneficial only for dialogues with a pre-defined common cross-lingual slot ontology.
In DST, irrespective of the transfer method and target language, cross-lingual performance is nearzero (not shown).These findings are in line with prior work (Ding et al., 2021) and are due to the DST task complexity.This is even more pronounced in zero-shot cross-lingual settings and especially for COD, where culture-specific slot values are obtained via outline-based generation.Given the low results, we focus on NLU and E2E as the two main tasks in all the following analyses.

Comparison of Multilingual Models on NLU.
The results in Table 9 and Figure 3 indicate that XLM-R largely outperforms mBERT in all setups in both NLU tasks.The gains are more pronounced on two languages more distant from English, ID and SW.We attribute this to XLM-R being exposed to more data in these languages during pretraining than mBERT.This very reason also accounts for the discrepancy in their performance on EN relative to other languages: with XLM-R, the gap between EN scores and other languages is much smaller than with mBERT.This is especially apparent in the case of Indonesian: ID pretraining data for mBERT is less than 10% of EN pretraining data, while their sizes are comparable in XLM-R.
Further, the results in Figure 3 indicate that joint training of two NLU tasks tends to benefit intent detection while degrading the performance on slot labelling.The reverse trend is true for separate training: slot labelling scores improve, while intent detection degrades.This confirms the trend observed in recent work (Anonymous, 2021). 10aps with respect to English.The per-language NLU results (see Table 9 and Figure 3, and also Appendix E) also illustrate the performance gap due to 'information loss' during transfer: the drops (averaged across all 4 target languages) of the strongest transfer method are ≈10 points on intent detection (in All-domains experiments), and 15 points on slot labelling, using exactly the same underlying models.Moreover, the gaps are even more pronounced for some languages (e.g., Kiswahili as the lowestresource language), and in domain-specific setups (e.g., in In-domain setups).
In the E2E task, the results in Figure 3c also reveal a chasm between mT5 performance on English and the other four languages, especially so without any target-language adaptation.The gap, while still present, gets substantially reduced with the +FF model variant (see §4.1).This disparity emphasises the key importance of (i) continuous development of multilingual benchmarks inclusive of less-resourced languages to provide realistic estimates of performance on multilingual TOD, as well as (i) creation of (indispensable) in-domain data for few-shot target language adaptation.
Overall, these findings suggest the challenging nature of the COD dataset, and also call for further research on data-efficient and effective transfer methods in multilingual TOD.
In-Domain vs. Cross-Domain Evaluation.. COD not only enables cross-lingual transfer but is also the first multilingual dialogue dataset suitable for testing models in cross-domain settings: we summarise the results of this evaluation in Table 11.
The key finding is that in-domain performance is much higher than cross-domain, although both have large room for improvement.

Related Work
Although a number of NLU resources have recently emerged in languages other than English, the availability of high-quality, multi-domain data to support multilingual TOD is still inconsistent (Razumovskaia et al., 2021).Translation of English data has been the predominant method for generating examples in other languages.The ATIS corpus (Hemphill et al., 1990) has been particularly widely translated, boasting translations into Chinese (He et al., 2013), Vietnamese (Dao et al., 2021), Spanish, German, Indonesian, or Turkish, among others (Susanto and Lu, 2017; Upadhyay et al., 2018;Xu et al., 2020).Bottom-up collection of TOD data directly in the target language has been the less popular choice, giving rise to monolingual datasets in French (Bonneau-Maynard et al., 2005) and Chinese (Zhang et al., 2017;Gong et al., 2019).
Thus far, the focus of existing benchmarks has been predominantly either on monolingual multidomain (Hakkani-Tür et al., 2016;Liu et al., 2019;Larson et al., 2019) or multilingual single-domain evaluation (Xu et al., 2020), rather than balancing diversity along both these dimensions.Moreover, the current multilingual datasets are mostly constrained to the two NLU tasks of intent detection and slot labelling (Li et al., 2021;van der Goot et al., 2021), and do not enable evaluations of E2E TOD systems in multilingual setups.In order to adequately assess the strengths and generalisability of NLU as well as DST and E2E models, they should be tested both on multiple languages and multiple domains, a goal pursued in this work.

Conclusion and Outlook
In this work we have presented and validated a 'bottom-up' method for creation of multilingual task-oriented dialogue (TOD) dataset.The key idea is to map domain-specific language-independent dialogue schemata into natural language outlines, which in turn guide human dialogue generators in each target language to create natural language utterances, both on the system and on the user side.We have empirically demonstrated that the proposed outline-based approach yields more natural and culturally more adapted dialogues than the standard translation-based approach to multilingual TOD data creation.Moreover, we have proven that the standard translation-based approaches often yield over-inflated and unrealistic performance in multilingual steps, while this issue is removed with the outline-based generation pipeline.
We have also presented a new Cross-lingual Outline-based Dialogue dataset (termed COD), created via the proposed outline-based approach.The dataset covers 5 typologically diverse languages, 11 domains in total, and enables evaluations in standard NLU, DST, and end-to-end TOD tasks; this way, COD makes an important step towards challenging multilingual and multi-domain TOD evaluation in future research.We have also evaluated a series of state-of-the-art models for the different TOD tasks, setting baseline reference points, and revealing the challenging nature of the dataset with ample room for improvement.
We hope that our work will inspire future research across multiple aspects.Besides its direct potential to serve as a more challenging testbed for current and future multilingual TOD models, our work provides useful practices and insights to steer and guide similar (potentially larger-scale) data creation efforts in TOD for other, especially lower-resource, languages and domains.

A Dialogue Generation Guidelines
Imagine having a conversation with a virtual or telephone assistant, where you want to complete a specific task.For example, you feel like going to a concert and would like to find out if there are any in your area, or would like to travel and need to book a flight.
In this task, we ask you to take on both roles, the user and the assistant: what would a helpful assistant reply to your query?Try and imagine an actual conversation you might have with an employee of a hotel or an airline, or at a tourist information office -the aim is to write down natural conversations that could take place between two language_name speakers. 11s a user, you will need to provide all the information that the assistant might need to carry out the task for you.You can be casual, like with someone you know and would address directly.As an assistant, you will provide information about flights, events, music, movies, or make suggestions that may interest the user.
In this task, we will provide you with brief instructions and types of information that the conversation between the user and the assistant should contain.However, to make the dialogues more natural to (hypothetical) language_name users, we encourage you to replace proper names which relate to English song titles, films, airline companies, cities, etc., with equivalents in language_name.You have complete freedom to make the replacements as you feel appropriate, as long as they are consistent within a single dialogue.See examples in Table 1.
It is likely that some concepts found in the English-language outlines do not exist in your culture or are unfamiliar to language_name speakers.Feel free to omit or creatively change these cases, so that the dialogues are fully understandable to language_name speakers.Can you find me a movie directed by Claire Denis?
Table 12: Examples of unnatural linguistic choices in translations vs. outline-based generated sentences: ♠ -for choice of lexical cognates closer to source language; and ♣ -for syntactic calques from the source language.

C Additional Sentence Similarity Scores
We also show additional (cosine) similarity scores between sentences generated via different data creation approaches (see §3 in the main paper) in Table 13 below.

F Leveraging SGD Schemata in NLU?
Since the English SGD dataset (Shah et al., 2018;Rastogi et al., 2020)  schemata): short descriptions of domains, intents and slots released with SGD, provided in the English language.Leveraging such schemata was proven useful for boosting NLU results in monolingual English-only scenarios (Rastogi et al., 2020).We thus evaluate whether incorporation of such schemata into the NLU models may positively impact their performance also in cross-lingual setups.
For the intent detection task, we use domain and intent descriptions as the schema.Schemata are encoded with the multilingual pretrained model (mBERT) and are not fine-tuned in training, following the setup of Rastogi et al. (2020).To ensure comparability with results without schemata, we use only the user utterance as input into the intent classification model.At inference, we follow the process described by Cao and Zhang (2021), where the schema for every intent is passed into the model together with the user utterance.The probability of the corresponding intent is recorded.If there is no intent with probability >0.5, NONE intent is predicted.This is a slightly different than our standard setup without the schema, where NONE intent is an additional class.
The results in Table 15 show that the use of schemata in cross-lingual settings does not provide performance boosts for intent prediction; on the contrary, we note a performance drop across the board.This could be a consequence of the increased number of trainable parameters due to the incorporation of schema embeddings into the model, which also might result in overfitting to the English training data.

Figure 1 :
Figure 1: Average scores for each quality survey question (see Table 6) assigned to dialogue examples generated via translation versus outline-based generation.

Figure 3 :
Figure 3: Per-language results over all domains.(a) and (b) share the model labels on the y-axis.

Table 2 :
Language stats.The last two columns denote the number of speakers.† Standard Arabic is learned as L2.

Table 3 :
Comparison of diversity indices of multilingual dialogue datasets in terms of typology, family, and macroareas.For the description of the three diversity measures, we refer the reader to Ponti et al. (2020).M.

Table 5 :
Examples of dialogue generation outlines created from SGD schemata, that is, annotations of dialogue acts, intents, slots and values, with intent-specific rewrites in bold.

Table 11 :
Baseline results for NLU and E2E on the COD test set, averaged over all 4 target languages, in three setups (In-or Cross-domain or All domains).Perlanguage results are in Appendix E.
B Translation-Based versus Outline-Based Generation: Additional Examples Do you want to schedule a visit to check out the property?♠: use of verb "планировать" [calque from to plan] instead of other more suited options (e.g., "хотели бы" [would like to]) 12_00089 На какой день планируете вылет?В какой день вы бы хотели полететь?Which is your preferred day of travel?

Table 15 :
served as the starting point for COD, we have access to its metadata (termed Results for schema-based intent prediction with mBERT based model.