Abstract
Creating high-quality annotated data for task-oriented dialog (ToD) is known to be notoriously difficult, and the challenges are amplified when the goal is to create equitable, culturally adapted, and large-scale ToD datasets for multiple languages. Therefore, the current datasets are still very scarce and suffer from limitations such as translation-based non-native dialogs with translation artefacts, small scale, or lack of cultural adaptation, among others. In this work, we first take stock of the current landscape of multilingual ToD datasets, offering a systematic overview of their properties and limitations. Aiming to reduce all the detected limitations, we then introduce Multi3WOZ, a novel multilingual, multi-domain, multi-parallel ToD dataset. It is large-scale and offers culturally adapted dialogs in 4 languages to enable training and evaluation of multilingual and cross-lingual ToD systems. We describe a complex bottom–up data collection process that yielded the final dataset, and offer the first sets of baseline scores across different ToD-related tasks for future reference, also highlighting its challenging nature.
1 Introduction and Motivation
Task-oriented dialog (ToD), where a human user engages in a conversation with a system agent with the aim of completing a concrete task, is one of the central objectives, hallmarks, and applications of machine intelligence (Gupta et al., 2006; Tür et al., 2010; Young, 2010, inter alia). ToD technology has been proven useful across a wide spectrum of application sectors such as hospitality industry (Henderson et al., 2014, 2019), healthcare (Laranjo et al., 2018), online shopping (Yan et al., 2017), banking (Altinok, 2018), and travel (Raux et al., 2005; El Asri et al., 2017), among others.
Wider developments in ToD have been hampered by the two conflicting requirements: 1) large-scale in-domain datasets are crucially required in order to unlock the potential of deep learning-based ToD components and systems to handle complex dialog patterns (Budzianowski et al., 2018; Lin et al., 2021b); at the same time, 2) data collection for ToD is known to be notoriously difficult as it is extremely time-consuming, expensive, and requires expert and domain knowledge (Shah et al., 2018; Larson and Leach, 2022). Put simply, the creation of ToD datasets for new domains and languages incurs significantly higher time and budget costs than for most other NLP tasks (Casanueva et al., 2022). Consequently, the progress in ToD until recently has been limited only to a small number of high-resource languages such as English and Chinese (Razumovskaia et al., 2022).
Recent work has recognized the need to expand the reach of multilingual ToD technology to more languages via collecting multilingual ToD data (Razumovskaia et al., 2022). Yet, as discussed in more detail later in §2, all the currently available multilingual ToD datasets suffer from one or several serious limitations: (i) the predominant reliance on translation-based data creation that introduces issues with ‘translationese’ and artificial performance inflation (Xu et al., 2020; Zuo et al., 2021); (ii) lack of cultural adaptation also results in artificial dialogs that are not localized nor adapted to real-world data and to cultural specificities of each target language and culture; (iii) small scale and lack of sufficient training data prevents truly equitable multilingual development and in-depth comparative cross-language analyses (Ding et al., 2022; Hung et al., 2022); (iv) lack of coherent and multi-parallel dialogs in all the represented languages, which are typically not created and corrected by native speakers, hinders meaningful cross-language comparisons and analyses (Ding et al., 2022); (v) some datasets focus on a single component of a full ToD system, typically Natural Language Understanding (NLU), which prevents training and evaluation of other crucial tasks such as Dialog State Tracking (DST), or Natural Language Generation (NLG) in multilingual and transfer setups.
In this work, we address all the aforementioned limitations of current multilingual ToD datasets and present a large-scale data collection process that resulted in a novel large-scale multilingual dataset for ToD: Multi3WOZ. The departure point of our data collection is the established multi-domain English MultiWOZ dataset (Budzianowski et al., 2018), that is, its cleaned version 2.3 in particular (Han et al., 2021). Multi3WOZ is then created via adapting a recent bottom–up outline-based approach of Majewska et al. (2023) which bypasses (the issues of) the translation-based design and discerns between language-agnostic abstract dialog schemata (i.e., outlines) and adapted, language-specific surface realizations of the underlying schemata (i.e., the actual user and system utterances). We validate the usefulness and feasibility of the outline-based approach to multilingual ToD data creation for the first time on a large scale, and prove its feasibility for such large-scale endeavors: the dataset contains a total of 494,116 dialog turns created manually by human subjects.
Guided by the need to tackle the present limitations, Multi3WOZ is the first multilingual ToD dataset with the following crucial properties; see also Table 1 for an overview. First, Multi3WOZ is large-scale with the equal number of training (7,440 dialogs per language), development (860), and test dialogs (860) offered in 4 different languages: English, Arabic, French, and Turkish. It is more versatile than all prior multilingual ToD datasets as it allows for training and evaluation in monolingual, multilingual, and cross-lingual setups, and in zero-shot, few-shot, and ‘many’-shot cross-lingual and cross-domain transfer scenarios. Second, Multi3WOZ offers multi-parallel dialogs, conveying comparable information over exactly the same conversational flows across all four languages. This property allows for cross-language studies and comparative analyses. Third, Multi3WOZ enables both (monolingual and multilingual) training and evaluation over different constituent ToD tasks such as NLU (intent detection and slot filling), DST, NLG, as well as full-fledged end-to-end (E2E) learning. Fourth, Multi3WOZ is localized and culturally adapted to the actual existing entities from the cultures in which the target languages are spoken. Finally, created in a bottom-up fashion by native speakers of the target languages, hence linguistically adapted to the target language, it offers natural and native dialogs in all target languages, avoiding ‘translationese’ and preventing over-inflation of transfer performance (Majewska et al., 2023).
Furthermore, to guide future research, we set reference scores across different ToD tasks in all the languages of Multi3WOZ, running a representative set of standard baselines in each relevant ToD task. The results clearly indicate the challenging nature of the dataset; we also outline the differences in performance across different languages.
2 Multi3WOZ versus Limitations of Current Multilingual ToD Datasets
We now delve deeper into the main benefits of Multi3WOZ, characterizing how its key properties make it a unique ToD resource. The summary and statistics of the most relevant prior work are provided in Table 1. Building upon this table, we discuss those datasets along with other related work in what follows, focusing on the five desirable properties of Multi3WOZ and how these counteract the detected main limitations of other datasets.
P1. Supporting Multiple Languages and ToD Tasks.
There has been a surge of interest in the creation of multilingual ToD datasets, aiming to mitigate the language resource gap in multilingual NLP (Ponti et al., 2019; Joshi et al., 2020b). Despite the effort, the gap is still much more pronounced for dialog tasks and data than for some other NLP tasks such as NLI (Conneau et al., 2018; Ebrahimi et al., 2022) or NER (Adelani et al., 2021), also due to its increased time demands and cost of annotation.1 Further, the majority of multilingual ToD datasets focused only on two standard NLU tasks (i.e., intent detection and slot labeling), again due to the high cost and specific challenges posed by collecting full dialog data (Budzianowski et al., 2018). The first wave of such NLU datasets were built upon the single-domain English ATIS dataset (Hemphill et al., 1990), extending it to 10 languages via human translation (Upadhyay et al., 2018; Xu et al., 2020; Dao et al., 2021). More recent NLU datasets cover multiple domains and wider linguistic typology and geography (Schuster et al., 2019; FitzGerald et al., 2022; Moghe et al., 2023; Majewska et al., 2023). However, current NLU datasets (i) still support only the two NLU tasks, and (ii) provide utterances ‘in isolation’ (i.e., out of the context of the full dialog which facilitates their multilingual construction). Further, (iii) some datasets do not provide any training data and are useful only for evaluation of (zero-shot) cross-lingual transfer; (iv) all the datasets except that of Majewska et al. (2023) and the concurrent work of Goel et al. (2023) were constructed via translation from the source English datasets.
Monolingual ‘end-to-end’ ToD datasets, which support NLU as well as other ToD tasks (i.e., modeling and evaluation of the full ToD pipeline), have been created only for particular high-resource languages. MultiWOZ (Budzianowski et al., 2018) and Taskmaster (Byrne et al., 2019) are two large-scale multi-domain English datasets spanning 7 and 6 domains, respectively, containing both single-domain and multi-domain dialogs. Inspired by MultiWOZ, monolingual RisaWOZ (Quan et al., 2020) and CrossWOZ (Zhu et al., 2020) datasets have been created for Chinese. Crucially, multilingual multi-domain ToD datasets that support full ToD modeling are still scarce, see Table 1, and they all come with some core limitations, as discussed next.
P2. Avoiding Translation-Based Design.
The majority of datasets have been obtained via manual or semi-automatic translation (e.g., via post-editing MT output [PEMT]) of an English source dataset (Zuo et al., 2021; Ding et al., 2022; Hung et al., 2022). The translation-based approach is cost-efficient and can natively yield data which is comparable across languages, but results in (i) undesired ‘translationese’ effects (Artetxe et al., 2020), (ii) lacks dialog naturalness (Ding et al., 2022), and (iii) typically leads to overinflated and thus misleading performance of ToD systems. For instance, Majewska et al. (2023) empirically validate that cross-lingual transfer performance substantially increases when exactly the same dialogs are obtained via automatic or manual translation rather than via a bottom-up approach relying on native speakers of the target languages.
Unlike prior work (i.e., all datasets from Table 1 except BiToD), the honed outline-based construction of Multi3WOZ (see §3 later) avoids all the negative implications of translation, while maintaining cost efficiency (and thus enabling its large scale), supporting cultural adaptation, and enabling coherence and multi-parallelism.
P3. Dataset Scale and Large-Scale Training.
Multi3WOZ offers a substantially larger number of dialogs for training than any previous multilingual ‘full ToD’ dataset, and it treats the four supported languages in an equitable way: i.e., it provides the same set of manually (bottom-up) constructed dialogs for training, development, and testing in each language. Previous work (Multi2WOZ, AllWOZ, GlobalWOZ) targeted the creation of test data only, for evaluating cross-lingual transfer scenarios. These datasets come (i) without providing any training data at all (Multi2WOZ), or (ii) with a very small set of post edited MT-obtained dialogs (AllWOZ),2 or (iii) with automatically created MT-based training data only (GlobalWOZ). The only exception is BiToD (Lin et al., 2021b), but it spans only two, highest-resourced languages, a smaller number of domains, and has approximately three times fewer training data than Multi3WOZ. For instance, Multi3WOZ contains almost 124,000 turns per each represented language (∼98,000/ 12,500/12,500), with a total of 494,116 turns; for comparison, the total number of turns in BiToD is 115,638, while it is 143,048 in the original English-only MultiWOZ.
P4. (Improved) Cultural Adaptation.
A large number of datasets for multilingual NLP ignores the fact that the data should also be adapted to the target cultures and concepts (Ponti et al., 2020; Hershcovich et al., 2022). Besides (i) propagating the source language bias towards possible conversational concepts (e.g., the US-tied concept of tailgating or conversations about baseball) (Ponti et al., 2020), the lack of the so-called cultural adaptation also (ii) creates peculiar or more unlikely conversational contexts (e.g., a user speaking to a Turkish ToD system about restaurants in Cambridge) (Ding et al., 2022), or (iii) even ignores specificities of a particular culture (e.g., postcodes are not used in Arabic-speaking countries). The only two datasets that try to incorporate the notion of cultural adaptation into their design are BiToD and GlobalWOZ (see Table 1). However, BiToD’s adaptation is based on a very specific bilingual region of the world (Hong Kong), while GlobalWOZ’s automatic cultural adaptation approach results in a large number of incoherent dialogs and annotation errors, e.g., see Figure 1. We thus adopt a new and improved cultural adaptation approach that ensures high-quality, coherent and multi-parallel dialogs across languages while respecting the underlying cultural traits, see §3 later.
P5. Dialog Coherence and ‘Multi-Parallelism’.
Finally, due to their design properties and oversimplifying assumptions, some datasets break coherence and multi-parallelism of dialogs. GlobalWOZ, while performing a form of cultural adaptation, (i) creates erroneous slot value annotations that are inconsistent with the dialog ontology and database in the particular language, and (ii) even induces inconsistent annotations within an individual dialog. Another problem with GlobalWOZ is that the authors select a subset of 500 test set dialogs for human PEMT work based on a simple heuristic: they opt for dialogs for which the sum of corpus-level frequencies of their constitutive 4-grams, normalized by dialog length, is the largest. This selection, not motivated in the original paper and performed independently for each language, entails that different portions of the original English MultiWOZ are included into the final language-specific test sets. This design choice, besides (i) artificially decreasing linguistic diversity of dialogs chosen for the test set in each language,3 also (ii) breaks the desired multi-parallel nature of the test set. As a consequence, GlobalWOZ overestimates downstream ToD performance for target languages, and cannot be used for any direct comparison of ToD task performance across different languages since test sets per language contain different dialogs, as also pointed out by Hung et al. (2022).
Multi3WOZ is the only dataset which performs cultural adaptation and avoids confouding factors such as GlobalWOZ’s selection heuristics, while maintaining the desired properties of dialog coherence and multi-parallelism.
3 Multi3WOZ
Multi3WOZ comprises linguistically and culturally adapted task-oriented dialogs in four languages: Arabic (ara; Afro-Asiatic), English (eng; Indo-European), French (fra; Indo-European), and Turkish (tur; Turkic). A total of 27,480 (3×9,160) dialogs is collected for ara, fra, tur, while the datas et also includes a subset of 9,160 normalized and corrected MultiWOZ v2.3 dialogs.4
In what follows, we describe its creation, as depicted in Figure 2. Our approach involves three key steps: (i) normalizing annotations from the original MultiWOZ v2.3 with canonical values; (ii) cultural adaptation by contextualizing dialogs to entities from the relevant cultures; and (iii) collecting linguistically adapted dialogs from target language native speakers using a bottom–up outlined-based method.
Preliminaries and Notation.
In ToD, the domains of a dataset (e.g., MultiWOZ) and the systems built upon it are typically defined by an ontology, which provides a structured representation of an underlying database. The ontology specifies slots that encompass all entity attributes and their corresponding values (Budzianowski et al., 2018). Multi3WOZ is designed to be fully compatible with the original English MultiWOZ’s ontology and data format, but now with culturally adapted database entries (see Figure 2).
Multi3WOZ contains four multi-parallel sets of dialogs, namely , , , and , along with their corresponding cultural-specific databases denoted as , , , and .5 Each database entry, , contains a set of slot-value pairs, such that .6 Each dialog in the dataset is represented as a list of natural language utterances, with alternating turns between the user and system initiated by the user. Each turn is annotated with its corresponding sentence-level meaning representation. Namely, for , , where u is a surface form (user or system) utterance; a is a dialog act representation; j is the length of the dialog .
A dialog act a is then defined as a set of tuples a = {(d1, i1, s1, v1),⋯ ,(dk, ik, sk, vk)}, where each tuple consists of domain d, intent i, slot s, and slot value v.
Slot-Value Normalization.
In the English MultiWOZ dataset, slot values are annotated as text spans within the corresponding utterances. This annotation scheme allows for more flexible and natural language expressions of the canonical value vtruth described in the ontology and database (e.g., 13:00), resulting in various surface forms v(1), ⋯, v(l) (e.g., 1 pm, 1:00 pm, one). However, this flexibility can create a discrepancy between the expected canonical value required by the backend API and the predicted value by the model.7
Moreover, the absence of a 1-to-1 mapping between the canonical value in the database and the annotations in MultiWOZ, coupled with erroneous or misspelled entries, hinders the consistent and systematic adaptation of culture-dependent entities to the target language. To address this, we manually created a normalization dictionary and assigned canonical values to all slot values across the English MultiWOZ dataset. For example, we created a normalization dictionary for the restaurant-name slot, mapping 544 distinct surface forms to 110 canonical names. These canonical names correspond exactly to the entities in the English restaurants domain’s database, enabling a one-to-one mapping between the entities described in dialogs and those in the database. Besides facilitating cultural adaptation through the creation of surface form agnostic outlines, we believe that this time-consuming yet crucial normalization process will also enable consistent evaluations of models built on Multi3WOZ. Henceforth, any mention of a slot value v assumes that it is in its canonical form.8
Cultural Adaptation.
While English MultiWOZ contains only dialogs describing entities in the Cambridge (UK) area, Multi3WOZ expands the scope to three additional languages targeting three cities where the target languages are considered native: Dubai for Arabic, Paris for French, and Ankara for Turkish.9 To ensure that our dataset respects and reflects the cultural traits pertaining to each target city and language, we propose a systematic approach for cultural adaptation, which ensures dialog coherence and multi-parallelism across all languages, and includes the following steps: 1. slot-value localization/redistribution with cultural awareness, 2. controlled entity replacement with one-to-one entity mappings, 3. slot-value randomization to avoid verbatim memorization.
We perform slot-value redistribution to adjust the original slot and value to align with the target ‘culture’. These modifications are based on the feedback from native speakers of the target language with expertise in the corresponding cultural context. To better fit the target culture, we remove eng-specific slots and values that are irrelevant to the culture. For example, we obliterate the postcode slot in the Arabic dataset due to its limited relevance in the associated culture.10
The main objective of our proposed cultural adaptation method is to perform controlled entity replacement using a 1-to-1 entity mapping. As a prerequisite, we first construct a localized database (e.g., for Arabic) for each target language. This database aims to reflect real-world entities and properties, and has been constructed by human participants in our project, native speakers of the target languages, who referred to a variety of public knowledge sources on the Internet, including the Google Places API and TripAdvisor API.11
In order to construct such a 1-to-1 mapping, an English entity and a target entity (e.g., ) can be mapped to each other only if all categorical slot values attributed to each entity are identical.12 Namely, the following condition holds: . This strategy guarantees a consistent distribution of entities with respect to each categorical property as MultiWOZ. It further facilitates the coherent and multi-parallel creation of dialogs, particularly when the user requests a certain property of a desired entity along the progress of dialogs (e.g., ‘an expensive restaurant’). This stands in contrast to the random sampling cultural adaptation solution of GlobalWOZ, which results in frequently mismatched entities being returned in response to the user request, and often results in dialog incoherence.
The original MultiWOZ contains a substantial number of randomized slot values, such as time, reference, and taxi-phone. To prevent verbatim memorization and undesired data artefacts, we perform slot-value randomization independently in each target dialog subset in Multi3WOZ. For time-related slot values in Multi3WOZ, we apply the randomization by adding a 1-hour random offset drawn from a uniform distribution [−1, 1] to the original value, as also illustrated in Figure 2. We ensure that all time relevant slots (e.g., leaving time and arriving time) in a dialog are equivalently shifted by the same randomized offset. For reference numbers, we employ the 1-to-1 randomly generated reference mapping. Regarding taxi-phone values, we first adhere to the target culture’s specific phone pattern followed by a 1-to-1 randomly generated phone mapping. In general, this procedure mitigates the risk of exploiting annotation artifacts and consequent overfitting when conducting cross-lingual transfer learning experiments.
Outline-Based Dialog Generation.
By adopting the outline-based dialog generation process we simultaneously enable cultural adaptation while eliminating the impact of syntactic and lexical grounding in the source language (i.e., the so-called “translation artifacts”), while keeping the annotation protocol feasible (Majewska et al., 2023). The outline-based method can be decomposed into two steps: outline creation (i.e., creating dialog schemata) and dialog writing (i.e., creating the actual surface realizations, utterances, from the dialog schemata).
Following Majewska et al. (2023), outline creation involves creating minimal but comprehensive instructions for the so-called dialog creators (termed DCs henceforth) to generate dialogs that fully convey specific intents and slots while avoiding the imposition of predefined syntactic structures or linguistic expressions. As depicted in Figure 2, we convert a culturally adapted (termed CA-ed henceforth) dialog act (e.g., using ara as an example language, aara) into a human-interpretable outline based on a set of manually defined templates, where different sets of templates are used for the user and system utterances. Given a tuple (d, i, s, vara) ∈aara, we transform a domain-intent pair d-i into a natural language instruction, e.g., Restaurant-Inform ⇒ “Express your intent to search for a restaurant with the following properties:”. In addition, the slot s is mapped to a predefined natural language description, and it is presented along with the CA-ed slot value vara (e.g., booking time = 18:45). As illustrated in Figure 2, in cases where there are multiple tuples with the same pair d-i, we group them together and present within a “card”. We note that a target language utterance (e.g., uara) can be constructed based on multiple cards, with each card corresponding to a unique domain-intent pair d-i.13 Additionally, each card may contain multiple slot-value pairs, where each slot value is shown as a CA-ed value (e.g., vara). To take full advantage of our outline-based framework, we have developed a Web-based annotation toolkit along with detailed annotation guidelines; the latter is made publicly available.
Dialog writing is then carried out by bilingual speakers as DCs. They are (i) native in the target language and (ii) fluent in English: following the results from our pilots, we opted for keeping the English templates as it facilitated the quality control of templates and cards while it did not have any detrimental effect on the quality of finally generated target language dialogs. The DCs were instructed to write natural-sounding exchanges in their native language between a hypothetical user and an assistant, based on the outlines derived from the CA-ed dialog act (e.g., aara) and a set of user goals that the hypothetical user wants to achieve (e.g., You are looking for a place to stay.). For each utterance u from the source eng dataset, the tasks of the DCs were then as follows: 1) writing a native dialog utterance from the card(s) that covers all the slot values from the cards; 2) annotating character-level span indices for each slot value vara; 3) indicating with a binary flag for each domain-intent pair d-i whether this dialog act retains coherence of the full dialog, this way also signaling and capturing errors still present in the English MultiWOZ v2.3 dataset.
Duration, Cost, Dialog Creators, Quality Control.
The logistically and technically complex data collection process spanned 14 months, starting in January 2022. The full cost of data collection was ∼$64,500, equally distributed across the three target languages. The recruited DCs are (i) professional translators and (ii) college students, recruited via the ProZ platform (www.proz.com) or from universities worldwide. A total of 133 native Arabic speakers, 112 native French speakers, and 75 native Turkish speakers contributed to the dataset.
We applied a number of quality control mechanisms throughout the data collection process. First, to ensure that the DCs have fully understood the instructions and all (sub)tasks, they were required to complete a qualification round before creating any actually deployed data. Second, our annotation platform features a real-time automatic check for all submissions, providing feedback and highlighting issues for the collected dialogs. Finally, we also ran two rounds of post-collection dialog editing: We invited a carefully selected small group of dialog creators, who had consistently produced exceptional high-quality dialogs, to review and, if necessary, edit all the dialogs in the validation and test sets of all three target languages.
Ethical and Responsible Data Creation and Use.
Following the principles from Rogers et al. (2021), the project has placed a high priority on ethical and responsible data creation and use. It underwent the full Ethics Approval process at University of Cambridge, and we describe other ethics-related aspects here.
Terms of Use:Multi3WOZ is released under the same MIT License as the original MultiWOZ.
Privacy: To comply with the EU General Data Protection Regulation (GDPR), we have acted as a data controller and collected the minimum of personal data required for this project. All participants provided informed consent by signing a Participant Consent Form before any data collection occurred. To adhere to the principle of data minimization, we collected only the participants’ email addresses as individually identifiable information for the sole purpose of processing payments. Our dataset consists solely of hypothetical dialogs in which the domains and content have been restricted and predefined, minimizing the risk of personal data being present in Multi3WOZ.
Compensation: The DCs were compensated based on the number of dialogs they contributed to the dataset, with a payment rate of approximately $12/h. As stated in our consent form, they were able to withdraw from the study at any time.
Data Structure and Statistics.
Figure 3 presents an example of multi-parallel dialogs from Multi3WOZ. All dialogs in Multi3WOZ consist of parallel surface form utterances in multiple languages and retain the same annotations as the original MultiWOZ. Precisely, each dialog is annotated with a CA-ed user goal, as well as for each utterance u in the dialog: a CA-ed dialog act, a CA-ed dialog state. In addition, Multi3WOZ offers (i) annotations for character-level textual spans for all the slot values in the dialog act to steer span extraction-based solutions to slot labeling (Joshi et al., 2020a), and (ii) a binary coherence indicator. The dataset is released in three standard formats: (i) json files following the structure of MultiWOZ (Budzianowski et al., 2018); (ii) a format compatible with the Huggingface repository (Wolf et al., 2020; Lhoest et al., 2021); (iii) ConvLab-3-compatible format (Zhu et al., 2022).
Multi3WOZ’s language-independent features, e.g., the frequency of dialog acts and average dialog length, closely resemble those of the original MultiWOZ; we thus focus on the statistics pertaining to language and cultural adaptation. Figure 4 presents the distribution of the number of tokens per turn, with white spaces as the token delimiter. Note that each language exhibits variance in its morphosyntactic properties (e.g., Turkish is an agglutinative language), which naturally impacts the expected utterance length. Further, we find that 13.3% of the slot values in the dialog acts are normalized with canonical values, while 38.7% of the dialog acts’ slot values are provided with CA-ed values. The type-to-token ratio (TTR) varies across languages, with English having a lower TTR value (0.010) compared to Arabic (0.032), French (0.023), and Turkish (0.035). In comparison to the GlobalWOZ dataset, which is an MT-based dataset without CA, our dataset (Multi3WOZ) achieves an increased TTR for Arabic (↑ 0.013), French (↑ 0.006), and Turkish (↑ 0.014).14 This outcome highlights that Multi3WOZ’s bottom-up design sparked higher semantic variability and naturalness in the target languages (Majewska et al., 2023). We further highlight the higher semantic diversity of utterances in Multi3WOZ in comparison to PEMT-based methods such as the one used by Multi2WOZ. We select a subset of 1,586 Arabic dialogs of flows shared between the two datasets and calculate the average pairwise cosine similarity between utterances in each data subset and their corresponding utterances in the English MultiWOZ, relying on LaBSE (Feng et al., 2022) as a state-of-the-art multilingual sentence encoder. The scores of 0.54 (Multi3WOZ) and 0.91 (Multi2WOZ) suggest the higher semantic variability created through the outline-based approach with cultural adaptation.
4 Multi3WOZ as a ToD Benchmark
Multi3WOZ establishes a multilingual and cross-lingual benchmark for ToD systems and their sub-modules. We now present a first ‘benchmarking study’ on the dataset, evaluating representative models for NLU, DST, NLG, and E2E tasks in ToD, merely scratching the surface of possible experimental work now enabled by Multi3WOZ.
Natural Language Understanding.
NLU is typically decomposed into two established tasks: intent detection (ID) and slot labeling (SL). ID can be cast as a multi-class classification task that identifies the presence of a domain-intent pair d-i (e.g., Restaurant-Inform) from the user’s utterance, where the set of intents is predefined in the ontology. SL is a sequence tagging task that identifies the presence of a value v and its corresponding slot s within the utterance.
We evaluate ID and SL methods backed by XLM-Rbase (Conneau et al., 2020). Precisely, at each dialog turn t, the model encodes the concatenation of the previous two utterances (ut−2 and ut−1) along with the current utterance (ut) to provide embedding vectors at both the sequence and token levels. To implement the intent detector, for each domain-intent pair d-i defined by the ontology, the representation of the “<s>” token is subsequently projected down to two logits and passed through a Sigmoid layer to form a Bernoulli distribution indicating if d-i appears in the ut. Performance is evaluated by measuring its accuracy in identifying the exact presence of all domain-intent pairs in a dialog act, as well as its F1 score. For SL, we adopt the widely used BIO labeling scheme to annotate each token in the user’s utterance.15,16
In Table 2, we observe that the fully supervised ID model achieves similarly high accuracy across all languages, and we also observe a large cross-lingual transfer gap (Hu et al., 2020) for both tasks. Further, there is a substantial decrease in performance for Arabic SL. Note that in Multi3WOZ the slot-value spans are annotated at the character level, and we only consider a span to be correctly identified if there is an exact match. At the same time, Rust et al. (2021) observed that the sub-optimal performance of the tokenizers for the multilingual models may yield degraded downstream performance. To investigate the limitations of tokenization, we then aligned the slot boundaries with the token boundaries. Specifically, we defined the slot span as the minimal token span that covered the entire slot in the utterance. With this approach, the identical model achieved F1 of 78.44 (↑30.00) for Arabic SL, confirming that the suboptimal XLM-R’s tokenization was the primary contributor to the original performance degradation in Arabic.
Dialog State Tracking.
For DST, we follow the standard MultiWOZ preprocessing and evaluation setups (Wu et al., 2019), excluding the ‘hospital’ and ‘police’ domains due to the absence of test dialogs in these domains. We report the Joint Goal Accuracy (JGA), Turn Accuracy, and Joint F1.
We adapt T5DST (Lin et al., 2021a) as a strong baseline that reformulates the DST as a QA task with slot descriptions. The DST model is back-boned with mT5small (Xue et al., 2021) (as very similar scores were obtained with mT5base). Regarding the model and training details, readers are referred to the original work (Lin et al., 2021a).17
Fully supervised DST scores provide a strong benchmark with the multilingual T5DST model over all languages in Multi3WOZ. We observe the highest performance in English (59.9% JGA), followed by Turkish, French, and Arabic, indicating the levels of difficulty of DST for each language. Table 2 presents the zero-shot cross-lingual transfer-from-English results, revealing poor transferability of the DST models across languages (all below 4% JGA). This indicates the limitations of current multilingual models in zero-shot setups and the challenge of transfer learning for culturally adapted dialogs in Multi3WOZ.
Natural Language Generation.
We approach the NLG task as a sequence-to-sequence problem, again supported by mT5small. Specifically, at each dialog turn t, the model takes the input of its dialog context, and generates a system response ut. Traditionally, NLG in ToD systems is defined as the task of converting a dialog act into a natural language utterance (Williams and Young, 2007). In our study, we evaluate NLG performance in both a traditional setup, where the goal is to realize the surface form of the dialog act, and an end-to-end LM setup, where we model response generation as a transduction problem from the dialog history to a natural response. Third, we consider the setup where both the dialog history and the ‘oracle’ dialog act are available, serving as a performance upper bound. For the surface realization setup, we convert the dialog act at into a flattened string format (e.g., [inform][restaurant]([price range][expensive], [area][center]) to serve as the input. For the language modeling setup, the model generates a response ut solely based on the preceding dialog history ut−2 and ut−1. In this setup, the generation model does not have any knowledge about the system’s ontology and database. In the language modeling with oracle setup, the model takes the concatenation of the two preceding utterances ut−2 and ut−1, as well as at as input.
Following MultiWOZ, we evaluate with the corpus BLEU score (Papineni et al., 2002); we evaluate lexicalized utterances without performing delexicalization. We also report ROUGE-L (Lin, 2004) and METEOR (Banerjee and Lavie, 2005).18
The results are summarized in Table 3. We observe that the performance of English is significantly higher than other languages in the first setup. This disparity can be attributed to the fact that dialog acts are considered a formal language for the system to process internally and, except for culturally adapted values, they are provided in English. Therefore, it is more challenging for a model to learn how to generate natural language utterances in other languages. Furthermore, by incorporating the dialog history and the oracle dialog act, the performance of all three languages improved significantly, indicating that modeling the dialog history contributes to more coherent responses. Lastly, in the absence of database information, the performance for all languages is considerably low. This highlights the challenge of modeling ToD, and underlines the necessity of incorporating databases into the ToD models in future work.
Language . | Surface Realization . | Language Modeling . | Language Modeling with Oracle . | ||||||
---|---|---|---|---|---|---|---|---|---|
BLEU . | ROUGE . | METEOR . | BLEU . | ROUGE . | METEOR . | BLEU . | ROUGE . | METEOR . | |
ENG | 20.67 | 47.76 | 44.16 | 8.66 | 27.95 | 25.18 | 21.20 | 48.52 | 44.31 |
ARA | 9.57 | 14.04 | 21.92 | 7.22 | 20.77 | 18.11 | 17.56 | 15.99 | 35.22 |
FRA | 9.96 | 35.31 | 29.17 | 6.19 | 24.47 | 19.78 | 13.61 | 40.69 | 34.87 |
TUR | 13.59 | 39.29 | 33.99 | 9.87 | 30.07 | 26.84 | 24.23 | 53.76 | 48.49 |
AVG. | 13.45 | 34.10 | 32.31 | 7.98 | 21.14 | 22.48 | 19.15 | 39.74 | 40.72 |
Language . | Surface Realization . | Language Modeling . | Language Modeling with Oracle . | ||||||
---|---|---|---|---|---|---|---|---|---|
BLEU . | ROUGE . | METEOR . | BLEU . | ROUGE . | METEOR . | BLEU . | ROUGE . | METEOR . | |
ENG | 20.67 | 47.76 | 44.16 | 8.66 | 27.95 | 25.18 | 21.20 | 48.52 | 44.31 |
ARA | 9.57 | 14.04 | 21.92 | 7.22 | 20.77 | 18.11 | 17.56 | 15.99 | 35.22 |
FRA | 9.96 | 35.31 | 29.17 | 6.19 | 24.47 | 19.78 | 13.61 | 40.69 | 34.87 |
TUR | 13.59 | 39.29 | 33.99 | 9.87 | 30.07 | 26.84 | 24.23 | 53.76 | 48.49 |
AVG. | 13.45 | 34.10 | 32.31 | 7.98 | 21.14 | 22.48 | 19.15 | 39.74 | 40.72 |
End-to-End Modeling.
Finally, E2E modeling performance serves as an even more comprehensive, challenging and arguably more important indicator for assessing the progress of ToD research, garnering intensified research attention (Hosseini-Asl et al., 2020; Lin et al., 2020; Peng et al., 2021; Su et al., 2022; Wu et al., 2023, inter alia). Developing an E2E system offers several advantages over focusing on individual sub-components like NLU modules or dialog state trackers. The E2E approach achieves increased applicability, enabling the development of practical real-world applications. Moreover, it reduces vulnerability to error propagation across sub-components and offers a simpler system design compared to the traditional pipelined approaches.
To the best of our knowledge, no previous publicly available implementation of a multilingual E2E ToD system exists that would be compatible with the MultiWOZ dataset and its derivatives. Other available multilingual ToD benchmarks either lack E2E results (Hung et al., 2022; Ding et al., 2022), or do not release their implementation (Zuo et al., 2021). The only exception is BiToD (Lin et al., 2021b); however, the BiToD dataset and the associated system use a different annotation schema, which is incompatible with MultiWOZ. Therefore, we present the first publicly available implementation of a multilingual E2E system compatible with the MultiWOZ-related datasets. We release this implementation as a baseline for further research and experimentation on Multi3WOZ.
Our system is composed of three key components: a Dialog State Tracking (DST) model, a Database (DB) Interface component, and a Response Generation (RG) model. First, the DST model is a sequence-to-sequence model, which takes the concatenated lexicalized form of all the historical utterances as input and generates a linearized dialog state (e.g., hotel price range = cheap; type = hotel). Then, the DB Interface transforms the predicted dialog state into an SQL query. This query is executed, resulting in a list of entities that satisfy the specified constraints, which are then returned to the system. Finally, the RG model, also implemented as a seq2seq model, takes as input the concatenation of historical utterances, predicted dialog state, and a database summary that indicates the number of entities returned for each active domain (e.g., restaurant more than five). It generates a delexicalized response, which can be further lexicalized using the values in the predicted dialog state and the returned entities from the database.
In our implementation, we utilize two separate mT5large models as the backbone for the DST model and the RG model. As discussed later, we opt for the large model because it demonstrates a substantial performance advantage over its smaller counterpart. The data preprocessing, including the linearization of dialog state annotations for training, and the evaluation protocol are based on the established implementation of the SOLOIST system (Peng et al., 2021). To ensure up-to-date functionality, our implementation is based on the most recent version 4.30 of the HuggingFace transformers repository. Our system is designed to prioritize simplicity and efficiency, with the primary goal of minimizing the complexity and effort required for training, evaluation, and future development. We report the standard evaluation metrics for the E2E task, including the Inform Rate, Success Rate, and the delexicalized corpus BLEU score.19
Table 4 presents the results of the fully supervised E2E experiments. As anticipated, we observe noticeable performance disparities across languages, particularly in comparison to English. Furthermore, we find that the size of the pretrained language model significantly impacts system performance. Specifically, the mT5large model exhibits a substantial (mean average) performance improvement of 16.4 Inform Rate, 17.2 Success Rate, and 4.6 BLEU points, compared to mT5small.
Language . | End-to-End Modeling . | ||
---|---|---|---|
Inform . | Success . | BLEU . | |
ENG | 67.9 | 39.0 | 15.7 |
ARA | 66.8 | 36.7 | 14.0 |
FRA | 47.9 | 22.2 | 12.0 |
TUR | 45.9 | 21.2 | 16.7 |
AVG. | 57.1 | 29.8 | 14.6 |
Language . | End-to-End Modeling . | ||
---|---|---|---|
Inform . | Success . | BLEU . | |
ENG | 67.9 | 39.0 | 15.7 |
ARA | 66.8 | 36.7 | 14.0 |
FRA | 47.9 | 22.2 | 12.0 |
TUR | 45.9 | 21.2 | 16.7 |
AVG. | 57.1 | 29.8 | 14.6 |
5 Conclusion
We have introduced a large-scale, culturally adapted, multilingual, and multi-parallel training and evaluation framework for ToD, which covers ∼495,000 dialog turns over 4 languages. The dataset was motivated by the limitations of current ToD datasets in multilingual setups, which we systematically analyzed as one contribution of this work. Owing to its unique set of properties and scale, beyond initial analyses and experiments conducted in this work, we hope that Multi3WOZ will inspire a wide array of further developments in modeling, analysis, and interpretability of multilingual and cross-lingual multi-domain ToD.
For instance, future work could replicate the data collection process to expand the dataset to even more languages (including low-resource ones). Further, one could analyze the performance disparities observed in Tables 2–4 within each language-specific ToD system, as well as explore methods to mitigate such disparities, e.g., through the utilization of cross-lingual transfer techniques. Future work could also explore evaluation metrics beyond the ones explored in this work, e.g., it would be interesting to explore the correlation between the increase in evaluation scores in multilingual ToD systems and the resulting performance gain in terms of factors such as utility, user experience, and user satisfaction. Additionally, it would be important to investigate how ToD systems should, ideally, be constructed and evaluated across different languages to ensure their inclusiveness and robustness in diverse linguistic contexts.
Code and Data.
We release the dataset and code at github.com/cambridgeltl/multi3woz.
Acknowledgments
Songbo Hu is supported by Cambridge International Scholarship. Ivan Vulić acknowledges the support of a personal Royal Society University Research Fellowship (no 221137; 2022–).
We would like to thank our internship students, Bassil Alaeddin (for the work on the Arabic portion of the dataset) and Max Letellier (for French), for their contributions and dedication to this project. We are grateful to a large number of our diligent annotators for their significant efforts and contributions to this work. Furthermore, we would like to express our gratitude to the TACL editors and anonymous reviewers for their insightful feedback, which greatly improved the quality of this paper.
Notes
For instance, the creation of the validation and test sets of the XCOPA dataset requires a total time ranging from 12 to 20 hours per language (Ponti et al., 2020). In contrast, the creation of the validation and testing sets for each individual language in Multi3WOZ requires over 300 hours of effort. Even when considering the annotation cost per sentence (utterance), which amounts to approximately $0.17 per utterance, the cost is notably higher than the per sentence annotation cost for NER ($0.06 as reported by Bontcheva et al. [2017]) and NLI ($0.01015 per instance as reported by Marelli et al. [2014]).
The tiny size of AllWOZ is even more problematic at the level of single domains, e.g., it contains only 13 dialogs for the Taxi domain, hindering any generalizable evaluations.
The selection heuristic favors dialogs that contain the same most frequent 4-grams globally.
We select 9,160 out of MultiWOZ’s full set of 10,438 dialogs by filtering out erroneous dialogs identified during the normalization and cultural adaptation process; problematic dialogs were also recorded by our annotators during the dialog generation and quality control phases (see later in §3).
In order to simplify our notation, we represent a backend database as a set of data entries, where each entry corresponds to a real-world entity within the target culture.
We denote each attribute of an entity as a slot and consider the domain of an entity as an inherent attribute. For example, {(domain, police), (name, parkside police station), (address, Parkside, Cambridge), (phone, 01223358966), (postcode, cb11jg)} is a database entry in .
The introduction of slot values in canonical forms offers supplementary information to the original MultiWOZ annotation. The original format can be automatically derived, enabling backward compatibility with previous models.
We fully acknowledge that here we use the term ‘culture’ (imprecisely) as a proxy for the limited set of properties, customs, and entities to be expected or common at the target location. We also acknowledge that language-culture mappings are typically many-to-many, with the possibility of multiple languages being native to the same culture, and one language spreading over more than one culture or subculture (Hershcovich et al., 2022). Our (simplified) choice is primarily driven by pragmatic considerations and feasibility requirements.
We also consider religious factors: e.g., to respect local culture, we replace the ‘gastropub’ restaurant type with the value ‘Arab’, or ‘nightclub’ with ‘waterpark’ for the attractions slot. Moreover, we address the issue of unbalanced entity distribution in the original MultiWOZ, which is heavily skewed towards Cambridge (UK) and contains a disproportionate number of mentions of ‘colleges’ and ‘guest houses’. To mitigate this bias, we swap certain types of entities; e.g., we exchange the very specific term ‘college’ with ‘architecture’ and ‘guest house’ with ‘hotel’ to offer a better localization of the entity distribution for the target location.
However, we note that, for database completeness, a portion of the entity information has been synthetically generated due to missing information on the Web, e.g., when a restaurant does not provide a phone number on its website.
A categorical slot is defined by the ontology such that the possible values for this slot are a closed set. For example, the slot ‘price range’ can only have the values of ‘cheap’, ‘moderate’, and ‘expensive’. In contrast, the value for a hotel name is an open set and not categorical, as it can be any string.
Restaurant-Inform is the domain-intent pair for the utterance There will be 5 of us and 19:45 would be great.
For this comparison, we utilize the “F&E” proportion of the GlobalWOZ dataset. In this dataset, English utterances are translated into the target language using Google Translate, while preserving the slot values associated with English entities. The calculation of the TTR is limited to the dialogs that are included in both the GlobalWOZ dataset and our dataset.
Specifically, each token is labeled with either B-d-i-s (e.g., B-Restaurant-Inform-Food), denoting the beginning of a slot-value pair with the corresponding slot name, I-d-i-s indicating it is inside the slot-value, or O indicating that the token is not associated with any slot-value pair.
We conducted all NLU experiments on a single RTX 24 GiB GPU with a batch size of 64 and a learning rate of 2e −5. We trained each model for 10 epochs and selected the model with the best F1 score on the validation set as the final model.
The experiments were run on a single RTX 24 GiB GPU; batch size of 4, a learning rate of 1e −4; 5 epochs.
All NLG experiments were run on a single A100 80 GiB GPU; batch size of 32, a learning rate of 1e −3; 10 epochs.
All E2E experiments were run on a single A100 80 GiB GPU; batch size of 4, learning rate of 5e −5; 5 epochs.
References
Author notes
Equal contribution.
Equal senior contribution.
Action Editor: Mark Steedman