Abstract
We present a new method, Soloist,1 that uses transfer learning and machine teaching to build task bots at scale. We parameterize classical modular task-oriented dialog systems using a Transformer-based auto-regressive language model, which subsumes different dialog modules into a single neural model. We pre-train, on heterogeneous dialog corpora, a task-grounded response generation model, which can generate dialog responses grounded in user goals and real-world knowledge for task completion. The pre-trained model can be efficiently adapted to accomplish new tasks with a handful of task-specific dialogs via machine teaching, where training samples are generated by human teachers interacting with the system. Experiments show that (i)Soloist creates new state-of-the-art on well-studied task-oriented dialog benchmarks, including CamRest676 and MultiWOZ; (ii) in the few-shot fine-tuning settings, Soloist significantly outperforms existing methods; and (iii) the use of machine teaching substantially reduces the labeling cost of fine-tuning. The pre-trained models and codes are available at https://aka.ms/soloist.
1 Introduction
The increasing use of personal assistants and messaging applications has spurred interest in building task-oriented dialog systems (or task bots) that can communicate with users through natural language to accomplish a wide range of tasks, such as restaurant booking, weather query, flight booking, IT helpdesk (e.g., Zhou et al., 2020; Adiwardana et al., 2020; Roller et al., 2020b; Gao et al., 2020; Peng et al., 2020a). The wide variety of tasks and domains has created the need for a flexible task-oriented dialog development platform that can support many different use cases while remaining straightforward for developers to use and maintain.
A typical task-oriented dialog system uses a modular pipeline, which has four modules and executes sequentially (Young et al., 2013; Gao et al., 2019a), as shown in Figure 1(a). A natural language understanding (NLU) module identifies user intents and extracts associated information such as slots and their values from users’ input. A dialog state tracker (DST) infers the belief state (or user goal) from dialog history. The belief state is often used to query a task-specific database (DB) to obtain the DB state, such as the number of entities that match the user goal. The dialog state and DB state are then passed to a dialog policy (POL) to select the next system action. A natural language generation (NLG) module converts the action to a natural language response.
Most popular commercial tools for dialog development employ the modular systems, including Google’s Dialog Flow,2 Microsoft’s Power Virtual Agents (PVA),3 Facebook’s Wit.ai,4 Amazon’s Lex,5 and IBM’s Watson Assistant.6 They are designed mainly to help develop systems manually, namely, writing code, crafting rules and templates. Unfortunately, even with these tools, building dialog systems remains a label-intensive, time-consuming task, requiring rich domain knowledge, reasonable coding skill, and expert experience. The cost of building dialog systems at scale (i.e., tens of thousands of bots for different tasks) can be prohibitively expensive.
With the recent advances in neural approaches to conversational AI (Gao et al., 2019a), researchers have been developing data-driven methods and neural models for either individual dialog modules or end-to-end systems. For example, recent attempts such as RASA (Bocklisch et al., 2017), ConvLab (Lee et al., 2019b; Zhu et al., 2020), and Conversation Learner (Shukla et al., 2020) are made to allow the use of data-driven approaches based on machine learning and machine teaching to develop dialog modules. End-to-end trainable dialog systems have also been studied (e.g., Wen et al., 2017; Zhao and Eskenazi, 2016; Li et al., 2017; Williams et al., 2017; Lei et al., 2018; Gao et al., 2019a; Zhang et al., 2020b). Although these methods have achieved promising results, they require large amounts of task-specific labeled data for training, which are rarely available for new tasks in real-world applications.
In this paper, we propose a novel method of building task bots at scale, Soloist, which significantly eases the workflow of training and deploying dialog systems for new tasks, compared to existing tools and methods. Our approach is inspired by the recent success of applying transfer learning to natural language processing (NLP) tasks: Big language models pre-trained on large amounts of raw text (e.g., BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and UniLM (Dong et al., 2019)) can be effectively fine-tuned for a wide range of NLP tasks with few in-domain labels. Recently, these pre-trained language models have also been employed to develop dialog modules such as NLU and DST (Henderson et al., 2020; Coope et al., 2020; Wu et al., 2020a). The proposed Soloist uses a similar pre- training-and-fine-tuning framework for building end-to-end dialog systems. We parameterize a task bot using a Transformer-based auto-regressive language model, which subsumes different dialog modules (i.e., NLU, DST, POL, and NLG) into a single neural model. Task bot building proceeds in two stages: (i) In the pre-training stage, initialized using GPT-2 (Radford et al., 2019), we train a Transformer-based, task-grounded, response generation model using large heterogeneous dialog corpora. The model learns the primary task completion skills such as DST and POL, and can generate dialog responses grounded in user goals and real-world knowledge for task completion. (ii) In the fine-tuning stage, we adapt the pre- trained Soloist model to complete a specific (new) task using a handful of task-specific dialogs via machine teaching, where training samples are generated by human teachers interacting with the system (Zhu, 2015; Shukla et al., 2020).
We show through a comprehensive empirical study that Soloist is an effective method of building task bots at scale by successfully transferring two capabilities from the pre-trained model to a new task bot: (i) the capability of NLU and NLG learned on raw text, and (ii) the capability of grounding system responses in user goals and real-world knowledge for task completion, learned on the out-domain dialog corpora.
Soloist achieves state-of-the-art performance on two well-studied task-oriented dialog benchmarks, lifting the combined score by 10 points in automatic evaluation, and the success rate by 20 points in human evaluation. In the few-shot fine-tuning settings, Soloist adapts to the new domain much more effectively than competing methods, achieving a reasonable success rate using less than 50 dialogs. The promising results demonstrate the potential of the new method for developing task bots at scale. Instead of collecting, labeling data, and building one bot per task, we can pre-train a task-grounded response generation model, and adapt it to new tasks via transfer learning and machine teaching.
2 Soloist
2.1 An Auto-Regressive Model for Dialog
The modular dialog system in Figure 1 constitutes a data processing pipeline that produces a sequence, through concatenating the input-output pair of each module along the generation process. Each consecutive pair in this sequence plays the role of annotated data for the corresponding module. Ideally, when the entire sequence is available, the data generation process of a dialog system (NLU, DST, POL, NLG) can be formulated as a single auto-regressive model.
2.2 Task-Grounded Pre-Training
Given training data of N samples , our goal is to build a neural model parameterized by θ to characterize the sequence generation probability pθ(x). We use a multi-task objective for learning θ, where each task is a self-supervised learning task.
Task 1: Belief Prediction.
Task 2: Grounded Response Generation.
Task 3: Contrastive Objective.
Full Pre-Training Objective.
Implementation Details.
Each dialog turn in training data is processed to form a sequence of tokens consisting of four items (s,b,c,r). For example, the dialog turn of Figure 1(b) is represented as follows, where different items are rendered in different colors.
This sequence, tokenized using byte pair encodings (Sennrich et al., 2016), can be readily used for multi-task training, as shown in Figure 1(c). The implementation of Soloist is based on Huggingface PyTorch Transformer (Wolf et al., 2020). The task-grounded pre-training of Soloist uses the public 117M-parameter GPT-2 as initialization. Adam (Kingma and Ba, 2014) with weight decay is used for pre-training. Table 1 shows the dialog corpora (Kim et al., 2019; Rastogi et al., 2020; Byrne et al., 2019) used for task-grounded pre-training. To ensure there is no overlap between pre-training and fine-tuning datasets, we exclude the data akin to MultiWOZ (Budzianowski et al., 2018), CamRest676 (Wen et al., 2017), Banking77 (Casanueva et al., 2020), Restaurant- 8k (Coope et al., 2020).
Name . | #Dialog . | #Utterance . | Avg. Turn . | #Domain . |
---|---|---|---|---|
task-grounded pre-training: | ||||
Schema | 22,825 | 463,284 | 20.3 | 17 |
Taskmaster | 13,215 | 303,066 | 22.9 | 6 |
fine-tuning: | ||||
MultiWOZ2.0 | 10,420 | 71,410 | 6.9 | 7 |
CamRest676 | 676 | 2,744 | 4.1 | 1 |
Banking77 | – | 25,716 | – | 21 |
Restaurant-8k | – | 8,198 | – | 1 |
Name . | #Dialog . | #Utterance . | Avg. Turn . | #Domain . |
---|---|---|---|---|
task-grounded pre-training: | ||||
Schema | 22,825 | 463,284 | 20.3 | 17 |
Taskmaster | 13,215 | 303,066 | 22.9 | 6 |
fine-tuning: | ||||
MultiWOZ2.0 | 10,420 | 71,410 | 6.9 | 7 |
CamRest676 | 676 | 2,744 | 4.1 | 1 |
Banking77 | – | 25,716 | – | 21 |
Restaurant-8k | – | 8,198 | – | 1 |
2.3 Fine-Tuning and Machine Teaching
When deploying Soloist to a new task, we collect task-specific x in the same format as that used for pre-training as (1). When x is available, the conventional fine-tuning procedure is utilized: we use the same multi-task objective of (7) to update θ to adapt the model to complete the new task using labeled task-specific dialogs.
In real applications, annotated task-specific data is often unavailable, or noisy/incomplete beforehand. One may deploy the dialog system and acquire high-quality task-specific labels (e.g., belief state and system response) for each dialog turn using machine teaching. Machine teaching is an active learning paradigm that focuses on leveraging the knowledge and expertise of domain experts as “teachers”. This paradigm puts a strong emphasis on tools and techniques that enable teachers—particularly non-data scientists and non-machine-learning experts—to visualize data, find potential problems, and provide corrections or additional training inputs in order to improve the system’s performance (Simard et al., 2017; Zhu, 2015; Williams and Liden, 2017; Shukla et al., 2020).
We proceed fine-tuning using Conversation Learner (Shukla et al., 2020), a machine teaching tool, in the following steps: (i) Dialog authors deploy the pre-trained Soloist model for a specific task. (ii) Users (or human subjects recruited for system fine-tuning) interact with the system and generate human-bot dialog logs. (iii) Dialog authors revise a dozen of training samples by selecting representative failed dialogs from the logs, correcting their belief and/or responses so that the system can complete these dialogs successfully, as illustrated in Figure 2. The corrected task-specific dialog turns are used to fine-tune the model.
Implementation Details.
To adapt a pre-trained Soloist to a new task in our experiments, we always fine-tune Soloist using a small amount of pre-collected task-specific dialogs, and then continue to fine-tune it via machine teaching, as detailed in Section 3.3. Training examples are truncated to ensure a maximal length of 512. The pre-trained models are fine-tuned with a mini-batch of 6 on 8 Nvidia V100 until no progress is observed on validation data or up to 10 epochs. Nucleus sampling (Holtzman et al., 2019) is used for decoding, where the sampling top-p ranges from 0.2 to 0.5 for all our models. The best setup of hyper-parameters is selected through grid-search on the validation set. For the machine teaching experiment,pre-trained models are fine-tuned with SGD on a single Nvidia V100.
3 Experiments
This section evaluates the proposed Soloist to answer three questions: Q1: How does Soloist perform on standard benchmarks compared to SoTA methods? Q2: Does Soloist meet the goal of effectively generalizing to new domains in the few-shot fine-tuning setting? Q3: how effective machine teaching is for fine-tuning? Note that we employ the conventional fine-tuning method without machine teaching for a fair comparison when studying Q1 and Q2.
3.1 Experimental Setup
Dialog Datasets for Fine-Tuning.
We validate the end-to-end dialog system performance of Soloist on two well-studied datasets. (i) CamRest676 (Wen et al., 2017) is a single-domain task- oriented dialog corpus. It contains 408/136/136 dialogs for training/validation/testing, respectively. Following Lei et al. (2018), we delexicalize each token that occurs in the ontology with its slot names such as restaurant name, phone number, and postcode. (ii) MultiWOZ dataset (Budzianowski et al., 2018) is a multi-domain task-oriented dialog dataset. It contains 8438/1000/1000 for training/validation/testing, respectively. Each dialog session contains 1 to 3 domains, such as Attraction, Hotel, Hospital, Police, Restaurant, Train, and Taxi. MultiWOZ is inherently challenging due to its multi-domain setting and diverse language styles.
Automatic Evaluation Metrics.
Following Budzianowski et al. (2018), Inform, Success, and BLEU scores are reported. The first two metrics relate to the dialogue task completion—whether the system has provided an appropriate entity (Inform) and then answered all the requested attributes (Success). BLEU evaluates how natural the generated responses are compared to that generated by human agents. A combined score (Combined) is also reported using Combined = (Inform + Success) × 0.5 + BLEU as an overall quality measure.
Baselines.
We compare Soloist with several strong baselines, which hold SoTA on the CamRest676 or MultiWOZ datasets. (i) Multi-Action Data Augmentation (DAMD) (Zhang et al., 2020b) is a modular system, where each dialog module is implemented using a neural network, and the whole system is trained in an end-to-end manner. (ii) Sequicity (Lei et al., 2018) is similar to DAMD except that it does not use multi-action data augmentation. (iii) GPT fine-tuning (Budzianowski and Vulić, 2019) is fine-tuned on GPT-2 to generate responses based on the dialog state and history. (iv) ARDM (Wu et al., 2019b) utilizes GPT-2 as the pre-trained model to learn to generate role-aware responses given dialog context. The model has to work with a separate dialog state tracker for task completion. (v) HDSA (Chen et al., 2019) is a modular dialog system, which generates responses using a BERT-based dialog policy and graph structure dialog act representations.
3.2 End-to-End Evaluation
CamRest676.
Table 2 shows the result and lists annotations used by different models. Soloist achieves the best scores in all the metrics. ARDM performs similarly to Soloist in terms of Success and BLEU. However, ARDM cannot track dialog states and requires a separately trained state tracker to accomplish tasks. GPT-2 fine-tuned with task-specific data works reasonably well but lags behind Soloist by a large margin. Sequicity, which uses a jointly trained model with belief state and policy annotations, underperforms Soloist. This result also shows that, compared to other end-to-end models, Soloist not only achieves better performance but requires lower labeling cost for fine-tuning due to the use of task-grounded pre-training.
Model . | Annotations . | Evaluation Metrics . | ||||
---|---|---|---|---|---|---|
Belief State . | Policy . | Inform↑ . | Success↑ . | BLEU ↑ . | Combined↑ . | |
Sequicity (Lei et al., 2018) | 92.30 | 85.30 | 21.40 | 110.20 | ||
Sequicity (w/o RL) | 94.00 | 83.40 | 23.40 | 112.10 | ||
GPT fine-tuning (Budzianowski and Vulić, 2019) | – | 86.20 | 19.20 | – | ||
ARDM1 (Wu et al., 2019b) | – | 87.10 | 25.20 | – | ||
Soloist | 94.70 | 87.10 | 25.50 | 116.40 | ||
1ARDM is not fully E2E, as it requires a rule-based dialog state tracker. |
Model . | Annotations . | Evaluation Metrics . | ||||
---|---|---|---|---|---|---|
Belief State . | Policy . | Inform↑ . | Success↑ . | BLEU ↑ . | Combined↑ . | |
Sequicity (Lei et al., 2018) | 92.30 | 85.30 | 21.40 | 110.20 | ||
Sequicity (w/o RL) | 94.00 | 83.40 | 23.40 | 112.10 | ||
GPT fine-tuning (Budzianowski and Vulić, 2019) | – | 86.20 | 19.20 | – | ||
ARDM1 (Wu et al., 2019b) | – | 87.10 | 25.20 | – | ||
Soloist | 94.70 | 87.10 | 25.50 | 116.40 | ||
1ARDM is not fully E2E, as it requires a rule-based dialog state tracker. |
MultiWOZ.
The result is shown in Table 3. Soloist achieves the best performance in terms of Inform, Success, and Combined, lifting the previous SoTA by a significant margin (e.g., about 10 points improvement in Combined over DAMD). Soloist also outperforms the method of Ham et al. (2020), where GPT-2 is fine-tuned and applied for end-to-end dialog modeling. Compared to the classical modular dialog systems such as DAMD, Soloist uses a much simpler architecture and requires much lower labeling effort. For example, Soloist requires only the belief states, while DAMD requires additional annotations for task definition (i.e., defining the intents, slots, and the corresponding value ranges) and dialog acts.
Model . | Annotations . | Evaluation Metrics . | ||||
---|---|---|---|---|---|---|
Belief State . | Policy . | Inform↑ . | Success↑ . | BLEU ↑ . | Combined↑ . | |
Sequicity (Lei et al., 2018) | 66.41 | 45.32 | 15.54 | 71.41 | ||
HRED-TS (Peng et al., 2019) | 70.00 | 58.00 | 17.50 | 81.50 | ||
Structured Fusion (Mehri et al., 2019b) | 73.80 | 58.60 | 16.90 | 83.10 | ||
DSTC8 Track 1 Winner1 (Ham et al., 2020) | 73.00 | 62.40 | 16.00 | 83.50 | ||
DAMD (Zhang et al., 2020b) | 76.40 | 60.40 | 16.60 | 85.00 | ||
Soloist | 85.50 | 72.90 | 16.54 | 95.74 | ||
1The result of DSTC8 Track 1 Winner is produced by adapting their code to our setting. |
Model . | Annotations . | Evaluation Metrics . | ||||
---|---|---|---|---|---|---|
Belief State . | Policy . | Inform↑ . | Success↑ . | BLEU ↑ . | Combined↑ . | |
Sequicity (Lei et al., 2018) | 66.41 | 45.32 | 15.54 | 71.41 | ||
HRED-TS (Peng et al., 2019) | 70.00 | 58.00 | 17.50 | 81.50 | ||
Structured Fusion (Mehri et al., 2019b) | 73.80 | 58.60 | 16.90 | 83.10 | ||
DSTC8 Track 1 Winner1 (Ham et al., 2020) | 73.00 | 62.40 | 16.00 | 83.50 | ||
DAMD (Zhang et al., 2020b) | 76.40 | 60.40 | 16.60 | 85.00 | ||
Soloist | 85.50 | 72.90 | 16.54 | 95.74 | ||
1The result of DSTC8 Track 1 Winner is produced by adapting their code to our setting. |
3.3 Few-Shot Evaluation
It is desirable for task bots to effectively generalize to new tasks with few task-specific training samples. Thus, the few-shot fine-tuning setting is a more realistic setting for evaluating dialog systems. Unfortunately, the existing task-oriented dialog benchmarks typically contain for each task hundreds to thousands of dialogs. Therefore, we re-organize CamRest676 and MultiWOZ to simulate the few-shot fine-tuning setting for end-to-end evaluation.7 We sample from the MultiWOZ dataset the dialog tasks that contain only one domain. Attraction, Train, Hotel, and Restaurant domains are used. We do not use the domains of Police, Taxi, and Hospital, as they do not require explicitly tracking dialog states for task completion. For each domain, we randomly sample 50 dialog sessions for training and validation and 200 dialog sessions for testing. The only exception is the Attraction domain, which has 100 sessions for testing. For CamRest676, we randomly sample 20 sessions. Details are shown in Table 4.
Domain . | Attra. . | Train . | Hotel . | Rest. . | CamRest676 . |
---|---|---|---|---|---|
#Train | 50 | 50 | 50 | 50 | 20 |
#Valid | 50 | 50 | 50 | 50 | 136 |
#Test | 100 | 200 | 200 | 200 | 136 |
Domain . | Attra. . | Train . | Hotel . | Rest. . | CamRest676 . |
---|---|---|---|---|---|
#Train | 50 | 50 | 50 | 50 | 20 |
#Valid | 50 | 50 | 50 | 50 | 136 |
#Test | 100 | 200 | 200 | 200 | 136 |
Table 5 and 6 report the end-to-end performance in the few-shot fine-tuning settings on CamRest676 and MultiWOZ, respectively. On all the domains, Soloist obtains substantially better performance in all the metrics. Removing task-grounded pre-training significantly hurts the performance of Soloist, although Soloist without task-grounded pre-training still consistently outperforms DAMD in all the domains. Soloist without task-grounded pre-training is conceptually similar to Ham et al. (2020), but is architecturally simpler and needs fewer annotations. The result verifies the importance of task-grounded pre-training on annotated dialog corpora, allowing Soloist to learn how to track dialog and database states to accomplish a task. To study the impact of using larger model size, we build a large version of Soloist, SoloistL, which is task-grounded pre-trained on the same data but using GPT-2medium with 345M parameters as initialization. SoloistL consistently outperforms Soloist by a large margin. It indicates that a larger model is a better few-shot learner, exhibiting stronger generalization ability with limited in-domain data. We leave it to future work to significantly scale up Soloist.
Model . | CamRest676 . | ||
---|---|---|---|
Inform↑ . | Success↑ . | BLEU ↑ . | |
Sequicity (Lei et al., 2018) | 60.61 | 66.11 | 11.15 |
Soloist w/o pre-training | 73.88 | 72.22 | 13.11 |
Soloist | 85.82 | 84.22 | 19.18 |
SoloistL | 88.05 | 84.79 | 18.88 |
Model . | CamRest676 . | ||
---|---|---|---|
Inform↑ . | Success↑ . | BLEU ↑ . | |
Sequicity (Lei et al., 2018) | 60.61 | 66.11 | 11.15 |
Soloist w/o pre-training | 73.88 | 72.22 | 13.11 |
Soloist | 85.82 | 84.22 | 19.18 |
SoloistL | 88.05 | 84.79 | 18.88 |
Model . | Attraction . | Train . | Hotel . | Restaurant . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Inform↑ . | Success↑ . | BLEU ↑ . | Inform↑ . | Success↑ . | BLEU ↑ . | Inform↑ . | Success↑ . | BLEU ↑ . | Inform↑ . | Success↑ . | BLEU ↑ . | |
DAMD (Zhang et al., 2020b) | 70.00 | 15.00 | 6.90 | 75.00 | 39.50 | 6.20 | 62.50 | 20.50 | 7.60 | 68.00 | 19.50 | 10.50 |
Soloist w/o pre-training | 65.66 | 46.97 | 5.85 | 59.00 | 44.00 | 7.07 | 62.50 | 40.00 | 7.70 | 75.50 | 44.50 | 11.00 |
Soloist | 86.00 | 65.00 | 12.90 | 80.81 | 64.65 | 9.96 | 74.50 | 43.50 | 8.12 | 81.00 | 55.50 | 12.80 |
SoloistL | 86.00 | 68.00 | 14.60 | 81.31 | 74.24 | 11.90 | 75.00 | 51.50 | 10.09 | 84.00 | 62.50 | 13.17 |
Model . | Attraction . | Train . | Hotel . | Restaurant . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Inform↑ . | Success↑ . | BLEU ↑ . | Inform↑ . | Success↑ . | BLEU ↑ . | Inform↑ . | Success↑ . | BLEU ↑ . | Inform↑ . | Success↑ . | BLEU ↑ . | |
DAMD (Zhang et al., 2020b) | 70.00 | 15.00 | 6.90 | 75.00 | 39.50 | 6.20 | 62.50 | 20.50 | 7.60 | 68.00 | 19.50 | 10.50 |
Soloist w/o pre-training | 65.66 | 46.97 | 5.85 | 59.00 | 44.00 | 7.07 | 62.50 | 40.00 | 7.70 | 75.50 | 44.50 | 11.00 |
Soloist | 86.00 | 65.00 | 12.90 | 80.81 | 64.65 | 9.96 | 74.50 | 43.50 | 8.12 | 81.00 | 55.50 | 12.80 |
SoloistL | 86.00 | 68.00 | 14.60 | 81.31 | 74.24 | 11.90 | 75.00 | 51.50 | 10.09 | 84.00 | 62.50 | 13.17 |
We conduct experiments to fine-tune Soloist by varying the percentage of task-specific training samples, ranging from 1% (80 examples) to 20% (1600 examples), on the MultiWOZ dataset. As shown in Table 7, Soloist consistently outperforms DAMD for a wide range of dataset sizes, and the improvement is more substantial when smaller numbers of in-domain examples are used for fine-tuning.
Model . | 1% . | 5% . | 10% . | 20% . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Inform ↑ . | Success ↑ . | BLEU ↑ . | Inform ↑ . | Success ↑ . | BLEU ↑ . | Inform ↑ . | Success ↑ . | BLEU ↑ . | Inform ↑ . | Success ↑ . | BLEU ↑ . | |
DAMD (Zhang et al., 2020b) | 34.40 | 9.10 | 8.10 | 52.50 | 31.80 | 11.60 | 55.30 | 30.30 | 13.00 | 62.60 | 44.10 | 14.90 |
Soloist w/o pre-training | 46.10 | 24.40 | 10.39 | 63.40 | 38.70 | 11.19 | 64.90 | 44.50 | 13.57 | 70.10 | 52.20 | 14.72 |
Soloist | 58.40 | 35.30 | 10.58 | 69.30 | 52.30 | 11.80 | 69.90 | 51.90 | 14.60 | 74.00 | 60.10 | 15.24 |
Model . | 1% . | 5% . | 10% . | 20% . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Inform ↑ . | Success ↑ . | BLEU ↑ . | Inform ↑ . | Success ↑ . | BLEU ↑ . | Inform ↑ . | Success ↑ . | BLEU ↑ . | Inform ↑ . | Success ↑ . | BLEU ↑ . | |
DAMD (Zhang et al., 2020b) | 34.40 | 9.10 | 8.10 | 52.50 | 31.80 | 11.60 | 55.30 | 30.30 | 13.00 | 62.60 | 44.10 | 14.90 |
Soloist w/o pre-training | 46.10 | 24.40 | 10.39 | 63.40 | 38.70 | 11.19 | 64.90 | 44.50 | 13.57 | 70.10 | 52.20 | 14.72 |
Soloist | 58.40 | 35.30 | 10.58 | 69.30 | 52.30 | 11.80 | 69.90 | 51.90 | 14.60 | 74.00 | 60.10 | 15.24 |
3.4 Machine Teaching Results
The machine teaching module of Conversational Learner (CL) (Shukla et al., 2020) allows human teachers (dialog authors) to select and visualize dialogs, find potential problems, and provide corrections or additional training samples to improve the bot’s performance. We use CL to evaluate the effectiveness of machine teaching for task bot fine-tuning. In our experiment, we first sample 10 dialogs from each domain to fine-tune Soloist as described in Section 3.3. The result is presented in the first row of Table 8. We then deploy the model to interact with human users via CL. The row of Soloist+Teach shows the result of machine teaching, where a human teacher has manually corrected 5 dialogs, which are recommended by CL using a ranking heuristic based on perplexity. The corrections are utilized to continually fine-tune the deployed system.
Model . | Attraction . | Train . | Hotel . | Restaurant . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Inform↑ . | Success↑ . | BLEU ↑ . | Inform↑ . | Success↑ . | BLEU ↑ . | Inform↑ . | Success↑ . | BLEU ↑ . | Inform↑ . | Success↑ . | BLEU ↑ . | |
Soloist | 45.00 | 19.00 | 7.67 | 67.68 | 58.08 | 7.13 | 33.50 | 22.50 | 8.70 | 50.50 | 10.00 | 8.61 |
Soloist +Extra | 63.00 | 41.00 | 11.08 | 65.15 | 57.58 | 9.74 | 41.50 | 19.00 | 7.96 | 44.50 | 27.00 | 9.77 |
Soloist +Teach | 78.00 | 45.00 | 11.90 | 68.18 | 63.64 | 9.45 | 46.50 | 22.50 | 7.68 | 53.00 | 32.00 | 9.81 |
Model . | Attraction . | Train . | Hotel . | Restaurant . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Inform↑ . | Success↑ . | BLEU ↑ . | Inform↑ . | Success↑ . | BLEU ↑ . | Inform↑ . | Success↑ . | BLEU ↑ . | Inform↑ . | Success↑ . | BLEU ↑ . | |
Soloist | 45.00 | 19.00 | 7.67 | 67.68 | 58.08 | 7.13 | 33.50 | 22.50 | 8.70 | 50.50 | 10.00 | 8.61 |
Soloist +Extra | 63.00 | 41.00 | 11.08 | 65.15 | 57.58 | 9.74 | 41.50 | 19.00 | 7.96 | 44.50 | 27.00 | 9.77 |
Soloist +Teach | 78.00 | 45.00 | 11.90 | 68.18 | 63.64 | 9.45 | 46.50 | 22.50 | 7.68 | 53.00 | 32.00 | 9.81 |
Table 8 shows that Soloist+Teach consistently improves Combined by a large margin compared with that without human teaching. Soloist+Extra is used as an ablation baseline, where 5 randomly selected dialogs with full annotations from experts are added as extra examples to fine-tune the model. It shows lower performance than machine teaching. Figure 3 demonstrates the performance of Soloist in Restaurant by repeating the above machine teaching process in multiple iterations. We observe that in the second iteration of machine teaching Soloist+Teach improves Combined by more than 8 points while Soloist+Extra achieves 5 points higher. The result demonstrates the effectiveness of our two-step fine-tuning scheme to deploy Soloist for a new task (domain). In terms of machine teaching cost, taking the restaurant domain as an example, we assume that one slot-value pair of belief state correction counts as one edit and a response correction counts as ten edits. The total numbers of edits for Soloist+Teach and Soloist+Extra are 61 and 396, respectively, suggesting that machine teaching reduces the labeling cost by 6×.
3.5 Component-Wise Evaluation
This section evaluates Soloist on two NLU tasks (i.e., intent classification and slot filling), the DST task and the response generation task. We show that although Soloist is an end-to-end dialog model, it also performs well on these component tasks.
Intent Classification
The task is to classify a user utterance into one of several pre-defined classes (intents). We follow the experiment setting of Casanueva et al. (2020). The last hidden state of Soloist is used as the sequence representation for classification. Several baseline methods are used for comparison. BERT-fixed and BERT-tuned are fine-tuned on BERT, with BERT parameters fixed and updated during fine-tuning, respectively. A linear classifier with a softmax layer is added on top of BERT for classification. Universal Sentence Encoder and ConveRT are sentence encoders tailored for modeling sentence pairs, and are trained for optimizing the conversational response selection task. The results in Table 9 show that Soloist is comparable with SoTA intent classification models. Soloist is the best performer when the full dataset is used for fine-tuning but its performance deteriorates more quickly than USE+ConveRT when fewer samples are used for fine-tuning. It is interesting to investigate whether incorporating intent classification tasks in task-grounded pre-training can boost Soloist’s performance. We leave it to future work.
Model . | Banking77 . | ||
---|---|---|---|
10 . | 30 . | Full . | |
BERT-Fixed | 67.55 | 80.07 | 87.19 |
BERT-Tuned | 83.42 | 90.03 | 93.66 |
USE | 84.23 | 89.74 | 92.81 |
ConveRT | 83.32 | 89.37 | 93.01 |
USE+ConveRT | 85.19 | 90.57 | 93.36 |
Soloist | 78.7 | 89.28 | 93.80 |
Model . | Banking77 . | ||
---|---|---|---|
10 . | 30 . | Full . | |
BERT-Fixed | 67.55 | 80.07 | 87.19 |
BERT-Tuned | 83.42 | 90.03 | 93.66 |
USE | 84.23 | 89.74 | 92.81 |
ConveRT | 83.32 | 89.37 | 93.01 |
USE+ConveRT | 85.19 | 90.57 | 93.36 |
Soloist | 78.7 | 89.28 | 93.80 |
Slot Filling.
We follow the experiment setting of Coope et al. (2020) and formulate slot filling as a turn-based span extraction problem. The results in Table 10 show that Soloist performs significantly better than the SoTA method Span-ConveRT, a variant of ConveRT designed explicitly for slot filling. The gap is wider when fewer examples are used for training. For example, when 64 samples are used for training, Soloist outperforms Span- ConveRT by 20 points in F1 score.
Fraction . | Soloist . | Span-ConveRT . | V-CNN-CRF . | Span-BERT . |
---|---|---|---|---|
1 (8198) | 0.98 | 0.96 | 0.94 | 0.93 |
1/2 (4099) | 0.95 | 0.94 | 0.92 | 0.91 |
1/4 (2049) | 0.93 | 0.91 | 0.89 | 0.88 |
1/8 (1024) | 0.89 | 0.89 | 0.85 | 0.85 |
1/16 (512) | 0.84 | 0.81 | 0.74 | 0.77 |
1/32 (256) | 0.79 | 0.64 | 0.57 | 0.54 |
1/64 (128) | 0.74 | 0.58 | 0.37 | 0.42 |
1/128 (64) | 0.61 | 0.41 | 0.26 | 0.30 |
Fraction . | Soloist . | Span-ConveRT . | V-CNN-CRF . | Span-BERT . |
---|---|---|---|---|
1 (8198) | 0.98 | 0.96 | 0.94 | 0.93 |
1/2 (4099) | 0.95 | 0.94 | 0.92 | 0.91 |
1/4 (2049) | 0.93 | 0.91 | 0.89 | 0.88 |
1/8 (1024) | 0.89 | 0.89 | 0.85 | 0.85 |
1/16 (512) | 0.84 | 0.81 | 0.74 | 0.77 |
1/32 (256) | 0.79 | 0.64 | 0.57 | 0.54 |
1/64 (128) | 0.74 | 0.58 | 0.37 | 0.42 |
1/128 (64) | 0.61 | 0.41 | 0.26 | 0.30 |
Dialog State Tracking.
We compare the dialog state tracking capability of Soloist with several strong baselines on MultiWOZ 2.0 and 2.1. The results in Table 11 show that Soloist achieves the best performance on MultiWOZ2.1 and similar performance to DST-Picklist (Zhang et al., 2020a), which requires pre-defined task ontology to guide state tracking. In comparison with Simple-TOD (Hosseini-Asl et al., 2020) that is based on GPT-2, Soloist obtains 1.13% higher joint goal accuracy. We attribute the gain to the task-grounded pre-training that equips Soloist with task completion skills including dialog state tracking.
Model . | Joint Goal Accuracy ↑ . | |
---|---|---|
MWoz2.0 . | MWoz2.1 . | |
MDBT (Ramadan et al., 2018) | 15.57 | – |
GLAD (Zhong et al., 2018) | 35.57 | – |
GCE (Nouri and Hosseini-Asl, 2018) | 36.27 | – |
FJST (Eric et al., 2020) | 40.20 | 38.00 |
HyST (Goel et al., 2019) | 44.24 | – |
SUMBT (Lee et al., 2019a) | 46.65 | – |
TOD-BERT (Wu et al., 2020a) | – | 48.00 |
Neural Reading (Gao et al., 2019b) | 41.10 | – |
TRADE (Wu et al., 2019a) | 48.62 | 45.60 |
COMER (Ren et al., 2019) | 48.79 | – |
NADST (Le et al., 2020) | 50.52 | 49.04 |
DSTQA (Zhou and Small, 2019) | 51.44 | 51.17 |
SOM-DST (Kim et al., 2020) | 51.38 | 52.57 |
DST-Picklist (Zhang et al., 2020a) | 53.30 | – |
MinTL (Lin et al., 2020) | 52.10 | 53.62 |
SST (Chen et al., 2020) | 51.17 | 55.23 |
Tripy (Heck et al., 2020) | – | 55.29 |
Simple-TOD (Hosseini-Asl et al., 2020) | – | 55.72 |
Soloist | 53.20 | 56.85 |
Model . | Joint Goal Accuracy ↑ . | |
---|---|---|
MWoz2.0 . | MWoz2.1 . | |
MDBT (Ramadan et al., 2018) | 15.57 | – |
GLAD (Zhong et al., 2018) | 35.57 | – |
GCE (Nouri and Hosseini-Asl, 2018) | 36.27 | – |
FJST (Eric et al., 2020) | 40.20 | 38.00 |
HyST (Goel et al., 2019) | 44.24 | – |
SUMBT (Lee et al., 2019a) | 46.65 | – |
TOD-BERT (Wu et al., 2020a) | – | 48.00 |
Neural Reading (Gao et al., 2019b) | 41.10 | – |
TRADE (Wu et al., 2019a) | 48.62 | 45.60 |
COMER (Ren et al., 2019) | 48.79 | – |
NADST (Le et al., 2020) | 50.52 | 49.04 |
DSTQA (Zhou and Small, 2019) | 51.44 | 51.17 |
SOM-DST (Kim et al., 2020) | 51.38 | 52.57 |
DST-Picklist (Zhang et al., 2020a) | 53.30 | – |
MinTL (Lin et al., 2020) | 52.10 | 53.62 |
SST (Chen et al., 2020) | 51.17 | 55.23 |
Tripy (Heck et al., 2020) | – | 55.29 |
Simple-TOD (Hosseini-Asl et al., 2020) | – | 55.72 |
Soloist | 53.20 | 56.85 |
Context-to-Response.
In this task, systems need to generate responses given the ground-truth belief state and DB search result (Wen et al., 2017). The results on MultiWOZ 2.0 are shown in Table 12. Soloist achieves the best performance in terms of Inform and Success but performs slightly worse in BLEU. The Combined score of Soloist is comparable with the current SoTA method DAMD. However, DAMD uses the labels of dialog act on both the user and system sides, which demands significantly higher labeling efforts than Soloist for model training. HDSA achieves the best BLEU score. Compared with HDSA, Soloist is much simpler and able to perform better in terms of Combined. Soloist outperforms ARDM in Combined. It is worth mentioning that ARDM cannot perform dialog state tracking and requires an extra dialog state tracker to accomplish tasks. These results show that Soloist can learn dialog policies accurately and generate natural language responses in the multi-domain scenario.
Model . | Annotations . | Evaluation Metrics . | ||||
---|---|---|---|---|---|---|
Belief State . | Policy . | Inform↑ . | Success↑ . | BLEU ↑ . | Combined↑ . | |
Baseline (Budzianowski et al., 2018) | 71.29 | 60.94 | 18.80 | 84.93 | ||
TokenMoE (Pei et al., 2019) | 75.30 | 59.70 | 16.81 | 84.31 | ||
GPT fine-tuning (Budzianowski and Vulić, 2019) | 70.96 | 61.36 | 19.05 | 85.21 | ||
Structured Fusion (Mehri et al., 2019b) | 82.70 | 72.10 | 16.34 | 93.74 | ||
LaRL (Zhao et al., 2019) | 82.80 | 79.20 | 12.80 | 93.80 | ||
MD-Sequicity (Zhang et al., 2020b) | 86.60 | 71.60 | 16.68 | 95.90 | ||
HDSA (Chen et al., 2019) | 82.90 | 68.90 | 23.60 | 99.50 | ||
ARDM (Wu et al., 2019b) | 87.40 | 72.80 | 20.60 | 100.70 | ||
DAMD (Zhang et al., 2020b) | 89.20 | 77.90 | 18.60 | 102.15 | ||
Soloist | 89.60 | 79.30 | 18.03 | 102.49 |
Model . | Annotations . | Evaluation Metrics . | ||||
---|---|---|---|---|---|---|
Belief State . | Policy . | Inform↑ . | Success↑ . | BLEU ↑ . | Combined↑ . | |
Baseline (Budzianowski et al., 2018) | 71.29 | 60.94 | 18.80 | 84.93 | ||
TokenMoE (Pei et al., 2019) | 75.30 | 59.70 | 16.81 | 84.31 | ||
GPT fine-tuning (Budzianowski and Vulić, 2019) | 70.96 | 61.36 | 19.05 | 85.21 | ||
Structured Fusion (Mehri et al., 2019b) | 82.70 | 72.10 | 16.34 | 93.74 | ||
LaRL (Zhao et al., 2019) | 82.80 | 79.20 | 12.80 | 93.80 | ||
MD-Sequicity (Zhang et al., 2020b) | 86.60 | 71.60 | 16.68 | 95.90 | ||
HDSA (Chen et al., 2019) | 82.90 | 68.90 | 23.60 | 99.50 | ||
ARDM (Wu et al., 2019b) | 87.40 | 72.80 | 20.60 | 100.70 | ||
DAMD (Zhang et al., 2020b) | 89.20 | 77.90 | 18.60 | 102.15 | ||
Soloist | 89.60 | 79.30 | 18.03 | 102.49 |
3.6 Human Evaluation Results
We conduct human evaluation to assess the quality of Soloist interacting with human users. Following the evaluation protocol in the DSTC8 track 1 challenge (Kim et al., 2019), we host the best performed Soloist on the validation set in MultiWOZ domain in the back-end as bot services and crowdsource the work to Amazon Mechanical Turk. For each dialog session, we present Turks a goal with instructions. Then Turks are required to converse with Soloist to achieve the goal and judge the overall dialog experience at the end of a session using four metrics. (i) Success evaluates task completion. (ii) Under. (language understanding score) ranging from 1 (bad) to 5 (good) indicates the extent to which the system understands user inputs. (ii) Appr. (response appropriateness score) scaling from 1 (bad) to 5 (good) denotes whether the response is appropriate and human-like. (iv) Turns is the average number of turns in a dialog overall successful dialog sessions. Turks are further required to write down a justification of giving a specific rating. In total, 120 dialog sessions are gathered for analysis.
Table 13 shows the human assessment results on MultiWOZ. The results are consistent with the automatic evaluation. Soloist achieves substantially better performance than other systems over all the metrics. Moreover, Soloist outperforms the DSTC8 Track 1 Winner by a much larger margin in Success (+20 points) in human evaluation than that in automatic evaluation (+10 points in Table 3). We attribute this to the fact that Turks use more diverse language to interact with the target bots in interactive human evaluation than that in the pre-collected MultiWOZ dataset and the use of heterogeneous dialog data for task-grounded pre-training makes Soloist a more robust task bot than the others. In many test cases against Soloist, Turks comment that they feel like they are talking to a real person.
Model . | Success ↑ . | Under. ↑ . | Appr. ↑ . | Turns ↓ . |
---|---|---|---|---|
Soloist | 91.67 | 4.29 | 4.43 | 18.97 |
DSTC8 Track 1 Winner | 68.32 | 4.15 | 4.29 | 19.51 |
DSTC8 2nd Place | 65.81 | 3.54 | 3.63 | 15.48 |
DSTC8 3rd Place | 65.09 | 3.54 | 3.84 | 13.88 |
DSTC8 Baseline | 56.45 | 3.10 | 3.56 | 17.54 |
Model . | Success ↑ . | Under. ↑ . | Appr. ↑ . | Turns ↓ . |
---|---|---|---|---|
Soloist | 91.67 | 4.29 | 4.43 | 18.97 |
DSTC8 Track 1 Winner | 68.32 | 4.15 | 4.29 | 19.51 |
DSTC8 2nd Place | 65.81 | 3.54 | 3.63 | 15.48 |
DSTC8 3rd Place | 65.09 | 3.54 | 3.84 | 13.88 |
DSTC8 Baseline | 56.45 | 3.10 | 3.56 | 17.54 |
Figure 4 depicts a dialog example where a user interacts with Soloist to complete a multi-domain task. The user starts the conversation by asking for a recommendation of a museum in the center of town. Soloist identifies the user intent, and provides a recommendation based on the search result from an attraction DB. Then, the user wants to book a table in a restaurant in the same area. We can see that through the conversation, Soloist develops belief state, which can be viewed as the system’s understanding of what the user needs and what is available in the DB. Based on the belief state and DB state, Soloist picks the next action, either asking for clarification or providing the user with information being requested. This example also demonstrates that Soloist is able to deal with some NLU challenges displayed often in human conversations, such as co-reference resolution. For example, Soloist understands that the “same area” at Turn 5 refers to “centre of town”, and then identifies a proper entity from the restaurant booking DB to make the reservation.
4 Related Work
Dialog Systems.
Dialog systems are typically grouped into two categories, task-oriented systems and social chatbots (e.g., Chen et al., 2017; Gao et al., 2019a; Roller et al., 2020a; Zhou et al., 2020). Recently many variants have been developed to extend the scope of dialog systems, including empathetic dialog systems (Ma et al., 2020; Zhou et al., 2018), chatbots for sentiment analysis (Li et al., 2020c), dialog systems with commonsense knowledge (Young et al., 2018; Shuster et al., 2020), or using audio features (Young et al., 2020). In this paper, we focus on end-to-end dialog models for task-oriented systems.
Pre-Trained Language Models.
Recent advances on self-supervised learning have witnessed the blooming of large-scale pre-trained language models (e.g., Devlin et al., 2019; Radford et al., 2019; Dong et al., 2019), which achieve SoTA performance on a variety of language understanding and generation tasks. The closest to Soloist are GPT-2 (Radford et al., 2019) andits variants that ground language generation in the prescribed control codes such as CTRL (Keskar et al., 2019) and Grover (Zellers et al., 2019)), or latent variables such as Optimus (Li et al., 2020a).
Recently, pre-trained language models have been adopted to develop task-oriented and chit- chat dialog systems. To name a few examples of chit-chat dialog systems: DialoGPT (Zhang et al., 2020c), TransferTransfo (Wolf et al., 2019) and CGRG (Wu et al., 2020b) adapt GPT-2 using human conversational data for response generation. Plato (Bao et al., 2020) pre-trains a discrete latent variable model for response generation. Meena (Adiwardana et al., 2020) and BST (Roller et al., 2020b) pre-train large models on conversational data and have demonstrated expressive performance in generating social chit-chat dialogs. For task-oriented dialogs, Mehri et al. (2019a) explores different pre-training methods for dialog context representation learning. TOD-BERT (Wu et al., 2020a) adapts the pre-trained BERT to achieve strong performance on four dialog sub-tasks. ConveRT (Henderson et al., 2020) pre-trains a model on Reddit data for intent classification and response selection. Span-ConveRT (Coope et al., 2020) extends the framework to entity extraction. SC-GPT (Peng et al., 2020b) uses a pre-trained language model to convert a dialog act to a natural language response. All these works use the pre-training and fine-tuning framework. However, they follow the modular architecture of task bots, and the pre-trained models are used for improving individual dialog modules such as NLU and DST. Soloist generalizes these methods to the entire dialog pipeline, building an end-to-end dialog system.
End-to-End Trainable Dialog Systems.
The end-to-end dialog systems based on neural models have been studied in Wen et al. (2017); Li et al. (2017); Lei et al. (2018); Xu et al. (2019). Although these methods have achieved promising results, they are designed for specific domains, rendering difficulties in generalizing to multi- domains such as MultiWOZ. Dialog models that can handle multi-domain tasks are studied in (Pei et al., 2019; Budzianowski and Vulić, 2019; Mehri et al., 2019b; Zhao et al., 2019; Wu et al., 2019b; Zhang et al., 2020b; Peng et al., 2017). However, these works require large amounts of in-domain labels to achieve good performance. In contrast, Soloist can effectively adapt to a new task in the few-shot fine-tuning settings.
The most related work to ours is Ham et al. (2020), which is the first attempt to fine-tune GPT- 2 to build end-to-end dialog models. Hosseini-Asl et al. (2020) take a similar approach, and is a concurrent work of Soloist. However, Soloist differs from these two methods in two major aspects. The first is the use of task-grounded pre-training that allows Soloist to learn primary task completion skills, such as tracking dialog states and select system actions. These skills can be easily reused and adapted (e.g., via few-shot fine-tuning) to solve new dialog tasks, leading to a much higher task success rate, as reported in Section 3. The second is that the annotation cost required for training Soloist is much lower than that of Ham et al. (2020) or Hosseini-Asl et al., 2020. Training Soloist requires only belief states as labels. But training of Ham et al. (2020) and Hosseini-Asl et al. (2020) requires labeling each dialog turn with dialog acts. In addition, while Soloist is end-to-end trainable, the other two models are not and need heuristic rules to handle different database search conditions.
5 Conclusion
Soloist is a method of building task bots at scale with transfer learning and machine teaching. Unlike GPT-2, Soloist is pre-trained in a task-grounded manner. So, it can generate responses grounded in user goals and real-world knowledge for task completion. Experiments show that Soloist creates new SoTA on two popular task-oriented dialog benchmarks, and that Soloist outperforms existing methods by a large margin in the few-shot fine-tuning settings where only a limited number of task labels are available for fine-tuning.
We hope that Soloist can inspire dialog researchers and developers to comprehensively explore the new paradigm for building task bots based on task-grounded pre-training and fine- tuning via machine teaching, and improving the recipe we present in this paper, namely, formulating task-oriented dialog as a single auto-regressive language model, pre-training a task-grounded response generation model on heterogeneous dialog corpora, and adapting the pre-trained model to new tasks through fine-tuning using a handful task-specific examples via machine teaching.
Notes
TaSk-Oriented DiaLOg wIth a Single pre-Trained model. In this paper, Soloist refers to both the proposed bot building method and the dialog model or system developed using the method.
We will release the re-organized datasets.