Abstract
This paper presents an ontology-aware pretrained language model (OPAL) for end-to-end task-oriented dialogue (TOD). Unlike chit-chat dialogue models, task-oriented dialogue models fulfill at least two task-specific modules: Dialogue state tracker (DST) and response generator (RG). The dialogue state consists of the domain-slot-value triples, which are regarded as the user’s constraints to search the domain-related databases. The large-scale task-oriented dialogue data with the annotated structured dialogue state usually are inaccessible. It prevents the development of the pretrained language model for the task-oriented dialogue. We propose a simple yet effective pretraining method to alleviate this problem, which consists of two pretraining phases. The first phase is to pretrain on large-scale contextual text data, where the structured information of the text is extracted by the information extracting tool. To bridge the gap between the pretraining method and downstream tasks, we design two pretraining tasks: ontology-like triple recovery and next-text generation, which simulates the DST and RG, respectively. The second phase is to fine-tune the pretrained model on the TOD data. The experimental results show that our proposed method achieves an exciting boost and obtains competitive performance even without any TOD data on CamRest676 and MultiWOZ benchmarks.
1 Introduction
A task-oriented dialogue (TOD) system aims to assist users in accomplishing a specific task by interacting with natural language, for example, reserving a hotel or booking flight tickets. With the popularity of the industrial dialogue system, the task-oriented dialogue system attracts extensive attention in research.
The existing task-oriented dialogue system can be classified into two categories: pipeline format and end-to-end format. The pipeline TOD system (Ultes et al., 2017; Weisz et al., 2018) is composed of four modules: natural language understanding (NLU) (Quirk et al., 2015), dialogue state tracking (DST) (Xu et al., 2020; Chen et al., 2020c), dialogue policy (DP) (Chen et al., 2018, 2019, 2020b), and natural language generation (NLG) (Wen et al., 2015; Li et al., 2016; Zhao et al., 2017). Since each module of the system is trained separately and executes sequentially, it faces two serious issues: error accumulation and high annotation cost. Thus, the end-to-end dialogue system (Lee et al., 2019b; Zhao et al., 2019) gradually becomes the research focus, which formulates the task-oriented dialogue as a sequence-to-sequence task. The dialogue state, database (DB) state, and the corresponding system response are directly concatenated together and flattened as a token sequence. The DB state is the status of the domain-related database searched with the dialogue state, as shown in Figure 1.
Thanks to the success of pretraining language models (Kenton and Toutanova, 2019; Raffel et al., 2020), effective application has shed light on open-domain (chit-chat) dialogues (Bao et al., 2020; Adiwardana et al., 2020). Nevertheless, utilizing such pretrained language models on TOD systems remains challenging due to the limited TOD data with annotated dialogue state. Unlike the open-domain dialogue, TOD is restricted by a dialogue ontology, which defines the dialogue domains, the slots and their candidate values. The TOD system needs to predict the dialogue state and feedback the DB content to accomplish a task. The dialogue state is structured information extracted from the dialogue context, which is a set of domain-slot-value triples.
Recently, some works (Hosseini-Asl et al., 2020; Lin et al., 2020) try to directly leverage the pretrained language models, e.g., GPT-2 (Radford et al., 2019) and BART (Lewis et al., 2020), in the end-to-end TOD system. Such models (Mehri et al., 2019) are pretrained on the large-scale contextual text with the general self-supervised method, e.g., language modeling and language denoising. However, in the task-oriented dialogue task, the dialogue state is structured information rather than a contextual text. The inconsistency between the pretrained and downstream tasks will impact the performance of the PLMs on the TOD benchmarks. To alleviate this problem, SOLOIST (Peng et al., 2020a) fine-tunes the pretrained GPT-2 with the existing annotated TOD data and then transfers it to the other task-oriented dialogue generation tasks. Similarly, NCM (Liu et al., 2021) first warm-ups the Transformer-based model with large-scale Reddit1 (Völske et al., 2017) data and then fine-tunes the model on the TOD data. However, the existing TOD data is too limited to pretrain a large-scale language model.
To alleviate the problems above and advance pretrained language model research, especially its application on TOD, we propose an Ontology-aware PretrAined Language model (OPAL). From the high-level perspective, we can abstract the end-to-end TOD task into two sub-tasks: ontology-like triple recovery and next-text generation, which corresponds to dialogue state tracking task and response generating task. The ontology-like triple recovery in the TOD means to predict the corresponding value given the domain and the slot. The next-text generation is easy to design for the contextual text, which directly fulfills with masking the last sentence. The challenge is how to design the ontology-like triple recovery task, which needs to obtain the structured information from the contextual text. In this paper, we utilize the external OpenIE tools (Angeli et al., 2015; Kolluru et al., 2020)2 to extract the relation triples (subject-relation-object) from the contextual text as the structured information. In most cases, the domain-slot-value triple can be regarded as relation triple, for example, train-arrive-12:30. The relation triples extracted from the contextual text can be regarded as the ontology-like triples. We design self-supervised ontology-like triple recovery task and next-text generation task to pretrain the model.
The main contributions of this paper are summarized below:
We leverage the external tool OpenIE to generate large amounts of TOD-like data, which is important for the development of pretrained language models in the TOD community.
To the best of our knowledge, this is the first work to design self-supervised tasks for end-to-end TOD tasks. It bridges the gap between pretrained language models and end-to-end TOD models.
The experimental results show that our proposed pretrained model OPAL can get competitive performance even without any annotated TOD data in the pretraining process.
Further fine-tuned on the annotated TOD data, our proposed method obtains exciting performance gain on CamRest676 and MultiWOZ datasets.
2 End-to-End Task-Oriented Dialogue
As previously introduced, the pipeline dialogue system consists of four modules. The NLU module is to recognize the user’s intents and the corresponding slot values. The DST module combines the previous state and the results of the NLU to update the current dialogue state. The DP module chooses the discrete dialogue acts according to the dialogue state and the database state to respond to the user. The NLG module generates the natural language based on the chosen dialogue acts. There are at least four kinds of annotation in such systems: the user’s intent, the slot value, the dialogue state, and the dialogue act. The heavy annotation labor enormously increases the cost of building a pipeline system. Its poor scalability further influences the pipeline dialogue system development.
3 Ontology-Aware Pretraining Method
The existing task-oriented dialogue data with the given ontology is limited to pretrain the language model. To increase the scale of the pretraining data, we divide the pretraining process into two phases. The first phase pretrains the model on the large-scale contextual text. The triples of the text are extracted by the latest neural-based OpenIE6 (Kolluru et al., 2020). There is still a glaring discrepancy between the contextual text and the dialogue. For example, the dialogue always contains co-reference and information ellipsis (Iyyer et al., 2017). We pretrain the model on the smaller TOD data at the second phase to further decrease the gap between the pretrained model and the downstream tasks. The two phases are complementary to each other introduced as below:
Phase-1: Pretrained on Contextual Text
Remove all the stopwords in the triples and filter the triples in which one of triple component is a blank space.
Remove the triples in which one of the triple components contains more than 4 words.
For the triples that have the same subject-relation pair, randomly select one of the triples and remove the others.
Randomly select two triples from the rest of triples, if their length is larger than two. This is to extract no more than two triples in a sentence.
Phase-2: Pretrained on TOD Data
To further decrease the gap between the pretrained language model and the end-to-end model, we leverage the smaller task-oriented data in the pretraining process. Instead of extracting the ontology-like triples with OpenIE6, the TOD ontology is designed by the dialogue experts. We directly use the text matching method to extract the domain-slot-value triples from the dialogue context with the given ontology. Note that the extracted triples with text matching operation are not the dialogue state. In this pretraining phase, the system-mentioned ontology triples also have to be recovered, which is consistent with the previous pretraining process. In other words, different from SOLOIST (Peng et al., 2020a) and NCM (Liu et al., 2021), we do not need to use the annotated dialogue state and only utilize the given dialogue ontology to match the ontology-related triples. This attribute increases the generalization of the proposed ontology-aware pretraining methods, where the ontology is much easier to be obtained than the dialogue state annotation. We share a toy example to distinguish the usage of the pretraining TOD data and the fine-tuning data of the end-to-end TOD task in Figure 3. During the pretraining process, the ontology is extracted from the context, which is just a part of the given ontology. The ontology recovery is to recover all the ontology-related triples, for example, the triple res-food-Chinese is not in the dialogue state. During fine-tuning process, there is an extra database searching step.
4 Experiments
We evaluate our proposed pretrained model OPAL on dialogue state tracking tasks and end-to-end TOD tasks. To further validate the effectiveness of the proposed OPAL, we conduct the ablation study to analyze the effects of the different pretraining ingredients. Last but not least, we design the resource-limited experiments to figure out the sample efficiency of the proposed OPAL on the end-to-end TOD task and show some cases to study the strength of the proposed OPAL.
4.1 Corpora
At phase-1 of the proposed OPAL, we use the Wikipedia corpus to pretrain the model. There are 72.24 million samples collected from Wikipedia. We have used five task-oriented dialogue datasets in the experiments, shown in Table 1, where the Schema (Rastogi et al., 2020) and the TaskMaster (Byrne et al., 2019) are leveraged in the phase-2 of the pretraining process and the rests are the downstream benchmarks. The WOZ (Mrkšić et al., 2017) and the CamRest676 (Wen et al., 2016) are the single-domain task-oriented dialogue corpora, which are the well-studied DST benchmark and end-to-end TOD benchmark, respectively. MultiWOZ is a kind of multi-domain dialogue corpus, which is challenging due to its multi-domain setting and diverse language styles. There are two versions of the MultiWOZ dataset used in the experiments: MultiWOZ2.0 (Budzianowski et al., 2018) and MultiWOZ2.1 (Eric et al., 2019), where MultiWOZ2.1 fixes most of DST annotation errors in MultiWOZ2.0. To fairly compare to the other baselines, we run the end-to-end TOD tasks on the MultiWOZ2.0 and run the DST tasks on the MultiWOZ2.0 and MultiWOZ2.1.
Dataset . | #Dialogue . | #Domain . | #Slot . | X-Domain . | Usage . |
---|---|---|---|---|---|
Schema | 22,825 | 17 | 123 | ✓ | P |
TaskMaster | 17,304 | 7 | 281 | ✗ | P |
MultiWOZ | 10,438 | 7 | 46 | ✓ | F |
WOZ | 1,200 | 1 | 4 | ✗ | F |
CamRest676 | 676 | 1 | 4 | ✗ | F |
Dataset . | #Dialogue . | #Domain . | #Slot . | X-Domain . | Usage . |
---|---|---|---|---|---|
Schema | 22,825 | 17 | 123 | ✓ | P |
TaskMaster | 17,304 | 7 | 281 | ✗ | P |
MultiWOZ | 10,438 | 7 | 46 | ✓ | F |
WOZ | 1,200 | 1 | 4 | ✗ | F |
CamRest676 | 676 | 1 | 4 | ✗ | F |
4.2 Metrics
For the dialogue state tracking task, we use the joint goal accuracy (JGA) to evaluate the models. Only if all the predicted slot values at each turn are exactly matched with the golden, does it confirm the successful prediction of the DST model. For the end-to-end TOD task, there are three reported scores: Inform, Success, and BLEU. Inform measures whether the system response has provided the right entity. Success reports whether the system response has provided all the requested slots. BLEU evaluates the naturalness of the generated system response. Following Budzianowski et al. (2018), the combined score (Combined) is also reported using Combined = (Inform + Success) ×0.5 + BLEU.
4.3 Experimental Setup
We implement the proposed OPAL with HuggingFace’s Transformers (Wolf et al., 2020) and BART, which is a pretrained denoising autoencoder. To validate the generalization of the proposed pretraining method, we set the base version and large version (BARTL) of the BART as the backbone of the proposed OPAL, named OPAL and OPALL, respectively. The learning rates of the pretraining and fine-tuning are both 1e-5. The optimizer is AdamW. At phase-1 of the pretraining process, the total training steps is 280,000 and the batch size is 256. It is pretrained on four P100 GPUs (16G memory for each). This pretraining process costs 260 hours (one epoch on Wikipedia). Similar to NCM (Liu et al., 2021), we pretrain 100,000 steps at the phase-2. At the fine-tuning process of the downstream tasks, the batch size is 32. We conduct significant tests (paired t-test) (Koehn, 2004) with five different seeds on the end-to-end TOD task, where the final results are trained with the default seed 42.
4.4 Baselines
We compare the proposed OPAL with the strong baselines, which hold the state-of-the-art (SOTA) performance on the DST and end-to-end TOD.
The DST models can be divided into two categories: classification method and generation method. The classification methods rely on the optional slot values of the ontology and select the value from it. Their scalability is a severe problem for the practical dialogue system. The generation methods directly extract the values from the dialogue context, which are comparable to the proposed OPAL.
For the end-to-end TOD tasks, the existing end-to-end TOD systems can be grouped into modular systems and sequential systems. The modular systems use multiple decoders to generate the downstream outputs independently and are trained in an end-to-end manner. The sequential systems formulate the end-to-end TOD as a single sequence prediction problem. Sequicity (Lei et al., 2018) proposes a two-stage CopyNet method to generate the dialogue state and the system response. HRED-TS (Peng et al., 2019) proposes a teacher-student framework with a hierarchical recurrent encoder-decoder backbone. DAMD (Zhang et al., 2020b) designs a domain-aware multi-decoder network with the multi-action data augmentation method. DSTC8 Winner Ham et al., 2020 and SimpleTOD (Hosseini-Asl et al., 2020) successfully leverage the pretrained language model GPT-2 for the end-to-end TOD modeling in the unified way. Inspired by SimpleTOD, SOLOIST (Peng et al., 2020a) fine-tunes GPT-2 with out-of-domain TOD data and obtains excellent transferability. MinTL-BART (Lin et al., 2020) and UBAR (Yang et al., 2021) improve the end-to-end TOD system by changing the input content without extra assumptions. HTER (Santra et al., 2021) improves the end-to-end TOD system by a hierarchical dialogue modeling mechanism. NCM (Liu et al., 2021) improves the decoder with the noisy channel model and proposes a two-stage pretrianing method to warm up the Transformer-based model, where the model first pretrains on the Reddit corpus and then on the task-oriented dialogues. NCM is the closest method to our proposed method. We mainly compare our proposed method with this method.
4.5 Results on End-to-End TOD
We first fine-tune our pretrained models OPAL and OPALL on two well-studied end-to-end TOD datasets: MultiWOZ2.0 and CamRest676, as shown in Table 2 and Table 3. We compare our models with strong baselines in the end-to-end dialogue learning setting.
Model . | Model Size . | Dialogue Act . | Inform . | Success . | BLEU . | Combined . |
---|---|---|---|---|---|---|
Sequicity (Lei et al., 2018) | – | ✗ | 66.40 | 45.30 | 15.54 | 71.39 |
HRED-TS (Peng et al., 2019) | – | ✓ | 70.00 | 58.00 | 17.50 | 81.50 |
DSTC8 Winner (Ham et al., 2020) | 124M | ✓ | 73.00 | 62.40 | 16.00 | 83.50 |
DAMD (Zhang et al., 2020b) | – | ✓ | 76.40 | 60.40 | 16.60 | 85.00 |
SimpleTOD (Hosseini-Asl et al., 2020) | 117M | ✓ | 84.40 | 70.10 | 15.01 | 92.26 |
SOLOIST (Peng et al., 2020a) | 117M | ✗ | 85.50 | 72.90 | 16.54 | 95.74 |
MinTL-BART (Lin et al., 2020) | 406M | ✗ | 84.88 | 74.91 | 17.89 | 97.78 |
UBAR (Yang et al., 2021) | 82M | ✗ | 88.20 | 79.50 | 16.43 | 100.28 |
NCMB (Liu et al., 2021) | 116M | ✓ | 85.90 | 74.80 | 19.76 | 100.11 |
NCML (Liu et al., 2021) | 292M | ✓ | 86.90 | 76.20 | 20.58 | 102.13 |
HTER (Santra et al., 2021) | – | ✓ | 91.72 | 75.80 | 19.05 | 102.81 |
BART | 139M | ✗ | 87.50 | 72.20 | 16.67 | 96.53 |
OPAL | 139M | ✗ | 89.40 | 81.10 | 18.60 | 103.85 |
BARTL | 406M | ✗ | 86.20 | 70.30 | 17.01 | 95.26 |
OPALL | 406M | ✗ | 88.00 | 82.80 | 20.80 | 106.20 |
Model . | Model Size . | Dialogue Act . | Inform . | Success . | BLEU . | Combined . |
---|---|---|---|---|---|---|
Sequicity (Lei et al., 2018) | – | ✗ | 66.40 | 45.30 | 15.54 | 71.39 |
HRED-TS (Peng et al., 2019) | – | ✓ | 70.00 | 58.00 | 17.50 | 81.50 |
DSTC8 Winner (Ham et al., 2020) | 124M | ✓ | 73.00 | 62.40 | 16.00 | 83.50 |
DAMD (Zhang et al., 2020b) | – | ✓ | 76.40 | 60.40 | 16.60 | 85.00 |
SimpleTOD (Hosseini-Asl et al., 2020) | 117M | ✓ | 84.40 | 70.10 | 15.01 | 92.26 |
SOLOIST (Peng et al., 2020a) | 117M | ✗ | 85.50 | 72.90 | 16.54 | 95.74 |
MinTL-BART (Lin et al., 2020) | 406M | ✗ | 84.88 | 74.91 | 17.89 | 97.78 |
UBAR (Yang et al., 2021) | 82M | ✗ | 88.20 | 79.50 | 16.43 | 100.28 |
NCMB (Liu et al., 2021) | 116M | ✓ | 85.90 | 74.80 | 19.76 | 100.11 |
NCML (Liu et al., 2021) | 292M | ✓ | 86.90 | 76.20 | 20.58 | 102.13 |
HTER (Santra et al., 2021) | – | ✓ | 91.72 | 75.80 | 19.05 | 102.81 |
BART | 139M | ✗ | 87.50 | 72.20 | 16.67 | 96.53 |
OPAL | 139M | ✗ | 89.40 | 81.10 | 18.60 | 103.85 |
BARTL | 406M | ✗ | 86.20 | 70.30 | 17.01 | 95.26 |
OPALL | 406M | ✗ | 88.00 | 82.80 | 20.80 | 106.20 |
Model . | Inform . | Success . | BLEU . | Combined . |
---|---|---|---|---|
Sequicity | 92.30 | 85.03 | 21.40 | 110.20 |
SOLOIST | 94.70 | 87.10 | 25.50 | 116.40 |
NCMB | 94.30 | 85.20 | 25.98 | 115.73 |
NCML | 95.40 | 85.30 | 26.89 | 117.24 |
BART | 96.31 | 79.41 | 24.74 | 112.61 |
OPAL | 96.32 | 89.86 | 26.56 | 119.65 |
Model . | Inform . | Success . | BLEU . | Combined . |
---|---|---|---|---|
Sequicity | 92.30 | 85.03 | 21.40 | 110.20 |
SOLOIST | 94.70 | 87.10 | 25.50 | 116.40 |
NCMB | 94.30 | 85.20 | 25.98 | 115.73 |
NCML | 95.40 | 85.30 | 26.89 | 117.24 |
BART | 96.31 | 79.41 | 24.74 | 112.61 |
OPAL | 96.32 | 89.86 | 26.56 | 119.65 |
To validate the generalization of our proposed ontology-aware pretraining method, we set the base-version and large-version BART as the backbones of the pretraining models. Compared with the performance fine-tuned on the original BARTs, the proposed OPAL and OPALL achieve 7.32 and 10.94 overall performance gains on the MultiWOZ2.0 dataset and absolute 7.04 point gains on the CamRest676 dataset. SOLOIST (Peng et al., 2020a) and NCM (Liu et al., 2021) are the two closest methods to OPAL, which both leverage the out-of-domain TOD in pretraining the Transformer-based models. Different from our methods, these two approaches rely on DST annotation. Our proposed models can still obtain the best task completion (Inform and Success) and have lower BLEU scores than NCM barely. Compared with overall baselines, our proposed models reach the new SOTA overall performance (Combined) on both two datasets. The large-version model OPALL outperforms the base-version OPAL with a 2.53 performance gain on the combined score. To fairly compare to other baselines, we only report the base-version OPAL’s performance in the next experiments.
Compared with NCMB, our proposed OPAL has higher task-completion (revealed by Inform + Success) ×0.5) performance. However, BLEU score of OPAL is lower than BLEU of NCMB. Figure 4 shows the correlation between BLEU score and task-complation ability. The fine-tuned model tried to balance between BLEU score and task-completion ability. With the progress of training process, the BLEU score is descending and the task-completion ability is enhanced. The main reason is that there are different expressions on the same system intention, which is the typical one-to-many mapping problem (Zhao and Eskenazi, 2018) in the dialogue generation. The final fine-tuned model has stronger task-completion ability but sacrifices the dialogue diversity. In the evaluation, we choose the model with the highest combination score.
4.6 Results on DST
The classification-based DST models and generation-based DST models are shown in the upper part and lower part of the Table 4 and Table 5, respectively. Table 4 reports the DST results on the MultiWOZ2.0 and MultiWOZ2.1 datasets. Our proposed OPAL can obtain the highest JGA among all the generation-based baselines on both datasets. Compared with the classification-based SOTA model FPDSC (Zhou et al., 2021), OPAL can even achieve 0.93% JGA improvement on the MultiWOZ2.0 dataset. Table 5 shows the DST results on WOZ, which is a single-domain dataset and has only 4 slots. The computational complexity of the classification-based models is proportional to the number of the candidate slot values. The classification-based models have the advantage of predicting slot values from valid candidates on the simpler dialogue domain. It is the main reason that the classification-based models are more popular on the single-domain WOZ dataset. Compared with the well-designed classification-based model BERT-DST (Lai et al., 2020), OPAL has a 0.7% JGA gain. OPAL gets 6.7% higher JGA over the novel generation-based model TRADE (Wu et al., 2019). Notice that we do not compare the proposed model with variants (Yu et al., 2020; Li et al., 2020; Dai et al., 2021) of the data augmentation methods based on TripPy (Heck et al., 2020). In this paper, we pay more attention on the end-to-end task-oriented dialogue generation task. Our model is completely compatible with these data augmentation methods. In the future, we will try these augmentation methods on our model.
Model . | JGA . | |
---|---|---|
MultiWOZ . | ||
2.0 . | 2.1 . | |
FJST (Eric et al., 2017) | 40.20 | 38.00 |
HyST (Goel et al., 2019) | 44.24 | – |
SUMBT (Lee et al., 2019a) | 46.65 | – |
TOD-BERT (Wu et al., 2020) | – | 48.00 |
DST-Picklist (Zhang et al., 2020a) | – | 53.30 |
SST (Chen et al., 2020a) | 51.17 | 55.23 |
TripPy (Heck et al., 2020) | – | 55.29 |
FPDSC (Zhou et al., 2021) | 53.17 | 59.07 |
TRADE (Wu et al., 2019) | 48.62 | 45.60 |
COMER (Ren et al., 2019) | 48.79 | – |
NADST (Le et al., 2020) | 50.52 | 49.04 |
DSTQA (Zhou and Small, 2019) | 51.44 | 51.17 |
SOM-DST (Kim et al., 2020) | 51.38 | 52.57 |
MinTL-BART (Lin et al., 2020) | 52.10 | 53.62 |
SimpleTOD (Hosseini-Asl et al., 2020) | – | 55.72 |
UBAR (Yang et al., 2021) | 52.59 | 56.20 |
SOLOIST (Peng et al., 2020a) | 53.20 | 56.85 |
OPAL | 54.10 | 57.05 |
Model . | JGA . | |
---|---|---|
MultiWOZ . | ||
2.0 . | 2.1 . | |
FJST (Eric et al., 2017) | 40.20 | 38.00 |
HyST (Goel et al., 2019) | 44.24 | – |
SUMBT (Lee et al., 2019a) | 46.65 | – |
TOD-BERT (Wu et al., 2020) | – | 48.00 |
DST-Picklist (Zhang et al., 2020a) | – | 53.30 |
SST (Chen et al., 2020a) | 51.17 | 55.23 |
TripPy (Heck et al., 2020) | – | 55.29 |
FPDSC (Zhou et al., 2021) | 53.17 | 59.07 |
TRADE (Wu et al., 2019) | 48.62 | 45.60 |
COMER (Ren et al., 2019) | 48.79 | – |
NADST (Le et al., 2020) | 50.52 | 49.04 |
DSTQA (Zhou and Small, 2019) | 51.44 | 51.17 |
SOM-DST (Kim et al., 2020) | 51.38 | 52.57 |
MinTL-BART (Lin et al., 2020) | 52.10 | 53.62 |
SimpleTOD (Hosseini-Asl et al., 2020) | – | 55.72 |
UBAR (Yang et al., 2021) | 52.59 | 56.20 |
SOLOIST (Peng et al., 2020a) | 53.20 | 56.85 |
OPAL | 54.10 | 57.05 |
Model . | JGA . |
---|---|
WOZ . | |
NBT (Mrkšić et al., 2017) | 84.4 |
GLAD (Zhong et al., 2018) | 88.1 |
GCE (Nouri and Hosseini-Asl, 2018) | 88.5 |
G-SAT (Balaraman and Magnini, 2019) | 88.7 |
StateNet (Ren et al., 2018) | 88.9 |
BERT-DST (Lai et al., 2020) | 90.5 |
TRADE (Wu et al., 2019)† | 84.5 |
OPAL | 91.2 |
5 Analysis
The analysis experiments evaluate the proposed OPAL on the end-to-end TOD tasks to answer three main questions: Q1: What role do the different pretraining corpora (Wikipedia and out-of-domain TOD) play? Q2: What is the main factor that affects the pretrained model? Q3: Does OPAL have a higher sample efficiency than the original BART in the limited-resource setting?
5.1 Ablation Study
Table 6 reports the ablation study of the proposed OPAL, which has two pretraining phases. The phase-1 of OPAL pretrains on the contextual texts and phase-2 pretrains on the task-oriented dialogues with the ontology-aware pretraining method. To evaluate the effects of these two corpora, we separately pretrain the backbones (BART) only on the pretraining data contextual texts or task-oriented dialogues, where the pretrained models are named as WIKI and TOD, respectively. The pretrained models WIKI and TOD still outperform the original BART by a large margin. It indicates the efficiency of the proposed ontology-aware pretraining method. Especially, the pretrained model WIKI that does not see any TOD data at the pretraining phase can result in competitive performance with the NCML. Compared with NCMB with a similar parameter scale to our model, WIKI has apparent advantages on both end-to-end TOD datasets. The WIKI has the better performance with TOD. We know that WIKI suffers from the unseen TOD data and TOD suffers from the scale of the pretraining data. Our proposed OPAL adopts a two-stage pretraining method to solve the above problem, a classic example of “one plus one greater than two”. The two-stage pretrained model OPAL outperforms the separated one with a 1.65 and 2.70 upper combined score on MultiWOZ2.0. This indicates that the ontology-aware contextual text corpus and ontology-aware TOD data are complementary.
Model . | MultiWOZ2.0 . | |||
---|---|---|---|---|
Inform . | Success . | BLEU . | Combined . | |
OPAL | 89.40 | 81.10 | 18.60 | 103.85 |
Effect of Pretrained Corpora | ||||
WIKI | 88.40 | 79.50 | 18.28 | 102.23 |
TOD | 89.00 | 78.20 | 17.55 | 101.15 |
REDD | 86.90 | 77.10 | 16.93 | 98.93 |
Effect of Pretrained Tasks | ||||
w/o NTG | 87.00 | 80.80 | 16.88 | 100.79 |
w/o OR | 85.20 | 79.50 | 17.52 | 99.88 |
Effect of IE Tools | ||||
OpenIE-Stanford | 88.40 | 79.20 | 17.34 | 101.14 |
BART | 87.50 | 72.20 | 16.67 | 96.52 |
Model . | MultiWOZ2.0 . | |||
---|---|---|---|---|
Inform . | Success . | BLEU . | Combined . | |
OPAL | 89.40 | 81.10 | 18.60 | 103.85 |
Effect of Pretrained Corpora | ||||
WIKI | 88.40 | 79.50 | 18.28 | 102.23 |
TOD | 89.00 | 78.20 | 17.55 | 101.15 |
REDD | 86.90 | 77.10 | 16.93 | 98.93 |
Effect of Pretrained Tasks | ||||
w/o NTG | 87.00 | 80.80 | 16.88 | 100.79 |
w/o OR | 85.20 | 79.50 | 17.52 | 99.88 |
Effect of IE Tools | ||||
OpenIE-Stanford | 88.40 | 79.20 | 17.34 | 101.14 |
BART | 87.50 | 72.20 | 16.67 | 96.52 |
To further compare Wikipedia to the Reddit corpus, we also use the same scale of Reddit data to conduct the Phase-1 pretraining, named REDD. WIKI is ahead of REDD in all the automatic metrics (BLEU and task-completion). To deeply analyze the effect factor, we calculate the occupation rate of the extracted triples that contained the pronouns as subject or object. As shown in Figure 5, 31.0% of triples in Reddit data contain pronouns. The highest frequency of pronouns is “i”, which occupies 31%. Only 0.7% triples contain pronouns in Wikipedia. In the TOD, the domains and slot values in the dialogue states are specific entities, which are not pronouns. The meaningless pronouns increase the gap between the pretraining model and the TOD model. The co-reference and information ellipsis in Reddit seriously hurt the performance of the external information extraction tool. It is the main reason that we choose Wikipedia as the pretraining corpus.
We also evaluate the effects of the pretrained tasks: ontology-like triple recovery (OR) and next-text generation (NTG). We directly remove the extracted triples in the input in “w/o OR” study. The “w/o NTG” means that the model only needs to recover the masked triples. The results show that the OR task and NTG task benefit the task completion and the contextual consistency, respectively. In the complex dialogue domain, the single-task pretrained methods cannot achieve comparable performance with OPAL. It indicates that the two designed tasks are both significant to reduce the gap between pretrained model and TOD model.
We further validate the effects of different OpenIE tools. In our main experiments, we use the latest neural-based IE tool OpenIE6. There is also a very popular rule-based IE tool OpenIE-Stanford. Compared with OpenIE-Stanford, neural-based OpenIE6 achieves promising performance improvement on well-studied IE benchmarks (Kolluru et al., 2020). As shown in Table 6, WIKI with OpenIE6 is also better than OpenIE-Stanford tool in all the metrics. However, the improvement of neural-based OpenIE6 is limited, which indicates that the proposed pretraining method is not sensitive about IE accuracy.
5.2 Sample Efficiency
Under the different resource-limited settings, the proposed OPAL can get all the best performance in terms of task completion (Inform and Success), response naturalness (BLEU), and overall performance among the baselines, as shown in Figure 6. It indicates the sample efficiency of the proposed ontology-aware pretraining method. When the training data is extremely limited (only 80 dialogues), TOD can improve overall performance by a large margin (absolute 3.2 point improvement) than WIKI. This improvement comes from the task completion ability, which indicates the TOD data can increase the generalization of the pretrained model for end-to-end TOD tasks. With the training data increase, WIKI pretrained on the large-scale context text data has the larger performance gain than TOD. When the number of the training data reaches 1600 dialogues, WIKI obtains absolute 4.8 point gains over TOD. It indicates that the scale of the pretraining data influences the growth potential of the pretrained model. On the other hand, TOD outperforms over the WIKI in three of four data limitation cases on task-completion ability. However, WIKI achieves better performance on fluent statement (revealed by BLEU). It indicates that WIKI benefits the task-completion ability and TOD facilitates fluency and context consistency.
5.3 Case Study
Our proposed pretrained model OPAL has improved the performance on task completion and contextual consistency over the original BART. As shown in Figure 7, we can see that the dialogue model fine-tuned from BART misses responding a request (address) to the user. Instead, our proposed OPAL accurately provides all the requested information to the user. As shown in Figure 8, at the first turn, we can see that our proposed OPAL can provide the more similar response as the oracle than BART. It indicates that OPAL has the better performance on the response prediction. At the second turn, the dialogue system needs to provide the correct entity to the user. The original BART model chooses to miss it. Our proposed OPAL recommends an entity to the user in time. Compared with the original BART, the proposed OPAL has a obvious advantage in modeling the task-oriented dialogue, which not only generates the precise response but also completes the dialogue task successfully. This performance improvement comes from the two-stage ontology-aware pretraining method on the large-scale contextual text with the handcrafted ontology-like triples and the small task-oriented dialogue data with given ontology.
6 Related Work
End-to-End TOD Systems
Early studies for end-to-end task-oriented dialogue systems either design a neural network-based model or propose a reinforcement learning method to use the reward signal to update the whole system. In these systems, the modules in the pipeline TOD system still exist and need their separated annotation. These systems usually can get promising performance on one specific task but have poor transferability. With the emergence of the multi-domain TOD benchmark, like MultiWOZ, the generative DST method has replaced the classification method as the mainstream over recent years due to its better generalization ability. It encourages formulating the end-to-end TOD as a text-to-text task. Lei et al. (2018) propose a two-stage CopyNet to generate the dialogue state and response jointly with a single seq2seq architecture. Zhang et al. (2020b) design a data augmentation method to increase the response diversity. The dialogue state, dialogue act, and the response are generated with a shared encoder and the different decoders. Note that our proposed model does not use the annotated dialogue acts. Recently, some work (Hosseini-Asl et al., 2020; Peng et al., 2020a; Lin et al., 2020; Yang et al., 2021) directly leverages the pretrained language models (like GPT-2 and BART) as the end-to-end TOD model in a unified way. Liu et al. (2021) propose a Transformer-based noisy channel method to model the response prior and use the Reddit data and TOD data to warm up the TOD model. Most recently, Su et al. (2021) formulate all the end-to-end TOD tasks as the unified generation tasks, which learns in a multitask learning manner. He et al. (2021) propose a semi-supervised method to explicitly learn dialogue policy from limited labeled dialogues. Our proposed pretrained method is compatible with these end-to-end TOD training strategies.
Self-supervised Learning for Dialogue System
Recent advances in supervised learning have witnessed the success of the pretrained language models (PLMs) on language understanding and generation tasks. Since the large-scale comment data in Reddit can be regarded as a kind of chit-chat dialogue, the self-supervised methods have been used in the chit-chat systems first. DialoGPT (Zhang et al., 2020c) adapts the pretrained GPT-2 in the large-scale dialogue data. PLATO (Bao et al., 2020) proposes a discrete latent variable pretraining method to solve the one-to-many problem of the dialogue system. Meena (Adiwardana et al., 2020) pretrains a large-scale model with the dialogue data and demonstrates its conversation ability. SC-GPT (Peng et al., 2020b) uses a pretrained language model to convert a dialog act to a natural language response. For the task-oriented dialogue, the large-scale domain-specific dialogue data is inaccessible. The TOD models (Jiang et al., 2020; Wu et al., 2020; Yu et al., 2020) are usually pretrained on the chit-chat dialogues (Reddit) first and then fine-tuned on the smaller released or synthetic TOD data. Different from the above PLMs, we pretrain the TOD model directly with the large-scale contextual text. We extract relation triples of the contextual text as the grounded ontology-like knowledge and design adaptive self-supervised learning tasks for the end-to-end TOD.
Knowledge-grounded PLMs
Recently, there is an important branch of PLM to study how to integrate the knowledge into the PLM. ERNIE (Zhang et al., 2019) utilizes the external knowledge graph to recognize the type of the mentioned entity. There is a entity type embedding layer as one of input representation. To enhance the knowledge-related representation, they improve the mask mechanism by masking a whole entity directly. Similarly, Rosset et al. (2020) proposes an knowledge-aware language model (KALM), which is decoder-only Transformer-based architecture, like GPT. KALM proposes an entity tokenizer to directly segment popular entities as a single token. Some fields, like medicine, include considerable proprietary information, and it is crucial to integrate the proprietary knowledge into the pretrained model. SMedBERT (Zhang et al., 2021) incorporates deep structured semantics knowledge from neighbors of linked-entity. In this paper, we aim to utilize the external tool OpenIE6 to produce lots of TOD-like data to bridge the gap between pretrained task and end-to-end TOD system. The proposed ontology-like triple recovery task only masks the object values in the extracted triples, rather than randomly masking mentioned entities.
7 Conclusion and Future Work
In this paper, we propose an ontology-aware pretraining method for modeling the end-to-end task-oriented dialogue. The scale of the existing task-oriented dialogue data is far from the need for the pretrained model. Thus, we leverage the external tool OpenIE6 in extracting the ontology-like knowledge of the large-scale contextual texts. To bridge the gap between the pretrained and end-to-end TOD models, we design two adaptive self-supervised learning tasks: ontology-like triple recovery and next-text generation. The pretraining process is divided into two phases, where phase-1 pretrains on the large-scale ontology-aware contextual texts and phase-2 pretrains on the ontology-aware TOD data. Our proposed OPAL achieves excellent performance on the end-to-end TOD tasks and dialogue state tracking tasks. In the future, we will evaluate the effect of the different ontology-building methods.
Acknowledgments
We would like to thank the TACL team and four anonymous reviewers for their insightful comments. This work has been supported by China NSFC Projects (No.62120106006, No.62106142, and No.92048205), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), and CCF-Tencent Open Fund and Startup Fund for Youngman Research at SJTU (SFYR at SJTU).
Notes
References
Author notes
Action Editor: Michel Galley