OPAL: Ontology-Aware Pretrained Language Model for End-to-End Task-Oriented Dialogue

This paper presents an ontology-aware pretrained language model (OPAL) for end-to-end task-oriented dialogue (TOD). Unlike chit-chat dialogue models, task-oriented dialogue models fulfill at least two task-specific modules: Dialogue state tracker (DST) and response generator (RG). The dialogue state consists of the domain-slot-value triples, which are regarded as the user’s constraints to search the domain-related databases. The large-scale task-oriented dialogue data with the annotated structured dialogue state usually are inaccessible. It prevents the development of the pretrained language model for the task-oriented dialogue. We propose a simple yet effective pretraining method to alleviate this problem, which consists of two pretraining phases. The first phase is to pretrain on large-scale contextual text data, where the structured information of the text is extracted by the information extracting tool. To bridge the gap between the pretraining method and downstream tasks, we design two pretraining tasks: ontology-like triple recovery and next-text generation, which simulates the DST and RG, respectively. The second phase is to fine-tune the pretrained model on the TOD data. The experimental results show that our proposed method achieves an exciting boost and obtains competitive performance even without any TOD data on CamRest676 and MultiWOZ benchmarks.


Introduction
A task-oriented dialogue system aims to assist users in accomplishing a specific task by interacting with natural language, i.e., reserving a hotel or booking flight tickets.With the popularity of the industrial dialogue system, the task-oriented dialogue system attracts extensive attention in research.The existing task-oriented dialogue system can be classified into two categories: pipeline format and end-to-end format.The pipeline TOD system (Ultes et al., 2017;Weisz et al., 2018) is composed of four modules: natural language understanding (NLU) (Quirk et al., 2015), dialogue state tracking (DST) (Xu et al., 2020;Chen et al., 2020c), dialogue policy (DP) (Chen et al., 2018(Chen et al., , 2019(Chen et al., , 2020b) ) and natural language generation (NLG) (Wen et al., 2015;Li et al., 2016;Zhao et al., 2017).Since each module of the system is trained separately and executes sequentially, it faces two serious issues: error accumulation and high annotation cost.Thus, the end-to-end dialogue system (Lee et al., 2019b;Zhao et al., 2019) gradually becomes the research focus, which formulates the task-oriented dialogue as a sequence-tosequence task.The dialogue state, database (DB) state and the corresponding system response are directly concatenated together and flattened as a token sequence.The DB state is the status of the domain-related database searched with the dialogue state, as shown in Figure 1.
Thanks to the success of pretraining language models (Kenton and Toutanova, 2019;Raffel et al., 2020), effective application has shed light on opendomain (chit-chat) dialogues (Bao et al., 2020;Adiwardana et al., 2020).Nevertheless, utilizing such pretrained language models on TOD systems remains challenging due to the limited TOD data with annotated dialogue state.Unlike the open-domain dialogue, TOD is restricted by a dialogue ontology, which defines the dialogue domains, the slots and their candidate values.The TOD system needs to predict the dialogue state and feedback the DB content to accomplish a task.The dialogue state is structured information extracted from the dialogue context, which is a set of domain-slot-value triples.
Recently, some works (Hosseini-Asl et al., 2020;Lin et al., 2020) try to directly leverage the pretrained language models, e.g., GPT-2 (Radford et al., 2019) and BART (Lewis et al., 2020), in the end-to-end TOD system.Such models (Mehri et al., 2019) are pretrained on the large-scale contextual text with the general self-supervised method, e.g., language modeling and language denoising.However, in the task-oriented dialogue task, the dialogue state is structured information rather than a contextual text.The inconsistency between the pretrained and downstream tasks will impact the performance of the PLMs on the TOD benchmarks.To alleviate this problem, SOLOIST (Peng et al., 2020a) fine-tunes the pretrained GPT-2 with the existing annotated TOD data and then transfers it to the other task-oriented dialogue generation tasks.Similarly, NCM (Liu et al., 2021) first warm-ups the Transformer-based model with large-scale Reddit1 (Völske et al., 2017) data and then fine-tunes the model on the TOD data.However, the existing TOD data is too limited to pretrain a large-scale language model.
To alleviate the problems above and advance pretrained language model research, especially its application on TOD, we propose an Ontology-aware PretrAined Language model (OPAL).From the high-level perspective, we can abstract the end-toend TOD task into two sub-tasks: ontology-like triple recovery and next-text generation, which corresponds to dialogue state tracking task and response generating task.The ontology-like triple recovery in the TOD means to predict the corresponding value given the domain and the slot.The next-text generation is easy to design for the contextual text, which directly fulfills with masking the last sentence.The challenge is how to design the ontology-like triple recovery task, which needs to obtain the structured information from the contextual text.In this paper, we utilize the external OpenIE tools (Angeli et al., 2015;Kolluru et al., 2020) 2 to extract the relation triples (subject-relation-object) from the contextual text as the structured information.In most cases, the domain-slot-value triple can be regarded as relation triple, e.g., train-arrive-12:30.The relation triples extracted from the contextual text can be regarded as the ontology-like triples.We design selfsupervised ontology-like triple recovery task and next-text generation task to pretrain the model.
The main contributions of this paper are summarized as below: • We leverage the external tool OpenIE to generate large amounts of TOD-like data, which is important for the development of pre-trained language models in the TOD community.
• To the best of our knowledge, this is the first work to design self-supervised tasks for endto-end TOD tasks.It bridges the gap between pretrained language models and end-to-end TOD models.
• The experimental results show that our proposed pretrained model OPAL can get competitive performance even without any annotated TOD data in the pretraining process.
• Further fine-tuned on the annotated TOD data, our proposed method gets exciting performance gain on CamRest676 and MultiWOZ datasets.
2 End-to-End Task-Oriented Dialogue As previously introduced, the pipeline dialogue system consists of four modules.The NLU module is to recognize the user's intents and the corresponding slot values.The DST module combines the previous state and the results of the NLU to update the current dialogue state.The DP module chooses the discrete dialogue acts according to the dialogue state and the database state to respond to the user.

contextual text
Kwai Tsing is an area of Hong Kong.The district has the third least educated residents.
[S] Kwai Tsing And their income is below average.Kwai Tsing did not exist … 1980s.

Ontology-like Triple Recovery Next Text Generation
1 2 object-value mask The NLG module is to generate the natural language based on the chosen dialogue acts.There are at least four kinds of annotation in such systems: the user's intent, the slot value, the dialogue state, and the dialogue act.The heavy annotation labor enormously increases the cost of building a pipeline system.Its poor scalability further influences the pipeline dialogue system development.
Compared with the pipeline system, this paper's end-to-end task-oriented dialogue system only requires the annotated dialogue state.The end-to-end TOD system is fed with the dialogue context c and generates the dialogue state b and delexicalized response r, where the database (DB) state d is retrieved from the results searched with b.The delexicalized response means that the specific slot values are replaced with the corresponding slot placeholders.The lexicalized response is recovered from the delexicalized one with the generated dialogue state and DB state.The training sample at each dialogue turn of the end-to-end TOD model is defined as: , b, d, r). (1) For the task-oriented dialogue, the dialogue context not only consists of the dialogue history h but also includes the dialogue ontology schema s, which is usually ignored by the existing end-toend models.The ontology can be seen as prior knowledge designed by the dialogue expert, which defines the dialogue domain, slots, and candidate values.The end-to-end TOD model needs to fulfill two sub-tasks: dialogue state tracking (DST) and response generation (RG).Formally, the learning goal of the TOD model is to maximize the joint probability p θ (x), which can be factorized in an auto-regressive manner as: where the factorization from (3) to (4) is based on the fact that the database-lookup operation is a deterministic process.The p(h, s) is the prior probability of the paired dialogue and ontology (as the input of the model), which depends on the distribution of the (pre-)training data and is independent on the model.The dialogue state tracker intrinsically extracts the ontology-related constraints demanded by the user, where the ontology schema is given in advance.

Ontology-Aware Pretraining Method
The existing task-oriented dialogue data with the given ontology is limited to pretrain the language model.To increase the scale of the pretraining data, we divide the pretraining process into two phases.The first phase pretrains the model on the large-scale contextual text.The triples of the text are extracted by the latest neural-based Ope-nIE6 (Kolluru et al., 2020).There is still a glaring discrepancy between the contextual text and the dialogue.For example, the dialogue always contains co-reference and information ellipsis (Iyyer et al., 2017).We pretrain the model on the smaller TOD data at the second phase to further decrease the gap between the pretrained model and the downstream tasks.The two phases are complementary to each other introduced as below: Phase-1: Pretrained on Contextual Text In traditional dialogue pre-trained models (Zhang et al., 2020c), the crawled Reddit data is popular to be used as pre-trained corpus.However, Reddit data contain lots of the co-reference and information ellipsis, which seriously impact the performance of the external information extraction tool.Different from the dialogue data, the co-reference and information ellipsis is infrequent in the contextual text of the Wikipedia 3 .The more details are shown in Section 5.1 to validate the effects of pre-trained corpora.We use the neural-based OpenIE6 to extract the ontology-like knowledge of contextual text automatically.We directly simulate the extracted subject-relation-object triples as the domain-slotvalue triples.As shown in Figure 2, the object values in the extracted ontology are masked during the pretraining process.One of our designed pretraining tasks is to recover the ontology-like triples (named ontology-like triple recovery as OR), which is similar to the DST task.To increase the inference ability of the pretrained model, we mask the next text (one or two sentences, which are randomly chosen.)and push the model to infer the next text (named next-text generation as NTG), which is similar to the RG task.Thus, the pretraining sample is composed of four elements: masked ontology-like triples ŝ, the masked document context ĥ, ontology-like triples b and the next text r.Similar to Equation 4, the goal of the pretaining model is to maximize the joint probability: p( ĥ, ŝ, b, r) = p(r| b, ĥ, ŝ) NTG p( b| ĥ, ŝ) OR p( ĥ, ŝ). (5) To obtain the qualified triples of a sentence using OpenIE6, we remove all the stopwords in the triples 3 https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 Encoder:

[S] res [R] price [O] [mask] [S] res [R] food [O] [mask] [SEP]
User: I need to find an expensive restaurant.System: Do you have a cuisine preference, Chinese food?User: No , I dont care.System: [mask] Decoder:

[S] res [R] price [O] expensive [S] res [R] food [O] Chinese [SEP]
System: There are several restaurants serve expensive food.Decoder:

Res matched 7 [SEP]
System: There are several restaurants serve [value_price] food.and filter the triples that one of the triple components is a blank space.It is also the main reason that we do not choose the Reddit at this pretraining phase.There are lots of pronouns in the text, which is hard to extract qualified triples.This pretraining phase vastly increases the scale of the pretraining data.There are four steps to filter the triples of the sentence: • Remove all the stopwords in the triples and filter the triples that one of triple component is a blank space.
• Remove the triples that one of triple component contains more than 4 words.
• For the triples that have the same subjectrelation pair, randomly select one of the triples and remove the others.
• Randomly select two triples from the rest of triples, if its length is larger than two.This is to extract no more than two triples in a sentence.
Phase-2: Pretrained on TOD Data To further decrease the gap between the pretrained language model and the end-to-end model, we leverage the smaller task-oriented data in the pretraining process.Instead of extracting the ontology-like triples with OpenIE6, the TOD ontology is designed by the dialogue experts.We directly use the text matching method to extract the domain-slot-value triples from the dialogue context with the given ontology.Note that the extracted triples with text matching operation are not the dialogue state.In this pretraining phase, the system-mentioned ontology triples also have to be recovered, which is consistent with the previous pretraining process.In other words, different from SOLOIST (Peng et al., 2020a) and NCM (Liu et al., 2021), we do not need to use the annotated dialogue state and only utilize the given dialogue ontology to match the ontologyrelated triples.This attribute increases the generalization of the proposed ontology-aware pretraining methods, where the ontology is much easier to be obtained than the dialogue state annotation.We give a toy example to distinguish the usage of the pretraining TOD data and the fine-tuning data of the end-to-end TOD task that shows in Figure 3.At the pretraining process, the ontology is extracted from the context, which is just a part of the given ontology.The ontology recovery is to recover all the ontology-related triples, e.g., the triple res-food-Chinese is not in the dialogue state.At fine-tuning process, there is an extra database searching step.

Experiments
We evaluate our proposed pretrained model OPAL on dialogue state tracking tasks and end-to-end TOD tasks.To further validate the effectiveness of the proposed OPAL, we conduct the ablation study to analyze the effects of the different pretraining ingredients.Last but not least, we design the resource-limited experiments to figure out the sample efficiency of the proposed OPAL on the endto-end TOD task and show some cases to study the strength of the proposed OPAL.

Corpora
At phase-1 of the proposed OPAL, we use the Wikipedia corpus to pretrain the model.There are 72.24 million samples collected from Wikipedia.
We have used five task-oriented dialogue datasets in the experiments as shown in Table 1, where the Schema (Rastogi et al., 2020) and the TaskMaster (Byrne et al., 2019) are leveraged in the phase-2 of the pretraining process and the rests are the downstream benchmarks.The WOZ (Mrkšić et al., 2017) and the CamRest676 (Wen et al., 2016) are the single-domain task-oriented dialogue corpora, which are the well-studied DST benchmark and end-to-end TOD benchmark, respectively.The MultiWOZ is a kind of multi-domain dialogue corpus, which is challenging due to its multi-domain setting and diverse language styles.There are two versions of the MultiWOZ dataset used in the experiments: MultiWOZ2.0(Budzianowski et al., 2018) and MultiWOZ2.1 (Eric et al., 2019), where Mul- The usages of the datasets are grouped into Pretraining (named as P) and Fine-tuning (named as F), which mean that the corresponding dataset is used in pretraining phase and fine-tuning phase.
tiWOZ2.1 fixes most of DST annotation errors in MultiWOZ2.0.To fairly compare to the other baselines, we run the end-to-end TOD tasks on the MultiWOZ2.0and run the DST tasks on the Multi-WOZ2.0 and MultiWOZ2.1.

Metrics
For the dialogue state tracking task, we use the joint goal accuracy (JGA) to evaluate the models.
Only if all the predicted slot values at each turn are exactly matched with the golden, it says the successful prediction of the DST model.For the endto-end TOD task, there are three reported scores: Inform, Success and BLEU.The Inform measures whether the system response has provided the right entity.The Success reports whether the system response has provided all the requested slots.The BLEU evaluates the naturalness of the generated system response.Following Budzianowski et al. (2018), the combined score (Combined) is also reported using Combined = (Inform + Success) × 0.5 + BLEU.

Experimental Setup
We implement the proposed OPAL with Hugging-Face's Transformers (Wolf et al., 2020) and BART, which is a pretrained denoising autoencoder.To validate the generalization of the proposed pretraining method, we set the base version and large version (BART L ) of the BART as the backbone of the proposed OPAL named OPAL and OPAL L respectively.The learning rates of the pretraining and fine-tuning are both 1e-5.The optimizer is AdamW.At phase-1 of the pretraining process, the total training step is 280,000 and the batch size is 256.It is pre-trained on four P100 GPUs (16G memory for each).This pretraining process costs 260 hours (one epoch on Wikipedia).Similar to NCM (Liu et al., 2021), we pretrain 100,000 steps at the phase-2.At the fine-tuning process of the downstream tasks, the batch size is 32.We conduct significant tests with five different seeds on the endto-end TOD task, where the final results are trained with the default seed 42.

Baselines
We compare the proposed OPAL with the strong baselines, which hold the state-of-the-art (SOTA) performance on the DST and end-to-end TOD.
The DST models can be divided into two categories: classification method and generation method.The classification methods rely on the optional slot values of the ontology and select the value from it.Their scalability is a severe problem for the practical dialogue system.The generation methods directly extract the values from the dialogue context, which are comparable to the proposed OPAL.
For the end-to-end TOD tasks, the existing endto-end TOD systems can be grouped into modular systems and sequential systems.The modular systems use multiple decoders to generate the downstream outputs independently and are trained in an end-to-end manner.The sequential systems formulate the end-to-end TOD as a single sequence prediction problem.Sequicity (Lei et al., 2018) proposes a two-stage CopyNet method to generate the dialogue state and the system response.HRED-TS (Peng et al., 2019) proposes a teacher-student framework with a hierarchical recurrent encoderdecoder backbone.DAMD (Zhang et al., 2020b) designs a domain-aware multi-decoder network with the multi-action data augmentation method.DSTC8 Winner (Ham et al., 2020) and Simple-TOD (Hosseini-Asl et al., 2020) successfully leverage the pretrained language model GPT-2 for the end-to-end TOD modelling in the unified way.Inspired by SimpleTOD, SOLOIST (Peng et al., 2020a) fine-tunes GPT-2 with out-of-domain TOD data and gets excellent transferability.MinTL-BART (Lin et al., 2020) and UBAR (Yang et al., 2021) improve the end-to-end TOD system by changing the input content without extra assumptions.HTER (Santra et al., 2021) improves the end-to-end TOD system by hierarchical dialogue modeling mechanism.NCM (Liu et al., 2021) improves the decoder with the noisy channel model and proposes a two-stage pretrianing method to warm up the Transformer-based model, where the model first pretrains on the Reddit corpus and then on the task-oriented dialogues.NCM is the closest method with our proposed method.We mainly compare our proposed method with this method.

Results on End-to-End TOD
We first fine-tune our pretrained models OPAL and OPAL L on two well-studied end-to-end TOD datasets: MultiWOZ2.0and CamRest676, as shown in Table 2 and Table 3.We compare our models with strong baselines in the end-to-end dialogue learning setting.
To validate the generalization of our proposed ontology-aware pretraining method, we set the base-version and large-version BART as the backbones of the pretraining models.Compared with the performance fine-tuned on the original BARTs, the proposed OPAL and OPAL L achieve 7.32 and 10.94 overall performance gains on the Multi-WOZ2.0 dataset and absolute 7.04 point gains on the CamRest676 dataset.SOLOIST (Peng et al., 2020a) and NCM (Liu et al., 2021) are the two closest methods to OPAL, which both leverage the out-of-domain TOD in pretraining the Transformerbased models.Different from our methods, these two approaches rely on DST annotation.Our proposed models can still get the best task completion (Inform and Success) and have lower BLEU scores than NCM barely.Compared with overall baselines, our proposed models reach the new SOTA overall performance (Combined) on both two datasets.The large-version model OPAL L outperforms the base-version OPAL with a 2.53 performance gain on the combined score.To fairly compare to other baselines, we only report the baseversion OPAL's performance in the next experiments.
Compared with NCM B , our proposed OPAL has higher task-completion (revealed by Inform + Success) × 0.5) performance.However, BLEU score of OPAL is lower than BLEU of NCM B . Figure 4 shows the correlation between BLEU score and task-complation ability.The fine-tuned model tried to balance between BLEU score and task- completion ability.With the progress of training process, the BLEU score is descending and the taskcompletion ability is enhanced.The main reason is that there are different expressions on the same system intention, which is the typical one-to-many mapping problem (Zhao and Eskenazi, 2018) in the dialogue generation.The final fine-tuned model has stronger task-completion ability but sacrifices the dialogue diversity.In the evaluation, we choose the model with the highest combination score.

Results on DST
The classification-based DST models and generation-based DST models show at the upper part and lower part of the Table 4 and Table 5 respectively.Table 4 reports the DST results on the MultiWOZ2.0and MultiWOZ2.1 datasets.
Our proposed OPAL can get the highest joint goal accuracy (JGA) among all the generationbased baselines on both datasets.

Analysis
The analysis experiments evaluate the proposed OPAL on the end-to-end TOD tasks to answer three main questions: Q1: What role do the different pretraining corpora (Wikipedia and out-of-domain TOD) play?Q2: What is the main factor that affects the pre-trained model?Q3: Does OPAL have a higher sample efficiency than the original BART in the limited-resource setting?(Mrkšić et al., 2017) 84.4 GLAD (Zhong et al., 2018) 88.1 GCE (Nouri and Hosseini-Asl, 2018) 88.5 G-SAT (Balaraman and Magnini, 2019) 88.7 StateNet (Ren et al., 2018) 88.9 BERT-DST (Lai et al., 2020) 90.5 TRADE (Wu et al.,  phase-1 of OPAL pretrains on the contextual texts and phase-2 pretrains on the task-oriented dialogues with the ontology-aware pretraining method.

Ablation Study
To evaluate the effects of these two corpora, we separately pretrain the backbones (BART) only on the pretraining data contextual texts or task-oriented dialogues, where the pretrained models are named as WIKI and TOD respectively.The pretrained models WIKI and TOD still outperform the original BART by a large margin.It indicates the efficiency of the proposed ontology-aware pretraining method.Especially, the pretrained model WIKI that does not see any TOD data at the pretraining phase can get the competitive performance with the NCM L .Compared with NCM B with a similar parameter scale to our model, WIKI has apparent  advantages on both end-to-end TOD datasets.The WIKI has the better performance with TOD.We know that WIKI suffers from the unseen TOD data and TOD suffers from the scale of the pretraining data.Our proposed OPAL adopts a two-stage pretraining method to solve the above problem, a classic example of "one plus one greater than two".The two-stage pretrained model OPAL outperforms the separated one with a 1.65 and 2.70 upper combined score on MultiWOZ2.0.It indicates that the ontology-aware contextual text corpus and ontology-aware TOD data are complementary.
To further compare Wikipedia to Reddit corpus, we also use the same scale of Reddit data to conduct the Phase-1 pre-training, named REDD.WIKI is ahead of REDD in all the automatic metrics (BLEU and task-completion).To deeply analyze the effect factor, we calculate occupation rate of the extracted triples that contained the pronouns as subject or object.As shown in Figure 6, there are 31.0%triples in Reddit data containing the pronouns.The highest frequency of pronouns is "i", which occupies 31%.There are only 0.7% triples contained pronouns in Wikipedia.In the TOD, the domains and slot values in the dialogue states are specific entities, which are not pronouns.The meaningless pronouns increase the gap between pre-training model and TOD model.The co-reference and information ellipsis in Reddit seriously hurt the performance of the external information extraction tool.It is the main reason that we choose the Wikipedia as the pre-training corpus.
We also evaluate the effects of the pre-trained tasks: ontology-like triple recovery (OR) and nexttext generation (NTG).We directly remove the extracted triples in the input in "w/o OR" study.The "w/o NTG" means that the model only needs to recover the masked triples.The results show that OR task and NTG task benefit the task completion and the contextual consistency respectively.In the complex dialogue domain, the single-task pre-trained methods can not achieve comparable performance with OPAL.It indicates that the two designed tasks are both significant to reduce the gap between pre-trained model and TOD model.
We further validate the effects of different Ope-nIE tools.In our main experiments, we use the latest neural-based IE tool OpenIE6.There is also a very popular rule-based IE tool OpenIE-Stanford.Compared with OpenIE-Stanford, Neural-based OpenIE6 achieves promising performance improvement on well-studied IE benchmarks (Kolluru et al., 2020).As shown in Table 6, WIKI with OpenIE6 is also better than OpenIE-Stanford tool in all the metrics.However, the improvement of neural-based OpenIE6 is limited, which indicates that the proposed pre-training method is not sensitive about IE accuracy.OPAL: [value_name] is in the [value_price] price range and is located at [value_address] .the phone number is [value_phone] .

Sample Efficiency
Under the different resource-limited settings, the proposed OPAL can get all the best performance in terms of task completion (Inform and Success), response naturalness (BLEU) and overall performance among the baselines, as shown in Figure 5.It indicates the sample efficiency of the proposed ontology-aware pretraining method.When the training data is extremely limited (only 80 dialogues), TOD can improve overall performance by a large margin (absolute 3.2 point improvement) than WIKI.This improvement comes from the task completion ability, which indicates the TOD data can increase the generalization of the pretrained model for end-to-end TOD tasks.With the training data increase, WIKI pretrained on the large-scale context text data has the larger performance gain than TOD.When the number of the training data reaches 1600 dialogues, WIKI gets absolute 4.8 point gains over TOD.It indicates that the scale of the pretraining data influences the growth potential of the pretrained model.On the other hand, TOD outperforms over the WIKI in three of four data limitation cases on task-completion ability.However, WIKI achieves better performance on fluent statement (revealed by BLEU).It indicates that WIKI benefits the task-completion ability and TOD facilitates fluency and context consistency.

Case Study
Our proposed pretrained model OPAL has improved the performance on task completion and contextual consistency over the original BART.As shown in Figure 7, we can see that the dialogue model fine-tuned from BART misses responding a request (address) to the user.Instead, our pro-  posed OPAL accurately provides all the requested information to the user.As shown in Figure 8, at the first turn, we can see that our proposed OPAL can provide the more similar response as the oracle than BART.It indicates that OPAL has the better performance on the response prediction.At the second turn, the dialogue system needs to provide the correct entity to the user.The original BART model chooses to miss it.Our proposed OPAL recommends an entity to the user in time.Compared with the original BART, the proposed OPAL has a obvious advantage in modeling the task-oriented dialogue, which not only generates the precise response but also completes the dialogue task successfully.This performance improvement comes from the two-stage ontology-aware pretraining method on the large-scale contextual text with the handcrafted ontology-like triples and the small task-oriented dialogue data with given ontology.

Related Work
End-to-End TOD Systems Early studies for end-to-end task-oriented dialogue systems either design a neural network-based model or propose a reinforcement learning method to use the reward signal to update the whole system.In these systems, the modules in the pipeline TOD system still exist and need their separated annotation.These systems usually can get promising performance on one specific task but have poor transferability.With the emergence of the multi-domain TOD benchmark, like MultiWOZ, the generative DST method has replaced the classification method as the mainstream over recent years due to its better generalization ability.It encourages formulating the end-to-end TOD as a text-to-text task.Lei et al. (2018) propose a two-stage CopyNet to generate the dialogue state and response jointly with a single seq2seq architecture.Zhang et al. (2020b) design a data augmentation method to increase the response diversity.The dialogue state, dialogue act and the response are generated with a shared encoder and the different decoders.Note that our proposed model does not use the annotated dialogue acts.Recently, some works (Hosseini-Asl et al., 2020;Peng et al., 2020a;Lin et al., 2020;Yang et al., 2021) directly leverage the pretrained language models (like GPT-2 and BART) as the end-to-end TOD model in a unified way.Liu et al. (2021) propose a Transformer-based noisy channel method to model the response prior and use the Reddit data and TOD data to warm up the TOD model.Most recently, Su et al. (2021) formulate all the end-to-end TOD tasks as the unified generation tasks, which learns in multitask learning manner.He et al. (2021) propose a semi-supervised method to explicitly learn dialogue policy from limited labeled dialogues.Our proposed pre-trained method is compatible with these end-to-end TOD training strategies.

Self-supervised Learning for Dialogue System
Recent advances in supervised learning have witnessed the success of the pretrained language models on language understanding and generation tasks.Since the large-scale comment data in Reddit can be regarded as a kind of chit-chat dialogue, the self-supervised methods have been used in the chitchat systems first.DialoGPT (Zhang et al., 2020c) adapts the pretrained GPT-2 in the large-scale dialogue data.PLATO (Bao et al., 2020) proposes a discrete latent variable pretraining method to solve the one-to-many problem of the dialogue system.Meena (Adiwardana et al., 2020) pretrains a largescale model with the dialogue data and demonstrates its conversation ability.SC-GPT (Peng et al., 2020b) uses a pre-trained language model to convert a dialog act to a natural language response.For the task-oriented dialogue, the large-scale domainspecific dialogue data is inaccessible.The TOD models (Jiang et al., 2020;Wu et al., 2020;Yu et al., 2020) are usually pretrained on the chit-chat dialogues (Reddit) first and then fine-tuned on the smaller released or synthetic TOD data.Different from the above PLMs, we pretrain the TOD model directly with the large-scale contextual text.We extract relation triples of the contextual text as the grounded ontology-like knowledge and design adaptive self-supervised learning tasks for the end-to-end TOD.
Knowledge-grounded PLMs Recently, there is an important branch of pre-trained language model to study how to integrate the knowledge into the PLM.ERNIE (Zhang et al., 2019) utilizes the external knowledge graph to recognize the type of the mentioned entity.There is a entity type embedding layer as one of input representation.To enhance the knowledge-related representation, they improve the mask mechanism by masking a whole entity directly.Similarly, Rosset et al. (2020) proposes an knowledge-aware language model (KALM), which is decoder-only Transformer-based architecture, like GPT.KALM proposes an entity tokenizer to directly segment popular entities as a single token.Some fields have lots of proprietary information, like medicine, which is emergency to integrate the knowledge.SMedBERT (Zhang et al., 2021) incorporates deep structured semantics knowledge from neighbours of linked-entity.In this paper, we aim to utilize the external tool OpenIE6 to produce lots of TOD-like data to bridge the gap between pre-trained task and end-to-end TOD system.The proposed ontology-like triple recovery task only masks the object values in the extracted triples, rather than randomly masks mentioned entities.

Conclusion & Future Work
In this paper, we propose an ontology-aware pretraining method for modeling the end-to-end taskoriented dialogue.The scale of the existing taskoriented dialogue data is far from the need for the pretrained model.Thus, we leverage the external tool OpenIE6 in extracting the ontology-like knowledge of the large-scale contextual texts.To bridge the gap between the pretrained and end-toend TOD models, we design two adaptive selfsupervised learning tasks: ontology-like triple recovery and next-text generation.The pretraining process divides into two phases, where the phase-1 pretrains on the large-scale ontology-aware contextual texts and the phase-2 pretrains on the ontologyaware TOD data.Our proposed OPAL achieves excellent performance on the end-to-end TOD tasks and dialogue state tracking tasks.In the future, we will evaluate the effect of the different ontologybuilding methods.

Figure 1 :
Figure 1: A task-oriented dialogue example.The dialogue model needs to infer the dialogue state based on the dialogue history and ontology schema.The DB state is searched by the generated dialogue state.The last step is to generate system response.

Figure 2 :
Figure 2: The ontology-aware pretraining method contains two masking strategies: object-value mask and next-text mask.The corresponding self-supervised learning methods are ontology-like triple recovery and next-text generation.The ontology-like triples of the contextual text are extracted by the external tool OpenIE at the pretraining phase-1 and matched with the given whole ontology at phase-2.
res [R] price [S] res [R] food [S] res [R] area [SEP] User: I need to find an expensive restaurant.System: Do you have a cuisine preference, Chinese food?User: No , I dont care.

Figure 3 :
Figure3: A toy example to show the differences between the pretraining data and the fine-tuning data.

Figure 4 :
Figure 4: The correlation between BLEU score and taskcompletion ability at first 20 finetuning epochs.They are the average evaluation results on MultiWOZ2.0with different five seeds.
you give me their phone number , address and price range , please ?GT Response: they are in the [value_price] price range .[value_phone] is their number .you can find them on [value_address] BART: it is in the [value_price] price range and the phone number is [value_phone] .

Figure 7 :
Figure 7: Third dialogue turn in the dialogue session SNG02115 from MultiWOZ2.0development set.The oracle response is represented as GT Response.BART and OPAL means that the responses are generated by the corresponding models.

Figure 8 :
Figure 8: The first two dialogue turns in the dialogue session SNG921 from MultiWOZ2.0development set.The oracle response is represented as GT Response.BART and OPAL means that the responses are generated by the corresponding models.

Table 1 :
The five task-oriented dialogue datasets are used in this paper.The X-domain (cross-domain) means that a dialogue can contain different dialogue domains.

Table 2 :
End-to-end response generation results on MultiWOZ2.0.and denote whether the dialogue act annotation is used in the training process.List all the model sizes of the Transformer-based end-to-end TOD models.Notice that we directly use the UBAR result provided by Liu et al..According the released code of the UBAR, they have not used the standard evaluation metric, which is unfair to compare to other methods.We also run their code with released model checkpoint, whose combined score is even worse than the result provided by Liu et al.. Results are significant (p < 0.01) comparing OPAL model and BART model as the initialized TOD model.

Table 4 :
Dialogue state tracking results on Mul-tiWOZ2.0 and MultiWOZ2.1.The upper part is classification-based model and the lower part belongs to generation-based model.
Table 6 reports the ablation study of the proposed OPAL, which has two pretraining phases.The

Table 5 :
Dialogue state tracking results on the singeldomain WOZ.The upper part is classification-based model and the lower part belongs to generation-based model.† represents the result is produced by us from their released code.

Table 6 :
Ablation study on MultiWOZ2.0.There are three types of ablation study.The first is to analyze the effects of the pre-trained data.The second is to validate the effects of the designed pre-trained tasks.The last is to figure out the effects of IE tools.Results are significant (p < 0.01) comparing OPAL model and BART model as the initialized TOD model.
User: hello !i need a place to bed down for the night that offers free parking and wifi .canyou help me ?GT Response: i can definitely help with that .do you have a preference in location ?BART: i sure can .first i 'll need to know what day you 're arriving and the number of nights you 'll be staying .