Abstract
Answer selection in open-domain dialogues aims to select an accurate answer from candidates. The recent success of answer selection models hinges on training with large amounts of labeled data. However, collecting large-scale labeled data is labor-intensive and time-consuming. In this paper, we introduce the predicted intent labels to calibrate answer labels in a self-training paradigm. Specifically, we propose intent-calibrated self-training (ICAST) to improve the quality of pseudo answer labels through the intent-calibrated answer selection paradigm, in which we employ pseudo intent labels to help improve pseudo answer labels. We carry out extensive experiments on two benchmark datasets with open-domain dialogues. The experimental results show that ICAST outperforms baselines consistently with 1%, 5%, and 10% labeled data. Specifically, it improves 2.06% and 1.00% of F1 score on the two datasets, compared with the strongest baseline with only 5% labeled data.
1 Introduction
Open-domain dialogue systems (ODSs) interact with users by dialogues in open-ended domains (Huang et al., 2020). The responses in ODS can be divided into different types, such as answer, gratitude, greeting, and junk (Qu et al., 2018). In this paper, we focus on the selection of answers, which aims to identify the correct answer from a pool of candidates given a dialogue context. Typically, there are two main branches of approaches to produce answers, i.e., generation-based methods and selection-based methods (Park et al., 2022). The former generate a response token by token; and the latter select a response from a pool of candidates. Currently, pure generation methods such as ChatGPT still face challenges: (1) They may generate incorrect content. (2) They cannot generate timely answers. Thus, it still needs selection-based methods to improve the correctness and timeliness of generation-based method.
Figure 1 illustrates our idea by comparing the answer selection paradigms of (a) context-aware methods, (b) intent-awaremethods, and (c) intent-calibrated methods. Context-aware methods (see Figure 1(a)) capture the context of the ongoing dialogue for understanding users’ information needs to select the most relevant responses from answer candidates (Jeong et al., 2021). Unlike task-oriented dialogue systems, it is much more challenging for ODSs to infer users’ information needs due to their open-ended goals (Huang et al., 2020).
To this end, user intents, i.e., a taxonomy of utterances, are introduced to guide the information-seeking process (Qu et al., 2018, 2019a; Yang et al., 2020). If the intent of the previous (OQ) is not satisfied by the (PA) provided by a system, then the users’ next intent is more likely to be information request (IR). For example, if the user asks: “Can you send me a website, so I can read more information?”, the user’s intent is IR. If the system does not consider the intent label IR, then it may provide an answer which does not satisfy the user’s request.
Intent-aware methods (see Figure 1(b)) adopt intents as an extra input to better understand users’ information needs in an utterance (Yang et al., 2020). However, they require sufficient human-annotated intent labels for training, the construction of which is time-consuming and labor-intensive.
Self-training has been widely used to mitigate label scarcity problem (Liu et al., 2022; Yang et al., 2022; Zhang et al., 2022a). But it is still under-explored for answer selection in ODSs. The principle of self-training is to iteratively learn a model by assigning pseudo-labels for large-scaled unlabeled data to extend the training set (Amini et al., 2022). The teacher-student self-training framework has been widely used in much recent work, where the teacher generates pseudo-labels and the student makes predictions (Xie et al., 2020; Ghiasi et al., 2021; Li et al., 2021; Karamanolakis et al., 2021). However, noisy pseudo labels incur error propagation across iterations, so the key challenge is how to assure both quality and quantity of pseudo labels (Karamanolakis et al., 2021).
In this paper, we introduce an intent-calibrated answer selection paradigm, as in Figure 1(c). It first conducts both context-aware and intent-aware answer selection to predict pseudo intent and answer labels, and then it selects high-quality intent labels to calibrate final answer labels. To be more specific, we develop an intent-calibrated self-training (ICAST) algorithm based on the teacher-student self-training and intent-calibrated answer selection paradigm.
The core procedure is: First, we train a teacher model on the labeled data and predict pseudo intent labels for the unlabeled data. Second, we select high-quality intent labels by estimating intent confidence gain and then add selected intents to the input of the answer selection model. The intent confidence gain measures how much information a candidate intent label can bring to the model. Third, we re-train a student model on both the labeled and pseudo-labeled data. Intuitively, ICAST synthesizes pseudo intent and answer labels and integrates them into teacher-student self-training, which can assure synthetic answer quality by high-quality intents.
We conduct experiments on two datasets: MSDIALOG1 (Qu et al., 2018) and MANTIS2 (Penha et al., 2019). The experimental results show that ICAST outperforms the state-of-the-art baseline by 2.51%/0.63% of F1 score on the MSDIALOG/MANTIS dataset, with 1% labeled data. The results demonstrate the effectiveness of ICAST which selects accurate answers with incorporating high-quality predicted intent labels.
2 Related Work
In this section, we summarize related work in terms of three categories, i.e., traditional answer selection models, intent-aware answer selection models, and self-training for data argumentation.
2.1 Traditional Answer Selection Models
The dominant work focuses on modeling the representation of dialogue contexts, responses, and their relevance to select appropriate answers (Zhou et al., 2016, 2018; Chaudhuri et al., 2018). Wang et al. (2019) propose a sequential matching network to model the relation between the contextual utterances and the response by a cross-attention matrix. Yang and Choi (2019) encode dialogue contexts and responses for answer utterance selection and answer span selection using multiple self-attention models, e.g., R-Net (Wang et al., 2017) based on RNN and QANet (Yu et al., 2018) based on CNN. Many researchers also explore to enhance the dialogue contexts or candidate responses. Medveď et al. (2020) extend the input candidate sentence with selected information from preceding sentence context. Fu et al. (2020) extend the contexts of the responses and integrate the context-to-context matching with context-to-response matching. Several studies (Ohmura and Eskenazi, 2018; Barz and Sonntag, 2021) also propose to improve the quality of answers by re-ranking answer candidates.
More recently, transformer-based pre-trained models have been the state-of-the-art paradigms (Kim et al., 2019; Henderson et al., 2019a; Tao et al., 2021). Researchers (Henderson et al., 2019b; Yang and Choi, 2019) apply a BERT encoder (Devlin et al., 2019) pre-trained on large-scaled open-domain dialogue corpus and fine-tune the model on small-scale in-domain dataset to capture the nuances. Likewise, Whang et al. (2020) also use a BERT encoder and perform context-response matching, but they also introduce the next utterance prediction and masked language modeling tasks during the post-training. Gu et al. (2020) incorporate speaker-aware embeddings into BERT to help with context understanding in multi-turn dialogues. Liu et al. (2021a) conduct utterance-aware and speaker-aware representations for dialogue contexts based on masking mechanisms in transformer-based pre-trained models, including BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019b), and ELECTRA (Clark et al., 2019).
There are several studies which use auxiliary tasks to enhance answer selection. Wu et al. (2020) incorporate a BERT-based response selection model with a contrastive learning objective and multiple auxiliary learning tasks, i.e., intention recognition, dialogue state tracking, and dialogue act prediction. Xu et al. (2021) enhance the response selection task with several auxiliary tasks, which can bring in extra supervised signals in multi-task learning manner. Pei et al. (2021) jointly learn missing user profiles with personalized response selection, which can improve response equality gradually based on enriched user profiles and neighboring dialogues.
2.2 Intent-aware Answer Selection Models
Intent detection is a key prior to understand users’ intent for answer selection, especially in multiple turn dialogues (Gu et al., 2020; Park et al., 2022). Various deep NLP models have been adopted to classify intents (Chen et al., 2017; Liu et al., 2019a; Weld et al., 2021; Wang et al., 2021a). Chen et al. (2016) generate new intents to bridge the semantic relation across domains for intent expansion and classification. Wu et al. (2020) improve pre-trained BERT with an extra contrastive objective for intention recognition. The key challenge is natural language understanding with the state-of-the-art NLP models, e.g., CNNs (Chen et al., 2016), RNNs (Firdaus et al., 2021), transformers (Zhao et al., 2020), and pretrained language models (PLMs) (Wu et al., 2020; Yan et al., 2022).
Intent calibration research attempts to predict additional information to resolve users’ ambiguous or uncertain intents. Lin and Xu (2019) calibrate the confidence of the softmax outputs for unknown intent detection. Gong et al. (2022) represent labels in hyperspherical space uniformly and calibrate confidence to trade-off accuracy and uncertainty. However, none of the above research has adapted the detected intents to answer selection. The most related work is IART (Yang et al., 2020), which weights the context by attending predicted intents for response selection.
Unlike the above methods, we propose to improve the performance of answer selection by using a large amount of unlabeled data. We devise the intent-calibrated self-training to improve the quality of pseudo answer labels by considering user intents.
2.3 Self-training for Answer Selection
Self-training has received remarkable attention in natural language processing (Luo, 2022) and machine learning (Karamanolakis et al., 2021; Amini et al., 2022). In general, the core idea is to augment the model training with pseudo supervision signals (Wu et al., 2020; Yan et al., 2022).
Sachan and Xing (2018) introduce a self-training algorithm for jointly learning to answer and generate questions, which augments labeled question-answer pairs with unlabeled text. Wu et al. (2018) introduce a pre-trained sequence-to-sequence model as an annotator to generate pseudo labels for unlabeled data to supervise the training process. Deng et al. (2021) propose to use the fine-tuned question generator and answer generator to generate pseudo question-answer pairs. Lin et al. (2020) introduce a fine-tuned generation-base model to generate gray-scale data.
Differently, the proposed ICAST in this work seeks to improve the quality of pseudo answer labels by introducing the intent-calibrated pseudo labeling mechanism which uses high-quality pseudo intent label to calibrate pseudo answer labels.
3 Preliminary
3.1 Answer Selection Task
We form answer selection as a binary classfication task (Yang et al., 2020). We denote the labeled dataset and unlabeled dataset . For the i-th sample, xi = (ui, ai) is a context-candidate pair, which consists of as a sequence of utterances as the context ui = [u1, ⋯, u|ui|] and a candidate answer ai ∈ A (a set of all candidate answers). ei = [e1, ⋯, e|ui|] is a sequence of user intent labels. yi ∈{0,1} is the answer label, yi = 1 denotes ai is a correct answer, otherwise yi = 0.
3.2 BERT for Answer Selection
3.3 Teacher-Student Self-training Framework
The teacher-student self-training framework (Li et al., 2021) is shown in Figure 2(a). It first trains the teacher model with the labeled data Dl to predict correct answer probabilities. Then at each iteration, the pseudo labeling module selects samples by using teacher’s predictions to assign pseudo answer labels. Finally, the student model is trained with the labeled data and pseudo-labeled data. At the next iteration, the student model is used as a new teacher model.
Pseudo Labeling Module.
4 Intent-calibrated Self-training
4.1 Overview
We illustrate the proposed (ICAST), as shown in Figure 2(b). First, we train a teacher model on labeled data Dl to predict pseudo intent labels for unlabeled data Du (see §4.2). Second, we conduct intent-calibrated pseudo labeling (See §4.3). Specifically, we estimate intent confidence gain to select samples with high-quality intent labels, and we calibrate the answer labels by incorporating selected intent labels as an extra input for answer selection. Third, we train the student model with labeled and pseudo-labeled data (See §4.4). We summarize the proposed intent-calibrated self-training in Algorithm 1.
4.2 Teacher Model Training
The answer selection module fβ constructs its input as a sequence of tokens, i.e., xi = [[CLS];u1;e1;⋯ ;u|ui|;e|ui|;[SEP];ai]] and computes the probability of a candidate answer as Eq. 3 to decide if it is the correct answer.
4.3 Intent-calibrated Pseudo Labeling
4.3.1 Intent Confidence Gain Estimation
The first term of intent confidence gain is the confidence score of MC dropout after considering pseudo intents and the second term of intent confidence gain is the confidence score of MC dropout. The intent confidence gain is to measure how much confidence can pseudo intents can bring to the model with MC dropout. The higher the intent confidence gain, the more improvement that predicted intents can bring to the confidence score. We set a threshold λ to determine if the predicted intents can bring enough improvement to the confidence score. If the intent confidence gain is larger than λ, we conclude that the predicted intents can improve the confidence score sufficiently and add them to the model’s inputs. Specifically, if Δ > λ, then we update the input with extra predicted intent labels ei, i.e., , which is expected to bring higher confidence score to the model, otherwise .
4.3.2 Answer Label Calibration
Note that line 13 in Algorithm 1 shows the process of selecting pseudo answer labels for retraining the answer selection module of the student model: Eq. 5 is used to determine if we add a sample to the subset for pseudo labeling. Eq. 12 is a revision of Eq. 5, which aims to make use of more unlabeled samples by introducing extra three thresholds. The goal of selecting samples by criteria of Eq. 5 and Eq. 12 (line 13 of Algorithm 1) is to prepare a set of candidates in primaries for high-quality pseudo labeling.
4.4 Student Model Re-training
The answer selection loss without intent labels calculates the cross entropy loss between predicted answers and ground-truth answers. It can be used to optimize the answer selection module when the intent confidence gain is lower than the threshold.
The answer selection loss with intent labels calculates the cross entropy loss between predicted answers and ground-truth answers. It can be used to optimize the answer selection module when the intent confidence gain is larger than the threshold.
5 Experimental Setup
5.1 Datasets and Evaluation Metrics
We test all methods on our extension of two benchmark datasets: MSDIALOG (Qu et al., 2018) and MANTIS (Penha et al., 2019). The MSDIALOG dataset contains multi-turn question answering across 4 topics collected from the Microsoft community.3 It has 12 different types of intents. The MANTIS dataset provides multi-turn dialogs with user intent labels across 14 domains crawled from Stack Exchange.4 It has 10 different types of intents. Note that we require a small number of data with intent labels in our experiments. There are other response selection datasets (e.g., UDC [Lowe et al., 2015]), however, they do not contain dialogues with intent labels. To this end, we select the MSDIALOG and MANTIS datasets which contain a small amount of data with intent labels, which can satisfy our experimental requirements.
In particular, we extend both datasets with unlabeled data. For MSDIALOG, we treat data without intent labels as unlabeled data; for MANTIS, we crawl unlabeled data from Stack Exchange4 from 2021 to 2022. For fair comparison with baselines, we follow previous work (Zhang et al., 2022b; Yang et al., 2020; Han et al., 2021): We use the ground-truth label as positive sample and use BM25 algorithm (Robertson and Zaragoza, 2009) to retrieve 9 relevant samples from different dialogues as negative samples. There could be a small number of negatives that are false-negatives, because there are cases which have same answers in different dialogues, so false-negatives may exist, but the number is very small. Besides, we use a different data partitioning strategy: First, we only extract conversations containing accurate answers, and the ground truth labels of all data are accurate answers. This is because we focus on the answer selection task, while the prior works focus on the response selection task. Note that not all the responses can serve as answers to users’ questions. Second, we put the data with intent labels into the training set. This is because the amount the data with intent labels is small, and we want to fully utilize the intent labels. In order to compare different methods in low-resource settings, we design three different low-resource simulation experiments: including 1%, 5%, and 10% labeled data and a large amount of unlabeled data. The statistics of the extended datasets is shown in Table 1.
. | Train . | Validation . | Test . | ||
---|---|---|---|---|---|
Labeled . | Unlabeled . | ||||
MSDIALOG | 1% | 1,410 | 140,420 | 5,000 | 21,280 |
5% | 7,050 | 134,780 | 5,000 | 21,280 | |
10% | 14,100 | 127,730 | 5,000 | 21,280 | |
MANTIS | 1% | 2,640 | 260,990 | 12,000 | 50,000 |
5% | 13,200 | 250,430 | 12,000 | 50,000 | |
10% | 26,400 | 237,230 | 12,000 | 50,000 |
. | Train . | Validation . | Test . | ||
---|---|---|---|---|---|
Labeled . | Unlabeled . | ||||
MSDIALOG | 1% | 1,410 | 140,420 | 5,000 | 21,280 |
5% | 7,050 | 134,780 | 5,000 | 21,280 | |
10% | 14,100 | 127,730 | 5,000 | 21,280 | |
MANTIS | 1% | 2,640 | 260,990 | 12,000 | 50,000 |
5% | 13,200 | 250,430 | 12,000 | 50,000 | |
10% | 26,400 | 237,230 | 12,000 | 50,000 |
5.2 Baselines
We compare the proposed ICAST with recent state-of-the-art methods that have reported results on the MSDIALOG and MANTIS datasets, respectively.
IART (Yang et al., 2020) proposes the intent-aware attention mechanism to weight the utterances in context.
SAM (Zhang et al., 2022b) captures semantic and similarity features to enhance answer selection.
JM (Zhang et al., 2021) concatenates the context and all candidate responses as input to select the most proper response.
BIG (Deng et al., 2021) uses the bilateral generation method to augment data and designs a contrastive loss function for training.
GRN (Liu et al., 2021b) uses NUP and UOP pre-training tasks, and combines the graph network and sequence network to model the reasoning process of multi-turn response selection.
GRAY (Lin et al., 2020) generates grayscale data by a fine-tuned generation model and proposes a multi-level ranking loss function for training.
BERT_FP (Whang et al., 2020) learns the interactions between utterances in context to enhance answer selection.
BERT (Devlin et al., 2019) is a general classification framework, which predicts answer labels on the vector of [CLS] token.
Teacher-student self-training (TSST) (Li et al., 2021) is a semi-supervised method, a teacher model is first trained with small, labeled data to generate pseudo labels on a large unlabeled dataset and then train a student model with pseudo labels.
5.3 Implementation Details
All models are implemented based on PyTorch5 and HuggingFace.6 We conduct hyper-parameter tuning on the validation dataset and report results on the test dataset. We use BERT-base-uncased model (Devlin et al., 2019) as an encoder in both fα and fβ, where the parameters are shared. We use AdamW (Loshchilov and Hutter, 2017) as optimizer. The batch size is 16, initial learning rate is 5e-5, and weight decay is 0.01. The maximum number of context turn is set to 4. The maximum length of context and answer are set to 400 and 100. The dropout ratios is 0.1. ICAST generates pseudo labels every 5 epochs. MC dropout conducts sampling by T = 5 times. For thresholds of pseudo labeling, we set λ + = 0.8, λ− = 0.1, , and λh = 0.2. For the threshold of intent confidence gain, we set λ = 0.0 for MANTIS dataset with 5% and 10% labeled data, otherwise, λ = 0.02.
For each parameter, we fix other hyper-parameters and select a specific value for the best performance on validation datasets. λ−, λ +, and λh are selected in (0, 1), the grid is 0.1. and are selected in (0.1, 0.8), the grid is 0.1. is selected in (0,0.05), the grid is 0.01. The number of ICAST’s parameters is 109,493,005. We train ICAST on 2 2080Ti GPUs with random seed 42, and the time cost is 48 hours.
6 Results
6.1 Overall Performance
We compare the overall performance of ICAST against the baseline methods. We also report the results of ICAST(Teacher). ICAST(Teacher) uses intent labels but BERT and BERT_FP do not, which is not a fair comparison. We conduct these experiments to see whether our method can outperform baselines without using unlabeled data. The results of overall performance are shown in Table 2.
Setting . | Model . | MSDIALOG . | MANTIS . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
P . | R . | F1 . | R@1 . | R@2 . | R@5 . | MAP . | P . | R . | F1 . | R@1 . | R@2 . | R@5 . | MAP . | ||
1% labeled | IART† | 22.18 | 46.75 | 30.08 | 25.65 | 46.28 | 77.58 | 47.74 | 48.29 | 52.22 | 50.18 | 50.40 | 68.34 | 86.22 | 66.12 |
SAM† | 44.17 | 44.36 | 44.26 | 46.89 | 59.06 | 77.02 | 60.72 | 57.75 | 58.62 | 58.18 | 65.10 | 76.32 | 88.54 | 75.60 | |
JM‡ | 44.80 | 44.59 | 44.70 | 44.54 | 60.76 | 84.30 | 61.26 | 62.95 | 62.62 | 62.78 | 62.64 | 77.32 | 92.30 | 75.25 | |
BIG‡ | 44.07 | 44.78 | 44.42 | 50.93 | 66.30 | 87.50 | 66.15 | 57.91 | 57.42 | 57.66 | 70.22 | 83.04 | 95.12 | 80.78 | |
GRAY‡ | 41.68 | 42.15 | 41.91 | 51.26 | 66.40 | 85.62 | 66.10 | 61.30 | 60.72 | 61.01 | 64.67 | 77.34 | 88.32 | 75.57 | |
GRN‡ | 43.41 | 43.37 | 43.39 | 43.28 | 61.60 | 86.46 | 61.19 | 61.75 | 61.10 | 61.42 | 61.06 | 76.64 | 93.66 | 74.56 | |
BERT_FP† | 44.32 | 42.95 | 43.62 | 56.76 | 72.08 | 91.25 | 70.90 | 66.26 | 62.86 | 64.51 | 75.62 | 86.14 | 95.22 | 84.11 | |
BERT‡ | 48.56 | 45.34 | 46.90 | 54.79 | 68.32 | 85.80 | 68.04 | 67.28 | 65.62 | 66.44 | 74.82 | 83.00 | 92.16 | 82.41 | |
ICAST (Teacher) | 49.82 | 46.33 | 48.01 | 56.86 | 67.81 | 85.38 | 69.03 | 68.48 | 66.12 | 67.28 | 77.28 | 86.12 | 94.98 | 82.98 | |
1% labeled | TSST‡ | 53.72 | 52.58 | 53.14 | 61.04 | 73.91 | 89.70 | 73.04 | 73.73 | 72.60 | 73.16 | 82.94 | 91.08 | 97.88 | 89.18 |
+all unlabeled | ICAST | 57.05 | 54.32 | 55.65 | 62.21 | 76.31 | 91.07 | 73.77 | 74.89 | 72.72 | 73.79 | 83.68 | 90.68 | 96.42 | 88.31 |
5% labeled | IART† | 23.52 | 49.38 | 31.86 | 28.80 | 48.02 | 79.93 | 49.97 | 50.24 | 53.60 | 51.86 | 51.56 | 70.66 | 89.52 | 67.75 |
SAM† | 49.52 | 51.45 | 50.47 | 54.27 | 67.66 | 83.03 | 67.32 | 59.16 | 57.82 | 58.48 | 66.52 | 76.88 | 89.28 | 76.51 | |
JM‡ | 50.98 | 49.81 | 50.39 | 50.37 | 67.62 | 89.47 | 66.56 | 67.16 | 66.82 | 66.99 | 66.92 | 80.83 | 94.94 | 78.47 | |
BIG‡ | 50.82 | 50.93 | 50.88 | 58.12 | 73.07 | 89.80 | 71.53 | 61.34 | 60.88 | 61.11 | 74.22 | 87.58 | 96.94 | 84.02 | |
GRAY‡ | 48.99 | 48.26 | 48.62 | 55.16 | 69.54 | 86.23 | 68.75 | 62.53 | 66.50 | 64.45 | 70.24 | 80.92 | 90.62 | 79.48 | |
GRN‡ | 49.28 | 50.04 | 49.66 | 49.76 | 66.77 | 89.52 | 66.04 | 64.27 | 63.00 | 63.62 | 63.78 | 78.60 | 93.38 | 76.27 | |
BERT_FP† | 49.74 | 50.93 | 50.33 | 62.96 | 77.16 | 92.76 | 75.41 | 70.04 | 68.32 | 69.17 | 80.22 | 89.36 | 97.30 | 87.37 | |
BERT‡ | 52.01 | 49.67 | 50.81 | 61.23 | 72.60 | 85.19 | 72.13 | 71.17 | 67.56 | 69.32 | 77.70 | 86.82 | 95.48 | 85.20 | |
ICAST (Teacher) | 54.22 | 51.83 | 53.00 | 62.59 | 74.38 | 90.36 | 74.16 | 73.13 | 69.20 | 71.11 | 80.82 | 88.12 | 95.86 | 84.27 | |
5% labeled | TSST‡ | 58.34 | 58.78 | 58.56 | 64.89 | 74.62 | 86.41 | 74.61 | 74.33 | 72.92 | 73.61 | 81.62 | 89.32 | 96.10 | 87.83 |
+all unlabeled | ICAST | 61.54 | 59.72 | 60.62 | 69.54 | 80.35 | 93.09 | 77.29 | 74.60 | 74.62 | 74.61 | 84.38 | 90.76 | 97.06 | 89.78 |
10% labeled | IART† | 34.38 | 47.22 | 39.79 | 39.05 | 58.31 | 84.77 | 58.00 | 50.77 | 53.04 | 51.88 | 51.80 | 71.20 | 89.28 | 68.04 |
SAM† | 55.63 | 54.27 | 54.94 | 59.53 | 70.62 | 85.05 | 71.00 | 61.39 | 60.00 | 60.69 | 66.88 | 77.92 | 90.84 | 77.08 | |
JM‡ | 57.64 | 57.56 | 57.60 | 57.61 | 73.12 | 90.97 | 71.70 | 68.06 | 68.98 | 68.52 | 68.22 | 80.46 | 94.04 | 79.02 | |
BIG‡ | 56.15 | 55.96 | 56.06 | 62.96 | 76.78 | 90.08 | 74.92 | 62.74 | 62.34 | 62.54 | 76.60 | 87.62 | 96.28 | 85.08 | |
GRAY‡ | 54.46 | 53.05 | 53.75 | 62.45 | 76.08 | 90.60 | 74.51 | 65.20 | 65.26 | 65.23 | 74.80 | 85.74 | 94.66 | 83.51 | |
GRN‡ | 54.06 | 53.43 | 53.74 | 53.52 | 70.67 | 90.08 | 68.96 | 66.01 | 64.92 | 65.46 | 66.10 | 80.60 | 93.34 | 77.83 | |
BERT_FP† | 57.95 | 56.81 | 57.38 | 67.48 | 80.16 | 94.07 | 78.56 | 71.04 | 68.20 | 69.59 | 80.72 | 89.38 | 96.82 | 87.59 | |
BERT‡ | 61.94 | 60.19 | 61.05 | 64.38 | 73.77 | 85.99 | 74.12 | 70.33 | 69.56 | 69.94 | 82.12 | 91.00 | 97.70 | 88.72 | |
ICAST (Teacher) | 62.41 | 59.77 | 61.06 | 66.54 | 76.55 | 89.09 | 76.43 | 71.89 | 70.24 | 71.05 | 81.92 | 90.02 | 97.12 | 88.29 | |
10% labeled | TSST‡ | 63.28 | 63.34 | 63.31 | 70.91 | 81.95 | 93.18 | 80.37 | 76.17 | 73.34 | 74.73 | 83.70 | 91.18 | 97.50 | 89.43 |
+all unlabeled | ICAST | 65.98 | 64.89 | 65.43 | 72.27 | 81.95 | 91.63 | 79.63 | 77.43 | 73.36 | 75.35 | 84.60 | 91.52 | 97.36 | 88.59 |
Setting . | Model . | MSDIALOG . | MANTIS . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
P . | R . | F1 . | R@1 . | R@2 . | R@5 . | MAP . | P . | R . | F1 . | R@1 . | R@2 . | R@5 . | MAP . | ||
1% labeled | IART† | 22.18 | 46.75 | 30.08 | 25.65 | 46.28 | 77.58 | 47.74 | 48.29 | 52.22 | 50.18 | 50.40 | 68.34 | 86.22 | 66.12 |
SAM† | 44.17 | 44.36 | 44.26 | 46.89 | 59.06 | 77.02 | 60.72 | 57.75 | 58.62 | 58.18 | 65.10 | 76.32 | 88.54 | 75.60 | |
JM‡ | 44.80 | 44.59 | 44.70 | 44.54 | 60.76 | 84.30 | 61.26 | 62.95 | 62.62 | 62.78 | 62.64 | 77.32 | 92.30 | 75.25 | |
BIG‡ | 44.07 | 44.78 | 44.42 | 50.93 | 66.30 | 87.50 | 66.15 | 57.91 | 57.42 | 57.66 | 70.22 | 83.04 | 95.12 | 80.78 | |
GRAY‡ | 41.68 | 42.15 | 41.91 | 51.26 | 66.40 | 85.62 | 66.10 | 61.30 | 60.72 | 61.01 | 64.67 | 77.34 | 88.32 | 75.57 | |
GRN‡ | 43.41 | 43.37 | 43.39 | 43.28 | 61.60 | 86.46 | 61.19 | 61.75 | 61.10 | 61.42 | 61.06 | 76.64 | 93.66 | 74.56 | |
BERT_FP† | 44.32 | 42.95 | 43.62 | 56.76 | 72.08 | 91.25 | 70.90 | 66.26 | 62.86 | 64.51 | 75.62 | 86.14 | 95.22 | 84.11 | |
BERT‡ | 48.56 | 45.34 | 46.90 | 54.79 | 68.32 | 85.80 | 68.04 | 67.28 | 65.62 | 66.44 | 74.82 | 83.00 | 92.16 | 82.41 | |
ICAST (Teacher) | 49.82 | 46.33 | 48.01 | 56.86 | 67.81 | 85.38 | 69.03 | 68.48 | 66.12 | 67.28 | 77.28 | 86.12 | 94.98 | 82.98 | |
1% labeled | TSST‡ | 53.72 | 52.58 | 53.14 | 61.04 | 73.91 | 89.70 | 73.04 | 73.73 | 72.60 | 73.16 | 82.94 | 91.08 | 97.88 | 89.18 |
+all unlabeled | ICAST | 57.05 | 54.32 | 55.65 | 62.21 | 76.31 | 91.07 | 73.77 | 74.89 | 72.72 | 73.79 | 83.68 | 90.68 | 96.42 | 88.31 |
5% labeled | IART† | 23.52 | 49.38 | 31.86 | 28.80 | 48.02 | 79.93 | 49.97 | 50.24 | 53.60 | 51.86 | 51.56 | 70.66 | 89.52 | 67.75 |
SAM† | 49.52 | 51.45 | 50.47 | 54.27 | 67.66 | 83.03 | 67.32 | 59.16 | 57.82 | 58.48 | 66.52 | 76.88 | 89.28 | 76.51 | |
JM‡ | 50.98 | 49.81 | 50.39 | 50.37 | 67.62 | 89.47 | 66.56 | 67.16 | 66.82 | 66.99 | 66.92 | 80.83 | 94.94 | 78.47 | |
BIG‡ | 50.82 | 50.93 | 50.88 | 58.12 | 73.07 | 89.80 | 71.53 | 61.34 | 60.88 | 61.11 | 74.22 | 87.58 | 96.94 | 84.02 | |
GRAY‡ | 48.99 | 48.26 | 48.62 | 55.16 | 69.54 | 86.23 | 68.75 | 62.53 | 66.50 | 64.45 | 70.24 | 80.92 | 90.62 | 79.48 | |
GRN‡ | 49.28 | 50.04 | 49.66 | 49.76 | 66.77 | 89.52 | 66.04 | 64.27 | 63.00 | 63.62 | 63.78 | 78.60 | 93.38 | 76.27 | |
BERT_FP† | 49.74 | 50.93 | 50.33 | 62.96 | 77.16 | 92.76 | 75.41 | 70.04 | 68.32 | 69.17 | 80.22 | 89.36 | 97.30 | 87.37 | |
BERT‡ | 52.01 | 49.67 | 50.81 | 61.23 | 72.60 | 85.19 | 72.13 | 71.17 | 67.56 | 69.32 | 77.70 | 86.82 | 95.48 | 85.20 | |
ICAST (Teacher) | 54.22 | 51.83 | 53.00 | 62.59 | 74.38 | 90.36 | 74.16 | 73.13 | 69.20 | 71.11 | 80.82 | 88.12 | 95.86 | 84.27 | |
5% labeled | TSST‡ | 58.34 | 58.78 | 58.56 | 64.89 | 74.62 | 86.41 | 74.61 | 74.33 | 72.92 | 73.61 | 81.62 | 89.32 | 96.10 | 87.83 |
+all unlabeled | ICAST | 61.54 | 59.72 | 60.62 | 69.54 | 80.35 | 93.09 | 77.29 | 74.60 | 74.62 | 74.61 | 84.38 | 90.76 | 97.06 | 89.78 |
10% labeled | IART† | 34.38 | 47.22 | 39.79 | 39.05 | 58.31 | 84.77 | 58.00 | 50.77 | 53.04 | 51.88 | 51.80 | 71.20 | 89.28 | 68.04 |
SAM† | 55.63 | 54.27 | 54.94 | 59.53 | 70.62 | 85.05 | 71.00 | 61.39 | 60.00 | 60.69 | 66.88 | 77.92 | 90.84 | 77.08 | |
JM‡ | 57.64 | 57.56 | 57.60 | 57.61 | 73.12 | 90.97 | 71.70 | 68.06 | 68.98 | 68.52 | 68.22 | 80.46 | 94.04 | 79.02 | |
BIG‡ | 56.15 | 55.96 | 56.06 | 62.96 | 76.78 | 90.08 | 74.92 | 62.74 | 62.34 | 62.54 | 76.60 | 87.62 | 96.28 | 85.08 | |
GRAY‡ | 54.46 | 53.05 | 53.75 | 62.45 | 76.08 | 90.60 | 74.51 | 65.20 | 65.26 | 65.23 | 74.80 | 85.74 | 94.66 | 83.51 | |
GRN‡ | 54.06 | 53.43 | 53.74 | 53.52 | 70.67 | 90.08 | 68.96 | 66.01 | 64.92 | 65.46 | 66.10 | 80.60 | 93.34 | 77.83 | |
BERT_FP† | 57.95 | 56.81 | 57.38 | 67.48 | 80.16 | 94.07 | 78.56 | 71.04 | 68.20 | 69.59 | 80.72 | 89.38 | 96.82 | 87.59 | |
BERT‡ | 61.94 | 60.19 | 61.05 | 64.38 | 73.77 | 85.99 | 74.12 | 70.33 | 69.56 | 69.94 | 82.12 | 91.00 | 97.70 | 88.72 | |
ICAST (Teacher) | 62.41 | 59.77 | 61.06 | 66.54 | 76.55 | 89.09 | 76.43 | 71.89 | 70.24 | 71.05 | 81.92 | 90.02 | 97.12 | 88.29 | |
10% labeled | TSST‡ | 63.28 | 63.34 | 63.31 | 70.91 | 81.95 | 93.18 | 80.37 | 76.17 | 73.34 | 74.73 | 83.70 | 91.18 | 97.50 | 89.43 |
+all unlabeled | ICAST | 65.98 | 64.89 | 65.43 | 72.27 | 81.95 | 91.63 | 79.63 | 77.43 | 73.36 | 75.35 | 84.60 | 91.52 | 97.36 | 88.59 |
First, in terms of all classification metrics, ICAST and ICAST (Teacher) outperform the baselines in each setting, excluding only one setting: The R score of ICAST (Teacher) is 0.42% lower than BERT trained on 10% labeled MSDIALOG dataset. Specifically, on the MSDIALOG dataset, ICAST with 1%, 5%, and 10% labeled data improves their corresponding strongest baselines by 2.51%, 2.06%, and 2.12% of F1 scores. On the MANTIS dataset, ICAST with 1%, 5%, and 10% labeled data improves their corresponding strongest baselines by 0.63%, 1.00%, and 0.62% of F1 scores. This demonstrates the effectiveness of ICAST on the performance of classifying correct answers. We believe there are two reasons: (i) the predicted intent labels can provide more information that are useful for selecting correct answers; and (ii) the self-training paradigm can calibrate answer labels for continuous improvement. For example, with self-training on 10% labeled MSDIALOG dataset and all unlabeled data, the R score of ICAST is 4.70%/1.55% higher than BERT/TSST, respectively.
Second, in terms of ranking metrics, we have the following observations: (i) ICAST outperforms all baselines in terms of R@1 score in each setting. Specifically, on the MSDIALOG dataset, ICAST with 1%, 5%, and 10% labeled data achieve 1.17%, 4.65%, and 1.36% higher of R@1 scores than their corresponding strongest baselines, respectively. On the MANTIS dataset, ICAST with 1%, 5%, and 10% labeled data achieve 0.74%, 2.76%, and 0.90% higher of R@1 scores than their corresponding strongest baselines, respectively. It indicates that ICAST can rank an accurate answer on top. (ii) For R@2, R@5, and MAP, ICAST achieves the highest scores in most of the settings, excluding: On the MSDIALOG dataset, with 10% labeled data and all unlabeled data, R@5 and MAP scores decrease 1.55% and 0.74%. On the MANTIS dataset, with 1% labeled data and all unlabeled data, R@2, R@5, and MAP scores decrease 0.40%, 1.46%, and 0.87%; with 10% labeled data and all unlabeled data, R@5 and MAP scores decrease 0.14% and 0.84%. Our method does not possess a significant advantage in terms of R@2, R@5, and MAP, as the primary objective of answer selection is to identify the answer rather than generating a ranking list. Hence, the fundamental performance measurements are precision, recall, and the F1-score (Wang et al., 2021b). Accordingly, we evaluate the models using standard evaluation metrics for a fair comparison. Additionally, we present supplementary ranking metrics (i.e., R@2, R@5, MAP) to assess whether improvements in selection metrics result in a noteworthy decline in ranking metrics. The results demonstrate that our method exhibits no considerable decrease in ranking metrics.
Third, using self-training with unlabeled data has the largest impact on all settings in terms of both classification and ranking metrics. Specifically, on the MSDIALOG dataset with 1%, 5%, and 10% labeled data, F1 scores increase 7.64%, 7.62%, and 4.37%; MAP scores increase 2.87%, 1.88%, and 1.81%. On the MANTIS dataset with 1%, 5%, and 10% labeled data, F1 scores increase 6.51%, 3.50%, and 4.30%; MAP scores increase 5.07%, 2.41%, and 0.71%. This reveals that ICAST benefits from making good use of unlabeled data with self-training. Besides, the influence of classification performance is larger than the ranking performance in each setting.
Last but not least, we do not require too much data with intent labels. This also motivates us to conduct experiments on only a small amount of data with labels (1%, 5%, 10%). For example, with 1% labeled data, our method outperforms the baselines with only 141 and 264 intent labels on the MSDIALOG and MANTIS datasets, respectively. Thus, it is possible to apply our method in practice, even without a large amount of intent labels.
6.2 Ablation Study
To better understand the contribution of each functional component of ICAST, i.e., intent confidence gain estimation (ICGE), answer label calibration (ALC), and intent generation (IG), we conduct the following ablation studies: After removing the IG (denoted as “-IG”), ICGE and ALC do not work, so ICAST degenerates to the TSST model. The ICGE estimates intent confidence gain according to the predicted intents which are the outputs of the IG module. The ALC uses the intent confidence scores to select unlabeled samples, the computation of intent confidence scores also needs the predicted intents. After removing the ICGE (denoted as “-ICGE”), the model does not select the predicted intents according to the intent confidence gain and adds all of predicted intent labels into the inputs. After removing the ICGE and ALC (denoted as “-ICGE-ALC”), the model does not select unlabeled data with intent confidence score and degenerates into the TSST with adding all predicted intents into inputs. Table 3 reports the results of the ablation studies.
Setting . | Model . | MSDIALOG . | MANTIS . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
P . | R . | F1 . | R@1 . | R@2 . | R@5 . | MAP . | P . | R . | F1 . | R@1 . | R@2 . | R@5 . | MAP . | ||
1%
labeled +all unlabeled | ICAST | 57.05 | 54.32 | 55.65 | 62.21 | 76.31 | 91.07 | 73.77 | 74.89 | 72.72 | 73.79 | 83.68 | 90.68 | 96.42 | 88.31 |
-ICGE | 54.81 | 54.04 | 54.42 | 60.24 | 72.85 | 86.46 | 71.83 | 74.44 | 71.78 | 73.08 | 81.68 | 89.02 | 94.86 ⇑ | 87.55 | |
-ICGE-ALC | 54.13 | 53.52 | 53.82 | 60.80 ↑ | 73.77 ↑ | 89.38 ↑ | 72.86 ↑ | 74.00 | 71.96 ↑ | 72.96 | 82.66 ↑ | 90.82 ⇑↑ | 97.72 ⇑↑ | 88.94 ⇑↑ | |
-IG | 53.72 | 52.58 | 53.14 | 61.04 | 73.91 | 89.70 | 73.04 | 73.73 | 72.60 | 73.16 | 82.94 | 91.08 ⇑ | 97.88 ⇑ | 89.18 ⇑ | |
5%
labeled +all unlabeled | ICAST | 61.54 | 59.72 | 60.62 | 69.54 | 80.35 | 93.09 | 77.29 | 74.60 | 74.62 | 74.61 | 84.38 | 90.76 | 97.06 | 89.78 |
-ICGE | 60.66 | 58.55 | 59.58 | 64.94 | 76.31 | 90.32 | 75.61 | 74.35 | 74.12 | 74.23 | 83.18 | 91.14 ⇑ | 97.50 ⇑ | 89.29 | |
-ICGE-ALC | 59.94 | 58.92 ↑ | 59.43 | 66.30 ↑ | 77.82 ↑ | 91.63 ↑ | 76.86 ↑ | 74.19 | 73.56 | 73.87 | 82.80 | 90.72 | 96.66 | 88.81 | |
-IG | 58.34 | 58.78 | 58.56 | 64.89 | 74.62 | 86.41 | 74.61 | 74.33 | 72.92 | 73.61 | 81.62 | 89.32 | 96.10 | 87.83 | |
10%
labeled +all unlabeled | ICAST | 65.98 | 64.89 | 65.43 | 72.27 | 81.95 | 91.63 | 79.63 | 77.46 | 73.36 | 75.35 | 84.60 | 91.52 | 97.36 | 88.59 |
-ICGE | 65.71 | 62.78 | 64.21 | 71.42 | 81.53 | 93.70 ⇑ | 80.51 ⇑ | 76.19 | 73.82 ⇑ | 74.98 | 83.82 | 91.26 | 96.98 | 89.46 ⇑ | |
-ICGE-ALC | 64.27 | 64.09 | 64.18 ↑ | 70.63 | 81.95 | 92.95 | 80.15 | 75.60 | 74.26 | 74.92 | 84.06 | 91.38 | 97.32 | 89.69 | |
-IG | 63.28 | 63.34 | 63.31 | 70.91 | 81.95 | 93.18 ⇑ | 80.37 ⇑ | 76.17 | 73.34 | 74.73 | 83.70 | 91.18 | 97.50 ⇑ | 89.43 ⇑ |
Setting . | Model . | MSDIALOG . | MANTIS . | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
P . | R . | F1 . | R@1 . | R@2 . | R@5 . | MAP . | P . | R . | F1 . | R@1 . | R@2 . | R@5 . | MAP . | ||
1%
labeled +all unlabeled | ICAST | 57.05 | 54.32 | 55.65 | 62.21 | 76.31 | 91.07 | 73.77 | 74.89 | 72.72 | 73.79 | 83.68 | 90.68 | 96.42 | 88.31 |
-ICGE | 54.81 | 54.04 | 54.42 | 60.24 | 72.85 | 86.46 | 71.83 | 74.44 | 71.78 | 73.08 | 81.68 | 89.02 | 94.86 ⇑ | 87.55 | |
-ICGE-ALC | 54.13 | 53.52 | 53.82 | 60.80 ↑ | 73.77 ↑ | 89.38 ↑ | 72.86 ↑ | 74.00 | 71.96 ↑ | 72.96 | 82.66 ↑ | 90.82 ⇑↑ | 97.72 ⇑↑ | 88.94 ⇑↑ | |
-IG | 53.72 | 52.58 | 53.14 | 61.04 | 73.91 | 89.70 | 73.04 | 73.73 | 72.60 | 73.16 | 82.94 | 91.08 ⇑ | 97.88 ⇑ | 89.18 ⇑ | |
5%
labeled +all unlabeled | ICAST | 61.54 | 59.72 | 60.62 | 69.54 | 80.35 | 93.09 | 77.29 | 74.60 | 74.62 | 74.61 | 84.38 | 90.76 | 97.06 | 89.78 |
-ICGE | 60.66 | 58.55 | 59.58 | 64.94 | 76.31 | 90.32 | 75.61 | 74.35 | 74.12 | 74.23 | 83.18 | 91.14 ⇑ | 97.50 ⇑ | 89.29 | |
-ICGE-ALC | 59.94 | 58.92 ↑ | 59.43 | 66.30 ↑ | 77.82 ↑ | 91.63 ↑ | 76.86 ↑ | 74.19 | 73.56 | 73.87 | 82.80 | 90.72 | 96.66 | 88.81 | |
-IG | 58.34 | 58.78 | 58.56 | 64.89 | 74.62 | 86.41 | 74.61 | 74.33 | 72.92 | 73.61 | 81.62 | 89.32 | 96.10 | 87.83 | |
10%
labeled +all unlabeled | ICAST | 65.98 | 64.89 | 65.43 | 72.27 | 81.95 | 91.63 | 79.63 | 77.46 | 73.36 | 75.35 | 84.60 | 91.52 | 97.36 | 88.59 |
-ICGE | 65.71 | 62.78 | 64.21 | 71.42 | 81.53 | 93.70 ⇑ | 80.51 ⇑ | 76.19 | 73.82 ⇑ | 74.98 | 83.82 | 91.26 | 96.98 | 89.46 ⇑ | |
-ICGE-ALC | 64.27 | 64.09 | 64.18 ↑ | 70.63 | 81.95 | 92.95 | 80.15 | 75.60 | 74.26 | 74.92 | 84.06 | 91.38 | 97.32 | 89.69 | |
-IG | 63.28 | 63.34 | 63.31 | 70.91 | 81.95 | 93.18 ⇑ | 80.37 ⇑ | 76.17 | 73.34 | 74.73 | 83.70 | 91.18 | 97.50 ⇑ | 89.43 ⇑ |
First, intent confidence gain estimation (ICGE), answer label calibration (ALC), and intent generation (IG) have positive influence on overall performance of classification in all settings on both the MSDIALOG and MANTIS datasets with 1%, 5%, and 10% labeled data, respectively. Removing IG from ICAST, F1 scores decrease 2.51%/2.06%/2.12% on the MSDIALOG dataset and 0.63%/1.00%/0.62% on the MANTIS dataset. This proves our hypothesis that the generated intents can provide more useful information for selecting correct answers. Removing ICGE from ICAST, F1 scores decrease 1.23%/1.04%/1.22% on the MSDIALOG dataset and 0.71%/0.38%/0.37% on the MANTIS dataset. This reveals that intent confidence gain can select high-quality intent labels that are helpful to select correct answers. Removing ALC from ICAST without ICGE, F1 scores decrease 0.60%/0.15%/0.03%. This shows that ALC can bring extra improvement even though ICGE is absent. Meanwhile, it works better together with the other two components.
Second, in terms of ranking performance, R@1 decreases when removing ICGE, ALC, and IG from ICAST in all settings on the MSDIALOG and MANTIS datasets. Removing ICGE/ALC/IG with 1%, 5%, and 10% labeled data, R@1 drops 1.97%/1.41%/1.17%, 4.60%/3.24%/4.65%, and 0.85%/1.64%/1.36% on the MSDIALOG dataset; R@1 drops 2.00%/1.02%/0.74%, 1.20/ 1.58/2.76%, and 0.78%/0.54%/0.90% on the MANTIS dataset. This shows that the three functional components are helpful to rank correct answers on top.
6.3 Analysis
Figure 3 shows the impact of threshold of intent confidence gain λ on classification performance of ICAST. We can see that as λ increases, the average number of selected intents decreases. Meanwhile, F1 scores increase first, achieve top at λ = 0.02, and then descend, because with a larger λ, more intents are selected to calibrate the answer labels, which leads to an increase of F1 scores. Then, adding more generated intents might also introduce noise for answer selection, which is the possible reason for the decrease of F1. Thus, λ can balance between more predicted intents and less noisy intents.
6.4 Case Study
Table 4 shows a case study of how ICAST and TSST select different answers for the same given context.
Context Utterances . | Intents . | ||
---|---|---|---|
User: How does a photon picture make the pattern? | OQ | ||
Agent: Photons in mainstream physics,
are quantum mechanical entities which in great numbers build up the classical electromagnetic radiation... | PA | ||
User: Do you know why the photon which
is hitting forward is causing an electron to move up-down? | IR | ||
Candidate Answers | Model | ICG | Probablity |
A1: The theories of quantum mechanics | TSST | / | 0.00 |
for electron photon interactions can | |||
be found in https://www.website.com. | ICAST | 0.14 | 0.99 |
A2: The energy of a photon is equal to | TSST | / | 0.96 |
the level spacing of a two-level system. | |||
It is a result of energy conservation... | ICAST | −0.13 | 0.71 |
Context Utterances . | Intents . | ||
---|---|---|---|
User: How does a photon picture make the pattern? | OQ | ||
Agent: Photons in mainstream physics,
are quantum mechanical entities which in great numbers build up the classical electromagnetic radiation... | PA | ||
User: Do you know why the photon which
is hitting forward is causing an electron to move up-down? | IR | ||
Candidate Answers | Model | ICG | Probablity |
A1: The theories of quantum mechanics | TSST | / | 0.00 |
for electron photon interactions can | |||
be found in https://www.website.com. | ICAST | 0.14 | 0.99 |
A2: The energy of a photon is equal to | TSST | / | 0.96 |
the level spacing of a two-level system. | |||
It is a result of energy conservation... | ICAST | −0.13 | 0.71 |
In general, a model chooses the candidate answer with the highest probability among all candidate answers as the correct answer. In this case, the strongest baseline TSST incorrectly chooses the second candidate answer (A2) with the probability of 0.96, instead of the first candidate answer (A1) which has a probability of 0.00. It shows that selecting answers based solely on their probabilities can result in significant bias. ICAST calibrates the probabilities based on ICG, and it correctly chooses A1 with the probability of 0.99, while skipping A2 with the probability of 0.71. ICAST computes ICGs by combining context and its predicted intents, and each candidate answer. The ICG of the correct answer is larger than λ, which indicates that ICAST can capture the intent information from the correct answer, so ICAST increases the probability of correct answer from 0.00 to 0.99. Meanwhile, the ICG of the incorrect answer is less than λ, which indicates ICAST cannot capture the intent information from and incorrect answer, so ICAST decreases the probability of incorrect answer from 0.96 to 0.71. Furthermore, we explain the intuition. In context utterances, the user asks the (OQ), and then the agent gives a (PA) which can explain the original question, but the user still raises the (IR) to ask the agent for more detailed information. Next, the user anticipates an answer that includes a link or document providing more detailed information, instead of a continuous explanation in text. Intuitively, the predicted intents can aid in monitoring changes in the user’s expectations throughout the utterances.
7 Conclusion and Future Work
In this paper, we propose intent-calibrated selftraining (ICAST) based on the teacher-student self-training and intent-calibrated answer selection: We train a teacher model on labeled data to predict intent labels on unlabeled data; select high-quality intents by intent confidence gain to enrich inputs and predict pseudo answer labels; and retrain a student model on both the labeled and pseudo-labeled data. We conduct massive experiments on two benchmark datasets and the results show that ICAST outperforms baselines even with small but similar proportions (i.e., 1%, 5%, and 10%) of labeled data, respectively. Note that we understand that a greater proportion of labeled data may lead to an increase of performance, e.g., BERT-FP with 10% labeled data beat ICAST with 1% labeled data for across all metrics for MSDIALOG. However, we focus on verifying if the proposed ICAST outperforms other methods given a very few amounts of labeled data. In some cases, ICAST can outperform baselines with fewer labeled data. In the future work, we will explore more predictable dialogue context (e.g., user profiles) than intents.
8 Reproducibility
To facilitate reproducibility of the results reported in this paper, the code and data used are available at https://github.com/dengwentao99/ICAST.
Limitations
Our proposed ICAST also has the following limitations. First, ICAST only considers user intents to enhance answer selection. It is limited because we only capture the user’s expectations from the predicted intent labels, without considering other user-centered factors, such as user profiles and user feedback. Second, like retrieval-based methods that have been shown to have a good effect in professional question-answering fields, ICAST also has limitations when it comes to diversity. For example, ICAST cannot retrieve multiple correct answers with different expressions given the same context. Third, since our model needs to predict the intent labels, to complete this task, the model needs a few additional parameters.
Ethics Considerations
We realize that there are risks in developing the dialogue system, so it is necessary to pay attention to the ethics issues of the dialogue system. It is crucial for a dialogue system to give correct answers to the users while avoiding ethical problems such as privacy preservation problems. Due to the fact that we have used public datasets to train our model, these datasets are carefully processed by publishers to ensure that there are no ethical problems. Specifically, the dataset publishers performed user ID anonymization on all datasets, and only the tokens “user” and “agent” are used to represent the roles in the conversation. The utterances do not contain any user privacy information (e.g., names, phone numbers, addresses) to prevent privacy disclosure.
Acknowledgments
We would like to thank the editors and reviewers for their helpful comments. This research was supported by the National Key R&D Program of China (grants No.2022YFC3303004, No.2020YFB1406704), the Natural Science Foundation of China (62102234, 62272274, 62202271, 61902219, 61972234, 62072279), the Key Scientifc and Technological Innovation Program of Shandong Province (2019JZZY010129), the Natural Science Foundation of Shandong Province (ZR2021QF129), the Fundamental Research Funds of Shandong University, and VOXReality (European Union grant 101070521). All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.
Notes
References
Author notes
Action Editor: Beata Beigman Klebanov