Abstract
Supervised neural approaches are hindered by their dependence on large, meticulously annotated datasets, a requirement that is particularly cumbersome for sequential tasks. The quality of annotations tends to deteriorate with the transition from expert-based to crowd-sourced labeling. To address these challenges, we present CAMEL (Confidence-based Acquisition Model for Efficient self-supervised active Learning), a pool-based active learning framework tailored to sequential multi-output problems. CAMEL possesses two core features: (1) it requires expert annotators to label only a fraction of a chosen sequence, and (2) it facilitates self-supervision for the remainder of the sequence. By deploying a label correction mechanism, CAMEL can also be utilized for data cleaning. We evaluate CAMEL on two sequential tasks, with a special emphasis on dialogue belief tracking, a task plagued by the constraints of limited and noisy datasets. Our experiments demonstrate that CAMEL significantly outperforms the baselines in terms of efficiency. Furthermore, the data corrections suggested by our method contribute to an overall improvement in the quality of the resulting datasets.1
1 Introduction
Supervised training of deep neural networks requires large amounts of accurately annotated data (Russakovsky et al., 2015; Szegedy et al., 2017; Li et al., 2020b). A particularly challenging scenario arises when training for sequential multi-output tasks. In this case, the neural network is required to generate multiple predictions simultaneously, one for each output category, at every time step throughout an input sequence. Consequently, the labeling effort increases rapidly, becoming impractical as the demand for precise and consistent labeling across each time step and output category intensifies. Therefore, a heavy dependence on human-generated labels poses significant limitations on the scalability of such systems.
A prominent example of a sequential multi-output label task for which this bottleneck is evident is dialogue belief tracking. A dialogue belief tracker is one of the core components of a dialogue system, tasked with inferring the goal of the user at every turn (Young et al., 2007). Current state-of-the-art trackers are based on deep neural network models (Lin et al., 2021; van Niekerk et al., 2021; Heck et al., 2022). These models outperform traditional Bayesian network-based belief trackers (Young et al., 2010; Thomson and Young, 2010). However, neural belief trackers are greatly hindered by the lack of adequate training data. Real-world conversations, even those pertaining to a specific task-oriented domain, are extremely diverse. They encompass a broad spectrum of user objectives, natural language variations, and the overall dynamic nature of human conversation. While there are many sources for dialogue data, such as logs of call centers or virtual personal assistants, labeled dialogue data is scarce and several orders of magnitude smaller than, say, data for speech recognition (Panayotov et al., 2015) or translation (Bojar et al., 2017). Although zero-shot trackers do not require large amounts of labeled data, they typically underperform compared to supervised models that are trained on accurately labeled datasets (Heck et al., 2023).
One of the largest available labeled datasets for task-oriented dialogues is MultiWOZ, which is a multi-domain dialogue dataset annotated via crowdsourced annotators. The challenges in achieving consistent and precise human annotations are apparent in all versions of MultiWOZ (Budzianowski et al., 2018; Eric et al., 2020; Zang et al., 2020; Han et al., 2021; Ye et al., 2022). Despite manual corrections in the most recent edition, model performance has plateaued, not due to limitations in the models, but as a result of data inconsistencies (Li et al., 2020a).
Addressing the omnipresent issue of unreliable labels, as evident in the MultiWOZ dataset, is a common problem that affects the quality and reliability of supervised learning systems. In order to mitigate these issues and enhance the robustness of model training, we propose a novel methodology.
In this work, we present CAMEL, a pool-based semi-supervised active learning approach for sequential multi-output tasks. Given an underlying supervised learning model that can estimate confidence in its predictions, CAMEL substantially reduces the required labeling effort. CAMEL comprises:
A selection component that selects a subset of time-steps and output categories to be labeled in input sequences by experts rather than whole sequences, as is normally the case.
A self-supervision component that uses self-generated labels for the remaining time-steps and output categories within selected input sequences.
A label validation component which examines the reliability of the human-provided labels.
We first apply CAMEL within an idealized setting for machine translation, a generative language modeling task. CAMEL achieves impressive results, matching the performance of a model trained on the full dataset while utilizing less than 60% of the expert-provided labels. Subsequently, we apply CAMEL to the dialogue belief tracking task. Notably, we achieve 95% of a tracker’s full-training dataset performance using merely 16% of the expert-provided labels. Additionally, we propose an adaptation of the meta-post-hoc model approach (Shen et al., 2023), tailored for cost-efficient active learning. We demonstrate that CAMEL, utilizing uncertainty estimates from this cost-effective method, exhibits similar performance compared to using uncertainty estimates from a significantly more computationally expensive ensemble of models.
On top of this framework, we develop a method for automatically detecting and correcting inaccuracies of human labels in datasets. We illustrate that these corrections boost performance of distinct tracking models, overcoming the limitations imposed by labeling inconsistencies. Having demonstrated its efficacy in machine translation and dialogue belief tracking, our framework holds potential for broad applicability across various sequential multi-output tasks, such as object tracking, pose detection, and language modeling.
2 Related Work
2.1 Active Learning
Active learning is a machine learning framework that pinpoints scenarios in data that lack representation and interactively queries a designated annotator for labels (Cohn et al., 1996). The framework uses an acquisition function to identify the most beneficial data points for querying. Such a function estimates how performance can improve following the labeling of data. Functions of this kind often rely on various factors, such as prediction uncertainty (Houlsby et al., 2011), data space coverage (Sener and Savarese, 2018), variance reduction (Johansson et al., 2007), or topic popularity (Iovine et al., 2022).
Active learning approaches can be categorized into stream-based and pool-based (Settles, 2009). Stream-based setups are usually employed when data creation and labeling occur simultaneously. In contrast, pool-based approaches separate these steps, operating under the assumption that an unlabeled data pool is available.
Active learning has been frequently employed in tasks such as image classification (Houlsby et al., 2011; Gal et al., 2017) and machine translation (Vashistha et al., 2022; Liu et al., 2018). A noteworthy example in machine translation is the work of Hu and Neubig (2021), which enhances efficiency by applying active learning to datasets enriched with frequently used phrases. While this strategy does reduce the overall effort required for labeling, it inherently limits the scope of the annotator’s work to phrases only. As a result, this method may not support the annotation of longer texts, where understanding the context and nuances of full sentences is crucial.
At the same time, active learning is less prevalent in dialogue belief tracking, with Xie et al. (2018) being a notable exception. Their framework involves querying labels for complete sequences (dialogues) and bases selection on a single output category, neglecting any potential correlation between categories. Furthermore, this approach does not account for annotation quality problems.
One work that addresses the issue of annotation quality within an active learning framework is Su et al. (2018). In that work, stream-based active learning is deployed for the purpose of learning whether a dialogue is successful. The user-provided labels are validated using a label confidence score. This innovative learning strategy is however not directly applicable to sequential multi-output tasks, as it does not deal with the sequential nature of the problem.
2.2 Semi-Supervised Learning
Semi-Supervised Learning (SSL) makes use of both labeled and unlabeled data to improve learning efficiency and model performance. While SSL traditionally encompasses various approaches, including encoder-decoder architectures, alternative methods incorporate self-labeling or self-supervision to enhance model training with minimal human intervention.
In SSL, a “pre-trained” model typically undergoes an initial phase of unsupervised learning, leveraging large volumes of unlabeled data to learn representations. Subsequently, the model is fine-tuned for specific tasks using labeled data. This fine-tuning process, especially prevalent in state-of-the-art transformer-based models like RoBERTa (Liu et al., 2019), is integral to semi-supervised learning strategies, serving as an illustration of their practical utility (van Niekerk et al., 2021; Su et al., 2022; Heck et al., 2022).
Moreover, SSL can utilize self-training techniques, such as Pseudo Labeling and Noisy Student Training, where a “teacher” model generates pseudo labels for unlabeled data, which are then used to train a “student” model. In this iterative process, the student assumes the teacher role. This semi-supervised training can improve performance without necessitating extra labels.
The Pseudo-Label method proposed by Lee (2013) is a straightforward and effective SSL technique where the model’s confident predictions on unlabeled data are treated as ground truth labels. This method has been widely adopted due to its simplicity and effectiveness in various domains.
Recent advances in SSL have focused on methods such as FixMatch (Sohn et al., 2020), which simplifies the semi-supervised learning pipeline by combining consistency regularisation and pseudo-labeling. FixMatch leverages weakly augmented data to predict pseudo labels, and strongly augmented data to enforce consistency.
Additionally, Xie et al. (2020) propose the Noisy Student method, which extends the teacher- student framework by adding noise to the student model, thereby improving its robustness and performance. Further, Kumar et al. (2020) explore the concept of gradual domain adaptation through self-training, where a model is iteratively trained on data that gradually shifts from the source to the target domain. This approach has been shown to effectively handle large distribution shifts by leveraging intermediate domains to improve generalisation.
In summary, the incorporation of self- supervision and iterative training frameworks in SSL has proven to be highly effective, driving advancements in model performance with minimal labeled data. These methods not only enhance the learning process but also reduce the reliance on extensive labeled datasets, making SSL a crucial area of research in modern machine learning.
2.3 Label Validation
The process of manually correcting labels is very tedious and expensive. As a result, many works focus on learning from imperfect labels, using loss functions and/or model architectures adapted for label noise (Reed et al., 2015; Xiao et al., 2015; Sukhbaatar et al., 2015). Still, these methods have been unable to match the performance of models trained on datasets that include manually corrected labels. However, the alternative of automated label validation or correction is often overlooked by such works. It has been shown that learning from automatically corrected labels, e.g., based on confidence scores, performs better than learning from noisy labels alone (Liu et al., 2017; Jiao et al., 2019). The major drawback of these approaches is that they frequently rely on overconfident predictions of neural network models to correct labels, which can further bias the model.
3 CAMEL: Confidence-based Acquisition Model for Efficient Self-supervised Active Learning
In this section, we introduce our pool-based active learning approach, named CAMEL, to address sequential multi-output classification problems. Let us consider a classification problem with input features x, and output y. According to Read et al. (2015), such a problem can be cast as a multi-output classification problem if the output consists of multiple label categories that need to be predicted simultaneously. Specifically, for a problem with M categories, the output is represented as , where each ym, m ∈ [1, M] can be binary or multivariate. Furthermore, this problem is characterized as a sequential classification problem if the output is dependent on a sequence of prior inputs. For a sequence with T time-steps, the input-output pairs can be represented as , where represents the output labels at time step .
In a conventional setting, for an unlabeled data sequence , an annotator would typically be required to provide labels, , for each label category m at every time step t, which is considerably expensive.
3.1 Requirements
CAMEL, as a confidence-based active learning framework, utilizes confidence estimates to determine data points to be queried for labeling. The framework relies on the model’s ability to gauge the certainty of each prediction. Specifically, for every time-step t in a sequence, for each category m in a multi-output setting, and for each possible value that m can take, the model calculates the predictive probability, . These probabilities, collected into a distribution , form the predictive distribution that CAMEL uses for active learning decisions.
The calibration of these confidence estimates is also critical. Calibration refers to the alignment between the model’s estimated confidence and the empirical likelihood of its predictions (Desai and Durrett, 2020). Should the model’s confidence estimates be poorly calibrated, it may select instances that are not informative, resulting in an inefficient allocation of the annotation budget and potentially suboptimal performance.
3.2 Active Learning Approach
The approach we propose starts with an initial learning model, which is trained using a small labeled seed dataset and iteratively progresses through four stages: data selection, labeling, label validation, and semi-supervised learning. These iterations continue until either a pre-defined performance threshold is achieved or the dataset is fully labeled. The schematic representation of this approach is illustrated in Figure 1.
CAMEL comprises four stages. Stage 1 involves data selection, choosing instances for labeling where the model shows uncertainty (confidence below the αsel threshold), as indicated by pink arrows. In Stage 2, annotators label the selected instances while the model self-labels the remaining ones (dashed green arrows). Stage 3 (optional) validates labels using a label confidence estimate, incorporating only labels exceeding the αval threshold and the self-labeled data into the dataset (black arrows). Finally, Stage 4 involves retraining the model for the next cycle.
CAMEL comprises four stages. Stage 1 involves data selection, choosing instances for labeling where the model shows uncertainty (confidence below the αsel threshold), as indicated by pink arrows. In Stage 2, annotators label the selected instances while the model self-labels the remaining ones (dashed green arrows). Stage 3 (optional) validates labels using a label confidence estimate, incorporating only labels exceeding the αval threshold and the self-labeled data into the dataset (black arrows). Finally, Stage 4 involves retraining the model for the next cycle.
Stage 1: Data Selection
In each cycle, we select a subset of Nsel sequences from the unlabeled pool of size Nunlb. Selection is based on the model’s prediction confidence, (which will be specified in Equation 1). Instances in which the model displays low confidence (confidence below a threshold αsel) are selected. More precisely, an input sequence is selected if the model shows high uncertainty for at least one time-step t and label category m instance . The αsel threshold is set such that Nsel sequences are selected for labeling.
Stage 2: Labeling
In the input sequences selected in Stage 1, the learning model self-labels the time-steps and categories, , where its confidence is above the threshold αsel. Concurrently, expert annotators are responsible for labeling the remaining time-steps and categories. These labels are denoted by .
Stage 3: Label Validation
This is an optional step, and the variant of CAMEL that contains this stage we call Confidence-based Acquisition Model for Efficient Self-supervised Active Learning with Label Validation (CAMELL). We can consider the labels, , with label confidence, , below a threshold αval to be potentially incorrect. This label confidence is not assigned by the annotators themselves but is computed by the learning model. To safeguard the model from being trained with these potentially erroneous labels, we purposely exclude them (i.e., these labels are masked in the dataset). The αval threshold can be set using a development set.
Stage 4: Semi-supervised Learning
At each iteration of the active learning approach, the expert provided labels that passed validation (Stage 3) and the self-determined labels from Stage 2 are added to the labeled pool, resulting in Nlab + Nsel data sequences. Based on these, the learning model is retrained.
3.3 Confidence Estimation
To accurately estimate the prediction confidence required in Stage 1 as well as the label confidence in Stage 3, we propose a confidence estimation model for each stage. These models are designed to encapsulate the learning model’s confidence by considering both its total and knowledge-based uncertainties. Total uncertainty captures all uncertainty in the model’s prediction, irrespective of the source. Conversely, knowledge uncertainty in a model originates from its incomplete understanding, which occurs due to a lack of relevant data during training, or the inherent complexity of the problem (Gal, 2016).
Both the prediction and label confidence estimation models share the same objective: to estimate the probability that the value for a specific label category m at time-step t is correct. To provide the training data for these models, we assume that the labels in the labeled pool are correct, as they have already been validated. Furthermore, we retrain these models whenever more data is labeled.
Category-specific uncertainty measures: (a) displays prediction uncertainty, including prediction probability and total and knowledge uncertainty; (b) depicts label uncertainty, including label probability and total and knowledge uncertainty from both learning and noisy models.
Category-specific uncertainty measures: (a) displays prediction uncertainty, including prediction probability and total and knowledge uncertainty; (b) depicts label uncertainty, including label probability and total and knowledge uncertainty from both learning and noisy models.
The design choices for the confidence estimation models were motivated by a desire to capture both intra- and inter-category uncertainty for reliable confidence estimation. We observed that excluding inter-category features degraded performance, emphasizing the importance of incorporating them.
3.3.1 Prediction Confidence Estimation
The objective of the prediction confidence estimation model is to assess whether the value predicted by the learning model, , is the “true” value, based on the prediction confidence score . This model, also known as the confidence-based acquisition model, is used as the selection criterion in Stage 1.
3.3.2 Label Confidence Estimation
The objective of the label confidence estimation model is to determine whether an annotator’s label, , is the “true” value, with this decision being based on the label confidence score . In Su et al. (2018), the confidence score of the learning model is directly used for both purposes. We believe this is a suboptimal strategy, because the model has not been exposed to instances of “incorrect” labels. To address this, we generate a noisy dataset featuring “incorrect” labels for training purposes.
Further, we extend to include uncertainty measures drawn from both a noisy model, trained on the corresponding noisy dataset, and the original learning model (as depicted in Figure 2b). Given that the noisy model is conditioned to accept the “incorrect” labels as correct, the discrepancy in uncertainty between the noisy model and the learning model enhances the label confidence estimator’s ability to identify potentially incorrect labels.
Noisy Dataset
The creation of a noisy dataset can be approached in two ways. One method is to randomly replace a portion of labels. However, this approach may not yield a realistic noisy dataset, considering human errors are rarely random. A second approach, particularly when the learning model is an ensemble, as is often the case for uncertainty-endowed deep learning models (Gal and Ghahramani, 2016; Ashukha et al., 2020), is to leverage individual ensemble members to supply noisy labels (see Section 4.5.1 for details related to an ensemble free approach). This method may be more effective, given the individual members’ typical lower accuracy compared to the ensemble as a whole.
In our proposed approach, we initially select αnoise percent of the sequences from the training data at random. For each category m, we choose a random ensemble member to generate noisy labels. This ensemble member creates labels at each time step t by sampling from its predictive probability distribution. To avoid generating labels from the clean dataset, the probabilities of these are set to zero prior to sampling. The noisy dataset is regenerated after each update of the learning model using the updated ensemble members, enhancing diversity of noisy labels.
3.4 Label Correction
We propose a label correction method that utilizes the model that solves the task at hand, referred to as the learning model, the label confidence estimation model (Section 3.3.2), and the prediction confidence estimation model (Section 3.3.1). In order to correct a noisy dataset, this method involves three steps: (1) detecting potentially erroneous labels, (2) determining which of these labels can be accurately corrected by the learning model, and (3) substituting the incorrect labels with the learning model’s predictions. Detecting potentially erroneous labels requires utilizing the label confidence estimation model and setting the hyperparameter αval, such that all labels with confidence below this threshold are considered potentially incorrect. Then the prediction confidence model is utilized to estimate the learning model’s confidence of detected erroneous labels . If this confidence is greater than the one assigned by the label confidence estimation model, the labels are substituted with the learning model’s predictions.
3.5 Efficient Confidence Estimation with Post-hoc Uncertainty Learning
To obtain reliable estimates of the knowledge and total uncertainties required in Section 3.3, an ensemble-based approach is typically employed; however, this method is computationally expensive (Gal, 2016). This challenge is amplified in active learning scenarios, where the model is frequently updated. Shen et al. (2023) propose an uncertainty estimation technique in which uncertainties are generated by a post-hoc Dirichlet meta-model, offering greater computational efficiency than an ensemble of models. This method enables the model to distinguish between knowledge and data uncertainty, without needing several instances of the learning model. The post-hoc Dirichlet meta-model involves a two-stage training process. In the initial stage, a model with the same architecture as the learning model is trained to create a base model. In the second stage, meta-features are employed to estimate the uncertainties of the base model. These meta-features, derived from various intermediate layers of the base model, capture distinct levels of feature representation, from low- to high-level representations. Utilizing the diversity in these representations allows for more nuanced uncertainty quantification (Shen et al., 2023). To capture the uncertainty of the base model, we utilize a meta-model. This meta-model takes as input the intermediate features from the base model and outputs the parameters of a Dirichlet distribution. This Dirichlet distribution over the probability simplex, in turn, describes the uncertainty present in the prediction.
More rigorously, given a base neural network model that solves the task at hand, the set of L features is extracted from different layers of this model for a given input, where L refers to the number of layers of the base model. These intermediate features can include embeddings from various layers within a neural network, such as the transformer layers in a transformer model. Meta-features are computed via small meta-feature extraction layers, gl.
In our case, these are fully connected layers with a ReLU activation function that map the intermediate features to meta-features of dimension dmeta, ml = gl(fl) for l = 1,…, L. These meta-features are then combined and mapped to the required prediction dimension through another fully connected layer with ReLU activation.
3.5.1 Learning Objective
The post-hoc meta-model is trained using Bayesian matching loss (Joo et al., 2020) with the same training dataset as the base model.
We can show that an optimal state for this model is reached when the output, , equals the sum of the prior concentration parameters and the scaled one-hot encoded label, . This mechanism enables the model to adjust its uncertainty by integrating both prior knowledge and the evidence gathered from observed data. However, the reliance on constant prior concentration parameters, β, introduces a limitation. Specifically, it encourages the model to generate similar uncertainty estimates across all inputs, irrespective of their complexity. This, however, leads to a model that is under-confident for inputs it can correctly predict and over-confident for inputs it cannot. To address this problem, we introduce a distillation approach called Dynamic Priors within the Bayesian matching loss framework. Dynamic Priors adapt at each active learning step by leveraging previous model versions, thereby mitigating the constant prior problem.
3.5.2 Dynamic Priors
Dynamic priors leverage the active learning setting in which we operate. This setting allows the model to access previous versions of the learning model, which can then be used as priors. The underlying hypothesis is that replacing the constant prior, as described in Section 3.5.1, with a dynamic prior—one that evolves at each active learning step—addresses the homogeneity issue discussed above.
More concretely, the prior is predicted from the Dirichlet distributions from the previous model version. If no previous version is available, such as at the beginning of the active learning process, a small ensemble of models, , trained on a small seed-set from the active learning initialization phase, is used to obtain the initial prior. It is important to emphasize that only the initial prior is obtained using a small ensemble. In all subsequent updates to the model, the predicted Dirichlet distribution from a single model instance is used as the prior. By parameterizing the Dirichlet prior, , with the previous model’s outputs, our approach dynamically adjusts the prior concentration parameters. This adjustment not only mitigates the issue of constant priors but also increases the model’s ability to produce more accurate uncertainty estimates.
Finally, it is important to emphasize the computational efficiency of this approach. During training, only the parameters of the meta-model, which typically constitute less than 5% of the base model’s size, are updated. Additionally, in the inference phase, the meta-model incurs an additional computational cost of approximately 15–20% of the total inference cost, resulting in an overall computationally efficient approach to uncertainty estimation.
4 Experiments
4.1 Baselines
Random Selection
randomly selects sequences to be annotated. Random selection is often used as a baseline for active learning approaches, as it allows us to observe the impact of purely adding more labeled data to our labeled pool without strategically selecting sequences to be labeled. Its advantage is that it maintains the full data distribution with every selection, thus not creating a bias (Dasgupta and Hsu, 2008).
Bayesian Active Learning by Disagreement (BALD)
is an uncertainty-based active learning method which employs knowledge uncertainty as the primary metric for selection (Houlsby et al., 2011). This technique has established itself as a strong baseline in various applications. For instance, in image classification tasks (Gal et al., 2017) and named entity recognition (Shen et al., 2017), BALD has shown notable performance. Its performance is further enhanced when used in conjunction with ensemble models (Beluch et al., 2018). Given its widespread adoption and proven efficacy, we see BALD as an ideal baseline.
In our study, we examined two criteria for making the selection decision: one based on the cumulative uncertainty across all time-steps and label categories, and another based on the average uncertainty across categories and time. Upon evaluation, we observed that the latter criterion yielded superior results, and therefore, adopted it as our baseline, which we refer to as BALD.
We further present an enhanced version of BALD which consists of stages 1, 2, and 4 of our approach as outlined in Section 3.2, utilizing knowledge uncertainty as the prediction confidence estimate. We call this BALD with self-supervision, BALD+SS.3
4.2 Variants of CAMEL
We introduce the following variants to understand the individual and collective contributions of our proposed framework’s components.
CAML
Confidence-based Acquisition Model for active Learning, represents the foundational layer of our framework, incorporating stages 1, 2a, and 4 described in Section 3.2. Crucially, it excludes the self-labeling process (stage 2b), in stage 2, thus relying solely on labels from the annotators. This variant serves as a baseline to evaluate the efficacy of our confidence estimation model in an active learning context, without the influence of self-supervision. For brevity, we report the CAML results for the translation experiments only (similar trends were observed in the dialogue belief tracking task).
CAMEL
Confidence-based Acquisition Model for Efficient self-supervised active Learning is the complete approach, which also includes the self-supervision component. This variant assesses the value added by self-supervision to the framework, while retaining stages 1, 2, and 4.
CAMELL
Confidence-based Acquisition Model for Efficient self-supervised active Learning with Label validation is an extended variation of our approach that includes a label validation component, denoted as Stage 3 in Section 3.2.
4.3 Variants of Label Correction
Live Label Correction
involves simultaneous labeling, validation, and correction of data. A variant of CAMELL is employed, in which the label is corrected at the validation stage using the prediction of the learning model.
On-line Label Correction
is a method that labels and validates data simultaneously, with the objective of minimising human effort in providing labels while concurrently validating them. CAMELL can be employed to flag the data points requiring correction, as well as to apply corrections to the flagged labels using the final model after active learning has been performed.
Offline Label Correction
is a technique used to correct an already labeled corpus, with the objective of identifying potentially incorrect labels and providing alternatives. To achieve this, individually trained components of CAMELL can be utilized, specifically the prediction confidence model (Section 3.3.1) and the label confidence model (Section 3.3.2). The process consists of the following steps:
Train learning model on labeled corpus.
Generate noisy dataset using this model, leveraging ensemble members from Step 1. If computational constraints prevent the use of an ensemble, a noisy dataset can be generated from a single model using the strategy described in Section 3.5.
Train learning model on noisy dataset.
Train prediction and label confidence models.
Perform label correction.
Semi-offline Label Correction
is a method in which data is collected with the objective of minimising human effort in providing labels, with validation occurring subsequently. For this purpose, CAMEL can be utilized alongside a separately trained label confidence model (Steps 2 and 3 from above), followed by Step 5.
4.4 Generative Language Modeling Task
For the generative language modeling task, we explore the application of our CAMEL framework to the task of Neural Machine Translation (NMT). NMT focuses on converting sequences of text from a source language to a target language. Our approach involves iterative annotation methods similar to those used in automatic speech recognition (Sperber et al., 2016), which incrementally increase model precision.
Specifically, in our experiment, an annotator corrects individual words within a translation, thereby progressively enhancing the quality of the subsequent output generated by the model. Conventional annotation methods typically involve providing fully corrected translations or quality ratings. While this iterative process diverges from conventional methods of machine translation annotation, it allows us to effectively demonstrate the self-supervision mechanism within our framework.
4.4.1 Implementation Details
We apply CAMEL to the task of machine translation, specifically using the T5 encoder-decoder transformer model (t5-small) (Raffel et al., 2020). We utilize an ensemble of 10 models in order to produce a well-calibrated predictive distribution, which requires 2500 GPU hours to fully train. Approximately 40% of this time is for training the ensemble, 50% for the annotation process, and 10% for training the confidence estimator.
The WMT17 DE-EN dataset, which consists of German to English translations (Bojar et al., 2017), is used for training, and METEOR (Banerjee and Lavie, 2005), BLEU (Papineni et al., 2002), and COMET (Rei et al., 2020) serve as evaluation metrics.
As machine translation does not entail a multi-output task, we employed a simplified version of the confidence estimation model, introduced in Section 3.3, consisting of only the intra-category encoder. The latent dimension of the encoder and feature transformation layer is 16. The parameters are optimised using the standard binary negative log likelihood loss (Cox, 1958).
It is crucial to address the inherent challenges in sequential machine translation labeling: (1) future sentence structure and labels can change depending on the current label, and (2) for any word position there exist multiple valid candidate words. This complexity necessitates the use of a dynamic annotation approach, as static dataset labels are insufficient for new data labeling. To avoid high translation annotation costs, we propose a practical approach: using an expert translation model, specifically the MBART-50 multilingual model (Tang et al., 2020), to simulate a human annotator.
Our approach, depicted in Figure 3, is a multi-stage procedure. Initially, the learning model produces a translation for a selected source language sentence. As it generates the translation, it simultaneously estimates its confidence for the subsequent token. Should this confidence fall below a set threshold αsel, the expert translation model steps in to supply the next word in the translation. After the label is provided, the learning model resumes the translation generation. For any future token whose confidence drops below the threshold, the expert translation model re-engages. This process continues until a complete translation for the source sentence is realised. The uncertainty threshold, αsel, is strategically chosen to yield a maximum of Nann word labels.
The model-based annotation process for semi-supervised annotation for NMT. The learning model initiates the translation with the word “The”, then confidence for the next token generation is below the threshold. The expert annotation model is prompted and provides the next word, “drunks”. The learning model resumes and successfully generates the remainder of the translation: “interrupted the event”.
The model-based annotation process for semi-supervised annotation for NMT. The learning model initiates the translation with the word “The”, then confidence for the next token generation is below the threshold. The expert annotation model is prompted and provides the next word, “drunks”. The learning model resumes and successfully generates the remainder of the translation: “interrupted the event”.
4.4.2 Results
We evaluated the performance of our proposed CAMEL framework and baseline models using METEOR (Banerjee and Lavie, 2005), BLEU (Papineni et al., 2002), and COMET (Rei et al., 2020) scores. While traditional metrics such as METEOR and BLEU highlight similar trends (with BLEU scores included in Appendix A, Figure 8), COMET, a neural evaluation metric, provides a more comprehensive understanding of the translation quality beyond traditional metrics. We establish that our proposed CAMEL framework, enhanced with self-supervision, is significantly more efficient requesting word-level labels than baseline models like BALD, BALD+SS, and random selection. This efficiency is evident in Figures 4 a and 5a, which showcases CAMEL’s need for fewer word-level labels to achieve similar performance. Although our primary focus is on the number of word-level labels queried, it is crucial to note that labeling overhead is also accounted for. We measure this overhead by the effort required to read and understand the source language tokens, which we consider a sufficient indicator.
METEOR score of the T5 translation model using different active learning approaches on the WMT17 DE-EN test set, as a function of (a) the number of word-level labels and (b) the number of complete translations, with 95% confidence interval.
METEOR score of the T5 translation model using different active learning approaches on the WMT17 DE-EN test set, as a function of (a) the number of word-level labels and (b) the number of complete translations, with 95% confidence interval.
COMET score of the T5 translation model using different active learning approaches on the WMT17 DE-EN test set, as a function of (a) the number of word-level labels and (b) the number of complete translations, with 95% confidence interval.
COMET score of the T5 translation model using different active learning approaches on the WMT17 DE-EN test set, as a function of (a) the number of word-level labels and (b) the number of complete translations, with 95% confidence interval.
A notable point to observe in Figure 4b is that the introduction of self-supervision to CAMEL does not significantly influence its performance in terms of the number of complete translations required, as evident by the comparison between CAML (CAMEL without the self-supervised labeling component) and CAMEL. This implies that self-supervision within CAMEL is applied predominantly when the model’s predictions can be considered reliable. In contrast, we observe that BALD+SS, despite its label efficiency shown in Figure 4a, performs poorly in terms of the number of complete translations required, as demonstrated in Figure 4b. This drop in performance may be attributed to BALD+SS’s tendency to incorrectly self-label complex examples. This trend is further evidenced by CAML’s lower expected calibration error (ECE), reported in Table 1. The COMET results, presented in Figure 5, further attest to CAMEL’s superiority. CAMEL not only excels in reducing the number of word-level labels but also outperforms other models in the number of complete translations required. The non-overlapping confidence intervals in the results indicates that the improvements of CAMEL over other methods are statistically significant.
Comparison of the expected calibration error (ECE) of confidence estimation approaches. * indicates significant difference on 95% confidence interval.
Confidence Estimator . | Dataset . | ECE (%) ↓ . |
---|---|---|
CE-T5 + CAML | WMT17 DE-EN | 26.74* |
CE-T5 + BALD | WMT17 DE-EN | 47.21 |
CE-SetSUMBT + CAML | MultiWOZ 2.1 | 9.65* |
CE-SetSUMBT + BALD | MultiWOZ 2.1 | 17.21 |
Confidence Estimator . | Dataset . | ECE (%) ↓ . |
---|---|---|
CE-T5 + CAML | WMT17 DE-EN | 26.74* |
CE-T5 + BALD | WMT17 DE-EN | 47.21 |
CE-SetSUMBT + CAML | MultiWOZ 2.1 | 9.65* |
CE-SetSUMBT + BALD | MultiWOZ 2.1 | 17.21 |
Regardless of the methodology used, all models require roughly the same number of complete translations, as shown in Figures 4b and 5b. This supports the widely accepted notion that exposure to large datasets is vital for training robust natural language processing (NLP) models.
Encouraged by these results, we adapt CAMEL to address the dialogue belief tracking problem, a task plagued by errors in the labels of available datasets.
4.5 Dialogue Belief Tracking Task
In task-oriented dialogue, the dialogue ontology contains a set of Mdomain-slot pairs {s1, s2,…, sM} and a set of plausible values for each sm. The goal of the dialogue belief tracker is to infer the user’s preference for each sm by predicting a probability distribution over the plausible values. Notably, each set of plausible values, , includes the not_mentioned value, indicating that a specific domain-slot pair is not part of the user’s goal. This allows for computing the model’s confidence for slots not present in the user’s preference.
To train a belief tracking model, we require the dialogue state, which includes the value label for each domain-slot, in every dialogue turn. The dialogue state at turn t in dialogue i is represented as , where denotes the value for the domain-slot pair sm at turn t in dialogue i. Consequently, we obtain a dataset , consisting of N dialogues, each comprising Ti turns, where user and system utterances at turn t in dialogue i are denoted as and , respectively.
To create a dataset , annotators usually provide relevant values for the domain-slot pairs they believe are present in the user’s utterance at every turn t. Subsequently, a handcrafted rule-based tracker considers the previous state i, t−1, the semantic actions present in the system utterance and the values provided by the annotator to generate complete dialogue states for each turn (Budzianowski et al., 2018). However, this approach has several drawbacks. Firstly, rule-based trackers tend to be imprecise and necessitate redevelopment for each new application, making it less versatile. Secondly, it may not use the time of human annotators efficiently, as the learning model could potentially predict the state for a substantial part of the dialogue accurately. Lastly, there is the risk of human annotators inadvertently overlooking slots in the user input, which could result in incomplete data.
4.5.1 Learning Model
To apply CAMEL to the dialogue belief tracking problem, we use the CE-SetSUMBT (Calibrated Ensemble – SetSUMBT) model (van Niekerk et al., 2021), a model which produces well-calibrated uncertainty estimates, important for CAMEL. The CE-SetSUMBT model consists of 10 ensemble members, requiring 1000 GPU hours to fully train. Approximately 45% of this time is utilized for training the ensemble, 45% for training the noisy model, and 10% for training the confidence estimators. In addition, we integrate the post-hoc uncertainty learning using a Dirichlet meta-model approach (Shen et al., 2023), described in Section 3.5, into SetSUMBT.
4.5.2 Datasets
In order to test our proposed approach, we utilize the multi-domain task-oriented dialogue dataset MultiWOZ 2.1 (Eric et al., 2020; Budzianowski et al., 2018) and its manually corrected test set provided in MultiWOZ 2.4 (Ye et al., 2022). In our experiments, we regard MultiWOZ 2.1 as a dataset with substantial label noise (Eric et al., 2020; Zang et al., 2020; Ye et al., 2022), and the test set of MultiWOZ 2.4 a dataset with accurate labels.
4.5.3 Implementation Details
The latent dimension of the intra- and inter-category encoders and feature transformation layer is 16. During training of the label confidence estimation model (Section 3.3.2), to avoid overfitting, we improve the calibration of this model by deploying binary label smoothing loss (Szegedy et al., 2016), temperature scaling and noisy training using Gaussian noise (An, 1996).
For the seed dataset (Section 3) we randomly select 5% of dialogues on which we train the initial SetSUMBT model. The other dialogues in the dataset are treated as the unlabeled pool. At each update step another 5% of the data are selected to be labeled. At each point where we require expert labels, we take the original labels provided in the dataset to simulate a human annotator.
4.5.4 Evaluation
As the main metric for our experiments, we use joint goal accuracy (JGA) (Henderson et al., 2014). We further include the joint goal expected calibration error (ECE) (Guo et al., 2017; van Niekerk et al., 2020), which measures the calibration of the model. In terms of measuring efficiency of each method, we examine JGA as a function of the number of expert provided labels. In order to assess the quality of the corrected dataset, we measure the JGA of models trained on a noisy dataset, with and without the proposed label correction.
4.5.5 Dialogue Diversity Baseline
We include an additional dialogue diversity baseline, aiming to obtain labels for dialogues geometrically dissimilar from those in the labeled pool, thus ensuring data space coverage. This diversity strategy proposed by Xie et al. (2018) assesses similarity based on vector embeddings of the candidate dialogue versus labeled dialogues. We adapt this approach by employing RoBERTa model embeddings (Liu et al., 2019), fine-tuned in an unsupervised fashion, on the MultiWOZ dialogues.
4.5.6 Results
As shown in Figure 6a, our proposed CAMEL framework requires significantly fewer labels to reach performance levels comparable to those of the baseline methods. This indicates that CAMEL is more efficient in learning dialogue belief tracking than the baseline strategies. It is important to note that all approaches requires the same number of unlabeled dialogues (see Figure 6b). It also highlights the role played by CAMEL’s confidence estimates in guiding the active learning process. This conclusion is supported by the lower calibration error of CAMEL’s confidence estimates, as reported in Table 1.
JGA of the CE-SetSUMBT model using different active learning approaches, on the MultiWOZ 2.1 test set, as a function of (a) the number of labels and (b) the number of dialogues, with 95% conf. int.
JGA of the CE-SetSUMBT model using different active learning approaches, on the MultiWOZ 2.1 test set, as a function of (a) the number of labels and (b) the number of dialogues, with 95% conf. int.
Further, we observe in Figure 7a–7b that similar results can be achieved using a computationally efficient uncertainty estimation technique such as the post-hoc Dirichlet meta model, described in Section 3.5, applied to the SetSUMBT model. It should be noted that the comparatively lower joint goal accuracy of this model can be attributed to its singular SetSUMBT model configuration. An ensemble of models consistently achieves an accuracy that is 2 to 3 percentage points higher.
JGA of the Dirichlet Meta SetSUMBT model using different active learning approaches, on the MultiWOZ 2.1 test set, as a function of (a) the number of labels and (b) the number of dialogues, with 95% conf. int.
JGA of the Dirichlet Meta SetSUMBT model using different active learning approaches, on the MultiWOZ 2.1 test set, as a function of (a) the number of labels and (b) the number of dialogues, with 95% conf. int.
4.6 Label Correction
To assess the quality of the corrected labels generated by our proposed label correction method (Section 3.4), we trained two distinct tracking models, CE-SetSUMBT and TripPy (Heck et al., 2020), using both the original MultiWOZ 2.1 dataset and various autocorrected datasets (live, online, offline, and semi-offline). The evaluation was conducted on both the noisy MultiWOZ 2.1 test set and the manually corrected MultiWOZ 2.4 test set. The selected tracking models represent the two major non-generative approaches to dialogue state tracking: a pick-list-based approach (SetSUMBT) and a span-prediction approach (TripPy).
4.6.1 Results
In Table 2, we present the JGA of the CE-SetSUMBT models on two test sets: the (noisy) MultiWOZ 2.1 test set and the (manually corrected) MultiWOZ 2.4 test set.4 Overall, results show the same trend both for CE-SetSUMBT and TripPy: On the MultiWOZ 2.1 test set, the models do not show statistically significant improvements, which is unsurprising given that the MultiWOZ 2.1 test set contains errors and, therefore, cannot adequately assess the impact of label correction. In contrast, on the MultiWOZ 2.4 test set, we observe significant improvements for both offline and online label correction methods for both belief state trackers. This demonstrates that the datasets resulting from online and offline label correction are of significantly higher quality.
Comparison of JGA of trackers trained with and without label corrections. The label corrections can be obtained using a SetSUMBT model trained on the full MultiWOZ 2.1 dataset, trained using CAMEL, or trained using CAMELL. * indicates significant difference on 95% conf. int.
Model . | Label Corr. Setup . | MultiWOZ 2.1 . | MultiWOZ 2.4 . |
---|---|---|---|
CE-SetSUMBT | None | 51.79 | 61.63 |
Live | 32.48 | 37.32 | |
Online | 52.85 | 63.35 | |
Offline | 52.83 | 63.32* | |
Semi-offline | 52.69 | 63.12 | |
TripPy | None | 55.28 | 64.45 |
Online | 56.17 | 66.13 | |
Offline | 56.11 | 66.02* | |
Semi-offline | 55.85 | 65.82 |
Model . | Label Corr. Setup . | MultiWOZ 2.1 . | MultiWOZ 2.4 . |
---|---|---|---|
CE-SetSUMBT | None | 51.79 | 61.63 |
Live | 32.48 | 37.32 | |
Online | 52.85 | 63.35 | |
Offline | 52.83 | 63.32* | |
Semi-offline | 52.69 | 63.12 | |
TripPy | None | 55.28 | 64.45 |
Online | 56.17 | 66.13 | |
Offline | 56.11 | 66.02* | |
Semi-offline | 55.85 | 65.82 |
The semi-offline method fails to produce significant improvements. We hypothesise that the model trained using CAMEL has already acquired similar error patterns to those commonly made by human annotators. The live label correction setup results in a low-quality dataset, which we attribute to the model’s inherent inability to correct data selected through active learning. At this stage, the model lacks the capability to make accurate predictions for these instances.5
Although the label validation stage of CAMELL does not yield a statistically significant improvement in the active learning setting, it produces a model that provides more reliable label correction compared to the CAMEL approach without label validation (see online vs. semi-offline correction in Table 2). While CAMELL does not generate labels of higher quality than those produced by the offline label correction approach, it facilitates the creation of a clean dataset with fewer labels, thereby reducing human effort.
An important take-away message is: if all labels in the dataset are available and active learning is not required, offline label correction can be applied to enhance the dataset’s quality. However, if labels are being collected through an active learning process, an online label correction should be applied rather than a semi-offline method, as the label validation component enables the creation of a final dataset of higher quality.
4.6.2 Qualitative Analysis
In our investigation of the improved datasets obtained from offline label correction, we identified three prevalent label errors, which our approach successfully rectifies, as exemplified in Table 3. (I) Hallucinated annotations, where the annotator assigns labels not present in the dialogue context, (II) Multi-annotation, the case of assigning multiple labels to the same piece of information, and (III) Erroneous annotation, the situation where an incorrect label is assigned based on the context. These instances underscore the efficacy of our label validation model in minimising the propagation of errors into the dataset.
Examples of three common types of annotation errors in the MultiWOZ 2.1 dataset detected and corrected by CAMELL, (I) hallucinated annotations, (II) multi-annotation and (III) erroneous annotation. For each, we provide the confidence scores of the labels and the corrections proposed by the model. Incorrect labels are marked in red and the proposed corrections in blue.

5 Conclusion
We propose CAMEL, a novel active learning approach that integrates self-supervision, with the goal of minimizing the reliance on labeled data in addressing sequential multi-output labeling problems. Initially, we applied CAMEL to a generative language modeling task in an idealized setting, specifically focusing on machine translation. Subsequently, in a more realistic setting focused on the dialogue belief tracking task, we demonstrated that our approach significantly outperforms baseline methods in terms of robustness and data efficiency.
Additionally, we introduce a methodology for automated dataset correction. Our experiments confirm that our label correction method enhances the overall quality of a dataset. We demonstrate that CAMELL (with label validation) is capable of producing high-quality datasets with a fraction of the human annotation required, through online label correction, thereby highlighting the importance of the label validation component for this task.
Finally, it is important to note that while many presented experiments used ensembles to establish comparisons, we have also provided a mechanism for confidence estimation and active learning that does not utilize ensembles and thus is more environmentally friendly.
We believe that this work has far-reaching implications. Firstly, it underscores the indispensable role of uncertainty estimation in learning models. Secondly, the versatility of CAMEL opens up possibilities for its application across diverse sequential multi-output labeling problems, such as entity-relation extraction or weather forecasting. Thirdly, it demonstrates that, in principle, dataset deficiencies can be addressed via data-driven approaches, circumventing the need for extensive manual or rule-based curation. This is particularly pertinent considering the prevailing belief that undesirable outcomes produced by NLP models are inherently linked to the training datasets and cannot be rectified algorithmically (Eisenstein, 2019, 14.6.3).
Looking ahead, we anticipate that refining the process of generating noisy datasets could result in a model capable of not only identifying label noise but also filtering out biases, false premises, and misinformation.
Acknowledgments
This work was made possible through the support of the Alexander von Humboldt Foundation, provided within the Sofja Kovalevskaja Award, the European Research Council (ERC) under the Horizon 2020 research and innovation program (grant no. STG2018 804636), and the Ministry of Culture and Science of North Rhine-Westphalia within the Lamarr Fellow Network. Computational resources were provided by the Centre for Information and Media Technology at Heinrich Heine University Düsseldorf and Google Cloud. We thank the anonymous reviewers for their insightful comments and suggestions, particularly for encouraging us to develop a more computationally efficient approach to uncertainty estimation. We also thank Andrey Malinin for early discussions that inspired us to broaden our perspective beyond dialogue state tracking, as well as Prof. Joseph van Genabith for his valuable insights regarding the machine translation setting.
Notes
The code is available under https://gitlab.cs.uni-duesseldorf.de/general/dsml/camell.git.
During label confidence estimation, for categories not selected for labeling, self-labels are used to complete the inter-category features.
Note that we are not able to combine BALD with label validation as knowledge uncertainty does not provide candidate level confidence scores.
The MultiWOZ 2.4 validation set was never used during training.
This method is not examined for TripPy, as we do not expect it to behave differently.
References
A BLEU Scores for Translation Experiments
BLEU score of the T5 translation model using different active learning approaches on the WMT17 DE-EN test set, as a function of (a) the number of word-level labels and (b) the number of complete translations, with 95% conf. int.
BLEU score of the T5 translation model using different active learning approaches on the WMT17 DE-EN test set, as a function of (a) the number of word-level labels and (b) the number of complete translations, with 95% conf. int.
Author notes
Action Editor: Wenjie (Maggie) Li