Abstract
Semi-supervised text classification-based paradigms (SSTC) typically employ the spirit of self-training. The key idea is to train a deep classifier on limited labeled texts and then iteratively predict the unlabeled texts as their pseudo-labels for further training. However, the performance is largely affected by the accuracy of pseudo-labels, which may not be significant in real-world scenarios. This paper presents a Rank-aware Negative Training (RNT) framework to address SSTC in learning with noisy label settings. To alleviate the noisy information, we adapt a reasoning with uncertainty-based approach to rank the unlabeled texts based on the evidential support received from the labeled texts. Moreover, we propose the use of negative training to train RNT based on the concept that “the input instance does not belong to the complementary label”. A complementary label is randomly selected from all labels except the label on-target. Intuitively, the probability of a true label serving as a complementary label is low and thus provides less noisy information during the training, resulting in better performance on the test data. Finally, we evaluate the proposed solution on various text classification benchmark datasets. Our extensive experiments show that it consistently overcomes the state-of-the-art alternatives in most scenarios and achieves competitive performance in the others. The code of RNT is publicly available on GitHub.
1 Introduction
The text classification task aims to associate a piece of text with a corresponding class that could be a sentiment, topic, or category. With the rapid development of deep neural networks, text classification has experienced a considerable shift towards pre-trained language models (PLMs) (Devlin et al., 2019; Yang et al., 2019; Liu et al., 2019; Lewis et al., 2020). Overall, PLMs are first trained on massive text corpora (e.g., Wikipedia) to learn contextual representation, followed by a fine-tuning step on the downstream tasks (Li et al., 2021; Chen et al., 2022; Tsai et al., 2022; Ahmed et al., 2022). The improvement of these approaches heavily relies on high-quality labeled data. However, labeling data is labor-intensive and may not be readily available in real-world scenarios. To alleviate the burden of labeled data, Semi-Supervised Text Classification (SSTC) typically refers to leveraging unlabeled texts to perform a particular task. SSTC-based approaches commonly attempt to exploit the consistency between instances under different perturbations (Li et al., 2020).
Earlier SSTC-based approaches adopt various data augmentation techniques via back-translation. They employ consistency loss between the predictions of unlabeled texts and corresponding augmented texts by translating the text into a targeted language and then translating it back to the source language (Miyato et al., 2019; Xie et al., 2020; Chen et al., 2020). However, the performance of these approaches requires an additional neural machine translation (NMT), which may not be accurate and bothersome in real-world scenarios. Recently, SSTC has experienced a shift toward self-training, and PLM fine-tuning (Li et al., 2021; Tsai et al., 2022). The basic idea is to fine-tune PLMs on the labeled data and iteratively employ prediction on the unlabeled data as pseudo-labels for further training. However, the pseudo-labels are treated equally likely to the truth labels and thus may lead to error accumulation (Zhang et al., 2021; Arazo et al., 2020).
In this paper, we propose a Rank-aware Negative Training (RNT) framework to address SSTC under learning with noisy label settings. To alleviate the domination of noisy information during training, we adopt reasoning with an uncertainty-based approach to rank the unlabeled texts by measuring their shared features, also known as evidential support, with the labeled texts. Eventually, the shared features that serve as a medium to convey knowledge from labeled texts (i.e., evidence) to the unlabeled texts (i.e., inference) are regarded as belief functions to reason about the degree of noisiness. These belief functions are combined to reach a final belief about the text being mislabeled. In other words, we attempt to discard the texts whose pseudo-labels may introduce inaccurate information to the training process.
Moreover, we propose using negative training (NT) (Kim et al., 2019) to robustly train with potential noisy pseudo-labels. Unlike positive training, NT is an indirect learning method that trains the network based on the concept that “the input sentence does not belong to the complementary label”, whereas a complementary label is randomly generated from the label space except the label of the sentence on-target. Considering the AG News dataset, given a sentence annotated as sport, the complementary label is randomly selected from all labels except sport (e.g., business). Intuitively, the probability of a true label serving as a complementary label is low and thus can reduce the noisy information during the training process. Finally, we conduct extensive experiments on various text classification benchmark datasets with different ratios of labeled examples, resulting in better performance on the test data. Experimental results suggest that RNT can mostly outperform the SSTC-based alternatives. Moreover, it has been empirically shown that RNT can perform better than PLMs fine-tuned on sufficient labeled examples.
In brief, our main contributions are three-fold:
We propose a rank-aware negative training framework, namely, RNT, to address the semi-supervised text classification problem as learning in the noisy label setting.
We introduce reasoning with an uncertainty-based solution to discard texts with the potential noisy pseudo-labels by measuring evidential support received from the labeled texts.
We evaluate the proposed solution on various text classification benchmark datasets. Our extensive experiments show that it consistently overcomes the state-of-the-art alternatives in most cases and achieves competitive performance in others.
2 Related Work
This section reviews the existing solutions of the SSTC task and learning with noisy labels.
Text Classification.
Text classification aims at assigning a given document to a number of semantic categories, which could be a sentiment, topic, or aspect (Hu and Liu, 2004; Liu, 2012; Schouten and Frasincar, 2016). Earlier solutions were usually equipped with deep memory or an attention mechanism to learn semantic representation in response to a given category (Socher et al., 2013b; Zhang et al., 2015; Wang et al., 2016; Ma et al., 2017; Chen et al., 2017; Johnson and Zhang, 2017; Conneau et al., 2017; Song et al., 2019; Murtadha et al., 2020; Tsai et al., 2022). Recently, many NLP tasks have experienced a considerable shift towards fine-tuning the PLMs (Devlin et al., 2019; Yang et al., 2019; Liu et al., 2019; Zaheer et al., 2020; Chen et al., 2022; Tsai et al., 2022; Ahmed et al., 2022). Despite the effectiveness of these approaches, the performance heavily relies on the quality of the labeled data, which requires intensive human labor.
Semi-supervised Text Classification.
Partially supervised text classification, also known as learning from Positive and Unlabeled (PU) examples, aims at building a classifier using P and U in the absence of negative examples to classify the Unlabeled examples (Liu et al., 2002; Li et al., 2010; Liu et al., 2011). Recent SSTC approaches primarily focus on exploiting the consistency in the predictions for the same samples under different perturbations. Miyato et al. (2016) established virtual adversarial training that perturbs word embeddings to encourage consistency between perturbed embeddings. Variational auto-encoders-based approaches (Yang et al., 2017; Chen et al., 2018; Gururangan et al., 2019) attempted to reconstruct instances and utilized the latent variables to classify text. Unsupervised data augmentation (UDA) (Xie et al., 2020) performed consistency training by making features consistent between back-translated instances. However, these methods mostly require additional systems (e.g., NMT back-translation), which may be bothersome in real-world scenarios. Mukherjee and Awadallah, (2020) and Tsai et al. (2022) introduced uncertainty-driven self-training-based solutions to select samples and performed self-training on the selected data. An iterative framework (Ma et al., 2021), named SENT, proposed to address distant relation extraction via negative training. Self-Pretraining (Karisani and Karisani, 2021) was introduced to employ an iterative distillation procedure to cope with the inherent problems of self-training. SSTC-based approaches and their limitations are well described by van Engelen and Hoos (2020) and Yang et al. (2022). Recently, S2TC-BDD (Li et al., 2021) was introduced to balance the label angle variances (i.e., the angles between deep representations of texts and weight vectors of labels), also called the margin bias. Despite the effectiveness of these methods, the unlabeled instances contribute equally likely to the labeled ones; therefore, the performance heavily relies on the quality of pseudo-labels. Unlikely, our proposed solution addresses the SSTC task as a learning under noisy label settings problem. Since the pseudo-labels are automatically labeled by the machine, we thus regard them as noisy labels and introduce a ranking approach to filter the highly risky mislabeled instances. To alleviate the noisy information resulting from the filtering process, we use negative training that performs classification based on the concept that “the input instance does not belong to the complementary label”.
Learning with Noisy Labels.
Learning with noisy data has been extensively studied, especially in the computer vision. The existing solutions introduced various methods to relabel the noisy samples in order to correct the loss function. To this end, several relabeling methods have been introduced to treat all samples equally to model the noisy ones, including directed graphical models (Xiao et al., 2015), conditional random fields (Vahdat, 2017), knowledge graphs (Baek et al., 2022), or deep neural networks (Veit et al., 2017; Lee et al., 2018). However, they were built based on semi-supervised learning, where access to a limited number of clean data is required. Ma et al. (2018) introduced a bootstrapping method to modify the loss with model predictions by exploiting the dimensionality of feature subspaces. Patrini et al. (2017) proposed to estimate the label corruption matrix for loss correction. Another direction of research on loss correction investigated two approaches, including reweighting training samples and separating clean and noisy samples (Thulasidasan et al., 2019; Konstantinov and Lampert, 2019). Shen and Sanghavi (2019) have claimed that the deep classifier normally learns the clean instances faster than the noisy ones. Based on this claim, they consider instances with smaller losses as clean ones. A negative training technique (Kim et al., 2019) was introduced to train the model based on the complementary label, which is randomly generated from the label space except for the label on-target. The goal is to encourage the probability to follow a distribution such that the noisy instances are largely distributed in low-value areas and the clean data are generally distributed in high-value areas to facilitate the separation process. Han et al. (2018) proposed to jointly train two networks that select small-loss samples within each mini-batch to train each other. Based on this paradigm, Yu et al., 2019 proposed updating the network on disagreement data to keep the two networks diverged. In this paper, we leverage a robust negative loss (Kim et al., 2019) for noisy data training.
3 Ranked-aware Negative Training
This section describes the proposed framework, namely, Rank-aware Negative Training (RNT), for semi-supervised text classification. An example of RNT is depicted in Figure 1. Suppose we have a training dataset D consisting of a limited labeled set Dl and a large unlabeled set Du. We follow the pseudo-labels method introduced by Lee (2013) to associate Du with pseudo-labels based on the concept of positive training. Simply put, we fine-tune the pre-trained language models (e.g., BERT) on the Dl set. It is noteworthy that we use BERT for a fair comparison, while other models can be used similarly. As the pseudo-labels are not manually annotated, we propose ranking the texts based on their potential for mislabeling to identify and discard the most risky mislabeled texts. Specifically, we first capture the shared information (i.e., we refer to this as the evidential support) between the labeled and unlabeled instances. Then, we measure the amount of support that an unlabeled instance receives from the labeled instances being correctly labeled. We denote the filtered set as Du′ in Figure 1. Finally, we train on both Dl and Du′ through the concept of the negative training. Next, we describe the framework in detail.
3.1 Task Description
Semi-Supervised Text Classification (SSTC).
Let D be the training dataset consisting of a limited labeled set and a large unlabeled text set , where and denote the input sequences of labeled and unlabeled texts, respectively, and represents the corresponding one-hot label vector of . The goal is to learn a classifier that leverages both Dl and Du to better generate in the inference step, also known as inductive SSTC.
3.2 Positive and Negative Training
Positive Training (PT).
Negative Training (NT).
To illustrate the robustness of PT and NT against noisiness, we train both techniques on the AG News dataset corrupted with randomly 30% of symmetric noise (i.e., associating the instance with a random label). In terms of confidence (i.e., the probability of the true class), we illustrate the histogram of the training data after PT and NT in Figure 2. As can be seen, with PT in Figure 2(a), the confidence of both clean and noisy instances increases simultaneously. With NT in Figure 2(b), in contrast, the noisy instances yield much lower confidence compared to the clean ones and thus discourages the domination of noisy data. After NT training, we train the model with only the samples having NT confidence over , where K denotes the number of classes. We refer to this process as Selective NT (SelNT), as illustrated in Figure 2(c) (Kim et al., 2019). We also depict the distribution of proposed RNT in Figure 2(d), which demonstrates the improvement of RNT in terms of noise filtering. In terms of performance, as shown in Figure 3, the accuracy of PT on the Dev data increases in the early stage. However, the direct mapping of features to the noisy labels eventually leads to the overfitting problem and thus gradually results in inaccurate performance on the clean Dev data.
3.3 Noise Ranking
We begin by extracting the shared features (i.e., evidential support) between the evidences (i.e., the labeled texts) and the inference (i.e., the unlabeled texts). Then, we adopt a reasoning with uncertainty approach to measure the evidential support. The instance with higher evidential support is regarded as a less potential noisy instance. An illustrative example is shown in Figure 4. Next, we describe the process in detail.
3.3.1 Feature Generation
Recall that RNT begins by training on the labeled data using the PT technique. Consequently, we rely on the learned latent space of PT to generate various features with three properties, including automatically generated, discriminating, and high-coverage, as follows.
Semantic Distance.
For each instance xi ∈{Dl, Du}, we recompute its semantic relatedness to each label based on the Angular Margin (AM) loss (Wang et al., 2018).
PT Confidence.
Instances with extreme confidence (i.e., close to 1) are generally considered to have a low risk of being mislabeled (Hou et al., 2020). To incorporate the class distribution of PT into the evidential support measurement process, we introduce a new feature, denoted as f′, whose value consists of the predicted class and its corresponding probability. Considering the illustrative example in Figure 4, and share f1′ (i.e., f1′(0,0.9)), which can be read as both instances are related to the class 0 based on PT classifier with 0.9 confidence.
3.3.2 Evidential Support Measurement
Now can capture shared knowledge between the labeled and unlabeled instances (i.e., the evidential support). We leverage Dempster Shafer Theory (DST) (Yang and Xu, 2013) to address evidential support measurement as reasoning with uncertainty. The goal is to estimate the degree of noisiness for an unlabeled instance by combining its evidence from multiple sources of uncertain information (i.e., PT and semantic features). To achieve this, DST applies Dempster’s rule, which combines the mass functions of each source of evidence to form a joint mass function. It is noteworthy that DST has been widely used for various purposes of reasoning (Liu et al., 2018; Wang et al., 2021; Ahmed et al., 2021). The basic concepts of DST are:
Proposition. It refers to all possible states of a situation under consideration. Two propositions are defined: “clean instance”, denoted by C, and “unclean instance”, denoted by U. Let proposition be X = {C, U} and a power set of X be 2X = {∅, C, U, X}.
Belief function. It associates each E ∈ 2X with a degree of belief (or mass), which satisfies and m(∅) = 0. Different belief functions for various evidences are defined (i.e., the generated features).
3.4 Training Procedure
Now that we can measure the evidential support, we then rank the instances of Du and select the less risky instances as the filtered set, denoted as . Note that the value of Nf is fine-tuned using the Dev set (please refer to Section 4.3 for more details). Finally, we combine both sets Dl and Du′ together for the final NT training, as illustrated in Figure 1. The training procedure can be explained by the following steps. We first generate pseudo-labels using the PT technique from Eq. 1. Then, we apply DST to filter the highly risky instances. Finally, we adopt NT technique, Eq. 2, to alleviate the noisy information during the training. Furthermore, to improve the convergence after NT, we follow Kim et al. (2019) by training only with the instances whose confidence is over , denoted as SelNT in Figure 2(c).
4 Experimental Setup
4.1 Dataset
We validate the performance of the proposed RNT on various text classification benchmark datasets (Table 1). In particular, we rely on AG News (Zhang et al., 2015), Yahoo (Chang et al., 2008), Yelp (Zhang et al., 2015), DBPedia (Zhang et al., 2015), TREC (Li and Roth, 2002), SST (Socher et al., 2013a), CR (Ding et al., 2008), MR (Pang and Lee, 2005), TNEWS, and OCNLI (Xu et al., 2020). For the AG News, Yelp, and Yahoo datasets, we follow the comparative approaches by forming the unlabeled training set Du, labeled training set Dl, and development set by randomly drawing from the corresponding original training datasets. For the other datasets, we split the training set into 10% and 90% for Dl and Du, respectively. Note that we utilize the original test sets for prediction evaluation.
Dataset . | #Class . | Train . | #Dev . | #Test . | Length . | Language . | Task . | Metric . | |
---|---|---|---|---|---|---|---|---|---|
#Lab . | #Unlab . | ||||||||
AG News | 4 | 10k | 20k | 8k | 7.6k | 100 | English | Topic | Macro-F1 |
Yelp | 5 | 10k | 20k | 10k | 5k | 256 | English | Sentiment | Macro-F1 |
Yahoo | 10 | 10k | 40k | 20k | 60k | 256 | English | Topic | Macro-F1 |
DBPedia | 14 | 10k | 20k | 10k | 70k | 160 | English | Topic | Macro-F1 |
TREC | 6 | 5.4k | NA | 1.1k | 500 | 30 | English | Question | Macro-F1 |
SST | {2,5} | 6.9k | NA | 871 | 1.8k | 50 | English | Sentiment | Macro-F1 |
CR | 2 | 3k | NA | 378 | 372 | 50 | English | Sentiment | Macro-F1 |
MR | 2 | 6.9k | NA | 1.7k | 2k | 50 | English | Sentiment | Macro-F1 |
TNEWS | 15 | 53.3k | NA | 10k | 10k | 128 | Chinese | Topic | Accuracy |
OCNLI | 3 | 50k | NA | 3k | 3k | 128 | Chinese | NLI | Accuracy |
Dataset . | #Class . | Train . | #Dev . | #Test . | Length . | Language . | Task . | Metric . | |
---|---|---|---|---|---|---|---|---|---|
#Lab . | #Unlab . | ||||||||
AG News | 4 | 10k | 20k | 8k | 7.6k | 100 | English | Topic | Macro-F1 |
Yelp | 5 | 10k | 20k | 10k | 5k | 256 | English | Sentiment | Macro-F1 |
Yahoo | 10 | 10k | 40k | 20k | 60k | 256 | English | Topic | Macro-F1 |
DBPedia | 14 | 10k | 20k | 10k | 70k | 160 | English | Topic | Macro-F1 |
TREC | 6 | 5.4k | NA | 1.1k | 500 | 30 | English | Question | Macro-F1 |
SST | {2,5} | 6.9k | NA | 871 | 1.8k | 50 | English | Sentiment | Macro-F1 |
CR | 2 | 3k | NA | 378 | 372 | 50 | English | Sentiment | Macro-F1 |
MR | 2 | 6.9k | NA | 1.7k | 2k | 50 | English | Sentiment | Macro-F1 |
TNEWS | 15 | 53.3k | NA | 10k | 10k | 128 | Chinese | Topic | Accuracy |
OCNLI | 3 | 50k | NA | 3k | 3k | 128 | Chinese | NLI | Accuracy |
4.2 Comparative Baselines
For fairness, we only include the semi-supervised learning methods that were built upon the contextual embedding models (e.g., BERT):
PLM is a pre-trained language model directly fine-tuned on the labeled data. We compared to BERT (Devlin et al., 2019; Cui et al., 2021) and RoBERTa (Liu et al., 2019);
UDA (Xie et al., 2020) is an SSTC method based on unsupervised data augmentation with back translation. We use German and English languages for back-translation of English and Chinese, datasets, respectively;
UST (Mukherjee and Awadallah, 2020) introduces select samples by information gain and utilizes cross-entropy loss to perform self-training;
S2TC-BDD (Li et al., 2021) is an SSTC method that addresses the margin bias problem by balancing the label angle variances.
4.3 Experimental Settings
Hyper-parameters. We use 12 heads and layers and keep the dropout probability to 0.1 with 30 epochs, learning rate of 2e−5 and 32 batch size. To guarantee the re-productivity without manual effort, we rely on the Dev set to automatically set the value of Nf (i.e., the number of instances in Du′). First, the ranked Dev set is split into small proportions (i.e., max is 10). Then, m is set to proportions that meet the condition λ = max(p) −st(p), where p is a vector that represents the accuracy of RNT on each proportion and st denotes the standard deviation. For example, θ = 0.2 means that Du′ consists of the first 20% of the ranked Du, as shown in Figure 5. We set the number of negative samples to K −1, where K is the number of classes in the labeled training set.
Figure 5:Metrics. We use the accuracy metric on Clue datasets, including TNEWS and OCNLI, and Macro-F1 scores for all other datasets.
5 Evaluation and Results
We describe the evaluation tasks and report the experimental results in this section. The evaluation criteria are: (I) Is RNT able to rank the instances of being mislabeled?; (II) Can the filtered data enhance the performance of the clean test data?
5.1 Results
We use the Dev set to select the best model and average three runs with different seeds. The experimental results are reported in Tables 2, 3, and 4, from which we have made the following observations.
Compared to the baselines, RNT gives the best results compared to its alternatives in most cases and achieves competitive performance in others. We also observe that SSTC-based approaches comfortably outperform the PLM fine-tuning when training with scarce labeled data (e.g., Nl = 30); however, the same performance is expected when Nl is increased (e.g., Nl ∈{1k,10k}), but it was not supported by the experiments. Furthermore, experimental results demonstrate that RNT is not sensitive to the number of classes compared to SSTC-based alternatives. For instance, UDA (Xie et al., 2020) can perform better on the binary datasets, as shown in Table 3.
Compared to the PLM fine-tuned on the labeled data, RNT comfortably overcomes PLM by considerable margins. For example, the Macro-F1 scores of RNT with Nl = 30 are even about 2.6%, 2.7%, and 3.0% on AG News, Yelp and Yahoo datasets, respectively. Moreover, we also observe that RNT can perform better than PLM fine-tuned on sufficient labeled data (e.g., Nl = 10k).
PLM . | Model . | AG News . | Yelp . | Yahoo . | DBPedia . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
30 . | 1k . | 10k . | 30 . | 1k . | 10k . | 30 . | 1k . | 10k . | 30 . | ||
BERT-Base | Fine-tuning | 84.1±0.9 | 87.8±0.3 | 90.5±0.2 | 42.2±1.7 | 53.2±0.8 | 58.6±0.5 | 63.2±0.5 | 67.1±0.3 | 70.8±0.2 | 97.1±0.9 |
UDA | 85.7±0.3 | 88.3 | 90.6 | 44.6±1.2 | 55.0 | 57.6 | 66.4±0.5 | 66.6 | 70.4 | 98.5±0.6 | |
UST | 87.2 ±0.6 | 88.6 | 90.8 | 44.8±1.1 | 54.2 | 57.7 | 66.5 ±0.3 | 67.5 | 71.1 | 98.4±0.6 | |
S2TC-BDD | 86.9±0.7 | 88.9 | 90.7 | 45.9 ±1.4 | 55.0 | 58.6 | 66.2±0.6 | 68.0 | 70.9 | 98.8 ±0.7 | |
RNT (Ours) | 86.7±0.3 | 89.4 ±0.1 | 91.9 ±0.1 | 44.9±1.2 | 56.6 ±0.6 | 60.2 ±0.1 | 66.2±0.3 | 69.1 ±0.2 | 72.7 ±0.1 | 98.2±0.4 | |
RoBERTa-Base | Fine-tuning | 84.9±0.7 | 88.5±0.3 | 91.0±0.2 | 53.7±1.6 | 57.8±0.7 | 62.5±0.4 | 66.6±0.4 | 68.3±0.5 | 72.3±0.2 | 98.1±0.5 |
RNT (Ours) | 86.9 ±0.4 | 89.6 ±0.1 | 92.2 ±0.2 | 53.9 ±1.4 | 60.0 ±0.5 | 63.8±0.1 | 67.2 ±0.4 | 69.6 ±0.2 | 73.7 ±0.1 | 98.4±0.2 | |
RoBERTa-Large | Fine-tuning | 86.5±0.4 | 89.1±0.2 | 91.8±0.2 | 56.2±1.3 | 62.3±0.6 | 66.0±0.4 | 67.8±0.3 | 70.3±0.3 | 73.7±0.1 | 98.3±0.3 |
RNT (Ours) | 87.8 ±0.3 | 89.8 ±0.1 | 92.6 ±0.1 | 58.3 ±0.9 | 63.1 ±0.4 | 66.8 ±0.2 | 68.9 ±0.2 | 71.2 ±0.2 | 74.3 ±0.1 | 98.8±0.2 |
PLM . | Model . | AG News . | Yelp . | Yahoo . | DBPedia . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
30 . | 1k . | 10k . | 30 . | 1k . | 10k . | 30 . | 1k . | 10k . | 30 . | ||
BERT-Base | Fine-tuning | 84.1±0.9 | 87.8±0.3 | 90.5±0.2 | 42.2±1.7 | 53.2±0.8 | 58.6±0.5 | 63.2±0.5 | 67.1±0.3 | 70.8±0.2 | 97.1±0.9 |
UDA | 85.7±0.3 | 88.3 | 90.6 | 44.6±1.2 | 55.0 | 57.6 | 66.4±0.5 | 66.6 | 70.4 | 98.5±0.6 | |
UST | 87.2 ±0.6 | 88.6 | 90.8 | 44.8±1.1 | 54.2 | 57.7 | 66.5 ±0.3 | 67.5 | 71.1 | 98.4±0.6 | |
S2TC-BDD | 86.9±0.7 | 88.9 | 90.7 | 45.9 ±1.4 | 55.0 | 58.6 | 66.2±0.6 | 68.0 | 70.9 | 98.8 ±0.7 | |
RNT (Ours) | 86.7±0.3 | 89.4 ±0.1 | 91.9 ±0.1 | 44.9±1.2 | 56.6 ±0.6 | 60.2 ±0.1 | 66.2±0.3 | 69.1 ±0.2 | 72.7 ±0.1 | 98.2±0.4 | |
RoBERTa-Base | Fine-tuning | 84.9±0.7 | 88.5±0.3 | 91.0±0.2 | 53.7±1.6 | 57.8±0.7 | 62.5±0.4 | 66.6±0.4 | 68.3±0.5 | 72.3±0.2 | 98.1±0.5 |
RNT (Ours) | 86.9 ±0.4 | 89.6 ±0.1 | 92.2 ±0.2 | 53.9 ±1.4 | 60.0 ±0.5 | 63.8±0.1 | 67.2 ±0.4 | 69.6 ±0.2 | 73.7 ±0.1 | 98.4±0.2 | |
RoBERTa-Large | Fine-tuning | 86.5±0.4 | 89.1±0.2 | 91.8±0.2 | 56.2±1.3 | 62.3±0.6 | 66.0±0.4 | 67.8±0.3 | 70.3±0.3 | 73.7±0.1 | 98.3±0.3 |
RNT (Ours) | 87.8 ±0.3 | 89.8 ±0.1 | 92.6 ±0.1 | 58.3 ±0.9 | 63.1 ±0.4 | 66.8 ±0.2 | 68.9 ±0.2 | 71.2 ±0.2 | 74.3 ±0.1 | 98.8±0.2 |
PLM . | Model . | TREC . | SST-2 . | SST-5 . | CR . | MR . | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
30 . | 10% . | 30 . | 10% . | 30 . | 10% . | 30 . | 10% . | 30 . | 10% . | ||
BERT-Base | Fine-tuning | 78.7±1.6 | 87.1±1.0 | 76.9±1.6 | 85.2±0.8 | 33.2±1.4 | 39.0±1.1 | 74.7±1.2 | 85.8±0.9 | 66.6±1.4 | 80.7±0.7 |
UDA | 83.5±1.1 | 91.2±0.7 | 79.9±1.3 | 85.6±0.3 | 33.6±1.1 | 40.6±0.8 | 81.0±0.7 | 87.7±0.6 | 72.9 ±0.9 | 81.0±0.1 | |
UST | 83.3±1.2 | 92.1 ±0.8 | 78.7±1.0 | 85.6±0.4 | 33.9±1.1 | 40.8±0.7 | 82.7 ±0.8 | 87.8±0.3 | 71.1±1.0 | 81.0±0.3 | |
S2TC-BDD | 81.2±1.3 | 91.2±0.9 | 81.1±1.2 | 85.7±0.5 | 34.6±1.3 | 39.6±0.5 | 82.3±0.9 | 87.6±0.7 | 72.1±0.9 | 80.0±0.6 | |
RNT (Ours) | 85.2 ±1.1 | 91.4±0.7 | 83.8 ±1.3 | 87.6 ±0.4 | 35.9 ±1.2 | 42.3 ±0.9 | 82.6±0.9 | 89.3 ±0.4 | 71.5±1.0 | 82.4 ±0.3 | |
RoBERTa-Base | Fine-tuning | 84.2±1.3 | 92.1±0.7 | 85.0±0.9 | 89.5±0.4 | 39.3±1.0 | 47.6±0.7 | 86.5±1.2 | 91.1±0.7 | 71.2±1.4 | 84.9±0.5 |
RNT (Ours) | 86.7 ±0.8 | 93.2 ±0.4 | 87.7 ±0.7 | 90.7 ±0.4 | 40.5 ±0.6 | 49.6 ±0.4 | 88.9 ±0.6 | 92.5 ±0.2 | 75.8 ±0.7 | 86.4 ±0.2 | |
RoBERTa-Large | Fine-tuning | 88.9±1.1 | 92.5±0.6 | 87.7±1.0 | 92.3±0.7 | 40.5±0.8 | 51.0±0.6 | 89.7±0.9 | 91.8±0.8 | 82.4±1.2 | 88.3±0.6 |
RNT (Ours) | 89.6 ±0.6 | 94.0 ±0.4 | 89.6 ±0.6 | 93.2 ±0.5 | 42.8 ±0.5 | 52.4 ±0.3 | 92.3 ±0.6 | 92.6 ±0.3 | 85.9 ±0.6 | 88.4 ±0.3 |
PLM . | Model . | TREC . | SST-2 . | SST-5 . | CR . | MR . | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
30 . | 10% . | 30 . | 10% . | 30 . | 10% . | 30 . | 10% . | 30 . | 10% . | ||
BERT-Base | Fine-tuning | 78.7±1.6 | 87.1±1.0 | 76.9±1.6 | 85.2±0.8 | 33.2±1.4 | 39.0±1.1 | 74.7±1.2 | 85.8±0.9 | 66.6±1.4 | 80.7±0.7 |
UDA | 83.5±1.1 | 91.2±0.7 | 79.9±1.3 | 85.6±0.3 | 33.6±1.1 | 40.6±0.8 | 81.0±0.7 | 87.7±0.6 | 72.9 ±0.9 | 81.0±0.1 | |
UST | 83.3±1.2 | 92.1 ±0.8 | 78.7±1.0 | 85.6±0.4 | 33.9±1.1 | 40.8±0.7 | 82.7 ±0.8 | 87.8±0.3 | 71.1±1.0 | 81.0±0.3 | |
S2TC-BDD | 81.2±1.3 | 91.2±0.9 | 81.1±1.2 | 85.7±0.5 | 34.6±1.3 | 39.6±0.5 | 82.3±0.9 | 87.6±0.7 | 72.1±0.9 | 80.0±0.6 | |
RNT (Ours) | 85.2 ±1.1 | 91.4±0.7 | 83.8 ±1.3 | 87.6 ±0.4 | 35.9 ±1.2 | 42.3 ±0.9 | 82.6±0.9 | 89.3 ±0.4 | 71.5±1.0 | 82.4 ±0.3 | |
RoBERTa-Base | Fine-tuning | 84.2±1.3 | 92.1±0.7 | 85.0±0.9 | 89.5±0.4 | 39.3±1.0 | 47.6±0.7 | 86.5±1.2 | 91.1±0.7 | 71.2±1.4 | 84.9±0.5 |
RNT (Ours) | 86.7 ±0.8 | 93.2 ±0.4 | 87.7 ±0.7 | 90.7 ±0.4 | 40.5 ±0.6 | 49.6 ±0.4 | 88.9 ±0.6 | 92.5 ±0.2 | 75.8 ±0.7 | 86.4 ±0.2 | |
RoBERTa-Large | Fine-tuning | 88.9±1.1 | 92.5±0.6 | 87.7±1.0 | 92.3±0.7 | 40.5±0.8 | 51.0±0.6 | 89.7±0.9 | 91.8±0.8 | 82.4±1.2 | 88.3±0.6 |
RNT (Ours) | 89.6 ±0.6 | 94.0 ±0.4 | 89.6 ±0.6 | 93.2 ±0.5 | 42.8 ±0.5 | 52.4 ±0.3 | 92.3 ±0.6 | 92.6 ±0.3 | 85.9 ±0.6 | 88.4 ±0.3 |
5.2 Mislabeling Filtering Evaluation
To evaluate the ability of RNT in mislabeling filtering, we conduct experiments on the Dev sets of AG News, Yelp, and Yahoo datasets as follows. We first associate the instances with the corresponding pseudo-labels (i.e., inferring using the PT classifier). Then, we require RNT to rank them based on their evidential support received from the clean training set (i.e., Nl = 1k). Since we have access to the true labels of the Dev set, we can evaluate the performance of the filtering process. Specifically, we divide the ranked Dev set into ten equally-likely proportions (note that we keep the same order of ranking) and calculate the accuracy of each proportion separately (i.e., comparing the pseudo-labels with the true labels). The proportions, as shown in Figure 5, are significantly correlated with the extent of mislabeling. In other words, the accuracy score gradually drops as the mislabeled instances increase and vice-versa. Note that we report the accuracy due to the imbalance labels in the proportions. Moreover, we report the performance of both the full Dev set and the filtered set in Table 5.
Dataset . | Nl . | Full Dev . | Filtering . | |||
---|---|---|---|---|---|---|
Acc . | F1 . | Prop . | Acc . | F1 . | ||
AG News | 1k | 88.1 | 88.1 | 70% | 95.8 | 95.2 |
10k | 91.9 | 91.9 | 70% | 98.2 | 97.9 | |
Yelp | 1k | 53.8 | 53.2 | 30% | 68.8 | 64.2 |
10k | 60.5 | 60.2 | 30% | 75.9 | 72.5 | |
Yahoo | 1k | 67.1 | 67.0 | 30% | 89.6 | 72.6 |
10k | 72.0 | 71.2 | 40% | 91.0 | 83.6 |
Dataset . | Nl . | Full Dev . | Filtering . | |||
---|---|---|---|---|---|---|
Acc . | F1 . | Prop . | Acc . | F1 . | ||
AG News | 1k | 88.1 | 88.1 | 70% | 95.8 | 95.2 |
10k | 91.9 | 91.9 | 70% | 98.2 | 97.9 | |
Yelp | 1k | 53.8 | 53.2 | 30% | 68.8 | 64.2 |
10k | 60.5 | 60.2 | 30% | 75.9 | 72.5 | |
Yahoo | 1k | 67.1 | 67.0 | 30% | 89.6 | 72.6 |
10k | 72.0 | 71.2 | 40% | 91.0 | 83.6 |
5.3 The Impact of Noise Filtering
To assess the impact of noise filtering on the overall performance of RNT, we remove DST and conduct experiments on the AG News, Yelp, and Yahoo datasets. The experimental results presented in Table 6 show that removing noise ranking from RNT causes a performance drop of 1.3, 1.6, and 1.2 on the AG News, Yelp, and Yahoo datasets, respectively. This demonstrates the efficacy of a well-designed noisy ranking in improving text classification performance. Furthermore, we observe that even without noise filtering, RNT outperforms PLM fine-tuning and achieves competitive results compared to other alternatives. This supports the adoption of NT for noisy data.
5.4 The Effect of DST
To validate the contribution of DST on the final performance in terms of mislabeled instances filtering, we implement two variants, namely, RNT Pure and RNT PT-conf, as follows. The RNT Pure is trained on Dl and Du as a whole without any filtering mechanism, while RNT PT-conf uses the PT confidence to filter the instances in Du that do not meet the predefined threshold (i.e., 0.9 in our experiments). In other words, instead of DST, we rely on the PT confidence to discard the instances close to the boundary. Empirically, we conduct experiments on AG News, Yelp, and Yahoo datasets with various Nl = {30,1k,10k}. The comparative results are shown in Table 7, from which we made the following observations. Overall, RNT can mostly give the best performance, and the improvements are significant, especially with less limited data (e.g., Nl = 30). RNT Pure performs worse due to the absence of a filtering mechanism. RNT PT-conf can achieve competitive performance with sufficient labeled data (e.g., Nl = 10k) even in terms of uncertainty. However, it gradually drops with the decrease of labeled data. Intuitively, these results are expected as the performance of the PT classifier heavily relies on the amount of labeled data. In brief, the ablation study empirically supports the contribution of DST to the performance of RNT.
Model . | AG News . | Yelp . | Yahoo . | ||||||
---|---|---|---|---|---|---|---|---|---|
30 . | 1k . | 10k . | 30 . | 1k . | 10k . | 30 . | 1k . | 10k . | |
RNT Pure | 82.9±0.8 | 88.1±0.2 | 91.3±0.2 | 42.6±1.7 | 55.0±0.8 | 59.8±0.3 | 65.1±0.7 | 67.9±0.2 | 71.9±0.1 |
RNT PT-conf | 83.7±1.3 | 89.7 ±0.2 | 91.7±0.1 | 42.2±1.6 | 54.6±0.6 | 60.1±0.2 | 65.4±1.0 | 68.1±0.4 | 72.3±0.1 |
RNT (Ours) | 86.7 ±0.3 | 89.4±0.1 | 91.9 ±0.1 | 44.9 ±1.2 | 56.6 ±0.6 | 60.2 ±0.1 | 66.2 ±0.3 | 69.1 ±0.2 | 72.7 ±0.1 |
Model . | AG News . | Yelp . | Yahoo . | ||||||
---|---|---|---|---|---|---|---|---|---|
30 . | 1k . | 10k . | 30 . | 1k . | 10k . | 30 . | 1k . | 10k . | |
RNT Pure | 82.9±0.8 | 88.1±0.2 | 91.3±0.2 | 42.6±1.7 | 55.0±0.8 | 59.8±0.3 | 65.1±0.7 | 67.9±0.2 | 71.9±0.1 |
RNT PT-conf | 83.7±1.3 | 89.7 ±0.2 | 91.7±0.1 | 42.2±1.6 | 54.6±0.6 | 60.1±0.2 | 65.4±1.0 | 68.1±0.4 | 72.3±0.1 |
RNT (Ours) | 86.7 ±0.3 | 89.4±0.1 | 91.9 ±0.1 | 44.9 ±1.2 | 56.6 ±0.6 | 60.2 ±0.1 | 66.2 ±0.3 | 69.1 ±0.2 | 72.7 ±0.1 |
5.5 Denoising Evaluation
Recall that the ultimate goal of DST is to estimate the score of unlabeled instances being mislabeled by the PT classifier. To evaluate the ability of DST to denoising, we adopt a perturbation strategy that has been used widely in the literature (Belinkov and Bisk, 2018; Sun and Jiang, 2019). We randomly pick 30% of the Dev data as the noisy instances. For each instance, we randomly select 30% of the words to be perturbed as follows. Specifically, we apply four kinds of noise: (1) swap two letters per word; (2) delete a letter randomly in the middle of the word; (3) replace a random letter with another in a word; (4) insert a random letter in the middle of the word.
The evaluation results of denoising are reported in Table 8, from which we made the following observations. (1) A considerable margin exists between the performance of the PT classifier on the clean and noise data, demonstrating the impact of the generated noise. (2) Despite the well-recognized challenge of denoising in NLP, our proposed solution can mostly identify clean instances. (3) Even though the performance can be deemed considerable, noisy information may still exist in the filtered data; therefore, we use NT for further training.
6 Conclusion and Future Work
In this paper, we proposed a self-training semi-supervised framework, namely, RNT, to address the text classification problem in learning with noisy label settings. RNT first discards the high risky mislabeled texts based on reasoning with uncertainty theory. Then, it uses the negative training technique to reduce the noisy information during training. Our extensive experiments have shown that RNT mostly outperformed SSTC-based alternatives. Despite the robustness of negative training, clean samples that have identical distributions with test data are subjected to complementary labels. Consequently, both clean and potentially noisy samples contribute equally to the final performance. A combination of both positive and negative training strategies in a unified framework can remedy the abundance of noisy samples; however, this needs further investigation.
Acknowledgments
We extend our gratitude to the TACL action editor and the anonymous reviewers for their valuable feedback and insightful suggestions, which have significantly contributed to the improvement of our work.
References
Author notes
Action Editor: Doug Downey