Rank-Aware Negative Training for Semi-Supervised Text Classification

Abstract Semi-supervised text classification-based paradigms (SSTC) typically employ the spirit of self-training. The key idea is to train a deep classifier on limited labeled texts and then iteratively predict the unlabeled texts as their pseudo-labels for further training. However, the performance is largely affected by the accuracy of pseudo-labels, which may not be significant in real-world scenarios. This paper presents a Rank-aware Negative Training (RNT) framework to address SSTC in learning with noisy label settings. To alleviate the noisy information, we adapt a reasoning with uncertainty-based approach to rank the unlabeled texts based on the evidential support received from the labeled texts. Moreover, we propose the use of negative training to train RNT based on the concept that “the input instance does not belong to the complementary label”. A complementary label is randomly selected from all labels except the label on-target. Intuitively, the probability of a true label serving as a complementary label is low and thus provides less noisy information during the training, resulting in better performance on the test data. Finally, we evaluate the proposed solution on various text classification benchmark datasets. Our extensive experiments show that it consistently overcomes the state-of-the-art alternatives in most scenarios and achieves competitive performance in the others. The code of RNT is publicly available on GitHub.


Introduction
Text classification task aims to associate a piece of text with a corresponding class that could be a sentiment, topic, or category.With the rapid development of deep neural networks, text classification has experienced a considerable shift to- * Corresponding author wards the pre-trained language models (PLMs) (Devlin et al., 2019;Yang et al., 2019;Liu et al., 2019;Lewis et al., 2020).Overall, PLMs are first trained on massive text corpora (e.g., Wikipedia) to learn contextual representation, followed by a fine-tuning step on the downstream tasks (Li et al., 2021;Chen et al., 2022;Tsai et al., 2022;Ahmed et al., 2022).The improvement of these approaches heavily relies on high-quality labeled data.However, labeling data is labor-intensive and may not be readily available in real-world scenarios.To alleviate the burden of labeled data, Semi-Supervised Text Classification (SSTC) typically refers to leveraging unlabeled texts to perform a particular task.SSTC-based approaches commonly attempt to exploit the consistency between instances under different perturbations (Li et al., 2020).
Earlier SSTC-based approaches adopt various data augmentation techniques via backtranslation.They employ consistency loss between the predictions of unlabeled texts and corresponding augmented texts by translating the text into a targeted language and then translating it back to the source language (Miyato et al., 2019;Xie et al., 2020;Chen et al., 2020).However, the performance of these approaches requires an additional Neural Machine Translation (NMT), which may not be accurate and bothersome in real-world scenarios.Recently, SSTC has experienced a shift toward self-training, and PLMs fine-tuning (Li et al., 2021;Tsai et al., 2022).The basic idea is to fine-tune PLMs on the labeled data and iteratively employ prediction on the unlabeled data as pseudo-labels for further training.However, the pseudo-labels are treated equally likely to the truth labels and thus may lead to the error accumulation (Zhang et al., 2021;Arazo et al., 2020).
In this paper, we propose a Rank-aware Negative Training (RNT) framework to address SSTC arXiv:2306.07621v1[cs.CL] 13 Jun 2023 under learning with noisy labels settings.To alleviate the domination of noisy information during training, we adopt reasoning with an uncertaintybased approach to rank the unlabeled texts by measuring their shared features, also known as evidential support, with the labeled texts.Eventually, the shared features that serve as a medium to convey knowledge from labeled texts (i.e., evidence) to the unlabeled texts (i.e., inference) are regarded as belief functions to reason about the degree of noisiness.These belief functions are combined to reach a final belief about the text being mislabeled.In other words, we attempt to discard the texts whose pseudo-labels may introduce inaccurate information to the training process.
Moreover, we propose using negative training (NT) (Kim et al., 2019) to robustly train with potential noisy pseudo-labels.Unlike positive training, NT is an indirect learning method that trains the network based on the concept that "the input sentence does not belong to the complementary label", whereas a complementary label is randomly generated from the label space except the label of the sentence on-target.Considering the AG News dataset, given a sentence annotated as sport, the complementary label is randomly selected from all labels except sport (e.g., business).Intuitively, the probability of a true label serving as a complementary label is low and thus can reduce the noisy information during the training process.Finally, we conduct extensive experiments on various text classification benchmark datasets with different ratios of labeled examples, resulting in better performance on the test data.Experimental results suggest that RNT can mostly outperform the SSTC-based alternatives.Moreover, it has been empirically shown that RNT can perform better than PLMs fine-tuned on sufficient labeled examples.
In brief, the main contributions are three-fold: • We propose a rank-aware negative training framework, namely RNT, to address the semi-supervised text classification problem as learning with the noisy labels manner.
• We introduce reasoning with an uncertaintybased solution to discard texts with the potential noisy pseudo-labels by measuring evidential support received from the labeled texts.
• We evaluate the proposed solution on various text classification benchmark datasets.
Our extensive experiments show that it consistently overcomes the state-of-the-art alternatives in most cases and achieves competitive performance in others.

Related Work
This section reviews the existing solutions of SSTC task and learning with noisy labels.
Text Classification.Text classification aims at assigning a given document to a number of semantic categories, which could be a sentiment, topic or aspect (Hu and Liu, 2004;Liu, 2012;Schouten and Frasincar, 2016).Earlier solutions were usually equipped with a deep memory or an attention mechanism to learn semantic representation in response to a given category (Socher et al., 2013b;Zhang et al., 2015;Wang et al., 2016;Ma et al., 2017;Chen et al., 2017;Johnson and Zhang, 2017;Conneau et al., 2017;Song et al., 2019;Murtadha et al., 2020;Tsai et al., 2022).Recently, many NLP tasks have experienced a considerable shift towards fine-tuning the pre-trained language models (PLMs) (Devlin et al., 2019;Yang et al., 2019;Liu et al., 2019;Zaheer et al., 2020;Chen et al., 2022;Tsai et al., 2022;Ahmed et al., 2022).Despite the effectiveness of these approaches, the performance heavily relies on the quality of the labeled data, which requires intensive human labor.
Semi-supervised text classification.Partially supervised text classification, also known as learning from Positive and Unlabeled (PU) examples, aims at building a classifier using P and U in the absence of negative examples to classify the Unlabeled examples (Liu et al., 2002;Li et al., 2010;Liu et al., 2011).Recent SSTC approaches primarily focus on exploiting the consistency in the predictions for the same samples under different perturbations.Miyato et al. (2016) established virtual adversarial training that perturb word embeddings to encourage consistency between perturbed embeddings.Variational auto-encodersbased approaches (Yang et al., 2017;Chen et al., 2018;Gururangan et al., 2019) attempted to reconstruct instances and utilized the latent variables to classify text.Unsupervised data augmentation (UDA) (Xie et al., 2020) performed consistency training by making features consistent between back-translated instances.How-ever, these methods mostly require additional systems (e.g., NMT back-translation), which may be bothersome in real-world scenarios.Mukherjee and Awadallah (2020) and Tsai et al. (2022) introduced uncertainty-driven self-training-based solutions to select samples and performed selftraining on the selected data.An iterative framework (Ma et al., 2021), named SENT, proposed to address distant relation extraction via negative training.Self-Pretraining (Karisani and Karisani, 2021) was introduced to employ an iterative distillation procedure to cope with the inherent problems of self-training.SSTC-based approaches and their limitations are well-described by van Engelen and Hoos (2020) and Yang et al. (2022).Recently, S 2 TC-BDD (Li et al., 2021) was introduced to balance the label angle variances (i.e., the angles between deep representations of texts and weight vectors of labels), also called the margin bias.Despite the effectiveness of these methods, the unlabeled instances contribute equally likely to the labeled ones; therefore, the performance heavily relies on the quality of pseudo-labels.Unlikely, our proposed solution addresses the SSTC task as a learning under noisy label settings problem.Since the pseudo-labels are automatically labeled by the machine, we thus regard them as noisy labels and introduce a ranking approach to filter the highly risky mislabeled instances.To alleviate the noisy information resulting from the filtering process, we use negative training that performs classification based on the concept that "the input instance does not belong to the complementary label".
Learning with noisy labels.Learning with noisy data has been extensively studied, especially in computer vision community.The existing solutions introduced various methods to relabel the noisy samples in order to correct the loss function.To this end, several relabeling methods have been introduced to treat all samples equally to model the noisy ones, including directed graphical models (Xiao et al., 2015), conditional random fields (Vahdat, 2017), knowledge graph (Baek et al., 2022), or deep neural networks (Veit et al., 2017;Lee et al., 2018).However, they were built based on semi-supervised learning, where access to a limited number of clean data is required.Ma et al. (2018) introduced a bootstrapping method to modify the loss with model predictions by exploiting the dimensionality of feature subspaces.Patrini et al. (2017) proposed to estimate the label corruption matrix for loss correction.Another direction of research on loss correction investigated two approaches, including reweighting training samples and separating clean and noisy samples (Thulasidasan et al., 2019;Konstantinov and Lampert, 2019).Shen and Sanghavi (2019) have claimed that the deep classifier normally learns the clean instances faster than the noisy ones.Based on this claim, they consider instances with smaller losses as clean ones.A negative training technique (Kim et al., 2019) was introduced to train the model based on the complementary label, which is randomly generated from the label space except for the label on-target.The goal is to encourage the probability to follow a distribution such that the noisy instances are largely distributed in low-value areas and the clean data are generally distributed in high-value areas to facilitate the separation process.Han et al. (2018) proposed to jointly train two networks that select small-loss samples within each mini-batch to train each other.Based on this paradigm, Yu et al. (2019) introduced to update the network on disagreement data to keep the two networks diverged.In this paper, we leverage a robust negative loss (Kim et al., 2019) for noisy data training.

Ranked-aware Negative Training
This section describes the proposed framework, namely Rank-aware Negative Training (RNT), for semi-supervised text classification.An example of RNT is depicted in Figure 1.Suppose we have a training dataset D consisting of a limited labeled set D l and a large unlabeled set D u .We follow the pseudo-labels method introduced by Lee (2013) to associate D u with pseudo-labels based on the concept of positive training.Simply, we fine-tune the pre-trained language models (e.g., BERT) on the D l set.It is noteworthy that we use BERT for a fair comparison, while other models can be used similarly.As the pseudo-labels are not manually annotated, we propose ranking the texts based on their potential for mislabeling to identify and discard the most risky mislabeled texts.Specifically, we first capture the shared information (i.e., we refer to it as the evidential support) between the labeled and unlabeled instances.Then, we measure the amount of support that an unlabeled instance receives from the labeled instances being correctly labeled.We denote the filtered set as D ′ u in Fig- ure1.Finally, we train on both D l and D ′ u through the concept of the negative training.Next, we describe the framework in detail.

Task Description
Semi-Supervised Text Classification (SSTC).Let D be the training dataset consisting of a limited labeled set , where x l i and x u j denote the input sequences of labeled and unlabeled texts, respectively, and y l i ∈ {0, 1} K represents the corresponding one-hot label vector of x l i .The goal is to learn a classifier that leverages both D l and D u to better generate in the inference step, also known as inductive SSTC.

Positive and Negative Training
Positive Training (PT).A typical method of training a model with a given input instance and the corresponding labels is referred to as positive training (PT).In other words, the model is trained based on the concept that "the input instance belongs to this label".Considering a multi-class classification problem, let x ∈ X be an input, y ∈ {0, 1} K be a c-dimension one-hot vector of its label.The training objective f (x; θ) aims to map the input instance to the k-dimensional score space f : X → R k , where θ is the set of parameters.To achieve this, PT uses the cross-entropy loss function defined as follows: where p k denotes the probability of the k th label.Equation 1 satisfies the claim of PT to optimize the probability value corresponding to the given label as 1 (p k → 1).
Negative Training (NT).Unlike PT, the model is trained based on the concept that "the input text does not belong to this label".Specifically, given an input text x with a label y ∈ {0, 1} K , a complementary label y is generated by randomly sampling from the label space except y (e.g., y ∈ R\{y}).The cross-entropy loss function of NT is defined as follows. (2) To illustrate the robustness of PT and NT against noisiness, we train both techniques on AG News dataset corrupted with randomly 30% of symmetric noise (i.e., associating the instance with a random label).In terms of confidence (i.e., the probability of the true class), we illustrate the histogram of the training data after PT and NT in Figure 2. As can be seen, with PT in Figure 2 (a), the confidence of both clean and noisy instances increases simultaneously.With NT in Figure 2 (b), in contrast, the noisy instances yield much lower confidence compared to the clean ones and thus discourages the domination of noisy data.After NT training, we train the model with only the samples having NT confidence over 1 K , where K denotes the number of classes.We refer to this process as Selective NT (SelNT), as illustrated in Figure 2 (c) (Kim et al., 2019).We also depict the distribution of proposed RNT in Figure 2 (d), which demonstrates the improvement of RNT in terms of  noise filtering.In terms of performance, as shown in Figure 3, the accuracy of PT on the Dev data increases in the early stage.However, the direct mapping of features to the noisy labels eventually leads to the overfitting problem and thus gradually results in inaccurate performance on the clean Dev data.

Noise Ranking
We begin by extracting the shared features (i.e., evidential support) between the evidences (i.e., the labeled texts) and the inference (i.e., the unlabeled texts).Then, we adopt a reasoning with uncertainty approach to measure the evidential support.
The instance with higher evidential support is regarded as a less potential noisy instance.An illustrative example is shown in Figure 4. Next, we describe the process in detail.

Feature Generation
Recall that RNT begins by training on the labeled data using PT technique.Consequently, we rely on the learned latent space of PT to generate various features with three properties, including automatically generated, discriminating, and highcoverage, as follows.
Semantic distance.For each instance x i ∈ {D l , D u }, we recompute its semantic relatedness to each label y i ∈ Y based on the Angular Margin (AM) loss (Wang et al., 2018).
The AM loss adds a margin term to the Softmax loss based on the angle (i.e., cosine similarity) between an input sample's feature vector and the actual class's weight vector.Notably, the margin term encourages the network to learn feature representations that are well-separated and distinct for different classes.As a result, the angle between the feature vectors of an input sample and different classes becomes an essential factor in estimating the degree of noisiness.For clarity, we first describe the AM loss with respect to angles.Given a training example (x i ; y i ), it can be formulated as: (3) where ϕ denotes the model parameters and cos(.)stands for the cosine similarity, which can be read as the angular distance between feature vectors and the class weights.Given an unlabeled instance x u j , we recompute its AM loss to each class y i ∈ Y as follows: where N is the number of samples (e.g., 5) from D l labeled with y i (i.e., class on-target) and θ jn denotes the cosine similarity between x u j and x l n (i.e., the deep representations of the PT classifier).The intuition behind this feature is that an unlabeled instance x u j , that receives close amount of support from different classes, is regarded as potential mislabeled.We denote this feature as f and its value consists of the corresponding class y i as well as the value of L cos .To enable valuable shared knowledge between instances, L cos is approximated to one digit (e.g., L cos (x u j , 1) = 0.213 ≈ 0.2).Considering the illustrative example in Figure 4, x u j and x l i approximately share the same L cos in response to class 0, (i.e., f 3 (0, 0.2)).
PT confidence.Instances with extreme confidence (i.e., close to 1) are generally considered to have a low risk of being mislabeled (Hou et al., 2020).To incorporate the class distribution of PT into the evidential support measurement process, we introduce a new feature, denoted as f ′ , whose value consists of the predicted class and its corresponding probability.Considering the illustrative example in Figure 4, x u j and x l i share f ′ 1 (i.e., f ′ 1 (0, 0.9)), which can be read as both instances are related to the class 0 based on PT classifier with 0.9 confidence.

Evidential Support Measurement
Now that we can capture shared knowledge between the labeled and unlabeled instances (i.e., the evidential support).We leverage Dempster Shafer Theory (DST) (Yang and Xu, 2013) to address evidential support measurement as reasoning with uncertainty.The goal is to estimate the degree of noisiness for an unlabeled instance by combining its evidence from multiple sources of uncertain information (i.e., PT and semantic features).To achieve this, DST applies Dempster's rule, which combines the mass functions of each source of evidence to form a joint mass function.ot is noteworthy that DST has been widely used for various purposes of reasoning (Liu et al., 2018;Wang et al., 2021;Ahmed et al., 2021).The basic concepts of DST are: • Proposition.It refers to all possible states of a situation under consideration.Two propositions are defined: "clean instance", denoted by C, and "unclean instance", denoted by U .Let proposition be X = {C, U } and a power set of X be 2 X = {∅, C, U, X}.The instances x u j and x l i exhibit a similar L cos value in response to class 0 (i.e., f 3 (0, 0.2)).The approximate PT's confidence of 0.9, represented by f ′ 1 (0, 0.9), further strengthens this similarity.Consequently, the instance x u j is considered less noisy due to the higher degree of evidential support it receives.
with a degree of belief (or mass), which satisfies E∈2 X m(E) = 1 and m(∅) = 0. Different belief functions for various evidences are defined (i.e., the generated features).
Given an unlabeled instance x u j and its semantic feature f , we estimate the evidential support that x u j receives from labeled instances that share f by the belief function: where d f denotes the degree of uncertainty of f , and P (f ) is the division of the number of positive instances (i.e., the labeled instances with the same class of the feature on-target f ) by all labeled instances shared f .Consider the illustrative example in Figure 4, f 3 (0, 0.2) (i.e., semantically related to class 0 with approximated similarity of 0.2), suppose that the positive instances x l i and x l i+1 are annotated with class 0, then P (f 3 ) = 1.0.Equation 5 can be read as the more extreme the value of P (f ) (i.e.close to 0 or 1) is, the more evidential support the element of C should receive from the feature f .Similarly, we use Equation 5 to estimate the evidential support m f ′ (E) that x u i receives from f ′ .Note that d f represents the impact that a given feature may have on the final degree of belief in terms of evidential support measurement.The lower the value, the greater the impact.Note that both types of features are generated based on the latent space of the PT classifier that we believe in its semantic representation as it is trained on the labeled data.Therefore, we empirically set d f to a small unified value (i.e., 0.2 in our experiments).
The overall evidential support of E = {C} that x u j receives from its observations is estimated by combining the estimated beliefs as follows: where m(E) represents the total amount of evidential support that x u j receives, and the combination is computed from the two sets of belief functions, m f 1 (E) and m f 2 (E) as follows: where E ′ and E ′′ denote the power set 2 X elements and ) is a measure of the amount of conflict between E ′ and E ′′ .In words, given the element of E = {C}, we multiply the combinations of E ′ and E ′′ such that E ′ ∩ E ′′ = C and thus can be regarded as a measure for the amount of support from {C}.For time complexity, each iteration takes O(n × n f ) time with n instances and n f the number of the generated features.Thus, the time complexity can be represented by O(n 2 × n f ).

Training Procedure
Now that we can measure the evidential support, we then rank the instances of D u and select the less risky instances as the filtered set, denoted as Note that the value of N f is fine-tuned using the Dev set (please refer to Section 4.2 for more details).Finally, we combine both sets D l and D ′ u together for the final NT training, as illustrated in Figure 1.The training procedure can be explained by the following steps.We first generate pseudo-labels using PT technique Eq.1.Then, we apply DST to filter the highly risky instances.Finally, we adopt NT technique Eq.2 to alleviate the noisy information during the training.Furthermore, to improve the convergence after NT, we follow (Kim et al., 2019) by training only with the instances whose confidence is over 1 K , denoted as SelNT in Figure 2 (c).
4 Experimental Setup

Dataset
We validate the performance of the proposed RNT on various text classification benchmark datasets Table 1.Particular, we rely on AG News (Zhang et al., 2015), Yahoo (Chang et al., 2008), Yelp (Zhang et al., 2015), DBPedia (Zhang et al., 2015), TREC (Li and Roth, 2002), SST (Socher et al., 2013a), CR (Ding et al., 2008), MR (Pang and Lee, 2005), TNEWS, and OCNLI (Xu et al., 2020).For AG News, Yelp and Yahoo datasets, we follow the comparative approaches by forming the unlabeled training set D u , labeled training set D l and development set by randomly drawing from the corresponding original training datasets.
For the other datasets, we split the training set into 10% and 90% for D l and D u , respectively.Note that we utilize the original test sets for prediction evaluation.

Comparative Baselines
For fairness, we only include the semi-supervised learning methods that were built upon the contextual embedding models (e.g., BERT): • PLM is a pre-trained language model directly fine-tuned on the labeled data.We compared to BERT (Devlin et al., 2019;Cui et al., 2021) and RoBERTa (Liu et al., 2019); • UDA (Xie et al., 2020) is an SSTC method based on unsupervised data augmentation with back translation.We use German and English languages for back-translation of English and Chinese, respectively, datasets; • UST (Mukherjee and Awadallah, 2020) introduces to select samples by information gain and utilizes cross-entropy loss to perform self-training; • S 2 TC-BDD (Li et al., 2021) is an SSTC method that addresses the margin bias problem by balancing the label angle variances.

Experimental Settings
• Hyper-parameters.We use 12 heads and layers and keep the dropout probability to 0.1 with 30 epochs, learning rate of 2e −5 and 32 batch size.To guarantee the re-productivity without manual effort, we rely on the Dev set to automatically set the value of N f (i.e., the number of instances in D ′ u ).First, the ranked Dev set is split into small proportions (i.e., max is 10).Then, m is set to proportions that meet the condition λ = max(p) − st(p), where p is a vector that represents the accuracy of RNT on each proportion and st denotes the standard deviation.1: The statistics of benchmark datasets, whereas #Lab and #Unlab denote the number of labeled and unlabeled texts, respectively.Note that for datasets with NA, we split #Lab into 10% and 90% for #Lab and #Unlab, respectively.θ = 0.2 means that D ′ u consists of the first 20% of the ranked D u as shown in Figure 5.We set the number of negative samples to K − 1, where K is the number of classes in the labeled training set.
• Metrics.We use the accuracy metric on Clue datasets, including TNEWS and OC-NLI, while Macro-F1 scores for all other datasets.

Evaluation and Results
We describe the evaluation tasks and report the experimental results in this section.The evaluation criteria are: (I) Is RNT able to rank the instances of being mislabeled?; (II) Can the filtered data enhance the performance of the clean test data?

Results
We use the Dev set to select the best model and average three runs with different seeds.The experimental results are reported in Tables 2, 3 and 4 from which we have made the following observations.
• Compared to the baselines, RNT gives the best results compared to its alternatives in most cases and achieves competitive performance in others.We also observe that SSTCbased approaches comfortably outperform the PLMs fine-tuning when training with scarce labeled data (e.g., N l = 30); however, the same performance is expected when N l is increased (e.g., N l ∈ {1k, 10k}), but it was not supported by the experiments.Furthermore, experimental results demonstrate that RNT is not sensitive to the number of classes compared to SSTC-based alternatives.For instance, UDA (Xie et al., 2020) can perform better on the binary datasets as shown in Table 3.
• Compared to the PLM fine-tuned on the labeled data, RNT comfortably overcomes PLMs by considerable margins.For example, the Macro-F1 scores of RNT with N l = 30 are even about 2.6%, 2.7% and 3.0% on AG News, Yelp and Yahoo datasets, respectively.Moreover, we also observe that RNT can perform better than PLM fine-tuned on sufficient labeled data (e.g., N l = 10k).The results of N l ∈ {1k, 10k} are retrieved from S 2 TC-BDD (Li et al., 2021), while the others are our implementations.The scores consists of the average of three runs, and the best scores are in bold.(Cui et al., 2020).The best scores are in bold.

Mislabeling Filtering Evaluation
To evaluate the ability of RNT in mislabeling filtering, we conduct experiments on the Dev sets of AG News, Yelp, and Yahoo datasets as follows.We first associate the instances with the corresponding pseudo-labels (i.e., inferring using the PT classifier).Then, we require RNT to rank them based on their evidential support received from the clean training set (i.e., N l = 1k).Since we have access to the true labels of the Dev set, we can evaluate the performance of the filtering process.Specifically, we divide the ranked Dev set into ten equally-likely proportions (note that we keep the same order of ranking) and calculate the accuracy of each proportion separately (i.e., comparing the pseudo-labels with the true labels).
The proportions, as shown in Figure 5, are significantly correlated with the extent of mislabeling.In other words, the accuracy score gradually drops as the mislabeled instances increase and vice-versa.Note that we report the accuracy due to the imbalance labels in the proportions.Moreover, we report the performance of both the full Dev set and the filtered set in Table 6.

The Impact of Noise Filtering
To assess the impact of noise filtering on the overall performance of RNT, we remove DST and conduct experiments on the AG News, Yelp, and Yahoo datasets.The experimental results presented in Table 7 show that removing noise ranking from RNT causes a performance drop of 1.3, 1.6, and 1.2 on the AG News, Yelp, and Yahoo datasets, respectively.This demonstrates the efficacy of a well-designed noisy ranking in improving text classification performance Furthermore, we observe that even without noise filtering, RNT outperforms PLM fine-tuning and achieves competitive results compared to other alternatives.This supports the adoption of NT for noisy data.

The Effect of DST
To validate the contribution of DST on the final performance in terms of mislabeled instances filtering, we implement two variants, namely RNT Pure and RNT PT-conf, as follows.The RNT Pure is trained on D l and D u as a whole without any filtering mechanism, while RNT PT-conf uses the PT confidence to filter the instances in D u that does not meet the predefined threshold (i.e., 0.9 in our experiments).In other words, instead of DST, we rely on the PT confidence to discard the instances close to the boundary.Empirically, we conduct experiments on AG NEWS, Yelp, and Yahoo datasets with various N l = {30, 1k, 10k}.The comparative results are shown in Table 5 from which we made the following observations.Overall, RNT can mostly give the best performance, and the improvements are significant, especially with less limited data (e.g., N l = 30).RNT Pure performs worse due to the absence of a filtering mechanism.RNT PT-conf can achieve competitive performance with sufficient labeled data (e.g., N l = 10k) even in terms of uncertainty.However, it gradually drops with the decrease of labeled data.Intuitively, these results are expected as the performance of the PT classifier heavily relies on the amount of labeled data.In brief, the ablation study empirically supports the contribution of DST to the performance of RNT.

Denoising Evaluation
Recall that the ultimate goal of DST is to estimate the score of unlabeled instances being mislabeled by the PT classifier.ity of DST to denoising, we adopt a perturbation strategy that has been used widely in the literature (Belinkov and Bisk, 2018;Sun and Jiang, 2019).We randomly pick 30% of the Dev data as the noisy instances.For each instance, we randomly select 30% of the words to be perturbed as follows.Specifically, we apply four kinds of noise: (1) swap two letters per word; (2) delete a letter randomly in the middle of the word; (3) replace a random letter with another in a word; (4) insert a random letter in the middle of the word.
The evaluation results of denoising are reported in Table 8 from which we made the following observations.(1) A considerable margin between the performance of the PT classifier on the clean and noise data, demonstrating the impact of the generated noise.(2) Despite the well-recognized challenge of denoising in NLP, our proposed solu-tion can mostly identify clean instances.(3) Even though the performance can be deemed considerable, noisy information may still exist in the filtered data; therefore, we use NT for further training.

Conclusion and Future Work
In this paper, we proposed a self-training semisupervised framework, namely RNT, to address text classification problem in learning with noisy labels manner.RNT first discards the high risky mislabeled texts based on reasoning with uncertainty theory.Then, it uses the negative training technique to reduce the noisy information during training.Our extensive experiments have shown that RNT mostly outperformed SSTC-based alternatives.Despite the robustness of negative training, clean samples that have identical distributions with test data are subjected to complementary labels.Consequently, both clean and potentially noisy samples contribute equally to the final performance.A combination of both positive and negative training strategies in a unified framework can remedy the abundance of noisy samples; however, it needs further investigation.

Figure 1 :
Figure 1: An example of the proposed framework.D l , D u , and D ′ u denote labeled set, unlabeled set and filtered unlabeled set, respectively.Briefly, RNT consists of three key steps: (1) Training with PT on limited labeled texts and then iteratively predicting the unlabeled texts as their pseudo-labels; (2) Measuring the evidential support based on the learned embedding space of PT to estimate the degree of noise; (3) Training with NT on the mixture of clean and filtered data.

Figure 2 :
Figure 2: A histogram of PT, NT and RNT data training distribution conducted on AG NEWS dataset with random 30% noisy-labels, in which blue represents the clean data and orange indicates the noisy data.SelNT further trains the model with only the samples having NT confidence over 1 K , where K denotes the number of classes.

Figure 3 :
Figure 3: A comparison between PT and NT techniques trained on AG News dataset corrupted with randomly 30% of symmetric noise.The accuracy of PT on the clean Dev data increases in the early stage.However, overfiting to the noisy training examples results in gradual inaccurate performance on the clean Dev data.

Figure 4 :
Figure4: An illustrative example of the evidential support.The instances x u j and x l i exhibit a similar L cos value in response to class 0 (i.e., f 3 (0, 0.2)).The approximate PT's confidence of 0.9, represented by f ′ 1 (0, 0.9), further strengthens this similarity.Consequently, the instance x u j is considered less noisy due to the higher degree of evidential support it receives.

Figure 5 :
Figure 5: Ranking evaluation on Dev sets with N l = 1k.The ranked Dev set is first split into 10 proportions equally-likely.Then, each proportion is inferred (i.e., calculate its accuracy) independently.The accuracy gradually drops as the noisy texts increase.In our experiment, we choose the proportions whose instances meet λ Section 4.2 for further NT training.

Table
For example,

Table 2 :
Comparative results with the state-of-the-art alternatives on 30 examples per label and N l ∈ {1k, 10k}.

Table 3 :
Comparative results with the state-of-the-art alternatives with 30 samples per label and N l = 10% of the labeled texts.Note that all results are the average of three runs with different seeds.

Table 4 :
Comparative results on Chinese datasets based on initial weights from RoBERTa-Large

Table 5 :
The effect of DST on the performance of RNT.All variants are jointly trained on D l and D u using PT and NT.RNT Pure is trained on all instances in D u without any filtering mechanism, while RNT PT-conf uses the PT-based confidence to filter the instances in D u that does not meet a predefined threshold (i.e., 0.9 in our experiments).

Table 6 :
Filtering evaluation on Dev sets.Prop, Acc and F1 denote proportion (i.e., the ratio of filtered texts), accuracy and Macro-F1, respectively.

Table 7 :
To evaluate the abil-The impact of noise filtering to the overall performance.Note that the number of labeled data is set to N l = 1k.Removing noise ranking from RNT leads to a noticeable performance drop; however, it still performs better than BERT-Based fine-tuned on labeled data and achieves competitive scores comparable to S 2 TC-BDD.

Table 8 :
Denoising evaluation with N l = 10k and 30% of noise instances.Full Dev denotes the performance on the clean and noisy Dev sets.Denoising indicates the ability of RNT to identify the clean instances.