Abstract
Cold-start active learning (CSAL) selects valuable instances from an unlabeled dataset for manual annotation. It provides high-quality data at a low annotation cost for label-scarce text classification. However, existing CSAL methods overlook weak classes and hard representative examples, resulting in biased learning. To address these issues, this paper proposes a novel dual-diversity enhancing and uncertainty-aware (Deuce) framework for CSAL. Specifically, Deuce leverages a pretrained language model (PLM) to efficiently extract textual representations, class predictions, and predictive uncertainty. Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both textual diversity and class diversity, ensuring a balanced data distribution. It further propagates uncertainty information via density-based clustering to select hard representative instances. Deuce performs well in selecting class-balanced and hard representative data by dual-diversity and informativeness. Experiments on six NLP datasets demonstrate the superiority and efficiency of Deuce.
1 Introduction
Cold-start active learning (CSAL; Yuan et al., 2020a; Zhang et al., 2022b) has gained much attention for efficiently labeling large corpora from zero. Given an unlabeled corpus (i.e., the “cold-start” stage), it aims to acquire a small subset (seed set) for annotation. Such absence of labels can happen due to data privacy concerns (Holzinger, 2016; Li et al., 2023), limited domain experts1 (Wu et al., 2022), labeling difficulty (Herde et al., 2021), quick expiration of labels (Yuan et al., 2020b; Zhang et al., 2021), etc. In real-world tasks with specialized domains (e.g., medical report classification with rare diseases; De Angeli et al., 2021), the complete absence of labels and lack of a posteriori knowledge pose challenges to CSAL.
While active learning (AL) has been studied for a wide range of NLP tasks (Zhang et al., 2022b), the cold-start problem has been hardly addressed. At the cold-start stage, the model is untrained and no labeled data are available for validation. Traditional CSAL applies random sampling (Ash et al., 2020; Margatina et al., 2021), diversity sampling (Yu et al., 2019; Chang et al., 2021), or uncertainty sampling (Schröder et al., 2022). However, random sampling suffers from high variance (Rudolph et al., 2023); diversity sampling is prone to easy examples and vector space noise (Eklund and Forsman, 2022); and uncertainty sampling is prone to redundant examples, outliers, and unreliable metrics (Wójcik et al., 2022). Moreover, existing methods ignore class diversity, where the sampling bias often results in class imbalance (Krishnan et al., 2021). At worst, the missed cluster effect (Schütze et al., 2006; Yu et al., 2019) can happen, i.e., clusters of weak classes are neglected. Tomanek et al. (2009) showed that an unrepresentative seed set gives rise to this effect. Learning is misguided, if started unfavorably.
The key challenge for CSAL lies in how to acquire a diverse and informative seed set. As a general heuristic (Dasgupta, 2011), a proper seed set should strike a balance between exploring the input space for instance regions (e.g., diversity sampling) and exploiting the version space for decision boundaries (e.g., uncertainty sampling). Such hybrid CSAL strategies have been proposed based on combinations of neighbor-awareness (Hacohen et al., 2022; Su et al., 2023; Yu et al., 2023), clustering (Yuan et al., 2020a; Agarwal et al., 2021; Müller et al., 2022; Brangbour et al., 2022; Shnarch et al., 2022; Yu et al., 2023), and uncertainty estimation (Dligach and Palmer, 2011; Yuan et al., 2020a; Müller et al., 2022; Yu et al., 2023). However, existing methods fail to explore the label space to enhance class diversity and mitigate imbalance. Moreover, most methods perform diversity sampling followed by uncertainty sampling, treating both aspects in isolation.
To address these challenges, this paper presents Deuce, a dual-diversity enhancing and uncertainty-aware framework for CSAL. It adopts a graph-based hybrid strategy to enhance diversity and informativeness. Different from previous works, Deuce not only emphasizes the diversity in textual contents (textual diversity), but also diversity in class predictions (class diversity). This is termed dual-diversity in this paper. To achieve this in the cold-start stage, it exploits the rich representational and predictive capabilities of PLMs. For informativeness, the predictive uncertainty is estimated from a one-vs-all (OVA) perspective. This helps mining informative “hard examples” for learning. Then, Deuce further employs manifold learning techniques (McInnes et al., 2020) to derive dual-diversity information. This results in the novel construction of a Dual-Neighbor Graph (DNG). Finally, Deuce performs density-based uncertainty propagation and Farthest Point Sampling (FPS) on the DNG. While propagation prioritizes representatively uncertain (RU) instances, FPS enhances the dual-diversity. Overall, Deuce ensures a more diverse and informative acquisition.
The merits of Deuce are attributed to the following contributions:
The dual-diversity enhancing and uncertainty aware (Deuce) framework adopts a novel hybrid acquisition strategy. It effectively selects class-balanced and hard representative instances, achieving a good balance between exploration and exploitation in CSAL.
This paper proposes a graph-based dual-diversity enhancement mechanism to select diverse instances with textual diversity and class diversity, tackling class imbalance in CSAL.
This paper presents an embedding-based uncertainty-aware prediction mechanism to effectively select hard representative instances according to predictive uncertainty.
2 Related Work
2.1 Cold-start Active Learning (CSAL)
According to the taxonomy of Zhang et al. (2022b), CSAL research for NLP can be categorized as informativeness-based, representativeness-based, and hybrid. As most methods are hybrid, the techniques and challenges for informativeness or representativeness are elucidated below.
2.1.1 Informativeness
Uncertainty.
The main metric for informativeness in CSAL is uncertainty, as it is more tractable in cold-start stages than others (e.g., gradients). High predictive uncertainty indicates difficulty for the model, thus valuable for annotation. Most existing methods use language models (LMs) for estimation. Common estimators include entropy (Zhu et al., 2008; Yu et al., 2023), LM probability (Dligach and Palmer, 2011), LM loss (Yuan et al., 2020a), and probability margin (Müller et al., 2022). However, several challenges exist in uncertainty estimation: (a) Often, a closed-world assumption is imposed. In other words, predictions are normalized such that they sum to 1. This hinders the expression of uncertainty, as it forces mapping to one of the known classes, ignoring options such as “none of the above” (Padhy et al., 2020). (b) PLMs suffer from overconfidence (Park and Caragea, 2022; Wang, 2024). This requires calibration for more robust uncertainty estimation (Yu et al., 2023). (c) Task information is hardly considered. As a result, the uncertainty will not be related to the downstream task (output uncertainty), but rather its intrinsic perplexity (input uncertainty) (Jiang et al., 2021). Patron (Yu et al., 2023) uses task-related prompts to tackle this issue.
2.1.2 Representativeness
Density.
To avoid outliers, density-based CSAL methods prefer “typical” instances. The method of Zhu et al. (2008) and TypiClust (Hacohen et al., 2022) prioritize instances with high kNN density. Uncertainty propagation (Yu et al., 2023) is also useful in aggregating density information. A typical group of uncertain examples indicates a region where the model’s knowledge is lacking.
Discriminative.
Some CSAL methods acquire sequentially or iteratively. They thus discriminate, i.e., prefer an instance if it differs the most from selected ones. Coreset selection (Sener and Savarese, 2018) selects an instance (cover-point) such that its minimum distance to selected instances is maximized. vote-k (Su et al., 2023) adopts a greedy approach to select remote instances on a kNN graph.
Batch Diversity.
It is more efficient to acquire in batch mode (Settles, 2009), i.e., to select multiple instances at each step. Clustering has been a common technique to enhance batch diversity and avoid redundancy in CSAL. It helps structure the unlabeled dataset by grouping similar instances together. Nguyen and Smeulders (2004) and Kang et al. (2004) first proposed pre-clustering the input space to select representatives from each cluster. Dasgupta and Ng (2009) used spectral clustering on the similarity matrix of documents. Hu et al. (2010) and Yu et al. (2019) used hierarchical clustering to stabilize the process. Zhu et al. (2008) and more recent works (Yuan et al., 2020a; Chang et al., 2021; Agarwal et al., 2021; Müller et al., 2022; Hacohen et al., 2022; Yu et al., 2023) have commonly used k-means for its simplicity and efficiency. However, these clustering methods can be sensitive to outliers. Moreover, clustering in the input space only contributes to textual diversity, regardless of other aspects.
2.2 Missed Cluster Effect
The missed cluster effect (Schütze et al., 2006; Tomanek et al., 2009) is an extreme case of class imbalance. It refers to when an AL strategy neglects certain classes (or clusters within classes). Schütze et al. (2006) first recognized the missed cluster effect in the context of text classification. They suggested more use of domain knowledge. Knowledge extraction from PLMs is in harmony with this suggestion. Dligach and Palmer (2011) proposed an uncertainty-based approach to avoid the missed cluster effect in word sense disambiguation (WSD). However, it is based on task-agnostic LM probability. Marcheggiani and Artières (2014) showed that labeling relevant instances, which reduces the labeling noise, also helps mitigate the missed cluster effect. Label calibration aligns with this finding. While many works are devoted to addressing the missed cluster effect or general class imbalance (e.g., Aggarwal et al., 2020; Fairstein et al., 2024) for general AL, they often rely on a labeled subset. Class diversity enhancement would help mitigate class imbalance issues, but it remains an open question for CSAL.
3 Methodology
In this section, the methodology of the proposed Deuce is introduced. Section 3.1 first defines CSAL and declares the notations for the rest of this paper. The framework of Deuce is then elaborated in Section 3.2.
3.1 Problem Formulation
This paper considers CSAL in a pool-based manner. Learning is initiated with a set of N unlabeled documents, . A C-way text classification task is defined by a set of classes taking values in a domain .
Given a labeling budget b ≪ N, a CSAL strategy acquires a subset with a fixed size , such that the labeled subset boosts most performance when used as a training seed set. The performance is evaluated by fine-tuning a PLM ℳθ with , and testing for its accuracy.
3.2 The Deuce Framework
The proposed Deuce framework is illustrated in Figure 1. Overall, the components of Deuce serve the same goal—to produce a seed set with high dual-diversity and informativeness.
3.2.1 Embedding Module
In CSAL, data selection starts with only an unlabeled corpus. Deuce leverages PLM embeddings, which guide the selection process towards more diverse and informative samples.
Specifically, the embedding module implements a prompt-based, verbalizer-free approach (Jiang et al., 2022). This requires only a single inference pass per document.
Textual and Predictive Embedding.
Class Embedding.
3.2.2 Prediction Module
This module aims to produce uncertainty-aware labels. With class information, Deuce gains prior knowledge about potential data distributions. With uncertainty information, Deuce is informed of potential labeling gain.
Label Vector.
Predictive Uncertainty.
In CSAL, uncertainty represents the difficulty of an instance. Deuce adapts entropy, a common measure of uncertainty (§2.1.1).
3.2.3 Dual-Neighbor Graph (DNG) Module
Graphs serve as a powerful tool for data selection by explicitly modeling data interrelationship. This enables the propagation of valuable information (e.g., uncertainty) and the selection of more diverse samples. To integrate textual and class diversity, Deuce leverages manifold learning techniques (McInnes et al., 2020) on k-Nearest-Neighbor (kNN) graphs of both spaces.2
kNN Graph.
The use of kNN arises from the neighborhood perspective of diversity. Deuce aims to avoid selecting neighboring instances. In a kNN graph, an instance xi is connected with its k nearest neighbors {xij} under some distance function Δ(·,·). Formally, the two metric spaces of kNN are defined as follows.
The textual space is defined by textual embeddings under cosine distance, ;
The label space is defined by label vectors under ℓ1 distance, .
The kNN graph from each space is denoted by and , respectively.
Graph Normalization.
To unify textual and class diversity, Deuce merges the two kNN graphs into one for graph-based sampling. However, across two distinct spaces, it is necessary to first normalize the distances (McInnes et al., 2020).
Symmetrization.
To identify representative instances, Deuce performs graph clustering. This requires symmetric kNN graphs.
Merging.
It is now appropriate to merge the two kNN graphs. This unifies textual and class diversity in one graph.
As merged, the DNG is an undirected graph . The edges are the union of edges in and . Moreover, is divided into two types:
represents edges which only appear in either kNN graph, called single-neighbor edges;
represents edges which appear in both kNN graphs, called dual-neighbor edges. They connect neighboring documents which are similar in both textual semantics and class predictions.
3.2.4 Acquisition Module
Deuce adopts a hybrid acquisition strategy. Overall, the goal is to produce a diverse and informative seed set. To achieve this, the acquisition module performs graph clustering, propagation, and traversal on DNG.
hdbscan*.
A group of similar documents with high predictive uncertainty indicates an area where the model’s knowledge is lacking. By labeling one of the documents, the model predictions can be improved for similar ones in the area. Therefore, it is valuable to identify and prioritize such representatively uncertain (RU) groups for CSAL.
Clustering has been a common technique to group similar instances (§2.1.2). However, traditional clustering methods (e.g., k-means) are ill-suited, as the number of RU groups is unknown. Moreover, they force every instance into a cluster, while some instances may not belong to any RU group. Instead, Deuce adopts density-based clustering, which identifies RU groups with a sufficient density (≥ kr similar documents).
Uncertainty Propagation.
FPS.
The whole process is described in Algorithm 1.
4 Experiments and Results
4.1 Experimental Setup
Datasets.
Deuce is evaluated on six text classification datasets: IMDb (Maas et al., 2011), Yelpfull (Meng et al., 2019), AG’s News (Zhang et al., 2015), Yahoo! Answers (Zhang et al., 2015), DBpedia (Lehmann et al., 2015), and TREC (Li and Roth, 2002). Dataset statistics are shown in Table 1. All the datasets used in the experiments are publicly accessible. The original labels are removed to create a cold-start scenario.
Dataset . | Source domain . | Target domain . | #ClassC . | . | #Test . | Label distribution (bar chart) and namesyj . |
---|---|---|---|---|---|---|
IMDb | Movie review | Sentiment | 2 | 25,000 | 25,000 | Negative, Positive |
Yelpfull | Review | Rating | 5 | 38,352 | 10,000 | 1 star, 2 stars, 3 stars, 4 stars, 5 stars |
AG’s News | News | Category | 4 | 120,000 | 7,600 | World, Sports, Business, Sci/Tech |
Yahoo! Answers | Web Q&A | Category | 10 | 300,000† | 60,000 | Society & Culture, Science & Mathematics, Health, Education & Reference, Computers & Internet, Sports, Business & Finance, Entertainment & Music, Family & Relationships, Politics & Government |
DBpedia | Wikipedia lead section | Category | 14 | 420,000† | 70,000 | Company, Educational institution, Artist, Athlete, Office holder, Mean of transportation, Building, Natural place, Village, Animal, Plant, Album, Film, Written work |
TREC | Question | Category | 6 | 5,452 | 500 | ‡ Abbreviation, Entity, Description and abstract concept, Human being, Location, Numeric value |
Dataset . | Source domain . | Target domain . | #ClassC . | . | #Test . | Label distribution (bar chart) and namesyj . |
---|---|---|---|---|---|---|
IMDb | Movie review | Sentiment | 2 | 25,000 | 25,000 | Negative, Positive |
Yelpfull | Review | Rating | 5 | 38,352 | 10,000 | 1 star, 2 stars, 3 stars, 4 stars, 5 stars |
AG’s News | News | Category | 4 | 120,000 | 7,600 | World, Sports, Business, Sci/Tech |
Yahoo! Answers | Web Q&A | Category | 10 | 300,000† | 60,000 | Society & Culture, Science & Mathematics, Health, Education & Reference, Computers & Internet, Sports, Business & Finance, Entertainment & Music, Family & Relationships, Politics & Government |
DBpedia | Wikipedia lead section | Category | 14 | 420,000† | 70,000 | Company, Educational institution, Artist, Athlete, Office holder, Mean of transportation, Building, Natural place, Village, Animal, Plant, Album, Film, Written work |
TREC | Question | Category | 6 | 5,452 | 500 | ‡ Abbreviation, Entity, Description and abstract concept, Human being, Location, Numeric value |
Evaluation Metric.
To evaluate the performance of the acquired seed set , it is labeled and used for fine-tuning the PLM. The original labels of the seed set are revealed. The accuracy of the fine-tuned PLM on the test set is then reported. To be consistent with previous methods (Yu et al., 2023), the experiments adopt RoBERTa-base (Liu et al., 2019) as the backbone PLM.
Analysis Metrics.
Implementation Details.
The fine-tuning setup and hyperparameters are the same as Patron’s (Yu et al., 2023). Notably, the experiment code transplants the original implementation of graph normalization (McInnes et al., 2018) to GPU for acceleration. For Deuce, k = 500, kr = 3, and γ = 1.0 (since ) are taken. All experiments are run on a machine with a single nvidia A800 GPU with 80 GB of VRAM.
Baselines.
The following CSAL baseline methods are considered:
Random sampling selects uniformly.
Entropy-based uncertainty sampling (revisited by Schröder et al., 2022) selects data with the highest predictive entropy.
Coreset selection (Sener and Savarese, 2018) iteratively selects data whose minimum distance to the selected data is maximized.
alps (Yuan et al., 2020a) computes surprisal embeddings from BERT loss as uncertainty. They are then clustered with k-means. Data closest to each centroid are selected.
few-selector (Chang et al., 2021) clusters the text embeddings with k-means.
TypiClust (Hacohen et al., 2022) clusters the text embeddings with k-means, and selects data with the highest typicality, i.e., kNN density, from each cluster.
Patron (Yu et al., 2023) clusters the text embeddings with k-means, and selects from each cluster data with the highest propagated uncertainty. It then iteratively updates the set to refine inter-sample distances.
vote-k (Su et al., 2023) iteratively assigns a high score if a data is far from selected data.
Comparisons of the CSAL baselines and Deuce are presented in Table 2.
Method . | Informativeness . | Representativeness . | ||
---|---|---|---|---|
Uncertainty . | Density . | Textual diversity . | Class diversity . | |
Random | ✗ | ✗ | ✗ | ✗ |
Entropy | ✓ | ✗ | ✗ | ✗ |
Coreset | ✗ | ✗ | ✓ | ✗ |
alps | ✓ | ✗ | ✓ | ✗ |
few-s. | ✗ | ✗ | ✓ | ✗ |
TypiCl. | ✗ | ✓ | ✓ | ✗ |
Patron | ✓ | ✓ | ✓ | ✗ |
vote-k | ✗ | ✓ | ✓ | ✗ |
Deuce | ✓ | ✓ | ✓ | ✓ |
Method . | Informativeness . | Representativeness . | ||
---|---|---|---|---|
Uncertainty . | Density . | Textual diversity . | Class diversity . | |
Random | ✗ | ✗ | ✗ | ✗ |
Entropy | ✓ | ✗ | ✗ | ✗ |
Coreset | ✗ | ✗ | ✓ | ✗ |
alps | ✓ | ✗ | ✓ | ✗ |
few-s. | ✗ | ✗ | ✓ | ✗ |
TypiCl. | ✗ | ✓ | ✓ | ✗ |
Patron | ✓ | ✓ | ✓ | ✗ |
vote-k | ✗ | ✓ | ✓ | ✗ |
Deuce | ✓ | ✓ | ✓ | ✓ |
4.2 Accuracy Improvement
The main quantitative results of PLM fine-tuning performance with Deuce and baseline CSAL methods are shown in Table 3. Results for baselines other than vote-k are from Yu et al. (2023). To report the standard deviation, each setup is repeated with 10 different random seeds. Figure 2 demonstrates a qualitative visualization of the b = 128 seed set from IMDb dataset, acquired by the latest baseline method vote-k and the proposed Deuce. The t-SNE (van der Maaten and Hinton, 2008) method is used for visualization.
Dataset . | b . | Random . | Entropy . | Coreset . | alps . | few-s. . | TypiCl. . | Patron . | vote-k . | Deuce . |
---|---|---|---|---|---|---|---|---|---|---|
IMDb | 32 | 80.2±2.5 | 81.9±2.7 | 74.5±2.9 | 82.2±3.0 | 79.2±1.6 | 82.8±2.2 | 85.5±1.5 | 85.6 ±1.8 | 86.9 ± 0.9 |
64 | 82.6±1.4 | 84.7±1.5 | 82.8±2.5 | 86.1±0.9 | 84.9±1.5 | 84.0±0.9 | 87.3±1.0 | 88.0 ±1.2 | 88.5 ± 0.7 | |
128 | 86.6±1.7 | 87.1±0.7 | 87.8±0.8 | 87.5±0.8 | 88.5±1.6 | 88.1±1.4 | 89.6 ±0.4 | 89.1±0.7 | 90.0 ± 0.3 | |
Yelpfull | 32 | 30.2±4.5 | 32.7±1.0 | 32.9±2.8 | 36.8±1.8 | 35.2±1.0 | 32.6±1.5 | 35.9±1.6 | 40.1 ±2.2 | 42.6 ± 1.1 |
64 | 42.5±1.7 | 36.8±2.1 | 39.9±3.4 | 40.3±2.6 | 39.3± 1.0 | 39.7±1.8 | 44.4±1.1 | 49.3 ±1.6 | 49.8 ± 1.2 | |
128 | 47.7±2.1 | 41.3±1.9 | 49.4±1.6 | 45.1±1.0 | 46.4±1.3 | 46.8±1.6 | 51.2 ±0.8 | 50.8±1.5 | 53.4 ± 0.7 | |
AG’s News | 32 | 73.7±4.6 | 73.7±3.0 | 78.6±1.6 | 78.4±2.3 | 79.1±2.7 | 80.7±1.8 | 83.2 ±0.9 | 81.8±1.3 | 83.7 ± 0.8 |
64 | 80.0±2.5 | 80.0±2.2 | 82.0±1.5 | 82.6±2.5 | 82.4±2.0 | 83.0±2.4 | 85.3 ±0.7 | 84.7±1.3 | 86.3 ± 0.6 | |
128 | 84.5±1.7 | 82.5±0.8 | 85.2±0.6 | 84.3±1.7 | 85.6±0.8 | 85.7±0.3 | 87.0 ±0.6 | 86.2±1.2 | 87.5 ± 0.4 | |
Yahoo! Answers | 32 | 43.5±4.0 | 23.0±1.6 | 22.0±2.3 | 47.7±2.3 | 46.8±2.1 | 36.9±1.8 | 56.8 ± 1.0 | 54.5±1.6 | 58.0 ±1.5 |
64 | 53.1±3.1 | 37.6±2.0 | 45.7±3.7 | 55.3±1.8 | 52.9±1.6 | 54.0±1.6 | 61.9 ± 0.7 | 60.8±1.4 | 62.8 ±1.3 | |
128 | 60.2±1.5 | 41.8±1.9 | 56.9±2.5 | 60.8±1.9 | 61.3±1.0 | 58.2±1.5 | 65.1 ± 0.6 | 64.3±0.9 | 66.2 ±0.9 | |
DBpedia | 32 | 67.1±3.2 | 18.9±2.4 | 64.0±2.8 | 77.5±4.0 | 83.3±1.0 | 78.2±1.8 | 85.3 ± 0.9 | 78.1±2.6 | 86.0 ±1.7 |
64 | 86.2±2.4 | 37.5±3.0 | 85.2±0.8 | 89.7±1.1 | 92.7±0.9 | 88.5±0.7 | 93.6 ± 0.4 | 92.7±1.3 | 94.1 ±0.9 | |
128 | 95.0±1.5 | 47.5±2.3 | 89.4±1.5 | 95.7±0.4 | 96.5±0.5 | 95.7±0.6 | 97.0 ± 0.2 | 96.4±0.4 | 97.3 ±0.3 | |
TREC | 32 | 49.0±3.5 | 46.6±1.4 | 47.1±3.6 | 60.5±3.7 | 60.3±1.5 | 42.0±4.4 | 64.0 ± 1.2 | 57.6±2.9 | 70.2 ±1.7 |
64 | 69.1±3.4 | 59.8±4.2 | 75.7±3.0 | 73.0±2.0 | 77.3±2.0 | 72.6±2.1 | 78.6±1.6 | 81.8 ±3.1 | 82.2 ± 1.5 | |
128 | 85.6±2.8 | 75.0±1.8 | 87.6±3.0 | 87.3±3.6 | 87.7±1.5 | 83.0±3.8 | 91.1 ±0.8 | 89.7±2.6 | 92.1 ± 0.8 | |
Average | 32 | 57.2±3.8 | 46.1±2.1 | 53.2±2.7 | 63.9±3.0 | 64.0±1.8 | 58.9±2.5 | 68.4 ± 1.2 | 66.3±2.1 | 71.2 ±1.3 |
64 | 68.9±2.5 | 56.1±2.7 | 68.5±2.7 | 71.2±1.9 | 71.6±1.6 | 70.3±1.7 | 75.2± 1.0 | 76.2 ±1.8 | 77.3 ±1.1 | |
128 | 76.6±1.9 | 62.5±1.7 | 76.1±1.9 | 76.8±1.9 | 77.6±1.2 | 76.3±1.9 | 80.2 ±0.6 | 79.4±1.4 | 81.1 ± 0.6 |
Dataset . | b . | Random . | Entropy . | Coreset . | alps . | few-s. . | TypiCl. . | Patron . | vote-k . | Deuce . |
---|---|---|---|---|---|---|---|---|---|---|
IMDb | 32 | 80.2±2.5 | 81.9±2.7 | 74.5±2.9 | 82.2±3.0 | 79.2±1.6 | 82.8±2.2 | 85.5±1.5 | 85.6 ±1.8 | 86.9 ± 0.9 |
64 | 82.6±1.4 | 84.7±1.5 | 82.8±2.5 | 86.1±0.9 | 84.9±1.5 | 84.0±0.9 | 87.3±1.0 | 88.0 ±1.2 | 88.5 ± 0.7 | |
128 | 86.6±1.7 | 87.1±0.7 | 87.8±0.8 | 87.5±0.8 | 88.5±1.6 | 88.1±1.4 | 89.6 ±0.4 | 89.1±0.7 | 90.0 ± 0.3 | |
Yelpfull | 32 | 30.2±4.5 | 32.7±1.0 | 32.9±2.8 | 36.8±1.8 | 35.2±1.0 | 32.6±1.5 | 35.9±1.6 | 40.1 ±2.2 | 42.6 ± 1.1 |
64 | 42.5±1.7 | 36.8±2.1 | 39.9±3.4 | 40.3±2.6 | 39.3± 1.0 | 39.7±1.8 | 44.4±1.1 | 49.3 ±1.6 | 49.8 ± 1.2 | |
128 | 47.7±2.1 | 41.3±1.9 | 49.4±1.6 | 45.1±1.0 | 46.4±1.3 | 46.8±1.6 | 51.2 ±0.8 | 50.8±1.5 | 53.4 ± 0.7 | |
AG’s News | 32 | 73.7±4.6 | 73.7±3.0 | 78.6±1.6 | 78.4±2.3 | 79.1±2.7 | 80.7±1.8 | 83.2 ±0.9 | 81.8±1.3 | 83.7 ± 0.8 |
64 | 80.0±2.5 | 80.0±2.2 | 82.0±1.5 | 82.6±2.5 | 82.4±2.0 | 83.0±2.4 | 85.3 ±0.7 | 84.7±1.3 | 86.3 ± 0.6 | |
128 | 84.5±1.7 | 82.5±0.8 | 85.2±0.6 | 84.3±1.7 | 85.6±0.8 | 85.7±0.3 | 87.0 ±0.6 | 86.2±1.2 | 87.5 ± 0.4 | |
Yahoo! Answers | 32 | 43.5±4.0 | 23.0±1.6 | 22.0±2.3 | 47.7±2.3 | 46.8±2.1 | 36.9±1.8 | 56.8 ± 1.0 | 54.5±1.6 | 58.0 ±1.5 |
64 | 53.1±3.1 | 37.6±2.0 | 45.7±3.7 | 55.3±1.8 | 52.9±1.6 | 54.0±1.6 | 61.9 ± 0.7 | 60.8±1.4 | 62.8 ±1.3 | |
128 | 60.2±1.5 | 41.8±1.9 | 56.9±2.5 | 60.8±1.9 | 61.3±1.0 | 58.2±1.5 | 65.1 ± 0.6 | 64.3±0.9 | 66.2 ±0.9 | |
DBpedia | 32 | 67.1±3.2 | 18.9±2.4 | 64.0±2.8 | 77.5±4.0 | 83.3±1.0 | 78.2±1.8 | 85.3 ± 0.9 | 78.1±2.6 | 86.0 ±1.7 |
64 | 86.2±2.4 | 37.5±3.0 | 85.2±0.8 | 89.7±1.1 | 92.7±0.9 | 88.5±0.7 | 93.6 ± 0.4 | 92.7±1.3 | 94.1 ±0.9 | |
128 | 95.0±1.5 | 47.5±2.3 | 89.4±1.5 | 95.7±0.4 | 96.5±0.5 | 95.7±0.6 | 97.0 ± 0.2 | 96.4±0.4 | 97.3 ±0.3 | |
TREC | 32 | 49.0±3.5 | 46.6±1.4 | 47.1±3.6 | 60.5±3.7 | 60.3±1.5 | 42.0±4.4 | 64.0 ± 1.2 | 57.6±2.9 | 70.2 ±1.7 |
64 | 69.1±3.4 | 59.8±4.2 | 75.7±3.0 | 73.0±2.0 | 77.3±2.0 | 72.6±2.1 | 78.6±1.6 | 81.8 ±3.1 | 82.2 ± 1.5 | |
128 | 85.6±2.8 | 75.0±1.8 | 87.6±3.0 | 87.3±3.6 | 87.7±1.5 | 83.0±3.8 | 91.1 ±0.8 | 89.7±2.6 | 92.1 ± 0.8 | |
Average | 32 | 57.2±3.8 | 46.1±2.1 | 53.2±2.7 | 63.9±3.0 | 64.0±1.8 | 58.9±2.5 | 68.4 ± 1.2 | 66.3±2.1 | 71.2 ±1.3 |
64 | 68.9±2.5 | 56.1±2.7 | 68.5±2.7 | 71.2±1.9 | 71.6±1.6 | 70.3±1.7 | 75.2± 1.0 | 76.2 ±1.8 | 77.3 ±1.1 | |
128 | 76.6±1.9 | 62.5±1.7 | 76.1±1.9 | 76.8±1.9 | 77.6±1.2 | 76.3±1.9 | 80.2 ±0.6 | 79.4±1.4 | 81.1 ± 0.6 |
From results in Table 3, it can be seen that Deuce consistently outperforms other baselines, achieving up to a 2.5% gain on balanced datasets and up to 6.2% on the imbalanced dataset, TREC. Deuce mainly benefits from that it enhances the class diversity as well as textual diversity. This can be concluded from the larger improvements on TREC. In over half of the setups, Deuce also achieves the lowest standard deviation. In addition, Deuce improves most when b is small. This aligns with the fundamental goal of AL, which is to maximize performance gains with minimal labeled data. Furthermore, from the visualization in Figure 2, it can be seen that Deuce’s enhancement of dual-diversity leads to a broader and more balanced coverage of both input space and label space. As Deuce adopts a highest-uncertainty strategy, such coverage also exhibits high predictive uncertainty, thus including more “hard examples” which are valuable for annotation.
4.3 Enhancement of Class Diversity
To verify the enhancement of class diversity, the class imbalance value (Yu et al., 2023) under b = 128 is reported in Table 4.
Dataset . | Random . | Entropy . | Coreset . | alps . | few-s. . | TypiCl. . | Patron . | vote-k . | Deuce . |
---|---|---|---|---|---|---|---|---|---|
IMDb | 1.207 | 6.111 | 1.000 | 1.783 | 1.286 | 2.765 | 1.286 | 1.065 | 1.169 |
Yelpfull | 1.778 | 3.800 | 6.000 | 2.833 | 2.000 | 5.200 | 2.250 | 1.273 | 1.450 |
AG’s News | 1.462 | 28.000 | 2.000 | 1.667 | 1.500 | 1.818 | 1.500 | 2.200 | 1.133 |
Yahoo! Answers | 3.000 | 12.000 | 7.000 | 5.500 | 2.250 | 3.333 | 5.500 | 3.333 | 2.125 |
DBpedia | 3.500 | 9.000 | 9.000 | 3.500 | 9.000 | 2.333 | 2.800 | 3.250 | |
TREC | 8.000 | 16.000 | 9.500 | 10.500 | 21.000 | 15.000 | 11.333 | 6.000 | |
Harmonic avg. | 2.128 | 9.863 | 3.124 | 3.138 | 2.166 | 3.839 | 2.338 | 2.052 | 1.779 |
Dataset . | Random . | Entropy . | Coreset . | alps . | few-s. . | TypiCl. . | Patron . | vote-k . | Deuce . |
---|---|---|---|---|---|---|---|---|---|
IMDb | 1.207 | 6.111 | 1.000 | 1.783 | 1.286 | 2.765 | 1.286 | 1.065 | 1.169 |
Yelpfull | 1.778 | 3.800 | 6.000 | 2.833 | 2.000 | 5.200 | 2.250 | 1.273 | 1.450 |
AG’s News | 1.462 | 28.000 | 2.000 | 1.667 | 1.500 | 1.818 | 1.500 | 2.200 | 1.133 |
Yahoo! Answers | 3.000 | 12.000 | 7.000 | 5.500 | 2.250 | 3.333 | 5.500 | 3.333 | 2.125 |
DBpedia | 3.500 | 9.000 | 9.000 | 3.500 | 9.000 | 2.333 | 2.800 | 3.250 | |
TREC | 8.000 | 16.000 | 9.500 | 10.500 | 21.000 | 15.000 | 11.333 | 6.000 | |
Harmonic avg. | 2.128 | 9.863 | 3.124 | 3.138 | 2.166 | 3.839 | 2.338 | 2.052 | 1.779 |
From Table 4, it can be seen that Deuce achieves the lowest average IMB value. This indicates that Deuce enhances class diversity properly. In contrast, an IMB of emerges in the pure uncertainty-based (Entropy) and textual-diversity-based (Coreset) method. This indicates the missed cluster effect happens in their acquisition.
4.4 Enhancement of Textual Diversity
To measure the textual diversity of seed sets, the textual-diversity value (Ein-Dor et al., 2020; Yu et al., 2023) under b = 128 is reported in Table 5.
Dataset . | Random . | Entropy . | Coreset . | alps . | few-s. . | TypiCl. . | Patron . | vote-k . | Deuce . |
---|---|---|---|---|---|---|---|---|---|
IMDb | 0.646 | 0.647 | 0.643 | 0.647 | 0.687 | 0.648 | 0.684 | 0.669 | 0.670 |
Yelpfull | 0.645 | 0.626 | 0.456 | 0.680 | 0.685 | 0.677 | 0.685 | 0.657 | 0.679 |
AG’s News | 0.354 | 0.295 | 0.340 | 0.385 | 0.436 | 0.376 | 0.423 | 0.370 | 0.448 |
Yahoo! Answers | 0.430 | 0.375 | 0.400 | 0.441 | 0.470 | 0.438 | 0.486 | 0.451 | 0.491 |
DBpedia | 0.402 | 0.316 | 0.381 | 0.420 | 0.461 | 0.399 | 0.459 | 0.434 | 0.476 |
TREC | 0.301 | 0.298 | 0.298 | 0.339 | 0.337 | 0.326 | 0.338 | 0.346 | 0.353 |
Average | 0.463 | 0.426 | 0.420 | 0.485 | 0.513 | 0.477 | 0.512 | 0.488 | 0.520 |
Dataset . | Random . | Entropy . | Coreset . | alps . | few-s. . | TypiCl. . | Patron . | vote-k . | Deuce . |
---|---|---|---|---|---|---|---|---|---|
IMDb | 0.646 | 0.647 | 0.643 | 0.647 | 0.687 | 0.648 | 0.684 | 0.669 | 0.670 |
Yelpfull | 0.645 | 0.626 | 0.456 | 0.680 | 0.685 | 0.677 | 0.685 | 0.657 | 0.679 |
AG’s News | 0.354 | 0.295 | 0.340 | 0.385 | 0.436 | 0.376 | 0.423 | 0.370 | 0.448 |
Yahoo! Answers | 0.430 | 0.375 | 0.400 | 0.441 | 0.470 | 0.438 | 0.486 | 0.451 | 0.491 |
DBpedia | 0.402 | 0.316 | 0.381 | 0.420 | 0.461 | 0.399 | 0.459 | 0.434 | 0.476 |
TREC | 0.301 | 0.298 | 0.298 | 0.339 | 0.337 | 0.326 | 0.338 | 0.346 | 0.353 |
Average | 0.463 | 0.426 | 0.420 | 0.485 | 0.513 | 0.477 | 0.512 | 0.488 | 0.520 |
Table 5 shows that Deuce also achieves the highest average textual-diversity value. This indicates that Deuce also enhances textual diversity properly. The improvement of textual-diversity value is not significant, compared to IMB value’s (Table 4). This signals that Deuce enhances more of class diversity than textual diversity, compared to other baselines. Such difference can be explained by the highest-uncertainty-candidate strategy, which acquires more information from the label space.
4.5 Quality of Textual Embedding
To analyze the quality of Deuce’s prompt-based, unsupervised text embeddings (§3.2.1), they are compared with the supervised Sentence Transformer embeddings (Sentence Transformers, 2024) used in vote-k (Su et al., 2023). The correlations are computed across all the possible pairs of their cosine similarity.3 Results on three datasets are reported in Table 6.
Dataset . | Pearson correlationr . | Spearman correlationρ . |
---|---|---|
IMDb | 0.1651 | 0.1636 |
w/ denoising | 0.1980 | 0.1889 |
Yelpfull | 0.1424 | 0.1440 |
w/ denoising | 0.3072 | 0.2984 |
TREC | 0.4271 | 0.4000 |
w/ denoising | 0.4662 | 0.4368 |
Dataset . | Pearson correlationr . | Spearman correlationρ . |
---|---|---|
IMDb | 0.1651 | 0.1636 |
w/ denoising | 0.1980 | 0.1889 |
Yelpfull | 0.1424 | 0.1440 |
w/ denoising | 0.3072 | 0.2984 |
TREC | 0.4271 | 0.4000 |
w/ denoising | 0.4662 | 0.4368 |
From Table 6, a weak positive correlation is observed. Moreover, template denoising produces better embeddings, as it removes the biases from raw embeddings. Overall, the quality of textual embeddings is acceptable and adequate for cold-start acquisition.
4.6 Quality of Class Prediction
To analyze the quality of embedding-based class prediction (§3.2.2), they are compared with gold labels. As uncertainty indicates unstable predictions, labels are arranged from the most confident (lowest ui) to the least. Results are demonstrated in Figure 3.
From Figure 3, a high accuracy of class predictions is consistently observed with high confidence and with denoised embeddings, and vice versa. This demonstrates the good quality of e.d.f. predictions and the derived uncertainty metric.
5 Discussion
5.1 Comparison with LLM-based Methods
The landscape of NLP is rapidly evolving with generative large language models (LLMs). This section evaluates two potential LLM-based alternatives to Deuce: serialization for acquisition and zero-shot Chain-of-Thought prompting. The following experiments are conducted with Llama 2 7B (Touvron et al., 2023).
5.1.1 Serialization for Acquisition
Due to resource constraints, LoRA (Hu et al., 2022) is used for fine-tuning, with r = α = 64. Results are reported in Table 7. Despite utilizing a mid-sized PLM, Deuce outperforms serialization with LLM in most datasets. The decision process of LLM is also black-box. In contrast, Deuce adopts graphs to explicitly capture the interplay of information, offering better interpretability.
Method . | b . | IMDb . | Yelpfull . | AG’s News . | Yahoo! . | DBpedia . | TREC . | Average . |
---|---|---|---|---|---|---|---|---|
Serialization | 32 | 81.7 | 44.5 | 25.2 | 38.8 | 62.6 | 28.4 | 46.9 |
64 | 83.8 | 51.2 | 53.4 | 55.7 | 45.9 | 27.8 | 53.0 | |
128 | 89.6 | 56.9 | 83.7 | 63.4 | 58.6 | 35.6 | 64.6 | |
Deuce | 32 | 86.9 | 42.6 | 83.7 | 58.0 | 86.0 | 70.2 | 71.2 |
64 | 88.5 | 49.8 | 86.3 | 62.8 | 94.1 | 82.2 | 77.3 | |
128 | 90.0 | 53.4 | 87.5 | 66.2 | 97.3 | 92.1 | 81.1 |
Method . | b . | IMDb . | Yelpfull . | AG’s News . | Yahoo! . | DBpedia . | TREC . | Average . |
---|---|---|---|---|---|---|---|---|
Serialization | 32 | 81.7 | 44.5 | 25.2 | 38.8 | 62.6 | 28.4 | 46.9 |
64 | 83.8 | 51.2 | 53.4 | 55.7 | 45.9 | 27.8 | 53.0 | |
128 | 89.6 | 56.9 | 83.7 | 63.4 | 58.6 | 35.6 | 64.6 | |
Deuce | 32 | 86.9 | 42.6 | 83.7 | 58.0 | 86.0 | 70.2 | 71.2 |
64 | 88.5 | 49.8 | 86.3 | 62.8 | 94.1 | 82.2 | 77.3 | |
128 | 90.0 | 53.4 | 87.5 | 66.2 | 97.3 | 92.1 | 81.1 |
5.1.2 Zero-shot Chain-of-Thought
Zero-shot Chain-of-Thought (CoT) prompting (Kojima et al., 2022) with LLMs has emerged as a promising method in cold-start scenarios. This paper tests zero-shot CoT without and with explicit choices in prompts. The temperature of generation is set to 0, and a maximum of 256 tokens are generated. Results are shown in Table 8. From the results, fine-tuning PLM with Deuce still outperforms 0-shot LLM predictions. In class-imbalanced and difficult datasets, performance gaps are greater. Lemon-picking shows that the LLM failed to output a final answer within 256 tokens for many test instances.
Method . | IMDb . | Yelpfull . | AG’s News . | Yahoo! . | DBpedia . | TREC . | Average . |
---|---|---|---|---|---|---|---|
0-shot CoT, w/o choices | 63.6 | 9.2 | 34.7 | 23.7 | 37.1 | 12.6 | 32.0 |
0-shot CoT, w/ choices | 72.1 | 25.4 | 60.2 | 43.6 | 32.3 | 24.2 | 43.0 |
Deuce, b = 32 | 86.9 | 42.6 | 83.7 | 58.0 | 86.0 | 70.2 | 71.2 |
Method . | IMDb . | Yelpfull . | AG’s News . | Yahoo! . | DBpedia . | TREC . | Average . |
---|---|---|---|---|---|---|---|
0-shot CoT, w/o choices | 63.6 | 9.2 | 34.7 | 23.7 | 37.1 | 12.6 | 32.0 |
0-shot CoT, w/ choices | 72.1 | 25.4 | 60.2 | 43.6 | 32.3 | 24.2 | 43.0 |
Deuce, b = 32 | 86.9 | 42.6 | 83.7 | 58.0 | 86.0 | 70.2 | 71.2 |
In addition, the average total GPU and CPU energy consumption and time usage are measured using Alizadeh and Castor’s (2024) method. Results are reported in Table 9. There is a 7.82× difference in energy consumption and 6.26× in time consumption. While increasing the number of output tokens might improve, the added resource consumption cannot be neglected. Deuce provides an efficient solution for low-resource scenarios.
Stage . | 0-shot CoT . | Deuce . | ||
---|---|---|---|---|
Energy (kJ) . | Time (sec) . | Energy (kJ) . | Time (sec) . | |
Acquisition | – | 59.82 | 81.00 | |
Fine-tuning | – | 225.77 | 208.89 | |
Prediction | 2561.58 | 1967.23 | 41.99 | 24.27 |
Total | 2561.58 | 1967.23 | 327.58 | 314.16 |
Stage . | 0-shot CoT . | Deuce . | ||
---|---|---|---|---|
Energy (kJ) . | Time (sec) . | Energy (kJ) . | Time (sec) . | |
Acquisition | – | 59.82 | 81.00 | |
Fine-tuning | – | 225.77 | 208.89 | |
Prediction | 2561.58 | 1967.23 | 41.99 | 24.27 |
Total | 2561.58 | 1967.23 | 327.58 | 314.16 |
5.2 Effect of Labeling Noise
Real-world annotations often involve noise. Northcutt et al. (2021) estimated an average of 2.6% labeling errors across 3 commonly used NLP datasets. To evaluate Deuce under labeling noise, experiments with artificial errors are conducted. As the gold labels may already contain around 3% errors, 7% of seed labels are randomly replaced by wrong labels. The final sets are expected to exhibit an error level of 4–10%. Results are reported in Table 10.
Deuce . | b . | IMDb . | Yelpfull . | AG’s News . | Yahoo! . | DBpedia . | TREC . | Average . |
---|---|---|---|---|---|---|---|---|
w/o noise | 32 | 86.9±0.9 | 42.6±1.1 | 83.7±0.8 | 58.0±1.5 | 86.0±1.7 | 70.2±1.7 | 71.2±1.3 |
64 | 88.5±0.7 | 49.8±1.2 | 86.3±0.6 | 62.8±1.3 | 94.1±0.9 | 82.2±1.5 | 77.2±1.1 | |
128 | 90.0±0.3 | 53.4±0.7 | 87.5±0.4 | 66.2±0.9 | 97.3±0.3 | 92.1±0.8 | 81.1±0.6 | |
w/ noise | 32 | 67.8±4.3 | 38.7±3.0 | 72.5±1.0 | 49.7±7.2 | 61.5±2.0 | 69.6±0.6 | 60.0±1.5 |
64 | 83.4±1.3 | 41.0±2.7 | 82.6±1.4 | 53.4±2.7 | 87.5±3.3 | 78.7±3.3 | 71.1±1.1 | |
128 | 82.9±6.3 | 45.1±1.7 | 84.7±2.4 | 62.7±1.3 | 89.2±3.7 | 82.5±3.8 | 74.5±1.5 |
Deuce . | b . | IMDb . | Yelpfull . | AG’s News . | Yahoo! . | DBpedia . | TREC . | Average . |
---|---|---|---|---|---|---|---|---|
w/o noise | 32 | 86.9±0.9 | 42.6±1.1 | 83.7±0.8 | 58.0±1.5 | 86.0±1.7 | 70.2±1.7 | 71.2±1.3 |
64 | 88.5±0.7 | 49.8±1.2 | 86.3±0.6 | 62.8±1.3 | 94.1±0.9 | 82.2±1.5 | 77.2±1.1 | |
128 | 90.0±0.3 | 53.4±0.7 | 87.5±0.4 | 66.2±0.9 | 97.3±0.3 | 92.1±0.8 | 81.1±0.6 | |
w/ noise | 32 | 67.8±4.3 | 38.7±3.0 | 72.5±1.0 | 49.7±7.2 | 61.5±2.0 | 69.6±0.6 | 60.0±1.5 |
64 | 83.4±1.3 | 41.0±2.7 | 82.6±1.4 | 53.4±2.7 | 87.5±3.3 | 78.7±3.3 | 71.1±1.1 | |
128 | 82.9±6.3 | 45.1±1.7 | 84.7±2.4 | 62.7±1.3 | 89.2±3.7 | 82.5±3.8 | 74.5±1.5 |
From the results, a decrease in accuracy and an increase in standard deviation occur as expected. However, Deuce still outperforms 0-shot CoT (Table 8) in nearly all setups, despite the added noise. This shows the robustness of Deuce for fine-tuning to labeling noise.
5.3 Effect of Class Prediction Failure
For real-world cold-start tasks, the knowledge about classes might not be well exploited by the PLM. In the worst case, the PLM can fail to generate meaningful class predictions. To simulate this scenario, ablation experiments with random class predictions are conducted. In this setup, the predictive embeddings are replaced with random vectors. This ablates class predictions. Results are reported in Table 11.
. | b . | IMDb . | Yelpfull . | AG’s News . | Yahoo! . | DBpedia . | TREC . | Average . |
---|---|---|---|---|---|---|---|---|
Coreset | 32 | 74.5± 2.9 | 32.9±2.8 | 78.6± 1.6 | 22.0± 2.3 | 64.0±2.8 | 47.1± 3.6 | 53.2±2.7 |
64 | 82.8± 2.5 | 39.9±3.4 | 82.0±1.5 | 45.7±3.7 | 85.2 ± 0.8 | 75.7±3.0 | 68.5±2.7 | |
128 | 87.8± 0.8 | 49.4±1.6 | 85.2±0.6 | 56.9±2.5 | 89.4±1.5 | 87.6 ±3.0 | 76.1±1.9 | |
Deuce w/ rand. pred. | 32 | 83.3 ±4.1 | 44.1 ± 0.7 | 83.4 ±2.0 | 52.3 ±3.9 | 63.2 ± 1.1 | 64.9 ±3.9 | 65.2 ± 1.2 |
64 | 85.9 ±4.5 | 48.0 ± 0.3 | 84.6 ± 1.2 | 60.0 ± 0.6 | 82.9±1.7 | 78.2 ± 2.0 | 73.3 ± 0.9 | |
128 | 86.6 ±2.5 | 49.5 ± 0.4 | 87.2 ± 0.4 | 63.4 ± 1.3 | 96.8 ± 0.1 | 86.8± 1.3 | 78.4 ± 0.5 |
. | b . | IMDb . | Yelpfull . | AG’s News . | Yahoo! . | DBpedia . | TREC . | Average . |
---|---|---|---|---|---|---|---|---|
Coreset | 32 | 74.5± 2.9 | 32.9±2.8 | 78.6± 1.6 | 22.0± 2.3 | 64.0±2.8 | 47.1± 3.6 | 53.2±2.7 |
64 | 82.8± 2.5 | 39.9±3.4 | 82.0±1.5 | 45.7±3.7 | 85.2 ± 0.8 | 75.7±3.0 | 68.5±2.7 | |
128 | 87.8± 0.8 | 49.4±1.6 | 85.2±0.6 | 56.9±2.5 | 89.4±1.5 | 87.6 ±3.0 | 76.1±1.9 | |
Deuce w/ rand. pred. | 32 | 83.3 ±4.1 | 44.1 ± 0.7 | 83.4 ±2.0 | 52.3 ±3.9 | 63.2 ± 1.1 | 64.9 ±3.9 | 65.2 ± 1.2 |
64 | 85.9 ±4.5 | 48.0 ± 0.3 | 84.6 ± 1.2 | 60.0 ± 0.6 | 82.9±1.7 | 78.2 ± 2.0 | 73.3 ± 0.9 | |
128 | 86.6 ±2.5 | 49.5 ± 0.4 | 87.2 ± 0.4 | 63.4 ± 1.3 | 96.8 ± 0.1 | 86.8± 1.3 | 78.4 ± 0.5 |
As class and uncertainty information are disarranged, Deuce degenerates to single textual diversity and performance degradation occurs as expected. Nonetheless, Deuce still outperforms Coreset selection (Sener and Savarese, 2018), a CSAL baseline which also purely utilizes textual diversity. This demonstrates Deuce’s effectiveness in real-world cold-start scenarios.
5.4 Performance of Few-shot Math Reasoning
Deuce has the potential to generalize on other NLP tasks. To demonstrate this, Deuce is tested on GSM8K (Cobbe et al., 2021), a dataset of math word problems. However, directly adapting RoBERTa to solving math problems is difficult due to its masked modeling nature. Instead, Deuce is applied with RoBERTa to produce a seed set.4 Then, the seeds are taken as examples for few-shot Chain-of-Thought prompting (Wei et al., 2022) with Llama 2 7B. From the results, as reported in Table 12, Deuce is still effective in few-shot math problem solving, compared to random sampling.
6 Conclusion
This paper presents Deuce, a dual-diversity enhancing and uncertainty-aware CSAL framework via a prompt-based and graph-based approach. Different from previous works, it emphasizes dual-diversity (i.e., textual diversity and class diversity) to ensure a balanced acquisition. This is achieved by the novel construction of Dual-Neighbor Graph (DNG) and Farthest Point Sampling (FPS). DNG leverages the kNN graph structure of textual space and label space from a PLM. In addition, Deuce prioritizes hard representative examples, so as to ensure an informative acquisition. This leverages density-based clustering and uncertainty propagation on the DNG. Experiments show the effectiveness of Deuce’s dual-diversity enhancement and uncertainty-aware mechanism. It offers an efficient solution for low-resource data acquisition. Overall, Deuce’s hybrid strategy strikes an important balance between exploration and exploitation in CSAL.
Limitations
Backbone LM.
Deuce leverages a discriminative PLM. However, state-of-the-art PLMs are primarily generative. Generative embedding models (e.g., Jiang et al., 2023) or adaptations (Yang et al., 2019; Gong et al., 2019; Zhang et al., 2022a) can be investigated and combined with Deuce. For such approaches, their quality and efficiency should be carefully minded.
External Knowledge.
In Deuce, the only source of external knowledge is the language model. Incorporation of more domain knowledge, if possible, can improve the performance in the cold-start stage. As Deuce adopts a prompt-based and graph-based acquisition, prompt engineering and knowledge graphs (Pan et al., 2024) can be investigated.
Acknowledgments
We extend our gratitude to our action editor, Sebastian Padó, and the anonymous reviewers for their constructive comments. We also thank Tianjun Li and Jiangfeng Liu for their helpful feedback on the initial drafts.
This work was funded in part by the National Natural Science Foundation of China grant under number 62222603, in part by the STI2030-Major Projects grant from the Ministry of Science and Technology of the People’s Republic of China under number 2021ZD0200700, in part by the Key-Area Research and Development Program of Guangdong Province under number 2023B0303030001, in part by the Program for Guangdong Introducing Innovative and Entrepreneurial Teams (2019ZT08X214), and in part by the Science and Technology Program of Guangzhou under number 2024A04J6310.
Notes
It is worth noting that Deuce does not utilize or optimize any Graph Neural Network (GNN). With the rich representational capability of PLMs, Deuce does not require GNNs to learn data representations.
Semantic similarity benchmarks (e.g., STS) cannot be used here, as the prompt Tx requires a task domain .
For open questions like math problems, there are no concepts of “classes”. Instead, the predictive embeddings are clustered with hdbscan*. The cluster centroids are taken as meta-class embeddings .
References
Author notes
Action Editor: Sebastian Padó