Cold-start active learning (CSAL) selects valuable instances from an unlabeled dataset for manual annotation. It provides high-quality data at a low annotation cost for label-scarce text classification. However, existing CSAL methods overlook weak classes and hard representative examples, resulting in biased learning. To address these issues, this paper proposes a novel dual-diversity enhancing and uncertainty-aware (Deuce) framework for CSAL. Specifically, Deuce leverages a pretrained language model (PLM) to efficiently extract textual representations, class predictions, and predictive uncertainty. Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both textual diversity and class diversity, ensuring a balanced data distribution. It further propagates uncertainty information via density-based clustering to select hard representative instances. Deuce performs well in selecting class-balanced and hard representative data by dual-diversity and informativeness. Experiments on six NLP datasets demonstrate the superiority and efficiency of Deuce.

Cold-start active learning (CSAL; Yuan et al., 2020a; Zhang et al., 2022b) has gained much attention for efficiently labeling large corpora from zero. Given an unlabeled corpus (i.e., the “cold-start” stage), it aims to acquire a small subset (seed set) for annotation. Such absence of labels can happen due to data privacy concerns (Holzinger, 2016; Li et al., 2023), limited domain experts1 (Wu et al., 2022), labeling difficulty (Herde et al., 2021), quick expiration of labels (Yuan et al., 2020b; Zhang et al., 2021), etc. In real-world tasks with specialized domains (e.g., medical report classification with rare diseases; De Angeli et al., 2021), the complete absence of labels and lack of a posteriori knowledge pose challenges to CSAL.

While active learning (AL) has been studied for a wide range of NLP tasks (Zhang et al., 2022b), the cold-start problem has been hardly addressed. At the cold-start stage, the model is untrained and no labeled data are available for validation. Traditional CSAL applies random sampling (Ash et al., 2020; Margatina et al., 2021), diversity sampling (Yu et al., 2019; Chang et al., 2021), or uncertainty sampling (Schröder et al., 2022). However, random sampling suffers from high variance (Rudolph et al., 2023); diversity sampling is prone to easy examples and vector space noise (Eklund and Forsman, 2022); and uncertainty sampling is prone to redundant examples, outliers, and unreliable metrics (Wójcik et al., 2022). Moreover, existing methods ignore class diversity, where the sampling bias often results in class imbalance (Krishnan et al., 2021). At worst, the missed cluster effect (Schütze et al., 2006; Yu et al., 2019) can happen, i.e., clusters of weak classes are neglected. Tomanek et al. (2009) showed that an unrepresentative seed set gives rise to this effect. Learning is misguided, if started unfavorably.

The key challenge for CSAL lies in how to acquire a diverse and informative seed set. As a general heuristic (Dasgupta, 2011), a proper seed set should strike a balance between exploring the input space for instance regions (e.g., diversity sampling) and exploiting the version space for decision boundaries (e.g., uncertainty sampling). Such hybrid CSAL strategies have been proposed based on combinations of neighbor-awareness (Hacohen et al., 2022; Su et al., 2023; Yu et al., 2023), clustering (Yuan et al., 2020a; Agarwal et al., 2021; Müller et al., 2022; Brangbour et al., 2022; Shnarch et al., 2022; Yu et al., 2023), and uncertainty estimation (Dligach and Palmer, 2011; Yuan et al., 2020a; Müller et al., 2022; Yu et al., 2023). However, existing methods fail to explore the label space to enhance class diversity and mitigate imbalance. Moreover, most methods perform diversity sampling followed by uncertainty sampling, treating both aspects in isolation.

To address these challenges, this paper presents Deuce, a dual-diversity enhancing and uncertainty-aware framework for CSAL. It adopts a graph-based hybrid strategy to enhance diversity and informativeness. Different from previous works, Deuce not only emphasizes the diversity in textual contents (textual diversity), but also diversity in class predictions (class diversity). This is termed dual-diversity in this paper. To achieve this in the cold-start stage, it exploits the rich representational and predictive capabilities of PLMs. For informativeness, the predictive uncertainty is estimated from a one-vs-all (OVA) perspective. This helps mining informative “hard examples” for learning. Then, Deuce further employs manifold learning techniques (McInnes et al., 2020) to derive dual-diversity information. This results in the novel construction of a Dual-Neighbor Graph (DNG). Finally, Deuce performs density-based uncertainty propagation and Farthest Point Sampling (FPS) on the DNG. While propagation prioritizes representatively uncertain (RU) instances, FPS enhances the dual-diversity. Overall, Deuce ensures a more diverse and informative acquisition.

The merits of Deuce are attributed to the following contributions:

  • The dual-diversity enhancing and uncertainty aware (Deuce) framework adopts a novel hybrid acquisition strategy. It effectively selects class-balanced and hard representative instances, achieving a good balance between exploration and exploitation in CSAL.

  • This paper proposes a graph-based dual-diversity enhancement mechanism to select diverse instances with textual diversity and class diversity, tackling class imbalance in CSAL.

  • This paper presents an embedding-based uncertainty-aware prediction mechanism to effectively select hard representative instances according to predictive uncertainty.

2.1 Cold-start Active Learning (CSAL)

According to the taxonomy of Zhang et al. (2022b), CSAL research for NLP can be categorized as informativeness-based, representativeness-based, and hybrid. As most methods are hybrid, the techniques and challenges for informativeness or representativeness are elucidated below.

2.1.1 Informativeness

Uncertainty.

The main metric for informativeness in CSAL is uncertainty, as it is more tractable in cold-start stages than others (e.g., gradients). High predictive uncertainty indicates difficulty for the model, thus valuable for annotation. Most existing methods use language models (LMs) for estimation. Common estimators include entropy (Zhu et al., 2008; Yu et al., 2023), LM probability (Dligach and Palmer, 2011), LM loss (Yuan et al., 2020a), and probability margin (Müller et al., 2022). However, several challenges exist in uncertainty estimation: (a) Often, a closed-world assumption is imposed. In other words, predictions are normalized such that they sum to 1. This hinders the expression of uncertainty, as it forces mapping to one of the known classes, ignoring options such as “none of the above” (Padhy et al., 2020). (b) PLMs suffer from overconfidence (Park and Caragea, 2022; Wang, 2024). This requires calibration for more robust uncertainty estimation (Yu et al., 2023). (c) Task information is hardly considered. As a result, the uncertainty will not be related to the downstream task (output uncertainty), but rather its intrinsic perplexity (input uncertainty) (Jiang et al., 2021). Patron (Yu et al., 2023) uses task-related prompts to tackle this issue.

2.1.2 Representativeness

Density.

To avoid outliers, density-based CSAL methods prefer “typical” instances. The method of Zhu et al. (2008) and TypiClust (Hacohen et al., 2022) prioritize instances with high kNN density. Uncertainty propagation (Yu et al., 2023) is also useful in aggregating density information. A typical group of uncertain examples indicates a region where the model’s knowledge is lacking.

Discriminative.

Some CSAL methods acquire sequentially or iteratively. They thus discriminate, i.e., prefer an instance if it differs the most from selected ones. Coreset selection (Sener and Savarese, 2018) selects an instance (cover-point) such that its minimum distance to selected instances is maximized. vote-k (Su et al., 2023) adopts a greedy approach to select remote instances on a kNN graph.

Batch Diversity.

It is more efficient to acquire in batch mode (Settles, 2009), i.e., to select multiple instances at each step. Clustering has been a common technique to enhance batch diversity and avoid redundancy in CSAL. It helps structure the unlabeled dataset by grouping similar instances together. Nguyen and Smeulders (2004) and Kang et al. (2004) first proposed pre-clustering the input space to select representatives from each cluster. Dasgupta and Ng (2009) used spectral clustering on the similarity matrix of documents. Hu et al. (2010) and Yu et al. (2019) used hierarchical clustering to stabilize the process. Zhu et al. (2008) and more recent works (Yuan et al., 2020a; Chang et al., 2021; Agarwal et al., 2021; Müller et al., 2022; Hacohen et al., 2022; Yu et al., 2023) have commonly used k-means for its simplicity and efficiency. However, these clustering methods can be sensitive to outliers. Moreover, clustering in the input space only contributes to textual diversity, regardless of other aspects.

2.2 Missed Cluster Effect

The missed cluster effect (Schütze et al., 2006; Tomanek et al., 2009) is an extreme case of class imbalance. It refers to when an AL strategy neglects certain classes (or clusters within classes). Schütze et al. (2006) first recognized the missed cluster effect in the context of text classification. They suggested more use of domain knowledge. Knowledge extraction from PLMs is in harmony with this suggestion. Dligach and Palmer (2011) proposed an uncertainty-based approach to avoid the missed cluster effect in word sense disambiguation (WSD). However, it is based on task-agnostic LM probability. Marcheggiani and Artières (2014) showed that labeling relevant instances, which reduces the labeling noise, also helps mitigate the missed cluster effect. Label calibration aligns with this finding. While many works are devoted to addressing the missed cluster effect or general class imbalance (e.g., Aggarwal et al., 2020; Fairstein et al., 2024) for general AL, they often rely on a labeled subset. Class diversity enhancement would help mitigate class imbalance issues, but it remains an open question for CSAL.

In this section, the methodology of the proposed Deuce is introduced. Section 3.1 first defines CSAL and declares the notations for the rest of this paper. The framework of Deuce is then elaborated in Section 3.2.

3.1 Problem Formulation

This paper considers CSAL in a pool-based manner. Learning is initiated with a set of N unlabeled documents, Xxii=1N. A C-way text classification task is defined by a set of classes Yyjj=1C taking values in a domain Y.

Given a labeling budget bN, a CSAL strategy acquires a subset XsX with a fixed size Xs=b, such that the labeled subset Xs boosts most performance when used as a training seed set. The performance is evaluated by fine-tuning a PLM ℳθ with Xs, and testing for its accuracy.

3.2 The Deuce Framework

The proposed Deuce framework is illustrated in Figure 1. Overall, the components of Deuce serve the same goal—to produce a seed set with high dual-diversity and informativeness.

Figure 1: 

The proposed Deuce framework.

Figure 1: 

The proposed Deuce framework.

Close modal

3.2.1 Embedding Module

In CSAL, data selection starts with only an unlabeled corpus. Deuce leverages PLM embeddings, which guide the selection process towards more diverse and informative samples.

Specifically, the embedding module implements a prompt-based, verbalizer-free approach (Jiang et al., 2022). This requires only a single inference pass per document.

Textual and Predictive Embedding.
In a masked PLM, the bidirectional semantics can be condensed into a [MASK] token. In light of this, Deuce extends Jiang et al. (2022)’s template with double [MASK] tokens:
where [domain] is the target domain Y, such as “sentiment”. The hidden representations of [MASK] tokens are extracted as the textual zxi and predictive embeddings zŷxi. They capture the intrinsic and task-related semantics.

However, raw embeddings suffer from template bias and length bias (Miao et al., 2023). Deuce further applies template denoising (Jiang et al., 2022) to obtain the denoised embeddings z~.

Class Embedding.
Predictions need to be paired with the known classes. Class embeddings z~yj are generated from a prompt template Ty, similar to Tx:
where [Y] is the placeholder for a class yj.

3.2.2 Prediction Module

This module aims to produce uncertainty-aware labels. With class information, Deuce gains prior knowledge about potential data distributions. With uncertainty information, Deuce is informed of potential labeling gain.

Label Vector.
For better uncertainty estimation, Deuce adopts an OVA setup, such that labels y^i do not necessarily sum to 1. First, it computes the inner product ωij for each pair of predictive and class embeddings:
Ideally, similarity ωij can be linearly transformed to class label ŷij. However, high anisotropy (Gao et al., 2019) was observed in preliminary experiments. As a result, ωij has a non-uniform distribution over [−1,1]. To tackle this issue, Deuce uses the empirical distribution function (e.d.f.) of Ω to give a calibrated estimate of labels Y^:
where 𝟙[·] is the indicator function. This gives ŷijU(0,1) regardless of the embedding distribution.
Predictive Uncertainty.

In CSAL, uncertainty represents the difficulty of an instance. Deuce adapts entropy, a common measure of uncertainty (§2.1.1).

In information theory, entropy is the expected self-information I of possible events. In an OVA setup, possible events {Ei} are “xi has a high predictive score for exactly one class”. The probability of event Ei is given by Wójcik et al. (2022):
Therefore, Deuce adopts the entropy from {Ei} as the uncertainty estimate u:

3.2.3 Dual-Neighbor Graph (DNG) Module

Graphs serve as a powerful tool for data selection by explicitly modeling data interrelationship. This enables the propagation of valuable information (e.g., uncertainty) and the selection of more diverse samples. To integrate textual and class diversity, Deuce leverages manifold learning techniques (McInnes et al., 2020) on k-Nearest-Neighbor (kNN) graphs of both spaces.2

kNN Graph.

The use of kNN arises from the neighborhood perspective of diversity. Deuce aims to avoid selecting neighboring instances. In a kNN graph, an instance xi is connected with its k nearest neighbors {xij} under some distance function Δ(·,·). Formally, the two metric spaces of kNN are defined as follows.

  • The textual space (X,Δz~) is defined by textual embeddings under cosine distance, Δz~(xi,xj)=1πarccosz~xiz~xj;

  • The label space (X,Δŷ) is defined by label vectors under 1 distance, Δŷ(xi,xj)=y^iy^j1.

The kNN graph from each space is denoted by Gz~ and Gŷ, respectively.

Graph Normalization.

To unify textual and class diversity, Deuce merges the two kNN graphs into one for graph-based sampling. However, across two distinct spaces, it is necessary to first normalize the distances (McInnes et al., 2020).

To ease notation, this part omits the subscript as G{Gz~,Gŷ}. For each xi, Deuce finds a normalization factor τi > 0 that satisfies the equation
where ρi denotes xi’s distance to its nearest neighbor. The weights w~ of the normalized (directed) kNN graph G, denoted by G~, is defined by
After normalization, the original kNN weights w[0,) are transformed to w~(0,1].
Symmetrization.

To identify representative instances, Deuce performs graph clustering. This requires symmetric kNN graphs.

Let W~ denote the sparse weight matrix of G~. Since weights w~[0,1], they can be interpreted as fuzzy memberships of neighborhood. Hence, symmetrizing W~ is equivalent to finding the fuzzy union (Dubois and Prade, 1982) of the neighbors W~ and reverse neighbors W~:
where ⊙ is the Hadamard product. W~sym defines the weights of the symmetric kNN graph G~sym. Its edges are denoted by E~sym.
Merging.

It is now appropriate to merge the two kNN graphs. This unifies textual and class diversity in one graph.

As merged, the DNG is an undirected graph Gdual=(V,E,wdual). The edges E are the union of edges in G~z~,sym and G~ŷ,sym. Moreover, E is divided into two types:

  • E1 represents edges which only appear in either kNN graph, called single-neighbor edges;

  • E2 represents edges which appear in both kNN graphs, called dual-neighbor edges. They connect neighboring documents which are similar in both textual semantics and class predictions.

The weight wdual of an undirected edge xi,xjE is thereby defined as
where γ is a threshold to distinguish dual-neighbor edges E2 from single-neighbor edges E1. In essence, DNG assigns greater weights to dual-neighbor edges. As a result, during the subsequent graph clustering and traversal, Deuce can avoid selecting textual and class neighbors.

3.2.4 Acquisition Module

Deuce adopts a hybrid acquisition strategy. Overall, the goal is to produce a diverse and informative seed set. To achieve this, the acquisition module performs graph clustering, propagation, and traversal on DNG.

hdbscan*.

A group of similar documents with high predictive uncertainty indicates an area where the model’s knowledge is lacking. By labeling one of the documents, the model predictions can be improved for similar ones in the area. Therefore, it is valuable to identify and prioritize such representatively uncertain (RU) groups for CSAL.

Clustering has been a common technique to group similar instances (§2.1.2). However, traditional clustering methods (e.g., k-means) are ill-suited, as the number of RU groups is unknown. Moreover, they force every instance into a cluster, while some instances may not belong to any RU group. Instead, Deuce adopts density-based clustering, which identifies RU groups with a sufficient density (≥ kr similar documents).

Specifically, Deuce applies hdbscan* (Campello et al., 2013, 2015) on the DNG, with minimum cluster size kr. A document xi is either (a) clustered in an RU group cl with membership pi, or (b) excluded as a non-RU outlier.

Uncertainty Propagation.
To prioritize RU documents, uncertainty information (§3.2.2) is propagated and aggregated in RU groups. This is formulated as a single step of message propagation:
FPS.
The final acquisition adopts a combination of diversity sampling and uncertainty sampling. First, Deuce runs Farthest Point Sampling (FPS; Eldar et al., 1994) on the DNG. As the result only depends on the initial point, FPS is started from documents xi with top-k degrees. Each produces a candidate seed set Xc(i), which contains b dually diverse samples. Finally, Deuce chooses the candidate with the highest propagated uncertainty:

The whole process is described in Algorithm 1.

graphic

4.1 Experimental Setup

Datasets.

Deuce is evaluated on six text classification datasets: IMDb (Maas et al., 2011), Yelpfull (Meng et al., 2019), AG’s News (Zhang et al., 2015), Yahoo! Answers (Zhang et al., 2015), DBpedia (Lehmann et al., 2015), and TREC (Li and Roth, 2002). Dataset statistics are shown in Table 1. All the datasets used in the experiments are publicly accessible. The original labels are removed to create a cold-start scenario.

Table 1: 

Statistics of evaluation datasets. Yahoo! and DBpedia are the truncated version with 30k samples per class by Yu et al. (2023). TREC is an imbalanced dataset.

DatasetSource domainTarget domainY#ClassC#UnlabeledX#TestLabel distribution (bar chart) and namesyj
IMDb Movie review Sentiment 25,000 25,000  Negative, Positive 
Yelpfull Review Rating 38,352 10,000  1 star, 2 stars, 3 stars, 4 stars, 5 stars 
AG’s News News Category 120,000 7,600  World, Sports, Business, Sci/Tech 
Yahoo! Answers Web Q&A Category 10 300,000 60,000  Society & Culture, Science & Mathematics, Health, Education & Reference, Computers & Internet, Sports, Business & Finance, Entertainment & Music, Family & Relationships, Politics & Government 
DBpedia Wikipedia lead section Category 14 420,000 70,000  Company, Educational institution, Artist, Athlete, Office holder, Mean of transportation, Building, Natural place, Village, Animal, Plant, Album, Film, Written work 
TREC Question Category 5,452 500  Abbreviation, Entity, Description and abstract concept, Human being, Location, Numeric value 
DatasetSource domainTarget domainY#ClassC#UnlabeledX#TestLabel distribution (bar chart) and namesyj
IMDb Movie review Sentiment 25,000 25,000  Negative, Positive 
Yelpfull Review Rating 38,352 10,000  1 star, 2 stars, 3 stars, 4 stars, 5 stars 
AG’s News News Category 120,000 7,600  World, Sports, Business, Sci/Tech 
Yahoo! Answers Web Q&A Category 10 300,000 60,000  Society & Culture, Science & Mathematics, Health, Education & Reference, Computers & Internet, Sports, Business & Finance, Entertainment & Music, Family & Relationships, Politics & Government 
DBpedia Wikipedia lead section Category 14 420,000 70,000  Company, Educational institution, Artist, Athlete, Office holder, Mean of transportation, Building, Natural place, Village, Animal, Plant, Album, Film, Written work 
TREC Question Category 5,452 500  Abbreviation, Entity, Description and abstract concept, Human being, Location, Numeric value 
Evaluation Metric.

To evaluate the performance of the acquired seed set Xs, it is labeled and used for fine-tuning the PLM. The original labels of the seed set are revealed. The accuracy of the fine-tuned PLM on the test set is then reported. To be consistent with previous methods (Yu et al., 2023), the experiments adopt RoBERTa-base (Liu et al., 2019) as the backbone PLM.

Analysis Metrics.
To analyze the effect of dual-diversity enhancement, the class imbalance (IMB) and textual-diversity value of seed sets are reported. Both metrics are computed under budget b = 128. IMB (Yu et al., 2023) is defined as:
where nj is the number of instances from class yj. Textual-diversity value (Ein-Dor et al., 2020; Yu et al., 2023) is defined as:
where Δ (xi, xj) is the Euclidean distance of SimCSE embeddings (Gao et al., 2021) of xi and xj.
Implementation Details.

The fine-tuning setup and hyperparameters are the same as Patron’s (Yu et al., 2023). Notably, the experiment code transplants the original implementation of graph normalization (McInnes et al., 2018) to GPU for acceleration. For Deuce, k = 500, kr = 3, and γ = 1.0 (since w~sym1.0) are taken. All experiments are run on a machine with a single nvidia A800 GPU with 80 GB of VRAM.

Baselines.

The following CSAL baseline methods are considered:

  • Random sampling selects uniformly.

  • Entropy-based uncertainty sampling (revisited by Schröder et al., 2022) selects data with the highest predictive entropy.

  • Coreset selection (Sener and Savarese, 2018) iteratively selects data whose minimum distance to the selected data is maximized.

  • alps (Yuan et al., 2020a) computes surprisal embeddings from BERT loss as uncertainty. They are then clustered with k-means. Data closest to each centroid are selected.

  • few-selector (Chang et al., 2021) clusters the text embeddings with k-means.

  • TypiClust (Hacohen et al., 2022) clusters the text embeddings with k-means, and selects data with the highest typicality, i.e., kNN density, from each cluster.

  • Patron (Yu et al., 2023) clusters the text embeddings with k-means, and selects from each cluster data with the highest propagated uncertainty. It then iteratively updates the set to refine inter-sample distances.

  • vote-k (Su et al., 2023) iteratively assigns a high score if a data is far from selected data.

Comparisons of the CSAL baselines and Deuce are presented in Table 2.

Table 2: 

Comparisons of CSAL methods, which adapt the taxonomy of Zhang et al. (2022b) (§2.1).

MethodInformativenessRepresentativeness
UncertaintyDensityTextual diversityClass diversity
Random ✗ ✗ ✗ ✗ 
Entropy ✓ ✗ ✗ ✗ 
Coreset ✗ ✗ ✓ ✗ 
alps ✓ ✗ ✓ ✗ 
few-s. ✗ ✗ ✓ ✗ 
TypiCl. ✗ ✓ ✓ ✗ 
Patron ✓ ✓ ✓ ✗ 
vote-k ✗ ✓ ✓ ✗ 
Deuce ✓ ✓ ✓ ✓ 
MethodInformativenessRepresentativeness
UncertaintyDensityTextual diversityClass diversity
Random ✗ ✗ ✗ ✗ 
Entropy ✓ ✗ ✗ ✗ 
Coreset ✗ ✗ ✓ ✗ 
alps ✓ ✗ ✓ ✗ 
few-s. ✗ ✗ ✓ ✗ 
TypiCl. ✗ ✓ ✓ ✗ 
Patron ✓ ✓ ✓ ✗ 
vote-k ✗ ✓ ✓ ✗ 
Deuce ✓ ✓ ✓ ✓ 

4.2 Accuracy Improvement

The main quantitative results of PLM fine-tuning performance with Deuce and baseline CSAL methods are shown in Table 3. Results for baselines other than vote-k are from Yu et al. (2023). To report the standard deviation, each setup is repeated with 10 different random seeds. Figure 2 demonstrates a qualitative visualization of the b = 128 seed set from IMDb dataset, acquired by the latest baseline method vote-k and the proposed Deuce. The t-SNE (van der Maaten and Hinton, 2008) method is used for visualization.

Table 3: 

Evaluation results of Deuce and CSAL baselines on six datasets and three budgets (denoted by b), each with 10 repetitions. Accuracy (%) of one-round fine-tuned PLM is reported in the format of avg±std. The best and second best results per setup are emboldened and underlined, respectively.

DatasetbRandomEntropyCoresetalpsfew-s.TypiCl.Patronvote-kDeuce
IMDb 32 80.2±2.5 81.9±2.7 74.5±2.9 82.2±3.0 79.2±1.6 82.8±2.2 85.5±1.5 85.6 ±1.8 86.9 ± 0.9 
64 82.6±1.4 84.7±1.5 82.8±2.5 86.1±0.9 84.9±1.5 84.0±0.9 87.3±1.0 88.0 ±1.2 88.5 ± 0.7 
128 86.6±1.7 87.1±0.7 87.8±0.8 87.5±0.8 88.5±1.6 88.1±1.4 89.6 ±0.4 89.1±0.7 90.0 ± 0.3 
Yelpfull 32 30.2±4.5 32.7±1.0 32.9±2.8 36.8±1.8 35.2±1.0 32.6±1.5 35.9±1.6 40.1 ±2.2 42.6 ± 1.1 
64 42.5±1.7 36.8±2.1 39.9±3.4 40.3±2.6 39.3± 1.0 39.7±1.8 44.4±1.1 49.3 ±1.6 49.8 ± 1.2 
128 47.7±2.1 41.3±1.9 49.4±1.6 45.1±1.0 46.4±1.3 46.8±1.6 51.2 ±0.8 50.8±1.5 53.4 ± 0.7 
AG’s News 32 73.7±4.6 73.7±3.0 78.6±1.6 78.4±2.3 79.1±2.7 80.7±1.8 83.2 ±0.9 81.8±1.3 83.7 ± 0.8 
64 80.0±2.5 80.0±2.2 82.0±1.5 82.6±2.5 82.4±2.0 83.0±2.4 85.3 ±0.7 84.7±1.3 86.3 ± 0.6 
128 84.5±1.7 82.5±0.8 85.2±0.6 84.3±1.7 85.6±0.8 85.7±0.3 87.0 ±0.6 86.2±1.2 87.5 ± 0.4 
Yahoo! Answers 32 43.5±4.0 23.0±1.6 22.0±2.3 47.7±2.3 46.8±2.1 36.9±1.8 56.8 ± 1.0 54.5±1.6 58.0 ±1.5 
64 53.1±3.1 37.6±2.0 45.7±3.7 55.3±1.8 52.9±1.6 54.0±1.6 61.9 ± 0.7 60.8±1.4 62.8 ±1.3 
128 60.2±1.5 41.8±1.9 56.9±2.5 60.8±1.9 61.3±1.0 58.2±1.5 65.1 ± 0.6 64.3±0.9 66.2 ±0.9 
DBpedia 32 67.1±3.2 18.9±2.4 64.0±2.8 77.5±4.0 83.3±1.0 78.2±1.8 85.3 ± 0.9 78.1±2.6 86.0 ±1.7 
64 86.2±2.4 37.5±3.0 85.2±0.8 89.7±1.1 92.7±0.9 88.5±0.7 93.6 ± 0.4 92.7±1.3 94.1 ±0.9 
128 95.0±1.5 47.5±2.3 89.4±1.5 95.7±0.4 96.5±0.5 95.7±0.6 97.0 ± 0.2 96.4±0.4 97.3 ±0.3 
TREC 32 49.0±3.5 46.6±1.4 47.1±3.6 60.5±3.7 60.3±1.5 42.0±4.4 64.0 ± 1.2 57.6±2.9 70.2 ±1.7 
64 69.1±3.4 59.8±4.2 75.7±3.0 73.0±2.0 77.3±2.0 72.6±2.1 78.6±1.6 81.8 ±3.1 82.2 ± 1.5 
128 85.6±2.8 75.0±1.8 87.6±3.0 87.3±3.6 87.7±1.5 83.0±3.8 91.1 ±0.8 89.7±2.6 92.1 ± 0.8 
 
Average 32 57.2±3.8 46.1±2.1 53.2±2.7 63.9±3.0 64.0±1.8 58.9±2.5 68.4 ± 1.2 66.3±2.1 71.2 ±1.3 
64 68.9±2.5 56.1±2.7 68.5±2.7 71.2±1.9 71.6±1.6 70.3±1.7 75.2± 1.0 76.2 ±1.8 77.3 ±1.1 
128 76.6±1.9 62.5±1.7 76.1±1.9 76.8±1.9 77.6±1.2 76.3±1.9 80.2 ±0.6 79.4±1.4 81.1 ± 0.6 
DatasetbRandomEntropyCoresetalpsfew-s.TypiCl.Patronvote-kDeuce
IMDb 32 80.2±2.5 81.9±2.7 74.5±2.9 82.2±3.0 79.2±1.6 82.8±2.2 85.5±1.5 85.6 ±1.8 86.9 ± 0.9 
64 82.6±1.4 84.7±1.5 82.8±2.5 86.1±0.9 84.9±1.5 84.0±0.9 87.3±1.0 88.0 ±1.2 88.5 ± 0.7 
128 86.6±1.7 87.1±0.7 87.8±0.8 87.5±0.8 88.5±1.6 88.1±1.4 89.6 ±0.4 89.1±0.7 90.0 ± 0.3 
Yelpfull 32 30.2±4.5 32.7±1.0 32.9±2.8 36.8±1.8 35.2±1.0 32.6±1.5 35.9±1.6 40.1 ±2.2 42.6 ± 1.1 
64 42.5±1.7 36.8±2.1 39.9±3.4 40.3±2.6 39.3± 1.0 39.7±1.8 44.4±1.1 49.3 ±1.6 49.8 ± 1.2 
128 47.7±2.1 41.3±1.9 49.4±1.6 45.1±1.0 46.4±1.3 46.8±1.6 51.2 ±0.8 50.8±1.5 53.4 ± 0.7 
AG’s News 32 73.7±4.6 73.7±3.0 78.6±1.6 78.4±2.3 79.1±2.7 80.7±1.8 83.2 ±0.9 81.8±1.3 83.7 ± 0.8 
64 80.0±2.5 80.0±2.2 82.0±1.5 82.6±2.5 82.4±2.0 83.0±2.4 85.3 ±0.7 84.7±1.3 86.3 ± 0.6 
128 84.5±1.7 82.5±0.8 85.2±0.6 84.3±1.7 85.6±0.8 85.7±0.3 87.0 ±0.6 86.2±1.2 87.5 ± 0.4 
Yahoo! Answers 32 43.5±4.0 23.0±1.6 22.0±2.3 47.7±2.3 46.8±2.1 36.9±1.8 56.8 ± 1.0 54.5±1.6 58.0 ±1.5 
64 53.1±3.1 37.6±2.0 45.7±3.7 55.3±1.8 52.9±1.6 54.0±1.6 61.9 ± 0.7 60.8±1.4 62.8 ±1.3 
128 60.2±1.5 41.8±1.9 56.9±2.5 60.8±1.9 61.3±1.0 58.2±1.5 65.1 ± 0.6 64.3±0.9 66.2 ±0.9 
DBpedia 32 67.1±3.2 18.9±2.4 64.0±2.8 77.5±4.0 83.3±1.0 78.2±1.8 85.3 ± 0.9 78.1±2.6 86.0 ±1.7 
64 86.2±2.4 37.5±3.0 85.2±0.8 89.7±1.1 92.7±0.9 88.5±0.7 93.6 ± 0.4 92.7±1.3 94.1 ±0.9 
128 95.0±1.5 47.5±2.3 89.4±1.5 95.7±0.4 96.5±0.5 95.7±0.6 97.0 ± 0.2 96.4±0.4 97.3 ±0.3 
TREC 32 49.0±3.5 46.6±1.4 47.1±3.6 60.5±3.7 60.3±1.5 42.0±4.4 64.0 ± 1.2 57.6±2.9 70.2 ±1.7 
64 69.1±3.4 59.8±4.2 75.7±3.0 73.0±2.0 77.3±2.0 72.6±2.1 78.6±1.6 81.8 ±3.1 82.2 ± 1.5 
128 85.6±2.8 75.0±1.8 87.6±3.0 87.3±3.6 87.7±1.5 83.0±3.8 91.1 ±0.8 89.7±2.6 92.1 ± 0.8 
 
Average 32 57.2±3.8 46.1±2.1 53.2±2.7 63.9±3.0 64.0±1.8 58.9±2.5 68.4 ± 1.2 66.3±2.1 71.2 ±1.3 
64 68.9±2.5 56.1±2.7 68.5±2.7 71.2±1.9 71.6±1.6 70.3±1.7 75.2± 1.0 76.2 ±1.8 77.3 ±1.1 
128 76.6±1.9 62.5±1.7 76.1±1.9 76.8±1.9 77.6±1.2 76.3±1.9 80.2 ±0.6 79.4±1.4 81.1 ± 0.6 
Figure 2: 

The t-SNE visualization of the acquired seed set (b = 128) on IMDb dataset. Text embeddings are colored by their true labels.

Figure 2: 

The t-SNE visualization of the acquired seed set (b = 128) on IMDb dataset. Text embeddings are colored by their true labels.

Close modal

From results in Table 3, it can be seen that Deuce consistently outperforms other baselines, achieving up to a 2.5% gain on balanced datasets and up to 6.2% on the imbalanced dataset, TREC. Deuce mainly benefits from that it enhances the class diversity as well as textual diversity. This can be concluded from the larger improvements on TREC. In over half of the setups, Deuce also achieves the lowest standard deviation. In addition, Deuce improves most when b is small. This aligns with the fundamental goal of AL, which is to maximize performance gains with minimal labeled data. Furthermore, from the visualization in Figure 2, it can be seen that Deuce’s enhancement of dual-diversity leads to a broader and more balanced coverage of both input space and label space. As Deuce adopts a highest-uncertainty strategy, such coverage also exhibits high predictive uncertainty, thus including more “hard examples” which are valuable for annotation.

4.3 Enhancement of Class Diversity

To verify the enhancement of class diversity, the class imbalance value (Yu et al., 2023) under b = 128 is reported in Table 4.

Table 4: 

Label imbalance value (IMB) of acquired seed sets (b = 128). Smaller value indicates better class diversity and balance. An IMB of indicates that the missed cluster effect happens.

DatasetRandomEntropyCoresetalpsfew-s.TypiCl.Patronvote-kDeuce
IMDb 1.207 6.111 1.000 1.783 1.286 2.765 1.286 1.065 1.169 
Yelpfull 1.778 3.800 6.000 2.833 2.000 5.200 2.250 1.273 1.450 
AG’s News 1.462 28.000 2.000 1.667 1.500 1.818 1.500 2.200 1.133 
Yahoo! Answers 3.000 12.000 7.000 5.500 2.250 3.333 5.500 3.333 2.125 
DBpedia 3.500  9.000 9.000 3.500 9.000 2.333 2.800 3.250 
TREC 8.000 16.000  9.500 10.500 21.000 15.000 11.333 6.000 
 
Harmonic avg. 2.128 9.863 3.124 3.138 2.166 3.839 2.338 2.052 1.779 
DatasetRandomEntropyCoresetalpsfew-s.TypiCl.Patronvote-kDeuce
IMDb 1.207 6.111 1.000 1.783 1.286 2.765 1.286 1.065 1.169 
Yelpfull 1.778 3.800 6.000 2.833 2.000 5.200 2.250 1.273 1.450 
AG’s News 1.462 28.000 2.000 1.667 1.500 1.818 1.500 2.200 1.133 
Yahoo! Answers 3.000 12.000 7.000 5.500 2.250 3.333 5.500 3.333 2.125 
DBpedia 3.500  9.000 9.000 3.500 9.000 2.333 2.800 3.250 
TREC 8.000 16.000  9.500 10.500 21.000 15.000 11.333 6.000 
 
Harmonic avg. 2.128 9.863 3.124 3.138 2.166 3.839 2.338 2.052 1.779 

From Table 4, it can be seen that Deuce achieves the lowest average IMB value. This indicates that Deuce enhances class diversity properly. In contrast, an IMB of emerges in the pure uncertainty-based (Entropy) and textual-diversity-based (Coreset) method. This indicates the missed cluster effect happens in their acquisition.

4.4 Enhancement of Textual Diversity

To measure the textual diversity of seed sets, the textual-diversity value (Ein-Dor et al., 2020; Yu et al., 2023) under b = 128 is reported in Table 5.

Table 5: 

Textual diversity value D of acquired seed sets (b = 128). Larger values indicate better textual diversity.

DatasetRandomEntropyCoresetalpsfew-s.TypiCl.Patronvote-kDeuce
IMDb 0.646 0.647 0.643 0.647 0.687 0.648 0.684 0.669 0.670 
Yelpfull 0.645 0.626 0.456 0.680 0.685 0.677 0.685 0.657 0.679 
AG’s News 0.354 0.295 0.340 0.385 0.436 0.376 0.423 0.370 0.448 
Yahoo! Answers 0.430 0.375 0.400 0.441 0.470 0.438 0.486 0.451 0.491 
DBpedia 0.402 0.316 0.381 0.420 0.461 0.399 0.459 0.434 0.476 
TREC 0.301 0.298 0.298 0.339 0.337 0.326 0.338 0.346 0.353 
 
Average 0.463 0.426 0.420 0.485 0.513 0.477 0.512 0.488 0.520 
DatasetRandomEntropyCoresetalpsfew-s.TypiCl.Patronvote-kDeuce
IMDb 0.646 0.647 0.643 0.647 0.687 0.648 0.684 0.669 0.670 
Yelpfull 0.645 0.626 0.456 0.680 0.685 0.677 0.685 0.657 0.679 
AG’s News 0.354 0.295 0.340 0.385 0.436 0.376 0.423 0.370 0.448 
Yahoo! Answers 0.430 0.375 0.400 0.441 0.470 0.438 0.486 0.451 0.491 
DBpedia 0.402 0.316 0.381 0.420 0.461 0.399 0.459 0.434 0.476 
TREC 0.301 0.298 0.298 0.339 0.337 0.326 0.338 0.346 0.353 
 
Average 0.463 0.426 0.420 0.485 0.513 0.477 0.512 0.488 0.520 

Table 5 shows that Deuce also achieves the highest average textual-diversity value. This indicates that Deuce also enhances textual diversity properly. The improvement of textual-diversity value is not significant, compared to IMB value’s (Table 4). This signals that Deuce enhances more of class diversity than textual diversity, compared to other baselines. Such difference can be explained by the highest-uncertainty-candidate strategy, which acquires more information from the label space.

4.5 Quality of Textual Embedding

To analyze the quality of Deuce’s prompt-based, unsupervised text embeddings z~xi3.2.1), they are compared with the supervised Sentence Transformer embeddings (Sentence Transformers, 2024) used in vote-k (Su et al., 2023). The correlations are computed across all the possible N2 pairs of their cosine similarity.3 Results on three datasets are reported in Table 6.

Table 6: 

The quality of textual embeddings, without and with template denoising (Jiang et al., 2022). Both correlation metrics are over [−1,1]; higher values indicate better quality.

DatasetPearson correlationrSpearman correlationρ
IMDb 0.1651 0.1636 
w/ denoising 0.1980 0.1889 
Yelpfull 0.1424 0.1440 
w/ denoising 0.3072 0.2984 
TREC 0.4271 0.4000 
w/ denoising 0.4662 0.4368 
DatasetPearson correlationrSpearman correlationρ
IMDb 0.1651 0.1636 
w/ denoising 0.1980 0.1889 
Yelpfull 0.1424 0.1440 
w/ denoising 0.3072 0.2984 
TREC 0.4271 0.4000 
w/ denoising 0.4662 0.4368 

From Table 6, a weak positive correlation is observed. Moreover, template denoising produces better embeddings, as it removes the biases from raw embeddings. Overall, the quality of textual embeddings is acceptable and adequate for cold-start acquisition.

4.6 Quality of Class Prediction

To analyze the quality of embedding-based class prediction y^i3.2.2), they are compared with gold labels. As uncertainty indicates unstable predictions, labels are arranged from the most confident (lowest ui) to the least. Results are demonstrated in Figure 3.

Figure 3: 

The quality of class predictions with respect to predictive uncertainty ui. Dataset: IMDb (left) and TREC (right).

Figure 3: 

The quality of class predictions with respect to predictive uncertainty ui. Dataset: IMDb (left) and TREC (right).

Close modal

From Figure 3, a high accuracy of class predictions is consistently observed with high confidence and with denoised embeddings, and vice versa. This demonstrates the good quality of e.d.f. predictions and the derived uncertainty metric.

5.1 Comparison with LLM-based Methods

The landscape of NLP is rapidly evolving with generative large language models (LLMs). This section evaluates two potential LLM-based alternatives to Deuce: serialization for acquisition and zero-shot Chain-of-Thought prompting. The following experiments are conducted with Llama 2 7B (Touvron et al., 2023).

5.1.1 Serialization for Acquisition

Inspired by the work of Hegselmann et al. (2023), class and uncertainty information can be serialized into natural language for LLM-based acquisition. The process is designed to involve three passes. In the first pass, each unlabeled text is formalized as a multiple-choice problem for LLM. The prompt template T1 is used to collect class and uncertainty information:
In the second pass, LLM decides on whether each text should be selected. Predictive uncertainty is estimated by the entropy of first-pass predictions, bounded by logC. The extended template T2 is used to combine multiple information:
In the third pass, texts with top-b probabilities of T2 answered “yes” are selected as the seed set. LLM is then fine-tuned with the seed set under T1. Finally, T1 is applied on the fine-tuned LLM to report the test set accuracy.

Due to resource constraints, LoRA (Hu et al., 2022) is used for fine-tuning, with r = α = 64. Results are reported in Table 7. Despite utilizing a mid-sized PLM, Deuce outperforms serialization with LLM in most datasets. The decision process of LLM is also black-box. In contrast, Deuce adopts graphs to explicitly capture the interplay of information, offering better interpretability.

Table 7: 

Fine-tuning results of Deuce (RoBERTa-base) and LLM serialization (Llama 2 7B).

MethodbIMDbYelpfullAG’s NewsYahoo!DBpediaTRECAverage
Serialization 32 81.7 44.5 25.2 38.8 62.6 28.4 46.9 
64 83.8 51.2 53.4 55.7 45.9 27.8 53.0 
128 89.6 56.9 83.7 63.4 58.6 35.6 64.6 
Deuce 32 86.9 42.6 83.7 58.0 86.0 70.2 71.2 
64 88.5 49.8 86.3 62.8 94.1 82.2 77.3 
128 90.0 53.4 87.5 66.2 97.3 92.1 81.1 
MethodbIMDbYelpfullAG’s NewsYahoo!DBpediaTRECAverage
Serialization 32 81.7 44.5 25.2 38.8 62.6 28.4 46.9 
64 83.8 51.2 53.4 55.7 45.9 27.8 53.0 
128 89.6 56.9 83.7 63.4 58.6 35.6 64.6 
Deuce 32 86.9 42.6 83.7 58.0 86.0 70.2 71.2 
64 88.5 49.8 86.3 62.8 94.1 82.2 77.3 
128 90.0 53.4 87.5 66.2 97.3 92.1 81.1 

5.1.2 Zero-shot Chain-of-Thought

Zero-shot Chain-of-Thought (CoT) prompting (Kojima et al., 2022) with LLMs has emerged as a promising method in cold-start scenarios. This paper tests zero-shot CoT without and with explicit choices in prompts. The temperature of generation is set to 0, and a maximum of 256 tokens are generated. Results are shown in Table 8. From the results, fine-tuning PLM with Deuce still outperforms 0-shot LLM predictions. In class-imbalanced and difficult datasets, performance gaps are greater. Lemon-picking shows that the LLM failed to output a final answer within 256 tokens for many test instances.

Table 8: 

Evaluation results of Deuce (b = 32, RoBERTa-base) and zero-shot Chain-of-Thought prompting (Kojima et al., 2022; Llama 2 7B).

MethodIMDbYelpfullAG’s NewsYahoo!DBpediaTRECAverage
0-shot CoT, w/o choices 63.6 9.2 34.7 23.7 37.1 12.6 32.0 
0-shot CoT, w/ choices 72.1 25.4 60.2 43.6 32.3 24.2 43.0 
 
Deuce, b = 32 86.9 42.6 83.7 58.0 86.0 70.2 71.2 
MethodIMDbYelpfullAG’s NewsYahoo!DBpediaTRECAverage
0-shot CoT, w/o choices 63.6 9.2 34.7 23.7 37.1 12.6 32.0 
0-shot CoT, w/ choices 72.1 25.4 60.2 43.6 32.3 24.2 43.0 
 
Deuce, b = 32 86.9 42.6 83.7 58.0 86.0 70.2 71.2 

In addition, the average total GPU and CPU energy consumption and time usage are measured using Alizadeh and Castor’s (2024) method. Results are reported in Table 9. There is a 7.82× difference in energy consumption and 6.26× in time consumption. While increasing the number of output tokens might improve, the added resource consumption cannot be neglected. Deuce provides an efficient solution for low-resource scenarios.

Table 9: 

Energy consumption and time usage of Deuce (b = 32, RoBERTa-base) and zero-shot Chain-of-Thought prompting (Kojima et al., 2022; Llama 2 7B), under the same data amount of 25000.

Stage0-shot CoTDeuce
Energy (kJ)Time (sec)Energy (kJ)Time (sec)
Acquisition – 59.82 81.00 
Fine-tuning – 225.77 208.89 
Prediction 2561.58 1967.23 41.99 24.27 
 
Total 2561.58 1967.23 327.58 314.16 
Stage0-shot CoTDeuce
Energy (kJ)Time (sec)Energy (kJ)Time (sec)
Acquisition – 59.82 81.00 
Fine-tuning – 225.77 208.89 
Prediction 2561.58 1967.23 41.99 24.27 
 
Total 2561.58 1967.23 327.58 314.16 

5.2 Effect of Labeling Noise

Real-world annotations often involve noise. Northcutt et al. (2021) estimated an average of 2.6% labeling errors across 3 commonly used NLP datasets. To evaluate Deuce under labeling noise, experiments with artificial errors are conducted. As the gold labels may already contain around 3% errors, 7% of seed labels are randomly replaced by wrong labels. The final sets are expected to exhibit an error level of 4–10%. Results are reported in Table 10.

Table 10: 

Evaluation results of Deuce, compared under an expected labeling noise level of 7%.

DeucebIMDbYelpfullAG’s NewsYahoo!DBpediaTRECAverage
w/o noise 32 86.9±0.9 42.6±1.1 83.7±0.8 58.0±1.5 86.0±1.7 70.2±1.7 71.2±1.3 
64 88.5±0.7 49.8±1.2 86.3±0.6 62.8±1.3 94.1±0.9 82.2±1.5 77.2±1.1 
128 90.0±0.3 53.4±0.7 87.5±0.4 66.2±0.9 97.3±0.3 92.1±0.8 81.1±0.6 
w/ noise 32 67.8±4.3 38.7±3.0 72.5±1.0 49.7±7.2 61.5±2.0 69.6±0.6 60.0±1.5 
64 83.4±1.3 41.0±2.7 82.6±1.4 53.4±2.7 87.5±3.3 78.7±3.3 71.1±1.1 
128 82.9±6.3 45.1±1.7 84.7±2.4 62.7±1.3 89.2±3.7 82.5±3.8 74.5±1.5 
DeucebIMDbYelpfullAG’s NewsYahoo!DBpediaTRECAverage
w/o noise 32 86.9±0.9 42.6±1.1 83.7±0.8 58.0±1.5 86.0±1.7 70.2±1.7 71.2±1.3 
64 88.5±0.7 49.8±1.2 86.3±0.6 62.8±1.3 94.1±0.9 82.2±1.5 77.2±1.1 
128 90.0±0.3 53.4±0.7 87.5±0.4 66.2±0.9 97.3±0.3 92.1±0.8 81.1±0.6 
w/ noise 32 67.8±4.3 38.7±3.0 72.5±1.0 49.7±7.2 61.5±2.0 69.6±0.6 60.0±1.5 
64 83.4±1.3 41.0±2.7 82.6±1.4 53.4±2.7 87.5±3.3 78.7±3.3 71.1±1.1 
128 82.9±6.3 45.1±1.7 84.7±2.4 62.7±1.3 89.2±3.7 82.5±3.8 74.5±1.5 

From the results, a decrease in accuracy and an increase in standard deviation occur as expected. However, Deuce still outperforms 0-shot CoT (Table 8) in nearly all setups, despite the added noise. This shows the robustness of Deuce for fine-tuning to labeling noise.

5.3 Effect of Class Prediction Failure

For real-world cold-start tasks, the knowledge about classes might not be well exploited by the PLM. In the worst case, the PLM can fail to generate meaningful class predictions. To simulate this scenario, ablation experiments with random class predictions are conducted. In this setup, the predictive embeddings zŷxi are replaced with random vectors. This ablates class predictions. Results are reported in Table 11.

Table 11: 

Ablation results of Deuce with random class predictions, compared with Coreset selection (Sener and Savarese, 2018). In this case, the class and uncertainty information are disarranged.

bIMDbYelpfullAG’s NewsYahoo!DBpediaTRECAverage
Coreset 32 74.5± 2.9 32.9±2.8 78.6± 1.6 22.0± 2.3 64.0±2.8 47.1± 3.6 53.2±2.7 
64 82.8± 2.5 39.9±3.4 82.0±1.5 45.7±3.7 85.2 ± 0.8 75.7±3.0 68.5±2.7 
128 87.8± 0.8 49.4±1.6 85.2±0.6 56.9±2.5 89.4±1.5 87.6 ±3.0 76.1±1.9 
Deuce w/ rand. pred. 32 83.3 ±4.1 44.1 ± 0.7 83.4 ±2.0 52.3 ±3.9 63.2 ± 1.1 64.9 ±3.9 65.2 ± 1.2 
64 85.9 ±4.5 48.0 ± 0.3 84.6 ± 1.2 60.0 ± 0.6 82.9±1.7 78.2 ± 2.0 73.3 ± 0.9 
128 86.6 ±2.5 49.5 ± 0.4 87.2 ± 0.4 63.4 ± 1.3 96.8 ± 0.1 86.8± 1.3 78.4 ± 0.5 
bIMDbYelpfullAG’s NewsYahoo!DBpediaTRECAverage
Coreset 32 74.5± 2.9 32.9±2.8 78.6± 1.6 22.0± 2.3 64.0±2.8 47.1± 3.6 53.2±2.7 
64 82.8± 2.5 39.9±3.4 82.0±1.5 45.7±3.7 85.2 ± 0.8 75.7±3.0 68.5±2.7 
128 87.8± 0.8 49.4±1.6 85.2±0.6 56.9±2.5 89.4±1.5 87.6 ±3.0 76.1±1.9 
Deuce w/ rand. pred. 32 83.3 ±4.1 44.1 ± 0.7 83.4 ±2.0 52.3 ±3.9 63.2 ± 1.1 64.9 ±3.9 65.2 ± 1.2 
64 85.9 ±4.5 48.0 ± 0.3 84.6 ± 1.2 60.0 ± 0.6 82.9±1.7 78.2 ± 2.0 73.3 ± 0.9 
128 86.6 ±2.5 49.5 ± 0.4 87.2 ± 0.4 63.4 ± 1.3 96.8 ± 0.1 86.8± 1.3 78.4 ± 0.5 

As class and uncertainty information are disarranged, Deuce degenerates to single textual diversity and performance degradation occurs as expected. Nonetheless, Deuce still outperforms Coreset selection (Sener and Savarese, 2018), a CSAL baseline which also purely utilizes textual diversity. This demonstrates Deuce’s effectiveness in real-world cold-start scenarios.

5.4 Performance of Few-shot Math Reasoning

Deuce has the potential to generalize on other NLP tasks. To demonstrate this, Deuce is tested on GSM8K (Cobbe et al., 2021), a dataset of math word problems. However, directly adapting RoBERTa to solving math problems is difficult due to its masked modeling nature. Instead, Deuce is applied with RoBERTa to produce a seed set.4 Then, the seeds are taken as examples for few-shot Chain-of-Thought prompting (Wei et al., 2022) with Llama 2 7B. From the results, as reported in Table 12, Deuce is still effective in few-shot math problem solving, compared to random sampling.

Table 12: 

Evaluation results of Deuce (RoBERTa-base) with few-shot Chain-of-Thought prompting (Wei et al., 2022; Llama 2 7B) on GSM8K dataset (Cobbe et al., 2021), compared to random sampling.

Method4-shot8-shotAverage
Random 25.1 24.3 24.7 
Deuce 25.8 27.4 26.6 
Method4-shot8-shotAverage
Random 25.1 24.3 24.7 
Deuce 25.8 27.4 26.6 

This paper presents Deuce, a dual-diversity enhancing and uncertainty-aware CSAL framework via a prompt-based and graph-based approach. Different from previous works, it emphasizes dual-diversity (i.e., textual diversity and class diversity) to ensure a balanced acquisition. This is achieved by the novel construction of Dual-Neighbor Graph (DNG) and Farthest Point Sampling (FPS). DNG leverages the kNN graph structure of textual space and label space from a PLM. In addition, Deuce prioritizes hard representative examples, so as to ensure an informative acquisition. This leverages density-based clustering and uncertainty propagation on the DNG. Experiments show the effectiveness of Deuce’s dual-diversity enhancement and uncertainty-aware mechanism. It offers an efficient solution for low-resource data acquisition. Overall, Deuce’s hybrid strategy strikes an important balance between exploration and exploitation in CSAL.

Backbone LM.

Deuce leverages a discriminative PLM. However, state-of-the-art PLMs are primarily generative. Generative embedding models (e.g., Jiang et al., 2023) or adaptations (Yang et al., 2019; Gong et al., 2019; Zhang et al., 2022a) can be investigated and combined with Deuce. For such approaches, their quality and efficiency should be carefully minded.

External Knowledge.

In Deuce, the only source of external knowledge is the language model. Incorporation of more domain knowledge, if possible, can improve the performance in the cold-start stage. As Deuce adopts a prompt-based and graph-based acquisition, prompt engineering and knowledge graphs (Pan et al., 2024) can be investigated.

We extend our gratitude to our action editor, Sebastian Padó, and the anonymous reviewers for their constructive comments. We also thank Tianjun Li and Jiangfeng Liu for their helpful feedback on the initial drafts.

This work was funded in part by the National Natural Science Foundation of China grant under number 62222603, in part by the STI2030-Major Projects grant from the Ministry of Science and Technology of the People’s Republic of China under number 2021ZD0200700, in part by the Key-Area Research and Development Program of Guangdong Province under number 2023B0303030001, in part by the Program for Guangdong Introducing Innovative and Entrepreneurial Teams (2019ZT08X214), and in part by the Science and Technology Program of Guangzhou under number 2024A04J6310.

1 

Recent studies (Lu et al., 2023; Naeini et al., 2023; Zhang et al., 2023) have shown that state-of-the-art PLMs still underperform human experts in difficult tasks.

2 

It is worth noting that Deuce does not utilize or optimize any Graph Neural Network (GNN). With the rich representational capability of PLMs, Deuce does not require GNNs to learn data representations.

3 

Semantic similarity benchmarks (e.g., STS) cannot be used here, as the prompt Tx requires a task domain Y.

4 

For open questions like math problems, there are no concepts of “classes”. Instead, the predictive embeddings z~ŷxi are clustered with hdbscan*. The cluster centroids are taken as meta-class embeddings zŷ.

Deepesh
Agarwal
,
Pravesh
Srivastava
,
Sergio
Martin-del-Campo
,
Balasubramaniam
Natarajan
, and
Babji
Srinivasan
.
2021
.
Addressing practical challenges in active learning via a hybrid query strategy
.
arXiv preprint arXiv:2110.03785v1
.
Umang
Aggarwal
,
Adrian
Popescu
, and
Céline
Hudelot
.
2020
.
Active learning for imbalanced datasets
. In
2020 IEEE Winter Conference on Applications of Computer Vision (WACV)
, pages
1417
1426
.
Negar
Alizadeh
and
Fernando
Castor
.
2024
.
Green AI: A preliminary empirical study on energy consumption in DL models across different runtime infrastructures
. In
Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI
,
CAIN ’24
, pages
134
139
,
New York, NY, USA
.
Association for Computing Machinery
.
Jordan T.
Ash
,
Chicheng
Zhang
,
Akshay
Krishnamurthy
,
John
Langford
, and
Alekh
Agarwal
.
2020
.
Deep batch active learning by diverse, uncertain gradient lower bounds
. In
8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020
.
OpenReview.net
.
Etienne
Brangbour
,
Pierrick
Bruneau
,
Thomas
Tamisier
, and
Stéphane
Marchand-Maillet
.
2022
.
Cold start active learning strategies in the context of imbalanced classification
.
arXiv preprint arXiv:2201.10227v1
.
Ricardo J. G. B.
Campello
,
Davoud
Moulavi
, and
Joerg
Sander
.
2013
.
Density-based clustering based on hierarchical density estimates
. In
Advances in Knowledge Discovery and Data Mining
, pages
160
172
,
Berlin, Heidelberg
.
Springer Berlin Heidelberg
.
Ricardo J. G. B.
Campello
,
Davoud
Moulavi
,
Arthur
Zimek
, and
Jörg
Sander
.
2015
.
Hierarchical density estimates for data clustering, visualization, and outlier detection
.
ACM Transactions on Knowledge Discovery from Data
,
10
(
1
).
Ernie
Chang
,
Xiaoyu
Shen
,
Hui-Syuan
Yeh
, and
Vera
Demberg
.
2021
.
On training instance selection for few-shot neural text generation
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
, pages
8
13
,
Online
.
Association for Computational Linguistics
.
Karl
Cobbe
,
Vineet
Kosaraju
,
Mohammad
Bavarian
,
Mark
Chen
,
Heewoo
Jun
,
Lukasz
Kaiser
,
Matthias
Plappert
,
Jerry
Tworek
,
Jacob
Hilton
,
Reiichiro
Nakano
,
Christopher
Hesse
, and
John
Schulman
.
2021
.
Training verifiers to solve math word problems
.
arXiv preprint arXiv:2110.14168v2
.
Sajib
Dasgupta
and
Vincent
Ng
.
2009
.
Mine the easy, classify the hard: A semi-supervised approach to automatic sentiment classification
. In
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
, pages
701
709
,
Suntec, Singapore
.
Association for Computational Linguistics
.
Sanjoy
Dasgupta
.
2011
.
Two faces of active learning
.
Theoretical Computer Science
,
412
(
19
):
1767
1781
.
Algorithmic Learning Theory (ALT 2009)
.
Kevin
De Angeli
,
Shang
Gao
,
Mohammed
Alawad
,
Hong-Jun
Yoon
,
Noah
Schaefferkoetter
,
Xiao-Cheng
Wu
,
Eric B.
Durbin
,
Jennifer
Doherty
,
Antoinette
Stroup
,
Linda
Coyle
,
Lynne
Penberthy
, and
Georgia
Tourassi
.
2021
.
Deep active learning for classifying cancer pathology reports
.
BMC Bioinformatics
,
22
(
1
). ,
[PubMed]
Dmitriy
Dligach
and
Martha
Palmer
.
2011
.
Good seed makes a good crop: Accelerating active learning using language modeling
. In
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
, pages
6
10
,
Portland, Oregon, USA
.
Association for Computational Linguistics
.
Didier
Dubois
and
Henri
Prade
.
1982
.
A class of fuzzy measures based on triangular norms: A general framework for the combination of uncertain information
.
International Journal of General Systems
,
8
(
1
):
43
61
.
Liat
Ein-Dor
,
Alon
Halfon
,
Ariel
Gera
,
Eyal
Shnarch
,
Lena
Dankin
,
Leshem
Choshen
,
Marina
Danilevsky
,
Ranit
Aharonov
,
Yoav
Katz
, and
Noam
Slonim
.
2020
.
Active learning for BERT: An empirical study
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
7949
7962
,
Online
.
Association for Computational Linguistics
.
Anton
Eklund
and
Mona
Forsman
.
2022
.
Topic modeling by clustering language model embeddings: Human validation on an industry dataset
. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track
, pages
635
643
,
Abu Dhabi, UAE
.
Association for Computational Linguistics
.
Yuval
Eldar
,
Micahel
Lindenbaum
,
Moshe
Porat
, and
Yehoshua Y.
Zeevi
.
1994
.
The farthest point strategy for progressive image sampling
. In
Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 2 - Conference B: Computer Vision & Image Processing. (Cat. No.94CH3440-5)
, volume
3
, pages
93
97
.
Yaron
Fairstein
,
Oren
Kalinsky
,
Zohar
Karnin
,
Guy
Kushilevitz
,
Alexander
Libov
, and
Sofia
Tolmach
.
2024
.
Class balancing for efficient active learning in imbalanced datasets
. In
Proceedings of The 18th Linguistic Annotation Workshop (LAW-XVIII)
, pages
77
86
,
St. Julians, Malta
.
Association for Computational Linguistics
.
Jun
Gao
,
Di
He
,
Xu
Tan
,
Tao
Qin
,
Liwei
Wang
, and
Tieyan
Liu
.
2019
.
Representation degeneration problem in training natural language generation models
. In
7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019
.
OpenReview.net
.
Tianyu
Gao
,
Xingcheng
Yao
, and
Danqi
Chen
.
2021
.
SimCSE: Simple contrastive learning of sentence embeddings
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
6894
6910
,
Online and Punta Cana, Dominican Republic
.
Association for Computational Linguistics
.
Xin-Rong
Gong
,
Jian-Xiu
Jin
, and
Tong
Zhang
.
2019
.
Sentiment analysis using autoregressive language modeling and broad learning system
. In
2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
, pages
1130
1134
.
Guy
Hacohen
,
Avihu
Dekel
, and
Daphna
Weinshall
.
2022
.
Active learning on a budget: Opposite strategies suit high and low budgets
. In
Proceedings of the 39th International Conference on Machine Learning
, volume
162 of Proceedings of Machine Learning Research
, pages
8175
8195
.
PMLR
.
Stefan
Hegselmann
,
Alejandro
Buendia
,
Hunter
Lang
,
Monica
Agrawal
,
Xiaoyi
Jiang
, and
David
Sontag
.
2023
.
TabLLM: Few-shot classification of tabular data with large language models
. In
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics
, volume
206 of Proceedings of Machine Learning Research
, pages
5549
5581
.
PMLR
.
Marek
Herde
,
Denis
Huseljic
,
Bernhard
Sick
, and
Adrian
Calma
.
2021
.
A survey on cost types, interaction schemes, and annotator performance models in selection algorithms for active learning in classification
.
IEEE Access
,
9
:
166970
166989
.
Andreas
Holzinger
.
2016
.
Interactive machine learning for health informatics: When do we need the human-in-the-loop?
Brain Informatics
,
3
(
2
):
119
131
. ,
[PubMed]
Edward
J. Hu
,
Yelong
Shen
,
Phillip
Wallis
,
Zeyuan
Allen-Zhu
,
Yuanzhi
Li
,
Shean
Wang
,
Lu
Wang
, and
Weizhu
Chen
.
2022
.
LoRA: Low-rank adaptation of large language models
. In
The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25–29, 2022
.
OpenReview.net
.
Rong
Hu
,
Brian Mac
Namee
, and
Sarah Jane
Delany
.
2010
.
Off to a good start: Using clustering to select the initial training set in active learning
. In
Proceedings of the Twenty-Third International Florida Artificial Intelligence Research Society Conference, May 19–21, 2010, Daytona Beach, Florida, USA
.
AAAI Press
.
Ting
Jiang
,
Shaohan
Huang
,
Zhongzhi
Luan
,
Deqing
Wang
, and
Fuzhen
Zhuang
.
2023
.
Scaling sentence embeddings with large language models
.
arXiv preprint arXiv:2307.16645v1
.
Ting
Jiang
,
Jian
Jiao
,
Shaohan
Huang
,
Zihan
Zhang
,
Deqing
Wang
,
Fuzhen
Zhuang
,
Furu
Wei
,
Haizhen
Huang
,
Denvy
Deng
, and
Qi
Zhang
.
2022
.
PromptBERT: Improving BERT sentence embeddings with prompts
. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
, pages
8826
8837
,
Abu Dhabi, United Arab Emirates
.
Association for Computational Linguistics
.
Zhengbao
Jiang
,
Jun
Araki
,
Haibo
Ding
, and
Graham
Neubig
.
2021
.
How can we know when language models know? On the calibration of language models for question answering
.
Transactions of the Association for Computational Linguistics
,
9
:
962
977
.
Jaeho
Kang
,
Kwang Ryel
Ryu
, and
Hyuk-Chul
Kwon
.
2004
.
Using cluster-based sampling to select initial training set for active learning in text classification
. In
Advances in Knowledge Discovery and Data Mining
, pages
384
388
,
Berlin, Heidelberg
.
Springer Berlin Heidelberg
.
Takeshi
Kojima
,
Shixiang Shane
Gu
,
Machel
Reid
,
Yutaka
Matsuo
, and
Yusuke
Iwasawa
.
2022
.
Large language models are zero-shot reasoners
. In
Advances in Neural Information Processing Systems
, volume
35
, pages
22199
22213
.
Curran Associates, Inc.
Ranganath
Krishnan
,
Alok
Sinha
,
Nilesh A.
Ahuja
,
Mahesh
Subedar
,
Omesh
Tickoo
, and
Ravi R.
Iyer
.
2021
.
Mitigating sampling bias and improving robustness in active learning
. In
Proceedings of Workshop on Human in the Loop Learning (HILL) in International Conference on Machine Learning (ICML 2021)
.
Jens
Lehmann
,
Robert
Isele
,
Max
Jakob
,
Anja
Jentzsch
,
Dimitris
Kontokostas
,
Pablo N.
Mendes
,
Sebastian
Hellmann
,
Mohamed
Morsey
,
Patrick
van Kleef
,
Sören
Auer
, and
Christian
Bizer
.
2015
.
DBpedia – a large-scale, multilingual knowledge base extracted from Wikipedia
.
Semantic Web
,
6
(
2
):
167
195
.
Xin
Li
and
Dan
Roth
.
2002
.
Learning question classifiers
. In
COLING 2002: The 19th International Conference on Computational Linguistics
.
Yansong
Li
,
Zhixing
Tan
, and
Yang
Liu
.
2023
.
Privacy-preserving prompt tuning for large language model services
.
arXiv preprint arXiv:2305.06212v1
.
Yinhan
Liu
,
Myle
Ott
,
Naman
Goyal
,
Jingfei
Du
,
Mandar
Joshi
,
Danqi
Chen
,
Omer
Levy
,
Mike
Lewis
,
Luke
Zettlemoyer
, and
Veselin
Stoyanov
.
2019
.
RoBERTa: A robustly optimized BERT pretraining approach
.
arXiv preprint arXiv:1907.11692v1
.
Yuxuan
Lu
,
Bingsheng
Yao
,
Shao
Zhang
,
Yun
Wang
,
Peng
Zhang
,
Tun
Lu
,
Toby Jia-Jun
Li
, and
Dakuo
Wang
.
2023
.
Human still wins over LLM: An empirical study of active learning on domain-specific annotation tasks
.
arXiv preprint arXiv:2311.09825v1
.
Andrew L.
Maas
,
Raymond E.
Daly
,
Peter T.
Pham
,
Dan
Huang
,
Andrew Y.
Ng
, and
Christopher
Potts
.
2011
.
Learning word vectors for sentiment analysis
. In
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
, pages
142
150
,
Portland, Oregon, USA
.
Association for Computational Linguistics
.
Laurens
van der Maaten
and
Geoffrey
Hinton
.
2008
.
Visualizing data using t-SNE
.
Journal of Machine Learning Research
,
9
(
86
):
2579
2605
.
Diego
Marcheggiani
and
Thierry
Artières
.
2014
.
An experimental comparison of active learning strategies for partially labeled sequences
. In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
898
906
,
Doha, Qatar
.
Association for Computational Linguistics
.
Katerina
Margatina
,
Giorgos
Vernikos
,
Loïc
Barrault
, and
Nikolaos
Aletras
.
2021
.
Active learning by acquiring contrastive examples
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
650
663
,
Online and Punta Cana, Dominican Republic
.
Association for Computational Linguistics
.
Leland
McInnes
,
John
Healy
, and
James
Melville
.
2020
.
UMAP: Uniform manifold approximation and projection for dimension reduction
.
arXiv preprint arXiv:1802.03426v3
.
Leland
McInnes
,
John
Healy
,
Nathaniel
Saul
, and
Lukas
Großberger
.
2018
.
UMAP: Uniform manifold approximation and projection
.
Journal of Open Source Software
,
3
(
29
):
861
.
Yu
Meng
,
Jiaming
Shen
,
Chao
Zhang
, and
Jiawei
Han
.
2019
.
Weakly-supervised hierarchical text classification
.
Proceedings of the AAAI Conference on Artificial Intelligence
,
33
(
01
):
6826
6833
.
Pu
Miao
,
Zeyao
Du
, and
Junlin
Zhang
.
2023
.
DebCSE: Rethinking unsupervised contrastive sentence embedding learning in the debiasing perspective
. In
Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
,
CIKM ’23
, pages
1847
1856
,
New York, NY, USA
.
Association for Computing Machinery
.
Thomas
Müller
,
Guillermo
Pérez-Torró
,
Angelo
Basile
, and
Marc
Franco-Salvador
.
2022
.
Active few-shot learning with FASL
. In
Natural Language Processing and Information Systems; 27th International Conference on Applications of Natural Language to Information Systems, NLDB 2022, Valencia, Spain, June 15–17, 2022, Proceedings
, pages
98
110
,
Cham
.
Springer International Publishing
.
Saeid Alavi
Naeini
,
Raeid
Saqur
,
Mozhgan
Saeidi
,
John Michael
Giorgi
, and
Babak
Taati
.
2023
.
Large language models are fixated by red herrings: Exploring creative problem solving and Einstellung effect using the Only Connect Wall dataset
. In
Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track
.
Hieu T.
Nguyen
and
Arnold
Smeulders
.
2004
.
Active learning using pre-clustering
. In
Proceedings of the Twenty-First International Conference on Machine Learning
,
ICML ’04
, page
79
,
New York, NY, USA
.
Association for Computing Machinery
.
Curtis
Northcutt
,
Anish
Athalye
, and
Jonas
Mueller
.
2021
.
Pervasive label errors in test sets destabilize machine learning benchmarks
. In
Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks
, volume
1
.
Shreyas
Padhy
,
Zachary
Nado
,
Jie
Ren
,
Jeremiah
Liu
,
Jasper
Snoek
, and
Balaji
Lakshminarayanan
.
2020
.
Revisiting one-vs-all classifiers for predictive uncertainty and out-of-distribution detection in neural networks
. In
ICML 2020 Workshop on Uncertainty and Robustness in Deep Learning
.
Shirui
Pan
,
Linhao
Luo
,
Yufei
Wang
,
Chen
Chen
,
Jiapu
Wang
, and
Xindong
Wu
.
2024
.
Unifying large language models and knowledge graphs: A roadmap
.
IEEE Transactions on Knowledge and Data Engineering
, pages
1
20
.
Seo Yeon
Park
and
Cornelia
Caragea
.
2022
.
On the calibration of pre-trained language models using mixup guided by area under the margin and saliency
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
5364
5374
,
Dublin, Ireland
.
Association for Computational Linguistics
.
Kara E.
Rudolph
,
Nicholas T.
Williams
,
Caleb H.
Miles
,
Joseph
Antonelli
, and
Ivan
Diaz
.
2023
.
All models are wrong, but which are useful? Comparing parametric and nonparametric estimation of causal effects in finite samples
.
Journal of Causal Inference
,
11
(
1
):
20230022
.
Christopher
Schröder
,
Andreas
Niekler
, and
Martin
Potthast
.
2022
.
Revisiting uncertainty-based query strategies for active learning with transformers
. In
Findings of the Association for Computational Linguistics: ACL 2022
, pages
2194
2203
,
Dublin, Ireland
.
Association for Computational Linguistics
.
Hinrich
Schütze
,
Emre
Velipasaoglu
, and
Jan O.
Pedersen
.
2006
.
Performance thresholding in practical text classification
. In
Proceedings of the 15th ACM International Conference on Information and Knowledge Management
,
CIKM ’06
, pages
662
671
,
New York, NY, USA
.
Association for Computing Machinery
.
Ozan
Sener
and
Silvio
Savarese
.
2018
.
Active learning for convolutional neural networks: A core-set approach
. In
6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 – May 3, 2018, Conference Track Proceedings
.
OpenReview.net
.
Sentence
Transformers
.
2024
.
paraphrase-mpnet-base-v2 (revision e6981e5)
.
Hugging Face
.
Burr
Settles
.
2009
.
Active learning literature survey
.
Computer Sciences Technical Report 1648
,
University of Wisconsin–Madison
.
Eyal
Shnarch
,
Ariel
Gera
,
Alon
Halfon
,
Lena
Dankin
,
Leshem
Choshen
,
Ranit
Aharonov
, and
Noam
Slonim
.
2022
.
Cluster & tune: Boost cold start performance in text classification
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
7639
7653
,
Dublin, Ireland
.
Association for Computational Linguistics
.
Hongjin
Su
,
Jungo
Kasai
,
Chen Henry
Wu
,
Weijia
Shi
,
Tianlu
Wang
,
Jiayi
Xin
,
Rui
Zhang
,
Mari
Ostendorf
,
Luke
Zettlemoyer
,
Noah A.
Smith
, and
Tao
Yu
.
2023
.
Selective annotation makes language models better few-shot learners
. In
The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1–5, 2023
.
OpenReview.net
.
Katrin
Tomanek
,
Florian
Laws
,
Udo
Hahn
, and
Hinrich
Schütze
.
2009
.
On proper unit selection in active learning: Co-selection effects for named entity recognition
. In
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing
, pages
9
17
,
Boulder, Colorado
.
Association for Computational Linguistics
.
Hugo
Touvron
,
Louis
Martin
,
Kevin
Stone
,
Peter
Albert
,
Amjad
Almahairi
,
Yasmine
Babaei
,
Nikolay
Bashlykov
,
Soumya
Batra
,
Prajjwal
Bhargava
,
Shruti
Bhosale
,
Dan
Bikel
,
Lukas
Blecher
,
Cristian Canton
Ferrer
,
Moya
Chen
,
Guillem
Cucurull
,
David
Esiobu
,
Jude
Fernandes
,
Jeremy
Fu
,
Wenyin
Fu
,
Brian
Fuller
,
Cynthia
Gao
,
Vedanuj
Goswami
,
Naman
Goyal
,
Anthony
Hartshorn
,
Saghar
Hosseini
,
Rui
Hou
,
Hakan
Inan
,
Marcin
Kardas
,
Viktor
Kerkez
,
Madian
Khabsa
,
Isabel
Kloumann
,
Artem
Korenev
,
Punit Singh
Koura
,
Marie-Anne
Lachaux
,
Thibaut
Lavril
,
Jenya
Lee
,
Diana
Liskovich
,
Yinghai
Lu
,
Yuning
Mao
,
Xavier
Martinet
,
Todor
Mihaylov
,
Pushkar
Mishra
,
Igor
Molybog
,
Yixin
Nie
,
Andrew
Poulton
,
Jeremy
Reizenstein
,
Rashi
Rungta
,
Kalyan
Saladi
,
Alan
Schelten
,
Ruan
Silva
,
Eric Michael
Smith
,
Ranjan
Subramanian
,
Xiaoqing Ellen
Tan
,
Binh
Tang
,
Ross
Taylor
,
Adina
Williams
,
Jian Xiang
Kuan
,
Puxin
Xu
,
Zheng
Yan
,
Iliyan
Zarov
,
Yuchen
Zhang
,
Angela
Fan
,
Melanie
Kambadur
,
Sharan
Narang
,
Aurelien
Rodriguez
,
Robert
Stojnic
,
Sergey
Edunov
, and
Thomas
Scialom
.
2023
.
Llama 2: Open foundation and fine-tuned chat models
.
arXiv preprint arXiv:2307.09288v2
.
Cheng
Wang
.
2024
.
Calibration in deep learning: A survey of the state-of-the-art
.
arXiv preprint arXiv:2308.01222v2
.
Jason
Wei
,
Xuezhi
Wang
,
Dale
Schuurmans
,
Maarten
Bosma
,
Brian
Ichter
,
Fei
Xia
,
Ed
Chi
,
Quoc V.
Le
, and
Denny
Zhou
.
2022
.
Chain-of-thought prompting elicits reasoning in large language models
. In
Advances in Neural Information Processing Systems
, volume
35
, pages
24824
24837
.
Curran Associates, Inc.
Xingjiao
Wu
,
Luwei
Xiao
,
Yixuan
Sun
,
Junhang
Zhang
,
Tianlong
Ma
, and
Liang
He
.
2022
.
A survey of human-in-the-loop for machine learning
.
Future Generation Computer Systems
,
135
:
364
381
.
Bartosz
Wójcik
,
Jacek
Grela
,
Marek
Smieja
,
Krzysztof
Misztal
, and
Jacek
Tabor
.
2022
.
SLOVA: Uncertainty estimation using single label one-vs-all classifier
.
Applied Soft Computing
,
126
:
109219
.
Zhilin
Yang
,
Zihang
Dai
,
Yiming
Yang
,
Jaime
Carbonell
,
Russ R.
Salakhutdinov
, and
Quoc V.
Le
.
2019
.
XLNet: Generalized autoregressive pretraining for language understanding
. In
Advances in Neural Information Processing Systems
, volume
32
.
Curran Associates, Inc.
Hualong
Yu
,
Xibei
Yang
,
Shang
Zheng
, and
Changyin
Sun
.
2019
.
Active learning from imbalanced data: A solution of online weighted extreme learning machine
.
IEEE Transactions on Neural Networks and Learning Systems
,
30
(
4
):
1088
1103
. ,
[PubMed]
Yue
Yu
,
Rongzhi
Zhang
,
Ran
Xu
,
Jieyu
Zhang
,
Jiaming
Shen
, and
Chao
Zhang
.
2023
.
Cold-start data selection for better few-shot language model fine-tuning: A prompt-based uncertainty propagation approach
. In
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
2499
2521
,
Toronto, Canada
.
Association for Computational Linguistics
.
Michelle
Yuan
,
Hsuan-Tien
Lin
, and
Jordan
Boyd-Graber
.
2020a
.
Cold-start active learning through self-supervised language modeling
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
7935
7948
,
Online
.
Association for Computational Linguistics
.
Mu
Yuan
,
Lan
Zhang
,
Xiang-Yang
Li
, and
Hui
Xiong
.
2020b
.
Comprehensive and efficient data labeling via adaptive model scheduling
. In
2020 IEEE 36th International Conference on Data Engineering (ICDE)
, pages
1858
1861
.
Shiwei
Zhang
,
Mingfang
Wu
, and
Xiuzhen
Zhang
.
2023
.
Utilising a large language model to annotate subject metadata: A case study in an Australian national research data catalogue
.
arXiv preprint arXiv:2310.11318v1
.
Tong
Zhang
,
Xinrong
Gong
, and
C. L.
Philip Chen
.
2022a
.
BMT-Net: Broad multitask transformer network for sentiment analysis
.
IEEE Transactions on Cybernetics
,
52
(
7
):
6232
6243
. ,
[PubMed]
Tong
Zhang
,
Guoxi
Su
,
Chunmei
Qing
,
Xiangmin
Xu
,
Bolun
Cai
, and
Xiaofen
Xing
.
2021
.
Hierarchical lifelong learning by sharing representations and integrating hypothesis
.
IEEE Transactions on Systems, Man, and Cybernetics: Systems
,
51
(
2
):
1004
1014
.
Xiang
Zhang
,
Junbo
Zhao
, and
Yann
LeCun
.
2015
.
Character-level convolutional networks for text classification
. In
Advances in Neural Information Processing Systems
, volume
28
.
Curran Associates, Inc.
Zhisong
Zhang
,
Emma
Strubell
, and
Eduard
Hovy
.
2022b
.
A survey of active learning for natural language processing
. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
, pages
6166
6190
,
Abu Dhabi, United Arab Emirates
.
Association for Computational Linguistics
.
Jingbo
Zhu
,
Huizhen
Wang
,
Tianshun
Yao
, and
Benjamin K.
Tsou
.
2008
.
Active learning with sampling by uncertainty and density for word sense disambiguation and text classification
. In
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)
, pages
1137
1144
,
Manchester, UK
.
Coling 2008 Organizing Committee
.

Author notes

Action Editor: Sebastian Padó

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.