Abstract
Weakly supervised learning aims to reduce the cost of labeling data by using expert-designed labeling rules. However, existing methods require experts to design effective rules in a single shot, which is difficult in the absence of proper guidance and tooling. Therefore, it is still an open question whether experts should spend their limited time writing rules or instead providing instance labels via active learning. In this paper, we investigate how to exploit an expert’s limited time to create effective supervision. First, to develop practical guidelines for rule creation, we conduct an exploratory analysis of diverse collections of existing expert-designed rules and find that rule precision is more important than coverage across datasets. Second, we compare rule creation to individual instance labeling via active learning and demonstrate the importance of both across 6 datasets. Third, we propose an interactive learning framework, INTERVAL, that achieves efficiency by automatically extracting candidate rules based on rich patterns (e.g., by prompting a language model), and effectiveness by soliciting expert feedback on both candidate rules and individual instances. Across 6 datasets, INTERVAL outperforms state-of-the-art weakly supervised approaches by 7% in F1. Furthermore, it requires as few as 10 queries for expert feedback to reach F1 values that existing active learning methods cannot match even with 100 queries.
1 Introduction
Supervised machine learning models for text classification require large, hand-labeled training datasets, which are both expensive and time-consuming to obtain. Most efforts to reduce the reliance on large training datasets support just a single type of expert supervision, namely, to label individual instances one at a time (Seeger, 2006; Clark et al., 2018; Ruder and Plank, 2018; Berthelot et al., 2019; Peters et al., 2018; Devlin et al., 2019; Zhang and Yang, 2021; Zhang et al., 2022c).
To reduce the data labeling bottleneck, weakly supervised learning (WSL) (Zhang et al., 2022a) focuses on labeling rules that automatically generate weak labels for unlabeled instances. WSL works in two separate steps: (i) experts provide labeling rules; and (ii) labeling rules are used to train a machine learning model. Most work focuses on solving the second step and learn with noisy rules (Ratner et al., 2016, 2017; Karamanolakis et al., 2019; Bach et al., 2019; Awasthi et al., 2020). In practice, however, experts find it difficult to define sufficiently many rules in one shot (Varma and Ré, 2018). Considerable time and creativity are required for inspecting unlabeled instances and creating rules that add predictive value by effectively covering a substantial number of instances. Therefore, it is an open question whether experts should spend their limited time writing rules or instead providing instance labels, notably via active learning (Settles, 2009).
In this paper, we investigate how to efficiently exploit an expert’s limited time for machine teaching. Our main idea is to automatically extract labeling rules with high coverage of unlabeled data, and then rely on domain expertise to validate the candidate rules. In contrast to active learning methods, where the machine queries the expert for labels of individual examples (Zhang et al., 2022c), providing feedback for each rule leads to multiple data labels, which we show here can boost classification performance faster.
Supporting rich forms of interaction is challenging, especially when the teaching budget is limited. First, given a restricted number of rules that can be created or validated by an expert, it is not clear what properties these rules should have to train an accurate model. For example, should one prioritize rules that cover many examples but with relatively low precision, or rules that have high precision but lower coverage? Moreover, existing algorithms for rule extraction require substantial labeled data, and it is unclear how to extract and rank candidate rules when we are given just limited labeled data and perhaps a few expert-validated rules. In general, there are few guidelines in the literature for creating effective rules for efficient machine teaching. Additionally, the option to ask for feedback on both rules and instances requires balancing the costs and potential benefits of each type of feedback when there is a shared budget of expert interaction.
Our work addresses these open questions via the following contributions:
Characterization of Prevalent Patterns in Offline Machine Teaching.
We analyze six datasets with expert-defined rules and evaluate multiple weak supervision methods under simulated low-resource settings. Specifically, we unify several weak supervision methods using a Teacher-Student abstraction, where a subset of the rules are considered in the teacher model for training a student model. By evaluating more than 1,000 Teacher-Student configurations per dataset, we associate Teacher properties with the Student’s performance and, even though rules are dataset-specific, we find two prevalent patterns across datasets and methods that could inform guidelines for rule creation. First, we show that a higher-F1 Teacher does not necessarily lead to a higher-F1 Student. Second, we show the Teacher’s precision is more important than coverage for training an accurate Student.
Automatic Rule Extraction via Prompting.
We propose a method that extracts rules with rich predicates, expressed as conjunctions of n-grams, syntactic features, and prompt-based features. By prompting a pre-trained model (see Figure 1), our method extracts high-level features that might not explicitly appear in the text (e.g., “terrible” customer experience) and thus can discover common patterns across instances with no n-gram overlap. As we will show, by extracting both surface-level and higher-level features, our rule family achieves higher precision and coverage than n-gram rules. Our design focuses on rules that could be easily validated by a human and are highly effective.
Interactive Machine Teaching.
We present a human-in-the-loop machine teaching framework called INTERVAL,1 which queries for expert feedback on both instances and rules, and uses all the available resources to train a classifier. We quantify the trade-off between labeling rules vs. instances and show that our framework is more efficient than existing WSL and active learning approaches even when starting with no expert-written rules. Our analysis demonstrates that feedback on both rules and instances is more effective than feedback on instances only (as in Active Learning) even when labeling rules are more expensive than labeling instances by up to 9 times.
The rest of this paper is organized as follows. Section 2 reviews related work on interactive machine teaching and defines our problem of focus. Section 3 presents our interactive machine teaching framework,2 which queries for feedback on labeling rules and instance. Sections 4 and 5 evaluate our interactive method via experiments on six text classification datasets. Finally, Sections 6 and 7 conclude and suggest future work.
2 Problem Definition and Related Work
We now define our problem of focus (Section 2.1); we also discuss related work on non-interactive weak supervision and interactive learning with instance- and feature-level feedback (Section 2.2).
2.1 Problem Definition
Let denote the feature space and denote the label space for a K-class classification task. We consider a set of manually labeled examples DL = {(sl, yl)}, where and , and a set of unlabeled examples DU = {si}. We also consider a set of pre-defined expert-provided labeling rules R = {rj}. A rule maps an example si into a label . Predicting indicates that rj does not cover si. We are primarily interested in the scenario where the size of DL is small in comparison to that of DU, and where R contains just a few or no expert-provided rules, which is often the case for new tasks. Additionally, we assume that we have a budget of T “cost” units (e.g., time) for querying a subject matter expert for feedback on either an instance si ∈ DU (at a cost of TI) or an automatically extracted rule rj (at a cost of TR), as we discuss in Section 2.2.
Our goal is to leverage DL, DU, and R, and interact with the expert within the specified budget T to train a classifier that, given an unseen test instance , predicts a label .
2.2 Prior Work
Non-interactive Approaches.
Non-interactive weak supervision approaches do not involve a human in the loop (i.e., T = 0 for our problem definition). Supervised learning methods consider just DL, semi-supervised learning methods consider DL and DU (Nigam and Ghani, 2000; Lee, 2013; Gera et al., 2022), and WSL methods consider DL, DU, and R (Ratner et al., 2017; Bach et al., 2019; Badene et al., 2019; Fu et al., 2020; Awasthi et al., 2020; Karamanolakis et al., 2021). WSL uses rules in R (e.g., keyword-based patterns, regular expressions, heuristic labeling functions) to automatically generate weak training labels for unlabeled instances in DU. As rules can be noisy, can have limited coverage, and different rules may generate conflicting labels for the same instance, WSL techniques estimate rule weights for noise-aware training (Zhang et al., 2022a). Our method also employs WSL, can work with any rule-weighting technique and further discovers new rules to expand the coverage of R.
Our method is also related to zero-shot and few-shot prompting methods, which use a template to modify the input si into a cloze-style or entailment question and leverage a pre-trained model to “answer” the question (Schick and Schütze, 2021; Yin et al., 2019; Liu et al., 2023). By directly using the outputs of the pre-trained model for classification, prompt-based techniques are sensitive to the selection of prompting templates (Gao et al., 2021; Ye et al., 2023), labeled examples (Zhao et al., 2021; Perez et al., 2021), and hyperparameters (Tam et al., 2021). Even prompting powerful models such as ChatGPT, the successor of InstructGPT (Ouyang et al., 2022), requires work to reach the performance of supervised (fine-tuned) models on text benchmarks (Bang et al., 2022). Our work explores prompting for rule creation during training instead of direct inference. Specifically, we use the pre-trained model’s output to construct labeling rules, which we assume are only weakly indicative of the true labels. Through our approach prompting is required just for training and can work with any model for inference, thus enabling applications where deploying large language models might not be possible.
Our work is also related to rule extraction methods, which consider rules of various types such as keywords, named entities, and numeric expressions (Yangarber et al., 2000), synthetic relations (Snow et al., 2004), part-of-speech tags and hypernyms (Califf and Mooney, 2003), regular expression patterns (Augenstein et al., 2016), sequential patterns (Srikant and Agrawal, 1996; Jindal and Liu, 2008), and more recently, features extracted by prompting pre-trained models (Zhang et al., 2022b). Our method considers a rich family of rules based on n-grams, linguistic features (e.g., part of speech tags and named entities), and prompt-based features and focuses on efficient interaction by soliciting feedback on both candidate rules and instances.
Interactive Learning with Instance Feedback.
One type of interaction that has been studied extensively in the literature is active learning, in which the machine queries the expert for just a small number of labels for examples that are chosen adaptively from abundant unlabeled data (Lewis and Gale, 1994; Cohn et al., 1996; Roy and McCallum, 2001; Dasgupta et al., 2007; Dasgupta and Hsu, 2008; Settles, 2009; Beygelzimer et al., 2010; Houlsby et al., 2011; Zhang and Chaudhuri, 2015; Shen et al., 2017; Kirsch et al., 2019; Ash et al., 2019; Brantley et al., 2020; Yuan et al., 2020; Dor et al., 2020; Margatina et al., 2021; Zhang et al., 2022c). Nearly all previous active learning methods solicit the expert’s judgment to just label instances. In other words, they do not support feedback on labeling rules (i.e., ) and query for feedback on instance labels. Creating a sufficiently large training set would require separate feedback on many individual instances. On the other hand, validating a candidate rule leads to weak labels for many examples at a time (i.e., for all the examples covered by the rule) and, as a result, a large weakly-labeled dataset can be created with a relatively small number of rules.
Interactive Learning with Rule Feedback.
Our work is related to previous interactive methods that support expert queries on automatically generated rules from the n-gram family (Druck et al., 2008; Melville et al., 2009; Settles, 2011; Jagarlamudi et al., 2012; Poulis and Dasgupta, 2017; Dasgupta et al., 2018; Boecking et al., 2020; Kartchner et al., 2022). These methods extract simple n-gram based rules, which as we will show (e.g., in Figure 3) have limited effectiveness and different characteristics than expert-provided rules in R. As two exceptions, Sen et al. (2019) extract rules based on linguistic expressions via syntactic parsing and Zhang et al. (2022b) consider rules based on the output of pre-trained language models prompted with task-specific templates; both show that experts can successfully provide feedback on rules from the proposed families. Most of the above methods do not allow instance-labeling queries (i.e., these methods assume that ). In contrast, our method subsumes and generalizes existing work on rule labeling and active learning by querying an expert for both instances and automatically extracted rules from a new rule family with rich predicates.
3 Interactive Machine Teaching with Instance and Rule Feedback
This section describes our interactive machine teaching framework, which addresses the problem defined in Section 2.1. The core question is how to efficiently solicit expert feedback for machine teaching given a limited budget T. Our main idea is to balance the quality of instance labels with the efficiency of labeling rules under this low-resource setting. We propose a framework, INTERVAL, that supports efficient interaction by selecting which instances to label manually and by extracting candidate rules that, when accepted, can automatically generate many additional labels. INTERVAL can be used with several WSL methods and any learning model.
In the rest of this section, we describe the individual steps followed by INTERVAL on each iteration, namely, Teacher-Student co-training (Section 3.1), querying for instance feedback (Section 3.2), candidate rule extraction (Section 3.3), and querying for rule feedback (Section 3.4), and then we summarize the main ideas of our interactive machine teaching algorithm (Section 3.5).
3.1 Teacher-Student Co-Training
In the first step of each iteration, we use DL, DU, and R to train a model. This has been the main objective in non-interactive WSL. Our model training employs the Teacher-Student abstraction by Karamanolakis et al. (2021) to unify several WSL methods (Dawid and Skene, 1979; Ratner et al., 2016, 2019; Zhang et al., 2022a).
The same Teacher-Student abstraction appears across different WSL approaches (Zhang et al., 2022a), which differ in the teacher model design. For example, in simple majority voting, the Teacher aggregates the predictions of rules in R. In Snorkel (Ratner et al., 2017), the Teacher is a probabilistic graphical model that estimates weights for rules in R in an unsupervised way. In ASTRA (Karamanolakis et al., 2021), the Teacher is a rule-attention network that aggregates rule labels with instance-specific weights and is co-trained with the Student.
In our problem of focus, where the size of DL is small and R contains just a small number of rules, the student model might have far less than satisfying accuracy for our target task. Next, we show how to exploit the interaction budget T.
3.2 Querying for Instance Feedback
After having trained the Student, INTERVAL queries the label yi for an instance si from the unlabeled set DU. To efficiently interact with an expert, we design a method that chooses which instance to query for feedback based on the Student’s probabilities, as some instances might be more “informative” for the Student than others. INTERVAL identifies a diverse collection of unlabeled instances for which the Student’s predicted probabilities have high entropy as explained next.
Instance Clustering.
At the beginning of our algorithm, we construct a hierarchical clustering of the unlabeled instances in DU. To achieve this, we implement agglomerative clustering using Ward’s linkage method, which focuses on minimizing cluster variances. For cluster variances, we calculate the Euclidean distances between instances based on instance embeddings, which are computed via pre-trained BERT (Devlin et al., 2019). For implementation details see Section 4.
Instance Selection.
To choose which instances to query, INTERVAL applies the Student pθ(·) on each unlabeled instance si ∈ DU to get soft labels , where represents the Student’s predicted probability for assigning si into the target class k ∈ K. We use to define the dataset that is soft-labeled by the Student. Then, INTERVAL selects sample instances si via the cluster-adaptive sampling algorithm of Dasgupta and Hsu (2008), which exploits the hierarchical structure of the data and evaluates cluster informativeness based on the entropy of the Student’s predicted probabilities for si in DS. Specifically, the algorithm chooses instances si from clusters characterized by low label “purity”, or equivalently, high entropy based on the Student’s probabilities pi. This selection is made under the premise that collecting expert labels for these instances will provide valuable information for the subsequent round of Student training. Once a cluster becomes “pure,” then the algorithm shifts its focus to another cluster with the goal to acquire a diverse collection of instances.
Instance Labeling.
After selecting an instance si, the system queries the expert’s label yi at a cost of TI. At the end of the iteration, the labeled pair (si, yi) is added in DL to train the Teacher and Student at the next iteration.
3.3 Candidate Rule Extraction
In contrast to WSL, where experts manually create rules with significant coverage on DU, we propose to automatically extract candidate rules and hopefully reduce the cost of rule creation. After getting the label yi for an instance si, we extract candidate rules rj that predict the same label yi for si and have non-trivial coverage in DU. We first describe the types of rules and then how to extract them.
Rule Family.
Most work on interactive learning with rule feedback has focused on extracting keyword-based labeling rules. These rules have limited expressiveness compared to expert-written rules, which include class-indicative keywords, regular expression patterns, and auxiliary classifiers (e.g., polarity and subjectivity classifiers for spam classification) (Zhang et al., 2022c). To improve expressiveness without sacrificing interpretability, our method extracts rules rj whose predicates vj(si) are conjunctions of features that can have three different types: n-grams (vj(si) is true if a specific n-gram appears in si), linguistic features (e.g., part-of-speech tags and named entities), and prompt-based features. Specifically, to construct prompt-based rules, we prompt pre-trained models for si using templates from “PromptSource” (Bach et al., 2022). As an example, consider the sentence from Figure 1 (si: “I have been to this restaurant 3 times. I won’t go back”). We construct “prompt-based” predicates by prompting a pre-trained model to fill in the mask in the following template: “<si>. Overall, the experience is [MASK]” and extracting the top k tokens (e.g., “terrible”). Table 1 shows more examples of prompt templates and Table 2 lists examples of rules extracted by our method using such templates (extraction details are discussed later). Our approach extracts common patterns across instances that might not even share any n-gram features, such as in tasks with short documents. As we will see in Section 5.2, the rules in our expanded family can be substantially more accurate than the simple n-gram rules considered in previous work, and yet they are nearly as interpretable. Note that, at test time, our method does not require access to the above resources as the student model predicts labels directly based on si.
Name . | Prompt Template . |
---|---|
EXPERIENCE | Overall, the experience is [MASK]. [TEXT]. |
RECOMMEND | [TEXT]. Would I recommend it? The answer is [MASK]. |
ASKS_FOR | The following SMS message asks for [MASK]: [TEXT]. |
IS_ABOUT | The following SMS message is about [MASK]: [TEXT]. |
Name . | Prompt Template . |
---|---|
EXPERIENCE | Overall, the experience is [MASK]. [TEXT]. |
RECOMMEND | [TEXT]. Would I recommend it? The answer is [MASK]. |
ASKS_FOR | The following SMS message asks for [MASK]: [TEXT]. |
IS_ABOUT | The following SMS message is about [MASK]: [TEXT]. |
Candidate Rules (predicate → label) . |
---|
PMT-EXPERIENCE =“terrible” → Negative |
PMT-EXPERIENCE =“fantastic”→ Positive |
PMT-RECOMMEND =“certainly” → Positive |
PMT-IS_ABOUT =“prizes” → Spam |
NGRAM =“http” AND PMT-ASKS_FOR =“donations” → Spam |
NER =“CARDINAL” AND PMT-ASKS_FOR =“information”→ Spam |
Candidate Rules (predicate → label) . |
---|
PMT-EXPERIENCE =“terrible” → Negative |
PMT-EXPERIENCE =“fantastic”→ Positive |
PMT-RECOMMEND =“certainly” → Positive |
PMT-IS_ABOUT =“prizes” → Spam |
NGRAM =“http” AND PMT-ASKS_FOR =“donations” → Spam |
NER =“CARDINAL” AND PMT-ASKS_FOR =“information”→ Spam |
Rule Extraction.
We extract rules r from the above family, as long as (i) they cover at least tcov examples in DUincludingsi and (ii) they have a precision of at least tprec in DL. Both tcov and tprec are hyper-parameters. Given the above coverage and precision constraints, we extract conjunctions of features using the Apriori algorithm (Agrawal et al., 1994). Specifically, we first exhaustively search all rules with a single feature from the above family and keep all rules that satisfy all constraints. (The constraint that all rules have to cover si with a label yi is especially strong and allows efficient search.) Then, we create rules as conjunctions of two features selected before and pick just the resulting rules that satisfy all of the above constraints. Our method considers rules with conjunctions of up to tlen features, where tlen is another hyper-parameter. The set RC contains all candidate rules that are extracted by our method and satisfy our constraints.
Automatically identifying a good rule is hard with limited labeled data DL. For example, a candidate rule rj with high coverage on DU might have low coverage in DL (DL might contain just a few labeled examples), and therefore it is hard to estimate the true precision of rj. Therefore, we rely on expert feedback for selected candidate rules from RC, as discussed next.
3.4 Querying for Rule Feedback
After having extracted the set of RC candidate rules that cover si, we select up to β candidate rules rj and query for their labels , where β is a hyper-parameter. Specifically, we first select in RC′ all rules from RC that predict a label (thus agreeing with the expert’s label for si). Then, we select from RC′ the top β rules with the highest precision (computed on DL). Note that RC′ might have fewer than β rules in total, thus we use βi ≤ β to indicate the number of rules selected finally.
Next, we query the labels for the βi selected rules at a cost of βi · TR. At the end of the iteration, the βi labeled rules, which we denote as , are added in R, where by design each rule rj will predict the same label for all instances that it covers. Our method ignores rules labeled with .
Throughout this interaction design, we assume that the domain expert can judge whether rj provides the correct label for most of the examples that the rule covers, and is aware that (i) a rule rj does not need to have perfect accuracy but rather represents a pattern that the expert intends to exploit to label examples more efficiently than manually; (ii) rule predictions will be aggregated to train a model in a noise-aware way. Similar to how expert-written rules are used for WSL, we assume that accepting a precise candidate rule for si could improve the Student in the next iteration. This is possible, by augmenting the Student’s training data with all the unlabeled examples covered by the rule, and by increasing the overlap of accepted rules R on DU, which provides useful signal for rule denoising, similar to inter-annotator agreement methods.
3.5 Interactive Machine Teaching Algorithm
These steps outlined in Sections 3.1–3.4 make up our interactive machine teaching method (Algorithm 1), which we recap as follows. First, our method clusters DU into hierarchical clusters. In each interaction round: (1) we train the Teacher and Student using labeled data, unlabeled data, and expert-validated rules (line 3.1); (2) we apply the Student on unlabeled data to get soft labels (line 3.2); (3) we pick a candidate unlabeled instance (line 3.3) and obtain its instance label from an expert (line 3.4); (4) we extract candidate rules (line 3.5) and obtain the labels for βi rules from an expert (line 3.6); and (5) we update the labeled dataset, expert-validated rules, and the remaining budget (line 3.7). In practice, we repeat Steps 3–6 (lines 3.3–3.6) in batches of 10 instances. We repeat the full procedure until the budget T runs out.
By associating rj with a specific instance si, we give the expert extra context (e.g., the text of si) for deciding zj. Also, we hypothesize that, in practice, reading the text of the instance can help reduce the cost TR for deciding zj. While some previous work assumes that labeling rules have no extra cost (Poulis and Dasgupta, 2017), we assume that TR > 0. The hyper-parameter βi controls how to distribute the budget T. Specifically, setting βi = 0 reduces to standard active learning, as INTERVAL will perform queries on instances only. By setting βi ≥ 1, one can exploit feedback on rules that apply to si. As we will show, rule feedback leads to performance improvements relative to instance feedback only.
4 Experimental Settings
We now present our experimental setting for interactive machine teaching on several text classification datasets.
Datasets.
For our analysis and to evaluate our framework, we consider six benchmark datasets from diverse domains: (1) spam classification of YouTube comments (Alberto et al., 2015); (2) spam classification of SMS messages (Almeida et al., 2011); (3) sentiment classification of IMDB movie reviews (Maas et al., 2011); (4) sentiment classification of Yelp reviews (Zhang et al., 2015); (5) question classification from TREC-6 (Li and Roth, 2002); and (6) topic classification in AGNews (Zhang et al., 2015). Table 3 reports dataset statistics. For each dataset, we use expert-made rules that are provided by Zhang et al. (2021) and prompt templates that are provided by Bach et al. (2022). For a fair comparison, we use exactly the same expert-written rules3 as in previous work, which can have various types such as keywords, regular expression patterns, and lexicons.
. | YouTube . | SMS . | IMDB . | Yelp . | TREC . | AGNews . |
---|---|---|---|---|---|---|
Classification task | spam | spam | sentiment | sentiment | question type | topic |
Domain | user comments | text messages | movies | reviews | web queries | news |
# Classes (K) | 2 | 2 | 2 | 2 | 6 | 4 |
Unlabeled size (|DU|) | 1546 | 4531 | 19,960 | 30,360 | 4845 | 95,920 |
Labeled train size (|DL|) | 40 | 40 | 40 | 40 | 120 | 80 |
Test size | 250 | 500 | 2500 | 3800 | 500 | 12,000 |
# Prompt templates | 5 | 5 | 15 | 12 | 6 | 9 |
# Expert-provided rules (R) | 10 | 73 | 5 | 8 | 68 | 9 |
. | YouTube . | SMS . | IMDB . | Yelp . | TREC . | AGNews . |
---|---|---|---|---|---|---|
Classification task | spam | spam | sentiment | sentiment | question type | topic |
Domain | user comments | text messages | movies | reviews | web queries | news |
# Classes (K) | 2 | 2 | 2 | 2 | 6 | 4 |
Unlabeled size (|DU|) | 1546 | 4531 | 19,960 | 30,360 | 4845 | 95,920 |
Labeled train size (|DL|) | 40 | 40 | 40 | 40 | 120 | 80 |
Test size | 250 | 500 | 2500 | 3800 | 500 | 12,000 |
# Prompt templates | 5 | 5 | 15 | 12 | 6 | 9 |
# Expert-provided rules (R) | 10 | 73 | 5 | 8 | 68 | 9 |
Experimental Procedure.
To simulate the low-resource setting, we split the training examples into DL (labeled set) and DU (unlabeled set) by sampling 20 labeled examples per class (20 · K in total) uniformly at random, which we use in DL, while we use the rest in DU. To be consistent with our low-resource assumptions, we downsample the validation set (used for training Student via early stopping) to match the size of DL. For interactive approaches, we consider the extreme low-resource setting where R = ∅. We simulate expert feedback for candidate instances si from DU (Section 3.2) using the ground-truth labels of DU (hidden to the main algorithm), which is common in active learning research (Zhang et al., 2022c). We simulate expert feedback for candidate automatic rules (Section 3.4) using all ground-truth labels in DU: a candidate rule rj is accepted if it correctly classifies more than toracle of the instances in DU that it covers. We experiment with different values of toracle: 25%, 50%, 75%, 90%, and 100% and study their impact on the student’s accuracy.
For a robust evaluation, for each method we run 10 different experiments with different random seeds, thus each run corresponds to a different version of DL, DU, and R. We report the average test performance over the 10 different runs. As evaluation metric, we use the macro-averaged F1 of the student model on the test set.
Model Configuration.
For a fair comparison, we use exactly the same text pre-processing (tokenization, embedding) as in the WRENCH benchmark (Zhang et al., 2021). Following Zhang et al. (2021), we represent each text instance (si) as a vector using pre-trained BERT (Devlin et al., 2019), specifically as the output embedding of the [CLS] token of BERT-base.4 For the hyper-parameters and search space for bag-of-words logistic regression, multilayer perceptron, and BERT, see Table 10 in Zhang et al. (2022c). For candidate rule extraction, we consider conjunctions (AND) of up to tlen = 3 features consisting of n-grams with n = 1,2,3; linguistic features (part-of-speech tags and named entities extracted using the spaCy library5); and prompt-based features as the top k = 10 tokens predicted by pre-trained RoBERTa (Liu et al., 2019) for each of the templates provided by Bach et al. (2022).6 For our analysis of rule characteristics, we experiment with different values for the minimum rule coverage on DU (tcov ∈{10,100,1000}) and the minimum rule precision based on DL (tprec ∈{25%,50%,75%,100%}). In INTERVAL, we use tcov = 100 and tprec = 75%. For interaction, we study different relative values for β (maximum number of rules per instance), TR (rule labeling cost) and TI (instance labeling cost).
Model Comparison.
For a robust evaluation of our approach, we compare several approaches that utilize different resources:
“Fully supervised”: a model trained in the high-resource setting using all labeled data.
“Low supervised”: a model trained in the low-resource setting using only DL.
“Semi supervised”: a model trained using DL and DU. We consider self-training (Nigam and Ghani, 2000; Lee, 2013) for up to 25 iterations with early stopping based on the validation performance.
“WSL”: a model trained using DL, DU, and R. We experiment with different methods, including unweighted majority voting and weighted aggregation of rule predictions with majority voting, Snorkel (Ratner et al., 2017), Dawid-Skene (Dawid and Skene, 1979), FlyingSquid (Fu et al., 2020), MeTaL (Ratner et al., 2019), and ASTRA (Karamanolakis et al., 2021).
“Active learning”: a model trained using DL, DU, and the interaction budget T. We experiment with standard active learning (performing queries on instances only) with different acquisition functions, including random instance selection, uncertainty-based sampling, hierarchical sampling (Dasgupta and Hsu, 2008), and contrastive active learning (Margatina et al., 2021). We also evaluate IWS (Boecking et al., 2020), which considers n-gram rule families and performs queries on rules only.7
“INTERVAL”: a model trained using our interactive machine teaching method that uses DL and DU, and spends the interaction budget T to perform queries on both instances and rules.
For a fair comparison, we use exactly the same modeling configuration across all methods (see paragraph “Model configuration” for details).
5 Experimental Results
We now present our analysis of expert-provided rules (Section 5.1), results on automatic rule extraction (Section 5.2), and our experiments for interactive machine teaching with queries on instances and rules (Section 5.3).
5.1 Analysis of Expert Rules
In this section, we analyze existing datasets with expert-labeled rules and simulate low-resource rule settings to understand the impact of Teacher properties on the performance of the Student.
Analysis of the Precision vs. Coverage Trade-off.
In Section 1, we highlighted one challenging question: Should one prioritize rules that cover more examples but have a relatively lower precision or a few rules that have higher precision but lower coverage? To analyze the precision-coverage trade-off, we create different Teacher versions using different subsets of the expert-labeled rules and evaluate the performance of Student using each Teacher separately. For a robust analysis, we evaluate multiple Teacher types (majority voting, Snorkel (Ratner et al., 2016), Dawid-Skene (Dawid and Skene, 1979), MeTaL (Ratner et al., 2019), FlyingSquid (Ratner et al., 2019)), and multiple Student types (bag-of-words logistic regression, multilayer perceptron, BERT). See Section 4 for implementation details. For each Teacher type, we keep different randomly selected subsets of the rules in R ranging from 1% to 100%. For eachTeacher-Student combination, we run 10 different experiments with different random seeds. This results in more than 1,000 Teacher-Student configurations for each dataset.
Figure 2 summarizes the results across all experiments for YouTube and TREC. While different datasets have Teacher-Student pairs with different characteristics, there are patterns that are prevalent across datasets. First, a more accurate Teacher does not necessarily lead to a more accurate Student. For example, in YouTube (Figure 2) some Teachers with F1 ≥ 0.6 train a Student with F1 ≥ 0.5, while other Teachers with F1 ≤ 0.2 train a Student with F1 ≥ 0.8. This result implies that naively optimizing the Teacher’s performance (according to the standard “data programming” paradigm (Ratner et al., 2016)) might not lead to the best performing student model.
A second pattern that is prevalent across datasets is that the Teacher’s precision is more important than coverage for training an accurate Student. In the scatterplots of Figure 2, most Teachers with high precision train high-quality Students, while many Teachers with high coverage train low-quality Students. To quantify this observation, we compute precision-coverage weights using the Teacher’s precision and coverage to predict the Student’s F1 score. Specifically, we compute the Student’s F1 score as the weighted geometric average of the Teacher’s precision and coverage, and we tune the corresponding weights using grid search. A higher weight thus indicates that the corresponding feature is more important for the prediction of the Student’s F1 score. Table 4 shows the estimated precision and coverage weights for all datasets. Across all datasets, precision has higher weight than coverage: more precise Teachers lead to more accurate Students.
. | YouTube . | SMS . | Yelp . | IMDB . | TREC . | AGNews . |
---|---|---|---|---|---|---|
Coverage weight | 0.20 | 0.00 | 0.22 | 0.23 | 0.30 | 0.46 |
Precision weight | 0.80 | 1.00 | 0.78 | 0.77 | 0.70 | 0.54 |
. | YouTube . | SMS . | Yelp . | IMDB . | TREC . | AGNews . |
---|---|---|---|---|---|---|
Coverage weight | 0.20 | 0.00 | 0.22 | 0.23 | 0.30 | 0.46 |
Precision weight | 0.80 | 1.00 | 0.78 | 0.77 | 0.70 | 0.54 |
Our observation that rule precision is more important than coverage explains recent design choices for WSL (Awasthi et al., 2020; Hsieh et al., 2022), such as the “contextualized LF modeling” component of Hsieh et al. (2022), which explicitly reduces rule coverage to improve rule precision. Moreover, our observation might inform guidelines for rule creation. In YouTube, for instance, if we reject all Teacher models with coverage lower than 0.5, then the precision’s importance weight increases from 0.75 to 0.84, indicating that focusing on precision would be beneficial. Therefore, one potential guideline is that if the Teacher has a coverage higher than 50%, then the main focus should be on improving its precision.
5.2 Analysis of Automatic Rules
In this section, we compare our rule family to n-gram rules and expert rules. Figure 3 shows precision-coverage scatterplots for rules automatically extracted by our method. For this analysis, we have included all rules with precision higher than 0.5 and coverage higher than 0. Rules with high-level predicates (conjunctions of n-grams, named entities, and prompt-based features) can achieve relatively high precision and coverage compared to n-gram predicates and thus are promising to improve the overall performance of interactive machine teaching.
Table 5 reports the performance of “WSL” with automatically extracted rules extracted by our method using tcov = 100 (minimum coverage) and tprec = 0.75 (minimum precision). Across all datasets, our rule family is more effective than n-gram rules and could thus improve the effectiveness of automatic rule extraction. Also, across most datasets (except TREC and YouTube), our rule family is more effective than expert-provided rules: we effectively use DU and DL to discover high-quality rules. As an exception, TREC contains the highest number of manually crafted rules compared to the rest of the datasets. As we will show next, expert interaction can lead to further improvements.
Rule family . | YouTube . | SMS . | IMDB . | Yelp . | TREC . | AGNews . | AVG F1 . |
---|---|---|---|---|---|---|---|
Expert | 90.0 | 86.8 | 71.2 | 80.2 | 57.0 | 75.9 | 76.8 |
Automatic (n-gram; Boecking et al. (2020)) | 76.4 | 79.7 | 49.1 | 54.9 | 52.7 | 74.8 | 64.6 |
Automatic (ours) | 82.7 | 91.4 | 73.5 | 86.1 | 53.3 | 78.1 | 77.5 |
Rule family . | YouTube . | SMS . | IMDB . | Yelp . | TREC . | AGNews . | AVG F1 . |
---|---|---|---|---|---|---|---|
Expert | 90.0 | 86.8 | 71.2 | 80.2 | 57.0 | 75.9 | 76.8 |
Automatic (n-gram; Boecking et al. (2020)) | 76.4 | 79.7 | 49.1 | 54.9 | 52.7 | 74.8 | 64.6 |
Automatic (ours) | 82.7 | 91.4 | 73.5 | 86.1 | 53.3 | 78.1 | 77.5 |
5.3 Interactive Machine Teaching
Table 6 reports classification results of different methods for each dataset. For brevity, we report the best method under each category and list the average F1 across datasets (see AVG F1 column). In interactive methods, we assume TR = TI and fix β = 1 (while we study different values later).
Method . | |DL| . | DU . | R . | T (TI, TR) . | YouTube . | SMS . | IMDB . | Yelp . | TREC . | AGNews . | AVG F1 . |
---|---|---|---|---|---|---|---|---|---|---|---|
Fully Supervised | 100% | – | – | – | 94.0 | 95.6 | 79.6 | 87.5 | 90.3 | 80.7 | 88.0 |
Low Supervised | 20· K | – | – | – | 79.8 | 82.5 | 61.6 | 70.4 | 55.0 | 58.8 | 68.0 |
Semi Supervised | 20· K | – | – | 80.7 | 83.2 | 63.4 | 72.0 | 55.0 | 60.7 | 69.2 | |
WSL (ASTRA) | 20· K | – | 90.0 | 86.8 | 71.2 | 80.2 | 57.0 | 75.9 | 76.8 | ||
Active Learning (hierarchical) | 20· K | – | 100 (100, 0) | 85.3 | 89.9 | 67.6 | 81.2 | 61.4 | 71.4 | 76.1 | |
INTERVAL | 20· K | – | 100 (50, 50) | 91.4 | 94.8 | 79.3 | 86.2 | 66.6 | 78.8 | 82.8 |
Method . | |DL| . | DU . | R . | T (TI, TR) . | YouTube . | SMS . | IMDB . | Yelp . | TREC . | AGNews . | AVG F1 . |
---|---|---|---|---|---|---|---|---|---|---|---|
Fully Supervised | 100% | – | – | – | 94.0 | 95.6 | 79.6 | 87.5 | 90.3 | 80.7 | 88.0 |
Low Supervised | 20· K | – | – | – | 79.8 | 82.5 | 61.6 | 70.4 | 55.0 | 58.8 | 68.0 |
Semi Supervised | 20· K | – | – | 80.7 | 83.2 | 63.4 | 72.0 | 55.0 | 60.7 | 69.2 | |
WSL (ASTRA) | 20· K | – | 90.0 | 86.8 | 71.2 | 80.2 | 57.0 | 75.9 | 76.8 | ||
Active Learning (hierarchical) | 20· K | – | 100 (100, 0) | 85.3 | 89.9 | 67.6 | 81.2 | 61.4 | 71.4 | 76.1 | |
INTERVAL | 20· K | – | 100 (50, 50) | 91.4 | 94.8 | 79.3 | 86.2 | 66.6 | 78.8 | 82.8 |
Non-interactive Approaches.
Across non-interactive approaches, WSL ASTRA performs best: using both labeled instances and expert-provided rules is more effective than using just labeled instances (in Low Supervised or Semi Supervised), which agrees with conclusions from recent work (Karamanolakis et al., 2021). ASTRA outperformed other WSL methods, including majority voting (AVG F1 = 74.1) and Snorkel (AVG F1 = 74.5).
Active Learning Approaches.
Using the extra interaction budget T in Active Learning improves over Low Supervised: Labeling extra instances leads to important performance boosts, as expected. Hierarchical sampling performs better than random sampling (AVG F1 = 75.0), uncertainty-based sampling (AVG F1 = 75.3), contrastive active learning (AVG F1 = 74.1), and IWS (AVG F1 = 75.3). For SMS, Yelp, and TREC, Active Learning with a budget of T = 100 outperforms ASTRA: Acquiring 100 extra instance labels is more effective than collecting expert rules for these datasets. However, for YouTube, IMDB, and AGNews, Active Learning (hierarchical) does not outperform ASTRA, which highlights that expert-provided rules are worth many examples. The above results suggest that there is no clear winner between Active Learning and WSL, and their relative performance varies across datasets.
Interactive Learning with Queries on Rules and Instances.
In Table 6, INTERVAL with a budget of T = 100 performs better than the best Active Learning (hierarchical) approach with the same budget: leveraging feedback on both instances and rules within a limited budget is more effective than feedback on instances only. Interestingly, even without using any expert-provided rules, INTERVAL outperforms ASTRA. This indicates that automatically-generated rules (analyzed in Section 5.2) are effective. While the ASTRA Student might capture implicit rules via self-training, many rules could be inaccurate, thus highlighting the importance of expert interaction.
Table 7 summarizes the results for all methods and ablation experiments. INTERVAL performs better than its ablations without instance labeling (by 6%) and without rule labeling (by 8%): feedback on both instances and rules is the most effective. Also, our rule family is more effective than its ablations without n-gram rules (by 4%) and without prompt-based rules (by 3%). Performance differences on each dataset are statistically significant at p < 0.05 using the Student’s t-test.
Method . | AVG F1 . |
---|---|
Fully Supervised | 88.0 |
Low Supervised | 68.0 |
Semi Supervised (self-training) (Lee, 2013) | 69.2 |
WSL (majority voting) | 74.0 |
WSL (Snorkel) (Ratner et al., 2017) | 74.2 |
WSL (FlyingSquid) (Fu et al., 2020) | 74.2 |
WSL (MeTaL) (Ratner et al., 2019) | 74.7 |
WSL (ASTRA) (Karamanolakis et al., 2021) | 76.8 |
Active Learning (random) | 75.0 |
Active Learning (uncertainty) | 75.3 |
Active Learning (contrastive) (Margatina et al., 2021) | 75.4 |
Active Learning (hierarchical) (Dasgupta and Hsu, 2008) | 76.1 |
Interactive Rule Labeling (IWS) (Boecking et al., 2020) | 75.1 |
INTERVAL | 82.8 |
INTERVAL w/o instance labeling | 78.2 ↓6% |
INTERVAL w/o rule labeling | 76.1 ↓8% |
INTERVAL w/o prompt-based rules | 79.7 ↓4% |
INTERVAL w/o n-gram rules | 80.2 ↓3% |
Method . | AVG F1 . |
---|---|
Fully Supervised | 88.0 |
Low Supervised | 68.0 |
Semi Supervised (self-training) (Lee, 2013) | 69.2 |
WSL (majority voting) | 74.0 |
WSL (Snorkel) (Ratner et al., 2017) | 74.2 |
WSL (FlyingSquid) (Fu et al., 2020) | 74.2 |
WSL (MeTaL) (Ratner et al., 2019) | 74.7 |
WSL (ASTRA) (Karamanolakis et al., 2021) | 76.8 |
Active Learning (random) | 75.0 |
Active Learning (uncertainty) | 75.3 |
Active Learning (contrastive) (Margatina et al., 2021) | 75.4 |
Active Learning (hierarchical) (Dasgupta and Hsu, 2008) | 76.1 |
Interactive Rule Labeling (IWS) (Boecking et al., 2020) | 75.1 |
INTERVAL | 82.8 |
INTERVAL w/o instance labeling | 78.2 ↓6% |
INTERVAL w/o rule labeling | 76.1 ↓8% |
INTERVAL w/o prompt-based rules | 79.7 ↓4% |
INTERVAL w/o n-gram rules | 80.2 ↓3% |
Performance with Different Budget Values.
Table 8 reports the performance of interactive methods with different budget sizes ranging from 10 to 250. INTERVAL requires as few as T = 10 queries to reach F1 values that existing active learning methods cannot match even with T = 100 queries. Figure 4 shows the performance of INTERVAL compared to Active Learning approaches on Yelp and AGNews. INTERVAL leads to a big performance boost especially in low-budget settings where T < 100. Our results highlight that INTERVAL can effectively leverage feedback on both instances and automatic rules, and outperform previous interactive methods.
Method . | Budget (T) . | |||||
---|---|---|---|---|---|---|
10 . | 50 . | 100 . | 150 . | 200 . | 250 . | |
Active Learning (rand.) | 68.1 | 71.8 | 75.0 | 76.5 | 78.0 | 78.4 |
Active Learning (hier.) | 68.4 | 73.9 | 76.1 | 78.3 | 79.3 | 79.9 |
INTERVAL | 76.2 | 81.1 | 82.8 | 84.3 | 85.5 | 86.2 |
Method . | Budget (T) . | |||||
---|---|---|---|---|---|---|
10 . | 50 . | 100 . | 150 . | 200 . | 250 . | |
Active Learning (rand.) | 68.1 | 71.8 | 75.0 | 76.5 | 78.0 | 78.4 |
Active Learning (hier.) | 68.4 | 73.9 | 76.1 | 78.3 | 79.3 | 79.9 |
INTERVAL | 76.2 | 81.1 | 82.8 | 84.3 | 85.5 | 86.2 |
Evaluating the Relative Cost of Rules and Instances.
So far, we have evaluated our method by assuming that TR = TI. Here, we experiment with different relative costs of labeling rules (TR) and instances (TI). We assume T = 100 · TI (fixed total budget), β = 1 (labeling up to one rule per instance), and find the maximum value for TR so INTERVAL () has an F1 score that is at least as high as the best Active Learning (hierarchical) method (). Table 9 reports the maximum TR value for each dataset. On average across datasets, feedback on rules and instances is more effective than feedback on instances as long as TR ≤ 5.2TI, though this value varies significantly per dataset and can be as high as 9TI (for Yelp). In other words, our hybrid method for labeling rules and instances is highly effective even when labeling rules is 9 times (for Yelp) more expensive than labeling instances.
How Many Rules to Label per Instance.
Table 10 shows the performance of INTERVAL by varying β (maximum number of rules to label per instance). Labeling up to one rule (β = 1) gives strong boosts compared to no rule labeling (β = 0) across datasets while labeling up to two rules (β = 2) gives further improvements in some tasks (YouTube, Yelp, AGNews). However, increasing β to values higher than 2 is less effective: when β = 5, then either less-accurate or redundant rules are queried, while this interaction budget could be used more effectively by labeling more instances (and the associated rules). Table 11 shows an example from AGNews (classes are “World,” “Sports,” “Business,” and “Sci/Tech”) where INTERVAL is applied with β = 5. The candidate instance is labeled as “World” topic and out of the βi = 3 rules that were queried (by satisfying the minimum precision and coverage thresholds), 2 were accepted and 1 was rejected as “international” also appears in other topics (e.g., “Business”). Our analysis suggests that most performance benefits are realized by labeling up to 1 rule per instance, while future research could dynamically determine the threshold β, for example as a function of task characteristics and labeling costs.
β . | YouTube . | SMS . | IMDB . | Yelp . | TREC . | AGNews . | AVG F1 . |
---|---|---|---|---|---|---|---|
0 | 85.3 | 89.9 | 67.6 | 81.2 | 61.4 | 71.4 | 76.1 |
1 | 91.4 | 94.8 | 79.3 | 86.2 | 66.6 | 78.8 | 82.8* |
2 | 91.9 | 94.8 | 79.2* | 87.3 | 65.0 | 79.7 | 83.0 |
5 | 91.0 | 94.7* | 78.4 | 86.9 | 62.5 | 79.2 | 82.1 |
β . | YouTube . | SMS . | IMDB . | Yelp . | TREC . | AGNews . | AVG F1 . |
---|---|---|---|---|---|---|---|
0 | 85.3 | 89.9 | 67.6 | 81.2 | 61.4 | 71.4 | 76.1 |
1 | 91.4 | 94.8 | 79.3 | 86.2 | 66.6 | 78.8 | 82.8* |
2 | 91.9 | 94.8 | 79.2* | 87.3 | 65.0 | 79.7 | 83.0 |
5 | 91.0 | 94.7* | 78.4 | 86.9 | 62.5 | 79.2 | 82.1 |
Text instance si: |
“Prime Minister Manmohan Singh today said international |
environment for India’s development was highly favourable...” |
Queries: |
- Instance label: World |
- Rule 1: NGRAM =“prime minister” → World |
- Rule 2: PROMPT_IS_ABOUT =“politics” → World |
- Rule 3: NGRAM =“international” → World ✗ |
- Rule 4: – |
- Rule 5: – |
Text instance si: |
“Prime Minister Manmohan Singh today said international |
environment for India’s development was highly favourable...” |
Queries: |
- Instance label: World |
- Rule 1: NGRAM =“prime minister” → World |
- Rule 2: PROMPT_IS_ABOUT =“politics” → World |
- Rule 3: NGRAM =“international” → World ✗ |
- Rule 4: – |
- Rule 5: – |
6 Discussion and Future Work
Our framework and analysis demonstrates the advantages of soliciting feedback on both candidate rules and individual instances. We identify several areas for future research and discuss them next.
As future work, we will explore additional design choices for INTERVAL, including instance selection strategies (e.g., based on rule informativeness), rule extraction methods (e.g., based on rule diversity), and weak supervision techniques. While INTERVAL selects up to β candidate rules per instance (where βi depends on how many rules satisfy the precision and coverage thresholds), we could further explore adaptive querying protocols, for example dynamically determining β or selectively skipping instance labeling based on dataset characteristics or labeling costs. We could also extend INTERVAL to support richer types of feedback, such as editing (rather than accepting or rejecting) candidate rules and prompt templates (rather than relying on fixed templates from Bach et al. (2022)). More research is required from a user perspective, for example on how to visualize rules (Lertvittayakumjorn et al., 2022) and effectively present a combination of rules and instances for expert labeling. INTERVAL supports prompting pre-trained models just for training data creation, and can work with any model for inference, thus enabling applications where deploying large language models might not be possible. We expect further gains by creating rules using more powerful pre-trained models such as InstructGPT (Ouyang et al., 2022); PaLM-T5 (Chung et al., 2022); LLaMA (Touvron et al., 2023a, b). We also expect performance improvements by replacing the Student using stronger pre-trained models and by representing instances using more recent text embedding techniques (He et al., 2020; Wang et al., 2023; Su et al., 2023; Muennighoff et al., 2024). INTERVAL could also be extended for multi-label classification by changing the Teacher-Student co-training objective (Section 3.1) and for other broader tasks by generating rules from more complex rule families using models such as Toolformer (Schick et al., 2023).
Our current experimental evaluation used simulated expert feedback, because a definitive evaluation involving actual subject matter experts would be too expensive. A potential stopgap is to use large language models (such as ChatGPT), which may be too expensive to query at test time, but are cheaper than subject matter experts to query at training time for selected instances.
7 Conclusions
In this paper, we presented an interactive machine teaching approach that queries experts for feedback on both instances and automatically generated rules. Our findings show that, even though rules are domain specific and have diverse characteristics, there are patterns that are prevalent across datasets. Specifically, a higher-F1 Teacher does not necessarily lead to a higher-F1 Student. We identified that the Teacher’s precision is more important than coverage for training an accurate Student. These findings could potentially inform guidelines for rule creation. Our analysis demonstrates that automatic rules based on high-level predicates are more accurate than rules based on n-gram predicates. We additionally showed that by asking queries on both instances and automatically extracted rules, our method can be more effective than active learning methods.
Acknowledgments
We thank the reviewers and action editors for their constructive feedback. This material is based upon work supported by the National Science Foundation under grant no. IIS-15-63785.
Notes
INTERVAL: INTEractive Rule discoVery for weAkly supervised Learning.
Our implementation is publicly available at https://github.com/gkaramanolakis/interval.
All rules are described at https://github.com/JieyuZ2/wrench.
Prompt templates are available at https://github.com/bigscience-workshop/promptsource.
Unfortunately, the code repository for PRBoost (Zhang et al., 2022b), https://github.com/rz-zhang/PRBoost, does not contain any code as of August 9, 2024.
References
Author notes
Work done at Columbia prior to joining Amazon.
Action Editor: Asli Celikyilmaz