Multi-task Active Learning for Pre-trained Transformer-based Models

Abstract Multi-task learning, in which several tasks are jointly learned by a single model, allows NLP models to share information from multiple annotations and may facilitate better predictions when the tasks are inter-related. This technique, however, requires annotating the same text with multiple annotation schemes, which may be costly and laborious. Active learning (AL) has been demonstrated to optimize annotation processes by iteratively selecting unlabeled examples whose annotation is most valuable for the NLP model. Yet, multi-task active learning (MT-AL) has not been applied to state-of-the-art pre-trained Transformer-based NLP models. This paper aims to close this gap. We explore various multi-task selection criteria in three realistic multi-task scenarios, reflecting different relations between the participating tasks, and demonstrate the effectiveness of multi-task compared to single-task selection. Our results suggest that MT-AL can be effectively used in order to minimize annotation efforts for multi-task NLP models.1

2 Accepted for publication in Transactions of the Association for Computational Linguistics (TACL), 2022.Pre-MIT Press publication version Nevertheless, DNNs often require large labeled training sets in order to achieve good performance.While annotating such training sets is costly and laborious, the active learning (AL) paradigm aims to minimize these costs by iteratively selecting valuable training examples for annotation.Recently, AL has been shown effective for DNNs across various NLP tasks (Duong et al., 2018;Peris and Casacuberta, 2018;Ein-Dor et al., 2020).
An appealing capability of DNNs is performing multi-task learning (MTL): Learning multiple tasks by a single model (Ruder, 2017).This stems from their architectural flexibility -constructing increasingly deeper and wider architectures from basic building blocks -and in their gradient-based optimization which allows them to jointly update parameters from multiple task-based objectives.Indeed, MTL has become ubiquitous in NLP (Luan et al., 2018;Liu et al., 2019a).
MTL models for NLP can often benefit from using corpora annotated for multiple tasks, particularly when these tasks are closely-related and can inform each other.Prominent examples of multitask corpora include OntoNotes (Hovy et al., 2006), the Universal Dependencies Bank (Nivre et al., 2020) and STREUSLE (Schneider et al., 2018).Given the importance of multi-task corpora for many MTL setups, effective AL frameworks that support MTL are becoming crucial.
Unfortunately, most AL methods do not support annotations for more than one task.Multi-task AL (MT-AL) was proposed by Reichart et al. (2008) before the neural era, and adapted by Ikhwantri et al. (2018) to a neural architecture.Recently, Zhu et al. (2020) proposed an MT-AL model for slot filling and intent detection, focusing mostly on LSTMs (Hochreiter and Schmidhuber, 1997).
In this paper, we are the first to systematically explore MT-AL for large pre-trained Transformer models.Naturally, our focus is on closely-related NLP tasks, for which multi-task annotation of the same corpus is likely to be of benefit.Particularly, we consider three challenging real-life multitask scenarios, reflecting different relations between the participating NLP tasks: 1. Complementing tasks, where each task may provide valuable information to the other task: Dependency parsing (DP) and named entity recognition (NER); 2. Hierarchically-related tasks, where one of the tasks depends on the output of the other: Relation extraction (RE) and NER; and 3. Tasks with different annotation granularity: Slot filling (SF, token level) and intent detection (ID, sentence level).We propose various novel MT-AL methods and tailor them to the specific properties of the scenarios, in order to properly address the underlying relations between the participating tasks.Our experimental results highlight a large number of patterns that can guide NLP researchers when annotating corpora with multiple annotation schemes using the AL paradigm.

Previous Work
This paper addresses a previously unexplored problem: multi-task AL (MT-AL) for NLP with pre-trained Transformer-based models.We hence start by covering AL in NLP and then proceed with multi-task learning (MTL) in NLP.
Recent works demonstrated that models like BERT can benefit from AL in low-resource settings (Ein-Dor et al., 2020;Grießhaber et al., 2020), and Bai et al. (2020) suggested basing the AL selection criterion on linguistic knowledge captured by BERT.Other works performed cost-sensitive AL, where instances may have different costs (Tomanek and Hahn, 2010;Xie et al., 2018).However, most previous works did not apply AL for MTL, which is our main focus.

Multi-task Learning (MTL) in NLP
MTL has become increasingly popular in NLP, particularly when the solved tasks are closelyrelated (Chen et al., 2018;Safi Samghabadi et al., 2020;Zhao et al., 2020).In some cases, the MTL model is trained in a hierarchical fashion, where information is propagated from lower-level (sometimes auxiliary) tasks to higher-level tasks (Søgaard and Goldberg, 2016;Rotman and Reichart, 2019;Sanh et al., 2019;Wiatrak and Iso-Sipila, 2020).In other cases, different labeled corpora can be merged to serve as multi-task benchmarks (McCann et al., 2017;Wang et al., 2018).This way, a single MTL model can be trained on multiple tasks, which are typically only distantlyrelated.This research considers the setup of closely-related tasks where annotating a single corpus w.r.t.multiple tasks is a useful strategy.
3 Task Definition -Multi-task Active Learning (MT-AL) In the MT-AL setup, the AL algorithm is provided with a textual corpus, where an initial (typically small) set of n 0 examples is labeled for t tasks.
The AL algorithm implements an iterative process, where at the i-th iteration the goal of the AL algorithm is to select n i additional unlabeled examples that will be annotated on all t tasks, such that the performance of the base NLP model will be improved as much as possible with respect to all of them.While such a greedy strategy of gaining the most in the i-th iteration may not yield the best performance in subsequent iterations, most AL algorithms are greedy, and we hence follow this strategy here as well.
We focus on the standard setup of confidencebased AL, where unlabeled examples with the lowest model confidence are selected for annotation.Algorithm 1 presents a general sketch of such AL algorithms, in the context of MTL.This frame- work, first introduced by Reichart et al. (2008), is a simple generalization of the single-task AL (ST-AL) framework, which supports the annotation of data with respect to multiple tasks.
As discussed in §1, we explore several variations of the MT-AL setup: Independent tasks that inform each other ( §5), hierarchically-related tasks, where one task depends on the output of the other ( §6), and tasks with different annotation granularity, word and sentence level ( §7).Before we can introduce the MT-AL algorithms for each of these setups, we first need to lay their shared foundations: The single-task and multi-task model confidence scores.

Confidence Estimation in Single-task and Multi-task Active Learning
We now introduce the confidence scores that we consider for single-task (ST-AL) and multi-task (MT-AL) active learning.These confidence scores are essentially the core of confidence-based AL algorithms (see Steps 2-3 of Algorithm 1).In Table 1 we provide a summary of the various ST-AL and MT-AL selection methods we explore.

Single-task Confidence Scores
We consider three confidence scores that have been widely used in ST-AL: Random (ST-R) This baseline method simply assigns random scores to the unlabeled examples.

Entropy-based Confidence (ST-EC)
The singletask entropy-based confidence score is defined as: For sentence classification tasks such as ID, E(x) is simply the entropy over the class predictions of a sample x divided by the log number of labels.In our token classification tasks (DP, NER, RE and SF), E(x) is the normalized sentence-level entropy (Kim et al., 2006), which allows us to estimate the uncertainty of the model for a given sequence of tokens x = (x 1 . . .x m ): (2) where m is the number of tokens, y j is the j'th possible label, and s is the number of labels.We perform entropy normalization by averaging the token-level entropies, in order to mitigate the effect of the sentence length, and by dividing the score by the log number of labels.The resulting confidence score ranges from 0 to 1, where lower values indicate lower certainty.
Dropout Agreement (ST-DA) Ensemble methods have proven effective for AL (see, e.g., (Seung et al., 1992;Settles and Craven, 2008)) .In this paper, we derive a confidence score inspired by Reichart and Rappoport (2007).We start by creating k = 10 different models by performing dropout inference for k times (Gal and Ghahramani, 2016).We then compute the single-task dropout agreement score for a sentence x by calculating the average token-level agreement across model pairs: (3) where ŷj i is the predicted label of model j for the i'th token.The resulting scores range from 0 to 1, where lower values indicate lower certainty.3

Multi-task Confidence Scores
When deriving confidence scores for MT-AL, multiple design choices should be made.First, the confidence score of a multi-task model can be based on both tasks or only on one of them.We denote with MT-EC and MT-DA the confidence scores that are equivalent to ST-EC and ST-DA: The only (important) difference is that they are calculated for a multi-task model.For clarity, we will augment this notation with the name of the task according to which the confidence is calculated.For example, ST-EC-NER and MT-EC-NER are the EC scores calculated using the named entity recognition (NER) classifier of a single-task and a multi-task model, respectively.
We can hence evaluate MT-AL algorithms on cross-task selection, i.e., when the evaluated task is different from the task used for computing the confidence scores (and hence for sample selection).For example, evaluating the performance of a multi-task model, trained jointly on NER and DP, on the DP task when the confidence scores used by the MT-AL algorithm are only based on the NER classifier (MT-EC-NER).
We also consider a family of confidence scores for MT-AL that are computed with respect to all participating tasks (joint-selection scores).For this aim, we consider three simple aggregation schemes using the average, maximum, or minimum operators over the single-task confidence scores.For example, the multi-task average confidence (MT-AVG) averages for a sample x the entropy-based confidence scores over all t tasks: The multi-task average dropout agreement score (MT-AVGDA) is similarly defined, but the averaging is over the MT-DA scores.Finally, the multi-task maximum (minimum) MT-MAX (MT-MIN) is computed in a similar manner to MT-AVG but with the max (min) operator taken over the task-specific confidence entropies.
Beyond Direct Manipulations of Confidence Scores Since our focus in this paper is on multitask selection, we would like to consider additional selection methods which go beyond the simple methods in previous work.The common principle of these methods is that they are less sensitive to the actual values of the confidence scores and instead consider the relative importance of the example to the participating tasks.
First, we consider MT-PAR, which is based on the Pareto-efficient frontier (Lotov and Miettinen, 2008).We start by representing each unlabeled sample as a t-dimensional space vector c, where c i = ST-EC-i is the ST confidence score for task i. Next, we select all samples for which the corresponding vector belongs to the Pareto-efficient frontier.A point belongs to the frontier if for every other vector c the following holds: 1. ∀i ∈ [t], c i ≤ c i and 2. ∃i ∈ [t], c i < c i .If the number of samples in the frontier is smaller than the total number of samples to select (n), we re-iterate the procedure by removing the vectors of the selected samples and calculating the next Pareto points.If there are still p points to be selected but the number of the final Pareto points (f ) exceeds p, we select every f p point, ordered by the first axis.Next, inspired by the field of information retrieval we propose MT-RRF.This method allows us to consider the rank of each example with respect to the participating tasks, rather than the actual confidence values.We first calculate r i , the ranked list of the i-th task, by ranking the examples according to their ST-EC-i scores, from lowest to highest.We next fuse the resulting t ranked lists into a single ranked list R, using the reciprocal rank fusion (RRF) technique (Cormack et al., 2009).The RRF score of an example x is computed as: where k is a constant, set to 60, as in the original paper.The final ranking is computed over the RRF scores of the examples -from highest to lowest.Higher-ranked examples are chosen first for annotation as they have lower confidence scores.Finally, MT-IND independently selects the n t most uncertain samples according to each task by ranking the ST-EC scores and re-iterating if overlaps occur.
We finally compare the selected samples of six of the MT-AL methods, after training a multi-task model for a single AL iteration on the DP and NER tasks (Figure 1).It turns out that while some of the methods tend to choose very similar example subsets for annotation (e.g., MT-IND and MT-RRF share 94% of the selected samples, and MT-AVG and MT-MIN share 84% of them), others substantially differ in their selection (e.g., MT-MAX shares only 16% of its selected samples with MT-MIN and 20% with MT-IND).This observation encourages us to continue investigating the impact of the various selection methods on MT-AL.

MT-AL for Complementing Tasks
We start by investigating MT-AL for two closelyrelated, complementing, syntactic tasks: Dependency Parsing (DP) and Named Entity Recognition (NER), which are often solved together by a joint multi-task model (Finkel and Manning, 2009;Nguyen and Nguyen, 2021).

Research Questions
We focus on three research questions.At first, we would like to establish whether MT-AL methods are superior to ST-AL methods for multi-task learning.Our first two questions are hence: Q1.1:Is multi-task learning effective in this setup?and Q1.2:Is AL effective?If so, which AL strategy is better: ST-AL or MT-AL?Next, notice that in MT-AL the confidence score of an example can be based on one or more of the participating tasks.That is, even if the base model for which training examples are selected is an MTL model, the confidence scores used by the MT-AL algorithm can be based on one task or more ( §4.2).Our third question is thus: Q1.3:Is it better to calculate confidence scores based on one of the participating tasks, or should we consider a joint confidence score, based on both tasks? 4

Data
We consider the English version of the OntoNotes 5.0 corpus (Hovy et al., 2006), consisting of seven textual domains: broadcast conversation (BC), broadcast news (BN), magazine (MZ), news (NW), bible (PT), telephone conversation (TC) and web (WB).Sentences are annotated with constituency-parse trees, named entities, part-ofspeech tags, as well as other labels.We convert constituency-parse trees to dependency trees using 4 This question naturally generalizes when more than two tasks are involved.
the ElitCloud conversion tool. 5We do not report results in the PT domain, as it is not annotated for NER.Table 2 summarizes the number of sentences per split for the OntoNotes domains, as well as for the additional datasets used in our next setups.

Models
We consider two model types: Single-task and multi-task models.Our single-task model (ST) consists of the 12-layer pre-trained BERT-base encoder (Devlin et al., 2019), followed by a task decoder.At first, we implemented a simple multitask model (SMT), consisting of a shared 12-layer pre-trained BERT-base encoder followed by an independent decoder for each task.However, early results suggested that it is inferior to single-task modeling.We therefore implemented a more complex multi-task model (CMT), illustrated in Figure 2.This model consists of (shared) cross-task and task-specific modules, similar in nature to the architecture proposed by Lin et al. (2018).In particular, it uses the 8 bottom BERT layers as shared cross-task layers and employs t + 1 replications of the 4 top BERT layers, one replication for each task, as well as a shared cross-task replication.
The input text, as encoded by the shared 8 layers, e S 1:8 , is passed through the shared and non-shared 4-layer modules, e S 8:12 and e U i 8:12 , respectively.The task classifiers are then fed with the output of the cross-task layers combined with the output of their task-specific layers, following the gating mecha- Table 2: Data statistics.We report the number of sentences in the original splits for each pair of tasks.
nism of Rotman and Reichart (2019): where ; is the concatenation operator, is the element-wise product, σ is the Sigmoid function, and W i g and b i g are the gating mechanism parameters.The combined vector g i (x) is then fed to the i-th task-specific decoder. 6ll implementations are based on Hugging-Face's Transformers package (Wolf et al., 2020). 7or all models, the DP decoder is based on the Biaffine parser (Dozat and Manning, 2017) and the NER decoder is a simple linear classifier.

Training and Hyper-parameter Tuning
We consider the following hyper-parameters for the AL experiments.At first, we randomly sample 2% of the original training set to serve as the initial labeled examples in all experiments and treat the rest of the training examples as unlabeled.We also fix our development set to be twice the size of our initial training set, by randomly sampling examples from the original development set.We then run each AL method for 5 iterations, where at each iteration, the algorithm selects an unlabeled set of the size of its initial training set (that is, 2% of the original training set) for annotation.We then reveal the labels of the selected examples and add them to the training set of the next iteration.At the beginning of the final iteration, our labeled training set consists of 10% of the original training data.
In each iteration, we train the models with 20K gradient steps with an early stopping criterion according to the development set.We report LAS scores for DP and F1 scores for NER.For DP, we measure our AL confidence scores on the unlabeled edges.When performing multi-task learning, we set the stopping criterion as the geometric transformers.mean of the task scores (F1 for NER and LAS for DP).We optimize all parameters using the ADAM optimizer (Kingma and Ba, 2015) with a weight decay of 0.01, a learning rate of 5e-5, and a batch size of 32.For label smoothing (see below), we use α = 0.2.Following Dror et al. (2018), we use the t-test for measuring statistical significance (p-value = 0.05).

Results
Model Architecture (Q1.1)We would first like to investigate the performance of the single-task and multi-task models in the full training (FT) and, more importantly, in the active learning (AL) setups.We hence compare three architectures: The single-task model (ST), the simple multi-task model (SMT), and our complex multi-task model (CMT).We train each model for DP and for NER on the six OntoNotes domains using the crossentropy (CE) objective function, or with the label smoothing objective (LS (Szegedy et al., 2016)) that has been demonstrated to decrease calibration errors of Transformer models (Desai and Durrett, 2020;Kong et al., 2020).ST-AL is performed with ST-EC and MT-AL with MT-AVG.
Table 3 reports the average scores (Avg column) over all domains and the number of domains where each model achieved the best results (Best).The results raise three important observations.First, SMT is worse on average than ST in all setups, suggesting that vanilla MT is not always better than ST training.Second, our CMT model achieves the best scores in most cases.The only case where it is inferior to ST, but not to SMT, is on AL with CE training.However, when training with LS, it achieves results comparable to or higher than those of ST on AL.
Third, when comparing CE to LS training, LS clearly improves the average scores of all models (besides one case).Interestingly, the improvement is more significant in the AL setup than in the FT setup.We report that when expanding these experiments to all AL selection methods, LS was found very effective for both tasks, outperforming CE in most comparisons, with an average improvement of 1.8% LAS for DP and 0.9 F1 for NER.
Multi-task vs. Single-task Performance (Q1.2) We next ask whether MT-AL outperforms strong ST-AL baselines.Figure 3 presents for every task and domain the performance of the per-domain best ST-AL and MT-AL methods after the final AL iteration.Following our observations in Q1.1, we train all models with the LS objective and base the multi-task models on the effective CMT model.
Although there is no single method, MT-AL or ST-AL, which performs best across all domains and tasks, MT-AL seems to perform consistently better.The figure suggests that MT-AL is effective for both tasks, outperforming the best ST-AL methods in 4 of 6 DP domains (results are not statistically significant, the average p-value is 0.19) and in 5 of 6 NER domains (results for 3 domains are statistically significant).While the average gap between MT-AL and ST-AL is small for DP (0.28% LAS), in NER it is as high as 2.4 F1   4 reports the percentage of comparisons where the MT-AL methods are superior.On average, the two method types are on par when comparing Within-task performance.More interestingly, for Cross-task performance MT-AL methods are clearly superior with around 90% winnings (87% of the cases statistically significant).Finally, the Average also supports the superiority of MT-AL methods which perform better in 79% of the cases (all results are statistically significant).These results demonstrate the superiority of MT-AL, particularly (and perhaps unsurprisingly) when both tasks are considered.
Single-task vs. Joint-task Selection (Q1.3)Next, we turn to our third question which compares single-task vs. joint-task confidence scores.That is, we ask whether MT-AL methods that base their selection criterion on more than one task are better than ST-AL and MT-AL methods that compute confidence scores using a single task only.
To answer this question, we compare the two best ST-AL and MT-AL methods that are based on single-task selection to the two best joint-task se- lection MT-AL methods.As previously, all methods employ the LS objective.Table 5 reports the average scores (across domains) of each of these methods for DP, NER, and the average task score, based on the final AL iteration.
While the method that performs best on average on both tasks is MT-MAX, a joint-selection method, the second best method is MT-EC-NER, a single-task selection method, and the gap is only 0.26 points.Not surprisingly, performance is higher when the evaluated task also serves as the task that the confidence score is based on, either solely or jointly with another task.
Although the joint-selection methods are effective for both tasks, we cannot decisively conclude that they are better than MT-AL methods that perform single-task selection.However, we do witness another confirmation for our answer to Q1.2, as all presented MT-AL methods perform better on average on both tasks (the Average column) than the ST-AL methods.
Overconfidence Analysis Originally, we trained our models with the standard CE loss.However, our early experiments suggested that such CE-based training yields overconfident models, which is likely to severely harm confidence-based AL methods.While previous work demonstrated the positive impact of label smoothing (LS) on model calibration, to the best of our knowledge, the resulting impact on multitask learning has not been explored, specifically not in the context of AL.We next analyze this impact, which is noticeable in our above results, in more detail.
Figure 5 presents sentence-level confidence scores as a function of sentence-level accuracy Following this analysis, we turn to investigate the impact of LS on model predictions in MT-AL.Inspired by Thulasidasan et al. (2019), who defined the overconfidence error (OE) for classification tasks, we first slightly generalize OE to support sentence-level scores for token classification tasks.Given N sentences, for each sentence x we start by calculating its accuracy score acc(x) over its tokens.The confidence score conf (x) is set to the confidence score of the corresponding AL method.We then define OE as: In essence, OE penalizes predictions according to the gap between their confidence score and their accuracy, but only when the former is higher.
In Table 6 we compare the OE scores of ST-EC, trained with the LS objective to 3 alternatives: ST-EC trained with the CE objective, ST-EC with the post-processing method temperature scaling (TS), and ST-DA trained with the CE objective.Both TS and dropout inference have been shown to improve confidence estimates (Guo et al., 2017;Ovadia et al., 2019), and hence serve as alternatives to LS in this comparison.OE scores are reported on the unlabeled set (given the true labels in hindsight) at the final AL iteration for both tasks.Additionally, OE scores for MT-AVG and MT-AVGDA are also reported and averaged on both tasks.
The results are conclusive, LS is the least overconfident method, achieving the lowest OE scores on all 18 setups, but one.While LS achieves a proportionate reduction error (PRE) of between 57.2% and 96.7% compared to the standard CE method, DA achieves at most a PRE of 51.9% and TS seems to have almost no effect.These results confirm that LS is highly effective in reducing overconfidence scores for BERT-based models, and we are able to show for the first time that such a reduction also holds for multi-task models.

MT-AL for Hierarchically-related Tasks
Until now, we have considered tasks (DP and NER) that are mutually informative but can be trained independently of each other.However, other multi-task learning scenarios involve a task that is dependent on the output of another task.A prominent example is the relation extraction (RE) task that depends on the output of the NER task, since the goal of RE is to classify and identify relations between named entities.Importantly, if the NER part of the model does not perform well, this harms the RE performance as well.Sample selection in such a setup should hence reflect the hierarchical relation between the tasks.

Selection Methods
Since the quality of the classifier for the independent task (NER) now affects also the quality of the classifier for the dependent task (RE), the confidence of each of the tasks may get different relative importance values.Although this in principle can also be true for independent tasks ( §5), explicitly accounting for this property seems more crucial in the current setup.
We hence modify four of our joint-selection methods ( § 4) to reflect the inherent a-symmetry between the tasks, by presenting a scaling parameter 0 ≤ β ≤ 1:8 a) MT-AVG is now calculated as follows: b) MT-RRF is calculated similarly by multiplying the RRF term of RE by β and that of NER by 1−β.c) MT-IND is calculated by independently choosing 100 • β% of the selected samples according to the RE scores and 100 • (1 − β)% according to the NER scores.d) MT-PAR is computed by restricting the first Pareto condition for the position of the RE confidence score: c RE ≤ q β •c RE , where q β is the value of the β-quantile of the RE confidence scores. 9e apply such a condition if β < 0.5.Otherwise, if β > 0.5 the condition is applied to the NER component, and when it is equal to 0.5, the original Pareto method is used.Since we restrict the condition to only one of the tasks, fewer samples will meet this condition (since 0 ≤ q β ≤ 1), and the Pareto frontier will include more samples that have met the condition for the second task.

Research Questions
In our experiments, we would like to explore two research questions: Q2.1: Which MT-AL selection methods are most suitable for this setup ?and Q2.2:What is the best balance between the participating tasks ?
Since RE fully relies on the output of NER, we limit our experiments only to joint multi-task models and do not include single-task models.

Experimental Setup
We experiment with the span-based joint NER and RE BERT model of Li et al. (2021). 10Experiments were conducted on five diverse datasets: NYT24 and NYT29 (Nayak and Ng, 2020), Sci-eRC (Luan et al., 2018), WebNLG (Gardent et al., 2017), and WLP (Kulkarni et al., 2018).The AL setup is identical to that of §5.4.Other hyperparameters that were not mentioned before are identical to those of the original implementation.

Results
Best Selection Method (Q2.1)We start by identifying the best selection method for this setup.Table 7 summarizes the per-task average score for the best β value of each method across the five datasets.
We observe three interesting patterns.First, MT-AL is very effective in this setup for the dependent task (RE), while for the independent task (NER), random selection does not fall too far behind.Second, all MT-AL methods achieve better performance for higher β values by giving more weight to the RE confidence scores during the selection process.This is an indication that indeed the selection method should reflect the asymmetric nature of the tasks.Third, overall, MT-IND is the best performing method, averaging first in NER and in RE, while MT-AVG, MT-RRF and MT-PAR achieve similar results in both tasks.
Scaling Configuration (Q2.2) Figure 6 presents the average F1 scores of the four joint-selection methods, as well as the random selection method MT-R, as a function of β (the relative weight of the RE confidence).First, we notice that joint selection outperforms random selection in the WebNLG domain only for RE (the dependent task) but not for NER (except when β approaches 1).Second, and more importantly, β = 1, that is, selecting examples only according to the confidence score of the RE (dependent) task, is most beneficial for both tasks (Q2.1).We hypothesize that this stems from the fact that the RE confidence score conveys information about both tasks and that this combined information provides a stronger signal with respect to the NER (independent) task, compared to the NER confidence score.Interestingly, the positive impact of higher values of β is more prominent for NER, even though this means that the role of the NER confidence score is downplayed in the sample selection process.
For the individual selection methods, we report that MT-AVG and MT-IND achieve higher results as β increases, while MT-PAR and MT-RRF peak at β = 0.7 and β = 0.8, respectively, and then drop by 0.2 F1 points for NER and 0.9 F1 points 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1  for RE on average.

MT-AL for Tasks with Different Annotation Granularity
NLP tasks are defined on different textual units, with the most common examples of sentence-level and token-level tasks.Our last investigation considers the scenario of two closely-related tasks that are of different granularity: Slot filling (SF, tokenlevel) and intent detection (ID, sentence-level).
Due to the different annotation nature of the two tasks, we have to define the cost of example annotation with respect to each.Naturally, there is no correct way to quantify these costs, but we aim to propose a realistic model.We denote the cost of annotating a sample for SF with Cost SF = m + tp • nt, where m is the number of tokens in the sentence, tp is a fixed token annotation cost (we set tp = 1) and nt is the number of entities.The cost of annotating a sample for ID is next denoted with Cost ID = m + ts, where ts is a fixed sentence cost (we set ts = 3).Our solution (see below) allows some examples to be annotated only with respect to one of the tasks.For examples that are annotated with respect to both tasks we consider an additive joint cost where the token-level term m is considered only once: JCost = Cost SF + Cost ID − m.In our experiments, we allow a fixed annotation budget of B = 500 per AL iteration.

Methods
We consider three types of decision methods: Greedy Active Learning (GRD_AL): AL methods that at each iteration greedily choose the least confident samples until the budget limitation is reached; Binary Linear Programming Active Learning (BLP_AL): AL methods that at each iteration opt to minimize the sum of confidence scores of the chosen samples given the budget constraints.The optimization problem (see below) is solved using a BLP solver; 11 and Binary Linear Programming (BLP): an algorithm that after training on the initial training set chooses all the samples at once, by solving the same constrained optimization problem as in BLP_AL.
For each of these categories, we experiment with four families of AL methods: a) Unrestricted Disjoint Sets (UDJS): This selection method is based on the non-aggregated multi-task confidence scores, where each sample can be chosen to be annotated on either task or both.The UDJS optimization problem aims to maximize the uncertainty scores (1 − Conf t (x)) of the selected samples given the budget and selection constraints: where U is the unlabeled set, T is the set of tasks, Conf t is the MT-EC-t confidence score, X t (x) is a binary indicator indicating the annota-11 https://www.python-mip.com/.
tion of sample x on task t, and Y (x) is a binary indicator indicating the annotation of x on all tasks.
Notice that this formulation may yield annotated examples for only one of the tasks, although this is unlikely, particularly under an iterative protocol when the confidence scores of the models are updated after each iteration.
b) Equal Budget Disjoint Sets (EQB-DJS): This strategy is similar to the above UDJS except that the budget is equally divided between the two tasks and the optimization problem is solved for each of them separately.If a sample is chosen to be annotated for both tasks, we update its cost according to the joint cost and re-solve the optimization problems until the entire budget is used.
c-f) Joint-task Selection: A sample could only be chosen to be annotated on both tasks, where confidence scores are calculated using a multi-task aggregation.The BLP optimization problem is formulated as follows: where Conf is calculated by c) MT-AVG, d) MT-MAX, e) MT-MIN or f) MT-RRF.12g-j) Single-task Confidence Selection (STCS): A sample could only be chosen to be annotated on both tasks, where the selection process aims to maximize the uncertainty scores of only one of the tasks: g) STCS-SF or j) STCS-ID.Similarly to Joint-task Selection, the budget constraints are applied to the joint costs.

Research Questions
We focus on three research questions: Q3.1:Does BLP optimization improve upon greedy selection?Q3.2: Do optimization selection and active learning have a complementary effect?and Q3.3:Is it better to annotate all samples on both tasks or to construct a disjoint annotated training set?

Experimental Setup
We conduct experiments on two prominent datasets: ATIS (Price, 1990) (Devlin et al., 2019) and Roberta-base (Liu et al., 2019b).Our code is largely based on the implementation of Zhu and Yu (2017). 13We run the AL process for 5 iterations with an initial training set of 50 random samples and a fixed-size development set of 100 random samples.We train all models for 30 epochs per iteration, with an early stopping criterion and with label smoothing (α = 0.1).Other hyperparameters were set to their default values.

Results
Optimization-based Selection and AL (Q3.1 and Q3.2) To answer the first two questions, we show in Table 8 the final sample F1 performance for both tasks, when selection is done with the best selection method of each decision category: MT-RRF GRD_AL , MT-MIN BLP _AL and MT-AVG BLP .The results confirm that the BLP optimization (with AL) is indeed superior to greedy AL selection, surpassing it in all setups (two tasks, two datasets, and two pre-trained language encoders, 5 of the comparisons are statistically significant).
The answer to Q3.2 is also mostly positive.BLP optimization and iterative AL have a complementary effect, as MT-MIN BLP _AL in most cases achieves higher performance than MT-AVG BLP (50% of the results are statistically significant).
In fact, the answer for the two questions holds true for all selection methods, as all perform best using our novel BLP_AL decision method.Second is BLP, indicating that the BLP formulation is highly effective for MT setups under different bud-get constraints.Finally, last is the standard greedy procedure (GRD_AL) which is commonly used in the AL literature.

Joint-task Selection (Q3.
3) To answer Q3.3, we perform a similar comparison in Table 9, but now for the most prominent methods for each joint-task selection method.The results indicate that MT-MIN BLP _AL that enforces joint annotation is better than allowing for non-restricted disjoint annotation (UDJS BLP _AL ) and than equally splitting the budget between the two tasks (EQB-DJS BLP _AL ), where 9 of the 16 comparisons are statistically significant.We hypothesize that the superiority of MT-MIN BLP _AL stems from two main reasons: 1. Integrating the joint aggregation confidence score into the optimization function provides a better signal than the single-task confidence scores; 2. Having a joint dataset where all samples are annotated on both tasks rather than a disjointly annotated (and larger) dataset allows the model to achieve better performance since the two tasks are closely related.
Finally, we also report that ST-AL experiments have led to poor performance.The ST-AL methods trail by 8.8 (SF) and 4.7 (ID) F1 points from the best MT-AL method MT-MIN BLP _AL on average.Interestingly, we also observe that selection according to ID (STCS-ID BLP _AL ) has led to better average results on both tasks than selection according to SF (STCS-SF BLP _AL ), suggesting that similarly to § 6, selection according to the higherlevel task often yields better results.

Overall Comparison
As a final evaluation, we turn to compare the performance of our proposed joint-selection MT-AL methods in all setups.In our first setup ( §5) we implemented all MT-AL selection methods with greedy selection, considering uniform task importance.Thereafter ( §6 and §7), we showed how these selection methods can be modified in the next two setups by either integrating non-uniform task weights or replacing the greedy selection with a BLP loss function overconfidence scores.However, not all of our MT-AL selection methods can be modified in such ways. 14We hence report our final comparison given the conditions applied in the first setup: Assuming uniform task weights and selecting samples in a greedy manner.ble 10 summarizes the average performance of the MT-AL methods in our three proposed setups: Complementing tasks (DP + NER), hierarchicallyrelated tasks (NER + RE) and tasks of different annotation granularity (SF + ID).
Each selection method has its cons and pros, which we outline in our final discussion: MT-R, not surprisingly, is the worst performing method on average, as it makes no use of the model's predictions.Nevertheless, the method performs quite well on the third setup (SF + ID) when compared to the other methods that were trained with the greedy decision method, the least successful decision method of this setup.Next, MT-AVG performs well when the tasks are of equal importance (DP + NER), but achieves only moderate performance on the other setups.
Surprisingly, MT-MAX is highly effective despite its simplicity.It is mostly beneficial for the first two setups (DP + NER and NER + RE), where the tasks are of the same annotation granularity.It is the third best method overall, and it does not lag substantially behind the best method, MT-PAR.Interestingly, MT-MIN, which offers a complementary perspective to MT-MAX, is on average the worst MT-AL method, excluding MT-R, and is mainly beneficial for the first setup (DP + NER).
The next MT-AL method, MT-PAR, seems to capture well the joint confidence space of the task pairs.It is the best method on average, achieving high average scores in all setups.However, when incorporating it with other training techniques, such as applying non-uniform weights (for the second setup), it is outperformed by the other MT-AL methods.MT-RRF does not lag far behind MT-PAR, achieving similar results on most tasks, excluding the RE and ID tasks, which are the higher-level tasks of their setups.Finally, MT-IND does not excel in three of the four tasks of the first two setups, while achieving the best average results on NER, when jointly trained with RE.Furthermore, the method demonstrates strong performance on the third setup, when the tasks are of different annotation granularity, justifying the independent annotation selection in this case.

Conclusions
We considered the problem of multi-task active learning for pre-trained Transformer-based models.We posed multiple research questions concerning the impact of multi-task modeling, multitask selection criteria, overconfidence reduction, the relationships between the participating tasks, as well as budget constraints, and presented a systematic algorithmic and experimental investigation in order to answer these questions.Our results demonstrate the importance of MT-AL modeling in three challenging real-life scenarios, corresponding to diverse relations between the participating tasks.In future work, we plan to research setups with more than two tasks and consider language generation and multilingual modeling.

Figure 1 :
Figure 1: The percentage of shared selected samples between pairs of MT-AL selection methods (see experimental details in the text).

Figure 3 :Figure 4 :
Figure 3: Performance of the best ST-AL vs. the best MT-AL method per domain (Q1.2).

Figure 5 :
Figure 5: Sentence-level accuracy as a function of entropy-based confidence, for DP (left) and for NER (right), when training with the CE objective.The heat maps represent the point frequency.

Figure 6 :
Figure 6: Average F1 scores over four jointselection methods as a function of β (the relative weight of the RE confidence).

Table 1 :
Summary of the ST-AL and MT-AL selection methods explored in this paper.

Table 4 :
A comparison of MT-AL vs. ST-AL on within-task, cross-task and average performance.Values indicate the percentage of comparisons in which MT-AL methods were superior.ofST-ALandMT-AL.To this end, Figure4presents the performance for the most prominent MT-AL and ST-AL methods: MT-EC-DP and ST-EC-DP for DP and MT-EC-NER and ST-EC-NER for NER, together with the multi-task random selection method MT-R.We plot for each

Table 7 :
Hierarchical MT-AL results.We report best average F1 results over all five datasets for the best β configuration per method.