Improving Multitask Retrieval by Promoting Task Specialization

Abstract In multitask retrieval, a single retriever is trained to retrieve relevant contexts for multiple tasks. Despite its practical appeal, naive multitask retrieval lags behind task-specific retrieval, in which a separate retriever is trained for each task. We show that it is possible to train a multitask retriever that outperforms task-specific retrievers by promoting task specialization. The main ingredients are: (1) a better choice of pretrained model—one that is explicitly optimized for multitasking—along with compatible prompting, and (2) a novel adaptive learning method that encourages each parameter to specialize in a particular task. The resulting multitask retriever is highly performant on the KILT benchmark. Upon analysis, we find that the model indeed learns parameters that are more task-specialized compared to naive multitasking without prompting or adaptive learning.1


Introduction
A standard approach to knowledge-intensive language tasks such as question answering (QA), entity disambigution, and fact verification is retrieval-based.Given an query, a retriever is used to efficiently search a large knowledge base (KB) to retrieve relevant "contexts", typically in the form of short paragraphs.How these contexts are used is task-specific (e.g., entity disambiguation takes the title of the article in which the top retrieved context is found; QA predicts an answer from the contexts by through a reader model).In this paper, we focus on the retrieval step.
In particular, we focus on multitask retrieval.In this setting, there are K > 1 downstream tasks that benefit from retrieval from a shared KB.A single retriever is then tasked with performing retrieval for K tasks.Multitask retrieval contrasts with task-specific retrieval in which a separate retriever is trained for each task, and has compelling advantages such as model simplicity (i.e., we can use the same model for all tasks rather than having to design potentially different models for different tasks) and memory efficiency at test time (K times smaller).
Despite the practical appeal, the performance of multitask retrieval has been underwhelming, severely limiting its real-world applicability.Specifically, previous work by Maillard et al. (2021) trains DPR (Karpukhin et al., 2020) on the union of all training datasets in the KILT benchmark (Petroni et al., 2021), but the model is outperformed by task-specific retrievers in 5 out of 8 tasks (page-level R-precision, validation split).In our experiments, we find that it is in fact outperformed in all tasks (often by substantial margins) when a stronger task-specific baseline is used.This result is surprising as well as disappointing given the usual benefits of multitask learning (e.g., data efficiency, reduced overfitting) when properly done.
We debunk the previous negative result by presenting a multitask retriever that outperforms taskspecific retrievers.The main theme of our work is that it is beneficial to explicitly promote task specialization.A first important source of improvement is a better choice of pretrained model, one that is explicitly optimized for multitasking.Specifically, instead of the standard retrieval encoder BERT (Kenton and Toutanova, 2019), we use T5 (Raffel et al., 2019) which includes multitasking in its pretraining stage.Importantly, we use the same prompting as in pretraining (i.e., task indicator) to reduce the gap between pretraining and finetuning for multitask retrieval.A second source of improvement is a novel adaptive learning method in which we adatively upweight the task gradients by the parameter's sensitivity to these tasks to encourage task specialization.
The resulting multitask retriever is highly performant on the KILT benchmark.We achieve 73.74% average page-level R-precision on KILT validation data and 72.84% average page-level Rprecision on KILT test data.Upon analysis, we find that the model indeed learns parameters that are more task-specialized compared to naive multitasking without prompting or adaptive learning.
2 Related Work Maillard et al. (2021) propose multitask retrieval largely as an extension of DPR.Their best model is a BERT-based dual encoder trained on the union of 8 retrieval tasks.While it performs comparably with task-specific DPRs on some tasks, it generally lags behind.In this work, we use stronger task-specific retrievers based on T5 and ANCE (Xiong et al., 2021) all of which substantially outperform their multitask retriever.We argue that this negative result undermines the case for multitask retrieval and that it is crucial to demonstrate competitive performance.Our main contribution is producing this demonstration.
We emphasize that achieving competitive multitask retrieval in practice is a highly difficult empirical problem.One might think that it is simply an application of multitask learning, which has no shortage of sophisticated techniques.These techniques typically modify the gradients during training, such as gradient surgery (Yu et al., 2020), gradient vaccine (Wang et al., 2020), common gradient descent (Piratla et al., 2021), and GradNorm (Chen et al., 2018).We experiment with these techniques and find that they do not help, thus motivated to develop one that works.
Our technical contribution is a new method for multitask learning based on the notion of task sensitivity.Given a loss function J(θ), the sensitivity of the i-th parameter to the loss at θ is defined as the absolute change in the loss when θ i is set to zero, which can be approximated by a first-order Taylor approximation as where θ −i is equal to θ except that its i-th element is zero.This quantity has been used in the context of model pruning-as a way of identifying weakly sensitive weights (Molchanov et al., 2016(Molchanov et al., , 2019;;Michel et al., 2019;Liang et al., 2021) and updating them more aggresively (Liang et al., 2022).In contrast, we use the quantity to identify weights that are strongly sensitive to a particular task and increase their sensitivity even further, intuitively to achieve per-parameter task specialization.To our knowledge, we are the first to use parameter sensitivity for multitasking learning.We briefly differentiate our work from other recent works on multitask retrieval.Chen et al. (2022) present CorpusBrain, an autoregressive multitask retriever trained in largely the same style as GENRE (De Cao et al., 2021) with excellent performance.Autoregressive retrieval has different pros and cons compared to dense retrieval which is our setting; it can be more memory and runtime efficient, but it does not "read" the description of the target and thus not suited for retrieval tasks that require involved reasoning over query-target pairs (e.g., zero-shot entity retrieval (Logeswaran et al., 2019)).Thus we consider the contribution of CorpusBrain to be at least partially orthogonal to ours.Nevertheless, we show that our model outperforms CorpusBrain in a similar training setting in experiments.Asai et al. (2022) propose instruction-based retrieval in which the retriever is given an intent as well as a query to find the intended target.While this is a form of multitask retrieval, the problem formulation is different and it is evaluated on its own dataset benchmark.

Method
We build on the well-established framework of dual encoder (Bromley et al., 1993;Huang et al., 2013;Gillick et al., 2019;Karpukhin et al., 2020, inter alia).Let X denote the set of all queries and Y the set of all targets (i.e., KB).First, we assume mappings text X : X → V + and text Y : Y → V + where V denotes the vocabulary to "verbalize" queries and targets.Second, we assume encoders enc θ X , enc θ Y : V + → R d with parameters θ defining the relevance score function s θ (x, y) = enc θ X (text X (x)), enc θ Y (text Y (y)) .Third, assuming iid samples (x 1 , y 1 ) . . .(x N , y N ) ∼ pop, we learn the parameters by noise contrastive estimation (NCE): where Y i ⊂ Y satisfying y i ∈ Y i is a set containing the gold and negative targets for the i-th labeled example.We pre-encode every y ∈ Y to v y = enc θ Y (text Y (y)) at test time and efficiently compute the highest scoring target ŷ(x) = arg max y∈Y enc θ X (text X (x)), v y for any x ∈ X by maximum inner product search.
In multitask retrieval, there are K retrieval tasks each with N k ) ∼ pop k drawn iid from the k-th population distribution pop k .We use the KILT benchmark which includes K = 8 tasks addressing QA, entity linking, fact checking, slot filling, and dialogue. 2The per-task loss is defining the final loss Previous work by Maillard et al. (2021) use the following setting.The KB Y consists of 100token disjoint Wikipedia passages.The text mappings text X , text Y apply the BERT tokenizer to unmodified queries and passages.The encoders enc θ X , enc θ Y are initialized with independent pretrained BERT-bases (uncased).The task-specific training datasets are downsampled to be of similar sizes.As in DPR, they train the model using hard negatives based on BM25, followed by one round of hard negative mining from the model (only on Natural Questions and TriviaQA in which verifying if a candidate negative is indeed incorrect is expedient).
We now describe the main sources of improvement that we achieve over the baseline multitask retriever: a better choice of the base model with appropriate prompting, and better optimization.

Base Model
We use a shared T5 to parameterize and initialize the query and passage encoder enc θ = enc θ X = enc θ Y .Specifically, we follow the ST5-EncDec architecture (Ni et al., 2021) which encode any z ∈ V + as enc θ (z) = T5.generate(z,length = 1).state(i.e., we run the T5-encoder on z, run the T5decoder for 1 step from the special start symbol, and take the resulting hidden state prior to token prediction).In addition, we define the text mapping for queries x ∈ X in task k as where ⊕ is string concatenation, [SEP] is the special separation token, and π k is a text prefix that indicates which task x is a query of.We use dataset names as prefixes (e.g., π 1 ="NQ").The text mapping for passages y ∈ Y does not use prefixes, that is This allows us to pre-encode passage embeddings at test time and retain the efficiency of the singletask dual encoder framework.
While simple, this choice is the most crucial component in our apporach to improving multitask retrieval.We take a model pretrained for multitasking and adopt the same prefix concatenation scheme for task adaptation, treating multitask retrieval as a continuation of the T5 training.
Interestingly, using task markers is reported to be not helpful in Maillard et al. (2021).This is likely because their base model, BERT, is not pretrained for multitasking.Another difference is that they use task markers to indicate the 5 task types (e.g., "QA"), whereas we use fine-grained markers to indicate the 8 tasks (e.g., "NQ").While there are previous works that use T5 for dense retrieval (Ni et al., 2021), we are the first to exploit the multitasking component of T5 pretraining for multitask retrieval.

Adaptive Learning
For the k-th task, the linear approximation of t) denote the parameter value at the t-th update in gradient-based training.For any i = 1 . . .d, we define θ (t) −i to be equal to θ (t) except that its i-th element is zero.The approximation of Rearranging and taking the absolute value, we have which is easily computable and can be viewed as measuring how sensitive the i-th parameter is with respect to the k-th task in the t-th iteration of training.We propose to use this quantity, previously used in the model pruning literature (Molchanov et al., 2016), to encourage task specialization during training.We define a conditional distribution over K tasks by where τ t > 0 is a temperature and σ(t) i,k is an appropriately normalized and amortized estimation of σ (t) i,k in Eq. (1) (see Section 3.2.1).Assuming training examples are sampled to roughly balance the size across tasks (i.e., N k ≈ N k ), we take the following gradient step for the i-th parameter in the t-th iteration: Note that this is a per-parameter adaptive learning.Each parameter θ i ∈ R maintains a distribution over K tasks and is updated more aggresively for tasks that θ i is sensitive to.

Sensitivity normalization
The d parameters θ (t) can be of very different magnitudes.To reduce the parameter-wise variance in the sensitivity scores, for task k we divide the scores by the median of across all parameters with respect to task k: We use the median instead of the mean to account for the long tail distribution of task-specific sensitivity scores.We also use momentum to amortize the scores: assuming some β > 0 where σ(0) i,k = 0.This is the final version of sensitivity that we use in Eq. ( 2).The algorithm in matrix form is given in Algorithm 1 (Appendix A).

Setup
Datasets.We follow (Maillard et al., 2021) and use eight tasks from KILT (Petroni et al., 2021) for training and evaluation.We randomly downsample the training data of the two largest datasets (T-REx and zsRE) to the same order of magnitude as the rest.All the datasets share the same knowledge base of 36 million disjoint 100-token Wikipedia passages preprocessed by Maillard et al. (2021).The data statistics and other data-related details can be found in Appendix B.
Evaluation.We use the page-level R-precision (the suggested main metric in KILT) to measure the retrieval performance.Page-level R-precision is the fraction of the R gold pages captured by the retriever in the top-R candidates.We map the retrieved passages to the their corresponding pages and use official KILT evaluation scripts to evaluate the page-level R-precision.We also report passage-level R-precision proposed by Maillard et al. (2021) on dev sets in Appendix E. We use TREC Eval3 to evaluate the passage-level Rprecision.
Model details.We initialize our dual encoder with the official T5-base (Raffel et al., 2019) checkpoint.The query encoder and passage encoder share weights.Following the ANCE (Xiong et al., 2021) training paradigm, we first warmup our model for 20 epochs with BM25 hard negatives by naive multitask learning with task prefix.Then we train the model for 8 ANCE episodes with the model-mined hard negatives refreshed at the begining of each ANCE episode.We adopt naive multitask learning with task prefix for the first 7 ANCE episodes and apply the adaptive learning introduced in Section 3.2 for the last ANCE episode to improve the performance further.We use Adam (Kingma and Ba, 2015) with a linear learning rate decay schedule with warmup proportion 0.1 over 3 epochs for each ANCE iteration.We provide more details and hyperparameters in Appendix C.

Main Results
We refer to our model as TACO, which stands for TAsk speCialty Optimization.Let avg val denote average validation pagelevel R-Precision.TACO achieves the best performance on 4 out of 8 tasks for both validation and test data.The performance is either the sec-ond best or close to the second best except AIDA, an entity linking dataset favoring autoregressive retrieval models over dense retrieval models (De Cao et al., 2021).TACO outperforms the previous multitask dense retrieval model MT-DPR (Maillard et al., 2021) 2020) for finetuning, which contains 8.9M annotated wikipedia sentences.TABi (Leszczynski et al., 2022) uses extra type labels information and leverages knowledge graph that is very effective for retrieval.TACO even rivals these noncomparable models on all the tasks except AIDA.TACO is the only model that outperforms strong task-specific models noticeably.Our taskspecific baseline is significantly stronger than the task-specific DPR, likely due to better training paradigm (ANCE) and better model (T5 vs BERT).Task-specific CorpusBrain is even stronger, especially for FEVER and AIDA.Only TACO and CorpusBrain mt outperform the strong task-specific models.TACO achieves a 2.36% improvement over its task-specific counterpart and a 1.41% improvement over the task-specific Cor-pusBrain, but CorpusBrain mt is only slightly better than its task-specific counterpart (+0.23% avg val).

Ablation Study
Table 3 shows the results of ablation studies on KILT validation data.
Model components.We first conduct experiments to understand the impact of individual components of our model.Removing task prefix results in 1.62% R-precision decrease and disabling adaptive learning yields 1.08% R-precision decrease.Removing both task prefix and adaptive learning significantly degrades the performance (-3.13%).This demonstrates that both task prefix and adaptive learning contribute to the effectiveness of TACO.
Query variants.We conduct experiments to investigate other query side variants besides task prefix.These variants are not trained with adaptive learning and only change the query input format or model.Leveraging task-specific query encoder yields slightly better performance (70.87% vs 70.61%), but is outperformed by task prefix significantly (70.87% vs 72.66%).The task type marker introduced in Maillard et al. ( 2021) is not helpful for BERT-based MT-DPR, but we find them effective for our T5-based model.This is likely because T5 is pretrained for multitasking.We conduct experiments to leverage their task type markers for our model.Using task type markers (i.e., 5 symbols indicating the 5 classes of task in KILT) leads to 1.24% R-precision improvement (71.85% vs 70.61%), but is less effective than our fine-grained dataset-level task prefix (71.85% vs 72.66%).
Mutltitask learning variants.We compare our adaptive learning method with recent general multitask learning algorithms with our own implementation.PCG (Yu et al., 2020) focuses on mitigating the conflict of gradients from different tasks.It performs on par with the "w/o adaptive" variant (72.47% vs 72.66%), but underperforms TACO which leverages our adaptive learning (72.47% vs 73.74%).This shows that the gradient conflict is not the main bottleneck in our multitask retrieval setting.CGD (Piratla et al., 2021) aims to improve multitask learning by encouraging update towards common directions of different tasks, which is opposite to our method that encourages task specialties.It performs much worse than TACO (69.51% vs 73.74% and lags behind the "w/o adaptive" variant significantly (69.51% vs 72.66%).This shows that we should encourage task specialty rather than emphasizing tasks shared part for multitask retrieval.GradNorm (Chen et al., 2018) tries to weight different tasks losses by using the average gradient norm.It performs slightly better than the naive "w/o adaptive" variant (72.47% vs 72.66%).Our adaptive learning method achieves descent improvement over GradNorm (73.74% vs 72.80%).Note that our adaptive update is more fine-grained and critically different because we adjust learning rates along both task dimension and parameter dimension compared with GradNorm that only do loss re-weighting.
Adaptive learning.We consider variations of the main version of adaptive learning which is applied only in the last ANCE episode.Specifically, we investigate the impact of applying adaptive learning to the last four ANCE episodes using an exponential softmax temperature decay scheduler.This approach yields an average page-level R-precision of 73.47%.In comparison, when adaptive learning is applied only to the last ANCE episode, we achieve an average page-level R-precision of 73.74%.These results suggest that extending adaptive learning to more ANCE episodes does not yield improvement.Additionally, we examine the effectiveness of encour-Figure 1: Task entropy histograms for model variants aging task specialization within adaptive learning.For this purpose, we focus on the second ANCE episode and experiment with positive softmax temperature (encouraging task specialty) and negative softmax temperature (discouraging task specialty).Encouraging task specialization results in an average page-level R-precision of 70.53%, while discouraging task specialization leads to an average page-level R-precision of 68.39%.In comparison, the performance of the standard multitask baseline at the second ANCE episode is 69.28%.These results highlight the benefits of encouraging task specialization and the detrimental effect of discouraging task specialization within adaptive learning.Normalizing task sensitivity using the median is preferred over using the mean or not applying any normalization, as different tasks exhibit variations in magnitude while sharing similar distribution shapes (see Figure 2).

Task Specialization
Figure 1 plots the histograms of task entropy for the learned parameters.The task entropy for each parameter is calculated with the distribution defined in equation 2. We first group parameters into two special bins.The first is a "Task Specific" bin that includes parameters whose entropy is smaller than 0.3, which is the entropy of 95% probability on one task and the 5% uniformly on the rest seven.The "Not Activated" bin includes parameters whose sensitivity w.r.t.all tasks is near zero (< 1e − 8).TACO significantly improves the fraction of task specific parameters to 22%, in comparison with 19% in naive multitask model (w/o prefix w/o adaptive).It also reduces the fraction of not activated parameters, showing optimizing task specialty also better utilizes the model capacity.
Figure 2 plots the kernel density estimated distribution of task-specific sensitivity in TACO and the standard multitask model for four KILT tasks.We drop outliers that deviates significantly from the median to ease visualization.Notably, TACO exhibits a noticeable reduction in the peak on the low sensitivity side for each task compared to the standard multitasking model.This observation suggests that TACO activates a larger number of parameters and enhances their sensitivity towards individual tasks.

Additional Benchmark
To test the performance of TACO in a different setup other than KILT, we constructed an additional benchmark containing MS-MARCO (Nguyen et al., 2016), ZESHEL (Logeswaran et al., 2019), a document-level version of FEVER from BEIR (Thakur et al., 2021), and Natural Questions from KILT.We chose this combination for a few reasons.First, we found that few public datasets outside KILT provide sufficiently large and high-quality training data other than MS-MARCO and ZESHEL.Second, each task now has its own KB to retrieve from, making this a rather different setup from KILT in which all tasks share one KB.We compare task-specific retrievers and multitask retrievers trained by TACO and other methods.Table 4 shows their recall at 100 on the validation split.We see that multitask- Table 4: Recall@100 on an additional benchmark containing MS-MARCO (MS), ZESHEL (ZES), FEVER (FEV), and Natural Questions (NQ).
ing is clearly beneficial for this benchmark.The best performance is obtained by CGD and it is the only multitask optimization method that yields noticeable improvements over the standard multitask model.Given that CGD aims to improve multitask learning by encouraging update towards common directions of different tasks, we hypothesize that the need for task specialization is diminished here because the tasks are more similar in difficulty (e.g., in KILT, T-REx and zsRE are much easier than HotpotQA).This experiment sheds light on what multitask settings most benefit from task specialization.

Conclusions
Multitask retrieval has compelling practical advantages such as model simplicity and memory efficiency, but it lags behind task-specific retrieval in the existing literature.We have shown that it is possible to significantly improve the performance of multitask retrieval by promoting task specialization.The key steps are the use of a base model optimized for multitasking with appropriate prompting and a per-parameter adaptive learning technique that upweights the task gradients by the parameters' sensitivity to the task losses.We have achieved strong results on the KILT retrieval benchmark. Lee

A Algorithm in Matrix Form
Alogrithm 1 is the matrix form of our adaptive learning algorithm.

B Data Details
See Table 5 for data statistics and some datarelated hyperparameters.We randomly downsample T-REx and zsRE to bring them to the same order of magnitude as the others.We follow (Raffel et al., 2019) and use temperature-scaled mixing sampling strategy to compute batch size for each task k: for some temperature c (we set it to 4 in our experiments).
Here N k is the dataset size of task k.Note that we compute task loss of each task batch independently instead of mixing all task batches for every optimization step.Each dataset needs to sample different number of batches to cover every training sample in that dataset once.We set the maximum of them as the number of batches that every dataset needs to sample.We shuffle and cycle batch sampling iterators of datasets that finish Algorithm 1 Task sensitivity-guided adaptive learning Require: Model parameter θ ∈ R d ; minibatches B where each batch B ∈ B is further divided by tasks B = {B k } k=1...K ; moving average rate β ∈ [0, 1]; temperature τ > 0; learning rate η > 0 Ensure: median : R d×K → R K is the column-wise median; softmax : R d×K → R d×K is the rowwise softmax; 1 K is a vector of K ones; is the Hadamard product.1: Initialize I ← 0 ∈ R d×K .2: for each batch B = {B k } k=1...K in B do 3: Compute the task-specific loss J k (θ) on B k for each k = 1 . . .K.

4:
Compute the gradient matrix G ∈ R d×K with each column G k ← ∇J k (θ).2015) with learning rate 5e − 6.We use linear learning rate schedule with warmup raio 0.1.Each query uses 2 hard negatives for training.Each ANCE episode trains for 3 epochs.Total batch size of all task batches are 120.
The data-related hyperparameters, such as maximum input query length and batch size, are listed in Table 5.The training hyperparameters are listed in Table 6.We use NCE loss with cross device in-batch negative mixed with hard negatives to compute each task loss.We sample two hard negatives for each query.We employ a "burn in" period for the first 10% training steps with uniform learning rates for parameters to declare their tendency during adaptive learning.All of our experiments are run on a machine with 8 A100-80GB GPUS.Our implementations are built upon Open-Match (Liu et al., 2021) Table 8: Average page-level R-precision w.r.t momentum factor for our adaptive learning Table 7 shows the impact of softmax temperature on validation R-precision for our adaptive learning.Table 8 shows the impact of momentum factor on validation R-precision for our adaptive learning.

Figure 2 :
Figure2: Task-specific sensitivity density distribution on the training data of four KILT tasks.The final models are used.The x-axis is sensitivity, and we drop outliers that are far from the median to ease visualization.
matrix I ∈ R d×K with each column I k ← G k θ.

Table 1 :
Table 1 and Table 2 show our main results on the KILT validation data Page-level R-precision on KILT validation data.Bold indicates the best model and underline indicates the second.†and* mark results from Chen et al. (2022) and Maillard et al. (2021) respectively.The non-comparable models are trained on additional data or use extra information.We list them only for reference not for comparison.Taks-specific models use a separate retriever for each task while all the other models use a single retriever across all the tasks.

Table 2 :
Page-level R-precision on KILT test data.Bold indicates the best model and underline indicates the second.†, * and ‡ mark results from Chen et al. (2022), Maillard et al. (2021) and Bevilacqua et al. (2022) respectively.The non-comparable models are trained on additional data or use extra information.

Table 3 :
(Chen et al., 2018)4% avg val).TACO also achieves better performance com-Ablation study results on KILT validation data.We report page-level R-precision.Bold indicates the best variant.Each line makes a single or multiple changes from the TACO model.The performance of the recent general multitask algorithms, PCG(Yu et al., 2020),CGD (Piratla et al., 2021)and GradNorm(Chen et al., 2018), are obtained from our own implementation.

Table 5 :
R d×K .9:endfor Data statistics and some data-related hyperparameters for our experiments.B denotes batch size.L denotes query maximum input length excluding the task prefix.iteratingearly.Batch size of each dataset computed by setting mixing temperature c = 4 and K k =1 N k = 120 is in Table5.

Table 6 :
Training hyperparameters for training our TACO-DR model.We use Adam(Kingma and Ba,