Abstract
In multitask retrieval, a single retriever is trained to retrieve relevant contexts for multiple tasks. Despite its practical appeal, naive multitask retrieval lags behind task-specific retrieval, in which a separate retriever is trained for each task. We show that it is possible to train a multitask retriever that outperforms task-specific retrievers by promoting task specialization. The main ingredients are: (1) a better choice of pretrained model—one that is explicitly optimized for multitasking—along with compatible prompting, and (2) a novel adaptive learning method that encourages each parameter to specialize in a particular task. The resulting multitask retriever is highly performant on the KILT benchmark. Upon analysis, we find that the model indeed learns parameters that are more task-specialized compared to naive multitasking without prompting or adaptive learning.1
1 Introduction
A standard approach to knowledge-intensive language tasks such as question answering (QA), entity disambigution, and fact verification is retrieval-based. Given an query, a retriever is used to efficiently search a large knowledge base (KB) to retrieve relevant “contexts”, typically in the form of short paragraphs. How these contexts are used is task-specific (e.g., entity disambiguation takes the title of the article in which the top retrieved context is found; QA predicts an answer from the contexts through a reader model). In this paper, we focus on the retrieval step.
In particular, we focus on multitask retrieval. In this setting, there are K > 1 downstream tasks that benefit from retrieval from a shared KB. A single retriever is then tasked with performing retrieval for K tasks. Multitask retrieval contrasts with task-specific retrieval, in which a separate retriever is trained for each task, and has compelling advantages such as model simplicity (i.e., we can use the same model for all tasks rather than having to design potentially different models for different tasks) and memory efficiency at test time (K times smaller).
Despite the practical appeal, the performance of multitask retrieval has been underwhelming, severely limiting its real-world applicability. Specifically, previous work by Maillard et al. (2021) trains DPR (Karpukhin et al., 2020) on the union of all training datasets in the KILT benchmark (Petroni et al., 2021), but the model is outperformed by task-specific retrievers in 5 out of 8 tasks (page-level R-precision, validation split). In our experiments, we find that it is in fact outperformed in all tasks (often by substantial margins) when a stronger task-specific baseline is used. This result is surprising as well as disappointing given the usual benefits of multitask learning (e.g., data efficiency, reduced overfitting) when properly done.
We debunk the previous negative result by presenting a multitask retriever that outperforms task-specific retrievers. The main theme of our work is that it is beneficial to explicitly promote task specialization. A first important source of improvement is a better choice of pretrained model, one that is explicitly optimized for multitasking. Specifically, instead of the standard retrieval encoder BERT (Devlin et al., 2019), we use T5 (Raffel et al., 2019), which includes multitasking in its pretraining stage. Importantly, we use the same prompting as in pretraining (i.e., task indicator) to reduce the gap between pretraining and finetuning for multitask retrieval. A second source of improvement is a novel adaptive learning method in which we adatively upweight the task gradients by the parameter’s sensitivity to these tasks to encourage task specialization.
The resulting multitask retriever is highly performant on the KILT benchmark. We achieve 73.74% average page-level R-precision on KILT validation data and 72.84% average page-level R-precision on KILT test data. Upon analysis, we find that the model indeed learns parameters that are more task-specialized compared to naive multitasking without prompting or adaptive learning.
2 Related Work
Maillard et al. (2021) propose multitask retrieval largely as an extension of DPR. Their best model is a BERT-based dual encoder trained on the union of 8 retrieval tasks. While it performs comparably with task-specific DPRs on some tasks, it generally lags behind. In this work, we use stronger task-specific retrievers based on T5 and ANCE (Xiong et al., 2021), all of which substantially outperform their multitask retriever. We argue that this negative result undermines the case for multitask retrieval and that it is crucial to demonstrate competitive performance. Our main contribution is producing this demonstration.
We emphasize that achieving competitive multitask retrieval in practice is a highly difficult empirical problem. One might think that it is simply an application of multitask learning, which has no shortage of sophisticated techniques. These techniques typically modify the gradients during training, such as gradient surgery (Yu et al., 2020), gradient vaccine (Wang et al., 2020), common gradient descent (Piratla et al., 2021), and GradNorm (Chen et al., 2018). We experiment with these techniques and find that they do not help, thus motivating us to develop one that works.
We briefly differentiate our work from other recent work on multitask retrieval. Chen et al. (2022) present CorpusBrain, an autoregressive multitask retriever trained in largely the same style as GENRE (De Cao et al., 2021) with excellent performance. Autoregressive retrieval has different pros and cons compared to dense retrieval which is our setting; it can be more memory and runtime efficient, but it does not “read” the description of the target and thus not suited for retrieval tasks that require involved reasoning over query-target pairs (e.g., zero-shot entity retrieval [Logeswaran et al., 2019]). Thus we consider the contribution of CorpusBrain to be at least partially orthogonal to ours. Nevertheless, we show that our model outperforms CorpusBrain in a similar training setting in experiments. Asai et al. (2022) propose instruction-based retrieval in which the retriever is given an intent as well as a query to find the intended target. While this is a form of multitask retrieval, the problem formulation is different and it is evaluated on its own dataset benchmark.
3 Method
We now describe the main sources of improvement that we achieve over the baseline multitask retriever: a better choice of the base model with appropriate prompting, and better optimization.
3.1 Base Model
While simple, this choice is the most crucial component in our apporach to improving multitask retrieval. We take a model pretrained for multitasking and adopt the same prefix concatenation scheme for task adaptation, treating multitask retrieval as a continuation of the T5 training.
Interestingly, using task markers is reported to be not helpful in Maillard et al. (2021). This is likely because their base model, BERT, is not pretrained for multitasking. Another difference is that they use task markers to indicate the 5 task types (e.g., “QA”), whereas we use fine-grained markers to indicate the 8 tasks (e.g., “NQ”). While there are previous works that use T5 for dense retrieval (Ni et al., 2021), we are the first to exploit the multitasking component of T5 pretraining for multitask retrieval.
3.2 Adaptive Learning
3.2.1 Sensitivity Normalization
4 Experiments
4.1 Setup
Datasets.
We follow Maillard et al. (2021) and use eight tasks from KILT (Petroni et al., 2021) for training and evaluation. We randomly downsample the training data of the two largest datasets (T-REx and zsRE) to the same order of magnitude as the rest. All the datasets share the same knowledge base of 36 million disjoint 100-token Wikipedia passages preprocessed by Maillard et al. (2021). The data statistics and other data-related details can be found in Appendix B.
Evaluation.
We use the page-level R-precision (the suggested main metric in KILT) to measure the retrieval performance. Page-level R-precision is the fraction of the R gold pages captured by the retriever in the top-R candidates. We map the retrieved passages to the their corresponding pages and use official KILT evaluation scripts to evaluate the page-level R-precision. We also report passage-level R-precision proposed by Maillard et al. (2021) on dev sets in Appendix E. We use TREC Eval3 to evaluate the passage-level R-precision.
Model Details.
We initialize our dual encoder with the official T5-base (Raffel et al., 2019) checkpoint. The query encoder and passage encoder share weights. Following the ANCE (Xiong et al., 2021) training paradigm, we first warmup our model for 20 epochs with BM25 hard negatives by naive multitask learning with task prefix. Then we train the model for 8 ANCE episodes with the model-mined hard negatives refreshed at the begining of each ANCE episode. We adopt naive multitask learning with task prefix for the first 7 ANCE episodes and apply the adaptive learning introduced in Section 3.2 for the last ANCE episode to improve the performance further. We use Adam (Kingma and Ba, 2015) with a linear learning rate decay schedule with warmup proportion 0.1 over 3 epochs for each ANCE iteration. We provide more details and hyperparameters in Appendix C.
4.2 Main Results
We refer to our model as TACO, which stands for TAsk speCialty Optimization. Table 1 and Table 2 show our main results on the KILT validation data and test data respectively. Fewer comparable baselines are available for KILT test data than for KILT validation data.
Model . | Fact Check. . | Ent. L. . | Slot Filling . | Open Domain QA . | Dial. . | Avg . | |||
---|---|---|---|---|---|---|---|---|---|
FEV . | AY2 . | T-REx . | zsRE . | NQ . | HoPo . | TQA . | WoW . | ||
Baselines. | |||||||||
BM25* | 50.13 | 3.47 | 58.60 | 66.43 | 25.83 | 43.95 | 29.44 | 27.50 | 38.17 |
BART | 81.92 | 89.17 | 75.18 | 91.08 | 58.62 | 48.69 | 67.64 | 50.98 | 70.41 |
CorpusBrain | 82.06 | 90.84 | 77.62 | 98.26 | 59.10 | 50.07 | 68.78 | 53.75 | 72.56 |
MT-DPR* | 74.72 | 83.78 | 69.18 | 77.23 | 61.51 | 44.21 | 61.95 | 39.70 | 64.04 |
Task-specific DPR* | 73.60 | 81.77 | 69.08 | 97.74 | 63.24 | 46.63 | 65.12 | 40.32 | 67.19 |
Task-specific BART† | 80.03 | 87.98 | 74.46 | 93.91 | 50.96 | 39.21 | 66.13 | 50.75 | 67.93 |
Task-specific CorpusBrain† | 81.77 | 90.36 | 76.90 | 98.49 | 57.67 | 50.62 | 69.25 | 53.60 | 72.33 |
Task-specific (ours) | 74.28 | 85.28 | 77.18 | 99.38 | 65.39 | 46.79 | 69.08 | 53.63 | 71.38 |
Non-Comparable Models (For Reference). | |||||||||
CorpusBrain | 85.03 | 92.86 | 80.22 | 98.49 | 64.61 | 52.23 | 71.71 | 59.72 | 75.61 |
GENRE† | 84.68 | 92.75 | 79.68 | 94.84 | 64.26 | 51.82 | 71.11 | 56.32 | 74.43 |
TABi (Leszczynski et al., 2022) | 85.8 | – | 82.0 | 95.2 | 62.4 | 52.7 | 71.5 | 51.8 | – |
TACO | 86.17 | 84.64 | 78.12 | 97.91 | 61.86 | 50.61 | 69.62 | 60.97 | 73.74 |
Model . | Fact Check. . | Ent. L. . | Slot Filling . | Open Domain QA . | Dial. . | Avg . | |||
---|---|---|---|---|---|---|---|---|---|
FEV . | AY2 . | T-REx . | zsRE . | NQ . | HoPo . | TQA . | WoW . | ||
Baselines. | |||||||||
BM25* | 50.13 | 3.47 | 58.60 | 66.43 | 25.83 | 43.95 | 29.44 | 27.50 | 38.17 |
BART | 81.92 | 89.17 | 75.18 | 91.08 | 58.62 | 48.69 | 67.64 | 50.98 | 70.41 |
CorpusBrain | 82.06 | 90.84 | 77.62 | 98.26 | 59.10 | 50.07 | 68.78 | 53.75 | 72.56 |
MT-DPR* | 74.72 | 83.78 | 69.18 | 77.23 | 61.51 | 44.21 | 61.95 | 39.70 | 64.04 |
Task-specific DPR* | 73.60 | 81.77 | 69.08 | 97.74 | 63.24 | 46.63 | 65.12 | 40.32 | 67.19 |
Task-specific BART† | 80.03 | 87.98 | 74.46 | 93.91 | 50.96 | 39.21 | 66.13 | 50.75 | 67.93 |
Task-specific CorpusBrain† | 81.77 | 90.36 | 76.90 | 98.49 | 57.67 | 50.62 | 69.25 | 53.60 | 72.33 |
Task-specific (ours) | 74.28 | 85.28 | 77.18 | 99.38 | 65.39 | 46.79 | 69.08 | 53.63 | 71.38 |
Non-Comparable Models (For Reference). | |||||||||
CorpusBrain | 85.03 | 92.86 | 80.22 | 98.49 | 64.61 | 52.23 | 71.71 | 59.72 | 75.61 |
GENRE† | 84.68 | 92.75 | 79.68 | 94.84 | 64.26 | 51.82 | 71.11 | 56.32 | 74.43 |
TABi (Leszczynski et al., 2022) | 85.8 | – | 82.0 | 95.2 | 62.4 | 52.7 | 71.5 | 51.8 | – |
TACO | 86.17 | 84.64 | 78.12 | 97.91 | 61.86 | 50.61 | 69.62 | 60.97 | 73.74 |
Model . | Fact Check. . | Ent. L. . | Slot Filling . | Open Domain QA . | Dial. . | Avg . | |||
---|---|---|---|---|---|---|---|---|---|
FEV . | AY2 . | T-REx . | zsRE . | NQ . | HoPo . | TQA . | WoW . | ||
Baselines. | |||||||||
TF-IDF† | 50.9 | 3.7 | 44.7 | 60.8 | 28.1 | 34.1 | 46.4 | 49.0 | 39.7 |
SEAL‡ | 81.4 | – | 62.1 | 91.6 | 63.2 | 58.8 | 68.4 | 57.5 | – |
MT-DPR* | 74.5 | 26.5 | 69.5 | 80.9 | 59.4 | 42.9 | 61.5 | 41.1 | 57.0 |
MT-DPR | 74.8 | – | 75.6 | 89.7 | 59.8 | 45.4 | 58.9 | 41.5 | – |
Task-specific (ours) | 73.22 | 79.52 | 77.00 | 99.15 | 60.87 | 46.50 | 69.12 | 55.03 | 70.05 |
Non-Comparable Models (For Reference). | |||||||||
CorpusBrain | 84.07 | 89.98 | 79.98 | 98.27 | 60.32 | 51.80 | 70.19 | 64.79 | 74.93 |
GENRE† | 83.64 | 89.85 | 79.42 | 95.81 | 60.25 | 51.27 | 69.16 | 62.88 | 74.04 |
TABi (Leszczynski et al., 2022) | 84.4 | – | 81.9 | 96.2 | 62.6 | 53.1 | 70.4 | 59.1 | – |
TACO | 84.07 | 80.64 | 77.22 | 98.21 | 60.80 | 50.70 | 68.45 | 62.64 | 72.84 |
Model . | Fact Check. . | Ent. L. . | Slot Filling . | Open Domain QA . | Dial. . | Avg . | |||
---|---|---|---|---|---|---|---|---|---|
FEV . | AY2 . | T-REx . | zsRE . | NQ . | HoPo . | TQA . | WoW . | ||
Baselines. | |||||||||
TF-IDF† | 50.9 | 3.7 | 44.7 | 60.8 | 28.1 | 34.1 | 46.4 | 49.0 | 39.7 |
SEAL‡ | 81.4 | – | 62.1 | 91.6 | 63.2 | 58.8 | 68.4 | 57.5 | – |
MT-DPR* | 74.5 | 26.5 | 69.5 | 80.9 | 59.4 | 42.9 | 61.5 | 41.1 | 57.0 |
MT-DPR | 74.8 | – | 75.6 | 89.7 | 59.8 | 45.4 | 58.9 | 41.5 | – |
Task-specific (ours) | 73.22 | 79.52 | 77.00 | 99.15 | 60.87 | 46.50 | 69.12 | 55.03 | 70.05 |
Non-Comparable Models (For Reference). | |||||||||
CorpusBrain | 84.07 | 89.98 | 79.98 | 98.27 | 60.32 | 51.80 | 70.19 | 64.79 | 74.93 |
GENRE† | 83.64 | 89.85 | 79.42 | 95.81 | 60.25 | 51.27 | 69.16 | 62.88 | 74.04 |
TABi (Leszczynski et al., 2022) | 84.4 | – | 81.9 | 96.2 | 62.6 | 53.1 | 70.4 | 59.1 | – |
TACO | 84.07 | 80.64 | 77.22 | 98.21 | 60.80 | 50.70 | 68.45 | 62.64 | 72.84 |
Let avg val denote average validation page-level R-Precision. TACO achieves the best performance on 4 out of 8 tasks for both validation and test data. The performance is either the second best or close to the second best except AIDA, an entity linking dataset favoring autoregressive retrieval models over dense retrieval models (De Cao et al., 2021). TACO outperforms the previous multitask dense retrieval model MT-DPR (Maillard et al., 2021) significantly (+7.34% avg val). TACO also achieves better performance compared with current top performing multitask autoregressive retrieval models in comparable setting (finetuned purely on KILT). TACO outperforms BARTmt (+3.33% avg val) with smaller model size (T5-base vs Bart-large). Compared with BARTmt, CorpusBrainmt employs additional pretraining and yields significant improvement over BARTmt (+2.15% avg val). TACO still outperforms CorpusBrainmt (+1.18% avg val) with smaller model size and no additional pretraining. We also list various top performing multitask retrieval models for reference but not for comparison because they are not in comparable setting. Both GENRE and CorpusBrainmt +BLINK are finetuned on a large amount of additional training data besides KILT training data. Specifically, they also use BLINK training data (Wu et al., 2020) for finetuning, which contains 8.9M annotated wikipedia sentences. TABi (Leszczynski et al., 2022) uses extra type labels information and leverages knowledge graph that is very effective for retrieval. TACO even rivals these non-comparable models on all the tasks except AIDA.
TACO is the only model that outperforms strong task-specific models noticeably. Our task-specific baseline is significantly stronger than the task-specific DPR, likely due to better training paradigm (ANCE) and better model (T5 vs BERT). Task-specific CorpusBrain is even stronger, especially for FEVER and AIDA. Only TACO and CorpusBrainmt outperform the strong task-specific models. TACO achieves a 2.36% improvement over its task-specific counterpart and a 1.41% improvement over the task-specific CorpusBrain, but CorpusBrainmt is only slightly better than its task-specific counterpart (+0.23% avg val).
4.3 Analysis
4.3.1 Ablation Study
Table 3 shows the results of ablation studies on KILT validation data.
Variants . | Fact Check. . | Ent. L. . | Slot Filling . | Open Domain QA . | Dial. . | Avg . | |||
---|---|---|---|---|---|---|---|---|---|
FEV . | AY2 . | T-REx . | zsRE . | NQ . | HoPo . | TQA . | WoW . | ||
TACO | 86.17 | 84.64 | 78.12 | 97.91 | 61.86 | 50.61 | 69.62 | 60.97 | 73.74 |
w/o task prefix | 85.71 | 84.68 | 74.82 | 94.68 | 61.05 | 49.38 | 67.79 | 58.81 | 72.12 |
w/o adaptive | 84.81 | 85.49 | 75.00 | 92.24 | 62.81 | 51.47 | 68.95 | 60.54 | 72.66 |
w/o task prefix w/o adaptive | 84.03 | 85.62 | 70.96 | 86.04 | 62.46 | 49.78 | 66.04 | 59.95 | 70.61 |
task query encoder | 82.71 | 87.56 | 72.72 | 85.15 | 64.01 | 49.74 | 69.12 | 55.93 | 70.87 |
task type marker | 84.49 | 85.51 | 73.88 | 89.37 | 62.85 | 50.97 | 67.70 | 60.02 | 71.85 |
PCG (Yu et al., 2020) | 84.97 | 85.26 | 74.90 | 91.43 | 62.67 | 51.47 | 68.54 | 60.48 | 72.47 |
CGD (Piratla et al., 2021) | 82.25 | 80.39 | 71.62 | 83.40 | 62.67 | 49.66 | 66.73 | 59.33 | 69.51 |
GradNorm (Chen et al., 2018) | 84.70 | 85.28 | 75.32 | 91.73 | 63.80 | 51.97 | 69.30 | 60.31 | 72.80 |
Variants . | Fact Check. . | Ent. L. . | Slot Filling . | Open Domain QA . | Dial. . | Avg . | |||
---|---|---|---|---|---|---|---|---|---|
FEV . | AY2 . | T-REx . | zsRE . | NQ . | HoPo . | TQA . | WoW . | ||
TACO | 86.17 | 84.64 | 78.12 | 97.91 | 61.86 | 50.61 | 69.62 | 60.97 | 73.74 |
w/o task prefix | 85.71 | 84.68 | 74.82 | 94.68 | 61.05 | 49.38 | 67.79 | 58.81 | 72.12 |
w/o adaptive | 84.81 | 85.49 | 75.00 | 92.24 | 62.81 | 51.47 | 68.95 | 60.54 | 72.66 |
w/o task prefix w/o adaptive | 84.03 | 85.62 | 70.96 | 86.04 | 62.46 | 49.78 | 66.04 | 59.95 | 70.61 |
task query encoder | 82.71 | 87.56 | 72.72 | 85.15 | 64.01 | 49.74 | 69.12 | 55.93 | 70.87 |
task type marker | 84.49 | 85.51 | 73.88 | 89.37 | 62.85 | 50.97 | 67.70 | 60.02 | 71.85 |
PCG (Yu et al., 2020) | 84.97 | 85.26 | 74.90 | 91.43 | 62.67 | 51.47 | 68.54 | 60.48 | 72.47 |
CGD (Piratla et al., 2021) | 82.25 | 80.39 | 71.62 | 83.40 | 62.67 | 49.66 | 66.73 | 59.33 | 69.51 |
GradNorm (Chen et al., 2018) | 84.70 | 85.28 | 75.32 | 91.73 | 63.80 | 51.97 | 69.30 | 60.31 | 72.80 |
Model Components.
We first conduct experiments to understand the impact of individual components of our model. Removing task prefix results in 1.62% R-precision decrease and disabling adaptive learning yields 1.08% R-precision decrease. Removing both task prefix and adaptive learning significantly degrades the performance (−3.13%). This demonstrates that both task prefix and adaptive learning contribute to the effectiveness of TACO.
Query Variants.
We conduct experiments to investigate other query side variants besides task prefix. These variants are not trained with adaptive learning and only change the query input format or model. Leveraging task-specific query encoder yields slightly better performance (70.87% vs 70.61%), but is outperformed by task prefix significantly (70.87% vs 72.66%). The task type marker introduced in Maillard et al. (2021) is not helpful for BERT-based MT-DPR, but we find them effective for our T5-based model. This is likely because T5 is pretrained for multitasking. We conduct experiments to leverage their task type markers for our model. Using task type markers (i.e., 5 symbols indicating the 5 classes of task in KILT) leads to 1.24% R-precision improvement (71.85% vs 70.61%), but is less effective than our fine-grained dataset-level task prefix (71.85% vs 72.66%).
Mutltitask Learning Variants.
We compare our adaptive learning method with recent general multitask learning algorithms with our own implementation. PCG (Yu et al., 2020) focuses on mitigating the conflict of gradients from different tasks. It performs on par with the “w/o adaptive” variant (72.47% vs 72.66%), but underperforms TACO which leverages our adaptive learning (72.47% vs 73.74%). This shows that the gradient conflict is not the main bottleneck in our multitask retrieval setting. CGD (Piratla et al., 2021) aims to improve multitask learning by encouraging update towards common directions of different tasks, which is opposite to our method that encourages task specialties. It performs much worse than TACO (69.51% vs 73.74% and lags behind the “w/o adaptive” variant significantly (69.51% vs 72.66%). This shows that we should encourage task specialty rather than emphasizing tasks shared part for multitask retrieval. GradNorm (Chen et al., 2018) tries to weight different tasks losses by using the average gradient norm. It performs slightly better than the naive “w/o adaptive” variant (72.47% vs 72.66%). Our adaptive learning method achieves descent improvement over GradNorm (73.74% vs 72.80%). Note that our adaptive update is more fine-grained and critically different because we adjust learning rates along both task dimension and parameter dimension compared with GradNorm that only do loss re-weighting.
Adaptive Learning.
We consider variations of the main version of adaptive learning which is applied only in the last ANCE episode. Specifically, we investigate the impact of applying adaptive learning to the last four ANCE episodes using an exponential softmax temperature decay scheduler. This approach yields an average page-level R-precision of 73.47%. In comparison, when adaptive learning is applied only to the last ANCE episode, we achieve an average page-level R-precision of 73.74%. These results suggest that extending adaptive learning to more ANCE episodes does not yield improvement. Additionally, we examine the effectiveness of encouraging task specialization within adaptive learning. For this purpose, we focus on the second ANCE episode and experiment with positive softmax temperature (encouraging task specialty) and negative softmax temperature (discouraging task specialty). Encouraging task specialization results in an average page-level R-precision of 70.53%, while discouraging task specialization leads to an average page-level R-precision of 68.39%. In comparison, the performance of the standard multitask baseline at the second ANCE episode is 69.28%. These results highlight the benefits of encouraging task specialization and the detrimental effect of discouraging task specialization within adaptive learning. Normalizing task sensitivity using the median is preferred over using the mean or not applying any normalization, as different tasks exhibit variations in magnitude while sharing similar distribution shapes (see Figure 2).
4.3.2 Task Specialization
Figure 1 plots the histograms of task entropy for the learned parameters. The task entropy for each parameter is calculated with the distribution defined in Equation 2. We first group parameters into two special bins. The first is a “Task Specific” bin that includes parameters whose entropy is smaller than 0.3, which is the entropy of 95% probability on one task and the 5% uniformly on the rest seven. The “Not Activated” bin includes parameters whose sensitivity w.r.t. all tasks is near zero (<1e −8). TACO significantly improves the fraction of task specific parameters to 22%, in comparison with 19% in naive multitask model (w/o prefix w/o adaptive). It also reduces the fraction of not activated parameters, showing optimizing task specialty also better utilizes the model capacity.
Figure 2 plots the kernel density estimated distribution of task-specific sensitivity in TACO and the standard multitask model for four KILT tasks. We drop outliers that deviates significantly from the median to ease visualization. Notably, TACO exhibits a noticeable reduction in the peak on the low sensitivity side for each task compared to the standard multitasking model. This observation suggests that TACO activates a larger number of parameters and enhances their sensitivity towards individual tasks.
4.3.3 Additional Benchmark
To test the performance of TACO in a different setup other than KILT, we constructed an additional benchmark containing MS-MARCO (Nguyen et al., 2016), ZESHEL (Logeswaran et al., 2019), a document-level version of FEVER from BEIR (Thakur et al., 2021), and Natural Questions from KILT. We chose this combination for a few reasons. First, we found that few public datasets outside KILT provide sufficiently large and high-quality training data other than MS-MARCO and ZESHEL. Second, each task now has its own KB to retrieve from, making this a rather different setup from KILT in which all tasks share one KB. We compare task-specific retrievers and multitask retrievers trained by TACO and other methods. Table 4 shows their recall at 100 on the validation split. We see that multitasking is clearly beneficial for this benchmark. The best performance is obtained by CGD and it is the only multitask optimization method that yields noticeable improvements over the standard multitask model. Given that CGD aims to improve multitask learning by encouraging update towards common directions of different tasks, we hypothesize that the need for task specialization is diminished here because the tasks are more similar in difficulty (e.g., in KILT, T-REx and zsRE are much easier than HotpotQA). This experiment sheds light on what multitask settings most benefit from task specialization.
. | MS . | ZES . | FEV . | NQ . | Avg . |
---|---|---|---|---|---|
Task-specific | 73.3 | 67.3 | 90.0 | 71.8 | 75.6 |
TACO | 85.8 | 67.6 | 91.2 | 76.8 | 80.4 |
w/o adapt | 85.9 | 67.5 | 91.3 | 76.6 | 80.3 |
w/o prefix, adapt | 86.2 | 68.1 | 92.2 | 76.4 | 80.7 |
PCG | 86.1 | 67.9 | 91.7 | 76.9 | 80.7 |
CGD | 86.8 | 69.1 | 94.4 | 76.2 | 81.6 |
GradNorm | 86.0 | 67.3 | 91.7 | 76.9 | 80.5 |
. | MS . | ZES . | FEV . | NQ . | Avg . |
---|---|---|---|---|---|
Task-specific | 73.3 | 67.3 | 90.0 | 71.8 | 75.6 |
TACO | 85.8 | 67.6 | 91.2 | 76.8 | 80.4 |
w/o adapt | 85.9 | 67.5 | 91.3 | 76.6 | 80.3 |
w/o prefix, adapt | 86.2 | 68.1 | 92.2 | 76.4 | 80.7 |
PCG | 86.1 | 67.9 | 91.7 | 76.9 | 80.7 |
CGD | 86.8 | 69.1 | 94.4 | 76.2 | 81.6 |
GradNorm | 86.0 | 67.3 | 91.7 | 76.9 | 80.5 |
5 Conclusions
Multitask retrieval has compelling practical advantages such as model simplicity and memory efficiency, but it lags behind task-specific retrieval in the existing literature. We have shown that it is possible to significantly improve the performance of multitask retrieval by promoting task specialization. The key steps are the use of a base model optimized for multitasking with appropriate prompting and a per-parameter adaptive learning technique that upweights the task gradients by the parameters’ sensitivity to the task losses. We have achieved strong results on the KILT retrieval benchmark.
Notes
Our code and model checkpoints are publicly available at https://github.com/WenzhengZhang/TACO.
We write “task” and “dataset” synonymously instead of distinguishing datasets from task types as done in some previous work. Thus KILT has 8 tasks and 5 task types.
References
A Algorithm in Matrix Form
B Data Details
See Table 5 for data statistics and some data-related hyperparameters. We randomly downsample T-REx and zsRE to bring them to the same order of magnitude as the others. We follow Raffel et al. (2019) and use temperature-scaled mixing sampling strategy to compute batch size for each task k: for some temperature c (we set it to 4 in our experiments). Here Nk is the dataset size of task k. Note that we compute task loss of each task batch independently instead of mixing all task batches for every optimization step. Each dataset needs to sample a different number of batches to cover every training sample in that dataset once. We set the maximum as the number of batches that every dataset needs to sample. We shuffle and cycle batch sampling iterators of datasets that finish iterating early. Batch size of each dataset computed by setting mixing temperature c = 4 and is in Table 5.
Dataset . | #Train . | B . | L . |
---|---|---|---|
Natural Questions | 76k | 16 | 32 |
TriviaQA | 53k | 14 | 32 |
HotpotQA | 69k | 15 | 32 |
Wizard of Wikipedia | 80k | 16 | 256 |
T-REx | 95k | 16 | 32 |
FEVER | 71k | 15 | 64 |
Zero Shot RE | 100k | 17 | 32 |
AIDA-YAGO 2 | 18k | 11 | 128 |
Dataset . | #Train . | B . | L . |
---|---|---|---|
Natural Questions | 76k | 16 | 32 |
TriviaQA | 53k | 14 | 32 |
HotpotQA | 69k | 15 | 32 |
Wizard of Wikipedia | 80k | 16 | 256 |
T-REx | 95k | 16 | 32 |
FEVER | 71k | 15 | 64 |
Zero Shot RE | 100k | 17 | 32 |
AIDA-YAGO 2 | 18k | 11 | 128 |
C Other Training Details
The data-related hyperparameters, such as maximum input query length and batch size, are listed in Table 5. The training hyperparameters are listed in Table 6. We use NCE loss with cross device in-batch negative mixed with hard negatives to compute each task loss. We sample two hard negatives for each query. We use a “burn in” period for the first 10% training steps with uniform learning rates for parameters to declare their tendency during adaptive learning. All of our experiments are run on a machine with 8 A100−80GB GPUS. Our implementations are built upon OpenMatch (Liu et al., 2021).
D Softmax Temperature and Momentum Ratio
Table 7 shows the impact of softmax temperature on validation R-precision for our adaptive learning. Table 8 shows the impact of momentum factor on validation R-precision for our adaptive learning.
τ . | 0.1 . | 1 . | 2 . | 5 . | 10 . | 100 . |
---|---|---|---|---|---|---|
Avg R-prec | 72.45 | 73.23 | 73.74 | 73.72 | 73.48 | 72.85 |
τ . | 0.1 . | 1 . | 2 . | 5 . | 10 . | 100 . |
---|---|---|---|---|---|---|
Avg R-prec | 72.45 | 73.23 | 73.74 | 73.72 | 73.48 | 72.85 |
E Passage-level Performance
Table 9 shows the passage-level R-precision on KILT validation data. We also list the passagelevel performance from Maillard et al. (2021) for comparison.
Model . | Fact Check. . | Slot Filling . | Open Domain QA . | Dial. . | Avg . | |||
---|---|---|---|---|---|---|---|---|
FEV . | T-REx . | zsRE . | NQ . | HoPo . | TQA . | WoW . | ||
MT-DPR* | 46.96 | 53.54 | 41.70 | 28.80 | 38.42 | 24.56 | 24.07 | 36.86 |
Task-specific DPR* | 43.92 | 58.54 | 78.81 | 28.13 | 43.47 | 23.79 | 20.73 | 42.48 |
Task-specific (ours) | 44.89 | 72.09 | 84.47 | 33.14 | 43.40 | 29.57 | 27.64 | 47.89 |
TACO | 60.76 | 72.57 | 82.80 | 31.16 | 46.72 | 28.32 | 33.24 | 50.80 |
Model . | Fact Check. . | Slot Filling . | Open Domain QA . | Dial. . | Avg . | |||
---|---|---|---|---|---|---|---|---|
FEV . | T-REx . | zsRE . | NQ . | HoPo . | TQA . | WoW . | ||
MT-DPR* | 46.96 | 53.54 | 41.70 | 28.80 | 38.42 | 24.56 | 24.07 | 36.86 |
Task-specific DPR* | 43.92 | 58.54 | 78.81 | 28.13 | 43.47 | 23.79 | 20.73 | 42.48 |
Task-specific (ours) | 44.89 | 72.09 | 84.47 | 33.14 | 43.40 | 29.57 | 27.64 | 47.89 |
TACO | 60.76 | 72.57 | 82.80 | 31.16 | 46.72 | 28.32 | 33.24 | 50.80 |
Author notes
Work done during an internship at Microsoft.
Action Editor: Dani Yogatama