Abstract
Label scarcity is a bottleneck for improving task performance in specialized domains. We propose a novel compositional transfer learning framework (DoT51) for zero-shot domain transfer. Without access to in-domain labels, DoT5 jointly learns domain knowledge (from masked language modelling of unlabelled in-domain free text) and task knowledge (from task training on more readily available general-domain data) in a multi-task manner. To improve the transferability of task training, we design a strategy named NLGU: We simultaneously train natural language generation (NLG) for in-domain label-to-data generation, which enables data augmentation for self-finetuning and natural language understanding (NLU) for label prediction. We evaluate DoT5 on the biomedical domain and the resource-lean subdomain of radiology, focusing on natural language inference, text summarization, and embedding learning. DoT5 demonstrates the effectiveness of compositional transfer learning through multi-task learning. In particular, DoT5 outperforms the current state-of-the-art in zero-shot transfer by over 7 absolute points in accuracy on RadNLI. We validate DoT5 with ablations and a case study demonstrating its ability to solve challenging NLI examples requiring in-domain expertise.
1 Introduction
While pretrained language models demonstrate massive improvements on a wide range of natural language processing (NLP) tasks, it remains challenging to apply them to specialized domains (Ramponi and Plank, 2020). To acquire domain-specific task knowledge, a conventional approach is to perform domain-specific pretraining—usually masked language modelling (MLM) on in-domain raw text—followed by finetuning with in-domain task-annotated data (Lee et al., 2020; Gu et al., 2021; Boecking et al., 2022). However, this approach requires in-domain task labels that can be expensive to acquire. Another approach is to train a model with the usually abundant general-domain task labels and directly transfer to the new domain (Romanov and Shivade, 2018; Ma et al., 2021), but the transfer performance is often limited by the domain gap. Past studies on zero-shot domain transfer or unsupervised domain adaptation have explored methods to transfer task knowledge from a source domain to an unseen target domain (Ramponi and Plank, 2020; Ganin and Lempitsky, 2015), but they usually require external modules to perform feature or domain alignment and are not always easily applicable to pretrained language models. In particular, there is little understanding of how we can leverage and combine domain-specific knowledge and general-domain task knowledge in the context of the recent success of text-to-text architectures in transfer learning.
To close this gap, we propose DoT5, a novel compositional zero-shot domain-transfer framework based on the state-of-the-art (SOTA) transfer learning model transfer learning model Text-to-Text Transfer Transformer (T5) (Raffel et al., 2020). Throughout, the ‘zero-shot’ setup refers to zero-shot domain transfer with no access to labelled in-domain data.2 By “compositional” we mean that DoT5 is able to combine seen task labels and domain text to acquire an unseen combination of task domain knowledge.
As shown in Figure 1, DoT5 combines domain knowledge and task knowledge by making the best use of in-domain free text and general-domain task labels, which are typically accessible and abundant. For example, in the context of natural language inference (NLI), DoT5 can learn domain-specific semantics (e.g., “bony abnormalities” is a synonym of “osseous abnormalities”) from in-domain free text and transferable task knowledge from general-domain task labels (e.g., negation indicates contradiction) to infer domain-specific task knowledge (e.g., “There are no bony abnormalities” contradicts “There are osseous abnormalities”).
We apply DoT5 to NLI, summarization, and text embedding learning, which are fundamental applications across many domains, and we explore zero-shot domain transfer to the high-value and highly specialized domain of biomedicine and its extremely low-resource subdomain of radiology. Due to their specialization, obtaining labelled data in these domains is expensive and time-consuming. For example, the radiology-specific NLI dataset (RadNLI) (Miura et al., 2021) contains only 960 manually labelled examples as development and test data and no training data is available.
The key to DoT5’s compositional transfer is continual multi-task pretraining to simultaneously acquire domain and task knowledge: We jointly train T5 with MLM on in-domain unlabelled data and general-domain tasks (NLI and summarization). To better acquire the transferable task knowledge from the general-domain task labels, we propose a multi-task setup we call NLGU. As depicted in Figure 2, NLGU gives each task two formulations: natural language generation (NLG) (label-to-data generation), and natural language understanding (NLU) (data-to-label prediction). NLU enables label prediction when tested in an unseen domain and forces model sensitivity to the conditioned label, assisting NLG. Meanwhile, NLG enables downstream tasks such as summarization or data augmentation. This enables DoT5 to generate its own NLI in-domain task data for further finetuning (a process we call self-finetuning), or to generate positive and negative examples for improving text embeddings by contrastive learning (Oord et al., 2018).
Our experiments show the effectiveness of DoT5 in zero-shot domain transfer, and our proposed multi-task compositional approach achieves large gains compared with sequential training with T5 across all tasks. In particular, we achieve SOTA zero-shot domain transfer performance on RadNLI (Romanov and Shivade, 2018), outperforming baselines including large language models (LLMs), sequential training approaches, and task-specific baselines by large margins. We also identify several key insights through extensive analysis: 1) All three key components (in-domain MLM, NLGU, self-finetuning) in DoT5 are important for transfer success while multi-task learning with in-domain MLM is the key for combining domain and task knowledge. 2) Scaling up model size significantly improves transfer performance. 3) DoT5 is able to solve challenging domain-specific task examples, indicating it acquires domain-specific task knowledge through compositional transfer.
To summarize, we present the following major contributions: 1) We propose DoT5, a general framework for compositional transfer learning with text-to-text models, and show that multi-task training is superior to sequential training in the models’ domain transfer. 2) With a novel NLGU training strategy combining generation and understanding, DoT5 can be used for both classification and generation tasks.3 With the latter, DoT5 can perform self-finetuning to further improve transfer performance. 3) We show the effectiveness of DoT5 in zero-shot domain transfer, achieving SOTA zero-shot performance in radiology NLI. 4) Comprehensive analysis demonstrates the inner workings of DoT5’s compositional transfer.
2 Related Work
Cross-task Transfer with Text-to-text Models
T5 (Raffel et al., 2020) unifies NLP tasks under a seq-to-seq framework and solves them using a single model. T0 (Sanh et al., 2022), FLAN (Wei et al., 2022), MetaICL (Min et al., 2022), and ExT5 (Aribandi et al., 2022) build on top of this idea and explore pretraining T5 with a massive collection of NLP datasets with diverse natural language prompts. Among them, T0, FLAN, and MetaICL investigate pretraining on a set of tasks, and then zero-shot transfer to another set of unseen tasks.
Domain-specific Pretraining
Gururangan et al. (2020) show continual training on domain and task data can adapt pretrained models for new domains and tasks. Both BioBERT (Lee et al., 2020) and BlueBERT (Peng et al., 2019) apply the BERT pretraining protocol (i.e., masked language modelling and next sentence prediction) on PubMed Central (PMC) or PubMed articles. They continue pretraining BERT checkpoints instead of training from scratch. Gu et al. (2021) demonstrate the importance of domain-specific vocabulary and pretraining from scratch when in-domain text is abundant, and produces PubMedBERT by pretraining on PubMed articles. Similar to PubMedBERT, SciBERT (Beltagy et al., 2019) pretrains from scratch on a mix of both PMC and computer science publications. Boecking et al. (2022) introduce CXR-BERT, which is pretrained on biomedical and radiology corpora. SciFive (Phan et al., 2021) continually pretrains T5 checkpoints on PubMed abstracts with seq-to-seq MLM. We compare to finetuned versions of SciFive, PubMedBERT, and CXR-BERT in §4.3.
Zero-shot Domain Transfer Learning
Training in one domain and directly testing on another domain has been a prevalent paradigm in zero-shot cross-domain transfer (Miura et al., 2021; Boecking et al., 2022; Agrawal et al., 2022). A similar zero-shot setup is also frequently seen in other transfer learning scenarios such as cross-lingual zero-shot learning (Conneau et al., 2018, 2020). Our summarization experiment is most similar to such a direct zero-shot setup. Concurrently, Pan et al. (2022) also propose to combine in-domain training and out-of-domain task knowledge. They proposed a zero-shot in-domain question answering model by finetuning a general-domain RoBERTa model with first domain-specific NER and then general-domain question answering. This study is the closest to our approach, with several key differences: Their method requires in-domain labels (in-domain NER) whereas we do not require any in-domain task labels. They only test on question answering whereas we show a more diverse range of evaluation datasets. Additionally, they do sequential training whereas we perform multi-task training. Finally, their model is not generative and therefore it cannot perform NLGU and self-finetuning as we did in our approach (see §3).
Our proposed NLGU and self-finetuning strategies are closely related to cross-domain data augmentation. A line of work in information retrieval generates “in-domain” pseudo training data leveraging unlabelled in-domain texts. As an example, Ma et al. (2021) and Wang et al. (2022) train a passage-to-query generator for synthesizing in-domain queries for the task of zero-shot passage retrieval. Similarly, The NLG component in our proposed NLGU strategy can also perform data augmentation but with better granularity and diversity as we can generate label-conditioned task data to create both positive and negative examples.
Besides zero-shot transfer in NLP, unsupervised domain adaptation (which also assumes labels in current domain and unlabelled data in the target domain) is a long-standing research topic in machine learning in general (Huang et al., 2006; Pan et al., 2010b; Ganin and Lempitsky, 2015; Ramponi and Plank, 2020). Many conventional unsupervised domain adaptation methods require external components to align domains on the feature/embedding level. For example, Pan et al. (2010a) propose applying spectral feature alignment to align domain-specific words across domains into unified clusters. Ganin and Lempitsky (2015) add a domain classifier that promotes domain-invariant features via a gradient reversal layer. These methods are not always immediately suitable for the recent pretrained language models, especially the text-to-text models. In comparison, our approach exploits the task unifying nature of text-to-text models, which contain the inherent transfer learning abilities and requires minimal architecture changes.
3 Method
We use T5, an encoder-decoder generative language modelling framework (Raffel et al., 2020), to learn a conditional sequence generator P(output|input). T5 is chosen for two reasons: 1) It is a strong transfer learning model, and 2) it can unify classification and generation, which has potential to further boost transfer performance (see NLGU discussion in §3.2). We use the same pretraining objective (cross-entropy with teacher-forcing) as in T5.
3.1 Continual Pretraining with In-domain MLM
For ℒdomain-MLM we use the MLM loss (Devlin et al., 2019) to continually pretrain a T5 on in-domain free text: Given a piece of sampled radiology or biomedical text, we randomly mask 15% of its tokens and ask the model to denoise the masked input sequence, i.e., generate the masked tokens.
3.2 Continual Pretraining on General-domain Tasks
For ℒtask, we define (x1, x2) as a text pair that denotes (premise, hypothesis) for NLI, and (document, summary) for summarization. The standard NLI task assigns labels from y: {entailment, neutral, contradiction}, and the task is . For summarization, the task is usually cast as . We follow Sanh et al. (2022) to adopt a multi-task learning strategy to train summarization and NLI simultaneously. Hence, the basic setup of task learning would be: NLI as a discriminative NLU task plus summarization as an NLG task.
NLGU: Simultaneous NLG and NLU
. | Setting . | Prompt (Input) . | Output . |
---|---|---|---|
NLI | NLG: | Generate a {label}sentence of: {premise} | {hypothesis} |
NLU: | {premise}Question: {hypothesis}True, False or Neither? | {True — False — Neither} | |
Sum. | NLG: | Generate a {label}summary of: {document} | {summary} |
NLU: | {document}Question: {summary}True or False? | {True — False} |
. | Setting . | Prompt (Input) . | Output . |
---|---|---|---|
NLI | NLG: | Generate a {label}sentence of: {premise} | {hypothesis} |
NLU: | {premise}Question: {hypothesis}True, False or Neither? | {True — False — Neither} | |
Sum. | NLG: | Generate a {label}summary of: {document} | {summary} |
NLU: | {document}Question: {summary}True or False? | {True — False} |
3.3 Task-specific Designs for In-domain Zero-shot Inference
After continual pretraining, we zero-shot-transfer the trained model to three applications in specialized domains without requiring labels from these domains: 1) NLI, 2) summarization, and 3) text embedding learning.
NLI (with Self-finetuning)
While the model is capable of directly performing NLI after train ing on general-domain NLI task labels with , we propose an additional step, self-finetuning, to boost transfer performance (§5.1). We first use the model’s NLG capabilities to generate pseudo in-domain NLI data: We sample a set of sentences from the target domain as premises, and prompt the pretrained model to generate hypotheses (the NLG task) with each of the three control codes (labels). This pseudo-in-domain NLI dataset is then used as additional training data to finetune the same model to perform the NLU task: . The resulting finetuned model is then used for zero-shot NLI transfer.
Text Summarization
We directly prompt the model after continual pretraining to summarize in-domain documents. We use the same prompt as pretraining: “Generate an entailed summary of: {document}”. The output summary is then compared against the gold summary. Since this is already a task of text generation, i.e., , we cannot exploit self-finetuning as for NLI since we cannot improve generation from training on the model’s own generated pseudo data.
Text Embedding Learning
DoT5 can be directly used as a generator for data augmentation. Apart from creating more pseudo NLI task data to improve NLI, DoT5 can improve domain-specific embedding learning in general. To do so, we sample a set of in-domain sentences as anchors, and prompt the trained model to generate entailed and contradictory sentences to form positive and negative pairs for each anchor. With beam search size of 5, we sample the top-k most probable sequences as the entailed (positives) and contradictory (negatives) sentences of the anchor.4 Given the collected anchors and positive/negative sentences, we finetune a SOTA sentence embedding model with a contrastive loss. Specifically, we continually finetune the all-mpnet-base-v25 model with a variant of InfoNCE (Oord et al., 2018) modified to handle multiple positives (Miech et al., 2020). The learned embedding space is then used for query-document retrieval or for computing text similarity.
4 Experiment
4.1 Experimental Setup
Details of the datasets used for training and evaluation are given in Table 2.
. | Dataset . | Task . | Used for . | # Examples . |
---|---|---|---|---|
General | SNLI – Bowman et al. | NLI | Task pretrain. | 550K |
MultiNLI – Williams et al. | NLI | Task pretrain. | 392K | |
AdversarialNLI – Nie et al. | NLI | Task pretrain. | 162K | |
Gigaword – Graff et al. | Summ. | Task pretrain. | 1M | |
Radiology | MIMIC-CXR – Johnson et al. | MLM | Domain pretrain. | 227K |
RadNLI – Miura et al. | NLI | Evaluation | 480 | |
Open-I – Demner-Fushman et al. | Summ. | Evaluation | 683 | |
Biomedical | PubMed Abstracts | MLM | Domain pretrain. | 4.2M |
MedNLI – Romanov and Shivade | NLI | Evaluation | 1.4K | |
PubMed ‘ShortSum’ | Summ. | Evaluation | 5K | |
MedSTS – Yanshan et al. | Similarity | Evaluation | 371 |
. | Dataset . | Task . | Used for . | # Examples . |
---|---|---|---|---|
General | SNLI – Bowman et al. | NLI | Task pretrain. | 550K |
MultiNLI – Williams et al. | NLI | Task pretrain. | 392K | |
AdversarialNLI – Nie et al. | NLI | Task pretrain. | 162K | |
Gigaword – Graff et al. | Summ. | Task pretrain. | 1M | |
Radiology | MIMIC-CXR – Johnson et al. | MLM | Domain pretrain. | 227K |
RadNLI – Miura et al. | NLI | Evaluation | 480 | |
Open-I – Demner-Fushman et al. | Summ. | Evaluation | 683 | |
Biomedical | PubMed Abstracts | MLM | Domain pretrain. | 4.2M |
MedNLI – Romanov and Shivade | NLI | Evaluation | 1.4K | |
PubMed ‘ShortSum’ | Summ. | Evaluation | 5K | |
MedSTS – Yanshan et al. | Similarity | Evaluation | 371 |
Pretraining Datasets
As our continual pretraining is a multi-task process, we balance the in-domain and general-domain datasets in each batch via up/downsampling as needed: For radiology, we upsample MIMIC-CXR samples via duplication, whereas for biomedicine we downsample PubMed abstracts, in each case matching the general-domain task dataset size. We also balance the number of samples coming from each task, downsampling the summarization dataset to roughly match that of NLI (‘# Examples’ in Table 2). Experiments with DoT5small (Figure 3) indicate that downstream task performance could be boosted by tuning the relative prevalence of the data sources, with a task-dependent optimal value. In this proof-of-concept study, we fix a ratio of 1:1.
We generate counterfactual summaries of Gigaword based on Rajagopal et al. (2022). Specifically, we run a named entity recognition model on the documents from the Gigaword summarization training data, specifically the “en_core_ web_sm” trained SpaCy pipeline (Honnibal et al., 2020). For each document that contains a named entity, we randomly sample an entity and replace it with a different named entity of the same category from the training corpus. This is our ‘counterfactual’ example.6 We also filter out noisy data when the generated counterfactual contains UNK or #. The resulting dataset, as listed in Table 2, consists of 50% document-‘wrong summary’ pairs (i.e., 500k pairs), one for each true document-summary pair. To create pseudo NLI data for the self-finetuning process, we use all premises from the RadNLI/MedNLI development set and generate one entailed, one neutral, and one contradictory hypothesis for each premise. In total, we have 1440 and 4185 pseudo examples for RadNLI and MedNLI, respectively.
Evaluation Datasets and Metrics
All the evaluation datasets are from domain-specific tasks (Table 2). For NLI, we report accuracy and macro-F1 (out of 100, for legibility) on the test set of RadNLI and MedNLI. For summarization in radiology, we evaluate on findings-to-impression7 summarization on the test split of the Open-I dataset (Demner-Fushman et al., 2016). For biomedical summarization, we create an abstract-to-title summarization dataset, ‘PubMed ShortSum’. The data for this task is sampled from PubMed and filtered to abstracts shorter than 1000 characters. Compared with the traditional article-to-abstract PubMed summarization task, which evaluates long summary generation for long text (Cohan et al., 2018), PubMed ShortSum evaluates extreme summarization for short text and is a more comparable task to our general domain Gigaword summarization. For summarization evaluation, we use standard lexical metrics (BLEU-4, ROUGE-L) and domain-specific factuality metrics: named entity matching (NEM) for both radiology (Miura et al., 2021) and biomedical (Alambo et al., 2022) summarization, and CheXbert (Smit et al., 2020)8 for radiology.
We evaluate embeddings trained for the biomedical domain on MedSTS (Yanshan et al., 2020), a clinical text similarity benchmark. Since the radiology domain has no text similarity datasets available, we design an impression-to-findings retrieval task on the Open-I test set, and report Accuracy@1/5/10. This retrieval task can also evaluate embedding quality as it requires the model to differentiate text from same/different reports by encoding texts from matching findings-impression pairs (from the same report) with similar representations.
Training Details
Models are trained for 10 epochs with validation loss used for checkpoint selection. We use distributed data parallelism on eight GPUs with the largest batch size permissible given computational constraints, resulting in batch sizes of 1024, 512, and 128 for small, base, and large models. With a dataset of ∼8M samples, we thus train the large model for ∼64,000 steps per epoch. We use AdaFactor (Shazeer and Stern, 2018) with learning rates of 10−3 for MIMIC-CXR and 2 × 10−5 for PubMed pretraining.
4.2 Baselines
We have three categories of baselines: (1) task-specific zero-shot baseline models reported from the literature (where applicable); (2) LLMs including T0 and GPT-3; and (3) sequential training first on in-domain unlabelled data and then on general-domain task labels. All the baseline models in our study must satisfy one constraint: not using any in-domain labels for the task, but they may differ in the required training resources (detailed comparison is found in Table 3). We compare with (2) as LLMs are known to be excellent zero-shot and few-shot learners for an unseen task, and should serve as a reasonable baseline for domain transfer. We provide (3) as a straightforward baseline to sequentially combine in-domain MLM training and general-domain task training as opposed to our proposed multi-task training.
Task-specific Zero-shot Baselines
We compare with the strongest task-specific zero-shot models from the literature. For the NLI task, we compare with Miura et al. (2021) and Boecking et al. (2022), which both finetune a BERT model with MedNLI training data and then test on RadNLI. Boecking et al. (2022) perform better as they use radiology-specific BERT model. Note that MedNLI is a nearby-domain corpus rather than general-domain task data, and in fact there has not been successful attempts in the literature to transfer general-domain NLI to RadNLI. Note that in the later sequential training section we will establish such baselines from finetuning CXR-BERT on general-domain NLI. For MedNLI, we compare with the best transfer learning results so far, ESIM (MultiNLI) which was trained on MultiNLI datasets (Romanov and Shivade, 2018). For radiology summarization, to our knowledge, we are the first to report results on direct transfer from general-domain summarization. For biomedical summarization, since we use a new dataset (PubMed ShortSum), there is no prior comparison.
Large Language Models
T0 (Sanh et al., 2022) and GPT-3 (Brown et al., 2020) are massively pretrained language models that can be used off-the-shelf for zero-shot or few-shot inference. T0 is pretrained with multiple tasks including general-domain summarization datasets (but not NLI), and shows strong transfer ability (Sanh et al., 2022). T0 can be seen as a strong general-domain summarization model and also strong zero-shot domain transfer baseline on summarization. T0 is also particularly effective in transferring to unseen tasks. Therefore, we include T0 as a zero-shot baseline for NLI even though it has not been trained with any NLI data. We test T0 (3B) and the most powerful T0++ (11B) model. GPT-3 (Brown et al., 2020) (davinci) is a massive language model with 175B parameters, pretrained on raw text with an autoregressive language modelling objective.
In the general domain, both models are shown to have performed reasonably well on NLI and summarization with prompting. We test their zero-shot-inference capabilities in our experiments, following the original papers for prompt design. For the NLI task, both T0 models and GPT-3 use the ANLI prompt template described in Brown et al. (2020): “<premise> Question: <hypothesis> True, False or Neither?”. For the summarization task, T0 used the prompt: “<document> ∖n === ∖n Generate a title for this article:”. For GPT-3 summarization, we used the prompt (“<document>∖n∖n Tl;dr:”) as recommended in the OpenAI GPT-3 playground example.9 Since GPT-3 benefits when few-shot examples are incorporated in the prompt, we create two additional baselines (GPT-3-NLI and GPT-3-GW10) that perform in-context learning of the task from general-domain NLI training data (30 examples, randomly selected) and Gigaword summarization training data (20 examples, randomly selected) respectively (Table 2).
Sequential Training
The most straightforward way to exploit both in-domain unlabelled data and task labels is to first train on in-domain MLM and then further finetune on general-domain task labels.11 We provide two variants of this baseline. The first type performs continual training with general-domain task labels from SOTA domain-specific pretrained models. We adopt SciFive (Phan et al., 2021), a T5 model pretrained on large biomedical corpora, CXR-BERT-General (Boecking et al., 2022), a radiology-specialized BERT model, and the PubMed-specific PubMedBERT (Gu et al., 2021). For finetuning these models we use the same general-domain task data as provided to DoT5, where for the BERT models we only do finetuning on NLI. This results in baseline models SciFivelarge-NLI, SciFivelarge-GW (summarization), CXR-BERT-NLI, and PubMedBERT-NLI. We further improve SciFivelarge-NLI by including our proposed self-finetuning stage (SciFivelarge-NLI + SFT). Since there is no radiology-pretrained T5 model, we compare with SciFive on both domains.
The second baseline type strictly compares multi-task training (DoT5) and sequential training. Here, we first pretrain T5 with in-domain MLM, and then continually pretrain on the general-domain task data, ensuring other factors remain the same including the training duration, use of NLGU, and use of self-finetuning where appropriate. We call this setting T5large-MLM → Task.
4.3 Main Results
NLI (Table 4)
DoT5large establishes new SOTA for zero-shot domain transfer on RadNLI and competitive results on MedNLI (Table 4). On RadNLI, DoT5large reaches an impressive 82.1% on accuracy and is the best performing model. It outperforms the strongest reported number from the literature (CXR-BERT) by more than 15%, and our baseline CXR-BERT-NLI by almost 7%. Comparing DoT5 to T5large-MLM → Task on RadNLI reveals the benefit of multitask training for compositional transfer.
On MedNLI, DoT5large outperforms ESIM (MultiNLI) by almost 20% (accuracy), but does not quite reach the 75.7% accuracy achieved by PubMedBERT-NLI, which establishes a new SOTA in zero-shot domain transfer on MedNLI—supervised SOTA is 86.6% (Phan et al., 2021). Although factors such as tokenization and pretraining strategies may contribute, we speculate that the domain gap between MedNLI and our biomedical pretraining corpus explains the weaker performance of DoT5 on MedNLI. MedNLI was sourced from clinical notes in MIMIC-III, which differ distributionally from biomedical articles in PubMed. Supporting this hypothesis, we observed that DoT5 pretrained on radiology text, and the sequential baseline T5large-MLM → Task achieved similar performance on MedNLI (70% accuracy), indicating that results on MedNLI may not fully reflect compositional domain knowledge transfer in our setup. In this case, a strong NLI-specific model is most performant, while lacking potentially advantageous versatile text generation/summarization capabilities.
Summarization (Table 5)
DoT5large achieves competitive performance compared with the best model in radiology (GPT-3-GW) and biomedical domains (T0 models) (Table 5). In radiology, DoT5large is the second-best model. That the strongest performing models on summarization are LLMs with substantially many parameters is not surprising; we observe in §5.2 that DoT5 too enjoys scaling effects. Most importantly, we again demonstrate the benefit from multi-task compositional transfer as DoT5large significantly outperforms both T5large-MLM→Task and SciFive-GW across all metrics in both domains. This further verifies that a naïve sequential training on these two sources does not lead to effective compositional knowledge transfer. We also acknowledge it is more difficult to perform domain transfer for generation tasks in general: We cannot perform the data augmentation NLG and self-finetuning pipeline as it amounts to training the model to generate its own outputs.
Text Embedding Learning (Table 6)
The DoT5-generated examples greatly improve the SOTA sentence embedding model’s capability on both impression-to-findings retrieval in radiology and semantic textual similarity (MedSTS) in the biomedicine domain (Table 6). This is evidence that DoT5-generated sentences are of high quality and have captured semantic similarity and contradiction required for learning good embedding model. We also compare with an ablated version of DoT5 without in-domain MLM to generate data and find that the full model performs better across the board. his shows the importance of domain training for generating good in-domain examples. We explore this further in §5.1.
Radiology (Open-I Retrieval) . | |||
---|---|---|---|
Model . | Acc@1 . | Acc@5 . | Acc@10 . |
all-mpnet-base-v2 | 8.3 | 15.1 | 20.2 |
+ DoT5large (no-MLM) | 12.0 | 19.9 | 22.8 |
+ DoT5large | 13.3 | 20.4 | 25.5 |
Biomedicine (MedSTS) | |||
Model | r | ρ | |
all-mpnet-base-v2 | 72.8 | 64.6 | |
+ DoT5large (no-MLM) | 76.4±0.04 | 67.1±0.06 | |
+ DoT5large | 76.9±0.00 | 67.9±0.09 |
Radiology (Open-I Retrieval) . | |||
---|---|---|---|
Model . | Acc@1 . | Acc@5 . | Acc@10 . |
all-mpnet-base-v2 | 8.3 | 15.1 | 20.2 |
+ DoT5large (no-MLM) | 12.0 | 19.9 | 22.8 |
+ DoT5large | 13.3 | 20.4 | 25.5 |
Biomedicine (MedSTS) | |||
Model | r | ρ | |
all-mpnet-base-v2 | 72.8 | 64.6 | |
+ DoT5large (no-MLM) | 76.4±0.04 | 67.1±0.06 | |
+ DoT5large | 76.9±0.00 | 67.9±0.09 |
5 Further Analysis
5.1 Ablation Study
Through ablations, we probe the contributions of key components of DoT5: 1) In-domain MLM, 2) NLGU (combining NLU and NLG) (§3.1), and 3) self-finetuning for zero-shot NLI (§3.3). We conduct these ablations on the radiology domain on DoT5large. The results are shown in Table 7.
Setting . | RadNLI (acc.) . | Sum. (NEM) . |
---|---|---|
DoT5large (full model) | 82.1 | .082 |
(1) no in-domain MLM | 63.5 | .015 |
(2) no NLGU & (3) | 59.0 | .052 |
(3) no self-finetuning | 49.6 | − |
Setting . | RadNLI (acc.) . | Sum. (NEM) . |
---|---|---|
DoT5large (full model) | 82.1 | .082 |
(1) no in-domain MLM | 63.5 | .015 |
(2) no NLGU & (3) | 59.0 | .052 |
(3) no self-finetuning | 49.6 | − |
We observe that all components are essential to the success of the model. In-domain MLM is especially important for summarization, without which the model fails in zero-shot transfer as it often just extracts a random subsequence of the document. Removing NLGU harms both NLI and summarization. Training without NLGU removes the NLG component from NLI and therefore disables self-finetuning. Self-finetuning is the most important component for boosting NLI performance, without which the model’s accuracy drops more than 30%. As shown in Table 4, SciFive also benefits from self-finetuning in this way. This indicates that the pseudo in-domain NLI task data generated by NLGU is crucial. Training without NLGU also removes the NLU task for summarization and brings down the performance, indicating that having an NLU task can also benefit generation.
We hypothesize that NLU improves NLG by forcing the model to be more sensitive to the control code in the prompt, leading to improved pseudo-data generation and better summarization. To test this, following Tang et al. (2018), we compute the maximum attention weights across all attention heads to the control codes in the prompt when generating an NLI hypothesis (Figure 4). We compare DoT5large trained with or without NLU. We see that the full model attends more on the control codes, suggesting that NLU is increasing label conditionality during generation. Table 8 shows some examples: When required to generate an entailment, the model can usually correctly paraphrase the original sentence; for negation, the model is usually able to correctly identify modifiers to flip the logic (e.g., change “increase” to “decrease” and adding or removing “no”); for neutral, the model generates a thematically related sentence but not directly negating or agreeing with the original sentence.
Prompt: Generate a {label}sentence of {premise}: . | ||
---|---|---|
premise: “Low lung volumes but no evidence of pneumonia.” | ||
label | Entailment: | “There is no evidence of pneumonia.” |
Neutral: | “The patient has a history of smoking.” | |
Contradiction: | “The patient has pneumonia.” | |
premise: “The patient is rotated slightly which limits assessment.” | ||
label | Entailment: | “Assessment is limited due to patient rotation.” |
Neutral: | “The patient is rotated to the left.” | |
Contradiction: | “The patient is oriented perfectly.” |
Prompt: Generate a {label}sentence of {premise}: . | ||
---|---|---|
premise: “Low lung volumes but no evidence of pneumonia.” | ||
label | Entailment: | “There is no evidence of pneumonia.” |
Neutral: | “The patient has a history of smoking.” | |
Contradiction: | “The patient has pneumonia.” | |
premise: “The patient is rotated slightly which limits assessment.” | ||
label | Entailment: | “Assessment is limited due to patient rotation.” |
Neutral: | “The patient is rotated to the left.” | |
Contradiction: | “The patient is oriented perfectly.” |
5.2 Effect of Scaling Up
We have so far reported results on a large T5 model (770M parameters). In Figure 5, we plot the performance of small (70M) and base (220M) DoT5 models with their ablated versions for RadNLI and radiology summarization, showing a clear trend of increasing performance as the model size grows. Interestingly, this scaling effect disappears when we remove in-domain MLM, revealing the importance of domain training for larger models, especially for summarization. This is possibly because, without domain training, scaling up the model leads to overfitting to the general-domain task data. The compositional transfer framework from DoT5 however regularises the model for more complex knowledge acquisition, and thus is able to harness the power from larger models.
5.3 Evidence of Compositional Transfer in DoT5: A Case Study on RadNLI
Although RadNLI is a radiology-specific NLI dataset, we observe that some examples may be solvable using general-domain task knowledge (e.g., syntactic cues) alone. A general-purpose NLI model will likely detect that ‘There is no pneumothorax’ contradicts ‘There is a pneumothorax’ without requiring radiology-specific knowledge such as an understanding of pneumothorax. Therefore, higher performance on RadNLI may not strictly guarantee the model has acquired in-domain knowledge. To quantify how much of DoT5’s transfer success is due to the acquisition of the previously unseen domain-specific task knowledge versus from direct application of the general-domain task knowledge, we manually annotated each of the 480 sentence pairs in the RadNLI test set by whether it could be solved without particular medical expertise.12 Examples are shown in Table 9.
Does not require radiology expertise | |
Premise | There is a small left pleural effusion. |
Hypothesis | No pleural effusion or pneumothorax is seen. |
Label | Contradiction |
Requires radiology expertise | |
Premise | The cardiac silhouette is top normal. |
Hypothesis | The heart is not enlarged. |
Label | Entailment |
Does not require radiology expertise | |
Premise | There is a small left pleural effusion. |
Hypothesis | No pleural effusion or pneumothorax is seen. |
Label | Contradiction |
Requires radiology expertise | |
Premise | The cardiac silhouette is top normal. |
Hypothesis | The heart is not enlarged. |
Label | Entailment |
Table 10 compares three models on these subsets: DoT5large (a), T5large-MLM → Task (b), and DoT5largewithout in-domain MLM (c) (equivalent to ‘T5 Task’). We further test with and without self-finetuning to probe its capacity to strengthen domain-specific competence.
Model . | All cases . | Expertise required . | |
---|---|---|---|
Yes . | No . | ||
a) DoT5large | 80.7 | 70.1 | 86.4 |
No self-finetuning | 51.0 | 43.3 | 50.8 |
b) T5large-MLM → Task | 75.6 | 54.8 | 86.5 |
No self-finetuning | 35.6 | 36.2 | 35.6 |
c) T5 Task | 59.5 | 35.3 | 70.1 |
No self-finetuning | 37.5 | 36.1 | 35.1 |
Zero-rule baseline | 24.6 | 29.0 | 20.0 |
Model . | All cases . | Expertise required . | |
---|---|---|---|
Yes . | No . | ||
a) DoT5large | 80.7 | 70.1 | 86.4 |
No self-finetuning | 51.0 | 43.3 | 50.8 |
b) T5large-MLM → Task | 75.6 | 54.8 | 86.5 |
No self-finetuning | 35.6 | 36.2 | 35.6 |
c) T5 Task | 59.5 | 35.3 | 70.1 |
No self-finetuning | 37.5 | 36.1 | 35.1 |
Zero-rule baseline | 24.6 | 29.0 | 20.0 |
While DoT5large achieves the best performance overall, it is specifically on challenging domain-specific cases that it outperforms T5large-MLM → Task, an increase of 15 points in F1. For example, in Table 9, only DoT5large is able to solve the second example which requires radiology-specific knowledge (the model should know cardiac silhouette includes heart size; and if the heart is top normal, then it should not be enlarged). This demonstrates the role of compositional transfer for inferring the otherwise unseen in-domain task knowledge (in this case, radiology NLI knowledge) solving challenging cases that require expertise.
The two ablated versions help understand where this domain-specific task knowledge is acquired. In-domain MLM training is key as removing it (c) significantly decreases the performance on domain-expert cases in particular, producing a model which cannot benefit from self-finetuning at all for such cases. This is because without in-domain MLM, the model is not able to generate good-quality pseudo in-domain labels in the first place, and therefore self-finetuning has little effect on the expert cases. Introducing in-domain data sequentially (b) resolves the performance gap on non-expert cases, but still underperforms on domain-specific cases relative to multi-task training (a). We conclude that the compositional fusion of task and domain knowledge happens during DoT5’s multi-task pretraining phase with in-domain MLM as the key element, and that domain-specific competence is elicited through self-finetuning.
6 Conclusion and Discussion
We propose DoT5, a compositional transfer learning framework to solve domain-specific NLP tasks without requiring in-domain task labels. We show the effectiveness of DoT5 on zero-shot transfer to multiple tasks in the biomedicine and radiology domains. DoT5 significantly outperforms T5 sequential training across all tasks, and achieves zero-shot SOTA in radiology NLI with massive gains. We also conduct extensive analyses to identify the contribution from each model component and the benefits from scaling up the model size, and demonstrate direct evidence of domain-specific task knowledge learned from DoT5’s compositional transfer.
Limitations of this work include the challenge of drawing clear boundaries between domains and the necessarily incomplete exploration of hyperparameters and configurations. For example, general domain texts may contain biomedical or radiology sources, and our ‘biomedical’ NLI evaluation set leans strongly clinical, introducing a degree of domain shift. Investigation of the weighting of terms in the loss reveals the potential to improve performance through more exhaustive hyperparameter search—we emphasise that this was a proof-of-concept study and although DoT5 performs favourably, zero-shot domain transfer could be further pushed, especially if only a single downstream task is required.
The proposed NLGU method and subsequent self-finetuning was critical for improving downstream task performance. However, we observed an intermittent negative effect wherein the model would attempt to solve the NLU task when presented with an unusually long prompt. Further work can be done to refine this approach. For example, the benefit of NLGU in resource-rich domains is unclear. As our focus is on domain transfer and we do not evaluate on general-domain tasks, we leave such experimentation to future study.
Finally, we acknowledge that it is non-trivial to apply our full framework to single-sentence/ paragraph classification tasks. While our most basic setup (compositional training of in-domain MLM and vanilla task training) can still be transferable to any task format, NLGU and self-finetuning would currently only work for tasks that involve pairs of texts. Nonetheless, we believe DoT5 proves to be a highly effective zero-shot domain transfer framework which will be beneficial to domain-specific applications beyond radiology and biomedicine.
Acknowledgments
The authors would like to thank the anonymous TACL reviewers and editors for their detailed feedback and helpful suggestions.
Notes
DoT5 (read as “dot five”): Domain Compositional ZerO-shot T5.
The definition of ‘zero-shot’ in this paper follows recent studies (Pan et al., 2022; Zhao et al., 2022), and is similar to unsupervised domain adaptation, as discussed in §2. Another similar usage of ‘zero-shot’ is found in cross-lingual setups where no task labels are accessible in the target test language but labels in the same task are available in a source language. Note that this definition is different from ‘zero-shot learning’ traditionally used to refer to the prediction of unseen classes.
Notice that the tasks are limited to those that can have pairwise input instead of single sentence input.
We experimented generating one, three, and five pairs of positives and negatives and found three to be the best in our setup. We thus use three across all models.
Note that a counterfactual is not always a contradiction. We approximate contradiction this way and use the ‘contradictory’ control code in our experiments for consistency.
In a radiology report, the “findings” section is a detailed description and the “impression” section is a summary of the findings with follow-up recommendation.
The average of the weighted-F1 score across 14 pathological observations labelled by CheXbert.
These are still zero-shot baselines as they do not use in-domain task examples.
This baseline category is similar to contemporaneous work (Pan et al., 2022) where domain-task transfer is achieved through sequential in-domain off-task training followed by general-domain in-task training. Here we do not use in-domain task data of any kind.
We determined 228 (47%) pairs could be solved without medical/radiological expertise, 177 (37%) could not, and the remaining 75 (16%) were ambiguous. Ambiguous cases were excluded from the analysis.
References
Author notes
Work done at Microsoft Health Futures.
Action Editor: Alexander Rush