Compositional Zero-Shot Domain Transfer with Text-to-Text Models

Abstract Label scarcity is a bottleneck for improving task performance in specialized domains. We propose a novel compositional transfer learning framework (DoT51) for zero-shot domain transfer. Without access to in-domain labels, DoT5 jointly learns domain knowledge (from masked language modelling of unlabelled in-domain free text) and task knowledge (from task training on more readily available general-domain data) in a multi-task manner. To improve the transferability of task training, we design a strategy named NLGU: We simultaneously train natural language generation (NLG) for in-domain label-to-data generation, which enables data augmentation for self-finetuning and natural language understanding (NLU) for label prediction. We evaluate DoT5 on the biomedical domain and the resource-lean subdomain of radiology, focusing on natural language inference, text summarization, and embedding learning. DoT5 demonstrates the effectiveness of compositional transfer learning through multi-task learning. In particular, DoT5 outperforms the current state-of-the-art in zero-shot transfer by over 7 absolute points in accuracy on RadNLI. We validate DoT5 with ablations and a case study demonstrating its ability to solve challenging NLI examples requiring in-domain expertise.


Introduction
While pretrained language models demonstrate massive improvements on a wide range of natural language processing (NLP) tasks, it remains challenging to apply them to specialised domains (Ramponi and Plank, 2020).To acquire domainspecific task knowledge, a conventional approach is to perform domain-specific pretraining -usually masked language modelling (MLM) on in-domain raw text -followed by finetuning with in-domain task-annotated data (Lee et al., 2020;Gu et al., 2021;Boecking et al., 2022).However, this approach requires in-domain task labels that can be expensive to acquire.Another approach is to train a model with the usually abundant general-domain task labels and directly transfer to the new domain (Romanov and Shivade, 2018;Ma et al., 2021), but the transfer performance is often limited by the domain gap.Past studies on zero-shot domain transfer or unsupervised domain adaptation have explored methods to transfer task knowledge from a source domain to an unseen target domain (Ramponi and Plank, 2020;Ganin and Lempitsky, 2015), but they usually require external modules to perform feature or domain alignment and are not always easily applicable to pretrained language models.In particular, there is little understanding of how we can leverage and combine domain-specific knowledge and general-domain task knowledge in the context of the recent success of text-to-text architectures in transfer learning.
To close this gap, we propose DOT5, a novel compositional zero-shot domain-transfer framework based on the state-of-the-art (SOTA) transfer learning model Text-to-Text Transfer Transformer (T5) (Raffel et al., 2020).Throughout, the 'zero-shot' setup refers to zero-shot domain transfer with no access to labelled in-domain data. 2 By "compositional" we mean that DOT5 is able to combine seen task labels and domain text to acquire an unseen combination of task domain knowledge.
As shown in Fig. 1, DOT5 combines domain

Domain-specific semantics: osseous ~ bony
There are bony abnormalities.Question: There are no osseous abnormalities.

True or False?
Label: False

DoT5
In-domain unlabelled text

General domain task
In-domain task knowledge and task knowledge by making the best use of in-domain free text and general-domain task labels, which are typically accessible and abundant.For example, in the context of natural language inference (NLI), DOT5 can learn domain-specific semantics (e.g., "bony abnormalities" is a synonym of "osseous abnormalities") from in-domain free text and transferable task knowledge from generaldomain task labels (e.g., negation indicates contradiction) to infer domain-specific task knowledge (e.g., "There are no bony abnormalities" contradicts "There are osseous abnormalities").
We apply DOT5 to NLI, summarisation and text embedding learning, which are fundamental applications across many domains, and we explore zero-shot domain transfer to the high-value and highly-specialised domain of biomedicine and its extremely low-resource subdomain of radiology.Due to their specialisation, obtaining labelled data in these domains is expensive and time-consuming.For example, the radiology-specific NLI dataset (RadNLI) (Miura et al., 2021)  The key to DOT5's compositional transfer is continual multi-task pretraining to simultaneously acquire domain and task knowledge: we jointly train T5 with MLM on in-domain unlabelled data and general-domain tasks (NLI and summarisation).To better acquire the transferable task knowledge from the general-domain task labels, we propose a multitask setup we call NLGU.As depicted in Fig. 2, NLGU gives each task two formulations: natural language generation (NLG) (label-to-data generation), and natural language understanding (NLU) (data-to-label prediction).NLU enables label prediction when tested in an unseen domain and forces model sensitivity to the conditioned label, assisting NLG.Meanwhile, NLG enables downstream tasks such as summarisation or data augmentation.This enables DOT5 to generate its own NLI in-domain task data for further finetuning (a process we call self-finetuning), or to generate positive and negative examples for improving text embeddings by contrastive learning (Oord et al., 2018).
Our experiments show the effectiveness of DOT5 in zero-shot domain transfer, and our proposed multi-task compositional approach achieves large gains compared with sequential training with T5 across all tasks.In particular, we achieve SOTA zero-shot domain transfer performance on RadNLI (Romanov and Shivade, 2018), outperforming baselines including large language models (LLMs), sequential training approaches and taskspecific baselines by large margins.We also identify several key insights through extensive analysis: 1) All three key components (in-domain MLM, NLGU, self-finetuning) in DOT5 are important for transfer success while multi-task learning with indomain MLM is the key for combining domain and task knowledge.2) Scaling up model size significantly improves transfer performance.3) DOT5 is able to solve challenging domain-specific task examples, indicating it acquires domain-specific task knowledge through compositional transfer.
To summarise, we present the following major contributions: 1) We propose DOT5, a general framework for compositional transfer learning with text-to-text models, and show multi-task training is superior to sequential training in the models' domain transfer.2) With a novel NLGU training strategy combining generation and understanding, DOT5 can be used for both classification and generation tasks. 3With the latter, DOT5 can perform self-finetuning to further improve transfer performance.3) We show the effectiveness of DOT5 in zero-shot domain transfer, achieving SOTA zeroshot performance in radiology NLI. 4) Comprehensive analysis demonstrates the inner workings of DOT5's compositional transfer.

DoT5
Generate an entailed sentence of: Waste in ponds produced by leaching gold from ore is a source of potential environmental dangers.
The car belonged to James Clark, 68, an acquaintance of James' family.Question: James Clark is not a teenager.True, False, or Neither?
Gold can produce pond waste.For NLI x 1 : premise, x 2 : hypothesis, and the label (y) is one of {entailed, neutral, contradictory}.For summarisation x 1 : document, x 2 : summary, and the label (y) is one of {entailed, contradictory}.(Aribandi et al., 2022) build on top of this idea and explore pretraining T5 with a massive collection of NLP datasets with diverse natural language prompts.Among them, T0, FLAN, and MetaICL investigate pretraining on a set of tasks, and then zero-shot transfer to another set of unseen tasks.(Miura et al., 2021;Boecking et al., 2022;Agrawal et al., 2022).A similar zero-shot setup is also frequently seen in other transfer learning scenarios such as cross-lingual zero-shot learning (Conneau et al., 2018(Conneau et al., , 2020)) Our proposed NLGU and self-finetuning strategies are closely related with cross-domain data augmentation.A line of work in information retrieval generates "in-domain" pseudo training data leveraging unlabelled in-domain texts.As an example, Ma et al. (2021); Wang et al. (2022) train a passage-to-query generator for synthesising indomain queries for the task of zero-shot passage retrieval.Similarly, The NLG component in our proposed NLGU strategy can also perform data augmentation but with better granularity and diversity as we can generate label-conditioned task data to create both positive and negative examples.

Domain-specific pretraining
Besides zero-shot transfer in NLP, unsupervised domain adaptation (which also assumes labels in current domain and unlabelled data in the target domain) is a long-standing research topic in machine learning in general (Huang et al., 2006;Pan et al., 2010b;Ganin and Lempitsky, 2015;Ramponi and Plank, 2020).Many conventional unsupervised domain adaptation methods require external components to align domains on the feature/embedding level.For example, Pan et al. (2010a) proposes applying spectral feature alignment to align domainspecific words across domains into unified clusters.Ganin and Lempitsky (2015) adds a domain classifier that promotes domain-invariant features via a gradient reversal layer.These methods are not always immediately suitable for the recent pretrained language models especially the text-to-text models.In comparison, our approach exploits the task unifying nature of text-to-text models which contain the inherent transfer learning abilities and requires minimal architecture changes.

Method
To achieve compositional transfer, DOT5 acquires domain knowledge and task knowledge via continual pretraining (see Fig. 2).Specifically, we optimise a joint loss function composed of an indomain masked language model loss ("domain-MLM") and a general-domain task-specific loss: We set λ = 0.5 but explore tuning it in §4.1.
We use T5, an encoder-decoder generative language modelling framework (Raffel et al., 2020), to learn a conditional sequence generator P (output|input).T5 is chosen for two reasons: 1) It is a strong transfer learning model, and 2) it can unify classification and generation, which has potential to further boost transfer performance (see NLGU discussion in §3.2).We use the same pretraining objective (cross-entropy with teacherforcing) as in T5.
We detail the two loss components for continual pretraining in § §3.1 and 3.2.Once the model has been continually pretrained, it can be used to perform zero-shot domain transfer on a task.Taskspecific designs for inference are given in §3.3.

Continual Pretraining with In-domain MLM
For L domain-MLM we use the MLM loss (Devlin et al., 2019) to continually pretrain a T5 on indomain free text: Given a piece of sampled radiology or biomedical text, we randomly mask 15% of its tokens and ask the model to denoise the masked input sequence, i.e., generate the masked tokens.

Continual Pretraining on General-domain Tasks
For L task , we define (x 1 , x 2 ) as a text pair that denotes (premise, hypothesis) for NLI, and (document, summary) for summarisation.The standard NLI task assigns labels from y: {entailment, neutral, contradiction}, and the task is (x 1 , x 2 ) → y.
For summarisation, the task is usually cast as x 1 → x 2 .We follow Sanh et al. (2022) to adopt a multi-task learning strategy to train summarisation and NLI simultaneously.Hence, the basic setup of task learning would be: NLI as a discriminative NLU task plus summarisation as an NLG task.
NLGU: Simultaneous NLG and NLU One immediate question is whether we can turn each task into both NLG and NLU (i.e., adding NLG for NLI and NLU for summarisation).For NLI, we can add label-to-data NLG to generate pseudo indomain text for data augmentation, performing (x 1 , y) → x 2 (the label y is used as control code).
For summarisation, we can also follow NLI to add a NLU task that predicts whether a documentsummary pair is entailed (the correct match) or contradictory (a counterfactual summary) ( §4.1).This NLU component aims to improve the factuality of generated text as it encourages the model to distinguish counterfactuals and true summaries.With the hypothesis that performing NLG and NLU simultaneously will mutually benefit each other, we propose NLGU, meaning joint training of NLG and NLU.With NLGU, we unify both summarisation and NLI into (x 1 , x 2 ) → y for NLU and (x 1 , y) → x 2 for NLG.The conditional generator then simultaneously optimises two losses: We set γ = 10 to balance the two losses since x 2 is usually much longer than y (the classification label).NLU and NLG are both trained with sequence-tosequence generation, and differ only in the input prompt and the expected output (Table 1).The prompt for L (x 1 ,x 2 )→y is from Brown et al. (2020).
The prompts for summarisation are akin to those for NLI, with premise and hypothesis replaced with document and summary respectively, and we only use {entailment, contradiction} relations.

Task-specific Designs for In-domain Zero-shot Inference
After continual pretraining, we zero-shot-transfer the trained model to three applications in specialised domains without requiring labels from these domains: 1) NLI, 2) summarisation, and 3) text embedding learning.
NLI (with self-finetuning) While the model is capable of directly performing NLI after training on general-domain NLI task labels with (x 1 , x 2 ) → y, we propose an additional step, selffinetuning, to boost transfer performance ( §5.1).
We first use the model's NLG capabilities to generate pseudo in-domain NLI data: We sample a set of sentences from the target domain as premises, and prompt the pretrained model to generate hypotheses (the NLG task) with each of the three control codes (labels).This pseudo-in-domain NLI dataset is then used as additional training data to finetune the same model to perform the NLU task: (x 1 , x 2 ) → y.The resulting finetuned model is then used for zero-shot NLI transfer.

Text summarisation
We directly prompt the model after continual pretraining to summarise indomain documents.We use the same prompt as pretraining: "Generate an entailed summary of: {document}".The output summary is then compared against the gold summary.Since this is already a task of text generation, i.e., (x 1 , y) → x 2 , we cannot exploit self-finetuning as for NLI since we cannot improve generation from training on the model's own generated pseudo data.
Text embedding learning DOT5 can be directly used as a generator for data augmentation.Apart from creating more pseudo NLI task data to improve NLI, DOT5 can improve domain-specific embedding learning in general.To do so, we sample a set of in-domain sentences as anchors, and prompt the trained model to generate entailed and contradictory sentences to form positive and negative pairs for each anchor.With beam search size of five, we sample the top-k most probable sequences as the entailed (positives) and contradictory (negatives) sentences of the anchor4 .Given the collected anchors and positive/negative sentences, we finetune a SOTA sentence embedding model with a contrastive loss.Specifically, we continually finetune the all-mpnet-base-v25 model with a variant of InfoNCE (Oord et al., 2018) modified to handle multiple positives (Miech et al., 2020).The learned embedding space is then used for query-document retrieval or for computing text similarity.

Experiment
We introduce our experimental setup in §4.1, briefly discuss baseline approaches in §4.2 and then present results in §4.3.

Experimental Setup
Details of the datasets used for training and evaluation are given in Table 2.  Pretraining datasets As our continual pretraining is a multi-task process, we balance the indomain and general-domain datasets in each batch via up/downsampling as needed: For radiology, we upsample MIMIC-CXR samples via duplication, whereas for biomedicine we downsample PubMed abstracts, in each case matching the general-domain task dataset size.We also balance the number of samples coming from each task, downsampling the summarisation dataset to roughly match that of NLI ('# Examples' in Table 2).Experiments with DOT5 small (Fig. 3) indicate that downstream task performance could be boosted by tuning the relative prevalence of the data sources, with a task-dependent optimal value.In this proof-of-concept study, we fix a ratio of 1:1.We generate counterfactual summaries of Gigaword based on Rajagopal et al. (2022).Specifically, we run a named entity recognition model on the documents from the Gigaword summarisation training data, specifically the "en_core_web_sm" trained SpaCy pipeline (Honnibal et al., 2020).For each document that contains a named entity, we randomly sample an entity and replace it with a different named entity of the same category from the training corpus.This is our 'counterfactual' example6 .We also filter out noisy data when the generated counterfactual contains UNK or #.The resulting dataset, as listed in Table 2, consists of 50% document-'wrong summary' pairs (i.e.500k pairs), one for each true document-summary pair.To create pseudo NLI data for the self-finetuning process, we use all premises from the RadNLI/MedNLI development set and generate one entailed, one neutral, and one contradictory hypothesis for each premise.In total, we have 1440 and 4185 pseudo examples for RadNLI and MedNLI respectively.
Evaluation datasets and metrics All the evaluation datasets are from domain-specific tasks (Table 2).For NLI, we report accuracy and macro-F 1 (out of 100, for legibility) on the test set of RadNLI and MedNLI.For summarisation in radiology, we evaluate on findings-to-impression7 summarisation on the test split of the Open-I dataset (Demner-Fushman et al., 2016).For biomedical summarisation, we create an abstract-to-title summarisation dataset, 'PubMed ShortSum'.The data for this task is sampled from PubMed and filtered to abstracts shorter than 1000 characters.Compared with the traditional article-to-abstract PubMed summarisation task which evaluates long summary generation for long text (Cohan et al., 2018), PubMed Short-Sum evaluates extreme summarisation for short text and is a more comparable task to our general domain Gigaword summarisation.For summarisation evaluation, we use standard lexical metrics (BLEU-4, ROUGE-L) and domain-specific factuality metrics: named entity matching (NEM) for both radiology (Miura et al., 2021) and biomedical (Alambo et al., 2022) summarisation, and CheXbert (Smit et al., 2020) 8 for radiology.
We evaluate embeddings trained for the biomedical domain on MedSTS (Yanshan et al., 2020), a clinical text similarity benchmark.Since the radiology domain has no text similarity datasets available, we design an impression-to-findings retrieval task on the Open-I test set, and report Accuracy@1/5/10.This retrieval task can also evaluate embedding quality as it requires the model to differentiate text from same/different reports by encoding texts from matching findings-impression pairs (from the same report) with similar representations.
Training details Models are trained for 10 epochs with validation loss used for checkpoint selection.We use distributed data parallelism on eight GPUs with the largest batch size permissible given computational constraints, resulting in batch sizes of 1024, 512, and 128 for small, base, and large models.With a dataset of ∼8M samples, we thus train the large model for ∼64,000 steps per epoch.We use AdaFactor (Shazeer and Stern, 2018) with learning rates of 10 −3 for MIMIC-CXR and 2 × 10 −5 for PubMed pretraining.

Baselines
We have three categories of baselines: (1) taskspecific zero-shot baseline models reported from the literature (where applicable); (2) LLMs including T0 and GPT-3; (3) sequential training first on in-domain unlabelled data and then on generaldomain task labels.All the baseline models in our study must satisfy one constraint: not using any in-domain labels for the task, but they may differ in the required training resources (detailed comparison is found in Table 3).We compare with (2) as LLMs are known to be excellent zero-shot and few-shot learners for an unseen task, and should serve as a reasonable baseline for domain transfer.We provide (3) as a straightforward baseline to sequentially combine in-domain MLM training and general-domain task training as opposed to our proposed multi-task training.
Task-specific zero-shot baselines We compare with the strongest task-specific zero-shot models from the literature.For the NLI task, we compare with Miura et al. (2021)  Large language models T0 (Sanh et al., 2022) and GPT-3 (Brown et al., 2020) are massively pretrained language models that can be used off-theshelf for zero-shot or few-shot inference.T0 is pretrained with multiple tasks including generaldomain summarisation datasets (but not NLI), and shows strong transfer ability (Sanh et al., 2022).T0 can be seen as a strong general-domain summarisation model and also strong zero-shot domain transfer baseline on summarisation.T0 is also particularly effective in transferring to unseen tasks.Therefore, we include T0 as a zero-shot baseline for NLI even though it has not been trained with any NLI data.We test T0 (3B) and the most powerful T0++ (11B) model.GPT-3 (Brown et al., 2020) (davinci) is a massive language model with 175B parameters, pretrained on raw text with an autoregressive language modelling objective.
In the general domain, both models are shown to have performed reasonably well on NLI and summarisation with prompting.We test their zero-shotinference capabilities in our experiments, following the original papers for prompt design.For the NLI task, both T0 models and GPT-3 use the ANLI prompt template described in Brown et al. (2020): "<premise> Question: <hypothesis> True, False or Neither?".For the summarisation task, T0 used the prompt: "<document> \n === \n Generate a title for this article:".For GPT-3 summarisation, we used the prompt ("<document>\n\n Tl;dr:") as recommended in the OpenAI GPT-3 playground example9 .Since GPT-3 benefits when few-shot examples are incorporated in the prompt, we create two additional baselines (GPT-3-NLI and GPT-3-GW10 ) that perform in-context learning of the task from general-domain NLI training data (30 examples, randomly selected) and Gigaword summarisation training data (20 examples, randomly selected) respectively (Table 2).

Sequential training
The most straightforward way to exploit both in-domain unlabelled data and task labels is to first train on in-domain MLM and then further finetune on general-domain task labels11 .We provide two variants of this baseline.The first type performs continual training with general-domain task labels from SOTA domain-specific pretrained models.We adopt SciFive (Phan et al., 2021), a T5 model pretrained on large biomedical corpora, CXR-BERT-General (Boecking et al., 2022), a radiologyspecialised BERT model, and the PubMed-specific PubMedBERT (Gu et al., 2021).For finetuning these models we use the same general-domain task data as provided to DOT5, where for the BERT models we only do finetuning on NLI.This results in baseline models SciFive large -NLI, SciFive large -GW (summarisation), CXR-BERT-NLI, and PubMedBERT-NLI.We further improve SciFive large -NLI by including our proposed selffinetuning stage (SciFive large -NLI + SFT).Since there is no radiology-pretrained T5 model, we compare with SciFive on both domains.
The second baseline type strictly compares multitask training (DOT5) and sequential training.Here, we first pretrain T5 with in-domain MLM, and then continually pretrain on the general-domain task data, ensuring other factors remain the same including the training duration, use of NLGU, and use of self-finetuning where appropriate.We call this setting T5 large -MLM→Task.4) DOT5 large establishes new SOTA for zero-shot domain transfer on RadNLI and competitive results on MedNLI (Table 4).On RadNLI, DOT5 large reaches an impressive 82.1% on accuracy and is the best performing model.It outperforms the strongest reported number from the literature (CXR-BERT) by more than 15%, and our baseline CXR-BERT-NLI by almost 7%.Comparing DOT5 to T5 large -MLM→Task on RadNLI reveals the benefit of multitask training for compositional transfer.

NLI (Table
On MedNLI, DOT5 large outperforms ESIM (MultiNLI) by almost 20% (accuracy), but does not quite reach the 75.7% accuracy achieved by PubMedBERT-NLI, which establishes a new SOTA in zero-shot domain transfer on MedNLI -supervised SOTA is 86.6% (Phan et al., 2021).Although factors such as tokenisation and pretraining strategies may contribute, we speculate that the domain gap between MedNLI and our biomedical pretraining corpus explains the weaker performance of DOT5 on MedNLI.MedNLI was sourced from clinical notes in MIMIC-III, which differ distributionally from biomedical articles in PubMed.Supporting this hypothesis, we observed that DOT5 pretrained on radiology text, and the sequential baseline T5 large -MLM→Task achieved similar performance on MedNLI (70% accuracy), indicating that results on MedNLI may not fully reflect compositional domain knowledge transfer in our setup.In this case, a strong NLIspecific model is most performant, while lacking potentially-advantageous versatile text generation/summarisation capabilities.
domains (T0 models) (Table 5).In radiology, DOT5 large is the second-best model.That the strongest performing models on summarisation are LLMs with substantially many parameters is not surprising; we observe in §5.2 that DOT5 too enjoys scaling effects.Most importantly, we again demonstrate the benefit from multi-task compositional transfer as DOT5 large significantly outperforms both T5 large -MLM→Task and SciFive-GW across all metrics in both domains.This further verifies that a naïve sequential training on these two sources does not lead to effective compositional knowledge transfer.We also acknowledge it is more difficult to perform domain transfer for gener-ation tasks in general: we cannot perform the data augmentation NLG and self-finetuning pipeline as it amounts to training the model to generate its own outputs.Table 5: Zero-shot summarisation results.NEM (named entity matching) and CheXbert (radiologyspecific) assess domain-specific factuality, while BLEU and ROUGE are standard lexical metrics.In all cases higher is better.GW = Gigaword.T0 (Sanh et al., 2022) and GPT-3 (Brown et al., 2020) baselines were conducted by us.

Model
Text embedding learning (Table 6) The DOT5generated examples greatly improve the SOTA sentence embedding model's capability on both impression-to-findings retrieval in radiology and semantic textual similarity (MedSTS) in the biomedicine domain (Table 6).This is evidence that DOT5-generated sentences are of high quality and have captured semantic similarity and contradiction required for learning a good embedding model.We also compare with an ablated version of DOT5 without in-domain MLM to generate data and find that the full model performs better across the board.This shows the importance of domain training for generating good in-domain examples.
We explore this further in §5.1.

Further Analysis
In this section, we demonstrate the importance of individual components of DOT5 ( §5.1) and explore 76.9 ±0.00 67.9 ±0.09 Table 6: Text embedding learning results.Starting from a state-of-the-art embedding model (all-mpnet-base-v2), we fine-tune with DOT5generated data (indicated by '+').Radiology evaluation is retrieval: given the impression section of a report, find the corresponding findings section.For biomedicine, we report similarity on Med-STS (Yanshan et al., 2020), where r and ρ refer to Pearson's r and ρ (scaled by 100 for legibility).
the role of model size ( §5.2).Finally, we provide fine-grained analysis on RadNLI to verify whether DOT5 has indeed acquired domain-specific task knowledge from compositional transfer ( §5.3).
We observe that all components are essential to the success of the model.In-domain MLM is especially important for summarisation, without which the model fails in zero-shot transfer as it often just extracts a random subsequence of the document.Removing NLGU harms both NLI and summarisation.Training without NLGU removes the NLG component from NLI and therefore disables selffinetuning.Self-finetuning is the most important component for boosting NLI performance, without which the model's accuracy drops more than 30%.As shown in Table 4, SciFive also benefits from self-finetuning in this way.This indicates that the pseudo in-domain NLI task data generated by NLGU is crucial.Training without NLGU also removes the NLU task for summarisation and brings down the performance, indicating that having an NLU task can also benefit generation.We hypothesise that NLU improves NLG by forcing the model to be more sensitive to the control code in the prompt, leading to improved pseudo-data generation and better summarisation.To test this, following Tang et al. (2018), we compute the maximum attention weights across all attention heads to the control codes in the prompt when generating an NLI hypothesis (Fig. 4).We compare DOT5 large trained with or without NLU.We see that the full model attends more on the control codes, suggesting that NLU is increasing label conditionality during generation.Table 8 shows some examples: When required to generate an entailment, the model can usually correctly paraphrase the original sentence; for negation, the model is usually able to correctly identify modifiers to flip the logic (e.g., change "increase" to "decrease" and adding or removing "no"); for neutral, the model generates a thematically related sentence but not directly negating or agreeing with the original sentence.

Effect of Scaling Up
We have so far reported results on a large T5 model (770M parameters).In Fig. 5, we plot the performance of small (70M) and base (220M) DOT5 models with their ablated versions for RadNLI and radiology summarisation, showing a clear trend of increasing performance as the model size grows.Interestingly, this scaling effect disappears when we remove in-domain MLM, revealing the importance of domain training for larger models, especially for summarisation.This is possibly because, without domain training, scaling up the model leads to over-  fitting to the general-domain task data.The compositional transfer framework from DOT5 however regularises the model for more complex knowledge acquisition, and thus is able to harness the power from larger models.

Evidence of Compositional Transfer in DOT5: A Case Study on RadNLI
Although RadNLI is a radiology-specific NLI dataset, we observe that some examples may be solvable using general-domain task knowledge (e.g., syntactic cues) alone.A general-purpose NLI model will likely detect that 'There is no pneumothorax' contradicts 'There is a pneumothorax' without requiring radiology-specific knowledge such as an understanding of pneumothorax.Therefore, higher performance on RadNLI may not strictly guarantee the model has acquired in-domain knowledge.To quantify how much of DOT5's transfer success is due to the acquisition of the previ- .Both the full model and its ablated versions are compared.Note that self-finetuning is only applicable for the NLI tasks.

Does not require radiology expertise Premise
There is a small left pleural effusion.Hypothesis No pleural effusion or pneumothorax is seen.

Requires radiology expertise Premise
The cardiac silhouette is top normal.Hypothesis The heart is not enlarged.Label Entailment  9.
Table 10 compares three models on these subsets: DOT5 large (a), T5 large -MLM→Task (b), and DOT5 large without in-domain MLM (c) (equivalent to 'T5 large →Task').We further test with and without self-finetuning to probe its capacity to strengthen domain-specific competence.
While DOT5 large achieves the best performance overall, it is specifically on challenging domain-specific cases that it outperforms T5 large -MLM→Task, an increase of 15 points in F 1 .For example, in specific knowledge (the model should know cardiac silhouette includes heart size; and if the heart is top normal, then it should not be enlarged).This demonstrates the role of compositional transfer for inferring the otherwise unseen in-domain task knowledge (in this case, radiology NLI knowledge) solving challenging cases that require expertise.The two ablated versions help understand where this domain-specific task knowledge is acquired.In-domain MLM training is key as removing it (c) significantly decreases the performance on domainexpert cases in particular, producing a model which cannot benefit from self-finetuning at all for such cases.This is because without in-domain MLM, the model is not able to generate good-quality pseudo in-domain labels in the first place, and therefore self-finetuning has little effect on the expert cases.Introducing in-domain data sequentially (b) resolves the performance gap on non-expert cases, but still underperforms on domain-specific cases relative to multi-task training (a).We conclude that the compositional fusion of task and domain knowledge happens during DOT5's multi-task pretraining phase with in-domain MLM as the key element, and that domain-specific competence is elicited through self-finetuning.

Conclusion and Discussion
We propose DOT5, a compositional transfer learning framework to solve domain-specific NLP tasks without requiring in-domain task labels.We show the effectiveness of DOT5 on zero-shot transfer to multiple tasks in the biomedicine and radiology domains.DOT5 significantly outperforms T5 sequential training across all tasks, and achieves zero-shot SOTA in radiology NLI with massive gains.We also conduct extensive analyses to identify the contribution from each model component and the benefits from scaling up the model size, and demonstrate direct evidence of domain-specific task knowledge learned from DOT5's compositional transfer.
Limitations of this work include the challenge of drawing clear boundaries between domains and the necessarily incomplete exploration of hyperparameters and configurations.For example, general domain texts may contain biomedical or radiology sources, and our 'biomedical' NLI evaluation set leans strongly clinical, introducing a degree of domain shift.Investigation of the weighting of terms in the loss reveals the potential to improve performance through more exhaustive hyperparameter search -we emphasise that this was a proof-of-concept study and although DOT5 performs favourably, zero-shot domain transfer could be further pushed, especially if only a single downstream task is required.
The proposed NLGU method and subsequent self-finetuning was critical for improving downstream task performance.However, we observed an intermittent negative effect wherein the model would attempt to solve the NLU task when presented with an unusually long prompt.Further work can be done to refine this approach.For example, the benefit of NLGU in resource-rich domains is unclear.As our focus is on domain transfer and we do not evaluate on general-domain tasks, we leave such experimentation to future study.
Finally, we acknowledge that it is nontrivial to apply our full framework to singlesentence/paragraph classification tasks.While our most basic setup (compositional training of indomain MLM and vanilla task training) can still be transferable to any task format, NLGU and selffinetuning would currently only work for tasks that involve pairs of texts.Nonetheless, we believe DOT5 proves to be a highly effective zero-shot domain transfer framework which will be beneficial to domain-specific applications beyond radiology and biomedicine.

Figure 1 :
Figure 1: By combining task knowledge from general domain data and domain knowledge from in-domain unlabelled text, our text-to-text model DOT5 learns to solve in-domain tasks.

Figure 2 :
Figure 2: Continual pretraining of DOT5 on general-domain tasks (warm colors) and in-domain unlabelled text (blue).For task training, we form both NLG and NLU variants of NLI and summarisation.All training is performed simultaneously, exploiting the unified text-to-text framework of T5.

Figure 3 :
Figure 3: Varying the prevalence of in-domain MLM and task data in DOT5 small training.

Figure 4 :
Figure 4: Maximum attention weights assigned to control code {label} ("entailed", "neutral", "contradictory") in the prompt in NLI hypothesis generation, averaged over 100 randomly sampled examples from the RadNLI dev set.Error bars represent standard deviation.

Figure 5 :
Figure5: Scaling-up effect on RadNLI (left) and Open-I summarisation (right).Both the full model and its ablated versions are compared.Note that self-finetuning is only applicable for the NLI tasks.
contains only 960 manually-labelled examples as development and test data and no training data is available.

Table 8 :
Prompt: Generate a {label} sentence of {premise}: premise: "Low lung volumes but no evidence of pneumonia."The patient has pneumonia."premise: "The patient is rotated slightly which limits assessment."Pseudo-NLI data in the radiology domain generated by DOT5 large for a given input premise and label.Premises are taken from the development split of the RadNLI dataset.

Table 9 :
Examples from RadNLI that do or do not require radiology-specific knowledge to solve.While all models listed in Table 10 correctly solved the top example, only DOT5 large solved the more challenging second example.

Table 9 ,
only DOT5 large is able to solve the second example which requires radiology-

Table 10 :
Macro F 1 of DOT5 with and without in-domain data during pretraining, on subsets of RadNLI requiring radiology-specific expertise or not.The zero-rule baseline always outputs the most common class (for RadNLI, this is 'Neither').We report macro F 1 to account for differing label distributions.Note that T5 large →Task is equivalent to DOT5 large without in-domain MLM training.