Abstract
Compositional Natural Language Inference (NLI) has been explored to assess the true abilities of neural models to perform NLI. Yet, current evaluations assume models to have full access to all primitive inferences in advance, in contrast to humans that continuously acquire inference knowledge. In this paper, we introduce the Continual Compositional Generalization in Inference (C2Gen NLI) challenge, where a model continuously acquires knowledge of constituting primitive inference tasks as a basis for compositional inferences. We explore how continual learning affects compositional generalization in NLI, by designing a continual learning setup for compositional NLI inference tasks. Our experiments demonstrate that models fail to compositionally generalize in a continual scenario. To address this problem, we first benchmark various continual learning algorithms and verify their efficacy. We then further analyze C2Gen, focusing on how to order primitives and compositional inference types, and examining correlations between subtasks. Our analyses show that by learning subtasks continuously while observing their dependencies and increasing degrees of difficulty, continual learning can enhance composition generalization ability.1
1 Introduction
Natural Language Inference (NLI) determines the inferential relation between pairs of sentences, by classifying the hypothesis as being true (entailment), undecided (neutral), or false (contradiction) given the premise (Dagan et al., 2013; Bowman et al., 2015; Williams et al., 2018). The task has been researched for decades and has been shown to facilitate downstream NLU tasks such as text summarization (Laban et al., 2022; Utama et al., 2022), question answering (Chen et al., 2021), or dialogue generation (Stasaski and Hearst, 2022).
Recently, large pre-trained models (PLMs) have achieved results on par with human performance by fitting NLI training data (Wang et al., 2019a; Raffel et al., 2020; Chowdhery et al., 2023). Despite the success of state-of-the-art PLMs, it remains unclear to what extent neural models have the ability to generalize when performing NLI. To better assess the true abilities of PLMs to perform NLI, Compositional Generalization (Fodor and Pylyshyn, 1988; Hupkes et al., 2020) evaluation has been proposed for NLI (Yanaka et al., 2020; Geiger et al., 2020; Fu and Frank, 2023). This novel task aims to evaluate whether models are able to predict unseen compositional inferences if they have seen their constituting primitive inferences in training. The left part of Table 1 (Compositional Generalization for NLI ) shows an unseen compositional NLI test instance for which we expect a model to make the correct prediction ‘He tries to catch his dogHe catches his pet’, by relying on the primitive inferences ‘try to SS’ and ‘catch his dog→catch his pet’.
However, existing work evaluating Compositional Generalization for NLI (CGen NLI) relies on offline training, which crucially differs from the way humans acquire knowledge, i.e., by continual learning (Ring, 1997; Parisi et al., 2019). Real communication scenarios require the understanding and induction of compositional inferences relative to dynamically updated knowledge. For example, an agent should be able to compose some newly acquired inferential knowledge buy an apple vision pro (S)→digital content is blended with physical space (S’) with previously learned try to SS, to induce try to SS’. In Section 9, we present a promising application of continual compositional inference in a dialogue setting.
To better align with the compositional generalization ability in real-world situations, and to prepare applying compositional NLI to dynamically evolving information states, we introduce a new task: Continual Compositional Generalization for NLI (C2Gen NLI), which aims to explore the compositional generalization ability of a model when performing NLI in a continual learning scenario. We simulate a continuous learning process by manipulating the order in which specific primitive NLI inferences are encountered during training. The right part of Table 1 shows an example. To solve the unseen compositional inference test sample, a model needs to learn, in the first place, the primitive inference try to SS (), and then catch his dog→catch his pet (). The C2Gen NLI task challenges models in two ways: it tests i) their ability to perform compositional generalization, by combining learned primitive inferences to solve unseen compositional inferences, and ii) doing this in a continual learning scenario that requires models to memorize and re-use primitive inferential knowledge they continually acquired. Unlike the existing CGen NLI task, C2Gen NLI allows us to evaluate whether models can learn primitive inferences continuously and efficiently.
To facilitate research on C2Gen NLI, we establish an evaluation setup and task dataset for systematic analysis of the effect that continual learning has on the compositional generalization capabilities of models to perform NLI. We design two sub-tasks to perform fine-grained compositional generalization analysis: i) compositional inference (TaskCI) to explore how well a model performs compositional inference; and ii) primitive recognition (TaskP), to evaluate a model’s ability to resolve constituting primitive inferences. With our evaluation datasets and tasks, we conduct experiments in CGen and C2Gen NLI settings using a multi-task model for the different inference tasks. Initial results show that with continual learning, models show inferior performance in compositional NLI inference, which we show to be due to forgetting.
To combat the forgetting issue, we benchmark a set of continual learning algorithms targeted at memorization. Our results validate their effectiveness, but also show that memorization alone cannot solve the compositional inference task. To gain deeper understanding of the challenges involved in a continual scenario, we investigate the effect of learning primitive inferences in different orders, analyze correlations between primitive and compositional NLI tasks, and the impact of ordering compositional inference types by difficulty. Our findings highlight the importance of ordering inference types in continual learning according to dependencies and intrinsic difficulty.
Our main contributions are as follows:
- i)
We motivate and introduce the C2Gen NLI (Continual Compositional Generalization for Natural Language Inference) task, which to our knowledge is the first challenge to explore the compositional generalization ability of NLI in a continual learning scenario.
- ii)
We construct a compositional NLI dataset and rearrange its partitions for C2Gen NLI.
- iii)
Experiments indicate that forgetting is a major challenge for C2Gen NLI. To combat this issue, we benchmark a set of continual learning algorithms and verify their effectiveness.
- iv)
Further analyses highlight the impact of guiding the order of continual learning by observing dependencies and degrees of difficulty of primitive and compositional inference types, for compositional NLI performance.
- v)
By controlling for data leakage using pseudo data, we demonstrate that the C2Gen NLI challenge persists for LLMs such as Llama.
2 Related Work
NLI determines the inferential relation between a hypothesis and a premise (Dagan et al., 2013; Bowman et al., 2015; Lai et al., 2017; Williams et al., 2018; Welleck et al., 2019). Prior work aimed to improve NLI performance with various neural model types (Parikh et al., 2016; Gong et al., 2018; Chen et al., 2018; Bauer et al., 2021). Recently, large PLMs perform well on the NLI task, often achieving human performance (Wang et al., 2019a; Liu et al., 2019). Despite the success of state-of-the-art LLMs, it remains unclear if models are able to generalize when performing NLI. To better assess their inference abilities, research has started to explore to what extent they perform generalization when performing NLI. This includes cross-genre (Williams et al., 2018) and cross-lingual (Conneau et al., 2018) generalization or investigating the impact of heuristics (McCoy et al., 2019; Bhargava et al., 2021). In this work, we evaluate the generalization ability in NLI focusing on compositional generalization. That is, we test a model’s capability of predicting unseen compositional inferences if constituting primitive inferences have been learned.
Early work evaluates compositional generalization for NLI targeting novel compositions involving specific linguistic phenomena, e.g., composing predicate replacements and embedding quantifiers (Yanaka et al., 2020), focusing on lexical entailment and negation (Geiger et al., 2020; Goodwin et al., 2020). Recently, Yanaka et al. (2021) and Fu and Frank (2023) extended the scope of compositional generalization evaluation to composition of veridical inference with customary NLI, finding that PLMs are limited in compositionality. Despite promising findings of the above studies, they all assume that models have full access to all training data in advance. This is in contrast with humans acquiring knowledge in a continuous fashion.
To simulate human learning processes, continual learning has been proposed (McCloskey and Cohen, 1989; Wu et al., 2022), enabling models to learn from a continuous data stream over time. Robins (1995) and French (1999) identified catastrophic forgetting being the main challenge in continual learning. To address this issue, various continual learning strategies have been proposed. Among others, data-based algorithms (Chaudhry et al., 2019b, a; Aguilar et al., 2020) are well-known. They use small memories to store seen training data, to be reused in later training steps. Using such strategies, later work designed elaborate models to enhance the performance of tasks such as relation extraction (Wang et al., 2019b), multilingual learning (Berard, 2021; M’hamdi et al., 2023), or dialogue (Madotto et al., 2021). By contrast, we use such continual strategies to analyze the impact of continual learning on compositional generalization ability in NLI.
Both compositional and continual learning are pivotal aspects for evaluating the genuine capabilities of large PLMs. Existing work (Dziri et al., 2023; Berglund et al., 2024; Mitchell et al., 2023) indicates that although LLMs are pre-trained on large amounts of data, they still struggle in novel tasks and situations. Thus, LLMs are expected to learn compositionally and continuously. Some recent work aims to combine continual learning and compositionality. Li et al. (2020) focus on continual learning in a sequence-to-sequence task. They propose to represent syntactic and semantic knowledge separately which allows to leverage compositionality for knowledge transfer. Jin et al. (2020) introduce a challenging benchmark that aims at continual learning of compositional semantics from visually grounded text. Unlike them, we introduce a new task that focuses on Continual Learning of Compositional Generalization in NLI. With this task, we i) analyze the challenge of compositional generalization in NLI in a continual learning setup; ii) identify the effect of ordering primitive and compositional inference types according to their dependencies and difficulty; and iii) finally, in §9 we showcase the relevance of continual learning in NLI in a concrete application, namely, Persona Dialogue.
Our finding ii), which highlights the impact of ordering primitive and compositional inference types based on their difficulty, is close to another machine learning paradigm, known as curriculum learning (Elman, 1993; Krueger and Dayan, 2009; Bengio et al., 2009; Soviany et al., 2022). This learning paradigm is inspired by the human classroom, and refers to training a model with a curriculum of increasing difficulty. Existing work first focuses on assessing the difficulty of training samples. According to their difficulty, they further weight data samples and bias the model towards them (Kumar et al., 2010; Huang and Du, 2019), or organize data into subgroups and commence learning from the easiest batch (Xu et al., 2020; Jia et al., 2023; Ranaldi et al., 2023). Curriculum learning differs from continual learning in two respects:2 i) learning schema. Curriculum learning remains an offline learning method. It focuses on structuring the learning process to facilitate faster and more robust learning, instead, continual learning aims to adapt to new data over time while preserving past knowledge. ii) training atoms. Curriculum learning concentrates on data points, instead, continual learning focuses on tasks or knowledge levels. Despite these distinctions, curriculum learning and continual learning interact, such as adopting the ordering principle from curriculum learning to enhance continual learning. Our findings, derived from the analysis of learning sequences in continual learning, could serve as empirical evidence supporting the principles of curriculum learning.
3 Task Setup: C2Generalization in NLI
In this section, we provide an overview of continual learning (§3.1) and describe the construction of our Compositional NLI dataset (§3.2). Building upon this foundation, we rearrange partitions of the dataset to establish Compositional generalization tests with standard training (CGen) and a Continual learning (C2Gen) setup (§3.3).
3.1 Continual Learning Preliminary
Continual learning (McCloskey and Cohen, 1989; Wu et al., 2022) is proposed to simulate human learning processes, enabling models to learn from a continuous and non in-distribution data stream over time. The objective is to enable a model to continuously learn a set of instances sequentially ordered with respect to a set of n tasks {}, following a given order. The model is trained on examples from , progresses to , and so on until . Notably, during the learning process for each task , the model is not allowed to access training data from previous tasks or future tasks . Within each task , instances are trained in a random order. In contrast, conventional training involves full access to all data in advance, meaning the model is trained simultaneously on examples randomly sampled from the set of tasks .
3.2 Compositional NLI
We model Compositional Inference (CI) building on customary NLI samples. Both customary and compositional NLI involve the relation between premise and hypothesis, but compositional inference involves at least two different primitive inference types (Table 1).3 To master compositional inference, a model must i) resolve the involved primitive NLI inferences and ii) compose the inferred results, using a suitable composition function.
We construct compositional inferences by selecting veridical inference as a special primitive inference type, and combine it with customary NLI inference samples as a second primitive inference (cf. Table 1). Given that veridical inference involves an embedded sentence, it can be flexibly combined and scaled to compositional inference datasets (Yanaka et al., 2021). Veridical inference (Karttunen, 1971; Ross and Pavlick, 2019) is strongly determined by the lexical meaning of sentence embedding verbs. In the context of a factive veridical verb, we can infer that the proposition it embeds can be held to be true, e.g., He manages to S→S. For a non-veridical verb, we cannot infer the truth or falsity of a proposition, e.g., He tries to SS; while for non-factive veridical verbs, we can infer the negation of the complement, e.g., He refuses to SS. For customary NLI we distinguish three classes: e(ntailment): S → S’, n(eutral): S S’ and c(ontradiction): S →¬S’.
To construct compositional NLI samples, we ensure that the hypothesis of a (non-)veridical inference pair (x verb S, S) matches the premise of a customary NLI pair (S, S′), to derive a transitive inference pair that may be entailed, neutral, or contradictory. For example, He tries to do SS & S→S’ ⇒ He tries to do SS’. We use the composition rules listed in Table 2 to define the compositional inference labels. For example, A man tries to catch his dogA man catches his pet is a (non-entailing) compositional inference. Here, tries to S S represents a non-veridical (neutral) inference sample, and catch his dog→catch his pet an entailing inference sample. Composing the above primitive inference results determines the label for the compositional inference, i.e., neutral (rule ④).
index . | PV . | PN . | CI . |
---|---|---|---|
① | positive | entailment | entailment |
② | positive | neutral | neutral |
③ | positive | contradiction | contradiction |
④ | neutral | entailment | neutral |
⑤ | neutral | neutral | neutral |
⑥ | neutral | contradiction | neutral |
⑦ | negative | entailment | contradiction |
⑧ | negative | neutral | neutral |
⑨ | negative | contradiction | entailment |
index . | PV . | PN . | CI . |
---|---|---|---|
① | positive | entailment | entailment |
② | positive | neutral | neutral |
③ | positive | contradiction | contradiction |
④ | neutral | entailment | neutral |
⑤ | neutral | neutral | neutral |
⑥ | neutral | contradiction | neutral |
⑦ | negative | entailment | contradiction |
⑧ | negative | neutral | neutral |
⑨ | negative | contradiction | entailment |
3.3 Compositional Generalization Testing
Compositional Generalization (CGen) in NLI
Compositional generalization tests are designed to evaluate whether a model can generalize to unseen compositional inferences whose constituting primitives have been observed in training. For example, we can evaluate a model’s compositional generalization ability by testing it on an unseen compositional sample A man tries to catch his dogA man catches his pet, where its constituting primitive inferences tries to SS and catch his dog→catch his pet have been seen in training. We denote the set of possible veridical inference types with , the set of customary inference types with , and the set of all possible compositional inference types with . The domain of all instances of the respective types is given as . In all our compositional generalization experiments, we guarantee there is no intersection between the compositional types used in training and test, i.e., , while primitive inferences involved in test instances are ensured to have been seen in training: , .
Continual Compositional Generalization (C2Gen) in NLI
Unlike standard compositional generalization evaluation that relies on offline learning, requiring all training data to be processed in advance, the continual compositional generalization test (C2Gen) extends the evaluation to a continual learning setup. Here, a model is fed with a non-stationary data stream, i.e., the training process follows a controlled learning order, simulating how humans acquire knowledge from their environment. Following the standard CGen setup, we evaluate a model’s generalization ability in compositional NLI by testing unseen composition types, e.g., A man tries to catch his dogA man catches his pet. During training, we separate the training stream into sequential stages (i ∈ {1,2}), where i) in one stage the model learns to categorize veridical inference based on the embedding verb (e.g., the neutral verb try); ii) in the other it learns to categorize a customary NLI pair (e.g., the entailment pair catch his dog →catch his pet). Hence, the model first learns one primitive (e.g., ) to solve compositional inference and then the other (), or vice versa.
We construct the above continual scenario by controlling irrelevant variables. When exploring veridical inference in , we use a small number of primitive NLI samples and feed various veridicality samples. Similarly, in , we fix a restricted number of samples from veridical inference and feed various primitive NLI instances. Parallel to training primitives, compositional instances are presented, where the used primitives have been seen in training of the corresponding stage Si. Different stages are trained sequentially, while samples are randomly trained within each stage. This process enables models to learn incrementally from new data. Figure 1 shows the process.
Compared to customary offline training, C2Gen NLI is more challenging and innovative. Because models need to learn how to compose primitive inferences, and need to preserve previously acquired knowledge of constituting primitive inferences.
4 Analyzing C2Gen NLI as a Multi-Task
4.1 Decomposing Compositional NLI
To prepare a deep analysis of the generalization capabilities in C2Gen NLI, i.e., compositional NLI in a continual learning training regime, we decompose the CGen task into two constituting subtasks: prediction of primitive inferences (TaskP), and prediction of compositional NLI (TaskCI) as the main task. We apply multi-task learning to jointly learn the two tasks.4Figure 2 gives an overview.
TaskCI: Compositional Inference
In the NLI CI task, a model is tasked to predict the inferential relationship instantiated in a given compositional NLI sample. For example, the model is expected to predict the value ‘neutral’ for A man tries to catch his dogA man catches his pet.5
TaskP: Primitives Recognition
TaskP evaluates whether a model correctly predicts the primitive inferences from which a given compositional sample is built. That is, for A man tries to catch his dogA man catches his pet we test the model predictions for its constituting primitive inferences, expecting i) neutral for A man tries to SS and ii) entailment for the entailed inference A man catches his dog→A man catches his pet.
4.2 Model
In the end, we use the multi-task training strategy to train two tasks, TaskP and TaskCI. Their objectives prim and cr are jointly optimized during training, using loss =prim +cr.
4.3 Training Settings
Compositional Generalization (CGen).
The standard compositional generalization test in NLI relies on offline training, where models have full access to all training data in advance. This setup serves as an upper-bound baseline for our experiments. All training data in CGen is mixed in a random order. We denote this as .
Continual Compositional Generalization (C2Gen).
This new training setup evaluates the compositional generalization capability in NLI in a continual learning scenario. The model is restricted to follow a non-stationary data stream, i.e., all compositional NLI training data is presented in a specific order ().
4.4 Continual Learning Strategies
In order to deeply analyze the challenges of the C2Gen NLI task, we first benchmark well-known continual learning strategies, designed to combat forgetting. All methods introduce a small, fixed-size so-called episodic memory. It consists of samples selected from a previous learning stage, and is used, in a next training stage, in different ways:
Experience Replay (ER).
Chaudhry et al. (2019b) utilize samples from a memory directly for re-training in future stages. They distinguish three variants: a) ER-res(ervoir) applies a sampling technique that ensures that each sample has an equal chance of being selected; b) ER-buff guarantees that the size of the memory at each training stage is the same; and c) ER-mir (Aljundi et al., 2019) selects re-training data that is most likely to be forgotten in the next training stage.
Averaged Gradient Episodic Memory (A-GEM).
Chaudhry et al. (2019a) constrain the direction of the updated gradient. They calculate the gradient g′ of the previous training stage on memory data and project the updated gradient to a direction that is close to g′.
Knowledge Distillation (KD).
Aguilar et al. (2020) apply memory samples to distill and preserve knowledge learned in previous stages, by minimizing the difference between the output predictions from the previous stage and the current stage over memory data.
5 Experimental Setup
5.1 Dataset Construction and Verification
We construct datasets with instances chosen from established NLI datasets. i) For primitive veridical inference, we select 21 verbs from the dataset of Ross and Pavlick (2019). We restricted our choice to verbs with infinitive complements to ease the construction of compositional samples. Table 3 shows the selected verbs for each class label. ii) For primitive customary NLI we extract 2130 instances (e: 747; n: 693; c: 690) from SICK (Marelli et al., 2014), focusing on instances where the inference is based on specific semantic relations including synonymy, hyponymy, active-passive diathesis, etc. For compositional inference, we compose samples from these primitive veridical and customary NLI data points, as described in §3.2. All compositional inferences are categorized into nine groups using the composition rules in Table 2. Table 4 shows the class distribution.6 The distribution of target class labels (e:n:c) is roughly 1:2:1.
Signature . | Instantiations . |
---|---|
positive (+) | manage, begin, serve, start, dare, use, get, come |
neutral (∘) | hope, wish, expect, try, plan, want, intend, appear |
negative (−) | forget, fail, refuse, decline, remain |
Signature . | Instantiations . |
---|---|
positive (+) | manage, begin, serve, start, dare, use, get, come |
neutral (∘) | hope, wish, expect, try, plan, want, intend, appear |
negative (−) | forget, fail, refuse, decline, remain |
type . | #num . | type . | #num . | type . | #num . |
---|---|---|---|---|---|
①ee_e | 5976 | ④ne_n | 5976 | ⑦ce_c | 3735 |
②en_n | 5544 | ⑤nn_n | 5544 | ⑧cn_n | 3465 |
③ec_c | 5520 | ⑥nc_n | 5520 | ⑨cc_e | 3450 |
type . | #num . | type . | #num . | type . | #num . |
---|---|---|---|---|---|
①ee_e | 5976 | ④ne_n | 5976 | ⑦ce_c | 3735 |
②en_n | 5544 | ⑤nn_n | 5544 | ⑧cn_n | 3465 |
③ec_c | 5520 | ⑥nc_n | 5520 | ⑨cc_e | 3450 |
As the dataset is automatically constructed from existing datasets, we perform manual human verification to ensure their validity, following Keysers et al. (2020); Liu et al. (2022, 2024). For cost considerations we restricted manual verification to 200 randomly sampled instances. Two annotators specialized in computational linguistics performed this task. They underwent training in practice sessions with direct feedback before starting the annotation process. Their task was to annotate the correct class (entailment, neutral, or contradiction) for each premise-hypothesis pair, for all three inference types. The inter-annotator agreement calculated by Cohen’s kappa was 0.961, 0.954, and 0.917 for the respective inference types.
After the annotation, we computed the consistency between the human-labeled and automatically constructed data for each inference type. Among incorrect veridical inference samples (15 cases)7, 87% of instances are susceptible of a systematic veridicality bias among humans (Ross and Pavlick, 2019). That is, some verbs with neutral signature are often perceived to have positive signature, while our construction follows the semantic definition (cf. Table 2). The remaining 13% are due to a range of different annotation errors. For customary inference (based on SICK), we follow the taxonomy of Kalouli et al. (2023) to categorize error samples (24 cases). Applied to our data, the errors are attributed to the following sources: ambiguous (55%), looseness (25%), phrasal verbs (10%), and annotation error (10%). Note that NLI labeling consistency, in general, is still an open issue, relating to factors such as ambiguity and uncertainty (Pavlick and Kwiatkowski, 2019; Nie et al., 2020; Jiang and de Marneffe, 2022). For incorrect compositional inferences (25 instances), we note that incorrect primitive inferences cause accumulated errors, accounting for approx. 91.5% of the incorrect compositional inferences. The remaining ones are annotation errors. Still, the consistency for each inference type exceeds 90%, indicating a high quality of our benchmark dataset, which can be a valuable resource for future work.
5.2 Dataset Split
The compositional inference data is prepared for the Compositional Generalization in NLI (CGen) evaluation as follows: Given nine compositional inference types, we conduct nine-fold cross-validation experiments, reporting averaged results. Specifically, each type will once serve as a test dataset (e.g., ①), while the remaining eight types are used as training set (e.g., ②–⑨). As outlined in §3.3, we guarantee that the primitive inferences used in a given test instance have been seen in training. For TaskCI we train on and test on . For TaskP we decompose the instances of and into their primitive inferences ⇒ & for primitive recognition training, and test with unseen primitive inferences , .
In the Continual Compositional Generalization in NLI (C2Gen) setting (cf. §3.3) we maintain the evaluation protocol for both tasks as detailed above for CGen, but split the train set into and s.th. }, and present this data in a continual training stream. For each stage , i ∈ {1,2}: i) If it serves to train the model to learn veridical inference, we use a small number of NLI samples and feed various veridicality samples. For example, we select data from ②⑤③⑨, where for the pair ②⑤ the model needs to distinguish the effect of positive and neutral veridicality and similarly for ③⑨, where it needs to distinguish the effect of positive and negative veridicality. ii) If the model is tasked to learn natural language inference, we use a small number of veridical verbs, selecting data from ④⑥⑦⑧ (for similar reasons as in i). We experiment with alternative data streams, with reversed order in which specific phenomena are trained, once setting to process training data targeted to and to , and once choose the opposite assignment to and . In each stage, we uniformly sample 3200 instances for training.
5.3 Evaluation Metric
We adopt two metrics: i) Acc(uracy) reports the percentage of correctly predicted labels for a given task after training on all stages. ii) Forget is a commonly used metric in continual learning. It measures to what extent knowledge that a model has learned in is preserved after training in . For a given task T, Forget is calculated as () / .
5.4 Implementation Details
Backbone.
Continual Learning Strategies.
For all evaluations using continual learning strategies, we set the memory size to 100. Following Chaudhry et al. (2019b) and Aljundi et al. (2019), we set the number of replay samples in each step to the batch size for ER-based strategies, including ER-reservoir, ER-buff, and ER-mir. In practice, we add the memory batch to the current batch in training. For fair comparison to other strategies, we set the sample size to be equal to the batch size used for controlling the gradient in AGEM (Chaudhry et al., 2019a) and for distilling knowledge in KD (Aguilar et al., 2020). For each experiment, we perform three runs with different seeds, as in Jin et al. (2020) and Madotto et al. (2021). We report the mean performance with standard deviations in the following experiments.
Hyperparameter Settings.
We determine suitable hyperparameters by empirical assessment in a grid search. To assess the impact of the learning rate, we run experiments across a range of learning rates [1e-5, 2e-5, 3e-5] using Adam optimizer.10 Results indicate that the gap (Δ) between CGen and C2Gen increases monotonically with increasing learning rate, achieving accuracies of [18.11, 19.05, 19.88] for TaskP and [7.44, 8.39, 9.26] for TaskCI for the respective choices. We select 1e-5 as the learning rate because its gap is the most negligible compared to the other rates. Moreover, the similarity in gap values between TaskP and TaskCI implies that adjusting hyperparameters alone does not significantly impact the subsequent conclusions. We similarly evaluate the impact of memory capacity on continual strategies, for ranges from 2% to 5% of the one-stage training data, corresponding to memory sizes of 50, 100, 150, and 200. Again, the results for the two tasks exhibit a unimodal distribution, with a peak occurring at 100. Therefore, we opt to utilize a memory size of 100.
6 Results and Analysis
6.1 How Does a Model Perform in C2Gen?
We start by analyzing the effects of the different training settings, CGen and C2Gen, on model performance in the compositional generalization test for NLI (TaskCI). Table 5 shows the results. In the CGen setting, the model shows decent performance in compositional inference (TaskCI) with an accuracy of 46.67. Compared to CGen, C2Gen NLI shows a decline for both continual order variants ver→nat and nat→ver, with reductions of 7.27 and 9.31 points, respectively. This suggests that compositional generalization in NLI in a continual learning scenario is more challenging.
Settings . | TaskP . | TaskCI . | ||
---|---|---|---|---|
V . | N . | V+N . | ||
CGen | 99.960.12 | 94.360.57 | 94.360.41 | 46.670.26 |
ver→nat | ||||
C2Gen () | 100.000.00 | – | – | – |
C2Gen () | 80.720.39 | 94.250.76 | 76.310.59 | 39.400.43 |
nat→ver | ||||
C2Gen () | – | 93.940.65 | – | – |
C2Gen () | 99.580.14 | 71.150.72 | 70.730.49 | 37.360.57 |
Settings . | TaskP . | TaskCI . | ||
---|---|---|---|---|
V . | N . | V+N . | ||
CGen | 99.960.12 | 94.360.57 | 94.360.41 | 46.670.26 |
ver→nat | ||||
C2Gen () | 100.000.00 | – | – | – |
C2Gen () | 80.720.39 | 94.250.76 | 76.310.59 | 39.400.43 |
nat→ver | ||||
C2Gen () | – | 93.940.65 | – | – |
C2Gen () | 99.580.14 | 71.150.72 | 70.730.49 | 37.360.57 |
Why is C2Gen More Challenging?
To investigate this question, we examine the accuracy of primitive inference (TaskP) in different continual learning stages. This is because TaskCI is dependent on TaskP, requiring correct predictions for the constituting elements of the composition. For C2Gen in order ver→nat, we find that the initially learned veridical primitive inference achieves high accuracy of 100% in stage , showing that the model has achieved perfect knowledge of veridical inference after . However, the accuracy for veridicality drops to 80.72 (↓19.18) after learning primitive NLI in . This suggests that the model forgets the primitive knowledge learned during . We find a similar trend in the C2Gen setting nat→ver, where the accuracy of the initially learned NLI primitive inference drops from 93.94 to 71.15 (↓22.79). While in each order only one primitive is affected by forgetting, the joint accuracy for TaskP drops to 70-76 points in both settings. From these observations we conclude that catastrophic forgetting is a major challenge in C2Gen.
6.2 Can Continual Learning Strategies Help?
Next, we apply existing continual learning strategies that are designed to address the problem of forgetting, and analyze their effect on the preservation of knowledge of primitives (TaskP) and on compositional generalization (TaskCI) in C2Gen, for both learning orders. Table 6 shows the results. Compared to vanilla C2Gen, all continual strategies yield improved accuracy for both tasks and reduce the forgetting value of learned primitive inference. C2Gen, yields a significant improvement in the accuracy of the initially learned primitive (AccV), with an increase from approx. 80 to 100. Accordingly, the forgetting value associated with this primitive decreases by the same amount to almost 0. A similar trend is seen in C2Gen where the accuracy of the initially learned primitive (AccN) increases from 71 to 83, while its forget value drops from 24 to 10. This shows that continual learning strategies alleviate forgetting, helping the model to regain substantive performance (+5 points for TaskCI).
Settings . | ver→nat . | nat→ver . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
TaskP . | TaskCI . | TaskP . | TaskCI . | |||||||
AccV(↑) . | AccN(↑) . | AccV +N(↑) . | ForgetV(↓) . | ACC(↑) . | AccV(↑) . | AccN(↑) . | AccV +N(↑) . | ForgetN(↓) . | ACC(↑) . | |
C2Gen () | 80.720.39 | 94.250.76 | 76.310.59 | 19.180.39 | 39.400.43 | 99.580.14 | 71.150.72 | 70.730.49 | 24.260.48 | 37.360.57 |
ER - Res | 99.890.01 | 94.140.56 | 94.040.53 | 0.110.01 | 44.890.68 | 100.000.00 | 87.43 0.67 | 87.430.67 | 7.640.53 | 42.34 0.71 |
ER - Buff | 99.780.01 | 94.250.34 | 93.910.28 | 0.150.01 | 44.340.56 | 100.000.00 | 87.380.59 | 87.380.59 | 6.910.42 | 41.680.42 |
ER - Mir | 99.920.00 | 94.870.29 | 94.040.19 | 0.080.00 | 44.730.72 | 100.000.00 | 87.550.71 | 87.550.71 | 6.710.69 | 42.010.66 |
AGEM | 99.860.02 | 94.910.87 | 93.780.75 | 0.140.03 | 42.10 0.91 | 99.610.03 | 81.701.12 | 81.350.94 | 13.250.86 | 41.600.81 |
KD | 99.800.03 | 94.560.63 | 90.130.44 | 0.200.03 | 42.370.77 | 97.860.04 | 82.690.99 | 81.900.87 | 11.570.68 | 41.780.74 |
Settings . | ver→nat . | nat→ver . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
TaskP . | TaskCI . | TaskP . | TaskCI . | |||||||
AccV(↑) . | AccN(↑) . | AccV +N(↑) . | ForgetV(↓) . | ACC(↑) . | AccV(↑) . | AccN(↑) . | AccV +N(↑) . | ForgetN(↓) . | ACC(↑) . | |
C2Gen () | 80.720.39 | 94.250.76 | 76.310.59 | 19.180.39 | 39.400.43 | 99.580.14 | 71.150.72 | 70.730.49 | 24.260.48 | 37.360.57 |
ER - Res | 99.890.01 | 94.140.56 | 94.040.53 | 0.110.01 | 44.890.68 | 100.000.00 | 87.43 0.67 | 87.430.67 | 7.640.53 | 42.34 0.71 |
ER - Buff | 99.780.01 | 94.250.34 | 93.910.28 | 0.150.01 | 44.340.56 | 100.000.00 | 87.380.59 | 87.380.59 | 6.910.42 | 41.680.42 |
ER - Mir | 99.920.00 | 94.870.29 | 94.040.19 | 0.080.00 | 44.730.72 | 100.000.00 | 87.550.71 | 87.550.71 | 6.710.69 | 42.010.66 |
AGEM | 99.860.02 | 94.910.87 | 93.780.75 | 0.140.03 | 42.10 0.91 | 99.610.03 | 81.701.12 | 81.350.94 | 13.250.86 | 41.600.81 |
KD | 99.800.03 | 94.560.63 | 90.130.44 | 0.200.03 | 42.370.77 | 97.860.04 | 82.690.99 | 81.900.87 | 11.570.68 | 41.780.74 |
We then analyze the effect of different continual strategies. Table 6 shows that Experience Replay strategies (ER-Res/Buff/Mir) achieve superior results with two tasks in different learning orders. For example, in C2Gen ER-based strategies achieve a TaskCI accuracy of 44 (as opposed to 42 for AGEM and KD). With the reverse order C2Gen the performance is lower for both tasks: TaskCI achieves 42 (ER) vs. 41 (non-ER); TaskP yields 87 (ER) vs. 81 (non-ER). The only exception is Task in C, where all continual strategies show comparable performance, at almost 100%. This is likely due to the ease of learning highly lexicalized veridicality classes, to which continual strategies cannot contribute much (cf. also Table 5).
7 Establishing Learning Order for C2Gen
As shown in §6.2, continual strategies can greatly improve the performance of primitive and compositional NLI in C2Gen NLI. However, the continual learning results still lag behind non-continual training. To gain deeper understanding of the challenges involved in the continual learning process for compositional generalization inference, we perform further analysis of the C2Gen setting.11
7.1 Effects of Primitive Learning Orders
While it seems evident that primitive tasks must be learned prior to compositional tasks they are constitutive for, the order among primitive tasks is more difficult to establish. To explore how different orders of learning primitives in continual learning affect compositional generalization, we compare the performance of Tasks P and CI with alternating orders of learning veridical inference (ver) and customary NLI inference (nat), i.e., ver→nat vs. nat→ver. Table 6 shows that ver→nat consistently outperforms nat→ver. For ER-Res, e.g., i) for TaskP, AccV +N differs by 6.61 points (94.04 vs. 87.43); ii) for AccCI in TaskCI the difference is smaller, but still considerable (2.55 points). These differences indicate that the order of learning constituting primitive inferences is relevant for compositional NLI inferences.
In order to investigate why ver→nat performs better than nat→ver, we examine the representation changes of the initially learned primitives for the respective learning orders at different timesteps: i) by the end of , where the model has just completed learning the initial primitive, and ii) after , when the model has completed learning of the other primitive. For we compare two settings: pure continual learning ( w/o continual strategy) and continual learning using the ER-Res strategy ( w/ continual strategy).
Figure 3 visualizes the results. For both orders, we observe similar changes between and : The three categories within each primitive inference type are clearly grouped in . In , the shapes of the three clusters get looser in w/o continual strategy, while with continual strategy in (rightmost images), the density of each cluster can be recovered. When comparing the density of the individual clusters for the different orders (ver→nat vs. nat→ver), it becomes evident that the clusters in ver→nat exhibit a higher level of density in both stages. This suggests that veridical inference is easier to learn than customary NLI, leading to reduced likelihood of forgetting. This finding highlights the importance of considering the inherent difficulty of learning a primitive, and to order primitives that are easier to learn first.
7.2 Continual Learning of Dependent Tasks
To better understand the challenges of compositional NLI in the different learning frameworks, we further analyze the correlation between TaskP and TaskCI. We decompose the compositional inference testing data into its primitive inferences ⇒ & for primitive recognition. We then categorize all test instances into four groups: i) P(✓)CI(✓), where both tasks yield correct predictions. ii) P(✓)CI(✗), where seen primitive inferences are correctly classified, but predicting unseen compositions fails. This we identify as lacking generalization capability. iii) P(✗)CI(✓) records unseen compositions that are correctly predicted without accurately recognizing their primitives. Given that TaskP is a prerequisite for TaskCI, this scenario indicates a shortcut. iv) For P(✗)CI(✗), where both tasks are incorrectly predicted, the model fails the complete task.
Table 7 displays the distribution of these cases. For CGen, we find an exceedingly low percentage of instances in the P(✗)CI(✓) category, indicating a scarcity of model shortcuts. Since P(✓)CI(✓) and P(✓)CI(✗) jointly cover the remaining probability mass, we conclude that the model meets the preconditions for solving TaskCI by being able to solve TaskP. But, about half of these cases fail to perform compositional NLI inference in TaskCI. This suggests that evaluated models lack compositionality. In contrast, human annotation evaluations (§5.1) show that incorrect compositional inferences mainly stem from accumulated errors in primitive inferences. That is, P(✗)CI(✗) is more predominant compared to P(✓)CI(✗). This indicates that humans show greater proficiency in handling compositionality compared to models.
Setting . | P(✓)CI(✓) . | P(✓)CI(✗) . | P(✗)CI(✓) . | P(✗)CI(✗) . |
---|---|---|---|---|
Indicates: . | correct . | no generalization . | shortcut . | wrong . |
CGen | 46.05 | 52.41 | 0.62 | 0.92 |
C2Gen | 37.98(Δ8.07) | 46.71(Δ5.70) | 1.42(Δ0.80) | 13.89(Δ12.97) |
ER-Res | 44.33(Δ1.72) | 54.38(Δ1.97) | 0.56(Δ0.06) | 0.73(Δ0.19) |
Setting . | P(✓)CI(✓) . | P(✓)CI(✗) . | P(✗)CI(✓) . | P(✗)CI(✗) . |
---|---|---|---|---|
Indicates: . | correct . | no generalization . | shortcut . | wrong . |
CGen | 46.05 | 52.41 | 0.62 | 0.92 |
C2Gen | 37.98(Δ8.07) | 46.71(Δ5.70) | 1.42(Δ0.80) | 13.89(Δ12.97) |
ER-Res | 44.33(Δ1.72) | 54.38(Δ1.97) | 0.56(Δ0.06) | 0.73(Δ0.19) |
Continual learning in C2Gen shows a reduction in the proportions of P(✓)CI(✓) and P(✓)CI(✗), with the majority of erroneous predictions transitioning to P(✗)CI(✗). This shows that continual learning has a clear impact on primitive recognition, with or without generalization ability. Enhancing the model with strategy ER-Res yields a reduction for P(✗)CI(✗) and a corresponding increase of the P(✓)CI(✓) and P(✓)CI(✗) classes. However, the increase is more pronounced for the no generalization class (+3.7). That is, ER-Res proves more effective for primitives compared to compositional generalization. This may be due to the complexity of two tasks, making it relatively easier for primitives to recover from forgetting. Overall, we show that memorization methods can alleviate the forgetting effect for primitives, while compositional inference remains challenging, with a small decrease compared to CGen.
7.3 C2Gen by Increasing Difficulty of Tasks
As the above analysis shows, C2Gen remains challenging with a gap of Δ1.78 for compared to CGen (Table 6). We aim to explore how to relieve this issue. Inspired from our insights into ordering effects for primitive inference types (§7.1) and the curriculum learning paradigm, we investigate the effect of ordering the continual learning stream for the complete compositional task along the degree of difficulty for all involved NLI types.
Table 8 shows that the 9 compositional inference types can be grouped into 3 function types based on veridicality:12 i) for positive verbs ve, the compositional inference label is consistent with the label of the NLI primitive; ii) for neutral verbs vn, compositional inference remains neutral regardless of the NLI inference type; iii) for negative verbs vc, the compositional inference label is the inverse of the customary NLI label. The respective function types f are defined in Table 8. We determine the difficulty of the individual functions by averaging the results of the individual inferences pertaining to each veridicality label x in the CGen setup. Table 8 shows that the performance of the 3 functions varies considerably: f, for neutral veridicality, exhibits significantly higher accuracy (85.74) compared to the other ones; f for positive veridicality performs much worse (35.76) but still better than f for negative veridicality, with 18.51 points. We hence define two compositional function learning orders (cfo) for TaskCI: easy→hard: fff and hard→easy: fff.
Following of the learning process as of §7.2, we add a stage that only presents compositional inference training data, controlled by a continual data stream where the functions f, f, f are arranged by degree of difficulty. in Table 9 shows the results of C2Gen in the two opposing orders. For fair comparison, CGen is also trained with this data, yet in random order, achieving 48.64 accuracy. Indeed, applying the easy→hard learning order narrows the gap to CGen up to a small margin of Δ0.42, outperforming hard→easy considerably (Δ2.58). This finding indicates that further training with a favorable function learning order benefits C2Gen, aligning with our insight from §7.2, that learning easy components first enhances learning performance.
. | CGen . | C2Gen . | C2Gen . |
---|---|---|---|
easy→ hard: fff . | hard→ easy: fff . | ||
46.67 | 44.89 (Δ 1.78) | ||
48.64 | 48.22 (Δ0.42) | 46.06 (Δ2.58) | |
47.19 | 45.45 (Δ1.74) | 44.63(Δ2.56) |
. | CGen . | C2Gen . | C2Gen . |
---|---|---|---|
easy→ hard: fff . | hard→ easy: fff . | ||
46.67 | 44.89 (Δ 1.78) | ||
48.64 | 48.22 (Δ0.42) | 46.06 (Δ2.58) | |
47.19 | 45.45 (Δ1.74) | 44.63(Δ2.56) |
To further consolidate the above finding we conduct a complementary experiment . Here, we construct a learning scheme that follows easy before hard but strictly orders primitive before compositional inference. That is, the model is forced to learn independent primitive inference first, and later compositional inferences ordered by function difficulty. Row 4 in Table 9 indicates that easy→hard still improves over the reverse order, confirming the easy before hard scheme. We also note that yields a larger gap compared to (1.74 vs. 0.42). This suggests learning CI in parallel to P in , is beneficial.
8 Controlling Model Size & Data Leakage
PLMs (Devlin et al., 2019; Liu et al., 2019) have demonstrated impressive performance on many NLP tasks through pre-training on extensive data. Recent advancements in large PLMs (Chowdhery et al., 2023; Touvron et al., 2023) have achieved even more substantial improvements by further scaling models and data. However, this raises concerns regarding the reliability of generalization evaluations: i) re. data: whether evaluation data might have been encountered during pre-training; ii) re. model scale: whether a scaled PLM could show emerging compositional generalization ability. We address these concerns in two experiments.
8.1 Controlling for Data Leakage
Following Lake and Baroni (2023), we construct a pseudo-compositional inference dataset by replacing all relevant knowledge-bearing natural language terms with pseudo-words. For veridical inference we replace veridical verbs with pseudo words, e.g., manage → blicke. Table 10 shows examples. Irrespective of these applied changes, we leave the signatures of the original verbs untouched. For customary NLI, we replace pairs of semantically related words that are crucial for deciding the NLI class with a pair of pseudo words. For example, in ‘A man catches his→A man catches his pet’ we replace dog→ozf and pet→yqj. Given the difficulty of identifying crucial semantic relations in the NLI data, we select 438 relation pairs covering 813 NLI instances (again examples in Table 10). Like veridical inference, we preserve the original inference labels. Using these pseudo primitive inference indicators, we build a pseudo TaskCI dataset following the process in §3.2.
With this pseudo dataset we re-evaluate the performance of RoBERTa under CGen and C2Gen. The results in Table 11 align well with the trends we have seen in Table 5, for the same data in natural language. This shows that the results of our generalization experiments are not affected by data seen in pre-training. Indeed, compared to CGen, C2Gen NLI shows a decline for both continual order variants of primitives ver and nat, in both datasets. This confirms that compositional generalization in NLI is more challenging in a continual learning setup. Comparing TaskP and TaskCI with alternating orders, we note that ver→nat outperforms nat→ver in both datasets.
. | Settings . | original . | pseudo . | ||||||
---|---|---|---|---|---|---|---|---|---|
TaskP . | TaskCI . | TaskP . | TaskCI . | ||||||
V . | N . | V+N . | . | V . | N . | V+N . | |||
CGen | 100.00 | 92.92 | 92.92 | 46.15 | 91.76 | 83.55 | 81.37 | 39.34 | |
v→n | C2Gen () | 100.00 | – | – | – | 90.14 | – | – | – |
C2Gen () | 81.29(Δ18.71) | 92.48 | 78.83 | 37.98(Δ8.17) | 76.29 (Δ13.85) | 81.57 | 73.82 | 36.62(Δ2.72) | |
n→v | C2Gen () | – | 93.15 | – | – | – | 82.19 | – | – |
C2Gen () | 99.87 | 73.91(Δ19.24) | 72.42 | 34.64 (Δ11.51) | 89.72 | 59.42 (Δ22.77) | 56.72 | 33.91 (Δ5.43) |
. | Settings . | original . | pseudo . | ||||||
---|---|---|---|---|---|---|---|---|---|
TaskP . | TaskCI . | TaskP . | TaskCI . | ||||||
V . | N . | V+N . | . | V . | N . | V+N . | |||
CGen | 100.00 | 92.92 | 92.92 | 46.15 | 91.76 | 83.55 | 81.37 | 39.34 | |
v→n | C2Gen () | 100.00 | – | – | – | 90.14 | – | – | – |
C2Gen () | 81.29(Δ18.71) | 92.48 | 78.83 | 37.98(Δ8.17) | 76.29 (Δ13.85) | 81.57 | 73.82 | 36.62(Δ2.72) | |
n→v | C2Gen () | – | 93.15 | – | – | – | 82.19 | – | – |
C2Gen () | 99.87 | 73.91(Δ19.24) | 72.42 | 34.64 (Δ11.51) | 89.72 | 59.42 (Δ22.77) | 56.72 | 33.91 (Δ5.43) |
Finally, we observe that the absolute accuracies obtained for TaskP and TaskCI on pseudo data generally drop compared to the original data, and substantially so for TaskP. As for the relative performance of different continual orders regarding ver and nat in TaskP, we note that the relative drop for compared to its opposite is much more pronounced for pseudo vs. original data.
8.2 Model Scale: Testing C2Gen with Llama
We next test the generalization ability for C2Gen NLI for a large PLM such as Llama-2-7B (Touvron et al., 2023). Table 12 shows the results. To fine-tune this large PLM, we adopt standard parameter-efficient fine-tuning (peft) with LoRA (Hu et al., 2022). Compared to RoBERTa-Large, its size increases by approx. 20 times from 0.355 to 7 billion parameters. This enhances the accuracy on the CGen test from 46.67 to 49.51% (Δ 2.84) for TaskCI. While this marks a progress, a large drop occurs for continual learning in C2Gen (Δ 5.12/Δ 6.88). This suggests that compositional generalization is still a challenge for LLMs. Besides, the gain of 2.84 over RoBERTa on CGen is constrained, compared to the substantial resource cost. This finding is consistent with Qiu et al. (2022), who found that fine-tuning LLMs generally has a flat or negative scaling curve on compositional generalization in semantic parsing.
. | Settings . | TaskP . | TaskCI . | ||
---|---|---|---|---|---|
V . | N . | V+N . | |||
CGen | 100.00 | 95.17 | 95.17 | 49.51 | |
v→n | C2Gen () | 100.00 | – | – | – |
C2Gen () | 82.89(Δ17.11) | 95.08 | 79.63 | 44.39(Δ5.12) | |
n→ v | C2Gen () | – | 95.14 | – | – |
C2Gen () | 99.43 | 75.24(Δ19.90) | 74.82 | 42.63(Δ6.88) |
. | Settings . | TaskP . | TaskCI . | ||
---|---|---|---|---|---|
V . | N . | V+N . | |||
CGen | 100.00 | 95.17 | 95.17 | 49.51 | |
v→n | C2Gen () | 100.00 | – | – | – |
C2Gen () | 82.89(Δ17.11) | 95.08 | 79.63 | 44.39(Δ5.12) | |
n→ v | C2Gen () | – | 95.14 | – | – |
C2Gen () | 99.43 | 75.24(Δ19.90) | 74.82 | 42.63(Δ6.88) |
Similar to RoBERTa, we observe that Llama-2 is affected by forgetting – but the amount of forgetting in Llama-2 does not differ much, dropping by 1.6 points in ver→nat but rising by 0.66 points in nat→ver. Comparing different training orders (ver→nat, nat→ver) confirms that Llama-2 also benefits from an ‘easy to hard’ learning scheme.
9 Potential Applications
Our work introduces the new C2Gen NLI task as a first step to explore the compositional generalization ability of models performing NLI in a continual learning setup. Similar to existing continual NLP-based tasks (Wang et al., 2019b; Berard, 2021; Madotto et al., 2021; M’hamdi et al., 2023), the continual learning setup inspires models to learn new inference knowledge continuously, to avoid costs for model retraining. Given such capabilities, the C2Gen NLI task setting can benefit future applications that require the understanding and induction of compositional inferences relative to dynamically updated knowledge stores.
We use the widely researched task Personalized Dialogue Agent (PDA) (Zhang et al., 2018) as an example to show how the C2Gen NLI task could apply in a dynamic setting. PDA proposes chit-chat models that are conditioned on information provided in a given personality profile. Figure 4 shows an illustration. Existing approaches suffer from consistency issues when a chit-chat model generates utterances that contradict their personality profile. For example, I dislike Rock’n Roll contradicts I always listen to Elvis songs. To solve this issue, some studies (Welleck et al., 2019; Utama et al., 2022) proposed to use NLI to evaluate and improve consistency. We can achieve this by evaluating whether the persona information entails or contradicts a dialogue utterance. In dialogue, utterances show semantic composition effects when combining primitive information to form new and meaningful sentences. For example, I have a blue iPod composes information from I have an iPod and my favorite color is blue. This scenario aligns with the CGen NLI setup.
But the persona profile of a chit-chat is dynamic and gets updated over time. For example, Fig. 4 shows persona information that is updated with a fact on a new product computer. The new primitive can be composed with previously learned primitives to generate novel compositional facts, e.g., I have a blue computer from I bought a computer and my favorite color is blue. Here, re-training the model to update the profile’s information state is expensive and time-consuming. By contrast, enabling the model to perform continual learning is a more viable and economic solution. The model is then deemed to evaluate compositional inferences relative to the updated information state, aligning with our new task C2Gen NLI.
10 Conclusions and Future Work
We propose C2Gen, a new challenge task for compositional generalization in NLI, grounded in a continual learning scenario. Our new task targets NLP applications that rely on composing information from continuously updated sources.
By conducting rich analyses for this novel task, on our new benchmark, we show that in continual learning, neural models fail to generalize to unseen compositional inferences due to forgetting. With known continual learning strategies we can combat forgetting, but our analyses show that memorization alone cannot solve the compositional inference challenge. Our in-depth analyses of C2Gen show that the model benefits from learning primitive before compositional inference, and learning easy before hard inference subtasks.
Our findings highlight the importance of observing differences of primitive and compositional inference types, and establishing the relative difficulties of diverse primitive and compositional inference types. With this, we establish recipes that can improve continual learning to approach non-continual learning. New methods can determine optimal learning orders for diverse inference types, while ensuring sufficient diversity in the data stream. Our insights could also benefit other compositional generalization methods, e.g., by ordering demonstrations in in-context learning along principles we established to improve compositional generalization in continual learning.
Acknowledgments
We are grateful to the anonymous reviewers, and action editors Mihai Surdeanu and Katrin Elisabeth Erk for their valuable comments. This work has been supported through a scholarship provided by the Heidelberg Institute for Theoretical Studies gGmbH.
Notes
Data and code can be found in https://github.com/Heidelberg-NLP/C2Gen.
We restrict ourselves to two primitive components.
While we expect that task performance will generally profit from MTL with the decomposed subtasks, our main interest is the ability to analyze the effect of continual learning in more detail.
Table 2 shows how the CI NLI value is semantically determined from its constituting NLI primitives.
Here, we use {e / n / c} to denote the veridical inference types, instead of{positive(p) / neutral(n) / negative(n)}.
We provide the aggregate count of incorrect samples annotated by two annotators for analysis.
We follow Liu et al. (2019) in the selection of potential learning rates, as excessively large or small values can impede convergence in RoBERTa.
In this section we select ER-Res as continual learning strategy for our experiments, given its superior performance (cf. Table 4). The remaining strategies show similar trends.
We take veridicality as example; NLI works analogously.
References
Author notes
Action Editor: Mihai Surdeanu