## Abstract

Humans are able to master a variety of knowledge and skills with ongoing learning. By contrast, dramatic performance degradation is observed when new tasks are added to an existing neural network model. This phenomenon, termed catastrophic forgetting, is one of the major roadblocks that prevent deep neural networks from achieving human-level artificial intelligence. Several research efforts (e.g., lifelong or continual learning algorithms) have proposed to tackle this problem. However, they either suffer from an accumulating drop in performance as the task sequence grows longer, or require storing an excessive number of model parameters for historical memory, or cannot obtain competitive performance on the new tasks. In this letter, we focus on the incremental multitask image classification scenario. Inspired by the learning process of students, who usually decompose complex tasks into easier goals, we propose an adversarial feature alignment method to avoid catastrophic forgetting. In our design, both the low-level visual features and high-level semantic features serve as soft targets and guide the training process in multiple stages, which provide sufficient supervised information of the old tasks and help to reduce forgetting. Due to the knowledge distillation and regularization phenomena, the proposed method gains even better performance than fine-tuning on the new tasks, which makes it stand out from other methods. Extensive experiments in several typical lifelong learning scenarios demonstrate that our method outperforms the state-of-the-art methods in both accuracy on new tasks and performance preservation on old tasks.

## 1  Introduction

Conventional deep neural network (DNN) models customized for specific data usually fail to handle other tasks, even when they share a lot in common. When facing new tasks, these complicated models have to be either retrained or fine-tuned. A typical example of retraining a model for new tasks is joint training, where new layers or other components are added to the existing model every time a new task arrives and then the whole model is trained on all the data sets. However, this solution incurs substantial data storage costs and requires training the model from scratch for every new task, which is inevitably time- and computation-consuming. Fine-tuning is more prevalent in practice since it retrains the model with new task data and then enables the old model to produce new task results. It works well on the latest task but suffers from a decline of performance on previous tasks, which is termed catastrophic forgetting (Goodfellow, Mirza, Xiao, Courville, & Bengio, 2013).

To address this problem, various lifelong learning algorithms (Kirkpatrick et al., 2017; Li & Hoiem, 2017; Rusu et al., 2016) have been proposed during the past few years that aim to preserve performance on previous tasks while adapting to new data. These methods, all of which have some weakness, try to alleviate the forgetting problem from different perspectives (detailed information is provided in section 2.1): architectural strategies require storing an increasing number of model parameters while rehearsal ones need to store part of the training samples, both violating the motivation of lifelong learning to some extent; among regularization strategies, parameter regularization strategies do not perform well when the tasks have different output domains or start from a small data set, as revealed by our analysis and experiments; and activation regularization methods are reported to suffer a buildup drop in old tasks' performance as the task sequence grows longer (Aljundi, Chakravarty, & Tuytelaars, 2017).

In this work, we focus on an incremental multitask image classification scenario. Formally, we aim to retrain an existing model and enable it to perform well on all the tasks added both separately and sequentially, without access to the legacy data or storing an excessive number of model parameters. We find that current activation regularization strategies usually use the classification probabilities of previous models as soft targets and try to preserve old tasks' performances via knowledge distillation (Hinton, Vinyals, & Dean, 2015). However, the classification probabilities cannot provide as strong, supervised information as the hard targets (i.e., the labels) of the new tasks. Thus, the models are likely to deviate from the optimal points of previous tasks when adapting to the new data. Meanwhile, we observe that when a student tries to learn something difficult, it is usually useful to break the complex task down to several easier and gradual goals.

Inspired by these insights, we propose a novel activation regularization method to alleviate the forgetting problem. Since the activations (denoting the outputs of neural network layers in this letter) of the old model have integrated the knowledge of previous tasks, they are able to guide the training process in multiple stages by serving as multilevel soft targets. The typical training process with multilevel soft targets is illustrated in Figure 1. Nevertheless, it is challenging to characterize the value of these intermediate activations. For example, the convolutional features are in high dimensions and contain spatial structure information, while the fully connected features are rich in abstract semantic information. To tackle the problem, we introduce a trainable discriminator to align the low-level visual features while applying maximum mean discrepancy (MMD; Gretton et al., 2012) to the high-level semantic features. The overall algorithm is termed adversarial feature alignment (AFA). On one hand, feature alignment provides sufficient supervised information of the old tasks during training and thus helps to alleviate the forgetting problem. On the other hand, it distills the knowledge of the old models to the new one and acts as a regularizer for the new tasks, which enables the model to gain even better performance than fine-tuning on the new tasks.

Figure 1:

Lifelong learning with multilevel soft targets. They provide sufficient supervised information of the old tasks and act as regularizers for the new tasks.

Figure 1:

Lifelong learning with multilevel soft targets. They provide sufficient supervised information of the old tasks and act as regularizers for the new tasks.

Our contributions are as follows:

• We propose a novel activation regularization lifelong learning method to alleviate the forgetting problem via adversarial feature alignment, which not only preserves the performance on old tasks but also achieves better performance on new tasks.

• We propose to align convolutional attention maps with a trainable discriminator and high-level semantic features with MMD. They guide the training process in multiple stages and help to reduce forgetting, while guaranteeing better performance on the new tasks due to knowledge distillation and regularization.

• Extensive experiments in several typical lifelong learning scenarios demonstrate that our method outperforms state-of-the-art methods in accuracies on the new tasks and performance preservation on old tasks.

## 2  Related Work

Lifelong learning, sometimes called continual learning or incremental learning, has been attracting increasing research interest. However, there is no standard terminology to identify various strategies or experimental protocols. One method that performs well in some experimental settings may totally fail in others (Kemker, McClure, Abitino, Hayes, & Kanan, 2018). For a fairer and more structured comparison, we combine the categorization of lifelong learning strategies in Maltoni and Lomonaco (2018) and the distinct experimental scenarios in van de Ven and Tolias (2018), and then illustrate the representative lifelong learning methods as in Figure 2.

Figure 2:

Strategies and scenarios of the representative lifelong learning methods: PNN (Rusu et al., 2016), ExpertGate (Aljundi et al., 2017), DAN (Rosenfeld & Tsotsos, 2017), CWR (Lomonaco & Maltoni, 2017), GEM (Lopez-Paz & Ranzato, 2017), ICARL (Rebuffi, Kolesnikov, Sperl, & Lampert, 2017), PDG (Hou, Pan, Change Loy, Wang, & Lin, 2018), DGR (Shin, Lee, Kim, & Kim, 2017), RtF (van de Ven & Tolias, 2018), LwF (Li & Hoiem, 2017), EBLL (Rannen Ep Triki, Aljundi, Blaschko, & Tuytelaars, 2017), EWC (Kirkpatrick et al., 2017), SI (Zenke, Poole, & Ganguli, 2017), MAS (Aljundi, Babiloni, Elhoseiny, Rohrbach, & Tuytelaars, 2018), and AFA, hereby proposed. (See the online figure in color.)

Figure 2:

Strategies and scenarios of the representative lifelong learning methods: PNN (Rusu et al., 2016), ExpertGate (Aljundi et al., 2017), DAN (Rosenfeld & Tsotsos, 2017), CWR (Lomonaco & Maltoni, 2017), GEM (Lopez-Paz & Ranzato, 2017), ICARL (Rebuffi, Kolesnikov, Sperl, & Lampert, 2017), PDG (Hou, Pan, Change Loy, Wang, & Lin, 2018), DGR (Shin, Lee, Kim, & Kim, 2017), RtF (van de Ven & Tolias, 2018), LwF (Li & Hoiem, 2017), EBLL (Rannen Ep Triki, Aljundi, Blaschko, & Tuytelaars, 2017), EWC (Kirkpatrick et al., 2017), SI (Zenke, Poole, & Ganguli, 2017), MAS (Aljundi, Babiloni, Elhoseiny, Rohrbach, & Tuytelaars, 2018), and AFA, hereby proposed. (See the online figure in color.)

### 2.1  Lifelong Learning Strategies

#### 2.1.1  Architectural Strategies

Architectural strategies train separated models for different tasks, and usually a selector is introduced to determine which model to launch during inference. The progressive neural network (PNN; Rusu et al., 2016) is one of the first architectural strategies in which a layer is connected to both the previous layer of the current model and the layers of old models, allowing information to flow horizontally. Expert Gate (Aljundi et al., 2017) decides which expert model to launch during inference and which training strategy to apply to the new incoming task with autoencoders. Other architectural strategies are Incremental Learning through Deep Adaptation (DAN; Rosenfeld & Tsotsos, 2017) and Copy Weight with Re-init (CWR; Lomonaco & Maltoni, 2017).

Architectural methods enjoy the advantage of preserving the performance on old tasks because adding new tasks to the system will not harm the previously learned models. However, an increasing number of extra spaces are needed to store parameters for each task, which conflicts with the original intentions of lifelong learning.

#### 2.1.2  Rehearsal Strategies

Rehearsal strategies replay past information periodically while adapting to the new data to avoid forgetting. Gradient episodic memory (GEM; Lopez-Paz & Ranzato, 2017) uses a fixed memory to store a subset of old patterns. Incremental classifier and representation learning (ICARL; Rebuffi et al., 2017) and progressive distillation and retrospection (PDG; Hou et al., 2018) include an external memory to store a subset of old task data. They also employ a distillation step and thus overlap with the regularization strategies.

Inspired by the recent success of generative adversarial networks (GAN; Goodfellow et al., 2014), deep generative replay (DGR; Shin et al., 2017), and replay through feedback (RtF; van de Ven & Tolias, 2018) propose training generative models to generate samples with regard to previous task distributions. However, training generative models for complex images is still an open problem itself. Current generative replay strategies are limited to simple experimental settings such as MNIST (LeCun, Bottou, Bengio, & Haffner, 1998) and its variations (Goodfellow et al., 2013).

#### 2.1.3  Regularization Strategies

Regularization strategies extend the loss function with additional terms to preserve the performance on previous tasks. It can be further divided into two categories: activation regularization strategies and parameter regularization ones.

Activation regularization strategies are usually based on knowledge distillation and use additional loss terms between activations of models to maintain the performance on old tasks. Learning without forgetting (LwF; Li & Hoiem, 2017) proposes using outputs of the old models as soft targets of old tasks. These soft targets are considered a substitute for the data of previous tasks, which cannot be accessed in lifelong learning settings. Encoder-based lifelong learning (EBLL; Rannen Ep Triki et al., 2017) prevents reconstruction of convolutional features from changing with autoencoders, which has the effect of preserving the knowledge of previous tasks.

Parameter regularization strategies focus more on the model itself. They try to figure out the importance of parameters for old tasks and apply a penalty to the change of essential parameters during training on new tasks. Elastic weights consolidation (EWC; Kirkpatrick et al., 2017) estimates the importance of parameters by the diagonal of the Fisher information matrix and uses an individual penalty for each previous task. Synaptic intelligence (SI; Zenke et al., 2017) estimates the importance weights in an online manner by the parameter-specific contribution to the changes in the total loss. Memory aware synaptic (MAS; Aljundi et al., 2018) is similar to SI but estimates the importance of weights by the changes of the model outputs. Parameter regularization strategies pay more attention to preserving the knowledge on old tasks but prevent the model from achieving competitive performance on the new tasks.

### 2.2  Lifelong Learning Scenarios

Benchmarking lifelong learning lacks a universal standard even if we focus on supervised image classification and leave out reinforcement learning, as various experimental protocols have been used in previous research. For a fairer comparison, we follow the distinct scenarios for lifelong learning proposed in van de Ven and Tolias (2018) and summarize the experimental settings of the above strategies.

In the incremental task scenario, the tasks are similar but have different output domains---for example, all the tasks belong to image classification, but one cares about finely grained flowers while another may focus on finely grained birds. Architectural strategies (Aljundi et al., 2017; Rosenfeld & Tsotsos, 2017; Rusu et al., 2016) can easily handle this scenario by training a specific model for each task, while some other methods (Aljundi et al., 2018; Hou et al., 2018; Li & Hoiem, 2017; Rannen Ep Triki et al., 2017) usually introduce a multihead output layer with regard to each task in this scenario.

#### 2.2.2  Incremental Domain Scenario

In the incremental domain scenario, the tasks usually have the same output domain but follow different data distributions. Typical examples of such protocol include MNIST (LeCun et al., 1998) and SVHN (Netzer et al., 2011), both being digit classifications but collected in different manners, and permuted MNIST (Goodfellow et al., 2013), in which different permutations are applied to the pixels and each permutation corresponds to a unique task. GEM (Lopez-Paz & Ranzato, 2017) and some parameter regularization strategies (Kirkpatrick et al., 2017; Zenke et al., 2017) mainly evaluate their algorithms in such a scenario.

#### 2.2.3  Incremental Class Scenario

The last one is the incremental class scenario, where the model is required to learn to recognize new classes continually. An example of such protocol is learning to classify MNIST (split MNIST) or CIFAR10 (Krizhevsky & Hinton, 2009) (split CIFAR10) class by class. These two data sets are widely adopted in CWR (Lomonaco & Maltoni, 2017), ICARL (Rebuffi et al., 2017), GEM (Lopez-Paz & Ranzato, 2017), and SI (Zenke et al., 2017).

Generative replay strategies (Shin et al., 2017; van de Ven & Tolias, 2018) conduct experiments in all the above scenarios. However, due to the complexity of generative models, they have not shown an ability to apply to more complicated data sets other than MNIST.

In this letter, we do not consider lifelong learning methods that require storing model parameters (architectural strategies) or training samples (rehearsal strategies). Concretely, we compare them with state-of-the-art regularization strategies, both activation and parameter based, in the incremental task scenario.

## 3  Background

### 3.1  Problem Definition and Notations

We first briefly introduce the lifelong learning setup and notations under the incremental task scenario, as used in Aljundi et al. (2018, 2017), Hou et al. (2018), Li and Hoiem (2017), and Rannen Ep Triki et al. (2017).

Formally, let $Xt={xit}iNt$ and $Yt={yit}iNt$ be the inputs and corresponding labels from task $t$ with $Nt$ denoting the number of examples. Lifelong learning algorithms focus on how to transfer an existing model to ${Xτ,Yτ}$ when task $τ$ arrives without access to samples of the previous task—${Xt,Yt}t=1τ-1$. In other words, they try to train a model that performs well on a sequence of tasks incoming separately and sequentially with only the most recent data. In this work, we focus on supervised image classification, and the model is usually a convolutional neural network (CNN). Following the notations in Rannen Ep Triki et al. (2017), the model for task $t$ is denoted as $ft$ and can be decomposed as $Ct∘C∘F$, where:

• $F$ is the feature extractor (e.g., the convolutional layers in CNNs).

• $C$ is the shared classifier (e.g., the fully connected layers in CNNs except the last one).

• $Ct$ is the task-specific classifier for task $t$ (e.g., the last fully connected layer in CNNs).

### 3.2  Formulation of Existing Methods

If all the data of the previous tasks were available, we could train a model to handle multiple tasks with joint training by minimizing the following empirical risk:
$∑t=1τ1Nt∑i=1Ntℓce(ft(xit),yit)=∑t=1τE[ℓce(ft(Xt),Yt)].$
(3.1)
Considering the decomposition of $ft$, equation 3.1 can be written as
$Ljoint=∑t=1τE[ℓce(Ct∘C∘F(Xt),Yt)]$
(3.2)
$=Lcls+Lold,$
(3.3)
$Lcls=E[ℓce(Cτ∘C∘F(Xτ),Yτ)],$
(3.4)
$Lold=∑t=1τ-1E[ℓce(Ct∘C∘F(Xt),Yt)],$
(3.5)
where $ℓce$ is the standard cross-entropy loss.
Joint training achieves the best performance for all the tasks (in theory) as the network is trained with the data from all the tasks simultaneously, most no longer accessible in the lifelong learning setup. To tackle the lack of previous data, LwF (Li & Hoiem, 2017) suggests that we should first record the response of $Ct*∘C*∘F*$ (i.e., classification probabilities) on $Xτ$ and replace the $Lold$ with
$Ldist=∑t=1τ-1E[ℓdist(Ct∘C∘F(Xτ),Ct*∘C*∘F*(Xτ))],$
(3.6)
where $Ct*∘C*∘F*$ are the model parameters optimal for the old tasks and $Ct∘C∘F$ are those to be learned for the new task. $ℓdist$ denotes the knowledge distillation loss (KD loss) proposed in Hinton et al. (2015).

However, LwF is reported to suffer an accumulating drop in performance as the task sequence grows longer (Aljundi et al., 2017). EBLL (Rannen Ep Triki et al., 2017), which slightly alleviates the forgetting problem with autoencoders to preserve the necessary information of old tasks in a lower-dimensional manifold. Nevertheless, its improvement over LwF is mainly for the old tasks, while the performance on the new task is even inferior (Hou et al., 2018).

## 4  Method

Classification probabilities of the old model cannot provide as strong and accurate supervised information as the labels, which accounts for the performance drop in LwF and EBLL. From another point of view, the old model $f*$ has learned the knowledge of previous tasks. Thus, the intermediate activations (or features) of $f*$ can be treated as soft targets and guide the training process. Inspired by these insights, we propose a novel activation regularization–based lifelong learning method with adversarial feature alignment (AFA).

Our framework is illustrated in Figure 3, with a two-stream model denoting the old and new networks, respectively. Besides the cross-entropy loss of the new task and the constraint between classification probabilities of the previous tasks, we employ the low-level visual features and high-level semantic features as soft targets, aiming to provide sufficient supervised information of the old task through multilevel feature alignment.

Figure 3:

The architecture of the proposed method. Feature alignment penalty is introduced in addition to the cross-entropy loss against labels and the distillation loss between the label probabilities of the old and new models. The convolutional feature maps generated by different models with the same data are aligned through adversarial attention alignment (see section 4.1), and the high-level semantic features across networks are aligned by MMD (see section 4.2). The above two constraints provide supervised information of the old task while acting as regularizers for the new tasks. (See the online figure in color.)

Figure 3:

The architecture of the proposed method. Feature alignment penalty is introduced in addition to the cross-entropy loss against labels and the distillation loss between the label probabilities of the old and new models. The convolutional feature maps generated by different models with the same data are aligned through adversarial attention alignment (see section 4.1), and the high-level semantic features across networks are aligned by MMD (see section 4.2). The above two constraints provide supervised information of the old task while acting as regularizers for the new tasks. (See the online figure in color.)

It is challenging to use convolutional visual features as soft targets, as these low-level features are in high dimensions and hard to characterize. For example, the dimension of $conv5$ activation of AlexNet (Krizhevsky, Sutskever, & Hinton, 2012) is 9216 ($6×6×256$) when flattened into a vector. Directly putting this vector into a neural network model will bring a vast number of parameters, while using statistic moments (e.g., L2 norm) as constraints will lose the 2D structural information in the feature maps (refer to section 5.8.3 for more detailed discussion). Here, we solve the problem with the help of an activation-based visual attention mechanism, which is defined as a function of spatial maps with regard to the convolutional layers, as in Zagoruyko and Komodakis (2017). The attention mechanism will put more weight on the most discriminative parts and make it easier to capture the character of visual feature maps.

Concretely, let us consider a convolutional layer and its corresponding activation tensor $A∈RC×H×W$, which consists of $C$ channels with spatial dimensions $H×W$. An activation-based mapping function $Fatt$ takes the above 3D tensor $A$ as input and outputs a spatial attention map—a flattened 2D tensor defined over the spatial dimensions:
$Fatt:RC×H×W→RH×W.$
(4.1)
An implicit assumption to define such a spatial attention map function is that the absolute value of a hidden neuron activation can be used as an indication of the importance of that neuron with regard to the specific input. Specifically, we consider the following activation-based mapping function,
$Fatt(A)=∑ch=1C|Ach|2,Ach∈RH×W,$
(4.2)
where $Ach$ is the $ch$th feature map of activation tensor $A$. The operations in equation 4.2 are element-wise.
Based on the above attention mapping, we further propose to apply the adversarial alignment penalty to the attention maps of visual features, which guides the new model to integrate the knowledge from the old model. A discriminator (termed $D$ network) is introduced to play the GAN-like minimax game with the feature extractors of the old and new models, $F$ and $F*$. Formally, the feature extractors take the training data as inputs and compute convolutional feature maps, which are further encoded by attention-mapping function $Fatt$ into latent representations $Z={z}$, where
$z=Fatt∘F(x),x∈Xτ.$
(4.3)
The $D$ network tries to distinguish that the latent representations come from the old or new network. Thus, $D$ is optimized by a standard supervised loss in GAN, defined as
$LadvD=maxDEz*∼Z*[logD(z*)]+Ez∼Z[log(1-D(z))],$
(4.4)
where $Z*$ and $Z$ are latent representations from the old and new feature extractors, respectively.
Then the feature extractor $F$ is updated by playing a minimax game with the discriminator $D$. Rather than directly adopting the gradient reverse layer (Ganin et al., 2016), which corresponds to the true minimax objective, modern GANs are usually trained with inverted label loss, defined as
$LadvF=minF-Ez∼Z[logD(z)].$
(4.5)
This objective shares the same convergence with the minimax loss but provides stronger gradients and thus eases the training process.

#### 4.1.1  Discussion

In our method, discriminator $D$ plays the minimax game with feature extractor $F$ instead of a generator in standard GANs. Similar modules are commonly adopted in adversarial domain adaptation studies (Ganin et al., 2016; Kang, Zheng, Yan, & Yang, 2018; Tzeng, Hoffman, Saenko, & Darrell, 2017). The significant difference between them and our method lies in the data flow. We use a single data flow, freeze the old model $f*$, and align the features generated by different models yet with the same data to distill the knowledge of the old network to the new one instead of pairing features generated from different data distributions in domain adaptation works.

### 4.2  High-Level Feature Alignment with MMD

In deep CNNs, the features generated by fully connected layers are high-level semantic representations, which contain massive information about tasks and labels. Aligning these high-level, task-specific features will force the new model to integrate the knowledge about the old tasks learned by the old network. However, employing a discriminator here and playing the minimax game with $C∘F$ does not work because these modules have already been customized for specific tasks and thus cannot adapt immediately to confuse the $D$ network (refer to section 5.8.4 for more detailed discussion).

Previous studies on measuring the discrepancy between high-level features usually take advantage of MMD (Gretton et al., 2012). Concretely, given two data distributions $P$ and $Q$, MMD is expressed as the distance between their means after mapping to a reproducing kernel Hilbert space (RKHS),
$MMD2(P,Q)=∥Ep∼P[φ(p)]-Eq∼Q[φ(q)]∥2,$
(4.6)
where $φ(·)$ denotes the mapping to RKHS.
In practice, this mapping is unknown. Expanding equation 4.6 and using the kernel trick to replace the inner product, we have an unbiased estimator of MMD:
$Lmmd(P,Q)=Ep,q∼P,Q[k(p,p)+k(q,q)-2k(p,q)],$
(4.7)
where $k(p,q)=〈φ(p),φ(q)〉$ is the desired kernel function. In this work, we use a standard radial basis function (RBF) kernel with multiple widths (Gretton et al., 2012).
Formally, let $H={h}$ be the fully connected features generated by the feature extractor $F$ and the shared classifier $C$:
$h=C∘F(x),x∈Xτ.$
(4.8)
We align the high-level semantic features from the old and new models (termed $H*$ and $H$, respectively) by minimizing the following loss function:
$Lfc=Lmmd(H,H*).$
(4.9)

#### 4.2.1  Discussion

In previous studies, MMD has usually been used for facilitating the network to generate domain-invariant features for data from the same class but different domains. In our method, the fully connected features are generated by different networks but with the same data. They contain rich task information and can be treated as soft targets for the new model. When these high-level semantic features are aligned, the knowledge of previous tasks is transferred across networks.

### 4.3  Overall Algorithm

The backbone network is trained by minimizing the following loss function that consists of four parts of constraints:
$L=Lcls+λ1Ldist+λ2LadvF+λ3Lfc,$
(4.10)
where $Lcls$, $Ldist$, $LadvF$, and $Lfc$ are defined in equations 3.4, 3.6, 4.5, and 4.9. $λ1$, $λ2$, and $λ3$ are hyperparameters (for discussion, see section 5.8.5).

The key idea of our method lies in employing the intermediate activations of the old model, which contains rich knowledge of previous tasks, as soft targets to guide the training process in multiple stages when fitting the new data. Aligning the multilevel features provides sufficient supervised information of the old tasks and helps to alleviate the forgetting problem. Further experiments demonstrate that the feature alignment strategy enables the model to gain even better performance than fine-tuning on the new tasks due to the knowledge distillation and regularization phenomena.

## 5  Experiments

We compare our method with the state-of-the-art regularization-based lifelong learning methods and several baselines in the incremental task scenarios. We consider the situations containing two (starting from large and small data sets separately) and five tasks.

### 5.1  Architecture

The network architecture is based on AlexNet (Krizhevsky et al., 2012), a representative of CNNs and widely used in transfer learning research (Aljundi et al., 2018; Li & Hoiem, 2017; Long, Cao, Wang, & Jordan, 2015; Rannen Ep Triki et al., 2017; Yosinski, Clune, Bengio, & Lipson, 2014). Transferability studied on it can be extended to other network architectures easily. Concretely, the shared feature extractor $F$ corresponds to the convolutional layers; $conv1∼conv5$. The shared classifier $C$ denotes all the fully connected layers except the last one, $fc6$ and $fc7$, while the task-specific classifier is the last fully connected layer $fc8$. The network architecture of our method is illustrated in Figure 3.

### 5.2  Data Sets

We use the following popular data sets in the incremental task scenario (Aljundi et al., 2018, 2017; Li & Hoiem, 2017; Rannen Ep Triki et al., 2017):

• MIT Scenes (Quattoni & Torralba, 2009): Images for indoor scenes classification, with 5360 training samples and 1340 test samples

• Caltech-UCSD Birds (Welinder et al., 2010): Images for finely-grained bird classification, with 5994 training samples and 5794 test samples

• Oxford Flowers (Nilsback & Zisserman, 2008): Images for finely-grained flower classification, with 2040 training samples and 6149 test samples

• Stanford Cars (Krause, Stark, Deng, & Fei-Fei, 2013): Images for car classification, with 8144 training samples and 8041 testing samples

• FGVC-Aircraft (Maji, Kannala, Rahtu, Blaschko, & Vedaldi, 2013): Images for aircraft manufacturer classification, with 6667 training samples and 3333 testing samples

• ImageNet (ILSVRC 2012 subset) (Deng et al., 2009): The validation set ILSVRC 2012, used for testing the effect of performance preservation on old tasks

The results reported in this letter are obtained on the test sets of Scenes, Birds, Flowers, Cars and Aircraft, and the validation set of ImageNet. We fix the rand seed and make the results reproducible.

### 5.3  Compared Methods

The proposed AFA is compared with the following methods:

• Joint training: The data of all the tasks are used during training, which is considered an upper bound of performance preservation on old tasks.

• Finetuning: Copy the feature extractor ($F$) and shared classifier ($C$) of the old model, randomly initialize the task-specific classifier ($Cτ$), and then train the network on the new task $τ$.

• LwF (Li & Hoiem, 2017): They introduce a KD loss term between the label probabilities of the old and new models computed on the new data to preserve the knowledge of previous tasks.

• EBLL (Rannen Ep Triki et al., 2017): This work builds on LwF and prevents the reconstruction of convolutional features from changing with autoencoders to reduce forgetting.

• EWC (Kirkpatrick et al., 2017): It estimates the importance of parameters by the diagonal of the Fisher information matrix and applies a penalty to the change of important parameters during training on the new task. It uses individual penalty for each previous task.

• SI (Zenke et al., 2017): It is similar to EWC but estimates the importance weights in an online manner by the parameter-specific contribution to the changes in the total loss.

• MAS (Aljundi et al., 2018): It is similar to SI but estimates the importance weights by the changes of the model outputs.

We do not consider lifelong learning methods that require storing model parameters (i.e., architectural strategies) or training samples (i.e., rehearsal strategies).

Remark.

The original versions of parameter regularization strategies (e.g., EWC and SI) require the network to maintain the same structure throughout all tasks as the constraint is explicitly given to each parameter. To make these methods compatible with incremental task scenario where tasks have different output domains, we apply the parameter constraints to feature extractor ($F$) and shared classifier ($C$), excluding task-specific classifiers ($C1,…,τ$). The results are reported after convergence on new tasks.

### 5.4  Implementation Details

During training, we use the stochastic gradient descent (SGD) optimizer with a momentum (Sutskever, Martens, Dahl, & Hinton, 2013) of 0.9 and dropout enabled in the fully connected layers. Data normalization is applied within each task. We augment the training data with random resized cropping and random horizontal flipping but without color jittering.

We adopt a three-layer perceptron as the discriminator, with the hidden layer consisting of 500 units. Additionally, we find that the number of neurons in the hidden layer is not a sensitive hyperparameter. Choices in the range from 100 to 900 will produce very close results.

Before training, we randomly initialize $Cτ$ using the Xavier (Glorot & Bengio, 2010) initialization with a scaling factor of 0.25. Then we freeze $F$ and $C$ and train $Cτ$ for some epochs, which we term the warm-up step in LwF. In our experiments, the warm-up step lasts for 70 epochs, and then the whole network is trained until convergence. For a fair comparison, the warm-up model is used as the starting point for all the compared methods. When convergence on the validation set is observed, we reduce the learning rate by $0.1×$ and keep training for extra 20 epochs.

### 5.5  Two-Task Scenario Starting from ImageNet

In the two-task scenario, we are given a network trained on a previous task, and then one single new task arrives to be learned. First, we follow the experimental setup in LwF and EBLL, in which all the experiments start from an AlexNet model pretrained on ImageNet. This scenario contains three individual experiments: ImageNet $→$ Scenes, ImageNet $→$ Birds, and ImageNet $→$ Flowers. The performances of our method and our compared ones in the two-task scenario are illustrated in Table 1. SI (Zenke et al., 2017) is not included in this scenario because it requires training from scratch on ImageNet to get the importance weights.

Table 1:
Classification Accuracy in the Two-Task Scenario Starting from ImageNet.
ImageNet $→$ ScenesImageNet $→$ BirdsImageNet $→$ FlowersAverage
Joint Training$a$ 55.11 (ref) 62.93 ($+$0.14) 54.93 (ref) 56.88 ($-$0.29) 56.26 (ref) 85.09 ($-$0.21) (ref) ($-$0.12)
Finetuning$b$ 51.28 ($-$3.83) 62.79 (ref) 42.94 ($-$11.99) 57.17 (ref) 44.46 ($-$11.80) 85.30 (ref) ($-$9.21) (ref)
LwF 53.62 ($-$1.49) 63.51 ($+$0.72) 53.41 ($-$1.52) 57.42 ($+$0.25) 54.64 ($-$1.62) 85.15 ($-$0.15) ($-$1.54) ($+$0.27)
EBLL 54.33 ($-$0.78) 63.29 ($+$0.50) 54.17 ($-$0.76) 56.78 ($-$0.39) 55.37 ($-$0.89) 84.36 ($-$0.94) ($-$0.81) ($-$0.28)
EWC 54.25 ($-$0.86) 61.34 ($-$1.45) 52.16 ($-$2.77) 54.57 ($-$2.60) 53.81 ($-$2.45) 84.58 ($-$0.72) ($-$2.03) ($-$1.59)
MAS 53.98 ($-$1.13) 61.87 ($-$0.92) 53.06 ($-$1.87) 54.26 ($-$2.91) 54.90 ($-$1.36) 83.66 ($-$1.64) ($-$1.46) ($-$1.83)
AFA 54.71 ($-$0.40) 63.88 ($+$1.09) 54.43 ($-$0.50) 57.84 ($+$0.66) 55.21 ($-$1.05) 86.03 ($+$0.73) ($-$0.65) ($+$0.83)
ImageNet $→$ ScenesImageNet $→$ BirdsImageNet $→$ FlowersAverage
Joint Training$a$ 55.11 (ref) 62.93 ($+$0.14) 54.93 (ref) 56.88 ($-$0.29) 56.26 (ref) 85.09 ($-$0.21) (ref) ($-$0.12)
Finetuning$b$ 51.28 ($-$3.83) 62.79 (ref) 42.94 ($-$11.99) 57.17 (ref) 44.46 ($-$11.80) 85.30 (ref) ($-$9.21) (ref)
LwF 53.62 ($-$1.49) 63.51 ($+$0.72) 53.41 ($-$1.52) 57.42 ($+$0.25) 54.64 ($-$1.62) 85.15 ($-$0.15) ($-$1.54) ($+$0.27)
EBLL 54.33 ($-$0.78) 63.29 ($+$0.50) 54.17 ($-$0.76) 56.78 ($-$0.39) 55.37 ($-$0.89) 84.36 ($-$0.94) ($-$0.81) ($-$0.28)
EWC 54.25 ($-$0.86) 61.34 ($-$1.45) 52.16 ($-$2.77) 54.57 ($-$2.60) 53.81 ($-$2.45) 84.58 ($-$0.72) ($-$2.03) ($-$1.59)
MAS 53.98 ($-$1.13) 61.87 ($-$0.92) 53.06 ($-$1.87) 54.26 ($-$2.91) 54.90 ($-$1.36) 83.66 ($-$1.64) ($-$1.46) ($-$1.83)
AFA 54.71 ($-$0.40) 63.88 ($+$1.09) 54.43 ($-$0.50) 57.84 ($+$0.66) 55.21 ($-$1.05) 86.03 ($+$0.73) ($-$0.65) ($+$0.83)

Note: The bold numbers indicate the highest accuracy on the corresponding task.

$a$For the old task, the reference performance is given by Joint Training.

$b$For the new task, Finetuning is considered the reference.

As Table 1 shows, the reference performance on the old task is given by Joint Training, which assumes that the data of previous tasks are available. Without any constraints, Finetuning favors the new task but does not care about the performance on the old tasks as the training process goes forward. The performances on ImageNet drop dramatically, especially in ImageNet $→$ Birds and ImageNet $→$ Flowers. That is exactly the catastrophic forgetting problem that we are trying to overcome. Through activation or parameter regularization, LwF, EBLL, EWC, and MAS alleviate the forgetting problem to some extent. Our method, AFA, suffers the least drop in performance on the old task in most of the three experiments. The visual and semantic features generated by the old model contain rich knowledge of the old task, which is integrated into the new model through the proposed feature alignment strategy.

Considering the performance on the new task, AFA reaches the best performance among the compared methods, even better than Joint Training, which has access to the maximum number of data, and Finetuning, a commonly used transfer learning routine. Since the new tasks in our experiments are small data sets compared to ImageNet, Finetuning with such data usually overfits the training sets and cannot reach high accuracy on the test sets. AFA and LwF outperform Joint Training, which is unexpected but does make sense. The critical point is that the quantities of data from the old and new tasks are incredibly unbalanced. Hence, the network favors the data from ImageNet but ignores those of Scenes/Birds/Flowers during the training process of Joint Training. On the contrary, AFA and LwF use knowledge from both the old and new tasks and avoid the unbalanced data distribution problem, which accounts for the improvement of accuracy on the new task. EBLL, EWC, and MAS focus more on preserving the performance on the old task while the accuracy on the new task is inferior. According to Kirkpatrick et al. (2017), the layers closer to the output are indeed being reused in EWC. However, when the tasks have different output domains, the constraints of parameters near the output layer will prevent the model from achieving competitive performance on the new task.

### 5.6  Two-Task Scenario Starting from Flowers

When we start from a smaller data set, Flowers here, different trends can be observed. The experimental results under Flowers $→$ Scenes and Flowers $→$ Birds are illustrated in Table 2. Similar to that starting from ImageNet, the reference performance of the old task is given by Joint Training while Finetuning is considered the reference for the new task.

Table 2:
Classification Accuracy in the Two-Task Scenario Starting from Flowers.
Flowers to ScenesFlowers to BirdsAverage
Joint Training$a$ 84.75 (ref) 61.05 (0.67) 83.04 (ref) 56.42 (0.22) 83.89 (ref) 58.73 (0.45)
Finetuning$b$ 71.67 ($-$13.08) 60.37 (ref) 65.41 ($-$17.63) 56.20 (ref) 68.54 ($-$15.35) 58.28 (ref)
LwF 79.58 ($-$5.16) 62.39 (2.02) 79.35 ($-$3.69) 55.89 ($-$0.31) 79.46 ($-$4.43) 59.14 (0.85)
EBLL 80.19 ($-$4.56) 61.57 (1.20) 80.09 ($-$2.95) 55.26 ($-$0.94) 80.14 ($-$3.75) 58.42 (0.13)
EWC 78.89 ($-$5.85) 58.96 ($-$1.42) 76.57 ($-$6.47) 55.07 ($-$1.12) 77.73 ($-$6.16) 57.01 ($-$1.27)
SI 78.78 ($-$5.97) 58.88 ($-$1.49) 76.50 ($-$6.54) 55.13 ($-$1.07) 77.64 ($-$6.25) 57.00 ($-$1.28)
MAS 78.89 ($-$5.85) 58.51 ($-$1.87) 76.63 ($-$6.41) 55.06 ($-$1.14) 77.76 ($-$6.13) 56.78 ($-$1.50)
AFA 80.96 ($-$3.78) 63.73 (3.36) 81.15 ($-$1.88) 57.60 (1.40) 81.06 ($-$2.83) 60.66 (2.38)
Flowers to ScenesFlowers to BirdsAverage
Joint Training$a$ 84.75 (ref) 61.05 (0.67) 83.04 (ref) 56.42 (0.22) 83.89 (ref) 58.73 (0.45)
Finetuning$b$ 71.67 ($-$13.08) 60.37 (ref) 65.41 ($-$17.63) 56.20 (ref) 68.54 ($-$15.35) 58.28 (ref)
LwF 79.58 ($-$5.16) 62.39 (2.02) 79.35 ($-$3.69) 55.89 ($-$0.31) 79.46 ($-$4.43) 59.14 (0.85)
EBLL 80.19 ($-$4.56) 61.57 (1.20) 80.09 ($-$2.95) 55.26 ($-$0.94) 80.14 ($-$3.75) 58.42 (0.13)
EWC 78.89 ($-$5.85) 58.96 ($-$1.42) 76.57 ($-$6.47) 55.07 ($-$1.12) 77.73 ($-$6.16) 57.01 ($-$1.27)
SI 78.78 ($-$5.97) 58.88 ($-$1.49) 76.50 ($-$6.54) 55.13 ($-$1.07) 77.64 ($-$6.25) 57.00 ($-$1.28)
MAS 78.89 ($-$5.85) 58.51 ($-$1.87) 76.63 ($-$6.41) 55.06 ($-$1.14) 77.76 ($-$6.13) 56.78 ($-$1.50)
AFA 80.96 ($-$3.78) 63.73 (3.36) 81.15 ($-$1.88) 57.60 (1.40) 81.06 ($-$2.83) 60.66 (2.38)

Note: The bold numbers indicate the highest accuracy on the corresponding test.

$a$For the old task, the reference performance is given by Joint Training.

$b$For the new task, Finetuning is considered the reference.

Activation regularization strategies (LwF, EBLL, and AFA) outperform parameter regularization strategies (EWC, SI, and MAS) in both alleviating forgetting on the old task and performance on the new task. One potential reason is due to the biased estimation of importance weights. The data samples are too few to compute the importance weights with regard to the old tasks when starting from small data sets. Another side effect of the biased importance weights is the poor performance on the new tasks.

Among the activation regularization strategies, the improvement brought by EBLL compared to LwF is mainly for the performance preservation on old tasks while the accuracy on the new task is inferior. The conclusion is consistent with that in PDG (Hou et al., 2018). Under the guidance of multilevel soft targets, the proposed method AFA further alleviates the forgetting problem compared to LwF and EBLL. Meanwhile, AFA achieves better performance than Finetuning and Joint Training on the new task due to the regularization and knowledge distillation phenomenons.

In conclusion, the proposed AFA suffers the least performance drop on the old task while achieving the best accuracy on the new task among all the compared methods in scenarios starting from both large (e.g., ImageNet) and small (e.g., Flowers) data sets.

We study a more challenging scenario containing a sequence of five tasks: Scenes $→$ Birds $→$ Flowers $→$ Aircraft $→$ Cars. The performances of all compared methods on each task at the end of the five-task sequence are illustrated in Figure 4. For a more efficient training process, we start from the AlexNet model pretrained on ImageNet and then fine-tuned on Scenes. We do not include the performance on ImageNet in the results as SI (Zenke et al., 2017) requires training from scratch on ImageNet to get the importance weights.

Figure 4:

The performance on each task at the end of the five-task sequence. The average accuracy over five tasks is marked next to the legend. (See the online figure in color.)

Figure 4:

The performance on each task at the end of the five-task sequence. The average accuracy over five tasks is marked next to the legend. (See the online figure in color.)

As expected, Finetuning suffers severe forgetting on previous tasks and favors the most recently learned task. With access to all the training data, Joint Training reaches the highest accuracy on the first four tasks and the best performance on average.

Parameter regularization strategies (EWC, SI, and MAS) better preserve the performance on older tasks and outperform activation regularization strategies in the first two tasks, Scenes and Birds. Nevertheless, parameter regularization strategies usually prevent the model from achieving competitive performance on the new tasks, which accounts for the reduced accuracy in the last two tasks, Aircraft and Cars.

Recall that in the two-task scenario, reported in both our experiments and original paper (Li & Hoiem, 2017), LwF reaches a higher accuracy than Joint Training on the new task. However, in the five-task scenario, the result is reversed, which is consistent with that already discussed: LwF suffers an accumulating drop in performance as the sequence grows longer. EBLL brings some improvements in alleviating the forgetting problem compared to LwF, but the performances on newer tasks are inferior. EBLL operates on the high-dimensional features generated by the convolutional layers directly and loses the structural information of feature maps. The proposed method AFA is more effective, which distills the knowledge of previous tasks to the new model by aligning multilevel soft targets and acts as a regularizer for the new task to improve the training effect. It reaches the best average performance with a balance between performance preservation on old tasks and accuracy on new tasks.

For more detailed comparisons, we illustrate additional information in Figure 5. Figure 5a shows the average performance drop over previous tasks (relative to the performance right after training on that task) after training on a new task. Figure 5b demonstrates the average performance gain on new tasks compared to Finetuning.

Figure 5:

(a) The average performance drop on previous tasks (relative to the performance right after training on that task) after training on a new task in a five-task scenario. (b) The average performance gain on new tasks compared to Finetuning in a five-task scenario. (See the online figure in color.)

Figure 5:

(a) The average performance drop on previous tasks (relative to the performance right after training on that task) after training on a new task in a five-task scenario. (b) The average performance gain on new tasks compared to Finetuning in a five-task scenario. (See the online figure in color.)

As shown in Figure 5a, the performance drops of different methods share the same trends. The proposed AFA suffers the least degradation on average among the compared methods. It is worth noting that although the relative performance drop of parameter regularization strategies is comparable to AFA, the absolute accuracy is much lower due to their poor performance on the newer tasks. In other words, they have a lower baseline. When it comes to new tasks, different trends can be observed. AFA and LwF gain a higher accuracy than Finetuning, while EBLL and parameter regularization strategies (EWC, SI, and MAS) perform worse and worse as the training process goes forward.

The conclusion is identical to that in the two-task scenario. Under the guidance of the sufficient supervised information provided by multilevel soft targets, our method, AFA, alleviates the forgetting problem on old tasks while achieving even better performance than Finetuning on new tasks.

### 5.8  Ablation Study

#### 5.8.1  With Single Constraint

Since two additional constraints are introduced in this work, we analyze them individually here. The method with only adversarial attention alignment of visual features is termed AFA-adv, while that employing high-level feature alignment with MMD is termed AFA-mmd. Their experimental results in the two-task scenario starting from ImageNet are illustrated in Figure 6. Both constraints help promote the accuracy on the new task and preserve performance on the old task.

Figure 6:

AFA-adv—the method with merely adversarial attention alignment of visual features. AFA-mmd—the method employing MMD constraints to high-level semantic features. The accuracy drop on ImageNet and gain on new tasks are illustrated. (See the online figure in color.)

Figure 6:

AFA-adv—the method with merely adversarial attention alignment of visual features. AFA-mmd—the method employing MMD constraints to high-level semantic features. The accuracy drop on ImageNet and gain on new tasks are illustrated. (See the online figure in color.)

Concretely, AFA-adv and AFA-mmd have similar effects on improving the accuracy on the new task. AFA-adv performs better in ImageNet $→$ Birds, while AFA-mmd reaches higher accuracy in ImageNet $→$ Scenes and ImageNet $→$ Flowers. Feature alignments act as regularizers during the training process, reduce overfitting on the new data, and thus improve test accuracy.

AFA-mmd outperforms AFA-adv in preserving the old task's performance. The fully connected features contain rich task-specific knowledge. Aligning these high-level semantic features with MMD will provide strong supervised information and force the network to integrate the knowledge from the old model and thus help to alleviate the forgetting problem.

#### 5.8.2  Loss Function between Label Probabilities

A popular choice for measuring the discrepancy between the outputs of the old and new models is the KD loss. As Hinton et al. (2015) and Li and Hoiem (2017) stated, other types of constraints also work. Through experiments, we find that the L2 norm achieves better performance preservation on the old tasks, while KD loss favors the new tasks. Since we set the hyperparameter $λ1$ in equation 4.10 to 1.0 for KD loss, a smaller one should be chosen if the L2 norm is adopted as the loss function between label probabilities.

#### 5.8.3  Constraints of Visual Features

We have tried applying statistic moments (e.g., L2 norm) to the visual features directly, and the results are illustrated in Figure 7. The method employing L2 norm achieves slightly better performance preservation on the old tasks but much less accuracy on the new tasks. Since L2 norm is a much stricter and more inflexible constraint, using it as the constraint for visual features will lose the rich 2D structural information in the convolutional feature maps and prevent the model from achieving competitive performance on the new tasks. On the contrary, the proposed adversarial attention alignment strategy introduces a trainable discriminator to measure the discrepancy between visual features dynamically and make the training process smoother.

Figure 7:

Employing Adv or L2 as the constraints of visual features in the two-task scenario starting from ImageNet. (a) The performance drop or gain compared to Joint Training on the old tasks. (b) The performance drop or gain compared to Finetuning on the new tasks.

Figure 7:

Employing Adv or L2 as the constraints of visual features in the two-task scenario starting from ImageNet. (a) The performance drop or gain compared to Joint Training on the old tasks. (b) The performance drop or gain compared to Finetuning on the new tasks.

#### 5.8.4  Constraints of Fully Connected Features

MMD measures the distance between data distributions after mapping them to the RKHS, which is useful for the high-level semantic feature alignment (Long et al., 2015; Long, Zhu, Wang, & Jordan, 2017). Performance of the method that replaces the MMD constraints with the L2 norm is illustrated in Figure 8. The results indicate that it is inferior to the proposed method in both performance preservation on the old tasks and accuracy on the new tasks.

Figure 8:

Employing MMD or L2 as the constraints of fully connected features in the two-task scenario starting from ImageNet. (a) The performance drop or gain compared to Joint Training on the old tasks. (b) The performance drop or gain compared to Finetuning on the new tasks.

Figure 8:

Employing MMD or L2 as the constraints of fully connected features in the two-task scenario starting from ImageNet. (a) The performance drop or gain compared to Joint Training on the old tasks. (b) The performance drop or gain compared to Finetuning on the new tasks.

We have also tried introducing a discriminator to play the adversarial minimax game with the fully connected features. However, the discriminator easily distinguishes which task the features come from after only several epochs, so the adversarial game cannot continue. The reason is that the fully connected features are high-level semantic features containing rich task-specific information and cannot adapt immediately to confuse the discriminator.

#### 5.8.5  Complexity and Effectiveness

Although our method introduces a trainable discriminator and additional hyperparameters, there is no need to worry about the extra complexity. The discriminator here can be trained in an end-to-end way along with the backbone network, without extra tricks. The discriminator is initialized between tasks, avoiding the storage cost for autoencoders in EBLL and importance weights in EWC, SI, and MAS. Besides, we find our method is not very sensitive to hyperparameters. For equation 4.10, the selection of $λ1$ was discussed in section 5.8.2 while $λ3$ is set to 1.0 in all of the above experiments. We choose 1.0 for $λ2$ in the two-task scenario and a smaller one in a longer sequence. Network design and training strategies are set out in section 5.4.

The key idea of our method lies in that the intermediate features of the old model are treated as multilevel soft targets and guide the training process in multiple stages. Aligning the features generated by different networks from the same data will not prevent the network from learning specialized features. Instead, by distilling the knowledge of previous networks to the new one and acting as regularizers, our method reaches higher accuracy than even Joint Training and Finetuning on the new tasks. The absolute improvement is not very significant because there are upper bounds in lifelong learning scenarios (e.g., the accuracy of Joint Training in the five-task scenario). In fact, both the improvements in performance preservation on the old tasks and outstanding accuracy on the new tasks are impressive.

## 6  Conclusion and Future Work

Lifelong learning remains an open research problem. In this letter, we focus on the incremental task scenario and propose an improved activation regularization lifelong learning method with adversarial feature alignment. Both the low-level visual features and high-level semantic features serve as soft targets when training on new data. Aligning these features generated by different models but from the same data provides sufficient supervised information for the old tasks and helps to reduce forgetting. Additionally, the proposed method gains even better performance than fine-tuning on the new tasks due to knowledge distillation and regularization. Extensive experiments in the incremental task scenarios are conducted, and the results show that our method outperforms previous ones in both accuracy on new tasks and performance preservation on old tasks.

There are several directions for further improvements. For example, the commonly used target data sets in lifelong learning scenarios are usually small compared to ImageNet. We would like to conduct more experiments using other kinds of backbone models with larger data sets to evaluate the generalization ability of our methods.

## Acknowledgments

This work is supported by the National Key R&D Program of China (2018YFB1003703), the National Natural Science Foundation of China (61521002), and the Beijing Key Lab of Networked Multimedia (Z161100005016051).

## References

Aljundi
,
R.
,
Babiloni
,
F.
,
Elhoseiny
,
M.
,
Rohrbach
,
M.
, &
Tuytelaars
,
T.
(
2018
). Memory aware synapses: Learning what (not) to forget. In
Proceedings of the European Conference on Computer Vision
(pp.
139
154
).
Berlin
:
Springer
.
Aljundi
,
R.
,
Chakravarty
,
P.
, &
Tuytelaars
,
T.
(
2017
). Expert gate: Lifelong learning with a network of experts. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
7120
7129
).
Piscataway, NJ
:
IEEE
.
Deng
,
J.
,
Dong
,
W.
,
Socher
,
R.
,
Li
,
L.-J.
,
Li
,
K.
, &
Fei-Fei
,
L.
(
2009
). Imagenet: A large-scale hierarchical image database. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
248
255
).
Piscataway, NJ
:
IEEE
.
Ganin
,
Y.
,
Ustinova
,
E.
,
Ajakan
,
H.
,
Germain
,
P.
,
Larochelle
,
H.
,
Laviolette
,
F.
, …
Lempitsky
,
V.
(
2016
).
.
Journal of Machine Learning Research
,
17
(
1
),
1
35
.
Glorot
,
X.
, &
Bengio
,
Y.
(
2010
).
Understanding the difficulty of training deep feedforward neural networks
. In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
(pp.
249
256
).
Goodfellow
,
I. J.
,
Mirza
,
M.
,
Xiao
,
D.
,
Courville
,
A.
, &
Bengio
,
Y.
(
2013
).
An empirical investigation of catastrophic forgetting in gradient-based neural networks
.
arXiv:1312.6211
.
Goodfellow
,
I.
,
,
J.
,
Mirza
,
M.
,
Xu
,
B.
,
Warde-Farley
,
D.
,
Ozair
,
S.
, …
Bengio
,
Y.
(
2014
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N. D.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
2672
2680
).
Red Hook, NY
:
Curran
.
Gretton
,
A.
,
Sejdinovic
,
D.
,
Strathmann
,
H.
,
Balakrishnan
,
S.
,
Pontil
,
M.
,
Fukumizu
,
K.
, &
Sriperumbudur
,
B. K.
(
2012
). Optimal kernel choice for large-scale two-sample tests. In
F.
Pereira
,
C. J. C.
Burges
,
L.
Bottou
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
25
(pp.
1205
1213
).
Red Hook, NY
:
Curran
.
Hinton
,
G.
,
Vinyals
,
O.
, &
Dean
,
J.
(
2015
).
Distilling the knowledge in a neural network
.
Stat
,
1050
,
9
.
Hou
,
S.
,
Pan
,
X.
,
Change Loy
,
C.
,
Wang
,
Z.
, &
Lin
,
D.
(
2018
). Lifelong learning via progressive distillation and retrospection. In
Proceedings of the European Conference on Computer Vision
(pp.
437
452
).
Berlin
:
Springer
.
Kang
,
G.
,
Zheng
,
L.
,
Yan
,
Y.
, &
Yang
,
Y.
(
2018
). Deep adversarial attention alignment for unsupervised domain adaptation: The benefit of target expectation maximization. In
Proceedings of the European Conference on Computer Vision
.
Berlin
:
Springer
.
Kemker
,
R.
,
McClure
,
M.
,
Abitino
,
A.
,
Hayes
,
T. L.
, &
Kanan
,
C.
(
2018
).
Measuring catastrophic forgetting in neural networks
. In
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence
.
Kirkpatrick
,
J.
,
Pascanu
,
R.
,
Rabinowitz
,
N.
,
Veness
,
J.
,
Desjardins
,
G.
,
Rusu
,
A. A.
, …
A.
Grabska-Barwinska
. (
2017
).
Overcoming catastrophic forgetting in neural networks
.
Proceedings of the National Academy of Sciences
114
,
3521
3526
.
Krause
,
J.
,
Stark
,
M.
,
Deng
,
J.
, &
Fei-Fei
,
L.
(
2013
). 3d object representations for fine-grained categorization. In
Proceedings of the 4th International IEEE Workshop on 3D Representation and Recognition
.
Piscataway, NJ
:
IEEE
.
Krizhevsky
,
A.
, &
Hinton
,
G.
(
2009
).
Learning multiple layers of features from tiny images
(Technical Report), Citeseer
.
Krizhevsky
,
A.
,
Sutskever
,
I.
, &
Hinton
,
G. E.
(
2012
). Imagenet classification with deep convolutional neural networks. In
F.
Pereira
,
C. J. C.
Burges
,
L.
Bottou
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
25
(pp.
1097
1105
).
Red Hook, NY
:
Curran
.
LeCun
,
Y.
,
Bottou
,
L.
,
Bengio
,
Y.
, &
Haffner
,
P.
(
1998
).
Gradient-based learning applied to document recognition
.
Proceedings of the IEEE
,
86
(
11
),
2278
2324
.
Li
,
Z.
, &
Hoiem
,
D.
(
2017
).
Learning without forgetting
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
40
,
2935
2947
.
Lomonaco
,
V.
, &
Maltoni
,
D.
(
2017
).
Core50: A new dataset and benchmark for continuous object recognition
. In
Proceedings of the Conference on Robot Learning
(pp.
17
26
).
Long
,
M.
,
Cao
,
Y.
,
Wang
,
J.
, &
Jordan
,
M.
(
2015
).
Learning transferable features with deep adaptation networks
. In
Proceedings of the International Conference on Machine Learning
(pp.
97
105
).
Long
,
M.
,
Zhu
,
H.
,
Wang
,
J.
, &
Jordan
,
M. I.
(
2017
).
Deep transfer learning with joint adaptation networks
. In
Proceedings of the International Conference on Machine Learning
(pp.
2208
2217
).
Lopez-Paz
,
D.
, &
Ranzato
,
M.
(
2017
). Gradient episodic memory for continual learning. In
I.
Guyon
,
U. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
6467
6476
).
Red Hook, NY
:
Curran
.
Maji
,
S.
,
Kannala
,
J.
,
Rahtu
,
E.
,
Blaschko
,
M.
, &
Vedaldi
,
A.
(
2013
).
Fine-grained visual classification of aircraft
(Technical Report)
.
Oxford
:
Oxford University
.
Maltoni
,
D.
, &
Lomonaco
,
V.
(
2018
).
.
arXiv:1806.08568
.
Netzer
,
Y.
,
Wang
,
T.
,
Coates
,
A.
,
Bissacco
,
A.
,
Wu
,
B.
, &
Ng
,
A. Y.
(
2011
).
Reading digits in natural images with unsupervised feature learning
.
NIPS Workshop on Deep Learning and Unsupervised Feature Learning
. http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf
Nilsback
,
M.-E.
, &
Zisserman
,
A.
(
2008
). Automated flower classification over a large number of classes. In
Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing
.
Berlin
:
Springer
.
Quattoni
,
A.
, &
Torralba
,
A.
(
2009
). Recognizing indoor scenes. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
413
420
).
Piscataway, NJ
:
IEEE
.
Rannen Ep Triki
,
A.
,
Aljundi
,
R.
,
Blaschko
,
M.
, &
Tuytelaars
,
T.
(
2017
). Encoder based lifelong learning. In
Proceedings of the IEEE International Conference on Computer Vision
(pp.
1320
1328
).
Piscataway, NJ
:
IEEE
.
Rebuffi
,
S.-A.
,
Kolesnikov
,
A.
,
Sperl
,
G.
, &
Lampert
,
C. H.
(
2017
). ICARL: Incremental classifier and representation learning. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
2001
2010
).
Piscataway, NJ
:
IEEE
.
Rosenfeld
,
A.
, &
Tsotsos
,
J. K.
(
2017
).
.
arXiv:1705.04228
.
Rusu
,
A. A.
,
Rabinowitz
,
N. C.
,
Desjardins
,
G.
,
Soyer
,
H.
,
Kirkpatrick
,
J.
,
Kavukcuoglu
,
K.
, …
,
R.
(
2016
).
Progressive neural networks
.
arXiv:1606.04671
.
Shin
,
H.
,
Lee
,
J. K.
,
Kim
,
J.
, &
Kim
,
J.
(
2017
). Continual learning with deep generative replay. In
I.
Guyon
,
U. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances on neural information processing systems
,
30
(pp.
2994
3003
).
Red Hook, NY
:
Curran
.
Sutskever
,
I.
,
Martens
,
J.
,
Dahl
,
G.
, &
Hinton
,
G.
(
2013
).
On the importance of initialization and momentum in deep learning
. In
Proceedings of the International Conference on Machine Learning
(pp.
1139
1147
).
Tzeng
,
E.
,
Hoffman
,
J.
,
Saenko
,
K.
, &
Darrell
,
T.
(
2017
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
7167
7176
).
Piscataway, NJ
:
IEEE
.
van de Ven
,
G. M.
, &
Tolias
,
A. S.
(
2018
).
Generative replay with feedback connections as a general strategy for continual learning
.
arXiv:1809.10635
.
Welinder
,
P.
,
Branson
,
S.
,
Mita
,
T.
,
Wah
,
C.
,
Schroff
,
F.
,
Belongie
,
S.
, &
Perona
,
P.
(
2010
).
Caltech-UCSD Birds 200
(Technical Report CNS-TR-2010-001)
.
:
California Institute of Technology
.
Yosinski
,
J.
,
Clune
,
J.
,
Bengio
,
Y.
, &
Lipson
,
H.
(
2014
). How transferable are features in deep neural networks? In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N. D.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
3320
3328
).
Red Hook, NY
:
Curran
.
Zagoruyko
,
S.
, &
Komodakis
,
N.
(
2017
).
Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer
. In
Proceedings of the International Conference on Learning Representations
.
OpenReview.net
.
Zenke
,
F.
,
Poole
,
B.
, &
Ganguli
,
S.
(
2017
).
Continual learning through synaptic intelligence
. In
Proceedings of the 34th International Conference on Machine Learning
.