## Abstract

This letter focuses on the problem of lifelong classification in the open world, the goal of which is to achieve an endless process of learning. However, incremental data sets (like the streaming data) in the open world, where the new classes may be emerging, are unsuited for classical classification methods. For addressing this problem, existing methods usually retrain the whole observed data sets with the complex computation and the expensive storage cost. This letter attempts to improve the performance of classification in the open world and decomposes the problem into three subproblems: (1) to reject unknown instances, (2) to classify accepted instances, and (3) to cut the cost of learning. Rejecting unknown instances refers to recognize those instances whose classes are unknown according to the learner, which could reduce the computation of the retraining process and eliminate the storage of historical data sets. We employ outlier detection for rejecting instances and a variant artificial neural network for classifying with fewer weights. Results on several experiments show that the work is effective. Source code can be found at https://github.com/wangbi1988/Lifelong-learning-in-Open-World-Classification.

## 1  Introduction

Classification in the open world is important and difficult in lifelong learning (LL; Javed & White, 2019; Beaulieu et al., 2020) and streaming data analysis (Mu, Ting, & Zhou, 2017; Cai, Zhao, Ting, Mu, & Jiang, 2019). Both streaming data analysis and LL (Silver, Yang, & Li, 2013; Carlson et al., 2010; Mitchell et al., 2018) take an interest in the streaming data observed over time; thus, this problem should be discussed in both fields. In the closed world, many classification methods have been successful (Bendale & Boult, 2015). By contrast, in the open world, the performance of a model or classifier degrades significantly (Mu et al., 2017; Janardan & Mehta, 2017). The reason is that the basic assumption in the classification, the same distribution of the training and testing set, does not hold in this case. The difference between training and the testing set is not only like concept drift in streaming data classification (Janardan & Mehta, 2017); also implies the emergence of new (unknown) classes that have never been seen before.

To address this problem, the typical classification methods require the human to state whether instances are sampled from unknown classes. With the statement, one solution (Da, Yu, & Zhou, 2014) may train a classifier to learn all observed data up to this point; retaining and retraining all the data is expensive. With the statement above about typical classification purposes, the open world is altered to the closed world. A new classifier can learn instances from unknown classes by a new classifier to eliminate the modification of previous classifiers. Thus, much LL research focuses on the transfer of training (Caruana, 1997; Kumar & Daume, 2012; Javed & White, 2019) to reduce the cost of retraining or improve the performance of retrained learners of classification. However, the problem of classification in the open world has not attracted enough attention in LL.

Some researchers tend to free the learner in the open world from the statement from the human, and help the learner decide whether a class is unknown (Fei, Wang, & Liu, 2016; Shu, Xu, & Liu, 2018). One important solution (Fei et al., 2016) is to propose a learner based on the 1-versus-rest strategy of support vector machines. The learner contains many binary classifiers, each of them employed to reject instances from unknown classes by a decision boundary. With the increment of known classes, the adjustment of decision boundaries leads to high performance. However, it is costly to adjust boundaries with historical data sets in both storage capacity and computation. At the same time, classification under an emerging new class is proposed as important in streaming data analysis (Mu et al., 2017) by unsupervised learning. Nevertheless, the classification is degraded when the instances are difficult to distinguish (Cai et al., 2019). The degree of difficulty increases over time.

There are three challenges for classification in the open world: (1) automatically accepting instances drawn from known classes and rejecting instances drawn from unknown classes; (2) classifying accepted instances and learning rejected instances incrementally; and (3) downsizing the cost of storage (the dependence of historical data sets) and the computation.

The trade-off between performance and cost is the ultimate goal of this work. Two aspects of the storage cost are important. The first is for retaining historical data sets to retrain learned classifiers. Over time, the number of arrived data sets increases gradually, and the cost of retaining historical data sets also increase. The second trade-off is for storing these learned classifiers. The cost of model-based and model-free classifiers is different; the number of used weights is also related to the cost.

To address these challenges, we distill the functionality of rejecting instances drawn from unknown classes from the task of classification. For each data set, outlier detection is considered a binary classifier to reject instances whose classes are not learned, and an artificial neural network (ANN) is used to classify the accepted instances. The independent functionality of rejecting implies that we can obtain more than one classifier for each training data set. The set of classifiers ensures performance over time, even without retraining. This stems from the decision boundary of those classifiers that prefer to reject. The final boundary is superimposed by those partially overlapped boundaries instead of adjusting from a boundary preferring to accept. In contrast with 1-versus-rest strategy, which needs one binary classifier for each class, we can train an ANN for all instances in the data set. The variant ANN responds to different data sets through activating sparse connections to reduce the number of used weights.

In this letter, section 2 introduces related work, section 3 describes the framework of this work and the VC dimension analysis. Sections 4 and 5 present the performance of our work and the conclusions, respectively.

## 2  Related Work

Definition 1

(Lifelong Classification in Open World). Given a streaming data, at time $t+1$, the sets of instances $X(t+1)$ and true labels $Y(t+1)$ are observed. Where $Y(t+1)$ is not a subset of known classes, $⋃i=0tY(i)$, $Y'(t+1)=Y(t+1)-⋃i=0tY(i)≠∅$. $Y'(t+1)$ is the set of new classes, which is unknown. The goal of classification in the open world is to classify those instances in $X(t+1)$ drawn from $⋃i=0tY(i)$ and learn those instances drawn from new classes $Y'(t+1)$ without degrading the previous classification performance.

This letter focuses on the problem of lifelong classification in the open world, which is defined in definition 1. The lifelong learner has endless learning. Thus, the characteristics of the lifelong classification in an open world could contain the support of streaming data with emerging classes in both learning and prediction, a changeless model like the brain, and incremental training without historical data. Based on these characteristics, we address the differences and similarities between existing classification methods and our proposed method, as shown in Table 1.

Table 1:

A Comparison of Related Classification Methods.

MethodsModel TypeEmerging New ClassesStreaming DataAltered ModelHistorical DataTraining
Typical classification based/free – no no – –
One-shot (few-shot) learning (Fei-Fei, 2006based learn no no – –
Continual learning (Aljundi, Chakravarty, & Tuytelaars, 2017; Javed & White, 2019; Ostapenko, Puscas, Klein, Jahnichen, & Nabi, 2019; Beaulieu et al., 2020based learn yes yes no incremental
Learning with augmented class (Da, Yu, & Zhou, 2014based/free predict no no – –
Open set recognition (Scheirer, de Rezende Rocha, Sapkota, & Boult, 2013; Rudd, Jain, Scheirer, & Boult, 2017; Geng, Huang, & Chen, 2020based predict no no – –
Cumulative learning (Fei et al., 2016based/free learn/predict yes no yes whole
Streaming emerging new class (Mu et al., 2017; Cai et al., 2019free learn/predict yes no no incremental
Our method based/free learn/predict yes no no incremental
MethodsModel TypeEmerging New ClassesStreaming DataAltered ModelHistorical DataTraining
Typical classification based/free – no no – –
One-shot (few-shot) learning (Fei-Fei, 2006based learn no no – –
Continual learning (Aljundi, Chakravarty, & Tuytelaars, 2017; Javed & White, 2019; Ostapenko, Puscas, Klein, Jahnichen, & Nabi, 2019; Beaulieu et al., 2020based learn yes yes no incremental
Learning with augmented class (Da, Yu, & Zhou, 2014based/free predict no no – –
Open set recognition (Scheirer, de Rezende Rocha, Sapkota, & Boult, 2013; Rudd, Jain, Scheirer, & Boult, 2017; Geng, Huang, & Chen, 2020based predict no no – –
Cumulative learning (Fei et al., 2016based/free learn/predict yes no yes whole
Streaming emerging new class (Mu et al., 2017; Cai et al., 2019free learn/predict yes no no incremental
Our method based/free learn/predict yes no no incremental

### 2.1  Classification in Close World

The difference between classification methods in the closed and open worlds is whether new classes are observed. The typical classification methods, which assume a close world, are an important part of supervised learning. Input of these methods contains a training set with labeled instances and a testing set with unlabeled instances. Output is a predicting set, with each instance in the testing set mapped to a class. The classes set during the process of learning and predicting is not altered. Each class labeled in the predicting set is also observed (known) in the training set.

Mixture of experts (MoE; Jacobs, Jordan, Nowlan, & Hinton, 1991) is a supervised learning architecture composed of many separate experts (Yang, Yuan, Heng, Komura, & Li, 2020), each specialized for a subdomain of the task. Based on an ANN, an MoE system contains several ANN experts and a gating network that is used to compute the weights of combining dynamically from the inputs, according to the local expertise of each expert. Since there is no transformation between the distribution of the training set and the testing set, the closed world assumption still holds. Though MoE is established based on the divide-and-conquer principle and handles many subtasks, it is still considered a typical classification method, the same as boosting (Sun, Kamel, Wong, & Wang, 2007; Masoudnia & Ebrahimpour, 2014).

Transfer-based classification methods, like one-shot (few-shot) learning (Fei-Fei, 2006), focus on exploiting existing knowledge of the world to quickly learn predictions on new instances (accelerate future learning). In this situation, input contains a large training set and a small training set. The distribution of the large and the small set is different but related. For example, knowledge of cars could apply when recognizing trucks, as a the human would do. Transfer-based classification methods require an assumption of a closed world and cannot reject instances of new classes in the prediction stage.

Continual learning (Beaulieu et al., 2020) is routinely described as lifelong learning (Javed & White, 2019), but in actuality it is a kind or a part of lifelong learning in the open world. These methods target the acceleration of learning and catastrophic forgetting in the context of lifelong learning. The acceleration of learning relies on meta-learning and transfer learning, and catastrophic forgetting is implemented by the altered model (or network). Continual learning cannot be applied in the open world due to an inability to recognize new classes in the prediction stage. For example, Beaulieu et al. (2020) is more likely to be correctly described as operating in the “open world with a new-task oracle.” However, one of the aims of this neuromodulated meta-learning algorithm (ANML) is to remove the “oracle.”

### 2.2  Classification in the Open World

The problem of classification in the open world is that it has a reject option (Chow, 1970), which aims at making reliable prediction by introducing a reject option when the classifier is not confident about the prediction by proposing loss functions (Yuan & Wegkamp, 2010) with the rejection cost. Since high confidence predictions do not refer to the small error of recognizing an unknown class (Scheirer et al., 2013); for example, a linear model can solve the linearly separable problem with high-confidence predictions but cannot recognize a new class. Although a reject response incurs a loss, which is less than one incurred by misclassification, this classification method is not a sophisticated solution of the problem of classification in the open world.

These methods with unknown classes in the prediction stage cannot handle streaming data. The open set recognition (Geng et al., 2020) is an alternative method for the problem in computer vision and the problem of long-tail (Liu et al., 2019). It focuses on the operating threshold so that an instance is classified to the seen classes only if the confidence is above the threshold. In learning with an augmented class, the augmented class is unknown during the training stage but appears in the test stage. Once the system can distinguish the augmented classes from the seen ones, the processing of the augmented classes can be handled (Da et al., 2014). However, these methods can only recognize unknown classes; they cannot not learn.

The methods of streaming new classes (Mu et al., 2017) detect instances of new classes as soon as they emerge in the data stream. However, detecting the new class would be much harder if it is similar to known classes (in terms of feature characteristics; Cai et al., 2019). The reason is that the supervised methods they employ for classification are model-free, like the nearest-neighbor ensemble approach, which depends on the distance between two instances.

### 2.3  Lifelong Classification in the Open World

Lifelong learning requires the maintenance of learned knowledge during a lifetime (Silver, Yang, & Li, 2013), and can be described as (1) continual learning (in the closed world), which focuses on the faster learning of new classes, and (2) cumulative learning (in the open world), which focuses on recognizing new classes.

Continual learning is similar to transfer and multitask learning (Chen & Liu, 2016). Earlier research on supervised lifelong learning (Thrun, 1995; Pentina & Lampert, 2015) shows the transfer of knowledge across multiple learning tasks sampled from the same domain (Bengio, Louradour, Collobert, & Weston, 2009) or different domains (Ruvolo & Eaton, 2013; Mitchell et al., 2018). The rule-based learner (NELL; Carlson et al., 2010; Mitchell et al., 2018) builds never-ending language learning on a knowledge base (KB) by belief revision. NELL employs semisupervised bootstrap learning methods that need a high precision (confidence). The rule-based method provides suitable confidence and ensures the continual learning process over two months. Extracted instances added to KB are partitioned into beliefs and candidate facts for retaining knowledge. Rule-based learners are tough for some tasks, such as computer vision tasks. Existing research (Ruvolo & Eaton, 2013), such as an ANN based-learner (Tessler, Givony, Zahavy, Mankowitz, & Mannor, 2017), selectively uses its past knowledge to solve new tasks efficiently. Without the support of KB, classifiers typically exhibit forgetting when learning multiple tasks that are observed in sequence (Isele & Cosgun, 2018).

Cumulative learning (Fei et al., 2016; Shu et al., 2018), which is similar to the streaming emerging classes (Mu et al., 2017; Cai et al., 2019), aims to classify data of the known/seen classes into their respective classes and to reject instances from unknown/unseen classes (Chen, Ma, & Liu, 2018). Both of these methods employ several models for the reject option, such as the KNN in SENNE (Cai et al., 2019) and OCSVM in CL-cbsSVM (Fei et al., 2016). Each reject model scores the test instance; once the score is over (or under) a threshold $γ$, the instance is accepted. And then the accepted instance is classified by a suitable classifier. The different between the CL-cbsSVM and SENNE is the classifier they employ. The SENNE classification model is the nearest-neighbor approach, a model-free classification method. CL-cbsSVM uses a nonlinear SVM, a model-based method, to perform various tasks. Thus, after each subtask observed, CL-cbsSVM is retrained on all of the history data rather than incrementally.

The proposed framework employs nearest-neighbor approaches for the reject option and the model-based methods for classification with incremental learning. Another difference is the number of reject models for each subtask. For the reject option, the proposed framework can create multiple models for each subtask rather than one model in the two previous methods.

The reason for interest in the performance of the reject-accept models could be interpreted in probability:
$Pr(y;x)=∑iPr(y,pi;x)=Pr(y,pk;x)+∑i≠kPr(y,pi;x)=Pr(y,pk;x)+0=Pr(y|pk;x)Pr(pk;x),$
(2.1)
where $Pr(·)$ is the probability of events. The event $y$ implies that a given instance $x$ is correctly classified, and $pk$ represents that the instance is accepted by the reject model $k$. An instance $x$ is correctly classified if and only if it is accepted by the correct reject model $pk$ and recognized correctly by the suitable classifier, which is indexed by $pk$. Thus, $Pr(y;x)=∑iPr(y,pi;x)$. $∑i≠kPr(y,pi;x)$ equals zero because the other reject model ($∀i≠k$) cannot obtain the correct prediction, so the output of each classifier is disjoint to others. Following formula 2.1, the performance of lifelong classification in the open world is incremental with respect to the performance of reject models $Pr(pk;x)$ and relative classifiers $Pr(y|pk;x)$.

## 3  Methods

In this section, we address how to achieve lifelong classification in the open world. Section 3.1 provides the framework for LL, which consists of two components; the definition of each component is exhibited. Section 3.2 describes the algorithm, and section 3.3 analyzes the VC dimension.

### 3.1  Framework

In general, an instance $x$, presented to this classification framework, is recognized as a sample drawn from known (learned) classes or an unknown class (will be learned). If instance $x$ is possibly drawn from known classes, the framework should retrieve a suitable learned model to classify. If not, the framework should collect the instance for training a new classifier and index the learned model. There are two proposed components: (1) brain-regions-like structure (BRL, $P$), a set of binary classifiers to reject unknown classes, and (2) ANN (CENet, $C$), used to classify known classes and learn the unknown.

We model LL classification as a continual learning process in the open world, as shown in Figure 1. BRL is the kernel for a retraining-free LL framework in which the historical data sets are no longer required.

On the right side of Figure 1, the learned models are saved in a lookup table; the size of the table increases with as tasks arrive. This lookup table is employed for incremental learning. The binary classifier $pt-1k$ is used to reject or accept observed instances at time $t$. The index $*pt-1k$ is for retrieving a suitable learned model to classify those accepted instances. In this section, $*pt-1k$ refers to the sparse matrix $ρt$, which is the switch of classifiers in a large ANN.

The purpose of employing CENet for the classification instead of the nearest-neighbor ensemble approaches like Cai et al. (2019) is to improve performance. Model based classification methods usually outperform model-free methods when instances from different classes are similar in terms of feature characteristics.

Figure 1:

The framework of LL classification in an open world in which the input is streaming data (in the top-left corner). After time $t$, the learner has recognized the classes of airplane and automobile. At time $t+1$, instances of the new classes “bird” and “deer” are observed. BRL, the set of binary classifier $pk$ and its index $*pt-1k$ (on the right side), rejects those instances of the new “bird” and “deer” classes and accepts instances of known classes. $pt-1k$ and $ptm$ accept only instances drawn from classes “airplane” and “automobile,” respectively. These accepted instances would be classified by CENet (in the middle). The instances of the new class, which are rejected, are trained by CENet, and also recognized by creating a new binary classifier, $pt+1n$, in BRL.

Figure 1:

The framework of LL classification in an open world in which the input is streaming data (in the top-left corner). After time $t$, the learner has recognized the classes of airplane and automobile. At time $t+1$, instances of the new classes “bird” and “deer” are observed. BRL, the set of binary classifier $pk$ and its index $*pt-1k$ (on the right side), rejects those instances of the new “bird” and “deer” classes and accepts instances of known classes. $pt-1k$ and $ptm$ accept only instances drawn from classes “airplane” and “automobile,” respectively. These accepted instances would be classified by CENet (in the middle). The instances of the new class, which are rejected, are trained by CENet, and also recognized by creating a new binary classifier, $pt+1n$, in BRL.

As shown in Figure 1, at time $t+1$, the observed data contain instances drawn from two classes. After the learning process, we obtain a set of the index pair $*pt+1n$ and $pt+1n$ in BRL. $pt+1n$ needs only to decide whether an instance belongs to these two classes, and CENet, according to $*pt+1n$, decides which class the instance belongs to. We represent the learner as a tuple $L=$ and describe these in detail below.

Definition 2
(Brain-Regions-Like Structure, BRL, $P$). There exists a model used to classify unknown instances into the class labeled 0, and otherwise into the class labeled 1. Considered collectively, lots of these classifiers form a single set $P$, the brain-regions-like structure, which can be represented as
$P={|i=1,…},$
where $x$ is an instance with $N$ features, $pi$ is a binary classifier, and $ρi$ is a sparse matrix to point out the connections activated by $pi$.
Definition 3
(Connection-Eliminated Network, CENet, $C$). At time $t$, let $xtj$ be an instance recognized by $pti$ and $Y={yj|yj∈N}$ be the union of all of known classes and the unknown class. Then, CENet, denoted $B:RN→N$, is defined as
$C(xtj,pti)=argmaxy∈Y(Pr(y|xtj;w(pti))).$
(3.1)
In other words, $C$ is a multiclass classifier that can receive the sparse matrix $ρ$ from $pti$ to select weights by the function $w(·)$.
An instance $x$ is drawn from a known class according to the learner if and only if there exists at least one binary classifier $pi$ in $P$ labeled $x$ as 1 (accepted). Notice that the input $x$ implies that this structure can be unsupervised. The classifier $pi$ only accepts known (positive) instances and rejects others (negative). We can calculate $P(x)$ by
$P(x)=1-∏i(1-pi(x)).$
(3.2)
At time $t$, the learner observes $m$ instances $Xt={xtj|j=1,…,m}$ which are divided into two sets, $X˜t$ and $X˙t$, by BRL (2). The sets $X˜t$ and $X˙t$ represent the set of instances that are recognized as positives and negatives, respectively. Where $X˜t∩X˙t=∅$ and $X˜t∪X˙t=Xt$.
$X˜t=xtj|I(P(xtj)),X˙t=Xt-X˜t.$
(3.3)

In formula 3.3, the predicate, Boolean-valued function, $I(P(xtj))$ refers to whether those instances in $Xt$ are classified to a known class according to the learner. We implement $P(·)$ by outlier detection, such as KNN (Ramaswamy, Rastogi, & Shim, 2000), and suggest other unsupervised methods and supervised methods with a reject option (Chow, 1970). Like the lookup table in Figure 1, each $pi$ in the lookup table is an outlier detector. The job of BRL, the lookup table of outlier detectors, is to divide $Xt$ into $X˜t$ and $X˙t$. Positive (accepted) and negative (rejected) instances are treated as normal points and outlier points, respectively. Each rejected instance is collected in a temporary set $X˙t$. Once the scale of $X˙t$ is satisfied, the new outlier detectors are trained according to all these instances in $X˙t$. Thus, instances in $X˜t$ or in $X˙t$ may not belong to an identical class. One $pti$ could, ideally, accept any instance drawn from all classes which are learned at time $t$. The number of incremental detectors does not necessarily equal 1 for each training; thus, $k$ may be greater than $t$ in Figure 1.

However, we need to predict the class of each instance in $X˜t$ by a classifier, and train a new classifier for the data set $X˙t$. A simple ANN is employed as the classifier, and other model-based classification methods, such as the continual learning (Javed & White, 2019; Beaulieu et al., 2020) and the one-shot (few-shot) learning (Fei-Fei, 2006), are also suggested. Model-based classification methods may outperform model-free methods normally. ANN is called CENet (3), and is a three-layer neural network with extra mask layers, whose model is shown in Figure 2.
Figure 2:

An example model layout for CENet. The dashed arrow implies an eliminated connection, which transmits across the mask layer.

Figure 2:

An example model layout for CENet. The dashed arrow implies an eliminated connection, which transmits across the mask layer.

The mask layers can be represented by a sparse matrix, in which each element implies that one connection is activated or unactivated. It is similar to brain activity shown by functional magnetic resonance imaging (fMRI; Lee et al., 2010): the response from various tasks and stimuli activates sparse synapses. The first purpose of the sparse matrix ($ρ$) is to reduce the number of weights for each task. The second purpose is to provide a hypothesis space that is not too large to overfit or too small to underfit for each task, which is further discussed in the section 3.3. Without the matrix $ρ$, it is costly to retain a new network for each element in BRL. For various tasks, an appropriate hypothesis space is large to prevent underfitting and regularized (i.e., $L$-norm) to avoid overfitting. Thus, we provide a large hypothesis space $H$ for the initial task. Then we store used units in $ρ$ and let the rest of units be the new hypothetical space $H'$.

For details, the output of the mask layer $h$ in Figure 2 is
$h=σ((M∘W)x),∂h∂J=∂σ(·)∂·∂((M∘W)x)∂M=∂σ(·)∂·Wx,∂h∂W=∂σ(·)∂·∂((M∘W)x)∂W=∂σ(·)∂·Mx,$
(3.4)
where $σ$ is an activation function, and $∘$ means the entrywise product. $W$ is a weighted matrix, and $M$ is a binary matrix upon $W$, and $|M|=|W|$. Here are two pairs of $(W,M)$—one before the hidden layer and another before the output layer (see Figure 2).
Each element of the matrix $M$, initialized by 1, is drawn from 1 to 0 only once during the process of LL. The weights of $W$ are regularized by $L1$-norm. Then the structural risk for CENet is
$Rsrm(x,y)=∑i|C|-yilog(y^i)+λL1(W),$

where $|C|$ is the number of classes, $y^:x→R+|C|$$s.t.$$∑i|C|y^i(x)=1$, and $y$ is one-hot encoding of the instance label. The reason for the use of $L1$-norm is that it can lead to the sparse weights (Bengio et al., 2009) and reduces the cost of the storage of weights. For example, there is a layer with 100 units regularized by $L1$-norm. After training a task, 10 units have been reweighted with nonzero value. The location of each these 10 units is indexed by the sparse matrix. The remaining 90 units will be used to train next tasks.

For the streaming data, the adjustment of the mask matrix $M$ should be careful because $M$ is crucial for the incremental learning process: it affects whether a weight in CENet is trainable, and CENet contains all models of classification. If $M$ makes some mistakes, the previous classification models may be damaged. Now we explain the maintenance of the sparse matrix $M$ and the index of each $*p$ in BRL. At time $t$, after training, each element of the matrix $M$ is drawn as
$ρt=[mtij*I(wtij)],Mt+1=1-[I(wtij)].$
(3.5)
The function $I:R→B$ is the indicator function mapping a real to a Boolean. $I(wtij)=1$ if $wtij≠0$; otherwise, $I(wtij)=0$. $mtij$ is an element of the matrix $M$. For instance, at time $t$, let $Mt=1110,Wt=00.700.5$. Then, following equation 3.5, $ρt=0100$ and $Mt+1=1010$ where $ρt0,1=mt0,1*I(wt0,1)=1*1=1$, and $mt+10,1=1-I(wt0,1)=0$. The elements in $M$ are untrainable. In the next training, weights are masked by 0—both 0.7 and 0.5 in $Wt$—and never change, and their gradients always equal 0. The function $w(·)$ in formula 3.1 copies $ρ$ to $M$.

The binary mask of connections indicates whether a connection between two linked layers is working. If the connection is not working, the relative weight resets to zero temporarily. The reason is that a neuron has two actions: to fire or not to fire (Dejean et al., 2016). It exists in both a drop-connect model (Wan, Zeiler, Zhang, Le Cun, & Fergus, 2013) and our work, but they are very different. The drop-connect model, a generalization of the drop-out model (Hinton, Srivastava, Krizhevsky, Sutskever, & Salakhutdinov, 2012), is for regularizing large fully connected layers within neural networks through randomly dropping a subset of activations.

ANML (Beaulieu et al., 2020) employs a similar masking scheme, which refers to the flexibility to turn on and off activations in subsets of the prediction network conditioned on the input. For improving performance, the masking scheme leads to different selective activation and plasticity for different types of images.

In contrast, our work aims at handling various tasks and saving those trained models. The necessary number of parameters for fitting linear is different from that for nonlinear functions. For details, if an element $mij$ in the mask matrix equals zero, then the weight $wij$ in the network will never be tunable. Since $∂h∂W=∂σ(·)∂·Mx$, as shown in formula 3.4 if $mij=0$, the gradient of $wij$ equals to 0.

The ReLU activation function is used to address the gradient vanishing problem. Unlike the gradient of sigmoid, which tends to be smaller when passing a deep neural network, ReLU is a piecewise linear function, so the gradient does not vanish. CENet is designed to classify all data sets that are learned over the time. We need a large hypothesis space (more nodes in the network). However, the complex model may suffer from overfitting, so we use the $L1$ regularization term. This term leads nodes to a sparse linked network $Mt$. Those nodes weighted by zero can be tuned in the next learning. Thus, we have a neural network that can be reused and without forgetting.

### 3.2  Algorithm

#### 3.2.1  Revising Misclassification by BRL

The process of revising misclassification is to change labels that turn out to be wrong when the new task arrives. We demonstrate the process based on the Iris data set in Figure 3. When the learner trains BRL for tasks, some of classifiers in BRL may accept the same instance. BRL could even accept instances that should be rejected. The instances marked by circles in Figure 3a are mislabeled as “Versicolour.” For the above case, we make revision in formula 3.6, where $y^=-1$ means that the instance $x$ is unknown and $θ$ is a hyperparameter:
$y^=B(x,maxp∈P(Sim(x,p)))ifSim(·,·)≥θ-1otherwise$
(3.6)

Elements of BRL are required to measure the similarity $Sim(x,p)$ by quantitative functions. For example, the classifier is trained with empirical risk minimization (ERM): $Lp(x^,x)=(x^-x)2$. ERM assumes a minimal empirical error. If the error of the task is 0.05, each $x$ sampled from the same distribution has a high probability that the loss $Lp(x^,x)$ is smaller than an upper value, say, 0.5. We can then define the function $Sim(xt+1,p)=1-Lp(x^,x)$ and set $θ=0.5$. Once an instance is accepted by more than one classifier $p$, we greedily maximize the similarity $Sim(·,·)$ and assume that the instance is recognized by the classifier with the high similarity. In Figure 3b, after learning the class of Virginica, these false-positive instances are revised by the incremental classifier in BRL.

Figure 3:

Revising misclassification based on the Iris data set by BRL. The $x$-axis is the petal length of instance, and the $y$-axis the width. Circles mark mislabeled instances. Contours point to the incremental similarity from the light to the dark color. Each contour is a decision boundary.

Figure 3:

Revising misclassification based on the Iris data set by BRL. The $x$-axis is the petal length of instance, and the $y$-axis the width. Circles mark mislabeled instances. Contours point to the incremental similarity from the light to the dark color. Each contour is a decision boundary.

#### 3.2.2  Algorithm 1

The detailed algorithm is given in algorithm 1.

Line 1 denotes a data structure $D$ in which each element $di$ is a tuple that includes a classifier $pi$ and those instances $Xi$ accepted by $pi$. Line 2 means we assign two initial values to $T$ and $Y^$. Line 3 implies dividing instances $Xt$ into two sets $D˜t,D˙t$. Each element in the subset of $D˜t$ has been accepted by $P$; rejected instances collect in $D˙t$, where $|D˜t|=|P|$ and $|D˙t|$ always equal one. Lines 4 to 8 are the classification for accepted instances. Line 5 denotes that for every $di$ in $D˜t$, we assign those accepted instances to $input_x$. Line 6 is important for recovering the sparse matrix $M$ as formula 3.5. Line 7 retains the union of two sets $Y^$ and classes of $input_x$. Lines 9 to 15 are the training process. Line 10 updates $input_x$ from $D˙t$. Line 11 is to create a new classifier $pT$ based on $input_x$. Line 12 repeats the process on line 3, and line 13 is to train CENet based on those accepted instances. Line 14 is for assignments.

For example, at time $t=0$, we train the learner to recognize airplanes. $T=|P|=0$, and $D˜t$ is empty. Then skip to line 8 and train those instances. The next steps are to create a new classifier $p0$, train these instances accepted by $p0$, evaluate $γ*$ and $T$, and save the important matrix $ρ0$. At time $t=1$, the learner incrementally learns the automobiles. First, collect instances accepted by $P$ in $D˜t$ and rejected in $D˙t$, respectively. Identify each instance accepted by classifier $p$. In this case, there is only $p0$. Then the learner predicts the accepted collection based on the sparse matrix $ρ0$ and learns the rejected collection.

The time complexity of this algorithm is $O(|P|M)$, where $|P|$ is the number of classifiers in BRL and $O(M)$ is the time complexity of each classifier. We need to enumerate all elements in BRL and then distinguish the known instance and predict the class.

### 3.3  VC Dimension Analysis

In this section, we analyze the VC dimension of BRL and explain the reason for employing a large ANN rather than several smaller ones. The assumption of BRL is that the increment of $|P|$ can provide a large hypothesis space for any task. Let the VC dimension of the classifier be $d$, and we complete the proof of that assumption: the VC dimension of BRL is a positive correlation with $|P|$.

Lemma 1.
Let $P={pi|i=1,…,T}$ be a BRL. Assume that VC dimension of each binary classifier $VCdim(pi)=d$ is greater than 3. Then,
$VCdim(P)≤(dT+2)2ee-1log(dT+2)+1.$
Proof.
Let $pi$ be a hypothesis class, $VCdim(pi)=d$, and $X={x1,x2,…,xm}$ be a set that is shattered by $P$. By the Sauer-Shelah-Perles lemma (Sauer, 1972; Shalev-Shwartz & Ben-David, 2014), the growth function of $P$ is denoted as $τP(m)≤(em/d)d$. There are at most $(em/d)dT$ ways to pick $T$ hypotheses $(p1(x),p2(x),…,pT(x))$. Rewrite formula 3.2 as follows:
$P(xj)=⋃i=1Tpi(xj)=sgn∑i=1Tpi(xj)$
The expression $sgn$ is a sign function. $sgn∑i=1Tpi(xj):=∑i=1Tpi(xj)-δ≥0$ is a function of a single variable $δ$; then $VCdim=2$, $τsgn(m)≤(em/2)2$. We can construct the upper bounded by
$2m≤(em/d)dT(em/2)2⇒2m≤mdT+2⇒m≤(dT+2)log(2)log(m).$
Now we try to solve the inequality. Let $x=(dT+2)/log(2)>1$; then $∃α>1$ such that $m≤sup{m}≤sup^{m}=xlog(xα)$. Since $m≤xlog(xlog(m))=xlog(xα)+xlog(log(m)xα-1)$, $∃α$ such that $xlog(log(m)xα-1)≤0$. For a tighter boundary, let
$f(x,α):=sup^{m}-xlog(sup^{m})=xlog(xα)-xlog(xlog(xα))=xlog(xα-1/(αlog(x)))=0.$
The extreme point of $f(x,α)$ is the stationary point of $g(x,α):=xα-1-αlog(x)$. $g(x,α)$ is an increasing function with respect to $α$ if $α>1$ and $x>1$:
$∂g∂α=(xα-1-1)log(x);∂g∂x=(α-1)xα-2-αx.$
Let $∂g∂x=(α-1)xα-2-αx=0$. Then:
$log(x)=logα-1α/(1-α);xα-1=αα-1⇒αα-1=αα-1logαα-1⇒α=ee-1.$
$x0=ee-1$, $α0=ee-1$. Then $∀x1$ and $α1≤α0≤α2$, such that $f(x0,α1)≤f(x0,α0)=0≤f(x1,α0)≤f(x1,α2)$. Since (1) $f(m)$ is an increasing function with respect to $m$, (2) $∀x>1$, $sup^{m}=xlog(xa0)$, and (3) $f(sup^{m})≥0$. $m≤sup{m}≤sup^{m}$ such that $f(m)≤0$, and $m$ can be bounded,
$m≤(dT+2)log(2)ee-1logdT+2log(2)m≤(dT+2)2ee-1log(dT+2)+1.$
(3.7)
Therefore,
$VCdim(P)≤(dT+2)2ee-1log(dT+2)+1.$
(3.8)
Lemma 4 gives an assurance that BRL can shatter a finite set when $T$ is sufficient. For instance, let $p$ be an autoencoder and $|w|$ be the number of weights of $p$. The VC dimension of ANN with the ReLU activation function is $O(|W|Llog(|W|))$ (Bartlett, Harvey, Liaw, & Mehrabian, 2017), where the $|W|$ is the number of weights and $L$ is the number of layers. The VC dimension of BRL is $O(TVCdim(p)log(TVCdim(p)))$:
$O(TVCdim(p)log(TVCdim(p)))=O(T|W|Llog(|W|)log(T|W|Llog(|W|)))=O(|W|Llog(|W|)log(|W|Llog(|W|)))=O(|W|Llog(|W|L)).$
(3.9)
In words, the expression of BRL is similar to an ANN. Even in one learning task, BRL can grow one or more classifiers ${pi}$. Each $pi$ takes several samples from $X˙t$ to $X˜t$. When the set $X˙t$ is scarce or empty, growth is finished. The reason to use a large ANN is that the VC dimension of the ANN with a large $|W|$ is larger than several small networks with the same scale of total weights. $O(|W|Llog(|W|))$ may be greater than $O(∑i|Wi|Llog(|Wi|))$ with respect to $∑i|Wi|=|W|$.

## 4  Experiments and Results

In this section, we present evaluations and shortcomings of our work. Evaluations over seven compared approaches include (1) the correctness of probabilistic interpretation, (2) the cost of storage and performance, and (3) the tendency of storage cost and F1 to score with different numbers of features. The goal of this work remains to lower the cost of storage in an open world classification. We need to reconstruct data and make sure that unknown classes arrive over time.

### 4.1  Experimental Setup

#### 4.1.1  Data

To evaluate the proposed BRL, we need a sequence of tasks sampled from different distributions. In this case, we encode each instance from cifar-10 (Krizhevsky & Hinton, 2009) to a 20-dimensional vector (from 512 dimensions1 by VGG—(Simonyan & Zisserman, 2014), to 20 dimensions by an autoencoder). We divide the training set into five subsets. Each subset contains two classes, in which the data are shuffled. Five subtasks are considered streaming data for training. We predict the test set, which contains 10 class instances after training each subtask. For example, subtask 1 includes instances labeled both “airplane” and “automobile.” In the process of prediction, we annotate all instances of class “airplane” and “automobile,” and label others with “negative.” The classifier need to recognize the learned classes “airplane” and “automobile” and unknown classes (others).

Our hardware details are as follows: CUP: Intel Core i7-7700; RAM: 8.00 GB; GPU: none; operating system: Windows 10.

Five metrics are used in the experiments.

1. 1.

Accuracy $Acc=TP+TNTP+TN+FP+FN$

2. 2.

Precision $P=TPTP+FP$

3. 3.

Recall $R=TPTP+FN$

4. 4.

F1 $F1=2TP2TP+FP+FN=2PRP+R$

5. 5.

Jaccard measure $J=TPTP+FP+FN$

In these formulas, T and F mean, respectively, true or false, and P and N mean, respectively, positive and negative. TP means the number of correctly classified positive instances.

The accuracy used in section 4.5 is for interpreting formula 2.1. We do not care how many negative instances are classified into positive, only the scale of correct recognition (accuracy).

Precision, recall, and F1 are used to evaluate the task of binary classification. Macro-average performance is the average of each subtask. In section 4.2, we show details for the ratio of accepted TP instances and rejected TF instances. The ratio of acceptance can be evaluated by recall R, and the ratio of rejection can be measured by precision P. In section 4.4, we explain why increasing the number of classifiers can reach a higher recall of BRL shown in lemma 4 with a smaller drop in full performance. In this way, we evaluate the performance by R and F1. Jaccard measure (Amigó, Gonzalo, & Verdejo, 2013; Zabihimayvan, Sadeghi, Rude, & Doran, 2017) is also reported.

#### 4.1.2  Compared Approaches

All approaches are listed: (1) one-class SVM with nonlinear kernel (OCSVM; Schölkopf, Platt, Shawe-Taylor, Smola, & Williamson, 2001), (2) EllipticEnvelope (EE; Rousseeuw & Driessen, 1999), (3) Isolation Forest (iForest; Liu, Ting, & Zhou, 2008), (4) K-Nearest Neighbor (KNN; Ramaswamy et al., 2000), (5) Local Outlier Factor (LOF; Breunig, Kriegel, Ng, & Sander, 2000), (6) Autoencoder (AE; Hinton & Salakhutdinov, 2006), (7) CL-cbsSVM2 (Fei et al., 2016), (8) Random-Erasing (Zhong, Zheng, Kang, Li, & Yang, 2017), (9) OLTR (Liu et al., 2019), (10) SENCForest (Mu et al., 2017), (11) SENNE (Cai et al., 2019), (12) Expert Gate (Aljundi et al., 2017), and (13) DGM (Ostapenko et al., 2019). Approaches 7 to 13 are baselines.

The two characteristics of classifier in the open world are (1) incremental learning without retraining and (2) recognizing (rejecting) the unknown instances. Retraining implies the costly learning process since the history data need to be relearned. The ability of recognizing the unknown instances affects the performance in tasks since the unknown instances are emerging in the testing set. Thus, we group compared approaches by the first characteristic and evaluate the performance by metrics.

• Without retraining: Approaches 1 to 6 are methods for detecting outliers. These outlier detection methods are employed for implementing BRL. For example, OCSVM+BRL+CENet, sometimes shortened to OCSVM, refers to the LL classification with a CENet and BRL in which $pt$ is an OCSVM detector of outliers. The purpose of comparing approaches 1 to 6 is to find a better method for rejecting unknown classes in our proposed framework. Approaches 10 and 11 are the streaming emerging new class methods, which are very similar to our methods but cannot benefit from model-based classification methods. Approaches 12 and 13 are continuing learning, a kind of lifelong learning, which lacks the reject option.

• With retraining: Approach 7 is the incremental method addressing the lifelong classification in the open world. Approach 8 worked in the closed world is a typical classification method, so retraining is necessary and unable to recognize instances of unknown classes. Approach 9 is a method of open set recognition.

We set $γ=1$. Then, during the learning process, no parameter should be tuned to minimize human efforts.

The following list identifies the hyperparameters of each compared approach:

• OCSVM. OCSVM imports from scikit-learn3 lists as follows: $θ=-1$, and $ν=0.95*o+0.05$, where the outliners; fraction ($o$) equals 0.15.

• EE and iForest. We set the contamination of EE and iForest, which also import from scikit-learn, to $o$ and $θ=-1$. In iForest, the max_samples equals 100 and bootstrap is True.

• KNN and LOF. KNN and LOF import from Pyod (Zhao, Nasrullah, & Li, 2019).4 The setting is the default—$θ=-1$ for KNN and $θ=-2$ for LOF.

• AE. Autoencoder is formed by an encoder, one input layer and one hidden layer with 20 units, and a reverse decoder. The activation function is RELU, and the optimizer is Adam (Kinga & Adam, 2015) with learning rate $1e-3$, $θ=0.5$.

• CL-cbsSVM. We set the hyperparameters of CL-cbsSVM as $λ=0.8$ and $θ=0.5$.

• Random-Erasing. Source code can be found at Github.5

• OLTR. We set the hyperparameters of OLTR6 as follows: $num_classes=10$ and $open_threshold=0.95$.

• SENCForest.$α=β=1$, the number of trees is 20, and the number of subsamples for each class is 100 for a balance between the computation and the performance.7

• SENNE.$ϕ$ is the number of subsamples for each class; the fine-tuned score threshold is 0.88.8

• Expert Gate. Source code can be found at Github.9 Oracle Expert Gate refers to the prediction of gate which is always correct.

• DGM. Source code can be found at Github.10 DGM requires inputting the class one by one. Thus, we replace the class label for each subtask by the subtask number. For example, there two classes 0, 1 in subtask 1. The adjusted label for DGM is 1.

### 4.2  Storage Cost and Performance

In this section, we present the recall, precision, and storage cost for each approach. The storage cost is the real file size. We use the function $dump$ in the pickle module to convert a Python object into a byte stream and save it as a file.

Table 2 demonstrates that LOF obtains better performance in both recall and precision over others on all five subtasks. The recall trends all on models over time, including OCSVM, EE, iForest, and AE. It illustrates that $pt$ in BRL rejects positive instances whose classes are learned at time $t$ and accepts a part of negative instances whose classes are learned at another time. In contrast, the trend of recall according to CL-cbsSVM and KNN is incremental. The reason for the improvement of CL-cbsSVM is retraining; the classification in the open world is altered to the closed world from subtask 1 to subtask 5. The improvement of KNN is due to the fact that its decision function of KNN is different from that of other methods. In Pyod, the KNN score is the distance between a test instance and k-nearest neighbors rather than density-based measures such as LOF. If an instance can be recognized by a retraining-based KNN, it can be recognized correctly by the incremental KNN-BRL since the distance is constant. But the density-based measures cannot hold that. For example, an instance from unknown classes is in the accepted boundary of $pt$ at time $t$. It is an FN instance; thus, the recall is degraded. At time $t+1$, the class of this instance is learned, and the score of $pt+1$ is greater than $pt$. Thus, this is a TP instance; recall and precision are both incremental.

Table 2:

Macro-Average Performance on cifar-10.

ClassifierRPJRPJRPJRPJRPJstorage cost (MB)
AE+BRL+CENet 0.85 0.73 0.60 0.83 0.79 0.65 0.80 0.82 0.66 0.75 0.82 0.62 0.76 0.85 0.65 0.45
EE+BRL+CENet 0.80 0.71 0.58 0.71 0.68 0.51 0.76 0.81 0.62 0.72 0.79 0.59 0.68 0.82 0.59 0.53
KNN+BRL+CENet 0.83 0.72 0.56 0.88 0.81 0.71 0.86 0.84 0.72 0.91 0.92 0.84 0.93 0.93 0.88 10.69
LOF+BRL+CENet 0.96 0.97 0.92 0.92 0.94 0.86 0.91 0.92 0.84 0.91 0.91 0.83 0.90 0.95 0.86 17.26
iForest+BRL+CENet 0.86 0.96 0.81 0.68 0.71 0.52 0.56 0.57 0.39 0.63 0.70 0.50 0.58 0.72 0.47 2.38
OCSVM+BRL+CENet 0.86 0.94 0.81 0.77 0.79 0.63 0.75 0.79 0.62 0.77 0.83 0.64 0.73 0.90 0.67 1.58
Random-Erasing 0.66 0.29 0.29 0.30 0.27 0.07 0.30 0.42 0.15 0.27 0.29 0.12 0.27 0.39 0.16 –
SENCForest 0.96 0.91 0.88 0.53 0.55 0.48 0.34 0.39 0.31 0.24 0.30 0.21 0.19 0.17 0.17 –
SENNE($ϕ=10$0.76 0.86 0.64 0.85 0.81 0.68 0.86 0.77 0.66 0.76 0.71 0.59 0.87 0.68 0.66 –
SENNE($ϕ=50$0.96 0.94 0.90 0.94 0.87 0.82 0.93 0.82 0.76 0.80 0.70 0.62 0.87 0.64 0.62 –
SENNE($ϕ=100$0.96 0.94 0.90 0.93 0.88 0.82 0.92 0.82 0.75 0.79 0.69 0.61 0.87 0.61 0.59 –
Expert Gate 0.32 0.66 0.32 0.57 0.76 0.53 0.38 0.56 0.37 0.30 0.43 0.29 0.18 0.39 0.17 47.23
Oracle Expert Gate 0.66 0.66 0.65 0.76 0.76 0.72 0.82 0.82 0.78 0.85 0.85 0.81 0.96 0.96 0.92 47.23
DGM 0.10 0.50 0.10 0.22 0.41 0.21 0.26 0.33 0.16 0.32 0.36 0.22 0.69 0.56 0.36 4.58
CL-cbsSVM$Φ$ 0.50 0.10 0.10 0.74 0.73 0.55 0.70 0.65 0.49 0.62 0.75 0.49 0.56 0.81 0.59 25.77
OLTR$Φ$ 0.63 0.52 0.41 0.84 0.81 0.71 0.84 0.81 0.70 0.84 0.89 0.76 0.65 0.79 0.63 30.37
Random-Erasing$Φ$ 0.66 0.29 0.29 0.68 0.35 0.33 0.79 0.54 0.52 0.82 0.70 0.75 0.95 0.94 0.90 33.8
SENCForest$Φ$ 0.96 0.92 0.88 0.94 0.84 0.80 0.90 0.81 0.73 0.93 0.82 0.76 0.97 0.82 0.80 –
SENNE($ϕ=50)Φ$ 0.96 0.94 0.90 0.94 0.87 0.82 0.93 0.82 0.76 0.80 0.70 0.62 0.87 0.64 0.63 –
ClassifierRPJRPJRPJRPJRPJstorage cost (MB)
AE+BRL+CENet 0.85 0.73 0.60 0.83 0.79 0.65 0.80 0.82 0.66 0.75 0.82 0.62 0.76 0.85 0.65 0.45
EE+BRL+CENet 0.80 0.71 0.58 0.71 0.68 0.51 0.76 0.81 0.62 0.72 0.79 0.59 0.68 0.82 0.59 0.53
KNN+BRL+CENet 0.83 0.72 0.56 0.88 0.81 0.71 0.86 0.84 0.72 0.91 0.92 0.84 0.93 0.93 0.88 10.69
LOF+BRL+CENet 0.96 0.97 0.92 0.92 0.94 0.86 0.91 0.92 0.84 0.91 0.91 0.83 0.90 0.95 0.86 17.26
iForest+BRL+CENet 0.86 0.96 0.81 0.68 0.71 0.52 0.56 0.57 0.39 0.63 0.70 0.50 0.58 0.72 0.47 2.38
OCSVM+BRL+CENet 0.86 0.94 0.81 0.77 0.79 0.63 0.75 0.79 0.62 0.77 0.83 0.64 0.73 0.90 0.67 1.58
Random-Erasing 0.66 0.29 0.29 0.30 0.27 0.07 0.30 0.42 0.15 0.27 0.29 0.12 0.27 0.39 0.16 –
SENCForest 0.96 0.91 0.88 0.53 0.55 0.48 0.34 0.39 0.31 0.24 0.30 0.21 0.19 0.17 0.17 –
SENNE($ϕ=10$0.76 0.86 0.64 0.85 0.81 0.68 0.86 0.77 0.66 0.76 0.71 0.59 0.87 0.68 0.66 –
SENNE($ϕ=50$0.96 0.94 0.90 0.94 0.87 0.82 0.93 0.82 0.76 0.80 0.70 0.62 0.87 0.64 0.62 –
SENNE($ϕ=100$0.96 0.94 0.90 0.93 0.88 0.82 0.92 0.82 0.75 0.79 0.69 0.61 0.87 0.61 0.59 –
Expert Gate 0.32 0.66 0.32 0.57 0.76 0.53 0.38 0.56 0.37 0.30 0.43 0.29 0.18 0.39 0.17 47.23
Oracle Expert Gate 0.66 0.66 0.65 0.76 0.76 0.72 0.82 0.82 0.78 0.85 0.85 0.81 0.96 0.96 0.92 47.23
DGM 0.10 0.50 0.10 0.22 0.41 0.21 0.26 0.33 0.16 0.32 0.36 0.22 0.69 0.56 0.36 4.58
CL-cbsSVM$Φ$ 0.50 0.10 0.10 0.74 0.73 0.55 0.70 0.65 0.49 0.62 0.75 0.49 0.56 0.81 0.59 25.77
OLTR$Φ$ 0.63 0.52 0.41 0.84 0.81 0.71 0.84 0.81 0.70 0.84 0.89 0.76 0.65 0.79 0.63 30.37
Random-Erasing$Φ$ 0.66 0.29 0.29 0.68 0.35 0.33 0.79 0.54 0.52 0.82 0.70 0.75 0.95 0.94 0.90 33.8
SENCForest$Φ$ 0.96 0.92 0.88 0.94 0.84 0.80 0.90 0.81 0.73 0.93 0.82 0.76 0.97 0.82 0.80 –
SENNE($ϕ=50)Φ$ 0.96 0.94 0.90 0.94 0.87 0.82 0.93 0.82 0.76 0.80 0.70 0.62 0.87 0.64 0.63 –

Notes: The number of features is 20. R: recall. P: precision; J: Jaccard measure. $Φ$The method suffered from retraining. From subtask 1 to the subtask 5, the scale of unknown instances emerging in prediction stage is degraded.

On the other hand, LOF prefers to reject (accumulating FN), but KNN to accept (accumulating FP). The performance of methods is sensitive to the decision boundary.

However, the storage cost of LOF is 17.26 MB, 38 times as large as the cost of AE. The cost of CL-cbsSVM is the most, because it needs to retain historical data sets and retrain them. LOF needs to retain the local density for each point; KNN needs to retain the k-d tree. Their costs are more expensive. OCSVM is supported by some support vectors; iForest makes decisions through some isolation trees. In contrast to the EE, which assumes that the distribution of data is gaussian, and AE, which just stores weights, OCSVM and iForest are expensive.

### 4.3  More Details with Different Numbers of Features

In this section, the number of incremental classifiers in BRL is limited to one for each subtask and the number of features as follows: [10, 20, 50, 100, 200, 500, 1000]. The threshold of similarity $θ$ for each approach is fixed.

The relationship between the storage cost and the number of features is shown in Figure 4. The ranking of storage cost for each approach is almost stable. Costs of AE and iForestbased LL classification are the cheapest. With the increasing number of features, the cost of each feature is declining. The same phenomenon can be observed in KNN- and LOF-based LL classification.
Figure 4:

Storage cost with various numbers of features.

Figure 4:

Storage cost with various numbers of features.

Our work improves learning efficiency. During the learning process, CL-cbsSVM needs to create $14(±1)$ SVM classifiers for five subtasks. Moreover, the number of instances for training is increased; there are 10,000 instances (“positive”) for the classifier $A$ and 2000 new instances (“negative”) are accepted by $A$; then $A$ needs to be retrained with 12,000 instances. Other approaches just create five classifiers and each instance is trained once. About the cost of time, AE is smooth; the cost of others rises with the growing number of features. OCSVM takes 2, 4, and 10 minutes for 200, 500, and 1000 features, respectively; LOF and KNN take 1, 8, and 20 minutes.

F1 scores are shown in Figure 5. When the number of features ($N$) equals 10, we can observe the worst F1 scores of KNN-, LOF-, AE-, and OCSVM-based LL classification. For example, the F1 score of OCSVM-based LL classification is 0.4 with $N=10$ and 0.8 with $N=20$. The reason is that instances embedded to the feature space are close together, so the decision boundary of elements in BRL is overlaps. With the number of features growing, precision rises, and recall falls after rising.
Figure 5:

F1 score with various numbers of features. Each subtask creates only one classifier in BRL. The F1 score is measured after all five tasks are learned.

Figure 5:

F1 score with various numbers of features. Each subtask creates only one classifier in BRL. The F1 score is measured after all five tasks are learned.

Performances of other approaches are poor because of the unfit decision boundary. A loose boundary implies low precision; a tight one leads to low recall. Recall is the main gap between KNN, LOF and AE, OCSVM. When $N=1000$, their recall is listed as [0.94, 0.90, 0.79, 0.77], and the precision is [0.94, 0.96, 0.97, 0.98]. Notice that the performance of CL-cbsSVM is similar to that of OCSVM in the beginning, but then falls. That also means CL-cbsSVM is sensitive to the threshold $θ$.

With high-dimensional feature space, the interval between two decision boundaries is closer, and the threshold $θ$ is hard to tune. For example, the interval between the decision boundary with $θ=1$ and the boundary with $θ=0$ may be 0.2 (F1 score) in 20-dimensional space, but it may be 0.8 in 1000-dimensional space. With the expanding boundary of OCSVM, we can plot the relative F1 score in Figure 6. There is a peak pointed out by the circles of F1 scores with different $θ$. The $θ$ of peak is closer to one, and it falls more rapidly with higher-dimensional feature space. The outperformed $θ$ thus is hard to fine-tune; any outsize $θ$ leads to low precision. Low precision means a large count of false positive; however, rejecting those false-positives is key in open world classification.
Figure 6:

Tendency of F1 score with the incremental $θ$ based on OCSVM-based LL classification. The circle marks the peak F1 score. The vertical black line implies omissions.

Figure 6:

Tendency of F1 score with the incremental $θ$ based on OCSVM-based LL classification. The circle marks the peak F1 score. The vertical black line implies omissions.

### 4.4  Improvements of More Binary Classifiers

The learner can create several $pt$s with a tight boundary in BRL at time $t$ for improving performance. However, a tight boundary can degrade the scale of overlapped areas. In this section, we identify the relation between the parameter $γ$ (in algorithm 1) and performance based on OCSVM. Results of recall, F1 score, and the number of classifiers in BRL are listed in Table 3. For most cases, $γ=0.005$ provides the highest performance by spending more classifiers. The improvement of recall is over 10% in cases with three times the number of classifiers. Normally, extra classifiers denote the improvement. Notice that there are some abnormal cases. For instance, with 500 features and $γ=0.1$, one extra classifier leads to incremental recall but a decreased F1 score. Furthermore, with 1000 features and $θ=-1$, there is no increment of classifier; however, on $θ=0$, the increment is notable. The fixed threshold of similarity $θ$ is not able to cope with the rapid change with high-dimensional feature space. It means that when the threshold $θ$ falls from $a$ to $b$, the extended area of decision boundary is increased with the incremental number of features. The same thing emerged for AE. Though we can prove some improvements, the stable boundary is necessary.

Table 3:

Details of Recall (R), F1 Score (F), and Storage Cost (C, MB) on OCSVM+BRL+CENet.

$γ=$0.005$γ=$0.01$γ=$0.1$γ=$0.2
DRF1#CRF1#CRF1#CRF1#C
D$=$10 0.39 0.46 13 0.93 0.38 0.45 10 0.93 0.34 0.43 0.87 0.32 0.41 0.85
D$=$20 0.88 0.88 15 1.82 0.88 0.88 15 1.82 0.83 0.86 1.76 0.73 0.81 1.58
D$=$50 0.93 0.93 15 4.30 0.92 0.93 12 4.27 0.89 0.92 4.19 0.77 0.86 3.79
D$=$100 0.92 0.93 14 8.61 0.90 0.93 12 8.54 0.91 0.85 8.31 0.74 0.85 7.50
D$=$200 0.91 0.92 13 17.17 0.88 0.92 10 16.94 0.85 0.91 16.59 0.73 0.84 14.89
D$=$500 0.84 0.87 62.12 0.84 0.87 62.12 0.76 0.82 52.81 0.74 0.85 49.04
D$=$1000 0.77 0.87 127.07 0.77 0.87 127.07 0.77 0.87 127.07 0.77 0.87 127.07
$*$D$=$1000 0.84 0.87 30 283.21 0.84 0.88 21 281.20 0.85 0.88 12 227.54 0.85 0.87 193.55
$γ=$0.005$γ=$0.01$γ=$0.1$γ=$0.2
DRF1#CRF1#CRF1#CRF1#C
D$=$10 0.39 0.46 13 0.93 0.38 0.45 10 0.93 0.34 0.43 0.87 0.32 0.41 0.85
D$=$20 0.88 0.88 15 1.82 0.88 0.88 15 1.82 0.83 0.86 1.76 0.73 0.81 1.58
D$=$50 0.93 0.93 15 4.30 0.92 0.93 12 4.27 0.89 0.92 4.19 0.77 0.86 3.79
D$=$100 0.92 0.93 14 8.61 0.90 0.93 12 8.54 0.91 0.85 8.31 0.74 0.85 7.50
D$=$200 0.91 0.92 13 17.17 0.88 0.92 10 16.94 0.85 0.91 16.59 0.73 0.84 14.89
D$=$500 0.84 0.87 62.12 0.84 0.87 62.12 0.76 0.82 52.81 0.74 0.85 49.04
D$=$1000 0.77 0.87 127.07 0.77 0.87 127.07 0.77 0.87 127.07 0.77 0.87 127.07
$*$D$=$1000 0.84 0.87 30 283.21 0.84 0.88 21 281.20 0.85 0.88 12 227.54 0.85 0.87 193.55

Notes: # implies the number of binary classifiers in BRL. The number of features (D) and $γ$ are listed, $θ=-1$. $*$Implies $θ=0$ for creating more classifiers and tightening the decision boundary.

The storage cost also increases in this experiment. Added classifiers imply more instances should be retrained. The number of increased classifiers $k$, $∑i=1km*(γ*)i-1$ instances are trained. Thus, the cost for each classifier shows a negative correlation to $k$.

### 4.5  Correctness of Probabilistic Interpretation

To validate formula 2.1, an important assumption in this work, we design an experiment. We have two probabilities. The first is the probability ($Pr(a|bk;x)$) that an instance can be correctly classified by BRL. The second is the probability ($Pr(bk;x)$), that is, the recognized instance can be correctly classified by CENet. The estimated performance (the estimated accuracy) is the result.

In Table 4, we can see that the estimated accuracy keeps pace with the true accuracy of learner. Based on LOF, the learner labels 9751 instances (1848 for instances of “airplane” and “automobile” and 7903 for others) correctly after training subtask 1, and feeds those instances to CENet for predicting. The accuracy of LOF+BRL is $Pr(pi;x)$, given an instance $x$, and the probability $Pr(pi;x)$, which refers to $x$, can be accepted by $pi$. $Pr(y|pi;x)$ refers to the accuracy of CENet, and instances accepted by $pi$ can be correctly classified by a suitable model, which is indexed by $*pi$. The weights of the classification model are indexed by $pi$ as shown in Figure 1. The estimated accuracy on subtask 1 is 0.96 ($=0.98*0.98$). The error within 2% means that those instances mislabeled by BRL are also inaccurate by CENet. The substantial improvements on accuracy for KNN in subtask 4 are due to the reduced number of unknown instances. After revising prior false positives, incremental classifiers have no extra acceptable negatives.

Table 4:

Accuracy of Each Subtask under LL on cifar-10.

AE+BRL 0.78 0.77 0.75 0.75 0.76
KNN+BRL 0.74 0.82 0.81 0.91 0.93
LOF+BRL 0.98 0.94 0.92 0.91 0.90
OCSVM+BRL 0.94 0.79 0.74 0.76 0.73
$[l$CENet 0.98 0.97 0.96 0.97 0.97$[l$
$[l$AE+BRL+CENet 0.78−0.02 0.77−0.02 0.74−0.02 0.74−0.01 0.75−0.01
KNN+BRL+CENet 0.70+0.03 0.80−0.00 0.78−0.00 0.89−0.01 0.91−0.01
LOF+BRL+CENet 0.98−0.02 0.93−0.02 0.91−0.03 0.90−0.02 0.89−0.02
OCSVM+BRL+CENet 0.94−0.02 0.78−0.01 0.74−0.03 0.76−0.02 0.72−0.01
AE+BRL 0.78 0.77 0.75 0.75 0.76
KNN+BRL 0.74 0.82 0.81 0.91 0.93
LOF+BRL 0.98 0.94 0.92 0.91 0.90
OCSVM+BRL 0.94 0.79 0.74 0.76 0.73
$[l$CENet 0.98 0.97 0.96 0.97 0.97$[l$
$[l$AE+BRL+CENet 0.78−0.02 0.77−0.02 0.74−0.02 0.74−0.01 0.75−0.01
KNN+BRL+CENet 0.70+0.03 0.80−0.00 0.78−0.00 0.89−0.01 0.91−0.01
LOF+BRL+CENet 0.98−0.02 0.93−0.02 0.91−0.03 0.90−0.02 0.89−0.02
OCSVM+BRL+CENet 0.94−0.02 0.78−0.01 0.74−0.03 0.76−0.02 0.72−0.01

Notes: The rows in the top panel show the accuracy, which is obtained by BRL only. The middle panel shows the accuracy obtained by CENet only. The rows in the bottom panel show the accuracy obtained by algorithm 1. The bold numbers refer to the difference between the true accuracy and the estimated accuracy according to formula 2.1. For example, 0.74 is the accuracy of KNN+BRL, 0.98 is the accuracy of CENet, 0.70 is the true accuracy, and $0.98*0.74=0.70+0.03$ is the estimated accuracy. $+$0.03 is the overestimation.

### 4.6  Visualization

This experiment is designed as a demonstration of the visualization of the used connections in CENet and the belief revision on MNIST. Each instance (a number) in the testing set is plotted in Figure 7. We split the training data of classes 0 to 9 into five subsets on MNIST. Each subtask contains the training data of two numbers, such as 0-1 and 2-3, and the test data with all 10 numbers. Each instance is encoded from 784 dimensions to 28 dimensions. Each classifier $p$ in BRL is an autoencoder with $θ=0.5$ and $γ=1$; CENet is a three-layer ANN in which the output layer is softmax layer ($28×1024×10$).

Figure 7 lists the predicted performance based on the full MNIST test data. Figure 7a shows 10 numbers classified into five colors, 0-3 classes, and an unknown class (blacked). Some of the unlearned numbers (such as the number 4 in the right conner) are mislabeled by 0, 2, and 3. After training the data of 4 and 5 there are many mislabeled numbers to be revised and the instances of 9 were classified into the adjacent class, 4. We can observe that those incorrect beliefs (mislabeled numbers) can be revised through learning new tasks. The employment of ANN that learns more classes at one time improves efficiency.

Figure 7:

Results of classification. Test data are visualized using t-SNE (Maaten & Hinton, 2008).

Figure 7:

Results of classification. Test data are visualized using t-SNE (Maaten & Hinton, 2008).

For further information on the sparse matrix, we show a pseudocolor image of the masked weights in Figure 8. Figure 8a shows that active connections occupy the fraction (approximately 15%, highlighted) of the site. The number of active connections for one subtask is nearly 3% in Figure 8b. Due to the compression by average pooling and subsampling, the colored area is seemingly more abundant than the claimed 15%.
Figure 8:

Pseudocolor image of weights in $M$. We compress $M$ ($28×1024$ to $56×32$) by average pooling and subsampling and plot through linear interpolation. The color at the top of the color bar means more active connections, and the black area means unused connections.

Figure 8:

Pseudocolor image of weights in $M$. We compress $M$ ($28×1024$ to $56×32$) by average pooling and subsampling and plot through linear interpolation. The color at the top of the color bar means more active connections, and the black area means unused connections.

### 4.7  Defects

The implication of not maintaining historical data sets implies a lack of ability to adjust decision boundaries. The result is partly dependent on the order of arrived tasks. In this case, we can only tighten the boundary and create more classifiers. Furthermore, there is no guarantee that more classifiers must lead to improvement in any case, such as the last row in Table 3.

Meanwhile, the increased number of classifiers in BRL will raise computationals because any instance must be detected by every element in BRL. This can be relieved through parallel computing.

## 5  Conclusion

In this letter, we have proposed an architecture for lifelong learning with limited storage in open world classification. The cost of storage comes from retaining history data sets and learned models. First, for BRL, there is no need to retain history data sets. Second, the number of used weights in CENet is reduced by the sparse matrix. Meanwhile, we employ methods of outlier detection as classifiers in BRL. Results show that LOF (density-based method) and KNN (instance-based method) outperform others with more than 0.85 accuracy. AE and EE cost less storage space, less than one-third of the space of OCSVM and iForest and one-twentieth of LOF and KNN.

Further work is required to balance performance with computational cost. However, cost grows with the dimension of instances based on methods such as LOF and KNN. The complexity of AE-based BRL remains steady, but its decision boundary is not as stable as OCSVM. AE-based BRL is sluggish, but its decision boundary is not as stable as OCSVM. Research about the stable boundary on various dimensions is significant. A hierarchical structure of BRL is also necessary to accelerate the process of discerning instances into the known and unknown sets, and consideration can be given to using sparse recurrent neural networks or the neural Turing machine instead of CENet.

More interesting is that this work can switch the suitable learned model to solve problems in different domains. It is reasonable that this work can be used not only as the if statement but also the for-loop statement to process hierarchical data structures. Linking the output to the input, the loop should be stopped when the input cannot be recognized automatically. For example, there are an add-one function, a trained traditional classifier, and a trained BRL by the task of learning integers in a range 1 to 10. The object is to implement a loop like “for $i$ in range$(1,10)$.” By the traditional classifier, the loop could not be implemented because of the absence of the break condition. However, by BRL, the initial input is $i=1$, and the output is the input adding one, $i=i+1$. Once the output is 11, which BRL cannot recognize the loop stopped. For test instances, we can maintain a linked list for all elements in BRL as the add-one function. Then the suitable model can be determined after one sweep without the size of BRL in the lifelong learning process.

## References

Aljundi
,
R.
,
Chakravarty
,
P.
, &
Tuytelaars
,
T.
(
2017
).
Expert gate: Lifelong learning with a network of experts
. In
Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition
(pp.
7120
7129
).
Piscataway, NJ
:
IEEE
.
Amigó
,
E.
,
Gonzalo
,
J.
, &
Verdejo
,
F.
(
2013
).
A general evaluation measure for document organization tasks
. In
Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval
(pp.
643
652
).
New York
:
ACM
.
Bartlett
,
P. L.
,
Harvey
,
N.
,
Liaw
,
C.
, &
Mehrabian
,
A.
(
2017
).
Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks.
arXiv:1703.02930.
Beaulieu
,
S.
,
Frati
,
L.
,
Miconi
,
T.
,
Lehman
,
J.
,
Stanley
,
K. O.
,
Clune
,
J.
, &
Cheney
,
N.
(
2020
).
Learning to continually learn
. In
Proceedings of the 24th European Conference on Artificial Intelligence
(pp.
325:992
1001
).
Amsterdam
:
IOS Press
.
Bendale
,
A.
, &
Boult
,
T.
(
2015
).
Towards open world recognition
. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
1893
1902
).
Piscataway, NJ
:
IEEE
.
Bengio
,
S.
,
Pereira
,
F.
,
Singer
,
Y.
, &
Strelow
,
D.
(
2009
). Group sparse coding. In
Y.
Bengio
,
D.
Schuurmans
,
J.
Lafferty
,
C.
Williams
, &
A.
Culotta
(Eds.),
Advances in neural information processing systems
,
22
(pp.
82
89
).
Red Hook, NY
:
Curran
.
Bengio
,
Y.
,
,
J.
,
Collobert
,
R.
, &
Weston
,
J.
(
2009
).
Curriculum learning
. In
Proceedings of the 26th Annual International Conference on Machine Learning
(pp.
41
48
).
New York
:
ACM
.
Breunig
,
M. M.
,
Kriegel
,
H.-P.
,
Ng
,
R. T.
, &
Sander
,
J.
(
2000
).
LOF: Identifying density-based local outliers
. In
ACM Sigmod Record
,
29
,
93
104
.
Cai
,
X. Q.
,
Zhao
,
P.
,
Ting
,
K. M.
,
Mu
,
X.
, &
Jiang
,
Y.
(
2019
).
Nearest neighbor ensembles: An effective method for difficult problems in streaming classification with emerging new classes
. In
Proceedings of the IEEE International Conference on Data Mining
(pp.
970
975
).
Piscataway, NJ
:
IEEE
.
Carlson
,
A.
,
Betteridge
,
J.
,
Kisiel
,
B.
,
Settles
,
B.
,
Hruschka Jr
,
E. R.
, &
Mitchell
,
T. M.
(
2010
).
Toward an architecture for never-ending language learning
. In
Proceedings of the 24th AAAI Conference on Artificial Intelligence
(vol. 5, p.
3
).
Palo Alto, CA
:
AAAI
.
Caruana
,
R.
(
1997
).
.
Machine Learning
,
28
(
1
),
41
75
.
Chen
,
Z.
, &
Liu
,
B.
(
2016
).
Lifelong machine learning
.
Synthesis Lectures on Artificial Intelligence and Machine Learning
,
10
(
3
),
1
145
.
Chen
,
Z.
,
Ma
,
N.
, &
Liu
,
B.
(
2018
).
Lifelong learning for sentiment classification.
arXiv:1801.02808.
Chow
,
C.
(
1970
).
On optimum recognition error and reject tradeoff
.
IEEE Transactions on Information Theory
,
16
,
41
46
.
Da
,
Q.
,
Yu
,
Y.
, &
Zhou
,
Z. H.
(
2014
).
Learning with augmented class by exploiting unlabeled data
. In
Proceedings of the National Conference on Artificial Intelligence
,
3
(2),
1760
1766
.
Palo Alto, CA
:
AAAI Press
.
Dejean
,
C.
,
Courtin
,
J.
,
Karalis
,
N.
,
Chaudun
,
F.
,
Wurtz
,
H.
,
Bienvenu
,
T. C.
, &
Herry
,
C.
(
2016
).
Prefrontal neuronal assemblies temporally control fear behaviour
.
Nature
,
535
(
7612
),
420
424
.
Fei
,
G.
,
Wang
,
S.
, &
Liu
,
B.
(
2016
).
Learning cumulatively to become more knowledgeable
. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
1565
1574
).
New York
:
ACM
.
Fei-Fei
,
L.
(
2006
).
Knowledge transfer in learning to recognize visual objects classes.
In
Proceedings of the Fifth International Conference on Development and Learning.
Geng
,
C.
,
Huang
,
S.-j.
, &
Chen
,
S.
(
2020
).
Recent advances in open set recognition: A survey
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
14
(
8
).
Hinton
,
G. E.
, &
Salakhutdinov
,
R. R.
(
2006
).
Reducing the dimensionality of data with neural networks
.
Science
,
313
(
5786
),
504
507
.
Hinton
,
G. E.
,
Srivastava
,
N.
,
Krizhevsky
,
A.
,
Sutskever
,
I.
, &
Salakhutdinov
,
R. R.
(
2012
).
Improving neural networks by preventing co-adaptation of feature detectors
. arXiv:1207.0580.
Isele
,
D.
, &
Cosgun
,
A.
(
2018
).
Selective experience replay for lifelong learning.
arXiv:1802.10269.
Jacobs
,
R. A.
,
Jordan
,
M. I.
,
Nowlan
,
S. E.
, &
Hinton
,
G. E.
(
1991
).
.
Neural Computation
,
3
,
79
87
.
Janardan
&
Mehta
,
S.
(
2017
).
Concept drift in streaming data classification: Algorithms, platforms and issues
.
Procedia Computer Science
,
122
,
804
811
.
Javed
,
K.
, &
White
,
M.
(
2019
). Meta-learning representations for continual learning. In
H.
Wallach
,
H.
Larochelle
,
A.
Beygelzimer
,
F.
d'Alché-Buc
,
E.
Fox
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
32
(pp.
1
15
).
Red Hook, NY
:
Curran
.
Kinga
,
D.
, &
,
J. B.
(
2015
).
A method for stochastic optimization
. In
Proceedings of the International Conference on Learning Representations
, vol. 5.
Krizhevsky
,
A.
, &
Hinton
,
G.
(
2009
).
Learning multiple layers of features from tiny images
(Technical Report). Citeseer.
Kumar
,
A.
, &
Daume III
,
H.
(
2012
).
. arXiv:1206.6417.
Lee
,
J. H.
,
Durand
,
R.
,
,
V.
,
Zhang
,
F.
,
Goshen
,
I.
,
Kim
,
D.-S.
,
Deisseroth
,
K.
(
2010
).
Global and local FMRI signals driven by neurons defined optogenetically by type and wiring
.
Nature
,
465
(
7299
), 788.
Liu
,
F. T.
,
Ting
,
K. M.
, &
Zhou
,
Z.-H.
(
2008
).
Isolation forest.
In
Proceedings of the Eighth IEEE International Conference on Data Mining
(pp.
413
422
).
Piscataway, NJ
:
IEEE
.
Liu
,
Z.
,
Miao
,
Z.
,
Zhan
,
X.
,
Wang
,
J.
,
Gong
,
B.
, &
Yu
,
S. X.
(
2019
).
Large-scale long-tailed recognition in an open world
. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
.
Piscataway, NJ
:
IEEE
.
Maaten
,
L. v. d.
, &
Hinton
,
G.
(
2008
).
Visualizing data using t-SNE
.
Journal of Machine Learning Research
,
9
,
2579
2605
.
Masoudnia
,
S.
, &
Ebrahimpour
,
R.
(
2014
).
Mixture of experts: A literature survey
.
Artificial Intelligence Review
,
42
(
2
)
275
293
.
Mitchell
,
T.
,
Cohen
,
W.
,
Hruschka
,
E.
,
Talukdar
,
P.
,
Yang
,
B.
,
Betteridge
,
J.
, …
Kisiel
,
B.
(
2018
).
Never-ending learning
.
Communications of the ACM
,
61
(
5
),
103
115
.
Mu
,
X.
,
Ting
,
K. M.
, &
Zhou
,
Z. H.
(
2017
).
Classification under streaming emerging new classes: A solution using completely-random trees
.
IEEE Transactions on Knowledge and Data Engineering
,
29
(
8
),
1605
1618
.
Ostapenko
,
O.
,
Puscas
,
M.
,
Klein
,
T.
,
Jahnichen
,
P.
, &
Nabi
,
M.
(
2019
).
Learning to remember: A synaptic plasticity driven framework for continual learning
. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(pp.
11313
11321
).
Piscataway, NJ
:
IEEE
.
Pentina
,
A.
, &
Lampert
,
C. H.
(
2015
). Lifelong learning with non-iid tasks. In
C.
Cortes
,
N.
Lawrence
,
D.
Lee
,
M.
Sugiyama
, &
R.
Garnett
(Eds.),
Advances in neural information processing system
,
28
(pp.
1540
1548
).
Red Hook, NY
:
Curran
.
Ramaswamy
,
S.
,
Rastogi
,
R.
, &
Shim
,
K.
(
2000
).
Efficient algorithms for mining outliers from large data sets
.
IACM Sigmod Record
,
29
,
427
438
.
Rousseeuw
,
P. J.
, &
Driessen
,
K. V.
(
1999
).
A fast algorithm for the minimum covariance determinant estimator
.
Technometrics
,
41
(
3
),
212
223
.
Rudd
,
E.
,
Jain
,
L. P.
,
Scheirer
,
W. J.
, &
Boult
,
T.
(
2017
).
The extreme value machine
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
40
(
3
).
Ruvolo
,
P.
, &
Eaton
,
E.
(
2013
).
Ella: An efficient lifelong learning algorithm
. In
Proceedings of the International Conference on Machine Learning
(pp.
507
515
).
New York
:
ACM
.
Sauer
,
N.
(
1972
).
On the density of families of sets
.
Journal of Combinatorial Theory, Series A
,
13
(
1
),
145
147
.
Scheirer
,
W. J.
, de Rezende
Rocha
,
A.
,
Sapkota
,
A.
, &
Boult
,
T. E.
(
2013
).
Toward open set recognition
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
35
(
7
),
1757
1772
.
Schölkopf
,
B.
,
Platt
,
J. C.
,
Shawe-Taylor
,
J.
,
Smola
,
A. J.
, &
Williamson
,
R. C.
(
2001
).
Estimating the support of a high-dimensional distribution
.
Neural Computation
,
13
(
7
),
1443
1471
.
Shalev-Shwartz
,
S.
, &
Ben-David
,
S.
(
2014
).
Understanding machine learning: From theory to algorithms
.
Cambridge
:
Cambridge University Press
.
Shu
,
L.
,
Xu
,
H.
, &
Liu
,
B.
(
2018
).
Unseen class discovery in open-world classification
. arXiv:1801.05609.
Silver
,
D. L.
,
Yang
,
Q.
, &
Li
,
L.
(
2013
).
Lifelong machine learning systems: Beyond learning algorithms
. In
Proceedings of the AAAI Spring Symposium: Lifelong Machine Learning
.
Palo Alto, CA
:
AAAI
.
Simonyan
,
K.
, &
Zisserman
,
A.
(
2014
).
Very deep convolutional networks for large- scale image recognition.
arXiv:1409.1556.
Sun
,
Y.
,
Kamel
,
M. S.
,
Wong
,
A. K.
, &
Wang
,
Y.
(
2007
).
Cost-sensitive boosting for classification of imbalanced data
.
Pattern Recognition
,
40
(
12
),
3358
3378
.
Tessler
,
C.
,
Givony
,
S.
,
Zahavy
,
T.
,
Mankowitz
,
D. J.
, &
Mannor
,
S.
(
2017
).
A deep hierarchical approach to lifelong learning in minecraft
. In
Proceedings of the AAAI Conference on Artificial Intelligence
(vol.
3
, p. 6).
Palo Alto, CA
:
AAAI
.
Thrun
,
S.
(
1995
).
A lifelong learning perspective for mobile robot control
. In
Proceedings of the IEEE/RSJ/GI Conference on Intelligent Robots and Systems
(pp.
201
214
).
Amsterdam
:
Elsevier
.
Wan
,
L.
,
Zeiler
,
M.
,
Zhang
,
S.
,
LeCun
,
Y.
, &
Fergus
,
R.
(
2013
).
Regularization of neural networks using dropconnect
. In
Proceedings of the International Conference on Machine Learning
(pp.
1058
1066
).
New York
:
ACM
.
Yang
,
C.
,
Yuan
,
K.
,
Heng
,
S.
,
Komura
,
T.
, &
Li
,
Z.
(
2020
).
Learning natural locomotion behaviors for humanoid robots using human bias
.
IEEE Robotics and Automation Letters
,
5
(
2
),
2610
2617
.
Yuan
,
M.
, &
Wegkamp
,
M.
(
2010
).
Classification methods with reject option based on convex risk minimization
.
Journal of Machine Learning Research
,
11
,
111
130
.
Zabihimayvan
,
M.
,
,
R.
,
Rude
,
H. N.
, &
Doran
,
D.
(
2017
).
A soft computing approach for benign and malicious web robot detection
.
Expert Systems with Applications
,
87
,
129
140
.
Zhao
,
Y.
,
Nasrullah
,
Z.
, &
Li
,
Z.
(
2019
).
Pyod: A Python toolbox for scalable outlier detection
.
Journal of Machine Learning Research
,
20
(
96
),
1
7
.
Zhong
,
Z.
,
Zheng
,
L.
,
Kang
,
G.
,
Li
,
S.
, &
Yang
,
Y.
(
2017
).
Random erasing data augmentation
. arXiv:1708.04896.