## Abstract

We explore classifier training for data sets with very few labels. We investigate this task using a neural network for nonnegative data. The network is derived from a hierarchical normalized Poisson mixture model with one observed and two hidden layers. With the single objective of likelihood optimization, both labeled and unlabeled data are naturally incorporated into learning. The neural activation and learning equations resulting from our derivation are concise and local. As a consequence, the network can be scaled using standard deep learning tools for parallelized GPU implementation. Using standard benchmarks for nonnegative data, such as text document representations, MNIST, and NIST SD19, we study the classification performance when very few labels are used for training. In different settings, the network's performance is compared to standard and recently suggested semisupervised classifiers. While other recent approaches are more competitive for many labels or fully labeled data sets, we find that the network studied here can be applied to numbers of few labels where no other system has been reported to operate so far.

## 1 Introduction

Large data sets (e.g., in the form of digital texts, images, sounds, or medical measurements) are becoming increasingly ubiquitous. Classification of such data has since long been identified as a central task of machine learning because of its many practical applications. If data sets are fully labeled, standard deep neural networks (DNNs), such as multilayer perceptrons (compare Rosenblatt, 1958; Ivakhnenko & Lapa, 1965) and their many modern versions, are often the method of choice. For many current benchmarks, large DNNs show state-of-the-art performance and can often exceed human abilities in specific data domains (see Schmidhuber, 2015; Bengio, Courville, & Vincent, 2013; Hinton et al., 2012, for reviews). However, the creation of fully labeled data sets becomes increasingly costly with increasing data points. While acquisition of the data points themselves is usually relatively easy (e.g., consider digital photos or the recording of sounds), correct labeling of the acquired data requires the availability of ground truth or a human who can hand-label the data. Neither ground truth nor labels provided by humans are available for most data sets, however. Furthermore, human labels may be erratic, especially for large data sets. The same applies to automatic or semiautomatic procedures that can provide labels. Depending on the data, acquiring even a few labels can be very costly. If, for example, data points consist of a set of medical measurements, a label in the form of a diagnosis requires the time and knowledge of a human medical expert. Hence, by considering large data sets or data sets with considerable cost per label, the following research question naturally emerges: How can a good classifier be trained for data sets with only few labels?

Indeed, classifiers leveraging information from labeled and unlabeled data points (semisupervised classifiers) have in recent years shifted into the focus of many research groups (Liu, He, & Chang, 2010; Weston, Ratle, Mobahi, & Collobert, 2012; Pitelis, Russell, & Agapito, 2014; Kingma, Mohamed, Rezende, & Welling, 2014; Forster, Sheikh, & Lücke, 2015; Rasmus, Berglund, Honkala, Valpola, & Raiko, 2015; Miyato, Maeda, Koyama, Nakae, & Ishii, 2016). If classifiers can successfully be trained using just few labels, they enable applications in many practically relevant settings. For example, classifiers for new data sets can be obtained in a very limited amount of time if labels for only few data points need to be provided; classifiers that perform poorly because of erratic labels in large training sets can be replaced by classifiers that are trained on the same data using only a few reliable labels; in settings where a stream of unlabeled data is constantly available (e.g., frames of a video, texts in chat rooms), a classifier could be obtained online if humans provide a few labels interactively. In all these examples, the total number of labels required to train the classifier is the main factor that determines its applicability.

But how far can we reduce the required total number of labels? And how strong can we expect a classifier to be in the limit of very few labels? In order to extract information from labeled and unlabeled data points, most successful contributions use hybrid combinations of two or more learning algorithms in order to merge unsupervised and supervised learning mechanisms (see, e.g., Weston et al., 2012; Kingma et al., 2014; Rasmus et al., 2015; Miyato et al., 2016). Typically, a standard DNN is used for the supervised part. Such DNNs are always equipped with a set of tunable parameters (i.e., hyperparameters or free parameters)—for example, for network architecture, activation functions, regularization, dropout, sparsity, gradient ascent types, learning rates, and early stopping. The unsupervised part adds further tunable parameters, and still more parameters are required to organize the interplay between supervised and unsupervised learning. For fully supervised learning, the problem of finding good values for the set of free parameters has been identified as its own research topic (see, e.g., Thornton, Hutter, Hoos, & Leyton-Brown, 2013; Bergstra, Yamins, & Cox, 2013; Hutter, Lücke, & Schmidt-Thieme, 2015). In the semisupervised setting, approaches with large numbers of free parameters face the additional challenge of parameter tuning using very few labels, which, for example, increases the risk of heavily overfitting to a subsequently very small validation set. Large sets of free parameters can thus negatively affect the applicability of a given system in the limit of few labels. The same applies to more principled combinations of supervised and unsupervised networks, for example, in the form of generative adversarial networks (GANs; Goodfellow et al., 2014; Salimans et al., 2016), which maintain large sets of free parameters of their constituting neural network components.

Alternatives to large hybrid approaches are classifiers derived from standard support vector machines (SVMs; Cortes & Vapnik, 1995). The transductive SVM (TSVM; Vapnik, 1998; Collobert, Sinz, Weston, & Bottou, 2006) was specifically derived for the semisupervised setting, and SVMs typically have comparably few free parameters. For supervised tasks, large DNNs are, however, often preferred because of their favorable scaling with the number of data points. While training of DNNs scales coarsely linearly with the size of the training data, SVMs typically scale approximately quadratically. As the same applies for TSVMs, it becomes difficult to leverage large numbers of unlabeled data.

Another alternative to large hybrid approaches is a standard probabilistic network, for example, in the form of deep directed graphical models (DDMs). DDMs are well suited to capture the rich structure of typical data, such as text documents, medical data, images and speech, and they can, in principle, be trained using unlabeled and labeled data points. Training of DDMs also scales efficiently with the number of data points (typically each learning iteration scales linearly with the number of data points). However, while being potentially very powerful information processors, typical directed models are limited in size. For instance, deep sigmoid belief networks (SBNs; Saul, Jaakkola, & Jordan, 1996; Gan, Henao, Carlson, & Carin, 2015) or newer models such as NADE (Larochelle & Murray, 2011) have been trained with only a couple of hundred to about a thousand hidden units (Bornschein & Bengio, 2015; Gan et al., 2015). Their scalability with regard to the number of neurons is thus more limited than standard discriminative DNNs, which often owe their competitive performance to their size.

In contrast to DNNs and SVMs, which are representatives of supervised learning, DDMs are primarily used for unsupervised learning. For the targeted limit of few labels, DDMs thus appear as a more natural starting point if we are able to address scalability for classification applications. In order to do so, we base our study on a directed graphical model that is sufficiently richly structured to give rise to a good classifier, while it allows for efficient training on large data sets and with large network sizes. Scalability will be realized by the derivation of a neural network equivalent for maximum likelihood learning of the graphical model. The emerging concise and local inference and learning equations of the network can then be parallelized and scaled using the same tools as were originally developed for conventional deep neural networks. By additionally considering a minimalistic network architecture, the number of free parameters will, at the same time, be kept low and easily tunable on few labels.

## 2 A Hierarchical Mixture Model for Classification

A classification problem can be modeled as an inference task based on a probabilistic mixture model (e.g., Duda, Hart, & Stork, 2001). Such a model can be hierarchical, or deep, if we expect the data to obey a hierarchical structure. For handwritten digits, for instance, we first assume the data to be divided into digit classes (0 to 9), and within each class, we expect a structure that distinguishes among different writing styles. Most deep systems allow for a much deeper substructure, using 5, 10, or, recently, even up to 100 or 1000 layers (He, Zhang, Ren, & Sun, 2016; Huang, Sun, Liu, Sedra, & Weinberger, 2016). For our goal of semisupervised learning with few labels, however, we want to restrain the model complexity to the necessary minimum of a hierarchical model.

### 2.1 The Generative Model

The parameters of the model, $W\u2208R>0C\xd7D$ and $R\u2208R\u22650K\xd7C$, will be referred to as generative weights, which are normalized to constants $A$ and 1, respectively. The top node (see Figure 1) represents $K$ abstract concepts or superclasses $k$ (e.g., 10 classes of digits). The middle node represents any of the occurring $C$ subclasses $c$ (e.g., different writing styles of the digits). And the bottom nodes represent an observed data sample $y\u2192$ with an according data label $l$ (e.g., ranging from 0 to 9). To generate an observation $y\u2192$ from the model, we first draw a superclass $k$ from a uniform categorical distribution $p(k)$. Next, we draw a subclass $c$ according to the conditional categorical distribution $p(c|k,R)$. Given the subclass, we then sample $y\u2192$ from a Poisson distribution. For labeled data, we assign to it the label $l$ of class $k$ via a Kronecker delta, that is, without label noise. Equations 2.1 to 2.3 define a minimalistically deep mixture model.

Our model assumes nonnegative observed data, and we use the Poisson distribution as an elementary distribution for nonnegative observations. Nonnegative data represent a natural type of data, and examples include bag-of-words representations of text documents or light-intensity representations of images including standard pattern recognition benchmarks such as MNIST (LeCun, Bottou, Bengio, & Haffner, 1998) or NIST (Grother, 1995). While bag-of-words data may directly motivate a Poisson distribution (for word counts), the model in principle will be applicable to any kind of nonnegative data. An important difference between models that assume Poisson distributed data and models using the very common assumption of gaussian distributed observables is the implicitly assumed similarity relation between data points. The assumption of gaussian observables is, for example, naturally linked to the assumption of Euclidean distances (i.e., squared coordinate-wise differences). For such models (including many deep neural networks), the classification problem is usually unaffected by global shifts of data points, and the origin of the data space has no dedicated meaning. This is no longer the case for the Poisson noise model. For nonnegative data, a zero-valued observation has a special meaning, and any addition of a fixed value changes this meaning (e.g., the difference between a word count of zero and a word count of 10 words conveys a different meaning from the difference between a word count of 100 and 110). Instead of Euclidean distances, Poisson noise links to distances defined by the Kullback-Leibler divergences (e.g., Cemgil, 2009), which are more similar to those used for nonnegative matrix factorization (compare Lee & Seung, 1999). In addition to being a natural choice for nonnegative data, the Poisson distribution used here also turns out to be mathematically convenient for deriving inference and learning rules. Similar observations have been made by Keck, Savin, and Lücke (2012), who used Poisson observables to derive local learning rules for shallow neural network models (also compare Lücke & Sahani, 2008; Nessler, Pfeiffer, & Maass, 2009, 2013).

### 2.2 Maximum Likelihood Learning

The EM algorithm optimizes the free energy by iterating two steps. First, given the current parameters $\Theta old$, the relevant expectation values under the posterior are computed in the E-step. Given these posterior expectations, $F(\Theta old,\Theta )$ is then maximized with regard to $\Theta $ in the M-step. Iteratively applying E- and M-steps locally maximizes the data likelihood.

#### 2.2.1 M-Step

#### 2.2.2 E-Step

#### 2.2.3 Probabilistically Optimal Classification

### 2.3 Truncated Variational EM

## 3 A Neural Network for Optimal Hierarchical Learning

For the purposes of this study, we now turn to the task of specifying a neural network formulation that corresponds to learning and inference in the hierarchical generative model of section 2. The study of optimal learning and inference with neural networks is a popular research field, and we here follow an approach similar to Lücke and Sahani (2008), Keck et al. (2012), Nessler et al. (2009), and Nessler, Pfeiffer, Buesing, and Maass (2013).

### 3.1 A Neural Network Approximation

The complete set of activation and learning rules, after identifying neural activities $sc$ and $tk$ with the respective posterior distributions, is summarized in Table 1. By comparing equations 3.4 with the M-step equations 2.9 and 2.10, we can now observe that such neural learning converges to the same fixed points as EM for the hierarchical Poisson mixture model (note that we set $B=B'=1$ as $sc$ and $tk$ sum to one). While the identification of $Wcd$ with $Wcd$ at convergence is straightforward, we have to restrict learning of $Rkc$ to labeled data to gain a neural equivalent in $Rkc$. In that case, $p(k|c,l(n),\Theta old)=p(k|l(n))$, which corresponds to our chosen activities $tk$ for labeled inputs. (In section 3.3, we will show a way to loosen up on this restriction by using self-labeling on unlabeled data with high inference certainty.)

Neural Simpletron | ||

Input | ||

Bottom up | $y~d$ Unnormalized data | (T1.1) |

Top down | $uk=\delta klforlabeleddata1Kforunlabeleddata$ | (T1.2) |

Activation across layers | ||

Observation layer | $yd=(A-D)y~d\u2211d'y~d'+1$ | (T1.3) |

First hidden | $sc=exp(Ic)\u2211c'exp(Ic')$, with | (T1.4) |

$Ic=\u2211dlog(Wcd)yd+log(\u2211kukRkc)$ | (T1.5) | |

Second hidden | $tk=ukforlabeleddata\u2211cRkc\u2211k'Rk'cscforunlabeleddata$ | (T1.6) |

Learning of neural weights | ||

First hidden | $\Delta Wcd=\u220aW(scyd-scWcd)$ | (T1.7) |

Second Hidden | $\Delta Rkc=\u220aR(tksc-tkRkc)$ for labeled data | (T1.8) |

Neural Simpletron | ||

Input | ||

Bottom up | $y~d$ Unnormalized data | (T1.1) |

Top down | $uk=\delta klforlabeleddata1Kforunlabeleddata$ | (T1.2) |

Activation across layers | ||

Observation layer | $yd=(A-D)y~d\u2211d'y~d'+1$ | (T1.3) |

First hidden | $sc=exp(Ic)\u2211c'exp(Ic')$, with | (T1.4) |

$Ic=\u2211dlog(Wcd)yd+log(\u2211kukRkc)$ | (T1.5) | |

Second hidden | $tk=ukforlabeleddata\u2211cRkc\u2211k'Rk'cscforunlabeleddata$ | (T1.6) |

Learning of neural weights | ||

First hidden | $\Delta Wcd=\u220aW(scyd-scWcd)$ | (T1.7) |

Second Hidden | $\Delta Rkc=\u220aR(tksc-tkRkc)$ for labeled data | (T1.8) |

In other words, by executing the online neural network of Table 1, we optimize the likelihood of the generative model, equations 2.1 to 2.3. The network's neural activities provide the posterior probabilities, which we can, for example, use for classification. The computation of posteriors is in general a difficult and computationally intensive endeavor, and their interpretation as neural activation rules is usually difficult. In our case, because of a specific interplay between introduced constraints, categorical distribution, and Poisson noise, the posteriors, and their neural interpretability greatly simplify, however.

All equations in Table 1 can directly be interpreted as neural activation or learning rules. Let us consider an unnormalized data point $y~\u2192=(y~1,\u2026,y~D)T$ as bottom-up input to the network. Labels are neurally coded as top-down information $u\u2192=(u1,\u2026,uK)T$, where only the entry $ul$ equals one if $l$ is the label and all other units are zero.^{1} In the case of unlabeled data, all labels are assumed as equally likely at $1/K$. As the first processing step, a divisive normalization, equation T1.3, is executed to obtain activations $yd$. Considering equations T1.4 and T1.5, we can interpret $Ic$ as input to neural unit $sc$. The input consists of a bottom-up and a top-down activation. The bottom-up input is the standard weighted summation of neural networks $\u2211dlog(Wcd)yd$ (note that we could redefine the weights by $W~cd:=logWcd$). Likewise, the top-down input is a standard weighted sum, $\u2211kukRkc$ but affects the input through a logarithm. Both sums can be computed locally at the neural unit $c$. The inputs to the hidden units $sc$ are then combined using a softmax function, which is also standard for neural networks. However, in contrast to discriminative networks, the weighted sums and the softmax function are here a direct result of the correspondence to a generative mixture model (compare also Jordan & Jacobs, 1994). The activation of the top layer, equation T1.6, is either directly given by the top-down input $uk$ if the data label is known. For unlabeled data, the inference again takes the form of a weighted sum over bottom-up inputs, which are now the activations $sc$ from the middle layer. Regarding learning, both equations T1.7 and T1.8 are local Hebbian learning equations with synaptic scaling. The weights of the first hidden layer are updated on all data points during learning, while those of the second hidden layer learn only from labeled input data.

Other kinds of generative layers could be imagined that can (depending on the data) be more suitable—for example, GMMs for not necessarily nonnegative data in an Euclidean space or generative convolutional models (see, e.g., Dai, Exarchakis, & Lücke, 2013; Gal & Ghahramani, 2016; Patel, Nguyen, & Baraniuk, 2016) to exploit prior knowledge about image data. Derivations of corresponding simpletron layers are, however, not necessarily as straightforward as for the Poisson model. M-steps that adhere to the form of equations 3.4 are not necessarily generally given and may require further approximations or modified neural learning rules to allow for the identification of EM and neural network fixed points. Similarly, neural activation rules for gaussian noise would be different from those in equations 3.5 and 3.6 (which follow from the Poisson assumption). Instead of the standard sums of weights in equation 3.6, a gaussian noise assumption would result in activations proportional to squared distances between cluster centers and data points.

As control of our analytical derivations above, Figure 3 shows a direct comparison of the likelihood using EM equations 2.9 and 2.10 and the corresponding neural learning rules in Table 1. We here used the MNIST data set as an example and trained both EM and the network with $C=1000$ and $A=900$. The scale of the learning rates of the network $\u220aW$ and $\u220aR$ was set to produce comparable training iterations as EM. We then verified numerically that the local optima of the neural network are indeed approximate local optima of the EM algorithm and vice versa. Note in this respect that although neural learning has the same convergence points as EM learning for the mixture model, in finite distances from the convergence points, neural learning follows different gradients, such that the trajectories of the network in parameter space are different from EM. By adjusting the learning rates in equations T1.7 and T1.8, the gradient directions can be changed in a systematic way without changing the convergence points, which we observed to be beneficial to avoid convergence to shallow local optima.

The equations defining the neural network are elementary, very concise, and contain a only four free parameters: the number of hidden units $C$, an input normalization constant $A$, and learning rates $\u220aW$ and $\u220aR$. Because of its concise form we call the network *neural simpletron* (NeSi).

In the experiments in section 4, we differentiate between five neural network approximations on the basis of Table 1. These result from two different approximations of the activations in the first hidden layer, two different approximations for the activations in the second hidden layer, and a truncated network approximation. These approximations are discussed in sections 3.2, 3.3, and 3.4, respectively.

### 3.2 Recurrent, Feedforward, and Greedy Learning

The complete formulas for the first hidden layer, given in equations T1.4 and T1.5, define a recurrent network, that is, a network that combines both bottom-up and top-down information. The first summation in $Ic$ incorporates the bottom-up information. Due to the chosen normalization in equation T1.3 with a background value of $+1$, all summands in this term are nonnegative. Values of the sum over these bottom-up connections will be high for input data $y\u2192$ generated by the hidden unit $c$. The second summation in $Ic$ incorporates top-down information. The weighted sum inside the logarithm, which can take the label information into account, will always yield values between zero and one. Thus, because of the logarithm, this second term is always nonpositive and suppresses the activation of the unit. This suppression is stronger, the less likely it is, that the given hidden unit $c$ belongs to the class of the provided label $l$ (for labeled data) and the less likely it is, that this unit becomes active at all. Because of these recurrent connections between the first and second hidden layers, we refer to our method in Table 1 as r-NeSi (“r” for *recurrent*) in the experiments. With “recurrent,” we do not mean a temporal memory of sequential inputs but the direction in which information flows through the network (following, for example, the definition of *recurrent* by Dayan & Abbott, 2001).

To investigate the influence of such recurrent information in the network, we also test a pure feedforward version of the first hidden layer. There, we remove all top-down connections by discarding the second term in equation T1.5. Such a feedforward formulation of the network is equivalent to treating the distribution $p(c|k,R)$ in the first hidden layer as a uniform prior distribution $p(c)=1/C$. We refer to this feedforward network as ff-NeSi in the experiments. Since ff-NeSi is stripped of all top-down recurrence and the fixed points of the second hidden layer now depend only on the activities of the first hidden layer at convergence, it can also be trained disjointly using a greedy layer-by-layer approach, which is customary for deep networks (e.g., Hinton, Osindero, & Teh, 2006).

### 3.3 Self-Labeling

So far, we trained the top layer of NeSi completely supervised by updating the weights in equation T1.8 only on labeled data. When labeled data are sparse, it could be beneficial to also make use of unlabeled data in this layer. We can do so by letting the network itself provide the missing labels (a procedure often termed “self-labeling”; see, e.g., Lee, 2013; Triguero, García, & Herrera, 2015). The availability of the full posterior distribution in the network (see equation T1.6 for unlabeled data) allows us to selectively use only those inferred labels where the network shows a very high classification certainty. As index for decision certainty, we use the best versus second best ($BvSB$) measure on $tk$, which is the absolute difference between the most likely and the second most likely prediction. Such a measure gives a sensible indicator for high skewness of the distribution toward a single class (Joshi, Porikli, & Papanikolopoulos, 2009). If the $BvSB$ lies above some threshold parameter $\u03d1$, which we treat as an additional free parameter, we approximate the full posterior in $tk$ by the MAP estimate. In that case, we set $tk\u2192MAP(tk)$, such that $tk$ for unlabeled data now holds the one-hot coded inferred label information, with which we can then update the top layer in the usual fashion using equation T1.8.

We mark those NeSi networks where we use self-labeling in the top layer with a superscript $+$ (i.e., r$+$-NeSi and ff$+$-NeSi). Although we here use the MAP estimate of $tk$ during training, because of the validity of equation 3.8 at high inference certainty, we are still learning in the context of the generative model, equations 2.1 to 2.3. Thus, we still keep the full posterior distribution in $tk$ for inference, as well as all identifications of section 3.1.

### 3.4 Truncated Simpletrons

Considering equation 3.9 it is hence sufficient to only compare the first hidden layer inputs $Ic$ for each data point $y\u2192(n)$ in order to construct sets $K(n)$. Sets that maximize the free energy in the E-step are consequently obtained by selecting those $C'$ clusters $c$ with the highest values $Ic$. In the mixture model, approximate truncated posteriors are then obtained by setting all posteriors $p(c|y\u2192(n),\Theta )$ for $c\u2209K(n)$ to zero and renormalizing $p(c|y\u2192(n),\Theta )$ to sum to one.

*truncated neural simpletron*(t-NeSi).

#### 3.4.1 Computational Complexity Reduction

Truncated approaches generally reduce the complexity of inference because the number of evaluated hidden states per data point can be drastically reduced (e.g., Lücke & Eggert, 2010; Dai & Lücke, 2014; Sheikh et al., 2014; Lücke, 2016). For mixture models, the reduction of states at first glance does not appear to be very significant (in contrast to multiple-causes models), as the number of hidden states scales linearly with the number of hidden variables. However, the exact zeros for posterior probabilities also result in a large reduction of computational cost in our case. equation 3.10 for $Ic$ is here still computed fully, which is of $O(CD)$. But for the updates of weights $(Wcd)$, equation T1.7, the required computations reduce from $O(CD)$ to $O(C'D)$ after truncation, as those $sc$ values that are equal zero result in no changes for their corresponding weights. Furthermore, and less significant, the computations of $sc$ directly reduce from $O(C)$ to $O(C')$. Even with the fully computed $Ic$, we thus still reduce the computational cost by a number of numerical operations per data point proportional to $(C-C')D$. If considering that the additionally needed operations to find the largest $C'$ elements are typically of order $O(C+C'logC)$ per data point (Lam & Ting, 2000) or just $O(C)$ (Blum, Floyd, Pratt, Rivest, & Tarjan, 1973), we can expect to reduce the required overall operations for t-NeSi by a large fraction compared to nontruncated NeSi networks.

For experiments with large numbers of hidden units (namely, on MNIST and NIST SD19), we perform additional experiments using the t-NeSi networks to investigate the benefits of such truncated approaches. These networks have one additional free parameter $C'$ that primarily depends on the relationship of the clusters in the data themselves rather than on the other network parameters. Furthermore, as will be shown on MNIST, tuning of this parameter is still possible with very few labels and can even be done directly on the unsupervised likelihood of the first hidden layer with no validation set at all.

## 4 Numerical Experiments

We apply an efficiently scalable implementation of our network to three standard benchmarks for classification on nonnegative data:^{2} the 20 Newsgroups text data set (Lang, 1995), the MNIST data set of handwritten digits (LeCun et al., 1998), and the NIST Special Database 19 of handwritten characters (Grother, 1995). To investigate the task of learning from few labels, we randomly divide the training parts of the data sets into labeled and unlabeled partitions, where we make sure that each class holds the same number of labeled training examples if possible. We repeat experiments for different proportions of labeled data and measure the classification error on the blind test set. For all such settings, we report the average test error over a given number of independent training runs with new random labeled and unlabeled data selection. Details on parallelization and weight initialization are in appendix B. Detailed statistics of the obtained results are in appendix C.

### 4.1 Parameter Tuning

For the basic NeSi algorithms, we have four free parameters: the normalization constant $A$ in the bottom layer, the number of hidden units $C$ and the learning rate $\u220aW$ in the middle layer, and the learning rate $\u220aR$ in the top layer. The optional self-labeling and truncation procedures to further improve learning will add a fifth and sixth free parameter, respectively. The parameter $\u03d1$ will set the $BvSB$ threshold for self-labeling (top layer), and the parameter $C'$ will set the number of considered middle-layer units for truncated learning.

To optimize the free parameters in the semisupervised setting with only a few labeled data points, it is customary to use a validation set, which comprises additional labeled data to the available amount of labels in the training set of that given setting (e.g., using a validation set of 1000 labeled data points to tune parameters in the setting of 100 labels). As this procedure does not guarantee that the resulting optimal parameter setting could have also been found with the limited number of labels in the given training setting, such achieved results reflect more of the performance limit of the model than the actual performance when given only very restricted numbers of labeled data. As already done in Forster et al. (2015), we therefore train our model given a strictly limited total number of labels for the complete tuning and training procedure in order to address our goal. This implies that we also have to tune all free parameters in the same setting as for training without any additional labeled data. In doing so, we make sure that our results are achievable by using no more labels than provided within each training setting. Furthermore, using only training data for parameter optimization ensures a fully blind test set, such that the test error gives a reliable index for generalization.

To construct the training and validation set for parameter tuning, we consider the setting of 10 labeled training data points per class (i.e., 200 labeled data points for 20 Newsgroups and 100 labeled data points for MNIST). This is the setting with the lowest number of labels on which models are generally compared on MNIST. For simplicity, we take half of these labeled data as the validation set (class balanced and randomly drawn) and use the other labeled half plus all unlabeled training data as the training set for parameter tuning. With this data split, we optimize the parameters of the r-NeSi network via a coarse manual grid search. For the search space, we may consider run time versus performance trade-offs where necessary (e.g., with an upper bound on the network size and a lower bound on the learning rates). Keeping the optimized parameter setting of r-NeSi fixed, we optimize only $\u03d1$ for r$+$-NeSi. For comparison, we keep the same parameter settings for the feedforward networks (ff-NeSi and ff$+$-NeSi) without further optimization. Finally, for the truncation parameter $C'$ of t-NeSi, we again optimize only $C'$ and keep all other parameters fixed.

Once optimized in this semisupervised setting, we keep the free parameters fixed for all following experiments. When evaluating the performance of the networks, we perform repeated experiments with different sets of randomly chosen training labels. This evaluation scheme is possible only with more labels available than used by each single network. However, this procedure is purely to gather meaningful statistics about the mean and variance of the acquired results, as these can vary based on the set of randomly chosen labels. As the experiments are performed independently of each other and the parameters are not further tuned based on these results on the test set, it is safe to say that the acquired results are a statistical representation of the performance of our models given no more than the corresponding number of labels in each setting.

A more rigorous parameter tuning would also allow for retuning of all parameters for each model and each new label setting, making use of the additional training label information in the settings where more that 100 labels are available, which we, however, refrained to do for our purposes. The overall tuning, training, and testing protocol is shown in Figure 4.

### 4.2 Document Classification (20 Newsgroups)

The 20 Newsgroups data set in the bydate version consists of 18,774 newsgroup documents of which 11,269 form the training set and the remaining 7505 form the test set. Each data vector comprises the raw occurring frequencies of 61,188 words in each document. We preprocess the data using only tf-idf weighting (Sparck Jones, 1972). No stemming, removals of stop words, or frequency cutoffs were applied. The documents belong to 20 different classes of newsgroup topics that are partitioned into six different subject matters (comp, rec, sci, forsale, politics, and religion). We show experiments for both classification into subject matter (6 classes) as well as the more difficult full 20-class problem.

#### 4.2.1 Parameter Tuning on 20 Newsgroups

In the following, we give a short overview over the parameter tuning on the 20 Newsgroups data set. We use the procedure described in section 4.1 to optimize the free parameters of NeSi using only 200 labels in total while keeping a fully blind test set. The parameters are optimized with respect to the more common 20-class problem, and we then keep the same parameter setting also for the easier 6-class task. We allowed training time over 200 iterations over the whole training set and restricted the parameters in the grid search such that sufficient convergence was given within this limitation.

*Hidden units.* Following the above tuning protocol for 20 Newsgroups (20 classes) results in a best-performing architecture of $D$–$C$–$K$$=$ 61188–20–20, that is, the complete setting $C=K=20$. Generally we would expect that the overcomplete setting $C>K$ would allow for more expressive representations. This is indeed the case for the 6-class problem ($K=6$), for which we find that $C=20$ (61188–20–6) is the still best setting. For the 20-class problem, however, more than $K$ middle-layer classes were not beneficial. Using more than 20 middle-layer units ($C>20$) for the $K=20$ problem could be hindered here by the high dimensionality of the data relative to the number of available training data points, as well as the prominent noise when taking all words of a given document into account.

*Normalization.* Because of the introduced background value of $+$1 (see equation T1.3), the normalization constant $A$ has a lower bound in the dimensionality of the input data $D=61,188$. For very low values $A\u2273D$, the model is unable to differentiate the observed patterns from background noise. At the other extreme, at $A\u2192\u221e$, the softmax function will converge to a winner-take-all maximum function. The optimal value lies in between, closely after the system is able to differentiate all classes from background noise but when the normalization is still low enough to allow for a broad softmax response. For all our experiments on the 20 Newsgroups data set, we chose (following the tuning protocol) $A=80,000$ (that is, $A/D\u22481.31$).

*Learning rates.* A relatively high learning rate in the first hidden layer ($\u220aW=5\xd7C/N$), coupled with a relatively much lower learning rate in the second hidden layer ($\u220aR=0.5\xd7K/L$), yielded the best results on the validation set. Especially the high value for $\u220aW$ seems to have the effect of more efficiently avoiding shallow local optima, which exist, again, due to noise and the high dimensionality of the data compared to the relatively low number of training samples. The different learning rates for $\u220aW$ and $\u220aR$ mean that the neural network follows a gradient markedly different from an EM update. This suggests that the neural network allows for improved learning compared to the EM updates it was derived from.

Note that in practice, we use normalized learning rates. The factor $C/N$ for the first hidden layer and $K/L$ for the second hidden layer represents the average activation per hidden unit over one full iteration over a data set of $N$ data points with $L$ labels. Tuning not the absolute learning rate but the proportionality to this average activation helps to decouple the optimum of the learning rates from the network size ($C$ and $K$) and the numbers of available training data and labels ($N$ and $L$).

*BvSB threshold.* Given the optimized values of the other free parameters, we found that introducing the additional self-labeling for unlabeled data is not helpful and even harmful for the 20 Newsgroups data set. Since even in the settings with only very few labeled data points, the number of provided labels per middle-layer hidden unit is already sufficiently large, the use of inferred labels only introduces destructive noise. The self-labeling will be more useful in scenarios where the number of hidden units surpasses the number of available labeled data points greatly (for MNIST, section 4.3, and NIST, section 4.4).

#### 4.2.2 Results on 20 Newsgroups (6 Classes)

We start with the easier task of subject matter classification, where the 20 newsgroup topics are partitioned into six higher-level groups that combine related topics (e.g., comp, rec). The optimal architecture for 20 Newsgroups (20 classes) on the validation set was given in the complete setting, where $C$ = $K$ = 20. At first glance, this seems like no subclasses were learned and that the split in the middle layer was primarily guided by class labels. However, for classification of subject matters (6 classes), where only labels of the six higher-level topics were given, we observed the setting with $C=20$ units (61188-20-6) to be far superior to the complete setting with architecture 61188-6-6 (see Table 2). This suggests that the data structure of 20 subclasses, and not the number of label classes, determines the optimal architecture of the NeSi network (see also sections 4.3 and 4.4). In our experiments, we furthermore observed the feedforward network, which learns completely unsupervised in the middle layer, to still achieve similar performance as the recurrent r-NeSi network. This shows that the NeSi networks are able to recover individual subclasses of the newsgroups data independent of the label information. If more and more labels are available, the recurrent network, however, improves on the feedforward version as the additional top-down label information also leads to further fine-tuning of the learned representations in the middle layer.

ff-NeSi | r-NeSi | |||

Number of Labels | $C=6,K=6$ | $C=20,K=6$ | $C=6,K=6$ | $C=20,K=6$ |

200 | 41.66 $\xb1$ 1.21 | 14.23 $\xb1$ 0.45 | 39.02 $\xb1$ 1.49 | 14.21 $\xb1$ 0.42 |

800 | 40.41 $\xb1$ 1.31 | 14.04 $\xb1$ 0.48 | 39.54 $\xb1$ 1.64 | 14.58 $\xb1$ 0.75 |

2000 | 42.31 $\xb1$ 0.72 | 14.26 $\xb1$ 0.47 | 40.05 $\xb1$ 0.64 | 13.44 $\xb1$ 0.43 |

11269 | 41.85 $\xb1$ 0.90 | 14.95 $\xb1$ 0.73 | 36.56 $\xb1$ 2.09 | 13.26 $\xb1$ 0.35 |

ff-NeSi | r-NeSi | |||

Number of Labels | $C=6,K=6$ | $C=20,K=6$ | $C=6,K=6$ | $C=20,K=6$ |

200 | 41.66 $\xb1$ 1.21 | 14.23 $\xb1$ 0.45 | 39.02 $\xb1$ 1.49 | 14.21 $\xb1$ 0.42 |

800 | 40.41 $\xb1$ 1.31 | 14.04 $\xb1$ 0.48 | 39.54 $\xb1$ 1.64 | 14.58 $\xb1$ 0.75 |

2000 | 42.31 $\xb1$ 0.72 | 14.26 $\xb1$ 0.47 | 40.05 $\xb1$ 0.64 | 13.44 $\xb1$ 0.43 |

11269 | 41.85 $\xb1$ 0.90 | 14.95 $\xb1$ 0.73 | 36.56 $\xb1$ 2.09 | 13.26 $\xb1$ 0.35 |

Note: The overcomplete setting ($C>K$) shows best results, where the network is able to learn the 20 individual subclasses present in the data.

#### 4.2.3 Results on 20 Newsgroups (20 Classes)

We now continue with the more challenging 20-class problem ($K=20$). Here, we investigate semisupervised settings of 20, 40, 200, 800, and 2000 labels in total—that is 1, 2, 10, 40 and 100 labels per class—as well as the fully labeled setting. For each setting, we present the mean test error averaged over 100 independent runs and the standard error of the mean (SEM). On each new run, a new set of class balanced labels is chosen randomly from the training set. We train our model on the full 20-class problem without any feature selection. An example of some learned weights of r-NeSi is shown in Figure 5.

To the best of our knowledge, most methods that report performance on the same benchmark do consider easier tasks: they either break the task into binary classification between individual or merged topics (e.g., Cheng, Kannan, Vempala, & Wang, 2006; Kim, Der, & Saul, 2014; Wang & Manning, 2012; Zhu, Ghahramani, & Lafferty, 2003), or perform feature selection (e.g., Srivastava, Salakhutdinov, & Hinton, 2013; Settles, 2011) for classification. There are, however works, that are compatible with our experimental setup (Larochelle & Bengio, 2008; Ranzato & Szummer, 2008). A hybrid of generative and discriminative RBMs (HDRBM) trained by Larochelle and Bengio (2008) uses stochastic gradient descent to perform semisupervised learning. They report results on 20 Newsgroups for both supervised and semisupervised setups. In the fully labeled setting, all their hyperparameters are optimized using a validation set of 1691 examples with the remaining 9578 in the training set. In the semisupervised setup, 200 examples were used as a validation set with 800 labeled examples in the training set. To reduce the dimensionality of the input data, they used only the 5000 most frequent words. The classification accuracy of the method is compared in Table 3.

Number of Labels | ff-NeSi | r-NeSi | HDRBM |

20 | 70.64 $\xb1$ 0.68$(*)$ | 68.68$\xb1$ 0.77$(*)$ | |

40 | 55.67 $\xb1$ 0.54$(*)$ | 54.24$\xb1$ 0.66$(*)$ | |

200 | 30.59 $\xb1$ 0.22 | 29.28$\xb1$ 0.21 | |

800 | 28.26 $\xb1$ 0.10 | 27.20$\xb1$ 0.07 | 31.8$(*)$ |

2000 | 27.87 $\xb1$ 0.07 | 27.15$\xb1$ 0.07 | |

11,269 | 28.08 $\xb1$ 0.08 | 27.28 $\xb1$ 0.07 | 23.8 |

Number of Labels | ff-NeSi | r-NeSi | HDRBM |

20 | 70.64 $\xb1$ 0.68$(*)$ | 68.68$\xb1$ 0.77$(*)$ | |

40 | 55.67 $\xb1$ 0.54$(*)$ | 54.24$\xb1$ 0.66$(*)$ | |

200 | 30.59 $\xb1$ 0.22 | 29.28$\xb1$ 0.21 | |

800 | 28.26 $\xb1$ 0.10 | 27.20$\xb1$ 0.07 | 31.8$(*)$ |

2000 | 27.87 $\xb1$ 0.07 | 27.15$\xb1$ 0.07 | |

11,269 | 28.08 $\xb1$ 0.08 | 27.28 $\xb1$ 0.07 | 23.8 |

Notes: We differentiate here between settings with different numbers of labels available during training. For results marked with “(*),” the free parameters of the model were optimized using additional labels: NeSi used the same parameter setting in all experiments on 20 Newsgroups, which was tuned with 200 labels in total; HDRBM used 1000 labels in total for tuning in the semisupervised setting (200 additional labels for the validation set). The numbers in bold are the best performing (in terms of lowest mean error) of the compared systems for each label setting.

Here, the recurrent and feedforward networks produce very similar results, with a small advantage to the recurrent networks. In comparison with HDRBM, ff-NeSi and r-NeSi both achieve better results than the competing model for the semisupervised setting. Both algorithms are still better with down to 200 labels, even though HDRBM uses more labels for training and additional labels for parameter tuning. Performance very significantly decreases only when going down even further to only one or two labels per class for training (note that the parameters were actually tuned using 200 labels in total). In the fully labeled setting, the HDRBM outperforms the shown NeSi approaches significantly. However, so far we have used only one parameter setting for all experiments. Optimizing r-NeSi specifically for the fully labeled setting, we achieve test errors of $(17.85\xb10.01)%$. For details on the parameter tuning, see section B.5.

### 4.3 Handwritten Digit Recognition (MNIST)

The MNIST data set consists of 60,000 training and 10,000 testing data points of $28\xd728$ images of gray-scale handwritten digits centered by pixel mass. We perform experiments in the semisupervised setting using 10, 100, 600, 1000, and 3000 labels in total—that is 1, 10, 60, 100, and 300 labels per class—which are randomly and class balanced chosen from the 10 classes. We also consider the setting of a fully labeled training set.

#### 4.3.1 Parameter Tuning on MNIST

We here give a short overview of the parameter tuning on the MNIST data set. We again use the tuning procedure described in section 4.1 to optimize all free parameters of NeSi using only 100 labels in total from the training data, keeping a fully blind test set. We allowed training time over 500 iterations over the whole training set and again restricted the parameters in the grid search such that sufficient convergence was given within this limitation.

*Hidden units.* Contrary to the 20 Newsgroups data set, for MNIST, the validation error generally decreased with an increasing number of hidden units. We therefore used $C=10,000$ for all our experiments for both the feedforward and the recurrent networks, which we set as an upper limit for network size as a good trade-off between performance and required computing time. However, with so many hidden units on a training set of 60,000 data points and with as few as only 10 labeled training samples in total, overfitting effects have to be taken into consideration. We discuss these more deeply in sections B.3 and B.4. In general, we encountered an increase in error rates on prolonged training times only for the r-NeSi algorithm in the semisupervised settings when no self-labeling was used. For this case only, we devised and used a stopping criterion based on the likelihood of the training data.

*Normalization.* The dependence of the validation error on the normalization constant $A$ shows similar behavior as for the 20 Newsgroups data set. Following a coarse screening according to the tuning protocol, the setting of $A=900$ (i.e., $A/D\u22481.15$) was chosen.

*Learning rates.* While a high learning rate can be used to overcome shallow local optima, a lower learning rate in general will find more precise optima with the downside of longer training time until convergence. As a trade-off between performance and training time, we chose $\u220aW=0.2\xd7C/N$ and $\u220aR=0.2\xd7K/L$ for all experiments on MNIST. Since for networks using self-labeling, the number of effectively used labels $L$ approaches $N$ over time, we scale the learning rate $\u220aR$ for systems with $K/N$ instead of $K/L$, that is, $\u220aR=0.2\xd7K/N$ for r$+$- and ff$+$-NeSi.

*BvSB threshold.* With $C=10,000$ and only 50 labels in total in the training set during parameter tuning, there is only a single label per 200 middle-layer fields available to learn their respective classes. In this setting, using self-labeling on unlabeled data as described in section 3.3 decreased the validation error significantly over the whole tested regime of $\u03d1\u2208[0.1,0.2,\u2026,0.9]$. We chose $\u03d1=0.6$ as the optimal value.

*Truncation.* For t-NeSi, only the additional parameter $C'$ is optimized while keeping the other parameters fixed. Complementary to the reduced computational complexity (see section 3.4), we also observe significantly faster learning times of truncated networks. Figure 6 shows the training likelihood of only the first hidden layer of truncated and nontruncated NeSi for 10 independent runs each (which are however, hardly distinguishable from another at this scale, as the likelihoods within each setting are too close together). As can be seen in the outer plot, the likelihood initially increases faster with lower $C'$. However, when the posterior is truncated too much, the likelihood will converge to significantly lower values (inlaid plot). Notably, here we can also observe that the optimal setting $C'=15$, found via optimization on the validation set, achieves a higher likelihood than all other shown settings, which could allow for parameter tuning based solely on the (unsupervised) likelihood, that is, without any required additional labels.

#### 4.3.2 Results on MNIST

Table 4 shows the results of the NeSi algorithms on the MNIST benchmark. As the NeSi model has no prior knowledge about spatial relations in the data, the given results are invariant to pixel permutation. As can be observed, the basic recurrent network (r-NeSi) results in significantly lower classification errors than the basic feedforward network (ff-NeSi) in the fully labeled setting, as well as settings with 600 labels or fewer. In between those extrema, we find a regime where the feedforward network not only catches up to the recurrent one but even performs slightly better. In the highly overcomplete setting that we use for MNIST, we now also see a significant gain in performance for the semisupervised settings with the additional self-labeling (ff$+$-NeSi and r$+$-NeSi). With these additional inferred labels, the feedforward network surpasses the recurrent version also in the settings with very few labels, down to a single label per class. For this last setting, however, we had to increase the training time to 2000 iterations to ensure convergence, since learning in the top layer with a single label per class per iteration is very slow when not adjusting the learning rate. The best-performing network is the truncated version of the ff$+$-NeSi network. As shown in Figure 6, truncation of the middle layer leads to convergence to optima with on average higher (middle-layer) likelihoods. Especially in settings of very few labels, such improved clustering significantly helps to maintain a low test error in the higher-level classification task.

Number of Labels | ff-NeSi | r-NeSi | ff$+$-NeSi | r$+$-NeSi | t-NeSi |

10 | 55.46 $\xb1$ 0.57$(*)$ | 29.61 $\xb1$ 0.57$(*)$ | 10.91 $\xb1$ 0.86$(*)$ | 18.68 $\xb1$ 0.89$(*)$ | 7.22$\xb1$ 0.53 $(*)$ |

20 | 38.88 $\xb1$ 0.52$(*)$ | 21.21 $\xb1$ 0.34$(*)$ | 7.23 $\xb1$ 0.35$(*)$ | 12.46 $\xb1$ 0.73$(*)$ | 6.21$\xb1$ 0.38 $(*)$ |

100 | 19.08 $\xb1$ 0.26 | 12.43 $\xb1$ 0.15 | 4.96 $\xb1$ 0.08 | 4.93 $\xb1$ 0.05 | 4.23$\xb1$ 0.07 |

600 | 7.27 $\xb1$ 0.05 | 6.94 $\xb1$ 0.05 | 4.08 $\xb1$ 0.02 | 4.34 $\xb1$ 0.01 | 3.65$\xb1$ 0.01 |

1000 | 5.88 $\xb1$ 0.03 | 6.07 $\xb1$ 0.03 | 4.00 $\xb1$ 0.01 | 4.26 $\xb1$ 0.01 | 3.63$\xb1$ 0.01 |

3000 | 4.39 $\xb1$ 0.02 | 4.68 $\xb1$ 0.02 | 3.85 $\xb1$ 0.01 | 4.05 $\xb1$ 0.01 | 3.52$\xb1$ 0.01 |

60,000 | 3.27 $\xb1$ 0.01 | 2.94$\xb1$ 0.01 | 3.27 $\xb1$ 0.01 | 2.94$\xb1$ 0.01 | 2.94$\xb1$ 0.01 |

Number of Labels | ff-NeSi | r-NeSi | ff$+$-NeSi | r$+$-NeSi | t-NeSi |

10 | 55.46 $\xb1$ 0.57$(*)$ | 29.61 $\xb1$ 0.57$(*)$ | 10.91 $\xb1$ 0.86$(*)$ | 18.68 $\xb1$ 0.89$(*)$ | 7.22$\xb1$ 0.53 $(*)$ |

20 | 38.88 $\xb1$ 0.52$(*)$ | 21.21 $\xb1$ 0.34$(*)$ | 7.23 $\xb1$ 0.35$(*)$ | 12.46 $\xb1$ 0.73$(*)$ | 6.21$\xb1$ 0.38 $(*)$ |

100 | 19.08 $\xb1$ 0.26 | 12.43 $\xb1$ 0.15 | 4.96 $\xb1$ 0.08 | 4.93 $\xb1$ 0.05 | 4.23$\xb1$ 0.07 |

600 | 7.27 $\xb1$ 0.05 | 6.94 $\xb1$ 0.05 | 4.08 $\xb1$ 0.02 | 4.34 $\xb1$ 0.01 | 3.65$\xb1$ 0.01 |

1000 | 5.88 $\xb1$ 0.03 | 6.07 $\xb1$ 0.03 | 4.00 $\xb1$ 0.01 | 4.26 $\xb1$ 0.01 | 3.63$\xb1$ 0.01 |

3000 | 4.39 $\xb1$ 0.02 | 4.68 $\xb1$ 0.02 | 3.85 $\xb1$ 0.01 | 4.05 $\xb1$ 0.01 | 3.52$\xb1$ 0.01 |

60,000 | 3.27 $\xb1$ 0.01 | 2.94$\xb1$ 0.01 | 3.27 $\xb1$ 0.01 | 2.94$\xb1$ 0.01 | 2.94$\xb1$ 0.01 |

Notes: We differentiate here between settings with different numbers of labels available during training. For results marked with “(*),” the free parameters were optimized using more labels than available in the given setting. We used the same parameter setting for all experiments shown here, which was tuned using 100 labels. The results are given as the mean and standard error (SEM) over 100 independent repetitions, with randomly drawn class-balanced labels. In the fully labeled case, there are no unlabeled data points to use self-labeling on. Therefore, the results of ff- and ff$+$-NeSi are identical there, as well as those of r- and r$+$-NeSi. Numbers in bold are the best performing (in terms of lowest mean error) of the compared systems for each label setting.

We also performed experiments using semisupervised kNN as a baseline for simple discriminative clustering algorithms. We used a two-stage procedure to make use of unlabeled data for kNN. First, only labeled training examples were used to classify all unlabeled examples (similar to our self-labeling approach but without an uncertainty threshold); then the test data were classified using labeled and self-labeled training data. We optimize four free parameters for kNN: the number of neighbors $n$, the weight function (equal weighting or distance dependent weighting), the algorithm (choice of ball tree, $kd$-tree, brute force, or automated pick of the three), and the power parameter $p$ for the Minkowski metric (where, e.g., $p=1$ recovers the taxicab metric, $p=2$ corresponds to the Euclidean metric, and so forth). For the parameter optimization we used the same tuning procedure and validation set as for the neural simpletrons. We found optimal parameter settings at $n=11$, uniform weighting, automated algorithm pick, and $p=4$.

Figure 7 shows a comparison to kNN and standard and recent state-of-the-art approaches for 100 labels and more. In this comparison (for lack of more comparable findings), all other algorithms use either a validation set with a substantial number of additional labels than available during training or (explicitly) use the test set for parameter tuning. If during parameter tuning, new sets of random labels were chosen between tuning iterations (for the training or validation set), even more labels than we account for in Figures 7 and 8 were actually seen by the algorithm to produce the final results. Also, some of the shown results (the TSVM, AGR, AtlasRBF, and the Em-networks) were achieved in the transductive setting, where the (unlabeled) test data are included into the training process. The NeSi approaches are, to our knowledge, so far the closest to our goal of a competitive algorithm in the limit of as few labels as possible. We explicitly avoided any training or tuning on any additional labeled data or the test set. This also prevents the risk of overfitting to test data. The more complex a system is, the more labels are generally necessary to find optimal parameter settings that are not overfitted to a small validation set and generalize poorly. When using test data during parameter tuning, the danger of such overfitting is even more severe as overfitting effects could be mistaken as good generalizability. Therefore, in Figure 7, we grouped the models by the number of additional labeled data points used in the validation set for parameter tuning and also show the number of free parameters for each algorithm, as far as we were able to estimate from the corresponding papers. These numbers, of course, have to be taken with high caution, as not all parameters can be treated equally. For some tunable parameters, for example, a default value may already give good results most of the time, while others might have to be highly optimized for each new task. Thus, these numbers should be taken more as a rough index for model complexity. Regarding classification performance, the NeSi networks achieve competitive results, surpassing even deep belief networks (DBN-rNCA) and other recent approaches (like the Embed-networks, AGR, and AtlasRBF). In the light of reduced model complexity and effectively used labels, we can, furthermore, compare to the few very recent algorithms with a lower error rate (M1$+$M2, VAT, and the Ladder networks).

Figure 8 shows the performance of the models with respect to the number of labels used during training (left-hand side) and with respect to the total number of labels used for the complete tuning and training procedure (right-hand side). For the NeSi algorithms, these plots are identical, as we only use maximally as many labels in the tuning phase as in the training phase for the shown results of 100 labels and more. No other model has yet been shown to operate in the same regime as NeSi networks are able to. For all other algorithms, these plots can be regarded as the two extreme cases, where their actual performance in our chosen setting would probably lie somewhere in between (if no overfitting to the test set occurred).

One competing model that so far comes closest to our limit setting of as few labels as possible is an approach that combines 10 generative adversarial networks (GANs) (Salimans et al., 2016) with five layers each. With down to 20 labels for training (2 labels per class), the classification error of a single GAN was reported at $(16.77\xb14.52)%$, and for the full ensemble of 10 GANs, the error was reported to be $(11.34\xb14.45)%$. Comparison with the systems of Figure 8 is made difficult, however, as no information about the number of additional labels of a validation set is reported. If we assume that the ensemble of 10 GANs requires at least 100 additional labels for tuning (a conservative estimate considering at least 1000 labels of all other approaches; see Figure 7), we can compare the GANs to the performance of t-NeSi in the limit of few labels. The classification error of $(11.34\xb14.45)%$ achieved by 10 GANs tuned with presumably at least 100 labels and trained with 20 labels then compares to an error of $(6.21\xb10.38)%$ for t-NeSi, which was tuned with exactly 100 labels and also trained with 20 labels. In settings where more labels are available during training (again under the assumption of only few additional labels for parameter tuning) the GANs will surpass the NeSi networks again and perform comparably to the Ladder networks.

### 4.4 Large-Scale Handwriting Recognition (NIST SD19)

Modern algorithms, especially in the field of semisupervised learning, should be able to handle and benefit from the ever increasing numbers of available data (big data). A comparable task to MNIST, but with many more data points and much higher input dimensionality, is given by the NIST Special Database 19. It contains over 800,000 binary $128\xd7128$ images from 3600 different writers (with around half of the data being handwritten digits and the other half being lower- and uppercase letters). We perform experiments on both digit recognition (10 classes) and case-sensitive letter recognition (52 classes).

We first applied the NeSi networks to the unpreprocessed NIST SD 19 digit data with $D=16,384$ input pixels. The data are of much higher dimensionality than MNIST, and the patterns are not centered by pixel mass, which represents a significantly more challenging task, as a lot more uninformative variation is kept within the data. Hence, having a mixture model, learning these variations would need many more hidden units to achieve similar performance. When keeping the same parameter setting as for MNIST (where we only increased $A$ to 25,000, giving $A/D\u22481.5$, to account for the increased input dimensionality), the best performance for digit data in the fully labeled case was achieved by the r-NeSi network with an error rate of 9.5%.

For better performance and easier comparison, we preprocessed the data similar to MNIST (compare Cireşan, Meier, & Schmidhuber, 2012): for each image, we calculate square bounding boxes, resize to $20\xd720$, zero-pad to $28\xd728$, and center by pixel mass. Finally, we invert the image, such that patterns have high pixel values instead of the background as is the case for MNIST. For simplicity and because of its high similarity, we then use the same setting for our free model parameters as we used for MNIST without further retuning. The experiments are done using 1, 10, 60, 100, 300, or all labels per class. We allowed for the same number of iterations as for MNIST to give sufficient training time for convergence. However, with roughly five times more training data than for MNIST but the same total number of labels, we now have a five times lower average activation in the top layer until self-labeling starts. In the semisupervised settings, we therefore scale the learning rate of the top layer also by a factor of five compared to MNIST to $\u220aR=1\xd7K/N$ for comparable convergence times. Figure 9 shows some examples of learned weights by the ff$+$-NeSi network with 10 labels per class. In Table 5, we report the mean and standard error over 10 experiments on both digit and letter data. For the NeSi networks, the results are given for the permutation-invariant task. To the best of our knowledge, this is the first system to report results for NIST SD19 in the semisupervised setting.

Number of Labels/Class | 1 | 2 | 10 | 60 | 100 | 300 | Fully Labeled |

Digits (10 classes) #labels total | 10 | 20 | 100 | 600 | 1000 | 3000 | 344,307 |

ff$+$-NeSi | 7.56 $\xb1$ 1.79 | 6.15 $\xb1$ 0.14 | 6.20 $\xb1$ 0.16 | 6.02 $\xb1$ 0.08 | 6.02 $\xb1$ 0.12 | 5.70 $\xb1$ 0.03 | 5.11 $\xb1$ 0.01 |

r$+$-NeSi | 9.84 $\xb1$ 2.40 | 8.50 $\xb1$ 2.09 | 6.14 $\xb1$ 0.23 | 5.83 $\xb1$ 0.14 | 5.94 $\xb1$ 0.12 | 5.72 $\xb1$ 0.10 | 4.52 $\xb1$ 0.01 |

t-NeSi | 5.71$\xb1$ 0.42 | 5.23$\xb1$ 0.15 | 5.26$\xb1$ 0.23 | 4.84$\xb1$ 0.02 | 4.86$\xb1$ 0.03 | 4.83$\xb1$ 0.02 | 4.50 $\xb1$ 0.01 |

35c-MCDNN | 0.77 | ||||||

Letters (52 classes) | |||||||

#labels total | 52 | 104 | 520 | 3120 | 5200 | 15,600 | 387,361 |

ff$+$-NeSi | 55.70 $\xb1$ 0.62 | 51.32 $\xb1$ 0.79 | 46.22 $\xb1$ 0.43 | 44.24 $\xb1$ 0.23 | 43.69 $\xb1$ 0.21 | 42.96 $\xb1$ 0.28 | 34.66 $\xb1$ 0.05 |

r$+$-NeSi | 64.97 $\xb1$ 0.85 | 60.32 $\xb1$ 0.91 | 54.08 $\xb1$ 0.38 | 43.73 $\xb1$ 0.15 | 41.57$\xb1$ 0.13 | 37.95$\xb1$ 0.12 | 31.93 $\xb1$ 0.06 |

t-NeSi | 52.14$\xb1$ 1.07 | 48.46$\xb1$ 0.92 | 45.62$\xb1$ 0.43 | 41.87$\xb1$ 0.32 | 41.75 $\xb1$ 0.36 | 41.13 $\xb1$ 0.30 | 33.34 $\xb1$ 0.04 |

35c-MCDNN | 21.01 |

Number of Labels/Class | 1 | 2 | 10 | 60 | 100 | 300 | Fully Labeled |

Digits (10 classes) #labels total | 10 | 20 | 100 | 600 | 1000 | 3000 | 344,307 |

ff$+$-NeSi | 7.56 $\xb1$ 1.79 | 6.15 $\xb1$ 0.14 | 6.20 $\xb1$ 0.16 | 6.02 $\xb1$ 0.08 | 6.02 $\xb1$ 0.12 | 5.70 $\xb1$ 0.03 | 5.11 $\xb1$ 0.01 |

r$+$-NeSi | 9.84 $\xb1$ 2.40 | 8.50 $\xb1$ 2.09 | 6.14 $\xb1$ 0.23 | 5.83 $\xb1$ 0.14 | 5.94 $\xb1$ 0.12 | 5.72 $\xb1$ 0.10 | 4.52 $\xb1$ 0.01 |

t-NeSi | 5.71$\xb1$ 0.42 | 5.23$\xb1$ 0.15 | 5.26$\xb1$ 0.23 | 4.84$\xb1$ 0.02 | 4.86$\xb1$ 0.03 | 4.83$\xb1$ 0.02 | 4.50 $\xb1$ 0.01 |

35c-MCDNN | 0.77 | ||||||

Letters (52 classes) | |||||||

#labels total | 52 | 104 | 520 | 3120 | 5200 | 15,600 | 387,361 |

ff$+$-NeSi | 55.70 $\xb1$ 0.62 | 51.32 $\xb1$ 0.79 | 46.22 $\xb1$ 0.43 | 44.24 $\xb1$ 0.23 | 43.69 $\xb1$ 0.21 | 42.96 $\xb1$ 0.28 | 34.66 $\xb1$ 0.05 |

r$+$-NeSi | 64.97 $\xb1$ 0.85 | 60.32 $\xb1$ 0.91 | 54.08 $\xb1$ 0.38 | 43.73 $\xb1$ 0.15 | 41.57$\xb1$ 0.13 | 37.95$\xb1$ 0.12 | 31.93 $\xb1$ 0.06 |

t-NeSi | 52.14$\xb1$ 1.07 | 48.46$\xb1$ 0.92 | 45.62$\xb1$ 0.43 | 41.87$\xb1$ 0.32 | 41.75 $\xb1$ 0.36 | 41.13 $\xb1$ 0.30 | 33.34 $\xb1$ 0.04 |

35c-MCDNN | 21.01 |

Notes: The results for NeSi are permutation invariant and given as the mean and standard error (SEM) over 10 independent repetitions, with randomly drawn, class-balanced labels. Free-parameter values as were used for MNIST. Numbers in bold are the best performing (in terms of lowest mean error) of the compared systems for each label setting.

As for MNIST, the performance of our three-layer network is in the fully labeled setting not competitive with state-of-the-art fully supervised algorithms (like the 35c-MCDNN, a committee of 35 deep convolutional neural networks; Cireşan et al., 2012). Note, however, that our results do apply to the permutation-invariant setting and do not take prior knowledge about two-dimensional image data into account (as convolutional networks do). More important, for the settings with few labels, we see only a relatively mild decrease in test error when we strongly decrease the total number of used labels. Even for just 10 labels per class, most patterns are correctly classified for the challenging task of case-sensitive letter classification (chance is below 2%). Comparison of the digit classification setting with MNIST furthermore suggests that not the relative but the absolute number of labels per class is important for learning in our networks (compare, for example, Rasmus et al., 2015, note 4).

In general, digit classification with NIST SD19 seems to be a more challenging task than MNIST (which can also be observed in the results of Cireşan et al., 2012). However, the test error in our case increased more slowly than for MNIST with decreasing numbers of labels—and, in the extreme case, a single label per class even surpassed the MNIST results. When using, as for MNIST, only 60,000 training examples for NIST, the test error for the single-label setting on digit data increased from $(7.56\xb11.79)%$ to $(9.10\xb10.92)%$ for ff$+$-NeSi, showing nicely the benefit of additional unlabeled data points for learning in NeSi networks. In fact, rare outliers are the main reason for the increase in test error for the single-label case, where two or more classes were learned completely switched, for example, all 3's were learned as 8's and vice versa. This can happen when the single randomly chosen labeled data points of two similar classes are too ambiguous and therefore lie close together at the border between two clusters. Additional unlabeled data points lead to better-defined clusters, where this problem occurs less frequently. Since in the recurrent network, the label information is also fed back to the middle layer, this network is more sensible to label information. On one hand, this helps when more label information is known. On the other hand, this also more often results in a stronger accumulation of errors in the self-labeling procedure as wrong labels are less frequently corrected. The best result in this setting of very few labels is again the truncated feedforward network; as with better-defined clusters in the middle layer, the problem of class confusion also becomes less frequent (also compare detailed results in appendix C).

With more training data available than for MNIST, we also tried out bigger networks of 20,000 hidden units for digit data, but saw only slight improvements on the test error. This points to a limit of learnable subclasses (a.k.a. writing styles) within the data, where the modeling of more than $C=10,000$ subclasses improves performance very little, but the increased data in NIST help to better define those given subclasses.

## 5 Discussion

In this study, we explored classifier training on data sets with few labels. We put a special emphasis on adhering to this restriction not only for the training phase but also the complete tuning, training, and testing procedure. Our tool was a novel neural network with learning rules based on a maximum likelihood objective. Starting from hierarchical Poisson mixtures, the derived three-layer directed data model can be observed to take on a form similar to learning in standard DNNs. The parameters of the network can be optimized with a very limited number of labels and training in the same setting showed to achieve competitive results, resulting in the first network shown to operate using no more than 10 labels per class in total on the investigated data sets.

### 5.1 Relation to Standard and Recent Deep Learning

Neural simpletrons are, on one hand, similar to standard DNNs as they learn online (i.e., they learn per data point or per mini-batch), are efficiently scalable, and as their activation and learning rules are local and consist of very elementary mathematical expressions (see Table 1). On the other hand, the NeSi networks exhibit features that are a hallmark of deep directed generative models, such as learning from unlabeled data and integration of bottom-up and top-down information for optimal inference. By comparing the learning and neural interaction equations of DNNs and the NeSi networks directly, equation T1.5 for top-down integration and the learning rules, equations T1.7 and T1.8, and T1.8 represent the crucial differences. The first one allows the NeSi networks (the r- and r$+$-NeSi versions) to integrate top-down and bottom-up information for inference, which contrasts with pure feedforward processing in standard DNNs. The second one shows that NeSi learning is local and neurally plausible (Hebbian) while approximating likelihood optimization, which differs from less local backpropagation for discriminative learning in standard DNNs. In the example of the NeSi networks, recurrent bottom-up/top-down integration was especially useful when many labels were available, particularly in the complete setting (see section B.5) or for the task of case-sensitive letter recognition (see Table 5), which represented one of the largest-scale applications considered here. When we acquire additional inferred labels through self-labeling, the (truncated) feedforward system was best in maintaining a low test error even down to the limit of a single training label per class. For fully labeled data, the NeSi systems were not observed to be competitive (e.g., for MNIST). Discriminative approaches dominate in this regime, as it seems to be difficult to compete with discriminative learning with such a minimalistic system once sufficiently many labeled data points are available. Furthermore, the generative NeSi approach relies on the possibility to learn representations of meaningful templates for Poisson-like distributed data (as shown, for example, in Figures 11, 5, and 9); and template representations make the networks very interpretable. However, for example, for large image databases showing 2D images of 3D objects, learning of such templates based on pixel intensities seems very challenging. Approaches applied to such data therefore commonly use (handcrafted or learned) features that typically transform images into suitable metric spaces. Such data would motivate studies of networks similar to ours but assuming gaussian observables. Similar such approaches have shown competitive performances (Van den Oord & Schrauwen, 2014). Alternatively, 2D images of 3D objects may be transformed into nonnegative data spaces to make them suitable for our model with Poisson observables. Any such approach would require the introduction of an additional feature layer, however, with potentially additional free parameters.

Besides the approaches studied here, many other systems are able to make use of top-down and bottom-up integration for learning and inference. Top-down information is provided in an indirect way if a system introduces new labels itself by using its own inference mechanism. Similar to the ff$+$- and r$+$-NeSi networks, this self-labeling idea has been followed repeatedly previously (for a recent overview, see Triguero et al., 2015). For the NeSi systems, such feedback worked especially well, which may indicate that self-labeling is particularly well suited for deep directed models in general. Systems that make a more direct use of bottom-up and top-down information include approaches based on undirected graphical models. The most prominent examples, especially in the context of deep learning, are deep restricted Boltzmann machines (RBMs). While RBMs are successfully used in many contexts (e.g., Hinton et al., 2006; Goodfellow, Courville, & Bengio, 2013; Neftci, Pedroni, Joshi, Al-Shedivat, & Cauwenberghs, 2015), performance of RBMs alone, without additional learning methods, does not seem to be competitive with recent results on semisupervised learning. The best-performing RBM-related systems we compared to here are the HDRBM (Larochelle & Bengio, 2008) for 20 Newsgroups and the DBN-rNCA system (Salakhutdinov & Hinton, 2007) for MNIST. Both approaches use additional mechanisms for semisupervised classification, which can be taken as evidence for standard RBM approaches being more limited when labeled data are sparse. In this semisupervised setting, both ff-NeSi and r-NeSi perform better than the DBN-rNCA approach for MNIST (see Figures 7 and 8) and better than the HDRBM for 20 Newsgroups (see Table 3). When optimized for the fully labeled setting, NeSi even improves considerably to the HDRBM in the fully labeled 20 Newsgroups task. Recent RBM versions, enhanced and combined with discriminative deep networks (Goodfellow et al., 2013), outperform NeSi networks on fully labeled MNIST; however, the competitiveness of such approaches in semisupervised settings has not been shown so far.

To reduce network complexity and improve classification performance, we showed results with a newly introduced selection criterion for network truncation (Forster & Lücke, 2017). Reducing the number of active neurons in a network for enhanced performance is also common for standard deep networks and became popular with approaches like “dropout” (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014). However, the truncation of hidden variables used here is notably different as it uses a systematic data-driven selection of very few neurons (here, only $0.15%$) to maximize the free energy, whereas dropout typically uses a random selection of half of the neurons to reduce coadaptation.

Regarding the learning and inference equations themselves, the compactness of the equations defining the NeSi algorithms and their formulation as minimalistic neural networks represent a major difference to pure generative approaches (such as Saul et al., 1996; Larochelle & Murray, 2011; Gan et al., 2015) or combinations of DNNs and graphical models (e.g., Kingma et al., 2014). Regarding empirical comparisons, typical directed generative models are not compared on typical DNN tasks but use other evaluation criteria. Prominent or recent examples such as deep sigmoid belief networks (SBNs; see, for example, Saul et al., 1996; Gan et al., 2015) have, for instance, not been shown to be competitive with standard discriminative deep networks on semisupervised classification tasks so far. In general, a main challenge is the need to introduce approximation schemes. The accuracy of approximations for large networks, and the complexity of the networks themselves, still seem to prevent scalability or competitive performance on tasks as discussed here. In principle, however, deep directed generative models such as deep SBNs or other deep directed multiple-cause approaches are more expressive than deep mixture models. One may thus also interpret our results as highlighting the general potential of deep directed generative models for tasks such as classification.

### 5.2 Empirical Performance, Model Complexity, and Data with Few Labels

Our main empirical results for the NeSi systems were obtained using the 20 Newsgroups, the MNIST, and the NIST SD19 data sets (with MNIST simply being the data set for which most empirical data for semisupervised learning are available). Tables 3 to 5 and Figures 7 and 8 summarize the results and provide comparison to those of other approaches. The r-NeSi system is the best-performing system for the semisupervised 20 Newsgroups data set (see Table 3), but the data set is much more popular as a fully supervised benchmark (comparison only to HDRBM in the semisupervised setting). The semi-supervised MNIST benchmark is therefore more instructive for comparison.

Considering Figure 8 (right-hand side), the NeSi algorithms still perform well for a budget of 1000, 600, or just 100 labels. For 600 labels, t-NeSi has a classification error well below $4%$, and all NeSi approaches with self-labeling have classification errors below $5%$ down to 100 labels. So far, it has not been shown that classifiers can be trained with similarly low numbers of labels. The reason is that all comparable approaches use at least 1000 labels to optimize the free parameters of their respective systems. If these additional labels are not considered, the NeSi approaches t-NeSi and r$+$-NeSi are outperformed by three recent systems in the limit of few labels: M1$+$M2 (Kingma et al., 2014), VAT (Miyato et al., 2016), and the Ladder network (Rasmus et al., 2015). All three systems use a combination of different approaches. M1$+$M2 (Kingma et al., 2014) combines generative learning and discriminative backpropagation learning: The results for the VAT (Miyato et al., 2016) are obtained by combining a DNN using backpropagation with a smoothness constraint derived from the data distribution: And the ladder network (Rasmus et al., 2015) applies a per-layer denoising objective onto standard discriminative learning models like MLPs and CNNs. The many free parameters of M1$+$M2, VAT, and ladder networks seem to require relatively large labeled validation sets (see Figure 7). M1$+$M2 and ladder networks used 10,000 additional labels for the tuning of these parameters, while VAT used 1000 additional labels. It can be argued that free parameters could be tuned in other ways (e.g., using other, related, data sets). However, it remains to be shown how well any of the recently suggested approaches would perform in such a case. The use of up to 10,000 labels may indicate that large labeled validation sets for some approaches are important to obtain high performance. Other recent work uses ensembles of generative adversarial networks (GANs; Salimans et al., 2016). As discussed in section 4.2, comparison to the GAN approach is made difficult because the labels required for the tuning of free parameters are not reported. If we only consider labels for training, the NeSi networks are the first to report results in the absolute limit case of only a single label per class on the MNIST data set. In this limit, the t-NeSi approach achieves lower classification errors than the best results reported by Salimans et al. (2016): an error of $(7.22\xb10.53)%$ for t-NeSi trained with one label per class or of $(6.21\xb10.38)%$ when trained with two labels per class, compared to $(11.34\xb14.45)%$ for an ensemble of 10 GANs trained with two labels per class. How many labels the GAN ensembles require in total (for training and tuning) remains unknown.

Another way to view the comparisons in Figures 7 and 8 is to interpret the results as highlighting a performance versus model complexity trade-off. If we consider the learning and tuning protocols that were used for the different systems to achieve the reported performance, large differences in the number of tunable parameters, the size of validation sets, and the complexity of the systems can be noticed. While some systems only need to tune a few parameters, others (especially hybrid systems) require tuning of quite many (see Figure 7). Parameter tuning can be considered as a second optimization loop requiring labels in addition to those of the training set. It may be argued that not considering these additional labels can favor large systems with many tunable parameters, as would be the case in most cases when parameterized models are fitted to data. To (partly) normalize for model complexity, performance comparison with regard to the total number of required labels could therefore serve as a kind of empirical Occam's razor. If the total number of labels is considered in the case of MNIST, the comparison of system performances changes as illustrated in Figure 8. Considering the right-hand side of Figure 8, the VAT system (1000 additional labels) could, for instance, be considered to perform more strongly than the ladder network. However, while the numbers of tunable parameters for the different systems and the sizes of the used validation sets are clearly correlated (see Figure 7), it remains unclear how many additional labels would be required by the different systems. The two plots of Figure 8 could therefore be considered as two limit cases for comparison.

Regarding the comparison of the networks themselves, M1$+$M2 is the approach most similar to NeSi networks as both approaches use generative models as integral parts. Both approaches can also be taken as evidence for two hidden layers of generative latents already resulting in competitive performances. A difference is, however, the strong reliance of M1$+$M2 on deep neural networks to parameterize the dependencies between observed and hidden variables and dependencies among hidden variables that are optimized using DNN gradient approaches (the same applies for DNNs used for the applied variational approximation). Inference and learning in M1$+$M2 is therefore significantly more intricate and requires multiple deep networks. Also the generative description part itself is very different (e.g., motivated by easy differentiability based on continuous latents) and is in M1$+$M2 not directly used for inference. For neural simpletrons, the generative and the neural network weights are identical and are directly used for inference.

Compared to the hybrid M1$+$M2, VAT, and ladder networks, the NeSi networks studied here are nonhybrid networks. As also AGR, Atlas-RBF, and EmbeddCNN are hybrids (Liu et al., 2010; Pitelis et al., 2014; Weston et al., 2012), the NeSi networks can be considered as the best-performing nonhybrid approaches even if we do not use self-labeling and truncated training and even if we only consider exclusively the labels for training (see Figure 8, left-hand side). With self-labeling and truncation mechanisms, r$+$-NeSi, ff$+$-NeSi, and t-NeSi are able to achieve even better performance, especially in the limit of few labels as investigated here. Self-labeling and truncation introduce one additional free parameter each, but these parameters are both tunable on the same minimal validation set like the other free parameters of the NeSi networks. Furthermore, both additional mechanisms are obtained from the same single central likelihood objective, equation 2.5, the NeSi networks were derived from.

Finally, using the NIST SD19 data set, we demonstrate the applicability of the approach to data sets with more data (up to 800,000 data points), larger input dimensionality (up to 16,384 input pixels), and more classes (up to 52 classes for case-sensitive letter recognition). NIST SD19 is known to be much more challenging than MNIST (compare, e.g., Cireşan et al., 2012). The NeSi approaches scale to all of the much larger settings investigated in Table 5, and the results show that good classification performance can be maintained with few labels. The networks can successfully leverage the larger number of unlabeled data—for example, for the setting of one training label per class (10-class NIST digit classification). For the challenging 52-class NIST setting, t-NeSi maintains above $50%$ correct classification for down to 10 labels per class in total. The NeSi classification in this setting remains fully interpretable with subclasses shown in Figure 9. In general, we have provided here the first results for semisupervised learning for NIST SD19.

### 5.3 Future Work and Outlook

As the NeSi networks share many properties with standard deep neural networks, further enhancements such as network pruning, annealing, or dropout could be investigated to further increase performance or efficiency. Also, settings with additional information in the form of a correct or false classification feedback instead of labels represent an interesting future research direction. For example, Holca-Lamarre, Lücke, and Obermayer (2017) have, in a study with neuroscientific focus, shown that classification performance can improve using a global reinforcement signal. Further reduction of label dependency could be achieved by using active learning (Cohn, Ghahramani, & Jordan, 1996) in order to systematically select required labels based on the uncertainty in posterior distributions. By choosing better than random labels to learn from, this could further reduce the number of needed labels to achieve similar or better performances, as shown, for example, with the BayesianCNN (see Figure 8, left-hand side) by Gal et al. (2017). Such an active learning approach could complement the self-labeling used here. A user-provided label could, for example, be given for low-decision certainties, while a self-provided label could be used for high-decision certainties. Any new technique for improved learning may make the algorithms more complex and may introduce new free parameters, however. For the goal studied here, such future systems will have to maintain the ability to be tunable and trainable with as few labels as possible. The same would apply to any future versions of our network with more than three layers or different layer variants.

The development of further variants of simpletron layers would allow for higher flexibility to construct simpletron networks that are most suitable for the given data. Gaussian mixture models would be an interesting and promising candidate for metric data in a Euclidean space. And incorporation of prior knowledge about spatial relations in generative convolutional variants (e.g., Dai et al., 2013; Gal & Ghahramani, 2016; Patel et al., 2016) would be more suitable for image data. However, derivations of such layers and guarantees on the (approximate) equivalence between EM and neural fixed points similar to section 3.1 are not necessarily possible or as straightforward as for the hierarchical Poisson mixture model and require careful further research.

Also, the combination with discriminative learning approaches is a promising extension. Ideally, such a combination would maintain a monolithic architecture and limited complexity. Other studies have already shown that deep discriminative models can be related to directed generative models in grounded mathematical ways (see Patel et al., 2016, for a recent example). Similarly, discriminative counterparts may be derivable for the NeSi systems.

Still further potential research directions are combinations with hyperparameter optimization approaches (e.g., Thornton et al., 2013; Bergstra et al., 2013; Hutter et al., 2015) in order to increase autonomy and to further exploit the very low number of free parameters. Finally, the probabilistic nature of the NeSi networks would allow for addressing problems such as label noise in straightforward ways, while its generative model relation would allow for the investigation of tasks other than classification.

## Notes

^{1}

This is sometimes referred to as one-hot coding.

^{2}

We use a Python 2.7 implementation of the NeSi algorithms, which is optimized using Theano to execute on NVIDIA TITAN X and Tesla GPUs. Details are in section B.1. The source code and scripts for repeating the experiments can be found at https://github.com/dennisforster/NeSi.

## Appendix A: Derivation Details

Although the resulting NeSi neural network models exist as a very compact and simple set of equations, shown in Table 1, the derivation of these equations is not necessarily trivial. Therefore, here we give further insight into some derivation steps to allow for a better understanding of the model at hand. In section A.1, we give details on the derivation of the EM update rules for the underlying generative model. In section A.2, we show the necessary derivation steps to attain the approximate equivalence of neural online learning with EM batch learning at convergence, which is the basis of our neural network derivation.

### A.1 EM Update Steps

#### A.1.1 E-Step

The posterior $p(k|c,l,\Theta )$ can be easily obtained by simply applying Bayes' rule for the labeled and unlabeled case. For $p(c|y\u2192,l,\Theta )$ however, some additional steps are necessary to attain the compact form shown in equation 2.11.

#### A.1.2 M-Step

### A.2 Approximate Equivalence of Neural Online Learning at Convergence

The large products in numerator and denominator of equations A.21 and A.22 can be regarded as polynomials of order $N$ for $\u220aW$ and $\u220aR$, respectively. Even for small $\u220aW$ and $\u220aR$, it is difficult, however, to argue that higher-order terms of $\u220aW$ and $\u220aR$ can be neglected because of the combinatorial growth of prefactors given by the large products.

We therefore consider the approximations derived for the nonhierarchical model in Keck et al. (2012), which were applied to an equation of the same structure as equations A.21 and A.22. At closer inspection of the terms $Fcd(T+N-n)$ and $Gkc(T+N-n)$, we find that we can apply these approximations also for the hierarchical case. For completeness, we reiterate the main intermediate steps of these approximations below.

For the second step, equation A.24, we approximated the sum over $n$ in equation A.23 by observing that the terms with large $n$ are negligible and by approximating sums of $Fcd(T+N-n)$ over $n$ by the mean $F^cd(0)$. For the last steps, equation A.25, we used the geometric series and approximated for large $N$ (for details on these last two approximations, see the supplement of Keck et al., 2012). Furthermore, we used the fact that for small $\u220aW$, $\u220aWexp(-\u220aWB)1-exp(-\u220aWB)\u2248B-1$ (which can be seen, for example, by applying l'Hôpital's rule).

Note that each approximation is individually very accurate for small $\u220aW$ and large $N$. Equations 3.4 can thus be expected to be satisfied with high accuracy in this case and numerical experiments based on comparisons with EM batch-mode learning verified such high precision.

## Appendix B: Computational Details

### B.1 Parallelization on GPUs and CPUs

The online update rules of the neural network Table 1 are ideally suited for parallelization using GPUs, as they break down to elementary vector or matrix multiplications. We observed GPU executions with Theano to result in training time speed-ups of over two orders of magnitude compared to single-CPU execution (NVIDIA GeForce GTX TITAN Black GPUs versus AMD Opteron 6134 CPUs).

The maximal aberration from single-step updates caused by this approximation can be shown to be of $O((\u220a\nu )2)$. Since this effect is negligible for $\u220a\nu \u226a1$, as we also experimentally confirmed, we consider only the mini-batch size $\nu $ as a parallelization parameter, and not as free parameter that could be chosen to optimize training in anything else than training speed.

### B.2 Weight Initialization

For the complete setting ($C=K$), where there is a good amount of labeled data per hidden unit even when labeled data are sparse and the risk of running into early local optima where the classes are not well separated is high, we initialize the weights of the first hidden layer in a modified version of Keck et al. (2012): We compute the mean $mkd$ and standard deviation $\sigma kd$ of the labeled training data for each class $k$ and set $Wkd=mkd+U(0,2\sigma kd)$, where $U(xdn,xup)$ denotes the uniform distribution in the range $(xdn,xup)$.

For the overcomplete setting ($C>K$), where there are far fewer labeled data points than hidden units in the semisupervised setting and class separation is no imminent problem, we initialize the weights using all data disregarding the label information. With the mean $md$ and standard deviation $\sigma d$ over all training data points, we set $Wcd=md+U(0,2\sigma d)$.

The weights of the second hidden layer are initialized as $Rkc=1/C$. The only exceptions to this rule are the additional experiments on the 20 Newsgroups data set in section B.5 for the fully labeled setting. As noted in the text, in this setting, we were able to make better use of the recurrent connections of the r-NeSi network and the fully labeled data set by initializing the weights of the second hidden layer as $Rkc=\delta kc$.

### B.3 A Likelihood Criterion for Early Stopping

Training of the first layer in the feedforward network is not influenced by the state of the second layer and is therefore independent of the number of provided labels. This is no longer the case for the recurrent network (r-NeSi). A low number of labels can lead to overfitting effects in r-NeSi when the number of hidden units in the first hidden layer is substantially larger than the number of labeled data points. However, when using the inferred labels for training in the r$+$-NeSi network, such overfitting effects vanish again.

Since learning in our network corresponds to maximum likelihood learning in a hierarchical generative model, a natural measure to define a criterion for early stopping of r-NeSi can be based on monitoring of the log likelihood, which is given by equation 2.5 (replacing the generative weights $(W,R)$ by the weights $(W,R)$ of the network). As soon as the scarce labeled data start overfitting the first-layer units as a result of top-down influence in $Ic$ (compare equation T1.5), the log likelihood computed over the whole training data is observed to decrease. This declining event in data likelihood can be used as a stopping criterion to avoid overfitting without requiring additional labels.

Figure 10 shows an example of the evolution of the average log likelihood per data point during training compared to the test error. For experiments over a variety of network sizes, we found strong negative correlations of $\u3008PPMCC\u3009=-0.85\xb10.1$. To smooth out random fluctuations in the likelihood, we compute the centered moving average over 20 iterations and stop as soon as this value drops below its maximum value by more than the centered moving standard deviation. The test error in Figure 10 is computed only for illustration purposes. In our experiments, we solely used the moving average of the likelihood to detect the drop event and stop learning. In our control experiments on MNIST, we found that the best test error generally occurred some iterations after the peak in the likelihood (compare Figure 10), which we, however, for simplicity have not exploited for our reported results.

### B.4 Overfitting Control for NeSi

With the tuning protocol shown in section 4.1, we ensure that our networks will not overfit to the test set such that our reported results accurately represent the generalization error. However, overfitting to the training set can still occur and may decrease the overall performance of the networks.

With networks of 10,000 hidden units on MNIST, which learn on only 60,000 training samples, some of the hidden units adapt to represent more rarely seen patterns, while others adapt to represent patterns that are more frequent in the training data. Furthermore, the network learns the frequency at which patterns occur as the distribution $p(c|R)=1K\u2211kRkc$. Figure 11 displays a random selection of 100 out of the 10,000 fields after training using the r$+$-NeSi algorithm.

Fields colored in blue in Figure 11 have a very low probability of $p(c|R)\xb7N<0.5$, with most of them $p(c|R)$ being close to zero. These fields have ceased to further specialize to respective pattern classes because of sufficiently many other fields that have optimized for a class. They are effectively discarded by the network itself, as the low values in $Rkc$ further suppress the activation of those fields in the recurrent network. With longer training times, $p(c|R)$ of those fields converges to zero, which practically prunes the network to the remaining size. The red fields in Figure 11 have a probability of $0.5\u2264p(c|R)\xb7N<1.5$ to be activated, which corresponds to approximately one data point in the training set that activates the field. Such weights are often adapted to one single training data point with a very uncommon writing style (like the crooked 7 in the forth column, nineth row) or some kind of preprocessing artifact (like the cropped 3 in the second column, seventh row).

We did control for the effect of rarely active fields (blue and red in Figure 11), especially as some of the fields are clearly overfitted to the training set. For that, we compared an original network of 10,000 fields (i.e., 10,000 middle-layer neurons) with a network for which all fields with activity $p(c|R)\xb7N<1.5$ were removed (around 15% of the 10,000 fields). We observed no significant changes in the test error between the original and the pruned network. The reason is that the pruned fields are rarely activated at test time because of low similarities to test data and strong suppression by the network itself (due to the learned low activation rates during training).

### B.5 Optimization in the Fully Labeled Setting for 20 Newsgroups

In the fully labeled setting on the 20 Newsgroups data set, we can gain a larger benefit out of the recurrence of r-NeSi. Changing the initialization procedure from $Rkc=1/C$ to $Rkc=\delta kc$ helps to avoid shallow local optima and reaches a test error of $(17.85\xb10.01)%$. This initialization fixes the class $k$ of subclass $c$ to a single specific class by setting all connections between the first and second hidden layer to other classes to a hard zero. Training with such a weight initialization is, however, useful only when very large numbers of labeled data are available. The top-down label information is then a necessary mechanism to make sure that the middle-layer units learn the appropriate representation of their respective fixed class (e.g., that a middle-layer unit that is fixed to class alt.atheism mainly, or exclusively, learns from data belonging to that class). So instead of first learning representations in the middle layer purely from the data and then learning the classes with respect to these representations from the labels, like the (greedy) ff-NeSi, the r-NeSi algorithm is also able to conversely shape their middle-layer representations in relation to their probability of belonging to the class of the presented data point.

To decide between this initialization procedure in the fully labeled setting and our standard one, we used the fully labeled training set during parameter tuning (again with a half/half split into training and validation set). With the better avoidance of shallow optima by this initialization, lower learning rates $\u220aW$ were now more beneficial ($\u220aR$ drops out as free parameter, as the top layer remains fixed). A coarse manual grid search in this setting resulted in optimal parameter values at $A=90,000$ ($A/D\u22481.47$) and $\u220aW=0.02$ (which we chose as lowest search value to restrict computational time), while keeping $C=20$. These results also show that parameter optimization based on each individual label setting and changing the initialization procedure based on label availability could potentially lead to better parameter settings and stronger performance also in the other settings.

## Appendix C: Detailed Training Results

We performed 100 independent training runs for results obtained on MNIST and 20 Newsgroups in sections 4.3 and 4.2 and 10 independent training runs for the NIST data set in section 4.4 with each of the given networks for each label setting with new randomly chosen, class-balanced labels for each training run. Tables 6 to 18 give a detailed summary of the statistics of the obtained results. They show the mean test error alongside the standard error of the mean (SEM), the standard deviation (in percentage points), as well as the minimal and maximal test error in the given number of runs. For the networks with self-labeling of unlabeled data (ff$+$- and r$+$-NeSi), we show only the semisupervised settings, as they are identical to their respective standard versions in the fully labeled case.

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

20 | 70.64 $\xb1$ 0.68 | 6.82 | 55.35 | 88.59 |

40 | 55.67 $\xb1$ 0.54 | 5.44 | 37.53 | 68.13 |

200 | 30.59 $\xb1$ 0.22 | 2.22 | 26.97 | 37.57 |

800 | 28.26 $\xb1$ 0.10 | 1.00 | 26.68 | 31.59 |

2000 | 27.87 $\xb1$ 0.07 | 0.74 | 25.85 | 30.01 |

11,269 | 28.08 $\xb1$ 0.08 | 0.78 | 26.29 | 30.25 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

20 | 70.64 $\xb1$ 0.68 | 6.82 | 55.35 | 88.59 |

40 | 55.67 $\xb1$ 0.54 | 5.44 | 37.53 | 68.13 |

200 | 30.59 $\xb1$ 0.22 | 2.22 | 26.97 | 37.57 |

800 | 28.26 $\xb1$ 0.10 | 1.00 | 26.68 | 31.59 |

2000 | 27.87 $\xb1$ 0.07 | 0.74 | 25.85 | 30.01 |

11,269 | 28.08 $\xb1$ 0.08 | 0.78 | 26.29 | 30.25 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

20 | 68.68 $\xb1$ 0.77 | 7.72 | 49.98 | 85.48 |

40 | 54.24 $\xb1$ 0.66 | 6.59 | 37.00 | 66.76 |

200 | 29.28 $\xb1$ 0.21 | 2.09 | 25.90 | 39.60 |

800 | 27.20 $\xb1$ 0.07 | 0.70 | 25.85 | 29.41 |

2000 | 27.15 $\xb1$ 0.07 | 0.65 | 25.77 | 29.13 |

11,269 | 27.28 $\xb1$ 0.07 | 0.73 | 26.08 | 29.82 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

20 | 68.68 $\xb1$ 0.77 | 7.72 | 49.98 | 85.48 |

40 | 54.24 $\xb1$ 0.66 | 6.59 | 37.00 | 66.76 |

200 | 29.28 $\xb1$ 0.21 | 2.09 | 25.90 | 39.60 |

800 | 27.20 $\xb1$ 0.07 | 0.70 | 25.85 | 29.41 |

2000 | 27.15 $\xb1$ 0.07 | 0.65 | 25.77 | 29.13 |

11,269 | 27.28 $\xb1$ 0.07 | 0.73 | 26.08 | 29.82 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

10 | 55.46 $\xb1$ 0.57 | 5.72 | 42.49 | 69.62 |

20 | 38.88 $\xb1$ 0.52 | 5.19 | 27.86 | 49.62 |

100 | 19.08 $\xb1$ 0.26 | 2.61 | 13.31 | 24.93 |

600 | 7.27 $\xb1$ 0.05 | 0.49 | 6.01 | 8.76 |

1000 | 5.88 $\xb1$ 0.03 | 0.31 | 5.19 | 6.97 |

3000 | 4.39 $\xb1$ 0.02 | 0.15 | 4.01 | 4.89 |

60,000 | 3.27 $\xb1$ 0.01 | 0.08 | 3.08 | 3.46 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

10 | 55.46 $\xb1$ 0.57 | 5.72 | 42.49 | 69.62 |

20 | 38.88 $\xb1$ 0.52 | 5.19 | 27.86 | 49.62 |

100 | 19.08 $\xb1$ 0.26 | 2.61 | 13.31 | 24.93 |

600 | 7.27 $\xb1$ 0.05 | 0.49 | 6.01 | 8.76 |

1000 | 5.88 $\xb1$ 0.03 | 0.31 | 5.19 | 6.97 |

3000 | 4.39 $\xb1$ 0.02 | 0.15 | 4.01 | 4.89 |

60,000 | 3.27 $\xb1$ 0.01 | 0.08 | 3.08 | 3.46 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

10 | 29.61 $\xb1$ 0.57 | 5.71 | 20.05 | 46.05 |

20 | 21.21 $\xb1$ 0.34 | 3.37 | 13.80 | 31.81 |

100 | 12.43 $\xb1$ 0.15 | 1.53 | 9.29 | 16.25 |

600 | 6.94 $\xb1$ 0.05 | 0.49 | 5.72 | 8.44 |

1000 | 6.07 $\xb1$ 0.03 | 0.28 | 5.24 | 6.78 |

3000 | 4.68 $\xb1$ 0.02 | 0.19 | 4.22 | 5.29 |

60,000 | 2.94 $\xb1$ 0.01 | 0.08 | 2.75 | 3.14 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

10 | 29.61 $\xb1$ 0.57 | 5.71 | 20.05 | 46.05 |

20 | 21.21 $\xb1$ 0.34 | 3.37 | 13.80 | 31.81 |

100 | 12.43 $\xb1$ 0.15 | 1.53 | 9.29 | 16.25 |

600 | 6.94 $\xb1$ 0.05 | 0.49 | 5.72 | 8.44 |

1000 | 6.07 $\xb1$ 0.03 | 0.28 | 5.24 | 6.78 |

3000 | 4.68 $\xb1$ 0.02 | 0.19 | 4.22 | 5.29 |

60,000 | 2.94 $\xb1$ 0.01 | 0.08 | 2.75 | 3.14 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

10 | 10.91 $\xb1$ 0.86 | 8.64 | 3.96 | 53.15 |

20 | 7.23 $\xb1$ 0.35 | 3.45 | 4.17 | 24.82 |

100 | 4.96 $\xb1$ 0.08 | 0.82 | 3.84 | 9.13 |

600 | 4.08 $\xb1$ 0.02 | 0.17 | 3.68 | 4.73 |

1000 | 4.00 $\xb1$ 0.01 | 0.12 | 3.76 | 4.38 |

3000 | 3.85 $\xb1$ 0.01 | 0.11 | 3.64 | 4.14 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

10 | 10.91 $\xb1$ 0.86 | 8.64 | 3.96 | 53.15 |

20 | 7.23 $\xb1$ 0.35 | 3.45 | 4.17 | 24.82 |

100 | 4.96 $\xb1$ 0.08 | 0.82 | 3.84 | 9.13 |

600 | 4.08 $\xb1$ 0.02 | 0.17 | 3.68 | 4.73 |

1000 | 4.00 $\xb1$ 0.01 | 0.12 | 3.76 | 4.38 |

3000 | 3.85 $\xb1$ 0.01 | 0.11 | 3.64 | 4.14 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

10 | 18.68 $\xb1$ 0.89 | 8.90 | 5.06 | 51.88 |

20 | 12.46 $\xb1$ 0.73 | 7.31 | 4.89 | 39.70 |

100 | 4.93 $\xb1$ 0.05 | 0.49 | 4.26 | 7.32 |

600 | 4.34 $\xb1$ 0.01 | 0.15 | 3.87 | 4.78 |

1000 | 4.26 $\xb1$ 0.01 | 0.12 | 3.97 | 4.62 |

3000 | 4.05 $\xb1$ 0.01 | 0.10 | 3.84 | 4.29 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

10 | 18.68 $\xb1$ 0.89 | 8.90 | 5.06 | 51.88 |

20 | 12.46 $\xb1$ 0.73 | 7.31 | 4.89 | 39.70 |

100 | 4.93 $\xb1$ 0.05 | 0.49 | 4.26 | 7.32 |

600 | 4.34 $\xb1$ 0.01 | 0.15 | 3.87 | 4.78 |

1000 | 4.26 $\xb1$ 0.01 | 0.12 | 3.97 | 4.62 |

3000 | 4.05 $\xb1$ 0.01 | 0.10 | 3.84 | 4.29 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

10 | 7.22 $\xb1$ 0.53 | 5.33 | 3.60 | 26.71 |

20 | 6.21 $\xb1$ 0.38 | 3.84 | 3.72 | 26.49 |

100 | 4.23 $\xb1$ 0.07 | 0.68 | 3.58 | 6.88 |

600 | 3.65 $\xb1$ 0.01 | 0.12 | 3.37 | 3.97 |

1000 | 3.63 $\xb1$ 0.01 | 0.11 | 3.25 | 4.02 |

3000 | 3.52 $\xb1$ 0.01 | 0.11 | 3.23 | 3.82 |

60,000 | 2.94 $\xb1$ 0.01 | 0.08 | 2.72 | 3.12 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

10 | 7.22 $\xb1$ 0.53 | 5.33 | 3.60 | 26.71 |

20 | 6.21 $\xb1$ 0.38 | 3.84 | 3.72 | 26.49 |

100 | 4.23 $\xb1$ 0.07 | 0.68 | 3.58 | 6.88 |

600 | 3.65 $\xb1$ 0.01 | 0.12 | 3.37 | 3.97 |

1000 | 3.63 $\xb1$ 0.01 | 0.11 | 3.25 | 4.02 |

3000 | 3.52 $\xb1$ 0.01 | 0.11 | 3.23 | 3.82 |

60,000 | 2.94 $\xb1$ 0.01 | 0.08 | 2.72 | 3.12 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

10 | 7.56 $\xb1$ 1.76 | 5.67 | 5.52 | 23.46 |

20 | 6.15 $\xb1$ 0.14 | 0.44 | 5.49 | 6.73 |

100 | 6.20 $\xb1$ 0.16 | 0.51 | 5.49 | 7.08 |

600 | 6.02 $\xb1$ 0.08 | 0.25 | 5.72 | 6.51 |

1000 | 6.02 $\xb1$ 0.12 | 0.38 | 5.63 | 6.99 |

3000 | 5.70 $\xb1$ 0.03 | 0.10 | 5.56 | 5.89 |

344,307 | 5.11 $\xb1$ 0.01 | 0.03 | 5.06 | 5.16 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

10 | 7.56 $\xb1$ 1.76 | 5.67 | 5.52 | 23.46 |

20 | 6.15 $\xb1$ 0.14 | 0.44 | 5.49 | 6.73 |

100 | 6.20 $\xb1$ 0.16 | 0.51 | 5.49 | 7.08 |

600 | 6.02 $\xb1$ 0.08 | 0.25 | 5.72 | 6.51 |

1000 | 6.02 $\xb1$ 0.12 | 0.38 | 5.63 | 6.99 |

3000 | 5.70 $\xb1$ 0.03 | 0.10 | 5.56 | 5.89 |

344,307 | 5.11 $\xb1$ 0.01 | 0.03 | 5.06 | 5.16 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

10 | 9.84 $\xb1$ 2.41 | 7.61 | 5.64 | 34.95 |

20 | 8.50 $\xb1$ 2.09 | 6.62 | 5.46 | 27.29 |

100 | 6.14 $\xb1$ 0.23 | 0.72 | 5.52 | 7.84 |

600 | 5.83 $\xb1$ 0.14 | 0.45 | 5.43 | 6.50 |

1000 | 5.94 $\xb1$ 0.12 | 0.39 | 5.46 | 6.49 |

3000 | 5.72 $\xb1$ 0.10 | 0.33 | 5.52 | 6.63 |

344,307 | 4.52 $\xb1$ 0.01 | 0.04 | 4.44 | 4.56 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

10 | 9.84 $\xb1$ 2.41 | 7.61 | 5.64 | 34.95 |

20 | 8.50 $\xb1$ 2.09 | 6.62 | 5.46 | 27.29 |

100 | 6.14 $\xb1$ 0.23 | 0.72 | 5.52 | 7.84 |

600 | 5.83 $\xb1$ 0.14 | 0.45 | 5.43 | 6.50 |

1000 | 5.94 $\xb1$ 0.12 | 0.39 | 5.46 | 6.49 |

3000 | 5.72 $\xb1$ 0.10 | 0.33 | 5.52 | 6.63 |

344,307 | 4.52 $\xb1$ 0.01 | 0.04 | 4.44 | 4.56 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

10 | 5.71 $\xb1$ 0.42 | 1.32 | 4.77 | 8.72 |

20 | 5.23 $\xb1$ 0.15 | 0.49 | 4.75 | 5.88 |

100 | 5.26 $\xb1$ 0.23 | 0.72 | 4.79 | 6.95 |

600 | 4.84 $\xb1$ 0.02 | 0.07 | 4.76 | 4.93 |

1000 | 4.86 $\xb1$ 0.03 | 0.09 | 4.69 | 5.01 |

3000 | 4.83 $\xb1$ 0.02 | 0.08 | 4.64 | 4.93 |

344,307 | 4.50 $\xb1$ 0.01 | 0.02 | 4.46 | 4.54 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

10 | 5.71 $\xb1$ 0.42 | 1.32 | 4.77 | 8.72 |

20 | 5.23 $\xb1$ 0.15 | 0.49 | 4.75 | 5.88 |

100 | 5.26 $\xb1$ 0.23 | 0.72 | 4.79 | 6.95 |

600 | 4.84 $\xb1$ 0.02 | 0.07 | 4.76 | 4.93 |

1000 | 4.86 $\xb1$ 0.03 | 0.09 | 4.69 | 5.01 |

3000 | 4.83 $\xb1$ 0.02 | 0.08 | 4.64 | 4.93 |

344,307 | 4.50 $\xb1$ 0.01 | 0.02 | 4.46 | 4.54 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

52 | 55.70 $\xb1$ 0.62 | 1.96 | 52.88 | 58.75 |

104 | 51.32 $\xb1$ 0.79 | 2.49 | 48.21 | 55.96 |

520 | 46.22 $\xb1$ 0.43 | 1.37 | 43.91 | 48.47 |

3120 | 44.24 $\xb1$ 0.23 | 0.74 | 43.23 | 45.49 |

5200 | 43.69 $\xb1$ 0.21 | 0.65 | 42.53 | 44.40 |

15600 | 42.96 $\xb1$ 0.28 | 0.88 | 41.55 | 44.38 |

387,361 | 34.66 $\xb1$ 0.05 | 0.15 | 34.45 | 34.86 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

52 | 55.70 $\xb1$ 0.62 | 1.96 | 52.88 | 58.75 |

104 | 51.32 $\xb1$ 0.79 | 2.49 | 48.21 | 55.96 |

520 | 46.22 $\xb1$ 0.43 | 1.37 | 43.91 | 48.47 |

3120 | 44.24 $\xb1$ 0.23 | 0.74 | 43.23 | 45.49 |

5200 | 43.69 $\xb1$ 0.21 | 0.65 | 42.53 | 44.40 |

15600 | 42.96 $\xb1$ 0.28 | 0.88 | 41.55 | 44.38 |

387,361 | 34.66 $\xb1$ 0.05 | 0.15 | 34.45 | 34.86 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

52 | 64.97 $\xb1$ 0.85 | 2.70 | 60.88 | 69.71 |

104 | 60.32 $\xb1$ 0.91 | 2.86 | 57.74 | 65.74 |

520 | 54.08 $\xb1$ 0.38 | 1.21 | 51.71 | 55.89 |

3120 | 43.73 $\xb1$ 0.15 | 0.47 | 42.99 | 44.62 |

5200 | 41.57 $\xb1$ 0.13 | 0.42 | 40.90 | 42.21 |

15600 | 37.95 $\xb1$ 0.12 | 0.38 | 37.25 | 38.56 |

387,361 | 31.93 $\xb1$ 0.06 | 0.18 | 31.63 | 32.17 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

52 | 64.97 $\xb1$ 0.85 | 2.70 | 60.88 | 69.71 |

104 | 60.32 $\xb1$ 0.91 | 2.86 | 57.74 | 65.74 |

520 | 54.08 $\xb1$ 0.38 | 1.21 | 51.71 | 55.89 |

3120 | 43.73 $\xb1$ 0.15 | 0.47 | 42.99 | 44.62 |

5200 | 41.57 $\xb1$ 0.13 | 0.42 | 40.90 | 42.21 |

15600 | 37.95 $\xb1$ 0.12 | 0.38 | 37.25 | 38.56 |

387,361 | 31.93 $\xb1$ 0.06 | 0.18 | 31.63 | 32.17 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

52 | 52.14 $\xb1$ 1.07 | 3.39 | 45.26 | 56.70 |

104 | 48.46 $\xb1$ 0.92 | 2.90 | 44.49 | 53.52 |

520 | 45.62 $\xb1$ 0.43 | 1.37 | 42.87 | 47.83 |

3120 | 41.87 $\xb1$ 0.32 | 1.03 | 39.70 | 43.48 |

5200 | 41.75 $\xb1$ 0.36 | 1.12 | 39.77 | 43.18 |

15600 | 41.13 $\xb1$ 0.30 | 0.96 | 39.75 | 42.67 |

387,361 | 33.34 $\xb1$ 0.04 | 0.14 | 33.12 | 33.63 |

Number of Labels | Mean Test Error | SD | Minimum | Maximum |

52 | 52.14 $\xb1$ 1.07 | 3.39 | 45.26 | 56.70 |

104 | 48.46 $\xb1$ 0.92 | 2.90 | 44.49 | 53.52 |

520 | 45.62 $\xb1$ 0.43 | 1.37 | 42.87 | 47.83 |

3120 | 41.87 $\xb1$ 0.32 | 1.03 | 39.70 | 43.48 |

5200 | 41.75 $\xb1$ 0.36 | 1.12 | 39.77 | 43.18 |

15600 | 41.13 $\xb1$ 0.30 | 0.96 | 39.75 | 42.67 |

387,361 | 33.34 $\xb1$ 0.04 | 0.14 | 33.12 | 33.63 |

## Appendix D: Tunable Parameters of the Compared Algorithms

We list in Table 19 the tunable parameters of each method compared to in Figures 7 and 8. For some of the methods, this estimate gives only a lower bound on the number of tunable parameters, as parameters of them may have multiple instances—for example, for each added layer in the network. If a parameter was kept constant for all layers, we counted it only as a single parameter, whereas such parameters that had differing values in different layers were counted as multiple parameters. An example is the constant number of hidden units in NN versus the differing numbers in the layers of the CNN. We also counted such parameters that were not (explicitly) optimized in the corresponding papers itself but were taken from other papers (e.g., parameters of the ADAM algorithm), or where the reason for the specific choice is not given (like for specific network architectures).

Method | Model Description and Tunable HyperParameters | Count |

SVM | Standard supervised support vector machine | |

Soft margin parameter $C$ | 1 | |

TSVM | Semisupervised transductive support vector machine | |

Soft margin parameter $C$, data-similarity kernel parameter $\lambda $ | 2 | |

NN | Supervised neural network using stochastic gradient descent | |

Number of hidden layers (here: 2), number of hidden units (here: same per layer), learning rate(s) | $3+$ | |

AGR | Anchorgraph; semisupervised large graph with anchor-based label prediction using k-means cluster centers as anchors | |

Number of anchors $m$, number of nearest anchors $s$, regularization parameter $\gamma $, dimensionality reduction (for acceleration) | 3–4 | |

kNN | Semisupervised k-nearest neighbors | |

Number of neighbors $k$, weight function, algorithm, power parameter $p$ | 4 | |

NeSi (ours) | Neural network approximation of hierarchical Poisson mixtures | |

Number of middle-layer units $C$, input normalization constant $A$, learning rates $\u220aW$ and $\u220aR$, $BvSB$ threshold $\u03d1$ (only for r$+$-, ff$+$-, and t-NeSi), $C'$ (truncation, only for t-NeSi) | 4–6 | |

AtlasRBF | Manifold learning of Atlas-based kernels for SVMs | |

Chart penalty $\lambda $, softening parameter $\gamma $, RBF kernel parameter $\sigma $, number of neighbors $k$, local manifold dimensionality | 5 | |

Em$all$NN | Neural network with nonlinear embedding using unlabeled data pairs | |

NN hyperparameters (see above, here: 10 layers), layers to embed (here: all), embedding parameter $\lambda $, distance parameter $m$ | 6 | |

CNN | Standard supervised convolutional neural network | |

number of CNN layers (here: 6), Patch size, pooling window size (2nd layer), neighborhood radius (4th layer), 1st, 3rd, 5th and 6th layer units, learning rate | $\u22659$ | |

M1$+$M2 | Generative model (2 hidden layers) parameterized with deep neural networks | |

M1: number of hidden layers (here: 2), number of hidden units per layer, number of samples from posterior, M2: number of hidden layers (here: 1), number of hidden units, $\alpha $, RMSProp: learning rate, first and second momenta | $\u226510$ | |

DBN-rNCA | Deep belief network with regularized nonlinear neighbourhood components analysis; 4 stacks of RBMs, unrolled and finetuned as deep autoencoders | |

Number of layers (here: 4), number of hidden units per layer; RBM learning rate, momentum, weight-decay, RBM epochs, NCA epochs, tradeoff parameter $\lambda $ | $\u226511$ | |

EmCNN | CNN with nonlinear embedding using unlabeled data pairs | |

CNN hyperparameters (see above), layers to embed (here: 5th layer), embedding parameter $\lambda $, distance parameter $m$, | $\u226512$ | |

VAT | Virtual adversarial training; standard deep networks with local distributional smoothness (LDS) constraint | |

Number of layers (here: 2–4), number of hidden units per layer, LDS weighting $\lambda $, magnitude of virtual adversarial perturbation $\u220a$, iteration times of power method $Ip$, ADAM (Kingma & Ba, 2014): learning rate $\alpha $, $\u220aADAM$, exponential decay rates $\beta 1$ and $\beta 2$; batch normalization (Ioffe & Szegedy, 2015): mini-batch size for labeled and mixed set | $\u226512$ | |

Ladder | Per-layer denoising objective on standard deep networks (here: CNNs) | |

Number of hidden layers (here: 5), number of hidden units per layer, noise level $n(l)$, denoising cost multipliers $\lambda (l)$ for each layer, ADAM (Kingma & Ba, 2014): learning rate $\alpha $, $\u220aADAM$, iterations until annealing phase, linear decay rate; batch normalization (Ioffe & Szegedy, 2015): minibatch size | $\u226518$ |

Method | Model Description and Tunable HyperParameters | Count |

SVM | Standard supervised support vector machine | |

Soft margin parameter $C$ | 1 | |

TSVM | Semisupervised transductive support vector machine | |

Soft margin parameter $C$, data-similarity kernel parameter $\lambda $ | 2 | |

NN | Supervised neural network using stochastic gradient descent | |

Number of hidden layers (here: 2), number of hidden units (here: same per layer), learning rate(s) | $3+$ | |

AGR | Anchorgraph; semisupervised large graph with anchor-based label prediction using k-means cluster centers as anchors | |

Number of anchors $m$, number of nearest anchors $s$, regularization parameter $\gamma $, dimensionality reduction (for acceleration) | 3–4 | |

kNN | Semisupervised k-nearest neighbors | |

Number of neighbors $k$, weight function, algorithm, power parameter $p$ | 4 | |

NeSi (ours) | Neural network approximation of hierarchical Poisson mixtures | |

Number of middle-layer units $C$, input normalization constant $A$, learning rates $\u220aW$ and $\u220aR$, $BvSB$ threshold $\u03d1$ (only for r$+$-, ff$+$-, and t-NeSi), $C'$ (truncation, only for t-NeSi) | 4–6 | |

AtlasRBF | Manifold learning of Atlas-based kernels for SVMs | |

Chart penalty $\lambda $, softening parameter $\gamma $, RBF kernel parameter $\sigma $, number of neighbors $k$, local manifold dimensionality | 5 | |

Em$all$NN | Neural network with nonlinear embedding using unlabeled data pairs | |

NN hyperparameters (see above, here: 10 layers), layers to embed (here: all), embedding parameter $\lambda $, distance parameter $m$ | 6 | |

CNN | Standard supervised convolutional neural network | |

number of CNN layers (here: 6), Patch size, pooling window size (2nd layer), neighborhood radius (4th layer), 1st, 3rd, 5th and 6th layer units, learning rate | $\u22659$ | |

M1$+$M2 | Generative model (2 hidden layers) parameterized with deep neural networks | |

M1: number of hidden layers (here: 2), number of hidden units per layer, number of samples from posterior, M2: number of hidden layers (here: 1), number of hidden units, $\alpha $, RMSProp: learning rate, first and second momenta | $\u226510$ | |

DBN-rNCA | Deep belief network with regularized nonlinear neighbourhood components analysis; 4 stacks of RBMs, unrolled and finetuned as deep autoencoders | |

Number of layers (here: 4), number of hidden units per layer; RBM learning rate, momentum, weight-decay, RBM epochs, NCA epochs, tradeoff parameter $\lambda $ | $\u226511$ | |

EmCNN | CNN with nonlinear embedding using unlabeled data pairs | |

CNN hyperparameters (see above), layers to embed (here: 5th layer), embedding parameter $\lambda $, distance parameter $m$, | $\u226512$ | |

VAT | Virtual adversarial training; standard deep networks with local distributional smoothness (LDS) constraint | |

Number of layers (here: 2–4), number of hidden units per layer, LDS weighting $\lambda $, magnitude of virtual adversarial perturbation $\u220a$, iteration times of power method $Ip$, ADAM (Kingma & Ba, 2014): learning rate $\alpha $, $\u220aADAM$, exponential decay rates $\beta 1$ and $\beta 2$; batch normalization (Ioffe & Szegedy, 2015): mini-batch size for labeled and mixed set | $\u226512$ | |

Ladder | Per-layer denoising objective on standard deep networks (here: CNNs) | |

Number of hidden layers (here: 5), number of hidden units per layer, noise level $n(l)$, denoising cost multipliers $\lambda (l)$ for each layer, ADAM (Kingma & Ba, 2014): learning rate $\alpha $, $\u220aADAM$, iterations until annealing phase, linear decay rate; batch normalization (Ioffe & Szegedy, 2015): minibatch size | $\u226518$ |

## Acknowledgments

We acknowledge funding by the German Research Foundation (DFG) in the Priority Program 1527 (Autonomous Learning), grant LU 1196/5-1, and within the Cluster of Excellence Hearing4all (EXC 1077/1). Furthermore, we acknowledge the use of the HPC cluster CARL of Oldenburg University, funded through INST 184/157-1 FUGG, the use of the GPU cluster GOLD, and support by the NVIDIA Corporation for a GPU card donation.

## References

*Neural Computation*,