We consider the problem of classifying data manifolds where each manifold represents invariances that are parameterized by continuous degrees of freedom. Conventional data augmentation methods rely on sampling large numbers of training examples from these manifolds. Instead, we propose an iterative algorithm, , based on a cutting plane approach that efficiently solves a quadratic semi-infinite programming problem to find the maximum margin solution. We provide a proof of convergence as well as a polynomial bound on the number of iterations required for a desired tolerance in the objective function. The efficiency and performance of are demonstrated in high-dimensional simulations and on image manifolds generated from the ImageNet data set. Our results indicate that is able to rapidly learn good classifiers and shows superior generalization performance compared with conventional maximum margin methods using data augmentation methods.
Handling object variability is a major challenge for machine learning systems. For example, in visual recognition tasks, changes in pose, lighting, identity, or background can result in large variability in the appearance of objects (Hinton, Dayan, & Revow, 1997). Techniques to deal with this variability have been the focus of much recent work, especially with convolutional neural networks consisting of many layers. The manifold hypothesis states that natural data variability can be modeled as lower-dimensional manifolds embedded in higher-dimensional feature representations (Bengio, Courville, & Vincent, 2013). A deep neural network can then be understood as disentangling or flattening the data manifolds so that they can be more easily read out in the final layer (Brahma, Wu, & She, 2016). Manifold representations of stimuli have also been utilized in neuroscience, where different brain areas are believed to untangle and reformat their representations (Riesenhuber & Poggio, 1999; Serre, Wolf, & Poggio, 2005; Hung, Kreiman, Poggio, & DiCarlo, 2005; DiCarlo & Cox, 2007; Pagan, Urban, Wohl, & Rust, 2013).
This article addresses the problem of classifying data manifolds that contain invariances with a number of continuous degrees of freedom. These invariances may be modeled using prior knowledge, manifold learning algorithms (Tenenbaum, 1998; Roweis & Saul, 2000; Tenenbaum, De Silva, & Langford, 2000; Belkin & Niyogi, 2003; Belkin, Niyogi, & Sindhwani, 2006; Canas, Poggio, & Rosasco, 2012), or generative neural networks via adversarial training (Goodfellow et al., 2014). Based on knowledge of these structures, other work has considered building group-theoretic invariant representations (Anselmi et al., 2013) or constructing invariant metrics (Simard, Le Cun, & Denker, 1994). Most approaches today rely on data augmentation by explicitly generating virtual examples from these manifolds (Niyogi, Girosi, & Poggio, 1998; Schölkopf, Burges, & Vapnik, 1996). Unfortunately, the number of samples needed to successfully learn the underlying manifolds may increase the original training set by more than a thousand-fold (Krizhevsky, Sutskever, & Hinton, 2012).
We propose a new method, the manifold cutting plane algorithm (), which uses knowledge of the manifolds to efficiently learn a maximum margin classifier. Figure 1 illustrates the problem in its simplest form, binary classification of manifolds with a linear hyperplane, with extensions to this basic model discussed later. Given a number of manifolds embedded in a feature space, the algorithm learns a weight vector that separates positively labeled manifolds from negatively labeled manifolds with the maximum margin. Although the manifolds consist of uncountable sets of points, the algorithm is able to find a good solution in a provably finite number of iterations and training examples.
Support vector machines (SVM) can learn a maximum margin classifier given a finite set of training examples (Vapnik, 1998); however, with conventional data augmentation methods, the number of training examples increases exponentially, rendering the standard SVM algorithm intractable. Methods such as shrinkage and chunking to reduce the complexity of SVM have been studied in the context of dealing with large-scale data sets (Smola & Schölkopf, 1998), but the resultant kernel matrix may still be very large. Other methods that subsample the kernel matrix (Lee & Mangasarian, 2001) or reduce the number of training samples (Wang, Neskovic, & Cooper, 2005; Smola & Schölkopf, 1998) are infeasible when the input data come from manifolds with an uncountable set of examples.
Our algorithm directly handles the uncountable set of points in the manifolds by solving a quadratic semi-infinite programming problem (QSIP). is based on a cutting plane method that iteratively refines a finite set of training examples to solve the underlying QSIP (Fang, Lin, & Wu, 2001; Kortanek & No, 1993; Liu, Teo, & Wu, 2004). The cutting plane method was also previously shown to efficiently handle learning problems with a finite number of examples but an exponentially large number of constraints (Joachims, 2006) by adding constraints successively. We provide a novel analysis of the convergence of with both hard and soft margins. When the problem is realizable, the convergence bound explicitly depends on the margin value, whereas with a soft margin and slack variables, the bound depends linearly on the number of manifolds.
The article is organized as follows. We first consider the hard margin problem and analyze the simplest form of the algorithm. Next, we introduce slack variables in , one for each manifold, and analyze its convergence with additional auxiliary variables. We then demonstrate the application of to both high-dimensional synthetic data manifolds and feature representations of images undergoing a variety of warpings. We compare its performance in both efficiency and generalization error with conventional SVMs using data augmentation techniques. Finally, we discuss some natural extensions and potential future work on and its applications.
2 Manifolds Cutting Plane Algorithm with Hard Margin
In this section, we first consider the problem of classifying a set of manifolds when they are linearly separable. This allows us to introduce the simplest version of the algorithm, along with the appropriate definitions and QSIP formulation. We analyze the convergence of the simple algorithm and prove an upper bound on the number of errors the algorithm can make in this setting.
2.1 Hard-Margin QSIP
This is the primal formulation of the problem, where maximizing the margin is equivalent to minimizing the squared norm We denote the maximum margin attainable by and the optimal solution as . For simplicity, we do not explicitly include the bias term here. A nonzero bias can be modeled by adding a feature of a constant value as a component to all the . Note that the dual formulation of this QSIP is more complicated, involving optimization of nonnegative measures over the manifolds. In order to solve the hard-margin QSIP, we propose the following simple algorithm.
The algorithm is an iterative algorithm to find the optimal in equation 2.1, based on a cutting-plane method for solving the QSIP. The general idea behind is to start with a finite number of training examples, find the maximum margin solution for that training set, augment the training set by finding a point on the manifolds that violates the constraints, and iterating this process until a tolerance criterion is reached.
At each stage of the algorithm, there is a finite set of training points and associated labels. The training set at the th iteration is denoted by the set with examples. For the th pattern in , is the index of the manifold and its associated label.
In step 3 of the algorithm, a point among the manifolds that violates the margin constraint needs to be found. The use of a separation oracle is common in other cutting plane algorithms such as those used for structural SVM's (Joachims, 2006) or linear mixed-integer programming (Marchand, Martin, Weismantel, & Wolsey, 2002). In our case, this requires determining the feasibility of over the -dimensional convex parameter set . When the manifold mapping is convex, feasibility in some cases can be determined analytically. More generally, it can be solved using modern convex techniques such as with projection-based methods (Bauschke & Borwein, 1996). For nonconvex mappings, we note that only the convex hulls of the manifolds, , are relevant to the classification problem. Thus, in those situations, the separation oracle need only return a feasible point in the intersection of a manifold convex hull with the half-space determined by and . These intersection points can be computed using search techniques in the -dimensional parameter set using gradient information, finite differences, or via convex relaxation techniques. We describe some specific methods of how separating points can be found in our experiments below.
2.3 Convergence of
The algorithm will converge asymptotically to an optimal solution when it exists. Here we show that the asymptotically converges to an optimal . Denote the change in the weight vector in the th iteration as . We present a set of lemmas and theorems leading up to the bounds on the number of iterations for convergence and the estimation of the objective function.
The change in the weights satisfies .
Define . Then for all , satisfies the constraints on the point set : for all . However, if , there exists a such that , contradicting the fact that is a solution to .
Next, we show that the norm must monotonically increase by a finite amount at each iteration.
In the iteration of algorithm, the increase in the norm of is lower-bounded by , where and .
First, note that ; otherwise, the algorithm stops. We have (see lemma 1). Consider the point added to set . At this point, , , hence .
As a corollary, we see that this procedure is guaranteed to find a realizable solution if it exists in a finite number of steps.
The algorithm converges to a zero error classifier in fewer than iterations, where is the optimal margin and bounds the norm of the points on the manifolds.
When there is an error, we have , and (see equation 2.4). This implies the total number of possible errors is upper-bounded by .
With a finite tolerance , we obtain a bound on the number of iterations required for convergence:
The algorithm for a given tolerance terminates in fewer than iterations where is the optimal margin and bounds the norm of the points on the manifolds.
Again, , and each iteration increases the squared norm by at least .
We can also bracket the error in the objective function after terminates:
With tolerance , after terminates with solution , the optimal value of is bracketed by .
The lower bound on is as before. Since has terminated, setting would make feasible for , resulting in the upper bound on .
3 with Slack Variables
In many classification problems, the manifolds may not be linearly separable due to their dimensionality, size, or correlation structure. In these situations, will not be able to find a feasible solution, and there is no solution that has zero error over all the manifold points. To handle these problems, the classic approach is to introduce slack variables on each point () to control generalization error. For continuous manifolds, we define generalization performance as the classification error over distributions of points sampled from the manifolds with corresponding labeled outputs. The naive implementation of slack variables requires defining appropriate measures over entire manifolds and solving the optimization problem over infinite sets of slack variables. This is not computationally tractable, and we formulate a more efficient alternative version of QSIP with slack variables next.
3.1 QSIP with Manifold Slacks
In this work, we propose using only one slack variable per manifold for classification problems with nonseparable manifolds. This formulation demands that all the points on each manifold obey an inequality constraint with one manifold slack variable, . As we will see, solving for this constraint is tractable, and the algorithm has good convergence guarantees.
3.3 Convergence of
Here we show that the objective function is guaranteed to increase by a finite amount with each iteration. This result is similar to Tsochantaridis, Joachims, Hofmann, and Altun (2005), but here we present proofs in the primal domain over an infinite number of examples.
We derive that the added point at each iteration must be a support vector for the next weight:
We also derive a bound on the following quadratic function over nonnegative values:
The minimum value occurs when . When , then , and the minimum is . When , the minimum occurs at . Thus, the lower bound is the smaller of these two values.
Using the lemmas above, the lower bound on the change in the objective function can be found:
The added point comes from a particular manifold . If , from lemma 3, we have . Then by the Cauchy-Schwarz inequality, , which yields .
We consider the minimum value of the thresholds: . We have two possibilities: is positive so that none of the points are support vectors for , or so that at least one support vector lies in .
Since the algorithm is guaranteed to increase the objective by a finite amount, it will terminate in a finite number of iterations if we require for some positive .
The algorithm for a given will terminate after at most iterations, where is the number of manifolds and L bounds the norm of the points on the manifolds.
and is a feasible solution for . Therefore, the optimal objective function is upper-bounded by . The upper bound on the number of iterations is then provided by theorem 2.
We can also bound the error in the objective function after terminates:
The lower bound on is apparent since includes constraints for all . Setting the slacks will make the solution feasible for resulting in the upper bound.
4.1 Ellipsoidal Manifolds
As an illustration of our method, we have generated -dimensional -norm ellipsoids with random radii, centers, and directions. The points on each manifold are parameterized as , where the center and basis vectors are random gaussian and , the ellipsoidal radii, are sampled from with mean . Similar to the classification task shown in Figure 1, we consider the following task: given of -dimensional ellipsoids with the above radii distributions embedded in dimension, where of the ellipsoids are labeled positive and the rest negative, find a linear hyperplane that shatters the given ellipsoids according to the labels, with maximum margin.
We compare the performance of to the conventional Point SVM (, ) with training samples drawn uniformly from the surface of the ellipsoids. The test example points for and point SVM are also drawn from the uniform distribution over the surface of each ellipsoid and given the same label as the binary label assigned to each ellipsoid. Performance is measured by generalization error on the task of classifying new test points from positively labeled manifolds from negatively labeled ones as a function of the total number of training samples used during learning.
For these manifolds, the separation oracle of returns points that minimize over the set . For norms with , the solution can be expressed as where . For , we used an additional single margin constraint per manifold given by the center of the ellipsoids .
We compare the generalization performance of the algorithm and point-wise SVM on the ellipsoids (see Figure 3) and ellipsoids (see Figure 4). As a linearly separable example, we use ellipsoids embedded in dimension, where with the parameters we use, and , the classification problem is known to be below critical manifold capacity for and and therefore linearly separable according to the recent theory of linear classification of general manifolds (Chung, Lee, & Sompolinsky, 2018). Then we test the generalization error performance of and for different numbers of samples used for training (see Figure 3a). To illustrate a linearly nonseparable example, we use the same set of ellipsoids ( and ), except with an increased number of ellipsoids, , and a decreased ambient dimension , such that the task is above the critical manifold capacity and linearly nonseparable. Then we compare the generalization error performance of simulations with point- (Figure 3b). Similarly, for ellipsoids, were chosen such that the problem is slightly below the capacity for a separable case and above the capacity for a nonseparable case (the details are in the captions of Figure 4). The results illustrate that achieves a low generalization error very efficiently compared to a conventional maximum margin classifier using sampled training examples.
Another important question is how the classification task difficulty interacts with the number of samples required for convergence of , as well as the manifold geometries. The recent theory of manifold classification (Chung, Lee, & Sompolinsky, 2016, 2018) illustrates the relationship between the critical manifold linear classification capacity , the task margin , and the manifold geometries such as dimension and size. To see the effect of the manifold geometries, we tested the dependence between the ellipsoid properties above, , , (for ), and the number of required training examples and the task margin within the linearly separable regime. First, we varied the number of the ellipsoids , while all the other task parameters are fixed at , , (see Figure 3c), and varied the dimension of the ellipsoids , while all the other task parameters are fixed at , , (see Figure 3c). Then we find a regime where there is a trade-off between the number of examples required for convergence, , and the task margin, . The task margin is related to the task difficulty, because the harder the classification task is, the smaller the margin is. When we vary and , the trade-off between and is manifested by an approximate plateau in , except when (close to the capacity where ). The task difficulty also depends on the size of the manifolds, and when we increase the ellipsoid scale, , shows very similar behavior as the increase in or , showing a rapid increase when the task is easy and finding an approximate plateau for the larger , except that even in the large regime, there is rapid drop in because even in the the hyperplane can orthogonalize the solution and have a nonzero asymptotic margin (see Figure 3d). The variation in the norm for ellipsoids shows a qualitatively similar behavior as the variation in , reflecting the fact that even when the hyperplane can find an orthogonalizing solution and have a nonzero, asymptotic margin (see Figure 4c).
4.2 ImageNet Object Manifolds
We also apply the algorithm to a more realistic class of object manifolds. Here, each object manifold is constructed from a set of affine warpings applied to a template image from the ImageNet data set (Deng et al., 2009). Figure 5a illustrates image changes along multiple axes of variation in such a manifold, which contains all combinations of changes along multiple axes. Those manifolds are parameterized by the intrinsic transformation dimensionality (i.e., the number of axes of variation) and the generated image variability (defined as the maximal displacement of the object corners, in pixels). In general, those affine transformations have 6 degrees of freedom; here, we demonstrate results for manifolds, utilizing only translation transformations, and manifolds, utilizing both translation and shear transformations as in Figure 5a. The resulting manifolds are split into dichotomies, with manifolds considered as positive instances and the remaining manifolds as negative instances. is then used to learn a binary classifier for the manifolds, and we obtain the parameter through cross-validation. We note that can easily be extended to multiclass problems, but only binary classification problems were used in this experiment.
We trained on object manifolds and used features throughout. At each iteration of , a constraint-violating point on the manifolds is found using a local search among neighboring samples in the affine warping space. Figure 5b shows how quickly converges to a low generalization error when classifying manifolds with different amounts of variation.
Next, we compared the generalization performance of to conventional SVM on sampled points from two different feature representations of the images: in the original pixel space (as in the previous experiment) and in a V1-like representation of the same images created by applying full-wave rectification after filtering by arrays of Gabor functions (Serre, Wolf, Bileschi, Riesenhuber, & Poggio, 2007), as illustrated in Figure 6a. The test examples are created by transforming the template image with parameters drawn uniformly in the intrinsic transformation-parameter space and given the same label as the binary label assigned to each object manifold. For pixel representation, the images' grayscale value is used, and Gabor representations are created by applying a set of Gabor filters (of four orientations and four spatial scales) around multiple image locations, from which we randomly sampled features to match the number of pixels.
Figure 6b shows the results of on those object manifolds, showing how as manifolds are transformed from pixel to Gabor features, the problem changes from nonseparable to separable. We note that the maximal number of manifolds that are separable depends on the feature representation. Thus, can also be used to investigate the benefits of alternative features representations, such as those found in different areas of the visual hierarchy in the brain or in the layers of a deep neural network.
We described and analyzed a novel algorithm, , based on the cutting plane method for finding the maximum margin solution for classifying manifolds. We proved the convergence of and provided bounds on the number of iterations required. Our analysis shows that in the separable case, the algorithm finds a solution in a finite number of iterations that completely segregates the manifolds even though they consist of uncountable sets of points.
The situation is more complex when the manifolds are not separable. In that case, finds an approximate solution in a finite number of iterations that obey slack constraints on a per manifold basis. We consider the generalization performance of the solution on points randomly sampled from the manifolds and find that the solution generalizes well across all the manifold input points even though it has only been provided with a finite set of samples during the training iterations. The number of slack variables that are nonzero indicates the fraction of manifolds containing inseparable points, providing an upper bound on the generalization error.
is reminiscent of cutting-plane methods for structured SVMs in that its convergence does not explicitly depend on the presence of a large number of constraints. However, for , the semi-infinate nature of the optimization problem arises even for binary classification due to the continuous manifold structure of the inputs, not from constraints on the outputs as it does for structured SVMs. A possible future extension of would be to handle structured output labels on manifold inputs.
Our experiments with both synthetic data manifolds and image manifolds demonstrate the efficiency of and superior performance in terms of generalization error compared to conventional SVMs using many virtual examples. Although the theoretical convergence bounds depend only on the margin, our numerical examples show that the empirical number of training samples required for depends on both the task margin and the manifold geometric properties. In particular, we have demonstrated how the algorithm performs when the dimensionalities, sizes, and shapes of the manifolds are varied. This illustrates the complex interplay in learning dynamics when the underlying manifold geometries are considered.
There is a natural extension of to nonlinear classifiers via the kernel trick, as all our operations involve dot products between the weight vector and manifold points . At each iteration, the algorithm would solve the dual version of the problem, which is readily kernelized. More theoretical work is needed to analyze infinite-dimensional kernels when the manifold optimization problem no longer is a QSIP but becomes a fully (doubly) infinite quadratic programming problem.
Beyond binary classification, variations of can also be used to solve other machine learning problems, including multiclass classification, ranking, ordinal regression, and one-class learning. We can also use to evaluate the computational benefits of manifold representations at successive layers of deep networks in both machine learning and brain sensory hierarchies. We anticipate using to help construct novel hierarchical architectures that can incrementally reformat the manifold representations through the layers for better overall performance in machine learning tasks, improving our understanding of how neural architectures can learn to process high-dimensional real-world signal ensembles and cope with a large variability due to continuous modulation of the underlying physical parameters.
The work is partially supported by the Gatsby Charitable Foundation, the Swartz Foundation, the Simons Foundation (SCGB grant 325207), the NIH, the MAFAT Center for Deep Learning, and the Human Frontier Science Program (project RGP0015/2013). D. L. also acknowledges the support of the U.S. National Science Foundation, Army Research Laboratory, Office of Naval Research, Air Force Office of Scientific Research, and Department of Transportation.