A complex-valued convolutional network (convnet) implements the repeated application of the following composition of three operations, recursively applying the composition to an input vector of nonnegative real numbers: (1) convolution with complex-valued vectors, followed by (2) taking the absolute value of every entry of the resulting vectors, followed by (3) local averaging. For processing real-valued random vectors, complex-valued convnets can be viewed as data-driven multiscale windowed power spectra, data-driven multiscale windowed absolute spectra, data-driven multiwavelet absolute values, or (in their most general configuration) data-driven nonlinear multiwavelet packets. Indeed, complex-valued convnets can calculate multiscale windowed spectra when the convnet filters are windowed complex-valued exponentials. Standard real-valued convnets, using rectified linear units (ReLUs), sigmoidal (e.g., logistic or tanh) nonlinearities, or max pooling, for example, do not obviously exhibit the same exact correspondence with data-driven wavelets (whereas for complex-valued convnets, the correspondence is much more than just a vague analogy). Courtesy of the exact correspondence, the remarkably rich and rigorous body of mathematical analysis for wavelets applies directly to (complex-valued) convnets.
Convolutional networks (convnets) have become increasingly important to artificial intelligence in recent years, as reviewed by LeCun, Bengio, and Hinton (2015). This note presents a theoretical argument for complex-valued convnets and their remarkable performance. Complex-valued convnets turn out to calculate “data-driven multiscale windowed spectra” characterizing certain stochastic processes common in the modeling of time series (such as audio) and natural images (including patterns and textures). We motivate the construction of such multiscale spectra using “local averages of multiwavelet absolute values” or, more generally, “nonlinear multiwavelet packets.”
A textbook treatment of all concepts and terms we use in this note is given by Mallat (2008). Further information is available in the original work of Daubechies (1992), Meyer (1993), Coifman, Meyer, Quake, and Wickerhauser (1994), Coifman and Donoho (1995), Simoncelli and Freeman (1995), Meyer and Coifman (1997), LeCun, Bottou, Bengio, and Haffner (1998), Donoho, Mallat, von Sachs, and Samuelides (2003), Srivastava, Lee, Simoncelli, and Zhu (2003), Rabiner and Schafer (2007), and Mallat (2008), for example. The work of Haensch and Hellwich (2010), Mallat (2010), Poggio, Mutch, Leibo, Rosasco, and Tacchetti (2012), Bruna and Mallat (2013), Bruna, Mallat, Bacry, and Muzy (2015), and Chintala et al. (2015) also develops complex-valued convnets, providing copious applications and numerical experiments. A related, more sophisticated connection (to renormalization group theory) is given by Mehta and Schwab (2014). Our exposition relies on nothing but the basic signal processing treated by Mallat (2008). Using the connections discussed below, the rich, rigorous mathematical analysis surveyed by Daubechies (1992), Meyer (1993), Mallat (2008), and others applies directly to complex-valued convnets.
Citing such connections, the anonymous reviews of this note suggested viewing complex-valued convnets as a kind of baseline architecture for much of the deep learning reviewed by LeCun et al. (2015). Section 6 presents numerical analyses corroborating this viewpoint. Having such a theoretical basis for deep learning could help in paring down the combinatorial explosion of possibilities for future developments, while probably illuminating further possibilities as well.
The rest of this note proceeds as follows. Section 2 reviews stationary stochastic processes and their spectra. Section 3 reviews locally stationary stochastic processes and the connection of their spectra to stages in a complex-valued convnet. Section 4 introduces multiscale (multiple stages in a convnet). Section 5 describes the fitting (also known as learning or training) that the connection to convnets facilitates. Section 6 briefly compares on a common benchmark the accuracies for the complex-valued convnets of Chintala et al. (2015) to those for the scattering transforms of Mallat (2010) and for the standard real-valued convnets of Krizhevsky, Sutskever, and Hinton (2012). Section 7 generalizes and summarizes the note.
2 Stationary Stochastic Processes
The absolute spectrum can be more robust than the power spectrum, in the same sense that the mean absolute deviation can be more robust than the variance or standard deviation. The power spectrum is more fundamental in a certain sense, yet the absolute spectrum may be preferable for applications to machine learning. We conjecture that both can work about the same. We focus on the absolute spectrum to simplify the exposition.
3 Locally Stationary Stochastic Processes
In practice, decimation or subsampling is important to avoid overfitting in the data-driven approach discussed below, by limiting the number of degrees of freedom appropriately. Even when the signal is not a strictly stationary stochastic process, the averaging in equation 3.3 (the left-most summation) performs the cycle spinning of Coifman and Donoho (1995) to avoid artifacts that would otherwise arise due to windows’ partitioning after subsampling. The averaging reduces the variance; wider averaging would further reduce the variance.
Sequences that are finite rather than doubly infinite provide only enough information for estimating a smoothed version of the spectrum. Alternatively, a finite amount of data provides information for estimating multiscale windowed spectra yielding time-frequency (or space-Fourier) resolution similar to the multiresolution analysis of wavelets.
SIFT (scale-invariant feature transform), HOG (histograms of oriented gradients), and SURF (speeded-up robust features) of Lowe (1999, 2004), Dalal and Triggs (2005), Bay, Ess, Tuytelaars, and Gool (2008), and others are more analogous to the multiwavelet architecture of Figure 2 than to the more general wavelet-packet architecture of Figure 3.
The “multiwavelet transform” constitutes a desirable baseline model. We can easily adapt to the data the choices of windows and indeed the whole recursive structure of the processing (whether restricting the recursion to the zero-frequency channels or also allowing the recursive processing of higher-frequency channels). Viewing the convolutional filters in equation 3.3 that serve as windowed exponentials as parameters, the desirable baseline is just one member of a parametric family of models. This parametric family is known as a “complex-valued convolutional network.” We can fit (i.e., learn or train) the parameters to the data using optimization procedures such as stochastic gradient descent in conjunction with “backpropagation” (backpropagation is the chain rule of Calculus applied to calculate gradients of our recursively composed operations). For supervised learning, we optimize according to a specified objective, usually using the multiscale spectra as inputs to a scheme for classification or regression, as detailed by LeCun et al. (1998), for example.
In consonance with the “best-basis” approach of Coifman et al. (1994) and Saito and Coifman (1995), a potentially more efficient possibility is to restrict the convolutional filters in equation 3.3 to be windowed exponentials that are designed completely a priori, aside from one overall scaling factor per filter, fitting only the scaling factors. How best to effect this approach is an open question.
6 Numerical Experiments
This section reports the classification accuracies for the complex-valued convnets of Chintala et al. (2015), the standard real-valued convnets of Krizhevsky et al. (2012), and the scattering transforms of Oyallon and Mallat (2015), on a benchmark data set, CIFAR-10, from Krizhevsky (2009) (CIFAR-10 contains 50,000 images in its training set and 10,000 images in its testing set; each image falls into one of 10 classes, is full color, and consists of a grid of pixels). According to Table 4 of Oyallon and Mallat (2015), the scattering transforms attain an error rate of on the test set after training their classifiers on the training set. According to Section 3.3 of Krizhevsky et al. (2012), a standard real-valued convnet attains an error rate of on the test set without the local response normalization of that Section 3.3 and attains with the local response normalization. The complex-valued convnets detailed in Chintala et al. (2015) attain an error rate of on the test set, at least when using a larger net and training with enough iterations for the test error to settle down and converge (for complex-valued convnets, accuracy seems to improve as the net becomes larger; for the error rate of , a net eight times the size of that reported in Table 1 of Chintala et al., 2015, was sufficient, using the same kernel sizes and other parameter settings as for Table 1). Augmenting the training images with their mirror images improved convergence to the reported accuracies. All in all, the extensively trained real- and complex-valued convnets yielded similar error rates, which are about one-third less than those that scattering transforms attained. Of course, the fitting/learning/training involved for classification with the scattering transforms is much less extensive.
While the above concerns Xk, where k ranges over the integers, extending the above to analyze , where j and k range over the integers, is straightforward; the latter could be a “locally homogeneous random field.” Also, the infinite range of the integers is far from essential; implementations on computers obviously use only finite sequences. Moreover, the above construction is appropriate for processing any locally stationary stochastic process, not just filtered white noise. For instance, the construction can enable a multiresolution analysis of regularity (or smoothness) that easily distinguishes between low-pass filtered i.i.d. gaussian noise and a pulse train or sinusoid with a random phase offset (e.g., for any integer k, where J is an integer drawn uniformly at random from 1, 2, …, 2000). More generally, the construction should enable discriminating between many interesting classes of stochastic processes, commensurate with the ability of multiwavelet-based multiresolution analysis to measure regularity, intermittency, distributional characteristics (say, gaussian versus Poisson), and so on. Any globally stationary stochastic process, with or without intermittent fluctuations, can be modeled as above as a locally stationary stochastic process (of course, Bruna et al., 2015, treat the former directly, to great advantage in the analysis of homogeneous turbulence and other phenomena from statistical physics). Every model in the parametric family constituting the complex-valued convnet calculates relevant features, windowed spectra of the form in equations 3.2 and 3.3. The absolute values in equations 3.2 and 3.3 are the key nonlinearity, a reflection of the local stationarity—the local translation invariance—of the process and its relevant features.
We would like to thank Keith Adams, Lubomir Bourdev, Rob Fergus, Armand Joulin, Manohar Paluri, Christian Puhrsch, Marc’Aurelio Ranzato, Ben Recht, Rachel Ward, and the editor and reviewers.