Abstract
Deep learning (DL), a variant of the neural network algorithms originally proposed in the 1980s (Rumelhart et al., 1986), has made surprising progress in artificial intelligence (AI), ranging from language translation, protein folding (Jumper et al., 2021), autonomous cars, and, more recently, human-like language models (chatbots). All that seemed intractable until very recently. Despite the growing use of DL networks, little is understood about the learning mechanisms and representations that make these networks effective across such a diverse range of applications. Part of the answer must be the huge scale of the architecture and, of course, the large scale of the data, since not much has changed since 1986. But the nature of deep learned representations remains largely unknown. Unfortunately, training sets with millions or billions of tokens have unknown combinatorics, and networks with millions or billions of hidden units can't easily be visualized and their mechanisms can't be easily revealed. In this letter, we explore these challenges with a large (1.24 million weights VGG) DL in a novel high-density sample task (five unique tokens with more than 500 exemplars per token), which allows us to more carefully follow the emergence of category structure and feature construction. We use various visualization methods for following the emergence of the classification and the development of the coupling of feature detectors and structures that provide a type of graphical bootstrapping. From these results, we harvest some basic observations of the learning dynamics of DL and propose a new theory of complex feature construction based on our results.
1 Introduction
Perhaps the most common task that deep learning models have been successful with is classification and most frequently using image data. This kind of task in cognitive science might be termed categorization or concept learning (Shepard et al., 1961; Hanson & Gluck, 1990; Bruner et al., 1956) or, more fundamental, identification (Luce et al., 1963). Although there are similarities to human learning, there are some important differences with typical DL classification. One key difference involves the distribution of exemplars (samples) and the level at which they are sampled. Specifically, in cognitive science, the level of a category (Rosch et al., 1976) is based on the hierarchical nature of category and a preferred level of reference (e.g., the basic level) and will vary across individual expertise. In supervised learning in DL architectures, the nature of the representation is dependent on the similarity function learned that maximally separates within-category members from between-category members. Because the DL, like any neural network, is a general function approximator (Hanson & Burr, 1990; Carroll & Dickinson, 1989; Hornik et al., 1990), it is difficult to tell whether the similarity functions learned for the mapping are arbitrary or consistent with human bias (Hanson et al., 2018).
Supervised learning using DL usually involves labeled data that may only sparsely represent any “concept” or category given the arbitrary category label. These types of classification tasks are often used to establish benchmarks with previously developed shared image databases in order to compare algorithm variations—for example, with the 60,000 images of CIFAR (Canadian Institute for Advanced Research), where there are 10 or 100 categories, making the exemplar sample space sparse, and as another example, a similar kind of diversity with ImageNet (14 million images; based on G. Miller's WordNet, although often used with a reduced sample of 1.28 million images). For CIFAR, the category structure in both labeled data sets (10,100) is diverse and sometimes appears at the basic level (e.g., “fish,” “man”) and sometimes as some unfamiliar subordinate level (“aquarium fish,” as opposed to, say, “goldfish”) and also cases containing mixed levels of reference.
The other common category learning data set is ImageNet's Large-Scale Visual Recognition Challenge (ILSVRC)) subset, which contains around 1.28 million training images representing 1,000 categories. These categories also tend to be at both the basic and subordinate levels as well, with a focus on finely grained classification but across multiple levels of reference. Hence, in these typical benchmark data sets, there does not appear to be a common and consistent level of reference relative to human lexical knowledge or usage. This means that the concept space can be very sparsely covered, although there will be many exemplars clustered with large gaps throughout the feature hypercube, where the exemplars distribute. But, of course, having benchmark data sets that are consistent with human bias was not a primary goal or necessary for comparative tests of architectures or learning algorithms.
We introduce here a more ecologically valid human learning task that involves a smaller set of categories but with a dense set of exemplars per category, similar to the nearest kin and friends and famous individuals (e.g., actors, politicians) that a single individual might have. We term this the dense sample category task (DSC).
One such set is the Yale Face Dataset, where each face has no fewer than 500 exemplars per face (based on 5,760 images of 10 individuals, each under 9 poses and 64 different lighting conditions, for 576 cases). This makes the concept space very dense in a feature hypercube, since it involves a singular category, a single identity. Of course, there are categories with thousands of exemplars in CIFAR, but at a much coarser level and never with this kind of density per concept.
We randomly chose five faces (although human subjects can recognize more than 1,000 faces accurately), a tactic that might represent immediate family members in that the individual will have a constant and extensive exposure and can recognize them in many novel contexts. Thus, the goals of this study are to approximate the sample space of a typical human individual with a similar density and learning exposure and compare and contrast the feature construction in the hidden layers early and late in learning, as well as early and late topographically in the network, and analyze the dynamics of learning and characterize learning phases when features first crystallize, that is, when category formation first appears and determine whether learning is unitary or multifactorial.
2 Methods and Materials
Layer (type) . | Output Shape . | Param # . |
---|---|---|
Conv2d-1 | [‒1, 3, 128, 128] | 84 |
ReLU-2 | [‒1, 3, 128, 128] | 0 |
Conv2d-3 | [‒1, 3, 128, 128] | 84 |
ReLU-4 | [‒1, 3, 128, 128] | 0 |
Conv2d-5 | [‒1, 3, 128, 128] | 84 |
ReLU-6 | [‒1, 3, 128, 128] | 0 |
Conv2d-7 | [‒1, 3, 128, 128] | 84 |
ReLU-8 | [‒1, 3, 128, 128] | 0 |
Conv2d-9 | [‒1, 3, 128, 128] | 84 |
ReLU-10 | [‒1, 3, 128, 128] | 0 |
AdaptiveAvgPool2d-11 | [‒1, 3, 20, 20] | 0 |
Linear-12 | [‒1, 1024] | 1,229,824 |
ReLU-13 | [‒1, 1024] | 0 |
Dropout-14 | [‒1, 1024] | 0 |
Linear-15 | [‒1, 1024] | 1,049,600 |
ReLU-16 | [‒1, 1024] | 0 |
Dropout-17 | [‒1, 1024] | 0 |
Linear-18 | [‒1, 5] | 5,125 |
Layer (type) . | Output Shape . | Param # . |
---|---|---|
Conv2d-1 | [‒1, 3, 128, 128] | 84 |
ReLU-2 | [‒1, 3, 128, 128] | 0 |
Conv2d-3 | [‒1, 3, 128, 128] | 84 |
ReLU-4 | [‒1, 3, 128, 128] | 0 |
Conv2d-5 | [‒1, 3, 128, 128] | 84 |
ReLU-6 | [‒1, 3, 128, 128] | 0 |
Conv2d-7 | [‒1, 3, 128, 128] | 84 |
ReLU-8 | [‒1, 3, 128, 128] | 0 |
Conv2d-9 | [‒1, 3, 128, 128] | 84 |
ReLU-10 | [‒1, 3, 128, 128] | 0 |
AdaptiveAvgPool2d-11 | [‒1, 3, 20, 20] | 0 |
Linear-12 | [‒1, 1024] | 1,229,824 |
ReLU-13 | [‒1, 1024] | 0 |
Dropout-14 | [‒1, 1024] | 0 |
Linear-15 | [‒1, 1024] | 1,049,600 |
ReLU-16 | [‒1, 1024] | 0 |
Dropout-17 | [‒1, 1024] | 0 |
Linear-18 | [‒1, 5] | 5,125 |
Total params: 2,284,969
Trainable params: 2,284,969
Non-trainable params: 0
Input size (MB): 0.19
Forward/backward pass size (MB): 3.81
Params size (MB): 8.72
Estimated Total Size (MB): 12.71
In order to identify the network architecture for the data, we performed several empirical experiments on a VGG-style network (starting with a VGG-11, see Table 1) and was able to shrink to a network with five convolutional layers with kernel size 3 and three filters per layer. Decreasing the size of the network any further made the learning process very unstable and proved to be difficult for getting consistent results while experimenting. We also used the smallest network size possible so as to better understand the functionality of each of the layers and reduce unnecessary parameters given the small and condensed size of our classes. We used a fixed filter size for all of the convolutional layers so the differences between the layers could be better analyzed and understood. This left us with images of size 128 × 128 at the end of the convolutional stages. Further focused pooling kept the size of the model manageable in the fully connected upper layers of the DL.
2.1 Factoring Learning Dynamics
The number of terms in this series will be a function of the number of categories, the complexity of the decision surface (e.g., nonlinearity, convexity, connectedness), and data complexity. In terms of this small number of categories, two factors are not unreasonable; however, we would expect in a typical CIFAR classification that there must be hundreds of such hyperbolic processes, some of which must be highly correlated (see below). Fast hyperbolic processes, we will see next, are based on the highest variance extraction, which will usually be based on the first significant class separation, in this case, the red data points—exemplars—but could be based on the entire linear separability of the task—say, Fisher's iris data (Fisher, 1936). Two decision surfaces might be extracted close in time, producing two large variance spikes in the learning dynamics (the third one, which is more nonlinear, is slower) with a fairly coarse structure. More moderate hyperbolic learning processes in DL can therefore independently adjust the decision surface gradually, improving the overall accommodation of other exemplars, while slower hyperbolic processes can discover more complex, higher-fidelity feature detectors and structure in the data, producing the smoothest and highest probable fit to the true decision surface.
It is also important to understand that the learning dynamics will tend to be a sequential extraction (similar to PCA, but unlikely with orthogonal components, making the overall DL approximation a series of conditionally independent feature detectors or filters (partly due to the spectral separation resulting from the depth of each layer) that can be incrementally added to a larger hierarchy or “scaffold” that may have already developed.
2.2 Looking Further inside the Black Box
3 Discussion
One recent theoretical approach to understanding the dynamics of DL learning is to use statistical mechanics (Martin & Mahoney, 2021) to frame the weight dynamics as a disordered system. In this theory, the overparameterization paradox of DL is characterized as a natural regularization of parameters in DL networks through a competitive implicit process (thus, there is no explicit regularization; see Moody, 1991). Regularization of models in the standard case (Tikonov regularization; neural network weight decay) involves adding a penalizer to the estimator so that parameter dynamics are damped and the parameter efficiency is optimized (Hanson & Pratt, 1988).
This new theory of regularization was constructed on more than 100 DL learned architectures by extracting the eigenvalue distribution from the correlation of weight matrices per layer. What resulted was a theory of DL learning that typically shows five distinct phases of learning. Initially there is a random eigenvalue distribution that slowly evolves to have developed a new theory of regularization (heavy-tailed regularization, in contrast to Tikonov's classical regularization) that appears to apply to deep learning architectures. The theory predicts five phases of learning in terms of the eigenvalue spectral distribution (ESD): random, bulk, bulk bleed out, bulk with spikes, and final rank collapse. The term bulk refers to the eigenvalue distribution that results with weak covariance soon after random ESD distribution appears (which would imply no learning). The bleed-out and spikes are the first signs that a strong covariance structure is emerging, which then rapidly grows to a highly connected predictive structure prior to the rank collapse. We return to this theory in discussing a new proposal for DL feature construction. But first we consider what we can say so far about deep learning dynamics.
3.1 Harvesting Some Principles of Deep Learning Dynamics
During learning and visualization, we observed a number of regularities that we summarize next in the context of at least two kinds of learning that we have observed in this example—what we define as fast feature competition and slow feature curation:
More layers create a buffer from aggressive competitive learning, common in single-layer networks. The layers in a DL appear locally correlated, suggesting that competitive learning, which is common in perceptron and backpropagation single hidden-layer networks, is isolated per layer, thus slowing destructive and constructive learning throughout later layers.
Layers can enable latent learning, an induction period, to more comprehensively model data complexity with smaller, less destructive (near zero) gradients that will preserve promising feature analyzers in later layers. This will in effect “curate” them with more consistent samples with more consistent feature sets that middle layers will filter.
Layers decouple long-range effects and effectively create a conditionally independent network of layers. This effect allows for more local updates per layer conditioned on the spatially nearest layers, thus again increasing the locality and ultimate fidelity of the feature analyzers and protecting them from destructive competition.
As slow logistic processes increase in accuracy, a tipping point is reached with an explosive growth of accuracy reflecting the network of feature analyzers that are nearly similar to what they will be asymptotically. This logistic or exponential-hyperbolic form has a slow rise period (induction period) prior to explosive growth with slower refinement processes (Hanson et al., 2018; Hanson & Hanson, 2023).
There are multiple learning processes initiated in parallel but at different rates depending on the covariance and complexity in the data. These learning processes will accumulate to an overall learning curve that can be factored into N multilogistic learning functions with different rates. The faster processes during classification are extracting a lower dimensional (separable features) structure while slower processes are backfilling structure from covariance (integral features), similar to nonlinear factor analysis (Kruskal & Shepard, 1974). These slower logistic processes might be termed “latent correct responses,” which emerge over a longer resolution period and at the same time with a much faster, hyperbolic rise in learning. Clearly DL is multifactorial.
We would expect that the deeper the network, the more quickly destructive competition and slow curation of feature structures are trading off. Consequently, lower layers (those closest to input) may experience more rapid change in accommodating the higher layers’ incremental improvement of feature detectors that tend to be more consistent with error feedback outcome per sample or batch. This potentially creates a long-distance communication channel between lower and upper layers. Nonetheless, there must be diminishing returns on this strategy and the effective depth of any DL learner.
3.2 A New Proposal for Feature Creation in DL: Autocatalytic Feature Sets
As we have seen, the logistic (hyperbolic) function can produce explosive growth to an asymptote with a complex construction of novel feature analyzers. We propose a theory of feature creation in DL networks that is based on an analogy to a chemical reaction. Recall that catalysis is typically described as a chemical reaction that normally proceeds at a fixed rate, depending on the chemical constituents present. A catalyst will multiply the basic rate of that reaction many fold, thus increasing the rate of the reaction without itself being consumed in the reaction.
In comparison, this type of feature representation is still a central mystery in visual pathways in the brain, which in standard neuroscience textbooks typically show pathways with simple features (Hubel & Wiesel, 1962). They begin with edges and lines, then patterns and textures and more complex checkerboards, and then objects, faces, and complex types. However, no specific intermediate structures have been observed, specified theoretically, or identified that might bootstrap or provide constructive routes to more complex forms. The nature of the visual pathway representation is only a plausible hypothesis. Worse, conventional wisdom has been that hierarchical structures could emerge that could in principle build novel structure and also be diagnostic of any type or token that might be in the visual world (Biederman, 1972). But these proposals have generally not been productive and are rife with many types of inductive, constructive, and logical problems (Zoccolan et al., 2007; Herzog & Clarke, 2014; Carroll & Dickenson, 1989). The dorsal visual pathway of the brain that may be primarily focused on “what” information at some system level is still poorly understood on what the “what” pathway actually does. It seems ironic that instead of the usual sci-fi trope of future technology predicting human common usage eventually, we have already created artificial systems based on the human brain nearly as complex as the human brain, creating yet another mystery to investigate.
Conflict of Interest
The authors report no conflicts of interest.