Deep learning (DL), a variant of the neural network algorithms originally proposed in the 1980s (Rumelhart et al., 1986), has made surprising progress in artificial intelligence (AI), ranging from language translation, protein folding (Jumper et al., 2021), autonomous cars, and, more recently, human-like language models (chatbots). All that seemed intractable until very recently. Despite the growing use of DL networks, little is understood about the learning mechanisms and representations that make these networks effective across such a diverse range of applications. Part of the answer must be the huge scale of the architecture and, of course, the large scale of the data, since not much has changed since 1986. But the nature of deep learned representations remains largely unknown. Unfortunately, training sets with millions or billions of tokens have unknown combinatorics, and networks with millions or billions of hidden units can't easily be visualized and their mechanisms can't be easily revealed. In this letter, we explore these challenges with a large (1.24 million weights VGG) DL in a novel high-density sample task (five unique tokens with more than 500 exemplars per token), which allows us to more carefully follow the emergence of category structure and feature construction. We use various visualization methods for following the emergence of the classification and the development of the coupling of feature detectors and structures that provide a type of graphical bootstrapping. From these results, we harvest some basic observations of the learning dynamics of DL and propose a new theory of complex feature construction based on our results.

Perhaps the most common task that deep learning models have been successful with is classification and most frequently using image data. This kind of task in cognitive science might be termed categorization or concept learning (Shepard et al., 1961; Hanson & Gluck, 1990; Bruner et al., 1956) or, more fundamental, identification (Luce et al., 1963). Although there are similarities to human learning, there are some important differences with typical DL classification. One key difference involves the distribution of exemplars (samples) and the level at which they are sampled. Specifically, in cognitive science, the level of a category (Rosch et al., 1976) is based on the hierarchical nature of category and a preferred level of reference (e.g., the basic level) and will vary across individual expertise. In supervised learning in DL architectures, the nature of the representation is dependent on the similarity function learned that maximally separates within-category members from between-category members. Because the DL, like any neural network, is a general function approximator (Hanson & Burr, 1990; Carroll & Dickinson, 1989; Hornik et al., 1990), it is difficult to tell whether the similarity functions learned for the mapping are arbitrary or consistent with human bias (Hanson et al., 2018).

Supervised learning using DL usually involves labeled data that may only sparsely represent any “concept” or category given the arbitrary category label. These types of classification tasks are often used to establish benchmarks with previously developed shared image databases in order to compare algorithm variations—for example, with the 60,000 images of CIFAR (Canadian Institute for Advanced Research), where there are 10 or 100 categories, making the exemplar sample space sparse, and as another example, a similar kind of diversity with ImageNet (14 million images; based on G. Miller's WordNet, although often used with a reduced sample of 1.28 million images). For CIFAR, the category structure in both labeled data sets (10,100) is diverse and sometimes appears at the basic level (e.g., “fish,” “man”) and sometimes as some unfamiliar subordinate level (“aquarium fish,” as opposed to, say, “goldfish”) and also cases containing mixed levels of reference.

The other common category learning data set is ImageNet's Large-Scale Visual Recognition Challenge (ILSVRC)) subset, which contains around 1.28 million training images representing 1,000 categories. These categories also tend to be at both the basic and subordinate levels as well, with a focus on finely grained classification but across multiple levels of reference. Hence, in these typical benchmark data sets, there does not appear to be a common and consistent level of reference relative to human lexical knowledge or usage. This means that the concept space can be very sparsely covered, although there will be many exemplars clustered with large gaps throughout the feature hypercube, where the exemplars distribute. But, of course, having benchmark data sets that are consistent with human bias was not a primary goal or necessary for comparative tests of architectures or learning algorithms.

We introduce here a more ecologically valid human learning task that involves a smaller set of categories but with a dense set of exemplars per category, similar to the nearest kin and friends and famous individuals (e.g., actors, politicians) that a single individual might have. We term this the dense sample category task (DSC).

One such set is the Yale Face Dataset, where each face has no fewer than 500 exemplars per face (based on 5,760 images of 10 individuals, each under 9 poses and 64 different lighting conditions, for 576 cases). This makes the concept space very dense in a feature hypercube, since it involves a singular category, a single identity. Of course, there are categories with thousands of exemplars in CIFAR, but at a much coarser level and never with this kind of density per concept.

We randomly chose five faces (although human subjects can recognize more than 1,000 faces accurately), a tactic that might represent immediate family members in that the individual will have a constant and extensive exposure and can recognize them in many novel contexts. Thus, the goals of this study are to approximate the sample space of a typical human individual with a similar density and learning exposure and compare and contrast the feature construction in the hidden layers early and late in learning, as well as early and late topographically in the network, and analyze the dynamics of learning and characterize learning phases when features first crystallize, that is, when category formation first appears and determine whether learning is unitary or multifactorial.

We trained a full DL model (2.4 million weights; see Table 1) and focus on the layer-to-layer interactions in order to characterize the internal learning dynamics (see Krizhevsky et al., 2012). Five faces (see Figure 1), of one female and four males, were randomly selected from the 10 sets of faces. They are typical of the variation of the entire set and show a strong similarity across individuals.
Table 1:

Configuration of the 1.24M Weight VVG Architecture Used for Face Classification.

Layer (type)Output ShapeParam #
Conv2d-1 [‒1, 3, 128, 128] 84 
ReLU-2 [‒1, 3, 128, 128] 
Conv2d-3 [‒1, 3, 128, 128] 84 
ReLU-4 [‒1, 3, 128, 128] 
Conv2d-5 [‒1, 3, 128, 128] 84 
ReLU-6 [‒1, 3, 128, 128] 
Conv2d-7 [‒1, 3, 128, 128] 84 
ReLU-8 [‒1, 3, 128, 128] 
Conv2d-9 [‒1, 3, 128, 128] 84 
ReLU-10 [‒1, 3, 128, 128] 
AdaptiveAvgPool2d-11 [‒1, 3, 20, 20] 
Linear-12 [‒1, 1024] 1,229,824 
ReLU-13 [‒1, 1024] 
Dropout-14 [‒1, 1024] 
Linear-15 [‒1, 1024] 1,049,600 
ReLU-16 [‒1, 1024] 
Dropout-17 [‒1, 1024] 
Linear-18 [‒1, 5] 5,125 
Layer (type)Output ShapeParam #
Conv2d-1 [‒1, 3, 128, 128] 84 
ReLU-2 [‒1, 3, 128, 128] 
Conv2d-3 [‒1, 3, 128, 128] 84 
ReLU-4 [‒1, 3, 128, 128] 
Conv2d-5 [‒1, 3, 128, 128] 84 
ReLU-6 [‒1, 3, 128, 128] 
Conv2d-7 [‒1, 3, 128, 128] 84 
ReLU-8 [‒1, 3, 128, 128] 
Conv2d-9 [‒1, 3, 128, 128] 84 
ReLU-10 [‒1, 3, 128, 128] 
AdaptiveAvgPool2d-11 [‒1, 3, 20, 20] 
Linear-12 [‒1, 1024] 1,229,824 
ReLU-13 [‒1, 1024] 
Dropout-14 [‒1, 1024] 
Linear-15 [‒1, 1024] 1,049,600 
ReLU-16 [‒1, 1024] 
Dropout-17 [‒1, 1024] 
Linear-18 [‒1, 5] 5,125 

Total params: 2,284,969

Trainable params: 2,284,969

Non-trainable params: 0

Input size (MB): 0.19

Forward/backward pass size (MB): 3.81

Params size (MB): 8.72

Estimated Total Size (MB): 12.71

Figure 1:

Face images of the five individuals in the database after a close crop removing background features.

Figure 1:

Face images of the five individuals in the database after a close crop removing background features.

Close modal

In order to identify the network architecture for the data, we performed several empirical experiments on a VGG-style network (starting with a VGG-11, see Table 1) and was able to shrink to a network with five convolutional layers with kernel size 3 and three filters per layer. Decreasing the size of the network any further made the learning process very unstable and proved to be difficult for getting consistent results while experimenting. We also used the smallest network size possible so as to better understand the functionality of each of the layers and reduce unnecessary parameters given the small and condensed size of our classes. We used a fixed filter size for all of the convolutional layers so the differences between the layers could be better analyzed and understood. This left us with images of size 128 × 128 at the end of the convolutional stages. Further focused pooling kept the size of the model manageable in the fully connected upper layers of the DL.

2.1  Factoring Learning Dynamics

Here we describe this mechanism in more detail using the Yale Face Dataset. Consider, for example, the two-dimensional principal component analysis (PCA) representation of the exemplar distribution during learning in Figures 2 and 3. These figures show the beginning of learning and the lack of separation of the five face categories. The error by chance is 22% (1/5). Figure 3 shows learning at asymptote after 51 epochs. Note that the five categories are completely separated (accuracy = 97%) in the two-dimensional PCA space.
Figure 2:

The beginning of learning (top) and lack of separation of the five face categories (bottom). The error by chance is 22%.

Figure 2:

The beginning of learning (top) and lack of separation of the five face categories (bottom). The error by chance is 22%.

Close modal
Figure 3:

Learning at asymptote, after 51 epochs, showing the five categories completely separate (acc = 97%) in the two-dimensional PCA space.

Figure 3:

Learning at asymptote, after 51 epochs, showing the five categories completely separate (acc = 97%) in the two-dimensional PCA space.

Close modal
Next, we analyzed the resultant learning curve shown in Figure 4, with a sum of two logistic curves. In this case, a single logistic curve fits with around 94% of the overall variance but with systematic errors on the phase transition when the first face identification occurs. The second logistic (albeit with two more parameters) boosts the fit to 99% of the variance, with a visually unbiased fit over the whole learning function.
Figure 4:

Cumulative logistic fit to Yale Face Database, DL learning curve. Note the two components’ logistic curves at the bottom of the figure.

Figure 4:

Cumulative logistic fit to Yale Face Database, DL learning curve. Note the two components’ logistic curves at the bottom of the figure.

Close modal
This factorization of learning dynamics is key to understanding dynamics in deep learning. Specifically, we propose that multiple learning processes tend to be initiated as a series of hyperbolic learning processes. As in a wavelet decomposition or in any type of spectral decomposition, we hypothesize that deep learning dynamics can also be decomposed in a series of hyperbolic functions, some very fast, some slower, and others near the floor—effectively background processes—but ones that are no less critical to the final representation (see Figure 5). We term this factoring the logistic learning decomposition (LLD):
(2.1)
Figure 5:

Factoring the components underlying Yale Face data DL learning curve.

Figure 5:

Factoring the components underlying Yale Face data DL learning curve.

Close modal

The number of terms in this series will be a function of the number of categories, the complexity of the decision surface (e.g., nonlinearity, convexity, connectedness), and data complexity. In terms of this small number of categories, two factors are not unreasonable; however, we would expect in a typical CIFAR classification that there must be hundreds of such hyperbolic processes, some of which must be highly correlated (see below). Fast hyperbolic processes, we will see next, are based on the highest variance extraction, which will usually be based on the first significant class separation, in this case, the red data points—exemplars—but could be based on the entire linear separability of the task—say, Fisher's iris data (Fisher, 1936). Two decision surfaces might be extracted close in time, producing two large variance spikes in the learning dynamics (the third one, which is more nonlinear, is slower) with a fairly coarse structure. More moderate hyperbolic learning processes in DL can therefore independently adjust the decision surface gradually, improving the overall accommodation of other exemplars, while slower hyperbolic processes can discover more complex, higher-fidelity feature detectors and structure in the data, producing the smoothest and highest probable fit to the true decision surface.

It is also important to understand that the learning dynamics will tend to be a sequential extraction (similar to PCA, but unlikely with orthogonal components, making the overall DL approximation a series of conditionally independent feature detectors or filters (partly due to the spectral separation resulting from the depth of each layer) that can be incrementally added to a larger hierarchy or “scaffold” that may have already developed.

2.2  Looking Further inside the Black Box

In order to further backfill this type of mechanism, we next explore the representation of the five face categories at the change point, where learning abruptly ends. We consider this, as a latent phase and hyperbolically, to be the accumulation phase deep learning. As we will see, there is a sequence of learning structures supporting the final classification. Consider Figure 6, where we model the dynamics of the last 1,024 hidden units prior classification. These 1,024 in the VVG network (see Table 1) are the last hidden layer in the VVG network and thus show the first effects of error over epochs.
Figure 6:

Principal components analysis over last 1024 hidden units prior to classification. The blue line is the total error over epochs.

Figure 6:

Principal components analysis over last 1024 hidden units prior to classification. The blue line is the total error over epochs.

Close modal
We performed a PCA over all classes and exemplars on the activation of hidden units at each epoch. As learning proceeds, the composite score of the largest PC (we extract five PCs that represent nearly 99% of the entire variance of the set) peaks near epoch 22 and shows an emergence of an entire class of faces (see the face 1 red exemplars in Figure 7), with the others emerging near epoch 26 and finally by epoch 30. At the point of transition, there is a large spike in the hidden layer accommodating the convolutional layers that have created feature detectors that through the first 22 epochs were not sufficiently tuned to separate a critical mass of the cases—in effect, the category representation of face 1. We show for faces 1 and 5, the pixel-wise variance across all exemplar faces per category and also show likely hypotheses about feature detectors as they are forming. Figures 8 and 9 show the nose, eyebrow, and cheek regions, each the basis for a larger unique set of detectors that will eventually emerge. Consistent with these emerging face features are the dynamics in Figure 6 where the other extracted PCs show a perturbation at the point of phase transition, but with smaller peaks and depressions prior to a steady rise in variance accumulation. The bottom blue curve shows the accuracy as the DL approaches asymptotic classification and the change point near epochs 22 to 28.
Figure 7:

First appearance of one face category (red) around epoch 22.

Figure 7:

First appearance of one face category (red) around epoch 22.

Close modal
Figure 8:

Face features for first emerging face category.

Figure 8:

Face features for first emerging face category.

Close modal
Figure 9:

Face features of the fifth emerging face category.

Figure 9:

Face features of the fifth emerging face category.

Close modal
So far, it is clear that the hidden units form different paths as they cluster together in various combinations within each PC. This is a type of competition that produces sets of PCs that are mostly uncorrelated. These feature combinations (PCs) are initially hypotheses about the classification of the categories, in the prototype cases for each of the five faces. The answer of how independent the hidden layers have become is not obvious from the components alone. We therefore also calculated the correlation pairwise of all convolutional layers over epochs shown (see Figure 10). Similar to the PCs over time, there is a phase transition around epochs 22 to 28, and after that point, there is a clear orthogonality of the hidden layers throughout the network, although with the spatially closest hidden units with higher correlation and those farther away with lower correlations. This gradient was nearly linear over the five layers, with correlations of spatially close layers slowly drifting down and those spatially farther away are decreasing more quickly and steadily over epochs. It is clear that a specific weight correlation structure per layer is at the basis of these hidden unit functions and the dynamics of DL learning. Nonetheless, it is also clear that the correlation structure of the DL is complex at the layer/weight level.
Figure 10:

Correlation over epochs of successive pairs of layers over DL model.

Figure 10:

Correlation over epochs of successive pairs of layers over DL model.

Close modal

One recent theoretical approach to understanding the dynamics of DL learning is to use statistical mechanics (Martin & Mahoney, 2021) to frame the weight dynamics as a disordered system. In this theory, the overparameterization paradox of DL is characterized as a natural regularization of parameters in DL networks through a competitive implicit process (thus, there is no explicit regularization; see Moody, 1991). Regularization of models in the standard case (Tikonov regularization; neural network weight decay) involves adding a penalizer to the estimator so that parameter dynamics are damped and the parameter efficiency is optimized (Hanson & Pratt, 1988).

This new theory of regularization was constructed on more than 100 DL learned architectures by extracting the eigenvalue distribution from the correlation of weight matrices per layer. What resulted was a theory of DL learning that typically shows five distinct phases of learning. Initially there is a random eigenvalue distribution that slowly evolves to have developed a new theory of regularization (heavy-tailed regularization, in contrast to Tikonov's classical regularization) that appears to apply to deep learning architectures. The theory predicts five phases of learning in terms of the eigenvalue spectral distribution (ESD): random, bulk, bulk bleed out, bulk with spikes, and final rank collapse. The term bulk refers to the eigenvalue distribution that results with weak covariance soon after random ESD distribution appears (which would imply no learning). The bleed-out and spikes are the first signs that a strong covariance structure is emerging, which then rapidly grows to a highly connected predictive structure prior to the rank collapse. We return to this theory in discussing a new proposal for DL feature construction. But first we consider what we can say so far about deep learning dynamics.

3.1  Harvesting Some Principles of Deep Learning Dynamics

During learning and visualization, we observed a number of regularities that we summarize next in the context of at least two kinds of learning that we have observed in this example—what we define as fast feature competition and slow feature curation:

More layers create a buffer from aggressive competitive learning, common in single-layer networks. The layers in a DL appear locally correlated, suggesting that competitive learning, which is common in perceptron and backpropagation single hidden-layer networks, is isolated per layer, thus slowing destructive and constructive learning throughout later layers.

Layers can enable latent learning, an induction period, to more comprehensively model data complexity with smaller, less destructive (near zero) gradients that will preserve promising feature analyzers in later layers. This will in effect “curate” them with more consistent samples with more consistent feature sets that middle layers will filter.

Layers decouple long-range effects and effectively create a conditionally independent network of layers. This effect allows for more local updates per layer conditioned on the spatially nearest layers, thus again increasing the locality and ultimate fidelity of the feature analyzers and protecting them from destructive competition.

As slow logistic processes increase in accuracy, a tipping point is reached with an explosive growth of accuracy reflecting the network of feature analyzers that are nearly similar to what they will be asymptotically. This logistic or exponential-hyperbolic form has a slow rise period (induction period) prior to explosive growth with slower refinement processes (Hanson et al., 2018; Hanson & Hanson, 2023).

There are multiple learning processes initiated in parallel but at different rates depending on the covariance and complexity in the data. These learning processes will accumulate to an overall learning curve that can be factored into N multilogistic learning functions with different rates. The faster processes during classification are extracting a lower dimensional (separable features) structure while slower processes are backfilling structure from covariance (integral features), similar to nonlinear factor analysis (Kruskal & Shepard, 1974). These slower logistic processes might be termed “latent correct responses,” which emerge over a longer resolution period and at the same time with a much faster, hyperbolic rise in learning. Clearly DL is multifactorial.

We would expect that the deeper the network, the more quickly destructive competition and slow curation of feature structures are trading off. Consequently, lower layers (those closest to input) may experience more rapid change in accommodating the higher layers’ incremental improvement of feature detectors that tend to be more consistent with error feedback outcome per sample or batch. This potentially creates a long-distance communication channel between lower and upper layers. Nonetheless, there must be diminishing returns on this strategy and the effective depth of any DL learner.

3.2  A New Proposal for Feature Creation in DL: Autocatalytic Feature Sets

As we have seen, the logistic (hyperbolic) function can produce explosive growth to an asymptote with a complex construction of novel feature analyzers. We propose a theory of feature creation in DL networks that is based on an analogy to a chemical reaction. Recall that catalysis is typically described as a chemical reaction that normally proceeds at a fixed rate, depending on the chemical constituents present. A catalyst will multiply the basic rate of that reaction many fold, thus increasing the rate of the reaction without itself being consumed in the reaction.

The resultant product is drastically increased in the presence of the catalyst. The graph for these equations is a sigmoid curve (specifically a logistic function), which is typical for autocatalytic reactions: these chemical reactions proceed slowly at the start (the induction period) because there is little catalyst present. The rate of reaction increases progressively as the reaction proceeds and the amount of catalyst increases, and then it again slows down as the reactant concentration decreases. Autocatalytic sets (Kauffman, 1992), as may be apparent from their name, literally catalyze themselves. In effect, they produce their own catalyst during their reaction, providing cross-catalysis for connected variables. (In the case of a dynamic system like Lotka-Volterra, it inhibits and therefore symmetrically disinhibits the growth of the other.) The second property from Kauffman's original theory is that an autocatalytic set will appear as one “giant connected component” in the chemical reaction. In effect, the autocatalytic set is a kind of “clique” in a graph or a circuit that provides for the sustainability and hyperbolic growth of structure as it develops with more and more elaborate structure—what we referred to earlier as curated features. One of the continuing mysteries of DL feature construction is how raw, basic features input to a convolutional layer become some sort of high-fidelity representation of a “face,” “cat,” or “car.” Are there universal structures that evolve for the recognition of types and tokens? Autocatalytic sets may also account for how they can evolve at all from such a simple pixel-level input. For example, recent work in DL visualization has revealed complex features that are not merely obvious bits and pieces of original feature structure from the stimulus set, but rather appear to be based on inferred qualities of symmetry, relational structure, texture, color, pattern, and central features. They appear to be novel inventions of the DL (see Figure 11 with the nose and eyes in case of the “dog” category; Olah et al., 2017, 2020; Erhan et al., 2009).
Figure 11:

Visualization of large-scale DL (Olah et al., 2020). “Dog” category.

Figure 11:

Visualization of large-scale DL (Olah et al., 2020). “Dog” category.

Close modal

In comparison, this type of feature representation is still a central mystery in visual pathways in the brain, which in standard neuroscience textbooks typically show pathways with simple features (Hubel & Wiesel, 1962). They begin with edges and lines, then patterns and textures and more complex checkerboards, and then objects, faces, and complex types. However, no specific intermediate structures have been observed, specified theoretically, or identified that might bootstrap or provide constructive routes to more complex forms. The nature of the visual pathway representation is only a plausible hypothesis. Worse, conventional wisdom has been that hierarchical structures could emerge that could in principle build novel structure and also be diagnostic of any type or token that might be in the visual world (Biederman, 1972). But these proposals have generally not been productive and are rife with many types of inductive, constructive, and logical problems (Zoccolan et al., 2007; Herzog & Clarke, 2014; Carroll & Dickenson, 1989). The dorsal visual pathway of the brain that may be primarily focused on “what” information at some system level is still poorly understood on what the “what” pathway actually does. It seems ironic that instead of the usual sci-fi trope of future technology predicting human common usage eventually, we have already created artificial systems based on the human brain nearly as complex as the human brain, creating yet another mystery to investigate.

The authors report no conflicts of interest.

Biederman
,
I.
(
1972
).
Perceiving real-world scenes
.
Science
,
177
(
4043
),
77
80
.
Bruner
,
J. S.
,
Goodnow
,
J. J.
, &
Austin
,
G. A.
(
1956
).
A study of thinking
.
Wiley
.
Carroll
,
S. M.
, &
Dickinson
,
B. W.
(
1989
).
Construction of neural nets using the radon transform
. In
Proceedings of the International Joint Conference on Neural Networks
(pp.
607
611
).
Erhan
,
D.
,
Bengio
,
Y.
,
Courville
,
A.
, &
Vincent
,
P.
(
2009
).
Visualizing higher-layer features of a deep network
.
Technical report
,
University of Montreal
.
Fisher
,
R. A.
(
1936
).
The use of multiple measurements in taxonomic problems
.
Annals of Eugenics
,
7
,
179
188
.
Hanson
,
S. J.
, &
Burr
,
D. J.
(
1990
).
What connectionist models learn: Learning and representation in connectionist networks
.
Behavioral and Brain Sciences
,
13
(
3
),
471
489
.
Hanson
,
C.
,
Caglar
,
L. R.
, &
Hanson
,
S. J.
(
2018
).
Attentional bias in human category learning: The case of deep learning
.
Frontiers in Psychology
,
9
,
374
.
Hanson
,
S. J.
, &
Gluck
,
M. A.
(
1990
).
Spherical units as dynamic consequential regions: Implications for attention, competition and categorization
. In
R.
Lippman
,
J. E.
Moody
, &
D. S.
Touretzky
(Eds.),
Advances in neural information processing systems
,
3
.
Morgan Kaufmann
.
Hanson
,
S. J.
, &
Hanson
,
C.
(
2023
).
L.L. Thurstone: The law of effect and the dynamics of deep learning
.
Manuscript submitted for publication
.
Hanson
,
S. J.
, &
Pratt
,
L. Y.
(
1988
).
Comparing biases for minimal network construction with back-propagation
. In
D.
Touretzky
(Ed.),
Advances in neural information processing systems
,
1
.
Morgan Kaufmann
.
Herzog
,
M. H.
, &
Clarke
,
A. M.
(
2014
).
Why vision is not both hierarchical and feedforward
.
Frontiers in Computer Neuroscience
,
8
,
135
.
Hornik
,
K.
,
Stinchcombe
,
M.
, &
White
,
H.
(
1990
).
Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks
.
Neural Networks
,
3
(
5
),
551
560
.
Hubel
,
D. H.
&
Wiesel
,
T. N.
(
1962
).
Receptive fields, biologic interaction, and functional architecture of the cat's visual cortex
.
Journal of Physiology
,
160
,
106
.
Jumper
,
J.
,
Evans
,
R.
,
Pritzel
,
A.
,
Green
,
T.
,
Figurnov
,
M.
,
Ronneberger
,
O.
,
Hassabis
,
D.
(
2021
).
Highly accurate protein structure prediction with AlphaFold
.
Nature
,
596
(
7873
),
583
589
.
Kauffman
,
S. A.
(
1992
).
Origins of order in evolution: Self-organization and selection
. In
F. J.
Varela
&
J. P.
Dupuy
(Eds.),
Understanding origins
.
Springer
.
Krizhevsky
,
A.
Sutskever
,
I.
, &
Hinton
,
G.
(
2012
).
ImageNet classification with deep convolutional neural networks
. In
F.
Pereira
,
C. J.
Burges
,
L.
Bottou
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
25
.
Curran
.
Kruskal
,
J. B.
, &
Shepard
,
R. N.
(
1974
).
A nonmetric variety of linear factor analysis
.
Psychometrika
,
39
(
2
),
123
157
.
Luce
,
R. D.
,
Bush
,
R. R.
, &
Galanter
,
E
.
(Eds.)
. (
1963
).
Handbook of mathematical psychology
.
Wiley
.
Martin
,
C. H.
, &
Mahoney
,
M. M.
(
2021
).
Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning
.
Journal of Machine Learning Research
,
22
,
1
73
.
Moody
,
J. E.
(
1991
).
The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems
. In
J.
Moody
,
S.
Hanson
, &
R. P.
Lippmann
(Eds.),
Advances in neural information processing systems
,
4
(pp. 
847
854
).
Morgan Kaufmann
.
Olah
,
Cammarta
,
N.
,
Schubert
,
L.
, &
Goh
,
G
. (
2020
).
Zoom in: An introduction to circuits
.
Distill
,
5
(
3
).
Olah
,
C.
,
Mordvintsev
,
A.
, &
Schubert
,
L.
(
2017
).
Feature visualization
.
Distill
.
Rosch
E.
,
Mervis
,
C. B.
,
Gray
,
W. D
,
Johnson
,
D. M.
, &
Boyes-Braem
,
P.
(
1976
).
Basic objects in natural categories
.
Cognitive Psychology
,
8
(
3
)
382
439
.
Rumelhart
,
D.
,
Hinton
,
G.
, &
Williams
,
R.
(
1986
).
Learning representations by back-propagating errors
.
Nature
,
323
,
533
536
.
Shepard
,
R. N.
,
Hovland
,
C. I.
, &
Jenkins
,
H. M.
(
1961
).
Learning and memorization of classifications
.
Psychological Monographs: General and Applied
,
75
(
13
),
1
42
.
Zoccolan
,
D.
,
Kouh
,
M.
,
Poggio
,
T.
, &
DiCarlo
,
J. J.
(
2007
).
Trade-Off between object selectivity and tolerance in monkey inferotemporal cortex
.
Journal of Neuroscience
,
27
,
12292
.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode