Abstract
The computational role of the abundant feedback connections in the ventral visual stream is unclear, enabling humans and nonhuman primates to effortlessly recognize objects across a multitude of viewing conditions. Prior studies have augmented feedforward convolutional neural networks (CNNs) with recurrent connections to study their role in visual processing; however, often these recurrent networks are optimized directly on neural data or the comparative metrics used are undefined for standard feedforward networks that lack these connections. In this work, we develop task-optimized convolutional recurrent (ConvRNN) network models that more correctly mimic the timing and gross neuroanatomy of the ventral pathway. Properly chosen intermediate-depth ConvRNN circuit architectures, which incorporate mechanisms of feedforward bypassing and recurrent gating, can achieve high performance on a core recognition task, comparable to that of much deeper feedforward networks. We then develop methods that allow us to compare both CNNs and ConvRNNs to finely grained measurements of primate categorization behavior and neural response trajectories across thousands of stimuli. We find that high-performing ConvRNNs provide a better match to these data than feedforward networks of any depth, predicting the precise timings at which each stimulus is behaviorally decoded from neural activation patterns. Moreover, these ConvRNN circuits consistently produce quantitatively accurate predictions of neural dynamics from V4 and IT across the entire stimulus presentation. In fact, we find that the highest-performing ConvRNNs, which best match neural and behavioral data, also achieve a strong Pareto trade-off between task performance and overall network size. Taken together, our results suggest the functional purpose of recurrence in the ventral pathway is to fit a high-performing network in cortex, attaining computational power through temporal rather than spatial complexity.
1 Introduction
The visual system of the brain must discover meaningful patterns in a complex physical world (James, 1890). Within 200 ms, primates can quickly identify objects despite changes in position, pose, contrast, background, foreground, and many other factors from one occasion to the next, a behavior known as “core object recognition” (Pinto, Cox, & Dicarlo, 2008; DiCarlo, Zoccolan, & Rust, 2012). It is known that the ventral visual stream (VVS) underlies this ability by transforming the retinal image of an object into a new internal representation, in which high-level properties, such as object identity and category, are more explicit (DiCarlo et al., 2012).
Nontrivial dynamics result from a ubiquity of recurrent connections in the VVS, including synapses that facilitate or depress dense local recurrent connections within each cortical region and long-range connections between different regions, such as feedback from higher to lower visual cortex (Gilbert & Wu, 2013). Furthermore, the behavioral roles of recurrence and dynamics in the visual system are not well understood. Several conjectures are that recurrence “fills in” missing data, (Spoerer, McClure, & Kriegeskorte, 2017; Michaelis, Bethge, & Ecker, 2018; Rajaei, Mohsenzadeh, Ebrahimpour, & Khaligh-Razavi, 2019; Linsley, Kim, Veerabadran, Windolf, & Serre, 2018) such as object parts occluded by other objects; that it “sharpens” representations by top-down attentional feature refinement, allowing for easier decoding of certain stimulus properties or performance of certain tasks (Gilbert & Wu, 2013; Lindsay, 2015; McIntosh, Maheswaranathan, Sussillo, & Shlens, 2018; Li, Jie, Feng, Liu, & Yan, 2018; Kar, Kubilius, Schmidt, Issa, & DiCarlo, 2019); that it allows the brain to “predict” future stimuli, such as the frames of a movie (Rao & Ballard, 1999; Lotter et al., 2017; Issa, Cadieu, & DiCarlo, 2018); or that recurrence “extends” a feedforward computation, reflecting the fact that an unrolled recurrent network is equivalent to a deeper feedforward network that conserves on neurons (and learnable parameters) by repeating transformations several times (Liao & Poggio, 2016; Zamir et al., 2017; Leroux et al., 2018; Rajaei et al., 2019; Kubilius et al., 2019; Spoerer, Kietzmann, Mehrer, Charest, & Kriegeskorte, 2020). Formal computational models are needed to test these hypotheses: if optimizing a model for a certain task leads to accurate predictions of neural dynamics, then that task may be a primary reason those dynamics occur in the brain.
We show that with a novel choice of layer-local recurrent circuit and long-range feedback connectivity pattern, ConvRNNs can match the performance of much deeper feedforward CNNs on ImageNet but with far fewer units and parameters, as well as a more anatomically consistent number of layers, by extending these computations through time. In fact, such ConvRNNs most accurately explain neural dynamics from V4 and IT across the entirety of stimulus presentation with a temporally fixed linear mapping compared to alternative recurrent circuits. Furthermore, we find that these suitably chosen ConvRNN circuit architectures provide a better match to primate behavior in the form of object solution times compared to feedforward CNNs. We observe that ConvRNNs that attain high task performance but have small overall network size, as measured by number of units, are most consistent with these data, while even the highest-performing but biologically implausible deep feedforward models are overall a less consistent match. In fact, we find a strong Pareto trade-off between network size and performance, with ConvRNNs of biologically plausible intermediate depth occupying the sweet spot with high performance and a (comparatively) small overall network size. Because we do not fit neural networks end-to-end to neural data (see Kietzmann et al., 2019), but instead show that these outcomes emerge naturally from task performance, our approach enables a normative interpretation of the structural and functional design principles of the model.
Our work is also the first to develop large-scale, task-optimized ConvRNNs with biologically plausible temporal unrolling. Unlike most studies of combinations of convolutional and recurrent networks, which posit a recurrent subnetwork concatenated onto the end of a convolutional backbone (McIntosh et al., 2018), we model local recurrence implanted within ConvRNN layers, and allow long-range feedback between layers. Moreover, we treat each connection in the network, whether feedforward or feedback, as a real temporal object with a biophysical conduction delay (set at 10 ms), rather than the typical procedure (e.g., as in McIntosh et al., 2018; Zamir et al., 2017; and Kubilius et al., 2019) in which the feedforward component of the network (no matter now deep) operates in one time step. As a result, our networks can be directly compared with neural and behavioral trajectories at a fine-grained scale limited only by the conduction delay itself.
This level of realism is especially important for establishing what we have found appears to be the main real quantitative advantage of ConvRNNs as biological models as compared to very deep feedforward networks. In particular, we can define an improved metric for assessing the correctness of the match between a ConvRNN network, thought of as a dynamical system, and the actual trajectories of real neurons. By treating such feedforward networks as ConvRNNs with recurrent connections set to 0, we can map these networks to temporal trajectories as well. As a result, we can directly ask how much of the neural-behavioral trajectory of real brain data is explicable by very deep feedforward networks. This is an important question because implausibly deep networks have been shown in the literature to achieve not only the highest categorization performance (He, Zhang, Ren, & Sun, 2016) but also competitive results on matching static (temporally averaged) neural responses (Schrimpf et al., 2018). Due to nonbiological temporal unrolling, previous work with comparing such networks to temporal trajectories in neural data (Kubilius et al., 2019) has been forced to unfairly score feedforward networks as total failures, with temporal match score artificially set at 0. With our improved realism, we find (see section 2) that deep feedforward networks actually make quite nontrivial temporal predictions that do explain some of the reliable temporal variability of real neurons. In this context, our finding that plausibly deep ConvRNNs in turn meaningfully outperform these deep feedforward networks on this more fair metric is a strong and nontrivial signal of the actually better biological match of ConvRNNs as compared to deep feedforward networks.
2 Results
2.1 An Evolutionary Architecture Search Yields Specific Layer-Local Recurrent Circuits and Long-Range Feedback Connectivity Patterns That Improve Task Performance and Maintain Small Network Size
We speculated that this was because standard circuits lack a combination of two key properties, each of which on their own have been successful either purely for RNNs or for feedforward CNNs: (1) direct passthrough, where at the first time step, a zero-initialized hidden state allows feedforward input to pass on to the next layer as a single linear-nonlinear layer just as in a standard feedforward CNN layer (see Figure 2a, top left); and (2) Gating, in which the value of a hidden state determines how much of the bottom-up input is passed through, retained, or discarded at the next time step (see Figure 2a, top right). For example, LSTMs employ gating but not direct passthrough, as their inputs must pass through several nonlinearities to reach their output, whereas SimpleRNNs do pass through a zero-initialized hidden state but do not gate their input (see Figure 2a; see section A.3 for cell equations). Additionally, each of these computations has direct analogies to biological mechanisms: direct passthrough would correspond to feedforward processing in time, and gating would correspond to adaptation to stimulus statistics across time (Hosoya, Baccus, & Meister, 2005; McIntosh et al., 2016).
We thus implemented recurrent circuits with both features to determine whether they function better than standard circuits within CNNs. One example of such a structure is the reciprocal gated circuit (RGC; Nayebi et al., 2018), which passes through its zero-initialized hidden state and incorporates gating (see Figure 2a, bottom right; see section A.3.7 for the circuit equations). Adding this circuit to the six-layer BaseNet (FF) increased accuracy from 0.51 and 0.53 (RGC Minimal, the minimally unrolled, parameter-matched control version) to 0.6 (RGC Extended). Moreover, the RGC used substantially fewer parameters than the standard circuits to achieve greater accuracy (see Figure S1).
However, it has been shown that different RNN structures can succeed or fail to perform a given task because of differences in trainability rather than differences in capacity (Collins, Sohl-Dickstein, & Sussillo, 2017). Therefore, we designed an evolutionary search to jointly optimize over both discrete choices of recurrent connectivity patterns and continuous choices of learning hyperparameters and weight initializations (search details are in section A.4). While a large-scale search over thousands of convolutional LSTM architectures did yield a better purely gated LSTM-based ConvRNN (LSTM Opt), it did not eclipse the performance of the smaller RGC ConvRNN. In fact, applying the same hyperparameter optimization procedure to the RGC ConvRNNs equally increased that architecture class's performance and further reduced its parameter count (see Figure S1, RGC Opt).
Therefore, given the promising results with shallower networks, we turned to embedding recurrent circuit motifs into intermediate-depth feedforward networks at scale, whose number of feedforward layers corresponds to the timing of the ventral stream (DiCarlo et al., 2012). We then performed an evolutionary search over these resultant intermediate-depth recurrent architectures (see Figure 2b). If the primate visual system uses recurrence in lieu of greater network depth to perform object recognition, then a shallower recurrent model with a suitable form of recurrence should achieve recognition accuracy equal to a deeper feedforward model, albeit with temporally fixed parameters (Liao & Poggio, 2016). We therefore tested whether our search had identified such well-adapted recurrent architectures by fully training a representative ConvRNN, the model with the median (across 7000 sampled models) validation accuracy after five epochs of ImageNet training. This median model (RGC Median) reached a final ImageNet top-1 validation accuracy nearly equal to a ResNet-34 model with nearly twice as many layers, even though the ConvRNN used only approximately 75% as many parameters. The fully unrolled model from the random phase of the search (RGC Random) did not perform substantially better than the BaseNet, though the minimally unrolled control did (see Figure 2c). We suspect the improvement of the base intermediate feedforward model over using shallow networks (as in Figure S1) diminishes the difference between the minimal and extended versions of the RGC compared to suitably chosen long-range feedback connections. However, compared to alternative choices of ConvRNN circuits, even the minimally extended RGC significantly outperforms them with fewer parameters and units, indicating the importance of this circuit motif for task performance. This observation suggests that our evolutionary search strategy yielded effective recurrent architectures beyond the initial random phase of the search.
We also considered a control model (Time Decay) that produces temporal dynamics by learning time constants on the activations independently at each layer rather than by learning connectivity between units. In this ConvRNN, unit activations have exponential rather than immediate falloff once feedforward drive ceases. These dynamics could arise, for instance, from single-neuron biophysics (e.g., synaptic depression) rather than interneuronal connections. However, this model did not perform any better than the feedforward BaseNet, implying that ConvRNN performance is not a trivial result of outputting a dynamic time course of responses. We further implanted other more sophisticated forms of ConvRNN circuits into the BaseNet, and while this improved performance over the Time Decay model, it did not outperform the RGC Median ConvRNN despite having many more parameters (see Figure 2c). Together, these results demonstrate that the RGC Median ConvRNN uses recurrent computations to subserve object recognition behavior and that particular motifs in its recurrent architecture (see Figure S2), found through an evolutionary search, are required for its improved accuracy. Thus, given suitable local recurrent circuits and patterns of long-range feedback connectivity, a physically more compact, temporally extended ConvRNN can do the same challenging object recognition task as a deeper feedforward CNN.
2.2 ConvRNNs Better Match Temporal Dynamics of Primate Behavior Than Feedforward Models
To address whether recurrent processing is engaged during core object recognition behavior, we turn to behavioral data collected from behaving primates. There is a growing body of evidence that current feedforward models fail to accurately capture primate behavior (Rajalingham et al., 2018; Kar et al., 2019). We therefore reasoned that if recurrence is critical to core object recognition behavior, then recurrent networks should be more consistent with suitable measures of primate behavior compared to the feedforward model family. Since the identity of different objects is decoded from the IT population at different times, we considered the first time at which the IT neural decoding accuracy reaches the (pooled) primate behavioral accuracy of a given image, known as the object solution time (OST) (Kar et al., 2019). Given that our ConvRNNs also have an output at each 10 ms time bin, the procedure for computing the OST for these models is computed from its IT-preferred layers, and we report the OST consistency, which we define as the Spearman correlation between the model OSTs and the IT population's OSTs on the common set of images solved by the given model and IT under the same stimulus presentation (see sections A.6.1 and A.8 for more details).
With model OST defined across both model families, we compared various ConvRNNs and feedforward models to the IT population's OST in Figure 3c. Among shallower and deeper models, we found that ConvRNNs were generally able to better explain IT's OST than their feedforward counterparts. Specifically, we found that ConvRNN circuits without any multiunit interaction such as the Time Decay ConvRNN only marginally, and not always significantly, improved the OST consistency over its respective BaseNet model.2 On the other hand, ConvRNNs with multiunit interactions generally provided the greatest match to IT OSTs than even the deepest feedforward models,3 where the best feedforward model (ResNet-152) attains a mean OST consistency of 0.173 and the best ConvRNN (UGRNN) attains an OST consistency of 0.237.
Consistent with our observations in Figure 2 that different recurrent circuits with multiunit interactions were not all equally effective when embedded in CNNs (despite outperforming the simple Time Decay model), we similarly found that this observation held for the case of matching IT's OST. Given recent observations (Kar & DiCarlo, 2021) that inactivating parts of macaque ventrolateral PFC (vlPFC) results in behavioral deficits in IT for late-solved images, we reasoned that additional decoding procedures employed at the categorization layer during task optimization might have a meaningful impact on the model's OST consistency, in addition to the choice of recurrent circuit used. We designed several decoding procedures (defined in section A.5), motivated by prior observations of accumulation of relevant sensory signals during decision making in primates (Shadlen & Newsome, 2001). Overall, we found that ConvRNNs with different decoding procedures but with the same layer-local recurrent circuit (RGC Median) and long-range feedback connectivity patterns yielded significant differences in final consistency with the IT population OST (see Figure S4; Friedman test, ). Moreover, the simplest decoding procedure of outputting a prediction at the last time point, a strategy commonly employed by the computer vision community, had a lower OST consistency than each of the more nuanced Max Confidence4 and Threshold decoding procedures5 that we considered. Taken together, our results suggest that the type of multiunit layer-wise recurrence and downstream decoding strategy are important features for OST consistency with IT, suggesting that specific, nontrivial connectivity patterns farther downstream of the ventral visual pathway may be important to core object recognition behavior over timescales of a couple hundred milliseconds.
2.3 Neural Dynamics Differentiate ConvRNN Circuits
ConvRNNs naturally produce a dynamic time series of outputs given an unchanging input stream, unlike feedforward networks. While these recurrent dynamics could be used for tasks involving time, here we optimized the ConvRNNs to perform the static task of object classification on ImageNet. It is possible that the primate visual system is optimized for such a task, because even static images produce reliably dynamic neural response trajectories at temporal resolutions of tens of milliseconds (Issa et al., 2018). The object content of some images becomes decodable from the neural population significantly later than the content of other images, even though animals recognize both object sets equally well. Interestingly, late-decoding images are not well characterized by feedforward CNNs, raising the possibility that they are encoded in animals through recurrent computations (Kar et al., 2019). If this were the case, we reason then that recurrent networks trained to perform a difficult but static object recognition task might explain neural dynamics in the primate visual system, just as feedforward models explain time-averaged responses (Yamins et al., 2014; Khaligh-Razavi & Kriegeskorte, 2014).
Prior studies (Kietzmann et al., 2019) have directly fit recurrent parameters to neural data, as opposed to optimizing them on a task. While it is natural to try to fit recurrent parameters to the temporally varying neural responses directly, this approach naturally has a loss of normative explanatory power. In fact, we found that this approach suffers from a fundamental overfitting issue to the particular image statistics of the neural data collected. Specifically, we directly fit these recurrent parameters (implanted into the task-optimized feedforward BaseNet) to the dynamic firing rates of primate neurons recorded during encoding of visual stimuli. However, while these nontask optimized dynamics generalized to held-out images and neurons (see Figures S5a and S5b), they had no longer retained performance to the original object recognition task that the primate itself is able to perform (see Figure S5c). Therefore, to avoid this issue, we instead asked whether fully task-optimized ConvRNN models (including the ones introduced in section 2.1) could predict these dynamic firing rates from multielectrode array recordings from the ventral visual pathway of rhesus macaques (Majaj, Hong, Solomon, & DiCarlo, 2015).
We began with the feedforward BaseNet and added a variety of ConvRNN circuits, including the RGC Median ConvRNN and its counterpart generated at the random phase of the evolutionary search (RGC Random). All of the ConvRNNs were presented with the same images shown to the primates, and we collected the time series of features from each model layer. To decide which layer should be used to predict which neural responses, we fit linear models from each feedforward layer's features to the neural population and measured where explained variance on held-out images peaked (see section A.6 for more details). Units recorded from distinct arrays—placed in the successive V4, posterior IT (pIT), and central/anterior IT (cIT/aIT) cortical areas of the macaque—were fit best by the successive layers of the feedforward model, respectively. Finally, we measured how well ConvRNN features from these layers predicted the dynamics of each unit. In contrast with feedforward models' fit to temporally averaged neural responses, the linear mapping in the temporal setting must be fixed at all time points. The reason for this choice is that the linear mapping yields “artificial units” whose activity can change over time (just like the real target neurons), but the identity of these units should not change over the course of 260 ms, which would be the case if instead a separate linear mapping was fit at each 10 ms time bin. This choice of a temporally fixed linear mapping therefore maintains the physical relationship between real neurons and model neurons.
Crucially, these predictions result from the task-optimized nonlinear dynamics from ImageNet, as both models are fit to neural data with the same form of temporally fixed linear model with the same number of parameters. Since the initial phase of neural dynamics was well fit by feedforward models, we asked whether the later phase could be fit by a much simpler model than any of the ConvRNNs we considered, namely, the Time Decay ConvRNN with ImageNet-trained time constants at convolutional layers. If the Time Decay ConvRNN were to explain neural data as well as the other ConvRNNs, it would imply that interneuronal recurrent connections are not needed to account for the observed dynamics; however, this model did not fit the late phase dynamics of intermediate areas (V4 and pIT), as well as the other ConvRNNs.6 The Time Decay model did match the other ConvRNNs for cIT/aIT, which may indicate some functional differences in the temporal processing of this area versus V4 and pIT. Thus, the more complex recurrence found in ConvRNNs is generally needed to improve object recognition performance over feedforward models and to account for neural dynamics in the ventral stream, even when animals are only required to fixate on visual stimuli. However, not all forms of complex recurrence are equally predictive of temporal dynamics. As depicted in Figure 4b, we found among these that the RGC Median, UGRNN, and GRU ConvRNNs attained the highest median neural predictivity for each visual area in both early and late phases, but they significantly outperformed the SimpleRNN ConvRNN at the late phase dynamics of these areas,7 and these models in turn were among the best matches to IT object solution times (OST) from section 2.2.
A natural follow-up question to ask is whether a lack of recurrent processing is the reason for the prior observation that there is a drop in explained variance for feedforward models from early to late time bins (Kar et al., 2019). In short, we find that this is not the case and that this drop likely has to do with task-orthogonal dynamics specific to individual primates, which we examine below.
It is well known that recurrent neural networks can be viewed as very deep feedforward networks with weight sharing across layers that would otherwise be recurrently connected (Liao & Poggio, 2016). Thus, to address this question, we compare feedforward models of varying depths to ConvRNNs across the entire temporal trajectory under a varying linear mapping at each time bin, in contrast to the above. This choice of linear mapping allows us to evaluate how well the model features are at explaining early compared to late time dynamics without information from the early dynamics influencing the later dynamics, and also more crucial, to allow the feedforward model features to be independently compared to the late dynamics. Specifically, as can be seen in Figure S7a, we observe a drop in explained variance from early (130–140 ms) to late (200–210 ms) time bins for the BaseNet and ResNet-18 models, across multiple neural data sets. Models with increased feedforward depth (such as ResNet-101 or ResNet-152), along with our performance-optimized RGC Median ConvRNN, exhibit a similar drop in median population explained variance as the intermediate feedforward models. The benefit of model depth with respect to increased explained variance of late IT responses might be noticeable only while comparing shallow models (less than 7 nonlinear transforms) to much deeper (more than 15 nonlinear transforms) models (Kar et al., 2019). Our results suggest that the amount of variance explained in the late IT responses is not a monotonically increasing function of model depth.
As a result, an alternative hypothesis is that the drop in explained variance from early to late time bins could instead be attributed to task-orthogonal dynamics specific to an individual primate as opposed to iterated nonlinear transforms, resulting in variability unable to be captured by any task-optimized model (feedforward or recurrent). To explore this possibility, we examined whether the model's neural predictivity at these early and late time bins was relatively similar in ratio to that of one primate's IT neurons mapped to that of another primate (see section A.7 for more details, where we derive a novel measure of the neural predictivity between animals, known as the “interanimal consistency”).
As shown in Figure S7b, across various hyperparameters of the linear mapping, we observe a ratio close to one between the neural predictivity (of the target primate neurons) of the feedforward BaseNet to that of the source primate mapped to the same target primate. Therefore, as it stands, temporally varying linear mappings to neural responses collected from an animal during rapid visual stimulus presentation (RSVP) may not sufficiently separate feedforward models from recurrent models any better than one animal to another, though more investigation is needed to ensure tight estimates of the interanimal consistency measure we have introduced here with neural data recorded from more primates. Nonetheless, this observation further motivates our earlier result of additionally turning to temporally varying behavioral metrics (such as the OST consistency) in order to be able to separate these model classes beyond what is currently achievable by neural response predictions.
2.4 ConvRNNs Mediate a Trade-Off between Task Performance and Network Size
Given our finding that specific forms of task-optimized recurrence are more consistent with IT's OST than iterated feedforward transformations (with unshared weights), we asked whether it was possible to approximate the effect of recurrence with a feedforward model. This approximation would allow us to better describe the additional “action” that recurrence is providing in its improved OST consistency. In fact, one difference between this metric and the explained variance metric evaluated on neural responses in the prior section is that the latter uses a linear transform from model features to neural responses, whereas the former operates directly on the original model features. Therefore, a related question is whether the (now standard) use of a linear transform for mapping from model units to neural responses can potentially mask the behavioral improvement that suitable recurrent processing has over deep feedforward models in their original feature space.
To address these questions, we trained a separate linear mapping (PLS regression) from each model layer to the corresponding IT response at the given time point, on a set of images distinct from those on which the OST consistency metric is evaluated on (see section A.8.2 for more details). The outputs of this linear mapping were then used in place of the original model features for both the uniform and graded mappings, constituting PLS Uniform and PLS Graded, respectively. Overall, as depicted in Figure S3, we found that models with less temporal variation in their source features (namely, those under a uniform mapping with less IT-preferred layers than the total number of time bins) had significantly improved OST consistency with their linearly transformed features under PLS regression (Wilcoxon test, ; mean OST difference 0.0458 and s.e.m. 0.00399). On the other hand, the linearly transformed intermediate feedforward models were not significantly different from task-optimized ConvRNNs that achieved high OST consistency,8 suggesting that the action of suitable task-optimized recurrence approximates that of a shallower feedforward model with linearly induced ground-truth neural dynamics.
3 Discussion
The overall goal of this study is to determine what role recurrent circuits may have in the execution of core object recognition behavior in the ventral stream. By broadening the method of goal-driven modeling from solving tasks with feedforward CNNs to ConvRNNs that include layer-local recurrence and feedback connections, we first demonstrate that appropriate choices of these recurrent circuits that incorporate specific mechanisms of direct passthrough and gating lead to matching the task performance of much deeper feedforward CNNs with fewer units and parameters, even when minimally unrolled. This observation suggests that the recurrent circuit motif plays an important role even during the initial time points of visual processing. Moreover, unlike very deep feedforward CNNs, the mapping from the early, intermediate, and higher layers of these ConvRNNs to corresponding cortical areas is neuroanatomically consistent and reproduces prior quantitative properties of the ventral stream. In fact, ConvRNNs with high task performance but small network size (as measured by number of neurons rather than synapses) are most consistent with the temporal evolution of primate IT object identity solutions. We further find that these task-optimized ConvRNNs can reliably produce quantitatively accurate dynamic neural response trajectories at temporal resolutions of tens of milliseconds throughout the ventral visual hierarchy.
Taken together, our results suggest that recurrence in the ventral stream extends feedforward computations by mediating a trade-off between task performance and neuron count during core object recognition, suggesting that the computer vision community's solution of stacking more feedforward layers to solve challenging visual recognition problems approximates what is compactly implemented in the primate visual system by leveraging additional nonlinear temporal transformations to the initial feedforward IT response. This work therefore provides a quantitative prescription for the next generation of dynamic ventral stream models, addressing the call to action in a recent previous study (Kar et al., 2019) for a change in architecture from feedforward models.
Many hypotheses about the role of recurrence in vision have been put forward, particularly in regard to overcoming certain challenging image properties (Spoerer et al., 2017; Michaelis et al., 2018; Rajaei et al., 2019; Linsley et al., 2018; Gilbert & Wu, 2013; Lindsay, 2015; McIntosh et al., 2018; Li et al., 2018; Kar et al., 2019; Rao & Ballard, 1999; Lotter, Kreiman, & Cox, 2017; Issa et al., 2018). We believe this is the first work to address the role of recurrence at scale by connecting novel task-optimized recurrent models to temporal metrics defined on high-throughput neural and behavioral data, to provide evidence for recurrent connections extending feedforward computations. Moreover, these metrics are well defined for feedforward models (unlike prior work; Kubilius et al., 2019) and therefore meaningfully demonstrate a separation between the two model classes.
Though our results help to clarify the role of recurrence during core object recognition behavior, many major questions remain. Our work addresses why the visual system may leverage recurrence to subserve visually challenging behaviors, replacing a physically implausible architecture (deep feedforward CNNs) with one that is ubiquitously consistent with anatomical observations (ConvRNNs). However, our work does not address gaps in understanding either the loss function or the learning rule of the neural network. Specifically, we intentionally implant layer-local recurrence and long-range feedback connections into feedforward networks that have been useful for supervised learning via backpropagation on ImageNet. A natural next step would be to connect these ConvRNNs with unsupervised objectives, as has been done for feedforward models of the ventral stream in concurrent work (Zhuang et al., 2021). The question of biologically plausible learning targets is similarly linked to biologically plausible mechanisms for learning such objective functions. Recurrence could play a separate role in implementing the propagation of error-driven learning, obviating the need for some of the issues with backpropagation (such as weight transport), as has been recently demonstrated at scale (Akrout, Wilson, Humphreys, Lillicrap, & Tweed, 2019; Kunin et al., 2020). Therefore, building ConvRNNs with unsupervised objective functions optimized with biologically plausible learning rules would be essential toward a more complete goal-driven theory of visual cortex.
High-throughput experimental data will also be critical to further separate hypotheses about recurrence. While we see evidence of recurrence as mediating a trade-off between network size and task performance for core object recognition, it could be that recurrence plays a more task-specific role under more temporally dynamic behaviors. Not only would it be an interesting direction to optimize ConvRNNs on more temporally dynamic visual tasks than ImageNet, but to compare to neural and behavioral data collected from such stimuli, potentially over longer timescales than 260 ms. While the RGC motif of gating and direct passthrough gave the highest task performance among ConvRNN circuits, the circuits that maintain a trade-off between number of units and task performance (RGC Median, GRU, and UGRNN) had the best match to the current set of neural and behavioral metrics, even if some of them do not employ passthroughs. However, it could be the case that with the same metrics we develop here but used in concert with such stimuli over potentially longer timescales, we can better differentiate these three ConvRNN circuits. Therefore, such models and experimental data would synergistically provide great insight into how rich visual behaviors proceed, while also inspiring better computer vision algorithms.
Notes
Mean OST difference 0.0120 and standard error of the mean (s.e.m.) 0.0045, Wilcoxon test on uniform versus graded mapping OST consistencies across feedforward models, ; see also Figure S3.
Paired -test with Bonferroni correction: shallow Time Decay versus BaseNet in blue, mean OST difference 0.101 and s.e.m. 0.0313, ; intermediate Time Decay versus BaseNet in purple, mean OST difference 0.0148 and s.e.m. 0.00857, .
Paired -test with Bonferroni correction: shallow RGC versus BaseNet in blue, mean OST difference 0.153 and s.e.m. 0.0252, ; intermediate UGRNN versus ResNet-152, mean OST difference 0.0652 and s.e.m. 0.00863, ; intermediate GRU versus ResNet-152, mean OST difference 0.0559 and s.e.m. 0.00725, ; RGC Median versus ResNet-152, mean OST difference 0.0218 and s.e.m. 0.00637, .
Paired -test with Bonferroni correction, mean OST difference 0.0195, and s.e.m. 0.00432, .
Paired -test with Bonferroni correction, mean OST difference 0.0279, and s.e.m. 0.00634, .
Wilcoxon test with Bonferroni correction for each ConvRNN versus Time Decay, except for the SimpleRNN for pIT.
Wilcoxon test with Bonferroni correction between each of these ConvRNNs versus the SimpleRNN on late phase dynamics, per visual area.
Paired -test with Bonferroni correction: RGC Median versus PLS Uniform BaseNet, mean OST difference and s.e.m. 0.0061, ; RGC Median with Threshold Decoder versus PLS Uniform ResNet-18, mean OST difference 0.00697 and s.e.m. 0.0085, ; RGC Median with Max Confidence Decoder versus PLS Uniform ResNet-34, mean OST difference 0.0001 and s.e.m. 0.0079, .
Acknowledgments
We thank Tyler Bonnen, Eshed Margalit, and the anonymous reviewers for comments on this article. We thank the Google TensorFlow Research Cloud team for generously providing TPU hardware resources for this project. D.L.K.Y. is supported by the James S. McDonnell Foundation (Understanding Human Cognition Award, grant 220020469), the Simons Foundation (Collaboration on the Global Brain, grant 543061), the Sloan Foundation (fellowship FG-2018-10963), the National Science Foundation (RI 1703161 and CAREER Award 1844724), the DARPA Machine Common Sense program, and hardware donation from the NVIDIA Corporation. This work is also supported in part by Simons Foundation grant SCGB-542965 (J.J.D. and D.L.K.Y.). This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement 70549 (J.K.). J.S. is supported by the Mexican National Council of Science and Technology (CONACYT).
Author Contributions
A.N. and D.L.K.Y. designed the experiments. A.N., J.S., and D.B. conducted the experiments, and A.N. analyzed the data. K.K. contributed neural data, and J.K. contributed to initial code development. K.K. and J.J.D. provided technical advice on neural predictivity metrics. D.S. and S.G. provided technical advice on recurrent neural network training. A.N. and D.L.K.Y. interpreted the data and wrote the article.
Competing Interest Declaration
The authors declare no competing interests.