Abstract
As animals adapt to their environments, their brains are tasked with processing stimuli in different sensory contexts. Whether these computations are context dependent or independent, they are all implemented in the same neural tissue. A crucial question is what neural architectures can respond flexibly to a range of stimulus conditions and switch between them. This is a particular case of flexible architecture that permits multiple related computations within a single circuit.
Here, we address this question in the specific case of the visual system circuitry, focusing on context integration, defined as the integration of feedforward and surround information across visual space. We show that a biologically inspired microcircuit with multiple inhibitory cell types can switch between visual processing of the static context and the moving context. In our model, the VIP population acts as the switch and modulates the visual circuit through a disinhibitory motif. Moreover, the VIP population is efficient, requiring only a relatively small number of neurons to switch contexts. This circuit eliminates noise in videos by using appropriate lateral connections for contextual spatiotemporal surround modulation, having superior denoising performance compared to circuits where only one context is learned. Our findings shed light on a minimally complex architecture that is capable of switching between two naturalistic contexts using few switching units.
1 Introduction
Our brains are unique in their ability to adapt to the context in which stimuli appear. Animals face the problem of processing visual stimuli rapidly and efficiently while adapting to different contexts every time they transition to a new environment (e.g., from jungle to savanna, from the shores of a river to underwater). A classic example of adaptation to different contexts is discussed in Barlow's “efficient coding hypothesis” (Barlow, 1961), which proposes that sensory systems encode maximal information about environments with different statistics (Olshausen & Field, 1996a, 1996b). In this and other cases, when context changes, neural circuits switch from previous strategies of feature representation to new ones that are better adapted to the statistical properties of the new context. How the neuronal circuitry of the brain is organized to account for the multitude of contexts animals may encounter has not been established (Yang, Cole, & Rajan, 2019). In particular, when do we need separate circuits for different contexts, and when can single circuits be modulated to switch among multiple contexts (Gozzi et al., 2010; Koganezawa, Kimura, & Yamamoto, 2016; Zhou et al., 2017; Cardin, 2019; Mante, Sussillo, Shenoy, & Newsome, 2013; Cohen, Dunbar, & McClelland, 1990; Yang et al., 2019)? Our aim is to identify a biologically constrained network that is capable of switching contexts and to infer the building blocks required for such switching. In constructing such a network, we will only discuss and include the structural and functional detail needed for the switching of contexts.
We focus on a concrete setting in which rapid context switching is apparent. This is mouse V1, which responds differently to inputs when the animal is running (moving condition) compared to when it is stationary (static condition) (Niell & Stryker, 2010; Fu et al., 2014). When the animal transitions from standing still to running, visually evoked firing rates significantly increase. For example, in one experimental setting, the firing rate of neurons in layers II/III of area V1 more than doubled (Niell & Stryker, 2010), while in layer V of V1, noise correlations between pairs of neurons were substantially reduced (Dadarlat & Stryker, 2017).
(a) Schematic of circuit involving VIP, SST, PV, and PYR groups of neurons. When VIP are silent, PYR are self-excitatory, while SST and PV inhibit PYR. When VIP are active, they inhibit the PYR while also creating a disinhibitory motif given by VIP-SST-PYR. The potential connection from PYR to VIP explored in this article is marked with a dotted arrow. (b) Processing of two input types (e.g., images, videos) happens using two separate networks for each type of input, each having units with weights in total to learn. (c) Processing of two input types can be done with one circuit: a switching circuit with units adapted to one of the contexts and switching units that turn on when the other context is presented. We may want , with connections to learn (assuming switching units are not interconnected). When the number of switching units required in a switching circuit is small, fewer connections need to be learned; more specifically, if . This generalizes well to a range of circuits, including in the case of sparse connectivities, as often presented throughout the article.
(a) Schematic of circuit involving VIP, SST, PV, and PYR groups of neurons. When VIP are silent, PYR are self-excitatory, while SST and PV inhibit PYR. When VIP are active, they inhibit the PYR while also creating a disinhibitory motif given by VIP-SST-PYR. The potential connection from PYR to VIP explored in this article is marked with a dotted arrow. (b) Processing of two input types (e.g., images, videos) happens using two separate networks for each type of input, each having units with weights in total to learn. (c) Processing of two input types can be done with one circuit: a switching circuit with units adapted to one of the contexts and switching units that turn on when the other context is presented. We may want , with connections to learn (assuming switching units are not interconnected). When the number of switching units required in a switching circuit is small, fewer connections need to be learned; more specifically, if . This generalizes well to a range of circuits, including in the case of sparse connectivities, as often presented throughout the article.
We study this circuit using a model in which the contextual information is stored in the lateral connections between neurons (Iyer, Hu, & Mihalas, 2020). Each neuron receives information about the visual scene from feedforward connections (which can be arbitrary in this model) and complements this with surround information provided by nearby neurons. The connections are dependent on the statistics of the environment; more precisely, they depend on the frequency of co-occurrence in the environment of the features which the neurons represent. These connections are most useful if the information from the feedforward connections is corrupted (e.g., by occlusions).
Importantly, the contextual information via lateral connections comes not only from the spatial surround but also from the past. Synaptic delays introduce a constraint on the available information each neuron gets. During the static condition, past surround information matches present information, and thus there is no temporal variability of the context. During movement, this no longer holds; neighboring features now also vary temporally, which changes the co-occurrence frequency; hence, the statistics of the moving context are different. We aim to find connection strengths from the switching VIP units that, during movement, modulate firing rates and neuronal correlation structure to adapt and enhance the encoding of visual stimuli when the moving context is turned on. Although throughout the article, we focus on the visual circuit and the switching role of the VIP neural population, these results can be generalized to circuits processing multiple contexts, and thus their applicability has broader scope. In section 3, we list several other biological examples of circuits processing multiple contexts.
Understanding switching circuits may also further aid efforts to design both flexible and efficient artificial neural architectures. This research area has benefited from bio-inspired architectures and algorithms like elastic weight consolidation (Kirkpatrick, Pascanu, & Hadsel, 2017), intelligent synapses (Zenke, Poole, & Ganguli, 2017), iterative pruning (Mallya & Lazebnik, 2018), leveraging prior knowledge through lateral connections (Rusu et al., 2016), task-based hard attention mechanism (Serra, Suris, Miron, & Karatzoglou, 2018), and block-modular architecture (Terekhov, Montone, & O'Regan, 2015), for example, to enable sequential learning by eliminating “catastrophic forgetting” (where previously acquired memories are overwritten once new tasks are learned). We hypothesize that a few switching units akin to VIP can be incorporated as part of the hidden layers to enable context modulation. This makes such a switching circuit architecture (see Figure 1c) more efficient than employing separate circuits for the different contexts (see Figure 1b) because switching circuits have fewer connections to learn.1 We hope such a circuit architecture will inspire next-generation flexible artificial nets that can process stimuli in changing contexts.
1.1 Article Outline
In section 2.1, we first detail a model introduced in Iyer et al. (2020) that describes neuronal connections and firing rates of a circuit adapted to static visual scenes (images). We next extend this model to the case of circuits adapted to moving visual scenes (videos). These circuits are attuned to the statistical regularities of movement and take into account constraints of biological networks, like synaptic delay. We are able to map these two circuit models to the V1 circuit, consisting of PYR, SST, and PV neuron populations. We thus obtain two different networks with full cell-type specifications achieving optimal context integration for static and moving contexts, respectively. In section 2.2 we detail the data sets and procedures used to quantify connectivities and firing rates in these two circuits. In section 2.3, we go on to describe a circuit that can switch between neuronal activity in static circuit and neuronal activity in the moving circuit by virtue of adding a single population, the VIP. We find that VIP projections to SST and PYR are not enough to shift activity during movement, but that we need a feedback connection from the PYR to the VIP (section 2.4). The resulting circuit is the minimally complex circuit resembling V1 we have found to switch contexts. In section 2.5, we describe how this circuit switches using only a small number of VIP units. We follow up on these results in section 2.6, where we use this switching circuit to obtain better reconstructions of videos in conditions of high noise. Finally, we evaluate the new switching circuit architecture with data from V1 that confirms some of the model's predictions (see section 2.7).
2 Results
2.1 Theoretical Models of Processing Visual Information in Static and Moving Contexts
We first introduce two models of visual processing in the V1 in the static and moving contexts where the circuits implementing the computations perform optimal inference and are adapted to the statistical regularities of the contexts through the lateral connections between neurons.
2.1.1 Model of Visual Processing in the Static Context
(a) Neurons receive stimulus input from a patch in space at position , their classical receptive field (), but also from surrounding patches in space (e.g., the patch at position ) through interactions with other neurons. These neurons are connected by weights that depend on the statistical regularities of natural scenes. (b) When features and at positions occur together often in natural scenes, then is strong; when and occur together by chance, without significant correlation, is close to 0. (c) Spatiotemporal surround for motion processing. Due to synaptic delay, context integration uses surrounding patches that are also ms in the past to assess the features in the present frame.
(a) Neurons receive stimulus input from a patch in space at position , their classical receptive field (), but also from surrounding patches in space (e.g., the patch at position ) through interactions with other neurons. These neurons are connected by weights that depend on the statistical regularities of natural scenes. (b) When features and at positions occur together often in natural scenes, then is strong; when and occur together by chance, without significant correlation, is close to 0. (c) Spatiotemporal surround for motion processing. Due to synaptic delay, context integration uses surrounding patches that are also ms in the past to assess the features in the present frame.
From a computational perspective, the organism cannot measure the feature probabilities and joint probabilities in equations 2.1 and 2.3 directly, but these can be estimated given our defined neural code as the convolutions between image and feature, , and as the cross-correlations between classical receptive field firing rates, . By mapping these probabilistic statements on feature occurrence to neurobiological quantities that capture firing rates and weights, we have obtained a circuit that does approximate context integration, extracting information through priors embedded in the neural connectivities. While the start of the model is Bayes optimal via equations 4.12 and 4.14, a set of approximations is needed to keep the circuit simple.
There are multiple possible mappings from the probabilistic framework to the neurobiological circuit (Iyer et al., 2020), but the current correspondence is straightforward and yields successful predictions from data, such as like-to-like connectivity, as detailed below. When a pair of features is frequently co-occurring, weights between neurons preferential for these features are strong and positive (see Figure 2b). In contrast, when two features are unlikely to co-occur in the same image, the connectivity is strong and negative. Overall occurrence probabilities of individual features normalize the co-occurrence probabilities so that the weights express the co-occurrence of features over and above chance. Co-occurrence probabilities of features are then averaged over many natural scenes so that the corresponding weights capture the statistical regularities of natural environments.
2.1.2 Model of Visual Processing in the Moving Context
We next show how the framework above can be applied to the moving context. While equations 2.2 and 2.3 show how connectivity and firing rates can be optimized to account for spatially co-occurring features—features that appear at the same moment in time but in different locations of the visual field—we now extend these equations to account for temporal co-occurring features—features that occur at nearby moments in time at different locations of the visual field.
We have introduced a model of visual processing where feedforward and lateral connections between neurons serve different roles. The lateral connections between neurons perform unsupervised learning of the probability of co-occurrence of visual features that the neurons represent. For the purpose of this study, the feedforward connections can be arbitrary, and the microcircuit described here can be at any level of processing. This separation of the roles for the feedforward and lateral connections allows for an easy implementation of both supervised and unsupervised learning in deep networks (Hu & Mihalas, 2018).
Here, we show how this model can integrate information from the surround using these within-layer connectivities in both static and moving states. However, integration of these two contexts results in two distinct circuits needed to perform visual processing under different conditions (static versus moving). The model optimally integrates context in the Bayes sense, meaning it uses priors on the co-occurrence of features in natural scenes when integrating information from the surround. These priors reflect the known statistical regularities of the environment (Simoncelli, 2003; Barlow, 1961; Marr, 1982) and weigh the surround contributions appropriately. We are then able to map this model formalism to the circuit architecture in V1 described above while specifying steady-state network weights and activations, as well as cell type functionality. This model emphasizes robust coding and applies best in conditions of high noise, where parts of the visual scene are missing due to occlusions or are corrupted, and thus where context information may play a critical role. We next describe our model of visual processing in detail.
2.2 Modeling Firing Rates and Weights in Networks Responding to Images and Videos
We next describe two separate circuits capable of doing optimal context integration in each of the moving and static contexts. We characterize these two circuits through the connectivities and , computed by using images and videos in training data sets and applying formulas 2.3 and 2.5. Once the corresponding connectivities are specified, we can further characterize the static and moving circuits by their neural activations. In the following, we elaborate, section by section, on the algorithm we implemented to compute the static and the moving weights.
2.2.1 Data Set and Feature Preparation
(a) Sample images from the BSDS data set. Images of animals, human faces, landscapes, buildings, and so on are used. (b) Sliding window on images from the BSDS data set so that the appearance of movement is achieved. Shown by the red arrow is how much the window has moved from frame 1 to frame 4. In general, movement of sliding window is random and in any direction, but we focus on horizontal movement in the case of natural videos. (c) Images of horizontal and vertical bars (above) and how the bars move in videos (below). (d) Eighteen filters: ON, OFF, ON/OFF with two gaussian subfields, different subfields dominating, at different intensities and orientations. Color bars show the different intensities of pixels. (e) Example of a spatiotemporal filter comprising two frames. Spatiotemporal filters are added to the 18 original filters to make up a total of 34 filters. The filter shown here over two frames captures a bar moving to the left and is obtained by translating the original filter by three pixels. Color bars show the different intensities of pixels to the left. (f) Two filters for the simplistic “bar world” comprising a horizontal and a vertical bar, respectively.
(a) Sample images from the BSDS data set. Images of animals, human faces, landscapes, buildings, and so on are used. (b) Sliding window on images from the BSDS data set so that the appearance of movement is achieved. Shown by the red arrow is how much the window has moved from frame 1 to frame 4. In general, movement of sliding window is random and in any direction, but we focus on horizontal movement in the case of natural videos. (c) Images of horizontal and vertical bars (above) and how the bars move in videos (below). (d) Eighteen filters: ON, OFF, ON/OFF with two gaussian subfields, different subfields dominating, at different intensities and orientations. Color bars show the different intensities of pixels. (e) Example of a spatiotemporal filter comprising two frames. Spatiotemporal filters are added to the 18 original filters to make up a total of 34 filters. The filter shown here over two frames captures a bar moving to the left and is obtained by translating the original filter by three pixels. Color bars show the different intensities of pixels to the left. (f) Two filters for the simplistic “bar world” comprising a horizontal and a vertical bar, respectively.
We generated a dictionary of features (filters) based on a parameterized set of models derived from recordings in V1 (Durand et al., 2016). This contains 18 filters with gaussian subfields (see Figure 3d) at different relative intensities and orientations. We added filters containing a temporal dimension—spatiotemporal filters—to obtain a set of 34 filters. Our spatiotemporal filters consist of two frames (see Figure 3e) and represent a temporal shift by several pixels in the horizontal direction, corresponding to the direction of movement and amount of displacement of the sliding window in the videos described above.
To more easily illustrate and interpret our model, we first tested our framework on a different, synthetic context. We analyzed a simplified world of horizontal and vertical bars moving up and down as well as left and right (see Figure 3c). This simple data set has only two features, horizontal bars and vertical bars (see Figure 3f), but movement can be in any of the four orthogonal directions.
2.2.2 Computing the Weights ,
(a) To obtain the weight matrix, we first take the convolution of video frames with features from the feature basis (e.g., ). We then consider the convolution of these convolved image frames to detect feature co-occurrence (e.g., ). (b) Schematic of how weights are represented. Normalized convolutions between patches separated by the same spatial and temporal distances are averaged and stored in the corresponding entry of the weight matrix. (c) Top: Static weights for the data set of images of bars. Bottom: Moving weights for the data set of videos of bars. (d) Static weights (above) and moving weights (below) for the data set of natural images/videos during horizontal motion only. (e) Sparse versions of slices from the static and moving weights for the data sets of natural images/videos during horizontal motion. Weights between neurons whose receptive fields are not at certain preselected, sufficiently far apart locations in the visual space were discarded to satisfy the constraint that patches are independent. (f) The full (nonsparse) tensors , , and , ordered first by spatial position, then by filter.
(a) To obtain the weight matrix, we first take the convolution of video frames with features from the feature basis (e.g., ). We then consider the convolution of these convolved image frames to detect feature co-occurrence (e.g., ). (b) Schematic of how weights are represented. Normalized convolutions between patches separated by the same spatial and temporal distances are averaged and stored in the corresponding entry of the weight matrix. (c) Top: Static weights for the data set of images of bars. Bottom: Moving weights for the data set of videos of bars. (d) Static weights (above) and moving weights (below) for the data set of natural images/videos during horizontal motion only. (e) Sparse versions of slices from the static and moving weights for the data sets of natural images/videos during horizontal motion. Weights between neurons whose receptive fields are not at certain preselected, sufficiently far apart locations in the visual space were discarded to satisfy the constraint that patches are independent. (f) The full (nonsparse) tensors , , and , ordered first by spatial position, then by filter.
2.2.3 Simplifications to Weights
We make three simplifications to reduce the number of parameters in this tensor (see section 4.2): (1) we assume translational invariance so that only the relative position of two filters is relevant ( when ); (2) the model is designed to compute connections to neurons that receive independent observations; thus, we only consider connections between neurons whose receptive fields are sufficiently far apart (i.e., at least half a receptive field apart); (3) as statistical dependencies in natural images decay with distance, we limit the spatial extent of connectivity to three times the size of the classical receptive field. Figures 4c and 4d show several 2D slices through this tensor, corresponding to a specific cell source and target, as well as the full static and moving weights (see Figure 4f) ordered by spatial position and feature type (see also Figure S1a). Figure 4c serves to provide some intuition as to what these weights represent and how they are structured: in the data set of bars, horizontal feature frequently occurs or is absent together with other horizontal features at neighboring locations, which leads to have positive values. Conversely, horizontal feature occurs always when vertical feature is absent, and vice versa, leading to negative weights (see Figure 4c).
2.2.4 Characterizing in the Case of Two Different Video Statistics
In the generation of the video data set we use a sliding window to enforce controlled and comparable statistics between the moving and static contexts. When the sliding window is free to move in all directions, the moving weights tend to be weaker in absolute value, which holds for the simple data set of bars (see Figure 4c), and the weights generated from the data set of natural images and videos (see Figures S1a to S1b). This effect is due to the weaker statistical dependence of features separated by the time window . Feature co-occurrence, and thus connectivity, is affected by the distortions during movement, like change of orientation of objects or appearance or disappearance of objects in the visual scene. Moving weights in this case are approximately smoothed-out versions of the static weights (see Figures S1a to S1b). In these conditions, as the information from surround is less reliable, the feedforward input plays a more important role during movement.
In the case when the sliding window moves pixels horizontally in time steps, and actually coincide so that their probability of co-occurrence is maximized. This means that for horizontal movement, peaks pixels from the center for any feature and is strong (see Figures 4d to 4e). Results for natural videos below are for horizontal movement, although the same general conclusions hold when movement is allowed in any direction (see Figure S2).
2.3 Implementing a Switching Circuit
Having two just defined optimal connectivities, and , for the static and moving contexts, we next consider whether a single circuit involving the cell types described above (VIP, PYR, SST, and PV) can respond optimally in these two contexts and switch between them. We additionally seek the computational principles behind the minimally complex circuit (i.e., the circuit with fewest connections) for such a switching circuit. Specifically, we ask whether a circuit with optimal weights for the static context can switch to produce nearly optimal activities in the moving context, via projections from a set of switching units. In such a circuit, every PYR neuron approximates Bayesian inference, combining classical receptive field information with information from the surround to estimate feature probability.
(a) Two separate circuits for optimal visual processing of static (top) and moving contexts (bottom), respectively. (b) The proposed switching circuit with the VIP population approximates the static circuit when the VIP are silent and the animal is static, and approximates the moving circuit when the VIP are active and the animal is moving. (c) Previous circuit, but with a feedback connection added from the PYR population to the VIP.
(a) Two separate circuits for optimal visual processing of static (top) and moving contexts (bottom), respectively. (b) The proposed switching circuit with the VIP population approximates the static circuit when the VIP are silent and the animal is static, and approximates the moving circuit when the VIP are active and the animal is moving. (c) Previous circuit, but with a feedback connection added from the PYR population to the VIP.
2.4 In the Absence of Feedback to VIP Neurons, the Circuit Is Unable to Switch from Static to Moving Conditions
This is a high-dimensional constrained optimization problem with the loss function defined as in equation 2.16, which we solved by means of a gradient descent method using the gradient-based Adam optimizer, implemented in PyTorch.2 The weights and as defined in equations 2.13 and 2.14 are unknown and learned by stochastic gradient descent (SGD), while , are fixed. Finding the global minimum of the loss function is difficult, but the main goal is to find weights that give a small enough error instead and later test these on a specific task to demonstrate that the optimal moving circuit can be approximated successfully (see section 2.6). We assessed the stability of our optimization by modifying several learning parameters—for example, learning rate (ranging from 0.001 to 0.1) and optimization algorithm (SGD, AdaGrad, RMSProp, Adam)—and checking the generalization error on a small number of frames (50) that were not used during training.
(a) Goal: Instead of two separate circuits for visual processing of static and moving contexts, the proposed circuit approximates the static circuit when the VIP are silent and the animal is static and the moving circuit when the VIP are active and the animal is moving. (b) Generalization/validation error found during the optimization to minimize the functional for the data sets of static and moving bars does not converge. (c) Generalization/validation error found during the optimization to minimize the functional for the data sets of natural images and videos converges, but the norm of the loss function decreases by only . (d) Circuit as in panel a, but with a feedback connection added from the PYR population to the VIP. (e) Training error (blue) and generalization/validation error (red) found during the optimization to minimize the functional (movement approximation error) for the data sets of natural images and videos converges to yield a relatively small error. (f) The movement approximation error for various circuit architectures: the static circuit with no VIP switching units, the circuit depicted in panel a without PYR to VIP feedback, and the circuit depicted in panel d.
(a) Goal: Instead of two separate circuits for visual processing of static and moving contexts, the proposed circuit approximates the static circuit when the VIP are silent and the animal is static and the moving circuit when the VIP are active and the animal is moving. (b) Generalization/validation error found during the optimization to minimize the functional for the data sets of static and moving bars does not converge. (c) Generalization/validation error found during the optimization to minimize the functional for the data sets of natural images and videos converges, but the norm of the loss function decreases by only . (d) Circuit as in panel a, but with a feedback connection added from the PYR population to the VIP. (e) Training error (blue) and generalization/validation error (red) found during the optimization to minimize the functional (movement approximation error) for the data sets of natural images and videos converges to yield a relatively small error. (f) The movement approximation error for various circuit architectures: the static circuit with no VIP switching units, the circuit depicted in panel a without PYR to VIP feedback, and the circuit depicted in panel d.
In order to understand the origin of this failure, we mathematically analyzed the circuit at hand. Analytically, if the loss is small , then , where is unique to each image in the data. The left side becomes a term that varies across a wide range of video frames, while the right side is a constant term incorporating the weights we are solving for: . This suggests that the failure of our optimization procedure to yield weights that approximate the moving circuit results from the VIP having no stimulus dependence.
We conclude that the circuit switching between static and moving contexts must be more complex than the simple circuit here, which has only outgoing projections from VIP. Below, we introduce recurrent connections that make the VIP input dependent and overcome the limitations above.
2.5 VIP Circuit with Feedback from the PYR Cells Can Switch Context Integration from Static to Moving Conditions
We remind the reader that is the contribution to the firing rate of the classical receptive field, and are the weights from population of neurons to population of neurons, where are the PYR, SST, VIP neurons. In addition to the fixed and , we also fix . A schematic of the underlying circuit model, along with the corresponding formula for the firing rate of PYR, is shown in Figure 6d.
To solve this high-dimensional optimization problem, we set up, as in section 2.4, an optimization problem with the loss function being the average Frobenius norm as defined in equation 2.19. Weights to and from VIP are unknown (, , and ) and learned by SGD, while , are fixed. Importantly, Dale's law is enforced (, , ) for biological realism.
(a) Adding VIP switching units to the circuit processing videos of bars approximates the activity to that of the optimal circuit for moving context for this simple data set. However, no more than 20 VIPs are needed in practice, compared to the 162 PYR and SST cells. (b) Adding VIP switching units to the circuit processing natural videos approximates the activity to that of the optimal circuit for moving context for the naturalistic data set. However, no more than 5 VIPs per unit space are needed in practice, compared to the 34 PYR and SST cells per unit space. The parameters chosen for this optimization are and , , where is the variable number of VIP units. (c) A random subset of activities corresponding to different video frames, filters, spatial positions for the static, moving, and approximated moving circuit. Red dots for activities for moving circuit () versus activities for static circuit (); blue dots for activities for moving circuit versus activities for approximated switching circuit (). Activities are computed using weights with 5 VIP units/unit space. Activities chosen for the approximated switching circuit are able to better estimate the activities in the moving circuit in comparison to the ability of the activities in the static circuit to estimate the activities in the moving circuit.
(a) Adding VIP switching units to the circuit processing videos of bars approximates the activity to that of the optimal circuit for moving context for this simple data set. However, no more than 20 VIPs are needed in practice, compared to the 162 PYR and SST cells. (b) Adding VIP switching units to the circuit processing natural videos approximates the activity to that of the optimal circuit for moving context for the naturalistic data set. However, no more than 5 VIPs per unit space are needed in practice, compared to the 34 PYR and SST cells per unit space. The parameters chosen for this optimization are and , , where is the variable number of VIP units. (c) A random subset of activities corresponding to different video frames, filters, spatial positions for the static, moving, and approximated moving circuit. Red dots for activities for moving circuit () versus activities for static circuit (); blue dots for activities for moving circuit versus activities for approximated switching circuit (). Activities are computed using weights with 5 VIP units/unit space. Activities chosen for the approximated switching circuit are able to better estimate the activities in the moving circuit in comparison to the ability of the activities in the static circuit to estimate the activities in the moving circuit.
Second, in the distinct case of more complex stimuli like images and videos of natural scenes, the movement approximation error in equation 2.19 was minimized when the number of VIP units is 34 per unit space, which matches the number of units in the PYR and SST population. However, the approximation error was already significantly minimized with only 5 VIP units per unit space, without any significant improvement after adding more units (see Figure 7b). Varying the dimensionality of spatial components of the tensors (see Figure S4) we were solving for (, , ) and the synaptic delay for sparse weights that account for patch independence, we obtained the same qualitative results. Our results also hold for nonsparse weights, as shown in Figure S5a. Fixing the number of VIP units to 5 per unit space, we find that the approximated firing rate of equation 2.18 matches compared to the firing rates of a circuit without VIP units (see Figure 7c). We conclude that for the specific parameters chosen in Figure 7b, the ratio of PYR to switching VIP units is , so that the switching operation requires relatively few units, a fact we return to in the context of the underlying biology below.
All in all, we have shown that a switching circuit with relatively few numbers of switching VIP units and appropriate feedback connections can be implemented to achieve visual processing during the static and moving contexts, and for both a simple synthetic data set of bars and a biologically relevant data set of natural images and videos.
2.6 Context-Dependent Visual Processing with Extraclassical Receptive Fields Leads to Denoising
According to our theory (see section 4.1), the moving circuit achieves optimality of visual processing for videos, the static circuit achieves optimality of processing for static images, and we have found appropriate connectivities to and from a population of switching units—VIP—that can approximate either circuit in a model of V1, the switching circuit. We have, however, not yet assessed the performance of these circuits on specific visual processing tasks. We pursue this here for the task of denoising. Specifically, we ask how well (1) extra-classical receptive field contributions from the static or moving circuits (see Figure 5a) can improve reconstructions of noisy images and videos and (2) whether the switching circuit can achieve the same level of performance as the separately optimized moving circuit when processing videos. We focus on reconstructions of video frames and the superior performance of the moving and switching circuits for processing moving contexts, although we also mention the comparably high performance of the static circuit and implicitly that of the switching circuit responding to static scenes, for processing static contexts.
To reconstruct a visual scene during movement, our brain uses information from the present but also time-delayed surround information, both of which can be inaccurate or incomplete. We use to weigh the past surround information, as these weights encapsulate the cross-correlational structure between features of the past and the present, thereby informing which features are more or less likely. We note that during motion, using to weigh surround information may still be better than using no surround information at all: if movement in the videos is slow enough or is small, features are smooth and and are highly correlated.
(a) Example of a reconstructed frame for each condition/circuit architecture: no EXC, static EXC, moving EXC, approximated EXC. (b) Average correlation coefficients between reconstructed noisy frames and reconstructed noiseless frames for one video in our data set. Here, reconstruction benefits from surround contextual information. (c) Same as panel a but in this case, the general inequality that holds on average breaks down and . (d) Average correlation coefficient over all frames and all videos after salt and pepper noise was added to the video frames. The probability is 0.2 each pixel is changed to white and 0.2 each pixel is changed to black, and (frames). The moving and approximated EXC average correlation coefficients are higher than for static EXC or no EXC (-value using the Wilcoxon rank-sum test for all relevant comparisons). Inset: Correlation coefficients in time, averaged across videos. (e) Same as panel d for gaussian white noise with 0.5 standard deviation. (frames). for all relevant comparisons, Wilcoxon rank-sum test. (f) Average correlation coefficient over frames and videos as noise level is varied. Top: Salt and pepper noise is varied; Down: Gaussian white noise SD is varied.
(a) Example of a reconstructed frame for each condition/circuit architecture: no EXC, static EXC, moving EXC, approximated EXC. (b) Average correlation coefficients between reconstructed noisy frames and reconstructed noiseless frames for one video in our data set. Here, reconstruction benefits from surround contextual information. (c) Same as panel a but in this case, the general inequality that holds on average breaks down and . (d) Average correlation coefficient over all frames and all videos after salt and pepper noise was added to the video frames. The probability is 0.2 each pixel is changed to white and 0.2 each pixel is changed to black, and (frames). The moving and approximated EXC average correlation coefficients are higher than for static EXC or no EXC (-value using the Wilcoxon rank-sum test for all relevant comparisons). Inset: Correlation coefficients in time, averaged across videos. (e) Same as panel d for gaussian white noise with 0.5 standard deviation. (frames). for all relevant comparisons, Wilcoxon rank-sum test. (f) Average correlation coefficient over frames and videos as noise level is varied. Top: Salt and pepper noise is varied; Down: Gaussian white noise SD is varied.
Thus equipped, we ask which circuit architecture gives rise to neural activity best suited for decoding visual scenes in noisy conditions. Figure 8a shows reconstructions of a video frame using different such circuit architectures. We expect on average, as are the optimal lateral connections as defined above. However, the exact relationship between , , , depends on the exact correlational structure of the frames for each video. Some videos match our prediction that is maximized (see Figure 8b), while other videos do not (see Figure 8c). Specifically, there are videos where surround modulation is not effective, which appears to be due to the presence of independent features where the information in the extraclassical receptive field does not aid image reconstruction.
On average throughout the videos, and yield the best reconstructions (dark and light green bars), displaying the highest cross-correlation coefficients between the noiseless reconstruction (the baseline) and the reconstructed frames (see Figure 8d). Figures 8d and 8e show this holds true when adding to the original frames either salt and pepper noise, when we varied the proportion of pixels occluded, or gaussian white noise, when we varied the standard deviation of the normal distribution of noise. The relation is robust to the amount of noise added to the frames (see Figure 8f), whether for salt and pepper noise or gaussian noise. This holds true both when the complete set of 34 spatiotemporal filters is used (see Figure S10a) and when only the set of 18 filters with no temporal component is used (see Figure S10b). As expected, the addition of filters with a temporal component improves the reconstruction performance in all four circuit architectures presented (see Figure S10c). Furthermore, reconstruction performance for images in the static condition is maximized on average using to weigh the surround so that on average (see Figure S9). This shows that the moving circuit is best used for processing noisy video frames and that the static circuit (or switching circuit with VIP silent) is ideally used for processing images at the highest performance.
Thus, the switching circuit provides reconstruction performance comparable to a dedicated moving circuit for videos and comparable to a dedicated static circuit for images. In the case of videos, this is because the switching circuit reproduces firing rates that are close enough to to improve reconstruction fidelity. The correlation coefficients found between noiseless baseline reconstructions and reconstructions due to the moving and switching circuits, respectively, present almost perfect overlap (light and dark green curves in Figures S10a and S10b). In sum, we conclude that the extraclassical receptive field contribution in the moving circuit and approximated switching circuit generates neural activity that can be decoded to produce more accurate frame reconstructions in videos. To produce the most accurate image reconstructions, the VIP neurons in the switching circuit must be silent so that the network implements the static circuit.
2.7 Experimental Evidence of VIP Role in Movement-Related Visual Coding
When we examine the weights to and from the VIP we have inferred in our model, we find that there are a few equally correct solutions for the optimization problem, equation 2.19, due to the multiple local minima of the movement approximation error. One of the possible solutions we found matched experimental data showing that in various layers of V1, the VIP-to-SST connection is strong compared to other connections, specifically the VIP-to-PYR connection (see Figure S6). Interestingly, this property arose only when including weights from SST to VIP in the circuit, consistent with experiments (Pfeffer et al., 2013, found the connection probability/strength from SST to VIP to be strong). Including this connection in our circuit and rewriting the circuit equations as in equation 4.24, we obtain a new set of connectivity patterns and activities so that we can now compare predictions of our model switching circuit to the extensive empirical evidence from the literature.
Importantly, we have not meticulously explored the set of all possible solutions from the optimization problem, equation 2.19, and, further, the optimization may allow additional constraints to the switching circuit while still admitting solutions. Acknowledging this, we now study both the connectivity and activity of the switching circuit with an additional SST to VIP connection.
2.7.1 Connectivity
We find that our model produces connectivity patterns that are largely consistent with empirical findings, as we describe next. Connection weights in the model can be interpreted as corresponding to a combination of connection probabilities and connection strengths in the data, as these have been shown to correlate well (Cossell et al., 2015). Regarding the connection from the VIP to SST, experimental data on connectivity in the visual cortex from Pfeffer et al. (2013) has shown that in layer 4 of V1, the average connection probability from VIP to SST is double the connection probability from VIP to PYR (0.625 compared to 0.351), while in layer 5, VIP to SST is five times more probable (0.625 compared to 0.125) (Pfeffer et al., 2013). A recent study by Campagnola et al. (2021) has confirmed the relative paucity of VIP-to-PYR connections as compared to VIP-to-SST connections throughout all layers, for example finding 3 out of 52 VIP-to-PYR versus 5 out of 33 VIP to SST interarea L2/3 connections (Campagnola et al., 2021). VIP-to-SST connections are also stronger than VIP-to-PYR throughout all the layers: 0.32 compared to 0.28 as found by Jiang et al. (2015) and 0.3 compared to 0.21 as found by Campagnola et al. (2021).
(a–c) Analysis of model connectivities . (a) Histogram of the absolute value of connectivities for showing a mean of 0.31. (b) Histogram of the absolute value of connectivities for showing a mean of 0.4. (c) Average connectivity per filter, corresponding to the postsynaptic cell type, for (blue) and (orange). Filters for postsynaptic units corresponding to the strongest conectivities are displayed to show what units are strongly inhibited during movement. (d–f) Data analysis of VIP population activity in calcium imaging data. (d) Dimensionality ratio (participation ratio measure) during periods of spontaneous activity between movement and static conditions across CRE lines. (e) Histogram of the modulation of dimensionality (statistics relative to the blue bar in panel d). (f) Activity (dff signal) ratio during periods of natural images viewing between movement and static conditions across CRE lines.
(a–c) Analysis of model connectivities . (a) Histogram of the absolute value of connectivities for showing a mean of 0.31. (b) Histogram of the absolute value of connectivities for showing a mean of 0.4. (c) Average connectivity per filter, corresponding to the postsynaptic cell type, for (blue) and (orange). Filters for postsynaptic units corresponding to the strongest conectivities are displayed to show what units are strongly inhibited during movement. (d–f) Data analysis of VIP population activity in calcium imaging data. (d) Dimensionality ratio (participation ratio measure) during periods of spontaneous activity between movement and static conditions across CRE lines. (e) Histogram of the modulation of dimensionality (statistics relative to the blue bar in panel d). (f) Activity (dff signal) ratio during periods of natural images viewing between movement and static conditions across CRE lines.
We next inquire whether the synapses encode the contextual statistics by probing like-to-like connectivity both between PYR neurons and between VIP and PYR populations of neurons (see section 4.9). We find that while there is like-to-like connectivity between PYR neurons as found by Iyer et al. (2020), this effect is largely absent between the VIP and PYR. To further examine the pattern of connectivity from the VIP, we correlate both and to and , because these latter weights reflect the statistical regularities of the static and moving contexts. We obtain that after averaging over presynaptic filters () and the spatial receptive fields, correlates positively with (0.41, -value , two-sided -test); while also correlates positively, the correlation coefficient is weaker and not statistically significant. Similarly, the convolution also correlates positively with (0.15, -value , two-sided -test).
Analyzing the average postsynaptic weights , more specifically, we find that the strongest connections are for inhibited postsynaptic units corresponding to vertical or diagonal filters. Looking at the strongest inhibitory weights for , for example (see Figure 9c), we find that correspond to postsynaptic vertical filters and to either vertical or diagonal postsynaptic filters. For and , 7 out of 10 such filters are vertical or diagonal (see Figure S11). We note also that the average connection strength of for postsynaptic vertical filters is negative and stronger () compared to that for horizontal filters (). This can be interpreted as follows: our videos feature horizontal movement, hence the spatiotemporal co-occurrence for vertical features in particular will be distorted during the moving context; this results in weaker weights overall when the postsynaptic cell responds to vertical filters and thus weights are strongly negative on average for such filters. The overall positive correlation of with determines that postsynaptic units tuned for vertical features will be more strongly inhibited through these connections when switching from static to moving contexts. This phenomenon is more prevalent for where more of the strongest connections () are driven at least in part by inhibition of vertically tuned units, and in contrast with and even , where the strongest inhibitory connections are mostly for horizontally tuned units ( and , respectively, of top inhibitory filters are either horizontal or diagonal) while , connections for vertically tuned units are mostly excitatory (on average, and , respectively).
2.7.2 Activity
We next study the consistency of activity patterns produced by our model with respect to empirical data. Published experimental findings provide strong evidence that the VIP inhibitory population acts to modulate the visual circuitry in a movement-dependent manner (Niell & Stryker, 2010; Pfeffer et al., 2013; Fu et al., 2014). Very recent results show that VIP neurons respond synergistically to stimuli moving front to back during locomotion, a conjunction expected during locomotion in a natural environment for mice, with a preference for low but nonzero contrasts (Millman et al., 2020). Such movement-modulated activity matches the one required in our models, although we have not endowed the VIP units with specific feature selectivity. Additionally, we perform a set of new analyses of experimental data in the context of our model. These draw both on the literature and on the Allen Brain Observatory (http://observatory.brain-map.org/visualcoding, 2016), which contains in vivo physiological activity in the mouse visual cortex, featuring representations of visually evoked calcium responses from GCaMP6-expressing neurons in selected cortical layers, visual areas, and Cre lines. The data set contains calcium activations across multiple experimental conditions, and here we focus on periods of spontaneous activity, natural images, and drifting gratings.
Our model of the switching circuit shows that the relative number of VIP neurons required to switch between moving and static contexts is low when compared to the number of PYR or SST neurons (see Figures 7a and 7b). This number qualitatively matches the relative abundance of neurons in the three populations. Excitatory neurons PYR are more abundant than inhibitory ones (roughly 80% to 20%), and VIP are a minority of inhibitory cells. Moreover, the existing VIP cells recorded in the Allen Observatory do not appear to exploit substantially more degrees of freedom (as measured by their relative dimensionality) than other cell populations (see Figure S10a), consistent with a small number of effective VIP “units.”
We now highlight two aspects of VIP neural activity that are directly related to our model and justify the choice of VIP as switching units whose activities are modulated by the locomotion state of the animal. First, VIP activity dimensionality is significantly modulated across the moving and static conditions during periods of spontaneous activity, as shown in Figures 9d and 9e. To extract such dimensionality modulation, we considered periods of spontaneous activity in the recordings and divided the statistical distribution of the animal's speed, for each experimental session, into four quartiles. We then computed the average dimensionality, or participation ratio (PR; see section 4.8) for each recording in each quartile, which we define here as the (lower) dimension of a subspace where the data of activations can be represented while retaining some meaningful properties of the original data. We define the dimensionality modulation to be the ratio between the average dimensionality distribution within the highest quartile (movement condition) and the average within the first quartile (static condition). Such ratio is displayed in Figure 9e. The dimensionality of the VIP population is significantly modulated by movement, while in other populations, the same quantity was not significantly different across moving and static conditions (see Figure 9d). The histogram of such statistics is shown in Figure 9e.
Second, we analyzed evoked activity during the animals' viewing of natural scenes. We performed a calcium signal modulation analysis and found that for this stimulus set, the activity was strongly modulated for the VIP population and less so for other neural populations (see Figure 9f) across moving and static conditions assessed via the quartile method just described. This further confirms the stronger VIP modulation across the moving-static conditions. Further pieces of experimental evidence are presented in Figure S12.
Finally, we analyzed the activities of VIP and PYR neuron populations. Similar to Niell and Stryker (2010), we find the activity of the PYR during the moving condition to be higher than the stationary condition on average (0.066 versus 0.074, -value 0.01). However, our PYR population activity does not double during locomotion compared to periods of stationarity, as in Niell and Stryker (2010). More recent studies, however, have reproduced the relation between excitatory neuronal activity in mouse visual cortex and running but have observed a much weaker relation (Millman et al., 2020, Figure 5e).
We conducted further analysis to infer the tuning properties of the PYR and the VIP. This was achieved by considering a wavelet family (e.g., Daubechies), taking the two-dimensional discrete wavelet transforms of the video frames in our data, regarding the corresponding average wavelet transforms as features, and finally performing a linear regression or GLM against VIP or PYR activities with the average wavelet transforms as the independent variables (see section 4.9). We find that most PYR neurons are tuned to horizontal features, and much less to vertical features. Because VIP neurons in our model only get input from the PYR, while the top-down input activating VIP is described simply by the binary term , VIP acquires the same preferential selectivity to horizontal features over and above that to vertical features (see Figure S13 and section 4.9 for details). This is counter to what we would expect if the VIP were capable of detecting the horizontal movement in our data set by exhibiting preferential selectivity toward vertical features within their receptive fields instead of through the ad hoc bulit in switch term . We conclude the simplification used by employing a binary term in equation 4.27 prevents us from observing a more realistic VIP activation pattern that would deviate from the PYR pattern and provide further insight. This points to an important direction for future and to more detailed modeling expanding on our current simplified model.
Altogether these comparisons provide further support for our modeling assumptions and for the role of VIP neurons in visual coding across static and moving conditions. We conclude that our switching circuit model reproduces the global pattern of interactions via VIP that we expect, approximating the static and moving circuits, synchronal with VIP activation. Further analysis of future data sets, as examined in section 3, will guide next steps of circuit modeling.
3 Discussion
We have introduced a computational model for V1 circuitry that uses multiple cell types to integrate contextual information into local visual processing, during two different—static and moving—contexts. We have identified a need for recurrence, leading to the architecture of a switching circuit with bidirectional, learned connections to a switching population (here, the VIP cell class). Beyond V1 and biological circuit modeling, this circuit may be useful in searching for artificial neural network (ANN) architectures that can operate in different contexts and switch effectively between them.
Our model connects to a body of recent empirical studies elucidating V1 neural cell types and network logic. First, Niell and Stryker (2010) have established that as the speed of mice increases, the circuit increases spiking overall and changes the frequency content of local field potentials. Potentially, distinct activity patterns during locomotion could be attributed to effects from eye movements; however, Niell and Stryker provide evidence against this hypothesis. These findings prompt us to model the network as a switching circuit that adapts its activity as the state of the animal changes from static to moving. Later studies have focused on the connection strengths for excitatory and inhibitory neurons: neurons display like-to-like connectivity (Cossell et al., 2015; Ko et al., 2011), whereby neurons with similar orientation tuning have a higher probability of connecting and display stronger connections on average. Pfeffer et al. (2013) describe the V1 circuit logic by using transgenic mouse lines expressing fluorescent proteins or Cre-recombinase, providing a consistent classification of cell populations across experiments. Three large nonoverlapping classes of molecularly distinct interneurons that interact via a simple connectivity scheme were identified: PV, SST, and VIP inhibitory neurons. In particular, PV inhibit one another, SST avoid one another and inhibit all other types of interneurons, and VIP preferentially inhibit SST cells.
Another important development made by Fu et al. (2014) has established that locomotion activates VIP neurons independent of visual stimulation and predominantly through nicotinic inputs from basal forebrain. This study was the first to propose the existence of a cortical circuit for the enhancement of visual response by locomotion, describing a modulation of sensory processing by behavioral state. These studies motivate us to choose VIP as switching units and to map the positive and negative weights of our model to connectivities between different neuronal populations. Finally, another study suggests that differentiated network response during locomotion can be advantageous for visual processing (Dadarlat & Stryker, 2017): an increase in firing rates can enhance the mutual information between visual stimuli and single neuron responses over a fixed window of time, while noise correlations decrease across the population, which further improves stimulus discrimination. The authors hypothesize that cortical state modulation due to locomotion likely increases visually pertinent information encoded in the V1 population during times when visual information changes rapidly, such as during movement.
At least one study (Dipoppa et al., 2018) has disputed the findings of Neill and Stryker and of Fu et al., finding contrary evidence to the disinhibitory model. Experiments with the light on and visual stimuli present showed that locomotion increased both SST responses to large stimuli and VIP responses to small stimuli. However, the authors note that rerunning the measurements in darkness reproduced results from Fu et al., reinforcing the assumption that our model operates in conditions of poor visibility and high noise.
There is a vast literature on models of efficient coding starting with Barlow (1961) and Attneave (1954). (For a great description of this literature, see Chalk, Marre, & Tkačik, 2018.) On one extreme, if the signal-to-noise ratio is high and additional constraints (e.g., sparsity) are introduced, such models emphasize redundancy reduction (Olshausen & Field, 1996a; Rao & Ballard, 1999; Harpur & Prager, 1996; Comon, 1994; Bell & Sejnowski, 1995; Zemel, 1993; Dayan, Hinton, Neal, & Zemel, 1995). At the other extreme, if the signal-to-noise ratio is low, such models emphasize robust coding (Karklin & Simoncelli, 2011; Doi & Lewicki, 2014). We use a theoretical framework that emphasizes robust coding and that we have selected because of its generality. It starts with an assumption on neuronal activation functionality (i.e., firing rates of neurons encode the probability of specific features being present in a given location of the image). This model describes local circuit interactions needed for integration of information from surrounding visual stimuli in noisy conditions for an arbitrary representation. The model matches multiple empirical findings—for example that statistical regularities of natural images give rise to like-to-like local circuit connectivities, as observed experimentally (Cossell et al., 2015; Ko et al., 2011). However, in different contexts the model predicts different functional lateral interactions. Therefore, we looked at circuits that can implement multiple functional interactions in one circuit.
Our model also relates to other switching circuits reported in the experimental literature. For example, selective inhibition of a subset of neurons in central nucleus of the amygdala (CeA) led to decreased conditioned freezing behavior and increased cortical arousal as visualized by fMRI (Gozzi et al., 2010). This therefore identifies a circuit that can shift fear reactions from passive to active. Another study has unraveled the cellular identity of the neural switch that governs the alternative activation of aggression and courtship in Drosophila fruit flies (Koganezawa et al., 2016). While these studies detail circuits responsible for switching behaviors, there are circuits switching between contexts: from detection of weak visual stimuli to discrimination after adaptation in mice (Ollerenshaw, Zheng, Millard, Wang, & Stanley, 2014); from high-response firing during active whisker movement, to low response when no tactile processing is initiated (Zhou et al., 2017); from odor attraction in food-deprived larva switching to odor aversion in well-fed larva (Vogt et al., 2020).
In contrast to this rich body of experimental studies, there are relatively few computational models proposed so far that explain switching of circuits (Yang et al., 2019). We may compare our V1 circuit to the recurrent circuits using FORCE learning, where a single unit or a few units project their feedback onto a recurrent neural net and momentarily disrupt chaotic activity to enable training. VIP units in our model precisely resemble such output units providing feedback in the FORCE framework, but it is unclear how far this analogy goes and to what extent the framework in Sussillo and Abbott (2009) is helpful in understanding V1 circuitry.
Another interesting example of a circuit with flexible, context-dependent behavior has been proposed by Mante et al. (2013), where prefrontal cortex (PFC) activity is modulated by the presence of a visual cue signaling which feature (color versus direction) the animals must integrate in a random-dots decision task. PFC functionality in this task has been modeled using a recurrent neural network (RNN) that takes the direction of motion, color of random dots, and visual cue as input and outputs the appropriate, reward-generating direction to saccade. This suggests the RNN enacts a potentially new mechanism for selection and integration of context-dependent inputs, with gating possible because the representations of the inputs and the upcoming choice are separable at the population level, even though they are deeply entangled at the single neuron level. The architecture of the model RNN proposed in this study is simpler than what we have laid out while also attaining high flexibility. There are important differences between the framework outlined in this article and our work. First, it is unclear what the number of weights in the network might be for the circuit in Mante et al. (2013) to be multitasking. One of our main motivations has been to achieve a switching circuit with few added units and weights, so that the circuit has fewer weights to learn than two separate circuits processing the two contexts independently. It is unclear if this potential advantage holds in the case of Mante et al. Second, our circuit adapts to the statistics of both static and moving scenes and yields firing rates that are optimal for visual processing in either context. In the case of Mante et al., the circuit does not change momentary input processing when the context changes; it simply adapts its dynamics to integrate the appropriate feature and initiate the action that will be rewarded. Context takes on different meanings in these two instances: in our model, context is given by the statistical regularities of a certain environment, static or moving and in Mante et al., context refers to an input cue that changes the goals and reward dependencies of actions within the task. Importantly, we have focused on switching circuits that modulate their responses to different sensory contexts, as opposed to different input cues and behaviors. It is unclear whether identical or different mechanisms for switching apply in the case of sensory processing or action selection, when the animal changes scene statistics or behaviors, respectively.
Although our model is faithful to some aspects of the biology of V1 circuits, it has several limitations. First, it has been reported that during animal locomotion, firing rates of neurons more than double, at least in layers II/III of V1. Our firing rates are normalized to sum to one across features and cannot reproduce a doubling occurring uniformly over features. Second, the model does not reproduce a few experimental findings as reported in Ayaz, Saleem, Scholvinc, and Carandini (2013) and Keller et al. (2020). For instance, locomotion does not increase spontaneous activity as found by sequentially showing, to the static and moving circuits, images where every pixel takes on a constant value or images with gaussian white noise (0.13 versus 0.11 mean static, moving activity for constant pixel images; 0.063 versus 0.53 mean static, moving activity for gaussian white noise images/videos). Similar to Keller et al., the firing rate due to the cross-oriented surround is only slightly higher than the firing rate due to the iso-oriented surround (0.087 versus 0.085, -value 0.015; see Figure S14 for stimuli shown to the circuits). However, locomotion does weaken signals conveying surround suppression as reported in Ayaz et al. through the inhibition of the SST population by the VIP.
Moreover, another study (Dadarlat & Stryker, 2017) reported that noise correlations are reduced during motion, but this does not occur in our model. Further, we model VIP as a switch that is off during the static condition and has an activation during locomotion dependent on input images, whereas data show VIP activity is modulated at a finer scale and correlates strongly with speed (Fu et al., 2014). In addition, VIP switching units in our model turn on based on perfect knowledge of whether the animal is static or moving, rather than based on more subtle time-varying visual or motor features. Furthermore, data from Ko et al. (2011), Pfeffer et al. (2013), Jiang et al. (2015), Hofer et al. (2011), Lefort, Tomm, Floyd Sarria, and Petersen (2009), Thomson, West, Wang, and Bannister (2002), and Cauli et al. (1997) on connection probabilities and strengths between neuron populations present a richer, more complex picture than our simplified circuit. There is wide-ranging connectivity to and from PV, there are strong connections from PYR to SST in most layers, and the weights from SST to VIP are strong (in terms of both connection probability and strength across layers), details that our simplified model cannot describe. Enabling weights from SST to VIP showed that we can similarly infer weights to and from VIP so that we are able to approximate the circuit during the moving condition (see Figures S6a and S6b). However, there are still many more potential connectivity structures between neuron populations our model does not describe.
From a computational perspective, our model makes several simplifications in describing context integration in circuits tuned to the statistical regularities of natural scenes. These include approximating a product with a sum in equation 4.12 and ignoring higher-order surround modulation going from equation 4.6 to 4.8. Furthermore, our equations have omitted terms explicitly describing feedback from higher-order areas. Top-down input to the VIP that mediates increase of local PYR activity has been reported, for example, in Zhang et al. (2014), Wilmes and Clopath (2019), Hertäg and Sprekeler (2019), Batista-Brito, Zagha, Ratliff, & Vinck (2018), and Wall et al. (2016). In our model, terms modulating the VIP firing rate causing the neuronal population to have a switch-like behavior have been essentially encapsulated into the binary variable in equation 4.27. Despite the fact that incorporating cell type-specific contributions of top-down feedback in our model is an avenue of clear importance to relate to recent experimental findings, we leave this to future work. For simplicity, we have also limited the basis set of filters to one that extracts information about oriented edges in natural scenes. However, the computation of the extraclassical receptive fields need not be intrinsically limited to simple cells responding to Gabor-like filters but can be extended to encompass neurons responding to more complex features in areas beyond V1. Switching circuits can occur more generally, including in somatosensory and auditory cortices, where some of the same neuronal populations interact using similar circuit logic (Niell & Stryker, 2010; Bigelow, Morrill, Dekloe, & Hasenstaub, 2019). Populations of neurons in general switching circuits can respond to diverse stimuli (e.g., the VIP in auditory cortex are activated by punishment in Pi et al., 2013).
The theoretical framework here did not make assumptions regarding the completeness of the basis. Instead, it focuses specifically on interactions outside the classical receptive field. Prior work of Olshausen (2013), Olshausen and Field (1996a, 1996b, 1997), and Lewicki and Sejnowski (2000) have discussed extensively the benefits of overcomplete bases. The key feature in our model is the normalization of the activity of the neurons in patch, not the orthogonality or completeness of the basis (indeed, 34 filters used here are not orthogonal). In our model, the interactions outside the classical receptive field of a cell are expressed exclusively on the representations by the cells with classical receptive fields in that location. As such, features not represented in an incomplete basis will be ignored in the context calculations. We use a relatively simple linear model for the classical receptive field formation. If there are nonlinear interactions in the classical receptive field, the model can be expanded to represent covariance of neuronal activities rather than covariance of projections on a linear filter; however, the analysis of such an extension is beyond the scope of the current study.
Here, we showed how a biologically inspired switching mechanism can enable a network to efficiently process stimuli in two different conditions. Most artificial neural networks (ANNs) suffer from what has been termed catastrophic forgetting, by which previously acquired memories are overwritten once new tasks are learned. Conversely, humans and other animals are capable of transfer learning—the ability to use past information without overwriting previous knowledge. Proposed solutions to this problem, like elastic weight consolidation or intelligent synapses, are discussed in Kirkpatrick et al. (2017), Zenke et al. (2017), and Mallya, Davis, and Lazebnik (2018). When applied to a narrow condition of learning new contexts, our work adds a switching mechanism based on the connections among different cell types in V1. This may open new doors to artificial neural networks with analogous switching architectures.
4 Methods
4.1 A Theory of Optimal Integration of Static Context in Images
A theory of optimal context integration, first outlined in Iyer et al. (2020), describes a probabilistic framework for inferring features at particular locations of an image given the features at surrounding locations. The probabilities of these feature occurring and co-occurring are then mapped to elements of a biological circuit (firing rates, weights).
4.1.1 Neuronal Code
Thus, the sum over probabilities of features adds up to 1. Throughout the article, we assume , although the model may be applied with other monotonic functions as well.
4.1.2 Probabilistic Framework
4.1.3 Mapping from the Probabilistic Framework to a Neural Network
4.2 Computing the Synaptic Weights
To compute weights according to equation 4.16, we first compute , the firing rates due to the classical receptive field for every image in a large data set. Initially, we preprocess the image: we convert the image to grayscale, subtract the mean, and normalize the image to have a maximum value of 1. Similarly, we preprocess the filters so the mean of each is 0. is the result of convolving with feature , rectifying and then normalizing so that at each location , the sum over features of firing rates is equal to 1. Rectification ensures that firing rates are nonnegative, while normalization further ensures we can interpret as probabilities. We average these firing rates over all images in the data set to obtain for each feature . The feature co-occurrence probability given by in the numerator for the synaptic weight formula is then computed by further pairwise convolution of firing rates due to the classical receptive field for each possible pair of filters in the basis set and each image in the data set and then averaged over all images.
We first assume translational invariance so that only the relative position of two filters is relevant: when . The assumption that weights act with translational invariance allows rewriting the connectivities as simply a function of the distance, in image space, between the receptive field centers of the two neurons. Second, the mathematical validity of our probabilistic framework relies on the assumption that patches in the visual space, representing receptive fields of neurons, contain independent information. To reconcile this assumption with our empirically derived weights, we only consider connections between neurons whose receptive fields are sufficiently far apart, regardless of their corresponding feature identity. This leads to the use of sparse weights for moving and static contexts (see Figure 4e), where the only nonzero weights we allow in are spatially farther apart than a minimum distance, which is half of the receptive field size. More precisely, for every feature , synaptic weights from target filters were sampled in steps of the receptive field size at three distances in each direction around (0,0), so that we have synaptic weights on a () grid (three connections to the left/up 3 connections to the right/down + self-connection ). Instead of using these sparse weights after sampling, we could have also rescaled the original, nonsparse weights by a scalar so that . Searching over possible values of , we find . We choose, however, to work with sparse weights or test our results on the original nonsparse weights without worrying about the rescaling by . Although results presented in this study are largely for sparse weights, we have checked that the main results also hold when using full connectivity, at least for small (see Figure S6a). Further, assuming that the contribution due to context integration decays as the filters are spatially farther and farther apart, we can limit the weights in space to three times the size of the classical receptive field. Sample synaptic weights obtained using this procedure are shown in Figure 4e (and Figures 4d and 4f without the sampling of weights).
4.3 Constructing the Feature Space for Natural Images and Videos
We chose a basis of spatial filters that was constructed as outlined in Iyer et al. (2020). This is done by averaging approximations of spatial receptive field sizes from 212 recorded neurons in V1 (Durand et al., 2016). This set of filters is our first feature space and consists of four classes of spatial RFs observed experimentally: ON (1 feature), OFF (1 feature), and two versions of ON/OFF neurons (8 features each, for a total of 16), with the first version having a stronger ON subfield and the second a stronger OFF subfield. Each subfield was modeled as a 2D gaussian with a standard deviation of average subfield size, which was measured to be 4.8 degrees for the OFF subfield and 4.2 degrees for the ON subfield. The relative orientation between two subfields for each ON/OFF class was varied uniformly in steps of 45 degrees, from 0 to 315 degrees. Also for the ON/OFF class, the relative distance between the centers of the ON and OFF subfields was chosen to be 5 degrees, which equates to roughly . According to the data, the amplitude of the weaker subfield is chosen to be half that of the stronger subfield, whose highest amplitude was chosen to be unity. These two subfields are then combined additively to form a receptive field whose size is 7 degrees (the distance between the two subfields plus ). The set of 18 features is shown in Figure 3d.
We then added 16 more filters with a temporal component, for a total of 34 filters. These filters have two frames with the first frame being one of the ON/OFF filters. The second frame is the ON/OFF filter in the previous frame shifted 3 pixels to the left, which matches the distance the sliding window moves every frame to generate the video. Such a spatiotemporal filter is shown in Figure 3e.
4.4 Data Sets of Natural and Synthetic Images and Videos
4.4.1 Natural Images and Videos
For the data set of images, we used the Berkeley Segmentation Dataset (BSDS) training and test data sets (Martin et al., 2001). The training data set consists of 200 images of animals, human faces, landscapes, buildings, and so on and is used compute the weights . This same training set is then employed to construct the data set of 200 videos where a sliding window moves across the image for each frame of the video. In the simple case, the sliding window () moves 3 pixels per frame in the horizontal direction across the image ( or ), from left to right for 50 frames (see Figure 3b). The sliding window may also move in any random direction, resulting in different statistics of the video data set and hence different . This different data set of videos is generated by choosing any pixel in the image and moving the sliding window toward it in smaller increments until that pixel is reached; a new pixel is then chosen from the image until the maximum limit of frames in the video (50 frames). Results from this different data set are shown in Figures S1 and S2. We further get 100 images from the BSDS test set to generate the corresponding 100 videos and use in the optimization problem. These video frames are provided as input to the optimizer that minimizes the loss functions and to find , for and , , and for . For both optimization problems, we set 50 frames aside from these 100 videos to compute the generalization error during the minimization procedure.
In order to generate the numbers in Figure 8, another set of 100 videos generated from BSDS testing data set is altered by adding gaussian and salt-and-pepper noise of different parameters to each frame. The resulting noisy video frames are used to establish the ability of the switching circuit to do visual processing of stimuli with better reconstruction capability than the circuit implementing the static extraclassical receptive field or without extraclassical receptive field (see section 2.6). Gaussian white noise has standard deviation for reconstructions in Figure 8e, while salt-and-pepper noise turns pixels black or white with probability each, for reconstructions in Figures 8d and 8f. Parameters and are varied (, ) in Figure 8f.
4.4.2 Synthetic Data Sets of Images and Videos of Horizontal and Vertical Bars
This simple synthetic data set consists of 18 images of horizontal and vertical bars (9 horizontal, 9 vertical). Images are , each image having a bar at a different location. Videos consist of bars moving in any direction one pixel at a time: left or right (for horizontal bars) and up or down (for vertical bars).
4.5 Deriving an Equation for PYR Firing Rate Consistent with V1 Circuit Architecture
Let be the firing rate due to the classical receptive field, the firing rate incorporating extraclassical receptive field information, and the weights between neuronal populations . We can write approximated expressions for firing rates of PYR, SST, VIP neurons at time :
- Case a: When there is no feedback connection from PYR to VIP:(4.22)(4.23)(4.24)
- Case b: When there is feedback from PYR to VIP:(4.25)(4.26)where of equation 4.24 is the intrinsic firing rate of VIP and is a binary variable that takes the value 1 during the moving condition and 0 during the static condition. For the analysis of the firing rate during movement, we assume . Equations 4.22 and 4.25, expressing the firing rate of the PYR population, assume the extraclassical receptive field contribution given by lateral connections has a multiplicative effect on the feedforward activities . This multiplicative gain is the result of mapping from the probabilistic framework of equations 4.14 and 4.18 and their analogs for the moving circuit activities and weights. This results in the network doing optimal inference of visual features via PYR firing rates as expressed in equations 4.22 and 4.25, and as detailed in section 2.1. The VIP firing rate expression involves a binary gating term that switches based on state (static or moving), a simplification of what has been found empirically. The model could incorporate a term into expression 4.27 describing VIP firing rates driven independently from PYR such that , but this change would not alter our main results. Finally, only the interneuron connections with the longest synaptic delay are assumed to be noninstantaneous (connections to and from PYR), while other connections are presumed to occur at a much faster timescale (connections between inhibitor neurons). Biologically, PYR are assumed to carry out computations by using dendritic trees, as outlined in Poirazi, Brannon, and Mel (2003), while SST and VIP are more spatially compact than PYR (Gouwens et al., 2019). Hence, synaptic delays between PYR and other neuron populations are longer than between other populations.(4.27)
4.6 Reconstructions from Noisy Videos Using Firing Rates and Optimal Synaptic Weights of Different Circuit Architectures
The reconstruction was performed as follows. For any noisy input image , where is some random variable representing a noisy process, we calculated the effective firing rate (activity) of neuron/feature at location using equations 4.38 to 4.42. To reconstruct image frames from firing rates, we convolved the firing rates computed with the inverses of the filters in our basis set. More specifically, the activity corresponding to filter was convolved with the inverse of , which was obtained by flipping about the horizontal and vertical axes. These convolutions for all filters were then averaged to obtain the final reconstruction.
We then performed the reconstruction for the same image frame without any noise added. We assessed the denoising capability of our circuits by computing the Pearson correlation coefficient between the reconstruction of and the reconstruction of . The latter is a baseline for our comparisons, as there is no noise to remove from the image frame through extraclassical surround modulation. The Pearson correlation coefficient is a function of the activity of different circuit architectures and is discussed and compared across circuits in section 2.6.
Two further issues merit further discussion. First, if the spectral content of the noise and image frame is known, a Wiener deconvolution can be applied, which minimizes the mean square error between the estimated reconstruction and the original frame. Such a Wiener deconvolution would minimize the impact of deconvolved noise at frequencies with poor signal-to-noise ratio. However, we assume here that interpretation of signals is done without access to knowledge of this spectral content, but rather implementing a naive reconstruction as would be optimal in the noise-free limit. Second, given the presence of extraclassical surround contribution, the deconvolution operation may be more complex than the simple, filter by filter, convolution with the inverse filter . Specifically, the inverse may contain information about the cross-correlation of features. Again we work in the simplifying limit in which this is not the case. We do not exclude, however, the possibility that the biological circuit may apply a more complex reconstruction (e.g., via learning weights), an interesting avenue to explore in future work.
4.7 Like-to-Like Connectivity for PYR and VIP Populations
In addition to interneuron connectivity discussed in section 2.7, PYR connection probability as a function of the difference in orientation tuning (see Figures S2c and S2d) qualitatively matches the same graph reported experimentally (Ko et al., 2011). This like-to-like connectivity, with neurons responding to similar features (orientations) more strongly connected, holds true for both static (shown in Iyer et al., 2020) and moving weights (shown in Figures S2c, S2d, and S3). Another feature concerns the amplitude of static and moving weights, which decreases with distance from the classical receptive field, with lower weights on average between neurons whose classical receptive fields are far away. Figure S2 shows the dependence of the maximum, minimum, and average positive and negative synaptic weights on the distance between neuronal receptive fields. Assuming an exponential spatial decay of weights with distance and using the first two points in the plot displaying decreasing distance dependence in the mean positive static weights curve (see Figure S2a), we computed the spatial constants the classical receptive field size. This is in accordance with past findings (Angelucci and Bressloff, 2006; Iyer et al., 2020), suggesting that the near surround extends over a range similar in size to the classical receptive field.
We further study the inferred connections to and from the VIP to establish whether these weights reflect the contextual statistics of static and moving states. We first inferred whether there is like-to-like connectivity between VIP and PYR populations by building a similarity matrix of dimension number of VIP neurons number of PYR neurons that measures response similarity between VIP and PYR neuronal populations. Each entry of this response similarity matrix is computed by taking the Pearson correlation between the GLM coefficients found above (see section 2.7) for each VIP neuron and each PYR neuron, respectively. We next built, from our tensor used for convolution, a matrix of connectivities of dimension number of VIP neurons number of PYR neurons. Finally, taking the Pearson correlation coefficient between the response similarity matrix and the matrix of connectivities yields a statistically significant but very low correlation coefficient (0.01, -value 0.01, Kolmogorov-Smirnov test). We conclude that while like-to-like connectivity is present between PYR neurons, this phenomenon is not prevalent between VIP and PYR populations.
4.8 Measuring Dimensionality with the Participation Ratio
4.9 Inferring the Tuning Properties of VIP and PYR Neurons
We further study the activation patterns of units in our switching circuit model by inferring the tuning properties of VIP and PYR units. To achieve this, we first choose a wavelet family that will determine our features and differs from the basis approximating spatial receptive fields in V1 from section 4.3. We chose the Daubechies 4 wavelet family with a mother wavelet of length 15 pixels, as shown in Figure S13a. We then consider the 2D discrete wavelet transforms of our video frames to obtain the approximation, horizontal detail, vertical detail, and diagonal detail coefficients (wavelet transforms), respectively, for each frame. The goal is to use the averages of these coefficients as the independent variables of a linear regression or GLM that models VIP or PYR activations.
We obtain that most PYR neurons are tuned to horizontal features, and much less so to vertical features (data not shown). Using either a linear regression or a GLM with a Poisson distribution yields qualitatively similar results. Because VIP neurons in our model only get input from the PYR, while the top-down input activating VIP is described simply by the binary term , we obtain that VIP acquires the same preferential selectivity to horizontal features to the detriment of vertical features (see Figure S13d). VIP neurons are tuned to horizontal features with an average regression coefficient of 0.65, while they are tuned to vertical features with an average regression coefficient of 0.015 (using the results from the linear regression). This runs counter to our expectation that VIP is capable of detecting horizontal movement in our data set by exhibiting preferential selectivity toward vertical features within their receptive fields, analogous to empirical results in Millman et al. (2020). Clearly, the simplification we have made by employing a binary term in equation 4.27 prevents us from observing a more realistic VIP activation pattern that would deviate from the PYR pattern and provide further insights. We leave the more detailed modeling expanding our current simplified model in this direction to future work.
Code
Source code is available in ModelDB (McDougal et al., 2017) at http://modeldb.yale.edu/267120.
Notes
In general, if is the number of neurons per location, is the number of locations, and is the number of connections per neuron, then the total number of connections in a circuit is NLC. Two identical circuits have 2NLC connectivities, while a switching circuit has , where is the number of switching (VIP) units and are the number of connections to and from the switching units, respectively. When and , then which is true for circuits with small .
The tensor weights are very high-dimensional so that the least-squares method and variations thereof have failed due to the high memory requirements.
Acknowledgments
We gratefully acknowledge the support of the Swartz Foundation Center for Theoretical Neuroscience at the University of Washington, and of NIH training grant 5 R90 DA 033461-08. We are grateful to Matthew Farrell and Kameron Harris for their helpful comments in producing the final manuscript. We thank Paul G. Allen, the founder of the Allen Institute for Brain Science, for his vision, encouragement, and support.