Abstract
Under difficult viewing conditions, the brain’s visual system uses a variety of recurrent modulatory mechanisms to augment feedforward processing. One resulting phenomenon is contour integration, which occurs in the primary visual (V1) cortex and strengthens neural responses to edges if they belong to a larger smooth contour. Computational models have contributed to an understanding of the circuit mechanisms of contour integration, but less is known about its role in visual perception. To address this gap, we embedded a biologically grounded model of contour integration in a task-driven artificial neural network and trained it using a gradient-descent variant. We used this model to explore how brain-like contour integration may be optimized for high-level visual objectives as well as its potential roles in perception. When the model was trained to detect contours in a background of random edges, a task commonly used to examine contour integration in the brain, it closely mirrored the brain in terms of behavior, neural responses, and lateral connection patterns. When trained on natural images, the model enhanced weaker contours and distinguished whether two points lay on the same versus different contours. The model learned robust features that generalized well to out-of-training-distribution stimuli. Surprisingly, and in contrast with the synthetic task, a parameter-matched control network without recurrence performed the same as or better than the model on the natural-image tasks. Thus, a contour integration mechanism is not essential to perform these more naturalistic contour-related tasks. Finally, the best performance in all tasks was achieved by a modified contour integration model that did not distinguish between excitatory and inhibitory neurons.
1 Introduction
Deep neural networks (DNN) are often used as models of the visual system (Kriegeskorte, 2015; Yamins & DiCarlo, 2016; Spoerer et al., 2017; Nayebi et al., 2018; Schrimpf et al., 2018; Lindsay, 2020). It has been argued that they are mechanistic models (Lindsay, 2020) because some of their computational elements have analogies in the brain. But they lack many other biological mechanisms, which may contribute to differences in representations (Schrimpf et al., 2018; Tripp, 2017; Shi et al., 2022) and behavior (Geirhos et al., 2019; Rajalingham et al., 2018; Szegedy et al., 2013; Nguyen et al., 2015; Hendrycks & Dietterich, 2019; Serre, 2019; Lake et al., 2015). In contrast, there are many physiological models of circuits that underlie localized neural phenomena (Haeusler & Maass, 2007; Rubin et al., 2015; Carandini & Heeger, 2012; Baker & Bair, 2016; Hurzook et al., 2013; Piëch et al., 2013), but these models tend to be isolated from larger circuits and to have uncertain connections with ethologically important visual tasks.
The limitations of both deep networks and isolated circuit models might potentially be addressed by combining them, that is, incorporating detailed circuit models into deep networks. In this direction, recent studies have incorporated details of interlaminar and interareal connectivity into deep networks (Kubilius et al., 2018; Lindsey et al., 2019; Tripp, 2019; Shi et al., 2022). Few studies (Guerguiev et al., 2017; Sacramento et al., 2018; Linsley et al., 2018; Iyer et al., 2020) have incorporated biologically grounded microcircuits into functionally sophisticated deep networks, but doing so may be an important step in understanding how microcircuits contribute to behavior and reproducing the superior generalization abilities of the brain (Sinz et al., 2019).
Contour integration (Field et al., 1993; Li et al., 2006; Hess et al., 2014; Roelfsema, 2006) is a phenomenon in the V1 cortex where stimuli from outside a neuron’s classical receptive field (cRF) modulate its feedforward responses (see Figure 1). In particular, a neuron’s response is enhanced if a preferred stimulus within the cRF is part of a larger contour. Li et al. (2006) found that these elevated V1 responses were highly correlated with contour detectability. Under difficult viewing conditions, it is thought that the visual system uses contour integration to pop out smooth contours. Contour integration is mediated by intra-area lateral and higher-layer feedback connections (Chen et al., 2017; Liang et al., 2017). Past computational models (Li, 1998; Piëch et al., 2013; Ursino & La Cara, 2004; Hu & Niebur, 2017; Mély et al., 2018) have tested potential mechanisms and successfully replicated neurophysiological data.1 However, a limitation of all of these circuit models is that they are stand-alone models that do little to clarify the roles of contour integration in natural vision.
In this work, we embedded a circuit model of contour integration within a deep network. We used this model to investigate two broad questions. First, we tested whether key characteristics of biological contour integration would emerge as a result of the network learning to identify contours within backgrounds of randomly oriented edges (a kind of stimulus that has often been used to study contour integration). We found that the trained model was consistent with biological data on behavior (detection of contours), electrophysiology (unit responses versus contour length and contour-fragment spacing), and connectivity (structure of learned lateral connections). This provides new evidence that these particular circuit characteristics benefit the perception of contours within these synthetic visual stimuli.
Second, we used our model to investigate whether contour integration improved performance of two natural scene tasks. One of these was detection of weak edges in natural scenes, a role that has previously been proposed for contour integration. The second was a new task that required distinguishing connected contours from nearby unconnected contours. In the first task, the contour integration model performed similarly to a parameter-matched feedforward network. In the second task, surprisingly, the model performed much worse than the control network. However, it generalized better to a variation of the task that it was not trained on. Furthermore, a variation of the model that allowed excitatory neurons to inhibit some of their targets substantially outperformed the control. This suggests that contour integration is relevant to the second task, but the model we adopted was not optimal, either because biological contour integration is not optimal or because important biological elements were missing from the model.
2 Model
2.1 Contour Integration Block
We adapted an existing circuit model of V1 contour integration and incorporated it into an artificial neural network (ANN). We used the current-based, subtractive-inhibition model of Piëch et al. (2013). This model focuses on within-layer lateral interactions between V1 orientation columns (co-located populations of neurons that respond to edges of similar orientations over a small area of visual space).
Functionally, E nodes process incoming edge extraction responses from preceding layers, while I nodes subtractively modulate E node activities. Both of these nodes also receive recurrent inputs from nearby columns via lateral connections. Piëch et al. (2013) designed these anisotropically distributed connections (Stettler et al., 2002) with connectivity patterns suggested by the association field model (Field et al., 1993) (see Figure 1B); each orientation column preferentially connects with nearby columns that respond to stimuli that are co-linear or curve-linear with the column’s preferred orientation. A similar but orthogonally oriented association field was used to model inhibitory lateral connections.
Piëch et al. (2013) defined the full model over a 2D grid of spatial locations. Each spatial location contained a set of orientation columns with the same frequency selectivities and a range of orientation preferences. The lateral connections of each orientation column were hard-coded. The dynamics of the full model were realized as the joint activities of all columns.
We made minimal adaptations to this circuit model to implement it as a trainable block inside a convolutional network. First, we replaced summations over e-cRFs with convolutions. The convolution operates over columns in nearby locations as well as at the same location. It incorporates the excitatory-self connection, , and the lateral connections. Second, we used Euler’s method to express the dynamics as difference equations (Linsley et al., 2018; Tallec & Ollivier, 2018). Third, we defined all model parameters including the lateral connections to be learnable and used task-level optimization to learn their optimal settings. Piëch et al. (2013) distinguished excitatory and inhibitory neurons in their model, consistent with Dale’s principle (Dale, 1935; Eccles et al., 1954). In contrast, neurons in convolutional networks typically do not make this distinction but allow weights to take on whatever values maximize performance. This consistently results in each neuron exciting some of its targets and inhibiting others. To ensure individual nodes were consistent with Dale’s principle, we constrained weights to be positive or negative, as appropriate. For connections between paired excitatory and inhibitory neurons, a logistic sigmoid nonlinearity was applied to the learned weight parameter to prevent changes in sign. The same method was used to retain the sign of the model’s time constants. For lateral connection kernels, a positive-only constraint was imposed on each element during training.
where .
Here, and are membrane potentials; is a nonlinear activation function; are membrane time constants; , are local I E, E I connection strengths; is the logistic sigmoid function that constrains time constants and local connection strengths to be positive; are lateral excitatory connections from E nodes in nearby columns to E; are connections from nearby E nodes to I; is the output of all modeled nodes at time ; is the convolution operator; is the external input; and is a node’s background activity.
This final form is a recurrent neural network that can be trained using standard neural network training techniques (Tallec & Ollivier, 2018). Finally, we included batch normalization (BN) (Ioffe & Szegedy, 2015) layers after every convolutional layer to model weak omnidirectional inhibition (Kapadia et al., 2000). We refer to this transformed model as the contour integration (CI) block and include it as a whole inside ANNs. Parameters of the CI block and their settings are described in the section 6.1.
2.2 Visual Inference Network
The full model is composed of edge extraction, CI, and classification blocks (see Figure 2). For edge extraction, we used the first convolutional layer of a ResNet50 (He et al., 2016) that was pretrained on the ImageNet (Deng et al., 2009) data set. We additionally added BN and max-pooling after the convolutional layer in all tasks other than edge detection in natural images. This helped reduce computational complexity (by reducing the spatial dimensions over which the recurrent CI block acts) and improved performance as well. For the task of edge detection in natural images, only the BN was added. Outputs of the edge extraction block were fed into the CI block. The same CI block was used across all tasks.
Outputs of the CI blocks were passed to classification blocks, which mapped CI block outputs to required label sizes for each task. These blocks had two convolutional layers each. Deeper classification blocks might have allowed better task performance, but we chose shallower classification blocks so that the CI block would play an essential role in network function. Description of each of the classification blocks is in section 6.2. The architectures of all the models we used are shown in Figure 2.
2.2.1 Feedforward Control Network
We compared our contour-integration model (the visual inference network described above) with a feedforward control network of matching capacity (number of parameters). Feedforward networks can be parameterized to match capacity in several different ways (Spoerer et al., 2020; Tan & Le, 2019). Because we were interested in modeling V1 lateral connections, we used convolutional kernels of the same size as the model. Compared to standard convolutional kernels, these were much larger and were specifically designed to model lateral connections, which may spread out up to eight times the cRF of V1 neurons (Stettler et al., 2002).
The control network used the same edge extraction and classification blocks as the model. Only the middle block was different. The control’s middle block used the same convolutional layers as the model’s CI block but ordered them sequentially. Additionally, batch normalization and dropout layers () were added after every convolutional layer to prevent the control from overfitting the training data. Finally, no positive-only weight constraint was enforced on the control network. It was free to adopt any weight changes that improved performance. Compared to the control, the CI block does approximately more computations per image and has a larger inference run time because it is recurrent. However, this is consistent with contour integration in the brain, which affects late-phase responses of V1 neurons rather than their initial responses (Li et al., 2006).
3 Results
3.1 Contour Detection
We first trained the networks with stimuli that are typically used to study biological contour integration (Field et al., 1993; Li et al., 2006, 2008). These consisted of many small edges, a few of which were aligned to form a contour, while the rest were randomly oriented to form the background (see Figure 3). Li et al. (2008) found that macaque monkeys progressively improved at detecting contours and had higher contour-enhanced V1 responses with experience on these stimuli. Hence, contour integration is learnable from these stimuli.
We constructed a data set containing 64,000 training and 6400 validation images in which contours differed in their locations, lengths (number of edges, ), curvature (random interedge rotations of , and component edges (64 Gabors with different parameters). Details of the full data set are described in section 6.5.
Networks were tasked with identifying fragments that were part of the contour. A fragments classifier block (see Figure 2) followed the CI block to map its outputs to the desired label size. Details of the training process are described in section 6.3. Network performances were evaluated using mean intersection over union (IoU) scores between predictions and labels (see section 6.4.1). We refer to this task as contour detection due to its similarity with object detection in computer vision, but note that it differs from the kind of detection used in monkey experiments, which involves two patches of line segments and requires only selection of the patch that contains a contour (Li et al., 2008).
Averaged peak IoU scores after training are shown in Table 1. For each network, the results were averaged over five independent runs, each initialized with different random seeds. The model outperformed the control by approximately 11% (validation score).
Network . | Train () . | Validation () . |
---|---|---|
Model | ||
Control |
Network . | Train () . | Validation () . |
---|---|---|
Model | ||
Control |
Note: Peak values (mean 1 SD) were averaged across five independent runs for each network.
To ensure the validity of our lateral kernel size choice, we also tested control models with more standard kernel sizes (). The validation IoU for the control reached approximately 36%, while for the control, it reached approximately 44%. Both of these figures were lower than the score achieved by our selected control model.
3.1.1 Effect of Contour Length and Interfragment Spacing
To determine whether networks learned to integrate contours in a manner similar to the brain, we analyzed them for consistency with behavioral and neurophysiological data. Li et al. (2006) concurrently monitored behavioral performance and V1 neural responses of macaque monkeys as the length of embedded contours and the spacing between contour fragments were varied. At the behavioral level, contours became more salient as lengths increased. Furthermore, when contours extended in the direction of the preferred orientation of V1 neurons, firing rates monotonically increased. Conversely, when spacing between fragments increased, contours became less salient and V1 firing rates decreased monotonically.
We constructed separate test stimuli (similar to those of Li et al., 2006) for each recorded neuron. These consisted of centrally located contours of varying length and interfragment spacing, where each contour fragment was a spatially shifted copy of the neuron’s optimal within-cRF stimulus. A detailed description of the test stimuli is given in section 6.6. Examples are shown in Figures 3C and 3D.
Average IoU scores as contour length increased are shown in Figure 4A. Results were averaged over five copies of each network, each trained in the same way but initialized with different random weights. For centrally located straight contours, behavioral performance of both networks was similar. Both the contour-integration model and control networks excelled (95% or higher) at detecting the absence of contours. There were dips in performance for length-three contours as they were the hardest to detect. For all other lengths, prediction accuracy increased with length, with the model outperforming the control at larger contour lengths.
Larger contrasts between the model and control were observed when neural response gains were analyzed. Figure 4B shows population average gains as contour lengths changed, along with averaged gains from two monkeys in Li et al. (2006). In the contour-integration model network, average gains increased monotonically with contour length, similar to the monkey data. In contrast, average gains in the control network did not change appreciably with contour length. Figure 4C shows population average gains as the spacing between fragments increased. Model network gains decreased monotonically with spacing, consistent with the monkey data (Li et al., 2006). Control network gains, unexpectedly, increased with spacing.
To calculate gains in both the model and the control network, we excluded neurons that did not respond to any single Gabor fragment in the cRF (no optimal stimulus). Out of the 320 possible neurons, 188 model and 178 control neurons were retained according to this criterion. Furthermore, for population average gains, neurons that were unresponsive to any contour condition (all zero gains) and those that had outlier gains (20 or more) for any contour condition were also removed. Typically, these large gains were seen for neurons that had small responses to contours, and small changes in the CI block outputs significantly affected their gains. This resulted in the removal of an additional 36 model and 144 control neurons. Across each population (model and control), there was a wide range of enhancement gains exhibited by individual neurons as shown in the mean SD shaded area in Figure 4B.
To better understand how responses varied across neuron populations, we plotted histograms of the slopes of linear fits to CI block outputs versus contour length and interfragment spacing. This was done for all neurons for which the optimal stimulus was found. Since outputs rather than gains were considered, we included neurons with outlier gains in these histograms. Results of the model network are shown in Figures 4D and 4E while those of the control network are shown in Figures 4F and 4G. Most model neurons showed positive slopes as contour lengths increased and negative slopes as fragment spacing increased, consistent with trends in the monkey data. In contrast, the slopes of control-network responses versus fragment length and spacing were both clustered slightly above zero. While the task performance of the model and the control networks were similar, they employed different strategies to solve the task, and only the contour-integration model network was consistent with neurophysiological data.
3.1.2 Lateral Connectivity Patterns
We additionally analyzed lateral kernels of trained models for consistency with neuroanatomical properties of V1 lateral connections. To maintain consistency with Dale’s principle and the approach used in the model of Piëch et al. (2013), the signs for all lateral kernel weights were constrained to be positive. Moreover, separate kernels were used to model excitatory connections onto excitatory neurons and inhibitory neurons. These constraints also facilitated visualizing these multidimensional connection patterns (see section 6.8). Example learn outgoing lateral kernels for a trained model are shown in Figures 5A and 5B. The full set of excitatory and inhibitory learned lateral kernels of this trained model are shown in Figures S2 and S3, respectively. Their corresponding feedforward kernels are shown in Figure S1. Qualitatively, many excitatory-targeting connections were anisotropically distributed and spread out densely in the preferred orientations of the source neurons, while inhibitory-targeting connections were shorter and more omnidirectional.
We quantified the spread of lateral connections using a procedure adapted from Sincich and Blasdel (2001). They injected axon staining dye into V1 orientation columns, and characterized the staining pattern around each injection with an averaging vector, . The magnitude, , indicated the direction selectivity of lateral connections, while its angle pointed in the direction of the densest staining. Directional selectivity was quantified using a normalized index of ellipticity , which was obtained by normalizing with the mean length of all lateral connection vectors. More details of the procedure are in section 6.7. An of zero indicates an omnidirectional spread of lateral connections, while a value of one indicates a straight line. Finally, they compared the axes of elongation of lateral connections with orientation preferences of V1 columns. In 11 of the 14 injection sites, a highly elliptical distribution of lateral connections was found ( 0.42), as well as a close correspondence between the axis-of-elongation of lateral connections and the preferred orientation of injected V1 columns (mean difference of 11).
We analyzed the directional selectivity and axis-of-elongation of lateral connections in our trained models in a similar manner. Details of how we adapted their analysis for our models are described in section 6.7. distributions for excitatory-targeting and inhibitory-targeting kernels of a trained model are shown in Figures 5C and 5D, respectively. The average for excitatory-targeting kernels was found to be 0.20, while for the inhibitory-targeting kernels, it was substantially lower at 0.07. Across the five trained models, we found a population-average excitatory-targeting of 0.19 0.01 (mean 1 SD) and inhibitory-targeting of 0.07 0.01. Excitatory-targeting connections were substantially more directed than inhibitory-targeting ones.
These s were lower than those reported by Sincich and Blasdel (2001). Two differences in our analysis may contribute to this. First, Sincich and Blasdel (2001) were only able to include connections that were outside a radius of 200 of the injection location, while we considered all lateral connections. Second, we weighted all lateral connections by their connection strengths so that stronger connections had a greater influence on the averaging vector, while Sincich and Blasdel (2001) considered all patch vectors to have equal weight.
Orientation differences, , between neurons’ orientation preferences and axes of elongation of their lateral connections are shown in Figure 5E for a trained model. Each marker is scaled by the kernel’s normalized index of ellipticity, so that larger markers show more anisotropic connections. Because orientation has a period of , angular differences have a potential range of . Most neurons’ axes of elongation were close to their feedforward kernel orientations (see Figure 5E; mean excitatory-targeting , mean inhibitory-targeting ). A smaller number of neurons had axes of elongation nearly orthogonal to their preferred orientation. The results were consistent across the five independently trained models (population average excitatory-targeting and inhibitory-targeting ). The difference in orientations between lateral connections, axis-of-elongation and feedforward orientation preferences was larger than what Sincich and Blasdel (2001) found, but the trend was similar; most lateral excitatory connections project in the same direction as the preferred direction of their associated feedforward kernel.
Excitatory lateral connections onto inhibitory neurons in our model have a net inhibitory effect on excitatory neurons in surrounding columns. In previous contour integration models with fixed connection structures (Piëch et al., 2013; Li, 1998; Ursino & La Cara, 2004), typically a similar size is used for both excitatory and inhibitory interactions. Contrastingly, our model learned smaller and more omnidirectional inhibitory-targeting kernels. Moreover, previous models aligned the orientation of lateral inhibition kernels in the orthogonal-to-the-preferred direction of feedforward kernels, consistent with Kapadia et al. (2000). In contrast, our model learned inhibitory-targeting connections that were mostly aligned with the preferred orientations of feedforward kernels but more omnidirectional (see Figure 5D). These kernels are consistent with observations that short-range connections in superficial layers of V1 tend to be omnidirectional and largely suppressive (Malach et al., 1993). They are also related to a recent version of the associate field model that includes short-range, omnidirectional inhibition (Field et al., 2013).
In summary, the lateral kernels in our model were qualitatively realistic in three respects: degree of elongation, alignment of elongation with neurons’ preferred orientations, and relatively omnidirectional short-range inhibitory interactions. Together with realistic responses discussed in previous sections, this indicates that a physiologically realistic contour integration mechanism is consistent with optimization the contour integration network for this contour detection task.
3.2 Edge Detection in Natural Images
Next, we explored whether brain-like contour integration can be learned from tasks in our natural viewing environment and whether contour integration is useful in the performance of these tasks. Despite substantial research on the mechanisms of contour integration and the phenomenon of contour pop-out, little is known about the role of contour integration in natural life and survival. Perhaps the most specific proposal to date is that contour integration may enhance detection of parts of a contour with weak local cues, such as poor contrast (Piëch et al., 2013; Li, 1998). To test this idea, we trained our network to detect edges in natural images. We used the Barcelona Images for Perceptual Edge Detection (BIPED) data set (Poma et al., 2020) as it considers all contours rather than object boundaries only. This is important because our focus is on contour integration in V1, whereas object awareness relies on more abstract representations in deeper layers. The data set contains 200 train and 50 validation (image, edge map) pairs. It was expanded to 57,600 related training images using data augmentation methods. Sample images and ground-truth labels are shown in Figures 6A and 6B.
Networks were tasked with detecting all contours in input images. Performance was evaluated using mean IoU scores (see section 6.4.1) between network predictions and ground-truth labels over all pixels in an image and all images in the data set. An edge detection block (see Figure 2) was used to map CI block outputs to the same dimensions as labels. Details of the training process are described in section 6.3.
Example predictions of trained control and model networks are shown in Figures 6C and 6D respectively. Visually, differences between their predictions are subtle. Validation IoU scores over the time course of training, for a detection threshold of 0.3 (see section 6.4.1), are shown in Figure 7A. Both networks achieved their highest mean IoU scores (0.45) at this threshold. The mean IoU scores of both networks were similar. The CI block had little impact on overall performance, suggesting that the physiology of contour integration may not be essential for reliable detection of a wide variety of edges in natural scenes. To further explore this point, we trained a version of the model in which the lateral connections had a much smaller spatial extent: kernels rather than 15 15. This model also reached the same peak performance.
3.2.1 Weak versus Strong Edge Pixel Detection
In natural images, contours have nonuniform strengths, and some parts are easier to detect than others. Li (1998) and Piëch et al. (2013) showed that contour integration can potentially enhance weak contours. However, results were qualitatively analyzed only using a single image. Although we found that contour integration did not improve detection of a wide variety of contours, including weak contours, contour integration may still strengthen low-level responses to weak contours. To investigate this question, we plotted the difference between model and control outputs as a function of the control outputs, pixel-by-pixel. Details of the procedure we used are described in section 6.9.
The results are shown in Figure 7B for edge pixels and in Figure 7C for nonedge pixels. On average, the model had higher edge predictions for weaker edges (up to control output 0.3). For stronger edges, the control network responded more strongly on average. For nonedge pixels, model outputs were on average lower than control outputs for all control outputs above 0.2. This shows that models had a lower tendency toward false-positive edge detection. In summary, contour integration strengthened the representation of weak edges, but this had little practical effect on detection of weak edges at the most effective discrimination threshold.
3.3 Naturalistic Contour Processing
Contour integration may support other kinds of reasoning about contours in natural scenes—for example, determining which branch to climb in order to reach some fruit. To investigate this possibility, we devised a new visual perception task. Specifically, we trained the model to detect whether two points in a natural scene were part of the same contour. We placed two markers in each image. In some cases, the markers were connected by a single contour in the image, while in others, they were placed on different contours. We additionally punctured input images with occlusion bubbles to fragment the contours. This made it difficult to rely solely on edge extraction to solve the task. Example images are shown in Figure 8C.
We constructed a data set of 50,000 training contours and 5000 validation contours that were extracted from the BIPED data set (Poma et al., 2020). Details of the data set and how it was constructed are described in section 6.10. A binary classifier block (see Figure 2) was used to map CI block outputs to binary decisions, that is, whether the pair of markers in each image was connected by a smooth contour. Performance was measured by comparing the accuracy of network predictions with labels. Training details are described in section 6.3.
Table 2 shows peak classification accuracies averaged across five independent runs for all networks. Over the whole data set, the model performed about 5% worse than the control (validation IoU).
Network . | Train () . | Validation () . | Test () . |
---|---|---|---|
Model | 70.52 0.95 | 77.27 1.55 | 70.39 |
Control | 77.54 0.44 | 82.67 0.53 | 65.65 |
Network . | Train () . | Validation () . | Test () . |
---|---|---|---|
Model | 70.52 0.95 | 77.27 1.55 | 70.39 |
Control | 77.54 0.44 | 82.67 0.53 | 65.65 |
Notes: Peak values (mean 1 SD) were averaged across five independent runs for each network. The Test column shows results for test stimuli not seen during training and which had a constant interfragment distance of RCD 1.
3.3.1 Effect of Interfragment Spacing
When occlusion bubbles were systematically added along contours rather than randomly placed throughout the image, classification accuracies of all networks dropped even for the smallest bubble size (see Table 2, Test column). However, the relative drop in performance for the model (about 6%) was significantly less than that of the control (about 17%), showing that the strategy employed by the model generalized better from the training data to these new stimuli. Figure 9A shows the results of fragment spacing on the behavioral performance of networks. From the least to the most spacing, model performance monotonically dropped by about 4%, consistent with trends in the synthetic contour detection task. The control was unaffected by interfragment spacing.
Figure 9B shows population-averaged contour integration gains as interfragment spacing increased. Population averages were found by averaging gains of individual neurons for which the optimal stimuli were found and across all five networks (trained from different random initializations) of each type. Model results were averaged across 293 neurons, while control results were averaged across 120 neurons. Response gains of the control network were similar regardless of spacing, in contrast with their marked increase with spacing in the synthetic contour task. Response gains in the model decreased with increasing fragment spacing, consistent with the synthetic contour task, although the changes were less pronounced in this case.
We further analyzed the impact of fragment spacing on output activations using linear fits of output activation versus fragment spacing of individual neurons. Histograms of the slopes are plotted in Figures 9C and 9D for the control and the model networks, respectively. Similar to population-averaged gain results, model outputs dropped more sharply while control output activations only dropped slightly as spacing increased.
Overall, the model behaved more consistently than the control. Its performance was less affected by new stimuli outside the training distribution, and its responses to fragment spacing were similar to those of both synthetic contours and natural images.
3.4 The Effect of Separating Excitation and Inhibition
On the fragmented contours data set, the RPCM network outperformed the model by about 7% and the control by about 18% (Train IoU , Validation IoU , averaged across three networks), even though it was trained for half the time (see section 6.3). The effect of contour length on behavioral performance is shown in Figure 10A. For all contour lengths, IoU scores of the RPCM network were higher than those of the model. Moreover, performance monotonically increased for contours of length three or longer, consistent with behavioral data. Neuron response gains also increased monotonically with contour length (see Figure 10B; results averaged over 149 neurons from three networks). However, these increases were not as pronounced as those of the model network. Similarly, RPCM neurons responded less to more widely spaced fragments, but the difference was not as pronounced as in the model network (see Figure 10C).
On the task of edge detection in natural images, RPCM networks peaked at a mean IoU score of 0.46 and slightly outperformed other networks (see Figure 10D). Like contour-integration model neurons, RPCM neurons had larger responses to weak edges than control neurons (see Figure 10E). Relative to control responses, RPCM responses varied in much the same way as model responses, although the variations were somewhat less pronounced. Similar to the contour-integration model, RPCM networks enhanced weaker contours, but this did not substantially affect task performance.
There were larger differences between the model and the RPCM on the task of contour tracing in natural images. The RPCM network outperformed the model by about 13% and the control by about 8% (Train 92.90 0.14, Validation 90.62 0.21). When tested with contours that were fragmented with fixed interfragment spacing, RPCM network performance dropped by about 6% (Test 84.36). The drop was similar to what was observed for the model and was substantially less than that of the control. RPCM networks retained the generalization properties of the model while improving overall performance. Performance also dropped monotonically with interfragment spacing (see Figure 10G), similar to the model. Neural response gains in the RPCM also decreased with increasing fragment spacing (see Figure 10H, averaged across 257 neurons), intermediate between the model and control gains.
In summary, RPCM networks trained more quickly than contour-integration model networks and outperformed both the model and the control on every task. RPCM neurons’ responses to contour length and fragment spacing were intermediate to those of control and model neurons, but qualitatively consistent with monkey data (i.e., stronger responses with longer contours and tighter fragment spacing). Thus, Dale’s principle may have helped to account for monkeys’ neural responses, while at the same time it was functionally counterproductive in these networks and tasks.
Despite the separation of excitation and inhibition in the brain, the functional connection from any neuron to another could, in principle, be either excitatory or inhibitory depending on the strengths of direct connections and indirect connections through inhibitory interneurons (Parisien et al., 2008). We wondered whether there was a similar equivalence in our model network. Analysis of the dynamic equations (see the appendix) indicated that the contour-integration model could become functionally equivalent to the RPCM at steady state. This suggests that functional differences may be due to transient responses and/or the model being more difficult to optimize with standard algorithms in deep learning.
4 Discussion
As a category, deep networks are the most realistic models of the brain in terms of neural representations (Schrimpf et al., 2018) and behavior, including near-human performance on a wide range of vision tasks. However, they lack many well-known mechanisms that seem to prominently affect the function of real brains. Local circuit models (Li, 1998; Piëch et al., 2013; Mély et al., 2018; Rubin et al., 2015; Carandini & Heeger, 2012) have the opposite limitation. They reflect specific physiological phenomena faithfully but lack sophisticated perceptual abilities. Each of these approaches has limitations that might be alleviated by integration with the other, but such integration is rare.
Contour integration in particular has been studied extensively, but the scope of its role in visual perception is uncertain. Contour integration in V1 may occur too late (Li et al., 2006) to drive core object recognition, which involves selectivity in inferotemporal cortex 100 ms after stimulus onset (DiCarlo et al., 2012). It is not necessary for visual motion perception, which proceeds robustly in the absence of contours (Shadlen et al., 1996). Contour integration may play a role in later stages of object recognition, together with dynamics in higher areas of the ventral stream. It could also bias core object recognition if inferotemporal neurons learn to predict their future inputs. Such a mechanism might help to account for humans’ greater reliance on contours in object recognition compared with deep networks (Geirhos et al., 2019; Baker et al., 2018). Contour integration has been proposed to strengthen the representation of weak edges in complex scenes (Li, 1998; Piëch et al., 2013). It seems also to play a role in perceptual grouping, related to the Gestalt laws of good continuation, proximity, and similarity (Wertheimer, 1938; Elder & Goldberg, 2002). It may also be involved in segmentation, or in other kinds of reasoning about visual scenes. Integrating local circuit models into a deep network may help to clarify the plausibility of various potential roles of contour integration in higher-level visual tasks and may lead to new questions and predictions.
4.1 Main Findings
Our integration of a contour integration model with a deep network has produced new insights, discussed below.
Realistic physiology emerges from training the model to detect contours in a background of randomly oriented line segments. In contrast with past work, our model was initialized with random synaptic weights and optimized as a whole to perform various tasks. When we trained the model to perform a contour detection task, which was similar to tasks that have been used to study contour integration in monkeys and humans, the model learned a physiologically realistic local circuit organization. Specifically, neurons in the trained model had local edge responses that were enhanced in the presence of contours, and this enhancement varied with contour length and contour fragment spacing in physiologically realistic ways. Neurons in a similar feedforward network that was trained to perform the same task did not have physiologically realistic contour responses. Their responses did not depend appreciably on contour length, and they increased instead of decreasing with contour fragment spacing. Furthermore, our contour integration model learned excitatory lateral connections that were elongated and largely aligned with neurons’ preferred orientations, as observed in the brain (Sincich & Blasdel, 2001). Past models have already established that such lateral connection patterns can produce realistic contour responses by imposing these connection patterns on the model. Our work reinforces this link by showing that it emerges consistently from an optimization process. In other words, we showed that both the lateral connections and physiological responses associated with contour integration are optimal for detecting contours in these synthetic stimuli, among a fairly generic family of networks with broad lateral connections and separate excitatory and inhibitory neurons.
Contour fragment spacing affects response gains similarly in natural and synthetic images. We occluded contours in natural images to test how spacing of visible contour fragments would affect contour gains. We found that greater fragment spacing monotonically reduced response strength. This result was qualitatively similar to the effect of contour spacing in synthetic images, although it was less pronounced. We do not believe that the effect of contour fragment spacing in natural images has been tested in monkeys. This would be informative, because the response patterns observed so far may only occur in response to specialized synthetic images, which would limit their ethological relevance. However, our computational results suggest that the phenomenon can generalize beyond synthetic images.
A contour integration model strengthens representation of edges with weak local cues in natural images. We trained the contour integration network to detect edges in natural scenes. Compared with a feedforward control network, this network responded more uniformly to local edge cues, with stronger responses to weak edges and weaker responses with strong edges. This confirms a suggestion by Piëch et al. (2013) and Li (1998) that was previously tested with only a single image. However, despite these changes in local edge representation, we did not find that the contour integration model facilitated edge detection overall. The weakest edges were strengthened the most, but not enough that they exceeded the detection threshold. Indeed, because the transition from strengthening to weakening occurred near the detection threshold and because the differences were not sufficiently pronounced (specifically, the slope in Figure 7B was less than 1), the differences in representation had little effect on edge detection. These results elaborate a previous proposal about the role of contour integration in natural images. However, while the use of natural images goes part of the way toward confronting the role of contour integration in natural life, edge detection per se has limited survival value. It may be fruitful in the future to consider edge representations in service of a higher-level perceptual task. In such a context, effects of contour integration below the edge detection threshold may become more relevant.
The contour integration mechanism can impair contour following. When we trained the model to determine whether two points in a natural image belonged to the same contour, the model performed substantially worse than the feedforward control (about 77% versus about 83% correct; chance performance 50%). This outcome was consistent with the impressive performance of standard convolutional networks in a wide range of vision tasks. However, it was unexpected, because the task directly involved contours. This outcome was also complicated by two factors. First, the model was better able to generalize to new stimuli than the control network. Second, the RPCM variation of the model, which did not respect Dale’s principle, outperformed the control (about 91% correct). The RPCM appropriately constrains the signs of net lateral influences and exhibits physiological responses that are more realistic than those of the control network. These results indicate that recurrence in general facilitates this task and more, specifically, that recurrence with some physiological properties can be beneficial. Results with the model network also show that contour integration can produce a solution that generalizes well outside the range of prior experience. However, the results do not support our expectation that physiologically realistic contour integration would improve performance of this task.
Dale’s principle consistently impaired performance. As a general rule, neurons release the same small-molecule neurotransmitter at each synapse (Dale’s principle), leading to distinct groups of excitatory, inhibitory, and modulatory neurons. Accordingly, our model had separate groups of excitatory and inhibitory neurons. We also tested a variant of the model (the relaxed positivity constraint model, RPCM) that did not respect Dale’s principle but allowed the optimization process to make any synaptic weight either excitatory or inhibitory. In every task, the RPCM outperformed the more biologically grounded model. This is unsurprising because Dale’s principle amounts to a constraint on the model parameters. It is for this reason that Dale’s principle has not been adopted in deep learning.
It is unclear why Dale’s principle has been adopted in the brain, for that matter. Exceptions suggest that it could have been otherwise. For example, glutamate is normally excitatory but has inhibitory effects associated with certain receptors (Katayama et al., 2003). Some neurons elicit a biphasic inhibitory-excitatory response due to cotransmission of dopamine and GABA (Liu et al., 2013) or glutamate and GABA (Shabel et al., 2014), and others change from excitatory to inhibitory depending on the presence of brain-derived neurotrophic factor (Yang et al., 2002). So the fact that excitation and inhibition are largely separate in the brain seems to suggest that this separation is consistent with effective information processing in ways that have yet to be exploited in deep networks.
The fact that Dale’s principle impaired our model could indicate that it impairs performance of contour-related tasks in the brain or that our model is missing other factors (e.g., feedback from higher areas, or a different kind of plasticity) that keep it from impairing performance in the brain. Consistent with the former possibility, the model that respected Dale’s principle produced the most physiologically realistic responses. However, there may be another solution that has both realistic physiology and superior task performance. Analysis of the dynamic equations indicates that the model and RPCM can become equivalent in certain conditions. This may suggest that the constrained and unconstrained models could learn similar behavior given suitable learning rules. Related to this, recent work, Cornford et al. (2021) has shown that carefully designed feedforward networks with separate layers of excitatory projection neurons and intermediate inhibitory neurons can learn as well as standard deep networks, and an extension of this approach to recurrent networks was proposed. A related approach was shown to introduce new modes of instability in recurrent networks (Tripp & Eliasmith, 2016), a promising direction for future work. Alternatively, while our model learned task-optimized lateral connections, unsupervised learning of lateral connections, as in Iyer et al. (2020), might be more effective.
4.2 Related Work
Apart from an earlier version of this work (Khan et al., 2020), our model is most closely related to the horizontal gated recurrent unit (hGRU) model (Linsley et al., 2018), which similarly embeds a learnable circuit model of a low-level neural phenomenon into a larger ANN. Here we discuss some of the distinctions with that work. First, the objectives were different. Whereas we sought to test a physiologically grounded circuit model within a deep network, the purpose of the hGRU model was to improve task-level performance by using lateral connections to address the inefficient detection of long-range spatial dependencies in CNNs. Many biological constraints were relaxed to achieve higher performance. Second, the two models use different embedded circuit models. The hGRU model uses the circuit model of Mély et al. (2018), a model of surround modulation, while our model uses the contour integration circuit model of Piëch et al. (2013). Third, recurrent interactions in the hGRU model are derived from gated recurrent unit (GRU) networks (Chung et al., 2014). These networks are trainable and expressive, but their internal architectures are complex and difficult to map onto circuits of the brain. Fourth, because we constrained our learned lateral connections to be positive only, a more detailed analysis and comparison of lateral kernels was possible. In particular, we were able to compare the axis-of-elongation of lateral kernels with orientation preferences.
The V1Net model of Veerabadran & de Sa (2020) also incorporates biologically inspired lateral connection into ANNs for contour grouping tasks. The model is similar to hGRU (Linsley et al., 2018) but derives its recurrent interactions from convolutional long-short-term-memory (conv-LSTM) networks (Shi et al., 2015). Consistent with the results of the hGRU model, they find that certain recurrent ANNs, especially those with biological constraints, can match or outperform a variety of feedforward networks, including those with many more parameters. Moreover, on these tasks, they train more quickly and are more sample efficient.
5 Conclusion
Local circuits are of much interest in neuroscience, but their roles in perception and behavior are mediated by the rest of the brain. Ideas about these relationships can be tested for plausibility by integrating biologically grounded models of local circuits into functionally sophisticated deep networks. Overall, our work to integrate a contour integration model into a deep network has not supported a role for this circuit in the natural-image tasks we investigated (a contour following task and detection of edges in complex natural images). This may be due to limitations of the model, although the model’s physiologically realistic responses suggest that it has much in common with the brain circuit. More work is needed to determine whether incorporating other physiological factors might produce a model that is more effective (similar to our model variant without constraints on the weight signs) without being less realistic and to test the role of contour integration in a wider range of tasks. This line of work may be important for understanding the role of contour integration in natural life.
6 Methods
6.1 Contour Integration Block Parameters
The architecture of the model’s contour integration (CI) block is shown in Figure 2. In the brain, V1 lateral connections of orientation columns are sparse and preferentially connect with other orientation columns with similar selectivities (Malach et al., 1993). Furthermore, these connections are long and can extend up to eight times the classical receptive fields (cRF) of V1 neurons (Stettler et al., 2002). Rather than using hard-coded lateral connections, we connected all columns within an neighborhood and used task-level optimization to learn them. For edge extraction, we used the first convolutional layer of a ResNet50 (He et al., 2016). It uses kernels, and we defined to be . Additionally, a sparsity constraint was used during training to retain only the most important connections (see section 6.3).
Incoming feedforward signals iterated through the CI block for steps before E node outputs were passed to deeper layers. We used , which we found to be a good trade-off between performance and run-time. Connection strengths , were initialized to 0.1, while time constants , were initialized to 0.5. Each neuron incorporated a rectified linear unit (ReLU) activation function, except where noted.
6.2 Classification Blocks of the Network
For the task of detecting fragmented contours, CI block outputs were fed into the fragments classifier block (see Figure 2). It consisted of two convolutional layers. The first layer contained 16 kernels of size of 3 3 and used a stride length of 1, while the second layer used a single kernel of size 1 1. There was a batch normalization layer between the two convolutional layers. The final convolutional layer used a sigmoid nonlinearity to generate prediction maps.
For the test of edge detection in natural images, CI block outputs were passed to an edge detection block (see Figure 2). These outputs were upsampled by a factor of 4, using bilinear interpolation, to resize them back to input sizes. Upsampled activations were passed through two convolutional layers before prediction maps were generated. The first convolutional layer contained eight kernels of size 3 3 and used a stride length of one. There was a batch normalization layer after the first convolutional layer. The last convolutional layer contained a single kernel of size 1 1 and was used to flatten activations to a single channel. Outputs of the final convolutional layer were passed though a logistic sigmoid nonlinearity to generate prediction maps.
For the task of detecting whether two markers were connected by a smooth contour, CI block outputs were passed to the binary classifier block (see Figure 2) that also consisted of two convolutional layers. The first convolutional layer consisted of eight kernels of size 3 3 and used a stride of three. As in the other detection blocks, there was a batch normalization layer after the first convolutional layer. The final convolutional layer used a single kernel of size 1 1 and used a stride of one. Finally, a global average pooling layer (Lin et al., 2013) mapped output activations to a single value that could be compared with image labels.
6.3 Training
where is the label and is the network prediction. Here, represents the total across all images, as well as the total predictions per image.
For the sparsity constraint, was set to 10 pixels while was set to 1e-4. Learned lateral connections of the model (but not the control) were restricted to be positive only. After every weight update step, negative weights were clipped to 0.
All networks were trained with the Adam (Kingma & Ba, 2014) optimizer. In the synthetic contour fragments detection and the contour tracing in natural images tasks, both the model and control were trained for 100 epochs with a starting learning rate () of 1e-4, which was reduced by a factor of two after 80 epochs. The RPCM network was trained for 50 epochs with the same starting which was dropped by a factor of two after 40 epochs. Trained RPCM networks had fully converged after 50 epochs and did not noticeably improve with additional training. For edge detection in natural images, networks were trained for 50 epochs with an initial of 1e-3, which was reduced by a factor of two after 40 epochs. A fixed batch size of 32 images was used in all tasks.
All input images were fixed to a size of 256 256 pixels, resizing images and labels when necessary. Input pixels were preprocessed to be approximately zero-centered with a standard deviation of one on average. Synthetic contour fragment images were normalized with data set channel means and standard deviations while natural images were normalized with ImageNet values. In the contour tracing in natural images tasks, input images were punctured with occlusion bubbles as described in section 6.10.
6.4 Metrics
6.4.1 Mean Intersection-over-Union
To get binary network predictions for an image, network outputs were passed through a sigmoid nonlinearity and thresholded. The intersection with the labels was found by multiplying the predictions with their corresponding labels, while the union was found by summing labels and predictions followed by subtracting the intersection of the two. An IoU score of 1 signifies a perfect match between predictions and labels, while an IoU score of 0 means that there is no match between what the network predicted and the label. Mean IoU score was found by averaging IoU scores over the full data set.
For the contour fragments data set, a threshold of 0.5 was used, while for the contour detection in natural images tasks, a value of 0.3 returned the best scores. IoU scores dropped off monotonically as detection threshold deviated away from 0.3 for all networks.
6.5 Synthetic Contour Fragments Stimuli
We used stimuli similar to those of Field et al. (1993). Each input stimulus consisted of a 2D grid of tiles that contained Gabor fragments that were identical except for their orientations and positions. The orientations and locations of a few adjacent fragments were aligned to form a smooth contour. The remaining (background) fragments had randomly varying orientations and positions.
To construct each stimulus, first, a Gabor fragment, contour length in number of fragments, , and contour curvature, , was selected. Each Gabor fragment was a square tile the same size as the cRF (kernel spatial size) of the preceding edge extracting layer. Second, a blank image was initialized with the mean pixel value of all boundary pixels of the selected Gabor. Third, the input image was sectioned into a grid of squares (full tiles) whose length was set to the pixel length of a fragment plus the desired interfragment spacing, . The grid was aligned to coincide the center of the image with the center of the middle full tile. Fourth, a starting contour fragment was randomly placed in the image. Fifth, the location of the next contour fragment was found by projecting a vector of length and orientation equal to the previous fragment’s orientation . The random direction change of and distance jitter were added to prevent them from appearing as cues to the network. Sixth, a fragment rotated by , was added at this position. The fifth and sixth steps were repeated until contour fragments were added to both ends of the starting fragment. Seventh, background fragments were added to all unoccupied full tiles. Background fragments were randomly rotated and positioned inside the larger full tiles. Finally, a binary label was created for each full tile indicating whether it contained the center of a contour fragment.
In all training images, interfragment spacing and fragment length were equal. A fixed input image size of 256 256 pixels was used. Gabor fragments of size 7 7 pixels and full tile of size 14 14 pixels were used in stimulus construction. This resulted in labels of size 19 19 for each input stimulus.
The full data set contained 64,000 training and 6400 validation images. In its construction, 64 different Gabor types, of 1, 3, 5, 7, 9 fragments and interfragment rotations of were used. Gabor parameters were manually picked with the only restriction that the Gabor fragment visually appear as a well-defined line segment. Each Gabor fragment was defined over three channels, and the data set included colored as well as standard black and white stimuli. 1 stimuli were included to teach the model to not do contour integration when there are no co-aligned fragments outside the cRF. Contour integration requires inputs from outside the cRF, and the model had to learn when not to apply enhancement gains. For these stimuli, the label was set to all zeros. An equal number of images were generated for each condition. Due to the random distance jitter, interfragment rotations, and the location of contours, multiple unique contours were possible for each condition. Moreover, background fragments varied in each image.
6.6 Test Synthetic Contour Fragments Stimuli
We used test stimuli similar to those of Li et al. (2006). These consisted of centrally located contours of different lengths and interfragment spacing. Test stimuli were similar to training stimuli except that the starting contour fragment was always centered at the image center. This ensured that centrally located neurons (whose outputs were measures) always received a full stimulus within their cRF. Furthermore, test stimuli were constructed in an online manner whereby the optimal stimulus of each centrally located neuron in each channel was first found by checking which of the 64 Gabor fragments elicited the maximum response in the cRF. Next, contours were extended in the direction of the preferred orientations of selected Gabors. The effects of contour length were analyzed using , 3, 5, 7, 9 fragments and a fixed spacing of RCD (see Figure 3C). The effects of interfragment spacing were analyzed using RCD [7, 8, 9, 10, 11, 12, 13, 14] / 7 and a fixed fragments (see Figure 3D). For each condition, results were averaged across 100 different images.
6.7 Lateral Kernel Analysis
We followed the method of Sincich and Blasdel (2001) to quantify the directional selectivity and find the axis-of-elongation of lateral connections. First, Sincich and Blasdel (2001) identified locations where stained lateral connections terminated in clusters (patches). Next, they constructed vectors originating at the injection site and ending at patch centers. Given a set of patch vectors of a V1 orientation column, an averaging vector was computed. Because patch vectors pointing in opposite directions represent lateral connections extending in the same direction, the orientations of individual patch vectors were doubled before computing the vector sum. Consequently, patches that were in opposite directions summed constructively, while those that were orthogonal summed destructively. After computing the vector sum, the resultant angle was halved to get its axis-of-elongation, . To quantify directional selectivity, the magnitude of the averaging vector was normalized by the magnitude sum of all patch vectors to get a normalized index of ellipticity .
Similarly, for trained models, we analyzed the lateral kernels of the CI block. Each input channel of lateral kernels receives output from a specific kernel in the previous feedforward layer. This signal is passed to all other neurons in a defined area as specified by its individual connection kernel. For each input channel, we constructed patch vectors starting at the kernel center and extending to each nonzero weight. This was slightly different from the approach of Sincich and Blasdel (2001) as we considered all weights rather than patch centers only. Moreover, individual patch vectors were weighted with their connection strengths. Stronger weights contributed more to the average vector compared to weaker ones. Next, similar to Sincich and Blasdel (2001), we computed an averaging vector and used it to compute the axis-of-elongation and directional selectivity of lateral connections.
Axes-of-elongation of lateral connections were compared with the preferred orientation of source feedforward edge extraction neurons. To find the preferred orientation of a kernel in the edge extraction layer, we least-square-fit each channel to a 2D Gabor function that was defined by eight parameters: the and location of its center, its amplitude, the orientation, wavelength, and phase offset of its sinusoid component, and the spatial extent and ratio of the spread in the versus direction of its gaussian envelope. The orientation of the channel with the highest amplitude was selected as the kernel’s preferred orientation. Orientation preferences of the pretrained edge extraction kernels are shown in Figure S1 (see the Supplementary Information section). We found Gabor fits for 42 out of the 64 kernels of the edge extracting layer.
Only those lateral kernels for which the orientation of feedforward kernels were found were considered in the analysis. Excitatory-targeting and inhibitory-targeting lateral kernels were analyzed separately.
6.8 Lateral Kernel Visualization
In the model, lateral connections were implemented using convolutional layers within the CI block. Each convolutional layer had kernels of size [, , , ]. Here, is the number of feedforward input channels, is the number of output channels and is the spatial extent of these lateral connections. To visualize lateral connections of a particular feedforward kernel in the preceding layer (input channel), we summed out output channels and plotted their spatial spread.
6.9 Network Predictions Strengths Comparison in Natural Images
To compare predictions of the model and the control at different edge strengths, first prethreshold control and model outputs over the entire BIPED validation data set were collected. Second, a sliding window of size 0.2 was run over control outputs to highlight pixels whose predictions lay within the desired range. Third, corresponding predictions of the model were found. Fourth, the average differences between model and control predictions were calculated. The process was repeated over the full range of predictions (0, 1) by sliding the window at intervals of 0.1.
Edge pixels and nonedge pixels were separately analyzed. To extract edge predictions, network outputs were multiplied with the ground-truth mask, while to separate nonedge pixels, network outputs were multiplied with the inverted ground truth mask. Considering edge pixels, if the mean difference is above zero, this suggests that the model is better at detecting pixels of the corresponding strength. Considering nonedge pixels, if the mean difference is below zero, then the model has a lower tendency for false positives.
6.10 Contour Tracing in Natural Images Stimuli
The construction of stimuli for the contour tracing in natural images task required selecting contours in natural images. We randomly extracted a smooth contour from a BIPED (Poma et al., 2020) image using its edge map. Contours were extracted by first selecting a random starting edge pixel from the edge map. Valid starting pixels had to be part of a straight contour in their 3 3 pixel vicinity, either vertically, horizontally, or diagonally. Next, this starting contour was extended at both ends by adding contiguous edge pixels that were at most radians from the local direction of the contour. The local direction of the contour was defined as the angular difference between the last two points of the contour. If there was more than one candidate edge pixel, the candidate with the smallest offset from the contour direction was selected. The process was repeated until there were no more edge pixels at candidate positions or if the selected candidate pixel was already a part of (circular contours). Additionally, once the contour length was greater than eight pixels, a large-scale smooth curvature constraint was applied to check that the angle difference between and contour points was not greater than radians, where is the last point on the contour. Contour extraction was also stopped if the large-scale curvature constraint was not met.
After extracting , one of its end points was chosen as the position of the first marker, . Next, a second edge pixel that did not lie on was randomly selected. To ensure that connected and unconnected stimuli had similar separation distances, the selection process used a nonuniform probability distribution to favor edge pixels that were equidistant with the unselected end point of . First, distances for all edge pixels from were calculated. Next, the absolute difference between distances of edge pixels and the distance to the unselected end point of was calculated. A Softmax function was used to convert negative distances differences to probabilities. Edge pixels that were of a similar distance to the unselected end point of had distance differences close to zero and were more likely to be selected, while edge pixels that were at a different distance had large negative distance differences and were less probable.
Given the location of the second edge pixel, a second contour, , was extended from it. If any point on overlapped with , a new starting-edge pixel was selected and the process was repeated until a nonoverlapping pair of contours was found. The location of the second marker, , was determined by the type of stimulus. For connected stimuli, the opposite end of was selected as , while for unconnected stimuli, one of the end points of was chosen. Once marker positions were determined, markers were placed at corresponding positions in the input image. Each marker consisted of a bull’s-eye of alternating red and blue concentric circles (see Figure 8B). Markers were added directly to input natural images, and networks were given no information about the selected contours.
Within a mask, bubbles were allowed to overlap, and a different mask was used for each image. Values in the bubble mask ranged between [0, 1]. Sample input training images for the contour tracing task are shown in Figure 8C.
The train data set contained 50,000 contours that were extracted from BIPED train images, while the validation data set contained 5000 contours that were extracted from BIPED test images. Since the BIPED test data set contains only 50 images, multiple contours per image were extracted. Care was taken to ensure duplicate contours were not selected. Puncturing of input images was done as a preprocessing step during the training loop. Consequently, each exposure of an image to a network was unique. Equal probabilities were used for generating connected and unconnected stimuli.
6.11 Test Contour Tracing in Natural Images Stimuli
Similar to when the effects of interfragment spacing were analyzed using synthetic fragmented contours, the optimal stimuli of target neurons needed to be found. In the synthetic contour fragments data set, test images were designed to contain the optimal stimuli of monitored neurons. However, for natural images, inputs cannot be defined in a similar way. Therefore, a new procedure was devised. To find the optimal stimulus of an individual channel, multiple unoccluded connected contours were presented to networks (see Figure 8B). For each image, the position of the most active neuron of each channel in the CI block was found. If it was within three pixels (the same as the stride length of the subsequent convolutional layer) of the contour, the image as well as the position of the most active neuron were stored. The process was repeated over 5000 contours and the top 50 (contour, most active neuron) pairs were retained for each channel. New random contours were selected from the augmented BIPED train data set. The train data set, as opposed to the test data set, was used as it contained more images and a larger variety of contours.
Given the optimal stimulus for a channel, each input contour was fragmented by inserting occlusion bubbles at specific positions along the contour. Different bubble sizes were used to fragment contours with different interfragment spacing. A fixed fragment length of seven pixels, the same size as the cRF of edge extracting neurons, was used. To ensure the cRF of the most active neuron was unaffected by bubbles, first, the position of the closest point on the contour was found. Bubbles were then inserted along the contour at offsets of until the ends of the contour. Finally, the blending-in area of bubbles was restricted to FWHM pixels to ensure visible contour fragments were unaffected.
Appendix: Model and RPCM Network Equivalency
More generally, the latter two factors do not have to be enforced. If , then it can be moved into the diagonal of . Similarly, if , then it can be multiplied by and and multiplied by , without changing the function.
This suggests that differences between model and RPCM are due to transient dynamics and learning dynamics (i.e., the model may be structurally capable of RPCM performance, but the solution may not be reachable via backpropagation and Adam).
Supporting Information
Code Availability
The source code for all networks, experiments and analysis that were performed as well as for generating data sets used in this work is available at https://github.com/salkhan23/contour_integration_pytorch.
Note
Many other computational models of contour integration exist in the literature, including those that are based on edge co-occurrence probabilities in natural images. For a review, see Elder and Goldberg (2002) and Geisler et al. (2001). However, because we are interested in the brain’s mechanisms of contour integration, we restrict our comparisons to mechanistic models only.