Visual understanding requires comprehending complex visual relations between objects within a scene. Here, we seek to characterize the computational demands for abstract visual reasoning. We do this by systematically assessing the ability of modern deep convolutional neural networks (CNNs) to learn to solve the synthetic visual reasoning test (SVRT) challenge, a collection of 23 visual reasoning problems. Our analysis reveals a novel taxonomy of visual reasoning tasks, which can be primarily explained by both the type of relations (same-different versus spatial-relation judgments) and the number of relations used to compose the underlying rules. Prior cognitive neuroscience work suggests that attention plays a key role in humans' visual reasoning ability. To test this hypothesis, we extended the CNNs with spatial and feature-based attention mechanisms. In a second series of experiments, we evaluated the ability of these attention networks to learn to solve the SVRT challenge and found the resulting architectures to be much more efficient at solving the hardest of these visual reasoning tasks. Most important, the corresponding improvements on individual tasks partially explained our novel taxonomy. Overall, this work provides a granular computational account of visual reasoning and yields testable neuroscience predictions regarding the differential need for feature-based versus spatial attention depending on the type of visual reasoning problem.

Humans can effortlessly reason about the visual world and provide rich and detailed descriptions of briefly presented real-life photographs (Fei-Fei, Li, Iyer, Koch, & Perona, 2007), vastly outperforming the best current computer vision systems (Geman, Geman, Hallonquist, & Younes, 2015; Kreiman & Serre, 2020). For the most part, studies of visual reasoning in humans have sought to characterize the neural computations underlying the judgment of individual relations between objects, such as their spatial relations (Logan, 1994a) or whether they are the same or different (up to a transformation, e.g., Shepard & Metzler, 1971). It has also been shown that different visual reasoning problems have different attentional and working memory demands (Logan, 1994b; Moore, Elsinger, & Lleras, 1994; Rosielle et al., 2002; Holcombe, Linares, & Vaziri-Pashkam, 2011; Van Der Ham et al., 2012; Kroger et al., 2002; Golde, von Cramon, & Schubotz, 2010; Clevenger & Hummel, 2014; Brady & Alvarez, 2015). However, there is still little known about the neural computations that are engaged by different types of visual reasoning (see Ricci, Cadène, & Serre, 2021, for a recent review).

One benchmark that has been designed to probe abstract visual relational capabilities in humans and machines is the synthetic visual reasoning test (SVRT) (Fleuret et al., 2011). The data set consists of 23 hand-designed binary classification problems that test abstract relationships between objects posed on images of closed-contour shapes. Observers are never explicitly given the underlying rule for solving any given problem. Instead, they learn it while classifying positive and negative examples and receiving task feedback. Examples from two representative tasks are depicted in Figure 1: observers must learn to recognize whether two shapes are the same or different (task 1), or whether the smaller of the two shapes is near the boundary (task 2). Additional abstract relationships tested in the challenge include “inside,” “in between,” “forming a square,” “aligned in a row,” or “finding symmetry” (see Figures S1 and S2 for examples).
Figure 1:

Two SVRT sample tasks from a set of 23. For each task, the leftmost and rightmost two examples illustrate the two categories to be classified. Representative samples for the complete set of 23 tasks are in Figures S1 and S2.

Figure 1:

Two SVRT sample tasks from a set of 23. For each task, the leftmost and rightmost two examples illustrate the two categories to be classified. Representative samples for the complete set of 23 tasks are in Figures S1 and S2.

Close modal

Human observers rapidly learn most SVRT tasks within 20 or fewer training examples (Fleuret et al., 2011; see Table 2 in this article; reproduced from the original study). However, modern deep neural network models require several orders of magnitude more training samples for some of the more challenging tasks (Ellis, Solar-Lezama, & Tenenbaum, 2015; Stabinger, Rodríguez-Sánchez, & Piater, 2016; Kim, Ricci, & Serre, 2018; Messina, Amato, Carrara, Gennaro, & Falchi, 2021b; Stabinger, Peer, Piater, & Rodríguez-Sánchez, 2021; Puebla & Bowers, 2021; see Ricci et al., 2021, for review; see also Funke et al., 2021, for an alternative perspective).

It is now clear that some SVRT tasks are more difficult to learn than others. For instance, tasks that involve spatial-relation (SR) judgments can be learned much more easily by deep convolutional neural networks (CNNs) than tasks that involve same-different (SD) judgments (Stabinger et al., 2016; Kim et al., 2018; Yihe, Lowe, Lewis, & van Rossum, 2019). In contrast, a very recent study (Puebla & Bowers, 2021) demonstrated that even when CNNs learn to detect whether objects are the same or different, they fail to generalize over small changes in appearance, meaning that they have only partially learned this abstract rule. The implication of the relative difficulty of learning SR versus SD tasks is that CNNs appear to need additional computations to solve SD tasks beyond standard filtering, nonlinear rectification, and pooling. Indeed, recent human electrophysiology work (Alamia et al., 2021) has shown that SD tasks recruit cortical mechanisms associated with attention and working memory processes to a greater extent than SR tasks. Others have argued that SD tasks are central to human intelligence (Firestone, 2020; Forbus & Lovett, 2021; Gentner, Shao, Simms, & Hespos, 2021). Beyond this basic dichotomy of SR and SD tasks, little is known about the neural computations necessary to learn to solve SVRT tasks as efficiently as human observers.

Here, we investigate the neural computations required for visual reasoning in two sets of experiments. In our first set of experiments, we extend prior studies on the learnability of individual SVRT tasks by feedforward neural networks using a popular class of deep neural networks known as deep residual networks (ResNets; He, Zhang, Ren, & Sun, 2016). We systematically analyze the ability of ResNets to learn all 23 SVRT tasks as a function of their expressiveness, parameterized by processing depth (number of layers) and their efficiency in learning a particular task. Through these experiments, we found that most of the performance variance in the space of SVRT tasks could be accounted for by two principal components, which reflected both the type of task (same-different versus spatial-relation judgments) and the number of relations used to compose the underlying rules.

Consistent with the speculated role of attention in solving the binding problem when reasoning about objects (Egly, Rafal, Driver, & Starrveveld, 1994; Roelfsema, Lamme, & Spekreijse, 1998), prior work by Kim et al. (2018) has shown that combining CNNs with an oracle model of attention and feature binding (i.e., preprocessing images so that they are explicitly and readily organized into discrete object channels) renders SD tasks as easy to learn by CNNs as SR tasks. Here, we build on this work and introduce CNN extensions that incorporate spatial or feature-based attention. In a second set of experiments, we show that these attention networks learn difficult SVRT tasks with fewer training examples than their nonattentive (CNN) counterparts but that the different forms of attention help on different tasks.

This second set of experiments raises the question. How do attention mechanisms help with learning different visual reasoning problems? There are at least two possible computational benefits: attention could improve model performance by simply increasing its capacity or help models more efficiently learn the abstract rules governing object relationships. To adjudicate between these two possibilities, we measured the sample efficiency of ResNets pretrained on SVRT images so that they only had to learn the abstract rules for each SVRT task. We found that attention ResNets and ResNets pretrained on the SVRT were similarly sample-efficient in learning new SVRT tasks, indicating that attention helps discover abstract rules instead of merely increasing model capacity.

2.1  Systematic Analysis of SVRT Tasks' Learnability

All experiments were carried out with the SVRT data set using code provided by the authors to generate images with dimension 128×128 pixels (see Fleuret et al., 2011, for details). All images were normalized and resized to 256×256 pixels for training and testing models. No image augmentations were used during training. In our first experiment, we wanted to measure how easy or difficult each task is for ResNets to learn. We did this by recording the SVRT performance of multiple ResNets, each with different numbers of layers and trained with different numbers of examples. By varying model complexity and the number of samples provided to a model to learn any given task, we obtained complementary measures of the learnability of every SVRT task for ResNet architectures. In total, we trained 18-, 50-, and 152-layer ResNets separately on each of the SVRT's 23 tasks. Each of these models was trained with 500, 1000, 5000, 10,000, 15,000 and 120,000 class-balanced samples. We also generated two unique sets of 40,000 positive and negative samples for each task: one was used as a validation set to select a stopping criterion for training the networks (if validation accuracy reaches 100%) and the other as a test set to report model accuracy. In addition, we used three independent random initializations of the training weights for each configuration of architecture or task and selected the best model using the validation set. Models were trained for 100 epochs using the Adam optimizer (Kingma & Ba, 2014) with a training schedule (we used an initial learning rate of 1e-3 and changing it to 1e-4 from the 70th epoch onward). As a control, because these tasks are quite different from each other, we also tested two additional initial learning rates (1e-4, 1e-5).

Consistent with prior work (Kim et al., 2018; Stabinger et al., 2016; Yihe et al., 2019), we found that some SVRT tasks are much easier to learn than others for ResNets (see Figure 2). For instance, a ResNet50 needs only 500 examples to perform well on tasks 2, 3, 4, 8, 10, 11, and 18 but the same network needs 120,000 samples to perform well on task 21 (see Figures S1 and S2 for examples of these tasks). Similarly, with 500 training examples, tasks 2, 3, 4, and 11 can be learned well with only 18 layers, while tasks 9, 12, 15, and 23 require as many as 152 layers. A key assumption of our work is that these differences in training set sizes and depth requirements between different SVRT tasks reflect different computational strategies that need to be discovered by the neural networks during training for different tasks. Our next goal is to charactarize what these computational strategies are.
Figure 2:

Test accuracy for each of the 23 SVRT tasks as a function of the number of training samples for ResNets with depth 18, 50, and 152, respectively. The color scheme reflects the identified taxonomy of SVRT tasks (see Figure 3 and text for details).

Figure 2:

Test accuracy for each of the 23 SVRT tasks as a function of the number of training samples for ResNets with depth 18, 50, and 152, respectively. The color scheme reflects the identified taxonomy of SVRT tasks (see Figure 3 and text for details).

Close modal

2.2  An SVRT Taxonomy

To better understand the computational strategies needed to solve the SVRT, we analyzed ResNet performance on the tasks with a multivariate clustering analysis. For each individual task, we created an N-dimensional vector by concatenating the test accuracy of all ResNet architectures (N=3 depths × 5 training set sizes = 15), which served as a signature of each task's computational requirements. We then passed a matrix of these vectors to an agglomerative hierarchical clustering analysis (see Figure 3) using Ward's method.
Figure 3:

Dendrogram derived from an N-dim hierarchical clustering analysis on the test accuracy of N=15 ResNets[18/50/152] trained to solve each task over a range of training set sizes.

Figure 3:

Dendrogram derived from an N-dim hierarchical clustering analysis on the test accuracy of N=15 ResNets[18/50/152] trained to solve each task over a range of training set sizes.

Close modal

Our clustering analysis revealed a novel taxonomy for the SVRT. At the coarsest level, it recapitulated the dichotomy between same-different (SD; green branches) and spatial-relation (SR; brown branches) categorization tasks originally identified by Kim et al. (2018) using shallow CNNs. Interestingly, two of the tasks that Kim et al. (2018) classified as SR (tasks 6 and 17) were assigned to the SD cluster in our analysis. We examined the descriptions of these two tasks as given in Fleuret et al. (2011) (see also Figures S1 and S2 in the online supplement) and found that these two tasks involve both SR and SD: they ask observers to tell whether shapes are the same or different and judge the distance between the shapes. Specifically, task 6 involves two pairs of identical shapes, with one category having same distance in between two identical shapes versus not in the other. Similarly in task 17, three of the four shapes are identical, and their distance with the nonidentical one is the same in one category versus different in the other. Thus, our data-driven dichotomization of SR versus SD refines the original proposal of Kim et al. (2018). This could be due to our use of ResNets (as opposed to vanilla CNNs), deeper networks, and a greater variety of training set sizes (including much smaller training set sizes than those used by Kim et al., 2018). The analysis by Fleuret et al. (2011) also revealed that several SD tasks (6, 16, 17, 21) are particularly challenging for human observers.

Our clustering analysis also revealed a finer organization than the main SR versus SD dichotomy. The SR cluster could be further subdivided into two subclusters. The SR2 (dark-brown-colored) branch in Figure 3 captures tasks that involve relatively simple and basic relation rules such as shapes making close contact (3, 11), or being close to one another (2), one shape being inside the other (4) or whether the shapes are arranged to form a symmetric pattern (8, 10, 18). In contrast, tasks that fall in the SR1 (light-brown-colored) branch involve the composition of more than two rules, such as comparing the size of multiple shapes to identify a subgroup before identifying the relationship of the members of the subgroups. This includes tasks such as finding a larger object in between two smaller ones (9) or three shapes of which two are small and one large with two smaller (identification of large and small object) ones either inside or outside in one category versus one inside and the other outside in the second (23), or two small shapes equally close to a bigger one (12), and so on. These tasks also tend to be comparatively harder to learn, requiring ResNets with greater processing depth and more training samples. For instance, tasks 9, 12, 15, and 23 were harder to learn than tasks 2, 4, and 11 requiring more samples or more depth to solve well (see Figure 2).

We found that task 15 gets assigned to this latter subcluster because the task requires finding four shapes in an image that are identical versus not. One would expect this task to fall in the SD cluster, but we speculate that the deep networks are actually able to leverage a shortcut (Geirhos et al., 2020) by classifying the overall pattern as symmetric or square (when the four shapes are identical) versus trapezoid (when the four shapes are different; see Figure S2), effectively turning an SD task into an SR task.

Our clustering analysis also reveals a further subdivision of the SD cluster. These tasks require recognizing shapes that are identical to at least one of the other shapes in the image. The first subcluster SD2 (light green color branch) belongs to tasks that require identification of simple task rules, like answering whether two shapes are identical (even if it is along the perpendicular bisector) (tasks 1, 20; see Figure S1), determining if all the shapes on an image are the same (16, 22), or detecting if two pairs of identical shapes can be translated to become identical to each other (13). Another set of tasks within this subcluster includes ones that are defined by more complex rules that involve the composition of additional relational judgments. Sample tasks include identifying pairs or triplets of identical shapes and measuring the distance with the rest (6, 17), determining if an image consists of pairs of identical shapes (5), or detecting if one of the shapes is a scaled version of the other (19). Finally, the second subcluster SD1 shown in dark green, involves two tasks that require an understanding of shape transformations. One task asks observers to say if one of the shapes is the scaled, translated, or rotated version of the other one (21). The other task asks observers to judge whether an image contains two pairs of three identical shapes or three pairs of two identical shapes in an image (7).

To summarize this first set of experiments, we have systematically evaluated the ability of ResNets spanning multiple depths to solve each of the 23 SVRT tasks for different training set sizes. This allowed us to represent SVRT tasks according to their learnability by ResNets of varying depth. By clustering these representations, we extracted a novel SVRT taxonomy that both recaptulated an already described SD-SR dichotomy (Kim et al., 2018) and revealed a more granular task structure corresponding to the number of rules used to form each task. Tasks with more rules are harder for ResNets to learn. Our taxonomy also reveals an organization of tasks where easier SR1 and SR2 subclusters fall closer to each other than harder SD1 and SD2 subclusters.

We next sought to identify computational mechanisms that could help ResNets learn the more challenging SVRT tasks revealed by our novel taxonomy. Attention has classically been implicated in visual reasoning in primates and humans (Egly et al., 1994; Roelfsema et al., 1998). Attentional processes can be broadly divided into spatial (e.g., attending to all features in a particular image location) versus feature based (e.g., attending to a particular shape or color at all spatial positions; Desimone & Duncan, 1995). The importance of attention for perceiving and reasoning about challenging visual stimuli has also been realized by the computer vision community. There are now a number of attention modules proposed to extend CNNs, including spatial (e.g., Sharma, Kiros, & Salakhutdinov, 2015; Yang, He, Gao, Deng, & Smola, 2016; Xu & Saenko, 2015; Ren & Zemel, 2016), feature based (Stollenga, Masci, Gomez, & Schmidhuber, 2014; Chen et al., 2015, 2017; Hu et al., 2018) and hybrid (Linsley, Scheibler, Eberhardt, & Serre, 2018a; Woo, Park, Lee, & So Kweon, 2018) approaches. Here, we adapt the increasingly popular transformer architecture (Vaswani et al., 2017) to implement both forms of attention. These networks, which were originally developed for natural language processing, are now pushing the state of the art in computer vision (Zhu et al., 2020; Carion et al., 2020; Dosovitskiy et al., 2020). Recent work (Ding, Hill, Santoro, Reynolds, & Botvinick, 2021) has also showed the benefits of such architectures and, especially, attention mechanisms for solving higher-level reasoning problems.

Transformers are neural network modules that usually consist of at least one “self-attention” module followed by a feedforward layer. Here, we introduced different versions of the self-attention module into ResNets to better understand the computational demands of each SVRT task. The self-attention implemented by transformers is both applied to and derived from the module's input. By reconfiguring standard transformer self-attention, we developed versions capable of allocating either spatial or feature-based attention over the input. Specifically, we created these different forms of attention by reshaping the convolutional feature map input to a transformer. For spatial attention, we reshaped the ZRH,W,C feature maps to ZRC,H*W, so that the transformer's self-attention was allocated over all spatial locations. For feature-based attention, we reshaped the convolutional feature maps to ZRH*W,C, enforcing attention over all features instead of spatial locations. See section S1 in the online supplement for an elaborated treatment.

We added one spatial or feature-based attention after one of the four residual blocks in a ResNet-50. We selected where to add either form of attention to a ResNet-50 by choosing the location where the addition of attention yielded the best validation accuracy across the SVRT tasks. Through this procedure, we inserted a spatial attention module after the second residual block and a feature-based attention module after the third residual block (see Figure 4).
Figure 4:

Location of the transformer self-attention modules in our ResNet extensions.

Figure 4:

Location of the transformer self-attention modules in our ResNet extensions.

Close modal
To measure the effectiveness of different forms of attention for solving the SVRT, we compared the accuracy of three ResNet-50 models: one capable of spatial attention, one capable of feature-based attention, and one that had no attention mechanisms (“vanilla”) (see Figure 5). Spatial attention consistently improved model accuracy on all tasks, across all five data set sizes of models we used for training models. The improvement in accuracy is particularly noticeable for the SD1 cluster. Tasks in this subcluster are composed of two rules, which ResNets without attention struggled to learn. Attention helps ResNets learn these tasks more efficiently. The improvement is also evident for SD2 and SR1. The benefit of attention for SR2 is, however, marginal, since ResNets without attention already perform well on these tasks.
Figure 5:

Test accuracies for a baseline ResNet50 versus the same architecture endowed with the two forms of attention for each of the 23 SVRT tasks when varying the number of training examples. A different axis scale is used for SR2 to improve visibility. These curves are constructed by joining task accuracy for five points representing data set sizes.

Figure 5:

Test accuracies for a baseline ResNet50 versus the same architecture endowed with the two forms of attention for each of the 23 SVRT tasks when varying the number of training examples. A different axis scale is used for SR2 to improve visibility. These curves are constructed by joining task accuracy for five points representing data set sizes.

Close modal
We find that feature-based attention leads to the largest improvements for SD1, especially when training on 5000 or 10,000 examples (see Figure 6). On the other hand, spatial attention leads to the largest improvements for SD2 and SR1. This improvement is pronounced when training on 500 or 1000 examples. Taken together, the differential success of spatial versus feature-based attention reveals that the task subclusters discovered in our data-driven taxonomy can be explained by their varying attentional demands.
Figure 6:

Test accuracies for 50-layer ResNets with spatial attention (orange), feature-based attention (tan), or no attention (green). Each bar depicts performance after training from scratch on 10,000 samples.

Figure 6:

Test accuracies for 50-layer ResNets with spatial attention (orange), feature-based attention (tan), or no attention (green). Each bar depicts performance after training from scratch on 10,000 samples.

Close modal
To better understand how the ResNet-derived taxonomy found in experiment 1 can be explained by the need for spatial and feature-based attention, we measured the relative improvement of each form of attention over the vanilla ResNet. For each attention model and task, we calculated the ratio of the test accuracies between the model and the vanilla ResNet50. We repeated this for every training data set size, then fit a linear model to these ratios to calculate the slope across data set sizes (see Figure 7 for representative examples). We repeated this procedure for all 23 tasks to produce two 23-dimensional vectors containing slopes for each model and every task.
Figure 7:

The benefit of attention in solving the SVRT is greatest in data-limited training regimes. The x-axis depicts the number of samples for training, and the y-axis depicts a ratio of the average performance of models with attention to models without attention. When the ratio is greater than 1 it shows that attention helps versus hurts when lower than 1. This gives us five ratios per task and attention process corresponding to each data set size. We performed a linear fitting procedure for these points and calculated the corresponding slope. This slope characterizes the relative benefits of attention for that particular task as the number of training examples available increases. If the benefit of attention is most evident in lower training regimes, one would expect a relatively small slope, and if the benefit of attention is most evident in higher training regimes, one would expect a large slope.

Figure 7:

The benefit of attention in solving the SVRT is greatest in data-limited training regimes. The x-axis depicts the number of samples for training, and the y-axis depicts a ratio of the average performance of models with attention to models without attention. When the ratio is greater than 1 it shows that attention helps versus hurts when lower than 1. This gives us five ratios per task and attention process corresponding to each data set size. We performed a linear fitting procedure for these points and calculated the corresponding slope. This slope characterizes the relative benefits of attention for that particular task as the number of training examples available increases. If the benefit of attention is most evident in lower training regimes, one would expect a relatively small slope, and if the benefit of attention is most evident in higher training regimes, one would expect a large slope.

Close modal
We next used these slopes to understand the attentional demands of each SVRT task. We did this through a two-step procedure. First, we applied a principal component analysis (see Figure 8) to the vanilla ResNet performance feature vectors (N=15) derived from experiment 1. Second, we correlated the principal components with the slope vectors from the two attention models. We restricted our analysis to the first two principal components, which captured about 93% of the variance in the vanilla ResNet's performance (see Figure 8). This analysis revealed a dissociation between the two forms of attention: feature-based attention was most correlated with the first principal component and spatial attention with the second principal component. Additionally, along the first principal component, we found the broader dichotomy of these 23 tasks into SD and SR clusters, whereas the second principal component divulged the tasks that responded better with spatial attention from tasks requiring either no attention or feature-based attention (as seen in dotted red lines in Figure 8). The corresponding Pearson coefficient r and p values are given in Table 1.
Figure 8:

Principal component analysis of the 23 tasks using the 15-dimensional feature vectors derived from experiment 1 representing the test accuracy obtained for each task for different data set sizes and ResNets of varying depths (18, 50, 152). The dotted red line represents four different bins in which these tasks can be clustered.

Figure 8:

Principal component analysis of the 23 tasks using the 15-dimensional feature vectors derived from experiment 1 representing the test accuracy obtained for each task for different data set sizes and ResNets of varying depths (18, 50, 152). The dotted red line represents four different bins in which these tasks can be clustered.

Close modal
Table 1:

Pearson Coefficient (r) and Corresponding p Values Obtained by Correlating the Slope Vectors of the Spatial Attention and the Feature-Based Attention Modules with the Two Principal Components of Figure 8.

SpatialFeature
rprp
PC1 0.466 0.0249 0.649 0.0008 
PC2 -0.652 0.0007 -0.491 0.0174 
SpatialFeature
rprp
PC1 0.466 0.0249 0.649 0.0008 
PC2 -0.652 0.0007 -0.491 0.0174 

Notes: Bold numbers signify the maximum correlation coefficient between spatial and feature with different principal components (PC1 and PC2). See text for details.

To summarize our results from experiment 2, we have found that the task clusters derived from ResNet test accuracies computed over a range of depth and training set sizes can be explained in terms of attentional demands. Here, we have shown that endowing these networks with attentional mechanisms helps them learn some of the most challenging problems with far fewer training examples. We also found that the relative improvements obtained over standard ResNets with feature-based and spatial attention are consistent with the taxonomy of visual reasoning tasks found in experiment 1. More generally, our analysis shows how the relative need for feature versus spatial attention seems to account for a large fraction of the variance in computational demand required for these SVRT tasks defined in experiment 1 according to their learnability by ResNets.

The learnability of individual SVRT tasks reflects two components: the complexity of the task's visual features and, separately, the complexity of the rule needed to solve the task. To what extent are our estimates of learnability driven by either of these components? We tested this question by training a new set of ResNets without attention according to the procedure laid out in experiment 1 but with a different pretraining strategies. One of the ResNets was pretrained to learn visual statistics (but not rules) of SVRT images, and another was pretrained on ImageNet, a popular computer vision data set containing natural object categories (Deng et al., 2009).

For pretraining on SVRT, we sampled 5000 class-balanced images from each of the 23 tasks (5000 × 23 = 115,000 samples in total). To make sure the networks did not learn any of the SVRT task rules, we shuffled images and binary class labels across all 23 problems while pretraining the network. We then trained models with binary cross-entropy to detect positive examples without discriminating tasks. Our assumption is that shuffling images and labels removes any semantic information between individual images and SVRT rules. However, a network with sufficient capacity can still learn the corresponding mapping between arbitrary images and class labels (even though it cannot generalize it to novel samples). To learn this arbitrary mapping, the network has to be able to encode visual features, but by construction, it cannot learn the SVRT task rule. When training this model and the ImageNet initialized model to solve individual SVRT tasks, we froze the weights of the convolutional layers and only fine-tuned the classification layers to solve SVRT problems.

Figure 9 shows a comparison between the different architectures in terms of their test accuracies according to the subclusters discovered in experiment 1. These results first confirm that the SVRT pretraining approach works because it consistently outperforms pretraining on ImageNet (see Figure S7) or training from scratch. Interestingly, for the SR2 subcluster, we found that the benefits of pretraining on SVRT goes down very quickly as the number of training examples grows. We interpret these results as reflecting the fact that generic visual features are sufficient for the task and that the rule can be learned very quickly (somewhere around 500 and 5000 samples). For the SR1 subcluster, the benefits of starting from features learned from SVRT is somewhat more evident in low training regimes, but these advantages quickly vanish as more training examples are available (the task is learned by all architectures within 5000 training samples).
Figure 9:

Test accuracies for a baseline ResNet50 trained from scratch (“No initialization”) versus the same architecture pretrained on an auxiliary task in order to learn visual representations that are already adapted to the SVRT stimuli for different numbers of training examples. The format is the same as used in Figure 5. A different axis scale is used for SR2 to improve visibility. These curves are constructed by joining task accuracy for five points representing data set sizes.

Figure 9:

Test accuracies for a baseline ResNet50 trained from scratch (“No initialization”) versus the same architecture pretrained on an auxiliary task in order to learn visual representations that are already adapted to the SVRT stimuli for different numbers of training examples. The format is the same as used in Figure 5. A different axis scale is used for SR2 to improve visibility. These curves are constructed by joining task accuracy for five points representing data set sizes.

Close modal

For SD1 while there appears to be a noteworthy advantage of pretraining on SVRT over ImageNet pretraining and training from scratch, the tasks never appear to be fully learned by any of the networks even with 15,000 training examples. This demonstrates the challenge of learning the rules associated with this subcluster beyond simply learning good visual representations. Finally, our results also show that the performance gap across all the architectures for SD2 versus SD1 increases rapidly with more training examples, demonstrating that the abstract rules for SD2 tasks are more rapidly learned than for SD1.

Finally, we carried out a similar analysis with the pretrained network as done in experiment 2. We built test accuracy vectors for the SVRT pretrained network trained using all five data set sizes (500, 1000, 5000, 10,000, 15,000) and searching over a range of optimal learning rates (1e-4, 1e-5, 1e-6). This led to five-dimensional vector, which we normalize by dividing each entry with the corresponding test accuracy of a baseline ResNet50 trained from scratch. Hence, the normalized vectors represent the improvement (ratio larger than 1) or reduction in accuracy (ratio smaller than 1) that results from the pretraining on SVRT for that particular task and training set size. We then calculated the slope vector in R(23), which we correlated with the corresponding spatial and feature-based attention vectors from experiment 2.

We found that task improvements due to SVRT pretraining correlated more strongly with task improvements due to spatial (r=0.90, p=4e-9) than feature-based attention (r=0.595, p=0.002). This suggests that the observed improvements in accuracy derived from spatial attention are more consistent with learning better feature representations compared to feature-based attention.

To summarize, in experiment 3, we have tried to address the question of learnability of SVRT features versus rules. We found that using an auxiliary task to pretrain the networks on the SVRT stimuli in order to learn visual representations beforehand provides learning advantages to the network compared to a network trained from scratch.

We also found a noteworthy correlation between the test accuracy vector of a network pretrained on SVRT visual statistics and a similar network endowed with spatial attention. This suggests that spatial attention helps discover the abstract rule more so than it helps improve learning good visual representations for the task.

The goal of this study was to shed light on the computational mechanisms underlying visual reasoning using the synthetic visual reasoning test (SVRT; Fleuret et al., 2011). There are 23 binary classification problems in this challenge, which include a variety of same-different and spatial reasoning tasks.

In a first experiment, we systematically evaluated the ability of a battery of N=15 deep convolutional neural networks (ResNets)—varying in depths and trained using different training set sizes—to solve each of the SVRT problems. We found a range of accuracies across all 23 tasks, with some tasks being easily learned by shallower networks and relatively small training sets and some tasks remaining barely solved with much deeper networks and orders of magnitude more training examples.

Under the assumption that the computational complexity of individual tasks can be well characterized by the pattern of test accuracy across these N=15 neural networks, we formed N-dimensional accuracy vectors for each task and ran a hierarchical clustering algorithm. The resulting analysis suggests a taxonomy of visual reasoning tasks: beyond two primary clusters corresponding to same-different (SD) versus spatial relation (SR) judgments, we also identified a finer organization with subclusters reflecting the nature and the number of relations used to compose the rules defining the task. Our results are consistent with previous work by Kim et al. (2018), who first identified a dichotomy between SD and SR tasks. Our results also extend prior work (Fleuret et al., 2011; Kim et al., 2018; Yihe et al., 2019) in proposing a finer-level taxonomy of visual reasoning tasks. The accuracy of neural networks is reflected in the number of relationships used to define the basic rules, which is expected, but it deserves closer examination.

Kim et al. (2018) have previously suggested that SD tasks “strain” convolutional neural networks. That is, while it is possible to find a network architecture of sufficient depth (or number of units) that can solve a version of the task up to a number of stimulus configurations (e.g., by forcing all stimuli to be contained within a ΔH×ΔW window), it is relatively easy to render the same task unlearnable by the same network past a certain number of stimulus configurations (e.g., by increasing the size of the window that contains all stimuli). It is as if these convolutional networks are capable of learning the task if the number of stimulus configurations remains below their memory capacity, and fail beyond that. It remains an open question whether nonconvolution alternatives to the CNNs tested here, such as the now popular transformer networks (Dosovitskiy et al., 2020; Touvron et al., 2021; Tolstikhin et al., 2021), would learn to solve some of the harder SVRT tasks more efficiently. As an initial experiment, we attempted to train and test a Vision Transformer1 (ViT) (Dosovitskiy et al., 2020) constrained to have a similar number of parameters (21 million) to the ResNet-50 used here. We were not able to get these architectures to do well on most of the tasks that are difficult for ResNets even with 100,000 samples (also shown in Messina, Amato, Carrara, Gennaro, & Falchi, 2021a). It is worth noting that even 100,000 samples remain a relatively small data set size by modern standards since ViT was trained from scratch.

Multilayer perceptrons and convolutional neural networks including ResNets and other architectures can be formally shown to be universal approximators under certain architectural constraints. That is, they can learn arbitrary mappings between images to class labels. Depending on the complexity of the mapping, one might need an increasing number of hidden units to allow for enough expressiveness of the network, but provided enough units and depth and a sufficient number of training examples, deep CNNs can learn arbitrary visual reasoning tasks. While we cannot make any strong claim for the specific ResNet architectures used in this study (currently the proof is limited to a single layer without max pooling or batch normalization; Lin & Jegelka, 2018), we have indeed found empirically that all SVRT tasks could indeed be learned for networks of sufficient depth and provided a sufficient number of training examples. However, deep CNNs typically lack many of the human cognitive functions such as attention and working memory. Such functions are likely to provide a critical advantage for a learner to solve some of these tasks (Marcus, 2001). CNNs might have to rely instead on function approximation, which could lead to a less general “brute-force” solution. Given this, an open question is whether the clustering of SVRT tasks derived from our CNN-based analyses will indeed hold for human studies. At the same time, the prediction by Kim et al. (2018) using CNNs that SD tasks are harder than SR tasks and hence that they may demand additional computations (through feedback processes) such as attention and/or working memory was successfully validated experimentally by Alamia et al. (2021) using EEG.

Additional evidence for the benefits of feedback mechanisms for visual reasoning was provided by Linsley, Shiebler, Eberhardt, and Serre (2018b) who showed that contour tracing tasks that can be solved efficiently with a single layer of a recurrent CNN may require several order of magnitudes more processing stages in a nonrecurrent CNN to solve the same task. This ultimately translates into much greater sample efficiency for recurrent CNNs on natural image segmentation tasks (Linsley, Kim, Ashok, & Serre, 2020). The closely related task of “insideness” was also studied by Villalobos et al. (2021), who demonstrated the inability of CNNs to learn a general solution for this class of problems. Universal approximators with minimal inductive biases such as multilayer perceptrons, CNNs, and other feedforward or nonattentive architectures can learn to solve visual reasoning tasks, but they might need a very large number of training examples to properly fit. Hence, beyond simply measuring the accuracy of very deep nets in high data regimes (such as when millions of training examples are available), systematically assessing the performance of neural nets of varying depths and for different training regimes may provide critical information about the complexity of different visual reasoning tasks.

Kim et al. (2018) hypothesized that such straining by convolutional networks is due to their lack of attention mechanisms to allow the explicit binding of image regions to mental objects. A similar point was made by Greff, van Steenkiste, and Schmidhuber (2020) in the context of the contemporary neural network failure to carve out sensory information into discrete chunks that can then be individually analyzed and compared (see also Tsotsos, Rodriguez-Sanchez, Rothenstein, & Simine, 2007, for a similar point). Interestingly, this prediction was recently tested using human EEG by Alamia et al. (2021) who showed that indeed brain activity recorded during SD tasks is compatible with greater attention and working memory demands than SR tasks.

At the same time that CNNs can learn SR tasks more efficiently than SD tasks does not necessarily mean that human participants can solve these tasks without attention. Indeed, Logan (1994b) has shown that under some circumstances, SR tasks such as judging insideness requires attention.

To further assess the role of attention in visual reasoning, we used transformer modules to endow deep CNNs with spatial and feature-based attention. The relative improvements obtained by the CNNs with the two forms of attention varied across tasks. Many tasks reflected a larger improvement for spatial attention, and a smaller number benefited from feature-based attention. Further, we found that the patterns of relative improvements accounted for much of the variance in the space of SVRT tasks derived in experiment 1. Overall, we found that the requirement for feature-based and spatial attention accounts well for the taxonomy of visual reasoning tasks identified in experiment 1. Our computational analysis also led to testable predictions for human experiments by suggesting tasks that either benefit from spatial attention (task 22) or from feature-based attention (task 21), tasks that benefit from either form of attention (task 19) and tasks that do not benefit from attention (task 2).

Finally, our study has focused on the computational benefits of spatial and feature-based attention for visual reasoning. Future work should consider the role of other forms of attention, including object-based attention (Egly et al., 1994) for visual reasoning.

In our third experiment, we studied the learnability of SVRT features versus rules. We did this by pretraining the neural networks on auxiliary tasks in order to learn SVRT features before training them to learn the abstract rules associated with individual SVRT problems. Our pretraining methods led to networks that learn to solve the SVRT problems better than networks trained from scratch, as well as networks that were pretrained to perform image categorization on the ImageNet data set. We have also found that such attention processes seem to contribute more to rule learning than to feature learning. For the SR1 subcluster, we find this type of pretraining to be advantageous in lower training regimes, but the benefits rapidly fade away in higher training regimes. In contrast, this pretraining does not allow the tasks from the SD1 subcluster to be learned even with 15,000 samples, suggesting that the key challenge with these tasks is not to discover good visual representations but rather to discover the rule. This suggests the need for additional mechanisms beyond those implemented in ResNets. This is also consistent with the improvements observed for these tasks with the addition of attention mechanisms.

In summary, our study compared the computational demands of different visual reasoning tasks. While our focus has been on understanding the computational benefits of attention and feature learning mechanisms, it is clear that additional mechanisms will be required to fully solve all SVRT tasks. These mechanisms are likely to include working memory, which is known to play a role in SD tasks (Alamia et al., 2021). Overall, this work illustrates the potential benefits of incorporating brain-like mechanisms in neural networks and provides a path forward to achieving human-level visual reasoning.

This work was funded by NSF (IIS-1912280) and ONR (N00014-19-1-2029) to T.S. and ANR (OSCI-DEEP grant ANR-19-NEUC-0004) to R.V. Additional support was provided by the ANR-3IA Artificial and Natural Intelligence Toulouse Institute (ANR-19-PI3A-0004) and the Center for Computation and Visualization and High Performance Computing resources from CALMIP (grant 2016-p20019). We acknowledge the Cloud TPU hardware resources that Google made available via the TensorFlow Research Cloud program, as well as computing hardware supported by NIH Office of the Director grant S10OD025181.

Alamia
,
A.
,
Luo
,
C.
,
Ricci
,
M.
,
Kim
,
J.
,
Serre
,
T.
, &
VanRullen
,
R.
(
2021
).
Differential involvement of EEG oscillatory components in sameness versus spatial-relation visual reasoning tasks.
eNeuro
,
8
(
1
).
[PubMed]
Brady
,
T. F.
, &
Alvarez
,
G. A.
(
2015
).
Contextual effects in visual working memory reveal hierarchically structured memory representations
.
Journal of Vision
,
15
(
15
), 6.
[PubMed]
Carion
,
N.
,
Massa
,
F.
,
Synnaeve
,
G.
,
Usunier
,
N.
,
Kirillov
,
A.
, &
Zagoruyko
,
S.
(
2020
).
End-to-end object detection with transformers
. arXiv:2005.12872.
Chen
,
K.
,
Wang
,
J.
,
Chen
,
L. C.
,
Gao
,
H.
,
Xu
,
W.
, &
Nevatia
,
R.
(
2015
).
ABC-CNN: An attention based convolutional neural network for visual question answering
. CoRR, abs/1511.05960.
Chen
,
L.
,
Zhang
,
H.
,
Xiao
,
J.
,
Nie
,
L.
,
Shao
,
J.
,
Liu
,
W.
, &
Chua
,
T.-S.
(
2017
).
Sca-CNN: Spatial and channel-wise attention in convolutional networks for image captioning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
5659
5667
).
Piscataway, NJ
:
IEEE
.
Clevenger
,
P. E.
, &
Hummel
,
J. E.
(
2014
).
Working memory for relations among objects
.
Attention, Perception, and Psychophysics
,
76
(
7
),
1933
1953
.
Deng
,
J.
,
Dong
,
W.
,
Socher
,
R.
,
Li
,
L.-J.
,
Li
,
K.
, &
Fei-Fei
,
L.
(
2009
).
ImageNet: A large-scale hierarchical image database.
In
Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition
(pp.
248
255
).
Piscataway, NJ
:
IEEE
.
Desimone
,
R.
, &
Duncan
,
J.
(
1995
).
Neural mechanisms of selective visual attention
.
Annual Review of Neuroscience
,
18
(
1
),
193
222
.
[PubMed]
Ding
,
D.
,
Hill
,
F.
,
Santoro
,
A.
,
Reynolds
,
M.
, &
Botvinick
,
M. M.
(
2021
).
Attention over learned object embeddings enables complex visual reasoning.
In
M.
Ranzato
,
A.
Beygelzimer
,
K.
Nguyen
,
P. S.
Liang
,
J. W.
Vaughan
, &
Y.
Dauphin
(Eds.),
Advances in neural information processing systems
,
34
.
Red Hook, NY
:
Curran
.
Dosovitskiy
,
A.
,
Beyer
,
L.
,
Kolesnikov
,
A.
,
Weissenborn
,
D.
,
Zhai
,
X.
,
Unterthiner
,
T.
, …
Houlsby
,
N.
(
2020
).
An image is worth 16 × 16 words: Transformers for image recognition at scale
. arXiv:2010.11929.
Egly
,
R.
,
Rafal
,
R.
,
Driver
,
J.
, &
Starrveveld
,
Y.
(
1994
).
Covert orienting in the split brain reveals hemispheric specialization for object-based attention
.
Psychological Science
,
5
(
6
),
380
383
.
Ellis
,
K.
,
Solar-Lezama
,
A.
, &
Tenenbaum
,
J.
(
2015
).
Unsupervised learning by program synthesis.
In
C.
Cortes
,
N.
Lawrence
,
D.
Lee
,
M.
Sugiyama
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
28
.
Red Hook, NY
:
Curran
.
Fei-Fei
,
L.
,
Li
,
F. F.
,
Iyer
,
A.
,
Koch
,
C.
, &
Perona
,
P.
(
2007
).
What do we perceive in a glance of a real-world scene?
J. Vis.
,
7
(
1
),
1
29
.
[PubMed]
Firestone
,
C.
(
2020
).
Performance vs. competence in human–machine comparisons
. In
Proceedings of the National Academy of Sciences
,
117
(
43
),
26562
26571
.
Fleuret
,
F.
,
Li
,
T.
,
Dubout
,
C.
,
Wampler
,
E. K.
,
Yantis
,
S.
, &
Geman
,
D.
(
2011
).
Comparing machines and humans on a visual categorization test
. In
Proceedings of the National Academy of Sciences
,
108
(
43
),
17621
17625
.
Forbus
,
K. D.
, &
Lovett
,
A.
(
2021
).
Same/different in visual reasoning
.
Current Opinion in Behavioral Sciences
,
37
,
63
68
.
Funke
,
C. M.
,
Borowski
,
J.
,
Stosio
,
K.
,
Brendel
,
W.
,
Wallis
,
T. S. A.
, &
Bethge
,
M.
(
2021
).
Five points to check when comparing visual perception in humans and machines
.
Journal of Vision
,
21
(
3
), 16.
[PubMed]
Geirhos
,
R.
,
Jacobsen
,
J.-H.
,
Michaelis
,
C.
,
Zemel
,
R.
,
Brendel
,
W.
,
Bethge
,
M.
, &
Wichmann
,
F. A.
(
2020
).
Shortcut learning in deep neural networks
.
Nature Machine Intelligence
,
2
(
11
),
665
673
.
Geman
,
D.
,
Geman
,
S.
,
Hallonquist
,
N.
, &
Younes
,
L.
(
2015
).
Visual Turing test for computer vision systems
. In
Proc. Natl. Acad. Sci. U.S.A.
,
112
(
12
),
3618
3623
.
[PubMed]
Gentner
,
D.
,
Shao
,
R.
,
Simms
,
N.
, &
Hespos
,
S.
(
2021
).
Learning same and different relations: Cross-species comparisons
.
Current Opinion in Behavioral Sciences
,
37
,
84
89
.
Golde
,
M.
,
von Cramon
,
D. Y.
, &
Schubotz
,
R. I.
(
2010
).
Differential role of anterior prefrontal and premotor cortex in the processing of relational information
.
NeuroImage
,
49
(
3
),
2890
2900
.
[PubMed]
Greff
,
K.
,
van Steenkiste
,
S.
, &
Schmidhuber
,
J.
(
2020
).
On the binding problem in artificial neural networks
. arXiv:2012.05208.
He
,
K.
,
Zhang
,
X.
,
Ren
,
S.
, &
Sun
,
J.
(
2016
).
Deep residual learning for image recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
.
Piscataway, NJ
:
IEEE
.
Holcombe
,
A. O.
,
Linares
,
D.
, &
Vaziri-Pashkam
,
M.
(
2011
).
Perceiving spatial relations via attentional tracking and shifting
.
Curr. Biol.
,
21
(
13
),
1135
1139
.
[PubMed]
Hu
,
J.
,
Shen
,
L.
, &
Sun
,
G.
(
2018
).
Squeeze-and-excitation networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
7132
7141
).
Piscataway, NJ
:
IEEE
.
Kim
,
J.
,
Ricci
,
M.
, &
Serre
,
T.
(
2018
).
Not-so-CLEVR: learning same–different relations strains feedforward neural networks
.
Interface Focus
,
8
(
4
), 20180011.
[PubMed]
Kingma
,
D. P.
, &
Ba
,
J.
(
2014
).
Adam: A method for stochastic optimization
. arXiv:1412.6980.
Kreiman
,
G.
, &
Serre
,
T.
(
2020
).
Beyond the feedforward sweep: Feedback computations in the visual cortex
.
Ann. N.Y. Acad. Sci.
,
1464
(
1
),
222
241
.
[PubMed]
Kroger
,
J. K.
,
Sabb
,
F. W.
,
Fales
,
C. L.
,
Bookheimer
,
S. Y.
,
Cohen
,
M. S.
, &
Holyoak
,
K. J.
(
2002
).
Recruitment of anterior dorsolateral prefrontal cortex in human reasoning: A parametric study of relational complexity
.
Cerebral Cortex
,
12
(
5
),
477
485
.
[PubMed]
Lin
,
H.
&
Jegelka
,
S.
(
2018
).
ResNet with one-neuron hidden layers is a universal approximator
. arXiv:1806.10909.
Linsley
,
D.
,
Kim
,
J.
,
Ashok
,
A.
, &
Serre
,
T.
(
2020
).
Recurrent neural circuits for contour detection
. arXiv:2010.15314.
Linsley
,
D.
,
Scheibler
,
D.
,
Eberhardt
,
S.
, &
Serre
,
T.
(
2018a
).
Global-and-local attention networks for visual recognition
. arXiv:1805.08819.
Linsley
,
D.
,
Shiebler
,
D.
,
Eberhardt
,
S.
, &
Serre
,
T.
(
2018b
).
Learning what and where to attend
. arXiv:1805.08819.
Logan
,
G. D.
(
1994a
).
On the ability to inhibit thought and action: A users' guide to the stop signal paradigm
.
Orlando, FL
:
Academic Press
.
Logan
,
G. D.
(
1994b
).
Spatial attention and the apprehension of spatial relations
.
Journal of Experimental Psychology: Human Perception and Performance
,
20
(
5
), 1015.
Marcus
,
G. F.
(
2001
).
The algebraic mind: Integrating connectionism and cognitive science
.
Cambridge, MA
:
MIT Press
.
Messina
,
N.
,
Amato
,
G.
,
Carrara
,
F.
,
Gennaro
,
C.
, &
Falchi
,
F.
(
2021a
).
Recurrent vision transformer for solving visual reasoning problems
. arXiv:2111.14576.
Messina
,
N.
,
Amato
,
G.
,
Carrara
,
F.
,
Gennaro
,
C.
, &
Falchi
,
F.
(
2021b
).
Solving the same-different task with convolutional neural networks
.
Pattern Recognition Letters
,
143
,
75
80
.
Moore
,
C. M.
,
Elsinger
,
C. L.
, &
Lleras
,
A.
(
1994
).
Visual attention and the apprehension of spatial relations: The case of depth
.
J. Exp. Psychol. Hum. Percept. Perform.
,
20
(
5
),
1015
1036
.
[PubMed]
Puebla
,
G.
, &
Bowers
,
J. S.
(
2021
).
Can deep convolutional neural networks learn same-different relations
? bioRxiv.
Ren
,
M.
, &
Zemel
,
R. S.
(
2016
).
End-to-end instance segmentation and counting with recurrent attention
. CoRR, abs/1605.09410.
Ricci
,
M.
,
Cadène
,
R.
, &
Serre
,
T.
(
2021
).
Same-different conceptualization: A machine vision perspective
.
Current Opinion in Behavioral Sciences
,
37
,
47
55
.
Roelfsema
,
P. R.
,
Lamme
,
V. A.
, &
Spekreijse
,
H.
(
1998
).
Object-based attention in the primary visual cortex of the macaque monkey
.
Nature
,
395
(
6700
),
376
381
.
[PubMed]
Rosielle
,
L. J.
,
Crabb
,
B. T.
, &
Cooper
,
E. E.
(
2002
).
Attentional coding of categorical relations in scene perception: Evidence from the flicker paradigm
.
Psychon. Bull. Rev.
,
9
(
2
),
319
326
.
[PubMed]
Sharma
,
S.
,
Kiros
,
R.
, &
Salakhutdinov
,
R.
(
2015
).
Action recognition using visual attention
. arXiv:1511.04119.
Shepard
,
R. N.
, &
Metzler
,
J.
(
1971
).
Mental rotation of three-dimensional objects
.
Science
,
171
(
3972
),
701
703
.
[PubMed]
Stabinger
,
S.
,
Peer
,
D.
,
Piater
,
J.
, &
Rodríguez-Sánchez
,
A.
(
2021
).
Evaluating the progress of deep learning for visual relational concepts
.
Journal of Vision
,
21
(
11
),
8
8
.
[PubMed]
Stabinger
,
S.
,
Rodríguez-Sánchez
,
A.
, &
Piater
,
J.
(
2016
).
25 years of CNNs: Can we compare to human abstraction capabilities?
In
A. E.
Villa
,
P.
Masulli
, &
A. J. Pons
Rivero
(Eds.),
Artificial Neural Networks and Machine Learning–ICANN 2016
(pp.
380
387
).
Cham
:
Springer
Stollenga
,
M. F.
,
Masci
,
J.
,
Gomez
,
F.
, &
Schmidhuber
,
J.
(
2014
).
Deep networks with internal selective attention through feedback connections.
In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
3545
3553
),
Red Hook, NY
:
Curran
.
Tolstikhin
,
I.
,
Houlsby
,
N.
,
Kolesnikov
,
A.
,
Beyer
,
L.
,
Zhai
,
X.
,
Unterthiner
,
T.
, …
Dosovitskiy
,
A.
(
2021
).
MLP-mixer: An all-MLP architecture for vision
. arXiv:2105.01601.
Touvron
,
H.
,
Cord
,
M.
,
Douze
,
M.
,
Massa
,
F.
,
Sablayrolles
,
A.
, &
Jégou
,
H.
(
2021
).
Training data-efficient image transformers and distillation through attention
. arXiv:2012.12877v2.
Tsotsos
,
J. K.
,
Rodriguez-Sanchez
,
A. J.
,
Rothenstein
,
A. L.
, &
Simine
,
E.
(
2007
).
Different binding strategies for the different stages of visual recognition.
In
F.
Mele
,
G.
Ramella
,
S.
Santillo
, &
F.
Ventriglia
(Eds.),
Advances in brain, vision, and artificial intelligence
(pp.
150
160
),
Berlin
:
Springer
.
Van Der Ham
,
I. J. M.
,
Duijndam
,
M. J. A.
,
Raemaekers
,
M.
,
van Wezel
,
R. J. A.
,
Oleksiak
,
A.
, &
Postma
,
A.
(
2012
).
Retinotopic mapping of categorical and coordinate spatial relation processing in early visual cortex
.
PLOS One
,
7
(
6
),
1
8
.
Vaswani
,
A.
,
Shazeer
,
N.
,
Parmar
,
N.
,
Uszkoreit
,
J.
,
Jones
,
L.
,
Gomez
,
A. N.
,
Kaiser
,
Ł.
, &
Polosukhin
,
I.
(
2017
).
Attention is all you need.
In
I.
Guyon
,
Y. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems, 30
.
Red Hook. NY
:
Curran
.
Villalobos
,
K.
,
Štih
,
V.
,
Ahmadinejad
,
A.
,
Sundaram
,
S.
,
Dozier
,
J.
,
Francl
,
A.
, …
Boix
,
X.
(
2021
).
Do neural networks for segmentation understand insideness?
Neural Computation
,
33
(
9
),
251
2549
.
Woo
,
S.
,
Park
,
J.
,
Lee
,
J.-Y.
, &
So Kweon
,
I.
(
2018
).
CBAM: Convolutional block attention module
. In
Proceedings of the European Conference on Computer Vision
(pp.
3
19
).
Berlin
:
Springer
.
Xu
,
H.
, &
Saenko
,
K.
(
2015
).
Ask, attend and answer: Exploring question-guided spatial attention for visual question answering
. CoRR, abs/1511.05234.
Yang
,
Z.
,
He
,
X.
,
Gao
,
J.
,
Deng
,
L.
, &
Smola
,
A.
(
2016
).
Stacked attention networks for image question answering.
In
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition
(pp.
21
29
).
Piscataway, NJ
:
IEEE
.
Yihe
,
L.
,
Lowe
,
S. C.
,
Lewis
,
P. A.
, &
van Rossum
,
M. C.
(
2019
).
Program synthesis performance constrained by non-linear spatial relations in synthetic visual reasoning test
. arXiv:1911.07721.
Zhu
,
X.
,
Su
,
W.
,
Lu
,
L.
,
Li
,
B.
,
Wang
,
X.
, &
Dai
,
J.
(
2020
).
Deformable DETR: Deformable transformers for end-to-end object detection
. arXiv:2010.04159

Supplementary data