Abstract
In our previous study (Han & Sereno, 2022a), we found that two artificial cortical visual pathways trained for either identity or space actively retain information about both identity and space independently and differently. We also found that this independently and differently retained information about identity and space in two separate pathways may be necessary to accurately and optimally recognize and localize objects. One limitation of our previous study was that there was only one object in each visual image, whereas in reality, there may be multiple objects in a scene. In this study, we find we are able to generalize our findings to object recognition and localization tasks where multiple objects are present in each visual image. We constrain the binding problem by training the identity network pathway to report the identities of objects in a given order according to the relative spatial relationships between the objects, given that most visual cortical areas including high-level ventral steam areas retain spatial information. Under these conditions, we find that the artificial neural networks with two pathways for identity and space have better performance in multiple-objects recognition and localization tasks (higher average testing accuracy, lower testing accuracy variance, less training time) than the artificial neural networks with a single pathway. We also find that the required number of training samples and the required training time increase quickly, and potentially exponentially, when the number of objects in each image increases, and we suggest that binding information from multiple objects simultaneously within any network (cortical area) induces conflict or competition and may be part of the reason why our brain has limited attentional and visual working memory capacities.
1 Introduction
According to many neuropsychological, lesion, and anatomical studies, the human visual system has two major distinct cortical pathways (Felleman & Essen, 1991; Mishkin, Ungerleider, & Macko, 1983; Ungerleider & Mishkin, 1982). The ventral pathway is concerned with object identity (Logothetis & Sheinberg, 1996) and the dorsal pathway with spatial cognition (Colby & Goldberg, 1999). However, some recent studies argued that representations associated with shape and location processing are present in both visual streams (Konen & Kastner, 2008; Lehky & Sereno, 2007; Sereno & Lehky, 2011; Sereno, Lehky, & Sereno, 2020). In a previous study using artificial neural networks (Han & Sereno, 2022a), we showed that the two cortical visual pathways for identity and space actively retained information about both identity and space independently and differently. We also showed that this independent and different retained information about identity and space in the two modeled pathways was necessary to accurately and optimally recognize and localize objects. One limitation of our previous study was that there was only one object in each visual image, whereas in reality, there may be multiple objects in a scene.
In our current study, we try to generalize our models to multiple objects' recognition and localization tasks. One of the difficulties of dealing with images with multiple objects is the binding problem, where the representation of multiple objects with independent feature sets can lose information about which features belong to which objects (Markov, Utochkin, & Brady, 2021). Given that our previous study showed that the identity pathway actively retained information about space, we wanted to test whether it may be possible to constrain the binding problem if we take advantage of this information (i.e., the spatial information in the identity network pathway). Our previous study also showed that the kinds of information that the network actively retains depended on the tasks or goals that were used for training the network. In our current study, we trained the identity network pathway by asking it to report the identities of the objects in a certain order that depends on the relative spatial relationships between objects in the image. As a result, the identity network pathway would actively retain information about the relative spatial relationships between objects. Asking the identity network to retain relative spatial relationships is plausible because previous physiological work has shown that cells in high-level ventral areas retain spatial information of objects, including retinotopic spatial information (Op De Beeck & Vogels, 2000; Sereno & Lehky, 2011) and angle of gaze spatial information (Sereno, Sereno, & Lehky, 2014), as well as spatial relationship among object parts (Yamane, Tsunoda, Matsumoto, Phillips, & Tanifuji, 2006), information needed for scene recognition (where the objects are part of a larger scene) and object recognition, respectively. Furthermore, even fMRI studies, with their poorer spatial resolution, have also demonstrated that much of human neocortex contains topological maps of sensory surfaces (Sereno, Sood, & Huang, 2022). In our prior work (Han & Sereno, 2022a), we showed that the simulated ventral pathway needed information about the relative spatial relationships between object parts to recognize the identity of the whole object (see also the discussion of the spatial relation of the faucet and basin of a sink in Figure 10b, in Sereno et al., 2020). Additionally, preliminary modeling results (Han & Sereno, 2022b) suggest that information about the relative spatial relationships between objects is able to constrain the binding problem when we combine the outputs of the identity network pathway and the spatial network pathway and process them together using a two-pathway neural network.
Previous studies have used artificial neural networks trained with supervised learning, self-supervised learning, or unsupervised learning to simulate the ventral and dorsal cortical visual pathways in the brain (Yamins et al., 2014; Kriegeskorte, 2015; Dobs, Martinez, Kell, & Kanwisher, 2022; Konkle & Alvarez, 2022; Bakhtiari, Mineault, Lillicrap, Pack, & Richards, 2021; Zhuang et al., 2022). Many of these previous studies found that artificial convolutional neural network models could successfully produce brain-like neural responses or even predict neural responses in the biological visual cortex. However, the main goal of our study is to gain a better understanding of the consequences of brain structure or segregated streams of processing using computational modeling rather than identifying the specific response features that are similar to the real neural responses of ventral and dorsal cortical pathways.
In our study, feedforward convolutional neural networks were used to simulate the two cortical visual pathways. All neural networks in our study were trained using supervised learning. When modeling the two cortical visual pathways, it is assumed that the two pathways use the same structure for simplicity and control. We trained the two neural networks separately using multiple-objects recognition tasks and multiple-objects localization tasks, respectively, so that the trained neural networks will be able to model the ventral and dorsal pathways, respectively. We used stochastic gradient descent with backpropagation to update the weights in the neural networks during training. Stochastic gradient descent with backpropagation is currently the best method for updating connection weights between neurons in artificial neural networks, and some have argued that the brain might be able to implement backpropagation-like effective synaptic updates (Lillicrap, Santoro, Marris, Akerman, & Hinton, 2020; Whittington & Bogacz, 2019).
One artificial neural network, , was trained to do an identity task (to identify whether the objects are tops, pants, or shoes). Another artificial neural network, , was trained to do a localization task (to determine the locations of the objects). was used to model the ventral pathway, whereas was used to model the dorsal pathway. These two networks were used to simulate the functions of ventral and dorsal cortical visual pathways in the brain. and were trained independently to serve as the two pathways in . The goal of is to recognize and localize multiple objects in the image at the same time. For comparison, another neural network, , was also trained to recognize and localize multiple objects in the image at the same time. The sizes of and are equal. The difference is that has only one pathway, and all the training occurs as a single network (the two pathways in are trained as independent networks).
According to our simulation results, was able to outperform in almost all experimental conditions (different numbers of objects in each image, different numbers of training samples). was able to achieve significantly higher average testing accuracy, had smaller testing accuracy variance, and required fewer training epochs and training time. However, the required training samples and training time increased quickly when the number of objects in each image increased. As a result, neither of the two networks was able to efficiently achieve high testing accuracies when there were four or more objects in the image. Though it may be a limitation of our models, this phenomenon may agree with the experimental evidence that shows our brain has a limited attention and working memory capacity for many cognitive processes, such as the processes involved in visual perception tasks, digital span tasks, and reading span tasks (Isbell, Fukuda, Neville, & Vogel, 2015; Miller, 1956; Daneman & Carpenter, 1980). Our models were not able to achieve high performance when there were four or more objects in the image because the binding problem became more difficult as the number of objects increased. Therefore, we suggest that capacity limits may be in part a consequence of the binding problem.
Similar to our previous study, our modeling is proof of the computational concept and better understanding of the effects of different organizational schemes more than an accurate model of the real human brain. Multiple-objects recognition and localization tasks are very important in both cognitive neuroscience and computer science. Our models may be able to help people get a better understanding of the computational costs and benefits of brain organization. Our models may also be able to provide insights about how to find better, more efficient, and more biologically plausible multiple-objects recognition and localization algorithms.
2 Methods
2.1 Objects
Black and white images of different kinds of tops, pants, and shoes obtained from the data set Fashion-MNIST were used as the objects in the object recognition and localization tasks (Xiao et al., 2017). There are 62 kinds of tops, 66 kinds of pants, and 58 kinds of shoes. Each object image was embedded in a black background and presented at different locations. There may be two, three, or four objects in each black background image. These object images with black background were used as visual inputs. Some examples of these input images are shown in Figure 1.
These black and white images were used to train, validate, and test the neural networks: two-thirds of the total number of images were used for training, one-sixth of the total number of images were used for validating, and one-sixth of the total number of images were used for testing.
2.2 Object Locations
Object image locations are shown and explained in Figure 1. The objects were put at different locations in a 140 140 (pixels) black square background. Specifically, each object image could have nine possible locations (see Figure 1a). The objects in the same visual image are always at different locations, and they never overlap with each other.
2.3 Neural Networks
Feedforward convolutional neural networks were used to build brain networks to model the two cortical visual pathways in the brain. Each neural network consists of several hidden layers, including the convolutional layers, the pooling layers, and the fully connected dense layers. ReLU activation function was used at each layer except the final output layer, in which a softmax activation function was used. These neural networks were implemented using TensorFlow and were trained using supervised learning, the cross-entropy loss function, and stochastic gradient descent with backpropagation.
Our primary goal in this study is not optimizing each artificial neural network to achieve the highest performance. It is trying to compare the performance of one-pathway and two-pathway artificial neural network architectures when they have the same hyperparameter settings. In our previous study (Han & Sereno, 2022a), we repeated some simulations with some different hyperparameter settings (e.g., number of layers, number of filters, filter sizes) in the artificial neural networks. We found that our findings do not depend on specific hyperparameter settings of the artificial neural networks. Therefore, in our current study, we choose similar hyperparameter settings that were used in our previous study.
A batch size of 256 and the Adam optimization method were used while training. The initial learning rate of Adam optimization was 0.001. The other hyperparameters are specified in Figures 2, 3, and 4. We applied 30% random dropout to all the dense layers in all neural networks during training for regularization. All networks were trained with enough epochs to ensure that all of them had reached the highest possible validation accuracy at the end of training.
The structure of and is shown in Figure 2. These two neural networks share the same structure, and the only difference between them was in their final output layers. Both networks take the same set of images as inputs. However, they were trained to do different tasks, so their output layers have different sizes. These two neural networks were trained to serve as the two pathways in for simultaneous multiple-objects recognition and localization.
was trained to determine the identities of the objects (tops, pants, or shoes) and report their identities according to their relative locations. Specifically, it was trained to report the identities according to this order: it should report the identity of the object at the top of the image first. If two objects are at the same horizontal line, then it should report the identity of the object on the left first. For example, when it receives the input image shown in Figure 1c, it should report the identities of all the objects in this order: shoe, top, pant. This information was represented in the output layer of the network using one-hot encoding. Note that the specific order described here is just an assumption without loss of generality: any particular (but consistent) spatial report would suffice. In general, the only requirement is that the one-hot vector representation in the final output layer of is determined by both the identities of the objects and the spatial relationships between the objects.
was trained to determine the locations of the objects. It should report the locations of all the objects in the image regardless of their identities. For example, when it receives the input image shown in Figure 1c, it should report the locations of all the objects: locations 1, 5, and 6. This information was also represented in the output layer of the network using one-hot encoding.
The structures of and are shown in Figures 3 and 4. The sizes of and are designed to be equal, which means they have the same number of layers, the same total number of kernels in each convolutional layer, and the same total number of units in each dense layer. We chose to keep the number of units the same because our brain has a limited number of neurons, but the number of connections (the number of parameters) in our brain is more flexible. The only difference between and is their architectures. was trained to determine the identities and locations of all the objects in each image simultaneously using only one pathway. It took images as inputs, and the output layer reported the identities and locations of all the objects using one-hot encoding. was trained to determine the identities and locations of all the objects in each image simultaneously by processing the input image in two pathways and then combining them together. It took the images as inputs and sent this information into two pathways. The independently trained and (excluding their one-hot encoded output layers) were used as the two pathways in , which processed the input images with the pathway and the pathways separately. Then the network concatenated the final layers of the two pathways together and processed the information jointly with some additional common dense layers. After the two pathways had been independently trained and their weights fixed, the common dense layers in were trained to report the identities and locations of all the objects using one-hot encoding.
Each network was trained five times, and testing accuracies were obtained for each of the five training sessions. The testing accuracies were obtained by dividing the number of correct classifications by the total number of testing samples during the testing session. The accuracies that are used to compare different networks in this letter are always referring to the testing accuracies. Welch's two-sample -tests were used to compare network accuracies and determine the significance of the differences.
3 Results
We performed training, validation, and testing multiple times for each network with network weights randomly initialized differently each time. When obtaining the accuracies in each experimental setting, the networks were always trained five times, and five testing accuracies were obtained for each condition after training. Welch's two-sample -tests were used to compare different testing accuracies and determine the significance of the differences. The difference between testing accuracies is considered to be significant if the corresponding -value 0.05. The average testing accuracies of different neural networks for input images with two objects, three objects, and four objects are shown in Tables 1, 2, and 3, respectively.
. | Number of Samples . | |||
---|---|---|---|---|
Network . | 600 . | 2400 . | 6000 . | 12,000 . |
(chance level 11.1) | 81.8 ± 2.2 | 98.4 ± 0.2 | 99.9 ± 0.1 | 100.0 ± 0.0 |
(chance level 1.2) | 100.0 ± 0.0 | 100.0 ± 0.0 | 100.0 ± 0.0 | 100.0 ± 0.0 |
(chance level 0.1) | 62.2 ± 2.2 | 98.2 ± 0.1 | 99.9 ± 0.1 | 100.0 ± 0.0 |
(chance level 0.1) | 61.6 ± 5.0 | 97.8 ± 0.5 | 99.9 ± 0.1 | 100.0 ± 0.0 |
. | Number of Samples . | |||
---|---|---|---|---|
Network . | 600 . | 2400 . | 6000 . | 12,000 . |
(chance level 11.1) | 81.8 ± 2.2 | 98.4 ± 0.2 | 99.9 ± 0.1 | 100.0 ± 0.0 |
(chance level 1.2) | 100.0 ± 0.0 | 100.0 ± 0.0 | 100.0 ± 0.0 | 100.0 ± 0.0 |
(chance level 0.1) | 62.2 ± 2.2 | 98.2 ± 0.1 | 99.9 ± 0.1 | 100.0 ± 0.0 |
(chance level 0.1) | 61.6 ± 5.0 | 97.8 ± 0.5 | 99.9 ± 0.1 | 100.0 ± 0.0 |
Notes: The row heading are the names of the networks. The column heading are the total number of samples for training, validation, and testing. was trained to report identities of all objects according to their relative locations. was trained to determine locations of all objects. and were trained to determine the identity and location of each object in the image. The chance-level accuracies () are reported next to the network names in the table.
. | Number of Samples . | |||
---|---|---|---|---|
Network . | 600 . | 2400 . | 6000 . | 12,000 . |
(chance level 3.7) | 35.6 ± 2.6 | 93.8 ± 0.8 | 99.6 ± 0.3 | 99.9 ± 0.1 |
(chance level 0.1) | 97.0 ± 0.0 | 100.0 ± 0.0 | 100.0 ± 0.0 | 100.0 ± 0.0 |
(chance level 0.0) | 16.8 ± 1.5 | 47.6 ± 0.3 | 85.4 ± 0.1 | 96.9 ± 0.0 |
(chance level 0.0) | 12.2 ± 2.6 | 29.6 ± 3.4 | 63.7 ± 10.5 | 91.3 ± 2.5 |
. | Number of Samples . | |||
---|---|---|---|---|
Network . | 600 . | 2400 . | 6000 . | 12,000 . |
(chance level 3.7) | 35.6 ± 2.6 | 93.8 ± 0.8 | 99.6 ± 0.3 | 99.9 ± 0.1 |
(chance level 0.1) | 97.0 ± 0.0 | 100.0 ± 0.0 | 100.0 ± 0.0 | 100.0 ± 0.0 |
(chance level 0.0) | 16.8 ± 1.5 | 47.6 ± 0.3 | 85.4 ± 0.1 | 96.9 ± 0.0 |
(chance level 0.0) | 12.2 ± 2.6 | 29.6 ± 3.4 | 63.7 ± 10.5 | 91.3 ± 2.5 |
Notes: The row headings are the names of the networks. The column headings are the total number of samples for training, validation, and testing. was trained to report identities of all objects according to their relative locations. was trained to determine locations of all objects. and were trained to determine the identity and location of each object in the image. The chance-level accuracies () are reported next to the network names in the table.
. | Number of Samples . | |||
---|---|---|---|---|
Network . | 600 . | 2400 . | 6000 . | 12,000 . |
(chance level 1.2) | 14.0 ± 3.0 | 69.2 ± 1.1 | 98.7 ± 0.8 | 99.8 ± 0.1 |
(chance level 0.0) | 99.0 ± 0.0 | 100.0 ± 0.0 | 100.0 ± 0.0 | 100.0 ± 0.0 |
(chance level 0.0) | 6.2 ± 0.8 | 15.0 ± 0.5 | 31.4 ± 0.1 | 54.8 ± 0.1 |
(chance level 0.0) | 3.6 ± 1.5 | 3.8 ± 1.1 | 5.1 ± 0.8 | 14.8 ± 1.3 |
. | Number of Samples . | |||
---|---|---|---|---|
Network . | 600 . | 2400 . | 6000 . | 12,000 . |
(chance level 1.2) | 14.0 ± 3.0 | 69.2 ± 1.1 | 98.7 ± 0.8 | 99.8 ± 0.1 |
(chance level 0.0) | 99.0 ± 0.0 | 100.0 ± 0.0 | 100.0 ± 0.0 | 100.0 ± 0.0 |
(chance level 0.0) | 6.2 ± 0.8 | 15.0 ± 0.5 | 31.4 ± 0.1 | 54.8 ± 0.1 |
(chance level 0.0) | 3.6 ± 1.5 | 3.8 ± 1.1 | 5.1 ± 0.8 | 14.8 ± 1.3 |
Notes: The row headings are the names of the networks. The column headings are the total number of samples for training, validation, and testing. was trained to report identities of all objects according to their relative locations. was trained to determine locations of all objects. and were trained to determine the identity and location of each object in the image. The chance-level accuracies () are reported next to the network names in the table.
3.1 Two Objects
According to the results shown in Table 1, the testing accuracies of were always 100% for different total numbers of samples. It may be because it is very easy to determine the locations of all the objects when there are only two objects in each image. The testing accuracies of all the other neural networks increased when the total number of samples increased. Though the difference between and average testing accuracies was small and not significant (), the standard deviations of were smaller than or equal to the standard deviations of . These results indicate that the performance of was more stable than .
3.2 Three Objects
According to the results shown in Table 2, the testing accuracy of was 100% when the total number of samples used was 2400 or more. It may be because it is relatively easy to determine the locations of all the objects when there are three objects in each image. The testing accuracies of all the other neural networks increased when the total number of samples increased. The difference between and average testing accuracies was significant (), and the standard deviations of accuracies were smaller than the standard deviations of accuracies. These results indicate that the performance of was higher than and the performance of was more stable than .
3.3 Four Objects
According to the results shown in Table 3, the testing accuracy of was 100% when the total number of samples used was 2400 or more. It may be because it is still relatively easy to determine the locations of all the objects when there are four objects in each image. The testing accuracies of all the other neural networks increased when the total number of samples increased. The difference between the and average testing accuracies was significant (), and the standard deviations of accuracies were smaller than the standard deviations of accuracies. These results indicate that the performance of was higher than , and the performance of was more stable than .
3.4 The Number of Epochs and Time Required for Training and
In addition, we measured the time spent during each training epoch according to TensorFlow logs, which are automatically output when using the model.fit command, and indicate the amount of time it took to train each epoch. For , it took around 1 second per epoch to train , around 1 second per epoch to train , and less than 1 second per epoch to train common dense layers. For , each training epoch took around 3 seconds. Therefore, each training epoch in training (around or less than 1 s) always took much less time than each training epoch in training , which was around 3 s). This finding may be because there are fewer weight parameters that need to be updated in during training.
Because training required fewer epochs and each training epoch took less time, the total training time required for was shorter. These results about required number of training epochs and training time were obtained using three-object images and 6000 samples. Similar results were also found with different numbers of objects and different numbers of samples.
3.5 Comparing the Performance of and
. | 1200 . | 2400 . | 6000 . | 12,000 . | 30,000 . | 60,000 . |
---|---|---|---|---|---|---|
Two objects | 82.2 ± 0.61 | 98.2 ± 0.12 | 99.9 ± 0.1 | 100.0 ± 0.0 | NA3 | NA3 |
Three objects | NA3 | 47.6 ± 0.3 | 85.4 ± 0.11 | 96.9 ± 0.02 | NA3 | NA3 |
Four objects | NA3 | 15.0 ± 0.5 | 31.4 ± 0.1 | 54.8 ± 0.1 | 85.7 ± 0.11 | 98.1 ± 0.02 |
. | 1200 . | 2400 . | 6000 . | 12,000 . | 30,000 . | 60,000 . |
---|---|---|---|---|---|---|
Two objects | 82.2 ± 0.61 | 98.2 ± 0.12 | 99.9 ± 0.1 | 100.0 ± 0.0 | NA3 | NA3 |
Three objects | NA3 | 47.6 ± 0.3 | 85.4 ± 0.11 | 96.9 ± 0.02 | NA3 | NA3 |
Four objects | NA3 | 15.0 ± 0.5 | 31.4 ± 0.1 | 54.8 ± 0.1 | 85.7 ± 0.11 | 98.1 ± 0.02 |
Notes: The row headings are the numbers of objects. The column headings are the total number of samples for training, validation, and testing.
1The accuracies are between and .
2The accuracies are between and .
3The data for simulations that were not conducted.
3.6 Compare the Performance of with or without Pretraining
The results reported above were obtained without any pretraining. In order to find out whether pretraining could improve the performance of , we conducted additional simulations to test the performance of with pretraining. With three objects per image and 6000 samples in total, we pretrained with the identity task first and the location task later; the testing accuracy of on the multiple-objects recognition and localization task was (12.2 ± 8.9)%. We also pretrained in the other order, with the location task first and the identity task later. The testing accuracy of in this case was (11.6 ± 10.2)%. The accuracies of in both pretraining cases were significantly lower than the accuracy of without pretraining (63.7 ± 10.5)%.
3.7 Possible Reasons for the Underperformance of
In order to help elucidate a possible reason for the underperformance of , we used a decoder to decode information from the second-to-last layer activities of the trained . The decoder was a multilayer perceptron with three hidden dense layers and 100 units in each hidden layer. ReLU activation function was used at each layer in the decoder except the final output layer, in which a softmax activation function was used. We used the second-to-last layer activities of the trained as inputs to the decoder and trained the decoder to do either the identity task or the location task with three objects per image and 6000 samples in total. The decoding accuracy for the identity task was (66.7 ± 3.0)%, and the decoding accuracy for the location task was (91.2 ± 1.0)%. The decoding accuracy for the identity task was much lower than that for the location task. In addition, the decoding accuracy for the identity task (66.7 ± 3.0)% was very close to the accuracy of on the object recognition and localization tasks (63.7 ± 10.5)% in the same condition.
3.8 The Contribution of Each Pathway in the Performance of
In order to examine the contribution of each pathway in the performance of , we tested the performance of after removing the identity pathway or location pathway, using three objects per image and 6000 samples in total. After removing the location pathway and keeping only the identity pathway, the testing accuracy of was (37.0 ± 1.9)%. After removing the identity pathway and keeping only the location pathway, the testing accuracy of was (4.6 ± 0.5)%. The accuracies of in both cases were significantly lower than the accuracy of when both pathways were present (85.4 ± 0.1)%. The accuracy of decreased more when the identity pathway was removed.
3.9 The Contribution of Spatial Relation Information in in Constraining the Binding Problem
In order to constrain the binding problem, we asked to learn spatial relation information by training it to report objects' identities in a certain order. In order to test whether the spatial relation information in contributed to performance, we trained another identity network, , and asked this network to report the identities of all the objects regardless of the spatial relations between these objects. Then we used the trained to be the identity pathway in . It turned out that the testing accuracy of this was (76.7 ± 1.0)% when there were three objects per image and 6000 samples in total, which was significantly lower than the testing accuracy of the original (85.4 ± 0.1)% in the same condition. In addition, required more training epochs (around 1000 epochs) than the original (around 100 epochs).
For comparison, we also trained and the original to do an object recognition and localization task without binding. We only asked the networks to report all objects' identities and locations, but we did not require them to bind each object's identity with its location. The chance level accuracy for this task without binding and the original task with binding is approximately 0%. We trained them with three objects per image and 6000 samples in total. The testing accuracy of for this task was (98.2 ± 0.0)%, very close to the testing accuracy of on the same task (98.3 ± 0.0)%. In addition, both and required around 100 epochs to train.
3.10 The Accuracy of When the Number of Objects Increased
According to the results shown in Tables 1, 2, and 3, and Figure 6, the accuracy of decreased when the number of objects increased with the same number of training samples. However, the required number of samples is still unclear for to reach a similar level of high accuracy when there are different numbers of objects.
It is hard to accurately estimate the required number of samples for to reach the same accuracy in different conditions. Therefore, in order to address this question, we conducted additional simulations to estimate the required number of samples for to reach similar high accuracies when there are two, three, or four objects. The results, shown in Table 4, suggest that required 1200, 6000, or 30,000 samples to reach a similar relatively high accuracy (between and ) for two, three, or four objects, respectively. Additionally, these results also suggest that required 2400, 12,000, or 60,000 samples to reach a similar but even higher accuracy level (between and ) for two, three, or four objects, respectively.
In order to test whether our results are robust to hyperparameter changes, we repeated some simulations with different numbers of convolutional layers. We either increased or decreased the number of these layers in each pathway of . We increased the number of these layers by adding one additional convolutional layer in each pathway of the original . The additional layer is the fourth convolutional layer, and it has the same number of filters and the same kernel sizes as the third convolutional layer in each pathway of the original . We reduced the number of convolutional layers by removing the second convolutional layer in each pathway of the original . We repeated simulations using these modified under different conditions, and the results are shown in Table 5.
. | Remove One Convolutional Layer . | Add One Convolutional Layer . | ||||
---|---|---|---|---|---|---|
. | 1200 . | 6000 . | 30,000 . | 1200 . | 6000 . | 30,000 . |
Two objects | 80.0 ± 0.7 | NA | NA | 85.5 ± 0.6 | NA | NA |
Three objects | NA | 84.6 ± 0.3 | NA | NA | 85.6 ± 0.1 | NA |
Four objects | NA | NA | 85.6 ± 0.1 | NA | NA | 85.8 ± 0.0 |
. | Remove One Convolutional Layer . | Add One Convolutional Layer . | ||||
---|---|---|---|---|---|---|
. | 1200 . | 6000 . | 30,000 . | 1200 . | 6000 . | 30,000 . |
Two objects | 80.0 ± 0.7 | NA | NA | 85.5 ± 0.6 | NA | NA |
Three objects | NA | 84.6 ± 0.3 | NA | NA | 85.6 ± 0.1 | NA |
Four objects | NA | NA | 85.6 ± 0.1 | NA | NA | 85.8 ± 0.0 |
Notes: The row headings are the numbers of objects. The column headings are the total number of samples for training, validation, and testing. The data in the first three columns were obtained using modified with one convolutional layer removed in each pathway. The data in the second three columns were obtained using modified with one additional convolutional layer added in each pathway. The data for simulations that were not conducted are labeled “NA.”
3.11 Decreased Accuracy When the Number of Objects Increased: Role of Binding?
It is possible that the decreased performance of with increasing numbers of objects was partly caused by a binding limitation. However, because the testing accuracies of also decreased when the number of objects increased, it is also possible that the decreased performance of was caused merely by the decreased performance of the identity pathway, and not by a binding limitation.
To test this hypothesis, we trained the same original to do the object recognition and localization task either with or without binding. For the case without binding, we only asked the final combined network to report all objects' identities and all objects' locations, but we did not require the network to bind each object's identity with its location. We trained each with three objects per image and 6000 samples in total. The chance-level accuracy for the task without binding and the original task with binding is approximately 0%. The testing accuracy of for the task without binding was (98.3 ± 0.0)%, which was significantly higher than the testing accuracy of with the original task that required binding (85.4 ± 0.1)%. We also trained with four objects per image and 6000 samples in total. The chance-level accuracy for the task without binding and the original task with binding is still approximately 0%. The testing accuracy of for this task without binding was (89.3 ± 0.3)%, which was also significantly higher than the testing accuracy of with the original task that required binding (31.4 ± 0.1)%.
4 Discussion
One of the limitations of our previous study modeling the two cortical visual pathways was that there was only one object in each input image. Here, we sought to test whether our findings could be generalized to multiple-object recognition and localization tasks. In our current study, we found that the artificial neural networks with two pathways for identity and space have better performance in multiple-objects recognition and localization tasks (higher average testing accuracy, lower testing accuracy variance, less training time) than the artificial neural networks with a single pathway. Additionally, we found that the required number of training samples and the required training time increased quickly, and potentially exponentially, when the number of objects in each image increased. We also showed that the spatial relation information required in the training of our to constrain the binding problem was critical and increased the performance of . Finally, we showed that testing accuracies of increased after training to do an object recognition and localization task without binding, suggesting that binding limited performance and may be a reason that our brain has limited attentional and visual working memory capacities.
4.1 The Performance of Was Significantly Better Than
According to our simulation results, was able to outperform in almost all conditions. These are fair comparisons because the two networks have equal sizes, and we trained every network with enough epochs until it had reached the highest possible validation accuracy. was able to achieve higher average testing accuracy and lower testing accuracy variance in most conditions. Further, was able to reach the highest validation accuracy with less total training time. Therefore, our simulation results suggest that two separate pathways are advantageous in order to process the same visual information in different ways so that the network could have better performance (higher average testing accuracy, lower testing accuracy variance, less training time) in multiple-objects recognition and localization tasks.
We compared the performance of with or without pretraining. In the case of pretraining, we pretrained with the identity task and the location task. We found that the accuracy of with pretraining was significantly lower than the accuracy of without pretraining. It may be because pretraining caused to be more likely to get stuck in local minima. These findings show that pretraining with the identity task and the location task cannot improve the performance of this network. Therefore, the better performance of cannot be explained by pretraining. A one-pathway neural network is not optimal or efficient for learning these two different specializations (multiple-objects recognition and localization).
According to Dobs et al. (2022), a one-pathway neural network may be sufficient for learning object recognition and face recognition tasks. However, in our study, the difference between the object recognition task and the object localization task is larger than the difference between the object recognition task and the face recognition task. As a result, it is likely more difficult for a one-pathway neural network to find a common feature space to solve both the object recognition task and the object localization task. Further, we show here that the performance of a one-pathway neural network is impaired in multiple-objects recognition and localization tasks compared to a two-pathway neural network (lower average testing accuracy, higher testing accuracy variance, more training time).
4.2 Possible Reasons for the Underperformance of
We used a decoder to decode identity and location information from the second to last layer of the trained with three objects per image and 6000 samples in total. According to our simulations, the decoding accuracy for the identity task was much lower than the decoding accuracy for the location task. It is important to note that with three objects per image and 6000 samples in total, and have very similar accuracies on the identity and location tasks, respectively (see Table 2). Further, the identity and location tasks also have similar chance-level accuracies that are close to 0. Therefore, the findings of much lower decoding accuracy for the identity task suggest that learned less identity information than location information. In addition, the decoding accuracy for the identity task was very close to the accuracy of on the object recognition and localization tasks. Therefore, these results suggest that one reason for the underperformance of was that it was not able to learn enough identity information.
4.3 The Contribution of Each Pathway in the Performance of
According to our simulations, the accuracy of decreased significantly when the identity pathway or the location pathway was removed. In addition, the accuracy of decreased significantly more when the identity pathway was removed. Therefore, both pathways contributed to the performance of , and the identity pathway contributed more. It is possible the identity pathway contributed more because it included both identity information and spatial relation information, whereas the location pathway contained only spatial information. It is also likely the relative contributions of each pathway may change with different task goals or task design.
4.4 The Contribution of Spatial Relation Information in in Constraining the Binding Problem
Previously, we showed that networks trained either for identity or location retained spatial information (Han & Sereno, 2022a). In some visual perception tasks, the goal of the task may require coordination of the information from these separated pathways (e.g., reaching for the object that is edible when multiple objects are present). In these cases, processing information independently and differently using multiple separate pathways may cause a binding problem (Treisman, 1996). We suggested that the binding problem may be lessened if we could take advantage of the spatial information contained in the identity network and object identity information in the spatial network. Therefore, in our current study, we assume that the ventral pathway has access to the relative spatial information of objects and try to constrain the binding problem in the following way. We trained by asking it to report the identities of all the objects in each image in a certain order that depends on the spatial relations between these objects. Note that we can choose any consistent order as long as the one-hot vector representation in the final output layer of is determined by both the identities of the objects and the spatial relationships between the objects. Because the information retained by the neural networks depends on the training task (Han & Sereno, 2022a), this task would make not only actively retain identities of the objects but also actively retain relative spatial relationships between the objects. We trained by asking it to report the locations of all the objects in the image regardless of their identities. Then we used these trained networks as the two separate pathways in . Therefore, should be able to bind the identity of each object with its location by combining the identity information in , the relative spatial relation information in , and the absolute location information without identities in . Our simulation results indicate that it is possible to constrain the binding problem in this way.
In order to evaluate the effectiveness of this method, we trained another identity network, , and asked this network to report the identities of all the objects regardless of the spatial relations between these objects. Then we used the trained to be the identity pathway in . According to our simulation results, the performance of was significantly lower than the testing accuracy of the original in the same condition. These results suggest that the spatial relation information in improved the performance of the two-pathway neural network. However, these findings do not establish whether this improvement was because the spatial relation information in was important for constraining the binding problem. For comparison, we also trained (identity network trained with no spatial relation information) and the original to do an object recognition and localization task without binding. When binding is not required in the task, the differences in performance between the two networks disappeared. The performance of for this task was almost the same as the performance of . Thus, these findings suggest that the spatial relation information retained in the identity network is not important if the task does not require binding, but when the task requires binding, it is critical in constraining the binding problem.
An important assumption that we made was that the identity network should be able to report object identities according to the relative spatial relations between objects. This assumption is biologically plausible because the ventral cortical visual pathway may have different neural representations when the relative spatial relations between the same set of objects are different (Yamane et al., 2006). Sereno & Lehky (2011) report additional experimental evidence where they showed not only that the majority of cells in late stages of the ventral pathway were spatially selective but also that it was possible to decode object location from a small population of cells. Further, they demonstrated that the recovered spatial representation was topologically correct. Topologically correct spatial information indicates that the information about relative spatial relations between objects is retained in the ventral pathway.
4.5 The Accuracy of When the Number of Objects Increased
According to Table 4, required 1200, 6000, or 30,000 samples to reach a similar relatively high accuracy (between and ) for two, three, or four objects, respectively. Also, Table 4 shows that required 2400, 12,000, or 60,000 samples to reach a similar but even higher accuracy (between and ) for two, three, or four objects, respectively. Though these required number of samples for to reach a similar high level of performance are just estimates, these results suggest that the required number of samples increases by around five times when the number of objects increases from two to three. The required number of samples increases by another five times when the number of objects increases from three to four. Training time for the same is roughly proportional to the number of training samples, so training time rises in a similar way. It indicates that the required number of samples and the required training time for to reach a certain high accuracy increases quickly, and potentially exponentially, when the number of objects increases.
In addition, the results shown in Tables 4 and 5 suggest that the accuracies of the original and the modified with different numbers of convolutional layers in each pathway were very similar when they were trained under the same conditions. The modified with different numbers of convolutional layers in each pathway still required 1200, 6000, or 30,000 samples to reach a similar relatively high accuracy (between and ) for two, three, or four objects, respectively. Together, these findings suggest that our results are robust to hyperparameter changes.
4.6 The Limitation of When the Number of Objects Increases: Role of Binding?
According to our simulations, our model could not recognize and localize four or more objects efficiently. With four or more objects in each image, training with was very slow and required many more training samples if we wanted to achieve a high testing accuracy (90%).
It is possible that the decreased performance of with larger numbers of objects was caused by a decreased performance of the identity pathway, not by a binding limitation. We conducted additional simulations to test this hypothesis. According to our simulations, the testing accuracy of for the object recognition and localization tasks that did not require binding was significantly higher than the testing accuracy of for the original task that required binding for each of the three- and four-object conditions. Therefore, the decreased performance of with larger numbers of objects cannot be fully explained with a decreased performance of the identity pathway. The binding problem increased the difficulty of the task and caused the model performance to decrease.
Though this increased difficulty with tasks that require binding may be a limitation of our model, it may agree with the computational properties of the biological brain. Our biological brain is also not good at recognizing and localizing four or more objects in the scene at the same time. According to Isbell et al. (2015), the capacity limit of visual working memory is about three simple objects in healthy young adults. Visual working memory capacity may be considered as the maximum number of objects that our brain could recognize and localize at the same time. This agrees with our findings. Our was also only able to achieve high testing accuracy (90%) within a reasonable training time and training samples if the number of objects in each image was less than or equal to three. Therefore, we argue that our visual working memory capacity may be limited in part by the binding problem. Other kinds of attention and working memory capacity may also be limited by the binding problem. Many memory span tasks that are used to measure working memory capacity require people to remember the occurrence of information in space and time (Tulving, 1972; Nairne, 2015). Combining the information with its occurrence in space and time is also often referred to as a binding problem, and some researchers have proposed that working memory is a system for building, maintaining, and updating different kinds of bindings (Oberauer, 2009).
If the brain needs to recognize and localize four or more objects, then we speculate that it could only do it sequentially. For example, Quirk, Adam, and Vogel (2020) found that the human visual working memory capacity increased for both simple and real-world objects when encoding times were longer. Their experimental tasks required participants to recognize and localize multiple objects during the encoding time and make a response about which object appeared at a certain location during the testing time. In our opinion, one possible explanation of their findings is that the brain is able to bind more objects' identities and locations sequentially when the encoding time is longer. As a result, the brain can recognize and localize more objects, so the visual working memory capacity may appear to increase.
Experimental evidence also suggests that visual working memory continues to develop throughout adolescence, and it does not reach adult levels even in 16-year-old participants (Isbell et al., 2015). It may be because multiple-objects recognition and localization requires a lot of training samples and training time. It also agrees with our findings because our model also required a relatively large number of training samples and training time to achieve a high testing accuracy (90%) with three objects in each image. There are many possible combinations of the same information in different ways. Furthermore, in the real world, learning conditions are complex and changing. For example, context itself may alter the meaning of the same information (an empty pot on a stove, with the stove on versus off). Finally, our contexts and environments are themselves changing over time. Thus, there should be some improvements in working memory capacities with training time on context-appropriate sets. However, visual working memory capacity cannot be improved indefinitely through training, likely because our life and experience (training samples) are limited, as well as the fact that the greatest developmental benefits of the human brain occur before adulthood.
If our visual working memory capacity is limited in part by the binding problem, then we speculate that the measured visual working memory capacity may increase if we ask the participants to report all objects' identities regardless of their locations, and/or ask them to report all objects' locations regardless of their identities. In addition, human visual working memory capacity should be continuously developing in a long period (from infancy to adulthood) and be dependent in part on stimulus and context-appropriate training. Human visual working memory capacity should increase relatively quickly at the beginning when the capacity is low and increase relatively slower later when the capacity is high and the individual is nearing nervous system limits. There may not be a hard limit for the maximum possible visual working memory capacity. However, there may be a soft limit because the difficulty of increasing the visual working memory capacity further should increase quickly, and potentially exponentially, when the capacity increases.
Though both our two-pathway model and human brains are not good at recognizing and localizing four or more objects at the same time, more research is needed to find out how different computational resources, hyperparameter settings, and learning types could affect this limit. Our current study suggests that the human visual working memory capacity may be limited in part by the binding problem, but our study does not suggest this limit must be three or any other specific critical number. According to some previous studies, human visual working memory capacity varies across individuals and groups (Luck & Vogel, 2013). The individual differences in working memory capacity may be caused by different hyperparameter settings in human brains (e.g., number of neurons, number of connections between neurons). Some previous studies argue that it would be biologically expensive for the brain to have a larger working memory capacity (Cowan, 2010). In addition, we have shown that the required training time and the number of samples for high neural network performance increased quickly as the number of objects increased. These findings suggest that the individual differences in terms of visual working memory capacity may exist but may not be very large. Therefore, there seem to be fairly standard numbers (three or four) for the limit of human visual working memory capacity (Luck & Vogel, 2013).
Using a computational modeling approach, we aimed to better understand whether the presence of the two separate cortical visual pathways in the brain is important for object recognition and localization when there are multiple objects in the scene. Our simulations using convolutional neural networks used simple tasks, and we ignored a lot of details of real biological neural networks. These simplifications are necessary to make direct computational comparisons possible. Our claims concern whether there could be a computational advantage for retaining information independently and differently in multiple pathways and whether this computational advantage could increase the network performance in multiple-objects recognition and localization tasks. A previous simulation study of one-pathway and two-pathway artificial neural networks compared model simulations with actual neural representations in the brain (Bakhtiari et al., 2021). They reported that their two-pathway artificial neural network models could produce better matches to the representations in the mouse ventral and dorsal visual streams than their one-pathway artificial neural network models. Though our intent with simulations was to explore the computational consequences of multiple streams architecture rather than emulate physiological conditions of the brain, interestingly our findings generally agree with this study, which considered actual neural activities in the brain. The brain is a complex organ, sometimes described as the last frontier, and it is clear that computational approaches can play a key role, including in elucidating consequences of its organization and function.
5 Conclusion
In summary, our simulations show that our models are able to accurately and simultaneously recognize and localize multiple objects in a scene. Furthermore, we show that the artificial neural networks with two pathways for identity and space have significantly better performance (higher average testing accuracy, lower testing accuracy variance, less training time) than the artificial neural networks with a single pathway in multiple-objects recognition and localization tasks. We also find that the required number of training samples and the required training time increased quickly, and potentially exponentially, when the number of objects in each image increased. The simulations suggest that the difficulty of binding identity and spatial information increases quickly, and potentially exponentially, when the number of objects increases. We suggest that binding information from multiple segregated pathways may be a reason that our brain has a limited visual working memory capacity. Given that attention and working memory require binding information with space or time, it is possible that many attentional and working memory capacities are also limited by similar binding problems.
Acknowledgments
This work was partially supported by funds from Purdue University to A.S.