Abstract
Although in conventional models of cortical processing, object recognition and spatial properties are processed separately in ventral and dorsal cortical visual pathways respectively, some recent studies have shown that representations associated with both objects' identity (of shape) and space are present in both visual pathways. However, it is still unclear whether the presence of identity and spatial properties in both pathways have functional roles. In our study, we have tried to answer this question through computational modeling. Our simulation results show that both a model ventral and dorsal pathway, separately trained to do object and spatial recognition, respectively, each actively retained information about both identity and space. In addition, we show that these networks retained different amounts and kinds of identity and spatial information. As a result, our modeling suggests that two separate cortical visual pathways for identity and space (1) actively retain information about both identity and space (2) retain information about identity and space differently and (3) that this differently retained information about identity and space in the two pathways may be necessary to accurately and optimally recognize and localize objects. Further, modeling results suggests these findings are robust and do not strongly depend on the specific structures of the neural networks.
1 Introduction
It is widely documented in neuropsychological, lesion, and anatomical studies that the human visual system has two distinct cortical pathways (Felleman & Essen, 1991; Mishkin, Ungerleider, & Macko, 1983; Ungerleider & Mishkin, 1982). Furthermore, the ventral pathway primarily processes information important for object recognition (Logothetis & Sheinberg, 1996), while the dorsal pathway primarily processes information related to spatial cognition (Colby & Goldberg, 1999). However, some recent studies have challenged this idea (Freud, Plaut, & Behrmann, 2016; Freud, Rosenthal, Ganel, & Avidan, 2015; Hong, Yamins, Majaj, & DiCarlo, 2016; Konen & Kastner, 2008). Some studies have found that representations associated with shape and location processing are present in both visual streams (Hong et al., 2016; Konen & Kastner, 2008; Sereno, Lehky, & Sereno, 2020). However, it remains unclear whether the representations of shape in dorsal stream and the representations of location in ventral stream are non-task-related or whether they might play a functional rule in spatial cognition and object recognition, respectively. Some findings from fMRI and behavioral studies have suggested that spatial processing that operates at the level of the scene, presumably within the dorsal visual pathway, can contribute to shape processing (Zachariou, Klatzky, & Behrmann, 2014). Another study found that correlated activity between ventral and dorsal visual pathways was higher when people were looking at objects with impossible spatial structures compared with when they were looking at objects with possible structures (Freud et al., 2015), which indicated that dorsal pathway processing might help the brain recognize objects with impossible structures. Furthermore, Hong et al. (2016) found in neural recordings that spatial information increases along the ventral stream, consistent with prior studies demonstrating spatial properties in later stages of the ventral stream (Lehky, Peng, McAdams, & Sereno, 2008; Nowicka & Ringo, 2000). In addition, Hong et al. (2016) suggest that it is likely that the spatial information in the ventral stream does not come from the dorsal stream, in agreement with previous studies arguing that ventral stream spatial representations are distinct and independent from dorsal stream spatial encodings (Sereno & Lehky, 2011; Sereno, Sereno, & Lehky, 2014).
This experimental evidence indicates that representations of shape and space exist in both visual pathways and might have functional roles. We attempt here to tackle these questions through explicit hypothesis testing using computational models. In our study, we examine whether identity (of shape) and space processing were found to be present in both simulated ventral and dorsal streams trained to do straightforward object recognition and localization tasks, respectively; we explore possible reasons for why information associated with identity (shape) and space processing were found to be present in both simulated ventral and dorsal streams; and finally, we discuss how this information could elucidate our understanding of the computational properties and needs of the two visual streams. Hong et al. (2016) showed with modeling that explicit spatial information is present in the ventral pathway. They did not show whether shape information is present or retained in the dorsal pathway. They also did not show whether different kinds of shape and spatial information are maintained differently in simulations of the ventral and dorsal pathways. Their results are not sufficient to suggest why seemingly task-irrelevant information is maintained in a neural network. These are computationally tractable questions that are important and timely.
In order to model the two cortical visual pathways and study their computational properties, feedforward multilayer convolutional neural networks were used to simulate the functions of the two visual pathways in the brain, and multilayer perceptrons were used to simulate the process of decoding information from recorded neural activities in the brain. All networks were trained using supervised learning. When modeling the two cortical visual pathways, for simplicity and control, it is assumed that the two pathways use the same computational structure (the numbers of neurons are the same, and the structures of the initial connections between neurons are the same) and receive the same visual input images. However, we will allow the connection weights between the neurons in the two pathways to be modified with training. It is almost certain that the connection weights between the two pathways will be different after training because the training networks have to meet different goals. Specifically, the primary goal of the ventral pathway is to distinguish different kinds of objects by distinguishing different features or different combinations of features, whereas the primary goal of the dorsal pathway is to determine the spatial information (e.g., locations and/or orientations) of objects necessary for interaction (e.g., reach, grasp, and/or avoidance/navigation). We used the backpropagation training method as a tool to capture the computational properties that result from these differing goals of the two visual pathways. Backpropagation is currently the best method for updating connection weights between neurons in artificial neural networks. In general, artificial neural networks trained using the backpropagation method tend to perform better than models trained using any other weight-updating methods (Lillicrap, Santoro, Marris, Akerman, & Hinton, 2020). Though biological neural networks are unlikely to be able to perform backpropagation weight updates in the same way as artificial neural networks, some researchers have argued that biological neural networks could compute back-propagation-like effective synaptic updates by using the differences of neural activities induced by feedback connections (Lillicrap, Santoro, Marris, Akerman, & Hinton, 2020; Whittington & Bogacz, 2019). Therefore, the backpropagation training method was used to obtain the results shown herein.
Object image locations and orientations. (A) Nine possible locations of the center of an object image. (B) Four possible orientations of an object image (up, down, left, and right orientations, respectively; going from top to bottom images and, for the first row, left to right images). Note that the alignment of parts within an image are not randomized, always in the same alignment, and always constrained to the two directions along the long axis. (C) An example of an unscrambled (US) image with “down” orientation at location 7. (D) An example of a noisy scrambled (S) image with “right” orientation at location 5.
Object image locations and orientations. (A) Nine possible locations of the center of an object image. (B) Four possible orientations of an object image (up, down, left, and right orientations, respectively; going from top to bottom images and, for the first row, left to right images). Note that the alignment of parts within an image are not randomized, always in the same alignment, and always constrained to the two directions along the long axis. (C) An example of an unscrambled (US) image with “down” orientation at location 7. (D) An example of a noisy scrambled (S) image with “right” orientation at location 5.
Five artificial neural networks—, , , , and —were trained to do an identity task (whether the image is scrambled or unscrambled), a shoe task (determine whether the shoe in the image is a sandal or closed shoe), and three spatial tasks (determine the location and orientation, location alone, or the orientation alone of the image), respectively. Both and were used to model the ventral pathway, whereas , , and were used to model the dorsal pathway. These five networks, considered the brain networks, were used to simulate the functions of ventral and dorsal cortical visual pathways in the brain. Various additional nonlinear and linear decoders were then trained to decode different kinds of information from the later processing stages of these brain networks. These decoders were used to simulate the process of decoding information embedded in the recorded neural activity signals in the brain. It is assumed that if the testing accuracy of the decoder was higher, the later processing stages of the brain networks retained more information needed by the decoder's decoding goal.
According to the simulation results, though the lost some identity information when it was trained to do the space task, the later processing stage of the still retained some of the information that was necessary for distinguishing different kinds of objects (combinations of features). In addition, though the lost some spatial information when it was trained to do the identity task, it still maintained some information that was necessary for the spatial task. Specifically, although maintained both location and orientation information, it maintained more information about orientations of the object images. Results suggest that object information is retained by a network trained to do a spatial task and spatial information is retained by a network trained to do object recognition, suggesting that aspects of both object and spatial properties might be important for successful object recognition and spatial tasks. However, the information retained was not always sufficient to optimally complete the other goal. Therefore, the results indicate that a reason for two visual pathways in the brain might be that multiple pathways are necessary in order to achieve the highest performance on different goals, such as required by the identity, the spatial, and the shoe tasks. More important, it also suggests that these multiple pathways retain different aspects and amounts of both object and spatial information to achieve the highest performance on spatial and object tasks, respectively.
Our main modeling goal is to gain a better understanding of computational issues rather than identifying the specific response features that are similar to the real neural responses of ventral and dorsal cortical areas—that is, a proof of computational concept more than an accurate model of the real human brain. Indeed, known differences in the structure of the two pathways (e.g., different number of areas within each stream, already evident in Felleman & Essen, 1991) would complicate direct and controlled comparisons of such biologically accurate models.
Given that our goal is proof of computational concept, we repeated some of the simulations with slightly different brain network structures (different number of filters, different kernel sizes) to test if our findings are dependent on the specific conditions or structures of the artificial neural networks. Because our findings do not depend on the specific structures we have used or particular parameters chosen, the findings suggest they may reflect more general computational processes. Specifically, our findings may also be valid for the biological brain even though the structures of our artificial neural networks and the structures of the biological brain networks are not the same.
2 Methods
2.1 Object Images
Black and white images consisting of different kinds of tops, pants, and shoes (Xiao et al., 2017) were used to construct the images of objects (see Figure 1). Images of different kinds of tops, pants, and shoes obtained from the TensorFlow data set Fashion-MNIST were used to construct the images of objects (Xiao et al., 2017). Each of these object images consists of three parts: a top (1 of 62 possible), a pant (1 of 66 possible), and a shoe. The shoe could be one of the two following types: sandals (58 possible) or closed shoes (61 possible). Each object image was embedded in a black background and presented at different locations and orientations (all parts—top, pant, shoe—of the object were presented with the same orientation and centered at the selected location). These object images with black background were used as visual inputs. In half of the images, the top, the pant, and the shoe were in the unscrambled order: the normal order of how people are dressed, with the pant, but not shoe or top, in the middle. In the other half of the images, the top, the pant, and the shoe were in the scrambled order, where the order of top, pant, and shoe does not follow the normal order.
Six hundred black and white images were used to train, validate, and test the neural networks: 400 images for training, 100 for validating, and 100 for testing. We have used a small data set in our simulations and did not use image augmentation because our goal is not to maximize the performance of the artificial neural networks but rather to compare the performance of different neural networks in order to clarify differences in the kinds of information that are retained, as well as how much information is retained. Using a very large data set (60,000 images) caused the testing accuracy of to be above and the accuracy of to be . The reason they could reach such high accuracies may be because training images and tasks were simple and the number of possible variations was limited. It is difficult to examine and identify performance differences between different networks if most of the networks have almost accuracies. Nevertheless, to test if data set size would alter any findings, we repeated some simulations with a larger data set (1200 images) and found that the size of the data set did not affect our major findings. All networks were trained with 200 epochs, and all of these networks had reached the highest performance level at the end of training with 200 epochs. For some conditions where testing accuracies approached , we added gaussian noise to the images (including both object and background) to increase task difficulty so that we could better compare performance differences of the different networks. An example of the noisy image is shown in Figure 1D. Batch size = 256 and the Adam optimization method were used while training. The initial learning rate of Adam optimization was 0.001.
2.2 Object Image Location
2.3 Alignment of the Parts within an Object Image and Orientation of an Object Image
Unscrambled and scrambled object orders. The alignment of the parts within an object and the orientation of the object are always the same. For each orientation, there are six possible orders of parts. Only the first image for each orientation (first image in each row) is considered unscrambled (labeled “US”). The other images for a given orientation are scrambled object images (labeled “S”). (A) Up orientation. (B) Down orientation. (C) Left orientation. (D) Right orientation.
Unscrambled and scrambled object orders. The alignment of the parts within an object and the orientation of the object are always the same. For each orientation, there are six possible orders of parts. Only the first image for each orientation (first image in each row) is considered unscrambled (labeled “US”). The other images for a given orientation are scrambled object images (labeled “S”). (A) Up orientation. (B) Down orientation. (C) Left orientation. (D) Right orientation.
2.4 Object Image Order: Unscrambled versus Scrambled
The six possible orders for a given object image in the four different orientations are illustrated in Figure 2. Despite six possible orders, there are only two possible classifications by the identity network: unscrambled (US) object or scrambled (S) object. The object image order is determined by the orientation of the object. If the orientation of the object is up, then the top part of the object image (order start) is at the top. If the orientation of the object is down, then the top part of the object image (order start) is at the bottom. In half of the object images (300 out of 600), the top, the pant, and the shoe parts are in the normal order. These images were labeled as unscrambled (images labeled “US” in Figure 2). Just as how people dress themselves and stand up in daily life, the normal order means that the top is at the top, the pant is in the middle, and the shoe is at the bottom. If the object image is rotated to another orientation, the normal order stays consistent, just as people sometimes may lie down or do a handstand. In the other half of the images (300 out of 600), the top, the pant, and the shoe have parts that are in a scrambled order (images labeled “S” in Figure 2). That is, if the order of top, pant, and shoe does not follow the normal order (e.g., shoe, shirt, pant), the object image is labeled as “scrambled” (second image in Figure 2A). In addition, if all the parts are rotated so that the orientation of the object is all upside down and the top is at the top, the pant is in the middle, and the shoe is at the bottom, the object image is also considered “scrambled” (third image in Figure 2B). Thus, with three parts in every object image (top, pant, shoe), there were six possible spatial orders in total for each orientation and only one of them is the unscrambled order.
We chose to use the scrambled-unscrambled or identity task because it is a common task used to identify ventral regions in human fMRI studies (e.g., Kourtzi & Kanwisher, 2000; Grill-Spector, Kourtzi, & Kanwisher, 2001). It is an object recognition task that includes information about the relations between parts. The shoe identity task is a task that is sensitive to the ability to discriminate shape information of parts. Much work in animals (e.g., with respect to faces; Perrett, Hietanen, Oram, & Benson, 1992) as well as humans (Hoffman & Haxby, 2000) have demonstrated the importance of ventral regions in discriminating lower-level visual features (see also Bracci, Ritchie, & Op de Beeck, 2017).
2.5 Neural Networks
Feedforward multilayer convolutional artificial neural networks were used to build brain networks to model the visual information processing in the brain. Each neural network consists of several hidden layers, including the convolutional layers, the pooling layers, and the fully connected dense layers. ReLU activation function was used at each layer except the final output layer, in which a softmax activation function was used. Random dropout was used as a regularization method to improve the performance of the network. Random dropout regularization method is a neuroscience-inspired regularization method that is commonly used in the deep learning community (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014). These neural networks were implemented using TensorFlow and were trained using the supervised learning, the cross-entropy loss function, and the backpropagation method (Rumelhart, Hinton, & Williams, 1986). Simple multilayer perceptrons were used to build decoder networks (see additional details below).
2.6 Brain Networks: Global Recognition (Identity Task), Spatial Cognition (Spatial Task, Location Task, Orientation Task), and Feature Recognition Networks (Shoe Task)
The structure of brain networks. Each neural network consists of several hidden layers, including the convolutional layer, the pooling layer, and the fully connected dense layer. The only difference between different brain networks is the size of their output layer. The size of the output layer depends on the task they were trained to do.
The structure of brain networks. Each neural network consists of several hidden layers, including the convolutional layer, the pooling layer, and the fully connected dense layer. The only difference between different brain networks is the size of their output layer. The size of the output layer depends on the task they were trained to do.
All brain networks take the same set of images as inputs. However, was trained to classify the input images as scrambled or unscrambled (identity task), whereas was trained to determine the location and the orientation of the images (spatial task). The third network, , was a variant of . It was identical to but trained instead to classify the type of shoes in both scrambled and unscrambled images as either a closed shoe or a sandal. Two additional networks were variants of : was a variant of and trained instead to only determine the locations of the images, and was a variant of and trained to only determine the orientations of the images (both networks only differing from in their final output layer). The chance-level testing accuracy for the various tasks are: identity task: ; spatial task: ; shoes task: ; location task: ; and orientation task: . While training and testing, the activities of the second-to-last layers of , , , , and were recorded.
2.7 Decoders
In order to analyze the information contained in the later processing stage of the convolutional networks, two kinds of decoders were used: nonlinear decoder networks and linear decoders.
The structure of a decoder network. The input dimension is equal to the number of units in the network layer that it was trained to decode from. The output dimension depends on what kind of information it was trained to decode.
The structure of a decoder network. The input dimension is equal to the number of units in the network layer that it was trained to decode from. The output dimension depends on what kind of information it was trained to decode.
The linear decoders we used were linear support vector machines (linear SVMs). The parameters were set as follows: loss function: hinge; regularization: L2 regularization with regularization strength 1. The linear decoders also took the artificial neural activities of the second-to-last-layer units of a brain network as inputs and were trained to give different kinds of outputs depending on what kind of information they were trying to decode.
The second-to-last-layer activities of a brain network are different when the input images are different. Therefore, during the training and testing of a decoder network, the inputs (second-to-last-layer activities) must be paired with the corresponding true labels of the training and testing images. There are reasons for choosing to decode from the second-to-last layer activities. First, the last layer is the output layer, which includes only information about the final classification decision of the corresponding task, which was different for different networks. Second, the layers before the second-to-last layer are closer to the input layer, and information may not have been fully processed at these layers. The assumption is that if the decoder is able to use the second-to-last layer activities to do a task with high accuracy, that indicates that there is a large amount of task-relevant information contained (and/or retained) in the second-to-last-layer activities.
2.8 Comparing Networks
In order to compare networks, each network (including the decoders) was trained 10 times, and testing accuracies were obtained for each of the 10 training sessions. The testing accuracies were obtained by dividing the number of correct classifications by the total number of testing samples (100) during the testing session. The accuracies that are used to compare different networks herein are always referring to the testing accuracies. Unpaired two-sample -tests were used to compare network accuracies and to determine the significance of the differences.
2.9 Baseline Decoder Networks: Getting the Baseline Accuracies
Before trying to decode information from the second-to-last layer of each brain network, it is important to know the accuracy of decoding from an untrained network. To get the baseline accuracies, an untrained network is used. The untrained network has the same structure as (as the output layer is not important, we could have used the structure of as well). After all connection weights in the untrained network were randomly initialized, training, validating, and testing images were provided as inputs to the network for zero epochs and the activities of the second-to-last-layer units were recorded. Because all input data only went through the network once and no training happened during this process (trained for zero epochs), the connection weights were still random.
Unit activities of the second-to-last layer of the untrained network served as inputs to the decoder . Then the decoder was trained to do the identity task, and the accuracy obtained was the baseline accuracy for identity. These unit activities of the second-to-last layer of an untrained network also served as inputs to the decoder . Then decoder was trained to determine the spatial information, and the accuracy obtained was the baseline accuracy for space. When these activities of an untrained network served as inputs to the decoder , the decoder was trained to determine the type of shoes, and the accuracy obtained was the baseline accuracy for the classification of shoes.
The reason for getting these baseline accuracies is to determine how much information about identity, space, and shoes would still be present in the second-to-last layer of the network if the network was not trained at all (i.e., all connection weights are random).
2.10 Determining the Amount of Information about a Task in the Later Processing Stage of the Brain Network When It Was Trained to Do a Different Task
It is possible that when the network is trained to do one kind of task, it would extract the task-relevant information and throw away task-irrelevant information. We examine here whether the amount of information about a relevant task in the later processing stage of the brain network would increase or decrease when this network was first trained to do a different irrelevant task.
The inputs and task goals of different decoders are listed in Table 1. For example, the decoder received intermediate processing information about space from the brain space network (i.e., inputs were artificial neural activities from the second-to-last layer of the brain ) but then was trained to decode information about identity from it. Similar arguments can be applied to other decoders.
Inputs and Task Goals of Different Decoders When the Brain Network Was Trained with a Different or Same Task.
. | Take the second-to-last-layer . | . |
---|---|---|
Decoder Name . | activities from . | To do the task . |
identity | ||
identity | ||
identity | ||
space | ||
space | ||
location | ||
orientation | ||
shoes | ||
shoes | ||
shoes | ||
identity | ||
space |
. | Take the second-to-last-layer . | . |
---|---|---|
Decoder Name . | activities from . | To do the task . |
identity | ||
identity | ||
identity | ||
space | ||
space | ||
location | ||
orientation | ||
shoes | ||
shoes | ||
shoes | ||
identity | ||
space |
2.11 Determining the Amount of Information about a Task in the Later Processing Stage of the Brain Network When the Network Was Trained to Do the Same Task
Using decoder networks to decode information from the brain networks is similar to adding more layers to the brain network and then training it for more epochs. The network's testing accuracy could increase or decrease simply because it was trained for more epochs or it has more layers. Training for more epochs or having more layers may increase testing accuracy by extracting more statistical information from the training samples, whereas it could also decrease testing accuracy by overfitting. Therefore, if we want to determine whether training with a task helps or hurts the network's ability to do another (different) task, we need to determine the accuracy of the decoder network when the brain network was trained again to do the same task.
The inputs and task goals of the relevant decoders are also listed in Table 1. For example, decoder received the intermediate processing information of the brain as inputs (i.e., inputs were artificial neural activities from the second-to-last layer of the brain ) and then was trained to decode information about identity from it. Decoder received the intermediate processing information of and then was trained to decode spatial information from it.
2.12 Determining Whether Performance on the Identity and Spatial Tasks Is Dependent on Whether There Is One (Double-Sized) Single Network or Two Separate Networks
For , a single network takes the images as visual inputs and determines objects' identity and space information as 1 of the 72 possible combinations of identity (2 possible) and space (36 possible). For , two brain networks take the images as visual inputs. The brain identity network determines objects' identity, and the brain space network determines space. Later, the results from the two networks are combined to determine objects' identity and space information as 1 of the 72 possible combinations of identity (2 possible) and space (36 possible).
The structure of , the single network that takes the images as visual inputs and determines objects' identity and space information as 1 of the 72 possible combinations of identity (2 possible) and space (36 possible).
The structure of , the single network that takes the images as visual inputs and determines objects' identity and space information as 1 of the 72 possible combinations of identity (2 possible) and space (36 possible).
The structure of , the two brain networks that take the images as visual inputs. The brain identity network determines objects' identity, and the brain space network determines space. Later, the results from the two networks are combined to determine objects' identity and space information as 1 of the 72 possible combinations of identity (2 possible) and space (36 possible).
The structure of , the two brain networks that take the images as visual inputs. The brain identity network determines objects' identity, and the brain space network determines space. Later, the results from the two networks are combined to determine objects' identity and space information as 1 of the 72 possible combinations of identity (2 possible) and space (36 possible).
3 Results
It is necessary to perform training, validation, and testing for multiple times with network weights randomly initialized differently each time to make sure the network did not get stuck at local minimums. When obtaining the accuracies in each experimental setting, the networks were always trained 10 times, and 10 testing accuracies were obtained for each condition after training. Unpaired two-sample -tests were used to compare different accuracies and determine the significance of the differences. The difference is considered to be significant if the corresponding -value 0.05 (level of significance: -value 0.05, -value 0.01, and -value 0.001). The average testing accuracies for different experimental settings are shown in Tables 2 and 3. One possible reason for the baseline accuracies to be higher than the corresponding chance levels is that although the connection weights were initialized randomly, some information contained within the input images themselves can still be passed on to the second-to-last-layer units, and this sensory-driven information was decoded by the decoder networks.
Average Testing Accuracies in Percentage () Standard Deviations () for Brain Networks and Nonlinear Decoder Networks.
Brain . | Identity . | Space . | Shoes . | Orientation . | Location . |
---|---|---|---|---|---|
Decoders | |||||
Baseline accuracy (decode from the untrained brain network) | 60.7 2.0 | 48.0 3.9 | 60.7 2.4 | 38.2 1.1 | 44.2 4.8 |
(noisy inputs) | (noisy inputs) | ||||
No decoder (brain network accuracy) | 80.2 1.8 | 85.7 3.1 | 73.4 1.9 | 82.6 2.3 | 97.0 1.1 |
(noisy inputs) | (noisy inputs) | ||||
Identity | 81.6 0.8 | 71.2 2.8 | NA | 72.2 1.2 | 65.0 2.0 |
Space | 75.8 2.9 | 86.5 1.2 | 61.0 1.6 | NA | NA |
Shoes | NA | 57.5 2.1 | NA | NA | NA |
Orientation | 34.5 3.9 | NA | NA | 84.1 0.7 | NA |
(noisy inputs) | (noisy inputs) | ||||
Location | 29.9 4.1 | NA | NA | NA | 97.3 0.5 |
(noisy inputs) | (noisy inputs) |
Brain . | Identity . | Space . | Shoes . | Orientation . | Location . |
---|---|---|---|---|---|
Decoders | |||||
Baseline accuracy (decode from the untrained brain network) | 60.7 2.0 | 48.0 3.9 | 60.7 2.4 | 38.2 1.1 | 44.2 4.8 |
(noisy inputs) | (noisy inputs) | ||||
No decoder (brain network accuracy) | 80.2 1.8 | 85.7 3.1 | 73.4 1.9 | 82.6 2.3 | 97.0 1.1 |
(noisy inputs) | (noisy inputs) | ||||
Identity | 81.6 0.8 | 71.2 2.8 | NA | 72.2 1.2 | 65.0 2.0 |
Space | 75.8 2.9 | 86.5 1.2 | 61.0 1.6 | NA | NA |
Shoes | NA | 57.5 2.1 | NA | NA | NA |
Orientation | 34.5 3.9 | NA | NA | 84.1 0.7 | NA |
(noisy inputs) | (noisy inputs) | ||||
Location | 29.9 4.1 | NA | NA | NA | 97.3 0.5 |
(noisy inputs) | (noisy inputs) |
Notes: The column heading are the names of the brain networks. The row headings are the kinds of information that decoder networks were trying to decode. The data are accuracies obtained by various decoder networks except for the data in the row labeled “Brain.” “Brain” means there is no decoder and it is the accuracy obtained by the brain network. Definitions of decoder networks are listed in tables 1 and 2. The data for simulations that were not conducted are labeled “NA.”
Average Testing Accuracies for and .
. | Average . | Chance . | Standard . |
---|---|---|---|
Network . | Accuracy () . | Level () . | Deviation () . |
72.8 | 1.4 | 2.1 | |
76.8 | 1.4 | 1.5 |
. | Average . | Chance . | Standard . |
---|---|---|---|
Network . | Accuracy () . | Level () . | Deviation () . |
72.8 | 1.4 | 2.1 | |
76.8 | 1.4 | 1.5 |
Note: The and the are used to simulate object identification and localization with one pathway or two separate pathways.
The comparisons of accuracies between different networks are shown in Table 4. Briefly, we found that the second-to-last-layer activities of brain networks that were trained to do a given task had higher decoding accuracies than the baseline when we tried to decode information about a different task. That is, we found that a network trained to identify images actively retained information about space, and a network trained on a spatial task actively retained information about identity. In addition, the decoding accuracies were lower from the brain networks that were trained to do a different task than from brain networks that were trained to do the same task. Additional modeling to better understand why networks retained seemingly task-irrelevant information suggest that this information is retained and preserved uniquely in service of improving the accuracy of the “irrelevant” task. For example, the identity network actively maintained more information about orientation than location because in order to determine whether the object is in the unscrambled or scrambled order, the network needs to determine the object orientation. Finally, simulation results from comparing a single combined pathway versus two segregated pathways in order to accurately identify objects and accurately determine the location and orientation of objects suggest that two separate pathways are advantageous in order to process the same input (visual information) in different ways for different tasks or goals. The specific comparisons and findings are discussed in more detail in section 4.
Comparisons of Testing Accuracies between Different Networks.
Network 1 . | Network 2 . | Average Difference in Accuracy () (, , ) . | -Value . |
---|---|---|---|
27.8*** | <0.001 | ||
−10.7*** | <0.001 | ||
14.8*** | <0.001 | ||
−6.2*** | <0.001 | ||
1.0 | 0.324 | ||
−7.2*** | <0.001 | ||
10.5*** | <0.001 | ||
−10.4*** | <0.001 | ||
−3.2** | 0.005 | ||
4.0*** | <0.001 |
Network 1 . | Network 2 . | Average Difference in Accuracy () (, , ) . | -Value . |
---|---|---|---|
27.8*** | <0.001 | ||
−10.7*** | <0.001 | ||
14.8*** | <0.001 | ||
−6.2*** | <0.001 | ||
1.0 | 0.324 | ||
−7.2*** | <0.001 | ||
10.5*** | <0.001 | ||
−10.4*** | <0.001 | ||
−3.2** | 0.005 | ||
4.0*** | <0.001 |
Notes: The first two sections examine whether and why there is information about space in the identity network. The next section examines whether there is information about identity and what kind of identity information is in the space network. The final section compares testing accuracies of a network doing the identity and spatial tasks using two separate pathways with a network doing the identity and spatial tasks using a single pathway.
One additional comparison, not illustrated in Table 4, was made to compare the difference in the amount of accuracy decrease in percentage from to versus from to . We found that the accuracy of brain stayed around when this network was trained with more epochs (e.g., 200 epochs) and suggest that the accuracy of this brain network saturates when trained for 200 epochs. Since we are using the accuracies of the decoders to assess the amount of different kinds of information contained in the second-to-last-layer brain network activities, when the accuracy saturates, it is possible that with additional training, the amount of information retained has changed but the accuracy stays the same (around ), making it difficult to evaluate whether the amount of information retained has changed and difficult to compare these networks' performance with other brain and decoder networks. Therefore, we increased the difficulty for the location and orientation tasks by adding gaussian white noise to the input images.
With noisy input images, the accuracy of was still very high () but did not reach . The range of the accuracy of the location task is from (chance level) to , or . The range of the accuracy of the orientation task is from (chance level) to , or . Given the differences in ranges for these networks, the change of accuracy in percentage was normalized by these respective ranges. Namely, the normalized change of accuracy was obtained by dividing the amount of change in accuracy by the size of the range of the accuracy of the corresponding task. The accuracy of is lower than by , and the accuracy of is lower than by . After the network had been trained to do the identity task, the accuracy of determining location decreased more in normalized percentage than did the accuracy of determining orientation. This difference in the amount of accuracy decrease is significant (-value 0.001).
We repeated the simulations about whether there is information about space in and about identity in with some different settings. The comparisons of accuracies between different networks when different sample sizes or different network parameter settings were used are shown in Tables 5 and 6. We repeated the simulations for the following three alternative settings: (1) with 1200 images used as data set; (2) with the number of filters in each convolutional layer in brain networks doubled from the first layer to the last layer (64, 128, 256 filters); and (3) with the kernel sizes for the first, second, and third convolutional layers in brain networks reduced to , , , respectively; (4) with decoder networks that have two hidden layers; and (5) with decoder networks that have 50 units in each hidden layer. Only one setting (size of data set, or number of filters, kernel sizes for brain networks, number of hidden layers, or number of units in hidden layers for decoder networks) was changed at a time. The results with these different settings are consistent with the results we obtained with regular settings.
Average Testing Accuracies for Networks with Different Settings (Different Data Set Size, or Different Number of Filters, or Different Kernel Sizes).
Network . | Average Accuracy () . | Chance Level () . | Standard Deviation () . |
---|---|---|---|
60.7 | 50.0 | 2.0 | |
48.0 | 2.8 | 3.9 | |
77.7 | 50.0 | 2.1 | |
69.5 | 50.0 | 2.0 | |
67.3 | 50.0 | 2.1 | |
57.6 | 2.8 | 2.1 | |
62.4 | 2.8 | 2.7 | |
50.4 | 2.8 | 2.9 | |
59.4 | 50.0 | 2.1 | |
48.9 | 2.8 | 3.7 | |
60.0 | 50.0 | 2.5 | |
41.4 | 2.8 | 5.9 | |
75.8 | 2.8 | 2.9 | |
79.3 | 2.8 | 2.0 | |
67.5 | 2.8 | 1.9 | |
69.5 | 2.8 | 4.0 | |
71.2 | 50.0 | 2.8 | |
87.4 | 50.0 | 0.9 | |
75.5 | 50.0 | 2.5 | |
80.8 | 50.0 | 1.4 | |
75.4 | 2.8 | 1.8 | |
70.4 | 50.0 | 1.3 | |
73.1 | 2.8 | 3.3 | |
69.8 | 50.0 | 3.4 |
Network . | Average Accuracy () . | Chance Level () . | Standard Deviation () . |
---|---|---|---|
60.7 | 50.0 | 2.0 | |
48.0 | 2.8 | 3.9 | |
77.7 | 50.0 | 2.1 | |
69.5 | 50.0 | 2.0 | |
67.3 | 50.0 | 2.1 | |
57.6 | 2.8 | 2.1 | |
62.4 | 2.8 | 2.7 | |
50.4 | 2.8 | 2.9 | |
59.4 | 50.0 | 2.1 | |
48.9 | 2.8 | 3.7 | |
60.0 | 50.0 | 2.5 | |
41.4 | 2.8 | 5.9 | |
75.8 | 2.8 | 2.9 | |
79.3 | 2.8 | 2.0 | |
67.5 | 2.8 | 1.9 | |
69.5 | 2.8 | 4.0 | |
71.2 | 50.0 | 2.8 | |
87.4 | 50.0 | 0.9 | |
75.5 | 50.0 | 2.5 | |
80.8 | 50.0 | 1.4 | |
75.4 | 2.8 | 1.8 | |
70.4 | 50.0 | 1.3 | |
73.1 | 2.8 | 3.3 | |
69.8 | 50.0 | 3.4 |
Notes: Only one setting was changed at a time. Definitions of decoder networks are listed in Tables 1 and 2. If labeled with “1200 samples,” then 1200 images were used as data set. If labeled with “increase filters,” the number of filters in each convolutional layer doubled from the first layer to the last layer (64, 128, 256 filters). If labeled with “different kernel sizes,” the kernel sizes for the first, second, and third convolutional layers were reduced to , , and , respectively. If labeled with “2 layer decoder,” the decoder with 2 hidden layers was used. If labeled with “50 units decoder,” the decoder with 50 units in each hidden layer was used. For the other networks, 600 images and regular parameter settings shown in Figures 3 and 4 were used.
Comparisons of Testing Accuracies between Different Networks.
Network 1 . | Network 2 . | Average Difference in Accuracy () (, , ) . | -Value . |
---|---|---|---|
21.7*** | <0.001 | ||
9.7*** | <0.001 | ||
5.1*** | <0.001 | ||
6.0*** | <0.001 | ||
19.1*** | <0.001 | ||
13.5*** | <0.001 | ||
26.5*** | <0.001 | ||
11.0*** | <0.001 | ||
31.7*** | <0.001 | ||
9.8*** | <0.001 |
Network 1 . | Network 2 . | Average Difference in Accuracy () (, , ) . | -Value . |
---|---|---|---|
21.7*** | <0.001 | ||
9.7*** | <0.001 | ||
5.1*** | <0.001 | ||
6.0*** | <0.001 | ||
19.1*** | <0.001 | ||
13.5*** | <0.001 | ||
26.5*** | <0.001 | ||
11.0*** | <0.001 | ||
31.7*** | <0.001 | ||
9.8*** | <0.001 |
Notes: The first section examines whether there is information about space in the identity network and whether there is information about identity in the space network when the number of samples 1200. The second and third sections examine whether there is information about space in the identity network and information about identity in the space network when different network parameter settings were used. The fourth and fifth sections examine whether the results would change when decoders with different numbers of hidden layers or different numbers of units were used. If labeled with “1200 samples,” then 1200 images were used as data set. If labeled with “increase filters,” then the number of filters in each convolutional layer doubled from the first to the last layer (64, 128, 256 filters). If labeled with “different kernel sizes,” the kernel sizes for the first, second, and third convolutional layers were reduced to , , , respectively. If labeled with “2 layer decoder,” then the decoder with 2 hidden layers was used. If labeled with “decoder 50 units,” the decoder with 50 units in each hidden layer was used. For the other networks, 600 images and regular parameter settings shown in Figures 3 and 4 were used.
We also repeated the decoding simulations using linear decoders. When obtaining the accuracies in each experimental setting, the linear decoders were always trained for 10 times and 10 testing accuracies were obtained for each condition after training (10 training and testing episodes). The input images were permuted each time, and different sets of input images were selected from the whole data set for training and testing during each episode. Unpaired two-sample -tests were used to compare different accuracies and to determine the significance of the differences.
The average testing accuracies for different experimental settings are shown in Table 7. The comparisons of accuracies between different linear decoders are shown in Table 8. According to these results, unlike the results obtained using nonlinear decoders, the differences between the accuracies of , , and are not significant. Though the accuracy of is significantly higher than the baseline (-value 0.003) and different from the nonlinear decoder result, the accuracy of is still significantly lower than and is consistent with the nonlinear decoder result. All of the other results shown in Table 8 are consistent with the results obtained using nonlinear decoders.
Average Testing Accuracies in Percentage () Standard Deviations () for Linear Decoders.
Brain . | Identity . | Space . | Shoes . | Orientation . | Location . |
---|---|---|---|---|---|
Decoders | |||||
Baseline Accuracy (decode from the untrained brain network) | 52.0 5.1 | 2.8 1.2 | 53.3 3.5 | 24.0 5.3 | 10.9 5.2 |
(noisy inputs) | (noisy inputs) | ||||
Identity | 93.6 1.9 | 63.2 4.3 | NA | 63.0 4.2 | 61.4 6.5 |
Space | 76.1 4.1 | 96.1 1.7 | 68.8 6.0 | NA | NA |
Shoes | NA | 58.7 3.7 | NA | NA | NA |
Orientation | 28.7 3.4 | NA | NA | 94.4 2.1 | NA |
(noisy inputs) | (noisy inputs) | ||||
Location | 15.8 5.9 | NA | NA | NA | 98.9 0.9 |
(noisy inputs) | (noisy inputs) |
Brain . | Identity . | Space . | Shoes . | Orientation . | Location . |
---|---|---|---|---|---|
Decoders | |||||
Baseline Accuracy (decode from the untrained brain network) | 52.0 5.1 | 2.8 1.2 | 53.3 3.5 | 24.0 5.3 | 10.9 5.2 |
(noisy inputs) | (noisy inputs) | ||||
Identity | 93.6 1.9 | 63.2 4.3 | NA | 63.0 4.2 | 61.4 6.5 |
Space | 76.1 4.1 | 96.1 1.7 | 68.8 6.0 | NA | NA |
Shoes | NA | 58.7 3.7 | NA | NA | NA |
Orientation | 28.7 3.4 | NA | NA | 94.4 2.1 | NA |
(noisy inputs) | (noisy inputs) | ||||
Location | 15.8 5.9 | NA | NA | NA | 98.9 0.9 |
(noisy inputs) | (noisy inputs) |
Notes: The column headings are the names of the brain networks. The row headings are the kinds of information that the linear decoders were trying to decode. The data are accuracies obtained by various decoder networks. Definitions of decoders are listed in Tables 1 and 2. The data for simulations that were not conducted are labeled “NA.”
Comparisons of Testing Accuracies between Different Linear Decoders.
Network 1 . | Network 2 . | Average Difference in Accuracy () (, , ) . | -Value . |
---|---|---|---|
73.3*** | <0.001 | ||
−20.0*** | <0.001 | ||
7.3** | 0.006 | ||
−1.8 | 0.47 | ||
−0.2 | 0.92 | ||
−1.6 | 0.52 | ||
11.2*** | <0.001 | ||
−30.4*** | <0.001 | ||
5.4** | 0.003 | ||
−4.5* | 0.02 |
Network 1 . | Network 2 . | Average Difference in Accuracy () (, , ) . | -Value . |
---|---|---|---|
73.3*** | <0.001 | ||
−20.0*** | <0.001 | ||
7.3** | 0.006 | ||
−1.8 | 0.47 | ||
−0.2 | 0.92 | ||
−1.6 | 0.52 | ||
11.2*** | <0.001 | ||
−30.4*** | <0.001 | ||
5.4** | 0.003 | ||
−4.5* | 0.02 |
Notes: The first two sections examine whether and, if so, why, there is information about space in the identity network. The next section examines whether there is information about identity and what kind of identity information is in the space network.
One additional comparison, not illustrated in Table 8, was made to compare the difference in the amount of accuracy decrease in percentage from to versus from to . Again, we increased the difficulty for the location and orientation tasks by adding gaussian white noise to the input images. With noisy input images, the accuracy of is lower than by , and the accuracy of is lower than by . After the network had been trained to do the identity task, the accuracy of determining location decreased more in normalized percentage than did the accuracy of determining orientation. This difference in the amount of accuracy decrease is significant (-value 0.05). This result is consistent with the result obtained using nonlinear decoders.
4 Discussion
Using a computational modeling approach, we aimed to better understand whether the presence of identity and spatial properties in cortical areas important for space and object recognition have a functional role. We trained networks to do various object and spatial recognition tasks. We show that these networks actively retain non-task-related information. Specifically, these networks retain different amounts of identity and spatial information (as shown, for example, by the amount of identity information retained by networks identity versus space or that the space network retains more scrambled/unscramble identity information than type of shoe information) and different kinds of identity and spatial information (as shown, for example, by the greater retention of orientation than location information by the identity network). Each of these networks was independent and trained on a single task and had no cross-connections from other networks. Hence, any non-task-related properties that were retained in each of these networks were not coming from other networks. We repeated some simulations with different neural network parameters and the results were still consistent with our findings. It implies that our findings are robust and do not depend on specific parameter settings of the neural networks. In sum, based on our results, we (1) suggest that this different retained information about identity and space in the two pathways is functional, (2) demonstrate that this task-irrelevant information need not come from another cortical stream or external source, and (3) show that in some cases, the task-irrelevant information may be necessary to accurately and optimally recognize and localize objects. Because our findings do not depend on specific parameter settings, they should also be valid for the biological brain though their structures may not be the same as our artificial networks.
4.1 Is There Information about Space in the Identity Network?
According to both the nonlinear and linear decoder results, the accuracy of the decoder is significantly higher than that of the decoder when both were trained to decode information about space. This finding suggests it is possible to decode information about space from the activities of the second-last-layer units of . It indicates that even though was trained only to identify scrambled and unscrambled images, its later processing stage still had information about space when it was processing the input images in order to identify scrambled and unscrambled images. Furthermore, the accuracy of is significantly lower than the accuracy of . This may be because as information goes from the input layer to the second-to-last-layer, some information about space may be lost because it is not useful to have very precise information about space for 's task (identifying scrambled/unscrambled images).
However, an important question is why the activities of the second-to-last-layer units of still contained spatial information. Was this spatial information actively kept by the network or just passively left in the network? While the network was processing input information in order to do a task, it would extract useful information from the inputs and eliminate useless information in the inputs. As a result, some would be retained, and some would be lost. “Actively kept by the network” means the network chose to keep spatial information when it was eliminating other useless information. “Passively left in the network” means the network did not choose to keep spatial information but did not actively eliminate all spatial information. If the network did not choose to actively keep the spatial information, then whatever, if any, spatial information was passively left in the network should be equivalent across trained networks and there should be no difference in the spatial information retained in and .
In order to answer this question, the accuracy of is compared with the accuracy of . The result is that the accuracy of is significantly lower than the accuracy of , which means that there was significantly less space information retained in the activities of the second-to-last-layer units of . This result is the same for both nonlinear and linear decoders. It is likely because identifying a feature (feature recognition or the types of the shoes) does not need as much spatial information compared to identifying scrambled and unscrambled images (global recognition, or identifying combinations of features). These findings indicate that the space information was actively maintained by even though it was trained to do the identity task. Though most studies assume spatial information in the ventral stream is coming from the dorsal stream, our results indicate the information may be retained or built up within the ventral stream.
These findings agree with Hong et al.'s (2016) computational modeling results. They also used hierarchical convolutional neural networks (HCNN) to model the ventral visual cortical pathway. They trained the HCNN to do category estimation tasks. They took the neural activities of the top hidden layer of their HCNN while training and used these artificial hidden layer neural activities and a decoder to perform category-orthogonal estimation tasks. They found the network performance on these tasks improved as training proceeds. It suggests that the category-orthogonal information was extracted by the HCNN when the HCNN was trained to do category estimation tasks, which is similar to what we found.
4.2 What Kind of Spatial Information Was Actively Maintained More in the Identity Network?
There are different kinds of spatial information, including the locations of the object images (defined as the location of the center of the object), the orientations of the object images, and the spatial alignments and orders of the parts of the object images and so forth. Two kinds of spatial information were examined in this study: object location and part/object orientation. According to the results presented, the accuracy of determining object location decreased significantly more than the accuracy of determining part/object orientation after the network had been trained to do the identity task when noisy input images were used. The amounts of accuracy decrease are comparable because they have been normalized according to their different chance level accuracies (see section 3). These findings suggest that the information loss about part/object orientation is smaller than the information loss about object location in the identity network. That is, these findings indicate that the identity network in our study actively maintained more information about part/object orientation than object location.
4.3 Why Did the Identity Network Actively Maintain More Information about Orientation?
To answer why the identity network actively maintained more information about part/object orientation, the accuracies of and were compared. The assumption is that would retain more information about object location in its second to last layer while would retain more information about part/object orientation in its second-to-last layer.
According to the nonlinear decoder results, the accuracy of is significantly higher than that of the . These findings suggest that part/object orientation information is more important for the identity task. It could be because in order to tell whether the object images are in the unscrambled order, the network needs to determine the part/object orientation. This is because the definition of the order of parts depends on the part/object orientation, as explained in the section 2. However, information about object location is less important for our identity task because the location of the object image is irrelevant for identifying the scrambled or unscrambled object image in our task. This suggests that spatial information is preserved along the ventral pathway because it is behaviorally useful. Likewise, given that some object recognition tasks can require an ability to disambiguate the same or similar objects in different locations (Byun & Lee, 2010; Garcia & Buffalo, 2020; Suzuki, Miller, & Desimone, 1997), we would expect that if our identity network was trained with such an object recognition task, spatial location information would be preserved to a greater extent than what we report in our study. Our findings suggest that spatial information is preserved along the ventral pathway when this information is behaviorally useful for the identification task.
When linear decoders are used, the difference between the accuracies of and are not significant, but in a similar direction as the nonlinear decoders. It is possible that the current methods and experiments are not sensitive enough to detect a small difference using linear decoders. Alternatively, it is possible that the amount of linearly decodable information about identity in and is not significantly different. However, it is important to point out that the question we were trying to clarify is why the orientation or location is important for doing an identity task. The brain is the subject who does the identity task, and the brain is nonlinear. Therefore, we feel the results obtained using nonlinear decoders are most relevant to understanding neural decoding. Nevertheless, future work is needed to better understand and test the reliability of this difference between nonlinear and linear decoders.
4.4 Is There Information about Object Identity (Scrambled/Unscrambled or Type of Shoes) in the Space Network?
The accuracy of is significantly higher than that of when both were trained to identify scrambled and unscrambled images (the identity task). This indicates that even though was trained to determine the location and orientation of the parts or images, its later processing stage still had some information that was necessary to do the identity task. It may be because the dorsal pathway processes parts and objects or face representation information in order to better recognize the object/face's configural identity (Freud et al., 2016). For example, though face recognition is believed to be mainly processed by the ventral pathway (Grill-Spector, Weiner, Gomez, Stigliani, & Natu, 2018), in a same–different face detection task, configural but not featural processing of faces was found in the posterior dorsal pathway (Zachariou, Nikas, Safiullah, Gotts, & Ungerleider, 2017). TMS centered on the parietal regions, impaired performance on configural but not featural face difference detection (Zachariou et al., 2017), which suggested that the dorsal pathway processing is important for the intact perception of object configural information.
Furthermore, the accuracy of is significantly lower than the accuracy of . Likely, as information goes from the input layer to the second-to-last layer, some information about scrambled or unscrambled identity may be lost because it is not useful to have very precise information about this identity for 's task (identifying locations and orientations).
On the other hand, according to the nonlinear decoder results, the accuracy of is not significantly higher than that of . According to the linear decoder results, although the accuracy of is significantly higher than the accuracy of , it is still significantly lower than . Together, these findings indicate that 's later processing stage retains less information about the identity of shoes (whether it is a sandal or a closed shoe). The reason that the space network contains information about object scrambled or unscrambled identity may be because the global object recognition information (scrambled/unscrambled) retained is relevant to the 's task, but the specific feature recognition information (the identity of shoes) is not.
When the space network was determining object orientation or location, it did not need to know object identity (scrambled or unscrambled). So what might be the benefit of retaining this information? The reason may be relevant to some studies about tool processing. Recent studies found that tool sensitivity undergoes further refinement between the ages of 4 and 8, which indicated that sensitivity to objects in the dorsal pathway may require more motor experience and learning during childhood (Freud et al., 2016; Kersey, Clark, Lussier, Mahon, & Cantlon, 2016). Young children are more likely to categorize objects based on their physical features (e.g., material of the object) than their function-related features (Landau, Smith, & Jones, 1998; Smith, Jones, & Landau, 1996). These studies indicate that the dorsal pathway may retain object function-related features (including the spatial relation of parts, important in scrambled/unscrambled) when it is trained to do spatial tasks during motor learning (for tasks that require localization and orientation) and such motor training can help people learn how to categorize different kinds of tools based on these function-related features. The shoe identity may be more similar to an object physical feature (and does not require spatial relation of parts), and thus it is not retained by the dorsal pathway. These arguments would need to be confirmed by future studies.
4.5 Independence, Not Interactions: What These Simulations Imply about the Ventral and Dorsal Cortical Visual Pathways
Some previous studies (Sereno & Lehky, 2011; Sereno et al., 2020) have found that the ventral pathway had representations about space, but these spatial representations were different from those in the dorsal pathway. The spatial representations in the ventral pathway were topological (“categorical”), whereas those in the dorsal pathway were precise and accurate (“coordinate”). They have suggested that it might be that objects' shape and spatial information are differently and independently constructed within each pathway in order to achieve different functions (object recognition or spatial recognition). Freud et al. (2015) also suggested that object representations in the dorsal pathway can be computed independently from those in the ventral pathway. According to our simulation results using nonlinear decoders, the spatial information retained in had more explicit information about orientation than location. Information about orientation is more useful for identifying scrambled and unscrambled images. Nevertheless, given that is used to model the ventral pathway and is used to model the dorsal pathway, these results agree with previous experimental findings as well as the interpretation that objects' identity and spatial information are differently and independently constructed within each pathway in order to achieve different functions as opposed to the idea that these “crossed” signals are coming from the other stream (van Polanen & Davare, 2015; Zachariou et al., 2014). Hong et al. (2016) found that spatial information increased along the ventral stream, and their computational modeling also suggested that spatial information that is present within the pathway is extracted along the ventral pathway, so it becomes more explicit at the later processing stages of the ventral pathway.
Some previous studies also found the dorsal pathway had representations about objects' shapes (Konen & Kastner, 2008; Sereno & Maunsell, 1998; Sereno, Trinath, Augath, & Logothetis, 2002), and some have argued that these representations are different from the object representations in the ventral pathway (Janssen, Srivastava, Ombelet, & Orban, 2008; Lehky & Sereno, 2007). According to our simulation results, the identity information retained in had more information about the scrambled or unscrambled identity (global recognition) than it did about shoe identity (feature recognition). These findings likely occurred because extracted spatial information about the arrangements of features from the inputs, and this extracted information was useful for the scrambled/unscrambled (or global) identity task but did not help with the shoe identity (feature recognition) task.
4.6 What Might Be a Reason for Why There Are Two Relatively Segregated Visual Pathways in the Brain?
Suppose the ventral pathway in the brain works similar to and the dorsal pathway works similar to . There are two possible ways to determine an object's identity and spatial information. One way is to use a single pathway to process visual inputs and determine the object's identity and spatial information (e.g., location and orientation) at the same time. The other is to segregate these goals (identity and space) and use two separate pathways to process visual inputs. In this dual stream method, one pathway processes the visual inputs and is critical for object identity, whereas the other pathway processes the same visual inputs and is important for spatial information and visuomotor control, with separate cortical regions and streams responsible for the object's identity and spatial information. Experimental evidence has shown that the brain is using the second way to determine the object's identity and spatial information (Ungerleider & Mishkin, 1982), but why?
To address this question, the accuracies of and were compared. was used to simulate the process of doing the identity and spatial tasks using a single pathway, and was used to simulate the process of doing the identity and spatial tasks using two separate pathways. The testing accuracy of is significantly higher than the accuracy of . It implies that when two pathways are used to determine an object's identity and spatial information separately, the neural network has better performance. Our findings suggest there are advantages for the brain to use two separate pathways to determine identity and spatial information.
However, according to the results discussed in previous sections, if there is only a single combined pathway and this pathway processes space information first, then at a later time point processes identity, there would be less information about object identity. As a result, this single-pathway-structured brain wouldn't be able to do object recognition accurately. In addition, if a single pathway processed object identity first, then this pathway would lose information about space and wouldn't be able to accurately determine the locations and orientations of the objects.
In summary, in order to accurately identify objects and accurately determine the location and orientation of objects, these findings suggest that two separate pathways are advantageous in order to process the same input (visual information) in different ways for different tasks or goals. However, in some tasks or conditions, the goal may require coordination of the information from these segregated pathways (e.g., reaching for objects only if they are edible). In these cases, processing information differently using multiple separate pathways may cause a binding problem (Treisman, 2002). We suggest, that the binding problem may be lessened by using the spatial information contained in the identity network and object identity information in the spatial network.
We are not aware of any published study examining exactly how much information and what kinds of it are extractable from neural networks solving these different tasks jointly or separately. These are computationally tractable questions that are important and timely. Our simulations using CNNs ignore a lot of details of real-world tasks and real biological neural networks (e.g., different cell types and connectivity or the fact that ventral stream has more cortical areas than dorsal). These simplifications are important and necessary to make direct computational comparisons possible. Our intent is not to claim that the simulation findings emulate physiological conditions of these brain pathways. Our claims concern whether there could be a computational need for properties retained in distinct pathways or computational need for separate pathways for recognition and nonrecognition tasks.
We repeated the simulations of decoding space information from and decoding identity information from with some different parameter settings in brain networks or decoder networks. Because our findings do not strongly depend on specific parameter settings of the brain networks or decoder networks, our findings may also be valid for the brain, though the structures of biological neural networks may be different. In sum, our computational findings are adequately supported by the results shown above and certainly have relevance to better understanding the computational constraints of neural computation.
5 Limitations and Future Directions
First, we constrained the alignments of the three parts in each object image to always be the same, and we defined object orientation according to alignment of parts. In this case, the identity of an object (scrambled or unscrambled) depends on the simple 1D order of parts, and the 1D order of parts depends on the alignment of parts. However, in reality, the identity of an object may not be dependent on the 1D order or alignment of parts. In other words, the dependency between object identity and the alignment of parts is just one simple example of the possible dependency between object identity and spatial information of parts. These dependencies in real life could be different and more complex. For example, an object may be unscrambled, even when different parts of an object do not have the same alignment (e.g., the yoga pose of Uttanasana, standing forward bend, where the head is upside down but the feet are right-side up). In this case, the identity of an object (scrambled or unscrambled) no longer depends on the alignment of parts, but it may still depend on other 2D or 3D spatial information of parts (such as the relative distance between parts, relative locations of parts, and other topological information) that affect object identity recognition. A previous study has found that object recognition accuracy can be improved by taking into account the spatial distribution of object parts. They found this via mathematical modeling and did not use artificial neural networks (Morales-González & García-Reyes, 2013). Many objects are different not because they have different physical features like color or texture, but because they have different spatial relations between parts. Therefore, it is very likely that in general, the identity artificial neural network retains some spatial information of parts because this information increases object recognition accuracy. In our current study, we demonstrated that the identity network retains some spatial information when object identity is dependent on the alignment of parts. In addition, we ran simulations where the image order was dependent only on the order of parts and found no differences in the major findings we report. We used relatively simple object images to make sure the variables used in the simulated experiments are well defined and controlled (e.g., all objects consist of three parts, with objects' orientations and locations clearly defined). If more complex and realistic images were used, the objects in the images would have many more variations, and it would be more difficult to define and control the variables that might increase or affect the computational differences we report. In the future, it is important to examine object recognition accuracy in more general settings where more realistic images are used and the object identity is based on other higher-dimensional or topological spatial information.
In addition, motion is also an important property of both the ventral and dorsal cortical visual pathways (Sereno et al., 2002). As our goal was proof of computational concept, we did not use more complex stimuli and models that could complete tasks using motion. In the future, it would be interesting to use more complex artificial neural network models to test whether our findings still hold when the networks are more elaborate and can respond to more complex moving stimuli, scenes, and tasks. However, given the vast known variety (e.g., in cell type, receptors, connectivity, or modules), as well as variety of stimuli, tasks and experiences, and training that a single brain encounters by the age of 20, even a CNN model that generalized to more complex stimuli, multiple tasks as well as predicted neural responses in visual cortical areas would still remain disputable as an accurate model of a real human brain.
Finally, we used a supervised learning rule. Many researchers think the brain is mainly using unsupervised learning and reinforcement learning to learn how to accomplish different tasks (Hinton & McClelland, 1988). Although previous work has argued that supervised learning may be biologically plausible (Lillicrap et al., 2020; Whittington & Bogacz, 2019), it would be interesting to examine whether more biologically plausible learning rules affect any of the findings we report. Future studies should also try to localize objects more accurately (give more accurate coordinates when localizing objects). In addition, this study can only localize one object at a time. It would be interesting in future work to use more realistic and biologically plausible networks that can localize and identify multiple objects at the same time.
6 Conclusion
In summary, our simulations imply that both ventral and dorsal cortical visual pathways contain information about identity and space, even when trained with a single identity or location task. We have also shown that the ventral pathway does not contain all types of spatial information equally, and the dorsal pathway does not contain all types of object identity information equally. In our simulations and tasks, there was more orientation information than location information retained in the ventral pathway. Likewise, we found that in the dorsal pathway, there was more information retained about the whole object (global recognition) than the information about individual features (feature recognition). These modeling findings suggest that the object information retained in the dorsal pathway and spatial information retained in the ventral pathway are not the same properties, respectively, retained in the ventral and dorsal pathways themselves. The retained object and spatial information in the dorsal and ventral pathway, respectively, appear to be the aspects of identity and space that are most needed to accomplish spatial and identity tasks, respectively. As a result, the modeling suggests that the identity and spatial information retained in the two pathways need to be different in order to accurately accomplish different kinds of tasks. Furthermore, we show that two separate pathways are needed in order to process visual information in different ways so that the brain can accomplish different kinds of visual tasks more accurately. Using a computational approach, we provide a framework to test the properties and functional consequences of two independent visual pathways (with no cross connections) and show that the findings can provide insight into recent contradictory findings in systems neuroscience.
Acknowledgments
We thank Sidney Lehky and Margaret Sereno for comments on the manuscript. This work was partially supported by start-up funds from Purdue University to A.S.