Abstract
Neurons selective for faces exist in humans and monkeys. However, characteristics of face cell receptive fields are poorly understood. In this theoretical study, we explore the effects of complexity, defined as algorithmic information (Kolmogorov complexity) and logical depth, on possible ways that face cells may be organized. We use tensor decompositions to decompose faces into a set of components, called tensorfaces, and their associated weights, which can be interpreted as model face cells and their firing rates. These tensorfaces form a high-dimensional representation space in which each tensorface forms an axis of the space. A distinctive feature of the decomposition algorithm is the ability to specify tensorface complexity. We found that low-complexity tensorfaces have blob-like appearances crudely approximating faces, while high-complexity tensorfaces appear clearly face-like. Low-complexity tensorfaces require a larger population to reach a criterion face reconstruction error than medium- or high-complexity tensorfaces, and thus are inefficient by that criterion. Low-complexity tensorfaces, however, generalize better when representing statistically novel faces, which are faces falling beyond the distribution of face description parameters found in the tensorface training set. The degree to which face representations are parts based or global forms a continuum as a function of tensorface complexity, with low and medium tensorfaces being more parts based. Given the computational load imposed in creating high-complexity face cells (in the form of algorithmic information and logical depth) and in the absence of a compelling advantage to using high-complexity cells, we suggest face representations consist of a mixture of low- and medium-complexity face cells.
1 Introduction
The ability to recognize individual faces and interpret facial expressions is central to human social interactions, as well as the social interactions of some nonhuman primates (Leopold & Rhodes, 2010; Parr, 2011; Parr, Winslow, Hopkins, & de Waal, 2000). Neurons whose responses are selective for faces have been demonstrated in humans and nonhuman primates, both neurophysiologically and through fMRI (Duchaine & Yovel, 2015; Freiwald, Duchaine, & Yovel, 2016; Haxby, Hoffman, & Gobbini, 2000; Kanwisher & Yovel, 2006; Nestor, Plaut, & Behrmann, 2016; Parr, Hecht, Barks, Preuss, & Votaw, 2009; Tsao, 2014; Tsao & Livingstone, 2008). How those neurons are used to represent face is a matter of extensive research.
Neurophysiological evidence indicates that faces can be encoded using a neural population code, with each face represented by a point within a high-dimensional face response space (Chang & Tsao, 2017; Eifuku, De Souza, Tamura, Nishijo, & Ono, 2004; Rolls & Tovée, 1995; Young & Yamane, 1992). Each neuron forms an axis of the neural face space. Neural responses within the high-dimensional response space can be visualized through dimensional-reduction techniques such as multidimensional scaling (MDS) or principal components analysis (PCA). The dimensionality of face space has been estimated psychophysically to be on the order of 100 (Meytlis & Sirovich, 2007; Sirovich & Meytlis, 2009). Within an axis-based face space, the average face may have special status as defining the origin of the face space coordinate system (Leopold, Bondar, & Giese, 2006; Leopold, O'Toole, Vetter, & Blanz, 2001; Rhodes & Jeffery, 2006; Tsao & Freiwald, 2006; Wilson, Loffler, & Wilkinson, 2002), though this remains controversial.
The use of axis-based high-dimensional neural response spaces has become commonplace for interpreting neural data, not just for describing faces responses but also for describing neural responses to object stimuli in general. Those using an axis-based approach to characterize neurophysiological object responses (nonface) include MDS studies (Kayaert, Biederman, & Vogels, 2005; Kiani, Esteky, Mirpour, & Tanaka, 2007; Lehky & Sereno, 2007; Murata, Gallese, Luppino, Kaseda, & Sakata, 2000; Op de Beeck, Wagemans, & Vogels, 2001; Romero, Van Dromme, & Janssen, 2013; Sereno & Lehky, 2018; Sereno, Sereno, & Lehky, 2014), as well as those based on PCA (Baldassi et al., 2013; Chang & Tsao, 2017). This axis-based approach can be extended to interpreting fMRI data for objects, in this case using each voxel as an axis for the high-dimensional response space (Bracci & Op de Beeck, 2016; Connolly et al., 2012; Kravitz, Peng, & Baker, 2011; Kriegeskorte et al., 2008).
There are two perspectives on the development of face processing circuitry in temporal cortex. The first is that there are face-specific neural processes that are hardwired (domain specificity) (Kanwisher, 2000; McKone, Kanwisher, & Duchaine, 2007; Tsao & Livingstone, 2008; Yovel & Kanwisher, 2004). The second is that the temporal cortex can also acquire processing for different classes of nonface stimuli through experience (expertise) (Cowell & Cottrell, 2013; Gauthier, Behrmann, & Tarr, 1999; Gauthier, Skudlarski, Gore, & Anderson, 2000; Gauthier & Tarr, 1997; Tong, Joyce, & Cottrell, 2008; Wang, Gauthier, & Cottrell, 2016). For the purposes of this study, we remain agnostic between these possibilities, focusing on the face representations themselves, not their development.
A neural face space is defined by the properties of the individual neurons that constitute the axes of the space (plus possible interactions within the face cell population if the face space is nonlinear). Therefore, a central task in characterizing face space is to characterize those individual neurons. As with high-level representations of objects in general, the complexity of face representations at the population level reflects the complexity in the organization of individual face cell receptive fields. Given the complexity of face cell organization, a fruitful approach is to constrain the possibilities of what aspects of facial features are important to face cells. An interesting example of this sort of analysis is given by Freiwald, Tsao, and Livingstone (2009) for monkey inferotemporal cortex, based on the geometry of facial features and parts/whole organization using simple cartoon face stimuli. In contrast, we have hesitations concerning the conclusions of Chang and Tsao (2017) that face space corresponds to one unique linear space that they have discovered. We believe that other linear face spaces are also consistent with their data under their mathematical analysis methods, as we will consider in section 4.
Here we suggest that image complexity may be a novel way to characterize face representations, where complexity is given a well-defined mathematical definition. We approach the issue of face complexity theoretically by using a mathematical technique based on tensor decomposition (Bro, 1997; Cichocki et al., 2015; Favier & de Almeida, 2014; Kolda & Bader, 2009; Rabanser, Shchur, & Günnemann, 2017; Sidiropoulos et al., 2017) that allows us to vary the complexity of the face cells that constitute the encoding dimensions. Complexity as used here is defined as Kolmogorov complexity, also known as algorithmic information (Adriaans, 2019; Cover & Thomas, 2006; Grünwald & Vitányi, 2008a, 2008b; Li & Vitányi, 2008), as well as another complexity measure called logical depth (Bennett, 1988, 1994; Zenil, Delahaye, & Gaucherel, 2012). Comparing properties of face representations with different complexities is the central focus of this study.
Tensor analysis decomposes faces into a set of components called tensorfaces. Under the algorithm used here, the original faces can be reconstructed by a weighted linear sum of the tensorfaces (under other tensor algorithms the mixing can be multilinear). A set of components and their associated weights can be thought of as model face cells and their firing rates. This tensor decomposition is analogous to reconstructing faces using a weighted linear sum of components derived from principal components analysis (PCA) (Turk & Pentland, 1991), a weighted linear sum of components derived from independent components analysis (ICA) (Bartlett, Movellan, & Sejnowski, 2002; Bartlett & Sejnowski, 1997), or a weighted linear sum of components derived from nonnegative matrix factorization (NMF; Wang, Jia, Hu, & Turk, 2005), among other possibilities. These decomposition algorithms differ based on what constraint is applied to the decomposition. PCA produces components subject to the constraint that they are orthogonal, ICA that they are statistically independent, and NMF that they are nonnegative. Another member of this genre of decomposing faces into linear components is active appearance modeling (AAM) (Cootes, Edwards, & Taylor, 2001; Edwards, Cootes, & Taylor, 1998), as used by Chang and Tsao (2017). AAM is similar to PCA except that fiducial markers are placed on the face images by hand to help with aligning features during decomposition (thus, this is not an automatic algorithm). In this study, the constraint we place on the face decomposition is that the components have fixed complexity.
Tensor decomposition is not a single algorithm but a category of algorithms. The term tensorface was originated by Vasilescu and Terzopoulos (2002) for a particular nonlinear (multilinear) tensorface decomposition algorithm (see also Vasilescu & Terzopoulos, 2002, 2003, 2005, 2011). We use a different tensor algorithm to linearly decompose faces (Phan, Cichocki, Tichavský, Zdunek, & Lehky, 2013), one that is constrained to produce tensorfaces with specified image complexity. Each tensorface can be visualized as a matrix of pixels, and the rank of that matrix serves as the direct proxy of image complexity when running the algorithm. (Rank is defined as the maximum number of linearly independent columns or rows in a matrix.) Matrix rank is the input parameter specified for the algorithm to specify the face complexity we want, while Kolmogorov complexity and logical depth are calculated from the output tensorfaces after the algorithm is run.
We are not advocating the algorithm used here as a specific model of biological face cells, and we are not interested in creating a canonical face space (we believe such an effort is premature). Rather, we are interested in exploring the concept of complexity in face representations in general using this algorithm as an example, with hopes that this concept will prove useful in future investigations of biological face processing. We create tensorfaces with specified complexity by adding a rank constraint to a tensor decomposition algorithm. Creating other face representations with specified complexity could also be done by adding rank constraints to other decomposition algorithms not based on tensor algorithms. An example of this is PCA decomposition with a rank constraint added (Yu, 2016). We confine ourselves here to issues of basic face representation and do not attempt to categorize different views of individual faces because we are not creating a full face recognition model.
2 Methods
2.1 Face Stimulus Set
Synthetic colored faces were generated using FaceGen software (Singular Inversions, Inc.; facegen.com). Some details of the FaceGen algorithms are discussed in Blanz and Vetter (1999). The face set included equal numbers of males and females and equal numbers from the four racial groups provided by the software: African, East Asian, European, and South Asian. Because we included color in our consideration of facial representations, we wanted to have different skin tones in the face sample. Example faces are shown in Figure 1. Within each racial group, we generated faces with random shape, color, and texture parameters using the Generate button in the software control panel. This automatic, random generation of faces sometimes led to unnatural-looking faces, which either were rejected from inclusion in the face set or had their parameters manually tweaked. Faces had zero rotation. The illumination angle was 0 azimuth and 0 elevation. Tensor decomposition was carried out on a sample set usually consisting of 128 faces (see the examples in Figure 2a). The resulting tensorfaces were tested by using them to reconstruct a different set of faces, a test set containing 40 faces (see Figure 2b).
For this initial study of tensorface complexity, we have kept the face-sample set simple, all front-facing with identical illumination. The multiway nature of tensor decompositions would allow inclusion of additional image parameters as additional dimensions to the input tensor containing the sample face set. For example, representations of rotated faces (changes in viewpoint) are an important aspect of face identification (Fang, Murray, & He, 2007; Freiwald & Tsao, 2010; Jiang, Blanz, & O'Toole, 2006; Natu et al., 2010; Noudoost & Esteky, 2013; Perrett et al., 1991, 1985; Ramírez, Cichy, Allefeld, & Haynes, 2014). Face rotation in depth (azimuth) could be added as a fifth dimension to the current four-dimensional input tensor ( spatial dimensions, spatial dimension, color, and different individuals), and analogously for additional image parameters.
2.2 Tensor Decomposition Algorithm Background
We computed face components using tensor methods rather than the matrix methods used in PCA, ICA, and NMF. PCA and other matrix techniques can deal only with 2D data. That means each face image must be unfolded or vectorized into one long 1D vector. Then the vectors for the individual faces are placed together to form the columns of a 2D matrix, which serves as the input to PCA (see Figure 3a). In contrast, tensor methods can be applied to data with an arbitrarily large number of parameter dimensions. Therefore, images do not need to be vectorized, and each pixel within the image retains its spatial context during the decomposition process (see Figure 3b). Here, we did tensor decompositions of 4D face data structures, which included two spatial dimensions for each face, color as the third dimension, and different individuals as the fourth dimension. While PCA and other matrix methods use linear algebra, tensor methods use multilinear algebra that allows consideration of multiple parameter dimensions concurrently. While we used a multilinear algorithm to decompose faces into a set of components and weights, the faces were reconstructed linearly as the weighted sum of the components.
2.3 Tensor Decomposition Algorithm
Matlab code and example face files are available at https://github.com/slehky/tensorfaces-neco.
2.3.1 Matrix Operators
The tensor decomposition algorithm uses the Kronecker, Khatri-Rao, and Hadamard products between two matrices, as well as Hadamard division. The properties and applications of those matrix operators have been reviewed by Van Loan (2000), as well as Liu and Trenkler (2008), and are included in the Matlab toolbox software of Kolda et al. (2017) and Phan (2018). Here we briefly look at these operators before describing the algorithm.
2.3.2 The Model
The tensor decomposition algorithm we use is described by Phan et al. (2013). We have not made any changes to it but present it in more detail here. The algorithm is a variant of the CANDECOMP/PARAFAC (CP) algorithm (Carroll & Chang, 1970; Harshman, 1970). It falls into the category of structured or constrained CP incorporating a PARALIND algorithm (Bro, Harshman, Sidiropoulos, & Lundy, 2009). Constrained CP algorithms have been reviewed by Favier and de Almeida (2014). Although this algorithm is derived from CP, it is not based on an outer product sum of rank 1 components as is done by CP. Rather, the decomposition is based on a Kronecker product between two tensors—namely, a components tensor and a weights tensor. The structured CP algorithm used here can be viewed in some sense as intermediate between two commonly used tensor decomposition models, the conventional CP model (Carroll & Chang, 1970; Harshman, 1970) and the Tucker model (Tucker, 1966), and incorporates aspects of both. The reason for using a structured CP model in this study rather than either the conventional CP or Tucker models is briefly outlined in Phan et al. (2013).
Consider a data tensor of size . Our aim is to represent this tensor by multiple basis components (tensorfaces) in which the components were specified to have various levels of complex structures. In our case, we are dealing with a four-way tensor () with size 200 (pixels) 200 (pixels) 3 (color channels) 128 (individuals), which represents 128 colored face images concatenated into a single data structure. All calculations are performed with the color channels converted from RGB to CIE 1976 L*A*B color space, which approximates human color vision more closely.
Various decomposition algorithms can be carried out subject to different constraints on , such as orthogonality (PCA), statistical independence (ICA), and nonnegativity (NMF), as well as possible constraints on such as sparseness. Here the constraint was on the tensor rank of , where we take rank to be a measure of the complexity of the tensorface patterns. As the number of components is limited, the decomposition will be only approximately equal to the original data .
For our model, the weights are of size , with their order (dimensionality) given by . Within the algorithm, we defined to be order , and thus is represented by an vector, where is number of face images in the input set (typically ). The patterns are of size , with their order given by . are of order , and form -sized tensors, where is the size of the input image in pixels—in our case, always 200 pixels.
and are rank- and rank- tensors, respectively. The rank of is always . The rank of is set over the range to for different runs of the tensor decomposition algorithm. Examining the effects of changing (changing the complexity of tensorfaces) is a central concern of this study.
The subscript for different tensorface patterns is included for generality, but we hold both the order and the rank of both and constant for all . Notably the rank of is constant for the entire population of tensorfaces during a single run. Although we had the option to set the rank of each tensorface individually, we do not do so here.
The tensor decompositions in equations 2.7 to 2.9 are particular cases of Kronecker tensor decomposition (KTD) and also constitute a generalized model of block term decomposition (BTD) (De Lathauwer, 2008a, 2008b; Sorber, Van Barel, & De Lathauwer, 2013). If all are of order 1 (i.e., ), as was the case here, then the above model is simplified into the rank- rank-1 BTD (Sorber et al., 2013).
In order to derive the algorithm that updates the factor matrices of the basis patterns and the weight tensors, we rewrite the tensor decomposition in equation 2.7 with rank constraints in equations 2.8 and 2.9 in the form of the CP decomposition.
In equations 2.14 and 2.15, is the Kronecker product, defined in equation 2.1. is the tensor with ones along the superdiagonal, is a vector of ones, and is the matrix transpose operator.
In this decomposition, due to properties of the Kronecker product, each component (column) of was replicated times in for , and each component of was replicated times in for . Such behavior is related to the rank-overlap problem (the decomposition creates multiple identical components), which often exists in real-world signals such as chemical data, flow injection analysis (FIA) data; Bro, 1998; Bro et al., 2009), or spectral tensors of EEG signals (Phan et al., 2013). However, in our case, this does not lead to the creation of multiple identical tensorfaces because each is the result of combining all factor matrices .
The structured CPD in lemma 1 is a particular case of parallel factor analysis (CANDECOMP/PARAFAC; Carroll & Chang, 1970; Harshman, 1970) with linearly dependent loadings (PARALIND; Bro et al., 2009) in which the dependency matrices (Bro et al., 2009) are fixed and given in lemma 1. Discussions on the uniqueness of the CPD with linearly dependent loadings can be found in Guo, Miron, Brie, and Stegeman (2012) and Stegeman and Lam (2012).
2.3.3 Algorithm
Updating and in turn allows updates of and through equations 2.8 and 2.9. ALS acts to iteratively adjust the factors and in equation 2.7 so as to minimize the Frobenius error between the original data tensor and the reconstructed data tensor , . is calculated from the estimated and during each iteration. The ALS algorithm updates each parameter sequentially, in contrast to error minimization using a gradient descent algorithm, which updates all parameters simultaneously. The error minimization loop is begun by initializing and using the singular value decomposition (SVD) of . SVD is performed on a matrix in which each column is formed by vectorizing a face image (creating a vector with 200 200 3 pixels), with the number of columns equal to the number of images (128 images). The left SVD vector is saved to a tensor with image dimensions, then approximated by a low-rank tensor using CANDECOMP/PARAFAC, and finally assigned as initialization of . is initialized using the right SVD vector.
2.4 Reconstruction Error
2.5 Displaying Tensorfaces
The pixel values of the tensorfaces produced by the tensor decomposition algorithm generally extend beyond the range of values allowed by the color space, as the decomposition was not constrained to fit requirements of the color space. The (luminance) channel allows values on the range 0 to 100, while the (red-green opponent) and (blue-yellow opponent) channels both allow values on the range 100 to 100. For display purposes, each tensorface was individually normalized to fill out the allowable values of the color space. The L channel was separately normalized, while the and channels were jointly normalized so as not to affect the color balance between the two. After the color space normalization, the tensorface was converted to RGB color space for display.
2.6 Kolmogorov Complexity (Algorithmic Information)
The Kolmogorov complexity of a pattern or, equivalently, the algorithmic information it contains is the length of the shortest algorithm required to reproduce it (Grünwald & Vitányi, 2008a, 2008b; Li & Vitányi, 2008). In other words, the complexity of a pattern is the size of the most compressed description of the pattern. The concept of Kolmogorov complexity was independently introduced by Solomonoff (1964), Kolmogorov (1965), and Chaitin (1969), and is sometimes known as Kolmogorov-Chaitin-Solomonoff (KCS) complexity.
To illustrate the difference between algorithmic information and Shannon information, consider a communications channel in which only two messages are possible: face A or face B. Whenever one of those faces is transmitted, the Shannon information is one bit because there are only two possibilities. However, the algorithmic information transmitted is vastly higher because it requires many bits to form a complete description of the face.
While the definition of Kolmogorov complexity is straightforward, determining its value is problematic as there is no systematic way to determine the most compact description of a pattern. In other words, Kolmogorov complexity is uncomputable (no algorithm exists). In practice, therefore, we use lossless compression algorithms to approximate an upper bound to the complexity of the tensorfaces (Ruffini, 2017).
Here, we base our estimate of the Kolmogorov complexity of tensorfaces on the file size of the tensorface images after they underwent a lossless compression. That is done by saving a tensorface image in PNG image format and noting the number of bits in the saved file. The PNG image format uses an efficient, lossless compression algorithm called DEFLATE, based on the Lempel-Ziv algorithm (Lempel & Ziv, 1976; Ruffini, 2017) together with Huffman coding. To further compress the tensorface files beyond the standard PNG format, we use the program ImageOptim (imageoptim.com), which ran an additional set of compression algorithms, also based on DEFLATE, that are more efficient but too-time consuming for ordinary use, combining the results of those compression algorithms. The algorithms included Zopfli, PGNOUT, OptiPGN, AdvPGN, and PGNCrush. Using ImageOptim reduces tensorface file sizes beyond the standard PNG compression by an amount depending on tensorface rank, ranging on average from 19% for rank 2 tensorfaces to 9% for rank 32 tensorfaces.
The number of bits in the compressed image was then normalized by the number of pixels in the image, giving an estimate of Kolmogorov complexity as bits per pixel for the compressed image. While all tensorface images had identical file sizes when uncompressed and initially had 24 bits per pixel, some tensorfaces were more compressible than others, reflecting image complexity.
After setting the desired rank of the tensor decomposition, the face sample set was decomposed into 100 components, and then the decomposition was replicated 10 times to produce 1000 tensorfaces. Kolmogorov complexity was averaged over those 1000 tensorfaces. The same set of tensorfaces was used in calculations of logical depth, power spectra, and globality described below.
In this analysis, the tensor decomposition algorithm provides us with model face receptive fields (tensorfaces). Such fields are presented as images of receptive fields, analogous to the way that V1 Gabor receptive fields are presented as images of the receptive fields. Having access to such receptive field images makes it feasible to employ mathematical methods for evaluating algorithmic complexity. On the other hand, in an experimental neurophysiological situation, producing images of face cell receptive fields is problematic because of the intractability with finding optimal face stimuli given undefined spatial nonlinearities in the receptive fields. We discuss nonlinearities in face cells further below.
2.7 Logical Depth
Logical depth is another way to measure the complexity of tensorfaces. In the present context, logical depth is the duration of computational time required to restore an image back to its original state after it has been maximally compressed in a lossless manner. The concept of logical depth was originated by Bennett (1988, 1994) and has previously been applied to the characterization of images by Zenil et al. (2012). The basic idea is that objects that “contain internal evidence of a nontrivial causal history” (Bennett, 1988) have a complex structure that requires more computational time to reconstitute from their shortest descriptions (maximally compressed states) than objects without complex structure.
While Kolmogorov complexity can be thought of as measuring complexity in terms of space (the length of the shortest description of an object), logical depth measures complexity in terms of time (the number of computational steps required to reconstruct the object from that shortest description). An important difference between the two is that Kolmogorov complexity considers both structured states and random states to be complex, but logical depth considers only structured states as complex, while treating both trivial and random states as noncomplex. Thus, as Zenil et al. (2012) pointed out, logical depth may lie closer to our intuitive concept of complexity than Kolmogorov complexity.
To measure logical depth, we first compress the tensorface images by running the program dzip within Matlab (Mathworks, Natick, MA). Then the image is uncompressed using dunzip, and the elapsed time to perform the uncompression was measured using the Matlab tic-toc timer function. The uncompression time is measured 1000 times for each tensorface and then averaged. Timing is measured with no user applications running aside from Matlab, with WiFi and Bluetooth turned off, and nothing attached to any of the computer ports.
Dzip implements the DEFLATE lossless compression algorithm. (Dzip and dunzip are available for download from the Matlab File Exchange: www.mathworks.com/matlabcentral/fileexchange/8899-rapid-lossless-data-compression-of-numerical-or-string-variables.)
2.8 Power Spectra
We calculated the 2D spatial frequency power spectra of tensorfaces having different levels of complexity. The tensorfaces were first converted from color to grayscale images. The 2D spectra were then transformed to 1D by performing rotational averaging (averaging spectral power over all orientations in the images).
2.9 Globality Index
We define the globality of a tensorface component as the fraction of the face it covers. This is the number of pixels in a tensorface divided by the average number of pixels in a face (averaged over all faces in the sample). The number of pixels in the faces is simple to determine, as the faces are on a black background and easy to segment. The number of pixels in a tensorface is more difficult, as the tensorfaces had a continuum of values that could blend in with the background. Including all tensorface pixels that differed just a tiny bit from the background would greatly inflate the size of the tensorfaces and therefore their globality.
We therefore follow the following procedure to exclude small pixel values from the globality calculations and isolate the high-activity regions of the tensorfaces. First, we convert the tensorfaces to grayscale and subtract the background, leaving the tensorfaces on a black background. Then we set a gray threshold level and exclude pixels below that level. The threshold is set using Otsu's method (Otsu, 1979), which minimizes the intraclass variance of the pixels above and below threshold (Matlab command graythresh in the Image Processing Toolbox). The grayscale tensorface is then binarized based on that threshold level, with pixels above threshold set to white and those below set to black. This thresholding typically leaves the high-activity tensorface regions as a set of disjoint white patches. To create a unitary tensorface region for purposes of globality calculations, all the individual white patches are enclosed by their convex hull (Matlab command convhull). The interior of this convex hull constitutes the high-activity region of the tensorface. Finally, the area of a tensorface enclosed by the convex hull, measured in pixels, is divided by the area of the face. The resulting fraction is the globality index of the tensorface.
2.10 Selectivity and Sparseness
We use kurtosis as a measure of both the selectivity of single tensorfaces and the sparseness of populations of tensorfaces. Kurtosis is a measure of the shape of a probability distribution—in this case, the distribution of tensorface responses to stimuli. A high kurtosis distribution, corresponding to high selectivity or high sparseness, emphasizes the peak and tails of the distribution with less probability in between. A low kurtosis distribution, corresponding to low selectivity or low sparseness, has a flatter distribution. By tensorface “response” to a stimulus, we mean the weight associated with that tensorface when reconstructing the stimulus image.
Cell selectivity is based on the probability distribution of the responses of a single cell (single tensorface) when presented with a set of stimuli over time. Selectivity has also been called the “lifetime sparseness” of single neurons (Willmore & Tolhurst, 2001). Population sparseness is based on the probability distribution of the simultaneous responses of a population of cells to a single stimulus (using the terminology of Lehky, Kiani, Esteky, & Tanaka, 2011; Lehky, Sejnowski, & Desimone, 2005; Lehky & Sereno, 2007). In the work presented here, responses could take both positive and negative values, which we interpret as deviations from a spontaneous level of activity, and probability distributions were roughly symmetrical.
Kurtosis has previously been used as a measure of selectivity and sparseness in the theoretical literature (Bell & Sejnowski, 1997; Olshausen & Field, 1996; Simoncelli & Olshausen, 2001). Kurtosis has also been used in the experimental literature for extrastriate cortex (Lehky et al., 2011, 2005; Lehky & Sereno, 2007; Tolhurst, Smyth, & Thompson, 2009).
2.11 Multidimensional Scaling
Multidimensional scaling (MDS; Hout, Papesh, & Goldinger, 2013) is used to visualize the face space produced by tensor decomposition. The MDS analysis is based on the tensorface weights that allow reconstruction of the faces in the sample set, after the faces have been decomposed into a set of 100 tensorfaces. We examined responses (weights) of a population of 100 tensorfaces to each of the 128 faces in the sample face set, as well as the responses of those tensorfaces to the average face calculated from the 128 faces. Thus, in total, we have population responses for 129 faces. These faces form 129 points in a 100-dimensional face space defined by the tensorface population. Because the relative positions of faces in the high-dimensional face space cannot be visualized, we use MDS to reduce the dimensionality of the face space down to two dimensions while maintaining approximate relative positions. While MDS is useful for low-dimensional visualization, the MDS algorithm has nonlinearities within it and should not be relied to produce a quantitatively accurate depiction of biological face space.
The responses of the tensorface population to a single face form a response vector with a length of 100 elements, defining the position of that face in face space. The first step in performing MDS is to calculate the distances between response vectors for all 129 faces, forming a distance matrix. A Euclidean distance metric is used. The distance matrix is then fed into the cmdscale command in the Matlab Statistics and Machine Learning Toolbox, which performs the MDS.
3 Results
3.1 Appearance of Tensorfaces
The tensor decomposition algorithm was applied to a set of 128 sample faces (the examples are shown in Figure 2a), producing tensorface components. Shown are the resulting low-complexity tensorfaces (rank 2, Figure 5), medium-complexity tensorfaces (rank 8, Figure 6), and high-complexity tensorfaces (rank 32, Figure 7). These are all shown as 200 200 pixel images. The number of tensorface components created by the algorithm was set by a parameter, and here we show examples of a decomposition of the face set into 40 components. The qualitative appearance of the components did not change as we varied the number of components over the range 5 to 100.
An expanded view of example tensorfaces at the three complexity levels is shown in Figure 8. As the complexity increases, the face representation progresses from crude blobs to a clear face-like appearance.
For comparison, eigenfaces resulting from a PCA decomposition of the sample face set are shown in Figure 9. They most closely resemble the high-complexity tensorfaces. The eigenfaces have rank 142, so they are more complex than any of the tensorfaces we created. Applying ICA to the sample faces produced components that qualitatively resembled the eigenfaces and were also highly complex, with the same rank 142.
3.2 Reconstructing Faces Using Tensorfaces
The tensorfaces were used to reconstruct a set of test faces (see Figure 3b), which were different from the sample faces used to create the tensorfaces. Although the tensor decomposition algorithm used to create the tensorfaces is nonlinear, reconstructing faces from a population of tensorfaces is a linear process. These face reconstructions are used to graphically illustrate how much information is available in the tensorfaces for representing faces and does not imply that the brain reconstitutes face bitmaps somewhere along the visual pathways.
Face reconstructions are shown in Figures 10a (reconstructed using low-complexity tensorfaces), 10b (reconstructed using medium-complexity faces), and 10c (reconstructed using high-complexity tensorfaces). In all three cases the reconstructions are subjectively comparable, showing that even the blob-like, low-complexity tensorfaces are capable of providing a good face representation.
Reconstruction errors are plotted as a function of the number of components in Figure 11a. Reconstruction error is the normalized Euclidean pixel-wise distance between original and reconstructed images. Not surprisingly, performance improved as tensorface population size increased. There was a trade-off between tensorface complexity and the population size required to reach a criterion error level. A large population of low-complexity tensorfaces can match the performance of a smaller population of high-complexity tensorfaces.
Reconstruction error is plotted as a function of complexity in Figure 11b (holding the tensorface population size constant at 100). Error is large for low-complexity tensorfaces, with the error dropping greatly going to medium complexity but then staying approximately constant with further increases in complexity. There is, in fact, a slight rise in reconstruction error at high complexities. That is because error is being measured here on a test set of faces different from the sample set used to create the tensorfaces, and high-complexity tensorfaces have a poorer ability to generalize to new stimuli. (Generalization is further discussed below.)
3.3 Computational Complexity of Tensorfaces
We have been measuring complexity in terms of the rank of the matrix of pixel values representing a tensorface image. The algorithm allows specification of the desired tensorface rank resulting from the decomposition process. Matrix rank is the minimum number of column vectors that can be used to generate all the columns in the matrix (equivalently, it can be done in terms of rows rather than columns). For example, a tensorface with rank 8 means that all 200 columns in the tensorface image can be generated by different linear combinations of just eight column vectors. A matrix with a larger rank requires a larger basis set of vectors to define it and is therefore more complex.
A standard way to measure complexity within computational theory is Kolmogorov complexity, also known as algorithmic information (Grünwald & Vitányi, 2008b; Li & Vitányi, 2008). As described in section 2, we operationally define Kolmogorov complexity as the number of bits per pixel required to store the tensorface image after undergoing maximal lossless compression. A more complex image requires a larger file size. The relationship between complexity measured as matrix rank and Kolmogorov complexity is plotted in Figure 12a. We see that Kolmogorov complexity correlates with complexity measured by rank. A second measure of computational complexity is logical depth, the time duration of computations required to uncompress a compressed tensorface (Bennett, 1988, 1994; Zenil et al., 2012). The logical depth of tensorfaces as a function of tensorface rank is plotted in Figure 12b.
Both Kolmogorov complexity and logical depth provide similar estimates of the relative complexity of different tensorfaces. Tensorfaces that compress to a small file size (low Kolmogorov complexity) take less computational time to uncompress (small logical depth). Tensorfaces that produce large compressed file sizes (high Kolmogorov complexity) take more computational time to uncompress (large logical depth).
Note the large standard deviations in Figure 12. For each plotted point, although all tensorfaces had identical matrix ranks, the resulting values for Kolmogorov complexity and logical depth were spread out over a broad range. The relation between tensorface rank and the two complexity measurements is therefore statistical rather than deterministic.
In addition to characterizing tensorfaces by their complexity as defined by computational theory, we can also characterize them using concepts from signal processing theory. The average spatial frequency power spectra of tensorfaces at rank 2, 8, and 32 are plotted in Figure 13a. We see that as tensorface complexity increases, the spectral power at high spatial frequencies also increases (see Figure 13b). Computational complexity of tensorface receptive fields (Figures 12a and 12b) correlates strongly with their Fourier power at high spatial frequencies.
3.4 Selectivity and Sparseness of Tensorfaces
Selectivity and sparseness of neural responses to stimuli are major concerns in neural coding theory. Under our terminology, selectivity is a function of the statistical distribution of responses of a single neuron to a large set of stimuli presented sequentially (Lehky et al., 2005). Sparseness is a function of the statistical distribution of responses over a population of neurons when simultaneously presented with a single stimulus. Here, we quantify both selectivity and sparseness by calculating kurtosis of the appropriate probability distribution.
Tensorface selectivity is plotted as a function of rank in Figure 14a. Although there is a lot of variability for different tensorfaces, the median value of selectivity (kurtosis) is close to zero, independent of tensorface complexity (rank). That means that as one presents many faces to a particular tensorface, responses tend to be gaussian distributed, as gaussians have zero reduced kurtosis. Population sparseness of tensorfaces is plotted as a function of rank in Figure 14b. Although there is higher population sparseness with very low complexities, for medium and high complexities, the sparseness settles down to values of around 1.0.
Sparseness and selectivity of tensorfaces are lower than reported in monkey inferotemporal cortex (Lehky et al., 2011), with single-unit selectivity 1.88 and population sparseness 9.61 (as measured by kurtosis). One reason tensorfaces have lower sparseness values and lower selectivity values is that responses of tensorfaces do not include a threshold nonlinearity for response rates. In real neurons, response rates cannot have negative values. That causes the probability distribution of response rates to be skewed to the right, leading to higher sparseness and selectivity values. Without that threshold nonlinearity, the response probability distributions of tensorfaces are closer to gaussian and the sparseness and selectivity values are therefore smaller.
Another factor reducing model values of sparseness and selectivity is that tensorfaces are linear filters, as are all the face decompositions mentioned earlier (PCA, ICA, NMF, AAF), whereas biological face cells are nonlinear spatial filters, as is the case for inferotemporal object representations in general (Tanaka, 1996). By being nonlinear spatial filters, we mean that different portions of the receptive field sum nonlinearly to produce the total response of a neuron to an object stimulus. As the nature of the spatial nonlinearities within face cells and object cells is unknown, we cannot quantify their contributions to sparseness and selectivity. Spatial nonlinearities are discussed further below.
3.5 Tensorfaces: Local or Global Representations
We define the globality of a tensorface as the fraction of the face covered by that tensorface. Thus, local and global representations formed end points on a continuum rather than a dichotomy. Within that continuum, we observe tensorfaces that can be strongly local or strongly global (Figure 15a), and also everything in between.
To measure globality, we threshold the tensorfaces to include only the envelope (convex hull) of the areas that gave a “strong” activation, as described in section 2. Examples of such thresholded tensorfaces are shown in Figure 15b, with the high-activation region outlined in a black line. Only the high-activation region was used in calculations of globality.
Globality as a function of tensorface complexity is plotted in Figure 15c. More complex tensorfaces have greater globality on average, although there is large variability in globality across a tensorface population. This relationship breaks down at the lowest values of rank. The discrepancy at low ranks appears to be an artifact of the methodology we are using to calculate globality. Some low-rank tensorfaces form bilaterally symmetric pairs of blobs at the left and right edges of the face. When those two widely separated blobs are enclosed in an envelope to define the high-activation region of the tensorface that inflates the area covered by the tensorface, that increases the globality measure as we calculate it.
3.6 Generalization to Statistically Novel Categories of Faces
When previously examining face reconstructions based on tensorfaces (see Figures 10 and 11), we used a set of test faces (see Figure 3b) that closely resembled the original sample set (see Figure 3a). Although the two sets contained different individuals, both are drawn from the same statistical distribution of face parameters in the face-generating software and thus are statistically nonnovel. As used here, the statistically nonnovel/statistically novel distinction is based purely on the physical characteristics of faces and not cognitive and semantic factors involved.
In Figure 16, we examine what happens when we reconstruct statistically novel faces that are radically different from those used to create the tensorfaces. Yoda (see Figure 16ai) is a face we can instantly perceive without a period of training, yet it is unlikely from an evolutionary perspective that we would have developed face cells specifically tuned to handle that stimulus. Presented here are reconstructions of Yoda using tensorface populations of different complexities that were created using human faces. The quality of all the tensorface reconstructions is poor, as nothing resembling this stimulus was part of the face sample set used to create the tensorfaces. However, subjectively, it looks as if the low-complexity reconstruction is better than the high-complexity one. The high-complexity reconstruction appears to be overconstrained to resemble the faces in the training set.
Measuring reconstruction error, we can see the reconstruction error of Yoda does indeed get worse as tensorface complexity increases (see Figure 16aii, dashed line). That trend is the opposite of what we saw for the reconstruction of the test face set, where reconstruction error decreased with greater tensorface complexity (see Figure 16aii, solid line, repeated from Figure 11b). Figure 16bi show the reconstruction of another statistically novel face far beyond the bounds of what was included in the sample face set, the face of a chimpanzee. As with Yoda, we see in Figure 16bii that reconstruction error increases with tensorface complexity, the opposite of what occurs when reconstructing the test face set.
Although high-complexity tensorfaces produce the best reconstructions of faces that are statistically nonnovel, they have a reduced ability to generalize to faces in statistically novel faces. For statistically novel faces, the low-complexity tensorfaces produce the best reconstructions. The lower ability to generalize as tensorface complexity increases cannot be explained by changes in the selectivity of tensorfaces. We see in Figure 14a that tensorface selectivity remains constant (and low) regardless of tensorface complexity.
3.7 Representation of the Average Face
There is evidence indicating that the representation of the average face forms the origin of a high-dimensional face space (Leopold et al., 2001; Rhodes & Jeffery, 2006; Tsao & Freiwald, 2006; Wilson et al., 2002). Using multidimensional scaling (MDS) based on a Euclidean distance metric, we examined the location of the average face in a face space formed by 100 tensorfaces. This set of faces thus formed 129 points (128 sample faces plus the average face) in a 100-dimensional face space. The MDS analysis is based on tensorfaces with rank 8, with other rank values performed similarly.
The result of MDS analysis is given in Figure 17. It shows that the faces of different racial groups and genders cluster into different regions of face space. Furthermore, we see that the face space formed by the tensorfaces places the representation of the average face at the origin of the face space. Note that the representation of the average face at the origin is due to activity across a population of tensorfaces. No individual tensorface specifically represents the average face.
3.8 Cross-Stimulation of Tensorfaces by Nonface Stimuli
A complete and autonomous face processing system should reject nonface inputs. Therefore, now we look at the reconstruction of a nonface object by tensorfaces. Reconstruction of a melon is shown in Figure 18a. The reconstruction obviously fails, producing an enormous reconstruction error (see Figure 18b). Going beyond just the magnitude of the error, the organization of the errors gives the reconstructed melon the shape of a face, although with melon texture on it.
Representing faces is essentially the only thing that tensorfaces are capable of. Any object presented to that face representation system will be interpreted as a face. In contrast, the ideal representation of a nonface object produced by a specialized face representation system should be null, no response.
The spurious reconstruction of nonface objects as faces by the tensorface population occurs because response magnitudes of tensorfaces are similar to face and nonface objects (see Figure 18c). This cross-stimulation of tensorfaces by face and nonface stimuli appears to be the result of the low stimulus selectivity of tensorfaces seen in Figure 14a. As will be discussed further below, a possible solution to this cross-stimulation problem would be to have a nonlinearity associated with the linear tensorface receptive fields that would filter out nonface stimuli from being processed (see Figure 18d).
4 Discussion
Based on measures of algorithmic information (Kolmogorov complexity), we show here that low-complexity and high-complexity faces have different properties and therefore that complexity can be a way of constraining possible ways that face space is organized. Just as Shannon information has proven useful for understanding processing in early vision (Barlow, 1961; Field, 1994), we suggest that Kolmogorov complexity and related measures such as logical depth may prove useful in providing a framework for studying high-level vision, including face recognition and object recognition in general.
Cover and Thomas (2006), in their textbook on information theory, state, “We consider Kolmogorov complexity to be more fundamental than Shannon entropy.” Kolmogorov complexity is associated with concepts from computational theory (Turing machines), while Shannon entropy is a statistical theory not derived from computational theory. Both Shannon entropy and Kolmogorov complexity can be used as measures of efficient coding, indicating how compressed a representation can be. In addition to considering compression from the statistical perspective of Shannon entropy (Barlow, 1961; Field, 1994; Olshausen & Field, 2004) it can also be considered from the algorithmic perspective of Kolmogorov complexity (Adriaans, 2007; Chater & Vitányi, 2003; Feldman, 2016). A critical difference between the two types of information is that Shannon entropy is defined probabilistically in terms of a distribution over an ensemble of symbols, without any connection to the structure of individual messages, while Kolmogorov complexity is a deterministic concept measuring information of a single entity (message) by itself in isolation (Grünwald & Vitányi, 2008a). While assigning probabilities to repetitive low-level structures (e.g., V1 Gabor receptive fields) is clearly reasonable within the framework of Shannon entropy, assigning such probabilities to high-level structures that are essentially unique (e.g., inferotemporal receptive fields) may be problematic (Chater & Vitányi, 2003). As a nonprobabilistic computation, Kolmogorov complexity can assign a measure of information content to an individual high-level structure purely in terms of its internal structure.
Receptive field complexity appears to increase as one ascends through the hierarchy of visual cortical areas. Although this impression is not yet confirmed through neurophysiological measurements of Kolmogorov complexity, the ventral stream deep learning model of Güçlü and van Gerven (2015) reinforces this perception of increased complexity, as it shows a monotonic increase in Kolmogorov complexity as a function of the layer in the network. While Kolmogorov complexity has not been measured experimentally, sparseness has been. It is well established that visual representations are sparse (Dan, Atick, & Reid, 1996; Lehky et al., 2011, 2005; Lehky & Sereno, 2007; Pitkow & Meister, 2012; Rolls & Tovée, 1995; Vinje & Gallant, 2000; Willmore, Mazer, & Gallant, 2011). What is not well established is the gradient of how sparseness changes across different cortical areas, as such data are limited. Kolmogorov complexity and sparseness need not necessarily be correlated. Just because receptive field organization may appear highly complex does not mean that there must be a correspondingly high level of sparseness (see Figure 14). Indeed, a comparison of a low-level visual area (V1) (see Lehky et al., 2005) and a high-level visual area (anterior inferotemporal cortex) (see Lehky et al., 2011) shows only a very modest increase in sparseness (median kurtosis going from 0.84 to 1.88, measuring lifetime sparseness or what we call cell selectivity). The data of Willmore et al. (2011) indicate sparseness stays essentially the same going from V1 to V4. As Willmore et al. (2011) conclude, the data suggest that “maximization of lifetime sparseness is not a principle that determines the organization of visual cortex.” In contrast, there appears to be a steady and substantial increase in receptive field complexity across the cortical areas of the ventral visual hierarchy. In view of that, Kolmogorov complexity may be a more interesting parameter than sparseness in high-level visual processing.
Given these preliminary remarks on the general significance of using Kolmogorov complexity for characterizing visual receptive fields, we now turn to face processing specifically. The different complexities of tensorfaces we examine here demonstrate a range of possibilities that biological face cells could have. In particular, low- and medium-complexity face cells form feasible representations in addition to very high-complexity representations, such as those formed by PCA eigenfaces or variants thereof (e.g., active appearance models; Chang & Tsao, 2017). Such high-complexity face representations have in the past been suggested as forming the basis of biological face space. The actual complexity of biological face cells remains a question for future experimental studies.
We observe a trade-off between receptive field complexity and the population size necessary to reach a criterion error in reconstructing faces (see Figure 11a). A large population of low-complexity tensorfaces is equivalent to a smaller population of high-complexity tensorfaces. This trade-off can be observed for receptive fields in earlier cortical areas. For example, large populations of low-complexity Gabor functions in striate cortex can also accurately represent faces, and face identification can be performed using Gabor-based face representations without any face cells (Wiskott, Krüger, Kuiger, & von der Malsburg, 1997).
From the perspective of information contained in a population of tensorfaces as indicated by reconstruction error, there does not seem to be a benefit to using high-complexity face cells. Reconstruction error as a function of tensorface complexity does not decrease moving from medium- to high-complexity tensorfaces (see Figure 11b). Moreover, high-complexity face cells incur high computational costs to create, measured as Kolmogorov complexity or logical depth (see Figure 12). Low-complexity cells are inefficient in that they require larger population sizes to reach a criterion reconstruction error. The sweet spot for face representations may be at intermediate complexity, perhaps at about rank 8. Nevertheless, low-complexity face cells may balance their representational inefficiency with their increased ability to generalize to statistically novel faces (see Figure 16). Thus, there may be an advantage to having a mixture of low- to medium-complexity face cells but not high-complexity face cells such as produced by PCA. Furthermore, not all face cells in a population need to have the same level of complexity. That is another empirical question for future experimental work.
What might be the advantage of increased complexity of face representations in higher visual cortical areas (i.e., the creation of face cells)? The smaller population sizes allowed by more complex receptive fields means that face spaces with lower dimensionalities can be created (see Lehky, Kiani, Esteky, & Tanaka, 2014, for a discussion of dimensionality). In other words, creating face representations with more complex receptive fields may be a dimensionality-reduction technique. Lower-dimensional face spaces may make it easier to categorize faces (Plastria, De Bruyne, & Carrizosa, 2008). However, the benefits for creating more efficient face spaces using more complex receptive fields must be balanced with computational costs of the increased complexity as measured by Kolmogorov complexity and logical depth of receptive field spatial structure.
There is a high correlation between computational complexity (see Figure 12) and spectral power at high spatial frequencies (see Figure 13b). The link between computational complexity and spatial frequency provides additional motivation to characterize spatial frequency properties of face cells, expanding on current physiological (Inagaki & Fujita, 2011; Rolls, Baylis, & Hasselmo, 1987) and psychophysical studies (Costen, Parker, & Craw, 1996; Gaspar, Sekuler, & Bennett, 2008; Näsänen, 1999). Nevertheless, tensorfaces with different complexities are not simply Fourier amplitude filtered versions of each other but have substantial differences in appearance (phase spectra). The spatial frequency content of a facial representation is not sufficient to completely characterize its complexity.
We worked with colored faces rather than the monochromatic faces as used in most studies of face coding. Color can be an important aspect of face identification (Nestor, Plaut, & Behrmann, 2013; Tanaka, Weiskopf, & Williams, 2001), particularly if the shape information is degraded or ambiguous (Choi, Ro, & Plataniotis, 2009; Yip & Sinha, 2002). We include joint shape and color sensitivity in the tensorfaces developed here. Responsiveness to both shape and color is found in the same face cells in the inferotemporal cortex of monkeys as measured neurophysiologically (Edwards, Xiao, Keysers, Földiák, & Perrett, 2003). However, there is also fMRI evidence for separate, parallel channels coding face shape and color (Lafer-Sousa & Conway, 2013; Lafer-Sousa, Conway, & Kanwisher, 2016).
A significant question is whether the representation of faces is global (holistic) or local (parts based) (Behrmann, Richler, Avidan, & Kimchi, 2015; Maurer, Grand, & Mondloch, 2002; Piepers & Robbins, 2012; Richler, Palmeri, & Gauthier, 2012; Tanaka & Simonyi, 2016). We examined this issue by measuring a globality index for tensorfaces, defined as the average fraction of the face covered by the tensorfaces. Tensorfaces across a population exhibit a great deal of variability in their globality. Some tensorfaces are local, and others are strongly global. On average, high-complexity tensorfaces are more global than low-complexity ones (see Figure 15). Typically, a tensorface covers a sizable fraction of a face but not the entire face.
This variability in globality is consistent with both psychophysical (Tanaka & Simonyi, 2016) and neurophysiological reports (Freiwald et al., 2009), which conclude that face processing involves both global and parts-based processing. We have previously proposed such mixed and intermediate globality for inferotemporal object representations in general, not just faces (Lehky & Tanaka, 2016), based on data from monkey neurophysiology showing sensitivity of neurons to a partial set of features but generally not the entire object (Fujita, Tanaka, Ito, & Cheng, 1992; Ito, Fujita, Tamura, & Tanaka, 1994; Ito, Tamura, Fujita, & Tanaka, 1995; Kobatake & Tanaka, 1994; Tanaka, Saito, Fukada, & Moriya, 1991; Yamane, Tsunoda, Matsumoto, Phillips, & Tanifuji, 2006).
Recently Chang and Tsao (2017) reported that biological face space corresponds to one specific linear space that they have discovered. However, we believe that the linear face space they report is not uniquely defined under their mathematical data analyses. Rather, a variety of different face spaces are consistent with their data.
Approximate linear transforms (i.e., multiple linear regression) can be fit between face coefficients for various linear decompositions (e.g., PCA, ICA, NMF, our version of tensorfaces). Fits between the different linear face decompositions will be good provided each is capable of doing acceptable reconstructions of faces (e.g., under some psychophysical criterion for reconstruction error). If the neurophysiological data provide a good fit to face components from one linear decomposition, such as active appearance model (AAT) of Chang and Tsao (2017), then the data will also provide good fits to other linear face decompositions. Chang and Tsao (2017) have studied one predetermined linear face decomposition, and since it happened to meet their criterion of goodness of fit, they did not continue to examine other possible decompositions.
For example, we investigated the transform between PCA coefficients (PCAcoeff) and tensor coefficients (tensorCoeff) for two linear face decompositions. This transform is given by PCAcoeff′ tensorCoeffb, where PCAcoeff and tensorCoeff are matrices of coefficients for a set of faces, one face per column, and PCAcoeff′ are estimated PCA coefficients. The coefficient is given by pinv(tensorCoeff) PCAcoeff, where pinv is the Moore-Penrose pseudoinverse operator performing a multiple linear regression, and tensorcoeff has been augmented by a column of ones to include offsets. There were 128 faces as input, the tensor decomposition had 100 components, and we modeled the first 50 PCA components. We fit the model leaving one face left out for testing, repeating with a different face being left out. The results show that when comparing actual PCAcoeff and estimated PCAcoeff′, the model accounts for a 0.985 fraction of the variance. This shows that it is possible to predict PCAcoeff from tensorCoeff with high accuracy. Therefore, the two linear face decompositions would each provide essentially the same fit to the neurophysiological data. The interchangeability between different linear face spaces means that if one wants to select a face model, it would have to be constrained based on some criterion other than overall goodness of fit of data to one single linear model in isolation, but perhaps based instead on comprehensive experimental characterizations of receptive fields of individual face cells across the population.
Furthermore, experimental face stimulus sets are limited in that they cover only a limited range of the faces that are possible. It is conceivable that observed face spaces such as Chang and Tsao (2017) are approximately linear at a local scale but that a more complete sampling of faces will reveal a nonlinear face space at a broader scale. Even color space at high-level visual cortex is a complicated, nonlinear space (Bohon, Hermann, Hansen, & Conway, 2016; Komatsu, Ideura, Kaji, & Yamane, 1992; Lehky & Sejnowski, 1999), and there is no reason to expect that face space is not similarly complicated and nonlinear.
Underlying a possible nonlinear face space would be face cells themselves that act individually as spatially nonlinear filters. Reports of inferotemporal processing indicate that object representations, and in particular face representations, involve nonlinear spatial filters, as mentioned earlier (Owaki et al., 2018; Tanaka, 1996; Yamane et al., 2006). Those nonlinear spatial interactions are why we cannot map face cell receptive fields with a simple stimulus spot as we do in striate cortex. The spurious reconstruction of nonface objects using linear components such as tensorfaces (see Figure 18) and eigenfaces (see Figure 2b in Tsao & Livingstone, 2008) also indicates a requirement to introduce some sort of nonlinearity in face cells.
Linear models of biological facial representations, including the particular implementation of tensorfaces used here, can reveal some significant aspects of face processing and thus can be useful in theoretical discussions as long as the biological limitations of those models are kept firmly in sight. However, without nonlinearity, they cannot be considered complete solutions. Nonlinearity is a central stumbling block in understanding biological face processing and object processing generally. One approach to introducing nonlinearity into face representations is illustrated by the nonlinear tensor modeling of Vasilescu and Terzopoulos (2011). However, there are multitudes of other possibilities, including the development of Kolmogorov-complexity constrained deep learning networks.
Overall, the results here suggest that spatial complexity of face cells is likely to be a significant factor, among others, in characterizing face space. Defining the complexity of face representations may contribute to a more complete framework for guiding future research.
Acknowledgments
This research was supported by a grant to K.T. from the Strategic Research Program for Brain Sciences of the Japan Agency for Medical Research and Development. It was also supported by grants to A.C. and A-H.P. from the Ministry of Education and Science of the Russian Federation (grant 14.756.31.0001) and to A.C. from the Polish National Science Center (grant 2016/20/W/N24/00354). We thank Topi Tanskanen for comments on the manuscript.