## Abstract

Hierarchical sparse coding (HSC) is a powerful model to efficiently represent multidimensional, structured data such as images. The simplest solution to solve this computationally hard problem is to decompose it into independent layer-wise subproblems. However, neuroscientific evidence would suggest interconnecting these subproblems as in predictive coding (PC) theory, which adds top-down connections between consecutive layers. In this study, we introduce a new model, 2-layer sparse predictive coding (2L-SPC), to assess the impact of this interlayer feedback connection. In particular, the 2L-SPC is compared with a hierarchical Lasso (Hi-La) network made out of a sequence of independent Lasso layers. The 2L-SPC and a 2-layer Hi-La networks are trained on four different databases and with different sparsity parameters on each layer. First, we show that the overall prediction error generated by 2L-SPC is lower thanks to the feedback mechanism as it transfers prediction error between layers. Second, we demonstrate that the inference stage of the 2L-SPC is faster to converge and generates a refined representation in the second layer compared to the Hi-La model. Third, we show that the 2L-SPC top-down connection accelerates the learning process of the HSC problem. Finally, the analysis of the emerging dictionaries shows that the 2L-SPC features are more generic and present a larger spatial extension.

## 1 Introduction

Sparse coding (SC) has proven to be one of the most successful methods to find an efficient representation for sensory signals such as natural images. It holds the idea that signals (e.g., images) can be encoded as a linear combination of a few features from a bigger set of features (Elad, 2010). The set of features (also called atoms) is called the dictionary, and SC is thus an inverse problem that is of prominent importance to the machine learning community as it is complex to solve when this dictionary is unknown and as the dimensionality of the signals increases. The pursuit of optimal coding is usually decomposed into two complementary subproblems: inference (coding) and dictionary learning. Inference involves finding an accurate sparse representation of the input data considering the dictionaries are fixed; it could be performed using algorithms like ISTA and FISTA (Beck & Teboulle, 2009), Matching Pursuit (Mallat & Zhang, 1993), Coordinate Descent (Li & Osher, 2009), or ADMM (Heide, Heidrich, & Wetzstein, 2015). Once the representation is inferred, one can learn the atoms from the data using methods like gradient descent (Kreutz-Delgado et al., 2003; Rubinstein, Bruckstein, & Elad, 2010; Sulam, Papyan, Romano, & Elad, 2018) or online dictionary learning (Mairal, Bach, Ponce, & Sapiro, 2009). Consequently, SC offers an unsupervised framework to learn simultaneously the dictionary and the corresponding input representation. SC has been applied with success to image restoration (Mairal, Bach, Ponce, Sapiro, & Zisserman, 2009), feature extraction (Szlam, Kavukcuoglu, & LeCun, 2010), and classification (Perrinet & Bednar, 2015; Yang, Zhang, Yang, & Zhang, 2011).

Interestingly, SC is also a field of interest for computational neuroscience. Olshausen & Field (1997) first demonstrated that adding a sparse prior to a shallow neural network was sufficient to account for the emergence of neurons whose receptive fields (RFs) are spatially localized, bandpass and oriented filters, analogous to those found in the primary visual cortex (V1) of mammals (Hubel & Wiesel, 1962). Because most SC algorithms are limited to single-layer networks, this method cannot easily be extended to model the hierarchical structure of the cortical areas constituting the visual pathways. Still, some solutions have been proposed to tackle this problem of hierarchical sparse coding (HSC) as a global optimization problem (Aberdam, Sulam, & Elad, 2019; Makhzani & Frey, 2013, 2015; Sulam, Aberdam, Beck, & Elad, 2019; Sulam et al., 2018; Zeiler, Taylor, & Fergus, 2011). However, these methods are looking for optimal solutions of HSC without bearing in mind their plausibility in terms of neuronal implementation. Consequently, the quest for an efficient HSC formulation that is compatible with such a neural implementation remains open.

Rao and Ballard (1999) introduced the predictive coding (PC) framework to model the effect of the interaction of two cortical areas in the visual cortex. PC intends to solve the inverse problem of vision by combining bottom-up (feedforward) and top-down (feedback) activities. In PC, feedback connections carry a prediction of the neural activity of the afferent lower cortical area, while the feedforward connection carries a prediction error to the next higher cortical area. In such a framework, the activity of the neural population is updated to minimize the unexpected component of the neural signal (Friston, 2010). PC has been applied to supervised object recognition (Wen et al., 2018; Han et al., 2018; Spratling, 2017) and unsupervised prediction of future video frames (Lotter, Kreiman, & Cox, 2016). Interestingly, it is flexible enough to allow the introduction of a sparse prior within each layer. Therefore, one might consider PC as a bio-plausible formulation of the HSC problem.

Interestingly, when recast into an HSC problem, SC and PC could be used to model different types of computation in the brain. On the one hand, SC might be considered as an intralayer computational mechanism that exacerbates competition between neurons by selecting only the strongly activated ones. This mechanism, called explaining away, could be used to model intracortical recurrent connectivity. On the other hand, PC is describing the flow of information between consecutive layers and could be used to model intercortical feedback connections. To the best of our knowledge, no study has revealed the advantages of interconnecting SC layers using the PC principle. What is the effect of the top-down connection of PC? What are the consequences in terms of computations and convergence? What are the qualitative differences concerning the learned atoms and representations?

The objective of this study is to experimentally answer these questions and show that the PC framework could be successfully used for improving the solutions to HSC problems in a 2-layer network. The letter is organized as follows. We start our study by defining the two mathematical formulations to solve the HSC problem: the hierarchical Lasso (Hi-La), which consists of stacking two independent Lasso subproblems, and the 2-layer sparse predictive coding (2L-SPC), which leverages PC into a deep and sparse network of bidirectionally connected layers. To experimentally compare both models, we train the 2L-SPC and Hi-La networks on four databases. First, we vary the sparsity of each layer and compare the overall prediction error for the two models, analyzing it layer-by-layer to understand their respective roles. Second, we analyze the number of inference iterations needed for the state variables of each network to reach their stability. Third, we study the evolution of the representations generated by the Hi-La and the 2L-SPC during their inference process. Fourth, we compare the convergence of both models during the dictionary learning stage. Finally, we discuss the differences between the features that both networks learned in light of their activation probability and their spatial extension.

## 2 Methods

In the following mathematical description, italic letters (e.g., $\alpha $) are used as symbols for scalars, bold lowercase letters (e.g., $x$) for column vectors, and bold uppercase letters (e.g., $D$) for matrices, and $\u2207xL$ denotes the gradient of a function $L$ with respect to $x$.

### 2.1 Background

### 2.2 From Hierarchical Lasso …

### 2.3 … to Hierarchical Predictive Coding

### 2.4 Coding Stopping Criterion and Unsupervised Learning

### 2.5 Link between the Hi-La, the 2L-SPC, and CNNs

## 3 Experimental Settings: Data Sets and Parameters

We use four databases to train and test both networks.

**STL-10**. The STL-10 database (Coates, Ng, & Lee, 2011) has 100,000 colored images of size $96\xd796$ pixels (px) representing 10 classes of objects (e.g., airplane, bird). STL-10 presents a high diversity of object viewpoints and backgrounds. This set is partitioned into a training set composed of 90,000 images and a testing set of 10,000 images.**CFD**. The Chicago Face Database (CFD) (Ma, Correll, & Wittenbrink, 2015) consists of 1804 high-resolution ($2444\xd71718$ px), color, standardized photographs of male and female faces of varying ethnicity between the ages of 18 and 40 years. We resized the pictures to $170\xd7120$ px to keep the computational time reasonable. The CFD database is partitioned into batches of 10 images. This data set is split into a training set composed of 721 images and a testing set of 486 images.**MNIST**. MNIST (LeCun, 1998) is composed of $28\xd728$ px, 70,000 grayscale images representing handwritten digits. We decomposed this data set into batches of 32 images, split into a training set composed of 60,000 digits and a testing set of 10,000 digits.**AT&T**. The AT&T database (AT&T, 1994) has 400 grayscale images of size $92\xd7112$ pixels (px) representing faces of 40 distinct persons with different lighting conditions, facial expressions, and details. This set is partitioned into batches of 20 images. The training set is composed of 330 images (33 subjects), and the testing set is composed of 70 images (7 subjects).

All of these databases are preprocessed using local contrast normalization (LCN) first and then whitening (see Figure 17 for sample examples on all databases). LCN is inspired by neuroscience and consists of a local subtractive and divisive normalization (Jarrett, Kavukcuoglu, & LeCun, 2009). In addition, we use whitening to reduce dependency between pixels (Olshausen & Field, 1997).

To draw a fair comparison between the 2L-SPC and Hi-La models, we train both models using the same set of hyperparameters. We summarize these parameters in Table 1 for the STL-10, MNIST, and CFD databases and in section A.3 for the ATT database. Note that the parameter $\eta ci$ is omitted in the table because it is computed as the inverse of the largest eigenvalue of $DiTDi$ (Beck & Teboulle, 2009). To learn the dictionary $Di$, we use stochastic gradient descent on the training set only, with a learning rate $\eta Li$ and a momentum equal to 0.9. In this study, we consider only 2-layered networks, and we vary the sparsity parameters of each layer ($\lambda 1$ and $\lambda 2$) to assess their effect on both the 2L-SPC and the Hi-La networks.

. | . | Databases . | ||
---|---|---|---|---|

. | . | STL-10 . | CFD . | MNIST . |

Network | $D1$ size | [64, 1, 8, 8] (2) | [64, 3, 9, 9] (3) | [32, 1, 5, 5] (2) |

parameters | $D2$ size | [128, 64, 8, 8] (1) | [128, 64, 9, 9] (1) | [64, 32, 5, 5] (1) |

$Tstab$ | 1e-4 | 5e-3 | 5e-4 | |

Training | Number | 10 | 250 | 100 |

parameters | of epochs | |||

$\eta L1$ | 1e-4 | 1e-4 | 5e-2 | |

$\eta L2$ | 5e-3 | 5e-3 | 1e-3 | |

Simulation | $\lambda 1$ range | [$0.2:0.6::0.1,1.6$] | [$0.3:0.7::0.1,1.8$] | [$0.1:0.3::0.05,0.3$] |

parameters | $\lambda 2$ range | [$0.4,1.4:1.8::0.1$] | [$0.5,1:1.8::0.2$] | [$0.2,0.2:0.4::0.05$] |

. | . | Databases . | ||
---|---|---|---|---|

. | . | STL-10 . | CFD . | MNIST . |

Network | $D1$ size | [64, 1, 8, 8] (2) | [64, 3, 9, 9] (3) | [32, 1, 5, 5] (2) |

parameters | $D2$ size | [128, 64, 8, 8] (1) | [128, 64, 9, 9] (1) | [64, 32, 5, 5] (1) |

$Tstab$ | 1e-4 | 5e-3 | 5e-4 | |

Training | Number | 10 | 250 | 100 |

parameters | of epochs | |||

$\eta L1$ | 1e-4 | 1e-4 | 5e-2 | |

$\eta L2$ | 5e-3 | 5e-3 | 1e-3 | |

Simulation | $\lambda 1$ range | [$0.2:0.6::0.1,1.6$] | [$0.3:0.7::0.1,1.8$] | [$0.1:0.3::0.05,0.3$] |

parameters | $\lambda 2$ range | [$0.4,1.4:1.8::0.1$] | [$0.5,1:1.8::0.2$] | [$0.2,0.2:0.4::0.05$] |

Note: The sizes of the convolutional kernels are shown in the format: [# features, # channels, width, height] (stride). To describe the range of explored parameters during simulations, we use the format [$0.3:0.7::0.1,0.5$], which means that we vary $\lambda 1$ from 0.3 to 0.7 by step of 0.1 while $\lambda 2$ is fixed to 0.5.

## 4 Results

For cross-validation, we ran all the simulations presented in this section seven times, each time with a different random seed for the initialization of the dictionary. We define the central tendency of our curves by the median of the runs and its variation by the median absolute deviation (MAD) (Pham-Gia & Hung, 2001). We prefer this measure to the classical mean $\xb1$ standard deviation because a few measures did not exhibit a normal distribution. All presented curves are obtained on the testing set.

### 4.1 2L-SPC Converges to a Lower Prediction Error

As a first analysis, we report the cost $F(Di,\gamma i)$ (see equation 2.2) for each layer and for both networks. To refine our analysis, we decompose for each layer this cost into a quadratic cost (i.e., the $\u21132$ term in $F$, that is, $G$) and a sparsity cost (i.e., the $\u21131$ term in $F$), and we monitor these quantities when varying the first- and second-layer sparse penalties (see Figure 2). For scaling reasons and because the error bars are small, we cannot display them on Figure 2; we thus include them in Figure 8. For all the simulations shown in Figure 2, we observe that the total cost (i.e., $F(D1,\gamma 1)+F(D2,\gamma 2$)) is lower for the 2L-SPC than for the Hi-La model. As expected, in both models, the total cost increases when we increase $\lambda 1$ or $\lambda 2$.

For all databases, Figure 2 shows that the feedback connection of the 2L-SPC tends to increase the first-layer quadratic cost. For example, when $\lambda 1$ is increased, the average variation of the first-layer quadratic cost of the 2L-SPC compared to the one of the Hi-La is $+$126% for STL-10, $+$110% for CFD, $+$100% for MNIST, and $+$73% for AT&T. On the contrary, the second-layer quadratic cost is strongly decreasing when the feedback connection is activated. In particular, when $\lambda 1$ is increased, the average variation of the second-layer quadratic cost of the 2L-SPC compared to the one of the Hi-La is $-$57% for STL-10, $-$58% for CFD, $-$61% for MNIST, and $-$44% for AT&T. These observations are holding when the second-layer sparse penalty is increased. This is expected: while the Hi-La first layer is fully specialized in minimizing the quadratic cost with the lower level, the 2L-SPC finds a trade-off between lower, and higher-level quadratic cost.

In addition, when $\lambda 1$ is increased, the Hi-La first-layer quadratic cost is increasing faster ($+$337% for STL-10, $+$152% for CFD, $+$214% for MNIST, and $+$260% for AT&T) than the 2L-SPC first-layer quadratic cost ($+$117% for STL-10, $+$83% for CFD, $+$148% for MNIST, and $+$147% for AT&T). This phenomenon is amplified if we consider the evolution of the first-layer sparsity cost when increasing $\lambda 1$. The first-layer sparsity cost of the Hi-La exhibits a stronger increase ($+$108% for STL-10, $+$99% for CFD, $+$150% for MNIST, and $+$60% for AT&T) than the one of the 2L-SPC ($+$92% for STL-10, $+$94% for CFD, $+$112% for MNIST, and $+$47% for AT&T). This suggests that the extra penalty induced by the increase of $\lambda 1$ is better mitigated by the 2L-SPC.

When $\lambda 2$ is increased, the quadratic cost of the first layer of the Hi-La model is almost stable ($1%$ for STL-10, $0%$ for CFD, $+2%$ for MNIST, and $0%$ for AT&T), whereas the 2L-SPC first layer $\u21132$ cost is increasing ($+6%$ for STL-10, $+16%$ for CFD, $+20%$ for MNIST, and $+8%$ for AT&T). The explanation here is straightforward: while the first layer of the 2L-SPC includes the influence of the upper layer, the Hi-La does not have such a mechanism. It suggests that the feedback connection of the 2L-SPC transfers a part of the extra penalty coming from the increase of $\lambda 2$ in the first-layer quadratic cost.

Figures 3i and 3ii show the mapping of the total cost when we vary the sparsity of each layer for the 2L-SPC and Hi-La, respectively. These heat maps confirm what has been observed in Figure 2 and they extend it to a larger range of sparsity values: both models are more sensitive to a variation of $\lambda 1$ than to a change in $\lambda 2$. Figure 3iii is a heat map of the relative difference between the 2L-SPC and the Hi-La total cost. It shows that the minimum relative difference between 2L-SPC and Hi-La (10.6%) is reached when $\lambda 1$ is maximal and $\lambda 2$ is minimal, and the maximum relative difference (19.9%) is reached when both $\lambda 1$ and $\lambda 2$ are minimal. It suggests that the previously observed mitigation mechanism originated by the feedback connection is more efficient when the sparsity of the first layer is lower.

All these observations point in the same direction: the 2L-SPC framework mitigates the total cost with a better distribution of the cost among layers. This mechanism is even more pronounced when the sparsity of the first layer is lower. Surprisingly, while the feedback connection of the 2L-SPC imposes more constraints on the state variables, it also happens to generate less total cost.

### 4.2 2L-SPC Has a Faster Inference Process

One may wonder if this better convergence is not achieved at the cost of a slower inference process. To address this concern, we report for both models the number of iterations the inference process needs to converge toward a stable state on the testing set. Figure 4 shows the evolution of this quantity, for STL-10, CFD, and MNIST databases (see section A.5 for the AT&T database), when varying both layers' sparsity. For all the simulations, results demonstrate that the 2L-SPC needs less iteration than the Hi-La model to converge toward a stable state. We also observe that data dispersion is in general more pronounced for the Hi-La model. In addition to converging to lower cost, the 2L-SPC is also decreasing the number of iterations in the inference process to converge toward a stable state.

### 4.3 2L-SPC Refines the Second Layer

We now study the evolution of the quadratic term of the prediction error during the inference process. As a representative example, we report these errors for both layers and both models when they are trained on the STL-10 database with $\lambda 1=0.4$ and $\lambda 2=1.4$ (see Figure 5a). We distinguish three states, denoted A, B, and C, corresponding to the inference at iterations 1, 7, and 20, respectively. At state A, the inference process has just started, and the first-layer prediction error is very high in contrast to the second-layer prediction error. Around state B and for both models, all layers' prediction errors are getting closer to each other. At state C, even if the inference process has not yet fully converged, the evolution of the layer's prediction errors gets smoother until a final error is reached (at iteration 155 for the Hi-La and 75 for the 2L-SPC).

We now back-project all the latent variables into the input space to assess qualitatively how both models and layers are representing the input image (see Figure 12 for more details on the back-projection mechanism). These back-projections are shown in the two first lines of Figure 5b. We observe that the reconstructions of the first-layer latent variables are highly similar for both the Hi-La and the 2L-SPC models. Note also that the quantitative difference between the first-layer prediction error of the Hi-La and the 2L-SPC at state C is not perceptible in the corresponding reconstructions. In state A, both models' first layers have already perceived all the details of the input. It suggests that feedforward networks are sufficient to provide accurate first-layer reconstruction (see section 2.5). In contrast, both models' second-layer reconstructions are very rough in state A and get refined along the inference process. In states B and C, we observe that the 2L-SPC second-layer reconstructions exhibit more details that the corresponding Hi-La reconstructions. In particular the 2L-SPC tends to better outline the contours of the object.

These observations suggest that the 2L-SPC is beneficial for the second layer to better represent both fine textural details and contrasted contours.

### 4.4 2L-SPC Learns Faster

Figure 6 shows the evolution of the total cost during the dictionary learning stage and evaluated on the testing set (see section 5 for the AT&T database). For all databases, the 2L-SPC model reaches its minimal total cost before the Hi-La model. The convergence rate of both models is comparable, but the 2L-SPC has a much lower cost in the very first epochs. The interlayer feedback connection of the 2L-SPC pushes the network toward lower prediction errors from the very beginning of the learning.

### 4.5 2L-SPC Features are Larger and More Generic

Another way to grasp the impact of the interlayer feedback connection is to visualize its effect on the dictionaries. To generate a human-readable visualization of the learned dictionaries, we back-project them into the image space using a cascade of transposed convolutions (see Figure 12). Using the analogy with neuroscience, these back-projections are called receptive fields (RFs). Figure 7 shows some of the RFs for the two layers and the second-layer activation probability histogram for both models when they are trained on the CFD database. In general, first-layer RFs are oriented Gabor-like filters, and second-layer RFs are more specific and represent more abstract concepts (e.g., curvatures, eyes, mouth, nose). In some extreme cases, RFs in the second layer of the Hi-La seem to overfit some specific faces and do not encompass all generality in the concept of a face. The red-framed RFs highlight one of these typical cases: the corresponding activation probabilities are $0.25%$ and $0.92%$ for Hi-La and 2L-SPC, respectively. This overfitting of features is supported by the lowest activation probability of the second layer's atoms of the Hi-La compared to the one of the 2L-SPC ($0.16%$ versus $0.30%$). This phenomenon is even more striking when we sort all the features by activation probabilities in descending order (see Figure 14). We filter out the highest activation probability (corresponding to the low-frequency filters highlighted by the black squares) of both Hi-La and 2L-SPC for scaling reasons. All the filters are displayed in Figures 13 through 16 for STL-10, CFD, MNIST, and AT&T RFs, respectively. The atoms' activation probability confirms the qualitative analysis of the RFs: the features learned by that the 2L-SPC learned are more generic and informative as they describe a wider range of images.

We also observe that the second-layer RFs present longer curvatures and oriented lines in the 2L-SPC than in the Hi-La model. We quantitatively confirm this statement by computing the mean surface coverage (MSC) of the first- and second-layer features (see Table 2). To perform such an analysis, we first filter out the low-frequency and face-specific features (e.g., eyes, nose) to keep only the oriented lines and the curvatures. Next, we normalize dictionaries for both models to make sure that the maximum pixel's intensity is equal to one. We next binarize each atom using the standard deviation as a threshold. Consequently, the only remaining pixels are those presenting a pixel's intensity higher than the standard deviation of the atom. Finally, we compute the MSC by summing the number of active pixels and dividing it by the total surface of the RF. For the four databases, both the first- and the second-layer dictionaries of the 2L-SPC are covering a larger space in the RFs (see Table 2). In particular, the first-layer dictionary of the 2L-SPC has an extra MSC varying from $+$11% to $+$21% compared to the one of the Hi-La. The increase of spatial extension is even more pronounced in the case of the second-layer features. The 2L-SPC second-layer atom have an extra MSC ranging from $+$43% to $+$56% compared to those of the Hi-La.

. | Mean Surface Coverage . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

. | Layer 1 . | Layer 2 . | ||||||||

Database . | Hi-La . | 2L-SPC . | Variation . | Hi-La . | 2L-SPC . | Variation . | ||||

CFD | 8.4% | 9.4% | $+$12% | 3.2% | 4.8% | $+$50% | ||||

STL-10 | 7.9% | 9.5% | $+$20% | 2.5% | 3.9% | $+$56% | ||||

MNIST | 12.1% | 14.6% | $+$21% | 3.8% | 5.8% | $+$52% | ||||

AT&T | 11.1% | 12.3% | $+$11% | 3.0% | 4.3% | $+$43% |

. | Mean Surface Coverage . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

. | Layer 1 . | Layer 2 . | ||||||||

Database . | Hi-La . | 2L-SPC . | Variation . | Hi-La . | 2L-SPC . | Variation . | ||||

CFD | 8.4% | 9.4% | $+$12% | 3.2% | 4.8% | $+$50% | ||||

STL-10 | 7.9% | 9.5% | $+$20% | 2.5% | 3.9% | $+$56% | ||||

MNIST | 12.1% | 14.6% | $+$21% | 3.8% | 5.8% | $+$52% | ||||

AT&T | 11.1% | 12.3% | $+$11% | 3.0% | 4.3% | $+$43% |

The analysis of the features reveals that the second-layer features of the 2L-SPC are more frequently activated than those of the Hi-La. The 2L-SPC second-layer atoms are then more generic as they encode for more diverse situations. In addition, these features are presenting a larger spatial extension and then include more contextual information compared to those of the Hi-La. Interestingly, these results might serve as an explanation for the lower global residual error observed in Figure 2.

## 5 Conclusion

What are the computational advantages of interlayer feedback connections in hierarchical sparse coding algorithms? We answered this question by comparing the hierarchical lasso (Hi-La) and the 2-Layer sparse predictive coding (2L-SPC) models. Both are identical in every respect, except that the 2L-SPC adds interlayer feedback connections. These extra connections force the internal state variables of the 2L-SPC to converge toward a trade-off between an accurate prediction passed by the lower layer and a better predictability by the upper layer. Experimentally, we demonstrated for these 2-layered networks on four different databases that the interlayer feedback connection (1) mitigates the overall prediction error by distributing it among layers, (2) accelerates the convergence toward a stable internal state, and (3) allows the second layer to refine its representation of the input, (4) accelerates the learning process, and (5) enables the learned features to be more generic and spatially extended.

The 2L-SPC is novel in its way to consider hierarchical sparse coding (HSC) as a combination of local SC subproblems linked with the PC theory. This is a crucial difference with CNNs that are trained by backpropagating gradients from a global loss. To the best of our knowledge, the 2L-SPC model is the first one that leverages local sparse coding into a hierarchical and unsupervised algorithm. The ML-CSC from Sulam et al. (2018) is equivalent to a one-layer sparse coding algorithm (Aberdam et al., 2019), and the ML-ISTA from Sulam et al. (2019) is trained using supervised learning. However, deconvolutional networks (Zeiler & Fergus, 2012; Zeiler et al., 2011) are not leveraging local sparse coding as each layer aims at reconstructing the input image. Interestingly, other approaches based on hierarchical probabilistic inference have successfully generated sparse decomposition of images even though these models are not explicitly solving HSC problems (Lee, Grosse, Ranganath, & Ng, 2009).

Therefore, the 2L-SPC is a proof of concept demonstrating the beneficial impact of feedback connection in HSC models. Further work needs to be conducted to generalize our results to deeper networks. In particular, one needs to find an efficient normalization mechanism, compatible with recurrent networks, to mitigate the strong, vanishing activity phenomenon we have observed with deeper latent variables. Neuroscience might bring an elegant solution to this problem through divisive normalization. Another crucial extension of the work presented here would consist in including a channel-wise activation function (e.g., one sparsity parameter per atoms) in which the optimal biases would be assessed in the inference process. Such an improvement would allow the network to adapt the sparsity of the layers with the specificity of the input image. For example, highly textural images would result in a very high sparsity for oriented line features and a low sparsity for textural features.

If one succeeds in applying all of these improvement to such a sparse predictive coding framework, such networks should exhibit promising results to model the brain and tackle practical applications like image inpainting, denoising, and image super-resolution.

## Appendix

### A.1 2L-SPC Pseudo-Code

### A.2 Evolution of the Global Prediction Error with Error Bar

### A.3 2L-SPC Parameters on ATT

. | . | ATT Database . |
---|---|---|

Network parameters | $D1$ size | [64, 1, 9, 9] (3) |

$D2$ size | [128, 64, 9, 9] (1) | |

$Tstab$ | 5e-4 | |

Training parameters | Number of epochs | 1000 |

$\eta L1$ | 1e-4 | |

$\eta L2$ | 5e-3 | |

Simulation parameters | $\lambda 1$ range | [$0.3:0.7::0.1,1$] |

$\lambda 2$ range | [$0.5,0.6:1.6::0.2$] |

. | . | ATT Database . |
---|---|---|

Network parameters | $D1$ size | [64, 1, 9, 9] (3) |

$D2$ size | [128, 64, 9, 9] (1) | |

$Tstab$ | 5e-4 | |

Training parameters | Number of epochs | 1000 |

$\eta L1$ | 1e-4 | |

$\eta L2$ | 5e-3 | |

Simulation parameters | $\lambda 1$ range | [$0.3:0.7::0.1,1$] |

$\lambda 2$ range | [$0.5,0.6:1.6::0.2$] |

Notes: The sizes of the convolutional kernels are shown in the format: [# features, # channels, width, height] (stride). To describe the range of explored parameters during simulations, we use the format [$0.3:0.7::0.1,0.5$],which means that we vary $\lambda 1$ from 0.3 to 0.7 by steps of 0.1 while $\lambda 2$ is fixed to 0.5.

### A.4 Prediction Error Distribution on AT&T

### A.5 Number of Iterations of the Inference on AT&T

### A.6 Evolution of Prediction Error during Training

### A.7 Illustration of the Back-Projection Mechanism

### A.8 Full Map of RFs for 2L-SPC and Hi-La on STL10

### A.9 Full Map of RFs for 2L-SPC and Hi-La on CFD

### A.10 Full Map of RFs for 2L-SPC and Hi-La on MNIST

### A.11 Full Map of RFs for 2L-SPC and Hi-La on AT&T

### A.12 Preprocessed Samples

## Acknowledgments

This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement 713750. It has also been carried out with financial support from the Regional Council of Provence-Alpes-Côte d'Azur and with the financial support of the A*MIDEX (ANR-11-IDEX-0001-02). This work was granted access to the HPC resources of Aix-Marseille Université financed by the project EquipMeso (ANR-10-EQPX-29-01) of the Investissements d'Avenir program.