Partial least squares (PLS) is a class of methods that makes use of a set of latent or unobserved variables to model the relation between (typically) two sets of input and output variables, respectively. Several flavors, depending on how the latent variables or components are computed, have been developed over the last years. In this letter, we propose a Bayesian formulation of PLS along with some extensions. In a nutshell, we provide sparsity at the input space level and an automatic estimation of the optimal number of latent components. We follow the variational approach to infer the parameter distributions. We have successfully tested the proposed methods on a synthetic data benchmark and on electrocorticogram data associated with several motor outputs in monkeys.
Partial least squares (PLS) (Wold, Sjöström, & Eriksson, 2001) is a family of techniques originally devised for modeling two sets of observed variables, which we shall call input and output components, by means of some (typically low-dimensional) set of latent or unobserved components. This model can also be extended to deal with more than two sets of components (Wangen & Kowalsky, 1989). It is commonly used for regression but is also applicable to classification (Barker & Rayens, 2003). In this letter, we focus on the regression paradigm. Latent components are generated by some linear transformation of the input components, while the output components are assumed to be generated by some linear transformation of the latent components.
The difference between PLS and related techniques lies in how the latent components are estimated (Hastie, Tibshirani, & Friedman, 2008; Rosipal & Krämer, 2006). Unlike PLS, principal components regression (PCR), for example, does not consider the output when constructing the latent components. Also, PLS differs from canonical correlation analysis (CCA) in that CCA treats the input and output spaces symmetrically (Hardoon, Szedmak, & Shawe-Taylor, 2004). A complete comparison between PLS, PCR and classical shrinkage regression from a statistical perspective is given by Frank and Friedman (1993), and further insight into the shrinkage properties of PLS can be found, for instance, by Goutis (1996). There exist Bayesian formulations of some latent component models in the literature, such as PCA (Bishop, 1998; Nakajima, Sugiyama, & Babacan, 2011), CCA (Fujiwara, Miyawaki, & Kamitani, 2009; Virtanen, Klami, & Kaski, 2011; Wang, 2007) and mixtures of factor analyzers (Beal, 2003; Ghahramani & Beal, 2000). To our knowledge, however, a Bayesian version of PLS has not yet been proposed.
Different varieties of PLS regression arise by the way they extract latent components (Rosipal & Krämer, 2006). In its classic form, PLS aims to maximize the covariance among the latent components, which are constrained to be orthogonal, using the nonlinear iterative partial least squares (NIPALS) algorithm (Wold, 1975). This is more an algorithmic than a traditional statistical approach, and, hence, the analysis of its properties is less obvious. A more rigorous approach (from a statistical perspective) is taken by de Jong (1993), who directly formulates the latent space as a linear projection of the input space and solves the resulting optimization problem by the so-called SIMPLS algorithm. The SIMPLS algorithm is equivalent to NIPALS only when the output space is unidimensional. Sparsifying accounts of PLS are proposed by van Gerven, Chao, and Heskes (2012) and Chun and Keleş (2010). A kernelized approach has been introduced by Lindgren, Geladi, and Wold (1993) and Rosipal and Trejo (2001).
By imposing separate gaussian priors for each column of P (and Q), we are allowing different input (and latent) variable couplings for each component (and ). An obvious simplification, which we take in the rest of the letter, is to let (and ), so that information is borrowed among the latent components and responses and the number of parameters is reduced. The derivation of the general case is straightforward.
It might appear that for large p scenarios, a large number of parameters is associated with the full precision matrix . However, as we will show, the estimation of these matrices is low rank, so that the effective number of parameters is kept reasonably low.
Note that . Since PQ has at most rank k, this formulation is related to reduced rank methods (Izenman, 1975) and approaches that penalize the nuclear norm of the coefficient matrix (Yuan, Ekici, Lu, & Monteiro, 2007). There is also a connection with the multivariate group Lasso (Obozinski, Wainwright, & Jordan, 2011), which imposes an L1/L2-penalty on the coefficient matrix so that a common sparsity pattern is shared by all responses. However, the multivariate group Lasso does not account for correlated errors. The sparse multivariate regression with covariance estimation approach (Rothman, Levina, & Zhu, 2010), on the other hand, does consider correlation between the responses by simultaneously estimating the coefficient matrix and the (full) inverse covariance matrix of the response variables. The coefficient matrix is L1-regularized, and then the sparsity pattern can vary for each response. Our approach is also somewhat related to the multitask feature learning problem, where each task has a different set of inputs and the goal is to find some shared structural parameterization that is beneficial for the individual tasks. For example, the method proposed by Argyriou, Evgeniou, and Pontil (2006) seeks a low-rank linear transformation such that the outputs are encouraged to share a common input sparsity pattern. The method introduced by Ando and Zhang (2005) is formulated so that unlabeled data can be used for learning common underlying predictive functional structures. However, these approaches do not build on a generative, model and it is not possible to express a Bayesian formulation that leads to the same estimator.
The rest of the letter is organized as follows. Section 2 introduces the variational approximation in the basic setting. Section 3 describes how to achieve a sparse solution. Section 4 proposes an improved model that aims to estimate the optimal number of latent components and increase the accuracy. Section 5 presents a simulation study with comparisons to other methods. Section 6 provides some results for real neural signal decoding. Finally, section 7 provides conclusions and directions for future work.
2. Variational Parameter Inference
Note that the matrix is not full rank as far as p>k, which is typically the case. It has, in fact, rank k. Then the effective number of parameters of this matrix is not p(p−1)/2, but, at most, pk+1−k(k−1)/2. When p is high relative to N, it becomes necessary to borrow information between the components , suggesting the choice .
Calculations for Q and dependent distributions are similar to those of P and are given in appendix A. Brown & Zidek (1980) theoretically showed that an adaptive joint estimation dominates an independent estimation for each of the outputs separately when the number of inputs is considerably larger than the number of outputs, but this domination breaks down when the number of outputs approaches the number of inputs. In our situation, when estimating Q, the number of outputs q typically even exceeds the number of hidden units k. This suggests that the factorization , mimicking independent estimation, is the sensible choice for q>k, which is often the case.
In short, the proposed approach proceeds as follows:
This grouping of the updates is motivated by the structure of the Bayesian hierarchy and the variational factorization in equation 2.1. A variant of the basic scheme, by assuming diagonality of and , arises by substituting equation 2.2 by 2.3, 2.4 by 2.5, A.1 by A.2, and A.3 by A.4. A variational lower bound of the evidence is given in appendix B.
3. Sparsity in P and Q
For achieving sparsity on the input variables, we may impose a group-sparsifying prior on P, so that the groups are the k-dimensional rows of P. By setting an individual regularization parameter on each group and integrating out P, the maximum likelihood value of such regularization parameters will effectively discard some groups. This is an example of groupwise automatic relevance determination (see, for example, Virtanen et al., 2011).
4. Adaptive P and Automatic Selection of k
The proposed estimation of P disregards Y and uses only the current state of Z. Put differently, equal attention is paid to all latent components when estimating P, no matter the contribution of each latent component to the prediction of Y. A possible improvement would be to focus on those latent components that are more useful for modeling Y, regularizing more the latent components that are relatively useless—those whose coefficients in Q are lower.
Here, we impose groupwise priors on P for both rows and columns so as to achieve two goals. First, we provide a more adaptive estimation of P, which we hope will reduce the bias. Second, starting with a reasonably high value of k, we will be better able to obtain an estimate of the adequate dimension for the latent space. Hence, the purpose of this section is to improve the model performance for a given number of latent components. A proper model selection scheme, which would compare the model evidence (see appendix B) for different values of k, would be an alternative (or complementary) way to go.
Then, by automatic relevance determination, the number of coordinates of significantly greater than zero at convergence is an estimation of the optimal number of latent variables k.
5. Synthetic Experiments
In order to test in practice the performance of the algorithms, we now present a synthetic simulation study where we compare the sparse approach proposed in section 3 (SPLS for short) with other methods. For simplicity, we do not present results for the nonfactorized estimation of P and Q or for full matrices and .
We have tested some alternative PLS methods: the NIPALS and SIMPLS algorithms and the frequentist sparse PLS from Chun and Keleş (2010) (sSIMPLS). All the PLS methods have been provided with four latent components. Also, we have tested a number of non-PLS univariate and multivariate regression techniques. Univariate methods are separately applied for each response; these include ordinary least squares regression (OLS), ridge regression, and the Lasso (Hastie et al., 2008). From the multivariate side, we have tested the multivariate group lasso (MGL) (Obozinski et al., 2011) and the sparse multivariate regression with covariance estimation approach (MRCE) (Rothman et al., 2010). The regularization parameters for the non-Bayesian methods have been chosen by cross-validation.
The design of the simulation study is as follows. In all experiments, the number of input variables is p=50, and the number of responses is q=8. We have covered several different situations, varying the number of hidden components, which we denote as k0, and the amount of training data (N=100, 500). We have set k0=1, 2, 4, 8. Within each situation, we generated 100 random replications. For each of them, we have sampled 1000 testing data points. Then, for each situation and each model, reports are shown over residuals.
For each random replication, the input matrix was generated according to a multivariate gaussian distribution with zero mean and covariance matrix M, whose elements were set as , and r1 was sampled from the uniform distribution with support [0, 1]. Hence, depending on r1, the degree of collinearity in the input matrix is different among the replications. Of course, the testing input matrix was sampled using the same covariance matrix M.
When k0<8, the latent components were generated as . Sparsity was enforced by setting each row of P to zero with probability 0.8 (ensuring at least two relevant input variables). The rest of the rows were sampled from a standard normal distribution. The noise was generated from a gaussian distribution with zero mean and diagonal covariance matrix , with , where denotes the sample standard deviation and the value r2 was sampled from the uniform distribution with support [0.01, 0.1] (separately for each each diagonal element).
We generated the responses as . We did not consider sparsity in Q, whose elements were sampled from a standard normal distribution. The noise was sampled from a gaussian distribution with zero mean and diagonal covariance matrix , with diagonal elements , where r3 was sampled from the uniform distribution with support [0.25, 0.5].
When k0=8, the response was computed as , where F has rank q. Since q>k, F has a higher number of effective parameters than PQ in the factorized (k0<8) case. F was sampled from a normal standard distribution, considering sparsity as before.
Figures 2, 3, 4, and 5 show box plots with the performance of the different methods for all situations. Results are in terms of the explained variance R2. In short, these graphs illustrate how the methods behave for different complexities of the true response function and amounts of training input data.
When k0=8, the response function does not factorize and has the highest number of parameters. In this case, the PLS methods have fewer parameters than the actual true response function and tend to underfit. This is in particular the case for N=500, when there are sufficient data for a reliable estimation of all the parameters. Interestingly, SPLS is the only PLS method that does not perform much worse than ridge regression, the Lasso, MGL, and MRCE for N=100 and also for N=500.
The k0=1 case is the opposite extreme. Whereas the full-rank methods aim to estimate pq=400 parameters, the PLS methods, which are set to use four latent components, estimate 4p+4q=232 parameters. Still, this is much more than the true number of parameters (p+q=58). For this reason, the PLS methods are not much better than the others. Indeed, SPLS and the Lasso appear to be the more effective in controlling the complexity of the model. Note that the differences between the methods are very subtle for N=500.
When k0=2, the true response function has 116 parameters. In this case, for N=100, SPLS and the Lasso clearly outperform the other methods. SPLS is slightly better than the Lasso. Surprisingly, SIMPLS and NIPALS do not perform better than OLS. For N=500, SPLS, OLS, ridge regression, the Lasso, and MGL are almost indistinguishable and better than the others.
Finally, when k0=4, the true number of parameters (232) matches that of the PLS methods. Surprisingly, for N=100, SIMPLS, sSIMPLS, and NIPALS are again less accurate than any of the univariate regression methods, MGL and MRCE. On the contrary, SPLS, followed by the Lasso, are the better methods. For N=500, SPLS, the univariate regression methods (including OLS) and MGL have the same performance, which is not far from the optimal Bayes error.
It is noticeable that SPLS is the only PLS method that works as well as the Lasso, and even beats it in some situations (N=100 and k0=1, 2). The good performance of the univariate regression techniques (in particular, ridge regression and the Lasso) is very likely because is diagonal, even when there is a coupling due to . The poor performance of MRCE probably comes from the estimation of a full inverse covariance matrix of the response variables.
In these experiments, the performance of APLS (not shown) is not very different from that of SPLS. This is not surprising, because the synthetic data sets were generated according to equations 1.1, which correspond to the SPLS model. In the next section, we shall see an example where the adaptive version is clearly better.
6. Electrocorticogram Data Decoding
In this section we describe some experimental results on a neuroscientific data set. In particular, we aim to decode the motor output from electrocorticogram (ECoG) signals collected in monkeys. ECoG signals were recorded at a sampling rate of 1 kHz per channel, for 32 electrodes implanted in the right hemisphere, and filtered. Afterward, the signals were bandpass-filtered from 0.3 to 500 Hz. A time-frequency representation of the ECoG signals containing p=1600 variables (32 electrodes, 10 frequency bins, and 5 time lags) was used as the input for the algorithms. The monkey's movements were captured at a sampling rate of 120 Hz. Seven degrees of freedom of the monkey's movements (q=7) are to be modeled and predicted. From the available data, we have used N=5982 time points for training and 5982 more for testing. More details about the data can be found in van Gerven et al. (2012).
Given the high dimensionality of the data, we have run the simplest version of the algorithms; we have factorized P and Q, and and are taken to be diagonal.
Figure 6 illustrates the performance (explained variance R2 over testing data) of the proposed approaches compared to SIMPLS and sSIMPLS. Although not included in the plot, the results of NIPALS are almost indistinguishable from SIMPLS. It is remarkable that APLS performs better than the other methods for most of the responses and is always better than the nonadaptive algorithm. Note that APLS appears to need fewer latent components to give a reasonable estimation. Moreover, APLS is more robust to the choice of k than the other algorithms (including SPLS). Only for the wrist abduction response is the APLS accuracy clearly decreased when k>5. On the other hand, SPLS performs worse in general for high values of k.
Figure 7a shows, for SPLS and APLS, a histogram with the values of as a measure of the importance of each input variable for the output prediction. (Note that reflects the importance of each input for predicting the latent variables and hence is not a good measure of importance of each input variable for the output prediction.) It is worth noting that APLS yields sparser solutions than SPLS in this sense.
Figure 7b shows the values of and the variance of the latent components for six executions of APLS, each with a different number of latent components . Each line, labeled with a different type of symbol, corresponds to a different value k. Note that latent variables that have a high value (left graph) or exhibit a low variance (right graph) are given less importance. From the graph, we can conclude that the first two latent components are the most important for the prediction. For example, the line with the × symbol corresponds to k=5 and has five components (symbols). Each component of this line corresponds to a latent component in the model. The lowest value of (or the highest variance of Zl) for this model pertains to the first two components. Hence, for this model, the two relevant components are l=1, 2. For the models with k=6, 7 components (whose lines have the and ▿ symbols), however, the last components have the lowest values for . The accuracy in these cases is worse than the accuracy that can be obtained with lower k values (see Figure 6). In summary, for all runs, there are two predominant latent components (either the first two or the last two), which indicates that k=2 is a reasonable estimate of the optimal value of latent components.
Finally, Figure 8 demonstrates the trajectories decoded by SIMPLS and APLS. We have used k=7 for SIMPLS and k=2 for APLS. The two algorithms turn out to do well. APLS is a bit smoother (less noisy) than SIMPLS. Again, the outcome for NIPALS is very similar to SIMPLS.
Table 1 illustrates the results of APLS versus OLS, ridge regression, the Lasso, MGL, and MRCE. Interestingly, the performance of APLS is higher than the other methods for all responses. Most differences are statistically significant according to a t-test. MGL does worse than ridge regression and the Lasso. MRCE, however, is also quite competitive.
|.||APLS .||OLS .||Ridge .||Lasso .||MGL .||MRCE .|
|.||APLS .||OLS .||Ridge .||Lasso .||MGL .||MRCE .|
Notes: Each row corresponds to a motor output: shoulder abduction (SA), shoulder flexion (SF), pronation (P), wrist abduction (WA), shoulder rotation (SR), elbow flexion (EF), and wrist flexion (WF). The best method is highlighted in bold.
We have proposed a Bayesian formulation of PLS, with extensions for sparsity, adaptive modeling of P, and automatic determination of k, and we empirically showed that they perform well on ECoG data decoding.
The proposed approximation relies on the Bayesian paradigm, and, hence, regularization is performed in a data-driven fashion with low risk of overfitting. An advantage is interpretability of the model: using diagonal matrices and , automatic relevance determination provides a measure of the relevance of each input and latent component. If, in addition, we use the adaptive extension proposed in section 4, we can obtain a reasonable estimate of the optimal number k of latent components from . Unlike other PLS formulations, the adaptive model appears to be robust to the choice of k.
For automatically selecting k, we can run the algorithm with several values of k and then select the one that reaches the highest model evidence (see appendix B). A more sophisticated solution is to move toward a nonparametric method, where a proper prior on Z would automatically select the optimal number of latent components. However, the model would probably lose its conjugacy, so that variational inference would no longer be practicable, and we would have to resort to sampling methods of inference.
Note that up to permutation of the latent components and sign flips, the model is identifiable thanks to the priors over P and Q, even when neither Q nor the latent components are forced to be orthonormal. Furthermore, although the model is unidentifiable with respect to permutations of the latent components, due to the initialization of Z (step 1 of the algorithm in section 2), the method will always produce the same order in the latent components across different runs.
Future developments can involve a Markovian consideration of the time dynamics. By means of this extension, a connection between PLS and an input-output linear dynamical system (Beal, 2003) can be established.
The variational lower bound for other variations of the method can be easily computed following the same line of argument.
This work has been partially supported by projects TIN2010-20900-C04-04 and Cajal Blue Brain of the Spanish Ministry of Science and Innovation (MINECO).