We present a supervised learning algorithm for estimation of generic input-output relations in a real-time, online fashion. The proposed method is based on a generalized expectation-maximization approach to fit an infinite mixture of linear experts (IMLE) to an online stream of data samples. This probabilistic model, while not fully Bayesian, can efficiently choose the number of experts that are allocated to the mixture, this way effectively controlling the complexity of the resulting model. The result is an incremental, online, and localized learning algorithm that performs nonlinear, multivariate regression on multivariate outputs by approximating the target function by a linear relation within each expert input domain and that can allocate new experts as needed. A distinctive feature of the proposed method is the ability to learn multivalued functions: one-to-many mappings that naturally arise in some robotic and computer vision learning domains, using an approach based on a Bayesian generative model for the predictions provided by each of the mixture experts. As a consequence, it is able to directly provide forward and inverse relations from the same learned mixture model. We conduct an extensive set of experiments to evaluate the proposed algorithm performance, and the results show that it can outperform state-of-the-art online function approximation algorithms in single-valued regression, while demonstrating good estimation capabilities in a multivalued function approximation context.
Online and incremental learning algorithms that can efficiently deal with high-dimensional streams of data have become increasingly required in the robotics community. As humanoid robots evolve and become more complex, their sensorimotor relations become more difficult to understand, leading to situations where analytical models fail to provide accurate approximations of these sensorimotor maps. These situations range from lack of knowledge of certain hard-to-measure physical parameters (e.g., friction) to highly nonlinear physical interactions, such as actuator nonlinearities and unmodeled mass distributions (Peters & Schaal, 2006; Nguyen-Tuong & Peters, 2008). Resorting to modern learning techniques is, in these cases, the only way to provide these systems with the necessary representation capability. Approaches such as neural networks (Bishop, 1995), memory-based learning (Hastie & Loader, 1993; Atkeson, Moore, & Schaal, 1997), support vector regression (SVR) (Smola & Schölkopf, 2004) and gaussian processes regression (GPR) (Rasmussen & Williams, 2006), among many others, have been successfully applied to robotic learning tasks in the past few decades.
Many of these learning techniques, however, are not suitable when we desire to achieve a continuous and autonomous operation, such that in a biologically plausible way, the robot continuously learns and adapts as it interacts with the surrounding environment. In this setup, an offline learning phase does not exist. As a consequence, online learning algorithms are needed that can keep their computational demands low in the presence of potentially very large sensorimotor spaces to explore. This rules out many state-of-the-art algorithms like SVR and GPR: these methods fit nonlinear functions globally and require, in their original form, the presence of all training points in memory and a computational effort that grows very fast with the increasing number of samples to process. Other methods, like neural networks, can suffer from slow convergence and destructive interference when learning new data, a problem reported, for example, by Schaal and Atkeson (1998). To overcome such computational burden, one can identify two major trends: methods based on global, sparse approximations (Csató & Opper, 2002; Ma, Theiler, & Perkins, 2003; Quiñonero-Candela & Rasmussen, 2005) try to keep a representative reduced set of training samples to represent the function to approximate, while a second group of methods is based on local approximations of the function to learn (Meeds & Osindero, 2006; Nguyen-Tuong & Peters, 2008). The former class of methods may become inadequate in the presence of large streams of online training data as it tries to represent the function to learn in a global manner and is particularly sensitive to shifts in the input data distribution.
Local methods have some strong support in neurobiology (see the references in Atkeson et al., 1997, for instance), and their ability to perform localized learning seems more suitable for an online operation: they often can be seen as probabilistic mixtures of models, where each model describes the function to be learned in a particular region of the input space. They have their origins in the mixture of experts (ME) concept (Jacobs, Jordan, Nowlan, & Hinton, 1991; Jordan & Jacobs, 1994), where competing experts, assigned to different zones of the input space, are responsible for generating a corresponding output, being the final prediction produced by a gating network that combines their outputs. Particularly suited for real-time operation, mixtures of local linear models avoid the need to keep all training points in memory, getting a dramatic increase in the speed of operation while possibly sacrificing the better approximation errors typical of state-of-the-art GP methods. Xu, Jordan, and Hinton (1995) provided one of the earliest examples of such mixture of linear experts architecture, trained using the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977); in its simplest form it is an offline algorithm, for which the number of components—number of experts—must be set beforehand. Sato and Ishii (2000) suggest an online version for the ME, based on the introduction of a discount factor in the update of the sufficient statistics vector of the mixture. Training, again, is done using the EM algorithm, although the regularization and allocation of new experts are performed in a somewhat heuristic way. LWPR (Vijayakumar, D'Souza, & Schaal, 2005) and XCSF (Wilson, 2002) are two other popular nonprobabilistic algorithms based on this mixture of linear experts concept. LWPR has been widely used for online, real-time learning of robotic tasks; it uses a gradient descent on the prediction error, based on a stochastic leave-one-out cross-validation algorithm, to adapt the distance metrics of the receptive fields that partition the input space. Within each receptive field, a linear relation from input to output is obtained using an incremental partial least squares algorithm that efficiently deals with redundant and high-dimensional input spaces. XCSF updates the input distance metric of each model resorting to a steady-state genetic algorithm, while each linear model is fitted using a recursive least squares algorithm.
Unfortunately, almost all of the current state-of-the-art learning algorithms fail to deal with multivalued functions; these relations, also known as multifunctions or multimaps, are one-to-many maps where a single input can be associated with one or more different outputs and can be thought of as inverse relations for noninjective functions. They naturally arise in some robotic domains, such as perceptual aliasing in reinforcement learning (Whitehead & Ballard, 1991) or robotic learning from demonstration (Chernova & Veloso, 2008; Grollman & Jenkins, 2010). However, the most evident example of multivalued functions comes from serial and parallel robot kinematics that relate actuated joint space variables to the task space configuration vector of the robot. The forward kinematics for serial robots is straightforward to obtain and consists of a proper, single-valued function of the joint variables, while the inverse kinematics for such type of manipulators usually exhibits multiple joint space solutions for the same task space variable, even for nonredundant robots, where the task and joint space have the same dimension (Craig, 1989). The opposite situation occurs with parallel robots. Now, the same value of joint space variables may correspond to different values of the task space vector. Inverse kinematics, however, is usually easy to obtain and provides only a single solution for every task space configuration (Merlet, 2006).
For many supervised learning algorithms, this lack of ability to obtain different multivalued predictions for the same input query point is a result of assuming the single-valued model during the training phase, either explicitly or by allowing some sort of interference across multiple solution branches. As a consequence, many of the learning methods described can provide only single-valued predictions, and no inverse predictions are directly available from the learned forward model in its original form. D'Souza, Vijayakumar, and Schaal (2001), for instance, used LWPR to learn inverse kinematics for serial robots, but this approach, like other inverse kinematics learning methods, obtains only a single, local inverse solution by application of some sort of constraining or optimization criterion.
Three major approaches to the multivalued prediction problem that do not rely on the single-valued model can be devised. The first approach explicitly takes into account the multivalued model and mostly relies on artificial neural networks, using recurrent neural networks (Tomikawa & Nakayama, 1998), feedforward neural networks (Lee & Lee, 2001; Brouwer, 2004), or regularization networks that provided a direct algebraic representation of the multivalued function to learn (Shizawa, 1994). These implementations, however, were tested only in very low-dimensional toy problems, leaving their performance in more demanding problems an open question.
Alternatively, as Ghahramani and Jordan (1994) suggested, unsupervised learning algorithms can be applied to supervised learning tasks by means of the conditional densities obtained from the learned joint density function over both inputs and outputs. In principle, this makes it possible to obtain multivalued predictions for both forward and inverse relations (Lopes & Damas, 2007). Different techniques exist for learning the joint density function of the data; the mixture model is a popular approach, for which specializations exist such as the mixture of gaussians (Ghahramani & Jordan, 1994) or the mixture of factor analyzers (Ghahramani & Hinton, 1997). One of the major problems with this type of model is the choice of the number of components that constitute the mixture; several extensions to the mixture model exist that are capable of automatically selecting an appropriate number of components. The infinite mixture concept presented by Rasmussen (2000), for instance, is a Bayesian method that assigns a Dirichlet process prior on the mixing proportions of the mixture, responsible for the automatic generation of the correct number of components. This, however, usually requires expensive offline computational training based on Markov chain Monte Carlo sampling methods, not suitable for real-time learning, although some variational techniques can be used to accelerate the training process. Other extensions rely on the EM algorithm, either using a greedy approach to grow the mixture to an appropriate number of components (Vlassis & Likas, 2002) or, using the opposite idea, starting with a large number of components and, using carefully chosen priors, automatically shrinking the number of components of the mixture in a Bayesian fashion (Figueiredo & Jain, 2002). Unsupervised learning, however, is a more difficult problem than its supervised counterpart and usually conducts to worse results, as it ignores that joint data, apart from noise corruption, lie in a lower-dimensional manifold. This is the main drawback of the unsupervised approach to multivalued learning and prediction, as it typically results in convergence to suboptimal solutions that do not take the problem structure into account.
The last, and perhaps the most promising, approach to multivalued learning is to consider an ME framework, where different experts are allowed to share the same input space regions, while allocated to different solutions in the output space. This is done, for instance, in Rasmussen and Ghahramani (2002) and Meeds and Osindero (2006), using a mixture of gaussian processes (GP) experts, or in Bishop (1994) and Qin and Carreira-Perpinán (2008), resorting to the mixture density network (MDN). These examples, however, fail to deal with two aspects of the learning process that we would like to achieve. First, they are not originally designed for online operation, assuming the training to be done in a batch, offline way. The work of Grollman and Jenkins (2010) is based on the mixture of GP experts of Meeds and Osindero (2006) and changes it to allow for an online operation of the algorithm, using a sparse GP model for each of the experts; however, a compromise between accuracy and speed reduces its flexibility, since training requires performing both a computationally expensive Monte Carlo integration over particles and an equally expensive sparse GP online training, which severely reduces the algorithm performance in medium- to high-dimensional learning problems, as discussed in section 5. MDN does not have such computational cost, but it requires the number of experts to be set beforehand. Second, no explicit inversion of the relation to be learned is available; this means that two distinct models, for inverse and forward prediction, must be separately trained, for instance, when trying to simultaneously learn direct and inverse kinematics.
The infinite mixture of linear experts (IMLE) proposed in this letter builds on the mixture of linear models concept to obtain an online, real-time supervised learning algorithm that is able to learn multivalued functions while allocating experts to the mixture as needed, achieving, for single-valued regression, performance at least comparable to state-of-the-art online learning methods like LWPR. It is, at its core, an infinite version of the model for mixture of linear experts given by Xu et al. (1995) and Sato and Ishii (2000), with a careful choice of priors for some of its parameters and the adoption of a generalized EM approach that allows an online operation based on the concepts present in Neal and Hinton (1999) and Cappé and Moulines (2009) and the automatic allocation of experts to the mixture in order to adapt to the complexity of the function to be learned. The training procedure automatically rejects interference among different multivalued function branches, and by considering simple linear local relations from input to output, it is possible to easily obtain the inverse prediction from each local expert. The probabilistic model underlying the proposed algorithm is described in section 2, and the training procedure is presented in section 3. This is the first major contribution of this letter.
After the probabilistic model has been trained, an important question that remains is how to generate a possibly multivalued set of predictions for a given query data point. Qin and Carreira-Perpinán (2008), for instance, use an MDN and obtain the set of modes for the resulting conditional distribution. We instead suggest a different approach, based on clustering the individual expert predictions with a predictive probabilistic model that relates multivalued solutions to individual expert predictions. This is the topic of section 4, where we describe how to obtain forward and inverse multivalued predictions for any query, given the current state of the IMLE model. This topic is the second main contribution of this letter. Finally, section 5 provides a large set of experimental evaluations of the proposed algorithm, testing it in both single-valued and multivalued prediction situations. Section 6 provides the concluding remarks.
2. Probabilistic Model
Here wij denotes a latent or hidden indicator variable that equals 1 if data point i was generated by linear model j and 0 otherwise, with (sometimes we will use the shorthand notation wij to denote the event wij=1, as in the above equations). These variables can be gathered, for each i, in a binary vector wi whose jth component is equal to wij. The parameter mj indicates if expert j is activated, effectively contributing to the mixture. It is equal to 1 if expert j is active and 0 otherwise. Each of the latent indicator variables wij consequently has an associated probability of 1/M if linear model j contributes to the mixture, where M is the total number of active experts. This uniform probability distribution on the latent variables is a more natural approach, in our opinion, to the online regression problem, since fully learned mixture coefficients depend heavily on the input training data distribution. This can vary greatly in an online data acquisition setting. Although we use a probabilistic mixture representation, the ultimate goal of this model is to describe a mapping from inputs to outputs. In this context, assigning the same importance to different parts of this mapping seems to make more sense.
Given wij, input zi follows a normal distribution with parameters and , while output xi follows a linear relation from zi, with mean , design matrix , and diagonal covariance matrix , corresponding to uncorrelated gaussian noise in x. This model, apart the uniform distribution for wij, is similar to the one presented in Xu et al. (1995) and Sato and Ishii (2000), where each expert j models a linear relation from input z to output x in some region of the input domain, defined by input center and covariance , this way softly partitioning the input space among the experts.
The prior on the activations mj imposes an increasing penalty on the number of linear experts the learning phase tries to allocate. We detail this in section 3.2. As for the remaining priors, W−1 and denote multivariate inverse-wishart and univariate inverse-gamma distributions, respectively. is the kth element of the diagonal of , while corresponds to the kth row of and is the kth element of . Constants , , , , and determine the strength of the respective priors, , , , , and , expressed as an equivalent number of “fake” data points. We choose these distributions for convenience, since they are conjugate priors for the observed data distribution. The purpose of , the common elliptical prior on , is threefold: it introduces some regularization, so that always has an inverse; it ensures that the experts’ input region shapes do not differ too much from each other; and it prevents nonneighboring experts from competing for the same data in the initial phase of the learning process of each expert—a serious problem that occurs in ME models, referred to, for instance, in Schaal and Atkeson (1998) and Vijayakumar, D'Souza, and Schaal (2005), thus enforcing the principle of localized learning. The normal prior on , on the other hand, controls the degree of mobility of this parameter: makes it dependent solely on the training data, while leads to a fixed center, as typically occurs in radial basis networks or in LWPR; the prior on has the same purpose. Finally, the inverse gamma prior on defines a kind of average noise all experts share, while the prior for the rows of performs a coefficient shrinkage similar to ridge regression. Its main purpose, however, is to impose a regularization mechanism in order to make the matrix inversion required for the estimation of ; full rank. Such prior introduces some bias in the expert prediction, and consequently should be kept to a low value in order to make this undesired effect negligible.
Given z, a collection of input training data , and x, the corresponding output, we would like to obtain the posterior distribution and use it to obtain p(x|zq, Z, X) and p(z|xq, Z, X), respectively, the forward and inverse posterior predictive distributions. This is, however, intractable without also considering the latent variables . Although several variational or Monte Carlo Bayesian methods exist to approximate , in this letter, we use the EM algorithm (Dempster et al., 1977) to find a maximum a posteriori (MAP) estimate for the unknown due to its easy adaptation to online learning schemes.
We also define for convenience, so that .
Equations 3.7b and 3.7f, on one hand, and 3.7e and 3.7g, on the other, are coupled, without an explicit closed-form solution for the parameters being estimated. To deal with this issue, the maximization step is relaxed, and each of these parameters is maximized individually, conditionally on the others remaining fixed. This corresponds to the expectation conditional maximization algorithm, a particular case of the generalized variant of the EM algorithm, where the M-step is modified to an update that improves the Q-function without necessarily maximizing it (Dempster et al., 1977). Solving for the values of that maximize the likelihood is, however intractable, as it requires evaluating all the infinite combinations of values for mj and picking the one that maximizes the Q-function.
3.2.1. Growing the Mixture.
Activations mj define the number of experts constituting the current mixture and play a key role in defining the complexity of the global probabilistic model. Choosing the appropriate number of components for a mixture is a difficult problem, and several methods have been proposed to deal with it. Among them, Bayesian methods provide an elegant framework that automatically generates a trade-off between the fitness of the data to the model and the complexity of the same model. Moreover, the infinite mixture models based on the Dirichlet process nonparametric prior for the mixing coefficients allow for generative models where the number of components of the mixture is not defined a priori (Antoniak, 1974; Rasmussen & Ghahramani, 2002; Meeds & Osindero, 2006).
Unfortunately, training these infinite mixtures usually requires either computationally expensive Markov chain Monte Carlo sampling methods or variational approaches that typically rely on some sort of truncation that imposes a bound on the admissible number of components for the mixture. Furthermore, the online operation we are trying to achieve in this work imposes some additional difficulties. In an offline setting, Bayesian methods in principle can efficiently grow or annihilate mixture components, but when it is operating online, the full set of training points is no longer available, and decisions concerning the allocation or removal of components of the mixture must be made, resorting only to the most recently available training points and the current mixture state, a far more demanding learning challenge. Sato (2001) derives an online variational Bayesian algorithm for learning mixture models, but it requires the maintenance of a parallel hypothesis about the number of components that can easily become too computationally expensive. Some recent work also views the variational Bayesian learning model under an online perspective but is based on nondeterministic approaches based on Gibbs sampling (Wang & Blei, 2012) or require processing the training points in smaller batches of data (Gomes, Welling, & Perona, 2008). For all these methods, an adequate computational speed that allows the processing of hundreds of samples per second, as required for online learning schemes for robotic applications, is yet to be shown.
In contrast to Bayesian methods, it is more difficult to assess an optimal value for the number of components of a mixture when using an EM algorithm for training. A well-known drawback of EM-based techniques is the fact that the maximized observed data likelihood will never decrease when new components are added to the mixture. This allows the derivation of a broad class of criteria to decide when to add new components, ranging from only allowing the existence of a single component, equivalent to performing a global linear regression on the data, to the activation of a new one for each data point processed, which would correspond to a memory-based learning approach, where predictions would be made resorting to all available training data. As a consequence, most EM deterministic methods impose some kind of penalty over the number of mixture components during optimization, such as Akaike's information criterion (AIC), the Bayesian inference criterion (BIC), or the minimum message length criterion (MML), to name just a few. A comprehensive comparison and review of these kind of penalty methods can be found in McLachlan and Peel (2000).
In the probabilistic model given in equations 2.1 and 2.2, the prior distribution on mj plays the role of such penalty, making a high number of experts increasingly less probable. Changing, at iteration t, the value of a particular parameter from 0 to 1 will affect only the Q-function in equation 3.3 through the term , as the values htij will still be equal to 0 for all i and this particular value of j, following equation 3.2. This results in a decrease of the Q-function value every time a new expert is activated in the M-step. In the absence of the prior, equation 2.2f, the Q-function value would not change.
The right side of equation 3.8 introduces an increasingly penalty on the activation of new experts as the value of M increases. Adjustable parameter p0 can regulate the propensity to activate new experts: the lower its value, the higher the critical value of the chi-squared distribution will be, making the experts’ activation criterion harder to be met, resulting in fewer components in the mixture. In a general way, equation 3.8 tells us that a new expert should be activated when the next acquired training point is poorly explained by the current probabilistic model. This is a sensible approach to mixture grow in online algorithms. LWPR, for instance, creates a new linear model each time an input training point zi fails to activate the nearest receptive field by more than a given threshold. Of course, when learning multivalued functions, an activation scheme must take into account both the input and output part of the training point, as occurs in equation 3.8.
3.2.2. Handling Outliers.
The probabilistic outlier model in equation 3.9 depends on the current estimates for output noise and input length-scale . This is a better approach than considering a fixed threshold, as tuning this latter parameter would require some problem domain-specific knowledge. If we consider a newly activated, not-yet-trained expert (for which , and ), can be seen as the evaluation of the probability density of the expert at a point (x, z) that lies over the equidensity contour that encircles the region, centered at , corresponding to a 1−p0 probability.
The results above show us that the mechanisms for activating a new expert or recognizing an outlier are essentially the same and correspond to identifying training points poorly explained by the current mixture. How do we know, then, if a training point satisfying equation 3.8 is an outlier or, alternatively, an indication that a new expert is needed in the corresponding region of the input-output space? This is challenging problem. For single-valued regression, outliers can be detected as training points that have enough support from the current model in the input space, while presenting a large deviation from it in the output space. The same, however, does not happen in multivalued regression, where such a situation may simply correspond to a yet unseen branch of the multivalued function being learned. Furthermore, the online assumption does not allow us to look at the whole training set, where identifying outliers in principle would be made easier by searching for isolated points.
This question can be answered under the assumption that the training data are temporally correlated, as often is the case in online learning. In this situation, a training point will have a high probability of being poorly explained by the current mixture if the same occurred with the previous point. In contrast, the same does not happen with statistically independent outliers. Observing two consecutive training points satisfying equation 3.8 is then more likely to be caused by a lack of fit of the points to the current model than the occurrence of two consecutive outliers. In such an event, we decide to activate a new expert.
3.3. Computational Complexity.
Learning with IMLE is very fast. For a new observation (zt, xt), a complete update of the mixture parameters consists of (1) deciding whether a new expert should be activated in equation 3.8; (2) assigning responsibilities hij to active experts using equation 3.2 (E-step); (3) updating the sufficient statistics (E-step, equation 3.6); and (4) obtaining the new value for (M-step, equations 3.7). All these calculations have a computational complexity of , since the matrix inversions required in this process can be efficiently performed using the Sherman-Morrison formula to perform a rank 1 update for these quantities. The computational complexity of a complete update of IMLE parameters is consequently linear in D and M, the number of experts, and quadratic in d, the number of input dimensions, making it directly comparable to the state-of-the-art LWPR in terms of computational complexity per training point. Like LWPR, this complexity can be made linear in d if the input distance metrics are constrained to be diagonal.
The same reasoning for obtaining an inverse probability distribution from the IMLE model can also be used to produce more general predictions. Assuming the joint vector y to comprise a query part yq and an answer part ya, such that , we can easily derive , for each expert j, by conditioning the joint distribution with respect to yq.
4.1. Grouping Experts Predictions.
The uncertainty for prediction is given by . When wxj=1 for a particular expert j, meaning that only that expert contributes to the prediction, this uncertainty becomes equal to , which agrees with the predictive uncertainty of the ordinary least squares solution for multiple linear regression.
4.2. A Test for Multivalued Prediction Goodness of Fit.
A practical detail of the above test is that will not have arbitrary large values for corresponding small values of wxj, as predicted by model 4.8. As a consequence Tk will assume much lower values than the ones expected under the null hypothesis distribution, since in a typical scenario, most of the experts will have a negligible contribution to the prediction. In this situation, we found that replacing Mk−1 in equation 4.10 with , the effective degrees of freedom of the mixture, provides a better fit of the statistic to its corresponding distribution, under the null hypothesis.
4.3. Obtaining a Valid Set of Multivalued Solutions.
Multivalued prediction starts with the single-valued estimation, Nsol=1. If the test described in the previous section rejects this single-valued solution, Nsol is increased to 2 and the EM iterations in equation 4.9 are performed. If the test for goodness of fit described in the previous section rejects at least one of the solutions, Nsol is incremented, and this procedure is repeated until a value of Nsol is found for which the test fails to reject the null hypothesis for any of the solutions thus obtained. , for , is then the set of multivalued forward solutions predicted by IMLE, while are the respective uncertainties. To speed up the prediction process, each EM procedure initializes with the values found in the previous run of the algorithm, while the extra solution starts near the solution k that produced the smallest p-value in the previous goodness-of-fit test, this way dividing in two the solution responsible for the null hypothesis rejection.
During this process, the significance level controls the number of solutions found. The lower its value, the harder it is to reject the null hypothesis, and fewer solutions are likely to be found. Increasing helps to separate different solutions, but as an unwanted consequence, predictions for neighbor experts may stop being merged together due to the function curvature around zq.
5. Experimental Results
In this section we evaluate IMLE in several different experimental settings, especially focusing on large training sets arising from continuous stream of data that particularly suit online learning. We compare our results with different online learning algorithms, namely LWPR (Vijayakumar et al., 2005), probably one of the most widely used state-of-the-art online learning methods for robotic applications, SOGP; a sparse online approximation for gaussian process regression (Csató & Opper, 2002), and ROGER, an online infinite mixture of SOGP experts (Grollman & Jenkins, 2010). All of these algorithms have their C++ implementation code available, and their most recent version to this date is used in our comparisons.1 A C++ implementation of the IMLE algorithm is also freely available for download.2 We also compare IMLE to standard GPR. This is not an online algorithm in its standard formulation, but it can give some insights into the performance loss we expect when going to an online operation setting. In the following experiments, we specify a gaussian likelihood to be used with an isotropic squared exponential covariance function, using exact inference for training and prediction. Optimal values for input length scale of the kernels and output noise, the free hyperparameters of the model, are obtained using standard optimization techniques over the training set, using the GPML Matlab code.3
Parameters of interest for tuning the LWPR algorithm comprehend Dinit, , wgen, and penalty (for details on these parameters’ meaning, consult the related documentation). Additionally, in all experiments we set diagOnly to false and useMeta and updateD to true. We use SOGP with a gaussian kernel. The remaining tuning parameters for this algorithm are , the kernel width, , the expected output noise for the function to learn, and , the maximum number of training basis points to be kept by the algorithm. SOGP behavior is supposedly similar to standard GPR if no upper limit is set to this number of basis points. Besides the (common) parameters for each of the SOGP experts, ROGER also needs same parameters to be defined: the most important quantities, according to its authors, are P; the number of particles, ; the Chinese Restaurant Process concentration parameter that drives the propensity of new SOGP experts to be created within each particle; and the common parameters for each of its SOGP experts. In general, the higher the number of particles and SOGP capacity, the slower ROGER will run.
As for IMLE, there are 11 parameters that can be tuned to change the resulting behavior of the algorithm. Some of them typically do not need any tweaking, and the following experiences, unless otherwise noted, will keep them with their default values, namely, (a small value is needed for regularization), , and (experts locations fully learned). As the input space dimension increases, a stronger prior on input covariance matrices and output noise is needed to make the learning process relatively invariant with respect to the trajectory nature of the training data acquisition process. A good rule of thumb is to set and then choose based on the confidence on the value of , with smaller values corresponding to a larger uncertainty on this parameter. A typical value for the forgetting factor lies in the range . The remaining parameters, , , and p0, have a strong influence in the experts’ activation process, equation 3.8, and ultimately on the number of local experts created during the training phase. represents the expected output noise variance, while corresponds to the input activation region for which the function to be learned can be approximately represented by a linear relation.
While setting and tuning such apparently high number of free parameters may appear to be challenging at first, the convergence of the probabilistic model is not very sensitive to specific values of these parameters. Perhaps the most sensible issue when considering the tuning of IMLE free parameters is to ensure a correct convergence of and , the input length scale and output noise estimates; this problem is discussed thoroughly next.
5.1. Single-Valued Function Aproximation.
In this section we evaluate IMLE ability to perform single-valued function approximation, applying the algorithm to three different learning problems with different input dimensions and comparing its performance to LWPR, SOGP, and GPR. Some care must be taken when confronting these different learning schemes. In general, increasing the model complexity for each of the algorithms will produce smaller approximation errors, while incurring some heavier computational cost. The number of local linear models activated by IMLE and LWPR is a good measure of model complexity for these online algorithms. Its final value, after the training process, is a consequence of the choices for their tunable parameters and the training data themselves. Since IMLE and LWPR have the same computational complexity per training point, the final number of activated models provides a fair comparison ground for IMLE and LWPR in terms of the approximation error/computational complexity trade-off. GRP and SOGP model complexity, on the other hand, is measured by the number of stored training points used for posterior prediction over the test data. For GPR, this number is set beforehand, while SOGP learns a sparse subset of the training data to be used for prediction, possibly limiting the maximum number of these inducing points to a value of . IMLE and LWPR computational demands are linear in the number of local models, while GPR and SOGP are much more penalized by the increase in the number in stored training points or inducing points, respectively.
The amount of information implicitly available to the algorithms is another important issue concerning a fair comparison between them: for regression, the input length-scale of the data and the noise level present in the output are two critical properties of the function to be learned. GPR learns them offline by optimizing the likelihood of the training data with respect to these hyperparameters. Since GPR is an offline algorithm, its prediction performance strongly depends on the stored training points’ input locations. If they efficiently cover all the input space, we expect GPR to outperform methods based on local linear approximations in terms of prediction error. Yet such highly desirable informative training set may be unavailable or can be difficult to generate, as in typical robotic applications, where visiting all the input space can be very time-consuming.
Online methods alleviate this dependence on an initial representative training set by learning their models on the fly, adapting them as new training data arrive. This is achieved in SOGP by maintaining a representative subsample of the data. Parameters and , however, are not adapted during the learning process, and thus SOGP must rely on a good initialization of its input length scale and output noise parameters. Adequate values can be obtained, for instance, from an initial offline optimization, similar to GPR. Both IMLE and LWPR learn the input length scale and output noise for each of their local linear models—in fact, they learn a full input distance metric, represented by covariance matrices and D−1j, respectively. They differ in the way they initialize these quantities: LWPR initializes the input distance metrics to a constant value Dinit; IMLE, on the other hand, puts a common prior on , defining then a vague hyperprior for . While LWPR initialization strongly influences the number of receptive fields created during learning, in IMLE the information conveyed in hyperpriors’ parameters and can quickly lose importance if and are small. This capability to learn, in an online fashion, the characteristic input length scale and output noise makes IMLE more robust to poor parameter initialization and less dependent on problem-specific knowledge; this complex probabilistic model structure, however, can make the learning convergence depend more on the training data input distribution, since hyperparameters and strongly influence the behavior of newly activated experts, which also contribute to the estimation of and . Ultimately this can lead to convergence to poor local maxima of the likelihood function.
5.1.1. Cross Function.
We first ran IMLE on a sequential stream of data taken from the cross function suggested in Vijayakumar et al. (2005), a two-dimensional input, univariate output function displayed in Figure 2a. The training set consisted of points sampled from a random trajectory performed in the input space and corresponding output data, for which we added gaussian noise with 0.1 standard deviation. A small sample of such training data can be seen in Figure 2a, superimposed on the target function.
We adopted all the suggested values for LWPR parameters presented in the cross-2D example given in LWPR source code, namely, . SOGP was left with its default parameters of , with no limit on the number of inducing points. As for IMLE, we chose to match the LWPR initial input covariance matrix, while defining and making . For comparison purposes, we varied the parameter p0, considering two values, p0=0.1 and p0=0.2. Since the output was known to be single-valued, we did not group experts’ predictions into different solutions when estimating the output for the test input data, setting for the hypothesis testing described in section 4. Finally, we also trained a standard GP model, using a random, nonsequential training set with two different sizes (M = 1000 and M = 5000). All of the online algorithms were trained on a set of 200,000 sequential points coming from a random trajectory in the input space, and the prediction root mean square error (RMSE) was evaluated on a noiseless test grid of 200 × 200 equally spaced input points and corresponding output values. For accuracy, the IMLE and LWPR presented results are averages over 100 randomly trials.
Figure 2b shows a typical reconstruction of the original target function after learning (p0=0.2). Figure 2c shows, for the online methods, the evolution of RMSE and number of created models as a function of the processed training points, and Figure 2d presents the final results after training. We can see that SOGP achieves the best RMSE, but at a high computational cost, compared to the other online methods. LWPR, on the other hand, is the fastest algorithm but has a worse function approximation error when compared to IMLE, even when IMLE resorts to less local models (p0=0.1). Note that increasing p0 results in a better function approximation, at a penalty on the number of linear experts activated and consequent increase in computation time. As for offline GPR, low error rates can be obtained if the training set is large enough, but this comes at a prohibitive cost in terms of offline computation time and memory required to perform the necessary matrix inversions. We also used a random set for training the GP. As stated before, this may not be easy to generate in many practical real-time applications. This may also explain why SOGP has a better RMSE than GPR while using significantly fewer training points for prediction, since SOGP keeps only the most informative points taken out of the full training set.
5.1.2. The PUMA 560 Serial Robot.
The Unimation PUMA 560 is a well-known six degrees of freedom industrial robotic arm. Its forward kinematics function is described, for instance, in Craig (1989). To evaluate the single-valued prediction capabilities of IMLE, we simulated the PUMA 560 robot kinematics, defining a 10 cm tool extending along the z-axis of the frame associated with the sixth joint, and generated random trajectories over the joint space of the robot, calculating the corresponding 3D position of the tool tip. The kinematic function to be learned was thus a map from an input space of dimension 6 to an output space of dimension 3, even if the last joint was irrelevant as it changed only the tool orientation. Note that the fully stretched arm plus tool measured more than 90 cm, which made the range of each of the output variables to be almost 2 m.
We trained both LWPR and IMLE with a set of 10 million training points, evaluating the final achieved RMSE on a different test set comprising 100,000 points. Output training values were corrupted with gaussian noise, with standard deviation equal to 2 cm. This corresponds approximately to 1/100 of the output range and can model, for instance, moderate noise in a vision-based end-effector tracking process. SOGP unfortunately was left out of the comparisons. Its parameters turned out to be difficult to tune and its behavior unstable and very slow in the face of a large stream of highly correlated input data. We also considered standard GPR, generating a random set of training data for hyperparameter optimization and testing on the independent test set. Once again, the results thus obtained do not compare fairly to the other online algorithms, as one of the most challenging difficulties that arise with this data set is the massive, sequential, and correlated nature of the training data.
Experimental results are shown in Table 1, where the RMSE for two different instances of the GPR algorithm, trained with 1000 and 5000 data points, are presented. We set , p0=0.1 and the recommended values , and since we did not wish to give IMLE any information about the input length scale and output noise characteristics of the function to be learned, we defined and , and set to a low value of 8 in order to quickly decay the influence of in the estimate for . As for LWPR, we picked the combination of parameters that made LWPR achieve the lowest error while not letting the number of created receptive fields grow to inadmissible values: , wgen=0.1, , and . Tuning these parameters to have roughly the same number of allocated models as IMLE never produced an RMSE of less than 10 cm on each output dimension. Figure 3b shows the corresponding learning curves for IMLE and LWPR. Due to the conservative large parameter values for and , IMLE converged slowly in the initial learning phase but quickly recovered after convergence of and to better output noise and input length-scale estimates, respectively.
|Method .||RMSE .||Number of Models .||CPU Time (s) .|
|IMLE||0.0245||668||4181 + 141|
|LWPR||0.0560||4338||11,974 + 725|
|Method .||RMSE .||Number of Models .||CPU Time (s) .|
|IMLE||0.0245||668||4181 + 141|
|LWPR||0.0560||4338||11,974 + 725|
The LWPR algorithm was tested over an exhaustive combination of parameters: , , , and . These parameters affected the final error through the number of receptive fields they tended to create: the larger this number, the lower the error, as expected. The same happened to IMLE when varying its tunable parameters, with , , , , and . In general, a lower RMSE would be achieved at a cost of an increasing number of local linear models. It is very illustrative to depict the final RMSE as a function of the number of models created, for each parameter configuration of IMLE or LWPR, as in Figure 3a, since the number of local models affects the computational training and testing time in a similar way for these two algorithms. In the figure we can identify, for IMLE, three convergence behaviors: the first one, marked IMLE (Global), corresponds to a low value of . Since this value of is lower than the actual output noise, IMLE activates a large number of linear experts in the initial learning phase, each one covering all the input space (due to a high ) and a particular region of the output space. This is of course an undesirable behavior that goes against the principle of localized learning. The second case, identified as IMLE (overfitting), is a consequence of choosing a low value of in combination with a low or medium value of , allowing the individual to shrink to a point where each new training point has a high probability of activating a new expert. This may result in a snowball effect, where more experts contribute to a decrease of , which will lead to a smaller and, consequently, more experts being activated. In Figure 3a, this does not result in a reduction of RMSE in the test set, a good indicator that IMLE is overfitting in that situation. These are, however, two extreme situations, caused by setting or to smaller values than the output noise and input length scale, respectively. All other combinations of parameters exhibit good robustness regarding the training convergence. Note that in this case, IMLE has a much better performance than LWPR, achieving a much smaller RMSE while activating considerably fewer local models.
5.1.3. The SARCOS Inverse Dynamics Data Set.
The SARCOS anthropomorphic robotic arm is a 7 degrees of freedom manipulator that has been used to test several function approximation algorithms. The learning task considered here is the estimation of its inverse dynamics from training examples. This is a nonlinear map from a 21-dimensional input space, consisting of positions, velocities, and accelerations of each joint to a 7-dimensional output space comprising the corresponding joint torques. The learned model can then be used to estimate the torques that achieve a desired trajectory in the joint space. The data set consists of 48,933 training points and 4449 test points, taken from trajectories performed with the real robot. Output values are normalized by the variance of the outputs of the test set to make the results presented here directly comparable to the ones in Rasmussen and Williams (2006) and Vijayakumar, D'Souza, Shibata, Conradt, and Schaal (2002). We compare IMLE results on this data set to the following models:
Linear regression (LR): A linear regression model is fitted to the data to provide a baseline for comparison.
Rigid body dynamics (RBD): This is a parametric physics-based model for the inverse dynamics function that is estimated using a least-squares approach with the available training data.
LWPR: This model was trained using diagonal distance metrics Dj, cycling through the training data over 6 million iterations. This roughly corresponds to 123 passes over the full training data.
GPR: Due to the computational infeasibility of using the full training data for optimization of hyperparameters and prediction over test data, a subset of regressors method was employed, with size 4096. A squared exponential covariance function was used, and its hyperparameters were optimized accordingly, using a subset of the training data.
As for IMLE, we used out-of-the-box default parameters, setting conservative large values for its noise and length-scale parameters, and , while setting the usual value of p0=0.1. Using the input dimension to set up the value of , , and according to the rule of thumb we presented before would result in overly large prior strengths. We resort instead to the notion that for actual robot movements, the generated data tend to be low-dimensional, with around four to six effective dimensions (Schaal, Vijayakumar, & Atkeson, 1998). This is a standard assumption in robotic applications in order to circumvent the curse of dimensionality. Learning a sensorimotor map would require a full exploration of the input space if this condition did not hold. In such case, as the input dimension increased, the time required for exploration and learning would grow exponentially, making the learning task infeasible from a practical point of view. In our experiments, we found that a value of seven effective dimensions could provide satisfactory learning behavior and thus set . We also set to a low value of 8 due to the high uncertainty on . Varying over a considerably wide interval around this value did not change IMLE convergence behavior significantly.
Unlike GPR and LWPR, the IMLE algorithm can directly provide multivariate output predictions without a need to train different models for each of the output dimensions. If the maps from inputs to each output variable have the same length-scale properties, as frequently happens in robotic sensorimotor maps, a huge economy of computational resources can be attained, as each local linear model can describe the interaction between the inputs and the full output vector. We show in Table 2 the prediction error when the output consists of only the first joint torque, comparing it to the results presented in Rasmussen and Williams (2006) for LWPR, GPR, RBD, and LR. We also used IMLE to learn the full output torque vector. For both situations, we present the MSE and number of activated experts after (1) a full pass over the training data and (2) 10 consecutive passes over the same data. For confirmation purposes, we also trained IMLE with p0=0.0, resulting in a model with a single expert that provided a global linear approximation to the training data. As expected, an MSE equal to that of the LR model was then obtained.
|Method .||MSE .||Number of Models .||Method .||MSE .||Number of Models .|
|LR||0.075||—||IMLE1D (1 epoch)||0.019||313|
|RBD||0.104||—||IMLE1D (10 epochs)||0.010||563|
|LWPR||0.040||260||IMLE7D (1 epoch)||0.018||271|
|GPR||0.011||—||IMLE7D (10 epochs)||0.010||550|
|Method .||MSE .||Number of Models .||Method .||MSE .||Number of Models .|
|LR||0.075||—||IMLE1D (1 epoch)||0.019||313|
|RBD||0.104||—||IMLE1D (10 epochs)||0.010||563|
|LWPR||0.040||260||IMLE7D (1 epoch)||0.018||271|
|GPR||0.011||—||IMLE7D (10 epochs)||0.010||550|
Note: IMLE1D is the model corresponding to a map from to , while IMLE7D is the model obtained from the full map from to .
The extremely good convergence of the IMLE algorithm is noteworthy. After only a single pass through the training data, IMLE achieves a better approximation error than LWPR (after more than 100 passes through the same data) while activating a comparable number of linear models. If more points are presented to IMLE, cycling through the data ten times, an MSE comparable to state-of-the-art GPR is achieved. It is also worth noticing that IMLE performance did not changed much when a full 7-dimensional output vector was considered for the learning task: the MSE remained the same, while, perhaps surprisingly, the number of experts even dropped a bit.4 We did not notice any increased computational time in this situation, as the slight increase in computation due to this higher-output dimension was balanced by a smaller number of activated models. In this aspect, IMLE compares favorably to GPR and LWPR, which would require seven times more computational power to learn the full inverse dynamics map.
5.2. Multivalued Function Approximation.
We now proceed to evaluating IMLE learning capabilities under a multivalued target function scenario. As stated in section 3.2, learning multivalued functions in an online fashion poses several additional problems. Without further information, it is difficult to distinguish noise or outliers from new multivalued function branches. In addition, it is no longer possible to set large values for and , hoping that the learning process finds adequate values for the input length scale and output noise estimate, as multivalued relations may then be interpreted as single-valued functions with large output noise. There is also the problem of time-varying functions. This issue can easily be addressed in online single-valued algorithms by introducing forgetting mechanisms to allow for a quick adaptation of the internal model to the time-varying data. However, things become more complicated with multivalued function approximation, where there is no clear border between fast time-changing training data and multivalued functions.
We tested IMLE against ROGER whenever possible. This later algorithm, however, consists of an infinite mixture of SOGPs, thus inheriting the same limitations mentioned in the previous section: the difficulty in getting a good parameter configuration and the slow operation for medium- or high-dimensional input spaces. Also, ROGER implementation code did not provide a set of multivalued solutions for the function being approximated, instead sampling a single solution from the infinite mixture. This feature imposes severe limitations to real applications, as ROGER predictions will permanently switch between different multivalued branches. To provide a fair comparison to IMLE, we adapted ROGER code to provide a finite set of prediction solutions. For each input query, ROGER was used to predict a solution over 100 different trials, and the distinct solutions thus obtained were gathered to become the set of predicted solutions.
The approximation error was measured by taking, for each input query, the multivalued prediction closest to the true output. Note that we do not address here the important problem of choosing a solution among the set of multivalued predictions produced by the algorithm. This can be seen as a context estimation, and it is a requirement when the multivalued model is to be used for control purposes. In the following experiments, we set . The lower this value was, the higher the number of solutions found for the same query on average. However, when the branches of the multivalued function to learn are well separated in the output space, we found that the value of this parameter did not influence the final prediction much, and that setting it on a range of, say, resulted in similar prediction results.
5.2.1. Synthetic Data Sets.
To illustrate the fundamental differences between multi- and single-valued function approximation algorithms, we start with a simple toy example consisting of a multivalued target sinusoidal function. Standard function approximation learning methods, like LWPR or GPR, typically behave poorly in this setting. Training data were generated by alternating sequential sweeps over each of the two branches of the multivalued function, given, respectively, by f1(x)=cos(z) and f2(x)=cos(z)+4, and each output xi was corrupted with gaussian noise with standard deviation equal to 0.1. Figure 4a represents a sample of the training data, while Figure 4b depicts estimates from IMLE, ROGER, LWPR, and GPR taken after training. As expected, LWPR and GPR tend to average the two branches of the multivalued function, while IMLE and ROGER correctly identify the two solutions. In particular, ROGER achieves an almost perfect approximation to the true function (RMSE = 0.014), when compared to IMLE (RMSE = 0.039).
The codomain of the target function makes each input point have two distinct solutions. IMLE was trained with random trajectories generated over the target function, and after 50,000 points, it achieved an RMSE of less than 0.065 on an independent random test set. The multivalued predicted outputs for this test set are shown in Figure 6a, while the 39 linear models allocated by the algorithm are represented in Figure 6b.
The final toy example we present consists of a randomly generated piecewise constant target function from to , shown in Figure 7a. Although not multivalued, many function approximation algorithms will not be able to properly learn it, as their smoothness assumptions can conflict with the discontinuities of the target function. This will typically result in either an overfitting to the data, when the complexity of the model is increased to approximate the function in the vicinity of the discontinuities, or in an oversmoothing behavior, where the predictions average the two values of the target function near the function transitions.
Multivalued learning algorithms based on mixtures like IMLE can provide an elegant solution to this problem. During training, there is no output interference near the discontinuities, since in those regions, the data are simply interpreted as coming from a multivalued function. When predicting, a single-valued solution can nevertheless be provided by simply taking the most important solution from the multivalued prediction set, according to the sum of weights wxj(zq) of the experts that contribute to that solution. This procedure is able to generate sudden transitions of the prediction, like the algorithm developed by Toussaint and Vijayakumar (2005) to specifically deal with discontinuous functions. Prediction results are shown in Figure 7 for IMLE and a single-valued learner. We will not undergo a full comparison to other discontinuity function learning algorithms, as the main point here is the proof of the concept that IMLE (and other multivalued learning algorithms) can also efficiently approximate discontinuous functions without the need for any special modifications.
5.2.2. The Humanoid iCub Robot.
The next experiment evaluates IMLE multivalued prediction capabilities in a higher-dimensional problem. We consider iCub, an anthropomorphic robot used for research into human cognition and artificial intelligence (Sandini, Metta, & Vernon, 2004). We simulated random trajectories in a seven-dimensional input space consisting of the joint angles of the robot waist (yaw, roll, and pitch) and right arm (shoulder yaw, pitch and roll, and elbow flexion), the output space being the 3D position of the end effector. During the training phase, after acquiring 100,000 data points, a 28 cm tool was introduced in the kinematic chain of the arm, effectively changing the end-effector position. After 100,000 more training points, the tool was removed and training resumed. From the viewpoint of the learning algorithm, this nonsignaled change of the kinematic structure can be represented by a multivalued function with two distinct branches corresponding to the tool and no-tool contexts. Traditional single-valued approximators must forget the previously learned map when the tool is introduced in order to learn the new kinematics. This wastes computational resources, as the original map must be learned again when the tool is removed and the original kinematics structure is once again presented to the robot. IMLE can keep both situations stored in its internal model: there is almost no increase in the RMSE when the tool is removed, as can be seen in Figure 8, where, for comparison with a single-valued learner, we also show the results of training LWPR using the same data set.
5.2.3. Simultaneously Learning of Forward and Inverse Models.
In the following experiments, we show how IMLE can use the learned mixture to predict both forward and inverse maps. We use the Puma 560 robot arm described in section 5.1, controlling its first three joints to position the end effector. The kinematic function for this robot configuration is a map from to . While the forward map consists of a single-valued function, the inverse kinematics is multivalued and can exhibit up to four solutions. As a second example, we consider the 3-RPR parallel manipulator described in Merlet (2006). It consists of an end effector connected to a fixed base through three prismatic links, each connecting to the base and end effector using free, unactuated rotational joints. Its movement is restricted to the x-y plane. Actuating on link lengths L1, L2, and L3 changes the x-y end-effector position and orientation on this plane. The kinematic for this mechanism is also a map from to . Parallel robots typically exhibit a duality relation to serial chains with respect to the forward and inverse kinematics nature. While their inverse relation is usually unique and straightforward to calculate, obtaining a closed formula for the end-effector position and orientation as a function of actuator values is difficult and frequently yields multiple valid solutions. This mechanism is known to have up to six different solutions for the same actuator configuration, which makes learning its forward kinematics infeasible for most standard single-valued function approximation techniques.
We trained IMLE over 1 million points taken from a simulated random trajectory for both manipulators, adding gaussian noise to the outputs, with standard deviation equal to 1/100 of the output range. We restricted the movement of the 3-RPR parallel manipulator to a square of 40 cm by 40 cm in the center of the mechanism, while the angle was constrained to the interval . We ran 10 different trials, testing the resulting final mixtures on a noiseless random sequence of 100,000 points. The results are shown in Figure 9. Figure 9c also presents, for the PUMA robot, the frequency of the number of solutions found by IMLE inverse prediction, comparing it to the real value, obtained by explicitly solving the inverse kinematics equations. The discrepancy between these numbers can be explained by the fact that close to the work space boundary of the PUMA 560, there are pairs of inverse solutions that become close to each other. In this situation, IMLE tends to merge these solutions. Lowering the value of would reduce this behavior; however, due to the curvature of the map to learn, this would have the undesired side effect of predictions of neighbor experts being erroneously taken as separate solutions. In general, choosing a value for is a compromise between this two effects. Nevertheless, IMLE still achieves a good inverse prediction error rate, since in the work space boundary, the merged solution provided by IMLE is approximately the average of two reasonable similar true solutions. As for the 3-RPR manipulator, IMLE found on average 1.88 forward solutions per test point, which, given the constrained work space, agrees with the expected number of solutions. For every point on the test set, only a single inverse solution was found. This is in total agreement with the single-valued nature of the inverse kinematics for these kind of mechanisms.
6. Discussion and Conclusion
The infinite mixture of linear models presented in this letter is, at its core, an algorithm that can efficiently deal with nonlinear function approximation in an online, incremental manner, comparable to current state-of-the-art online learning methods. It consists of a collection of linear experts, together with appropriate priors on the parameters being learned and a mechanism that efficiently grows the number of experts when the need to explain outfitted data arises for newly acquired samples. Its training is based on the generalized EM algorithm, where the expectation step is extended to allow for incremental updating of the sufficient statistics of the mixture of experts and the maximization step includes the allocation of a new expert each time a training point is poorly explained by the current mixture. Put together, this results in a very fast and scalable online learning algorithm, capable of processing hundreds or thousands of samples per second, coming from continuous streams of data. This is a difficult learning setting, since when no knowledge is provided about the characteristics of the function to learn, this information must be estimated from a sequence of correlated training data that may correspond to only a small subset of the full input-output space. We tested IMLE in this kind of situation for single-valued regression and showed how it could equal or even surpass LWPR, a current state-of-the-art online learning algorithm, in terms of convergence of prediction error and number of allocated models.
However, a distinctive feature of IMLE, when compared to other online supervised learning algorithms, is its ability to deal with multivalued estimates for the same query data. The applications for such kind of problems range from learning forward models of parallel robots to learning switching models, where the function to be approximated can alternate between different configurations over time. This constitutes an even more challenging learning problem. Besides the limitations coming from learning online from a stream of data, we now face the expert allocation dilemma, where it is difficult to distinguish between noise and outliers, on one hand, and a mixture requiring a new component, on the other. Additionally, underlying nonstationary relations make the problem even more difficult to learn. Analyzing the influence of nonstationarity in multivalued prediction, relating it to separability of multivalued function branches, is perhaps a topic for further research.
We have also shown in this letter that the same procedure used to obtain a set of multivalued forward solutions can be applied to inverse queries, making IMLE capable of delivering both forward and inverse multivalued predictions from the same model, without need of further training. This is a consequence of directly learning a multivariate output relation from to instead of a set of D distinct univariate output maps. As seen in the previous section, such multivalued learning capability can also prove to be useful when learning discontinuous functions, for which an undesirable prediction smoothing typically occurs in the vicinity of the discontinuities when using standard function approximation algorithms.
A current limitation of the proposed algorithm is the lack of a mixture shrinking mechanism that would be responsible for removing either experts providing wrong predictions or redundant mixture components. The experiments presented in section 5 show that the IMLE model will activate new experts more and more sparingly as the training progresses, but eventually, after a lifetime of learning, too many experts may become activated, resulting in an increase of computational resources consumption. Moreover, in the event of some episodic outlier bursts, some experts may be activated to represent such erroneous training data. We do not yet have a definite answer to this problem. As discussed in section 3.2, growing the mixture under incremental assumptions, for multivalued data is a delicate matter, and shrinking the same mixture in a principled way is even more troublesome. The main difficulty here is that there is no simple way to detect experts wrongly activated by an outlier (or a sequence of outliers), as they cannot easily be distinguished from experts originating from a sporadic training on a new branch of the multivalued function being learned. Such “incorrect” experts can be detected, for instance, by a strong deviation from the characteristic input length scale or output noise (strong disagreement between and or between and ) or by low support on training data (low accumulation of sufficient statistics after a long period of training). However, removing an expert under this condition does not come without the risk of lowering the likelihood of the training data, since no guarantees exist that such an expert does not represent the true distribution of the multivalued function being learned. In general, choosing an expert to be removed from the mixture—due, for instance, to capacity limitations imposed on the mixture—is not easy to do without access to the full set of training points. Creating a probabilistically supported shrinking mechanism for the mixture is surely an improvement to the IMLE model that we would like to investigate in subsequent work.
IMLE has a large space for customization by choosing different priors for the mixture parameters, the most evident being the prior on regression coefficients . Feature selection is a very desirable property of a function approximation algorithm that allows learning the input subspace that effectively contributes to output variation. This typically makes the prediction more stable and less influenced by irrelevant or strongly correlated input dimensions. In a Bayesian setup, feature selection can be implemented if adequate priors for loading matrices are defined. Adopting a gaussian prior, as IMLE currently does, leads to the ridge regression (Hoerl & Kennard, 1970), while a Laplacian prior induces the LASSO (Tibshirani, 1996). Both priors involve the choice of a hyperparameter to control the degree of regularization and sparseness. Figueiredo (2003) uses a Jeffrey's prior to overcome the necessity of such a hyperparameter. More recently, Ting, D'Souza, Vijayakumar, and Schaal (2010) proposed a slightly different formulation of the generative model corresponding to the linear regression performed by each expert, which together with a careful choice of priors for the elements of can lead to a fast and efficient high-dimensional feature selection and regression. In principle, any of these priors can be integrated into IMLE; this is a topic for future work. Another research direction is the inclusion of Dirichlet priors for mixture components, turning IMLE into a fully Bayesian learning method. In this situation, the major challenge would be retaining the online and scalability properties of the algorithm.
Appendix:. Posterior Distributions for IMLE
We thank the anonymous reviewers for their valuable comments and suggestions, which greatly contributed to improving the quality of this letter. We are also grateful to Lorenzo Jamone for sharing the iCub data set.
LWPR 1.2.3: http://wcms.inf.ed.ac.uk/ipab/slmc/research/software-lwpr. SOGP 2.0, ROGER 1.5: http://cs.brown.edu/people/dang/code.shtml.
IMLE 1.1: http://users.isr.ist.utl.pt/~bdamas/IMLE.
This probably is explained by a lesser tendency for overfitting when the output dimension increases.
This behavior was unfortunately observed in the following experiments, and so ROGER was removed from the remaining tests. Note that the ROGER good results shown in Figure 4b were a consequence of a careful choice of its parameters: small changes in these values would typically lead to more than two solutions being predicted in many regions of the input space.