## Abstract

We present a supervised learning algorithm for estimation of generic input-output relations in a real-time, online fashion. The proposed method is based on a generalized expectation-maximization approach to fit an infinite mixture of linear experts (IMLE) to an online stream of data samples. This probabilistic model, while not fully Bayesian, can efficiently choose the number of experts that are allocated to the mixture, this way effectively controlling the complexity of the resulting model. The result is an incremental, online, and localized learning algorithm that performs nonlinear, multivariate regression on multivariate outputs by approximating the target function by a linear relation within each expert input domain and that can allocate new experts as needed. A distinctive feature of the proposed method is the ability to learn multivalued functions: one-to-many mappings that naturally arise in some robotic and computer vision learning domains, using an approach based on a Bayesian generative model for the predictions provided by each of the mixture experts. As a consequence, it is able to directly provide forward and inverse relations from the same learned mixture model. We conduct an extensive set of experiments to evaluate the proposed algorithm performance, and the results show that it can outperform state-of-the-art online function approximation algorithms in single-valued regression, while demonstrating good estimation capabilities in a multivalued function approximation context.

## 1. Introduction

Online and incremental learning algorithms that can efficiently deal with high-dimensional streams of data have become increasingly required in the robotics community. As humanoid robots evolve and become more complex, their sensorimotor relations become more difficult to understand, leading to situations where analytical models fail to provide accurate approximations of these sensorimotor maps. These situations range from lack of knowledge of certain hard-to-measure physical parameters (e.g., friction) to highly nonlinear physical interactions, such as actuator nonlinearities and unmodeled mass distributions (Peters & Schaal, 2006; Nguyen-Tuong & Peters, 2008). Resorting to modern learning techniques is, in these cases, the only way to provide these systems with the necessary representation capability. Approaches such as neural networks (Bishop, 1995), memory-based learning (Hastie & Loader, 1993; Atkeson, Moore, & Schaal, 1997), support vector regression (SVR) (Smola & Schölkopf, 2004) and gaussian processes regression (GPR) (Rasmussen & Williams, 2006), among many others, have been successfully applied to robotic learning tasks in the past few decades.

Many of these learning techniques, however, are not suitable when we desire to achieve a continuous and autonomous operation, such that in a biologically plausible way, the robot continuously learns and adapts as it interacts with the surrounding environment. In this setup, an offline learning phase does not exist. As a consequence, online learning algorithms are needed that can keep their computational demands low in the presence of potentially very large sensorimotor spaces to explore. This rules out many state-of-the-art algorithms like SVR and GPR: these methods fit nonlinear functions globally and require, in their original form, the presence of all training points in memory and a computational effort that grows very fast with the increasing number of samples to process. Other methods, like neural networks, can suffer from slow convergence and destructive interference when learning new data, a problem reported, for example, by Schaal and Atkeson (1998). To overcome such computational burden, one can identify two major trends: methods based on global, sparse approximations (Csató & Opper, 2002; Ma, Theiler, & Perkins, 2003; Quiñonero-Candela & Rasmussen, 2005) try to keep a representative reduced set of training samples to represent the function to approximate, while a second group of methods is based on local approximations of the function to learn (Meeds & Osindero, 2006; Nguyen-Tuong & Peters, 2008). The former class of methods may become inadequate in the presence of large streams of online training data as it tries to represent the function to learn in a global manner and is particularly sensitive to shifts in the input data distribution.

Local methods have some strong support in neurobiology (see the references in Atkeson et al., 1997, for instance), and their ability to perform localized learning seems more suitable for an online operation: they often can be seen as probabilistic mixtures of models, where each model describes the function to be learned in a particular region of the input space. They have their origins in the mixture of experts (ME) concept (Jacobs, Jordan, Nowlan, & Hinton, 1991; Jordan & Jacobs, 1994), where competing experts, assigned to different zones of the input space, are responsible for generating a corresponding output, being the final prediction produced by a gating network that combines their outputs. Particularly suited for real-time operation, mixtures of local linear models avoid the need to keep all training points in memory, getting a dramatic increase in the speed of operation while possibly sacrificing the better approximation errors typical of state-of-the-art GP methods. Xu, Jordan, and Hinton (1995) provided one of the earliest examples of such mixture of linear experts architecture, trained using the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977); in its simplest form it is an offline algorithm, for which the number of components—number of experts—must be set beforehand. Sato and Ishii (2000) suggest an online version for the ME, based on the introduction of a discount factor in the update of the sufficient statistics vector of the mixture. Training, again, is done using the EM algorithm, although the regularization and allocation of new experts are performed in a somewhat heuristic way. LWPR (Vijayakumar, D'Souza, & Schaal, 2005) and XCSF (Wilson, 2002) are two other popular nonprobabilistic algorithms based on this mixture of linear experts concept. LWPR has been widely used for online, real-time learning of robotic tasks; it uses a gradient descent on the prediction error, based on a stochastic leave-one-out cross-validation algorithm, to adapt the distance metrics of the receptive fields that partition the input space. Within each receptive field, a linear relation from input to output is obtained using an incremental partial least squares algorithm that efficiently deals with redundant and high-dimensional input spaces. XCSF updates the input distance metric of each model resorting to a steady-state genetic algorithm, while each linear model is fitted using a recursive least squares algorithm.

Unfortunately, almost all of the current state-of-the-art learning algorithms fail to deal with multivalued functions; these relations, also known as multifunctions or multimaps, are one-to-many maps where a single input can be associated with one or more different outputs and can be thought of as inverse relations for noninjective functions. They naturally arise in some robotic domains, such as perceptual aliasing in reinforcement learning (Whitehead & Ballard, 1991) or robotic learning from demonstration (Chernova & Veloso, 2008; Grollman & Jenkins, 2010). However, the most evident example of multivalued functions comes from serial and parallel robot kinematics that relate actuated joint space variables to the task space configuration vector of the robot. The forward kinematics for serial robots is straightforward to obtain and consists of a proper, single-valued function of the joint variables, while the inverse kinematics for such type of manipulators usually exhibits multiple joint space solutions for the same task space variable, even for nonredundant robots, where the task and joint space have the same dimension (Craig, 1989). The opposite situation occurs with parallel robots. Now, the same value of joint space variables may correspond to different values of the task space vector. Inverse kinematics, however, is usually easy to obtain and provides only a single solution for every task space configuration (Merlet, 2006).

For many supervised learning algorithms, this lack of ability to obtain different multivalued predictions for the same input query point is a result of assuming the single-valued model during the training phase, either explicitly or by allowing some sort of interference across multiple solution branches. As a consequence, many of the learning methods described can provide only single-valued predictions, and no inverse predictions are directly available from the learned forward model in its original form. D'Souza, Vijayakumar, and Schaal (2001), for instance, used LWPR to learn inverse kinematics for serial robots, but this approach, like other inverse kinematics learning methods, obtains only a single, local inverse solution by application of some sort of constraining or optimization criterion.

Three major approaches to the multivalued prediction problem that do not rely on the single-valued model can be devised. The first approach explicitly takes into account the multivalued model and mostly relies on artificial neural networks, using recurrent neural networks (Tomikawa & Nakayama, 1998), feedforward neural networks (Lee & Lee, 2001; Brouwer, 2004), or regularization networks that provided a direct algebraic representation of the multivalued function to learn (Shizawa, 1994). These implementations, however, were tested only in very low-dimensional toy problems, leaving their performance in more demanding problems an open question.

Alternatively, as Ghahramani and Jordan (1994) suggested, unsupervised learning algorithms can be applied to supervised learning tasks by means of the conditional densities obtained from the learned joint density function over both inputs and outputs. In principle, this makes it possible to obtain multivalued predictions for both forward and inverse relations (Lopes & Damas, 2007). Different techniques exist for learning the joint density function of the data; the mixture model is a popular approach, for which specializations exist such as the mixture of gaussians (Ghahramani & Jordan, 1994) or the mixture of factor analyzers (Ghahramani & Hinton, 1997). One of the major problems with this type of model is the choice of the number of components that constitute the mixture; several extensions to the mixture model exist that are capable of automatically selecting an appropriate number of components. The infinite mixture concept presented by Rasmussen (2000), for instance, is a Bayesian method that assigns a Dirichlet process prior on the mixing proportions of the mixture, responsible for the automatic generation of the correct number of components. This, however, usually requires expensive offline computational training based on Markov chain Monte Carlo sampling methods, not suitable for real-time learning, although some variational techniques can be used to accelerate the training process. Other extensions rely on the EM algorithm, either using a greedy approach to grow the mixture to an appropriate number of components (Vlassis & Likas, 2002) or, using the opposite idea, starting with a large number of components and, using carefully chosen priors, automatically shrinking the number of components of the mixture in a Bayesian fashion (Figueiredo & Jain, 2002). Unsupervised learning, however, is a more difficult problem than its supervised counterpart and usually conducts to worse results, as it ignores that joint data, apart from noise corruption, lie in a lower-dimensional manifold. This is the main drawback of the unsupervised approach to multivalued learning and prediction, as it typically results in convergence to suboptimal solutions that do not take the problem structure into account.

The last, and perhaps the most promising, approach to multivalued learning is to consider an ME framework, where different experts are allowed to share the same input space regions, while allocated to different solutions in the output space. This is done, for instance, in Rasmussen and Ghahramani (2002) and Meeds and Osindero (2006), using a mixture of gaussian processes (GP) experts, or in Bishop (1994) and Qin and Carreira-Perpinán (2008), resorting to the mixture density network (MDN). These examples, however, fail to deal with two aspects of the learning process that we would like to achieve. First, they are not originally designed for online operation, assuming the training to be done in a batch, offline way. The work of Grollman and Jenkins (2010) is based on the mixture of GP experts of Meeds and Osindero (2006) and changes it to allow for an online operation of the algorithm, using a sparse GP model for each of the experts; however, a compromise between accuracy and speed reduces its flexibility, since training requires performing both a computationally expensive Monte Carlo integration over particles and an equally expensive sparse GP online training, which severely reduces the algorithm performance in medium- to high-dimensional learning problems, as discussed in section 5. MDN does not have such computational cost, but it requires the number of experts to be set beforehand. Second, no explicit inversion of the relation to be learned is available; this means that two distinct models, for inverse and forward prediction, must be separately trained, for instance, when trying to simultaneously learn direct and inverse kinematics.

The infinite mixture of linear experts (IMLE) proposed in this letter builds on the mixture of linear models concept to obtain an online, real-time supervised learning algorithm that is able to learn multivalued functions while allocating experts to the mixture as needed, achieving, for single-valued regression, performance at least comparable to state-of-the-art online learning methods like LWPR. It is, at its core, an infinite version of the model for mixture of linear experts given by Xu et al. (1995) and Sato and Ishii (2000), with a careful choice of priors for some of its parameters and the adoption of a generalized EM approach that allows an online operation based on the concepts present in Neal and Hinton (1999) and Cappé and Moulines (2009) and the automatic allocation of experts to the mixture in order to adapt to the complexity of the function to be learned. The training procedure automatically rejects interference among different multivalued function branches, and by considering simple linear local relations from input to output, it is possible to easily obtain the inverse prediction from each local expert. The probabilistic model underlying the proposed algorithm is described in section 2, and the training procedure is presented in section 3. This is the first major contribution of this letter.

After the probabilistic model has been trained, an important question that remains is how to generate a possibly multivalued set of predictions for a given query data point. Qin and Carreira-Perpinán (2008), for instance, use an MDN and obtain the set of modes for the resulting conditional distribution. We instead suggest a different approach, based on clustering the individual expert predictions with a predictive probabilistic model that relates multivalued solutions to individual expert predictions. This is the topic of section 4, where we describe how to obtain forward and inverse multivalued predictions for any query, given the current state of the IMLE model. This topic is the second main contribution of this letter. Finally, section 5 provides a large set of experimental evaluations of the proposed algorithm, testing it in both single-valued and multivalued prediction situations. Section 6 provides the concluding remarks.

## 2. Probabilistic Model

Here *w _{ij}* denotes a latent or hidden indicator variable that equals 1 if data point

*i*was generated by linear model

*j*and 0 otherwise, with (sometimes we will use the shorthand notation

*w*to denote the event

_{ij}*w*=1, as in the above equations). These variables can be gathered, for each

_{ij}*i*, in a binary vector

*w*_{i}whose

*j*th component is equal to

*w*. The parameter

_{ij}*m*indicates if expert

_{j}*j*is activated, effectively contributing to the mixture. It is equal to 1 if expert

*j*is active and 0 otherwise. Each of the latent indicator variables

*w*consequently has an associated probability of 1/

_{ij}*M*if linear model

*j*contributes to the mixture, where

*M*is the total number of active experts. This uniform probability distribution on the latent variables is a more natural approach, in our opinion, to the online regression problem, since fully learned mixture coefficients depend heavily on the input training data distribution. This can vary greatly in an online data acquisition setting. Although we use a probabilistic mixture representation, the ultimate goal of this model is to describe a mapping from inputs to outputs. In this context, assigning the same importance to different parts of this mapping seems to make more sense.

Given *w _{ij}*, input

*z*_{i}follows a normal distribution with parameters and , while output

*x*_{i}follows a linear relation from

*z*_{i}, with mean , design matrix , and diagonal covariance matrix , corresponding to uncorrelated gaussian noise in

**. This model, apart the uniform distribution for**

*x**w*, is similar to the one presented in Xu et al. (1995) and Sato and Ishii (2000), where each expert

_{ij}*j*models a linear relation from input

**to output**

*z***in some region of the input domain, defined by input center and covariance , this way softly partitioning the input space among the experts.**

*x*The prior on the activations *m _{j}* imposes an increasing penalty on the number of linear experts the learning phase tries to allocate. We detail this in section 3.2. As for the remaining priors,

*W*

^{−1}and denote multivariate inverse-wishart and univariate inverse-gamma distributions, respectively. is the

*k*th element of the diagonal of , while corresponds to the

*k*th row of and is the

*k*th element of . Constants , , , , and determine the strength of the respective priors, , , , , and , expressed as an equivalent number of “fake” data points. We choose these distributions for convenience, since they are conjugate priors for the observed data distribution. The purpose of , the common elliptical prior on , is threefold: it introduces some regularization, so that always has an inverse; it ensures that the experts’ input region shapes do not differ too much from each other; and it prevents nonneighboring experts from competing for the same data in the initial phase of the learning process of each expert—a serious problem that occurs in ME models, referred to, for instance, in Schaal and Atkeson (1998) and Vijayakumar, D'Souza, and Schaal (2005), thus enforcing the principle of localized learning. The normal prior on , on the other hand, controls the degree of mobility of this parameter: makes it dependent solely on the training data, while leads to a fixed center, as typically occurs in radial basis networks or in LWPR; the prior on has the same purpose. Finally, the inverse gamma prior on defines a kind of average noise all experts share, while the prior for the rows of performs a coefficient shrinkage similar to ridge regression. Its main purpose, however, is to impose a regularization mechanism in order to make the matrix inversion required for the estimation of ; full rank. Such prior introduces some bias in the expert prediction, and consequently should be kept to a low value in order to make this undesired effect negligible.

*k*th diagonal elements of and , diagonal matrices representing the initial guesses for and . Free parameters and control these hyperpriors’ strength: setting and , we get the uninformative priors and , respectively. The (infinite) parameter vector that defines this mixture, to be learned from the data, is consequently given by , and the graphical model corresponding to the probabilistic model is shown in Figure 1.

## 3. Training

Given ** z**, a collection of input training data , and

**, the corresponding output, we would like to obtain the posterior distribution and use it to obtain**

*x**p*(

**|**

*x*

*z*_{q},

**,**

*Z***) and**

*X**p*(

**|**

*z*

*x*_{q},

**,**

*Z***), respectively, the forward and inverse posterior predictive distributions. This is, however, intractable without also considering the latent variables . Although several variational or Monte Carlo Bayesian methods exist to approximate , in this letter, we use the EM algorithm (Dempster et al., 1977) to find a maximum a posteriori (MAP) estimate for the unknown due to its easy adaptation to online learning schemes.**

*X***, for the current value of , and the maximization step (M-step), that finds the new value of given the previous expectation.**

*W*### 3.1. E-Step.

*w*, and hence it suffices, to obtain , to calculate , the estimate of the posterior probability that data point

_{ij}*i*was effectively generated by expert

*j*, also called the responsibility that expert

*j*has generated data point

*i*. Since

*w*depends on only

_{ij}

*x*_{i}and

*z*_{i}, we have, using Bayes’ theorem,

*S*^{t}, the expected value of the sufficient statistics vector, given the observed data and the current value of the parameter vector . For the mixture model, equations 2.1 and 2.2, it comprises the terms

*S*^{t}

_{hj},

*S*^{t}

_{hzj},

*S*^{t}

_{hxj},

*S*^{t}

_{hzzj},

*S*^{t}

_{hxxj}, and

*S*^{t}

_{hxzj}, for , defined as follows:

We also define for convenience, so that .

*t*+1, of performing an update of the sufficient statistics using solely some data point

*i*, according to

*S*^{t+1}=

*S*^{t}+

*s*^{t+1}

_{i}−

*s*^{t}

_{i}, instead of the whole data set as in equations 3.4. For a continuous stream of data, each point is visited and used only once, and thus its index

*i*can be associated with corresponding iteration number

*t*. The partial E-step can be written in this case as where for data point

*i*=

*t*+1. We follow a more recent result by Cappé and Moulines (2009), where more general conditions for convergence of online EM algorithms are provided and where the following E-step is suggested: where is a step size. Setting , for , guarantees the algorithm convergence under some mild assumptions, while introducing a time decay in the sufficient statistics that may be beneficial when slowly time-varying data are presented to the algorithm. Such a situation may typically occur within the context of robotic applications. Equation 3.6 can be reformulated if we use an equivalent set of sufficient statistics, , then becoming This is the decaying statistics formulation presented in Sato and Ishii (2000). Setting corresponds to having , an accumulation of the sufficient statistics with no forgetting over the time, equivalent to equation 3.5.

### 3.2. M-Step.

*t*: denotes a diagonal matrix equal to the diagonal of its argument. For the common input variance parameter , however, we must obtain the partial derivatives of with respect to each and equate them to zero, getting where denotes the

*k*th element of the diagonal of the inverse of and

*M*corresponds to the effective number of experts in the mixture at iteration

^{t}*t*, that is, . When we use the same procedure, a similar result holds for :

Equations 3.7b and 3.7f, on one hand, and 3.7e and 3.7g, on the other, are coupled, without an explicit closed-form solution for the parameters being estimated. To deal with this issue, the maximization step is relaxed, and each of these parameters is maximized individually, conditionally on the others remaining fixed. This corresponds to the expectation conditional maximization algorithm, a particular case of the generalized variant of the EM algorithm, where the M-step is modified to an update that improves the Q-function without necessarily maximizing it (Dempster et al., 1977). Solving for the values of that maximize the likelihood is, however intractable, as it requires evaluating all the infinite combinations of values for *m _{j}* and picking the one that maximizes the Q-function.

#### 3.2.1. Growing the Mixture.

Activations *m _{j}* define the number of experts constituting the current mixture and play a key role in defining the complexity of the global probabilistic model. Choosing the appropriate number of components for a mixture is a difficult problem, and several methods have been proposed to deal with it. Among them, Bayesian methods provide an elegant framework that automatically generates a trade-off between the fitness of the data to the model and the complexity of the same model. Moreover, the infinite mixture models based on the Dirichlet process nonparametric prior for the mixing coefficients allow for generative models where the number of components of the mixture is not defined a priori (Antoniak, 1974; Rasmussen & Ghahramani, 2002; Meeds & Osindero, 2006).

Unfortunately, training these infinite mixtures usually requires either computationally expensive Markov chain Monte Carlo sampling methods or variational approaches that typically rely on some sort of truncation that imposes a bound on the admissible number of components for the mixture. Furthermore, the online operation we are trying to achieve in this work imposes some additional difficulties. In an offline setting, Bayesian methods in principle can efficiently grow or annihilate mixture components, but when it is operating online, the full set of training points is no longer available, and decisions concerning the allocation or removal of components of the mixture must be made, resorting only to the most recently available training points and the current mixture state, a far more demanding learning challenge. Sato (2001) derives an online variational Bayesian algorithm for learning mixture models, but it requires the maintenance of a parallel hypothesis about the number of components that can easily become too computationally expensive. Some recent work also views the variational Bayesian learning model under an online perspective but is based on nondeterministic approaches based on Gibbs sampling (Wang & Blei, 2012) or require processing the training points in smaller batches of data (Gomes, Welling, & Perona, 2008). For all these methods, an adequate computational speed that allows the processing of hundreds of samples per second, as required for online learning schemes for robotic applications, is yet to be shown.

In contrast to Bayesian methods, it is more difficult to assess an optimal value for the number of components of a mixture when using an EM algorithm for training. A well-known drawback of EM-based techniques is the fact that the maximized observed data likelihood will never decrease when new components are added to the mixture. This allows the derivation of a broad class of criteria to decide when to add new components, ranging from only allowing the existence of a single component, equivalent to performing a global linear regression on the data, to the activation of a new one for each data point processed, which would correspond to a memory-based learning approach, where predictions would be made resorting to all available training data. As a consequence, most EM deterministic methods impose some kind of penalty over the number of mixture components during optimization, such as Akaike's information criterion (AIC), the Bayesian inference criterion (BIC), or the minimum message length criterion (MML), to name just a few. A comprehensive comparison and review of these kind of penalty methods can be found in McLachlan and Peel (2000).

In the probabilistic model given in equations 2.1 and 2.2, the prior distribution on *m _{j}* plays the role of such penalty, making a high number of experts increasingly less probable. Changing, at iteration

*t*, the value of a particular parameter from 0 to 1 will affect only the Q-function in equation 3.3 through the term , as the values

*h*will still be equal to 0 for all

^{t}_{ij}*i*and this particular value of

*j*, following equation 3.2. This results in a decrease of the Q-function value every time a new expert is activated in the M-step. In the absence of the prior, equation 2.2f, the Q-function value would not change.

*x*_{t+1},

*z*_{t+1}) under the new parameter vector . This log likelihood is given, considering also the priors on , by , where is obtained from the complete data likelihood by marginalizing out the latent variables

*w*:

_{ij}*j*. If, alternatively, expert

*j*

^{+}=

*M*+1 is activated at the end of the M-step, by making and initializing and to

^{t}

*x*_{t+1}and

*z*_{t+1}, respectively, we obtain the alternative log likelihood , where is a changed version of parameter vector , with expert

*j*

^{+}activated and where the next training point likelihood is now given by

*j*

^{+}will increase the log likelihood of the next training point if , and this can be used as a criterion for deciding when to activate a new expert. This approach, however, will often lead to too many local models being allocated. Instead we take a statistical approach of activating a new expert only when strong evidence supports the alternative parameter vector against the null hypothesis . This later parameter vector can be seen as a special case of , with one less mixture component. This suggests using a likelihood ratio test to compare them, where the test statistic approximately follows a chi-squared distribution with degrees of freedom equal to the difference of free number of parameters between and (Kendall, Stuart, Ord, Arnold, & O'Hagan, 1998). At the time of activation of a new expert

*j*

^{+}, only and are effectively defined, so the change in the number of free parameters is equal to

*d*+

*D*. Let be the critical value of a chi-squared distribution with

*d*+

*D*degrees of freedom, corresponding to the number of free parameters introduced in the mixture: a new expert should be activated, according to the likelihood ratio, if, for a significance value

*p*

_{0},

*m*

_{j+}affects only through the term

*p*(

*m*

_{j+}), given by equation 2.2f, we have , and the above expression results in activating a new expert

*j*

^{+}when where expert

*j*

^{+}parameters are equal to they prior values (expert

*j*

^{+}has not yet accumulated any sufficient statistics). Since and , this results in

The right side of equation 3.8 introduces an increasingly penalty on the activation of new experts as the value of *M* increases. Adjustable parameter *p*_{0} can regulate the propensity to activate new experts: the lower its value, the higher the critical value of the chi-squared distribution will be, making the experts’ activation criterion harder to be met, resulting in fewer components in the mixture. In a general way, equation 3.8 tells us that a new expert should be activated when the next acquired training point is poorly explained by the current probabilistic model. This is a sensible approach to mixture grow in online algorithms. LWPR, for instance, creates a new linear model each time an input training point *z*_{i} fails to activate the nearest receptive field by more than a given threshold. Of course, when learning multivalued functions, an activation scheme must take into account both the input and output part of the training point, as occurs in equation 3.8.

#### 3.2.2. Handling Outliers.

*w*

_{0}is considered, corresponding to outliers that are not generated by any of the currently activated experts, with an improper constant distribution and a prior distribution . Then the probability that a training point is an outlier generated by

*w*

_{0}, given the current mixture parameters, follows from Bayes’ rule: This posterior probability is dominant over the posterior probabilities for the ME experts if . This happens only at the end of iteration

*t*and for a new point (

*x*_{t+1},

*z*_{t+1}) if This result is very similar to equation 3.8, since the first factor on right side of equation 3.8 quickly approaches 1 as the number of active experts

*M*increases.

The probabilistic outlier model in equation 3.9 depends on the current estimates for output noise and input length-scale . This is a better approach than considering a fixed threshold, as tuning this latter parameter would require some problem domain-specific knowledge. If we consider a newly activated, not-yet-trained expert (for which , and ), can be seen as the evaluation of the probability density of the expert at a point (** x**,

**) that lies over the equidensity contour that encircles the region, centered at , corresponding to a 1−**

*z**p*

_{0}probability.

The results above show us that the mechanisms for activating a new expert or recognizing an outlier are essentially the same and correspond to identifying training points poorly explained by the current mixture. How do we know, then, if a training point satisfying equation 3.8 is an outlier or, alternatively, an indication that a new expert is needed in the corresponding region of the input-output space? This is challenging problem. For single-valued regression, outliers can be detected as training points that have enough support from the current model in the input space, while presenting a large deviation from it in the output space. The same, however, does not happen in multivalued regression, where such a situation may simply correspond to a yet unseen branch of the multivalued function being learned. Furthermore, the online assumption does not allow us to look at the whole training set, where identifying outliers in principle would be made easier by searching for isolated points.

This question can be answered under the assumption that the training data are temporally correlated, as often is the case in online learning. In this situation, a training point will have a high probability of being poorly explained by the current mixture if the same occurred with the previous point. In contrast, the same does not happen with statistically independent outliers. Observing two consecutive training points satisfying equation 3.8 is then more likely to be caused by a lack of fit of the points to the current model than the occurrence of two consecutive outliers. In such an event, we decide to activate a new expert.

### 3.3. Computational Complexity.

Learning with IMLE is very fast. For a new observation (*z*_{t}, *x*_{t}), a complete update of the mixture parameters consists of (1) deciding whether a new expert should be activated in equation 3.8; (2) assigning responsibilities *h _{ij}* to active experts using equation 3.2 (E-step); (3) updating the sufficient statistics (E-step, equation 3.6); and (4) obtaining the new value for (M-step, equations 3.7). All these calculations have a computational complexity of , since the matrix inversions required in this process can be efficiently performed using the Sherman-Morrison formula to perform a rank 1 update for these quantities. The computational complexity of a complete update of IMLE parameters is consequently linear in

*D*and

*M*, the number of experts, and quadratic in

*d*, the number of input dimensions, making it directly comparable to the state-of-the-art LWPR in terms of computational complexity per training point. Like LWPR, this complexity can be made linear in

*d*if the input distance metrics are constrained to be diagonal.

## 4. Prediction

*z*_{q}is represented by a distribution

*p*(

**|**

*x*

*z*_{q},

**,**

*X***), where the dependence on learned mixture parameters and latent variables is marginalized out. The same occurs in inverse prediction, where now**

*Z**p*(

**|**

*z*

*x*_{q},

**,**

*X***) is considered. However, these posterior distributions cannot be analytically calculated for most probabilistic models. Instead, EM-based learning algorithms will normally provide predictions based on , the point estimate for the parameter vector being learned. For IMLE forward prediction, this results in where and where**

*Z**w*and

_{j}*w*are a shorthand for

_{k}*w*=1 and

_{qj}*w*=1; and follow from equation 2.1, with replaced by its estimate. Inverse prediction, on the other hand, leads to where now we have

_{kj}The same reasoning for obtaining an inverse probability distribution from the IMLE model can also be used to produce more general predictions. Assuming the joint vector ** y** to comprise a query part

*y*_{q}and an answer part

*y*_{a}, such that , we can easily derive , for each expert

*j*, by conditioning the joint distribution with respect to

*y*_{q}.

**,**

*x***) using the sufficient statistics**

*z*

*S*^{t}; distributions for and are derived in the appendix and are given by equations A.5 and A.1, respectively. To keep the notation simple we will use to refer to either or in the rest of this section, depending on whether forward or inverse prediction is considered.

*M*normal densities, each corresponding to a point estimate provided by a different expert, together with an uncertainty value, and where the mixture weights are given by the posterior probabilities that the query point was generated by each expert. A single-valued forward prediction, together with an associated uncertainty, can be obtained from equation 4.3 by taking its mean and variance, where and

*R*^{x}

_{j}are, respectively, the mean and variance of , and where from now on, we drop the dependence on

*z*_{q}for notational convenience. This is the approach followed in LWPR, and it works reasonably well under the single-valued hypothesis, although it tends to overestimate the true variance of the data due to the cross-variance between expert estimates, given by the last term in equation 4.6. However, when multivalued functions are considered, this approach will be able to generate only a single estimate, together with a large value of the associated uncertainty. As reported, for instance, by Ghahramani and Jordan (1994), merging together the distinct solutions provided by each expert might result in a poor overall estimate for a nonconvex solution space, where the weighted mean of different experts’ predictions might itself be far from the true value to estimate. This is clearly unacceptable. Searching for all the modes of the underlying distribution (Carreira-Perpiñán, 2000), on the other hand, can introduce many low-weight spurious modes, corresponding to the contributions of distant experts. In this case, some kind of filtering must be done to remove them. Even after removing low weighted components, the topography of the mixture can be complex in prediction spaces with more than one dimension, as analyzed by Ray and Lindsay (2005), where, counterintuitively, there may exist more modes than mixture components. Consequently, in order to obtain multivalued forward predictions, we must deal with two major issues: (1) how to identify the correct number of solutions for a given query

*z*_{q}and (2) how to group and merge the experts’ predictions accordingly. These are the topics of the following sections. Although exemplified by forward prediction, the concepts and techniques are exactly the same for inverse, forward, or more general prediction.

### 4.1. Grouping Experts Predictions.

*z*_{q}, the true number of solutions

*N*is assumed to be known, estimating these solutions reduces to the problem of clustering

_{sol}*M*observations into

*N*classes, where each observation comes with an associated variance

_{sol}

*R*^{x}

_{j}and weight

*w*. To provide a Bayesian probabilistic model that relates experts’ predictions to the unknown multivalued solutions , for , and that additionally takes

^{x}_{j}

*R*^{x}

_{j}and

*w*into consideration, we propose the generative model where

^{x}_{j}*s*is a latent indicator variable that signals if

_{jk}

*x*_{j}, expert

*j*conditional mean given

*z*_{q}, was produced by solution

*k*. Expert

*j*true conditional mean

*x*_{j}is also unobserved: it relates to the point estimate given by equation A.3 according to equation A.7, which takes uncertainty on current expert parameters into account. The rationale for the variance in the previous model follows from the traditional probabilistic view of weighted least squares and best linear unbiased estimators (Gelman, Carlin, Stern, & Rubin, 2004; Vijayakumar et al., 2005), where we incorporate each expert weight

*w*in the respective predictor variance. Using equation A.7 and compounding the distribution, equation 4.7 with the posterior distribution for given results in an approximate normal distribution with variance . Putting these facts together, we get

^{x}_{j}The uncertainty for prediction is given by . When *w ^{x}_{j}*=1 for a particular expert

*j*, meaning that only that expert contributes to the prediction, this uncertainty becomes equal to , which agrees with the predictive uncertainty of the ordinary least squares solution for multiple linear regression.

### 4.2. A Test for Multivalued Prediction Goodness of Fit.

*D*degrees of freedom, given that was indeed generated by solution

*k*using model 4.8. Under the null hypothesis that (1) generative model 4.8 corresponds to the true distribution for observed , (2)

*N*is the true number of multivalued solutions for query

_{sol}

*z*_{q}, and (3) the previous EM algorithm correctly grouped the experts’ predictions into

*N*different solutions, the statistic

_{sol}*T*follows a chi-squared distribution for every solution

_{k}*k*, where again the sums are over experts assigned to solution

*k*and

*M*is the number of experts belonging to that solution. A low value for this statistic indicates a good fit of observations to the estimated solutions . On the other hand, if the

_{k}*p*-value for any solution

*k*is lower than a given significance level , the current set of solutions is considered to be badly explained by the data.

A practical detail of the above test is that will not have arbitrary large values for corresponding small values of *w ^{x}_{j}*, as predicted by model 4.8. As a consequence

*T*will assume much lower values than the ones expected under the null hypothesis distribution, since in a typical scenario, most of the experts will have a negligible contribution to the prediction. In this situation, we found that replacing

_{k}*M*−1 in equation 4.10 with , the effective degrees of freedom of the mixture, provides a better fit of the statistic to its corresponding distribution, under the null hypothesis.

_{k}### 4.3. Obtaining a Valid Set of Multivalued Solutions.

Multivalued prediction starts with the single-valued estimation, *N _{sol}*=1. If the test described in the previous section rejects this single-valued solution,

*N*is increased to 2 and the EM iterations in equation 4.9 are performed. If the test for goodness of fit described in the previous section rejects at least one of the solutions,

_{sol}*N*is incremented, and this procedure is repeated until a value of

_{sol}*N*is found for which the test fails to reject the null hypothesis for any of the solutions thus obtained. , for , is then the set of multivalued forward solutions predicted by IMLE, while are the respective uncertainties. To speed up the prediction process, each EM procedure initializes with the values found in the previous run of the algorithm, while the extra solution starts near the solution

_{sol}*k*that produced the smallest

*p*-value in the previous goodness-of-fit test, this way dividing in two the solution responsible for the null hypothesis rejection.

During this process, the significance level controls the number of solutions found. The lower its value, the harder it is to reject the null hypothesis, and fewer solutions are likely to be found. Increasing helps to separate different solutions, but as an unwanted consequence, predictions for neighbor experts may stop being merged together due to the function curvature around *z*_{q}.

## 5. Experimental Results

In this section we evaluate IMLE in several different experimental settings, especially focusing on large training sets arising from continuous stream of data that particularly suit online learning. We compare our results with different online learning algorithms, namely LWPR (Vijayakumar et al., 2005), probably one of the most widely used state-of-the-art online learning methods for robotic applications, SOGP; a sparse online approximation for gaussian process regression (Csató & Opper, 2002), and ROGER, an online infinite mixture of SOGP experts (Grollman & Jenkins, 2010). All of these algorithms have their C++ implementation code available, and their most recent version to this date is used in our comparisons.^{1} A C++ implementation of the IMLE algorithm is also freely available for download.^{2} We also compare IMLE to standard GPR. This is not an online algorithm in its standard formulation, but it can give some insights into the performance loss we expect when going to an online operation setting. In the following experiments, we specify a gaussian likelihood to be used with an isotropic squared exponential covariance function, using exact inference for training and prediction. Optimal values for input length scale of the kernels and output noise, the free hyperparameters of the model, are obtained using standard optimization techniques over the training set, using the GPML Matlab code.^{3}

Parameters of interest for tuning the LWPR algorithm comprehend *D*_{init}, , *w _{gen}*, and penalty (for details on these parameters’ meaning, consult the related documentation). Additionally, in all experiments we set

*diagOnly*to

*false*and

*useMeta*and

*updateD*to

*true*. We use SOGP with a gaussian kernel. The remaining tuning parameters for this algorithm are , the kernel width, , the expected output noise for the function to learn, and , the maximum number of training basis points to be kept by the algorithm. SOGP behavior is supposedly similar to standard GPR if no upper limit is set to this number of basis points. Besides the (common) parameters for each of the SOGP experts, ROGER also needs same parameters to be defined: the most important quantities, according to its authors, are

*P*; the number of particles, ; the Chinese Restaurant Process concentration parameter that drives the propensity of new SOGP experts to be created within each particle; and the common parameters for each of its SOGP experts. In general, the higher the number of particles and SOGP capacity, the slower ROGER will run.

As for IMLE, there are 11 parameters that can be tuned to change the resulting behavior of the algorithm. Some of them typically do not need any tweaking, and the following experiences, unless otherwise noted, will keep them with their default values, namely, (a small value is needed for regularization), , and (experts locations fully learned). As the input space dimension increases, a stronger prior on input covariance matrices and output noise is needed to make the learning process relatively invariant with respect to the trajectory nature of the training data acquisition process. A good rule of thumb is to set and then choose based on the confidence on the value of , with smaller values corresponding to a larger uncertainty on this parameter. A typical value for the forgetting factor lies in the range . The remaining parameters, , , and *p*_{0}, have a strong influence in the experts’ activation process, equation 3.8, and ultimately on the number of local experts created during the training phase. represents the expected output noise variance, while corresponds to the input activation region for which the function to be learned can be approximately represented by a linear relation.

While setting and tuning such apparently high number of free parameters may appear to be challenging at first, the convergence of the probabilistic model is not very sensitive to specific values of these parameters. Perhaps the most sensible issue when considering the tuning of IMLE free parameters is to ensure a correct convergence of and , the input length scale and output noise estimates; this problem is discussed thoroughly next.

### 5.1. Single-Valued Function Aproximation.

In this section we evaluate IMLE ability to perform single-valued function approximation, applying the algorithm to three different learning problems with different input dimensions and comparing its performance to LWPR, SOGP, and GPR. Some care must be taken when confronting these different learning schemes. In general, increasing the model complexity for each of the algorithms will produce smaller approximation errors, while incurring some heavier computational cost. The number of local linear models activated by IMLE and LWPR is a good measure of model complexity for these online algorithms. Its final value, after the training process, is a consequence of the choices for their tunable parameters and the training data themselves. Since IMLE and LWPR have the same computational complexity per training point, the final number of activated models provides a fair comparison ground for IMLE and LWPR in terms of the approximation error/computational complexity trade-off. GRP and SOGP model complexity, on the other hand, is measured by the number of stored training points used for posterior prediction over the test data. For GPR, this number is set beforehand, while SOGP learns a sparse subset of the training data to be used for prediction, possibly limiting the maximum number of these inducing points to a value of . IMLE and LWPR computational demands are linear in the number of local models, while GPR and SOGP are much more penalized by the increase in the number in stored training points or inducing points, respectively.

The amount of information implicitly available to the algorithms is another important issue concerning a fair comparison between them: for regression, the input length-scale of the data and the noise level present in the output are two critical properties of the function to be learned. GPR learns them offline by optimizing the likelihood of the training data with respect to these hyperparameters. Since GPR is an offline algorithm, its prediction performance strongly depends on the stored training points’ input locations. If they efficiently cover all the input space, we expect GPR to outperform methods based on local linear approximations in terms of prediction error. Yet such highly desirable informative training set may be unavailable or can be difficult to generate, as in typical robotic applications, where visiting all the input space can be very time-consuming.

Online methods alleviate this dependence on an initial representative training set by learning their models on the fly, adapting them as new training data arrive. This is achieved in SOGP by maintaining a representative subsample of the data. Parameters and , however, are not adapted during the learning process, and thus SOGP must rely on a good initialization of its input length scale and output noise parameters. Adequate values can be obtained, for instance, from an initial offline optimization, similar to GPR. Both IMLE and LWPR learn the input length scale and output noise for each of their local linear models—in fact, they learn a full input distance metric, represented by covariance matrices and *D*^{−1}_{j}, respectively. They differ in the way they initialize these quantities: LWPR initializes the input distance metrics to a constant value *D*_{init}; IMLE, on the other hand, puts a common prior on , defining then a vague hyperprior for . While LWPR initialization strongly influences the number of receptive fields created during learning, in IMLE the information conveyed in hyperpriors’ parameters and can quickly lose importance if and are small. This capability to learn, in an online fashion, the characteristic input length scale and output noise makes IMLE more robust to poor parameter initialization and less dependent on problem-specific knowledge; this complex probabilistic model structure, however, can make the learning convergence depend more on the training data input distribution, since hyperparameters and strongly influence the behavior of newly activated experts, which also contribute to the estimation of and . Ultimately this can lead to convergence to poor local maxima of the likelihood function.

#### 5.1.1. Cross Function.

We first ran IMLE on a sequential stream of data taken from the cross function suggested in Vijayakumar et al. (2005), a two-dimensional input, univariate output function displayed in Figure 2a. The training set consisted of points sampled from a random trajectory performed in the input space and corresponding output data, for which we added gaussian noise with 0.1 standard deviation. A small sample of such training data can be seen in Figure 2a, superimposed on the target function.

We adopted all the suggested values for LWPR parameters presented in the cross-2D example given in LWPR source code, namely, . SOGP was left with its default parameters of , with no limit on the number of inducing points. As for IMLE, we chose to match the LWPR initial input covariance matrix, while defining and making . For comparison purposes, we varied the parameter *p*_{0}, considering two values, *p*_{0}=0.1 and *p*_{0}=0.2. Since the output was known to be single-valued, we did not group experts’ predictions into different solutions when estimating the output for the test input data, setting for the hypothesis testing described in section 4. Finally, we also trained a standard GP model, using a random, nonsequential training set with two different sizes (M = 1000 and M = 5000). All of the online algorithms were trained on a set of 200,000 sequential points coming from a random trajectory in the input space, and the prediction root mean square error (RMSE) was evaluated on a noiseless test grid of 200 × 200 equally spaced input points and corresponding output values. For accuracy, the IMLE and LWPR presented results are averages over 100 randomly trials.

Figure 2b shows a typical reconstruction of the original target function after learning (*p*_{0}=0.2). Figure 2c shows, for the online methods, the evolution of RMSE and number of created models as a function of the processed training points, and Figure 2d presents the final results after training. We can see that SOGP achieves the best RMSE, but at a high computational cost, compared to the other online methods. LWPR, on the other hand, is the fastest algorithm but has a worse function approximation error when compared to IMLE, even when IMLE resorts to less local models (*p*_{0}=0.1). Note that increasing *p*_{0} results in a better function approximation, at a penalty on the number of linear experts activated and consequent increase in computation time. As for offline GPR, low error rates can be obtained if the training set is large enough, but this comes at a prohibitive cost in terms of offline computation time and memory required to perform the necessary matrix inversions. We also used a random set for training the GP. As stated before, this may not be easy to generate in many practical real-time applications. This may also explain why SOGP has a better RMSE than GPR while using significantly fewer training points for prediction, since SOGP keeps only the most informative points taken out of the full training set.

#### 5.1.2. The PUMA 560 Serial Robot.

The Unimation PUMA 560 is a well-known six degrees of freedom industrial robotic arm. Its forward kinematics function is described, for instance, in Craig (1989). To evaluate the single-valued prediction capabilities of IMLE, we simulated the PUMA 560 robot kinematics, defining a 10 cm tool extending along the *z*-axis of the frame associated with the sixth joint, and generated random trajectories over the joint space of the robot, calculating the corresponding 3D position of the tool tip. The kinematic function to be learned was thus a map from an input space of dimension 6 to an output space of dimension 3, even if the last joint was irrelevant as it changed only the tool orientation. Note that the fully stretched arm plus tool measured more than 90 cm, which made the range of each of the output variables to be almost 2 m.

We trained both LWPR and IMLE with a set of 10 million training points, evaluating the final achieved RMSE on a different test set comprising 100,000 points. Output training values were corrupted with gaussian noise, with standard deviation equal to 2 cm. This corresponds approximately to 1/100 of the output range and can model, for instance, moderate noise in a vision-based end-effector tracking process. SOGP unfortunately was left out of the comparisons. Its parameters turned out to be difficult to tune and its behavior unstable and very slow in the face of a large stream of highly correlated input data. We also considered standard GPR, generating a random set of training data for hyperparameter optimization and testing on the independent test set. Once again, the results thus obtained do not compare fairly to the other online algorithms, as one of the most challenging difficulties that arise with this data set is the massive, sequential, and correlated nature of the training data.

Experimental results are shown in Table 1, where the RMSE for two different instances of the GPR algorithm, trained with 1000 and 5000 data points, are presented. We set , *p*_{0}=0.1 and the recommended values , and since we did not wish to give IMLE any information about the input length scale and output noise characteristics of the function to be learned, we defined and , and set to a low value of 8 in order to quickly decay the influence of in the estimate for . As for LWPR, we picked the combination of parameters that made LWPR achieve the lowest error while not letting the number of created receptive fields grow to inadmissible values: , *w _{gen}*=0.1, , and . Tuning these parameters to have roughly the same number of allocated models as IMLE never produced an RMSE of less than 10 cm on each output dimension. Figure 3b shows the corresponding learning curves for IMLE and LWPR. Due to the conservative large parameter values for and , IMLE converged slowly in the initial learning phase but quickly recovered after convergence of and to better output noise and input length-scale estimates, respectively.

Method . | RMSE . | Number of Models . | CPU Time (s) . |
---|---|---|---|

IMLE | 0.0245 | 668 | 4181 + 141 |

LWPR | 0.0560 | 4338 | 11,974 + 725 |

GPR_{1000} | 0.0513 | — | — |

GPR_{5000} | 0.0144 | — | — |

Method . | RMSE . | Number of Models . | CPU Time (s) . |
---|---|---|---|

IMLE | 0.0245 | 668 | 4181 + 141 |

LWPR | 0.0560 | 4338 | 11,974 + 725 |

GPR_{1000} | 0.0513 | — | — |

GPR_{5000} | 0.0144 | — | — |

The LWPR algorithm was tested over an exhaustive combination of parameters: , , , and . These parameters affected the final error through the number of receptive fields they tended to create: the larger this number, the lower the error, as expected. The same happened to IMLE when varying its tunable parameters, with , , , , and . In general, a lower RMSE would be achieved at a cost of an increasing number of local linear models. It is very illustrative to depict the final RMSE as a function of the number of models created, for each parameter configuration of IMLE or LWPR, as in Figure 3a, since the number of local models affects the computational training and testing time in a similar way for these two algorithms. In the figure we can identify, for IMLE, three convergence behaviors: the first one, marked IMLE (Global), corresponds to a low value of . Since this value of is lower than the actual output noise, IMLE activates a large number of linear experts in the initial learning phase, each one covering all the input space (due to a high ) and a particular region of the output space. This is of course an undesirable behavior that goes against the principle of localized learning. The second case, identified as IMLE (overfitting), is a consequence of choosing a low value of in combination with a low or medium value of , allowing the individual to shrink to a point where each new training point has a high probability of activating a new expert. This may result in a snowball effect, where more experts contribute to a decrease of , which will lead to a smaller and, consequently, more experts being activated. In Figure 3a, this does not result in a reduction of RMSE in the test set, a good indicator that IMLE is overfitting in that situation. These are, however, two extreme situations, caused by setting or to smaller values than the output noise and input length scale, respectively. All other combinations of parameters exhibit good robustness regarding the training convergence. Note that in this case, IMLE has a much better performance than LWPR, achieving a much smaller RMSE while activating considerably fewer local models.

#### 5.1.3. The SARCOS Inverse Dynamics Data Set.

The SARCOS anthropomorphic robotic arm is a 7 degrees of freedom manipulator that has been used to test several function approximation algorithms. The learning task considered here is the estimation of its inverse dynamics from training examples. This is a nonlinear map from a 21-dimensional input space, consisting of positions, velocities, and accelerations of each joint to a 7-dimensional output space comprising the corresponding joint torques. The learned model can then be used to estimate the torques that achieve a desired trajectory in the joint space. The data set consists of 48,933 training points and 4449 test points, taken from trajectories performed with the real robot. Output values are normalized by the variance of the outputs of the test set to make the results presented here directly comparable to the ones in Rasmussen and Williams (2006) and Vijayakumar, D'Souza, Shibata, Conradt, and Schaal (2002). We compare IMLE results on this data set to the following models:

**Linear regression (LR):**A linear regression model is fitted to the data to provide a baseline for comparison.**Rigid body dynamics (RBD):**This is a parametric physics-based model for the inverse dynamics function that is estimated using a least-squares approach with the available training data.**LWPR:**This model was trained using diagonal distance metrics*D*_{j}, cycling through the training data over 6 million iterations. This roughly corresponds to 123 passes over the full training data.**GPR:**Due to the computational infeasibility of using the full training data for optimization of hyperparameters and prediction over test data, a subset of regressors method was employed, with size 4096. A squared exponential covariance function was used, and its hyperparameters were optimized accordingly, using a subset of the training data.

As for IMLE, we used out-of-the-box default parameters, setting conservative large values for its noise and length-scale parameters, and , while setting the usual value of *p*_{0}=0.1. Using the input dimension to set up the value of , , and according to the rule of thumb we presented before would result in overly large prior strengths. We resort instead to the notion that for actual robot movements, the generated data tend to be low-dimensional, with around four to six effective dimensions (Schaal, Vijayakumar, & Atkeson, 1998). This is a standard assumption in robotic applications in order to circumvent the curse of dimensionality. Learning a sensorimotor map would require a full exploration of the input space if this condition did not hold. In such case, as the input dimension increased, the time required for exploration and learning would grow exponentially, making the learning task infeasible from a practical point of view. In our experiments, we found that a value of seven effective dimensions could provide satisfactory learning behavior and thus set . We also set to a low value of 8 due to the high uncertainty on . Varying over a considerably wide interval around this value did not change IMLE convergence behavior significantly.

Unlike GPR and LWPR, the IMLE algorithm can directly provide multivariate output predictions without a need to train different models for each of the output dimensions. If the maps from inputs to each output variable have the same length-scale properties, as frequently happens in robotic sensorimotor maps, a huge economy of computational resources can be attained, as each local linear model can describe the interaction between the inputs and the full output vector. We show in Table 2 the prediction error when the output consists of only the first joint torque, comparing it to the results presented in Rasmussen and Williams (2006) for LWPR, GPR, RBD, and LR. We also used IMLE to learn the full output torque vector. For both situations, we present the MSE and number of activated experts after (1) a full pass over the training data and (2) 10 consecutive passes over the same data. For confirmation purposes, we also trained IMLE with *p*_{0}=0.0, resulting in a model with a single expert that provided a global linear approximation to the training data. As expected, an MSE equal to that of the LR model was then obtained.

Method . | MSE . | Number of Models . | Method . | MSE . | Number of Models . |
---|---|---|---|---|---|

LR | 0.075 | — | IMLE_{1D} (1 epoch) | 0.019 | 313 |

RBD | 0.104 | — | IMLE_{1D} (10 epochs) | 0.010 | 563 |

LWPR | 0.040 | 260 | IMLE_{7D} (1 epoch) | 0.018 | 271 |

GPR | 0.011 | — | IMLE_{7D} (10 epochs) | 0.010 | 550 |

Method . | MSE . | Number of Models . | Method . | MSE . | Number of Models . |
---|---|---|---|---|---|

LR | 0.075 | — | IMLE_{1D} (1 epoch) | 0.019 | 313 |

RBD | 0.104 | — | IMLE_{1D} (10 epochs) | 0.010 | 563 |

LWPR | 0.040 | 260 | IMLE_{7D} (1 epoch) | 0.018 | 271 |

GPR | 0.011 | — | IMLE_{7D} (10 epochs) | 0.010 | 550 |

Note: IMLE_{1D} is the model corresponding to a map from to , while IMLE_{7D} is the model obtained from the full map from to .

The extremely good convergence of the IMLE algorithm is noteworthy. After only a single pass through the training data, IMLE achieves a better approximation error than LWPR (after more than 100 passes through the same data) while activating a comparable number of linear models. If more points are presented to IMLE, cycling through the data ten times, an MSE comparable to state-of-the-art GPR is achieved. It is also worth noticing that IMLE performance did not changed much when a full 7-dimensional output vector was considered for the learning task: the MSE remained the same, while, perhaps surprisingly, the number of experts even dropped a bit.^{4} We did not notice any increased computational time in this situation, as the slight increase in computation due to this higher-output dimension was balanced by a smaller number of activated models. In this aspect, IMLE compares favorably to GPR and LWPR, which would require seven times more computational power to learn the full inverse dynamics map.

### 5.2. Multivalued Function Approximation.

We now proceed to evaluating IMLE learning capabilities under a multivalued target function scenario. As stated in section 3.2, learning multivalued functions in an online fashion poses several additional problems. Without further information, it is difficult to distinguish noise or outliers from new multivalued function branches. In addition, it is no longer possible to set large values for and , hoping that the learning process finds adequate values for the input length scale and output noise estimate, as multivalued relations may then be interpreted as single-valued functions with large output noise. There is also the problem of time-varying functions. This issue can easily be addressed in online single-valued algorithms by introducing forgetting mechanisms to allow for a quick adaptation of the internal model to the time-varying data. However, things become more complicated with multivalued function approximation, where there is no clear border between fast time-changing training data and multivalued functions.

We tested IMLE against ROGER whenever possible. This later algorithm, however, consists of an infinite mixture of SOGPs, thus inheriting the same limitations mentioned in the previous section: the difficulty in getting a good parameter configuration and the slow operation for medium- or high-dimensional input spaces. Also, ROGER implementation code did not provide a set of multivalued solutions for the function being approximated, instead sampling a single solution from the infinite mixture. This feature imposes severe limitations to real applications, as ROGER predictions will permanently switch between different multivalued branches. To provide a fair comparison to IMLE, we adapted ROGER code to provide a finite set of prediction solutions. For each input query, ROGER was used to predict a solution over 100 different trials, and the distinct solutions thus obtained were gathered to become the set of predicted solutions.

The approximation error was measured by taking, for each input query, the multivalued prediction closest to the true output. Note that we do not address here the important problem of choosing a solution among the set of multivalued predictions produced by the algorithm. This can be seen as a context estimation, and it is a requirement when the multivalued model is to be used for control purposes. In the following experiments, we set . The lower this value was, the higher the number of solutions found for the same query on average. However, when the branches of the multivalued function to learn are well separated in the output space, we found that the value of this parameter did not influence the final prediction much, and that setting it on a range of, say, resulted in similar prediction results.

#### 5.2.1. Synthetic Data Sets.

To illustrate the fundamental differences between multi- and single-valued function approximation algorithms, we start with a simple toy example consisting of a multivalued target sinusoidal function. Standard function approximation learning methods, like LWPR or GPR, typically behave poorly in this setting. Training data were generated by alternating sequential sweeps over each of the two branches of the multivalued function, given, respectively, by *f*_{1}(*x*)=cos(*z*) and *f*_{2}(*x*)=cos(*z*)+4, and each output *x _{i}* was corrupted with gaussian noise with standard deviation equal to 0.1. Figure 4a represents a sample of the training data, while Figure 4b depicts estimates from IMLE, ROGER, LWPR, and GPR taken after training. As expected, LWPR and GPR tend to average the two branches of the multivalued function, while IMLE and ROGER correctly identify the two solutions. In particular, ROGER achieves an almost perfect approximation to the true function (RMSE = 0.014), when compared to IMLE (RMSE = 0.039).

^{5}Another toy example, presented now by Lee and Lee (2001), consists of a cylindrical spiral surface, again a multivalued function from to , described by

The codomain of the target function makes each input point have two distinct solutions. IMLE was trained with random trajectories generated over the target function, and after 50,000 points, it achieved an RMSE of less than 0.065 on an independent random test set. The multivalued predicted outputs for this test set are shown in Figure 6a, while the 39 linear models allocated by the algorithm are represented in Figure 6b.

The final toy example we present consists of a randomly generated piecewise constant target function from to , shown in Figure 7a. Although not multivalued, many function approximation algorithms will not be able to properly learn it, as their smoothness assumptions can conflict with the discontinuities of the target function. This will typically result in either an overfitting to the data, when the complexity of the model is increased to approximate the function in the vicinity of the discontinuities, or in an oversmoothing behavior, where the predictions average the two values of the target function near the function transitions.

Multivalued learning algorithms based on mixtures like IMLE can provide an elegant solution to this problem. During training, there is no output interference near the discontinuities, since in those regions, the data are simply interpreted as coming from a multivalued function. When predicting, a single-valued solution can nevertheless be provided by simply taking the most important solution from the multivalued prediction set, according to the sum of weights *w ^{x}_{j}*(

*z*_{q}) of the experts that contribute to that solution. This procedure is able to generate sudden transitions of the prediction, like the algorithm developed by Toussaint and Vijayakumar (2005) to specifically deal with discontinuous functions. Prediction results are shown in Figure 7 for IMLE and a single-valued learner. We will not undergo a full comparison to other discontinuity function learning algorithms, as the main point here is the proof of the concept that IMLE (and other multivalued learning algorithms) can also efficiently approximate discontinuous functions without the need for any special modifications.

#### 5.2.2. The Humanoid iCub Robot.

The next experiment evaluates IMLE multivalued prediction capabilities in a higher-dimensional problem. We consider iCub, an anthropomorphic robot used for research into human cognition and artificial intelligence (Sandini, Metta, & Vernon, 2004). We simulated random trajectories in a seven-dimensional input space consisting of the joint angles of the robot waist (yaw, roll, and pitch) and right arm (shoulder yaw, pitch and roll, and elbow flexion), the output space being the 3D position of the end effector. During the training phase, after acquiring 100,000 data points, a 28 cm tool was introduced in the kinematic chain of the arm, effectively changing the end-effector position. After 100,000 more training points, the tool was removed and training resumed. From the viewpoint of the learning algorithm, this nonsignaled change of the kinematic structure can be represented by a multivalued function with two distinct branches corresponding to the tool and no-tool contexts. Traditional single-valued approximators must forget the previously learned map when the tool is introduced in order to learn the new kinematics. This wastes computational resources, as the original map must be learned again when the tool is removed and the original kinematics structure is once again presented to the robot. IMLE can keep both situations stored in its internal model: there is almost no increase in the RMSE when the tool is removed, as can be seen in Figure 8, where, for comparison with a single-valued learner, we also show the results of training LWPR using the same data set.

#### 5.2.3. Simultaneously Learning of Forward and Inverse Models.

In the following experiments, we show how IMLE can use the learned mixture to predict both forward and inverse maps. We use the Puma 560 robot arm described in section 5.1, controlling its first three joints to position the end effector. The kinematic function for this robot configuration is a map from to . While the forward map consists of a single-valued function, the inverse kinematics is multivalued and can exhibit up to four solutions. As a second example, we consider the 3-RPR parallel manipulator described in Merlet (2006). It consists of an end effector connected to a fixed base through three prismatic links, each connecting to the base and end effector using free, unactuated rotational joints. Its movement is restricted to the *x*-*y* plane. Actuating on link lengths L_{1}, L_{2}, and L_{3} changes the *x*-*y* end-effector position and orientation on this plane. The kinematic for this mechanism is also a map from to . Parallel robots typically exhibit a duality relation to serial chains with respect to the forward and inverse kinematics nature. While their inverse relation is usually unique and straightforward to calculate, obtaining a closed formula for the end-effector position and orientation as a function of actuator values is difficult and frequently yields multiple valid solutions. This mechanism is known to have up to six different solutions for the same actuator configuration, which makes learning its forward kinematics infeasible for most standard single-valued function approximation techniques.

We trained IMLE over 1 million points taken from a simulated random trajectory for both manipulators, adding gaussian noise to the outputs, with standard deviation equal to 1/100 of the output range. We restricted the movement of the 3-RPR parallel manipulator to a square of 40 cm by 40 cm in the center of the mechanism, while the angle was constrained to the interval . We ran 10 different trials, testing the resulting final mixtures on a noiseless random sequence of 100,000 points. The results are shown in Figure 9. Figure 9c also presents, for the PUMA robot, the frequency of the number of solutions found by IMLE inverse prediction, comparing it to the real value, obtained by explicitly solving the inverse kinematics equations. The discrepancy between these numbers can be explained by the fact that close to the work space boundary of the PUMA 560, there are pairs of inverse solutions that become close to each other. In this situation, IMLE tends to merge these solutions. Lowering the value of would reduce this behavior; however, due to the curvature of the map to learn, this would have the undesired side effect of predictions of neighbor experts being erroneously taken as separate solutions. In general, choosing a value for is a compromise between this two effects. Nevertheless, IMLE still achieves a good inverse prediction error rate, since in the work space boundary, the merged solution provided by IMLE is approximately the average of two reasonable similar true solutions. As for the 3-RPR manipulator, IMLE found on average 1.88 forward solutions per test point, which, given the constrained work space, agrees with the expected number of solutions. For every point on the test set, only a single inverse solution was found. This is in total agreement with the single-valued nature of the inverse kinematics for these kind of mechanisms.

## 6. Discussion and Conclusion

The infinite mixture of linear models presented in this letter is, at its core, an algorithm that can efficiently deal with nonlinear function approximation in an online, incremental manner, comparable to current state-of-the-art online learning methods. It consists of a collection of linear experts, together with appropriate priors on the parameters being learned and a mechanism that efficiently grows the number of experts when the need to explain outfitted data arises for newly acquired samples. Its training is based on the generalized EM algorithm, where the expectation step is extended to allow for incremental updating of the sufficient statistics of the mixture of experts and the maximization step includes the allocation of a new expert each time a training point is poorly explained by the current mixture. Put together, this results in a very fast and scalable online learning algorithm, capable of processing hundreds or thousands of samples per second, coming from continuous streams of data. This is a difficult learning setting, since when no knowledge is provided about the characteristics of the function to learn, this information must be estimated from a sequence of correlated training data that may correspond to only a small subset of the full input-output space. We tested IMLE in this kind of situation for single-valued regression and showed how it could equal or even surpass LWPR, a current state-of-the-art online learning algorithm, in terms of convergence of prediction error and number of allocated models.

However, a distinctive feature of IMLE, when compared to other online supervised learning algorithms, is its ability to deal with multivalued estimates for the same query data. The applications for such kind of problems range from learning forward models of parallel robots to learning switching models, where the function to be approximated can alternate between different configurations over time. This constitutes an even more challenging learning problem. Besides the limitations coming from learning online from a stream of data, we now face the expert allocation dilemma, where it is difficult to distinguish between noise and outliers, on one hand, and a mixture requiring a new component, on the other. Additionally, underlying nonstationary relations make the problem even more difficult to learn. Analyzing the influence of nonstationarity in multivalued prediction, relating it to separability of multivalued function branches, is perhaps a topic for further research.

We have also shown in this letter that the same procedure used to obtain a set of multivalued forward solutions can be applied to inverse queries, making IMLE capable of delivering both forward and inverse multivalued predictions from the same model, without need of further training. This is a consequence of directly learning a multivariate output relation from to instead of a set of *D* distinct univariate output maps. As seen in the previous section, such multivalued learning capability can also prove to be useful when learning discontinuous functions, for which an undesirable prediction smoothing typically occurs in the vicinity of the discontinuities when using standard function approximation algorithms.

A current limitation of the proposed algorithm is the lack of a mixture shrinking mechanism that would be responsible for removing either experts providing wrong predictions or redundant mixture components. The experiments presented in section 5 show that the IMLE model will activate new experts more and more sparingly as the training progresses, but eventually, after a lifetime of learning, too many experts may become activated, resulting in an increase of computational resources consumption. Moreover, in the event of some episodic outlier bursts, some experts may be activated to represent such erroneous training data. We do not yet have a definite answer to this problem. As discussed in section 3.2, growing the mixture under incremental assumptions, for multivalued data is a delicate matter, and shrinking the same mixture in a principled way is even more troublesome. The main difficulty here is that there is no simple way to detect experts wrongly activated by an outlier (or a sequence of outliers), as they cannot easily be distinguished from experts originating from a sporadic training on a new branch of the multivalued function being learned. Such “incorrect” experts can be detected, for instance, by a strong deviation from the characteristic input length scale or output noise (strong disagreement between and or between and ) or by low support on training data (low accumulation of sufficient statistics after a long period of training). However, removing an expert under this condition does not come without the risk of lowering the likelihood of the training data, since no guarantees exist that such an expert does not represent the true distribution of the multivalued function being learned. In general, choosing an expert to be removed from the mixture—due, for instance, to capacity limitations imposed on the mixture—is not easy to do without access to the full set of training points. Creating a probabilistically supported shrinking mechanism for the mixture is surely an improvement to the IMLE model that we would like to investigate in subsequent work.

IMLE has a large space for customization by choosing different priors for the mixture parameters, the most evident being the prior on regression coefficients . Feature selection is a very desirable property of a function approximation algorithm that allows learning the input subspace that effectively contributes to output variation. This typically makes the prediction more stable and less influenced by irrelevant or strongly correlated input dimensions. In a Bayesian setup, feature selection can be implemented if adequate priors for loading matrices are defined. Adopting a gaussian prior, as IMLE currently does, leads to the ridge regression (Hoerl & Kennard, 1970), while a Laplacian prior induces the LASSO (Tibshirani, 1996). Both priors involve the choice of a hyperparameter to control the degree of regularization and sparseness. Figueiredo (2003) uses a Jeffrey's prior to overcome the necessity of such a hyperparameter. More recently, Ting, D'Souza, Vijayakumar, and Schaal (2010) proposed a slightly different formulation of the generative model corresponding to the linear regression performed by each expert, which together with a careful choice of priors for the elements of can lead to a fast and efficient high-dimensional feature selection and regression. In principle, any of these priors can be integrated into IMLE; this is a topic for future work. Another research direction is the inclusion of Dirichlet priors for mixture components, turning IMLE into a fully Bayesian learning method. In this situation, the major challenge would be retaining the online and scalability properties of the algorithm.

## Appendix:. Posterior Distributions for IMLE

*t*, given the current estimates for and . Since the normal-inverse Wishart is a conjugate prior for the center and covariance matrix of a multivariate normal data distribution (Gelman et al., 2004), we have, for each expert

*j*, where

*t*-Student distribution with degrees of freedom,

*t*, by defining , where , and expanding the input vector to accommodate a constant term, . We are then able to rewrite equation 2.1a as and change the corresponding priors, equations 2.2c and 2.2d accordingly, , with and

Equations 3.7c to 3.7e arise from the previous posterior distributions, since , and are the modes of the respective distributions.

*t*-Student with degrees of freedom. When this value is large, the distribution can be approximated by a multivariate normal. For and this results in

**|**

*x*

*z*_{q}, we use the fact that the marginalization of with respect to the parameters , , and yields again a

*t*-Student distribution, with degrees of freedom, mean equal to and variance given by where the factor reflects the uncertainty on the estimates and used in the posterior prediction. Note that as the training size increases, this term vanishes in equation A.4, while the fundamental source of noise due to remains. Put together, this results in where again we approximate the

*t*-Student distribution to a normal one, under the assumption of a large value of . It is useful to view this result under a different and equivalent formulation, given by the hierarchical model which can be interpreted as a sample point

**(**

*x*

*z*_{q}) being generated, with noise , from an unknown mean

*x*_{j}, for which posterior distribution A.7 is available given the current set of sufficient statistics and input query

*z*_{q}.

## Acknowledgments

We thank the anonymous reviewers for their valuable comments and suggestions, which greatly contributed to improving the quality of this letter. We are also grateful to Lorenzo Jamone for sharing the iCub data set.

## References

## Notes

^{1}

LWPR 1.2.3: http://wcms.inf.ed.ac.uk/ipab/slmc/research/software-lwpr. SOGP 2.0, ROGER 1.5: http://cs.brown.edu/people/dang/code.shtml.

^{2}

IMLE 1.1: http://users.isr.ist.utl.pt/~bdamas/IMLE.

^{3}

GPML 3.1: http://www.gaussianprocess.org/gpml/code/matlab/doc/.

^{4}

This probably is explained by a lesser tendency for overfitting when the output dimension increases.

^{5}

This behavior was unfortunately observed in the following experiments, and so ROGER was removed from the remaining tests. Note that the ROGER good results shown in Figure 4b were a consequence of a careful choice of its parameters: small changes in these values would typically lead to more than two solutions being predicted in many regions of the input space.