We present an approach for constructing nonlinear empirical mappings from high-dimensional domains to multivariate ranges. We employ radial basis functions and skew radial basis functions for constructing a model using data that are potentially scattered or sparse. The algorithm progresses iteratively, adding a new function at each step to refine the model. The placement of the functions is driven by a statistical hypothesis test that accounts for correlation in the multivariate range variables. The test is applied on training and validation data and reveals nonstatistical or geometric structure when it fails. At each step, the added function is fit to data contained in a spatiotemporally defined local region to determine the parameters—in particular, the scale of the local model. The scale of the function is determined by the zero crossings of the autocorrelation function of the residuals. The model parameters and the number of basis functions are determined automatically from the given data, and there is no need to initialize any ad hoc parameters save for the selection of the skew radial basis functions. Compactly supported skew radial basis functions are employed to improve model accuracy, order, and convergence properties. The extension of the algorithm to higher-dimensional ranges produces reduced-order models by exploiting the existence of correlation in the range variable data. Structure is tested not just in a single time series but between all pairs of time series. We illustrate the new methodologies using several illustrative problems, including modeling data on manifolds and the prediction of chaotic time series.
In data-driven modeling problems, generally the goal is to construct a representation of the data that will be both descriptive and predictive. The requirement that the model be descriptive dictates that it accurately reproduces the relationships in the observed data, while the requirement that the model be predictive means that unobserved inputs, or domain values, be mapped to meaningful outputs, or range values. In general terms, it is the predictive quality, or ability to generalize from observations, of the model that leads to added value and the potential for knowledge discovery.
When data may be viewed as being sampled from a manifold, we may appeal to some powerful theorems from differential geometry to guide our modeling approach. For example, Whitney's (1936) (easy) embedding theorem pertains to representing data as the graph of a nonlinear function and provides guidance concerning the dimension of the domain when it is not known a priori. More precisely, Whitney's theorem states that an md-dimensional manifold residing in an n-dimensional ambient space may be embedded in a 2md+1 dimensional subspace.1 The proof of the theorem is constructive and shows that there is a large set of projections such that there exists a nonlinear inverse that affords perfect reconstruction of the data.2 Whitney's theorem pertains not to data but to manifolds. When data may be viewed as a sampled manifold, the question arises concerning how to construct the nonlinear function that Whitney's theorem proves exists.
In contrast, Takens' theorem (1980) indicates that data sampled from an md-dimensional smooth manifold may be reconstructed (up to a diffeomorphic copy) by using a time-delay embedding in a space of dimension 2md+1. What is surprising about this theorem is that it states that only a single component, a scalar function, of the time-dependent trajectory must be observed to permit this reconstruction. This permits a now-well-known framework for the time-series prediction problem (Packard, Crutchfield, Farmer, & Shaw, 1980). These geometric theorems suggest that data on manifolds may possess a structure that is richer than simple nonlinearity and may be exploited in the modeling process. This is the primary motivation for the algorithm proposed in this letter.
The objective of this letter is to present a new approach for building nonlinear models for multivariate data that is particularly useful for modeling time-evolving trajectories on manifolds. The main contribution is a practical algorithm to model geometric structure not just in the individual components of the range of model but also to identify and model structure that exists between all pairs of components. The proposed algorithm is an extension to our previous work (Jamshidi, 2004; Jamshidi & Kirby, 2007; Anderle & Kirby, 2001), which deals with modeling where the range dimension is one. Similar to the univariate case, the proposed algorithm progresses iteratively, adding a new function at each step to refine the model. The placement of the functions is driven by a statistical hypothesis test, but now in higher dimensions that reveals geometric structure when it fails. At each step, the added function is fit to data contained in a spatiotemporally defined local region to determine the model parameters. A great advantage of the proposed algorithm is that it does not require ad hoc parameters save for the selection of the skew radial basis functions. Thus, the number of basis functions required for an accurate fit is determined automatically by the algorithm, making it a powerful tool for modeling multivariate time series, especially when the additional manifold structure may be exploited.
This letter is organized as follows. In section 2, the geometric theorems are discussed in the context of data analysis. In section 3.1, basic facts about radial basis functions (RBFs) are reviewed, and a recent variation, skew RBFs (sRBFs), is described. In section 4, the theoretical aspects of the multivariate RBF algorithm are presented. In section 5, the basic multivariate RBF algorithm is presented in detail. Numerical experiments are described in section 6, illustrating applications of Whitney's and Takens' theorems. Section 7 summarizes the main contributions of this letter.
2. Geometric Framework
Here we outline the geometric framework of the algorithms. We shall see that at the heart of each problem is a need for construction of a nonlinear mapping of multivariate data.
2.1. Representing Data as a Graph of a Function.
If the rank of is d where d>2md, then Whitney's theorem ensures the existence of a global map from the range of the projector to its null space, that is, there exists an f such that equation 2.2 will hold. We have referred to the representation in equation 2.3 as the Whitney reduction network (Broomhead & Kirby, 2001). Note that the special case of principal component analysis is obtained if the linear term contains enough dimensions to fully reconstruct, or encapsulate, the data. However, if the data have the structure of a manifold, then it is more efficient to represent by taking the rank of to be as small as possible and to have the nonlinear term encode the manifold structure. Determining parsimonious representations for f in this expression is one of our main motivations for developing efficient algorithms for multivariate RBFs.
2.2. Time-Delay Embedding.
3. Relationship to Other Work
Here we describe the background of RBFs, as well as provide an overview of recent work in the general area of manifold learning
3.1. Data Fitting with Radial Basis Functions.
RBFs have been widely used for data approximation. Multiquadrics, that is, functions of the form , were introduced by Hardy (1997) for modeling scattered data on topographical surfaces. Thin plate splines, , were introduced by Harder and Desmarais (1972) for surface interpolation. Gaussian RBFs, , were proposed by Broomhead and Lowe (1988) for data fitting and classification. There is a significant literature treating theoretical aspects of RBFs, including universal approximation theorems (see, e.g., Powell, 1992; Park & Sandberg, 1991, 1993; Schaback & Wendland, 2000). A large number of algorithms have been proposed in the literature for computing the model parameters (Platt, 1991; Karayiannis & Mi, 1997; Holmes & Mallick, 1998; Yingwei, Sundararajan, & Saratchandran, 1998; Nabney, McLachlan, & Lowe, 1996). The research monographs (Lee & Haykin, 2001; Buhmann, 2003; Wendland, 2005) contain references treating additional theory, algorithms, and applications.
The issue now is to determine the parameters associated with the data-fitting problem. Of particular importance are the locations of the fitting functions and the width of the basis , as well as the number of fitting functions Nc. A method for initializing these using the correlation and cross-correlation structure is described in section 4. Once these parameters are initialized, together with the remaining parameters they are determined using a variety of nonlinear optimization routines.
3.2. Manifold Learning.
This letter particularly concerned with modeling multivariate time series on manifolds—where locally the structure behaves like Euclidean space. This setting is attractive given the existence of Whitney's theorem and Takens' theorem described in section 2, which indicate that data on manifolds are particularly amenable to nonlinear modeling by constructing an empirical representation of a function f that captures the structure of the data from examples. This review presented here is not intended to be comprehensive but rather a sampling of the literature on this topic.
The modeling procedure presented here represents a special case of a more general area now referred to as manifold learning. We note that this field comprises a collection of approaches to model data assuming a special structure, in the same way lines of best fit are used to model data where we assume some linear trend exists. It is not essential that the data actually reside on a manifold to make these techniques useful in the same way that data need not fit exactly on a line when we use linear least squares. Of course, the more like a manifold the data are, the better we can expect the results to be.
There are a variety of ways to approach the manifold learning problem. A local approach based on data charts models the data in patches that, when taken together, form a complete manifold model. Each chart consists of a connected neighborhood of the manifold, equipped with a projection mapping from the manifold to a linear space (whose dimension is the topological dimension of the manifold) and an inverse function from the linear space to the manifold.
The composition of these functions behaves as the identity mapping. (For additional details on this approach, see Hundley, 1998; Hundley, Kirby, & Miranda, 1999.) A probabilistic approach for generating an atlas of charts is provided in Brand (2003). This algorithm employs a mapping that preserves local geometric relations in the manifold and is pseudo-invertible. The algorithm decomposes the sample data into locally linear low-dimensional patches that are merged into a single low-dimensional coordinate system by stochastic optimization.
A manifold learning approach referred to as local linear embedding (Roweis & Saul, 2000) represents each point in the ambient space as a weighted sum of its nearest neighbors. This weighting matrix is then used to generate a configuration of points with the same spatial relationships in a space of reduced dimension. We note that this approach results in an eigenvector problem where the matrix has the same size as the number of points in the sample. Although strictly speaking, this is not an isometric embedding of the data, it may be viewed loosely as topology preserving.
Multidimensional scaling (MDS) approaches this problem from a different perspective (Mardia, Kent, & Bibby, 1980). Given a set of distances between points, is it possible to locate these points in Euclidean space such that they have the same interpoint distances. MDS provides a solution in terms of the eigenvectors of the interpoint distance matrix. An extension to this approach for data on manifolds, referred to as ISOMAP, is based on computing geodesic distances numerically from sampled data and using these distances in the MDS eigenvector problem (Tenenbaum, de Silva, & Langford, 2000). The result is an embedding of the data in Euclidean space with distances between the data points, as measured on the manifolds, preserved. Other techniques have been proposed for constructing isometric embeddings, notably Belkin and Niyogi (2003), Donoho and Grimes (2003), and Coifman and Lafon (2006).
In Roweis, Saul, and Hinton (2002), the local linear models are represented by a mixture of factor analyzers, and the global coordination of these models is achieved by adding a regularizing term to the standard maximum likelihood objective function favoring models whose internal coordinate systems are aligned in a consistent way. As a result, the internal coordinates change smoothly as one traverses a connected path on the manifold even when the path crosses the domains of many different local models. To get an efficient algorithm that allows separate local models to learn consistent global representations, an automatic alignment procedure that maps the disparate internal representations learned by several local dimensionality-reduction experts into a single coherent global coordinate system for the original data space is studied in Teh and Roweis (2002).
Kernel principal component analysis (KPCA; Verbeek, Roweis, & Vlassis, 2004) treats the processing of data on manifolds by first flattening the data out using an appropriately selected kernel function, for example, a Veronese embedding Schölkopf, Smola, and Müller (1998). This approach has the advantage of turning nonlinear problems into tractable linear problems via the kernel trick. KPCA is also related to the well-known classification technique of support vector machines (Vapnik, 1995).
Projection pursuit was proposed as a linear approach for reducing the dimension of data to reveal clusters in data and their separation (Friedman & Tukey, 1974). It is often used in the context of an interactive data visualization system such as the projection pursuit guided tour (Cook, Buja, Cabrera, & Hurley, 1995).
A global approach to manifold learning that involves reducing data such that it is constrained to lie on a prescribed manifold in the reduced space was presented in Kirby and Miranda (1996). For example, one can construct the best topological circle through a data set using this approach. This idea was extended to spheres in Hundley, Kirby, and Miranda (1995).
4. Statistical Background for the Multivariate sRBF Algorithm
In general, the derivation of the asymptotic distribution of the sample cross-correlation function is quite complicated even for multivariate moving averages (Brockwell & Davis, 1991). The methods employed for the univariate case are not immediately adaptable to the multivariate case. An important special case arises when the two component time series have independent moving averages. The asymptotic distribution of for such a process is given in the following theorem:
As reported in Brockwell and Davis (1991), without knowing the correlation function of each of the processes, it is impossible to decide if the two processes are uncorrelated with one another. The problem is resolved by prewhitening the two series before computing the cross-correlation , that is, transfer the two series to white noise by application of suitable filters. In other words, any test for independence of the two component series cannot be based solely on estimated values of the cross-correlation without taking into account the nature of the two component series. Note that since in practice the true model is nearly always unknown and since the data Xtj, , are not available, it is convenient to replace the sequences by the residuals, which, if we assume that the fitted models are in fact the true models, are white noise sequences. To test the hypothesis H0 that and are independent series, we observe that under H0, the corresponding two prewhitened series and are also independent. Under H0, theorem 1 implies that the sample autocorrelations and , , of and are asymptotically independent normal with mean 0 and variances L−1. An appropriate test for independence can therefore be obtained by comparing the values of with . If we prewhiten only one of the two original series, say, , then under H0, theorem 1 implies that the sample cross-correlations and , , of and are asymptotically independent normal with mean 0 and variances L−1 and covariance . Hence, for any fixed h, also falls (under H0) between the bounds with a probability of approximately 0.95.
5. Multivariate Algorithm Implementation
The main difference with the univariate algorithm is the statistical hypothesis test. Again, the question of whether a new basis function should be added is answered by the i.i.d. test. We shall see that this test also indicates where the new basis function should be initialized. First, we compute the autocorrelation functions of all the m time series. If all of these pass the white noise (WN) or i.i.d. test, then the cross-correlations among the time series are considered. If there is structure in the autocorrelations or cross-correlations of the time series, then the i.i.d. will be rejected.
As in the univariate case, the next requirement is to determine where the new basis function should be located to optimally reduce the structure in the model residuals. In our extension, we look for the point in the domain that makes the largest contribution to the auto- or cross-correlation that has caused the test to fail.
6. Numerical Results
In this section, we provide numerical evidence that the proposed k-variate algorithm results in a model that significantly outperforms k univariate models. The enhanced parsimony of the model is result of accounting for correlations between the multivariate ranges.
6.1. Multivariate Pringle Data Set.
To begin, in this section, we consider the application of the multivariate RBF algorithm to the Pringle data set (Broomhead & Kirby, 2000, 2001). As described below, we will build a two-variate data set with correlation one between the first and the second time series, as a first illustration of the methodology. This problem illustrates the challenges of modeling data as the graph of a function with range dimension two.
This Pringle data set, as proposed in Broomhead and Kirby (2005), can be generated as the solution to the following systems of ordinary differential equations, where and are parameters. A numerically integrated trajectory of the attracting cycle (x, y, z) is shown in Figure 1a.3 A second data set, also displayed in Figure 1a, is constructed by amplifying the z coordinate as (x, y, 2z).
A good projection for this problem, in the sense of minimizing the Lipschitz constant of f, has been shown to be the x, y-plane (Broomhead & Kirby, 2005). Thus, the (x, y) data values are the inputs to our function, and the range data values (z, 2z) constitute the associated output of the function. The training set was selected to consist of 101 points (almost two cycles), while an additional 100 points were used for validation. In this experiment, the testing data set was selected to be the next 500 points, or approximately nine cycles. We use the Hanning-RBF in equation 3.6 in the numerical experiments in this section.
Figure 2 shows the output of the first RBF allocated by the multivariate algorithm. Figure 2a is the output associated with the first component of the multivariate range data, while Figure 2b is the output associated with the second component. The final model consists of four RBFs, and training was terminated as a result of the residuals' passing the i.i.d. test on the residual. The training, validation, and testing data sets and the output of the multivariate algorithm on multivariate Pringle data set are shown in Figure 1b. The performance of the RBF fit on the multivariate Pringle data set in the RMSE sense is shown in Figure 3a. The confidence level of the fitted model on the training and the validation set data sets as new basis functions are added to the model are shown in Figure 3b. After four RBFs have been added to the model, the confidence level of the i.i.d. hypothesis test is well above 95% for both the training and validation data, indicating a lack of structure in the residuals.
This simple example illustrates the ideal behavior of the multivariate algorithm when the output time series are highly correlated. Two univariate models in this instance would double the model complexity. Our next example, while harder to visualize, is a more challenging test for the algorithm.
6.2. Multivariate Mackey-Glass.
Figure 4 shows the confidence level of the multivariate model as new Arctan-Hanning sRBFs are added to the model for the case . We provide the confidence levels for all pairs of the time series. In almost all cases, the confidence level of 95% was achieved for the training set and validation set with approximately the same number of RBFs. In general, we advocate that this confidence level be attained for both sets unless the RMSE on the validation set increases suggesting the possibility of overfitting. The number of sRBFs required to achieve 95% of confidence on both the training and validation data sets is 43 for the two-variate problem. In this case, the RMSE of the final model is 0.0175, which is below the noise floor. The associated univariate 25 steps-ahead prediction problem requires 20 sRBFs to achieve 95% of confidence on the training and validation sets and results in a model with an RMSE of 0.0128, again below the noise floor. The univariate model for 50 steps-ahead prediction requires 28 sRBFs to reach 95% of confidence on training and validation sets with an RMSE below the noise floor of 0.0192. So, in summary, for this data set, the sRBF representation of f using two univariate models consisting of 53 modes is over 20% larger than the two-variate model consisting of 43 sRBFs, while the errors for both representations are under the noise floor.
Figure 5 shows both outputs of the two-variate model for the Mackey-Glass experiment for the case . Figure 5a shows the output for the 25-steps-ahead prediction, while Figure 5b provides the model output for the 50-steps-ahead prediction. The log plot of the RMSE of the model as new sRBFs are added to the model is shown in Figure 5c.
We repeated this experiment for the case . For this case, the two-univariate model representation was over 60% larger than the single two-variate, model consisting of 74 and 49 sRBFs, respectively, with comparable errors below the noise floor. This example provides convincing evidence of the savings one may obtain using the cross-correlation structure of the residuals of the model. Details of this second example may be found in Jamshidi (2008).
We observed that the extension of the ACF test for i.i.d. noise to k-variate ranges produces models of significantly smaller order than using k-univariate models while attaining comparable prediction errors. This is a consequence of the fact that the correlations as well as the cross-correlations of the residuals of the multivariate model are being exploited during the model-building process. We applied the proposed algorithm to the problem of representing data as the graph of a function as well as the benchmark problem of forecasting the chaotic time series produced by the Mackey-Glass equation. In each case, we saw a substantial reduction in the order of the k-variate model when compared to k-univariate models.
The opportunity for applications of the proposed RBFs is significant: various problems in nonlinear signal processing, optimal control, computer vision, pattern recognition, and prediction, such as the financial time-series problem. In future work, we will apply these functions for representing data on manifolds as graphs of functions (Broomhead & Kirby, 2000, 2001), as well as the low-dimensional modeling of dynamical systems (Broomhead & Kirby, 2005).
The algorithm proposed here exploits the existence of cross-correlation between the output time series. Hence, we expect the advantage of this approach to be especially significant in cases where the range data possess a related geometric or statistical structure.
This work is partially supported by NSF grants MSPA-MCS 0434351, ATM-530884, and DOD-USAF-Air Force FA-9550-08-1-0166. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the AFOSR.
This theorem is particularly helpful when n>>2md+1.
In mathematical terms, there is an open and dense set of projections that allow one to reduce the dimension to 2md+1.
In this example, we are concerned only with fitting data on the limit cycle and ignore transients.
Arta Jamshidi is now at the Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, U.K.