Abstract

We consider the problem of multiclass adaptive classification for brain-computer interfaces and propose the use of multiclass pooled mean linear discriminant analysis (MPMLDA), a multiclass generalization of the adaptation rule introduced by Vidaurre, Kawanabe, von Bünau, Blankertz, and Müller (2010) for the binary class setting. Using publicly available EEG data sets and tangent space mapping (Barachant, Bonnet, Congedo, & Jutten, 2012) as a feature extractor, we demonstrate that MPMLDA can significantly outperform state-of-the-art multiclass static and adaptive methods. Furthermore, efficient learning rates can be achieved using data from different subjects.

1.  Introduction

Brain-computer interfaces (BCI) (Vidal, 1973) aim to provide human subjects with control over devices or computer applications while bypassing the traditional muscular paths. In other words, these interfaces might allow humans to control devices using only their brain activity as measured by, for example, electroencephalogram (EEG) (Haas, 2003). The control of such devices could offer an additional channel of communication and action that could improve the quality of life for people with severe disabilities (Wolpaw, Birbaumer, McFarland, Pfurtscheller, & Vaughan, 2002).

The most common procedure for setting up a BCI is as follows. The user participates in a training (i.e., calibration) session in which he or she is instructed to perform a specific mental task while the brain activity generated is recorded. The recorded data (usually designated as training data) are used to extract discriminative features associated with various user intentions. The data are subsequently used to train a classifier that predicts the user's intention during the testing (feedback) session (van Gerven et al., 2009). To optimize the feature extraction procedure, a time window of interest should be selected, as well as the most discriminative frequency bands. This choice is typically experimental and subject dependent, and it usually requires some level of prior physiological knowledge. Several algorithms are available for automatically optimizing these parameters (Blankertz, Lemm, Treder, Haufe, & Müller, 2011; Fernandez-Vargas, Pfaff, Rodríguez, & Varona, 2013; Ang, Chin, Wang, Guan, & Zhang, 2012). In a multiclass BCI problem, the most common feature extractors are based on supervised data projections, such as one-against-one common spatial patterns (CSP) (Blankertz, Tomioka, Lemm, Kawanabe, & Müller, 2008; Dornhege, Blankertz, Curio, & Müller, 2004) or multiclass CSP (MCSP) (Grosse-Wentrup & Buss, 2008). Another interesting approach that has been introduced recently is tangent space mapping (TSM) (Barachant et al., 2012), an unsupervised nonlinear projection of the data covariance matrices that can be used to optimize its classification.

Usually the binary class BCI classification problem is considered linear (Müller, Anderson, & Birch, 2003; Farquhar, 2009) and a linear discriminant analysis (LDA) classifier (Fisher, 1936) can be used effectively. When the dimension of the feature space is large in relation to the number of samples (training data), the classifier must be regularized (Bishop, 2007; Blankertz et al., 2011). In the multiclass case, the multiclass LDA (MLDA) has been shown to be the best choice (Yang, Xu, Liu, Xu, & Yao, 2009; Felix et al., 2005; Tang, Tang, & Gong, 2008).

One common problem encountered by the classifiers in EEG-based BCIs involves changes in the feature statistics over time due to the nonstationary character of the EEG data (Krauledat, 2008). These changes generally result in poor classifier generalization performance (Shenoy, Krauledat, Blankertz, Rao, & Müller, 2006; Millán, 2004). To overcome this problem, several methods that adapt the feature space or the classifier parameters have been proposed (Tomioka, Hill, Blankertz, & Aihara, 2006; Hasan & Gan, 2009; Llera, Gómez, & Kappen, 2012).

The most common (i.e., the safest) strategy is to update the model parameters that are class independent while keeping the rest fixed. In the case of binary LDA, this strategy can be used to update the global average covariance matrix (or its inverse) (Vidaurre, Schlöogl, Cabeza, Scherer, & Pfurtscheller, 2006) or the global mean of the data (Vidaurre, Kawanabe, von Bünau, Blankertz, & Müller, 2010). When the adaptation is performed in the feature space and not in the classifier parameters, the strategy can be used to reduce the nonstationary effect by transforming the CSP filters (Tomioka et al., 2006) or, more generally, the actual testing data (Arvaneh, Guan, Ang, & Quek, 2013) in a linear fashion.

To adapt class-dependent classifier parameters, it is necessary to introduce uncertainty into the model, as in the unsupervised adaptive gaussian mixture model classifier (Hasan & Gan, 2009). To the best of our knowledge, the unsupervised adaptation of class-dependent feature space extraction parameters has yet to deliver any applicable results. Updating the global mean of the data allows the adaptation of the bias of the binary LDA discriminant function. This is usually a robust technique for BCI classifier adaptation and is referred to as the pooled mean linear discriminant analysis (Pmean) (Vidaurre et al., 2010). It adapts to shifts in the feature space that are commonly attributed to non-class-related nonstationarity in EEG-based imaginary movement binary BCI (Shenoy et al., 2006). Despite its simplicity, Pmean can achieve state-of-the-art binary unsupervised adaptive classification performance. Moreover, it has been shown to be a valuable tool for helping to increase the number of possible BCI users (Vidaurre, Sanelli, Müller, & Blankertz, 2011). Similar binary performance has been reported using data space adaptation (DSA), a feature-based adaptation method proposed recently by Arvaneh et al. (2013).

In the multiclass setting, the adaptation process clearly becomes more complex and poses more difficult challenges. State-of-the-art unsupervised multiclass methods, such as enhanced Bayesian LDA (EBLDA) (Xu, Yang, Lei, & Yao, 2011), perform unsupervised retraining of each of the pair-wise Bayesian LDA classifiers (BLDA) (MacKay, 1992). This adaptive approach uses a generative model for class-conditional distributions, and its performance is strongly dependent on the quality of the initialization. For binary problems, however, evidence suggests that the performance of the Pmean update approaches that obtained using supervised updates Vidaurre et al. (2010), thus often outperforming adaptive unsupervised generative models (e.g., EBLDA).

In this letter, we introduce a novel multiclass extension of the binary Pmean adaptation of the LDA classifier and demonstrate that this kind of adaptation is better suited for multiclass adaptation than are the previously mentioned state-of-the-art methods.

In section 2, we present the proposed method, followed by a description of the three EEG data sets used in this work in section 3. The results are presented in section 4. The letter concludes with a discussion in section 5.

2.  Methods

In this section, we present the methods for feature extraction and classification that we consider in the rest of the work. In section 2.1 we specify the feature extraction procedure: tangent space mapping (TSM). We describe multiclass LDA in section 2.2 and present the proposed algorithm for multiclass adaptive classification, the MPMLDA, in section 2.3.

2.1.  Tangent Space Mapping.

Tangent space mapping (TSM) as feature space was recently presented for BCIs in the context of the tangent space linear discriminant analysis (TSLDA) (Barachant et al., 2012). The referenced work highlights the potential of TSM as a feature extractor for the multiclass classification of covariance matrices in the context of BCI.

In light of the observation that covariance matrices belong to the Riemannian manifold of symmetric positive-definite matrices () (Moakher & Zéraï, 2011), TSM performs a nonlinear projection of the spatial covariance matrices of the data into the tangent space (do Carmo, 1976) of at the Riemannian (or geometric) mean (Heath, 1981) of the spatial covariance matrices of the training data. The Riemannian mean () of a set of covariance matrices is defined as
formula
2.1
where denotes the Riemannian distance induced by the Riemann geometry on , and it can be computed as a generalized eigenvalue problem (Moakher, 2005). More precisely, for any two ,
formula
2.2
where , , are the eigenvalues of .

The geometric mean exists within the considered manifold, and it is unique (Karcher, 1977). Although there is no closed-form solution for its computation, it can be computed efficiently using iterative algorithms. In this work we consider the algorithm presented by Fletcher and Joshi (2004).

Given a set of covariance matrices and denoting their geometric mean as , the tangent space mapping at of a given covariance matrix Ck is (after several trivial simplifications) given by
formula
2.3
where the log is the logarithm of a matrix derived from its diagonalization (Barbaresco, 2008).

is a symmetric matrix that in vectorized form (after eliminating redundant elements due to symmetry) can be used as features for classification in BCI problems (Barachant et al., 2012). In most cases and particularly in this work, one unique is computed using all of the covariances matrices in the training set, thus rendering TSM an unsupervised feature extractor for covariance classification.

2.2.  Multiclass Linear Discriminant Analysis.

The multiclass linear discriminant analysis (MLDA) classifier (Bishop, 2007) is defined by a discriminant function of the input feature vector for each class pair (i, j). A majority vote or a probabilistic interpretation of the results can be used to produce unique output from each pair of binary classifiers (Tax & Duin, 2002).

Given a K-class classification problem and a set of m labeled data vectors , for each , we can compute the class-wise means and covariance matrices . Defining the per-class-pair average covariance matrices as for , we define the discriminant function between classes i and j as
formula
2.4
formula
2.5
formula
2.6
formula
2.7
where each describes the vector of weights and the bias term of the discriminant function between classes i and j. For each , an input feature vector x is classified as class j if , and as class i otherwise. The output of the discriminant function can be interpreted probabilistically by assuming that the probabilities of classes i and j are given by a binomial distribution on the sigmoidal mapping of the discriminant function value (MacKay, 2003).
More concretely, given the discriminant function , j>i between classes i and j, we define the probability of x belonging to class i as and, consequently with
formula
2.8
Note that this probabilistic interpretation of the discriminant function of the LDA classifier makes a connection between the LDA and the logistic regression model.
The MLDA procedure produces probabilistic output from all of the pair-wise probabilities Qi,j in the following manner. For simplicity of notation, we define for j<i, Qi,j(k|x)=Qj,i(k|x). We then define the probability vector P containing in the ith coordinate the probability of class i given x as
formula
2.9
The final output of this MLDA classifier is assigned to the most probable class under this measure .

2.3.  Multiclass Pooled Mean LDA.

In the binary setting, under the assumption of balanced classes, the bias of the discriminant function can be adapted in an unsupervised manner using the global data mean (Vidaurre et al., 2010). (See appendix  A for additional information.)

In the multiclass setting, not every sample contributes to all discriminant functions. For this reason, not all discriminant functions must be updated with every sample. We extend the pooled-mean approach to the multiclass case using a probabilistic update for the pairwise class means in equation 2.7 as
formula
2.10
where represents the updated , is the learning rate, and is defined using the probability vector P(x) of equation 2.9 as
formula
2.11
The update, equation 2.10, allows for the adaptation of each bias bi,j through equation 2.6. The MPMLDA algorithm is summarized in algorithm 1.
In the binary setting, equations 2.10 and 2.11 reduce to the binary Pmean update when . The multiclass MPMLDA is thus a natural extension of the original pooled-mean adaptation rule.
formula

One particular feature of MPMLDA is that given a new data point x, the adaptation is automatically stronger for discriminant functions between pairs of classes that are more relevant, as demonstrated in equation 2.11. In a supervised scenario (i.e., , with k being the true class label), it is easy to see that MPMLDA updates only the boundaries between the real label k and the other classes.

3.  Data Sets and Evaluation

We consider three data sets, all containing multiclass imagery-movement tasks.

Physiobank (Goldberger et al., 2000; Schalk, Mcfarl, Hinterberger, Birbaumer, & Wolpaw, 2004) contains data from 109 subjects performing various combinations of real and imagined movements in one day of recordings.1 In this work, we consider only the data related to imagery movement of the left hand, right hand, and both feet. Each subject performed 22 trials of each class (3 seconds per trial), and the EEG data were recorded using 64 electrodes. Due to the computational load required by the extended analysis presented in this work, we focus only on 20 EEG electrodes over the sensorimotor cortex. The large number of users makes this data set convenient to evaluate the impact of feature extraction on a large scale.

BCI competition IV-2a (Brunner, Leeb, Müller-Putz, Schlögl, & Pfurtscheller, 2008) provides data from nine subjects performing four different imagery movement tasks—right hand (RH), left hand (LH), both feet (F), tongue (T)—during two days of recordings.2 Each day the subjects performed 72 trials of each task (3 seconds per trial), and the EEG data were recorded using 20 electrodes.

BSI-RIKEN (Cichocki & Zhao, 2011) provides data from several subjects performing binary or multiclass imagery movement tasks.3 In this work we consider only the two subjects performing multiclass problems (LH, RH, F) on different, well-defined days (subjects B and C). Subject B was recorded on two days and subject C on seven days. Each day they performed approximately 65 trials (3–4 seconds) of each task, and the EEG data were recorded using five or six electrodes. In this work, we always consider the same five EEG electrodes: C3, Cp3, C4, Cp4, and Cz.

The data set also contains three long additional sessions for subject C ( trials per session) spread across a different day. We denote these data as subject C2.

In this work, all of the data sets were preprocessed in a similar way before the feature extraction step. In all cases, the data were divided into trials containing the imagery movement period. Channels and trials contaminated with artifacts were eliminated from the training set using an automatic variance-based routine (Nolan, Whelan, & Reilly, 2010). The contaminated channels were also removed from the testing set, after which the data were detrended and bandpass-filtered into the frequency bands 8 Hz to 30 Hz. To reduce the presence of filter-induced artifacts in the data, we discarded the starting and ending half seconds from each trial.

We measure BCI performance using Cohen's kappa coefficient (Kraemer, 1982), which assigns a value of zero to random classification and a value of one to perfect classification. This measure is commonly used in multiclass BCI problems (Schlögl, Kronegg, Huggins, & Mason, 2007).

We use the Physiobank data set only to compare the different feature extractors, not to evaluate adaptive methods. Note that the reduced number of trials per class available per subject in this data set does not allow the proper evaluation of any adaptive method. In this case, we report classification results based on leave-one-out cross-validation.

We use the BCI-IV-2a and BSI-RIKEN data sets to evaluate and compare the performance of the different classifiers in terms of adaptation. We perform training on the data from one day and testing on data from a subsequent day. Because these data were recorded on different days, they are more likely to present hard nonstationarities.

For subject C2, however, each model was trained and tested on the different sessions on the same day. Although all recordings denoted as subject C2 were performed within the same day, we decided to include them in the analysis, as each session provides many trials and the three sessions were distributed throughout the day, thereby allowing the presence of clear, nonstationary changes. This procedure yielded nine evaluations for the BCI-IV-2a data set and 25 evaluations for the BSI-RIKEN data set, which are decomposed to a value of 1 for subject B, 21 for subject C (6+5+⋅⋅⋅+1), and 3 for subject C2.

4.  Results

Given that TSM has only recently been introduced and is largely unknown in the BCI community, we present novel results that confirm the quality of TSM as an algorithm for multiclass feature extraction when using static classification. We then analyze and compare the performance of the proposed multiclass adaptive method. In section 4.3, we consider the method's dependence on the learning rate, and in section 4.4, we provide a qualitative analysis of the dynamic changes occurring in the feature space. We assess the significance of the performance differences between methods according to Wilcoxon statistical tests (Demšar, 2006). The observed differences are shown with single or double asterisks to indicate p-values less than 0.05 and 0.01, respectively.

4.1.  TSM as Feature Space for Nonadaptive Classification.

In this section, we illustrate the ability of TSM to serve as multiclass feature extractor for BCI imagery movement problems on a large scale. We present results according to a static MLDA classifier (see section 2.2) and the three data sets considered in this work. Figure 1 provides a comparison of the kappa values obtained using TSM and multiclass CSP (MCSP) (Grosse-Wentrup & Buss, 2008), which is commonly used for multiclass BCI feature extraction. For the BCI-IV-2a data set (middle panel), subject numbers are indicated inside the circles. For the BSI-RIKEN data set (right panel), the circles represent subject C, the square represents subject B, and the crosses represent the different possible test sessions of subject C2 during one day. The mean kappa values are displayed in the upper left (for TSM) and lower right (MCSP) corners of each panel.

Figure 1:

Kappa values obtained through MCSP (x-axis) plotted against kappa values obtained by TSM for the three data sets under consideration. The values printed in the upper left and lower right corners of each panel represent the mean kappa values obtained by each method. Wilcoxon statistical tests indicate the significance of the observed differences, noted by single or double asterisks representing p-values smaller than 0.05 and 0.01, respectively.

Figure 1:

Kappa values obtained through MCSP (x-axis) plotted against kappa values obtained by TSM for the three data sets under consideration. The values printed in the upper left and lower right corners of each panel represent the mean kappa values obtained by each method. Wilcoxon statistical tests indicate the significance of the observed differences, noted by single or double asterisks representing p-values smaller than 0.05 and 0.01, respectively.

For the Physiobank and BCI-IV-2a data sets, TSM provides a significantly better feature space than MCSP does. These results confirm that TSM can be considered a state-of-the-art feature extractor for multiclass BCI problems. For the remainder of this letter, we consider TSM as the feature space.

4.2.  Multiclass Adaptive Classification.

In this section, we analyze the problem of interday classifier adaptation in the multiclass setting, using the BCI-IV-2a and BSI-RIKEN data sets and considering TSM as the feature extractor. We analyze the behavior of the proposed method, MPMLDA, and we consider EBLDA and DSA for comparison (a brief description of the two methods is provided in appendixes B and C, respectively). For reference on the improvement achieved, we also present values computed for the performance of the initial (static) MLDA classifier.

In all cases, we first train a initial static MLDA (or BLDA) classifier using data from a day of recordings, after which we selected a different (future) day or session of the same subject as testing data in order to evaluate both static and adaptive classifiers.4 In this section, we optimize the learning rates for each combination of subject and adaptation method separetely. In the next section, we analyze the influence of the learning rate in more detail.

Figure 2 shows the comparison of MPMLDA in terms of kappa values for the different methods (in columns) based on each of the data sets (in rows). Points above the discontinuous lines represent cases in which MPMLDA performance is superior. The values at the upper left and lower right corners indicate the mean kappa values of the corresponding method averaged over all subjects.

Figure 2:

Comparison between MPMLDA and other multiclass methods. All adaptive methods use individual optimal learning rates. Each row represents a different data set, and each column compares it with a different model (MLDA, DSA, and EBLDA). The values displayed in the upper-left and lower-right corners are the mean kappa value averaged over all the subjects of the data set. Wilcoxon statistical tests indicate the significance of the observed differences. Single and double asterisks indicate p-values smaller than 0.05 and 0.01, respectively.

Figure 2:

Comparison between MPMLDA and other multiclass methods. All adaptive methods use individual optimal learning rates. Each row represents a different data set, and each column compares it with a different model (MLDA, DSA, and EBLDA). The values displayed in the upper-left and lower-right corners are the mean kappa value averaged over all the subjects of the data set. Wilcoxon statistical tests indicate the significance of the observed differences. Single and double asterisks indicate p-values smaller than 0.05 and 0.01, respectively.

First, we note that on average, any adaptation improves the mean performance of the static classifier, which has kappa 0.51. Remarkably, MPMLDA significantly outperforms all of the other methods. Note that for the BSI-RIKEN data set, EBLDA yields very poor performance in some isolated cases.

In contrast, although DSA provides excellent stable performance, its performance is still significantly worse than that achieved by MPMLDA. We conclude that for the data sets under consideration, MPMLDA performs significantly better than DSA and EBLDA.

4.3.  Influence of the Learning Rate.

In Figure 3, we present results of our analysis of the influence of the learning rate. We plot the mean kappa values averaged over all subjects as a function of the learning rate for the BCI-IV-2a data set (left panel) and the BSI-RIKEN data set (right panel). Horizontal lines indicate the performance of the static MLDA method and the adaptive EBLDA and DSA (the latter two adaptive methods use individually optimized learning rates).

Figure 3:

The continuous line represents the average kappa value across subjects as a function of the MPMLDA learning rate using the data sets BCI-IV-2a and BSI-RIKEN. Discontinuous and marked lines represent the average kappa values obtained by MLDA, EBLDA, and DSA. Individual optimal learning rates are used EBLDA and DSA.

Figure 3:

The continuous line represents the average kappa value across subjects as a function of the MPMLDA learning rate using the data sets BCI-IV-2a and BSI-RIKEN. Discontinuous and marked lines represent the average kappa values obtained by MLDA, EBLDA, and DSA. Individual optimal learning rates are used EBLDA and DSA.

Interestingly, the improvement of MPMLDA is notable with respect to the static MLDA for a wide range of learning rate values. The performance of this method is also superior or comparable to the other adaptive methods for both data sets.

Observe that the optimal learning rate differs by data set. These differences can be explained by the dependence of the learning rate on the number of classes induced by equation 2.11.

We now set the MPMLDA learning rate to the corresponding optimal values shown above and compare the performance of MPMLDA against the other adaptive methods, for which the learning rates have been fully optimized. The results are displayed in Figure 4.

Figure 4:

Comparison between MPMLDA and other multiclass methods. The learning rates values of all the adaptive methods have been optimized for each subject independently, with the exception of MPMLDA, which corresponds to the optimised mean across subjects (see Figure 3). The learning rate values used by MPMLDA are 0.03 and 0.01 for the BCI-IV-2a and BSI-RIKEN data sets, respectively. Each row represents a different data set, and each column compares it to a different model (MLDA, EBLDA, and DSA). The values displayed in the upper-left and lower-right corners indicate the mean kappa value averaged over all subjects in the data set. Wilcoxon statistical tests indicate the significance of the observed differences. Single and double asterisks indicate p-values smaller than 0.05 and 0.01, respectively.

Figure 4:

Comparison between MPMLDA and other multiclass methods. The learning rates values of all the adaptive methods have been optimized for each subject independently, with the exception of MPMLDA, which corresponds to the optimised mean across subjects (see Figure 3). The learning rate values used by MPMLDA are 0.03 and 0.01 for the BCI-IV-2a and BSI-RIKEN data sets, respectively. Each row represents a different data set, and each column compares it to a different model (MLDA, EBLDA, and DSA). The values displayed in the upper-left and lower-right corners indicate the mean kappa value averaged over all subjects in the data set. Wilcoxon statistical tests indicate the significance of the observed differences. Single and double asterisks indicate p-values smaller than 0.05 and 0.01, respectively.

With this subject-independent but data set–dependent learning rate, MPMLDA still generally outperforms the static classifier and also EBLDA, although it does not yield any significant difference with respect to the optimal DSA (second column).

We conclude that effective subject independent learning rates can be learned for a fixed paradigm and a fixed EEG montage. Although they deviate from subject-dependent optimal learning rates, this approach significantly outperforms static classification and optimal EBLDA.

4.4.  Analysis of the Feature Space Dynamics.

In order to understand why MPMLDA outperforms the other methods, we project the class-wise training and testing feature distributions onto the first two principal components derived from the training data (Jolliffe, 2002). The results for subjects 4 and 8 (BCI-IV-2a) are displayed in Figure 5, left.

Figure 5:

(Left) Distribution of training (continuous ellipses) and testing (discontinuous ellipses) features of four different tasks (shading) projected onto the first two principal components for subjects 4 and 8. (Right) The shift in the mean for each class between day 1 and day 2.

Figure 5:

(Left) Distribution of training (continuous ellipses) and testing (discontinuous ellipses) features of four different tasks (shading) projected onto the first two principal components for subjects 4 and 8. (Right) The shift in the mean for each class between day 1 and day 2.

Note the clear shift observed between the training and testing distributions occurring for subject 8. In the right column of Figure 5, we represent the shift in the mean for each class. Observe that the shifts are class dependent in the case of subject 4 and largely class independent for subject 8. Adapting for class-independent shifts is obviously a much simpler task that can be achieved using DSA or a naive multi class Pmean update (i.e., assuming in equation 2.11). However, class-dependent changes cannot be tracked by any of this methods. Further, note that the changes in the covariances are not too strong for any of both subjects. Such behavior (i.e., strong mean shifts and small covariance changes) was observed in general in the BCI-IV-2a data set. By construction, MPMLDA can adapt to class-dependent shifts; this fact explains the superior performance of MPMLDA in the BCI-IV-2a data set.

The PCA projections on different days for subject C (BSI-RIKEN) are displayed in Figure 6. In the first row, we show the change between the first and the second day. Once more, strong means shifts represent properly the nonstationary changes. However, the nonstationary character becomes stronger when comparing the projections of the first and the seventh day for the same subject (second row). Note that the shifts in the mean are insufficient to represent the nonstationarity in this case; a strong class-dependent covariance change is also present. Although this is a difficult scenario for any adaptive model, it is clear that correcting for the bias is a necessary condition for any adaptation strategy to be succesfull.

Figure 6:

(Left) Distribution of training (continuous ellipses) and testing (discontinuous ellipses) features of four different tasks (shading), projected onto the first two principal components for subject C. Training and testing were performed for days (1, 2), (1, 7), and (6, 7), respectively, as indicated by the corresponding legend. (Right) The shift in the mean for each class between day 1 and day 2.

Figure 6:

(Left) Distribution of training (continuous ellipses) and testing (discontinuous ellipses) features of four different tasks (shading), projected onto the first two principal components for subject C. Training and testing were performed for days (1, 2), (1, 7), and (6, 7), respectively, as indicated by the corresponding legend. (Right) The shift in the mean for each class between day 1 and day 2.

Instead of using the principal components from the first day, we concentrate now on the projections after training and testing on the sixth and the seventh days, respectively, as shown in Figure 6 (bottom row). Remarkably, they are more concentrated than they were in the previous cases. One possible explanation for this reduced variance could be that the subject learned to execute the task better (Curran, 2003; Barbero Jimenez & Grosse-Wentrup, 2010). However, also in this case, class-dependent changes are dominant (both mean shifts and covariance changes).

To conclude, we note that in some cases, the translation vectors are grouped (right columns), suggesting that class-dependent bias updates add little improvement to a naive Pmean that would account for a class-independent shift (i.e., assuming in equation 2.11). Using this approach, however, results in mean kappa values of 0.541 (**) and 0.60 (**) for the BCI-IV-2a and BSI-RIKEN data sets, respectively, which improve the static MLDA but are significantly worse than the performance obtained by MPMLDA. This result confirms that the use of class-dependent bias updates is useful for tracking such multiclass class-dependent distribution shifts.

5.  Discussion

In this work we propose a novel method for adaptive multiclass classification for BCI: MPMLDA. This method is a multiclass extension of the binary pooled mean LDA (Pmean) introduced in Vidaurre et al. (2010). We demonstrate that the performance of MPMLDA is superior to that of state-of-the-art adaptive methods as EBLDA and DSA.

As feature space we use TSM as recently introduced in Barachant et al. (2012). Our results confirm previous findings: TSM-based features yields better classification performance than MCSP features.

The most important feature of MPMLDA is that its parameter updates are class dependent, thereby resulting in larger updates for discriminant functions between pairs of classes that are more suitable for explaining the current EEG pattern. Our results on different data sets suggest that such class-dependent updates are a key ingredient in explaining the improved performance of MPMLDA over the other methods.

One interesting observation is that MPMLDA can achieve higher classification than is possible with DSA. This is a remarkable result. By construction, DSA has the potential to adapt for stronger nonstationarities than does MPMLDA. This is because, in principle, DSA can remove not only shifts but also rotations in feature space (if they are common to all classes). However, the presented results clearly show the presence of class-dependent nonstationarity components that cannot be learned using DSA. As indicated by the empirical results, the MPMLDA is also able to outperform DSA, and, as such, the bias adaptation has been proved to be a very powerful tool in the multiclass setting.

Appendix

A.1 Binary Linear Discriminant Analysis with Pooled Mean Adaptation.

Pmean (Vidaurre et al., 2010) is a binary unsupervised adaptive LDA algorithm that identifies the global mean of the data () with , where and represent the class-wise means, under the assumption of balanced classes. Consequently can be updated sequentially following
formula
A.1
where is the updated , x is the new observed feature vector, and is the learning rate controlling the adaptation. Because the bias of the LDA discriminant function is given by
formula
A.2
with w representing its weights, the update of (through equation 2.12) allows the unsupervised update of the bias.

Although Pmean provides state-of-the-art binary unsupervised classification performance, it is important to note that no multiclass extension of Pmean has previously been considered in the literature.

A.2 Enhanced Bayesian Linear Discriminant Analysis.

Bayesian linear discriminant analysis (BLDA) is a Bayesian version of regularized LDA in which regularization parameters are estimated with Bayesian regression (see Xu et al., 2011, for more detailed information).

In order to improve the performance of the classifier, EBLDA proposes training a new classifier by supplementing training sets with additional high-probability test samples. In the binary case, the probability from the BLDA classifier for a test sample is computed. If this probability exceeds a threshold (e.g., 0.9), this test sample and its estimated label are added to the training set for classifier retraining. In this work, we refer to the parameter value indicating this threshold as the learning rate.

In the multiclass setting, EBLDA uses combinations of binary BLDA classifiers as MLDA.

A.3 Unsupervised EEG Data Space Adaptation.

The DSA adaptation procedure (Arvaneh et al., 2013) performs an adaptive linear transformation of the testing data instead of adapting the classifier parameters. It provides a direct approximation of the discrepancy between the bandpass-filtered training and testing data distributions by comparing the average distributions of the EEG data obtained regardless of the class labels (under gaussian assumption). Let be the average distribution of the training data, where is obtained by averaging the covariance matrices over all available EEG training trials. Denote the average distribution of the testing data after a linear transformation as , with W as a linear transformation matrix and C as the average covariance matrix of the testing data. The DSA method optimizes the matrix W by minimizing the KL divergence between N2 and N1. Interestingly, W can be written in closed form as
formula
In DSA, the linear transformation W is recomputed sequentially after a certain number of trials, and the new testing pattern is projected onto W before applying the pretrained feature extraction and classification algorithms. The number of trials used for recomputing C and W determines the length scale of the adaptation. We therefore refer to this parameter as the learning rate for the DSA method.

The original unsupervised DSA method considering common spatial patterns (CSP) as feature extractor is very similar to the unsupervised adaptation of CSP as proposed in Tomioka et al. (2006). The latter updates the CSP filters using W (i.e., ), while DSA filters the testing data (X) using W (i.e. ). Consequently in both cases the features are extracted from the linear transformation .

In this work, we consider a multiclass variant of DSA with TSM as feature space. In both senses, this unsupervised strategy differs from the proposals in Tomioka et al. (2006) and Arvaneh et al. (2013).

Acknowledgments

We gratefully acknowledge the support of the Brain-Gain Smart Mix Programme of the Netherlands Ministry of Economic Affairs and the Netherlands Ministry of Education, Culture and Science.

References

Ang
,
K. K.
,
Chin
,
Z. Y.
,
Wang
,
C.
,
Guan
,
C.
, &
Zhang
,
H.
(
2012
).
Filter bank common spatial pattern algorithm on BCI competition IV datasets 2a and 2b
.
Frontiers in Neuroscience
,
6
,
39
.
doi:10.3389/fnins.2012.0039
Arvaneh
,
M.
,
Guan
,
C.
,
Ang
,
K. K.
, &
Quek
,
C.
(
2013
).
EEG data space adaptation to reduce intersession nonstationarity in brain-computer interface
.
Neural Computation
,
25
,
2146
2171
.
Barachant
,
A.
,
Bonnet
,
S.
,
Congedo
,
M.
, &
Jutten
,
C.
(
2012
).
Multiclass brain-computer interface classification by Riemannian geometry
.
IEEE Transactions on Biomedical Engineering
,
59
,
920
928
.
Barbaresco
,
F.
(
2008
).
Innovative tools for radar signal processing based on Cartans geometry of SPD matrices and information geometry
. In
Proceedings of the IEEE Radar Conference
(pp.
1
3
).
Piscataway, NJ
:
IEEE
.
Barbero Jimenez
,
A.
, &
Grosse-Wentrup
,
M.
(
2010
).
Biased feedback in brain-computer interfaces
.
Journal of NeuroEngineering and Rehabilitation
,
7
.
doi:10.1186/1743-0003-7-34
.
Bishop
,
C. M.
(
2007
). Pattern recognition and machine learning.
New York
:
Springer
.
Blankertz
,
B.
,
Lemm
,
S.
,
Treder
,
M.
,
Haufe
,
S.
, &
Müller
,
K-R.
(
2011
).
Single-trial analysis and classification of ERP components. A tutorial
.
NeuroImage
,
56-2
,
814
825
.
Blankertz
,
B.
,
Tomioka
,
R.
,
Lemm
,
S.
,
Kawanabe
,
M.
, &
Müller
,
K.-R.
(
2008
).
Optimizing spatial filters for robust EEG single-trial analysis
.
IEEE Signal Processing Magazine
,
56
,
41
56
.
Brunner
,
C.
,
Leeb
,
R.
,
Müller-Putz
,
G. R.
,
Schlögl
,
A.
, &
Pfurtscheller
,
G.
(
2008
).
BCI competition 2008, Graz dataset 2A
. http://www.bbci.de/competition/iv
Cichocki
,
A.
, &
Zhao
,
Q.
(
2011
).
EEG motor imagery dataset
(
Tech. Report
).
Saitama, Japan
:
Laboratory for Advanced Brain Signal Processing, BSI, RIKEN
.
Curran
,
E.
(
2003
).
Learning to control brain activity: A review of the production and control of EEG components for driving brain computer interface (BCI) systems
.
Brain and Cognition
,
51
,
326
336
.
Demšar
,
J.
(
2006
).
Statistical comparisons of classifiers over multiple data sets
.
Journal of Machine Learning Research
,
7
,
1
30
.
do Carmo
,
M. P.
(
1976
).
Differential geometry of curves and surfaces
.
Englewood Cliffs, NJ
:
Prentice Hall
.
Dornhege
,
G.
,
Blankertz
,
B.
,
Curio
,
G.
, &
Müller
,
K. R.
(
2004
).
Increase information transfer rates in BCI by CSP extension to multi-class
. In
S. Thrün, L. Saul, & B. Schölkopf (Eds.)
,
Advances in neural information processing systems
(pp.
733
740
).
Cambridge, MA
:
MIT Press
.
Farquhar
,
J.
(
2009
).
A linear feature space for simultaneous learning of spatio-spectral features in BCI
.
Neural Networks
,
22
,
1278
1285
.
Felix
,
L.
,
Scherer
,
R.
,
Leeb
,
R.
,
Neuper
,
C.
,
Bischof
,
H.
, &
Pfurtscheller
,
G.
(
2005
).
A comparative analysis of multi-class EEG classification for brain computer interface
. In
Proceedings of the 10th Computer Vision Winter Workshop
.
Vienna
:
Austrian Computer Society
.
Fernandez-Vargas
,
J.
,
Pfaff
,
H. U.
,
Rodríguez
,
F. B.
, &
Varona
,
P.
(
2013
).
Assisted closed-loop optimization of SSVEP-BCI efficiency
.
Frontiers in Neural Circuits
,
27
(
7
).
Fisher
,
R. A.
(
1936
).
The use of multiple measurements in taxonomic problems
.
Annals of Eugenics
,
2
,
179
188
.
Fletcher
,
P. T.
, &
Joshi
,
S.
(
2004
).
Principal geodesic analysis on symmetric spaces: Statistics of diffusion tensors
. In
ECCV Workshops CVAMIA and MMBIA
(pp.
87
94
).
New York
:
Springer-Verlag
.
Goldberger
,
A. L.
,
Amaral
,
L.A.N.
,
Glass
,
L.
,
Hausdorff
,
J. M.
,
Ivanov
,
P. Ch.
,
Mark
,
R. G.
, …
Stanley
,
H. E.
(
2000
).
PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals
.
Circulation
,
100
,
215
220
.
Grosse-Wentrup
,
M.
, &
Buss
,
M.
(
2008
).
Multi-class common spatial patterns and information theoretic feature extraction
.
IEEE Transactions on Biomedical Engineering
,
55
,
1191
2000
.
Haas
,
L. F.
(
2003
).
Hans Berger (1873–1941), Richard Caton (1842–1926), and electroencephalography
.
Journal of Neurology, Neurosurgery and Psychiatry
,
74
,
1
9
.
Hasan
,
B. A. S.
, &
Gan
,
J. Q.
(
2009
).
Sequential EM for unsupervised adaptive gaussian mixture model based classifier
. In
Machine Learning and Data Mining in Pattern Recognition, Lecture Notes in Computer Science
(Vol.
5632
, pp.
96
106
).
New York
:
Springer
.
Heath
,
Sir T.
(
1981
).
A history of greek mathematics. Vol. 1: From Thales to Euclid.
New York
:
Dover
.
Jolliffe
,
I. T.
(
2002
).
Principal component analysis.
New York
:
Springer
.
Karcher
,
H.
(
1977
).
Riemannian center of mass and mollifier smoothing
.
Communications on Pure and Applied Mathematics
,
30
,
509
541
.
Kraemer
,
H. C.
(
1982
).
Encyclopedia of statistical sciences
.
New York
:
Wiley
.
Krauledat
,
M.
(
2008
).
Analysis of nonstationarities in EEG signals for improving brain-computer interface performance.
Doctoral dissertation, Technische Universität Berlin, Fakultät IV—Elektrotechnik und Informatik.
Llera
,
A.
,
Gómez
,
V.
, &
Kappen
,
H. J.
(
2012
).
Adaptive classification on Brain Computer Interfaces using reinforcement signals
.
Neural Computation
,
24
,
2900
2923
.
MacKay
,
D. J. C.
(
1992
).
Bayesian interpolation
.
Neural Computation
,
3
,
415
447
.
MacKay
,
D. J. C.
(
2003
).
Information theory, inference and learning algorithms.
Cambridge
:
Cambridge University Press
.
Millán
,
J. del R.
(
2004
).
On the need for on-line learning in brain-computer interfaces
. In
IEEE Proc. of the Int. Joint Conf. on Neural Networks
(vol.
4
, pp.
2877
2882
).
Piscataway, NJ
:
IEEE
.
Moakher
,
M.
(
2005
).
A differential geometric approach to the geometric mean of symmetric positive-definite matrices
.
SIAM J. Matrix Anal. Appl
,
26
,
735
747
.
Moakher
,
M.
, &
Zéraï
,
M.
(
2011
).
The Riemannian geometry of the space of positive-definite matrices and its application to the regularization of positive-definite matrix-valued data
.
J. of Mathematical Imaging and Vision
,
40
,
171
187
.
Müller
,
K. R.
,
Anderson
,
C. W.
, &
Birch
,
G. E.
(
2003
).
Linear and nonlinear methods for brain-computer interface
.
IEEE Transactions on Neural Systems and Rehabilitation Engineering
,
11
,
165
169
.
Nolan
,
H.
,
Whelan
,
R.
, &
Reilly
,
R. B.
(
2010
).
FASTER: Fully automated statistical thresholding for EEG artifact rejection
.
Journal of Neuroscience Methods
,
192
,
152
162
.
Schalk
,
G.
,
Mcfarl
,
D. J.
,
Hinterberger
,
T.
,
Birbaumer
,
N.
, &
Wolpaw
,
J. R.
(
2004
).
BCI2000: A general-purpose brain-computer interface (BCI) system
.
IEEE Transactions on Biomedical Engineering
,
51
,
1034
1043
.
Schlögl
,
A.
,
Kronegg
,
J.
,
Huggins
,
J. E.
, &
Mason
,
S. G.
(
2007
).
Evaluation criteria for BCI research
. In
G. Dornhege, J. del R. Millán, T. Hinterberger, D. MacFarland, & K. R. Müller (Eds.)
,
Toward Brain-Computer interfacing
,
19
,
327
342
.
Cambridge, MA
:
MIT Press
.
Shenoy
,
P.
,
Krauledat
,
M.
,
Blankertz
,
B.
,
Rao
,
R. P.
, &
Müller
,
K. R.
(
2006
).
Towards adaptive classification for BCI
.
Journal of Neural Engineering
,
3
(
1
),
R13
R23
.
Tang
,
Y.
,
Tang
,
J.
, &
Gong
,
A.
(
2008
).
Multi-class EEG classification for brain computer interface based on CSP
. In
Proceedings of the International Conference on BioMedical Engineering and Informatics
(vol.
2
, pp.
463
472
).
Washington, DC
:
IEEE Computer Society
.
Tax
,
D. M. J.
, &
Duin
,
R. P. W.
(
2002
).
Using two-class classifiers for multi-class classification
. In
Proceedings of the 16th IEEE International Conference on Pattern Recognition
(vol.
2
, pp.
124
127
).
Piscataway, NJ
:
IEEE
.
Tomioka
,
R.
,
Hill
,
J. N.
,
Blankertz
,
B.
, &
Aihara
,
K.
(
2006
).
Adapting spatial filter methods for nonstationary BCIs
. In
Proceedings of 2006 Workshop on Information-Based Induction Sciences (IBIS 2006)
(pp.
65
70
).
Max-Planck Gesellschaft
.
van Gerven
,
M.
,
Farquhar
,
J.
,
Schaefer
,
R.
,
Vlek
,
R.
,
Geuze
,
J.
,
Nijholt
,
A.
, …
Desain
,
P.
(
2009
).
The brain computer interface cycle
.
Journal of Neural Engineering
,
6
.
doi:10.1088/1741-2560/6/4/041001
.
Vidal
,
J. J.
(
1973
).
Toward direct brain-computer communication
.
Annual Review of Biophysics and Bioengineering
,
2
,
157
180
.
Vidaurre
,
C.
,
Kawanabe
,
M.
,
von Bünau
,
P.
,
Blankertz
,
B.
, &
Müller
,
K. R.
(
2010
).
Toward an unsupervised adaptation of LDA for brain-computer interfaces
.
IEEE Trans. Biomed. Eng
,
58
,
587
597
.
Vidaurre
,
C.
,
Sanelli
,
C.
,
Müller
,
K. R.
, &
Blankertz
,
B.
(
2011
).
Machine-learning based co-adaptive calibration
.
Neural Computation
,
23
,
791
816
.
Vidaurre
,
C.
,
Schlöogl
,
A.
,
Cabeza
,
R.
,
Scherer
,
R.
, &
Pfurtscheller
,
G.
(
2006
).
A fully on-line adaptive BCI
.
IEEE Transactions on Biomedical Engineering
,
53
,
1214
1219
.
Wolpaw
,
J. R.
,
Birbaumer
,
N.
,
McFarland
,
D. J.
,
Pfurtscheller
,
G.
, &
Vaughan
,
T. M.
(
2002
).
Brain-computer interfaces for communication and control
.
Clinical Neurophysiology
,
113
,
767
791
.
Xu
,
P.
,
Yang
,
P.
,
Lei
,
X.
, &
Yao
,
D.
(
2011
).
An enhanced probabilistic LDA for multi-class brain computer interface
.
PloS One
,
6-1
.
e14634
.
doi:10.1371/journal.pone.0014634
Yang
,
P.
,
Xu
,
L.
,
Liu
,
T.
,
Xu
,
P.
, &
Yao
,
D.-Z.
(
2009
).
Probabilistic methods in multi-class brain computer interfaces
.
Journal of Electronic Science and Technology of China
,
7
,
1
,
12
16
.

Notes