## Abstract

We consider the problem of multiclass adaptive classification for brain-computer interfaces and propose the use of multiclass pooled mean linear discriminant analysis (MPMLDA), a multiclass generalization of the adaptation rule introduced by Vidaurre, Kawanabe, von Bünau, Blankertz, and Müller (2010) for the binary class setting. Using publicly available EEG data sets and tangent space mapping (Barachant, Bonnet, Congedo, & Jutten, 2012) as a feature extractor, we demonstrate that MPMLDA can significantly outperform state-of-the-art multiclass static and adaptive methods. Furthermore, efficient learning rates can be achieved using data from different subjects.

## 1. Introduction

Brain-computer interfaces (BCI) (Vidal, 1973) aim to provide human subjects with control over devices or computer applications while bypassing the traditional muscular paths. In other words, these interfaces might allow humans to control devices using only their brain activity as measured by, for example, electroencephalogram (EEG) (Haas, 2003). The control of such devices could offer an additional channel of communication and action that could improve the quality of life for people with severe disabilities (Wolpaw, Birbaumer, McFarland, Pfurtscheller, & Vaughan, 2002).

The most common procedure for setting up a BCI is as follows. The user participates in a training (i.e., calibration) session in which he or she is instructed to perform a specific mental task while the brain activity generated is recorded. The recorded data (usually designated as training data) are used to extract discriminative features associated with various user intentions. The data are subsequently used to train a classifier that predicts the user's intention during the testing (feedback) session (van Gerven et al., 2009). To optimize the feature extraction procedure, a time window of interest should be selected, as well as the most discriminative frequency bands. This choice is typically experimental and subject dependent, and it usually requires some level of prior physiological knowledge. Several algorithms are available for automatically optimizing these parameters (Blankertz, Lemm, Treder, Haufe, & Müller, 2011; Fernandez-Vargas, Pfaff, Rodríguez, & Varona, 2013; Ang, Chin, Wang, Guan, & Zhang, 2012). In a multiclass BCI problem, the most common feature extractors are based on supervised data projections, such as one-against-one common spatial patterns (CSP) (Blankertz, Tomioka, Lemm, Kawanabe, & Müller, 2008; Dornhege, Blankertz, Curio, & Müller, 2004) or multiclass CSP (MCSP) (Grosse-Wentrup & Buss, 2008). Another interesting approach that has been introduced recently is tangent space mapping (TSM) (Barachant et al., 2012), an unsupervised nonlinear projection of the data covariance matrices that can be used to optimize its classification.

Usually the binary class BCI classification problem is considered linear (Müller, Anderson, & Birch, 2003; Farquhar, 2009) and a linear discriminant analysis (LDA) classifier (Fisher, 1936) can be used effectively. When the dimension of the feature space is large in relation to the number of samples (training data), the classifier must be regularized (Bishop, 2007; Blankertz et al., 2011). In the multiclass case, the multiclass LDA (MLDA) has been shown to be the best choice (Yang, Xu, Liu, Xu, & Yao, 2009; Felix et al., 2005; Tang, Tang, & Gong, 2008).

One common problem encountered by the classifiers in EEG-based BCIs involves changes in the feature statistics over time due to the nonstationary character of the EEG data (Krauledat, 2008). These changes generally result in poor classifier generalization performance (Shenoy, Krauledat, Blankertz, Rao, & Müller, 2006; Millán, 2004). To overcome this problem, several methods that adapt the feature space or the classifier parameters have been proposed (Tomioka, Hill, Blankertz, & Aihara, 2006; Hasan & Gan, 2009; Llera, Gómez, & Kappen, 2012).

The most common (i.e., the safest) strategy is to update the model parameters that are class independent while keeping the rest fixed. In the case of binary LDA, this strategy can be used to update the global average covariance matrix (or its inverse) (Vidaurre, Schlöogl, Cabeza, Scherer, & Pfurtscheller, 2006) or the global mean of the data (Vidaurre, Kawanabe, von Bünau, Blankertz, & Müller, 2010). When the adaptation is performed in the feature space and not in the classifier parameters, the strategy can be used to reduce the nonstationary effect by transforming the CSP filters (Tomioka et al., 2006) or, more generally, the actual testing data (Arvaneh, Guan, Ang, & Quek, 2013) in a linear fashion.

To adapt class-dependent classifier parameters, it is necessary to introduce uncertainty into the model, as in the unsupervised adaptive gaussian mixture model classifier (Hasan & Gan, 2009). To the best of our knowledge, the unsupervised adaptation of class-dependent feature space extraction parameters has yet to deliver any applicable results. Updating the global mean of the data allows the adaptation of the bias of the binary LDA discriminant function. This is usually a robust technique for BCI classifier adaptation and is referred to as the pooled mean linear discriminant analysis (Pmean) (Vidaurre et al., 2010). It adapts to shifts in the feature space that are commonly attributed to non-class-related nonstationarity in EEG-based imaginary movement binary BCI (Shenoy et al., 2006). Despite its simplicity, Pmean can achieve state-of-the-art binary unsupervised adaptive classification performance. Moreover, it has been shown to be a valuable tool for helping to increase the number of possible BCI users (Vidaurre, Sanelli, Müller, & Blankertz, 2011). Similar binary performance has been reported using data space adaptation (DSA), a feature-based adaptation method proposed recently by Arvaneh et al. (2013).

In the multiclass setting, the adaptation process clearly becomes more complex and poses more difficult challenges. State-of-the-art unsupervised multiclass methods, such as enhanced Bayesian LDA (EBLDA) (Xu, Yang, Lei, & Yao, 2011), perform unsupervised retraining of each of the pair-wise Bayesian LDA classifiers (BLDA) (MacKay, 1992). This adaptive approach uses a generative model for class-conditional distributions, and its performance is strongly dependent on the quality of the initialization. For binary problems, however, evidence suggests that the performance of the Pmean update approaches that obtained using supervised updates Vidaurre et al. (2010), thus often outperforming adaptive unsupervised generative models (e.g., EBLDA).

In this letter, we introduce a novel multiclass extension of the binary Pmean adaptation of the LDA classifier and demonstrate that this kind of adaptation is better suited for multiclass adaptation than are the previously mentioned state-of-the-art methods.

## 2. Methods

In this section, we present the methods for feature extraction and classification that we consider in the rest of the work. In section 2.1 we specify the feature extraction procedure: tangent space mapping (TSM). We describe multiclass LDA in section 2.2 and present the proposed algorithm for multiclass adaptive classification, the MPMLDA, in section 2.3.

### 2.1. Tangent Space Mapping.

Tangent space mapping (TSM) as feature space was recently presented for BCIs in the context of the tangent space linear discriminant analysis (TSLDA) (Barachant et al., 2012). The referenced work highlights the potential of TSM as a feature extractor for the multiclass classification of covariance matrices in the context of BCI.

The geometric mean exists within the considered manifold, and it is unique (Karcher, 1977). Although there is no closed-form solution for its computation, it can be computed efficiently using iterative algorithms. In this work we consider the algorithm presented by Fletcher and Joshi (2004).

*C*is (after several trivial simplifications) given by where the log is the logarithm of a matrix derived from its diagonalization (Barbaresco, 2008).

_{k}is a symmetric matrix that in vectorized form (after eliminating redundant elements due to symmetry) can be used as features for classification in BCI problems (Barachant et al., 2012). In most cases and particularly in this work, one unique is computed using all of the covariances matrices in the training set, thus rendering TSM an unsupervised feature extractor for covariance classification.

### 2.2. Multiclass Linear Discriminant Analysis.

The multiclass linear discriminant analysis (MLDA) classifier (Bishop, 2007) is defined by a discriminant function of the input feature vector for each class pair (*i*, *j*). A majority vote or a probabilistic interpretation of the results can be used to produce unique output from each pair of binary classifiers (Tax & Duin, 2002).

*K*-class classification problem and a set of

*m*labeled data vectors , for each , we can compute the class-wise means and covariance matrices . Defining the per-class-pair average covariance matrices as for , we define the discriminant function between classes

*i*and

*j*as where each describes the vector of weights and the bias term of the discriminant function between classes

*i*and

*j*. For each , an input feature vector

**x**is classified as class

*j*if , and as class

*i*otherwise. The output of the discriminant function can be interpreted probabilistically by assuming that the probabilities of classes

*i*and

*j*are given by a binomial distribution on the sigmoidal mapping of the discriminant function value (MacKay, 2003).

*j*>

*i*between classes

*i*and

*j*, we define the probability of

**x**belonging to class

*i*as and, consequently with Note that this probabilistic interpretation of the discriminant function of the LDA classifier makes a connection between the LDA and the logistic regression model.

*Q*

_{i,j}in the following manner. For simplicity of notation, we define for

*j*<

*i*,

*Q*

_{i,j}(

*k*|

*x*)=

*Q*

_{j,i}(

*k*|

*x*). We then define the probability vector

**P**containing in the

*i*th coordinate the probability of class

*i*given

**x**as The final output of this MLDA classifier is assigned to the most probable class under this measure .

### 2.3. Multiclass Pooled Mean LDA.

In the binary setting, under the assumption of balanced classes, the bias of the discriminant function can be adapted in an unsupervised manner using the global data mean (Vidaurre et al., 2010). (See appendix A for additional information.)

**P**(

**x**) of equation 2.9 as The update, equation 2.10, allows for the adaptation of each bias

*b*

_{i,j}through equation 2.6. The MPMLDA algorithm is summarized in algorithm 1.

One particular feature of MPMLDA is that given a new data point **x**, the adaptation is automatically stronger for discriminant functions between pairs of classes that are more relevant, as demonstrated in equation 2.11. In a supervised scenario (i.e., , with *k* being the true class label), it is easy to see that MPMLDA updates only the boundaries between the real label *k* and the other classes.

## 3. Data Sets and Evaluation

We consider three data sets, all containing multiclass imagery-movement tasks.

**Physiobank** (Goldberger et al., 2000; Schalk, Mcfarl, Hinterberger, Birbaumer, & Wolpaw, 2004) contains data from 109 subjects performing various combinations of real and imagined movements in one day of recordings.^{1} In this work, we consider only the data related to imagery movement of the left hand, right hand, and both feet. Each subject performed 22 trials of each class (3 seconds per trial), and the EEG data were recorded using 64 electrodes. Due to the computational load required by the extended analysis presented in this work, we focus only on 20 EEG electrodes over the sensorimotor cortex. The large number of users makes this data set convenient to evaluate the impact of feature extraction on a large scale.

**BCI competition IV-2a** (Brunner, Leeb, Müller-Putz, Schlögl, & Pfurtscheller, 2008) provides data from nine subjects performing four different imagery movement tasks—right hand (RH), left hand (LH), both feet (F), tongue (T)—during two days of recordings.^{2} Each day the subjects performed 72 trials of each task (3 seconds per trial), and the EEG data were recorded using 20 electrodes.

**BSI-RIKEN** (Cichocki & Zhao, 2011) provides data from several subjects performing binary or multiclass imagery movement tasks.^{3} In this work we consider only the two subjects performing multiclass problems (LH, RH, F) on different, well-defined days (subjects B and C). Subject B was recorded on two days and subject C on seven days. Each day they performed approximately 65 trials (3–4 seconds) of each task, and the EEG data were recorded using five or six electrodes. In this work, we always consider the same five EEG electrodes: C3, Cp3, C4, Cp4, and Cz.

The data set also contains three long additional sessions for subject C ( trials per session) spread across a different day. We denote these data as subject C2.

In this work, all of the data sets were preprocessed in a similar way before the feature extraction step. In all cases, the data were divided into trials containing the imagery movement period. Channels and trials contaminated with artifacts were eliminated from the training set using an automatic variance-based routine (Nolan, Whelan, & Reilly, 2010). The contaminated channels were also removed from the testing set, after which the data were detrended and bandpass-filtered into the frequency bands 8 Hz to 30 Hz. To reduce the presence of filter-induced artifacts in the data, we discarded the starting and ending half seconds from each trial.

We measure BCI performance using Cohen's kappa coefficient (Kraemer, 1982), which assigns a value of zero to random classification and a value of one to perfect classification. This measure is commonly used in multiclass BCI problems (Schlögl, Kronegg, Huggins, & Mason, 2007).

We use the Physiobank data set only to compare the different feature extractors, not to evaluate adaptive methods. Note that the reduced number of trials per class available per subject in this data set does not allow the proper evaluation of any adaptive method. In this case, we report classification results based on leave-one-out cross-validation.

We use the BCI-IV-2a and BSI-RIKEN data sets to evaluate and compare the performance of the different classifiers in terms of adaptation. We perform training on the data from one day and testing on data from a subsequent day. Because these data were recorded on different days, they are more likely to present hard nonstationarities.

For subject C2, however, each model was trained and tested on the different sessions on the same day. Although all recordings denoted as subject C2 were performed within the same day, we decided to include them in the analysis, as each session provides many trials and the three sessions were distributed throughout the day, thereby allowing the presence of clear, nonstationary changes. This procedure yielded nine evaluations for the BCI-IV-2a data set and 25 evaluations for the BSI-RIKEN data set, which are decomposed to a value of 1 for subject B, 21 for subject C (6+5+⋅⋅⋅+1), and 3 for subject C2.

## 4. Results

Given that TSM has only recently been introduced and is largely unknown in the BCI community, we present novel results that confirm the quality of TSM as an algorithm for multiclass feature extraction when using static classification. We then analyze and compare the performance of the proposed multiclass adaptive method. In section 4.3, we consider the method's dependence on the learning rate, and in section 4.4, we provide a qualitative analysis of the dynamic changes occurring in the feature space. We assess the significance of the performance differences between methods according to Wilcoxon statistical tests (Demšar, 2006). The observed differences are shown with single or double asterisks to indicate *p*-values less than 0.05 and 0.01, respectively.

### 4.1. TSM as Feature Space for Nonadaptive Classification.

In this section, we illustrate the ability of TSM to serve as multiclass feature extractor for BCI imagery movement problems on a large scale. We present results according to a static MLDA classifier (see section 2.2) and the three data sets considered in this work. Figure 1 provides a comparison of the kappa values obtained using TSM and multiclass CSP (MCSP) (Grosse-Wentrup & Buss, 2008), which is commonly used for multiclass BCI feature extraction. For the BCI-IV-2a data set (middle panel), subject numbers are indicated inside the circles. For the BSI-RIKEN data set (right panel), the circles represent subject C, the square represents subject B, and the crosses represent the different possible test sessions of subject C2 during one day. The mean kappa values are displayed in the upper left (for TSM) and lower right (MCSP) corners of each panel.

For the Physiobank and BCI-IV-2a data sets, TSM provides a significantly better feature space than MCSP does. These results confirm that TSM can be considered a state-of-the-art feature extractor for multiclass BCI problems. For the remainder of this letter, we consider TSM as the feature space.

### 4.2. Multiclass Adaptive Classification.

In this section, we analyze the problem of interday classifier adaptation in the multiclass setting, using the BCI-IV-2a and BSI-RIKEN data sets and considering TSM as the feature extractor. We analyze the behavior of the proposed method, MPMLDA, and we consider EBLDA and DSA for comparison (a brief description of the two methods is provided in appendixes B and C, respectively). For reference on the improvement achieved, we also present values computed for the performance of the initial (static) MLDA classifier.

In all cases, we first train a initial static MLDA (or BLDA) classifier using data from a day of recordings, after which we selected a different (future) day or session of the same subject as testing data in order to evaluate both static and adaptive classifiers.^{4} In this section, we optimize the learning rates for each combination of subject and adaptation method separetely. In the next section, we analyze the influence of the learning rate in more detail.

Figure 2 shows the comparison of MPMLDA in terms of kappa values for the different methods (in columns) based on each of the data sets (in rows). Points above the discontinuous lines represent cases in which MPMLDA performance is superior. The values at the upper left and lower right corners indicate the mean kappa values of the corresponding method averaged over all subjects.

First, we note that on average, any adaptation improves the mean performance of the static classifier, which has kappa 0.51. Remarkably, MPMLDA significantly outperforms all of the other methods. Note that for the BSI-RIKEN data set, EBLDA yields very poor performance in some isolated cases.

In contrast, although DSA provides excellent stable performance, its performance is still significantly worse than that achieved by MPMLDA. We conclude that for the data sets under consideration, MPMLDA performs significantly better than DSA and EBLDA.

### 4.3. Influence of the Learning Rate.

In Figure 3, we present results of our analysis of the influence of the learning rate. We plot the mean kappa values averaged over all subjects as a function of the learning rate for the BCI-IV-2a data set (left panel) and the BSI-RIKEN data set (right panel). Horizontal lines indicate the performance of the static MLDA method and the adaptive EBLDA and DSA (the latter two adaptive methods use individually optimized learning rates).

Interestingly, the improvement of MPMLDA is notable with respect to the static MLDA for a wide range of learning rate values. The performance of this method is also superior or comparable to the other adaptive methods for both data sets.

Observe that the optimal learning rate differs by data set. These differences can be explained by the dependence of the learning rate on the number of classes induced by equation 2.11.

We now set the MPMLDA learning rate to the corresponding optimal values shown above and compare the performance of MPMLDA against the other adaptive methods, for which the learning rates have been fully optimized. The results are displayed in Figure 4.

With this subject-independent but data set–dependent learning rate, MPMLDA still generally outperforms the static classifier and also EBLDA, although it does not yield any significant difference with respect to the optimal DSA (second column).

We conclude that effective subject independent learning rates can be learned for a fixed paradigm and a fixed EEG montage. Although they deviate from subject-dependent optimal learning rates, this approach significantly outperforms static classification and optimal EBLDA.

### 4.4. Analysis of the Feature Space Dynamics.

In order to understand why MPMLDA outperforms the other methods, we project the class-wise training and testing feature distributions onto the first two principal components derived from the training data (Jolliffe, 2002). The results for subjects 4 and 8 (BCI-IV-2a) are displayed in Figure 5, left.

Note the clear shift observed between the training and testing distributions occurring for subject 8. In the right column of Figure 5, we represent the shift in the mean for each class. Observe that the shifts are class dependent in the case of subject 4 and largely class independent for subject 8. Adapting for class-independent shifts is obviously a much simpler task that can be achieved using DSA or a naive multi class Pmean update (i.e., assuming in equation 2.11). However, class-dependent changes cannot be tracked by any of this methods. Further, note that the changes in the covariances are not too strong for any of both subjects. Such behavior (i.e., strong mean shifts and small covariance changes) was observed in general in the BCI-IV-2a data set. By construction, MPMLDA can adapt to class-dependent shifts; this fact explains the superior performance of MPMLDA in the BCI-IV-2a data set.

The PCA projections on different days for subject C (BSI-RIKEN) are displayed in Figure 6. In the first row, we show the change between the first and the second day. Once more, strong means shifts represent properly the nonstationary changes. However, the nonstationary character becomes stronger when comparing the projections of the first and the seventh day for the same subject (second row). Note that the shifts in the mean are insufficient to represent the nonstationarity in this case; a strong class-dependent covariance change is also present. Although this is a difficult scenario for any adaptive model, it is clear that correcting for the bias is a necessary condition for any adaptation strategy to be succesfull.

Instead of using the principal components from the first day, we concentrate now on the projections after training and testing on the sixth and the seventh days, respectively, as shown in Figure 6 (bottom row). Remarkably, they are more concentrated than they were in the previous cases. One possible explanation for this reduced variance could be that the subject learned to execute the task better (Curran, 2003; Barbero Jimenez & Grosse-Wentrup, 2010). However, also in this case, class-dependent changes are dominant (both mean shifts and covariance changes).

To conclude, we note that in some cases, the translation vectors are grouped (right columns), suggesting that class-dependent bias updates add little improvement to a naive Pmean that would account for a class-independent shift (i.e., assuming in equation 2.11). Using this approach, however, results in mean kappa values of 0.541 (**) and 0.60 (**) for the BCI-IV-2a and BSI-RIKEN data sets, respectively, which improve the static MLDA but are significantly worse than the performance obtained by MPMLDA. This result confirms that the use of class-dependent bias updates is useful for tracking such multiclass class-dependent distribution shifts.

## 5. Discussion

In this work we propose a novel method for adaptive multiclass classification for BCI: MPMLDA. This method is a multiclass extension of the binary pooled mean LDA (Pmean) introduced in Vidaurre et al. (2010). We demonstrate that the performance of MPMLDA is superior to that of state-of-the-art adaptive methods as EBLDA and DSA.

As feature space we use TSM as recently introduced in Barachant et al. (2012). Our results confirm previous findings: TSM-based features yields better classification performance than MCSP features.

The most important feature of MPMLDA is that its parameter updates are class dependent, thereby resulting in larger updates for discriminant functions between pairs of classes that are more suitable for explaining the current EEG pattern. Our results on different data sets suggest that such class-dependent updates are a key ingredient in explaining the improved performance of MPMLDA over the other methods.

One interesting observation is that MPMLDA can achieve higher classification than is possible with DSA. This is a remarkable result. By construction, DSA has the potential to adapt for stronger nonstationarities than does MPMLDA. This is because, in principle, DSA can remove not only shifts but also rotations in feature space (if they are common to all classes). However, the presented results clearly show the presence of class-dependent nonstationarity components that cannot be learned using DSA. As indicated by the empirical results, the MPMLDA is also able to outperform DSA, and, as such, the bias adaptation has been proved to be a very powerful tool in the multiclass setting.

## Appendix

### A.1 Binary Linear Discriminant Analysis with Pooled Mean Adaptation.

*Pmean*(Vidaurre et al., 2010) is a binary unsupervised adaptive LDA algorithm that identifies the global mean of the data () with , where and represent the class-wise means, under the assumption of balanced classes. Consequently can be updated sequentially following where is the updated ,

**x**is the new observed feature vector, and is the learning rate controlling the adaptation. Because the bias of the LDA discriminant function is given by with

**representing its weights, the update of (through equation 2.12) allows the unsupervised update of the bias.**

*w*Although Pmean provides state-of-the-art binary unsupervised classification performance, it is important to note that no multiclass extension of Pmean has previously been considered in the literature.

### A.2 Enhanced Bayesian Linear Discriminant Analysis.

Bayesian linear discriminant analysis (BLDA) is a Bayesian version of regularized LDA in which regularization parameters are estimated with Bayesian regression (see Xu et al., 2011, for more detailed information).

In order to improve the performance of the classifier, EBLDA proposes training a new classifier by supplementing training sets with additional high-probability test samples. In the binary case, the probability from the BLDA classifier for a test sample is computed. If this probability exceeds a threshold (e.g., 0.9), this test sample and its estimated label are added to the training set for classifier retraining. In this work, we refer to the parameter value indicating this threshold as the learning rate.

In the multiclass setting, EBLDA uses combinations of binary BLDA classifiers as MLDA.

### A.3 Unsupervised EEG Data Space Adaptation.

*W*as a linear transformation matrix and

*C*as the average covariance matrix of the testing data. The DSA method optimizes the matrix

*W*by minimizing the KL divergence between

*N*

_{2}and

*N*

_{1}. Interestingly,

*W*can be written in closed form as In DSA, the linear transformation

*W*is recomputed sequentially after a certain number of trials, and the new testing pattern is projected onto

*W*before applying the pretrained feature extraction and classification algorithms. The number of trials used for recomputing

*C*and

*W*determines the length scale of the adaptation. We therefore refer to this parameter as the learning rate for the DSA method.

The original unsupervised DSA method considering common spatial patterns (CSP) as feature extractor is very similar to the unsupervised adaptation of CSP as proposed in Tomioka et al. (2006). The latter updates the CSP filters using *W* (i.e., ), while DSA filters the testing data (*X*) using *W* (i.e. ). Consequently in both cases the features are extracted from the linear transformation .

## Acknowledgments

We gratefully acknowledge the support of the Brain-Gain Smart Mix Programme of the Netherlands Ministry of Economic Affairs and the Netherlands Ministry of Education, Culture and Science.

## References

## Notes

^{4}

For EBLDA, a Bayesian LDA was used for training.