## Abstract

Unsupervised analysis of the dynamics (nonstationarity) of functional brain connectivity during rest has recently received a lot of attention in the neuroimaging and neuroengineering communities. Most studies have used functional magnetic resonance imaging, but electroencephalography (EEG) and magnetoencephalography (MEG) also hold great promise for analyzing nonstationary functional connectivity with high temporal resolution. Previous EEG/MEG analyses divided the problem into two consecutive stages: the separation of neural sources and then the connectivity analysis of the separated sources. Such nonoptimal division into two stages may bias the result because of the different prior assumptions made about the data in the two stages. We propose a unified method for separating EEG/MEG sources and learning their functional connectivity (coactivation) patterns. We combine blind source separation (BSS) with unsupervised clustering of the activity levels of the sources in a single probabilistic model. A BSS is performed on the Hilbert transforms of band-limited EEG/MEG signals, and coactivation patterns are learned by a mixture model of source envelopes. Simulation studies show that the unified approach often outperforms conventional two-stage methods, indicating further the benefit of using Hilbert transforms to deal with oscillatory sources. Experiments on resting-state EEG data, acquired in conjunction with a cued motor imagery or nonimagery task, also show that the states (clusters) obtained by the proposed method often correlate better with physiologically meaningful quantities than those obtained by a two-stage method.

## 1 Introduction

Unsupervised machine learning techniques play a fundamental role in the analysis of spontaneous (resting-state) neuroimaging signals by exploring the intrinsic statistical structures of such signals without relying on extrinsic covariates about tasks or stimulation protocols. The structures or features obtained can then be examined based on neurophysiological knowledge, often using the features in a group comparison or, possibly, by finding similar structures in other signals already associated with tasks or stimuli.

In recent years, there has been growing interest in exploring the patterns and dynamics of resting-state functional brain connectivity (Friston, 1994) based on unsupervised signal analyses. To find patterns in nonstationary functional connectivity, most studies have relied on such standard techniques as independent component analysis (ICA) (Brookes et al., 2011; Smith et al., 2012), principal component analysis (PCA) (Leonardi et al., 2013), and *K*-means clustering (Liu, Chang, & Duyn, 2013; Allen et al., 2014). New unsupervised analysis methods have also been developed (e.g., Haufe, Tomioka, Nolte, Müller, & Kawanabe, 2010; Zhang & Hyvärinen, 2010; Hyvärinen, Zhang, Shimizu, & Hoyer, 2010; Hirayama & Hyvärinen, 2012; Ramkumar, Parkkonen, Hari, & Hyvärinen, 2012; Ramkumar, Parkkonen, & Hyvärinen, 2014; Dähne et al., 2014) to incorporate the specific nature of neuroimaging signals.

Such methods for analyzing functional brain connectivity, based primarily on signals’ own statistics, are potentially very useful not only for neuroscientific investigations but also for neuroengineering applications. For example, a key challenge in an emerging new direction in brain-computer interface (BCI) research is to covertly acquire users’ unobserved states during everyday life behaviors (Zander & Kothe, 2011; Lance, Kerick, Ries, Oie, & McDowell, 2012). Since brain activity cannot be very well controlled in everyday life situations, no reliable class labels are available for discriminating the unobservable states of users. Such difficult data could be tackled if unsupervised analysis discovered connectivity patterns reflecting the user’s cognitive states. Unsupervised connectivity analysis might also shed light on the neurophysiological basis of the BCI paradigms commonly used so far, such as those based on motor imagery, that is, imagining body movements (Grosse-Wentrup, 2009, 2011).

Motivated by such potential applications, in this article we focus on developing an unsupervised analysis method for finding connectivity-related signal features from electroencephalography (EEG); our method may also be readily applied to magnetoencephalography (MEG) because of its fundamental similarity. EEG/MEG’s high temporal resolution is particularly useful for analyzing nonstationary connectivity, as compared to functional magnetic resonance imaging (fMRI), which is used in most connectivity studies.

The analysis of functional connectivity patterns in EEG/MEG, however, is not straightforward because the neural sources are mixed by volume conduction (or field spread) into sensor signals. Two-stage analysis has been conventionally performed by first separating the neural sources from the given sensor signals and analyzing the connectivity patterns based on those separated sources. In neuroimaging literature, electromagnetic inverse problems are often solved to separate (estimate) the activity of dipolar sources on the cortical grid, with an additional effort of physical forward modeling. In exploratory signal analysis related to BCI, blind source separation (BSS) methods (including ICA) are especially useful, since they greatly simplify the interpretation of the results by decomposing the data into components, and the inverse problem can be solved to localize the obtained components afterward (Hironaga & Ioannides, 2007; Doesburg & Ward, 2009). Note that these components are actually called “sources” in BSS literature, and a component can be an integration of multiple correlated dipolar sources.

The problem is that conventional two-stage analysis (i.e., first source separation and then connectivity analysis) is “neither a principled nor an optimal solution to the overall problem” (Makeig et al., 2012). Source separation methods rely on specific prior assumptions about the sources, which are not necessarily consistent with what the connectivity analysis assumes about them; such inconsistency between prior assumptions might bias the results. A more desirable unified treatment would be obtained by extending the conventional generative (forward) model used for source separation by including prior assumptions about the sources that are consistent with the connectivity analysis. Typically, connectivity analysis can be formulated as learning of a specific parametric model of the sources, and both layers of the model may be learned simultaneously, unified by the principle of statistical parameter estimation.

Here, we present a unified method for BSS and analyzing functional connectivity patterns in EEG/MEG sources, based on a novel two-layer extension of the conventional BSS/ICA generative model. In line with previous resting-state MEG studies (de Pasquale et al., 2010; Brookes et al., 2011; Ramkumar et al., 2014), we are particularly interested in finding coherent (frequently occurring) patterns of activity levels (coactivations) of oscillatory sources in any frequency band of interest. The connectivities are based on envelopes and ignore phase information, but our model is rather different from power-to-power coherences.

To properly model the source envelopes, our model uses a complex-valued formulation of BSS based on the Hilbert transform, a key departure from related extensions of ICA based on modeling real-valued data (Hyvärinen & Hoyer, 2000; Hyvärinen, Hoyer, & Inki, 2001; Valpola, Harva, & Karhunen, 2004; Karklin & Lewicki, 2005; Kawanabe & Müller, 2005; Osindero, Welling, & Hinton, 2006; Köster & Hyvärinen, 2010; Haufe et al., 2010; Hirayama & Hyvärinen, 2012). Another important novelty here is using a finite mixture model of sources, in which they are assumed to exhibit different coactivation patterns corresponding to a finite number of unobserved “brain states.” This corresponds to performing unsupervised clustering on the coactivations of the sources, inspired by the use of *K*-means clustering in previous resting-state fMRI studies (Liu et al., 2013; Allen et al., 2014). Due to the simplicity of the mixture model, our two-layer model is tractable, unlike previous two-layer models that are often difficult to learn. Our entire model can be readily estimated (optimized) by the maximum likelihood method without resorting to any approximations.

The rest of this article is organized as follows. First, in section 2, we present our proposed method based on a novel two-layer extension of the generative BSS/ICA model, the latent coactivity mixture model (LCMM). Then we provide simulation studies in section 3 and real EEG data analysis in section 4 to validate the unified approach. Finally, we discuss the results and open issues in section 5. Preliminary results were presented at the International Conference of the IEEE Engineering in Medicine and Biology Society (Hirayama, Ogawa, & Hyvärinen, 2014).

## 2 Latent Coactivity Mixture Model

### 2.1 Background: Blind Source Separation

Note that in equation 2.1, the number of sources is assumed to equal (effective) dimensionality *d* of the sensor signal as a fundamental setting in a standard ICA; this greatly simplifies the mathematical treatment. In practice, *d* can be selected to be smaller than the original number of sensor channels, typically by discarding the ineffective dimensions (with too small variances) using PCA. It should also be noted that each source (or component) does not necessarily correspond to any single electrical dipole; instead, each may describe the total effect of multiple correlated dipolar activities.

To solve the BSS problem, we need to make further assumptions about the statistical properties of the sources based on prior knowledge. Independent component analysis (ICA) typically assumes that *d* sources are non-gaussian and mutually independent (Hyvärinen, Karhunen, & Oja, 2001), which theoretically guarantees the identifiability of both and ’s, up to the scaling and permutation of the sources. However, the independence assumption might lead to a solution that is weakly functionally connected (i.e., weakly statistically dependent) even when the true sources are strongly dependent, which is not consistent with the goal of connectivity analysis. This motivated us to develop an appropriate model of functionally connected (dependent) sources.

### 2.2 Definition of Latent Coactivity Mixture Model

*i*is an imaginary unit. The envelope of is given by the modulus of . This simple algebraic dependence of the envelope on greatly simplifies the developments below.

Note that mixing matrix in equation 2.3 is identical to the original one in equation 2.1 because of the Hilbert transform’s linearity. This ensures that the 's columns can be interpreted directly as defining spatial topographies in the original sensor space. For simplicity, we constrain to be real valued, while a complex-valued could also be straightforwardly used, which might be useful to deal with sources synchronized in different phases (Hyvärinen, Ramkumar, Parkkonen, & Hari, 2010).

We next define a coactivation (connectivity) structure between sources . The fundamental assumption here is that a system generating the data can be in a finite number of different states, corresponding to different patterns of source amplitudes. Given the state at time point *t*, sources are generated based on a multivariate Student-*t* distribution (specified below), which implements the average source amplitudes specific to that state and generates random phases.

Thus, our model of connectivities is not based on explicitly measuring some form of correlations between the sources or their envelopes, as is typically done in electrophysiology. Instead, we characterize the interactions of the sources by dividing their joint activity into a number of typical patterns of envelopes that intuitively express the idea that certain sources tend to be coactivated. Such coactivation does imply correlations of envelopes (i.e., power-to-power coherence) but provides a more detailed analysis of the coactivation than merely computing correlations.

The proposed latent coactivity mixture model (LCMM) is thus summarized as a two-layer generative model of complex sensor signal vector as follows:

At each time point

*t*, the system generating the data takes one of a finite number of different states (clusters) indexed by , according to multinomial probability distribution with cluster probabilities , where and .- Given that the system belongs to the
*k*th state at time*t*, source vector is specifically generated by a complex multivariate Student-*t*distribution (Schreier & Scharf, 2010) with circular (see below) and mutually uncorrelated sources; they have state-conditional variances or expected powers (squared amplitudes), given by where denotes the conditional expectation given the*k*th state, is an integer called the degrees of freedom, and nonnegative vector , called the coactivation pattern, specifies the expected levels of the source envelopes.^{1} Complex sensor signal is given as a linear instantaneous mixture of with unknown mixing matrix , as in equation 2.3, common to each state

*k*.

The *N* Hilbert-transformed sensor signal vectors, , are simply assumed to be independent and identically distributed (i.i.d.), as is commonly done in many ICA methods. Such a simplification is mainly done for purposes of mathematical and computational tractability, but it could be relaxed by modeling the autocorrelation structures of the sources (in future work).

*t*distribution (Schreier & Scharf, 2010) with scatter matrix and degrees of freedom is given by where denotes the Hermitian transpose and denotes the gamma function. Source vector in LCMM has a state-conditional pdf given by . Figures 1a and 1b illustrate the conditional pdf for a bivariate case.

The particular choice of the Student-*t* model is mainly motivated by its robustness to outliers in the estimation of (co)variances (i.e., coactivation patterns), as has been thoroughly studied in the literature (Ollila & Koivunen, 2003; Mahot, Pascal, Forster, & Ovarlez, 2013).^{2} Scatter matrix is proportional to the covariance matrix if it exists, that is, for , and maximum likelihood estimate (MLE) of is often used as a robust alternative of the sample covariance matrix even with . Diagonal scatter matrix in LCMM implies equation 2.4, and the estimate of serves as a robust estimator of the state-conditional variances. Note that even if the variance does not exist when , can have a finite MLE, giving a more general scale parameter estimate. Typically, since EEG/MEG signals contain a large number of noises or artifacts from outside the brain, robustness is a desirable property in practice. Our model, however, includes no explicit noise term in the generative model of equation 2.1, which is mainly for computational simplicity, as in standard ICA methods.

Graphical representations of the dependency structure in LCMM are given in Figure 2. The sources in LCMM are not independent of each other (see Figure 2a; see also Figure 1c), that is, , in contrast to standard complex-valued BSS/ICA models. Unlike the related generalizations of ICA modeling energy correlations, the sources are even conditionally dependent (see Figure 2b), given the higher-order latent variables (here, state *k*). This is due to our choice of the Student-*t* model instead of a gaussian model. In the limit of , the pdf is reduced to a complex gaussian, and the sources become conditionally independent (see Figure 2c), but they remain dependent over the whole data set (with state *k* marginalized out).

### 2.3 Parameter Estimation

*t*distributions, in which the state-dependent scatter matrices are tied with common parameter . It can be readily seen from equation 2.7 that the probability density does not change if (or ) and are simultaneously replaced by (or ) and , respectively, where is any nonsingular diagonal matrix. This is the well-known scaling ambiguity inherent in BSS/ICA. We fix the scale by setting every column of to have unit Euclidean norm.

The parameters of interest in LCMM can be easily estimated by the maximum likelihood method: maximizing with respect to the model parameters when the latent variables are marginalized out. Importantly, this does not require any approximation, in contrast to other hierarchical BSS models. For simplicity, we fix degrees of freedom to a constant (we set in sections 3 and 4 below; this choice leads to an infinite variance and strong robustness) and learn the other parameters since typically has only a small effect on the final solution (at least if set relatively small for ensuring robustness). The detailed form of the objective function and its derivatives are given in the appendix.

^{3}to efficiently optimize the likelihood using reparameterization (Salakhutdinov, Roweis, & Ghahramani, 2003) given by so that the s automatically satisfy constraints and . We do not constrain the scaling of or during the optimization, but rescale them after obtaining the final solution so that every column of has a unit norm. Any standard optimization software can be used for solving unconstrained optimization on new parameter set . We found that this quasi-Newton method is more efficient than the well-known expectation-maximization method (simulations not shown).

### 2.4 Choosing the Number of States by BIC

*K*of states or clusters. We use the Bayesian information criterion (BIC) to select the best

*K*minimizing where denotes the maximum of the likelihood obtained numerically, as explained above, and the number of free parameters

*M*in LCMM is given by . The use of BIC for the model order selection in mixture models has been extensively studied in statistics. There are theoretical results regarding statistical consistency (Keribin, 2000), and BIC often exhibits state-of-the-art performance, as shown empirically (Steele & Raftery, 2010).

### 2.5 Relation to Previous Two-Layer Extensions of BSS/ICA

Many previous attempts have been made to extend BSS/ICA based on equation 2.1, particularly to deal with the residual dependency structures between power or magnitudes of the sources often observed in ICA results (Hyvärinen & Hoyer, 2000; Hyvärinen, Hoyer et al. 2001; Valpola et al., 2004; Karklin & Lewicki, 2005; Kawanabe & Müller, 2005; Zhang & Hyvärinen, 2010; Hirayama & Hyvärinen, 2012); these works were not necessarily concerned with EEG/MEG.

The following are the main problems with these previous models for EEG/MEG connectivity analysis: (1) they do not properly model the envelopes of the oscillatory sources, and (2) the likelihood is often intractable without resorting to approximations, because the continuous latent variable needs to be integrated out. A recent study (Cadieu & Olshausen, 2012) on the statistical modeling of natural movies actually addressed the first issue without resolving the second one.

*t*distribution belongs to the gaussian scale-mixture family (see Bishop, 2006, for a real case). We can equivalently reformulate our model by assuming that is a complex (circular) gaussian conditionally on state

*k*and introducing a scaling variable that follows an inverse gamma distribution. The conditional variance can then be written where denotes the squared envelopes. Now consider a simplified complex-valued counterpart of the previously used equation 2.11 given by The LCMM in equation 2.12 has essentially the same form if we can constrain it so that only a single variable

*u*takes a nonzero value at a time.

_{k}Hence, although closely related, LCMM has notable differences from previous energy-correlation models. First, it models the (squared) envelopes instead of the (squared) magnitudes , thus properly dealing with oscillatory sources (together with Hilbert transform). Second, the model is tractable and fast to learn because it has only one discrete latent variable instead of multiple continuous ones.

### 2.6 Real-Valued Variant of LCMM

*u*in equation 2.11 takes a nonzero value at a time, where nonlinearity is set to an identity function, implying that . More specifically, the real-valued LCMM is given as a constrained form of a mixture of multivariate Student-

_{k}*t*distributions and its pdf is given by where the Student-

*t*pdf for real vector is generally given by Again, the pdf can be expressed using as and the same quasi-Newton optimization is used for estimating (specifically with in sections 3 and 4). Finally, the state posterior is also given by

## 3 Simulations on Artificial Data

We next quantitatively compare the proposed method with existing approaches for the analysis of EEG/MEG data. We start with simulated EEG/MEG data so that the ground truth is known and can be systematically controlled; we provide a real EEG analysis in section 4. The goal of the simulation study is to validate the two key ideas of the proposed method: the complex-valued formulation and the unified estimation principle of the two stages of analysis.

### 3.1 Methods

The simulated EEG/MEG signals were created as follows. First, we applied a bandpass filter (– Hz) to 10 gaussian temporally white signals sampled virtually at Hz to simulate the alpha-range activities. Then these oscillatory signals were jointly amplitude-modulated block-wise in every 2 second window ( samples) to show the state-dependent coactivation patterns. For this purpose, we first created vectors whose entries were independently sampled from a standard gaussian distribution but set to zero if negative; this was repeated until at least two entries satisfied . Then the *j*th oscillatory signal was multiplied by with state *k* randomly chosen for each block with uniform probability (). The sources with nonzero *b _{jk}*’s were actually coactivated, while very small activity levels were avoided with this procedure.

So that the amplitudes of these coactivated sources have nonzero (positive) correlations within each state, they were further modulated globally by a noisy sinusoidal signal, generated by sampling from a gamma distribution ^{4} with where Hz and phase was randomly selected. Then gaussian white noise was added, where the noise variance was set to have a given value of a signal-to-noise ratio (SNR), defined as the ratio of the variance. Figure 3 illustrates an example of 10 sources before and after adding the noise. Finally, sources were linearly mixed into the same number (i.e., ) of sensor signals with square mixing matrix generated randomly from the standard gaussian distribution.

The clustering and source separation performances on these simulated data were compared among the following methods: LCMM, LCMM (real), ICA+MixT, and ICA+Kmeans. The first one, LCMM, is the proposed approach, based on the joint maximum likelihood estimation of the two layers of the complex-valued LCMM. The second one, LCMM (real), denotes the real-valued counterpart of LCMM (see section 2.6), which can be seen as lying between our proposed method and the previous two-layer BSS/ICA models. The latter two, ICA+MixT and ICA+Kmeans, perform two-stage analysis. Both first used the complex-valued FastICA (Bingham & Hyvärinen, 2000) (with real-valued ) for separating complex sources ; then ICA+MixT directly learned the mixture of the Student-*t* model (as in LCMM) on the separated sources, while ICA+Kmeans performed standard *k*-means clustering on log-amplitudes , where the mean log amplitude over the channels was subtracted at every *t* to compensate for the global modulation.

^{5}corrects normalized mutual information (NMI) between true cluster

*k*and estimated cluster for chance agreements: where () and denotes the expectation of under random permutations of the cluster labels where the numbers of clusters and cluster members are unchanged (

*I*and

*H*denote the sample mutual information and marginal entropy). AMI is thus expected to be zero under this random permutation and is upper-bounded by one; the upper bound is achieved only when the two clusterings are perfectly matched. The Amari index, a standard performance measure for linear BSS problems, is defined by where denotes the th element of matrix with estimated and true mixing matrices and , respectively. This index is nonnegative and equals zero if and only if the true mixing matrix is recovered up to the permutation and scaling of the columns.

### 3.2 Results

Figure 4 quantitatively compares the performances of clustering (panels on the left) and source separation (panels on the right) by the above four methods with different numbers of clusters estimated by the model, , while the number of true states is always 5. Each box plot displays the result of runs in each of the different sample sizes *N*. The SNR of the sources was specifically set to dB in this figure. In the panels on the right, “ICA” corresponds to both ICA+MixT and ICA+Kmeans.

As is clearly seen in the left three panels, LCMM achieved the highest AMI (medians) in every condition, and LCMM (real) consistently showed a lower AMI. In contrast, on the right-hand-side panels, these two methods showed very similar Amari indices. These results imply that the complex-valued formulation of our LCMM is particularly beneficial for obtaining better clustering without degenerating the source separation. The two-stage methods, ICA+MixT and ICA+Kmeans, also exhibited lower (i.e., worse) AMI values than those of LCMM (but not necessarily of LCMM (real)), while they exhibited higher (i.e., worse) Amari indices. The two-stage methods are less accurate in both clustering and source separation than the proposed (complex-valued) LCMM method.

Although we obtained the best performance with (the true number of clusters), the relative performance among the four methods was qualitatively similar for different *K*s. That is, the proposed method outperforms other methods even when the number of clusters *K* is misspecified.

To further examine how the result changes with different noise levels, we also conducted simulations with different SNRs for generating the source signals. The number of clusters *K* in the model was simply set at the true one (). Figure 5 shows the result in the same format as that of Figure 4. The relative performance of the four methods was qualitatively similar to Figure 4 in every SNR setting. This showed that LCMM improved clustering without degenerating the source separation over the other methods even when the SNR is relatively low.

Finally, we demonstrated the use of BIC for selecting the number of clusters. Here, we computed the BIC for and chose the *K* that minimizes the value. Note that in the simulation setting here, LCMM does not completely match the true data-generating model due to the additional gaussian noise in the sources. The number of clusters selected thus often exceeded five, as shown in Figure 6, while higher SNRs (e.g., or dB) resulted in values closer to five. In practice, since EEG/MEG usually has a low SNR, these results indicate that the number of clusters will likely be overestimated by BIC. However, spatial topographies also learned by LCMM can be used to identify and discard such irrelevant clusters that contain only noise or artifacts instead of physiologically meaningful patterns.

## 4 Experiments on Resting-State EEG Data

Next, we demonstrate the advantages of using LCMM compared to two-stage methods in a real EEG data analysis. The target data are resting-state EEGs acquired before and after a BCI-related task, which we expect to contain task-relevant brain states, possibly due to mental rehearsal or retrieval. We examined the patterns (states) found in the resting-state EEGs based on the labeled EEG data during task, as well as their spatial topographies on sensor channels.

### 4.1 EEG Data

Five healthy subjects (three males, two females, 28+/−11 years old) participated in our EEG experiment. We placed a head cap with EEG electrodes on their heads with electric-conductive gel SIGNAGEL (Parker Laboratories, Fairfield, NJ, USA) to reduce the impedance of the electrodes. We positioned 64-channel active electrodes based on the international 10–20 system, and connected them to an ActiveTwo amplifier (BioSemi, Amsterdam, Netherlands). An experimental protocol of this study was approved by the ethical committee at ATR.

The brain activity was measured in each subject during resting states with eyes open and while the subjects performed a cued motor imagery or nonimagery task. The experiment consisted of two resting-state (RS) sessions and six task sessions between the two RS sessions. In each RS session (5 minutes), the subject was instructed to relax without thinking of anything in particular and without sleeping and to focus on a fixation point at the screen’s center. In each task session, the subjects performed a number of task trials in each of which they randomly took one of the following three actions for 3 seconds after a visually cued onset: (1) left (covert imagination of a left-hand movement), (2) right (covert imagination of a right-hand movement), and (3) idle (no imagination of hand movements).

The EEG data were acquired at a sampling frequency of 256 Hz, bandpass-filtered offline to 1 to 50 Hz (fourth-order Butterworth, zero phase), and re-referenced to the common average. Some of the RS data and some trials in the task data were rejected due to gross contamination. Typical ocular, cardiac, and muscular artifacts were also identified and removed by FastICA (Hyvärinen, 1999) with visual inspection and frequency analysis, which was done separately for each subject and also for the RS and task data. The data were further bandpass-filtered in certain frequency bands of interest and then Hilbert-transformed. We focused on two frequency bands of interest, 8 to 12 Hz (alpha) and 13 to 30 Hz (beta), in line with previous neurophysiological studies on motor imagery (Pfurtscheller & Neuper, 1997) as well as those on resting-state brain networks (Brookes et al., 2011).

### 4.2 Method

All the LCMM parameters were estimated from the RS data alone where the two (pre- and post-) RS sessions were combined. We emphasize that the task data and labels were not used for the parameter estimation, only to validate the learned model in a post hoc manner. Before the parameter estimation, the RS data were spatially prewhitened and dimensionality-reduced by PCA, so that % of the sample variance was kept. Note that the effective dimensionality of the data had already been reduced above by removing the artifactual dimensions using ICA. As a result, the number of sources *d* was selected as , and for the five subjects in the alpha band and , and in the beta band. For comparison, we also applied the two-stage method ICA+MixT, as explained in section 3.1, to the same data to examine the effect of unifying the two stages of the analysis by LCMM. In both methods, we ran the algorithm 10 times, each from different initial parameters to converge, for every preselected number of states . The best *K* in LCMM was chosen so that the median of the 10 BIC values achieved the minimum, and the same number *K* was also used in ICA+MixT.

We then evaluated how well each coactivation pattern found in the RS data discriminates the two physiologically different brain states corresponding to the motor imagery (left and right) and nonimagery (idle) labeled in the task EEG data. We used a standard performance criterion for binary discrimination, the area under the receiver operating characteristic (ROC) curve or AUC, and evaluated them as follows. We used the *k*th signal model to detect whether a task trial was in the *k*th state. This state was supposed to be detected if log-likelihood , averaged over the imagery or nonimagery period after the cued onset (where the initial 0.5 second of this 3-second period was discarded to avoid the transient effect), was above or below a certain threshold. The log likelihood of the *k*th state, up to the irrelevant scaling and additive constants, can be evaluated at each time point by , as derived by setting in equation 2.5 and removing the time-invariant constant terms from its logarithm. The occurrence of the *k*th state may be associated arbitrarily with motor imagery or with nonimagery for each case of which the ROC curve was drawn by plotting the true-positive rate against the false-positive rate at many different threshold values. Thus, we had two AUC values (computed by a trapezoidal rule) for each state from which the greater one was simply chosen; the AUC becomes close to one if the state occurrence discriminates well between motor imagery and nonimagery, and it is close to if the occurrence does not discriminate it at all. Note that this AUC does not distinguish between left and right because that seemed too difficult in these data based on our preliminary analysis (not shown here).

To compare these states that exhibited high AUC values, we further examined the time courses of their log likelihood and the sensor-level topographies corresponding to them. The topographies were drawn by evaluating how the occurrence of each state changed the power in the frequency band of interest at each site of the electrodes. We first obtained a robust estimate of the state-conditional variance (power) for each sensor channel by (), where *a _{cj}* denotes the estimated mixing coefficient from the

*j*th source to the

*c*th sensor channel and total variance was simply normalized to one because it was undefined due to the choice of . The topographies were then plotted by spatially interpolating the percentage deviations from the “grand variance” over electrodes (see Figure 7 for the layout and channel names). Grand variance was set to the expectation of state-conditional variances during the task period, , where was reestimated by the posterior state probabilities during the imagery and nonimagery periods of the task trials.

### 4.3 Results

The discriminability of each state between the motor imagery and idling trials was examined in terms of the sample distribution of their AUC values. Figure 8 shows the normalized histograms of the AUC collected from the 10 different runs of the parameter estimation, where each panel shows the two normalized histograms by LCMM (“unified”) and ICA+MixT (“two-stage”) for a particular subject and frequency band. The number of samples (AUC values) summarized in each panel was thus , where the number of states *K* was selected by BIC as , and for the five subjects in the alpha band and , and in the beta band. The figure shows that in most cases, the unified method exhibited greater variation in AUC than the two-stage method. For example, the range of AUC clearly expanded to both sides of the distribution in subjects 1 (beta), 3 (beta), and 5 (alpha), while mostly to the right (greater) side in subjects 3 (alpha), 4 (alpha, beta), and 5 (beta). The greater variation in the AUC distribution, even if it was both-sided, implies that it has a greater chance to contain brain states that exhibit higher AUC values. The two vertical lines in each panel further indicate the upper quartiles of AUC for the unified (solid line) and two-stage methods (dashed line), showing consistent increases of the upper quartiles by LCMM from those by ICA+MixT. In other words, the lower bound of the top 25% of the AUC values consistently shifted to the right as a consequence of unifying the two stages of analysis.

To gain insight into the difference of these high-AUC states between the unified and two-stage methods, we further analyzed the average dynamics of the states that exhibited the best (highest) AUC in each subject and frequency band. Figures 9 and 10 display the trial-averaged time courses of the log likelihood of those states obtained for the alpha and the beta bands, respectively, around the 3 second period of motor imagery or nonimagery. For the alpha band (see Figure 9), the dynamics clearly differ between the motor imagery (left and right) and the non-imagery (idle) conditions in LCMM (left column), especially in subjects 1, 2, and 5, with smaller differences among them in ICA+MixT (right column). For the beta band (see Figure 10), the dynamics again exhibited similar differences between the imagery and non imagery conditions in every subject. The results by LCMM (left column) and ICA+MixT (right column) are very similar in subjects 2, 3, and 4, but in subjects 1 and 5, the dynamics again clearly differ between the imagery and nonimagery conditions in LCMM with fewer differences among them in ICA+MixT.

The corresponding spatial topographies of the power deviations, given in Figures 9 and 10, provide further neurophysiological insights in conjunction with those of the averaged dynamics. For example, Figure 9 (left column) suggests that the best-AUC states of subjects 1 and 2 are both associated with the event-related decrease of frontal alpha power in the idle condition but not in the two motor imagery conditions. This is readily seen in the decrease of the log likelihood during the idle condition (green line) from the baseline level and with those topographies that exhibit positive values on the central areas. The figure also suggests that the best-AUC state of subject 5 is associated with the event-related decrease of the alpha powers during the left and right conditions at the bilateral Rolandic (central) areas, the major regions of interest related to motor imagery (Pfurtscheller & Neuper, 1997). This is seen in the log-likelihood increase during motor imagery in conjunction with the negative deviations in the alpha power around those areas, as indicated by the topographies. The topographies obtained for the beta band (see Figure 9) seem more difficult to interpret. Further neurophysiological interpretation is beyond the scope of this article.

These topographic changes in power, seen at the sensor level, are actually caused by different coactivation patterns at the level of the underlying sources. Figure 11 shows an example of the coactivation patterns () and the topographies ( and their element-wise squares) corresponding to each source obtained by LCMM. The top row shows the coactivation pattern of the 12 sources, obtained for subject 2 in the alpha band, which corresponds to the best-AUC state shown in Figure 9 (left column, subject 2). Here in this state, sources 3 and 10 have the largest powers, followed by 5, 6, and 7, and the rest have relatively small powers. With the coactivation patterns for other states, perhaps these high-AUC states can be characterized by the relative deactivation of sources 1 and 2 or the relative activation of sources 5, 6, and 7, for example. Although we omit the details, the topographies given at the bottom will be useful for further interpretation.

## 5 Discussion

We presented a novel unsupervised method for nonstationary functional connectivity analysis of EEG/MEG sources. Our simulation studies confirmed that the proposed unified method often outperformed the conventional two-stage method in terms of both source separation and clustering performances. Real EEG data analysis also showed that the unified method finds coactivity patterns that discriminate well between motor imagery and idling states with a higher probability than the two-stage method. In addition, the proposed method performs clustering in the source space to give further neurophysiological insights beyond sensor-space clustering (Britz, Ville, & Michel, 2010; Shi & Lu, 2008), as demonstrated in section 4.3.

Our real EEG data analysis suggested that the log-likelihood values of the states, obtained by our unsupervised analysis, may provide better higher-order signal features relevant to BCI (e.g., for the onset detection of motor imagery) than those by conventional methods. However, as seen in section 4.3, LCMM often did not consistently improve the discriminability (AUC) of every state but rather strengthened their contrasts: the states obtained by LCMM may contain both more discriminative (higher AUC) and more indiscriminative (lower AUC) ones than those by the two-stage methods. Hence, we must carefully select the discriminative features (states) while avoiding the indiscriminative ones to successfully apply our method to BCI. Such a feature selection could be done—for example, based on neurophysiological interpretations or using additional task-based experimental calibration. This remains open for future investigation.

Some recent studies have combined BSS with effective (directional) connectivity analysis to analyze the causality between neural activities, for example, autoregressive (AR) models (Gómez-Herrero, Atienza, Egiazarian, & Cantero, 2008; Haufe et al., 2010, see also Fukushima, Yamashita, Knösche, & Sato, 2015, to solve the inverse problem rather than BSS), generalized autoregressive conditional heterokcedasticity (GARCH) models (Zhang & Hyvärinen, 2010), or a structural equation model (SEM) (Hirayama & Hyvärinen, 2012). Our idea of unifying BSS and EEG/MEG connectivity analysis in a hierarchical statistical model is thus not completely novel, but in this study, we focused on a different type of statistical connectivity: functional connectivity based on envelope correlations. More important, we focused on analyzing nonstationary functional connectivity in terms of underlying patterns or states and their dynamics. This is a conceptually crucial difference from previous studies that were concerned with static connectivity (i.e., those unchanging over time).

The LCMM proposed here was shown to be a reasonable statistical model to perform our specific analysis on resting-state EEG/MEG signals, but obviously it has some limitations from the perspective of generative signal modeling. First, each LCMM state can describe only the positive correlations in the source envelopes, as is readily seen from the interpretation of equation 2.12, where latent factor and coefficients are both positive. The states themselves are correlated negatively since they do not occur simultaneously due to the assumption of the finite mixture model. Hence, our method cannot analyze any brain states characterized by negative envelope correlations or states that are positively correlated with each other. Second, the i.i.d. assumption is too simplistic since brain activity is obviously not independent over time. More general models corresponding to equation 2.13, possibly with nonlinearly in equation 2.11, as well as some temporal dynamics model on latent variables, will make the model more realistic. However, the model complexity must be carefully controlled to achieve good predictability and maintain the model’s tractability.

Another practical limitation of our method is that it can currently handle only a single frequency band of interest. Although it is very typical to limit an EEG/MEG analysis to a certain frequency band, cross-frequency interactions are sometimes of particular interest. Some recent studies have actually combined a complex-valued ICA with time-frequency decomposition for spontaneous EEG/MEG analysis with more flexibility on the spectral nature (Hyvärinen, Ramkumar et al., 2010; Ramkumar et al., 2012). A similar technique can probably be used to extend our method to analyze the data in multiple frequency bands.

### Appendix: Maximum Likelihood Estimation of LCMM

*t*th term, , is given by where , according to equations 2.5 and 2.6. We minimize by a quasi-Newton method with respect to , , and using the reparameterization of equation 2.8. This requires the first derivatives of , given by where , the asterisk denotes a complex conjugate, and denotes the -element of the transposed inverse matrix of . denotes the posterior probability of the

*k*th state, given by where means that the left-hand side is proportional to the right-hand side up to a constant factor independent of

*k*. Notice that equation A.5 is equivalent to equation 2.9. To obtain equation A.3, above, we used the relation given by where and denote the real and imaginary parts of complex number

*z*, respectively.

## Acknowledgments

We gratefully thank Motoaki Kawanabe and Okito Yamashita for their helpful comments and discussions and Takayuki Suyama, who provided the opportunity for this research. This work was supported by a contract with the Ministry of Internal Affairs and Communications “Novel and Innovative R&D Making Use of Brain Structures” and by JSPS KAKENHI Grant Number 25730155. J.H. was partially supported by Strategic International Research Cooperative Program, Japan Science and Technology Agency (JST). A.H. was supported by the Academy of Finland, Centre-of-Excellence in Inverse Problems Research and Centre-of-Excellence Algorithmic Data Analysis.

## References

## Notes

^{1}

To be precise, the exact relation between and the source variances in equation 2.4 is no longer valid if , since the variance is infinite or undefined. However, even if , the estimated can still be interpreted as modeling the variance levels, while small implies canceling the effects from the outliers from the viewpoint of robust estimation (as explained in the text).

^{2}

Note that other heavy-tailed distributions in the complex elliptical family (Schreier & Scharf, 2010) commonly have the robustness property and thus could also be used as the source model in LCMM. However, the examination of this option is beyond the scope of this article.

^{3}

We used a Matlab implementation of the limited-memory BFGS by Mark Schmidt, available at http://www.di.ens.fr/∼ mschmidt/Software/minFunc.html.

^{4}

The pdf is given by if .

^{5}

We used the Matlab code available at https://sites.google.com/site/vinhnguyenx/softwares.