## Abstract

A driver's cognitive state of mental fatigue significantly affects his or her driving performance and more important, public safety. Previous studies have leveraged reaction time (RT) as the metric for mental fatigue and aim at estimating the exact value of RT using electroencephalogram (EEG) signals within a regression model. However, due to the easily corrupted and also nonsmooth properties of RTs during data collection, methods focusing on predicting the exact value of a noisy measurement, RT generally suffer from poor generalization performance. Considering that human RT is the reflection of brain dynamics preference (BDP) rather than a single regression output of EEG signals, we propose a novel channel-reliability-aware ranking (CArank) model for the multichannel ranking problem. CArank learns from BDPs using EEG data robustly and aims at preserving the ordering corresponding to RTs. In particular, we introduce a transition matrix to characterize the reliability of each channel used in the EEG data, which helps in learning with BDPs only from informative EEG channels. To handle large-scale EEG signals, we propose a stochastic-generalized expectation maximum (SGEM) algorithm to update CArank in an online fashion. Comprehensive empirical analysis on EEG signals from 40 participants shows that our CArank achieves substantial improvements in reliability while simultaneously detecting noisy or less informative EEG channels.

## 1  Introduction

According to the Sleep Health Foundation report by Adams et al. (2017), mental fatigue is a major cause in 33% to 45% of all road accidents. In general, mental fatigue (Boksem & Tops, 2008) refers to the inability to maintain optimal cognitive performance in a task with a high demand of cognitive activity. Such inability in the context of driver could lead to accidents with severe consequences (Adams et al., 2017). Individuals may find themselves in a mentally fatigued because of lack of sleep, continuous driving for an extended period, driving, monotonous driving late at night or before dawn, and driving while under the influence of sleeping drugs or with sleep disorders (Ji, Zhu, & Lan, 2004; Ting, Hwang, Doong, & Jeng, 2008). (See Zhang, Yao, Wang, Monaghan, & Mcalpine, 2019, for recent advances and references in brain dynamic analysis.)

In response to these critical issues, several methods (Cook, O'Connor, Lange, & Steffener, 2007; Blankertz et al., 2009; Fazli et al., 2009; Wascher et al., 2014; Tian, Wang, Dong, Pei, & Chen, 2018; Kaji, Iizuka, & Sugiyama, 2019) have been proposed to estimate and predict mental fatigue based on electroencephalography (EEG) and reaction time (RT) (see Figure 1a). Some of these methods, however, performed considerably well for some participants but failed for others due to lack of generalization. One of the challenges behind such poor generalization is determining how to use RT effectively. RT is easily affected by the instrumental error, wandering attention, or any other task unrelated factors. A previous study (Wei et al., 2015) tried to overcome this problem by adopting different techniques to smooth RTs but still failed to make it work for all participants. Note that humans' RT is usually the result of preference (Izuma & Adolphs, 2013) in brain dynamics during the task rather than just a single value. Such preferences can be affected by changing levels of attention (Möckel et al., 2015) like wandering mind (Lin et al., 2016), or a lower level of attention (Chuang et al., 2018). Therefore, the relationship between EEG signals and RTs, including extreme or abnormal RTs, should be attended to in a way that reflects human brain dynamics preferences (BDPs).

Figure 1:

(a) Regression model with EEG signals. (b) Proposed channel-reliability aware ranking (CArank) model with brain dynamics preferences.

Figure 1:

(a) Regression model with EEG signals. (b) Proposed channel-reliability aware ranking (CArank) model with brain dynamics preferences.

Another important problem lies in the heterogeneous channels extracted from different brain regions, which are normally responsible for different functionalities. There was an attempt to choose different brain regions (Wascher et al., 2014) for a method during evaluation of mental fatigue, but these regions of the brain are not necessarily the same for all participants (Gramann, Müller, Schönebeck, & Debus, 2006). For example, Wascher et al. (2014) heuristically used frontal theta to represent a different level of mental fatigue for all participants. In such a case, the reliability of the learning model would inevitably degrade because of possibly noisy or less informative channels chosen, on different brain regions, by the method. Some previous work (de Naurois, Bourdin, Stratulat, Diaz, & Vercher, 2017), attempted to solve this issue by using artificial neural network models but still failed to provide convincing results. This previous work impels us to pursue a purely data-driven approach to predict mental fatigue while getting rid of the low versatility caused by various1 heuristic tricks.

To overcome these problems, we first formulate mental fatigue task of monitoring mental fatigue into a multichannel ranking problem and solve it with our proposed channel-reliability aware ranking (CArank) model. In particular, CArank could learn from brain dynamics preference (BDPs) using EEG data robustly, while effectively preserving the exact ordering of RTs (see Figure 1b). This approach surprisingly corrects the defects of previous models and their performance caused by noisy and extreme RTs. Furthermore, our model also proposes using a transition matrix to evaluate the high-confidence sources among heterogeneous EEG channels, which contributes highly toward task performance. In order to handle large-scale EEG signals and obtain higher generalization, we propose a stochastic generalized expectation-maximization (SGEM) algorithm is. More precisely, we make the following key contributions:

• We formulate the task of monitoring mental fatigue into a multichannel ranking problem and tackle it with the CArank model. CArank is a purely data-driven approach to detect mental fatigue using informative channels only.

• We propose a stochastic generalized expectation-maximumzation algorithm for CArank, which extends CArank to large-scale applications.

• We conduct empirical experiments on EEG signals from 40 participants to demonstrate the superior reliability of CArank in terms of mental fatigue monitoring.

This letter is organized as follows. Section 2 introduces the topic of mental fatigue monitoring and motivates the practice of using brain dynamics preferences. In section 3, we address the multichannel ranking problem and introduce our channel-reliability aware ranking to solve it. Section 4 describes a stochastic generalized expectation-maximization algorithm. Section 5 demonstrates the reliability of the proposed CArank with EEG signals from 40 participants. Section 6 envisions the future work, and section 7 concludes.

## 2  Background

In this section, we introduce some preliminary information about mental fatigue monitoring and then discuss our motivation for learning from brain dynamics preferences.

Reaction time is an intuitive indicator used to assess human mental fatigue. Therefore, a common practice for monitoring mental fatigue is to find a robust way of mapping humans' reaction time to an emergent situation using previously recorded EEG signals (Lal, Craig, Boord, Kirkup, & Nguyen, 2003; Kohlmorgen et al., 2007; Jap, Lal, Fischer, & Bekiaris, 2009).

### 2.1  Overfitting of the Regression Model

A natural way to forecast the RT with EEG signals is to formulate it as a regression task (see Figure 2), namely, finding a (non)linear mapping (e.g., neural networks, SVR) from the EEG signals $x$ to the corresponding RT. However, due to the existence of extreme values in RTs during data collection (Wei et al., 2015; Huang, Pal, Chuang, & Lin, 2015), the scale of the regression loss with regard to various RTs varies significantly. Therefore, the regression loss, without discriminating the peculiarity of the RTs, would be dominated by the few extreme RTs while omitting normal RTs. This then leads to the overfitting of the regression model on the training data, with poor generalization performance on the test data (see Figures 2 and 5 and Table 1).

Figure 2:

Overfitting of the two-layer regression model for mental fatigue monitoring. EEG signals from multiple channels are simply concatenated into a long feature vector, and the corresponding regression model is trained using this feature vector. The difference between the ground truth and the prediction is calculated with the root mean squared error. We collect the results only from the first participant for a showcase.

Figure 2:

Overfitting of the two-layer regression model for mental fatigue monitoring. EEG signals from multiple channels are simply concatenated into a long feature vector, and the corresponding regression model is trained using this feature vector. The difference between the ground truth and the prediction is calculated with the root mean squared error. We collect the results only from the first participant for a showcase.

Table 1:
Test Accuracy (in $%$).
ParticipantP1P2P3P4P5P6P7P8P9P10P11P12P13P14P15P16P17P18P19P20
Test ACC SVR 71.74 78.92 85.79 69.76 84.17 66.61 76.38 80.41 71.10 58.52 77.12 87.01 73.92 83.79 69.10 73.65 63.77 62.49 72.64 68.66
LR 69.80 70.77 85.63 69.01 63.77 53.62 79.69 55.87 74.15 21.32 77.55 87.44 74.17 70.79 41.03 53.11 58.15 59.93 41.88 66.20
Regression (C) 71.63 79.21 80.22 72.39 83.65 68.38 60.31 54.99 77.98 59.01 82.72 89.80 79.56 85.45 68.60 65.88 54.30 50.58 68.65 61.80
Regression (A) 71.71 72.97 79.81 70.90 82.80 57.42 61.88 60.96 66.38 52.96 79.37 73.87 67.70 80.54 66.03 54.47 51.01 65.07 62.33 54.80
Classification (C) 76.85 82.48 82.40 74.77 83.12 65.69 76.12 70.84 83.02 63.74 76.41 85.08 77.74 88.03 69.09 71.80 58.44 77.31 80.85 63.56
Classification (A) 79.97 77.61 79.87 68.69 82.55 63.86 49.85 51.47 51.78 53.03 75.79 79.69 66.40 89.39 68.10 53.07 50.00 52.81 61.19 52.02
CArank 82.29 80.97 83.78 77.50 87.42 76.62 82.34 79.16 91.40 78.25 81.74 84.17 83.23 90.53 76.66 80.40 88.69 81.13 80.42 78.35
Participant P21 P22 P23 P24 P25 P26 P27 P28 P29 P30 P31 P32 P33 P34 P35 P36 P37 P38 P39 P40
Test ACC SVR 72.71 73.43 78.98 67.00 76.72 72.56 75.94 85.95 81.63 82.19 67.78 87.73 71.21 76.69 80.61 88.76 77.47 74.22 63.52 46.24
LR 52.97 30.10 78.23 40.02 60.25 73.15 40.67 53.23 46.66 79.32 73.71 53.29 48.55 48.36 75.71 88.53 78.06 62.05 60.28 43.53
Regression (C) 69.84 50.58 80.73 56.85 78.72 67.76 65.06 84.75 79.59 82.59 63.41 66.46 56.78 61.81 66.70 87.21 81.98 57.71 84.41 67.48
Regression (A) 53.44 58.27 78.29 54.25 77.46 53.31 51.33 77.73 69.92 77.06 58.46 64.09 53.09 59.69 72.45 85.64 73.21 62.83 50.55 46.35
Classification (C) 68.22 79.82 84.36 68.10 84.28 69.60 77.09 86.46 82.11 86.85 74.22 85.05 60.49 71.58 73.03 90.40 83.51 75.62 80.37 69.07
Classification (A) 49.86 74.65 72.45 59.46 75.35 49.80 52.20 76.89 51.88 73.62 59.50 61.30 50.00 53.07 60.46 90.24 72.15 60.79 78.30 65.76
CArank 72.83 85.33 82.70 89.35 84.57 76.52 85.02 83.58 86.56 85.64 92.74 85.74 79.24 84.77 90.53 90.96 86.05 77.12 93.48 75.56
ParticipantP1P2P3P4P5P6P7P8P9P10P11P12P13P14P15P16P17P18P19P20
Test ACC SVR 71.74 78.92 85.79 69.76 84.17 66.61 76.38 80.41 71.10 58.52 77.12 87.01 73.92 83.79 69.10 73.65 63.77 62.49 72.64 68.66
LR 69.80 70.77 85.63 69.01 63.77 53.62 79.69 55.87 74.15 21.32 77.55 87.44 74.17 70.79 41.03 53.11 58.15 59.93 41.88 66.20
Regression (C) 71.63 79.21 80.22 72.39 83.65 68.38 60.31 54.99 77.98 59.01 82.72 89.80 79.56 85.45 68.60 65.88 54.30 50.58 68.65 61.80
Regression (A) 71.71 72.97 79.81 70.90 82.80 57.42 61.88 60.96 66.38 52.96 79.37 73.87 67.70 80.54 66.03 54.47 51.01 65.07 62.33 54.80
Classification (C) 76.85 82.48 82.40 74.77 83.12 65.69 76.12 70.84 83.02 63.74 76.41 85.08 77.74 88.03 69.09 71.80 58.44 77.31 80.85 63.56
Classification (A) 79.97 77.61 79.87 68.69 82.55 63.86 49.85 51.47 51.78 53.03 75.79 79.69 66.40 89.39 68.10 53.07 50.00 52.81 61.19 52.02
CArank 82.29 80.97 83.78 77.50 87.42 76.62 82.34 79.16 91.40 78.25 81.74 84.17 83.23 90.53 76.66 80.40 88.69 81.13 80.42 78.35
Participant P21 P22 P23 P24 P25 P26 P27 P28 P29 P30 P31 P32 P33 P34 P35 P36 P37 P38 P39 P40
Test ACC SVR 72.71 73.43 78.98 67.00 76.72 72.56 75.94 85.95 81.63 82.19 67.78 87.73 71.21 76.69 80.61 88.76 77.47 74.22 63.52 46.24
LR 52.97 30.10 78.23 40.02 60.25 73.15 40.67 53.23 46.66 79.32 73.71 53.29 48.55 48.36 75.71 88.53 78.06 62.05 60.28 43.53
Regression (C) 69.84 50.58 80.73 56.85 78.72 67.76 65.06 84.75 79.59 82.59 63.41 66.46 56.78 61.81 66.70 87.21 81.98 57.71 84.41 67.48
Regression (A) 53.44 58.27 78.29 54.25 77.46 53.31 51.33 77.73 69.92 77.06 58.46 64.09 53.09 59.69 72.45 85.64 73.21 62.83 50.55 46.35
Classification (C) 68.22 79.82 84.36 68.10 84.28 69.60 77.09 86.46 82.11 86.85 74.22 85.05 60.49 71.58 73.03 90.40 83.51 75.62 80.37 69.07
Classification (A) 49.86 74.65 72.45 59.46 75.35 49.80 52.20 76.89 51.88 73.62 59.50 61.30 50.00 53.07 60.46 90.24 72.15 60.79 78.30 65.76
CArank 72.83 85.33 82.70 89.35 84.57 76.52 85.02 83.58 86.56 85.64 92.74 85.74 79.24 84.77 90.53 90.96 86.05 77.12 93.48 75.56

Notes: Higher is better. The shaded numbers indicate the best results.

This creates a dilemma: it requires a reliable learning model to predict RT with the complex EEG signals (indeed, it is exactly our target), but it is not required to excessively approximate the exact value of RT, especially the extreme values. The problem, then, is how to find an efficient way to learn from the noisy RT or non-smooth while the exact value is not necessary.

As shown in Figure 2, the extreme or abnormal RTs wildly exist during data collection. The issues of overfitting arise in the regression model since the regression loss excessively forces the learning model to fit the extreme RTs yet underrates the regular RTs. Although various regularization methods (e.g., $L2$ norm, $L1$ norm, and Laplace priors) could alleviate the overfitting of the learning model (Hastie, Tibshirani, & Friedman, 2009; Zhang et al., 2015; Jin, Zhou, Gao, & Zhang, 2018), they cannot solve the overfitting issue if regression loss is still adopted. The same is true for other heuristic approaches (e.g., early stopping) used for alleviating overfitting.

Wei et al. (2015) tried to overcome this problem by adopting different techniques to smooth RTs, but still failed to make it work for all participants. Meanwhile, the performance varies significantly from different choices of the mapping function. The predefined smooth techniques would excessively weigh down or, or simply clip, the extreme or abnormal RTs in the MSE loss, which instead fails to reveal the real relationship between the EEG signals and RTs, especially the extreme or abnormal RTs (Möckel, Beste, & Wascher, 2015; Lin et al., 2016; Chuang et al., 2018).

For the sake of comparison, we apply the $L2$ norm regularization to all baselines in this letter.

### 2.2  Consistency of the Ordinal Regression Model

Instead of using regression, we propose to transform the problem into an ordinal regression problem. In particular, the RTs are defined in the totally ordered space $R$. This space owns its structure meanings, which are preserved by the pairwise comparisons between the RTs. The pairwise comparisons indeed preserve the whole relative structure information between the RTs while ignoring their absolute numerical information. Therefore, predicting the orderings of the pairwise comparisons may be regarded as a relaxed alternative of the previous regression model (see Figure 3).

Figure 3:

Consistency of the two-layer ordinal regression model using brain dynamics preferences. EEG signals from multiple channels are concatenated into a long feature vector, and the corresponding ordinal regression model is trained using this feature vector. In-degree sequences for the ground truth and the prediction are calculated. The root-mean-squared error (RMSE) was also measured between the indegree sequences of the ground truth and the prediction. We collected the results only from the first participant for a showcase.

Figure 3:

Consistency of the two-layer ordinal regression model using brain dynamics preferences. EEG signals from multiple channels are concatenated into a long feature vector, and the corresponding ordinal regression model is trained using this feature vector. In-degree sequences for the ground truth and the prediction are calculated. The root-mean-squared error (RMSE) was also measured between the indegree sequences of the ground truth and the prediction. We collected the results only from the first participant for a showcase.

We showcase our motivation using a naive ordinal regression model for mental fatigue monitoring and present the results in Figure 3: that even the naive ordinal regression model could capture some meaningful results compared to the regression model. In particular, the relative structure information between the RTs is somewhat preserved: the boundary between large RTs and small RTs is clear. Meanwhile, large RTs could serve as an indicator for monitoring mental fatigue.

#### 2.2.1  Comparison between Ordinal Regression and Regression

The difference between ordinal regression and regression lies in the objective they aim to minimize. Ordinal regression aims to preserve the whole ordering of RT, while regression aims to excessively approximate the exact value of RT. Therefore, ordinal regression is less sensitive to outliers, that is, the scale of RTs in mental fatigue monitoring.

#### 2.2.2  Reliability Issues Caused by Heterogeneous Channels

A naive ordinal regression method still suffers from overfitting, mainly because of the simple concatenation of the EEG signals. Since the EEG signals are from heterogeneous channels, if we simply concatenate the EEG signals without discriminating the reliability of each channel, the model's generalization would be degraded.

Remark 1

(Deficiencies of $L1$ and $L2,1$ Regularization for Eliminating Noisy Channels). In order to eliminate the noisy channels, the weight of the features should be set to zero regarding each channel as the whole. However, the $L1$ regularization could only push partial instead of all weights of one channel to zero. $L2,1$ regularization first performs $L2$ norm over the weight of each channel and then calculates $L1$ norm of all $L2$ norm. The $L2,1$ regularization could be used to eliminate the noisy channels. However, $L2,1$ regularization suffers from the following deficiencies: (1) it is difficult to extend to the nonlinear model, such as deep neural networks, and (2) it would heavily rely on parameter tuning of the balance factor for the $L2,1$ regularization term.

We next explore data-driven methods that can automatically weigh up reliable channels and down unreliable channels.

## 3  Model and Methodology

In this section, we formulate the mental fatigue monitoring task as a multichannel ranking problem. Furthermore, we extend the ordinal classification model for brain dynamics preferences and introduce a transition matrix to evaluate the channel reliability. Then, we propose the CArank model to tackle the multichannel ranking problem.

Note that we used the term preference intentionally to show that brain dynamics keep changing with regard to human behaviors, and it happens because the human brain prefers one decision over others (Ekman & Davidson, 1994; Izuma & Adolphs, 2013; Franks, 2019). Therefore, we prefer preference to classification. We then refer to the pairwise comparison between brain dynamics as the brain dynamics preference (BDP).

### 3.1  Multichannel Ranking

Our aim is to correctly preserve the whole orderings between the pairwise RT comparisons (see Figure 1b). In particular, the collection of the pairwise RT comparisons $D$, which we call preference propositions, can be constructed as follows,
$D={(Ti,Tj)|Ti,Tj∈T,i≠j},$
(3.1)
where $T$ is the set of reaction times. Note that the ground truth of each pairwise RT comparison is accessible since RTs are known. Since the connection between RT and BDP is based on human intuition, we call the ground truth of the pairwise RT comparison a preference proposition with regard to BDP.
For brevity of notation, we use the new notation to represent the preference propositions as
$D={ρm:(Tm,1,Tm,2)}m=1M,$
(3.2)
where $M$ denotes the number of preference propositions and $ρm(∈D)$ denotes the $m$th preference proposition. There are usually two types of preference propositions: (1) $ρm=1/-1$, in which the orderings between the RTs are significant, that is, $Tm,1≥Tm,2$ or $Tm,1≤Tm,2$, and (2) $ρm=0$, ion which the RTs in each comparison are comparable, that is, $Tm,1≈Tm,2$.1
Then the BDP could be constructed for each proposition using the corresponding pairwise EEG signals recorded from each channel, respectively:
$preferencepropositionsρm:(Tm,1,Tm,2)⟺BDP(xn,m1,xn,m2),$
(3.3)
where $n=1,2,…,N$. The BDP $(xn,m1,xn,m2)$ denotes the EEG signals recorded within the $n$th channel for each preference proposition $ρm∀m=1,2,…,M$.
In summary, our problem is formulated as predicting the preference propositions (the ordering of the pairwise RT comparisons) by aggregating the BDPs from multiple channels:
$f({xn,m1,xn,m2}n=1N)⟶ρm,∀m=1,2,…,M.$
(3.4)

### 3.2  Beyond Ordinal Classification

For a BDP $(x1,x2)$,2 the popular Bradley-Terry model, which is based on logistic regression, can be formulated as follows:
$P(ρ|w,x1,x2)=σ(wTΔx)ρ=1,σ(-wTΔx)ρ=-1,$
(3.5)
where $σ(z)=1/(1+e-z)$ is the sigmoid function and $σ(-z)=1-σ(z)$. Let $Δx$ denote the subtraction $(x1-x2)$ between the BDP $(x1,x2)$.

However, a preference proposition $ρ$ has three states: $1,0,-1$, denoting win ($T1>T2$), tie ($T1≈T2$), and loss ($T1), respectively. Since binary classification fails to model the state of a tie ($T1≈T2$), binary classification (e.g., see equation 3.5) is therefore very sensitive to the subtle difference of the reaction time. It means that other classification models, such as support vector machines, are also infeasible for our problem due to lack of a normalized probability definition for three states. Meanwhile, the softmax function, a straightforward extension of binary classification, models different states equally. It also does not serve as a good candidate since it fails to capture the intrinsic connection of these two types of preference propositions.

Therefore, we define a normalized probability for the three states while considering the two types of preference propositions, first normalizing the probability over states $(1,-1)$ (exclusively to the significant preference proposition) to 1 and then generalizing the probability definition to state 0. This can be mathematically formulated as
$P(ρ|w,x1,x2)=σ(wTΔx)[1-κ(wTΔx)]ρ=1,κ(wTΔx)ρ=0,σ(-wTΔx)[1-κ(wTΔx)]ρ=-1.$
(3.6)
Following Weng and Lin (2011), the probability of a tie is modeled as the geometric mean between a win and a loss:
$κ(wTΔx)=σ(wTΔx)σ(-wTΔx).$
(3.7)

Note that we consider the linear mapping $wTΔx$ here since the EEG data are usually high-dimensional with low sample size.

Remark 2

(Ternary Classification versus Binary Classification). Ternary classification (see equation 3.6) is less sensitive to the subtle difference of reaction time. In terms of binary classification, a subtle discrepancy around the classification boundary would lead to the steepest gradient. However, the tie state (i.e., $ρ=0$), introduced in ternary classification, would flatten the steepest gradient and enhance the model robustness regarding the subtle difference of RT.

Remark 3

(Extension to Deep Models). For the sake of clarity, we elaborate our three-states ordinal classification with a linear formulation (see equation 3.6). In the case of a deep learning model, we can consider either (1) replacing the linear difference $wTx1-wTx2$ with the difference of the neural network output $g(x1)-g(x2)$ or (2) replacing the raw feature $x$ in equation 3.6 with the output of the last layer of the encoder. To ensure end-to-end training, we chose the first approach in our experiment.

### 3.3  Channel Reliability

Because different regions in the human brain have different functions, relative contributions of different channels to human RT may vary a lot. The state of each channel can be classified as informative and noisy according to its contribution with regard to human RT. Note that a channel is called “noise” if the algorithms could not extract useful brain information with EEG signals from this channel (Alharbi, 2018; Lin et al., 2018). Therefore, if we directly model the EEG preferences recorded in each channel without any distinctions among the channels regarding channel reliability (i.e., informative and noisy), the model's reliability would inevitably degrade.

In the following, a transition matrix $Πn$ is introduced to characterize the reliability of each channel $n$ with regard to the learning task. Let $ρ$ denote the preference proposition and $ρ(n)$ denote the prediction from the $n$th channel. $ρ$ and $ρ(n)$ are all defined on a finite state space $S={1,0,-1}$. Then we have
$Πn=P(ρ|ρ(n))=π11nπ12nπ13nπ21nπ22nπ23nπ31nπ32nπ33n,$
(3.8)
where $Pi,j(ρ|ρ(n))=P(ρ=Sj|ρ(n)=Si)$. According to the definition of the transition matrix, $Πn$ should satisfy three constraints: (1) each entry of $Πn$ should be constrained in [0,1]; (2) each row of $Πn$ should be summed up to be 1; and (3) each column of $Πn$ should be summed up to be 1.
However, it is usually costly and redundant to estimate $Πn$ (see equation 3.8) directly. In the following, we consider imposing more constraints on equation 3.8, so as to simplify the inference while enhancing interpretability. First, the transition between states $(1,-1)$ is constrained to be symmetric, since states $(1,-1)$ are exclusive to the preference proposition where the orderings between the RTs are significant, that is, $P(ρ=1|ρ(n)=-1)=P(ρ=-1|ρ(n)=1)$. Second, since the equal case between two real values is hard to measure when conducting prediction, the transition from the significant RT pairwise comparisons to comparable RT ones is not considered,3 that is, $P(ρ=0|ρ(n)={1,-1})=0$. Therefore, a simplified transition matrix can be represented as follows:
$Πn=P(ρ|ρ(n))=πn0(1-πn)010(1-πn)0πn.$
(3.9)
The parameter $πn$ in the transition matrix $Πn$, equation 3.9, actually indicates the reliability of the $n$th channel $∀n=1,2,…,N$. It also helps to divide the channels into three states:
1. Positive channels with $πn$ close to 1: The ranking model, equation 3.6, can extract enough information from the $n$th channel and exactly predict the state of the preference proposition.

2. Noisy channels with $πn$ approximating 0.5: The ranking model cannot extract any useful information from the $n$th channel.

3. Negative channels with $πn$ close to 0: The ranking model can extract enough information from the $n$th channel, but the prediction states are exactly opposite the proposition states.

The identified positive and negative channels are all considered as informative EEG channels, which helps in learning reliable models for the corresponding task.

### 3.4  Channel-Reliability Aware Ranking

With the incorporation of transition matrix $Πn$, equation 3.9, on top of the introduced three states learning to rank model, equation 3.6, the likelihood function for each preference proposition $ρ$ can be represented as
$P(ρ|w,Πn,xn1,xn2)=Eρ(n)P(ρ|ρ(n))P(ρ(n)|w,xn1,xn2)=[πnσ(wTΔxn)+(1-πn)σ(-wTΔxn)][1-κ(wTΔxn)]ρ=1,κ(wTΔxn)ρ=0,[(1-πn)σ(wTΔxn)+πnσ(-wTΔxn)][1-κ(wTΔxn)]ρ=-1,$
(3.10)
where the subscripts $m$, indicating the index of preference proposition, are omitted for simplicity.
Let $D$ denote the collection of preference propositions and $X$ represent the recorded EEG signals from $N$ different channels. We further extend equation 3.10 to a Bayesian formulation. A gaussian prior is introduced for $w$ (i.e., $w∼N(μ,Σ)$). Since the transition matrix $Πn$ depends only on the parameter $πn$, we focus on estimating the parameter $πn∀n=1,2,…,N$ in the following. Let $π$ denote ${πn}n=1N$, and we introduce a beta prior for each $πn$ (i.e., $π∼B(α,β)=∏n=1NB(αn,βn)$). Then, our CArank model, equation 3.11, for the multichannel ranking problem, equation 3.4, can be represented as
$P(D,w,π|X)=P0(π)P0(w)P(D|w,π,X)=B(π|α,β)N(w|μ,Σ)∏m=1M∏n=1NP(ρm|w,πn,Δxn,m).$
(3.11)
Let $M$ denote the number of preference propositions, $|D|=M$. The variable $n$ iterates over the channels. $m$ iterates over preference propositions. Due to the symmetry of the state probability, equation 3.6, and the transition matrix, equation 3.9, with regard to states 1 and $-1$, the resulting marginal likelihood, equation 3.10, and the corresponding Bayesian formulation, equation 3.11, remain symmetric with regard to states 1 and $-1$.

Now our aim is to estimate the model parameters ($w$ and $π$) by maximizing equation 3.11. In principle, any solution strategies for MAP estimation can be considered to solve this problem. (See section 4 for optimization details.)

### 3.5  Reliability Analysis and Channel State Estimation

CArank (see equation 3.11) indeed trains a mixture of two complementary classifiers, which share the same parameter $w$. It is different from classical mixture models since it clusters at the channel level instead of the sample level.

In particular, in terms of the positive channels with $πn$ close to 1, CArank relies as the first classifier to update the shared parameter $w$. In terms of the negative channels with $πn$ close to 0, equation 3.11 automatically switches to the opposite classifier, which can extract correct information from the negative channels and update the shared parameter $w$ accordingly. Furthermore, CArank is robust to the noisy channels with $πn$ approximately equal to 0.5, because equation 3.11 gives up extracting information from the noisy channels by assigning a constant likelihood (i.e., 0.5) to each BDP. The estimated $πn$ can be leveraged as an indicator to detect noisy channels with $πn≈0.5,∀n=1,2,…,N$. (See Figure 6 for more details.)

Figure 4:

Sustained-attention driving task. (A) Different participants are independent during the data collection process. (B) Different EEG sensors used for recording are recorded independently from the scalp without influencing other sensors (Homan, Herman, & Purdy, 1987; Teplan, 2002). (C) Different trials are conducted independently during the data collection process. (D) The collected reaction time is slightly corrupted by inherent (basically irremovable) sources of noise, but the ranking relationships are preserved to some extent.

Figure 4:

Sustained-attention driving task. (A) Different participants are independent during the data collection process. (B) Different EEG sensors used for recording are recorded independently from the scalp without influencing other sensors (Homan, Herman, & Purdy, 1987; Teplan, 2002). (C) Different trials are conducted independently during the data collection process. (D) The collected reaction time is slightly corrupted by inherent (basically irremovable) sources of noise, but the ranking relationships are preserved to some extent.

Figure 5:

In-degree sequence for CArank and other baselines (closer is better). The root-mean-squared error (RMSE) was also measured according to equation 5.3.

Figure 5:

In-degree sequence for CArank and other baselines (closer is better). The root-mean-squared error (RMSE) was also measured according to equation 5.3.

Figure 6:

Reliability of different channels for 40 participants estimated by CArank. Each column denotes the states of 33 channels for each participant. The channels with estimated reliability $0.15≤πn≤0.85$ and marked in red are considered noisy channels.

Figure 6:

Reliability of different channels for 40 participants estimated by CArank. Each column denotes the states of 33 channels for each participant. The channels with estimated reliability $0.15≤πn≤0.85$ and marked in red are considered noisy channels.

### 3.6  Superiority of CArank over Previous Methods

CArank is superior in two ways: (1) using ordinal regression instead of regression enables it to be less sensitive to the scale of RTs, and (2) the data-driven noisy channel detection ensures performing mental fatigue monitoring using informative channels only.

In terms of the overfitting caused by extreme values, Wei et al. (2015) adopted different techniques to smooth RTs but failed to make it work for all participants. Meanwhile, the predefined smooth techniques would excessively weigh down the extreme or abnormal RTs in the MSE loss, which instead fails to reveal the real relationship between the EEG signals and RTs, especially the extreme or abnormal RTs. In terms of the lower reliability caused by heterogeneous channels, Wascher et al. (2014) heuristically used frontal theta to represent a different level of mental fatigue, but specific regions of the brain are not necessarily the same for all participants (Gramann et al., 2006).

Different from existing work, which heavily relies on various heuristic tricks, CArank is the first purely data-driven approach to predict mental fatigue and therefore offers high versatility. Specifically, it first formulates the mental fatigue monitoring task as a multichannel ranking problem. Next, it evaluates the channel reliability of each EEG channel via a transition matrix. CArank therefore performs reliable mental fatigue prediction using informative channels only.

## 4  Stochastic Generalized Expectation-Maximization

In this section, we describe a generalized expectation-maximization (GEM) algorithm (Dempster, Laird, & Rubin, 1977) to solve the proposed CArank, equation 3.11. Since the feasible region of $πn$ is restricted to [0,1], the gradient-based optimization methods would make our solution inaccurate and inefficient. The GEM algorithm is an efficient iterative procedure to compute the MAP solution in the presence of latent variables ($ρm(n)$ in equation 3.11. GEM avoids directly calculating the derivative to the expectation of latent variables and resorts to a surrogate lower bound to optimize. Therefore, GEM, a silver bullet for MAP with latent variables, can significantly simplify the optimization over parameter $πn$ for equation 3.11.

### 4.1  GEM for CArank

For each preference proposition $ρm$, we introduce an auxiliary variable $δm(n)∈{1,0}$ for the $n$th channel, representing the consistency between the preference proposition $ρm$ and the prediction $ρm(n)$ given by the $n$th channel. Specifically, $δm(n)=1$ denotes that the prediction $ρm(n)$ given by the first classifier is consistent with the preference proposition $ρm$, and $δm(n)=0$ denotes that the prediction $ρm(n)$ estimated by the second classifier is consistent with the preference proposition $ρm$. We can therefore find an equivalent formulation of equation 3.10 for each preference proposition $ρm$ involving the auxiliary variable $Ξm={δm(n)}n=1N$:
$P(ρm,Ξm|π,w,X)=∏n=1NP(ρm,δm(n)|πn,w,Δxn,m)=∏n=1N[πnσ(wTΔxn,m)]δm(n)[(1-πn)σ(-wTΔxn,m)]1-δm(n)ρm=1,×[1-κ(wTΔxn,m)]∏n=1Nκ(wTΔxn,m)ρm=0,∏n=1N[(1-πn)σ(wTΔxn,m)]δm(n)[πnσ(-wTΔxn,m)]1-δm(n)ρm=-1.×[1-κ(wTΔxn,m)].$
(4.1)
This shows that we can deal with the joint distribution directly, which leads to significant simplifications for optimization. The complete log likelihood of CArank, equation 3.11, can be written as
$logP(D,Ξ,w,π|X)=logP0(π)+logP0(w)+∑m=1M∑n=1NlogP(ρm,δm(n)|w,πn,Δxn,m).$
(4.2)
In the expectation step, we first calculate the expected value of the auxiliary variable $δm(n)$ with regard to its posterior distribution $P(δm(n)|π,w,ρm,xn,m)∀n=1,2,…,N,∀m=1,2,…,M$:
$E[δm(n)]=P(ρm,δm(n)|w,πn,Δxn,m)P(ρm|w,πn,Δxn,m)=1+(1-πn)σ(-wTΔxn,m)πnσ(wTΔxn,m)-1ρm=1,1ρm=0,1+πnσ(-wTΔxn,m)(1-πn)σ(wTΔxn,m)-1ρm=-1,$
(4.3)
where $E[δm(n)]$ denotes the degree of the consistency between the prediction $ρm(n)$ and the preference proposition $ρm$. Then the expectation of equation 3.11 with regard to the posterior distribution $P(δm(n)|π,w,ρm,xn,m)∀n=1,2,…,N,∀m=1,2,…,M$ can be represented as
$L(w,π)=E[logP(D,Ξ,w,π|X)]=∑n=1N[(αn-1)logπn+(βn-1)log(1-πn)]-12(w-μ)TΣ-1(w-μ)+∑m=1M∑n=1N[I(ρm=0)logκ(wTΔxn,m)+I(ρm≠0)log[1-κ(wTΔxn,m)]+I(ρm=1)[E[δm(n)]logπnσ(wTΔxn,m)+(1-E[δm(n)])log(1-πn)σ(-wTΔxn,m)]+I(ρm=-1)[E[δm(n)]log(1-πn)σ(wTΔxn,m)+(1-E[δm(n)])logπnσ(-wTΔxn,m)]],$
(4.4)
where $I(*)$ is the indicator function that equals one if the condition is true and zero otherwise.
In the generalized maximization step, we increase the objective function, equation 4.4 with regard to the model parameters $π$ and $w$, respectively. In terms of $π$, we set the gradient of equation 4.4 with regard to $πn$ to zero and obtain the following estimate for $πn$:
$πnnew=∑m=1MI(ρm=1)E[δm(n)]+I(ρm=-1)(1-E[δm(n)])+αn-1∑m=1MI(ρm=1)+I(ρm=-1)+αn+βn-2,$
(4.5)
where $n=1,2,…,N$.
In terms of $w$, due to the complexity of the sigmoid function, we cannot have a closed-form solution for $w$ and need to use gradient-based methods to optimize equation 4.4 with regard to $w$. In particular, the gradient function $g(w)$ can be represented as follows:
$g(w)=-Σ-1(w-μ)+∑m=1M∑n=1N[I(ρm=0)+I(ρm≠0)1-[κ(wTΔxn,m)]-11-2σ(wTΔxn,m)2+I(ρm≠0)(E[δm(n)]-σ(wTΔxn,m))]Δxn,m.$
(4.6)
Regarding the linear rank mapping, we adopt the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) (Byrd, Lu, Nocedal, & Zhu, 1995) to optimize $w$. $wnew$ can be obtained with L-BFGS using $L(w)$ and $g(w)$:
$wnew=L-BFGS(L(w),g(w),D).$
(4.7)
The GEM algorithm (see algorithm 1) then iterates the E-step and the generalized M-step until convergence is achieved.
Remark 4

(Extension to Deep Models). The L-BFGS optimization method used in equation 4.7 aims to find the optimum $w$. It is easy to find its alternatives in deep learning literature, such as vanilla stochastic gradient descent (SGD) and its various variants (Kasai, 2017), if we replace the raw EEG feature $x$ with neural embedding.

Remark 5

(Computational Efficiency Stochastic GEM for CArank). According to algorithm 1, the optimization of CArank involved $T$ iterations between E-step with regard to $E[δm(n)]$, and M-step with regard to $πn$ and $w$. Note that the computation cost of calculating $E[δm(n)]$ and $πn$ is marginal compared to that of optimizing $w$. Thus, the computation cost for CArank at each iteration is dominated by the subclassification problem, is, optimizing $w$. Accordingly, the computation efficiency of GEM for CArank is T times of that for optimizing a regular classification problem. Note that only a few iterations ($T<10$) are required for GEM to converge. This analysis also applies to stochastic GEM where we update the model parameter with minibatch samples.

Note that this cost of computation is for the training stage only, while the computation costs of all methods for the test stage are similar. Our CArank enjoys the lowest storage during the test stage, since we can safely abandon the EEG signals from noisy channels after the training stage. It would not sacrifice the model performance since CArank rejects extracting information from noisy channels for decision making.

### 4.2  Stochastic GEM for CArank

The GEM approach introduced in section 4.1 is inefficient for large-scale data sets, because we need to iteratively calculate the gradient with regard to parameters $π$ and $w$ over all samples during each generalized maximization step. Motivated by the stochastic approximation literature (Roche, 2011), we introduce a stochastic generalized expectation-maximization (SGEM) approach, which resorts to stochastic minibatch optimization to learn the parameters. To be specific, SGEM approximates the updated $π$ and $w$ in batch EM with a single sample or minibatch samples. Since minibatch samples cannot be a perfect approximation to the whole data set, we interpolate between the new and former estimators with a decreasing step size4$ηk$, as in Liang and Klein (2009).

#### 4.2.1  Sampling Step

Before the $t$th iteration, we randomly sample a minibatch $Dt$ from $D$. The number of preference propositions in $Dt$, denoted by $Mt$, is much smaller than the corresponding total data set size $M$.

#### 4.2.2  Expectation Step

The expectation step remains similar. The only difference is that we need to calculate the posterior expectation of the auxiliary variable $δm(n)$ over the mini-batch $Dt$.

#### 4.2.3  Generalized Maximization Step

In the generalized maximization step, we increase the objective function, calculated on the minibatch $Dt$, with regard to model parameters $π$ and $w$. In terms of parameter $πn$, since its marginal distribution belongs to the exponential family, we perform the stochastic update in the space of sufficient statistics (Cappé & Moulines, 2009). Let $φ˜n$ denote the noisy estimate of the sufficient statistic for $πn$:
$φ˜n=MMt∑m∈DtI(ρm=1)E[δm(n)]+I(ρm=-1)(1-E[δm(n)]),$
(4.8a)
$φnt=(1-ηt)φnt-1+ηkφ˜n,$
(4.8b)
$πnnew=φnt+αn-1∑m∈DtI(ρm=1)]+I(ρm=-1)+αn+βn-2,n=1,2,…,N.$
(4.8c)
In terms of parameter $w$, the above practice is infeasible due to its nonexponential marginal distribution. Inspired by the stochastic gradient EM algorithms in Cappé and Moulines (2009), we perform the stochastic update in the original space. First, a local optimal regression weight $wt$ can be obtained via iterative optimization over the minibatch $Dt$ using L-BFGS. Then we interpolate between a local optimum and the former estimations to form a global approximation with regard to the parameter $w$:
$wt=L-BFGS(L(w),g(w),Dt),$
(4.9a)
$wnew=(1-ηk)wold+ηkwt.$
(4.9b)
Remark 6

(Convergence Analysis). The convergence issues of the proposed stochastic GEM algorithm are analogous to the discussion given by Cappé and Moulines (2009) for their stochastic gradient EM algorithms. The existence of such links is hardly surprising. In view of the discussions in section 3 of Cappé and Moulines (2009), the online update rule, equation 4.9b, could also be seen as a stochastic gradient recursion formula, namely, $wnew=wold+ηk(wt-wold)$.

## 5  Empirical Analysis

In this section, we demonstrate the reliability of the proposed CArank, equation 3.11, with EEG signals from 40 participants.

We used the 33-channel EEG data recorded in Huang et al. (2015) from 40 adult participants while performing a long, sustained attention task.5 These data contain one intrinsic non-EEG channel, the 33rd channel, which contains the information about only one axis in the direction of deviation. The experiment has been conducted using a virtual-reality dynamic driving simulator (see Figures 4D and 4E). The task involves driving on a four-lane highway while lane-departure events were a randomly induced deviation toward the side of the road from the original position. Each participant was instructed to quickly respond to steer back to the original position. A complete trial in this study (see Figure 4A), includes a 10 s baseline, deviation onset, response onset, and response offset (see Figures 4B and 4C). The next trial occurs within an interval of 5 s to 10 s after finishing the current trial. Each participant completed $T$ trials within 1.5 h. For each trial $i$, the EEG signals ${xn,i}n=1N$ from $N$ different channels were recorded simultaneously, and the corresponding reaction time $RTi$ was also collected afterward. If a participant fell asleep during the experiment, there was no feedback to wake him up. The NuAmps amplifier (Compumedics Limited, Australia) was used to collect EEG data with a maximum sampling rate of 1000 Hz, 200 HZ bandwidth (DC), and 22-bit resolution.

In this letter, the 10 s baseline (see Figure 4B) as the feature vector has been adopted, which is assumed to be long enough to detect any significant changes in brain activity (Zhang, 2000). This was followed by exploring the relationship between the 10 s baseline $x(∈Rk)$ and the preference proposition $ρm$ under the following four assumptions:

#### 5.1.1  Data Preprocessing

Brain dynamics preferences for each participant have been generated as follows: the trials of each participant were randomly divided into two parts, $50%$ for training and $50%$ for test, and the EEG preferences were constructed according to the pairwise comparisons between the RTs. To be specific, two types of RT comparisons could be constructed: (1) significant RT pairwise comparisons $(Tm,1,Tm,2)$, where $Tm,1≫Tm,2$ or $Tm,2≫Tm,1$, and (2) comparable RT pairwise comparisons $(Tm,1,Tm,2)$, where $Tm,1≈Tm,2$. Considering the time delay among the channels in the time domain, Fourier transform has been applied to EEG signals to transform time series into frequency domain. Fast Fourier transform (FFT) has been applied using the Welch method (Welch, 1967) with a window size of 128 (such that spectral decomposition over 0.5 seconds) and a pad ratio of 2 without any overlap, which yields twice the output feature as the sampling rate. Further, to avoid overhead computation, EEG power within 0.5 Hz to 30 Hz has been selected, which is considered to be the most relevant to the RTs (Huang et al., 2015). Meanwhile, we can also adopt other feature transformations for feature extraction if necessary. (See Hammon and de Sa, 2007, for an example of other features typically used for EEG.)

#### 5.1.2  Baselines

First, we considered two popular linear methods: support vector regression (SVR) (Chang & Lin, 2011) and linear regression (LR) with the features from the multiple channels being simply concatenated into a long feature vector. Then we compared CArank with widely adopted nonlinear methods, regression and classification methods, under the multiple channel concatenation formulation and the multiple channel aggregation formulation, respectively. In particular, we considered two regression models (Lin et al., 2014; Hajinoroozi, Mao, Jung, Lin, & Huang, 2016). (1) In regression (C), with the EEG signals from multiple channels are simply concatenated into a long feature vector and the corresponding regression model is trained using this feature vector. (2) regression (A), the EEG signals from multiple channels, are considered independently, and the regression results are aggregated using majority voting afterward. Two ordinal classification models (Zarei, 2017; Zeng et al., 2018) are considered: (1) classification (C), where the EEG signals from multiple channels are simply concatenated into a long feature vector and the corresponding classification model is trained using this feature vector, and (2) classification (A), where the EEG signals from multiple channels are considered independently and the classification results are aggregated using majority voting afterward.

#### 5.1.3  Metrics

First, we aggregate the predictions from different channels using a simple voting scheme,
$ρ^m=sign∑n=1Nρm(n)I(πn>κ)-I(πn<1-κ),$
where $ρm(n)$ denotes the predicted state (1 means win and −1 means loss) for the pairwise RT comparison ($Tm,1,Tm,2$) by the $n$th channel, using the brain dynamics preference ($xn,m1,xn,m2$). $ρ^m$ is the final estimated order for ($Tm,1,Tm,2$) by aggregating the predictions $ρm(n)$ over all channels. $I(*)$ is an indicator that returns one if the argument is valid and zero otherwise.
Then we introduce two metrics to measure the performance of CArank model from different perspectives. First, we adapted the Wilcoxon-Mann-Whitney statistics (Yan, Dodier, Mozer, & Wolniewicz, 2003) to evaluate the accuracy (in $%$, higher is better) over all pairs:
$Acc=1M¯∑m=1MI(ρm=ρ^m),M¯=∑m=1MI(ρm≠0).$
(5.1)
Further, we investigate the reliability of CArank in terms of preserving the global ordering with regard to RTs. Note that a totally ordered set could be equally represented by a fully directed graph, where the graph can be further encoded by its degree sequence. We only consider the in-degree sequence because the in-degree and out-degree of a vertex can be uniquely determined when the overall degree of each vertex is fixed. The in-degree of vertex $vi$ can be calculated as
$Indeg^(vi)=∑m∈N1(vi)I(ρ^m=1)+∑m∈N2(vi)I(ρ^m=-1)+∑m∈N1(vi)∪N2(vi)0.5×I(ρ^m=0),$
(5.2)
where $N1(vi),N2(vi)$ denote the index set of the pairwise comparisons with the RT of trial $i$ (vertex $vi$) appearing in the first and second positions, respectively. Further, we collected the in-degree sequences (Becirovic, 2017) of the constructed directed graph using the predicted RTs. Then the discrepancy between the predicted in-degree sequences and ground truth can be measured using the root-mean-squared errors (smaller is better), namely,
$RMSE=1T∑i=1T[Indeg(vi)-Indeg^(vi)]2,$
(5.3)
where $T$ denotes the number of trials for each participant. $Indeg(vi)$ is the ground truth in-degree of vertex $vi$ while $Indeg^(vi)$ is the predicted in-degree of vertex $vi$.

Note that we only trust the predictions from informative channels with reliability $πn>κ$ or $πn<1-κ$. $κ$ is set to 0.85 for all participants in our experiment. In terms of SVR, LR, regression (C)/classification (C), it is a simple regression/regression/regression/classification problem, since the EEG signals from multiple channels are simply concatenated into a long vector. In terms of regression (A)/classification (A), considering the high-dimensional feature with low sample size, we train a nonlinear model shared by all channels and aggregate the results from different channels to calculate the final predictions using the majority voting scheme. Since there is no mechanism for SVR, LR, regression (C/A), classification (C/A) to evaluate the channel state, we trust all the channels by default. Further, we calculate only the two metrics on the preference propositions wherein the orderings between the RT pair are significant, since it is hard to evaluate when the orderings between the RT pair are comparable.

#### 5.1.4  Parameter Initialization

We implemented SVR using Libsvm6 with the parameter set to -s 3 -t 0. Other methods are implemented in PyTorch (Paszke et al., 2017). We train a one-layer neural network for LR. For the sake of a fair comparison, we implemented a two-layer neural network for all nonlinear methods. In particular, we set network dimensions to d-100-1, where d is the input feature dimension, which varies between different baselines. All layers are densely (fully) connected. In terms of the channel reliability $πn$, we aimed to eliminate the effects of noisy channels during the training process and therefore initialized the channel reliability $πn$ to 0.5, $∀n=1,2,…,N$. The L2 norm is used, which equals adopting the standard gaussian distribution for $w$: $w∼N(0,1)$. In terms of the hyperparameters $(αn,βn)$, as we intended to eliminate the effects of noisy channels, we adopted a strong noninformative prior for $πn$: $αn=βn=100$, $∀n=1,2,…,N$, according to Bishop (2006). The Adam method is used to optimize the weight7$w$. In terms of the maximum iteration number for our CArank, we set $MaxIter=7$ in our experiment to ensure the algorithm converged for each participant. The minibatch size is set to 256, and the learning rate is 0.001. In terms of LR and regression (C/A), the common mean square error (MSE) is adopted as the loss function. In terms of classification (C/A), the negative log-likelihood (see equation 4.4) is adopted as the loss function, except that $πn$ is fixed to 1, $∀n=1,2,…,N$.

### 5.2  Empirical Results of CArank on Brain Dynamics Preferences

In this section, we compare the performance of our CArank and other baselines based on the two metrics, equations 5.1 and 5.3.

#### 5.2.1  Compassion Based on Wilcoxon-Mann-Whitney Statistics

The Wilcoxon-Mann-Whitney statistics of all methods on the test BDPs are presented in Table 1. In terms of SVR, LR, and regression (C), the Wilcoxon-Mann-Whitney statistics of the predicted RTs is calculated with regard to the ground truth on the test BDPs. In terms of regression (A), we first collected the predicted RTs on the test BDPs by aggregating the prediction from each channel using majority voting. Then we calculated the Wilcoxon-Mann-Whitney statistics of the predicted RTs following equation 5.1.

From Table 1, we offer the following observations:

1. $CArank>otherbaselines$. CArank exhibits consistent improvements over other baselines. In particular, it achieves the highest test accuracy on 30 participants and comparable results on the rest of the participants. This is consistent with our motivation that classification serves as a relaxed alternative for regression, can effectively circumvent the overfitting caused by nonsmooth or extreme RTs, and preserves the ordering with regard to RTs. Meanwhile, our channel-reliability-aware formulation could also automatically eliminate the effects of the EEG signals from a noisy channel during the training process, compared with using simple concatenation.

2. $Classification>SVR>Regression$. The test accuracy of classification-based methods for most participants is higher than their regression-based counterparts, namely, classification (C) outperforms SVR and regression (C) on 24 and 33 participants, respectively, and classification (A) outperforms regression (A) on 26 participants. This observation is consistent with our statement that regression-based models are easily overfitting, especially when extreme values (RTs in our problem) exist.

3. $Concatenation>Aggregation$. It is interesting to note that the test accuracy based on multiple channel aggregation is significantly inferior to their counterparts based on simple feature concatenation. Specifically, regression (C) outperforms regression (A) on 33 participants, while classification (C) outperforms classification (A) on 38 participants. This is quite impressive but reasonable. Since a shared regression/ classification model is trained in the case of the multiple channel aggregation formulation, the generalization performance would inevitably degenerate when learning with noisy channels. Meanwhile. the noisy channels universally exist, and at least one noisy channel is detected for each participant according to Figure 6.

4. $SVR>NonlinearRegression>LinearRegression$. Note that linear SVR shows superior performance to do nonlinear regression (C) and LR over 26 and 32 participants, respectively. Since the input of SVR, LR, and Regression (C) is the same, the only difference lies in the choice of the loss function. SVR adopts hinge loss, which is robust to outliers away from the boundary (Basak, Pal, & Patranabis, 2007). This is consistent with our analysis about the deficiency of the MSE loss used in the regression model (see section 2.1). Meanwhile, the performance of SVR is not stable and may achieve worse results on some participants (e.g., P10, P18, P31, P39, P40). Therefore, the hinge loss is also not the best choice compared to the classification setting, where our CArank can universally achieve accuracy above $75%$ for the corresponding participants.

#### 5.2.2  Compassion Based on In-Degree Preservation

To further investigate the reliability of CArank in terms of preserving the global ordering corresponding to RTs, we first collected the in-degree sequences according to equation 5.2 using the predicted RTs and then measured the in-degree discrepancy between the calculated in-degree sequences and the ground truth using the root-mean-squared error, equation 5.3. The RMSEs for all participants are shown in Table 2.

Table 2:
Test RMSE (in Numbers). Smaller Is Better.
ParticipantP1P2P3P4P5P6P7P8P9P10P11P12P13P14P15P16P17P18P19P20
Test RMSE SVR 12.76 17.87 9.98 21.40 23.99 23.59 17.21 23.05 40.75 19.72 14.35 7.28 18.78 15.48 27.98 10.67 14.85 43.93 41.59 34.90
LR 13.26 22.81 10.14 21.69 54.85 32.89 13.56 46.83 38.37 34.34 14.01 7.34 18.65 25.34 47.00 18.76 18.26 45.59 73.37 38.72
Regression (C) 13.11 17.40 13.06 17.92 25.98 22.14 22.46 42.16 30.63 18.16 12.04 5.40 13.88 14.03 27.87 12.47 17.93 45.85 43.96 36.64
Regression (A) 11.92 19.12 13.46 18.27 24.65 26.14 21.17 38.24 37.44 20.53 11.82 11.56 19.15 16.71 26.91 14.82 17.30 35.59 46.23 41.59
Classification (C) 10.53 15.35 12.51 16.78 27.36 22.85 16.32 33.48 25.18 17.45 15.21 8.64 15.58 12.63 25.24 10.54 15.98 27.56 31.11 35.76
Classification (A) 9.20 18.28 13.54 18.35 27.15 22.83 25.98 44.48 49.43 20.36 14.61 9.65 18.83 12.23 25.20 15.19 17.42 42.07 48.42 44.38
CArank 8.66 16.97 12.02 16.27 20.71 19.97 13.98 25.71 12.52 13.70 12.73 9.41 12.43 10.06 23.95 8.99 5.74 25.00 31.39 28.00
Participant P21 P22 P23 P24 P25 P26 P27 P28 P29 P30 P31 P32 P33 P34 P35 P36 P37 P38 P39 P40
Test RMSE SVR 33.58 16.29 32.49 33.89 40.51 35.73 32.63 24.47 29.83 34.87 23.58 10.37 18.67 19.49 13.07 8.35 15.17 38.28 14.50 28.71
LR 56.49 38.90 31.20 58.26 60.63 34.56 64.32 61.78 70.42 36.94 20.29 32.28 29.31 36.47 14.54 7.96 14.62 55.18 15.84 30.40
Regression (C) 38.13 25.26 26.88 40.08 38.27 36.48 38.22 26.56 36.39 31.98 26.09 22.99 24.67 25.53 13.85 9.94 13.74 52.71 6.66 18.75
Regression (A) 48.77 22.10 27.19 41.04 37.01 46.19 46.01 31.61 41.87 36.54 26.77 23.33 23.51 25.03 12.49 9.23 17.07 46.63 17.52 24.70
Classification (C) 40.79 13.76 23.97 31.98 30.39 36.36 28.85 24.44 33.09 26.21 19.45 13.54 22.04 20.48 13.41 7.22 13.46 39.68 9.65 17.33
Classification (A) 51.50 15.99 30.35 37.33 37.94 47.82 44.97 37.25 57.80 42.96 26.74 23.69 24.75 26.88 19.49 6.59 16.26 46.13 7.61 16.97
CArank 37.77 11.77 25.49 16.44 29.67 30.72 19.38 26.32 26.00 28.49 8.00 11.65 12.94 12.06 5.34 7.03 11.17 36.77 3.83 15.14
ParticipantP1P2P3P4P5P6P7P8P9P10P11P12P13P14P15P16P17P18P19P20
Test RMSE SVR 12.76 17.87 9.98 21.40 23.99 23.59 17.21 23.05 40.75 19.72 14.35 7.28 18.78 15.48 27.98 10.67 14.85 43.93 41.59 34.90
LR 13.26 22.81 10.14 21.69 54.85 32.89 13.56 46.83 38.37 34.34 14.01 7.34 18.65 25.34 47.00 18.76 18.26 45.59 73.37 38.72
Regression (C) 13.11 17.40 13.06 17.92 25.98 22.14 22.46 42.16 30.63 18.16 12.04 5.40 13.88 14.03 27.87 12.47 17.93 45.85 43.96 36.64
Regression (A) 11.92 19.12 13.46 18.27 24.65 26.14 21.17 38.24 37.44 20.53 11.82 11.56 19.15 16.71 26.91 14.82 17.30 35.59 46.23 41.59
Classification (C) 10.53 15.35 12.51 16.78 27.36 22.85 16.32 33.48 25.18 17.45 15.21 8.64 15.58 12.63 25.24 10.54 15.98 27.56 31.11 35.76
Classification (A) 9.20 18.28 13.54 18.35 27.15 22.83 25.98 44.48 49.43 20.36 14.61 9.65 18.83 12.23 25.20 15.19 17.42 42.07 48.42 44.38
CArank 8.66 16.97 12.02 16.27 20.71 19.97 13.98 25.71 12.52 13.70 12.73 9.41 12.43 10.06 23.95 8.99 5.74 25.00 31.39 28.00
Participant P21 P22 P23 P24 P25 P26 P27 P28 P29 P30 P31 P32 P33 P34 P35 P36 P37 P38 P39 P40
Test RMSE SVR 33.58 16.29 32.49 33.89 40.51 35.73 32.63 24.47 29.83 34.87 23.58 10.37 18.67 19.49 13.07 8.35 15.17 38.28 14.50 28.71
LR 56.49 38.90 31.20 58.26 60.63 34.56 64.32 61.78 70.42 36.94 20.29 32.28 29.31 36.47 14.54 7.96 14.62 55.18 15.84 30.40
Regression (C) 38.13 25.26 26.88 40.08 38.27 36.48 38.22 26.56 36.39 31.98 26.09 22.99 24.67 25.53 13.85 9.94 13.74 52.71 6.66 18.75
Regression (A) 48.77 22.10 27.19 41.04 37.01 46.19 46.01 31.61 41.87 36.54 26.77 23.33 23.51 25.03 12.49 9.23 17.07 46.63 17.52 24.70
Classification (C) 40.79 13.76 23.97 31.98 30.39 36.36 28.85 24.44 33.09 26.21 19.45 13.54 22.04 20.48 13.41 7.22 13.46 39.68 9.65 17.33
Classification (A) 51.50 15.99 30.35 37.33 37.94 47.82 44.97 37.25 57.80 42.96 26.74 23.69 24.75 26.88 19.49 6.59 16.26 46.13 7.61 16.97
CArank 37.77 11.77 25.49 16.44 29.67 30.72 19.38 26.32 26.00 28.49 8.00 11.65 12.94 12.06 5.34 7.03 11.17 36.77 3.83 15.14

Note: The shaded numbers indicate the best results.

From Table 2, we could draw similar conclusions. (1) Our CArank consistently achieves lower RMSE compared to other baselines. In particular, CArank achieves the lowest test RMSE on 27 over 40 participants. (2) Except for our CArank, classification (C) shows better performance over the resting baselines. This is reasonable, since classification is robust to extreme RTs while the concatenation approach is less affected by the noisy channels compared to simple aggregation. (3) The difference between other baseline methods becomes ambiguous. This is because RMSE assigned higher punishment to an estimation with a larger error.

#### 5.2.3  Visualization of Predicted In-Degrees

To further explore the superiority of our CArank, we visualized Table 2 using the indegree sequences. For the sake of intuitive interpretation, we particularly showcase participants P9, P13, P22, P24, and P31 with the most representative performance in Figure 5. Regarding the rest of the participants, our CArank also achieves superior performance with the lowest RMSE (see Table 2).

From Figure 5, we make five observations. First, overall, the in-degree sequences predicted by CArank closely align to the ground truth with slight fluctuations (small RMSE), while the in-degree sequences predicted by other baselines fluctuate significantly and fail to maintain the trend with the ground truth (large RMS). Second, the points located in the northeast denote the trials with high RTs (also called extreme RTs). The in-degree sequences predicted by CArank show slighter fluctuations compared to those of other baselines. It denotes that CArank could accurately detect the mental fatigue associated with higher RTs. However, other baselines either show large fluctuations (e.g., P9, P13, P24), leading to a high false-negative rate, or completely fail to maintain the trend, leading to a high error rate. Third, the points located in the southwest denote the trials with small RTs. The in-degree sequences predicted by other baselines show large fluctuations (e.g., P22), a high false-positive rate. (4) It is worth noting that the in-degree sequences predicted by regression(C/A) usually fluctuates heavily for low in-degree trials (small RTs) and high in-degree trials (large RTs). It means that regression (C/A) overestimates the RTs with small values and underestimates the RTs with large values. It is consistent with our claim that the regression-based model is not suitable for tasks with a nonsmooth response variable (RT). Fifth, a simple classification using multichannel aggregation, that is, classification (M), also shows heavy fluctuations, since it lacks an effective mechanism to aggregate the predication from multiple channels. Classification (C) shows better performance but is just as likely to be overfitting, since classification (C) also could not eliminate the effects of noisy channels during the training process.

### 5.3  Noisy Channel Detection

We also investigated the reliability of our CArank from the perspective of noisy channel detection. According to our analysis, the parameter $πn$ in the transition matrix $Πn$ indicates channel reliability. Hereafter, we leverage $πn$ as the channel reliability indicator to detect noisy channels. Figure 6 lists the noisy channels (marked in red) detected with $0.15≤πn≤0.85$, $∀n=1,2,…,N$.

Figure 6 shows, first, that the noisy channels universally exist among the EEG signals. At least one noisy channel is detected for each participant. For example, the 33rd channel is recognized as the noisy channel by CArank for almost all participants. It is reasonable since the 33rd channel is generally acknowledged as the nonrelevant channel to any tasks (Lin et al., 2014). Second, for each participant, most channels are reliable, which ensures we can always find enough support to training our CArank. Third, the detected noisy channels vary from participant to participant and do not possess the transitivity property between participants. The noise can arise due to the intrinsic noninformative EEG channel (e.g., the 33rd channel for all participants); channels for lateral mastoid references (e.g., the 23rd and 29th channels for majority participants) (Chatrian, Lettich, & Nelson, 1985); and improper experimentation or artifacts (for P13, P39, and P40) (Lin et al., 2018).

## 6  Limitations and Future Work

In this work, the cooperation mechanism among channels is simplified as a weighted majority voting system, while different trials are viewed independently. We intend to formulate it with more complex mechanisms, such as the Markov decision process (MDP), to conduct learning and decision making simultaneously. Some previous work (Chen, Jiao, & Lin, 2016; Chen, Lin, & Zhou, 2015) has studied the decision-making process among crowd (noisy) workers, which is promising to our setting to investigate the cooperation mechanism among noisy channels. Efforts are underway to apply this approach in future work.

Furthermore, Brain dynamics are nonstationary and characterized by significant trial-by-trial variability (Yarkoni, Barch, Gray, Conturo, & Braver, 2009). Due to this variability, CArank would suffer from repeated training and updating costs with respect to all new data. We consider extending CArank to a real-time mental fatigue monitoring system by online calibrating CArank. Inspired by the work of Weng and Lin (2011) and Jaini et al. (2017), Bayesian moment matching offers promise to sequentially update the nonconjugate likelihood function (e.g., CArank) with analytic update rules.

## 7  Conclusion

This work proposes a CArank model to assess the state of mental fatigue. The efficacy of the model was demonstrated using EEG data collected in a sustained driving task from 40 participants. This model has been combined with a stochastic-generalized expectation-maximization (SGEM) algorithm to provide an efficient update in the large-scale setting. CArank uses a unique methodology with a relaxed alternative, ordinal classification, to circumvent overfitting to the extreme values of RTs. It has been demonstrated that the overall performance of CArank can be significantly improved with the introduction of a transition matrix, which enables the technique to evaluate the reliability of informative EEG channels while detecting noisy EEG channels. Empirical results show that CArank delivers significant improvements over simple classification and regression methods in terms of global ranking preservation.

## Notes

1

When applied to BDP, the subtle difference between the RTs may be caused not by the intrinsic difference between BDP but the unknown noise.

2

In the following, we omitted the subscripts for simplicity.

3

A promising approach to generalize the transition matrix $Πn$, equation 3.9, is to introduce the concept of the confidence region to measure the equal cases (Pregibon, 1981).

4

Here, the stepsize is set to $ηt=(t+2)-τ0$, where $t$ is the number of iterations and $0.5<τ0<1$. The smaller the $τ0$ is, the larger the update $ηt$ is, and the more quickly we forget (decay) our old parameters. This can lead to swift progress but also generates instability.

5

According to Huang et al. (2015), the couplings between pairs of MCC, ACC, lSMC, rSMC, PCC, and ESC regions increased at the intermediate level of attention. It reveals that an enhancement of the cortico-cortical interaction is necessary to maintain task performance and prevent mental fatigue. Further, it shows that higher connectivity shows optimal performance, while very few connected nodes show poor performance. See Huang et al. (2015) for more information.

6

https://www.csie.ntu.edu.tw/$∼$cjlin/libsvm/.

7

In terms of the L-BFGS implementation, a Matlab code can be downloaded from Granzow (2017).

## Acknowledgments

I.W.T. is supported by ARC under grant DP180100106 and DP200101328. M.S. was supported by the International Research Center for Neurointelligence (WPI-IRCN) at the University of Tokyo Institutes for Advanced Study.

## References

,
R. J.
,
Appleton
,
S. L.
,
Taylor
,
A. W.
,
Gill
,
T. K.
,
Lang
,
C.
,
McEvoy
,
R. D.
, &
Antic
,
N. A.
(
2017
).
Sleep health of Australian adults in 2016: Results of the 2016 Sleep Health Foundation national survey.
Sleep Health: Journal of the National Sleep Foundation
,
3
(
1
),
35
42
.
Alharbi
,
N.
(
2018
).
A novel approach for noise removal and distinction of EEG recordings.
Biomedical Signal Processing and Control
,
39
,
23
33
.
Basak
,
D.
,
Pal
,
S.
, &
Patranabis
,
D. C.
(
2007
).
Support vector regression.
Neural Information Processing–Letters and Reviews
,
11
(10)
.
Becirovic
,
E.
(
2017
).
On social choice in social networks
. Master's thesis, Linköping University. http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1117841&dswid=1470.
Bishop
,
C. M.
(
2006
).
Pattern recognition and machine learning
.
Berlin
:
Springer
.
Blankertz
,
B.
,
Tangermann
,
M.
,
Vidaurre
,
C.
,
Dickhaus
,
T.
,
Sannelli
,
C.
,
Popescu
,
F.
, …
Müller
,
K.-R.
(
2009
). Detecting mental states by machine learning techniques: The Berlin brain–computer interface. In
B.
Graimann
(Ed.),
Brain-computer interfaces
(pp.
113
135
).
Berlin
:
Springer
.
Boksem
,
M. A.
, &
Tops
,
M.
(
2008
).
Mental fatigue: Costs and benefits.
Brain Research Reviews
,
59
(
1
),
125
139
.
Byrd
,
R. H.
,
Lu
,
P.
,
Nocedal
,
J.
, &
Zhu
,
C.
(
1995
).
A limited memory algorithm for bound constrained optimization.
SIAM Journal on Scientific Computing
,
16
(
5
),
1190
1208
.
Cappé
,
O.
, &
Moulines
,
E.
(
2009
).
On-line expectation–maximization algorithm for latent data models
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
71
(
3
),
593
613
.
Chang
,
C.-C.
, &
Lin
,
C.-J.
(
2011
).
LIBSVM: A library for support vector machines
.
ACM Transactions on Intelligent Systems and Technology
,
2
(
3
),
1
27
.
Chatrian
,
G.
,
Lettich
,
E.
, &
Nelson
,
P.
(
1985
).
Ten percent electrode system for topographic studies of spontaneous and evoked EEG activities.
American Journal of EEG Technology
,
25
(
2
),
83
92
.
Chen
,
X.
,
Jiao
,
K.
, &
Lin
,
Q.
(
2016
).
Bayesian decision process for cost-efficient dynamic ranking via crowdsourcing.
Journal of Machine Learning Research
,
17
(
217
),
1
40
.
Chen
,
X.
,
Lin
,
Q.
, &
Zhou
,
D.
(
2015
).
Statistical decision making for optimal budget allocation in crowd labeling.
Journal of Machine Learning Research
,
16
(
1
),
1
46
.
Chuang
,
C.-H.
,
Cao
,
Z.
,
King
,
J.-T.
,
Wu
,
B.-S.
,
Wang
,
Y.-K.
, &
Lin
,
C.-T.
(
2018
).
Brain electrodynamic and hemodynamic signatures against fatigue during driving
.
Frontiers in Neuroscience
,
12
,
181
.
Cook
,
D. B.
,
O'Connor
,
P. J.
,
Lange
,
G.
, &
Steffener
,
J.
(
2007
).
Functional neuroimaging correlates of mental fatigue induced by cognition among chronic fatigue syndrome patients and controls
.
NeuroImage
,
36
(
1
),
108
122
.
de
Naurois
,
C. J.
,
Bourdin
,
C.
,
Stratulat
,
A.
,
Diaz
,
E.
, &
Vercher
,
J.-L.
(
2017
).
Detection and prediction of driver drowsiness using artificial neural network models.
Accident Analysis and Prevention
,
126
,
95
104
.
Dempster
,
A. P.
,
Laird
,
N. M.
, &
Rubin
,
D. B.
(
1977
).
Maximum likelihood from incomplete data via the EM algorithm
.
Journal of the Royal Statistical Society Series B (Methodological)
,
39
,
1
38
.
Ekman
,
P. E.
, &
Davidson
,
R. J.
(
1994
).
The nature of emotion: Fundamental questions
.
New York
:
Oxford University Press
.
Fazli
,
S.
,
Popescu
,
F.
,
Danóczy
M.
,
Blankertz
,
B.
,
Müller
,
K.-R.
, &
Grozea
,
C.
(
2009
).
Subject-independent mental state classification in single trials
.
Neural Networks
,
22
(
9
),
1305
1312
.
Franks
,
D. D.
(
2019
).
Neurosociology: Fundamentals and current findings
.
Berlin
:
Springer
.
Gramann
,
K.
,
Müller
H.
,
Schönebeck
,
B.
, &
Debus
,
G.
(
2006
).
The neural basis of ego- and allocentric reference frames in spatial navigation: Evidence from spatiotemporal coupled current density reconstruction.
Brain Research
,
1118
(1)
,
116
129
.
Granzow
,
B.
(
2017
).
A Matlab implementation of L-BFGS-B
. https://github.com/bgranzow/L-BFGS-B.
Hajinoroozi
,
M.
,
Mao
,
Z.
,
Jung
,
T.-P.
,
Lin
,
C.-T.
, &
Huang
,
Y.
(
2016
).
EEG-based prediction of driver's cognitive performance by deep convolutional neural network.
Signal Processing: Image Communication
,
47
,
549
555
.
Hammon
,
P. S.
, &
de Sa
,
V. R.
(
2007
).
Preprocessing and meta-classification for brain-computer interfaces.
IEEE Transactions on Biomedical Engineering
,
54
(
3
),
518
525
.
Hastie
,
T.
,
Tibshirani
,
R.
, &
Friedman
,
J.
(
2009
).
The elements of statistical learning: Data mining, inference, and prediction
.
New York
:
.
Homan
,
R. W.
,
Herman
,
J.
, &
Purdy
,
P.
(
1987
).
Cerebral location of international 10–20 system electrode placement.
Electroencephalography and Clinical Neurophysiology
,
66
(
4
),
376
382
.
Huang
,
C.-S.
,
Pal
,
N. R.
,
Chuang
,
C.-H.
, &
Lin
,
C.-T.
(
2015
).
Identifying changes in EEG information transfer during drowsy driving by transfer entropy.
Frontiers in Human Neuroscience
,
9
,
570
.
Izuma
,
K.
, &
,
R.
(
2013
).
Social manipulation of preference in the human brain
.
Neuron
,
78
(
3
),
563
573
.
Jaini
,
P.
,
Chen
,
Z.
,
Carbajal
,
P.
,
Law
,
E.
,
Middleton
,
L.
,
Regan
,
K.
, …
Poupart
,
P.
(
2017
).
Online Bayesian transfer learning for sequential data modeling.
In
International Conference on Learning Representations
. https://openreview.net/forum?id=HygBZnRctX.
Jap
,
B. T.
,
Lal
,
S.
,
Fischer
,
P.
, &
Bekiaris
,
E.
(
2009
).
Using EEG spectral components to assess algorithms for detecting fatigue
.
Expert Systems with Applications
,
36
(
2
),
352
2359
.
Ji
,
Q.
,
Zhu
,
Z.
, &
Lan
,
P.
(
2004
).
Real-time nonintrusive monitoring and prediction of driver fatigue.
IEEE Transactions on Vehicular Technology
,
53
(
4
),
1052
1068
.
Jin
,
Z.
,
Zhou
,
G.
,
Gao
,
D.
, &
Zhang
,
Y.
(
2018
).
EEG classification using sparse Bayesian extreme learning machine for brain–computer interface.
Neural Computing and Applications
, pp.
1
9
. https://doi.org/10.1007/s00521-018-3735-3.
Kaji
,
H.
,
Iizuka
,
H.
, &
Sugiyama
,
M.
(
2019
).
ECG-based concentration recognition with multi-task regression
.
IEEE Transactions on Biomedical Engineering
,
66
(
1
),
101
110
.
Kasai
,
H.
(
2017
).
SGDLibrary: A MATLAB library for stochastic gradient descent algorithms
.
arXiv:1710.10951
.
Kohlmorgen
,
J.
,
Dornhege
,
G.
,
Braun
,
M.
,
Blankertz
,
B.
,
Curio
,
G.
,
Hagemann
,
K.
, …
Kincses
,
W.
(
2007
).
Improving human performance in a real operating environment through real-time mental workload detection.
In
G.
Dornhege
,
J.
del R. Millán
,
T.
Hinterberger
,
D.
McFarland
, &
K.-R.
Müller
(Eds.),
Toward brain-computer interfacing
(pp.
409
422
).
Cambridge, MA
:
MIT Press
.
Lal
,
S. K.
,
Craig
,
A.
,
Boord
,
P.
,
Kirkup
,
L.
, &
Nguyen
,
H.
(
2003
).
Development of an algorithm for an EEG-based driver fatigue countermeasure.
Journal of Safety Research
,
34
(
3
),
321
328
.
Liang
,
P.
, &
Klein
,
D.
(
2009
).
Online EM for unsupervised models.
In
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
(pp.
611
619
).
Stroudsburg, PA
:
Association for Computational Linguistics
.
Lin
,
C.-T.
,
Chuang
,
C.-H.
,
Huang
,
C.-S.
,
Tsai
,
S.-F.
,
Lu
,
S.-W.
,
Chen
,
Y.-H.
, &
Ko
,
L.- W.
(
2014
).
Wireless and wearable EEG system for evaluating driver vigilance.
IEEE Transactions on Biomedical Circuits and Systems
,
8
(
2
),
165
176
.
Lin
,
C.-T.
,
Chuang
,
C.-H.
,
Kerick
,
S.
,
Mullen
,
T.
,
Jung
,
T.-P.
,
Ko
,
L.-W.
, …
McDowell
,
K.
(
2016
).
Mind-wandering tends to occur under low perceptual demands during driving.
Scientific Reports
,
6
,
21353
.
Lin
,
C.-T.
,
Huang
,
C.-S.
,
Yang
,
W.-Y.
,
Singh
,
A. K.
,
Chuang
,
C.-H.
, &
Wang
,
Y.-K.
(
2018
).
Real-time EEG signal enhancement using canonical correlation analysis and gaussian mixture clustering.
Journal of Healthcare Engineering
.
Möckel
,
T.
,
Beste
,
C.
, &
Wascher
E.
(
2015
).
The effects of time on task in response selection: An ERP study of mental fatigue
.
Scientific Reports
,
5
,
10113
.
Paszke
,
A.
,
Gross
,
S.
,
Chintala
,
S.
,
Chanan
,
G.
,
Yang
,
E.
,
DeVito
,
Z.
, …
Lerer
,
A.
(
2017
).
Automatic differentiation in PyTorch
. https://openreview.net/forum?id=BJJsrmfCZ.
Pregibon
,
D.
(
1981
).
Logistic regression diagnostics
.
Annals of Statistics
,
9
(
4
),
705
724
.
Roche
,
A.
(
2011
).
EM algorithm and variants: An informal tutorial
.
arXiv:1105.1476
.
Teplan
,
M.
(
2002
).
Fundamentals of EEG measurement.
Measurement Science Review
,
2
(
2
),
1
11
.
Tian
,
S.
,
Wang
,
Y.
,
Dong
,
G.
,
Pei
,
W.
, &
Chen
,
H.
(
2018
).
Mental fatigue estimation using EEG in a vigilance task and resting states.
In
Proceedings of the 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society
(pp.
1980
1983
).
Piscataway, NJ
:
IEEE
.
Ting
,
P.-H.
,
Hwang
,
J.-R.
,
Doong
,
J.-L.
, &
Jeng
,
M.-C.
(
2008
).
Driver fatigue and highway driving: A simulator study.
Physiology and Behavior
,
94
(3)
,
448
453
.
Wascher
E.
,
Rasch
,
B.
,
Sänger
J.
,
Hoffmann
,
S.
,
Schneider
D.
,
Rinkenauer
G.
, …
Gutberlet
,
I.
(
2014
).
Frontal theta activity reflects distinct aspects of mental fatigue.
Biological Psychology
,
96
,
57
65
.
Wei
,
C.-S.
,
Lin
,
Y.-P.
,
Wang
,
Y.-T.
,
Jung
,
T.-P.
,
Bigdely-Shamlo
,
N.
, &
Lin
,
C.-T.
(
2015
).
Selective transfer learning for EEG-based drowsiness detection.
In
Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics
(pp.
3229
3232
).
Piscataway, NJ
:
IEEE
.
Welch
,
P.
(
1967
).
The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms.
IEEE Transactions on Audio and Electroacoustics
,
15
(2)
,
70
73
.
Weng
,
R. C.
, &
Lin
,
C.-J.
(
2011
).
A Bayesian approximation method for online ranking.
Journal of Machine Learning Research
,
1
,
267
300
.
Yan
,
L.
,
Dodier
,
R. H.
,
Mozer
,
M.
, &
Wolniewicz
,
R. H.
(
2003
).
Optimizing classifier performance via an approximation to the Wilcoxon-Mann-Whitney statistic.
In
Proceedings of the 20th International Conference on Machine Learning
(pp.
848
855
).
Palo Alto, CA
:
AAAI
.
Yarkoni
,
T.
,
Barch
,
D. M.
,
Gray
,
J. R.
,
Conturo
,
T. E.
, &
Braver
,
T. S.
(
2009
).
Bold correlates of trial-by-trial reaction time variability in gray and white matter: A multi-study FMRI analysis.
PLOS One
,
4
(1)
,
e4257
.
Zarei
,
R.
(
2017
).
Developing enhanced classification methods for ECG and EEG signals
. PhD diss., Victoria University.
Zeng
,
H.
,
Yang
,
C.
,
Dai
,
G.
,
Qin
,
F.
,
Zhang
,
J.
, &
Kong
,
W.
(
2018
).
EEG classification of driver mental states by deep learning.
Cognitive Neurodynamics
,
12
(6)
,
597
606
.
Zhang
,
G. P.
(
2000
).
Neural networks for classification: A survey.
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)
,
30
(4)
,
451
462
.
Zhang
,
X.
,
Yao
,
L.
,
Wang
,
X.
,
Monaghan
,
J.
, &
Mcalpine
,
D.
(
2019
).
A survey on deep learning based brain computer interface: Recent advances and new frontiers
.
arXiv:1905.04149
.
Zhang
,
Y.
,
Zhou
,
G.
,
Jin
,
J.
,
Zhao
,
Q.
,
Wang
,
X.
, &
Cichocki
,
A.
(
2015
).
Sparse Bayesian classification of EEG for brain–computer interface
.
IEEE Transactions on Neural Networks and Learning Systems
,
27
(
11
),
2256
2267
.