## Abstract

A driver's cognitive state of mental fatigue significantly affects his or her driving performance and more important, public safety. Previous studies have leveraged reaction time (RT) as the metric for mental fatigue and aim at estimating the exact value of RT using electroencephalogram (EEG) signals within a regression model. However, due to the easily corrupted and also nonsmooth properties of RTs during data collection, methods focusing on predicting the exact value of a noisy measurement, RT generally suffer from poor generalization performance. Considering that human RT is the reflection of brain dynamics preference (BDP) rather than a single regression output of EEG signals, we propose a novel channel-reliability-aware ranking (CArank) model for the multichannel ranking problem. CArank learns from BDPs using EEG data robustly and aims at preserving the ordering corresponding to RTs. In particular, we introduce a transition matrix to characterize the reliability of each channel used in the EEG data, which helps in learning with BDPs only from informative EEG channels. To handle large-scale EEG signals, we propose a stochastic-generalized expectation maximum (SGEM) algorithm to update CArank in an online fashion. Comprehensive empirical analysis on EEG signals from 40 participants shows that our CArank achieves substantial improvements in reliability while simultaneously detecting noisy or less informative EEG channels.

## 1 Introduction

According to the Sleep Health Foundation report by Adams et al. (2017), mental fatigue is a major cause in 33% to 45% of all road accidents. In general, mental fatigue (Boksem & Tops, 2008) refers to the inability to maintain optimal cognitive performance in a task with a high demand of cognitive activity. Such inability in the context of driver could lead to accidents with severe consequences (Adams et al., 2017). Individuals may find themselves in a mentally fatigued because of lack of sleep, continuous driving for an extended period, driving, monotonous driving late at night or before dawn, and driving while under the influence of sleeping drugs or with sleep disorders (Ji, Zhu, & Lan, 2004; Ting, Hwang, Doong, & Jeng, 2008). (See Zhang, Yao, Wang, Monaghan, & Mcalpine, 2019, for recent advances and references in brain dynamic analysis.)

In response to these critical issues, several methods (Cook, O'Connor, Lange, & Steffener, 2007; Blankertz et al., 2009; Fazli et al., 2009; Wascher et al., 2014; Tian, Wang, Dong, Pei, & Chen, 2018; Kaji, Iizuka, & Sugiyama, 2019) have been proposed to estimate and predict mental fatigue based on electroencephalography (EEG) and reaction time (RT) (see Figure 1a). Some of these methods, however, performed considerably well for some participants but failed for others due to lack of generalization. One of the challenges behind such poor generalization is determining how to use RT effectively. RT is easily affected by the instrumental error, wandering attention, or any other task unrelated factors. A previous study (Wei et al., 2015) tried to overcome this problem by adopting different techniques to smooth RTs but still failed to make it work for all participants. Note that humans' RT is usually the result of preference (Izuma & Adolphs, 2013) in brain dynamics during the task rather than just a single value. Such preferences can be affected by changing levels of attention (Möckel et al., 2015) like wandering mind (Lin et al., 2016), or a lower level of attention (Chuang et al., 2018). Therefore, the relationship between EEG signals and RTs, including extreme or abnormal RTs, should be attended to in a way that reflects human brain dynamics preferences (BDPs).

Another important problem lies in the heterogeneous channels extracted from different brain regions, which are normally responsible for different functionalities. There was an attempt to choose different brain regions (Wascher et al., 2014) for a method during evaluation of mental fatigue, but these regions of the brain are not necessarily the same for all participants (Gramann, Müller, Schönebeck, & Debus, 2006). For example, Wascher et al. (2014) heuristically used frontal theta to represent a different level of mental fatigue for all participants. In such a case, the reliability of the learning model would inevitably degrade because of possibly noisy or less informative channels chosen, on different brain regions, by the method. Some previous work (de Naurois, Bourdin, Stratulat, Diaz, & Vercher, 2017), attempted to solve this issue by using artificial neural network models but still failed to provide convincing results. This previous work impels us to pursue a purely data-driven approach to predict mental fatigue while getting rid of the low versatility caused by various^{1} heuristic tricks.

To overcome these problems, we first formulate mental fatigue task of monitoring mental fatigue into a multichannel ranking problem and solve it with our proposed channel-reliability aware ranking (CArank) model. In particular, CArank could learn from brain dynamics preference (BDPs) using EEG data robustly, while effectively preserving the exact ordering of RTs (see Figure 1b). This approach surprisingly corrects the defects of previous models and their performance caused by noisy and extreme RTs. Furthermore, our model also proposes using a transition matrix to evaluate the high-confidence sources among heterogeneous EEG channels, which contributes highly toward task performance. In order to handle large-scale EEG signals and obtain higher generalization, we propose a stochastic generalized expectation-maximization (SGEM) algorithm is. More precisely, we make the following key contributions:

We formulate the task of monitoring mental fatigue into a multichannel ranking problem and tackle it with the CArank model. CArank is a purely data-driven approach to detect mental fatigue using informative channels only.

We propose a stochastic generalized expectation-maximumzation algorithm for CArank, which extends CArank to large-scale applications.

We conduct empirical experiments on EEG signals from 40 participants to demonstrate the superior reliability of CArank in terms of mental fatigue monitoring.

This letter is organized as follows. Section 2 introduces the topic of mental fatigue monitoring and motivates the practice of using brain dynamics preferences. In section 3, we address the multichannel ranking problem and introduce our channel-reliability aware ranking to solve it. Section 4 describes a stochastic generalized expectation-maximization algorithm. Section 5 demonstrates the reliability of the proposed CArank with EEG signals from 40 participants. Section 6 envisions the future work, and section 7 concludes.

## 2 Background

In this section, we introduce some preliminary information about mental fatigue monitoring and then discuss our motivation for learning from brain dynamics preferences.

Reaction time is an intuitive indicator used to assess human mental fatigue. Therefore, a common practice for monitoring mental fatigue is to find a robust way of mapping humans' reaction time to an emergent situation using previously recorded EEG signals (Lal, Craig, Boord, Kirkup, & Nguyen, 2003; Kohlmorgen et al., 2007; Jap, Lal, Fischer, & Bekiaris, 2009).

### 2.1 Overfitting of the Regression Model

A natural way to forecast the RT with EEG signals is to formulate it as a regression task (see Figure 2), namely, finding a (non)linear mapping (e.g., neural networks, SVR) from the EEG signals $x$ to the corresponding RT. However, due to the existence of extreme values in RTs during data collection (Wei et al., 2015; Huang, Pal, Chuang, & Lin, 2015), the scale of the regression loss with regard to various RTs varies significantly. Therefore, the regression loss, without discriminating the peculiarity of the RTs, would be dominated by the few extreme RTs while omitting normal RTs. This then leads to the overfitting of the regression model on the training data, with poor generalization performance on the test data (see Figures 2 and 5 and Table 1).

Participant . | P1 . | P2 . | P3 . | P4 . | P5 . | P6 . | P7 . | P8 . | P9 . | P10 . | P11 . | P12 . | P13 . | P14 . | P15 . | P16 . | P17 . | P18 . | P19 . | P20 . | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Test ACC | SVR | 71.74 | 78.92 | 85.79 | 69.76 | 84.17 | 66.61 | 76.38 | 80.41 | 71.10 | 58.52 | 77.12 | 87.01 | 73.92 | 83.79 | 69.10 | 73.65 | 63.77 | 62.49 | 72.64 | 68.66 |

LR | 69.80 | 70.77 | 85.63 | 69.01 | 63.77 | 53.62 | 79.69 | 55.87 | 74.15 | 21.32 | 77.55 | 87.44 | 74.17 | 70.79 | 41.03 | 53.11 | 58.15 | 59.93 | 41.88 | 66.20 | |

Regression (C) | 71.63 | 79.21 | 80.22 | 72.39 | 83.65 | 68.38 | 60.31 | 54.99 | 77.98 | 59.01 | 82.72 | 89.80 | 79.56 | 85.45 | 68.60 | 65.88 | 54.30 | 50.58 | 68.65 | 61.80 | |

Regression (A) | 71.71 | 72.97 | 79.81 | 70.90 | 82.80 | 57.42 | 61.88 | 60.96 | 66.38 | 52.96 | 79.37 | 73.87 | 67.70 | 80.54 | 66.03 | 54.47 | 51.01 | 65.07 | 62.33 | 54.80 | |

Classification (C) | 76.85 | 82.48 | 82.40 | 74.77 | 83.12 | 65.69 | 76.12 | 70.84 | 83.02 | 63.74 | 76.41 | 85.08 | 77.74 | 88.03 | 69.09 | 71.80 | 58.44 | 77.31 | 80.85 | 63.56 | |

Classification (A) | 79.97 | 77.61 | 79.87 | 68.69 | 82.55 | 63.86 | 49.85 | 51.47 | 51.78 | 53.03 | 75.79 | 79.69 | 66.40 | 89.39 | 68.10 | 53.07 | 50.00 | 52.81 | 61.19 | 52.02 | |

CArank | 82.29 | 80.97 | 83.78 | 77.50 | 87.42 | 76.62 | 82.34 | 79.16 | 91.40 | 78.25 | 81.74 | 84.17 | 83.23 | 90.53 | 76.66 | 80.40 | 88.69 | 81.13 | 80.42 | 78.35 | |

Participant | P21 | P22 | P23 | P24 | P25 | P26 | P27 | P28 | P29 | P30 | P31 | P32 | P33 | P34 | P35 | P36 | P37 | P38 | P39 | P40 | |

Test ACC | SVR | 72.71 | 73.43 | 78.98 | 67.00 | 76.72 | 72.56 | 75.94 | 85.95 | 81.63 | 82.19 | 67.78 | 87.73 | 71.21 | 76.69 | 80.61 | 88.76 | 77.47 | 74.22 | 63.52 | 46.24 |

LR | 52.97 | 30.10 | 78.23 | 40.02 | 60.25 | 73.15 | 40.67 | 53.23 | 46.66 | 79.32 | 73.71 | 53.29 | 48.55 | 48.36 | 75.71 | 88.53 | 78.06 | 62.05 | 60.28 | 43.53 | |

Regression (C) | 69.84 | 50.58 | 80.73 | 56.85 | 78.72 | 67.76 | 65.06 | 84.75 | 79.59 | 82.59 | 63.41 | 66.46 | 56.78 | 61.81 | 66.70 | 87.21 | 81.98 | 57.71 | 84.41 | 67.48 | |

Regression (A) | 53.44 | 58.27 | 78.29 | 54.25 | 77.46 | 53.31 | 51.33 | 77.73 | 69.92 | 77.06 | 58.46 | 64.09 | 53.09 | 59.69 | 72.45 | 85.64 | 73.21 | 62.83 | 50.55 | 46.35 | |

Classification (C) | 68.22 | 79.82 | 84.36 | 68.10 | 84.28 | 69.60 | 77.09 | 86.46 | 82.11 | 86.85 | 74.22 | 85.05 | 60.49 | 71.58 | 73.03 | 90.40 | 83.51 | 75.62 | 80.37 | 69.07 | |

Classification (A) | 49.86 | 74.65 | 72.45 | 59.46 | 75.35 | 49.80 | 52.20 | 76.89 | 51.88 | 73.62 | 59.50 | 61.30 | 50.00 | 53.07 | 60.46 | 90.24 | 72.15 | 60.79 | 78.30 | 65.76 | |

CArank | 72.83 | 85.33 | 82.70 | 89.35 | 84.57 | 76.52 | 85.02 | 83.58 | 86.56 | 85.64 | 92.74 | 85.74 | 79.24 | 84.77 | 90.53 | 90.96 | 86.05 | 77.12 | 93.48 | 75.56 |

Participant . | P1 . | P2 . | P3 . | P4 . | P5 . | P6 . | P7 . | P8 . | P9 . | P10 . | P11 . | P12 . | P13 . | P14 . | P15 . | P16 . | P17 . | P18 . | P19 . | P20 . | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Test ACC | SVR | 71.74 | 78.92 | 85.79 | 69.76 | 84.17 | 66.61 | 76.38 | 80.41 | 71.10 | 58.52 | 77.12 | 87.01 | 73.92 | 83.79 | 69.10 | 73.65 | 63.77 | 62.49 | 72.64 | 68.66 |

LR | 69.80 | 70.77 | 85.63 | 69.01 | 63.77 | 53.62 | 79.69 | 55.87 | 74.15 | 21.32 | 77.55 | 87.44 | 74.17 | 70.79 | 41.03 | 53.11 | 58.15 | 59.93 | 41.88 | 66.20 | |

Regression (C) | 71.63 | 79.21 | 80.22 | 72.39 | 83.65 | 68.38 | 60.31 | 54.99 | 77.98 | 59.01 | 82.72 | 89.80 | 79.56 | 85.45 | 68.60 | 65.88 | 54.30 | 50.58 | 68.65 | 61.80 | |

Regression (A) | 71.71 | 72.97 | 79.81 | 70.90 | 82.80 | 57.42 | 61.88 | 60.96 | 66.38 | 52.96 | 79.37 | 73.87 | 67.70 | 80.54 | 66.03 | 54.47 | 51.01 | 65.07 | 62.33 | 54.80 | |

Classification (C) | 76.85 | 82.48 | 82.40 | 74.77 | 83.12 | 65.69 | 76.12 | 70.84 | 83.02 | 63.74 | 76.41 | 85.08 | 77.74 | 88.03 | 69.09 | 71.80 | 58.44 | 77.31 | 80.85 | 63.56 | |

Classification (A) | 79.97 | 77.61 | 79.87 | 68.69 | 82.55 | 63.86 | 49.85 | 51.47 | 51.78 | 53.03 | 75.79 | 79.69 | 66.40 | 89.39 | 68.10 | 53.07 | 50.00 | 52.81 | 61.19 | 52.02 | |

CArank | 82.29 | 80.97 | 83.78 | 77.50 | 87.42 | 76.62 | 82.34 | 79.16 | 91.40 | 78.25 | 81.74 | 84.17 | 83.23 | 90.53 | 76.66 | 80.40 | 88.69 | 81.13 | 80.42 | 78.35 | |

Participant | P21 | P22 | P23 | P24 | P25 | P26 | P27 | P28 | P29 | P30 | P31 | P32 | P33 | P34 | P35 | P36 | P37 | P38 | P39 | P40 | |

Test ACC | SVR | 72.71 | 73.43 | 78.98 | 67.00 | 76.72 | 72.56 | 75.94 | 85.95 | 81.63 | 82.19 | 67.78 | 87.73 | 71.21 | 76.69 | 80.61 | 88.76 | 77.47 | 74.22 | 63.52 | 46.24 |

LR | 52.97 | 30.10 | 78.23 | 40.02 | 60.25 | 73.15 | 40.67 | 53.23 | 46.66 | 79.32 | 73.71 | 53.29 | 48.55 | 48.36 | 75.71 | 88.53 | 78.06 | 62.05 | 60.28 | 43.53 | |

Regression (C) | 69.84 | 50.58 | 80.73 | 56.85 | 78.72 | 67.76 | 65.06 | 84.75 | 79.59 | 82.59 | 63.41 | 66.46 | 56.78 | 61.81 | 66.70 | 87.21 | 81.98 | 57.71 | 84.41 | 67.48 | |

Regression (A) | 53.44 | 58.27 | 78.29 | 54.25 | 77.46 | 53.31 | 51.33 | 77.73 | 69.92 | 77.06 | 58.46 | 64.09 | 53.09 | 59.69 | 72.45 | 85.64 | 73.21 | 62.83 | 50.55 | 46.35 | |

Classification (C) | 68.22 | 79.82 | 84.36 | 68.10 | 84.28 | 69.60 | 77.09 | 86.46 | 82.11 | 86.85 | 74.22 | 85.05 | 60.49 | 71.58 | 73.03 | 90.40 | 83.51 | 75.62 | 80.37 | 69.07 | |

Classification (A) | 49.86 | 74.65 | 72.45 | 59.46 | 75.35 | 49.80 | 52.20 | 76.89 | 51.88 | 73.62 | 59.50 | 61.30 | 50.00 | 53.07 | 60.46 | 90.24 | 72.15 | 60.79 | 78.30 | 65.76 | |

CArank | 72.83 | 85.33 | 82.70 | 89.35 | 84.57 | 76.52 | 85.02 | 83.58 | 86.56 | 85.64 | 92.74 | 85.74 | 79.24 | 84.77 | 90.53 | 90.96 | 86.05 | 77.12 | 93.48 | 75.56 |

Notes: Higher is better. The shaded numbers indicate the best results.

This creates a dilemma: it requires a reliable learning model to predict RT with the complex EEG signals (indeed, it is exactly our target), but it is not required to excessively approximate the exact value of RT, especially the extreme values. The problem, then, is how to find an efficient way to learn from the noisy RT or non-smooth while the exact value is not necessary.

As shown in Figure 2, the extreme or abnormal RTs wildly exist during data collection. The issues of overfitting arise in the regression model since the regression loss excessively forces the learning model to fit the extreme RTs yet underrates the regular RTs. Although various regularization methods (e.g., $L2$ norm, $L1$ norm, and Laplace priors) could alleviate the overfitting of the learning model (Hastie, Tibshirani, & Friedman, 2009; Zhang et al., 2015; Jin, Zhou, Gao, & Zhang, 2018), they cannot solve the overfitting issue if regression loss is still adopted. The same is true for other heuristic approaches (e.g., early stopping) used for alleviating overfitting.

Wei et al. (2015) tried to overcome this problem by adopting different techniques to smooth RTs, but still failed to make it work for all participants. Meanwhile, the performance varies significantly from different choices of the mapping function. The predefined smooth techniques would excessively weigh down or, or simply clip, the extreme or abnormal RTs in the MSE loss, which instead fails to reveal the real relationship between the EEG signals and RTs, especially the extreme or abnormal RTs (Möckel, Beste, & Wascher, 2015; Lin et al., 2016; Chuang et al., 2018).

For the sake of comparison, we apply the $L2$ norm regularization to all baselines in this letter.

### 2.2 Consistency of the Ordinal Regression Model

Instead of using regression, we propose to transform the problem into an ordinal regression problem. In particular, the RTs are defined in the totally ordered space $R$. This space owns its structure meanings, which are preserved by the pairwise comparisons between the RTs. The pairwise comparisons indeed preserve the whole relative structure information between the RTs while ignoring their absolute numerical information. Therefore, predicting the orderings of the pairwise comparisons may be regarded as a relaxed alternative of the previous regression model (see Figure 3).

We showcase our motivation using a naive ordinal regression model for mental fatigue monitoring and present the results in Figure 3: that even the naive ordinal regression model could capture some meaningful results compared to the regression model. In particular, the relative structure information between the RTs is somewhat preserved: the boundary between large RTs and small RTs is clear. Meanwhile, large RTs could serve as an indicator for monitoring mental fatigue.

#### 2.2.1 Comparison between Ordinal Regression and Regression

The difference between ordinal regression and regression lies in the objective they aim to minimize. Ordinal regression aims to preserve the whole ordering of RT, while regression aims to excessively approximate the exact value of RT. Therefore, ordinal regression is less sensitive to outliers, that is, the scale of RTs in mental fatigue monitoring.

#### 2.2.2 Reliability Issues Caused by Heterogeneous Channels

A naive ordinal regression method still suffers from overfitting, mainly because of the simple concatenation of the EEG signals. Since the EEG signals are from heterogeneous channels, if we simply concatenate the EEG signals without discriminating the reliability of each channel, the model's generalization would be degraded.

(Deficiencies of $L1$ and $L2,1$ Regularization for Eliminating Noisy Channels). In order to eliminate the noisy channels, the weight of the features should be set to zero regarding each channel as the whole. However, the $L1$ regularization could only push partial instead of all weights of one channel to zero. $L2,1$ regularization first performs $L2$ norm over the weight of each channel and then calculates $L1$ norm of all $L2$ norm. The $L2,1$ regularization could be used to eliminate the noisy channels. However, $L2,1$ regularization suffers from the following deficiencies: (1) it is difficult to extend to the nonlinear model, such as deep neural networks, and (2) it would heavily rely on parameter tuning of the balance factor for the $L2,1$ regularization term.

We next explore data-driven methods that can automatically weigh up reliable channels and down unreliable channels.

## 3 Model and Methodology

In this section, we formulate the mental fatigue monitoring task as a multichannel ranking problem. Furthermore, we extend the ordinal classification model for brain dynamics preferences and introduce a transition matrix to evaluate the channel reliability. Then, we propose the CArank model to tackle the multichannel ranking problem.

Note that we used the term *preference* intentionally to show that brain dynamics keep changing with regard to human behaviors, and it happens because the human brain prefers one decision over others (Ekman & Davidson, 1994; Izuma & Adolphs, 2013; Franks, 2019). Therefore, we prefer *preference* to *classification*. We then refer to the pairwise comparison between brain dynamics as the brain dynamics preference (BDP).

### 3.1 Multichannel Ranking

^{1}

### 3.2 Beyond Ordinal Classification

^{2}the popular Bradley-Terry model, which is based on logistic regression, can be formulated as follows:

However, a preference proposition $\rho $ has three states: $1,0,-1$, denoting win ($T1>T2$), tie ($T1\u2248T2$), and loss ($T1<T2$), respectively. Since binary classification fails to model the state of a tie ($T1\u2248T2$), binary classification (e.g., see equation 3.5) is therefore very sensitive to the subtle difference of the reaction time. It means that other classification models, such as support vector machines, are also infeasible for our problem due to lack of a normalized probability definition for three states. Meanwhile, the softmax function, a straightforward extension of binary classification, models different states equally. It also does not serve as a good candidate since it fails to capture the intrinsic connection of these two types of preference propositions.

Note that we consider the linear mapping $wT\Delta x$ here since the EEG data are usually high-dimensional with low sample size.

(Ternary Classification versus Binary Classification). Ternary classification (see equation 3.6) is less sensitive to the subtle difference of reaction time. In terms of binary classification, a subtle discrepancy around the classification boundary would lead to the steepest gradient. However, the tie state (i.e., $\rho =0$), introduced in ternary classification, would flatten the steepest gradient and enhance the model robustness regarding the subtle difference of RT.

(Extension to Deep Models). For the sake of clarity, we elaborate our three-states ordinal classification with a linear formulation (see equation 3.6). In the case of a deep learning model, we can consider either (1) replacing the linear difference $wTx1-wTx2$ with the difference of the neural network output $g(x1)-g(x2)$ or (2) replacing the raw feature $x$ in equation 3.6 with the output of the last layer of the encoder. To ensure end-to-end training, we chose the first approach in our experiment.

### 3.3 Channel Reliability

Because different regions in the human brain have different functions, relative contributions of different channels to human RT may vary a lot. The state of each channel can be classified as informative and noisy according to its contribution with regard to human RT. Note that a channel is called “noise” if the algorithms could not extract useful brain information with EEG signals from this channel (Alharbi, 2018; Lin et al., 2018). Therefore, if we directly model the EEG preferences recorded in each channel without any distinctions among the channels regarding channel reliability (i.e., informative and noisy), the model's reliability would inevitably degrade.

^{3}that is, $P(\rho =0|\rho (n)={1,-1})=0$. Therefore, a simplified transition matrix can be represented as follows:

Positive channels with $\pi n$ close to 1: The ranking model, equation 3.6, can extract enough information from the $n$th channel and exactly predict the state of the preference proposition.

Noisy channels with $\pi n$ approximating 0.5: The ranking model cannot extract any useful information from the $n$th channel.

Negative channels with $\pi n$ close to 0: The ranking model can extract enough information from the $n$th channel, but the prediction states are exactly opposite the proposition states.

The identified positive and negative channels are all considered as informative EEG channels, which helps in learning reliable models for the corresponding task.

### 3.4 Channel-Reliability Aware Ranking

### 3.5 Reliability Analysis and Channel State Estimation

CArank (see equation 3.11) indeed trains a mixture of two complementary classifiers, which share the same parameter $w$. It is different from classical mixture models since it clusters at the channel level instead of the sample level.

In particular, in terms of the positive channels with $\pi n$ close to 1, CArank relies as the first classifier to update the shared parameter $w$. In terms of the negative channels with $\pi n$ close to 0, equation 3.11 automatically switches to the opposite classifier, which can extract correct information from the negative channels and update the shared parameter $w$ accordingly. Furthermore, CArank is robust to the noisy channels with $\pi n$ approximately equal to 0.5, because equation 3.11 gives up extracting information from the noisy channels by assigning a constant likelihood (i.e., 0.5) to each BDP. The estimated $\pi n$ can be leveraged as an indicator to detect noisy channels with $\pi n\u22480.5,\u2200n=1,2,\u2026,N$. (See Figure 6 for more details.)

### 3.6 Superiority of CArank over Previous Methods

CArank is superior in two ways: (1) using ordinal regression instead of regression enables it to be less sensitive to the scale of RTs, and (2) the data-driven noisy channel detection ensures performing mental fatigue monitoring using informative channels only.

In terms of the overfitting caused by extreme values, Wei et al. (2015) adopted different techniques to smooth RTs but failed to make it work for all participants. Meanwhile, the predefined smooth techniques would excessively weigh down the extreme or abnormal RTs in the MSE loss, which instead fails to reveal the real relationship between the EEG signals and RTs, especially the extreme or abnormal RTs. In terms of the lower reliability caused by heterogeneous channels, Wascher et al. (2014) heuristically used frontal theta to represent a different level of mental fatigue, but specific regions of the brain are not necessarily the same for all participants (Gramann et al., 2006).

Different from existing work, which heavily relies on various heuristic tricks, CArank is the first purely data-driven approach to predict mental fatigue and therefore offers high versatility. Specifically, it first formulates the mental fatigue monitoring task as a multichannel ranking problem. Next, it evaluates the channel reliability of each EEG channel via a transition matrix. CArank therefore performs reliable mental fatigue prediction using informative channels only.

## 4 Stochastic Generalized Expectation-Maximization

In this section, we describe a generalized expectation-maximization (GEM) algorithm (Dempster, Laird, & Rubin, 1977) to solve the proposed CArank, equation 3.11. Since the feasible region of $\pi n$ is restricted to [0,1], the gradient-based optimization methods would make our solution inaccurate and inefficient. The GEM algorithm is an efficient iterative procedure to compute the MAP solution in the presence of latent variables ($\rho m(n)$ in equation 3.11. GEM avoids directly calculating the derivative to the expectation of latent variables and resorts to a surrogate lower bound to optimize. Therefore, GEM, a silver bullet for MAP with latent variables, can significantly simplify the optimization over parameter $\pi n$ for equation 3.11.

### 4.1 GEM for CArank

(Extension to Deep Models). The L-BFGS optimization method used in equation 4.7 aims to find the optimum $w$. It is easy to find its alternatives in deep learning literature, such as vanilla stochastic gradient descent (SGD) and its various variants (Kasai, 2017), if we replace the raw EEG feature $x$ with neural embedding.

(Computational Efficiency Stochastic GEM for CArank). According to algorithm 1, the optimization of CArank involved $T$ iterations between E-step with regard to $E[\delta m(n)]$, and M-step with regard to $\pi n$ and $w$. Note that the computation cost of calculating $E[\delta m(n)]$ and $\pi n$ is marginal compared to that of optimizing $w$. Thus, the computation cost for CArank at each iteration is dominated by the subclassification problem, is, optimizing $w$. Accordingly, the computation efficiency of GEM for CArank is T times of that for optimizing a regular classification problem. Note that only a few iterations ($T<10$) are required for GEM to converge. This analysis also applies to stochastic GEM where we update the model parameter with minibatch samples.

Note that this cost of computation is for the training stage only, while the computation costs of all methods for the test stage are similar. Our CArank enjoys the lowest storage during the test stage, since we can safely abandon the EEG signals from noisy channels after the training stage. It would not sacrifice the model performance since CArank rejects extracting information from noisy channels for decision making.

### 4.2 Stochastic GEM for CArank

The GEM approach introduced in section 4.1 is inefficient for large-scale data sets, because we need to iteratively calculate the gradient with regard to parameters $\pi $ and $w$ over all samples during each generalized maximization step. Motivated by the stochastic approximation literature (Roche, 2011), we introduce a stochastic generalized expectation-maximization (SGEM) approach, which resorts to stochastic minibatch optimization to learn the parameters. To be specific, SGEM approximates the updated $\pi $ and $w$ in batch EM with a single sample or minibatch samples. Since minibatch samples cannot be a perfect approximation to the whole data set, we interpolate between the new and former estimators with a decreasing step size^{4}$\eta k$, as in Liang and Klein (2009).

#### 4.2.1 Sampling Step

Before the $t$th iteration, we randomly sample a minibatch $Dt$ from $D$. The number of preference propositions in $Dt$, denoted by $Mt$, is much smaller than the corresponding total data set size $M$.

#### 4.2.2 Expectation Step

The expectation step remains similar. The only difference is that we need to calculate the posterior expectation of the auxiliary variable $\delta m(n)$ over the mini-batch $Dt$.

#### 4.2.3 Generalized Maximization Step

(Convergence Analysis). The convergence issues of the proposed stochastic GEM algorithm are analogous to the discussion given by Cappé and Moulines (2009) for their stochastic gradient EM algorithms. The existence of such links is hardly surprising. In view of the discussions in section 3 of Cappé and Moulines (2009), the online update rule, equation 4.9b, could also be seen as a stochastic gradient recursion formula, namely, $wnew=wold+\eta k(wt-wold)$.

## 5 Empirical Analysis

In this section, we demonstrate the reliability of the proposed CArank, equation 3.11, with EEG signals from 40 participants.

### 5.1 Experiment Paradigm

We used the 33-channel EEG data recorded in Huang et al. (2015) from 40 adult participants while performing a long, sustained attention task.^{5} These data contain one intrinsic non-EEG channel, the 33rd channel, which contains the information about only one axis in the direction of deviation. The experiment has been conducted using a virtual-reality dynamic driving simulator (see Figures 4D and 4E). The task involves driving on a four-lane highway while lane-departure events were a randomly induced deviation toward the side of the road from the original position. Each participant was instructed to quickly respond to steer back to the original position. A complete trial in this study (see Figure 4A), includes a 10 s baseline, deviation onset, response onset, and response offset (see Figures 4B and 4C). The next trial occurs within an interval of 5 s to 10 s after finishing the current trial. Each participant completed $T$ trials within 1.5 h. For each trial $i$, the EEG signals ${xn,i}n=1N$ from $N$ different channels were recorded simultaneously, and the corresponding reaction time $RTi$ was also collected afterward. If a participant fell asleep during the experiment, there was no feedback to wake him up. The NuAmps amplifier (Compumedics Limited, Australia) was used to collect EEG data with a maximum sampling rate of 1000 Hz, 200 HZ bandwidth (DC), and 22-bit resolution.

In this letter, the 10 s baseline (see Figure 4B) as the feature vector has been adopted, which is assumed to be long enough to detect any significant changes in brain activity (Zhang, 2000). This was followed by exploring the relationship between the 10 s baseline $x(\u2208Rk)$ and the preference proposition $\rho m$ under the following four assumptions:

#### 5.1.1 Data Preprocessing

Brain dynamics preferences for each participant have been generated as follows: the trials of each participant were randomly divided into two parts, $50%$ for training and $50%$ for test, and the EEG preferences were constructed according to the pairwise comparisons between the RTs. To be specific, two types of RT comparisons could be constructed: (1) significant RT pairwise comparisons $(Tm,1,Tm,2)$, where $Tm,1\u226bTm,2$ or $Tm,2\u226bTm,1$, and (2) comparable RT pairwise comparisons $(Tm,1,Tm,2)$, where $Tm,1\u2248Tm,2$. Considering the time delay among the channels in the time domain, Fourier transform has been applied to EEG signals to transform time series into frequency domain. Fast Fourier transform (FFT) has been applied using the Welch method (Welch, 1967) with a window size of 128 (such that spectral decomposition over 0.5 seconds) and a pad ratio of 2 without any overlap, which yields twice the output feature as the sampling rate. Further, to avoid overhead computation, EEG power within 0.5 Hz to 30 Hz has been selected, which is considered to be the most relevant to the RTs (Huang et al., 2015). Meanwhile, we can also adopt other feature transformations for feature extraction if necessary. (See Hammon and de Sa, 2007, for an example of other features typically used for EEG.)

#### 5.1.2 Baselines

First, we considered two popular linear methods: support vector regression (SVR) (Chang & Lin, 2011) and linear regression (LR) with the features from the multiple channels being simply concatenated into a long feature vector. Then we compared CArank with widely adopted nonlinear methods, regression and classification methods, under the multiple channel concatenation formulation and the multiple channel aggregation formulation, respectively. In particular, we considered two regression models (Lin et al., 2014; Hajinoroozi, Mao, Jung, Lin, & Huang, 2016). (1) In regression (C), with the EEG signals from multiple channels are simply concatenated into a long feature vector and the corresponding regression model is trained using this feature vector. (2) regression (A), the EEG signals from multiple channels, are considered independently, and the regression results are aggregated using majority voting afterward. Two ordinal classification models (Zarei, 2017; Zeng et al., 2018) are considered: (1) classification (C), where the EEG signals from multiple channels are simply concatenated into a long feature vector and the corresponding classification model is trained using this feature vector, and (2) classification (A), where the EEG signals from multiple channels are considered independently and the classification results are aggregated using majority voting afterward.

#### 5.1.3 Metrics

Note that we only trust the predictions from informative channels with reliability $\pi n>\kappa $ or $\pi n<1-\kappa $. $\kappa $ is set to 0.85 for all participants in our experiment. In terms of SVR, LR, regression (C)/classification (C), it is a simple regression/regression/regression/classification problem, since the EEG signals from multiple channels are simply concatenated into a long vector. In terms of regression (A)/classification (A), considering the high-dimensional feature with low sample size, we train a nonlinear model shared by all channels and aggregate the results from different channels to calculate the final predictions using the majority voting scheme. Since there is no mechanism for SVR, LR, regression (C/A), classification (C/A) to evaluate the channel state, we trust all the channels by default. Further, we calculate only the two metrics on the preference propositions wherein the orderings between the RT pair are significant, since it is hard to evaluate when the orderings between the RT pair are comparable.

#### 5.1.4 Parameter Initialization

We implemented SVR using Libsvm^{6} with the parameter set to -s 3 -t 0. Other methods are implemented in PyTorch (Paszke et al., 2017). We train a one-layer neural network for LR. For the sake of a fair comparison, we implemented a two-layer neural network for all nonlinear methods. In particular, we set network dimensions to d-100-1, where d is the input feature dimension, which varies between different baselines. All layers are densely (fully) connected. In terms of the channel reliability $\pi n$, we aimed to eliminate the effects of noisy channels during the training process and therefore initialized the channel reliability $\pi n$ to 0.5, $\u2200n=1,2,\u2026,N$. The L2 norm is used, which equals adopting the standard gaussian distribution for $w$: $w\u223cN(0,1)$. In terms of the hyperparameters $(\alpha n,\beta n)$, as we intended to eliminate the effects of noisy channels, we adopted a strong noninformative prior for $\pi n$: $\alpha n=\beta n=100$, $\u2200n=1,2,\u2026,N$, according to Bishop (2006). The Adam method is used to optimize the weight^{7}$w$. In terms of the maximum iteration number for our CArank, we set $MaxIter=7$ in our experiment to ensure the algorithm converged for each participant. The minibatch size is set to 256, and the learning rate is 0.001. In terms of LR and regression (C/A), the common mean square error (MSE) is adopted as the loss function. In terms of classification (C/A), the negative log-likelihood (see equation 4.4) is adopted as the loss function, except that $\pi n$ is fixed to 1, $\u2200n=1,2,\u2026,N$.

### 5.2 Empirical Results of CArank on Brain Dynamics Preferences

#### 5.2.1 Compassion Based on Wilcoxon-Mann-Whitney Statistics

The Wilcoxon-Mann-Whitney statistics of all methods on the test BDPs are presented in Table 1. In terms of SVR, LR, and regression (C), the Wilcoxon-Mann-Whitney statistics of the predicted RTs is calculated with regard to the ground truth on the test BDPs. In terms of regression (A), we first collected the predicted RTs on the test BDPs by aggregating the prediction from each channel using majority voting. Then we calculated the Wilcoxon-Mann-Whitney statistics of the predicted RTs following equation 5.1.

From Table 1, we offer the following observations:

1. $CArank>otherbaselines$. CArank exhibits consistent improvements over other baselines. In particular, it achieves the highest test accuracy on 30 participants and comparable results on the rest of the participants. This is consistent with our motivation that classification serves as a relaxed alternative for regression, can effectively circumvent the overfitting caused by nonsmooth or extreme RTs, and preserves the ordering with regard to RTs. Meanwhile, our channel-reliability-aware formulation could also automatically eliminate the effects of the EEG signals from a noisy channel during the training process, compared with using simple concatenation.

2. $Classification>SVR>Regression$. The test accuracy of classification-based methods for most participants is higher than their regression-based counterparts, namely, classification (C) outperforms SVR and regression (C) on 24 and 33 participants, respectively, and classification (A) outperforms regression (A) on 26 participants. This observation is consistent with our statement that regression-based models are easily overfitting, especially when extreme values (RTs in our problem) exist.

3. $Concatenation>Aggregation$. It is interesting to note that the test accuracy based on multiple channel aggregation is significantly inferior to their counterparts based on simple feature concatenation. Specifically, regression (C) outperforms regression (A) on 33 participants, while classification (C) outperforms classification (A) on 38 participants. This is quite impressive but reasonable. Since a shared regression/ classification model is trained in the case of the multiple channel aggregation formulation, the generalization performance would inevitably degenerate when learning with noisy channels. Meanwhile. the noisy channels universally exist, and at least one noisy channel is detected for each participant according to Figure 6.

4. $SVR>NonlinearRegression>LinearRegression$. Note that linear SVR shows superior performance to do nonlinear regression (C) and LR over 26 and 32 participants, respectively. Since the input of SVR, LR, and Regression (C) is the same, the only difference lies in the choice of the loss function. SVR adopts hinge loss, which is robust to outliers away from the boundary (Basak, Pal, & Patranabis, 2007). This is consistent with our analysis about the deficiency of the MSE loss used in the regression model (see section 2.1). Meanwhile, the performance of SVR is not stable and may achieve worse results on some participants (e.g., P10, P18, P31, P39, P40). Therefore, the hinge loss is also not the best choice compared to the classification setting, where our CArank can universally achieve accuracy above $75%$ for the corresponding participants.

#### 5.2.2 Compassion Based on In-Degree Preservation

To further investigate the reliability of CArank in terms of preserving the global ordering corresponding to RTs, we first collected the in-degree sequences according to equation 5.2 using the predicted RTs and then measured the in-degree discrepancy between the calculated in-degree sequences and the ground truth using the root-mean-squared error, equation 5.3. The RMSEs for all participants are shown in Table 2.

Participant . | P1 . | P2 . | P3 . | P4 . | P5 . | P6 . | P7 . | P8 . | P9 . | P10 . | P11 . | P12 . | P13 . | P14 . | P15 . | P16 . | P17 . | P18 . | P19 . | P20 . | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Test RMSE | SVR | 12.76 | 17.87 | 9.98 | 21.40 | 23.99 | 23.59 | 17.21 | 23.05 | 40.75 | 19.72 | 14.35 | 7.28 | 18.78 | 15.48 | 27.98 | 10.67 | 14.85 | 43.93 | 41.59 | 34.90 |

LR | 13.26 | 22.81 | 10.14 | 21.69 | 54.85 | 32.89 | 13.56 | 46.83 | 38.37 | 34.34 | 14.01 | 7.34 | 18.65 | 25.34 | 47.00 | 18.76 | 18.26 | 45.59 | 73.37 | 38.72 | |

Regression (C) | 13.11 | 17.40 | 13.06 | 17.92 | 25.98 | 22.14 | 22.46 | 42.16 | 30.63 | 18.16 | 12.04 | 5.40 | 13.88 | 14.03 | 27.87 | 12.47 | 17.93 | 45.85 | 43.96 | 36.64 | |

Regression (A) | 11.92 | 19.12 | 13.46 | 18.27 | 24.65 | 26.14 | 21.17 | 38.24 | 37.44 | 20.53 | 11.82 | 11.56 | 19.15 | 16.71 | 26.91 | 14.82 | 17.30 | 35.59 | 46.23 | 41.59 | |

Classification (C) | 10.53 | 15.35 | 12.51 | 16.78 | 27.36 | 22.85 | 16.32 | 33.48 | 25.18 | 17.45 | 15.21 | 8.64 | 15.58 | 12.63 | 25.24 | 10.54 | 15.98 | 27.56 | 31.11 | 35.76 | |

Classification (A) | 9.20 | 18.28 | 13.54 | 18.35 | 27.15 | 22.83 | 25.98 | 44.48 | 49.43 | 20.36 | 14.61 | 9.65 | 18.83 | 12.23 | 25.20 | 15.19 | 17.42 | 42.07 | 48.42 | 44.38 | |

CArank | 8.66 | 16.97 | 12.02 | 16.27 | 20.71 | 19.97 | 13.98 | 25.71 | 12.52 | 13.70 | 12.73 | 9.41 | 12.43 | 10.06 | 23.95 | 8.99 | 5.74 | 25.00 | 31.39 | 28.00 | |

Participant | P21 | P22 | P23 | P24 | P25 | P26 | P27 | P28 | P29 | P30 | P31 | P32 | P33 | P34 | P35 | P36 | P37 | P38 | P39 | P40 | |

Test RMSE | SVR | 33.58 | 16.29 | 32.49 | 33.89 | 40.51 | 35.73 | 32.63 | 24.47 | 29.83 | 34.87 | 23.58 | 10.37 | 18.67 | 19.49 | 13.07 | 8.35 | 15.17 | 38.28 | 14.50 | 28.71 |

LR | 56.49 | 38.90 | 31.20 | 58.26 | 60.63 | 34.56 | 64.32 | 61.78 | 70.42 | 36.94 | 20.29 | 32.28 | 29.31 | 36.47 | 14.54 | 7.96 | 14.62 | 55.18 | 15.84 | 30.40 | |

Regression (C) | 38.13 | 25.26 | 26.88 | 40.08 | 38.27 | 36.48 | 38.22 | 26.56 | 36.39 | 31.98 | 26.09 | 22.99 | 24.67 | 25.53 | 13.85 | 9.94 | 13.74 | 52.71 | 6.66 | 18.75 | |

Regression (A) | 48.77 | 22.10 | 27.19 | 41.04 | 37.01 | 46.19 | 46.01 | 31.61 | 41.87 | 36.54 | 26.77 | 23.33 | 23.51 | 25.03 | 12.49 | 9.23 | 17.07 | 46.63 | 17.52 | 24.70 | |

Classification (C) | 40.79 | 13.76 | 23.97 | 31.98 | 30.39 | 36.36 | 28.85 | 24.44 | 33.09 | 26.21 | 19.45 | 13.54 | 22.04 | 20.48 | 13.41 | 7.22 | 13.46 | 39.68 | 9.65 | 17.33 | |

Classification (A) | 51.50 | 15.99 | 30.35 | 37.33 | 37.94 | 47.82 | 44.97 | 37.25 | 57.80 | 42.96 | 26.74 | 23.69 | 24.75 | 26.88 | 19.49 | 6.59 | 16.26 | 46.13 | 7.61 | 16.97 | |

CArank | 37.77 | 11.77 | 25.49 | 16.44 | 29.67 | 30.72 | 19.38 | 26.32 | 26.00 | 28.49 | 8.00 | 11.65 | 12.94 | 12.06 | 5.34 | 7.03 | 11.17 | 36.77 | 3.83 | 15.14 |

Participant . | P1 . | P2 . | P3 . | P4 . | P5 . | P6 . | P7 . | P8 . | P9 . | P10 . | P11 . | P12 . | P13 . | P14 . | P15 . | P16 . | P17 . | P18 . | P19 . | P20 . | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Test RMSE | SVR | 12.76 | 17.87 | 9.98 | 21.40 | 23.99 | 23.59 | 17.21 | 23.05 | 40.75 | 19.72 | 14.35 | 7.28 | 18.78 | 15.48 | 27.98 | 10.67 | 14.85 | 43.93 | 41.59 | 34.90 |

LR | 13.26 | 22.81 | 10.14 | 21.69 | 54.85 | 32.89 | 13.56 | 46.83 | 38.37 | 34.34 | 14.01 | 7.34 | 18.65 | 25.34 | 47.00 | 18.76 | 18.26 | 45.59 | 73.37 | 38.72 | |

Regression (C) | 13.11 | 17.40 | 13.06 | 17.92 | 25.98 | 22.14 | 22.46 | 42.16 | 30.63 | 18.16 | 12.04 | 5.40 | 13.88 | 14.03 | 27.87 | 12.47 | 17.93 | 45.85 | 43.96 | 36.64 | |

Regression (A) | 11.92 | 19.12 | 13.46 | 18.27 | 24.65 | 26.14 | 21.17 | 38.24 | 37.44 | 20.53 | 11.82 | 11.56 | 19.15 | 16.71 | 26.91 | 14.82 | 17.30 | 35.59 | 46.23 | 41.59 | |

Classification (C) | 10.53 | 15.35 | 12.51 | 16.78 | 27.36 | 22.85 | 16.32 | 33.48 | 25.18 | 17.45 | 15.21 | 8.64 | 15.58 | 12.63 | 25.24 | 10.54 | 15.98 | 27.56 | 31.11 | 35.76 | |

Classification (A) | 9.20 | 18.28 | 13.54 | 18.35 | 27.15 | 22.83 | 25.98 | 44.48 | 49.43 | 20.36 | 14.61 | 9.65 | 18.83 | 12.23 | 25.20 | 15.19 | 17.42 | 42.07 | 48.42 | 44.38 | |

CArank | 8.66 | 16.97 | 12.02 | 16.27 | 20.71 | 19.97 | 13.98 | 25.71 | 12.52 | 13.70 | 12.73 | 9.41 | 12.43 | 10.06 | 23.95 | 8.99 | 5.74 | 25.00 | 31.39 | 28.00 | |

Participant | P21 | P22 | P23 | P24 | P25 | P26 | P27 | P28 | P29 | P30 | P31 | P32 | P33 | P34 | P35 | P36 | P37 | P38 | P39 | P40 | |

Test RMSE | SVR | 33.58 | 16.29 | 32.49 | 33.89 | 40.51 | 35.73 | 32.63 | 24.47 | 29.83 | 34.87 | 23.58 | 10.37 | 18.67 | 19.49 | 13.07 | 8.35 | 15.17 | 38.28 | 14.50 | 28.71 |

LR | 56.49 | 38.90 | 31.20 | 58.26 | 60.63 | 34.56 | 64.32 | 61.78 | 70.42 | 36.94 | 20.29 | 32.28 | 29.31 | 36.47 | 14.54 | 7.96 | 14.62 | 55.18 | 15.84 | 30.40 | |

Regression (C) | 38.13 | 25.26 | 26.88 | 40.08 | 38.27 | 36.48 | 38.22 | 26.56 | 36.39 | 31.98 | 26.09 | 22.99 | 24.67 | 25.53 | 13.85 | 9.94 | 13.74 | 52.71 | 6.66 | 18.75 | |

Regression (A) | 48.77 | 22.10 | 27.19 | 41.04 | 37.01 | 46.19 | 46.01 | 31.61 | 41.87 | 36.54 | 26.77 | 23.33 | 23.51 | 25.03 | 12.49 | 9.23 | 17.07 | 46.63 | 17.52 | 24.70 | |

Classification (C) | 40.79 | 13.76 | 23.97 | 31.98 | 30.39 | 36.36 | 28.85 | 24.44 | 33.09 | 26.21 | 19.45 | 13.54 | 22.04 | 20.48 | 13.41 | 7.22 | 13.46 | 39.68 | 9.65 | 17.33 | |

Classification (A) | 51.50 | 15.99 | 30.35 | 37.33 | 37.94 | 47.82 | 44.97 | 37.25 | 57.80 | 42.96 | 26.74 | 23.69 | 24.75 | 26.88 | 19.49 | 6.59 | 16.26 | 46.13 | 7.61 | 16.97 | |

CArank | 37.77 | 11.77 | 25.49 | 16.44 | 29.67 | 30.72 | 19.38 | 26.32 | 26.00 | 28.49 | 8.00 | 11.65 | 12.94 | 12.06 | 5.34 | 7.03 | 11.17 | 36.77 | 3.83 | 15.14 |

Note: The shaded numbers indicate the best results.

From Table 2, we could draw similar conclusions. (1) Our CArank consistently achieves lower RMSE compared to other baselines. In particular, CArank achieves the lowest test RMSE on 27 over 40 participants. (2) Except for our CArank, classification (C) shows better performance over the resting baselines. This is reasonable, since classification is robust to extreme RTs while the concatenation approach is less affected by the noisy channels compared to simple aggregation. (3) The difference between other baseline methods becomes ambiguous. This is because RMSE assigned higher punishment to an estimation with a larger error.

#### 5.2.3 Visualization of Predicted In-Degrees

To further explore the superiority of our CArank, we visualized Table 2 using the indegree sequences. For the sake of intuitive interpretation, we particularly showcase participants P9, P13, P22, P24, and P31 with the most representative performance in Figure 5. Regarding the rest of the participants, our CArank also achieves superior performance with the lowest RMSE (see Table 2).

From Figure 5, we make five observations. First, overall, the in-degree sequences predicted by CArank closely align to the ground truth with slight fluctuations (small RMSE), while the in-degree sequences predicted by other baselines fluctuate significantly and fail to maintain the trend with the ground truth (large RMS). Second, the points located in the northeast denote the trials with high RTs (also called extreme RTs). The in-degree sequences predicted by CArank show slighter fluctuations compared to those of other baselines. It denotes that CArank could accurately detect the mental fatigue associated with higher RTs. However, other baselines either show large fluctuations (e.g., P9, P13, P24), leading to a high false-negative rate, or completely fail to maintain the trend, leading to a high error rate. Third, the points located in the southwest denote the trials with small RTs. The in-degree sequences predicted by other baselines show large fluctuations (e.g., P22), a high false-positive rate. (4) It is worth noting that the in-degree sequences predicted by regression(C/A) usually fluctuates heavily for low in-degree trials (small RTs) and high in-degree trials (large RTs). It means that regression (C/A) overestimates the RTs with small values and underestimates the RTs with large values. It is consistent with our claim that the regression-based model is not suitable for tasks with a nonsmooth response variable (RT). Fifth, a simple classification using multichannel aggregation, that is, classification (M), also shows heavy fluctuations, since it lacks an effective mechanism to aggregate the predication from multiple channels. Classification (C) shows better performance but is just as likely to be overfitting, since classification (C) also could not eliminate the effects of noisy channels during the training process.

### 5.3 Noisy Channel Detection

We also investigated the reliability of our CArank from the perspective of noisy channel detection. According to our analysis, the parameter $\pi n$ in the transition matrix $\Pi n$ indicates channel reliability. Hereafter, we leverage $\pi n$ as the channel reliability indicator to detect noisy channels. Figure 6 lists the noisy channels (marked in red) detected with $0.15\u2264\pi n\u22640.85$, $\u2200n=1,2,\u2026,N$.

Figure 6 shows, first, that the noisy channels universally exist among the EEG signals. At least one noisy channel is detected for each participant. For example, the 33rd channel is recognized as the noisy channel by CArank for almost all participants. It is reasonable since the 33rd channel is generally acknowledged as the nonrelevant channel to any tasks (Lin et al., 2014). Second, for each participant, most channels are reliable, which ensures we can always find enough support to training our CArank. Third, the detected noisy channels vary from participant to participant and do not possess the transitivity property between participants. The noise can arise due to the intrinsic noninformative EEG channel (e.g., the 33rd channel for all participants); channels for lateral mastoid references (e.g., the 23rd and 29th channels for majority participants) (Chatrian, Lettich, & Nelson, 1985); and improper experimentation or artifacts (for P13, P39, and P40) (Lin et al., 2018).

## 6 Limitations and Future Work

In this work, the cooperation mechanism among channels is simplified as a weighted majority voting system, while different trials are viewed independently. We intend to formulate it with more complex mechanisms, such as the Markov decision process (MDP), to conduct learning and decision making simultaneously. Some previous work (Chen, Jiao, & Lin, 2016; Chen, Lin, & Zhou, 2015) has studied the decision-making process among crowd (noisy) workers, which is promising to our setting to investigate the cooperation mechanism among noisy channels. Efforts are underway to apply this approach in future work.

Furthermore, Brain dynamics are nonstationary and characterized by significant trial-by-trial variability (Yarkoni, Barch, Gray, Conturo, & Braver, 2009). Due to this variability, CArank would suffer from repeated training and updating costs with respect to all new data. We consider extending CArank to a real-time mental fatigue monitoring system by online calibrating CArank. Inspired by the work of Weng and Lin (2011) and Jaini et al. (2017), Bayesian moment matching offers promise to sequentially update the nonconjugate likelihood function (e.g., CArank) with analytic update rules.

## 7 Conclusion

This work proposes a CArank model to assess the state of mental fatigue. The efficacy of the model was demonstrated using EEG data collected in a sustained driving task from 40 participants. This model has been combined with a stochastic-generalized expectation-maximization (SGEM) algorithm to provide an efficient update in the large-scale setting. CArank uses a unique methodology with a relaxed alternative, ordinal classification, to circumvent overfitting to the extreme values of RTs. It has been demonstrated that the overall performance of CArank can be significantly improved with the introduction of a transition matrix, which enables the technique to evaluate the reliability of informative EEG channels while detecting noisy EEG channels. Empirical results show that CArank delivers significant improvements over simple classification and regression methods in terms of global ranking preservation.

## Notes

^{1}

When applied to BDP, the subtle difference between the RTs may be caused not by the intrinsic difference between BDP but the unknown noise.

^{2}

In the following, we omitted the subscripts for simplicity.

^{4}

Here, the stepsize is set to $\eta t=(t+2)-\tau 0$, where $t$ is the number of iterations and $0.5<\tau 0<1$. The smaller the $\tau 0$ is, the larger the update $\eta t$ is, and the more quickly we forget (decay) our old parameters. This can lead to swift progress but also generates instability.

^{5}

According to Huang et al. (2015), the couplings between pairs of MCC, ACC, lSMC, rSMC, PCC, and ESC regions increased at the intermediate level of attention. It reveals that an enhancement of the cortico-cortical interaction is necessary to maintain task performance and prevent mental fatigue. Further, it shows that higher connectivity shows optimal performance, while very few connected nodes show poor performance. See Huang et al. (2015) for more information.

^{6}

https://www.csie.ntu.edu.tw/$\u223c$cjlin/libsvm/.

^{7}

In terms of the L-BFGS implementation, a Matlab code can be downloaded from Granzow (2017).

## Acknowledgments

I.W.T. is supported by ARC under grant DP180100106 and DP200101328. M.S. was supported by the International Research Center for Neurointelligence (WPI-IRCN) at the University of Tokyo Institutes for Advanced Study.