## Abstract

We present a graphical model framework for decoding in the visual ERP-based speller system. The proposed framework allows researchers to build generative models from which the decoding rules are obtained in a straightforward manner. We suggest two models for generating brain signals conditioned on the stimulus events. Both models incorporate letter frequency information but assume different dependencies between brain signals and stimulus events. For both models, we derive decoding rules and perform a discriminative training. We show on real visual speller data how decoding performance improves by incorporating letter frequency information and using a more realistic graphical model for the dependencies between the brain signals and the stimulus events. Furthermore, we discuss how the standard approach to decoding can be seen as a special case of the graphical model framework. The letter also gives more insight into the discriminative approach for decoding in the visual speller system.

## 1. Introduction

The Farwell and Donchin speller (Farwell & Donchin, 1988) is a brain-computer interface that enables users to spell words by focusing their attention on letters in a letter grid displayed on a computer screen. The user's is electroencephalogram (EEG) is recorded while a sequence of controlled stimulus events over time takes place on the letters. A stimulus event in the standard visual speller is a short increase of the brightness (“flash”) of a specific group of letters on the screen. The pattern of flashes of the letter that the user is focusing on evokes a characteristic EEG signal that is correlated with the sequence of flashes of that letter over time. A computer program analyzes the recorded EEG signal, inferring the letter on which the user is focusing. This decoding is not a trivial task since the signal-to-noise ratio of the relevant EEG signals is poor.

Increasing the communication rate of the speller can be achieved in two ways: by decreasing the time interval between stimulus events or reducing the number of stimulus events necessary for inferring the user-selected letter. The latter can be achieved by optimizing the design of the sequence of stimulus events. In the standard design, each stimulus event involves the flashing of letters in a particular row or column of the letter grid. Other designs exist that in theory would need fewer stimulus events per communicated letter for a given decoding accuracy than the standard design. We say that these designs have good error-correction capabilities. A new design for the visual speller with good error correction capabilities was studied by Hill, Farquhar, Martens, Biessman, and Schölkopf (2009). Surprisingly, the study revealed that these error-correcting designs in practice perform worse than the standard design in the visual speller. This finding was explained by the fact that the new design increases the number of flashes per communicated letter, leading to a reduction of the signal-to-noise ratio of the EEG due to refractory effects (Martens, Hill, Farquhar, & Schölkopf, 2009).

It seems that the stimulus design in the visual speller involves a trade-off between error-correcting capabilities and the amount of refractory effect. One possible solution to this is to reduce the refractory effects, for example, by using a more salient stimulus type (Hill et al., 2009; Martens et al., 2009). However, it is not clear whether this is an effective solution for all subjects, including patient users with a reduced attention span. Also, if the time interval between subsequent stimulus events were decreased, the refractory effects might become more pronounced again.

In this letter, we evaluate an alternative solution by explicitly taking into account the possibility of refractory effects in the decoding of visual speller data. For this purpose, we introduce the use of graphical models as a framework for the decoding process in the speller system.

We begin in section 2.1 by introducing some terminology in the speller system and discuss standard decoding. We then propose in section 2.1 two graphical models, which represent the generation of brain signals in response to the stimulus events, and in section 2.3, we derive the decoding rules based on these graphical models. The first graphical model (which is related to the standard way of decoding visual speller data) does not take into account refractory effects, whereas the second model does. We show how prior knowledge, like letter frequency information, can be easily incorporated in the decoding. We discuss in sections 2.4 and 2.5 subtleties in the training that can be understood more easily in the graphical model framework. We demonstrate in section 2.5 that the commonly used decoding approach may give a maximum a posteriori solution under a number of conditions. Finally, in section 3, we test if an error-correction design outperforms the standard design on real speller data using the proposed decoding.

## 2. Methods

### 2.1. Encoding and Decoding in the Visual Speller System.

The letter grid in the visual speller may contain numbers, punctuation characters, and other symbols. For simplicity, we will refer to the items in the grid as letters. We distinguish three units in the speller system: encoding, modulation, and decoding (see Figure 1). The encoding describes how each letter is encoded as a sequence of stimulus events over time. In the modulation process, the stimulus events on the user-selected letter are translated into attention-modulated brain signals. The decoding consists of inferring which letter was selected based on the measured brain signals.

While designing a good encoding for the speller, it is helpful to write down
these stimulus events per letter as code words. These are bit strings of length *N* for which each entry corresponds to a stimulus event. An
entry has the value 1 if the letter participates in the stimulus event and value
0 otherwise. The codebook weight refers to the number of 1's in the code word.
The Hamming distance between two binary code words of equal length is the number
of positions for which the two code words have different values (Hamming, 1950). The collection of code words for all
the letters in the letter grid will be referred to as the codebook. Each column
in this codebook represents a stimulus event at a given point in time (see
Figure 1).

The standard encoding is one in which the stimulus events take place on rows and
columns of letters (Farwell & Donchin, 1988). We will refer to this codebook as the row-column codebook
(*RC*). The minimum Hamming distance *d* is
the smallest Hamming distance between any two code words in the codebook and is
related to how many misclassified code word entries can be corrected. An RC code
of length 24 has *d*=4. In contrast, a Hadamard code *HAD* of length 24 (Levenshtein, 1964) has *d*=12 and is
therefore expected to exhibit superior error-correction properties.

*c*with the vector of classifier outputs is largest—that is, the letter for which the code word satisfies where

*C*denotes the codebook.

### 2.2. Graphical Models.

Graphical models are useful tools to model the (conditional) independencies
between random variables. In this letter, we focus on a subset of graphical
models called directed acyclic graphs (DAGs). A DAG consists of a set of nodes and a set of directed edges between the nodes. Each node in the graph represents a random variable *X _{i}*, and missing edges between the nodes represent conditional
independencies. The graph is assumed to be acyclic, that is, there are no
directed paths , where

*i*

_{1}=

*i*. Each node

_{k}*i*has a set of parent nodes , which can be the empty set.

*n*random variables by . For DAGs, the joint probability factors into the probabilities of the variables conditioned on its parents (Koller & Friedman, 2009):

For instance, consider graph depicted in Figure 2. The
graph is a DAG with variables *t*, *c _{j}*, and

*b*, , which are spread over three layers. We can think of this graph as a generative model describing the modulation process in the visual speller system. The variable represents a letter from the alphabet that consists of all the items in the letter grid. Variable

_{j}*c*represents an entry from the code word that corresponds to the letter

_{j}*t*, which is selected by the user and can take a value from the set , whereas

*b*represents an EEG segment in which we expect to capture the response to a stimulus event

_{j}*c*. This observed brain signal is a multidimensional and continuous-valued variable.

_{j}The directed edges between the letter *t* and the code word
entries *c _{j}* represent direct influences. In fact, each letter is associated with a
unique bit string such that all code word entries are determined if the letter
is fixed, and vice versa. The directed edges between the code word entries

*c*and the brain signals

_{j}*b*also denote direct influences, although their relationship is not deterministic. For example, when a code word entry is set to 0, the corresponding brain response may be of small duration and small amplitude. When a code word entry is 1, the corresponding brain response may be of longer duration and larger amplitude such as a P300 event-related potential response. In practice,

_{j}*b*also consist of non-task-related signals such as background EEG signals, measurement noise, and artifacts. The amplitude of the signal that represents the task-related brain response is small compared to the total observed brain signal, which makes the decoding a nontrivial task.

_{j}Figure 2 shows an additional graph, , which models the modulation process in a slightly more
complex way. Although the variables in and are the same, there is a different dependency among the
variables. In there are edges between brain signals *b _{j}* at time point

*j*and code word entries at a previous stimulus event

*c*

_{j−1}, which are absent in . These slanted edges aim at modeling refractory and overlap effects: Martens et al. (2009) have shown that the shape of the brain response depends on the target- to target interval (TTI) value and therefore on preceding codebook entries. In particular, if a preceding target code word entry

*c*

_{j−1}was a 1 (target stimulus event), the brain may not be able to produce the same response if the current code word entry

*c*is again a 1. In addition, the brain response to a target event overlaps with the brain response to the subsequent stimulus event. We may extend this dependency further to higher orders by adding more edges, such that

_{j}*b*depends not only on

_{j}*c*and

_{j}*c*

_{j−1}but also on

*c*

_{j−2}, and so on.

- •
In , if the value of the code word entry

*c*at time point_{j}*j*is given, the probability distribution of the observed brain signals*b*at time point_{j}*j*is determined. Moreover, if*c*is given, the probability distribution of_{j}*b*does not depend on the letter_{j}*t*, previous code word entries (), or previous brain signals (). - •
In , if the value of the code word entry

*c*at time point_{j}*j*is given, the probability of the observed brain signals*b*at time point_{j}*j*is still uncertain since the probability distribution of*b*also depends on_{j}*c*_{j−1}. However, if*c*_{j−1}and*c*are given, the probability distribution of_{j}*b*does not depend on_{j}*t*, earlier code word entries (), or earlier brain signals ().

*c*

_{0}in equation 2.6 to 0 with probability 1. Later we will be interested in the joint probability

*p*(

*c*,

*b*) of the code word and the brain signals given by with expressing how the letter prior

*p*(

*t*) induces a code word prior

*p*(

*c*). Note that the prior

*p*(

*c*) of a code word

*c*is equal to the letter prior

*p*(

*t*) of the letter corresponding to that code word (and vanishes for nonvalid code words ).

### 2.3. Decoding Based on Graphical Models.

*C*, we may equivalently select the code word with the largest probability given the measured brain signals: This is equivalent to selecting the code word with the largest joint probability since

*p*(

*c*|

*b*)=

*p*(

*c*,

*b*)/

*p*(

*b*) and

*p*(

*b*) is independent of the code word: The joint probability

*p*(

*c*,

*b*) of the code word and the brain signals was defined previously for and in equations 2.7 and 2.8, respectively. Therefore, to perform the decoding according to equation 2.11, we need to find the distribution of brain signals given the current, and possibly preceding, code word entries. This generative approach has been successfully adopted for in Martens and Leiva (2010). Another approach is to turn around the conditional probabilities in the joint in equations 2.7 and 2.8 by applying Bayes' rule:

*p*(

*x*|

*y*)=

*p*(

*y*|

*x*)

*p*(

*x*)/

*p*(

*y*). The resulting expressions for the joint probability can be inserted in equation 2.11 to obtain the decoding rules for and : where the factor

*p*(

*b*) is independent of the code word and is therefore discarded. Notice that by learning the conditional probabilities

_{j}*p*(

*c*|

_{j}*b*) and

_{j}*p*(

*c*

_{j−1},

*c*|

_{j}*b*) in equations 2.12 and 2.13 from data, we perform a discriminative training for a generative model.

_{j}### 2.4. Homogeneity Assumptions.

We will assume homogeneity such that in , the conditional distribution in equation 2.7 does not depend on the bit *j* (for fixed and ). This means that the brain signal *b _{j}* generated by a stimulus event defined by

*c*for bit

_{j}*j*cannot be distinguished from the brain signal generated by another stimulus event defined by

*c*at bit

_{i}*i*, if the two stimulus events have the same value . Similarly, given , we assume that in equation 2.8 does not depend on the bit

*j*for a fixed , , and .

It is important to note that the homogeneity assumption in implies a bit independence for the probability distribution *p*(*b _{j}* |

*c*) but not necessarily for the conditional probability

_{j}*p*(

*c*|

_{j}*b*). Indeed, by using Bayes' rule on the homogeneity assumption , it follows that the equation holds only if . These homogeneity assumptions are relevant for the training phase, as will be explained in the next section.

_{j}### 2.5. Bit Dependencies.

The per bit conditional probability factor in equation 2.12 may be estimated for each bit *j* individually
using the training examples corresponding to that code word entry *c _{j}*. However, we may want to accumulate the training examples of all bits and estimate a conditional probability on the complete training set aggregated over all bits

*j*in favor of a more accurate estimation. Unfortunately, from section 2.4 we know that is bit dependent, and therefore is in general not equal to . Consequently, we may not simply substitute the per bit conditional probability

*p*(

*c*|

_{j}*b*) by the global conditional probability

_{j}*f*(

*c*|

_{j}*b*) in the decoding rule of equation 2.12.

_{j}*p*(

*b*|

_{j}*c*) and

_{j}*p*(

*b*|

_{j}*c*

_{j−1},

*c*) in equations 2.7 and 2.8 by

_{j}*f*(

*b*|

_{j}*c*) and

_{j}*f*(

*b*|

_{j}*c*

_{j−1},

*c*), respectively, and apply Bayes' rule to find the following decoding rules: where is independent of the code word and has been discarded.

_{j}Notice the similarity of these to equations 2.12 and 2.13. It turns out that we can use the
conditional probability *f*(*c _{j}* |

*b*) estimated from all the bits in the training set if we divide by the global bias

_{j}*f*(

*c*) instead of

_{j}*p*(

*c*) for each bit

_{j}*j*. This

*f*(

*c*) depends on only the value of

_{j}*c*and is therefore bit independent in contrast to

_{j}*p*(

*c*). From now on, we refer to the factors

_{j}*f*(

*c*) in equation 2.20 and

_{j}*f*(

*c*

_{j−1},

*c*) in 2.21 as global bias correction factors. They correct for the presence of bit dependencies of

_{j}*p*(

*c*|

_{j}*b*) and

_{j}*p*(

*c*

_{j−1},

*c*|

_{j}*b*). A more rigorous derivation of the global bias correction factors can be found in appendix C.

_{j}There are two special cases for which the product of the global bias correction
factors in equation 2.20 is constant for all code words and consequently negligible in
the decoding: (1) all code words have the same weight, and (2) each value of *c _{j}* is equally probable in the training set. However, this is not
necessarily true for the product of the global bias correction factors in equation 2.21.

Equation 2.20 also shows that
the practice of balancing the number of training examples for the two classes
(as, e.g., in Kaper et al., 2004; Martens
et al., 2009) yields *f*(*c _{j}*)=0.5. In that case, the global bias correction factor becomes
code word independent and can be neglected. But if the balancing is done by
throwing away examples of the abundant class, the resulting reduction of the
training set will lead to a less accurate estimation of the conditional
probability.

The standard decoding method for visual speller data as defined in equation 2.1 arises as a special case of under the following three additional assumptions: the
classifier outputs can be transformed into probabilistic quantities according to
a logistic function, all letters in the letter grid are equally likely, and all
code words have the same weight (see appendix A). If one uses a classifier that
gives nonprobabilistic outputs, it is unclear how to incorporate factors such as *f*(*c _{j}*),

*f*(

*c*

_{j−1},

*c*), and letter priors

_{j}*p*(

*c*) in the decoding.

### 2.6. Training by Regularized Logistic Regression.

*f*(

*c*|

_{j}*b*) in equation 2.20 and

_{j}*f*(

*c*

_{j−1},

*c*|

_{j}*b*) in equation 2.21 by a logistic regression. A logistic regression models the posterior probabilities of the classes by a generalized linear model while at the same time ensuring that the probabilities sum to 1 and remain in [0, 1] (Hastie, Tibshirani, & Friedman, 2001). The models are as follows: The parameters and in the binary classification problem 2.22 and and in the multiclass classification problem 2.23 can be learned by maximum likelihood. A regularization term is added to the log likelihood to reduce the chance of overfitting (see appendix B for a more detailed description).

_{j}### 2.7. Letter Prior.

Suppose we have trained on a data set with a given letter prior *p*(*t*), whereas the letters in the test set
come from a different distribution . This may happen if we let the subject do a copy-spelling task
for training with randomly drawn letters and then let the subject communicate
proper sentences. Since we want to optimize the letter decoding performance in
the test set, we should simply replace *p*(*c*) in
equations 2.20 and 2.21 by the code word prior induced by the letter prior of the test set .

## 3. Real Visual Speller Data

### 3.1. Setup.

Eleven subjects performed a copy-spelling task with the visual speller system
implemented in the BCPy2000 platform (http://www.bci2000.org/wiki/index.php/Contributions:BCPy2000).
The subject observed a PC screen on which a 6 × 6 letter grid was
displayed (as in Figure 1). The task was to
focus attention on a specified letter of the grid and passively count the number
of times a stimulus event occurred on that letter. All subjects used the system
with a standard letter intensification type of stimulus. The time interval
between the start of one stimulus event and the start of the next event, the
stimulus onset asynchrony (SOA), was set to 183 ms. Each intensification lasted
100 ms and was followed by a no-intensification period of 83 ms. We recorded a
16-channel common-average-reference EEG sampled at 500 Hz using a QuickAmp
system (BrainProducts GmbH). Each subject spelled sentences from the book *The Diving Bell and the Butterfly* by Bauby (1998) until the subject indicated that he or she was
tired, resulting in 73 to 113 trials (letters spelled) per subject. Feedback was
given to the subjects after their spelling session. Two different codebooks of
length *N*=72 were used: a standard row-column codebook
(RC) and a Hadamard codebook (HAD; see Figure 3). For the RC codebook, each stimulus event occurred on 6 letters
in the same row or column, whereas for the HAD codebook, each stimulus event
involved 13 to 20 letters spread over the grid.

The codebooks alternated per communicated letter. The HAD codebook was created by
selecting 36 code words of a Hadamard code of length 24, permuting the columns
to increase randomness of target events, concatenating code words three times,
and assigning each resulting code word of length 72 to a letter in the
grid.^{1} The RC has a small minimum Hamming distance of 12, and the HAD has a
large minimum Hamming distance of 36. The weight of the code words is 12 for the
RC code and between 33 and 39 for the HAD codebook. The large percentage of 1's
in the HAD codebook leads to small TTI values, whereas the small percentage of
1's in the RC codebook results in a widespread distribution of TTI values (see
Figure 3). We expect that the
error-correcting capabilities of the HAD code book are diminished by strong
refractory effects due to the large number of small TTI targets. Nevertheless,
by applying the decoding method based on , which models these refractory effects, the HAD codebook may
outperform the RC code.

### 3.2. Signal Analysis.

The signal analysis was performed offline in Matlab. The EEG was
bandpass-filtered between 0.5 and 10 Hz with steep FIR Bartlett-Hanning filters,
and cut up in 600 ms EEG epochs synchronized by the stimulus cues. These epochs
were downsampled to 25 Hz. We trained and decoded on the complete 72 bits code
words. We performed the training on letters and tested the decoding performance on the remaining
letters. We applied an *l*-fold cross-validation on the training
set with the percentage of correctly inferred letters as a criterion to select
the optimal regularization parameter, using *l*=10 folds
if the number of training letters *L* was larger than 10, and *l*=*L* folds otherwise. After the
cross-validation, a logistic regression was trained on the complete training set
using the selected regularization parameter.

Decoding was done on the test set according to equations 2.20 and 2.21, where the learned logistic regression parameters were applied to the data in the test set according to equations 2.22 and 2.23. We set the letter priors based on English character frequencies in about 1415 works of fiction (http://millikeys.sourceforge.net/freqanalysis.html; see Figure 4). We calculated the decoding performance as the percentage of correctly inferred letters in the test set.

## 4. Results

### 4.1. Effect of Global Bias Correction.

We investigated the impact of the global bias correction on the decoding
performance (see section 2.5 for
theoretical background). For this purpose, one subject used a different Hadamard
code, which we will refer to as HADspecial. This codebook consists of just two
code words with weights 39 and 3, respectively. The resulting global bias
corrections take on completely different values for the two code words (see
Figure 5). This particular data set
contained 27 trials in which one of these two code words was communicated by the
subject. To increase the test set size, we split each communicated code word up
into three code words of length *N*=24, as if each code
word had been communicated three times.

We recalculated the prior probabilities of the two code words after setting the
probabilities of the other letters in Figure 4 to 0 giving *p*(*E*)=0.34 and *p*(−) = 0.66. We performed a decoding
according to as in equation 2.20. In addition, we performed a naive decoding , which uses a logistic regression trained on all bits but
ignores the global bias correction factor in the joint, that is, according to , and another naive decoding , which uses a logistic regression trained on all bits but uses
the wrong global bias correction factor, according to .

The decoding performance of and was lower than the performance of , which used the correct global bias correction factor (see
Figure 5). The difference in performance
between and was significant at the 5% level (40 training letters,
one-tailed Fisher's exact test, *p*=0.003), whereas the
decoding difference between and was marginally significant
(*p*=0.06).

### 4.2. Effect of Letter Frequency Information.

We investigated the increase in decoding performance if letter frequency information is used. For this purpose, we analyzed the visual speller data from the 10 subjects who used the RC and HAD codebooks. One subject did not reach above-chance performance and was left out of the analysis. We used equation 2.20 with a realistic letter prior as in Figure 4 and also with a uniform letter prior, referred to as .

Using realistic prior knowledge about the probability of the letters increased
the decoding performance (see Figure 6) up
to 5%. The difference in performance between and was significant for the HAD data (5 training letters,
Pearson's chi square test, *p*=0.03) but not for the RC
data (*p*=0.14).

### 4.3. Versus .

The two decoding methods from equations 2.20 and 2.21 were
tested on visual speller data from the 10 subjects who used the RC and HAD
codebooks. One subject did not reach above-chance performance and was left out
of the analysis. For large training set sizes, graph showed on average the same decoding performance as graph on the RC data (see Figure 7), whereas performed significantly better than on the HAD data (40 training letters, Pearson's chi square
test, *p*=0.01). For small training set sizes, performed better than on the RC data (*p*=0.03) and equally
well on the HAD data.

### 4.4. Effect of Codebook.

The decoding performance of the RC codebook was superior over the HAD codebook, independent of the number of training trials (see Figure 7). Using for decoding instead of improved the performance of the HAD code, but not so much that it outperformed the RC code.

## 5. Conclusion

The aim of this letter is to promote a flexible framework using graphical models for maximum a posteriori decoding in the speller. The framework can be seen as an upper level in the decoding process in which the researcher picks or designs a realistic graphical model for the generation of brain signals in response to stimulus events. We proposed two graphical models, and , each with different dependencies between the variables. We have shown that the commonly used decoding approach can be seen as a special case of the simple graphical model .

The lower level involves the training or learning on the selected graphical model. We showed how to do this training discriminatively, and this principle has been successfully applied in speech recognition (Kapadia, 1998) and natural language processing (Collins, 2002). Although we applied a regularized logistic regression classifier to perform the learning, one has the freedom to use his or her favorite classifier as long as it gives quantities that can be interpreted probabilistically. For example, a support vector machine classifier, whose outputs were squeezed through a logistic function, resulted in a similar decoding performance as the logistic regression we used.

The homogeneity assumption for the generation of the brain signals allows us to perform the learning on the complete training data set instead of bit-wise. This is common practice in the literature. One should, however, be cautious if the conditional probability of a code word entry being 0 or 1 is bit dependent. Therefore, during decoding, a global bias correction should be included that corrects for the global bias in the training data set. The necessity of the global bias correction directly follows from the homogeneity assumption. We showed that the global bias correction is crucial if a codebook is used in which the code words have different weights and the global probability of a bit being 1 is far from 0.5.

In both graphical models, letter frequency information is incorporated by code word priors. The results demonstrate that adding letter frequency information improves the decoding performance. A next step would be to use letter priors conditioned on previously communicated letters.

Graph models dependencies between brain signals and previous stimulus events and therefore recognizes the presence of refractory effects of the brain signals. The training and decoding involves the classification of pairs of bits, a four–class problem. The results show that this graphical model yields a better decoding performance on data sets in which the code words are characterized by a large weight. For small training set sizes, however, suffers from a slower learning curve and performs worse than . This is to be expected since encodes a more general class of models than . Therefore, there is a trade-off between model generality and learning speed of the training procedure with respect to the number of data points.

We tested two codebooks: the standard row-column (RC) code and a Hadamard (HAD) code. If the per bit classification accuracy were independent of the codebook, the HAD codebook would be superior to the RC codebook. However, refractory effects lead to lower per bit classification accuracies in codebooks with a large weight such as the HAD codebook. In our experiment, we used the HAD codebook and tried to make the decoding suffer less from refractory effects by using the more sophisticated graphical model . The effort we made in modeling the refractory effects by improved the performance of the HAD data significantly, but not so much that the HAD outperformed the RC codebook. Our explanation for this finding is that cannot simply make up for the reduction in binary classification performance: it merely expresses uncertainty for the bits with refractory effects, whereas misclassifies these bits. The more realistic prior knowledge in the form of the English letter prior in Figure 4 is apparently not strong enough to exploit this difference.

Future work consists of testing the HAD codebook with a more salient stimulus type as in Martens et al. (2009) for which the refractory effects are reduced. In this setting, the HAD codebook is expected to outperform the RC codebook. A further increase in bit rate could then be achieved by speeding up the presentation of stimulus events. At faster stimulus rates, refractory effects are likely to occur even if salient stimulus types are used, and the system would profit from a decoding that models these effects in combination with stronger prior knowledge in the form of a letter prior conditioned on previously communicated letters.

## Appendix A: Relationship Between Standard Decoding and MAP Decoding

Selecting the code word that maximizes as in equation 2.1 is equivalent to a MAP solution as in equation 2.20 under the following three conditions:

The classifier outputs

*k*can be transformed into probabilistic quantities according to a logistic function. If , then ._{j}We rewrite equation 2.1 as where the first term log(1−*f*(*c*=1 |_{j}*b*)) may be added since it is independent of_{j}*c*. Separating the cases_{j}*c*=0 and_{j}*c*=1 gives From this we see that_{j}All letters in the letter grid are equally likely (i.e., the code word prior is uniform). If the marginal probability of the code words

*p*(*c*) is constant, this factor can be ignored in equation 2.20.All code words have the same weight. In that case, the global bias factor

*f*(*c*) in equation 2.20 can be ignored, and equation 2.27 is equivalent to equation 2.20._{j}

## Appendix B: Fitting the Logistic Regression Parameters

*R*the regularization parameter. Notice that the factor in the likelihood in equation B.2 may be neglected since the factor does not depend on the parameters and . Alternatively, one can derive equation B.3 as the MAP estimate of the parameters and if we assume a gaussian prior over the weights . To minimize the loss function, we set its derivative with respect to and to zero. The resulting equations can be solved using iteratively reweighted least squares (Hastie et al., 2001). The derivations for are calculated likewise.

## Appendix C: Alternative Way of Training the Logistic Regression Classifier

In this appendix, we show that the bit dependency of *p*(*c _{j}* |

*b*) can be dealt with in two different ways and that these two approaches are asymptotically equivalent. The derivation is presented only for , since for , it is similar.

_{j}*d*-dimensional real vectors. We assume that for each bit , the probability of the class

*c*given the brain signal

_{j}*b*is of the logistic regression form where and are a

_{j}*d*-dimensional weight vector and a bias parameter that both depend on and on

*j*. The conditional probability of the brain signal given the class is therefore where we defined and separated off the following factor, which does not depend on . This will turn out to be convenient later.

*j*. In other words, for all , we have Since this holds for any , we can sum the equations over , which gives for : Now, forming the quotient of equations C.4 and C.5, the factors and drop out and we obtain for all .

*N*weight vectors to be identical for some global weight vector , and all

*N*bias parameters to be related by for some global bias parameter . Thus, using the homogeneity assumption, we can rewrite equation C.1 as The different conditional probabilities are now expressed in terms of the global weight vector and the global bias parameter , and the logarithms of the prior class probabilities . Note that the only dependence on

*j*is now via the offsets . In other words, the homogeneity assumption boils down to sharing parameters in combination with

*j*-specific offsets for the bias parameter.

The global weight vector and global bias parameters in equation C.7 can be trained by using (regularized) maximum likelihood in a way very similar to ordinary (regularized) logistic regression training, with the only difference that the bit-dependent offsets (which are given at training time) have to be taken into account. Although this is a possible approach, a disadvantage is that this particular training method has to be implemented—most off-the-shelf methods do not support offsets in the bias parameter that can differ for each training point.

*j*arbitrary. Again, a logistic regression form appears for the conditional distribution : This suggests an alternative way of learning the global weight vector and the global bias parameter : simply aggregate the training examples corresponding to different bits together into one large pool, and train a logistic regression classifier from the aggregated data. Asymptotically this will yield the same global weight vector as before, and the resulting global bias parameter differs by only the logarithm of the global bias correction factor . Although the two different training methods are asymptotically equivalent (where some care needs to be taken with the regularization), the latter method is much easier to implement in practice, as one can use an off-the-shelve logistic regression classifier.

## Acknowledgments

We thank Karin Bierig for the experimental help.

## Notes

^{1}

For Hadamard codes, *d*=*N*/2, such that a
Hadamard code of length *N*=72 bits would not have yielded
a larger *d* than the proposed concatenated code.