Abstract

We present a new method for fusing scores corresponding to different detectors (two-hypotheses case). It is based on alpha integration, which we have adapted to the detection context. Three optimization methods are presented: least mean square error, maximization of the area under the ROC curve, and minimization of the probability of error. Gradient algorithms are proposed for the three methods. Different experiments with simulated and real data are included. Simulated data consider the two-detector case to illustrate the factors influencing alpha integration and demonstrate the improvements obtained by score fusion with respect to individual detector performance. Two real data cases have been considered. In the first, multimodal biometric data have been processed. This case is representative of scenarios in which the probability of detection is to be maximized for a given probability of false alarm. The second case is the automatic analysis of electroencephalogram and electrocardiogram records with the aim of reproducing the medical expert detections of arousal during sleeping. This case is representative of scenarios in which probability of error is to be minimized. The general superior performance of alpha integration verifies the interest of optimizing the fusing parameters.

1  Introduction

There are many scenarios where multiple detectors are to be fused to improve their individual performance (Khaleghi, Khamis, Karray, & Razavi, 2013; Atrey, Hossain, El Saddik, & Kankanhalli, 2010; Yuksel, Wilson, & Gader, 2012; Kittler, Hatef, Duin, & Matas, 1998). In general, the input to a single detector is a vector of measures (observation or feature vector) processed to obtain a scalar statistic to be compared with a threshold, thus obtaining a binary decision. Then the fusion of detectors can be made at three different levels: measures, statistics, or decisions. Finding optimum fusion functions becomes simpler as we go from measures to decisions level, but a price is paid in loss of information. Therefore, fusion at the statistics (intermediate) level becomes a reasonable compromise. On the one hand, the number of variables to be fused is reduced to the number of available detectors; on the other hand, it avoids the loss of information after thresholding. Usually the statistic is called “score.” Depending on the application, the score is normalized in a given range or not. Different normalization techniques exist (Jain, Nandakumar, & Ross, 2005), which are especially interesting in the case that heterogeneous detectors are to be fused. Normalized scores between 0 and 1 may be thought of as estimates of the a posteriori probability assigned by the detector to one of the two hypotheses if they are properly calibrated (Zadrozny & Elkan, 2002).

In this letter, we concentrate on the fusion of scores for detection purposes. Moreover, we make use of -integration, which was proposed to integrate stochastic models in Amari (2007). The case of integrating gaussian mixtures was considered in Wu (2009). It is noteworthy that -integration can be used to fuse or combine any finite set of d numbers ( in the form
formula
1.1
Amari (2007) has demonstrated, that if mi and are respectively associated to probability density functions and of some random variable x, then is the probability density minimizing the cost function
formula
1.2
where is the -divergence (Amari, 2007; Wu, 2009) between the two probability densities. Particularly simple cases of fusion rules are obtained for particular selections of the parameter . Thus, assuming that , we see that , respectively, renders the arithmetic mean, the geometric mean, and the harmonic mean. Similarly, is equivalent to computing the minimum or maximum. Notice that equation 1.2 can be applied to the approximation of every positive function from a set of d positive functions .
Choi, Choi, Katake, and Choe (2010) and Choi, Choi, and Choe (2013) present gradient descent algorithms to estimate both the parameter and the coefficients minimizing the mean square error (MSE) of the approximation achieved by -integration in some target values :
formula
1.3

Expressions for the gradients are obtained, and convergence is experimentally tested in some simulated data.

The -integration can be readily adapted to the fusion of scores in a detection context. Several detectors will produce several scores, which can be fused using -integration to obtain a unique (fused) score. In this letter, we propose three methods for estimating the fusing parameters ( and given a set of labeled training data. The first one is appropriate when working with normalized scores and is a direct adaptation of the least mean square error (LMSE) criterion in equation 1.3 to the detection problem. The possible unbalanced number of labeled data between both hypotheses and the different cost incurred by every type of erroneous decision (detection miss or false alarm) are accounted for by some simple modification of the cost function. A second method is proposed based on the maximization of the area under the ROC curve, (AUCmax). This is a cost function well suited to the detection framework and allows both normalized and nonnormalized scores. These two methods are appropriate in applications where the probability of detection is to be maximized for a given probability of false alarm. However, there are scenarios where minimizing the probability of error is more convenient. Hence, we propose a third method (MPE) where the -integration parameters are estimated so that the probability of making wrong decisions is minimized. This method requires that the scores are normalized. Gradient algorithms are devised for the three methods.

The next section is devoted to the LMSE approach. AUCmax is considered in section 3. Some experiments with LMSE and AUCmax criteria based on simulations are presented in section 4 with the aim of illustrating the concept and the interest of the new methods of -integration. Section 5 presents the application of LMSE and AUCmax to -integration in biometric data. Finally, the MPE method is considered in section 6, applied to a medical diagnosis problem: automatic detection of arousals during sleeping. Minimizing the wrong detections (relative to a medical expert) is the essential objective in this application. Section 7 ends the letter.

2  Estimating the -Integration Parameters by LMSE Criterion

In a detection scenario, we must decide between two hypotheses H0 and H1. Let us assume that we have d different detectors working on the same hypotheses and that everyone contributes with a score, in a manner that higher values of the score play in favor of selecting and vice versa. Let us also assume that the scores are normalized so that Apart from this, there are no other constraints. Thus, the specific way in which every detector computes its score is of no concern here. Similarly the detectors may share the same input of observations or have totally different inputs, they may be statistically independent or not, and so on.

What we want is a unique score . Considering equation 1.1, the -integration solution is given by
formula
2.1
Let us assume that sequences of labeled scores are available; that is, we have a set of couples where is the vector of scores provided by the detectors, and yj is the corresponding known binary decision ( if H1 is true and if H0 is true). We can use this set to learn the parameters by minimizing a cost function as indicated in equation 1.3, which now becomes
formula
2.2

We see that by minimizing the cost function, equation 2.2, we are trying to approximate the fused score to 1 when the true hypothesis is H1 and to 0 when the true hypothesis is H0.

In many detection scenarios, there is a significant imbalance between the sizes of the subsets of the training set corresponding to H1 and H0 This is the case in novelty detection (Pimentel, Clifton, Clifton, & Tarassenko, 2014) or detection of signals in a noise background (Soriano, Vergara, Moragues, & Miralles, 2014). In those cases, minimization of equation 2.2 will be “blind” to H1. To account for this problem, we propose a modification of the cost function, equation 2.2. Let us call N1 and N0 the sizes of the subsets corresponding, respectively, to H1 and H0, hence . Instead of minimizing the overall mean square error, we compute separately the mean square errors corresponding to H1 and H0. Then the mean of both values is to be minimized. Taking advantage of the binary value of yj, the new cost function can be expressed in the form
formula
2.3

In this manner, the contributions to the error are normalized with respect to the size of the training subsets.

Similarly, we can consider the possibility of weighting the contributions of the different types of errors. These can be of two types: decide H0 when the true hypothesis is (detection miss) or decide H1 when the true hypothesis is (false alarm). A simple modification of equation 2.3 can consider this option:
formula
2.4

Notice that the modification of the new LMSE cost function of equation 2.4 implies a different weighting for every training sample contribution to the MSE as computed in equation 2.20. Also notice that and N1 are the number of available training samples of each class, and is a value fitted by the user depending on the importance given to every type of error. However, and can be estimated to minimize equation 2.4. In the following, we present gradient algorithms to estimate the optimum value of and

We have to compute the derivatives of the error cost function in equation 2.4 with respect to and . Let us define
formula
2.5
Then
formula
2.6a
formula
2.6b
formula
2.6c
Moreover,
formula
2.7a
formula
2.7b
Hence the corresponding gradient algorithms will be
formula
2.8
formula
2.9
where and are obtained by respectively using equation 2.6 and 2.7, substituting by and wi by where necessary. The values and are the learning rate constants that control the speed of convergence. In all the experiments in this letter, these values have been fitted using similar values to those ones recommended in (Choi et al., 2010, 2013). Small variations around those values of and influenced the converging speed, but the final estimates remained the same.

3  Estimating the -Integration Parameters by AUCmax

LMSE criterion minimizes the MSE, where the error is defined as the (weighted) difference between the final (integrated score) and the target value (1 for H1, 0 for . This seems a priori a reasonable criterion to obtain a good detector, but by no means implies that the probability of detection is maximized for a given probability of false alarm. Ultimately the detector performance depends on the statistical distribution of the integrated scores under every hypothesis. This suggests the convenience of a new criterion that could directly incorporate the detector performance.

Different figures of merit have been proposed to evaluate the detector performance (Parker, 2013). Among them, the AUC is the most popular. Moreover, AUC has two advantages in comparison with MSE:

  • We can optimize the fusion in specific intervals of the probability of false alarm depending on the application requirements.

  • Scores of the labeled training set are not required to be normalized between 0 and 1.

An ROC curve represents the probability of detection Pd as a function of the probability of false alarm Pf; let us represent this curve by the function . We can compute the area associated with that function in a given interval of the independent variable Pf by integrating ; the result of the integral will be the AUC corresponding to that interval. Let us define a normalized AUC in a given interval:
formula
3.1
where and limit the interval of interest where the normalized AUC is to be computed.
We must find the parameter set such that
formula
3.2
under the constraints
formula
3.3
In the following, we propose a new method to solve this optimization problem. The cost function will be obtained by means of the empirical non-parametric method recently proposed in Narasimhan and Agarwal (2013) for measuring the partial AUC, which was presented as an improvement of the one in Dodd and Pepe (2003).
The training set consisting of instances of score vectors and the corresponding fused score , can be divided into two subsets corresponding to each hypothesis, or H0:
formula
3.4
Let us name the subset sorted in descending order of fused scores :
formula
3.5

We can evaluate the normalized area by numerical integration. This can be made by uniformly sampling the ROC curve, adding all the sample values, and normalizing by the total number of samples. To define the sampling points, we take into account that the test is implemented by comparing score with a threshold t. Therefore, every threshold establishes one point of the ROC curve. We select consecutive values of the set in a given interval as thresholds. For every threshold , we count the number of values in S1 that are above the threshold; this number divided by N1is an empirical estimate for that threshold. Summing all values so obtained and dividing by the total number of summed values, we obtain an empirical estimate of the normalized AUC in a given interval. The selected interval of thresholds in must be in concordance with the Pf interval But notice that as the elements in are sorted in descending order the thresholds correspond to empirical values Then the selected interval in must be , where is defined as the next higher whole number of value and as the next lesser whole number of value . This leads to the empirical normalized AUC estimator of equations 3.6a and 3.6b. We consider separately in equation 3.6a the case in which the limits of the interval for integration are so close that after truncations, the order of the limits is inverted (i.e., ). It that case, only one sample for is obtained for estimating the normalized AUC. The other cases, when , are all included in equation 3.6b. Notice that the truncation effects in and are compensated by the term a1:

  • If :
    formula
    3.6a
  • If :
    formula
    3.6b

is a logic function that returns 1 when the relation evaluated is true and 0 otherwise. Defining the new variable , a unit step function can be used instead of the logic function :
formula
3.7
In order to transform the empirical normalized AUC into a differentiable function, a continuous approximation of the unit step function must be carried out. A natural choice is the sigmoid function (Herschtal & Raskutti, 2004):
formula
3.8

As can be observed in Figure 1, the sigmoid function may approximate the unit step function, with arbitrarily small approximation error, by selecting a large enough value.

Figure 1:

Approximating a unit step function using a sigmoid function.

Figure 1:

Approximating a unit step function using a sigmoid function.

A constrained nonlinear minimization problem is stated using the sigmoid function in equation 3.6:
formula
3.9

To solve this optimization problem, an interior point algorithm can be used (Byrd, Hribar & Nocedal, 1999; Waltz, Morales, Nocedal, & Orban, 2006).

The gradient of the objective function can be obtained to improve the interior point algorithm due to using the differentiable sigmoid function in expressions 3.6a and 3.6b. Differentiating the estimator with respect to a generic parameter :

  • If :
    formula
    3.10a
  • If If :
    formula
    3.10b

Continuing with the differentiation chain, the partial derivative of the sigmoid function must be obtained:
formula
3.11
The partial derivative of the variable depends on the partial derivative of the fused score , which is known (see equations 2.6b and 2.7b):
formula
3.12

Substituting the generic parameter by the parameters and wiin the differentiation chain of equations 3.10 to 3.12, 2.6b, and 2.7b, the gradient of the objective function can be obtained straightforwardly.

4  Experiments with Simulated Data

We have performed a number of simulations with the aim of illustrating the different factors influencing -integration for the fusion of detectors, as well as the specific interest of the proposed modifications. We have considered the fusion of two detectors (). Every detector provides one score , which is modeled as a random variable uniformly distributed in a given interval that depends on the true hypothesis . Let us, respectively, call and to the lower and upper limits of the intervals corresponding to the uniform distribution of the scores provided by detector iunder hypothesis Hk.

We show in Figures 2 to 10 the results of nine experiments. In experiments 1 to 6, the LMSE gradient algorithm was used to estimate the optimum value of and/or . However, in experiments 7 to 9, the AUCmax was considered.

Figure 2:

Experiment 1: .

Figure 2:

Experiment 1: .

Every figure is formed by six subfigures showing (from left to right and from top to bottom):

  • The 2D distribution of the training set of scores

  • The curves of convergence of the parameter and/or the coefficients w1 and w2 corresponding to the gradient algorithm of equations 2.8 and 2.9

  • The ROC curves of the three detectors (two individual detectors and the fused one) representing the probability of detection Pd in terms of the probability of false alarm Pf

  • The 2D contour curves defining the decision regions of the -integrated detector

  • The uniform distributions of the scores s1 and s2 corresponding to every individual detector

  • The final distributions of the score obtained after -integration

In all the experiments, the training (estimation of the optimum value of and/or was made by using labeled scores. The evaluation performance (ROC curves and fused score distributions) was obtained from a set of 10,000 scores. Other experiments were made by using different training and evaluation sizes, but the general conclusions remained the same.

Figure 2 corresponds to experiment 1. As we see, the parameter is learned by means of the gradient algorithm, equation 2.8, and converges to a final value (0.6) after only some 15 iterations. The sizes of the training sets are the same for both hypotheses . The weighting coefficients are not estimated but are fitted to the same value (). The parameter , that is, no preference is given a priori to any hypothesis. The limits of the uniform distributions of the individual scores are , , . This implies a large overlap between both hypotheses when working separately with the individual detectors. However, the distributions of the integrated score are no longer uniform, showing the better separation between hypotheses achieved after -integration. This can also be observed by looking to the ROC curves.

Experiment 2 (see Figure 3) illustrates the interest of the modification included in equation 2.3 to account for the possible different sizes of the training sets under every hypothesis. Thus, in Figure 2 the sizes of the training set under every hypothesis are very different (. The rest of the parameters are the same as those of the first experiment. We can see that parameter converges to the same value of experiment 1; hence, the performance of the detector after fusion should be the same. This is verified by observing that the ROC curves, the 2D contours, and the distribution of the integrated score are practically the same in both experiments.

Figure 3:

Experiment 2: .

Figure 3:

Experiment 2: .

The next two experiments illustrate how parameter may be used to bias the -integrated detector toward one of the two hypotheses. Thus, in Figure 4, we show the same case as in Figure 1, except that now , so that the contribution to the global error due to deciding H0 when the true hypothesis is H1is much more significant than the contribution to the global-error due to deciding H, when the true hypothesis is H0. We see in Figure 4 that converges to , that is, the -integrated detector tends toward computing the maximum of the two individual scores, which clearly bias the decisions in favor of H1. This bias can also be observed in the form adopted by the 2D contour curves defining the decision regions and in the resulting distributions of . Finally, we see in the ROC curves that for a probability of false alarm greater than approximately , the individual detectors have greater probability of detection than the -integrated detector.

Figure 4:

Experiment 3: .

Figure 4:

Experiment 3: .

Experiment 4 is similar to experiment 3, but now , so that the fusion of detectors is biased in favor of H0. We see in Figure 5 that converges to , that is, the fusion tends toward computing the minimum of the two individual scores. The 2D contour curves and the resulting distributions of are modified accordingly. Finally we see in the ROC curves that for a probability of false alarm less than about , the individual detectors have greater probability of detection than the fused detector.

Figure 5:

Experiment 4: .

Figure 5:

Experiment 4: .

The next experiment illustrates that an optimum linear combiner (weighted arithmetic mean) of the individual scores is a particular constrained case of -integration. Notice in equation 2.1 that if , then .

We show in Figure 6 the results of experiment 5, which is the same case as experiment 1, except that and the weighting coefficients are learned by the gradient algorithm, equation 2.9. Because both individual detectors produce the same score distribution (both detectors perform the same), the gradient algorithm converges to . Notice that the contour curves are now straight lines, in concordance with equation 2.9. This implies some suboptimality with respect to experiment 1, where the optimum was learned by the gradient algorithm, and was different from −1. Suboptimality can be appreciated too by comparing the ROC curve of the -integrated detector in Figure 6 with the corresponding curve of Figure 2 

Figure 6:

Experiment 5: .

Figure 6:

Experiment 5: .

Experiment 6 illustrates the case of combining two detectors having different performances. We have modified Experiment 5 so that the uniform distribution of the scores of the detector 1 under H1is narrowed (between 0.4 and 1). This implies that detector 1 performs better than detector 2 under H1. Then we see in Figure 7 that the gradient algorithm converges to weights such that , so that detector 1 has more influence in the final -integrated detector. This produced a rotation of the 2D contours to accommodate a bias toward the decisions of detector 1.

Figure 7:

Experiment 6: .

Figure 7:

Experiment 6: .

In the next three experiments, we use the parameter estimation method based on AUCmax. Using this method, we can select the interval of probabilities of false alarm in which we want to obtain the best results. These experiments are like experiment 1, where two equal detectors are fused by means of integration, but now the new training method based on the AUCmax is used. First, in experiment 7, we estimated all the parameters to maximize . The results are presented in Figure 8. In this case, we can see how the weighting parameters obtained are the same for each detector and the estimated parameter converges to a value so that the whole AUC of the ROC curve obtained after fusion is maximized. Notice that the ROC curves are quite similar to the ones in Figure 1.

Figure 8:

Experiment 7: . Parameters are optimized by AUCmax (.

Figure 8:

Experiment 7: . Parameters are optimized by AUCmax (.

In the two final experiments, we have changed the Pf interval of the ROC curves in which we want to maximize the AUC. Thus, in experiment 8 (see Figure 9) is maximized, and in experiment 9 (see Figure 10) is maximized.

Figure 9:

Experiment 8: . Parameters are optimized by AUCmax (.

Figure 9:

Experiment 8: . Parameters are optimized by AUCmax (.

Figure 10:

Experiment 9: . Parameters are optimized by AUCmax (.

Figure 10:

Experiment 9: . Parameters are optimized by AUCmax (.

In these two cases, due to the same behavior of both detectors, the estimated weighting parameters are equal, but , which controls the shape of the separation frontiers, converges to a value that allows a better probability of detection after fusion in the specified false alarm intervals of the ROC curves

5  Application of -Integration in Biometrics Score Fusion

Biometrics refers to the automatic identification of an individual based on his or her physiological traits (Jain et al., 2004). The performance of a biometric system can be measured by reporting its false accept rate (FAR), equivalent to the concept of probability of false alarm considered so far, and false reject rate (FRR), equivalent to the concept of 1. These systems are subject to low FAR (usually less than 0.1%).

Biometric systems based on a single source of information (unimodal systems) suffer from such limitations as the lack of uniqueness, nonuniversality, and noisy data (Jain & Ross, 2004) and hence may not be able to achieve the desired performance requirements of real-world applications. In contrast, multimodal biometric systems combine information from its component modalities to arrive at a decision (Ross & Jain, 2003). Multimodal biometric authentication requires fusing information of different modalities (e.g., fingerprint, face, iris, retina, voice). Several studies (Toh, Jiang, & Yau, 2004; Wang, Tan, & Jain 2003) have demonstrated that by consolidating information from multiple sources, better performance can be achieved compared to the individual unimodal systems.

In a multimodal biometric system, integration can be done at the feature level, matching score level, or decision level. Matching-score-level fusion is commonly preferred because matching scores are easily available and contain sufficient information to distinguish between a genuine and an impostor case. Given a number of biometric systems, one can generate matching scores for a prespecified number of users even without knowing the underlying feature extraction and matching algorithms of each biometric system. Thus, combining information contained in the matching scores seems both feasible and practical (Dass, Nandakumar, & Jain, 2004)

In this letter, we have tested the use of -integration to fuse the matching scores in a multimodal biometric system. In particular we have used the public database Biometric Scores Set—Release 1 (BSSR1) (U.S. Department of Commerce, 2013). BSSR1 is a set of raw output similarity scores from two face recognition systems and one fingerprint system, operating on frontal faces, and left and right index live-scan fingerprints, respectively. The data are intended to permit interested parties to investigate a range of outstanding statistical problems related to biometrics. BSSR1 contains three partitions (see Table 1).

Table 1:
Description of the BSSR1Partition Content.
PartitionNumber of IndividualsNumber of DetectorsScores Available by Detector
 4:2 measures of 2 face matchers Total: Genuine:  
 2:1 measure of right and 1 measure of left index fingerprint of 1 fingerprint matcher Total: Genuine:  
517 4:1 measure of 2 face matchers, 1 measure of right and 1 measure of left index fingerprint of 1 fingerprint matcher Total: 517 Genuine: 517 
PartitionNumber of IndividualsNumber of DetectorsScores Available by Detector
 4:2 measures of 2 face matchers Total: Genuine:  
 2:1 measure of right and 1 measure of left index fingerprint of 1 fingerprint matcher Total: Genuine:  
517 4:1 measure of 2 face matchers, 1 measure of right and 1 measure of left index fingerprint of 1 fingerprint matcher Total: 517 Genuine: 517 

Many possible experiments may be devised from these three partitions. We have selected four experiments whose results are respectively shown in Tables 2 to 5. In all the experiments, we have obtained the GARs corresponding to three different FARs for different methods of score fusion. The shown GAR values are the average of 30 iterations. In every iteration, the available score sets of the corresponding BSSR1 partition have been randomly divided into two halves. The first half has been used for training and the second for evaluation.

Table 2:
Experiment 1. GAR (%) Corresponding to Different Methods Applied to Partition 1 of BSSR1.
FAR 0.001%FAR 0.01%FAR 0.1%
Arithmetic mean 97.859 98.823 99.510 
Geometric mean 96.229 98.691 96.609 
Min 72.305 79.816 85.724 
Max 97.424 98.622 99.426 
-integration (LMSE) 83.767 97.019 98.693 
-integration (AUCmax, nAUC98.851 99.135 99.601 
FAR 0.001%FAR 0.01%FAR 0.1%
Arithmetic mean 97.859 98.823 99.510 
Geometric mean 96.229 98.691 96.609 
Min 72.305 79.816 85.724 
Max 97.424 98.622 99.426 
-integration (LMSE) 83.767 97.019 98.693 
-integration (AUCmax, nAUC98.851 99.135 99.601 

Notes: Scores were normalized by using equation 5.1. Numbers in bold indicate the best result.

In Table 2, -integration based on LMSE and on AUCmax criteria is compared with simpler rules. Partition 1 was considered, and the scores were normalized between 0 and 1, a requirement for -integration based on LMSE. Normalization was made by computing the a posteriori probability of every hypotheses given the score:
formula
5.1
where snorm and s are, respectively, the scores after and before the normalization, is the probability density of s conditioned to hypothesis Hk, and is the a priori probability of hypothesis Hk. These probabilities were estimated from the percentages of instances of Hk inside the training set of scores. Moreover, has been estimated using nonparametric gaussian kernel methods. Other methods of normalization are possible (Jain et al., 2005), but its influence on the results is out of the scope of this work.

As we can see in Table 2, the best results are obtained with -integration (AUCmax). Tuning the maximization of AUC in an interval of the ROC curve is important in this experiment if we compare the results obtained by -integration (LMSE). In fact, notice that in some cases of Table 2, -integration (LMSE) performs even worse than other simple rules. This is because no direct maximization of the GAR is made by -integration (LMSE) and reinforces the interest of the new proposed criterion AUCmax.

In experiments 2, 3 and 4, we considered the original scores without normalization; hence, the -integration (LMSE) was not applied. Each experiment corresponds to a different partition. Thus, we show in Tables 3, 4, and 5 the results obtained with partitions 1, 2, and 3, respectively. We can see in all cases the superior performance of fusion based on -integration (AUCmax), thus showing the interest of optimizing the fusing parameters.

Table 3:
Experiment 2: GAR (%) Corresponding to Different Methods Applied to Partition 1 of BSSR1.
FAR 0.001%FAR 0.01%FAR 0.1%
Arithmetic mean 92.990 93.901 96.172 
Geometric mean 90.799 92.864 95.404 
Min 57.969 73.896 84.135 
Max 87.161 90.223 93.436 
-integration (AUCmax, nAUC98.093 99.417 99.611 
FAR 0.001%FAR 0.01%FAR 0.1%
Arithmetic mean 92.990 93.901 96.172 
Geometric mean 90.799 92.864 95.404 
Min 57.969 73.896 84.135 
Max 87.161 90.223 93.436 
-integration (AUCmax, nAUC98.093 99.417 99.611 

Notes: Scores are not normalized. Numbers in bold indicate the best result.

Table 4:
Experiment 3: GAR (%) Corresponding to Different Methods Applied to Partition 2 of BSSR1.
FAR 0.001%FAR 0.01%FAR 0.1%
Arithmetic mean 88.393 91.170 93.895 
Geometric mean 85.410 89.007 92.304 
Min 75.546 79.740 84.425 
Max 86.570 90.298 93.311 
-integration (AUCmax, nAUC88.542 91.409 94.011 
FAR 0.001%FAR 0.01%FAR 0.1%
Arithmetic mean 88.393 91.170 93.895 
Geometric mean 85.410 89.007 92.304 
Min 75.546 79.740 84.425 
Max 86.570 90.298 93.311 
-integration (AUCmax, nAUC88.542 91.409 94.011 

Notes: Scores are not normalized. Numbers in bold indicate the best results.

Table 5:
Experiment 4: GAR (%) Corresponding to Different Methods Applied to Partition 3 of BSSR1.
FAR 0.001%FAR 0.01%FAR 0.1%
Arithmetic mean 50.752 65.018 77.320 
Geometric mean 65.135 74.904 83.998 
Min 59.807 71.365 81.538 
Max 49.176 63.914 76.416 
-integration (AUCmax, nAUC66.799 75.971 84.995 
FAR 0.001%FAR 0.01%FAR 0.1%
Arithmetic mean 50.752 65.018 77.320 
Geometric mean 65.135 74.904 83.998 
Min 59.807 71.365 81.538 
Max 49.176 63.914 76.416 
-integration (AUCmax, nAUC66.799 75.971 84.995 

Notes: Scores are not normalized. Numbers in bold indicate the best result.

6  Estimating the -Integration Parameters by MPE: An Application in Medical Diagnosis

So far we have considered that the ROC curve of the integrated detector is the essential element to be optimized by -integration. This is implicitly done with the LMSE criterion by trying to obtain integrated scores as close as possible to 1 when the true hypothesis is H1 or to 0 when H0 is in force. On the other hand, ROC curves are explicitly optimized by AUCmax. This approach is appropriate in detection problems where having control of the probability of false alarm Pf is a crucial aspect. However, there are applications when it is better to minimize the probability of error Pe (i.e., the probability of selecting a wrong hypothesis). This is a typical criterion in digital transmission, where an error happens whenever a symbol “1” is decided in reception but the emitted symbol was “0” or vice versa. Thus, Pe becomes the essential figure of merit of a digital communication system performance. There are other areas where minimizing the Pe of a detector is the appropriate optimization goal. One of them is automatic medical diagnosis. Often long biosignal records (e.g., electrocardiogram (ECG) and electroencephalogram (EEG) recordings) must be visually analyzed by the medical expert to detect the possible presence of some predefined events in the signals. The amount and sequencing of these events may help in the diagnosis of pathologies. This task can be eased and dramatically accelerated by replacing the expert by an automatic detector. In this type of problem, the goal is to reproduce the detections of the expert, which are considered correct detections, as much as possible. Hence minimizing Pe is the best option.

In this section, we show the results obtained by -integration in the implementation of an automatic detector that integrates two scores corresponding to different modalities (EEG and ECG). Before that, we propose a new method for estimating the -integration parameters, which is more appropriate for this kind of scenarios: the minimum probability of error (MPE) criterion.

As in section 2, let us assume that we have a training set of couples , where is the vector of scores provided by the detectors and yj is the corresponding known binary decision ( if H1 is true and if H0 is true). Minimization of the Pe corresponding to the foregoing set is equivalent to maximization of the probability of making correct decisions across the whole set of couples . Let us call to the probability of taking a correct decision yj from the fused score ; it can be expressed as
formula
6.1
We assume that the scores to be fused are normalized and calibrated (Jain et al., 2005; Zadrozny & Elkan, 2002), so that we can consider that . Therefore, after -integration, we have that . Then, substituting in equation 6.1,
formula
6.2
Let us call Pc to the probability of making correct decisions across the whole set of couples . If the measurements are independent for different values of the index j, we can write
formula
6.3
Finally, taking logarithms in equation 6.3 and changing the sign, we define the cost function to be minimized with the MPE criterion:
formula
6.4
Minimization can also be done by a gradient algorithm. Let us compute the required derivatives,
formula
6.5a
formula
6.5b
where can be computed using equations (2.6b) and (2.6c), and can be computed using equation 2.7b. Hence, a gradient algorithm like the one in equations 2.8 and 2.9 can be implemented using these new derivatives to obtain the MPE parameters
With these estimated parameters, given any vector of scores, we are able to compute the integrated score . Then we implement the test that makes the final decision in a form consistent with the essential objective of minimizing Pe. It is well known in detection theory (Hippenstiel, 2002) that the optimum detector that minimizes Pe from (in this case) the observation is obtained by the test
formula
6.6
But , so equation 6.6 is equivalent to
formula
6.7
But we have assumed that , so the MPE test will simply be
formula
6.8
We have considered the MPE criterion in the estimation of the -integration parameters in an application of medical diagnosis. The problem belongs to the area of computer-assisted sleep staging (Agarwal & Gotman, 2001). In particular, we want to build an automatic detector of arousals during sleeping, as their frequency of appearance is related to the presence of apnea and epilepsy. Normally arousals are detected by a medical expert from a visual inspection of the so-called polysomnograms (PSG), a set of EEGs obtained from the patient while sleeping. This manual task is tedious and susceptible to error after a long period of analysis. Then Salazar, Vergara, and Miralles (2010) proposed and automatic technique that, extracting four features from the PSG signals, generates automatic detections of arousals every epoch of 30 seconds. The method consists of a Bayesian classifier that assumes a hidden Markov model for the evolution of the sleeping stages and a nongaussian mixture model for the multivariate probability density in the feature space (see Salazar et al., 2010 for details).

Here we want to verify the possible improvement in the detection of arousals by combined use of EEG and ECG information. From the ECG records and after some standard signal processing (Kaufmann, Sütterlin, Schulz, & Vögele, 2011), the heartbeats (R-peaks) are extracted. Then the sequence of RR intervals between consecutive R-peaks is formed. This is termed the heart rate variability (HRV) signal, which has been extensively used for health monitoring (see Bouziane, Yagoubi, Vergara, & Salazar, 2015). Three features are extracted in every 30 second epoch. Two of them are time domain features: the mean and the standard deviation of the RR intervals. The third feature is the quotient between the low-frequency (LF) (0.04–0.15 Hz) and high-frequency (HF) (0.15-0.4 Hz) powers, obtained from the power spectral density (PSD) of the RR sequence.

For the experiment, we had four subjects. EEG and ECG signals were synchronously recorded for every subject during sleeping. Every recording session lasted some 7.5 hours (some 900 epochs of 30 seconds). A medical expert generated a binary decision for every epoch (presence or no presence of arousal), the target decision or ground truth. For every subject, we used the first half of his recording session for training and the second half for testing the detectors performance. Using the methods described in Salazar et al. (2010), a score is generated from the EEG information for every epoch. Similarly, a score is obtained from the ECG features described in the previous paragraph, using a support vector machine (SVM) classifier. Finally both scores are -integrated.

The goal is to reproduce the manual detections given by the expert as much as possible; then every discrepancy with the expert will be considered an error, and the probability of error is to be minimized. Hence, the decisions corresponding to the EEG and ECG modalities are obtained by respectively introducing the EEG score and the ECG score in the test of equation 6.8. On the other hand, we have used the MPE criterion for estimating the -integration parameters, and the -integrated score is also considered in the test (see equation 6.8) to generate decisions.

The left side of Table 6 shows the results in terms of percentage of decisions that coincide with the expert decisions for the three possible automatic cases: isolated scores obtained from the EEG signals, isolated scores obtained from the EEG signals, and scores derived from -integration of both. The corresponding -integration parameters estimated with the MPE criterion are indicated on the right side of Table 6. We see that improvements after -integration appear in subjects 1, 2, and 3; the percentage corresponding to subject 4 is the same one obtained with isolated ECG scores. The very large value of corresponding to subject 4 confirms that the minimum score is selected that seems to correspond to the ECG score in this case. Moreover, the weights are clearly unbalanced in favor of the ECG score in subject 4. In any case, notice that -integration yields a performance that is as least as good as the best individual performance. Thus, even in the case of no improvement, -integration is able to “select” the best automatic detector between the two available.

Table 6:
(Left) Percentage of Decisions Coincident with the Expert Decisions Corresponding to EEG Scores, ECG Scores and -Integrated Scores. (Right) Estimated -Integration Parameters with MPE Criterion.
EEG (%)ECG (%)-int (%)
Subject 1 78.60 80.55 84.70 10.95 0.5053 0.4947 
Subject 2 77.39 74.37 77.51 17.02 0.5552 0.4448 
Subject 3 89.13 90.48 91.72 10.15 0.4306 0.5786 
Subject 4 80.45 93.93 93.93 96.02 0.2009 0.7991 
EEG (%)ECG (%)-int (%)
Subject 1 78.60 80.55 84.70 10.95 0.5053 0.4947 
Subject 2 77.39 74.37 77.51 17.02 0.5552 0.4448 
Subject 3 89.13 90.48 91.72 10.15 0.4306 0.5786 
Subject 4 80.45 93.93 93.93 96.02 0.2009 0.7991 

Note: Numbers in bold indicate the best results.

7  Conclusion

We have presented a new method for the fusion of scores obtained from different detectors based on -integration. It is a generalization of simpler rules which allows optimum fitting of the parameters and finds rationale in the optimum integration of stochastic models. Three optimality criteria have been considered: LMSE, AUCmax, and MPE. While the first two relates implicitly or explicitly in optimizing the ROC curves (i.e., maximizing probability of detection for a given probability of false alarm), the last one focuses on minimizing the probability of error.

We have proposed new gradient algorithms for the three criteria. In the LMSE case, we have adapted to the detection context a gradient algorithm previously proposed in the general framework of -integration. Some variations have been included to account for unbalanced distribution of the training data sizes and relative significance of every type of error in the global MSE. Regarding AUCmax, a new algorithm has been proposed based on transforming an empirical nonparametric measure of AUC in a differentiable function. A key advantage of AUCmax with respect to LMSE is that it allows tuned optimization in selected intervals of the ROC curves. In MPE, a new cost function is defined that is the negative of the log probability of correct answers.

We have included different experiments with simulated data with the aim of illustrating the different factors influencing -integration with both LMSE and AUCmax. It has been shown that the fusion of two-detector scores leads to significant improvements of the ROC curves.

Finally, two real data cases have been considered. The first corresponds to the fusion of scores in multimodal biometric data. In this application, the goal is to have the maximum genuine acceptation ratio (equivalent to probability of detection) for a given (rather small) false alarm ratio; hence, both LMSE and AUCmax have been considered. Different experiments have been done with different data sets, showing the superior performance of -integration with respect to simpler rules, which do not allow the optimization of the fusing parameters. We have also demonstrated the interest of the tuning capability of AUCmax to a selected range of probabilities of false alarm.

The second real data case is in the area of automatic analysis of medical records to reproduce the manual decisions taken by the medical expert, so the best criterion is MPE. We have presented the theoretical analysis, including gradient computations, of -integration based on MPE. The method has been applied in the fusion of two scores, respectively obtained from EEG and ECG records. The problem was the automatic detection of arousals during sleeping, which the medical expert currently does manually. Experiments in four subjects have illustrated the potential interest of MPE -integration in these kinds of problems.

Acknowledgments

This work has been supported by Generalitat Valenciana under grants PROMETEOII 2014-032, ISIC2012-006 and by Spanish administration under grant TEC2014-58438-R.

References

Agarwal
,
R.
, &
Gotman
,
J.
(
2001
).
Computer-assisted sleep staging
.
IEEE Trans. Biomedical Engineering
,
48
(
12
),
1412
1423
.
Amari
,
S.
(
2007
).
Integration of stochastic models by minimizing -divergence
.
Neural Computation
,
19
,
2780
2796
.
Atrey
,
P.
,
Hossain
,
M.
,
El Saddik
,
A.
, &
Kankanhalli
,
M.
, (
2010
).
Multimodal fusion for multimedia analysis: A survey
.
Multimedia Systems
,
16
,
345
379
.
Bouziane
,
A.
,
Yagoubi
,
B.
,
Vergara
,
L.
, &
Salazar
,
A.
(
2015
).
The ANS sympathovagal balance using a hybrid method based on the wavelet packet and the KS-segmentation algorithm
. In
Proc. Int’l. Conf. Circuits, Systems, Signals and Telecomm
(pp.
75
83
).
N.P.
:
WSEAS Press
.
Byrd
,
R. H.
,
Hribar
,
M. E.
, &
Nocedal
,
J.
(
1999
).
An interior point algorithm for large scale nonlinear programming
.
SIAM J. Optim.
,
9
(
4
),
877
900
.
Choi
,
H.
,
Choi
,
S.
, &
Choe
,
Y.
(
2013
).
Parameter learning for alpha integration
.
Neural Computation
,
25
,
1585
1604
.
Choi
,
H.
,
Choi
,
S.
,
Katake
,
A.
, &
Choe
,
Y.
(
2010
).
Learning -integration with partially labeled data
. In
Proc. IEEE Int’l. Conf. Acoustics, Speech, and Signal Processing
(pp.
2058
2061
).
Piscataway, NJ
:
IEEE
.
Dass
,
S. C.
,
Nandakumar
,
K.
, &
Jain
,
A. K.
(
2004
).
A principled approach to score level fusion in multimodal biometric systems
.
In Proceedings of Fifth International Conference on AVBPA
(pp.
1049
1058
).
Berlin
:
Springer
.
Dodd
,
L. E.
, &
Pepe
,
M. S.
(
2003
).
Partial AUC estimation and regression
.
Biometrics
,
59
(
3
),
614
623
.
Herschtal
,
A.
, &
Raskutti
,
B.
(
2004
).
Optimising area under the ROC curve using gradient descent
. In
Proc. 21st International Conference on Machine Learning
(pp.
49
56
).
New York
:
ACM
.
Hippenstiel
,
R. D.
(
2002
).
Detection theory: Application and digital signal processing
.
Boca Raton, FL
:
CRC Press
.
Jain
,
A.
,
Nandakumar
,
K.
, &
Ross
,
A.
(
2005
).
Score normalization in multimodal biometric systems
.
Pattern Recognition
,
38
,
2270
2285
.
Jain
,
A. K.
, &
Ross
,
A.
(
2004
).
Multibiometric systems
.
Communications of the ACM
.
47
,
34
40
.
Jain
,
A. K.
,
Ross
,
A.
, &
Prabhakar
,
S.
(
2004
).
An Introduction to biometric recognition
.
IEEE Transactions on Circuits and Systems for Video Technology
,
14
,
4
20
.
Kaufmann
,
T.
,
Sütterlin
,
S.
,
Schulz
,
S. M.
, &
Vögele
,
C.
(
2011
).
ARTiiFACT: A tool for heart rate artifact processing and heart rate variability analysis
.
Behavior Research Methods
,
43
(
4
),
1161
1170
.
Khaleghi
,
B.
,
Khamis
,
A.
,
Karray
,
F. O.
, &
Razavi
,
S. N.
(
2013
).
Multisensor data fusion: A review of the state-of-the-art
.
Information Fusion
,
14
(
1
),
28
44
.
Kittler
,
J.
,
Hatef
,
M.
,
Duin
,
R. P.
, &
Matas
,
J.
(
1998
).
On combining classifiers
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
20
(
3
),
226
239
.
Narasimhan
,
H.
, &
Agarwal
,
S.
(
2013
).
A structural (SVM) based approach for optimizing partial AUC
. In
Proc. 30th International Conference on Machine Learning
(pp.
516
524
).
N.P.
:
JMRL W&CP
.
Parker
,
Ch.
(
2013
).
On measuring the performance of binary classifiers
.
Knowledge and Information Systems
,
35
(
1
),
131
152
.
Pimentel
,
M. A. F.
,
Clifton
,
D. A.
,
Clifton
,
L.
, &
Tarassenko
,
L.
(
2014
).
A review of novelty detection
,
Signal Processing
,
99
,
215
249
.
Ross
,
A.
, &
Jain
,
A. K.
(
2003
).
Information fusion in biometrics
.
Pattern Recognition Letters
,
24
,
2115
2125
.
Salazar
,
A.
,
Vergara
,
L.
, &
Miralles
,
R.
(
2010
).
On including sequential dependencies in ICA mixtures models
.
Signal Processing
,
90
,
2314
2318
.
Soriano
,
A.
,
Vergara
,
L.
,
Moragues
,
J.
, &
Miralles
,
R.
(
2014
).
Unknown signal detection by one-class detector based on gaussian copula
.
Signal Processing
,
96
,
315
320
.
Toh
,
K. A.
,
Jiang
,
X.
, &
Yau
,
W. Y.
(
2004
).
Exploiting global and local decisions for multi-modal biometrics verification
.
IEEE Transactions on Signal Processing
,
52
,
3059
3072
.
U.S. Department of Commerce, National Institute of Standards & Technology
.
Biometric scores set
. http://www.nist.gov/itl/iad/ig/biometricscores.cfm.
Waltz
,
R. A.
,
Morales
,
J. L.
,
Nocedal
,
J.
, &
Orban
,
D.
(
2006
).
An interior algorithm for nonlinear optimization that combines line search and trust region steps
.
Math. Program.
,
107
(
3
),
391
408
.
Wang
,
Y.
,
Tan
,
T.
, &
Jain
,
A. K.
(
2003
).
Combining face and iris biometrics for identity verification
. In
Proceedings of Fourth International Conference on AVBPA
(pp.
805
813
).
Berlin
:
Springer
.
Wu
,
D.
(
2009
).
Parameter estimation for -GMM Based on maximum likelihood criterion
.
Neural Computation
,
21
,
1776
1795
.
Yuksel
,
S. E.
,
Wilson
,
J. N.
, &
Gader
,
P. D.
(
2012
).
Twenty years of mixture of experts
.
IEEE Trans. on Neural Networks and Learning Systems
,
23
,
1177
1193
.
Zadrozny
,
B.
, &
Elkan
,
C.
(
2002
).
Transforming classifier scores into accurate multiclass probability estimates
. In
Proc. 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
694
699
).
New York
:
ACM
.