## Abstract

This letter addresses the problem of separating two speakers from a single microphone recording. Three linear methods are tested for source separation, all of which operate directly on sound spectrograms: (1) eigenmode analysis of covariance difference to identify spectro-temporal features associated with large variance for one source and small variance for the other source; (2) maximum likelihood demixing in which the mixture is modeled as the sum of two gaussian signals and maximum likelihood is used to identify the most likely sources; and (3) suppression-regression, in which autoregressive models are trained to reproduce one source and suppress the other. These linear approaches are tested on the problem of separating a known male from a known female speaker. The performance of these algorithms is assessed in terms of the residual error of estimated source spectrograms, waveform signal-to-noise ratio, and perceptual evaluation of speech quality scores. This work shows that the algorithms compare favorably to nonlinear approaches such as nonnegative sparse coding in terms of simplicity, performance, and suitability for real-time implementations, and they provide benchmark solutions for monaural source separation tasks.

## 1 Introduction

The problem of recovering underlying source signals from their mixtures is called source separation. It has been an area of intense research for a long time and has received much attention recently because of the potential applications for hearing aid systems, signal preprocessing in speech recognition systems, and network tomography and medical signal processing in general (Hyvärinen, Karhunen, & Oja, 2001). When the problem is (over-) determined, that is, when the number of sources is no larger than the number of available mixtures or sensor recordings, then generic assumptions such as statistical independence of sources can be used for successful demixing (Hyvärinen et al., 2001). However, most of the time, the problem is underdetermined: the number of sources is larger than the number of available mixtures; hence, more specific assumptions must be made for demixing. One of the most difficult cases is source separation from a single microphone, which is the task of segregating a target source from a masker using a single-channel recording. This task has been proven to be extremely challenging (Wang & Brown, 2006) because one can only use the intrinsic acoustic properties of the target and the masker with no additional cues.

Essentially three general approaches have been proposed for monaural separation: speech enhancement, computational auditory scene analysis (CASA), and model-based methods such as nonnegative matrix factorization (NMF). The linear methods have mostly been used for speech enhancement in noisy environments. These methods include spectral subtraction, Wiener filtering, and subspace-based approaches (Loizou, 2013). The subspace methods for speech enhancement are based on the principle that sources are usually confined to a subspace of Euclidean space. Consequently, methods have been developed that compute these source-specific subspaces. Some of the widely used methods are singular value decomposition (SVD), eigenvalue decomposition (EVD), and subspace tracking algorithms (Loizou, 2013). These methods have been primarily used for speech enhancement of noisy speech (Jensen, Hansen, Hansen, & Sorensen, 1995; Hansen & Jensen, 2005, 2007; Weiss, 2009). Another approach to deal with source separation is to compute an ideal binary mask (IBM) for the target source, considered one of the main goals in CASA systems for speech segregation (Wang & Brown, 2006). An IBM is defined as a binary matrix in which a matrix element is 1 if the power of the target source is known to be higher than that of the masker and 0 otherwise in the corresponding time frequency bin of the mixture signal. Binary masking and related approaches have been used extensively for demixing (Aoki et al., 2001; Brungart, Chang, Simpson, & Wang, 2006; Han & Wang, 2012; Jourjine, Rickard, & Yilmaz, 2000; Nguyen, Belouchrani, Abed-Meraim, & Boashash, 2001; Rickard, Balan, & Rosca, 2001). Deep neural networks (DNNs) have also been used to learn binary masks (Wang & Wang, 2013; Zhao, Wang, & Wang, 2014) and soft masks (Narayanan & Wang, 2013, 2014; Huang, Kim, Hasegawa-Johnson, & Smaragdis, 2014). A major disadvantage of using IBM is the loss of information in regions where the source and masker overlap. Model-based approaches such as NMF that factorize a nonnegative matrix into (usually) two nonnegative matrices have been used extensively for monaural source separation and over a wide range of sources, including human speech, noise, and music (Schmidt, Larsen, & Hsiao, 2007).

This letter proposes various linear methods for separating (artificial) mixtures of two sources. The approaches are based on the power spectrogram, a time-frequency representation of sound that has the advantage of being phase insensitive and so allows speakers to be distinguished based on differences in the spectro-temporal sound patterns they produce. This work introduces new methods based on eigenanalysis, probabilistic demixing, and linear regression. These methods are evaluated by estimating (reconstructing) target spectrograms of single speakers from mixture speech. The reconstructed spectrograms are compared to those yielded by nonlinear approaches, that is, nonnegative sparse coding (NNSC) (Schmidt et al., 2007) and a supervised version thereof (which we refer to as non-blind NNSC, described in section 3.5).

## 2 Methods

All computations are performed using Matlab R2011a on a 64-bit machine with 8 GB of RAM and an Intel Core i7 processor with 4 cores and clock frequency 2.93 GHz.

### 2.1 The Problem and the Approach

Speaker separation is evaluated on audio speech data using the GRID corpus and Mocha-TIMIT---Texas Instruments (TI) and Massachusetts Institute of Technology (MIT)---database. All sentences are sampled at 16 KHz. The training data consist of 20 sets of audio files, each set containing roughly 5 minutes of clean speech from a distinct speaker (10 male and 10 female speakers). In the training phase, source-specific filters are learned from the training set. The testing data consist of roughly 1 minute of clean speech from each speaker. On each run of testing, one male and one female speaker are randomly chosen and are denoted by and . The same set of speakers is used for training and testing. The test audio signals (mixture speech) are artificially created by summing the two underlying speech signals and . Note that since the STFT is a linear function, the mixture can be expressed in the time-frequency domain as , where , , and are the absolute values of the FFT for the mixture, speaker , and speaker , respectively, and is the corresponding phase. Since the inequality holds true for any , , and , therefore for a set of uncorrelated signals and , where is the expected value.

In this letter, the approximation is used, which can be thought of as a generalization of the binary mask approach because the approximation is excellent ( is close to ) for any when or . In general, the assumption is good for close to 2 (Berouti et al., 1979). The assumption becomes worse as , and thus the performances of the various methods are expected to be lower in this range.

The spectrograms of speech for the two speakers in the training set are denoted and From the training set, the filters (linear methods) or libraries (NNSC) are derived. These filters and libraries are used to estimate each speaker's spectrogram and from mixture spectrograms .

Note that for the subsequent methods, one spectrogram window is predicted at each time-step and averaged over overlapping windows to obtain , which is ultimately inverted into cleaned speech. The exception will be suppression-regression, where single columns and are directly estimated from windows and . Note also that source separation methods based on are feasible for real-time implementation only when is small, ideally .

### 2.2 Mean Subtraction (Applicable to the Linear Methods)

All methods make the simplifying assumption that , that is, the mixture spectrogram is the sum of the two individual speaker spectrograms. From this assumption, the following mean subtraction method is derived. The means and are defined over individual training spectrograms, where represents the mean over time. Equally the means and are defined over windows of training spectrograms. As the training sets for both speakers are chosen to be equally large, the overall mean of the training spectrogram columns is given by and likewise for the mean spectrogram window. The linear methods to be defined extract demixing filters and from the mean-subtracted training spectrograms and . The filter maps components of onto themselves and components of onto zero, and vice versa for filter

For demixing, the steps are to subtract from the mixture spectrogram twice the overall mean , , apply the demixing filters and add the individual training spectrogram means and to the result to obtain the estimates and respectively. It is simple to show that by doing so (and assuming , the following is true: adding the estimates and yields = under the assumption

The following mean-subtraction methods are also evaluated, but their performance is worse (larger residuals and lower signal-to-noise ratios; data not shown) compared to the above chosen method for mean subtraction:

During training, the speaker-specific filters on the individual-mean subtracted training spectrograms and are computed. During testing, the mixture-mean subtracted mixture spectrogram is projected on these filters, and subsequently half of the mixture mean is added back to the projections to obtain the estimates and .

The speaker-specific filters are computed on a set of overall-mean subtracted training spectrograms: and . During testing, and are estimated by projecting the overall-mean subtracted mixture spectrogram, , on these filters; subsequently the individual training spectrogram mean (resp. ) is added to the projections to obtain the estimates and .

The speaker-specific filters are computed on a set of overall-mean subtracted training spectrograms, and , as above. However during the testing phase, the subtracted mixture spectrogram is projected on these filters; subsequently, the individual training spectrogram mean (resp. ) is added to the projections to obtain the estimates and .

Note that methods 1 and 3 also satisfy the assumption but method 2 does not.

## 3 Algorithms

The three linear methods for source separation are introduced followed by two variants of a nonlinear method.

### 3.1 Eigenmode Analysis of Covariance Difference (EACD)

Eigenmode analysis is applied in order to compute filters that span directions associated with large variance for one speaker and small variance for the other speaker, an approach inspired by Machens, Romo, and Brody (2010). The speech mixture spectrograms are then projected onto these filters to selectively suppress one of the two speakers present in the mixture.

*C*and

_{X}*C*into equation 3.2, the following is obtained: Because the average is a linear operator, with and Thus, the diagonal elements of give the difference in variance between the two mean-subtracted data sets along the eigenvectors (columns of ). These eigenvectors span two subspaces. In the first subspace, speaker produces a higher variance than speaker and vice versa. The sets of filters that span the first and the second subspaces are labeled and .

_{Y}The vectors and as defined in equations 3.4 and 3.5 satisfy the assumption (because is the identity matrix, here and as mentioned in section 2). The algorithm works identically for windows of multiple spectral columns, , as described in section 2, by simply replacing the variables by their boldface counterparts and averaging and over their overlapping regions.

### 3.2 Probabilistic Approach (Maximum Likelihood Demixing)

This approach performs source separation under the assumption that the two (mean-subtracted) sources are independent and gaussian distributed.

with and normalization constants.

### 3.3 Suppression-Regression

Note that a large condition number of the matrix (to be decomposed) can become a significant problem leading to ill-conditioned matrices. The problem of ill-conditioned matrices is observed only for suppression-regression in case of multiple spectral columns (). One of the methods of dealing with such ill-conditioned matrices is regularization. In this work, Tikhonov regularization (Tikhonov & Arsenin, 1977) is used. For an ill-posed inverse problem (because of either the nonexistence or nonuniqueness of ) such as , the estimate of r is usually computed using an ordinary least squares approach that minimizes the residual where represents the Euclidean norm. This may occur because of being illconditioned or singular. To obtain a solution with desirable properties, a regularization term is included in this minimization where is called the Tikhonov matrix, is a constant, and is the identity matrix. An explicit solution is for some regularization constant . In this work, the regularization constant is chosen such that it maximizes the performance of separation (higher SNR, PESQ score, and lower residual). It is varied from 0 to (with step sizes initially increasing in multiples of 10 and fixed step size of close to peak performance, which typically occurred between and ). Peak performance value was determined by cross-validation on an independent set of data. Figures 4 and 6b show the results for suppression-regression after performing regularization. The performance improvement is largest for a large number of columns, which is expected because the problem (of inversion) is more ill posed the larger the number of columns. On average, the SNR values improve by 0.7 dB, the PESQ values by 0.3, and residual values decrease by 0.05 after regularization.

### 3.4 Source Separation Using Nonnegative Sparse Coding (NNSC)

These linear approaches are compared to nonlinear approaches based on NMF, which encompasses a set of methods in which a matrix is factorized into two nonnegative matrices and such that (all components of and are nonnegative). Nonnegative matrix factorization leads to a parts-based representation because the factorization allows only additive, not subtractive, combinations (Eggert & Korner, 2004).

Sparsity of basis coefficients is enforced by the second term in equation 3.26. This term guarantees that only a small subset of dictionary elements is used at any time, thus forcing the dictionary elements to be source specific. Note that NNSC is a nonlinear approach because solving the global optimum of equation 3.26 is not tractable (NP-hard).

There exist diverse algorithms for computing this factorization (Eggert & Korner, 2004; Hoyer, 2002; Lee & Seung, 1999, 2006; Lin, 2007; Schmidt et al., 2007) including the multiplicative update rule from Lee and Seung (1999, 2006). This update rule has been very popular due to the simplicity of its implementation. An advantage of multiplicative update rules over standard gradient descent update is that convergence is guaranteed because no step-size parameter is needed (Lee & Seung, 1999). Also the relaxation toward a local minimum is fast and computationally inexpensive (Lee & Seung, 1999). This work uses the multiplicative update rules of Schmidt et al. (2007), which are inspired by Eggert and Korner (2004) and Lee and Seung (1999, 2006). These update rules provide an extra advantage in that they allow the target dictionaries to be learned from the mixture and thus constitute a semisupervised method of learning.

The masker dictionary is first computed using the clean speech of the masker as follows (speaker being the masker and speaker the target in this particular case):

Start with a randomly initialized dictionary matrix, and code matrix .

The above approach of semiblind NNSC is also modified by first learning the target dictionary from the clean data set and then learning the masker dictionary from the mixture using update rules analogous to equations 3.30 to 3.32. Note that learning the dictionary of the masker from its clean speech is very useful when the target model is unavailable. If the masker model is unavailable, one can learn the masker dictionary from the mixture while learning the target from its clean speech. We show results for both of these semiblind approaches.

### 3.5 Non-Blind NNSC

One might expect a better performance when the dictionaries for both speakers are prelearned from clean training sentences as was done in the linear methods (in the NNSC approach described in section 3.4, the dictionary of one of the speakers from clean speech and the dictionary for the other speaker were computed using the mixture). In this section, NNSC is applied in learning both dictionaries and from clean training sentences; the code matrices and are then inferred from update rules, equations 3.30 and 3.31.

Parameter values for non-blind and semi-blind NNSC were chosen by maximizing SNR as follows. The optimal parameter values in Schmidt et al. (2007) were taken as starting points () and varied one after the other over an empirically chosen range while keeping the others at their optimal values. The parameters and were varied over the range [0, 0.1] in linear steps of 0.01, over the range [0,1] in linear steps of 0.1, and and over the range [1,512] in powers of 2. The other parameters were set to , . The following optimal values for the NNSC parameters were obtained:

## 4 Performance Criteria

The performances of the algorithms are evaluated using the following measures:

Perceptual evaluation of speech quality (PESQ) scores (Hu & Loizou, 2008) computed over the target audio signal and the predicted audio signal

## 5 Simulation Results

After training the filters and dictionaries, artificial speech mixtures from two speakers (male and female), in Figure 2a, were created, and the corresponding mixture spectrogram in Figure 2b was computed. From this mixture spectrogram, the speech spectrogram of the target speaker (male) was estimated using the algorithms already presented.

The example in Figure 2 shows all of speech from only the target, only the masker, and a superposition of both. Masker suppression is strongest in the case of NNSC (masker learned). However, the target itself is also highly suppressed in this case. As a result, the target speech suffers from low intelligibility. By contrast, masker suppression is weaker in the linear approaches and NNSC (target learned and both learned), but the target speech is better preserved.

Shown in Figure 2c is an excerpt of the waveforms comparing the target, the mixture, and the estimated target using MLD/suppression-regression (the best method). The estimated waveform matches well with the target waveform.

The performance of the various algorithms is compared for different values of the STFT window size , the STFT exponent , and the number of spectrogram columns (see Figures 3 to 5). The base parameter set is , , and the values for one of the three parameters are varied while keeping the other two parameters fixed. In all figures, the plots of residual spectrograms and SNRs that were averaged over all speakers are shown. The performances of the algorithms are further compared using the widely used and International Telecommunication Union--Telecommunication (ITU-T) recommended perceptual PESQ metric. The PESQ scores support the results shown in Figures 3 to 5 and align well to them (see Figure 6).

Figure 3 shows the plots depicting the performances of the algorithms under varying FFT window size . For all algorithms except NNSC (masker learned), the SNR of the reconstructed signals peaks at either 64 ms or 128 ms. The normalized residual plots confirm this finding for the well-performing methods MLD/suppression-regression and NNSC (both dictionaries learned). For the remaining three methods, the minimal normalized residual is not reached even at the largest window size tested.

Figure 4 shows the performances of the algorithms for various numbers of spectral columns. Adding more spectral columns captures the context-dependent information in each analysis vector, leading to better performance. However, temporal context beyond a certain limit (depending on the amount of data available) is not useful and leads to performance reduction. Note that research has shown that the temporal context in human speech can typically vary from 20 ms to 200 ms (Rosen, 1992).

The performance comparison of the different algorithms gives a sense of how prone they are to overfitting. The better-performing algorithms MLD, suppression-regression, and NNSC (both learned and target learned) exhibit their peak performances for few spectrogram columns (), while the performances of the other nonlinear algorithms saturate. EACD shows a monotonically increasing performance, revealing it is more robust to overfitting.

Shown in Figure 5 are the performances of the algorithms for various values of the STFT exponent Overall MLD/suppression-regression performs best, followed by NNSC (both speakers learned). NNSC (masker learned) method performs worst.

The sparseness factor has a strong impact on performance. The assumption implicit in the linear methods is that the data are gaussian distributed. The nonlinear NNSC methods, however, assume exponentially distributed independent components present in the mixture (Hoyer, 2002). As , the distribution of the spectrogram pixels approaches a gaussian. In the midrange this distribution approaches an exponential, and for large , this distribution is heavy tailed. The approximation that the sources are additive in the spectrogram domain gets worse the further deviates from 2 (Berouti et al., 1979).

Given this trade-off, it may not be surprising that the optimal performance for EACD and MLD is not near small values but around .

The same argument can be applied to NNSC (both dictionaries learned) for which the performance is optimal around . In this range, MLD/suppression-regression and NNSC (both speakers learned) perform best. This can be seen in the estimated spectrogram for the case of (see Figure 3). The region where the masker speaker (female in this example) is solely speaking is suppressed best in NNSC (masker learned); however, NNSC also over-suppresses the target speaker, leading to lower SNR values compared to suppression-regression and NNSC (both dictionaries learned). Performances of EACD and MLD are still very good in this range.

### 5.1 Sparse NMF with Kullback-Leibler (KL) and Itakura-Saito Divergence Criteria

Apart from the Euclidian distance-based criterion for matrix factorization, we also tested other divergence criteria such as the Kullback-Leibler (KL) divergence and the Itakura-Saito (IS) divergence. These criteria are usually considered better suited for audio spectra than the Euclidean distance (Févotte & Idier, 2011). Using the KL divergence, the SNRs averaged over all speakers improved by 0.77 dB for NNSC (target learned), 1.18 dB for NNSC (masker learned), and 0.34 dB for non-blind NNSC. When the KL divergence was used, the residuals decreased on average by 0.09 for NNSC (target learned), 0.10 for NNSC (masker learned), and 0.01 for non-blind NNSC. The PESQ scores also improved on average by 0.06 for NNSC (target learned), 0.05 for NNSC (masker learned), and 0.06 for non-blind NNSC. The results are summarized in Table 1.

. | Mean SNR (dB) . | Mean Normalized Residual . | Mean PESQ Score . | ||||||
---|---|---|---|---|---|---|---|---|---|

NNSC . | Euclidean . | KL . | IS . | Euclidean . | KL . | IS . | Euclidean . | KL . | IS . |

Method . | Distance . | Divergence . | Divergence . | Distance . | Divergence . | Divergence . | Distance . | Divergence . | Divergence . |

NNSC | 4.98 | 5.75 | 4.86 | 0.6 | 0.51 | 0.64 | 1.86 | 1.92 | 1.74 |

(target learned) | |||||||||

NNSC | 3.27 | 4.45 | 3.0 | 0.66 | 0.56 | 0.70 | 1.73 | 1.78 | 1.71 |

(masker learned) | |||||||||

NNSC | 5.98 | 6.32 | 5.80 | 0.41 | 0.40 | 0.45 | 2.08 | 2.14 | 2.02 |

(both learned) |

. | Mean SNR (dB) . | Mean Normalized Residual . | Mean PESQ Score . | ||||||
---|---|---|---|---|---|---|---|---|---|

NNSC . | Euclidean . | KL . | IS . | Euclidean . | KL . | IS . | Euclidean . | KL . | IS . |

Method . | Distance . | Divergence . | Divergence . | Distance . | Divergence . | Divergence . | Distance . | Divergence . | Divergence . |

NNSC | 4.98 | 5.75 | 4.86 | 0.6 | 0.51 | 0.64 | 1.86 | 1.92 | 1.74 |

(target learned) | |||||||||

NNSC | 3.27 | 4.45 | 3.0 | 0.66 | 0.56 | 0.70 | 1.73 | 1.78 | 1.71 |

(masker learned) | |||||||||

NNSC | 5.98 | 6.32 | 5.80 | 0.41 | 0.40 | 0.45 | 2.08 | 2.14 | 2.02 |

(both learned) |

Note: The KL divergence criterion yields a better performance than the Euclidean distance criterion. In contrast, the IS divergence criterion performs slightly worse.

By contrast, the results using sparse IS divergence were slightly worse than those using the Euclidean distance. Compared to SNRs using the Euclidean distance, the SNRs decreased on average by 0.12 dB, 0.27 dB, and 0.18 dB, and residual values increased by 0.037, 0.04, and 0.04 for NNSC (target learned), NNSC (masker learned), and nonblind NNSC, respectively. The PESQ scores for the IS divergence were reduced as well in comparison to the 2-norm distance-based divergence measure. The PESQ scores averaged over all speakers were reduced by 0.12, 0.02, and 0.05 for NNSC (target learned), NNSC (masker learned), and nonblind NNSC, respectively. The results comparing all the divergence criteria are summarized in Table 1.

### 5.2 Additional Layer of Wiener Filtering for Speech Enhancement

Wiener filtering is a widely used method in signal processing, particularly for signal denoising and source separation. Wiener filtering is typically applied to audio signals in the spectrogram domain (using the STFT). We use the adaptive Wiener filtering (AWF) approach that enforces a reconstruction constraint of mixture spectrograms being the sum of individual estimated spectrograms. Under this constraint, the new spectrogram estimates and are given by and where the horizontal line represents pointwise division and represents pointwise multiplication of two matrices. This reconstruction constraint improves the separation performance of all methods studied. The results are summarized in Table 2.

. | Mean SNR(dB) without WF (with AWF / with CWF) . | Mean Normalized Residual without WF (with AWF / with CWF) . | Mean PESQ Score without WF (with AWF / with CWF) . |
---|---|---|---|

EACD | 5.5, (6.3/6.09) | 0.47, (0.45/0.46) | 1.9, (2.21/2.19) |

MLD/SR | 6.42, (6.96/7.21) | 0.38, (0.34/0.35) | 2.35, (2.56/2.56) |

NNSC (target learned) | 4.98, (5.13/4.74) | 0.6, (0.58/0.65) | 1.86, (1.9/1.85) |

NNSC (masker learned) | 3.27, (3.4/3.01) | 0.66, (0.63/0.74) | 1.73, (1.87/1.55) |

NNSC (both learned) | 5.98, (6.88/6.53) | 0.41, (0.38/0.377) | 2.08, (2.19/2.13) |

. | Mean SNR(dB) without WF (with AWF / with CWF) . | Mean Normalized Residual without WF (with AWF / with CWF) . | Mean PESQ Score without WF (with AWF / with CWF) . |
---|---|---|---|

EACD | 5.5, (6.3/6.09) | 0.47, (0.45/0.46) | 1.9, (2.21/2.19) |

MLD/SR | 6.42, (6.96/7.21) | 0.38, (0.34/0.35) | 2.35, (2.56/2.56) |

NNSC (target learned) | 4.98, (5.13/4.74) | 0.6, (0.58/0.65) | 1.86, (1.9/1.85) |

NNSC (masker learned) | 3.27, (3.4/3.01) | 0.66, (0.63/0.74) | 1.73, (1.87/1.55) |

NNSC (both learned) | 5.98, (6.88/6.53) | 0.41, (0.38/0.377) | 2.08, (2.19/2.13) |

Notes: The improved performance values (over the methods without WF) are shown in bold. AWF improves the performance of all methods. CWF does not improve the performance of NNSC (target learned) and NNSC (masker learned) but improves it for all other methods. MLD/SR is MLD/suppression-regression. Note that applying CWF on top of AWF did not, on average, improve the performance for any of the demixing methods.

We also tried the consistent Wiener filtering (CWF) approach proposed by Roux and Vincent (2013) that enforces consistency between neighboring STFT coefficients, as follows. Under gaussian assumptions, the negative log likelihood of the conditional distribution of the source given the mixture is given by , where represents a time-frequency bin, is the mean, and is the covariance of the conditional distribution . To enforce consistency in , a necessary and sufficient condition is that the of the is equal to itself or, in other words, that it belongs to the null space Ker of the -linear operator from to itself defined by . The hard consistency constraint may be inadequate when the estimated source variances are unreliable. Therefore, the norm of is used as a soft penalty term with weight . The consistent estimate of is obtained by minimizing the following objective function, , using a conjugate gradient descent method.

Applied as a post-processing step to demixed spectrograms, CWF makes the reconstructed spectrograms more amenable to inversion and therefore enhances the quality of the demixed audio signals. CWF requires as inputs the power spectral densities (PSDs) for both the target and the masker (these correspond to in equation 2.1 with . Using the PSDs of our estimated audio signals for CWF, we report in Table 2 the separation performance following CWF.

CWF applied to our linear methods using the optimal set of parameters ( and ) improved SNRs averaged over all speakers of 0.79 dB for MLD/suppression-regression and of 0.59 dB for EACD. The normalized residual values were reduced on average by 0.03 for MLD/suppression-regression and by 0.01 for EACD. The PESQ scores averaged over all speakers improved by 0.21 for MLD/suppression-regression and by 0.29 for EACD. Note that for the NNSC-based approaches, consistent Wiener filtering improved only the results for NNSC (both learned).

## 6 Real-Time Implementations

The suitability of the linear methods was tested for real-time applications, and therefore two of them (EACD and MLD) were implemented in real-time. The computer used for this purpose had a 64-bit operating system and an Intel Core i7 processor with 2.70 GHz clock frequency and 8 GB of RAM. To minimize hardware latencies, an audio stream input output (ASIO) sound card driver was used. ASIO bypasses the normal audio path from a user application through layers of intermediary windows' operating systems software on a computer so that an application can communicate directly with the sound card. Each layer that is bypassed contributes to a reduction in latency. The audio signal is acquired at a sampling frequency of 44.1 KHz. The buffer size was kept at 512 samples ( ms) at both recording and playback ends. To achieve a lower latency and yet good performance, the STFT window size was kept at ms and FFT window overlap at 75%. For simplicity, the number of spectral columns was set to and the sparseness factor to The audio latency achieved was ≈46 ms, which was experienced as a well-tolerable latency. This work also tried to implement non-blind NNSC separation as a real-time algorithm. The dictionaries and code matrices for the two speakers were first learned using the training data. These pre-learned code matrices were then used as the starting code matrices (instead of random matrices) for the iterative update, thereby reducing the number of required iterations in equations 3.30 and 3.31. However, the audio latency achieved was ≈500 ms, intolerably large for real-time separation purposes.

## 7 Conclusion

This letter presented novel linear approaches to audio source separation: (1) eigenmode analysis of covariance difference (EACD) in which spectro-temporal features associated with large variance for one source and small variance for the other source are identified; (2) maximum likelihood demixing (MLD) in which the mixture is modeled as the sum of two gaussian signals and maximum likelihood is applied to identify the most likely sources present in the mixture; and (3) suppression-regression (SR) in which autoregressive models are trained to reproduce one source and suppress the other. The approaches in this work use only a single microphone recordings to perform source separation.

Unlike our proposed methods that perform monaural source separation, there exist various other source separation approaches that require multiple microphone recordings (Hyvärinen et al., 2001; Pham & Cardoso, 2001; Souden, Araki, Kinoshita, Nakatani, & Sawada, 2013). Many of them are based on maximum likelihood considerations like ours (Degerine & Zaidi, 2004; Fevotte & Cardoso, 2005; Pham & Cardoso, 2001). Nevertheless, these approaches differ from the proposed methods not only in terms of the number of inputs but also in other ways. For example, the probabilistic multispeaker model in Souden et al. (2013) is based on a latent variable assuming discrete states. The speech signal is reconstructed assuming that each state is associated with only one speaker. This assumption, as in binary masking-based methods, can lead to loss of information when time-frequency bins contain signals from multiple speakers. Our methods in principle are not limited by this constraint. Pham and Cardoso (2001) propose demixing methods based on maximum likelihood and minimum mutual information principles for gaussian non-stationary sources. By contrast, our proposed methods assume stationary sources but incorporate the temporal dependencies by concatenating consecutive spectral columns into a single vector, thereby increasing the timescale of the represented signal. Another multiple microphone source separation approach that uses the statistical independence and nonstationarity of the sources was proposed by Matsuoka, Ohoya, and Kawamoto (1995). In their algorithm, mixture signals are decorrelated using a neural network trained with stochastic gradient descent. Their approach has the advantage of being independent of the type of distribution of the individual sources. However, by comparison with the complex and iterative approaches proposed by Pham and Cardoso (2001) and by Matsuoka and colleagues (1995), our proposed methods learn the filters in a single step, thereby making our methods more efficient and suitable for real-time applications.

Overall, the linear methods for single-microphone source separation proposed in this letter perform better than more computationally demanding nonlinear approaches such as NNSC (Schmidt et al., 2007), in terms of both SNRs, residual spectrograms and PESQ scores. Unlike nonlinear NNSC, these linear approaches are not only simpler to implement but also faster to execute. Nevertheless, the semiblind NNSC approach has an advantage over the proposed linear approaches of being able to separate a target from an unknown masker or separating an unknown target from known noise. An interesting extension of this work could be to implement such abilities in future linear approaches.

This work is not the first to propose single-channel source separation methods based on the maximum likelihood approach (Jang & Lee, 2003). However, the proposed methods are more suitable for real-time applications because the method in Jang and Lee (2003) uses complex and iterative schemes for the separation task, whereas the methods here in essence require only fast-forward and inverse Fourier transformations and matrix multiplications.

An alternative approach to source separation is the use of binary masks on mixture spectrograms, that is, assigning each point in the time-frequency (TF) bin to the dominant source. The problem with binary masking and related approaches (Aoki et al., 2001; Brungart et al., 2006; Han & Wang, 2012; Jourjine et al., 2000; Nguyen et al., 2001; Rickard et al., 2001) is that artifacts or unnatural sounds may appear in the reconstructed signals. Efforts to combine ICA with binary masks have also been undertaken in Højen-Sørensen, Winther, and Hansen (2002); however, one of the general problems with binary masking is loss of information in the overlapping TF bins where the target utterance has lower energy than the masker. This occurs due to the fact that only one source is supposed to be active per TF bin. In contrast, the proposed linear methods do not exclusively assign a TF bin to one speaker only, but rather compute the subspaces associated with each of the sources and project the mixture onto them. This preserves the information for both speakers in a particular TF bin. In a fairly different approach to monaural source separation, Vishnubhotla and Espy-Wilson perform the segregation task by modeling the TF masking as a combination of complex sinusoids that are harmonics of the pitch frequencies of the speakers by applying a least-squares fitting approach (Vishnubhotla & Espy-Wilson, 2009). Deep neural networks have also been used to compute both binary masks (Wang & Wang, 2013) and soft masks (Huang et al., 2014) for the task of source separation. However, they are highly nonlinear and are computationally expensive. Linear methods are simpler to implement.

Human speech exhibits structure on multiple temporal scales, in line with natural sounds that tend to slowly vary in time (Bregman, 1990; Rosen, 1992). Congruently, human auditory processing shows a bias toward the perception of continuity in sound streams (Bregman, 1990), motivating the inclusion of temporal continuity into source separation methods. The method proposed in Lim, Shinn-Cunningham, and Gardner (2012) represents any discrete time series as a set of time-frequency contours. This method when applied to source separation allows sources to be extracted based on differences in contour representations of target and masker signals. While this method works well when the timescales of the two underlying signals are highly different, they are likely to fail in separation problems such as the multiple speaker problem when the underlying timescales of the two signals are very similar. Temporal continuity is incorporated naively in this work by simply concatenating consecutive spectral columns into a single vector, thereby increasing the timescale of the signal representation. Performance (in terms of SNR and residual) of all algorithms peaked for multiple-column representations, though less sensitively than expected. Concatenating too many spectral columns led to a reduction in performance, presumably due to overfitting.

Temporal continuity has also been addressed in a system proposed by Vincent and Rodet (2004) who modeled the activity of a source with a hidden Markov model. Such models are known to produce good separation results but are less suitable for real-time implementation. Virtanen (2007) incorporated temporal continuity of features by introducing a dedicated cost function. Because this cost function is computed iteratively, it might be difficult to implement this algorithm in real time.

## Acknowledgments

This work was funded by Swiss National Science Foundation grant 200021-126844 “Early Auditory Based Recognition of Speech,” grant 200020-153565 “Fast Separation of Auditory Sounds,” and by European Research Council (ERC-Advanced Grant 268911).

## References

*Neural Computation*,