Abstract

This letter addresses the problem of separating two speakers from a single microphone recording. Three linear methods are tested for source separation, all of which operate directly on sound spectrograms: (1) eigenmode analysis of covariance difference to identify spectro-temporal features associated with large variance for one source and small variance for the other source; (2) maximum likelihood demixing in which the mixture is modeled as the sum of two gaussian signals and maximum likelihood is used to identify the most likely sources; and (3) suppression-regression, in which autoregressive models are trained to reproduce one source and suppress the other. These linear approaches are tested on the problem of separating a known male from a known female speaker. The performance of these algorithms is assessed in terms of the residual error of estimated source spectrograms, waveform signal-to-noise ratio, and perceptual evaluation of speech quality scores. This work shows that the algorithms compare favorably to nonlinear approaches such as nonnegative sparse coding in terms of simplicity, performance, and suitability for real-time implementations, and they provide benchmark solutions for monaural source separation tasks.

1  Introduction

The problem of recovering underlying source signals from their mixtures is called source separation. It has been an area of intense research for a long time and has received much attention recently because of the potential applications for hearing aid systems, signal preprocessing in speech recognition systems, and network tomography and medical signal processing in general (Hyvärinen, Karhunen, & Oja, 2001). When the problem is (over-) determined, that is, when the number of sources is no larger than the number of available mixtures or sensor recordings, then generic assumptions such as statistical independence of sources can be used for successful demixing (Hyvärinen et al., 2001). However, most of the time, the problem is underdetermined: the number of sources is larger than the number of available mixtures; hence, more specific assumptions must be made for demixing. One of the most difficult cases is source separation from a single microphone, which is the task of segregating a target source from a masker using a single-channel recording. This task has been proven to be extremely challenging (Wang & Brown, 2006) because one can only use the intrinsic acoustic properties of the target and the masker with no additional cues.

Essentially three general approaches have been proposed for monaural separation: speech enhancement, computational auditory scene analysis (CASA), and model-based methods such as nonnegative matrix factorization (NMF). The linear methods have mostly been used for speech enhancement in noisy environments. These methods include spectral subtraction, Wiener filtering, and subspace-based approaches (Loizou, 2013). The subspace methods for speech enhancement are based on the principle that sources are usually confined to a subspace of Euclidean space. Consequently, methods have been developed that compute these source-specific subspaces. Some of the widely used methods are singular value decomposition (SVD), eigenvalue decomposition (EVD), and subspace tracking algorithms (Loizou, 2013). These methods have been primarily used for speech enhancement of noisy speech (Jensen, Hansen, Hansen, & Sorensen, 1995; Hansen & Jensen, 2005, 2007; Weiss, 2009). Another approach to deal with source separation is to compute an ideal binary mask (IBM) for the target source, considered one of the main goals in CASA systems for speech segregation (Wang & Brown, 2006). An IBM is defined as a binary matrix in which a matrix element is 1 if the power of the target source is known to be higher than that of the masker and 0 otherwise in the corresponding time frequency bin of the mixture signal. Binary masking and related approaches have been used extensively for demixing (Aoki et al., 2001; Brungart, Chang, Simpson, & Wang, 2006; Han & Wang, 2012; Jourjine, Rickard, & Yilmaz, 2000; Nguyen, Belouchrani, Abed-Meraim, & Boashash, 2001; Rickard, Balan, & Rosca, 2001). Deep neural networks (DNNs) have also been used to learn binary masks (Wang & Wang, 2013; Zhao, Wang, & Wang, 2014) and soft masks (Narayanan & Wang, 2013, 2014; Huang, Kim, Hasegawa-Johnson, & Smaragdis, 2014). A major disadvantage of using IBM is the loss of information in regions where the source and masker overlap. Model-based approaches such as NMF that factorize a nonnegative matrix into (usually) two nonnegative matrices have been used extensively for monaural source separation and over a wide range of sources, including human speech, noise, and music (Schmidt, Larsen, & Hsiao, 2007).

This letter proposes various linear methods for separating (artificial) mixtures of two sources. The approaches are based on the power spectrogram, a time-frequency representation of sound that has the advantage of being phase insensitive and so allows speakers to be distinguished based on differences in the spectro-temporal sound patterns they produce. This work introduces new methods based on eigenanalysis, probabilistic demixing, and linear regression. These methods are evaluated by estimating (reconstructing) target spectrograms of single speakers from mixture speech. The reconstructed spectrograms are compared to those yielded by nonlinear approaches, that is, nonnegative sparse coding (NNSC) (Schmidt et al., 2007) and a supervised version thereof (which we refer to as non-blind NNSC, described in section 3.5).

2  Methods

All computations are performed using Matlab R2011a on a 64-bit machine with 8 GB of RAM and an Intel Core i7 processor with 4 cores and clock frequency 2.93 GHz.

Following Berouti, Schwartz, and Makhoul (1979) the signal in the time-frequency domain is represented as an element-wise exponentiated short-time Fourier transform (STFT):
formula
2.1
The STFT is computed using a Fourier window size of samples and 75% overlap between successive Hanning windows. The algorithms are applied on adjacent columns of the spectrogram, concatenated into a single column vector. Each of the spectrogram columns contains distinct frequency bands. Therefore, the dimensionality of the data processed by the algorithms is given by . The exponent is referred as the sparseness factor; it defines the sparseness of the sound representation (the sparseness of increases with ). The performances of the algorithms are evaluated for various , and . Note that for in equation 2.1, is referred to as the power spectral density (PSD).

2.1  The Problem and the Approach

Speaker separation is evaluated on audio speech data using the GRID corpus and Mocha-TIMIT---Texas Instruments (TI) and Massachusetts Institute of Technology (MIT)---database. All sentences are sampled at 16 KHz. The training data consist of 20 sets of audio files, each set containing roughly 5 minutes of clean speech from a distinct speaker (10 male and 10 female speakers). In the training phase, source-specific filters are learned from the training set. The testing data consist of roughly 1 minute of clean speech from each speaker. On each run of testing, one male and one female speaker are randomly chosen and are denoted by and . The same set of speakers is used for training and testing. The test audio signals (mixture speech) are artificially created by summing the two underlying speech signals and . Note that since the STFT is a linear function, the mixture can be expressed in the time-frequency domain as , where , , and are the absolute values of the FFT for the mixture, speaker , and speaker , respectively, and is the corresponding phase. Since the inequality holds true for any , , and , therefore for a set of uncorrelated signals and , where is the expected value.

In this letter, the approximation is used, which can be thought of as a generalization of the binary mask approach because the approximation is excellent ( is close to ) for any when or . In general, the assumption is good for close to 2 (Berouti et al., 1979). The assumption becomes worse as , and thus the performances of the various methods are expected to be lower in this range.

The spectrograms of speech for the two speakers in the training set are denoted and From the training set, the filters (linear methods) or libraries (NNSC) are derived. These filters and libraries are used to estimate each speaker's spectrogram and from mixture spectrograms .

From , an estimate of the underlying single-speaker sound waveform is also derived by inverting the spectrogram as follows:
formula
2.2
where represents the inverse Fourier transform, represents the real value, and the phase signal is obtained from applying the STFT to the mixture signal . To obtain the final waveform from the set of overlapping estimates , each is multiplied by a Hanning window, and their weighted mean (weighted by the amount of overlap) is computed.
Because speech signals typically show strong temporal correlations that extend beyond typical STFT windows, a reasonable extension is to consider more than individual spectrogram columns. To that end, the window vector is defined to encompass multiple spectrogram columns,
formula
for and (depicted in Figure 1) and likewise for and .
Figure 1:

A single spectrogram column and the window comprising columns on the left and columns on the right.

Figure 1:

A single spectrogram column and the window comprising columns on the left and columns on the right.

Note that for the subsequent methods, one spectrogram window is predicted at each time-step and averaged over overlapping windows to obtain , which is ultimately inverted into cleaned speech. The exception will be suppression-regression, where single columns and are directly estimated from windows and . Note also that source separation methods based on are feasible for real-time implementation only when is small, ideally .

2.2  Mean Subtraction (Applicable to the Linear Methods)

All methods make the simplifying assumption that , that is, the mixture spectrogram is the sum of the two individual speaker spectrograms. From this assumption, the following mean subtraction method is derived. The means and are defined over individual training spectrograms, where represents the mean over time. Equally the means and are defined over windows of training spectrograms. As the training sets for both speakers are chosen to be equally large, the overall mean of the training spectrogram columns is given by and likewise for the mean spectrogram window. The linear methods to be defined extract demixing filters and from the mean-subtracted training spectrograms and . The filter maps components of onto themselves and components of onto zero, and vice versa for filter

For demixing, the steps are to subtract from the mixture spectrogram twice the overall mean , , apply the demixing filters and add the individual training spectrogram means and to the result to obtain the estimates and respectively. It is simple to show that by doing so (and assuming , the following is true: adding the estimates and yields = under the assumption

The following mean-subtraction methods are also evaluated, but their performance is worse (larger residuals and lower signal-to-noise ratios; data not shown) compared to the above chosen method for mean subtraction:

  1. During training, the speaker-specific filters on the individual-mean subtracted training spectrograms and are computed. During testing, the mixture-mean subtracted mixture spectrogram is projected on these filters, and subsequently half of the mixture mean is added back to the projections to obtain the estimates and .

  2. The speaker-specific filters are computed on a set of overall-mean subtracted training spectrograms: and . During testing, and are estimated by projecting the overall-mean subtracted mixture spectrogram, , on these filters; subsequently the individual training spectrogram mean (resp. ) is added to the projections to obtain the estimates and .

  3. The speaker-specific filters are computed on a set of overall-mean subtracted training spectrograms, and , as above. However during the testing phase, the subtracted mixture spectrogram is projected on these filters; subsequently, the individual training spectrogram mean (resp. ) is added to the projections to obtain the estimates and .

Note that methods 1 and 3 also satisfy the assumption but method 2 does not.

3  Algorithms

The three linear methods for source separation are introduced followed by two variants of a nonlinear method.

3.1  Eigenmode Analysis of Covariance Difference (EACD)

Eigenmode analysis is applied in order to compute filters that span directions associated with large variance for one speaker and small variance for the other speaker, an approach inspired by Machens, Romo, and Brody (2010). The speech mixture spectrograms are then projected onto these filters to selectively suppress one of the two speakers present in the mixture.

Concretely, the covariance matrices for the two speakers are denoted by
formula
3.1
The difference of covariance matrix is further defined as
formula
Let be the matrix of eigenvectors that diagonalizes
formula
3.2
where is the diagonal matrix consisting of eigenvalues of . Since and are symmetric matrices, the matrix is also symmetric. As a result, is an orthonormal matrix, and therefore . By substituting the definitions of CX and CY into equation 3.2, the following is obtained:
formula
Because the average is a linear operator,
formula
3.3
with and Thus, the diagonal elements of give the difference in variance between the two mean-subtracted data sets along the eigenvectors (columns of ). These eigenvectors span two subspaces. In the first subspace, speaker produces a higher variance than speaker and vice versa. The sets of filters that span the first and the second subspaces are labeled and .
Once these filters are computed, the sources present in a mixture can then be estimated simply by projecting the mean subtracted mixture spectrogram onto the filters and subsequently adding back the individual means of the two speakers to obtain the estimated spectrograms for the two speakers:
formula
3.4
formula
3.5

The vectors and as defined in equations 3.4 and 3.5 satisfy the assumption (because is the identity matrix, here and as mentioned in section 2). The algorithm works identically for windows of multiple spectral columns, , as described in section 2, by simply replacing the variables by their boldface counterparts and averaging and over their overlapping regions.

3.2  Probabilistic Approach (Maximum Likelihood Demixing)

This approach performs source separation under the assumption that the two (mean-subtracted) sources are independent and gaussian distributed.

Let and again be the covariance matrices of the two sources that are assumed to be gaussian distributed:
formula
3.6
and
formula
3.7

with and normalization constants.

With the assumption of independence of the two sources, the joint probability can be written as
formula
The underlying sources and are estimated as the most likely constituents of the mixture. The estimate is derived by setting the derivative of the joint probability (given the mixture ) to zero:
formula
where in the last equation, the assumption is used. Isolating yields
formula
3.8
and
formula
3.9
(The detailed matrix operations are described in Petersen & Pedersen, 2008.) Because the matrices and have to be computed only once, this method of demixing is extremely efficient. Note that for evaluating the performance of the algorithm for multiple spectral columns (), the procedure is exactly as in EACD for multiple columns. (See section 3.1.)

3.3  Suppression-Regression

A simple approach to source separation is obtained through linear regression. The idea is to compute a speaker-specific filter that maps the training spectrogram onto zero (suppression) and onto itself (regression):
formula
3.10
formula
3.11
where and are sources of uncorrelated gaussian noise. The linear map that minimizes the sum of squared errors, is computed, where denotes the common length of the training spectrograms.
For the second source, the linear map is obtained by minimizing in the analogous system:
formula
3.12
formula
3.13
The solutions of the above equations are
formula
3.14
formula
3.15
where and are the covariance matrices of the respective speakers. Based on the maps and , the two sources from the mixture are estimated as
formula
3.16
formula
3.17
These source estimations are identical to equations 3.8 and 3.9 of MLD. Thus, in the one-column case, suppression-regression is identical to maximum likelihood demixing.
As mentioned earlier, correlations over longer durations are also considered and are modeled by demixing filters that span over several spectrogram columns. Therefore, analogous to equations 3.10 to 3.13, and are chosen such that they minimize the sum of squared error in the equation systems
formula
3.18
formula
3.19
and
formula
3.20
formula
3.21
where and are the individual speaker means for the larger predictor window. From these equations, the following are derived:
formula
3.22
formula
3.23
where and are the covariance matrices of the windows and respectively, and and are the covariance matrices between single columns and their corresponding windows and between the columns and their corresponding windows .
Having determined the linear maps and on the training data, the mixture is separated according to
formula
3.24
formula
3.25
Because suppression-regression is identical to MLD for single columns, the results for suppression-regression are not shown in Figures 3 and 5. In the multiple spectral column case (), unlike in MLD where demixing filters are learned from the past, in suppression-regression, the demixing filters are learned based on both the past and the future as shown in Figure 4.

Note that a large condition number of the matrix (to be decomposed) can become a significant problem leading to ill-conditioned matrices. The problem of ill-conditioned matrices is observed only for suppression-regression in case of multiple spectral columns (). One of the methods of dealing with such ill-conditioned matrices is regularization. In this work, Tikhonov regularization (Tikhonov & Arsenin, 1977) is used. For an ill-posed inverse problem (because of either the nonexistence or nonuniqueness of ) such as , the estimate of r is usually computed using an ordinary least squares approach that minimizes the residual where represents the Euclidean norm. This may occur because of being illconditioned or singular. To obtain a solution with desirable properties, a regularization term is included in this minimization where is called the Tikhonov matrix, is a constant, and is the identity matrix. An explicit solution is for some regularization constant . In this work, the regularization constant is chosen such that it maximizes the performance of separation (higher SNR, PESQ score, and lower residual). It is varied from 0 to (with step sizes initially increasing in multiples of 10 and fixed step size of close to peak performance, which typically occurred between and ). Peak performance value was determined by cross-validation on an independent set of data. Figures 4 and 6b show the results for suppression-regression after performing regularization. The performance improvement is largest for a large number of columns, which is expected because the problem (of inversion) is more ill posed the larger the number of columns. On average, the SNR values improve by 0.7 dB, the PESQ values by 0.3, and residual values decrease by 0.05 after regularization.

3.4  Source Separation Using Nonnegative Sparse Coding (NNSC)

These linear approaches are compared to nonlinear approaches based on NMF, which encompasses a set of methods in which a matrix is factorized into two nonnegative matrices and such that (all components of and are nonnegative). Nonnegative matrix factorization leads to a parts-based representation because the factorization allows only additive, not subtractive, combinations (Eggert & Korner, 2004).

Training spectrograms are factorized such that the matrix is a dictionary matrix whose columns contain a set of source-specific basis vectors and is the code matrix that contains nonnegative weights that determine the linear combination of basis used to approximate . Inspired by Schmidt et al. (2007) in which NMF was applied to source separation, a variant of sparse NMF termed nonnegative sparse coding (NNSC) is applied in which only a few filters are used to represent the data, because most of the filter coefficients are constrained to take close to zero values. In NNSC, sparseness of is enforced by minimizing the following cost function,
formula
3.26
where represents the 2-norm and represents the 1-norm (Manhattan norm), and is a sparsity parameter.

Sparsity of basis coefficients is enforced by the second term in equation 3.26. This term guarantees that only a small subset of dictionary elements is used at any time, thus forcing the dictionary elements to be source specific. Note that NNSC is a nonlinear approach because solving the global optimum of equation 3.26 is not tractable (NP-hard).

There exist diverse algorithms for computing this factorization (Eggert & Korner, 2004; Hoyer, 2002; Lee & Seung, 1999, 2006; Lin, 2007; Schmidt et al., 2007) including the multiplicative update rule from Lee and Seung (1999, 2006). This update rule has been very popular due to the simplicity of its implementation. An advantage of multiplicative update rules over standard gradient descent update is that convergence is guaranteed because no step-size parameter is needed (Lee & Seung, 1999). Also the relaxation toward a local minimum is fast and computationally inexpensive (Lee & Seung, 1999). This work uses the multiplicative update rules of Schmidt et al. (2007), which are inspired by Eggert and Korner (2004) and Lee and Seung (1999, 2006). These update rules provide an extra advantage in that they allow the target dictionaries to be learned from the mixture and thus constitute a semisupervised method of learning.

The masker dictionary is first computed using the clean speech of the masker as follows (speaker being the masker and speaker the target in this particular case):

  1. Start with a randomly initialized dictionary matrix, and code matrix .

  2. Alternate the following updates until convergence:
    formula
    formula
    where represents pointwise multiplication and the horizontal line pointwise division, defines the sparsity parameter of single-speaker dictionaries, is the full code matrix, and is the column-wise normalized dictionary matrix for speaker .
  3. Under the assumption of additive mixing of sources, the following is obtained:
    formula
    where and are the dictionary matrices and and are single columns of the code matrices for speakers and , respectively. and are the concatenated dictionaries and code matrices.

The target dictionaries are then learned directly from the mixture via the following iterative update rules until convergence:
formula
3.30
formula
3.31
formula
3.32
where and are the sparsity parameters for the code matrices of speakers and , respectively. That is, when the dictionaries are trained on clean speech, the sparsity parameter is applied, and when the dictionary is applied on the mixture, the sparsity parameters and are applied.
Once and are computed, they are used to estimate the target spectrogram as follows:
formula
3.33
where are the single columns of the code matrix of speaker .

The above approach of semiblind NNSC is also modified by first learning the target dictionary from the clean data set and then learning the masker dictionary from the mixture using update rules analogous to equations 3.30 to 3.32. Note that learning the dictionary of the masker from its clean speech is very useful when the target model is unavailable. If the masker model is unavailable, one can learn the masker dictionary from the mixture while learning the target from its clean speech. We show results for both of these semiblind approaches.

3.5  Non-Blind NNSC

One might expect a better performance when the dictionaries for both speakers are prelearned from clean training sentences as was done in the linear methods (in the NNSC approach described in section 3.4, the dictionary of one of the speakers from clean speech and the dictionary for the other speaker were computed using the mixture). In this section, NNSC is applied in learning both dictionaries and from clean training sentences; the code matrices and are then inferred from update rules, equations 3.30 and 3.31.

The individual sources in this non-blind NNSC method are then estimated as
formula
3.34
and
formula
3.35

Parameter values for non-blind and semi-blind NNSC were chosen by maximizing SNR as follows. The optimal parameter values in Schmidt et al. (2007) were taken as starting points () and varied one after the other over an empirically chosen range while keeping the others at their optimal values. The parameters and were varied over the range [0, 0.1] in linear steps of 0.01, over the range [0,1] in linear steps of 0.1, and and over the range [1,512] in powers of 2. The other parameters were set to , . The following optimal values for the NNSC parameters were obtained:

  • Sparsity parameters, and in equations 3.30 and 3.31

  • Number of components of the target and masker dictionaries,

  • Sparsity parameter, used in equation 3.27 for learning single-speaker dictionaries

4  Performance Criteria

The performances of the algorithms are evaluated using the following measures:

  1. Normalized residual between estimated and target spectrograms,
    formula
    where represents the square of the 2 norm
  2. SNR computed over the target audio signal and the predicted audio signal ,
    formula
  3. Perceptual evaluation of speech quality (PESQ) scores (Hu & Loizou, 2008) computed over the target audio signal and the predicted audio signal

5  Simulation Results

After training the filters and dictionaries, artificial speech mixtures from two speakers (male and female), in Figure 2a, were created, and the corresponding mixture spectrogram in Figure 2b was computed. From this mixture spectrogram, the speech spectrogram of the target speaker (male) was estimated using the algorithms already presented.

Figure 2:

(a) An excerpt of the sound waveforms from the target (male) speaker, the masker (female), and their mixture. (b) From top to bottom: target spectrogram masker spectrogram , mixture spectrogram and estimated target spectrograms for the male speaker using EACD, MLD/suppression-regression, NNSC (TL-target learned), NNSC (ML-masker learned), and NNSC (BL-both learned), from top to bottom. Log spectrograms are shown. (c) Waveform excerpts of the target (blue), the mixture (red), and the estimated target by MLD (black). , FFT window size ms, number of spectral columns . The target is well preserved, and the masker is well suppressed in both the voiced and unvoiced regions for the linear methods depicting the strength of these methods. Note that the results for MLD and suppression-regression are exactly the same; hence only the MLD results are shown in the figure.

Figure 2:

(a) An excerpt of the sound waveforms from the target (male) speaker, the masker (female), and their mixture. (b) From top to bottom: target spectrogram masker spectrogram , mixture spectrogram and estimated target spectrograms for the male speaker using EACD, MLD/suppression-regression, NNSC (TL-target learned), NNSC (ML-masker learned), and NNSC (BL-both learned), from top to bottom. Log spectrograms are shown. (c) Waveform excerpts of the target (blue), the mixture (red), and the estimated target by MLD (black). , FFT window size ms, number of spectral columns . The target is well preserved, and the masker is well suppressed in both the voiced and unvoiced regions for the linear methods depicting the strength of these methods. Note that the results for MLD and suppression-regression are exactly the same; hence only the MLD results are shown in the figure.

The example in Figure 2 shows all of speech from only the target, only the masker, and a superposition of both. Masker suppression is strongest in the case of NNSC (masker learned). However, the target itself is also highly suppressed in this case. As a result, the target speech suffers from low intelligibility. By contrast, masker suppression is weaker in the linear approaches and NNSC (target learned and both learned), but the target speech is better preserved.

Shown in Figure 2c is an excerpt of the waveforms comparing the target, the mixture, and the estimated target using MLD/suppression-regression (the best method). The estimated waveform matches well with the target waveform.

The performance of the various algorithms is compared for different values of the STFT window size , the STFT exponent , and the number of spectrogram columns (see Figures 3 to 5). The base parameter set is , , and the values for one of the three parameters are varied while keeping the other two parameters fixed. In all figures, the plots of residual spectrograms and SNRs that were averaged over all speakers are shown. The performances of the algorithms are further compared using the widely used and International Telecommunication Union--Telecommunication (ITU-T) recommended perceptual PESQ metric. The PESQ scores support the results shown in Figures 3 to 5 and align well to them (see Figure 6).

Figure 3:

(a) Normalized residual of estimated target spectrogram and (b) SNR of estimated target waveform versus STFT window size, comparing all the algorithms. Plots show averages over all speakers. The optimal performance was at 64 ms for MLD/suppression-regression and NNSC (both learned), 128 ms for EACD and NNSC (target learned), and 256 ms for NNSC (masker learned). , . Note that the results for MLD and suppression-regression are exactly the same; hence, only one of them is shown in this figure.

Figure 3:

(a) Normalized residual of estimated target spectrogram and (b) SNR of estimated target waveform versus STFT window size, comparing all the algorithms. Plots show averages over all speakers. The optimal performance was at 64 ms for MLD/suppression-regression and NNSC (both learned), 128 ms for EACD and NNSC (target learned), and 256 ms for NNSC (masker learned). , . Note that the results for MLD and suppression-regression are exactly the same; hence, only one of them is shown in this figure.

Figure 4:

(a) Normalized residual of estimated target spectrogram and (b) SNR of estimated target waveform versus number of spectral columns, comparing all the algorithms (plots show averages over all speakers). ,  ms.

Figure 4:

(a) Normalized residual of estimated target spectrogram and (b) SNR of estimated target waveform versus number of spectral columns, comparing all the algorithms (plots show averages over all speakers). ,  ms.

Figure 5:

(a) Normalized residual of estimated target spectrogram and (b) SNR of estimated target waveform versus STFT exponent , comparing all the algorithms (plots show averages over all speakers). Results in panels a and b are well aligned with each other. Overall MLD/suppression-regression outperforms all methods. EACD outperforms the nonlinear NNSCs only when the masker is learned. As expected, the performance of NNSC (both learned) is better than NNSC (target learned) followed by the masker-learned NNSC variant. STFT window size ms; number of spectral columns .

Figure 5:

(a) Normalized residual of estimated target spectrogram and (b) SNR of estimated target waveform versus STFT exponent , comparing all the algorithms (plots show averages over all speakers). Results in panels a and b are well aligned with each other. Overall MLD/suppression-regression outperforms all methods. EACD outperforms the nonlinear NNSCs only when the masker is learned. As expected, the performance of NNSC (both learned) is better than NNSC (target learned) followed by the masker-learned NNSC variant. STFT window size ms; number of spectral columns .

Figure 6:

PESQ scores for all algorithms computed as average over all speakers and plotted as a function of (a) STFT exponent , (b) number of spectral columns , and (c) STFT window size . The PESQ scores are in alignment with the other measures of performance (SNR on the audio signal and normalized spectrogram residuals).

Figure 6:

PESQ scores for all algorithms computed as average over all speakers and plotted as a function of (a) STFT exponent , (b) number of spectral columns , and (c) STFT window size . The PESQ scores are in alignment with the other measures of performance (SNR on the audio signal and normalized spectrogram residuals).

Figure 3 shows the plots depicting the performances of the algorithms under varying FFT window size . For all algorithms except NNSC (masker learned), the SNR of the reconstructed signals peaks at either 64 ms or 128 ms. The normalized residual plots confirm this finding for the well-performing methods MLD/suppression-regression and NNSC (both dictionaries learned). For the remaining three methods, the minimal normalized residual is not reached even at the largest window size tested.

Figure 4 shows the performances of the algorithms for various numbers of spectral columns. Adding more spectral columns captures the context-dependent information in each analysis vector, leading to better performance. However, temporal context beyond a certain limit (depending on the amount of data available) is not useful and leads to performance reduction. Note that research has shown that the temporal context in human speech can typically vary from 20 ms to 200 ms (Rosen, 1992).

The performance comparison of the different algorithms gives a sense of how prone they are to overfitting. The better-performing algorithms MLD, suppression-regression, and NNSC (both learned and target learned) exhibit their peak performances for few spectrogram columns (), while the performances of the other nonlinear algorithms saturate. EACD shows a monotonically increasing performance, revealing it is more robust to overfitting.

Shown in Figure 5 are the performances of the algorithms for various values of the STFT exponent Overall MLD/suppression-regression performs best, followed by NNSC (both speakers learned). NNSC (masker learned) method performs worst.

The sparseness factor has a strong impact on performance. The assumption implicit in the linear methods is that the data are gaussian distributed. The nonlinear NNSC methods, however, assume exponentially distributed independent components present in the mixture (Hoyer, 2002). As , the distribution of the spectrogram pixels approaches a gaussian. In the midrange this distribution approaches an exponential, and for large , this distribution is heavy tailed. The approximation that the sources are additive in the spectrogram domain gets worse the further deviates from 2 (Berouti et al., 1979).

Given this trade-off, it may not be surprising that the optimal performance for EACD and MLD is not near small values but around .

The same argument can be applied to NNSC (both dictionaries learned) for which the performance is optimal around . In this range, MLD/suppression-regression and NNSC (both speakers learned) perform best. This can be seen in the estimated spectrogram for the case of (see Figure 3). The region where the masker speaker (female in this example) is solely speaking is suppressed best in NNSC (masker learned); however, NNSC also over-suppresses the target speaker, leading to lower SNR values compared to suppression-regression and NNSC (both dictionaries learned). Performances of EACD and MLD are still very good in this range.

5.1  Sparse NMF with Kullback-Leibler (KL) and Itakura-Saito Divergence Criteria

Apart from the Euclidian distance-based criterion for matrix factorization, we also tested other divergence criteria such as the Kullback-Leibler (KL) divergence and the Itakura-Saito (IS) divergence. These criteria are usually considered better suited for audio spectra than the Euclidean distance (Févotte & Idier, 2011). Using the KL divergence, the SNRs averaged over all speakers improved by 0.77 dB for NNSC (target learned), 1.18 dB for NNSC (masker learned), and 0.34 dB for non-blind NNSC. When the KL divergence was used, the residuals decreased on average by 0.09 for NNSC (target learned), 0.10 for NNSC (masker learned), and 0.01 for non-blind NNSC. The PESQ scores also improved on average by 0.06 for NNSC (target learned), 0.05 for NNSC (masker learned), and 0.06 for non-blind NNSC. The results are summarized in Table 1.

Table 1: 
Mean Values of the SNRs, Normalized Residuals, and PESQ Scores Averaged over All Speakers for the Sparse NMF-Based Methods Comparing Different Divergence Criteria.
Mean SNR (dB)Mean Normalized ResidualMean PESQ Score
NNSCEuclideanKLISEuclideanKLISEuclideanKLIS
MethodDistanceDivergenceDivergenceDistanceDivergenceDivergenceDistanceDivergenceDivergence
NNSC 4.98 5.75 4.86 0.6 0.51 0.64 1.86 1.92 1.74 
(target learned)          
NNSC 3.27 4.45 3.0 0.66 0.56 0.70 1.73 1.78 1.71 
(masker learned) 
NNSC 5.98 6.32 5.80 0.41 0.40 0.45 2.08 2.14 2.02 
(both learned) 
Mean SNR (dB)Mean Normalized ResidualMean PESQ Score
NNSCEuclideanKLISEuclideanKLISEuclideanKLIS
MethodDistanceDivergenceDivergenceDistanceDivergenceDivergenceDistanceDivergenceDivergence
NNSC 4.98 5.75 4.86 0.6 0.51 0.64 1.86 1.92 1.74 
(target learned)          
NNSC 3.27 4.45 3.0 0.66 0.56 0.70 1.73 1.78 1.71 
(masker learned) 
NNSC 5.98 6.32 5.80 0.41 0.40 0.45 2.08 2.14 2.02 
(both learned) 

Note: The KL divergence criterion yields a better performance than the Euclidean distance criterion. In contrast, the IS divergence criterion performs slightly worse.

By contrast, the results using sparse IS divergence were slightly worse than those using the Euclidean distance. Compared to SNRs using the Euclidean distance, the SNRs decreased on average by 0.12 dB, 0.27 dB, and 0.18 dB, and residual values increased by 0.037, 0.04, and 0.04 for NNSC (target learned), NNSC (masker learned), and nonblind NNSC, respectively. The PESQ scores for the IS divergence were reduced as well in comparison to the 2-norm distance-based divergence measure. The PESQ scores averaged over all speakers were reduced by 0.12, 0.02, and 0.05 for NNSC (target learned), NNSC (masker learned), and nonblind NNSC, respectively. The results comparing all the divergence criteria are summarized in Table 1.

5.2  Additional Layer of Wiener Filtering for Speech Enhancement

Wiener filtering is a widely used method in signal processing, particularly for signal denoising and source separation. Wiener filtering is typically applied to audio signals in the spectrogram domain (using the STFT). We use the adaptive Wiener filtering (AWF) approach that enforces a reconstruction constraint of mixture spectrograms being the sum of individual estimated spectrograms. Under this constraint, the new spectrogram estimates and are given by and where the horizontal line represents pointwise division and represents pointwise multiplication of two matrices. This reconstruction constraint improves the separation performance of all methods studied. The results are summarized in Table 2.

Table 2: 
SNRs, Normalized Residuals, and PESQ Scores, Averaged over All Speakers for the Proposed Demixing Methods without (Wiener Filtering (WF) and with Additional Adaptive Wiener Filtering (AWF) or Consistent Wiener Filtering (CWF).
Mean SNR(dB) without WF (with AWF / with CWF)Mean Normalized Residual without WF (with AWF / with CWF)Mean PESQ Score without WF (with AWF / with CWF)
EACD 5.5, (6.3/6.09) 0.47, (0.45/0.46) 1.9, (2.21/2.19) 
MLD/SR 6.42, (6.96/7.21) 0.38, (0.34/0.35) 2.35, (2.56/2.56) 
NNSC (target learned) 4.98, (5.13/4.74) 0.6, (0.58/0.65) 1.86, (1.9/1.85) 
NNSC (masker learned) 3.27, (3.4/3.01) 0.66, (0.63/0.74) 1.73, (1.87/1.55) 
NNSC (both learned) 5.98, (6.88/6.53) 0.41, (0.38/0.377) 2.08, (2.19/2.13) 
Mean SNR(dB) without WF (with AWF / with CWF)Mean Normalized Residual without WF (with AWF / with CWF)Mean PESQ Score without WF (with AWF / with CWF)
EACD 5.5, (6.3/6.09) 0.47, (0.45/0.46) 1.9, (2.21/2.19) 
MLD/SR 6.42, (6.96/7.21) 0.38, (0.34/0.35) 2.35, (2.56/2.56) 
NNSC (target learned) 4.98, (5.13/4.74) 0.6, (0.58/0.65) 1.86, (1.9/1.85) 
NNSC (masker learned) 3.27, (3.4/3.01) 0.66, (0.63/0.74) 1.73, (1.87/1.55) 
NNSC (both learned) 5.98, (6.88/6.53) 0.41, (0.38/0.377) 2.08, (2.19/2.13) 

Notes: The improved performance values (over the methods without WF) are shown in bold. AWF improves the performance of all methods. CWF does not improve the performance of NNSC (target learned) and NNSC (masker learned) but improves it for all other methods. MLD/SR is MLD/suppression-regression. Note that applying CWF on top of AWF did not, on average, improve the performance for any of the demixing methods.

We also tried the consistent Wiener filtering (CWF) approach proposed by Roux and Vincent (2013) that enforces consistency between neighboring STFT coefficients, as follows. Under gaussian assumptions, the negative log likelihood of the conditional distribution of the source given the mixture is given by , where represents a time-frequency bin, is the mean, and is the covariance of the conditional distribution . To enforce consistency in , a necessary and sufficient condition is that the of the is equal to itself or, in other words, that it belongs to the null space Ker of the -linear operator from to itself defined by . The hard consistency constraint may be inadequate when the estimated source variances are unreliable. Therefore, the norm of is used as a soft penalty term with weight . The consistent estimate of is obtained by minimizing the following objective function, , using a conjugate gradient descent method.

Applied as a post-processing step to demixed spectrograms, CWF makes the reconstructed spectrograms more amenable to inversion and therefore enhances the quality of the demixed audio signals. CWF requires as inputs the power spectral densities (PSDs) for both the target and the masker (these correspond to in equation 2.1 with . Using the PSDs of our estimated audio signals for CWF, we report in Table 2 the separation performance following CWF.

CWF applied to our linear methods using the optimal set of parameters ( and ) improved SNRs averaged over all speakers of 0.79 dB for MLD/suppression-regression and of 0.59 dB for EACD. The normalized residual values were reduced on average by 0.03 for MLD/suppression-regression and by 0.01 for EACD. The PESQ scores averaged over all speakers improved by 0.21 for MLD/suppression-regression and by 0.29 for EACD. Note that for the NNSC-based approaches, consistent Wiener filtering improved only the results for NNSC (both learned).

6  Real-Time Implementations

The suitability of the linear methods was tested for real-time applications, and therefore two of them (EACD and MLD) were implemented in real-time. The computer used for this purpose had a 64-bit operating system and an Intel Core i7 processor with 2.70 GHz clock frequency and 8 GB of RAM. To minimize hardware latencies, an audio stream input output (ASIO) sound card driver was used. ASIO bypasses the normal audio path from a user application through layers of intermediary windows' operating systems software on a computer so that an application can communicate directly with the sound card. Each layer that is bypassed contributes to a reduction in latency. The audio signal is acquired at a sampling frequency of 44.1 KHz. The buffer size was kept at 512 samples ( ms) at both recording and playback ends. To achieve a lower latency and yet good performance, the STFT window size was kept at ms and FFT window overlap at 75%. For simplicity, the number of spectral columns was set to and the sparseness factor to The audio latency achieved was ≈46 ms, which was experienced as a well-tolerable latency. This work also tried to implement non-blind NNSC separation as a real-time algorithm. The dictionaries and code matrices for the two speakers were first learned using the training data. These pre-learned code matrices were then used as the starting code matrices (instead of random matrices) for the iterative update, thereby reducing the number of required iterations in equations 3.30 and 3.31. However, the audio latency achieved was ≈500 ms, intolerably large for real-time separation purposes.

7  Conclusion

This letter presented novel linear approaches to audio source separation: (1) eigenmode analysis of covariance difference (EACD) in which spectro-temporal features associated with large variance for one source and small variance for the other source are identified; (2) maximum likelihood demixing (MLD) in which the mixture is modeled as the sum of two gaussian signals and maximum likelihood is applied to identify the most likely sources present in the mixture; and (3) suppression-regression (SR) in which autoregressive models are trained to reproduce one source and suppress the other. The approaches in this work use only a single microphone recordings to perform source separation.

Unlike our proposed methods that perform monaural source separation, there exist various other source separation approaches that require multiple microphone recordings (Hyvärinen et al., 2001; Pham & Cardoso, 2001; Souden, Araki, Kinoshita, Nakatani, & Sawada, 2013). Many of them are based on maximum likelihood considerations like ours (Degerine & Zaidi, 2004; Fevotte & Cardoso, 2005; Pham & Cardoso, 2001). Nevertheless, these approaches differ from the proposed methods not only in terms of the number of inputs but also in other ways. For example, the probabilistic multispeaker model in Souden et al. (2013) is based on a latent variable assuming discrete states. The speech signal is reconstructed assuming that each state is associated with only one speaker. This assumption, as in binary masking-based methods, can lead to loss of information when time-frequency bins contain signals from multiple speakers. Our methods in principle are not limited by this constraint. Pham and Cardoso (2001) propose demixing methods based on maximum likelihood and minimum mutual information principles for gaussian non-stationary sources. By contrast, our proposed methods assume stationary sources but incorporate the temporal dependencies by concatenating consecutive spectral columns into a single vector, thereby increasing the timescale of the represented signal. Another multiple microphone source separation approach that uses the statistical independence and nonstationarity of the sources was proposed by Matsuoka, Ohoya, and Kawamoto (1995). In their algorithm, mixture signals are decorrelated using a neural network trained with stochastic gradient descent. Their approach has the advantage of being independent of the type of distribution of the individual sources. However, by comparison with the complex and iterative approaches proposed by Pham and Cardoso (2001) and by Matsuoka and colleagues (1995), our proposed methods learn the filters in a single step, thereby making our methods more efficient and suitable for real-time applications.

Overall, the linear methods for single-microphone source separation proposed in this letter perform better than more computationally demanding nonlinear approaches such as NNSC (Schmidt et al., 2007), in terms of both SNRs, residual spectrograms and PESQ scores. Unlike nonlinear NNSC, these linear approaches are not only simpler to implement but also faster to execute. Nevertheless, the semiblind NNSC approach has an advantage over the proposed linear approaches of being able to separate a target from an unknown masker or separating an unknown target from known noise. An interesting extension of this work could be to implement such abilities in future linear approaches.

This work is not the first to propose single-channel source separation methods based on the maximum likelihood approach (Jang & Lee, 2003). However, the proposed methods are more suitable for real-time applications because the method in Jang and Lee (2003) uses complex and iterative schemes for the separation task, whereas the methods here in essence require only fast-forward and inverse Fourier transformations and matrix multiplications.

An alternative approach to source separation is the use of binary masks on mixture spectrograms, that is, assigning each point in the time-frequency (TF) bin to the dominant source. The problem with binary masking and related approaches (Aoki et al., 2001; Brungart et al., 2006; Han & Wang, 2012; Jourjine et al., 2000; Nguyen et al., 2001; Rickard et al., 2001) is that artifacts or unnatural sounds may appear in the reconstructed signals. Efforts to combine ICA with binary masks have also been undertaken in Højen-Sørensen, Winther, and Hansen (2002); however, one of the general problems with binary masking is loss of information in the overlapping TF bins where the target utterance has lower energy than the masker. This occurs due to the fact that only one source is supposed to be active per TF bin. In contrast, the proposed linear methods do not exclusively assign a TF bin to one speaker only, but rather compute the subspaces associated with each of the sources and project the mixture onto them. This preserves the information for both speakers in a particular TF bin. In a fairly different approach to monaural source separation, Vishnubhotla and Espy-Wilson perform the segregation task by modeling the TF masking as a combination of complex sinusoids that are harmonics of the pitch frequencies of the speakers by applying a least-squares fitting approach (Vishnubhotla & Espy-Wilson, 2009). Deep neural networks have also been used to compute both binary masks (Wang & Wang, 2013) and soft masks (Huang et al., 2014) for the task of source separation. However, they are highly nonlinear and are computationally expensive. Linear methods are simpler to implement.

Human speech exhibits structure on multiple temporal scales, in line with natural sounds that tend to slowly vary in time (Bregman, 1990; Rosen, 1992). Congruently, human auditory processing shows a bias toward the perception of continuity in sound streams (Bregman, 1990), motivating the inclusion of temporal continuity into source separation methods. The method proposed in Lim, Shinn-Cunningham, and Gardner (2012) represents any discrete time series as a set of time-frequency contours. This method when applied to source separation allows sources to be extracted based on differences in contour representations of target and masker signals. While this method works well when the timescales of the two underlying signals are highly different, they are likely to fail in separation problems such as the multiple speaker problem when the underlying timescales of the two signals are very similar. Temporal continuity is incorporated naively in this work by simply concatenating consecutive spectral columns into a single vector, thereby increasing the timescale of the signal representation. Performance (in terms of SNR and residual) of all algorithms peaked for multiple-column representations, though less sensitively than expected. Concatenating too many spectral columns led to a reduction in performance, presumably due to overfitting.

Temporal continuity has also been addressed in a system proposed by Vincent and Rodet (2004) who modeled the activity of a source with a hidden Markov model. Such models are known to produce good separation results but are less suitable for real-time implementation. Virtanen (2007) incorporated temporal continuity of features by introducing a dedicated cost function. Because this cost function is computed iteratively, it might be difficult to implement this algorithm in real time.

Acknowledgments

This work was funded by Swiss National Science Foundation grant 200021-126844 “Early Auditory Based Recognition of Speech,” grant 200020-153565 “Fast Separation of Auditory Sounds,” and by European Research Council (ERC-Advanced Grant 268911).

References

References
Aoki
,
M.
,
Okamoto
,
M.
,
Aoki
,
S.
,
Matsui
,
H.
,
Sakurai
,
T.
, &
Kaneda
,
Y.
(
2001
).
Sound source segregation based on estimating incident angle of each frequency component of input signals acquired by multiple microphones
.
Acoust. Sci. Technology
,
22
,
149
157
.
Berouti
,
M.
,
Schwartz
,
R.
, &
Makhoul
,
J.
(
1979
).
Enhancement of speech corrupted by acoustic noise
. In
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
(pp.
208
211
).
Piscataway, NJ
:
IEEE
.
Bregman
,
A. S.
(
1990
).
Auditory scene analysis
.
Cambridge, MA
:
MIT Press
.
Brungart
,
D. S.
,
Chang
,
P. S.
,
Simpson
,
B. D.
, &
Wang
,
D.
(
2006
).
Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation
.
J. Acoust. Soc. Am.
,
120
,
4007
4018
.
Degerine
,
S.
, &
Zaidi
,
A.
(
2004
).
Separation of an instantaneous mixture of gaussian autoregressive sources by the exact maximum likelihood approach
.
IEEE Trans. Signal Process.
,
52
,
1492
1512
.
Eggert
,
J.
, &
Korner
,
E.
(
2004
).
Sparse coding and NMF
. in
Proceedings of the IEEE International Joint Conference on Neural Networks
(pp.
2529
2533
).
Piscataway, NJ
:
IEEE
.
Févotte
,
C.
, &
Cardoso
,
J.-F.
(
2005
).
Maximum likelihood approach for blind audio source separation using time-frequency gaussian source models
. In
Proceedings of the IEEE Workshop Applications of Signal Processing to Audio and Acoustics
(pp.
78
81
).
Piscataway, NJ
:
IEEE
.
Févotte
,
C.
, &
Idier
,
J.
(
2011
).
Algorithms for nonnegative matrix factorization with the -divergence
.
Neural Computation
,
23
(
9
),
2421
2456
.
Han
,
K.
, &
Wang
,
D.
(
2012
).
A classification based approach to speech segregation
.
J. Acoust. Soc. Am.
,
132
,
3475
3483
.
Hansen
,
P. C.
, &
Jensen
,
S. H.
(
2005
).
Prewhitening for rank-deficient noise in subspace methods for noise reduction
.
IEEE Trans. on Signal Process.
,
53
,
3718
3726
.
Hansen
,
P. C.
, &
Jensen
,
S. H.
(
2007
).
Subspace-based noise reduction for speech signals via diagonal and triangular matrix decompositions: Survey and analysis
.
EURASIP Journal on Advances in Signal Process
,
1
,
092953
.
Højen-Sørensen
,
P.A.d.F.R.
,
Winther
,
O.
, &
Hansen
,
L. K.
(
2002
).
Analysis of functional neuroimages using ICA with adaptive binary sources
.
Neurocomputing
,
49
,
213
225
.
Hoyer
,
P. O.
(
2002
).
Non-negative sparse coding
. In
Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing
(pp.
557
565
).
Piscataway, NJ
:
IEEE
.
Hu
,
Y.
, &
Loizou
,
P. C.
(
2008
).
Evaluation of objective quality measures for speech enhancement
.
IEEE Trans. Audio. Speech Lang. Process
,
16
,
229
238
.
Huang
,
P. S.
,
Kim
,
M.
,
Hasegawa-Johnson
,
M.
, &
Smaragdis
,
P.
(
2014
).
Deep learning for monaural speech separation
. In
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing
(pp.
1581
1585
).
Piscataway, NJ
:
IEEE
.
Hyvärinen
,
A.
,
Karhunen
,
J.
, &
Oja
,
E.
(
2001
).
What is independent component analysis?
In
S.
Haykin
(Ed.),
Independent component analysis
(pp.
145
164
).
New York
:
Wiley
.
Jang
,
G.-J.
, &
Lee
,
T.-W.
(
2003
).
A maximum likelihood approach to single-channel source separation
.
J. Mach. Learn. Res.
,
4
,
1365
1392
.
Jensen
,
S. H.
,
Hansen
,
P. C.
,
Hansen
,
S. D.
, &
Sorensen
,
J. A.
(
1995
).
Reduction of broad-band noise in speech by truncated QSVD
.
IEEE Transactions on Speech and Audio Processing
,
3
(
6
),
439
448
.
Jourjine
,
A.
,
Rickard
,
S.
, &
Yilmaz
,
O.
(
2000
).
Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures
. In
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
(pp.
2985
2988
).
Piscataway, NJ
:
IEEE
.
Lee
,
D. D.
, &
Seung
,
H. S.
(
1999
).
Learning the parts of objects by non-negative matrix factorization
.
Nature
,
401
,
788
791
.
Lee
,
D. D.
, &
Seung
,
H. S.
(
2006
).
Algorithms for non-negative matrix factorization
. In
T. K.
Leen
,
T. G.
Dietterich
, &
V.
Tresp
(Eds.),
Advances neural information processing systems
,
13
(pp.
556
562
).
Cambridge, MA
:
MIT Press
.
Lim
,
Y.
,
Shinn-Cunningham
,
B.
, &
Gardner
,
T. J.
(
2012
).
Sparse contour representations of sound
.
IEEE Signal Process. Lett.
,
19
,
684
687
.
Lin
,
C.-J.
(
2007
).
Projected gradient methods for nonnegative matrix factorization
. Neural Computation,
19
,
2756
2779
.
Loizou
,
P. C.
(
2013
).
Speech enhancement: Theory and practice
.
Boca Raton, FL
:
CRC Press
.
Machens
,
C. K.
,
Romo
,
R.
, &
Brody
,
C. D.
(
2010
).
Functional, but not anatomical, separation of “what” and “when” in prefrontal cortex
.
J. Neurosci.
,
30
,
350
360
.
Matsuoka
,
K.
,
Ohoya
,
M.
, &
Kawamoto
,
M.
(
1995
).
A neural net for blind separation of nonstationary signals
.
Neural Netw.
,
8
,
411
419
.
Narayanan
A.
, &
Wang
D. L.
(
2013
).
Ideal ratio mask estimation using deep neural networks for robust speech recognition
. In
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
(pp.
7092
7096
).
Piscataway, NJ
:
IEEE
.
Narayanan
A.
, &
Wang
D. L.
(
2014
).
Investigation of speech separation as a frontend for noise robust speech recognition
.
IEEE Trans. Audio. Speech Lang. Process.
,
22
,
826
835
.
Nguyen
,
L. T.
,
Belouchrani
,
A.
,
Abed-Meraim
,
K.
, &
Boashash
,
B.
(
2001
).
Separating more sources than sensors using time-frequency distributions
. In
Proceedings of the Sixth International Symposium on Signal Processing and Its Applications
(pp.
583
586
).
Piscataway, NJ
:
IEEE
.
Petersen
,
K. B.
, &
Pedersen
,
M. S.
(
2008
).
The matrix cookbook. Technical University of Denmark
.
Pham
,
D.-T.
, &
Cardoso
,
J. F.
(
2001
).
Blind separation of instantaneous mixtures of nonstationary sources
.
IEEE Trans. Signal Process.
,
49
,
1837
1848
.
Rickard
,
S.
,
Balan
,
R.
, &
Rosca
,
J.
(
2001
).
Real-time time-frequency based blind source separation
. In
Proceedings of the International Workshop on Independent Component Analysis and Blind Source Separation
(pp.
651
656
).
New York
:
Springer
.
Rosen
,
S.
(
1992
).
Temporal information in speech: Acoustic, auditory and linguistic aspects
.
Phil. Trans. Roy. Soc. Lond. Series B: Biol. Sci.
,
336
,
367
373
.
Roux
,
J.
, &
Vincent
,
E.
(
2013
).
Consistent Wiener filtering for audio source separation
.
Signal Processing Letters, IEEE
,
20
(
3
), pp.
217
220
.
Schmidt
,
M. N.
,
Larsen
,
J.
, &
Hsiao
,
F.-T.
(
2007
).
Wind noise reduction using non-negative sparse coding
. In
Proceedings of the IEEE Workshop on Machine Learning for Signal Processing
(pp.
431
436
).
Piscataway, NJ
:
IEEE
.
Souden
,
M.
,
Araki
,
S.
,
Kinoshita
,
K.
,
Nakatani
,
T.
, &
Sawada
,
H.
(
2013
).
A multichannel MMSE-based framework for speech source separation and noise reduction
.
IEEE Trans. Audio. Speech Lang. Process.
,
21
,
1913
1928
.
Tikhonov
,
A. N.
, &
Arsenin
,
V. Y.
(
1977
).
Solutions of ill-posed problems
.
Ultimo, NSW
:
Halsted Press
.
Vincent
,
E.
, &
Rodet
,
X.
(
2004
).
Music transcription with ISA and HMM
.
Lecture Notes in Computer Science
(pp.
1197
1204
).
New York
:
Springer
.
Virtanen
,
T. V. T.
(
2007
).
Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria
.
IEEE Trans. Audio. Speech Lang. Process
,
15
,
1066
1074
.
Vishnubhotla
,
S.
, &
Espy-Wilson
,
C. Y
. (
2009
).
An algorithm for speech segregation of co-channel speech
. In
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing
(pp.
109
112
).
Piscataway, NJ
:
IEEE
:
Wang
,
D. L.
, &
Brown
,
G. J.
(
2006
).
Fundamentals of computational auditory scene analysis
. In
D. L.
Wang
&
G. J.
Brown
(Eds.),
Computational auditory scene analysis: Principles, algorithms, and applications
(pp.
1
44
). (
Hoboken, NJ
:
Wiley and IEEE Press
.
Wang
Y.
, &
Wang
D. L.
(
2013
).
Towards scaling up classification-based speech separation
.
IEEE Trans. Audio. Speech Lang. Process.
,
21
,
1381
1390
.
Weiss
,
R. J.
(
2009
).
Underdetermined source separation using speaker subspace models
.
Doctoral dissertation
,
Columbia University
.
Zhao
X.
,
Wang
Y.
, &
Wang
D. L.
(
2014
).
Robust speaker identification in noisy and reverberant conditions
.
IEEE Trans. Audio. Speech Lang. Process.
,
22
,
836
845
.