Abstract

This letter presents a new algorithm for blind dereverberation and echo cancellation based on independent component analysis (ICA) for actual acoustic signals. We focus on frequency domain ICA (FD-ICA) because its computational cost and speed of learning convergence are sufficiently reasonable for practical applications such as hands-free speech recognition. In applying conventional FD-ICA as a preprocessing of automatic speech recognition in noisy environments, one of the most critical problems is how to cope with reverberations. To extract a clean signal from the reverberant observation, we model the separation process in the short-time Fourier transform domain and apply the multiple input/output inverse-filtering theorem (MINT) to the FD-ICA separation model. A naive implementation of this method is computationally expensive, because its time complexity is the second order of reverberation time. Therefore, the main issue in dereverberation is to reduce the high computational cost of ICA. In this letter, we reduce the computational complexity to the linear order of the reverberation time by using two techniques: (1) a separation model based on the independence of delayed observed signals with MINT and (2) spatial sphering for preprocessing. Experiments show that the computational cost grows in proportion to the linear order of the reverberation time and that our method improves the word correctness of automatic speech recognition by 10 to 20 points in a RT20= 670 ms reverberant environment.

1.  Introduction

1.1.  Background and Motivation

The ultimate goal of our research is to develop a human-symbiotic robot. Such robots, for example, could rescue people from dangerous situations in difficult-to-access places and perform domestic duties. Communication through speech is essential and natural between humans and robots because speech is used in a daily life to communicate. A robot equipped with microphones on its body or head hears its own speech, the speech of nearby people, reverberations (echoes), and unknown noise. For example, a person may interrupt and begin to speak in a TV-noise environment while the robot is still talking, that is, to barge in, as often occurs in human-to-human communication. In this case, the robot hears its own and the person's speech at the same time. Thus, robots are required to distinguish target sound from many disturbances as easily as humans do.

We present a comprehensive method for separating the robot's own speech and reducing reverberation with a reduced amount of a priori information. This method enables spoken dialogue systems to handle barge-in situations (Komatani, Kawahara, & Okuno, 2008; Matsuyama, Komatani, Ogata, & Okuno, 2009). Our target problems are reduced to a combination of blind dereverberation and echo cancellation using a microphone array in an acoustic and signal processing area. Here, blind dereverberation means removal of a target speaker's reverberation by using only observed signals, one of the most difficult problems in acoustic signal processing. Although dereverberation is an important problem, not many dereverberation methods work well in practical situations. Echo cancellation means separation of the system speech signal by using a reference signal and observed signals.

We outline our requirements and philosophy before describing related works. The method for speech separation (1) must have low computational cost, (2) requires a minimal amount of prior information, and (3) has connectivity with other useful methods. The reason for requiring low cost is that the resources of robots are usually limited, while real-time responses are required in human-robot interactions. Therefore, the computational cost is an important issue for practical application. The reason for requiring a small amount of priori information is that robots must be able to work in even unknown acoustic environments where they cannot make strong assumptions, such as the positions of sound sources. The reason for requiring connectivity with other methods is that since many noise reduction and recognition mechanisms (Sohn, Kim, & Sung, 1999; Raj & Stern, 2005) have been proposed, the integration of sound signal separation with them will drastically improve the overall system, including automatic speech recognition (ASR). Our philosophy is based on a macroviewpoint of constructing a whole system, such as an intelligent system or robot, from sound separation to speech recognition and speech dialogue, not a micro one that deals with only a specific technique.

Speech separation is categorized as a computational auditory scene analysis (CASA) problem (Rosenthal & Okuno, 1998) because it requires discrimination of a target speech signal from noisy input signals. Figure 1 shows the standard scheme for achieving CASA in our situation. Sound source separation is used to separate mixed sound signals. The separated speech is then usually recognized by ASR. The result of ASR is used in the spoken dialogue system to produce system utterances. The system speech can be used in the separation process because the system's speech is included in the signals captured by the microphones.

Figure 1:

Problems and topics in this letter.

Figure 1:

Problems and topics in this letter.

1.2.  Related Work

A great deal of research has been conducted on blind dereverberation and echo cancellation. However, little reported research has been done on dealing with both of them or on integrating them.

1.2.1.  Echo Cancellation

Echo cancellation is used to separate a known (reference) source signal from the microphone input. Many echo cancellation techniques have been developed, beginning with the Weiner filter, least mean squares (LMS), the Kalman filter (Haykin, 1991), and the latest achievements such as statistical Bayes filters based on state-space models (Bishop, 2006). These echo cancellation methods are generally not robust against noise (i.e., barge-in situations), and they usually require double-talk detectors (Ghose & Reddy, 2000; Gansler, & Benesty, 2001). More robust echo cancellation methods have been reported, such as application of M-estimator or independent component analysis (ICA) (Gansler, Gray, Sondhi, & Benesty, 2000; Takeda, Nakadai, Komatani, Ogata, & Okuno, 2007; Miyabe, Hinamoto, Saruwatari, Shikano, & Tatekura, 2007; Yang & Sakai, 2008). However, these methods do not handle dereverberation problems. Herbordt, Buchner, Nakamura, and Kellermann (2007) integrated echo cancellation with generalized side-lobe cancellation (Hoshuyama, Sugiyama, & Hirano, 1999) in a multichannel microphone array, thereby achieving robust separation of known sources, and emphasized target speech. However, this requires a rough location of the targeted sound source, so it does not satisfy our requirements.

1.2.2.  Dereverberation and Blind Dereverberation

Blind dereverberation means removal of a target speaker's reverberation by using only observed signals. Nakatani and Miyoshi (2003) achieved blind dereverberation by using the harmonic structure of the speech signal. Gomez, Even, Saruwatari, and Shikano (2008) applied a fast spectral subtraction to late reverberations by using a prerecorded impulse response. These and similar approaches are not suitable for integrating echo cancellation because they use schemes different from those of echo cancellation (e.g., different optimization criteria).

In contrast, a number of multichannel blind deconvolution methods without using any impulse responses have been proposed for blind dereverberation. For example, Nakatani, Yoshioka, Kinoshita, Miyoshi, and Juang (2008) proposed blind dereverberation method based on maximum likelihood estimation using a time-varying gaussian model and a multichannel linear prediction model. However, they did not consider the echo cancellation even though most such methods employ a statistical approach compatible with echo cancellation. Other methods using only microphone arrays, including multichannel blind deconvolution, have not dealt with echo cancellation (Hyvarinen, Karhunen, & Oja, 2001; Furuya & Kataoka, 2005; Douglas, Sawada, & Makino, 2005; Larue, Mars, & Jutten, 2006).

1.3.  Our Approach and Principal Contribution

We make use of ICA (Hyvarinen et al., 2001), especially frequency domain ICA (FD-ICA) (Ikeda & Murata, 1999), to deal simultaneously with blind dereverberation and echo cancellation. We use FD-ICA because (1) it provides a natural framework, such as blind source separation and adaptive filtering (Joho, Mathis, & Moschytz, 2001; Miyabe et al., 2007), (2) is robust against gaussian noise, such as fan noise, and (3) its convergence and computational cost are excellent compared with time domain ICA. However, Araki, Mukai, Makino, Nishikawa, and Saruwatari (2003) identified a fundamental limitation in the performance of FD-ICA for actual acoustic signals; reverberations are barely separated, which causes a substantial deterioration of ASR (Nishiura, Hirano, Denda, & Nakayama, 2007). Although using FD-ICA enables convolutive mixing problem to be converted into an instantaneous mixing problem that is easy to solve, another technique is required to achieve blind dereverberation for actual acoustic signals. because we usually observe reverberations longer than the proper window size used in the Fourier transform for FD-ICA (Araki et al., 2003).

We apply a multiple input/output inverse filtering theorem (MINT) to the separation model of FD-ICA (Miyoshi & Kaneda, 1988) and adopt STFT domain modeling (Takeda et al., 2009; Nakatani et al., 2008) to overcome the separation performance limitation caused by reverberation. Since FD-ICA was already extended to integrate blind source separation with echo cancellation (Takeda et al., 2009), combining MINT and FD-ICA–based echo cancellation achieves both the blind dereverberation and echo cancellation. However, this naive application substantially increases the computational cost because this approach essentially separates all reflected sounds as other sources. To overcome this problem, we developed a method comprising (1) a separation model using observed signal independence that holds under MINT conditions, (2) a spatial sphering technique for preprocessing, and (3) miscellaneous techniques for practical applications. The model provides a new learning algorithm for separation filters and is a natural extension of FD-ICA. With this method, we can achieve low computational cost and fast-convergence ICA for acoustic signals. (Note that this letter is an extended version of Takeda et al., 2009.)

Our method essentially differs from ICA and multichannel blind deconvolution methods in that it can separate multiple sound sources and their reverberations, except for the specific permutation problem of ICA. Our algorithm is essentially different from previous convolutive ICA algorithms (Hyvarinen et al., 2001; Nishikawa, Saruwatari, & Shikano, 2003; Douglas et al., 2005; Kokkinakis & Nandi, 2006; Hiroe, 2007), because it uses a trick derived by using MINT and reduces to a simple learning rule of separation filter that is similar to FD-ICA. In terms of a solution to FD-ICA limitations, there are some applicable methods, such as Nishikawa et al. (2003), Araki, Makino, Aichner, Nishikawa, and Saruwatari (2005), and Hiroe (2007). However, some of them were not designed for use in the STFT domain, and some of them increase the computational cost because they apply other transformations. Moreover, they cannot handle echo cancellation and blind dereverberation.

1.4.  Organization of This Letter

Section 2 states our problem and explains our fundamental method. In section 3, we explain the MINT-based observation model and ICA and its problems. Section 4 explains our separation model and spatial sphering technique. Section 5 discusses our method based on experimental results on speech recognition.

2.  Problem Statement

This section formulates the blind dereverberation and echo cancellation problem and explains the common principles underlying our method.

2.1.  Our Strategy and Focus

The problem is to extract the target speaker's speech signal from observed signals by using a microphone array that includes the target speaker's speech, the robot's speech, and their reverberations. Moreover, we need to satisfy the three requirements of low computational cost, minimal amount of a priori information, and connectivity with other methods.

There are two approaches to model sound signals: modeling in the time domain and modeling in the subband domain, which includes frequency domain, wavelet domain, and other domains (Vaidyanathan, 1993). From the viewpoint of the whole robot audition/CASA system, it is suitable to use subband domain processing, especially the STFT domain (Hiroe, 2007; Takeda et al., 2007; Nakatani et al., 2008). This will be explained in section 2.2.

STFT domain processing is effective in terms of the following three main efficiency and technical aspects. First, since the speech features for ASR are often extracted framewise from the STFT domain signal, they can be extracted directly after the STFT domain speech separatioin and dereverberation, and we can eliminate unnecessary processing. Second, since misrecognition due to reverberation is caused by spectra remaining from several previous frames, we can naturally model a separation model in the STFT domain by regarding the reverberant spectra as delayed sound souces from previous frames. Moreover, reverberations in the same STFT frame or in a very delayed frame can be handled using ASR techniques, such as the cepstral mean normalization (Acero & Stern, 1992) or acoustic model adaptations (Gauvain & Lee, 1994; Leggetter & Woodland, 1995). Third, there are many techniques for reducing the stationary background noise in a high-SNR (signal-to-noise ratio) environment. A typical technique is spectral subtraction (Boll, 1979). These techniques are effective for improved SNR and compatible with our separation scheme in the STFT domain.

Figure 2:

Our specifications and STFT-domain processing.

Figure 2:

Our specifications and STFT-domain processing.

We define our speech separation problem as follows (see Figure 2):

  • Purpose: 

    Removal of reverberation and robot's speech

  • Input: 

    (1) Observed sound spectra in STFT domain (one person's speech signal, one reference speech signal, and their mixed reverberations) and (2) reference robot's speech spectrum in STFT domain

  • Output: 

    Separated target speaker's spectrum in STFT domain

  • Requirements: 

    (1) Low computational cost and (2) weak modeling assumptions

We use ICA to satisfy the second requirement. Since ICA involves few modeling assumptions, it is suitable for solving our problem. This method differs from other microphone array techniques in that it does not need the location of the sound source or equivalent information such as a transfer function. Thus, the remaining questions are how to treat reverberations in the STFT domain for ICA how to satisfy the requirement for the low computational cost. We describe the answers in section 3.

2.2.  Spectrum in STFT Domain

We model sound separation in the STFT domain. Sounds are usually captured by microphones and sampled discretely. We represent a sound signal as , where means the discrete time index. To describe the observed sound signal from sound source , a convolutive model is often used (Vaidyanathan, 1993):
formula
2.1
where represents time-invariant transfer coefficients with Kh taps and the reflected sounds in the time domain.
Sound-signal processing is often designed in the frequency domain because the convolutive relationship becomes a scale relationship,
formula
2.2
where w is index of the frequency bin and , , and correspond to the Fourier transformed signals of , , and .
In practice, since it is impossible to analyze the frequency of all time durations, STFT is often used as an approximation. All signals in the time domain are analyzed using STFT with an analysis window of size Ta and shift Ts. For example, sound signal is transformed into a spectrum using the discrete Fourier transform,
formula
2.3
where t denotes the frame index, w denotes the frequency bin index in the STFT domain, and f is a window function for short-duration analysis. This process can be executed rapidly by using the fast Fourier transformation algorithm at O(Talog Ta) computational cost. Note that ww[t] is a complex number because the STFT spectrum generally becomes a complex number even from a real-value signal.

2.3.  Independent Component Analysis

The flow of ICA processing is outlined in Figure 3. We assume that a zero-mean original source vector, , consists of L mutually independent complex random variables, . They are mixed using a time-invariant linear system that is represented by an L × L nonsingular matrix, . Let be an observed signal vector. The relationship between and is represented as
formula
2.4
ICA estimates the original source vector by using only the observed vector ,
formula
2.5
where is an L × L separation matrix estimated using ICA.
Figure 3:

Flow of ICA processing.

Figure 3:

Flow of ICA processing.

The higher-order ICA assumes the probabilistic density function (PDF) of . ICA estimates by minimizing the Kullback-Leibler divergence (KLD):
formula
2.6
where p is the joint PDF of and q corresponds to the product of the marginal PDF of . These parameters are usually estimated using an iterative gradient-based method because of the nonlinearity of J.
A well-known learning algorithm is the natural gradient algorithm (Amari, 1998):
formula
2.7
where μ is a step-size parameter, E is an expectation operator, ·j represents the number of iterations, and ·H represents a conjugate transpose. The is a nonlinear function vector:
formula
2.8

ICA is ambiguous about the permutation and scaling of each element of the estimated vector, . These two factors affect the quality of the resynthesized signals when using ICA in a decomposed domain, such as the frequency or wavelet domain (Hyvarinen et al., 2001). A solution to this problem is presented in section 4.5.

3.  Basic Techniques

We use several basic techniques to achieve both blind dereverberation and echo cancellation. We explain the echo cancellation based on ICA in the STFT domain and dereverberation based on ICA and MINT. Then we discuss the problem of the naive combination of these two methods.

3.1.  Model of Semiblind ICA for Echo Cancellation

Takeda et al. (2007) modeled echoes of the robot's speech (reference) spectrum in the STFT domain and developed a separation algorithm based on ICA. We denote the observed spectra as xw,j[t] at frequency bin w and frame t of the jth microphone and the user's and robot's (reference) speech spectrum as sw,u[t] and sw,r[t], respectively.

Here, we explain the model of semiblind ICA specific to the jth microphone, assuming the discrete-linear convolution model.
formula
3.1
where hrw,j[n] and aw,j are the transfer coefficients of the robot's and the user's spectrum and Kw,r is the number of filter taps. We transform this equation into matrix representation with a reference spectrum vector, , and a transfer coefficients vector, ,
formula
3.2
where is an (Mw,r + 1) × (Mw,r + 1) unit matrix and Mw,r = Kw,r. The reference spectrum is assumed to arrive without delay in the model. The independence between the user's and reference spectrum is evaluated in the ICA algorithm to suppress the reference spectrum . We call this algorithm “semiblind” because the reference is a known signal, whereas the rest signals are unknown.

Since equation 3.2 is a nonsingular mixing process, we can obtain the user's spectrum by applying ICA. We have two reasons for applying ICA. The first is that the speech signal has a nongaussian property (Hyvarinen et al., 2001) and this property theoretically matches the assumption of ICA. The second is that since ICA is robust against disturbances that have gaussian property, it can estimate the separation filter in noisy situations by using an iterative or online processing (Yang & Sakai, 2008). Of course, in batch processing, we can use the Wiener filter (Haykin, 1991) as an initial value for the separation filter in terms of the second-order statistics.

Semiblind ICA can separate the robot's speech, including its reverberations, because the model accounts for the reverberations over several frames. However, it cannot cope with the user's reverberations.

3.2.  Model of ICA Based on Inverse Theorem in Acoustic Field

There is an inverse filter in a multiple-input system as long as the number of microphones is larger than that of the sound sources. This is almost the same as that defined in the time domain, that is, multiple input/output inverse filtering theorem (MINT) (Miyoshi et al., 1988). However, we distinguish our model in the STFT domain from the time-domain MINT.

The jth microphone's observation, xw,j[t], is given by a linear-combined user speech spectrum in time frames sw,u[t],
formula
3.3
where huw,j(n) is the filter coefficient of microphone j with n delay and Kw,u is the filter length.
Next, we transform this into matrix formation. Circular matrix for source sw,y with the filter length Kw,y and filter coefficients vector with the delay i is given by
formula
3.4
The proper matrix size is discussed later in this section.
With the observed spectrum vector , …, , and user spectrum vector , the MINT-based observation model is represented as
formula
3.5
Here, Nw is the number of delayed frames of the observed spectrum vector, Mw,u is that of the user spectrum vector, and the size of is L(Nw + 1) × (Mw,u + 1).

If the condition that L(Nw + 1) = (Mw,u + 1) and is full rank holds, the whole mixing matrix is a nonsingular matrix. This ensures the existence of an inverse system meaning that we can obtain the unique solution of the original sources. In practice, since there is no environment with rank-reduced (i.e., the transfer functions from the source to each microphone are all unique), this theorem is always true with proper Nw and Mw,u. Note that this theorem also holds in a multiple-source case although the explanation is omitted here.

3.3.  Simple Solution for Blind Dereverberation and Echo Cancellation

Semiblind ICA separates robot speech and its reverberations. However, the model ignores the reverberation of user speech, and therefore the separated user speech remains reverberant. On the other hand, the MINT, or multiple inputs, model has the inverse transfer function for user speech. If we can solve the inverse problem of the MINT formulation, the dereverberated user speech will be obtained. This means that both the blind dereverberation and echo cancellation can be achieved by combining the semiblind ICA and the MINT-based model.

4.  Efficient ICA-Based Separation

We first point out the bottleneck problem found in the naive combination of the semiblind ICA and the MINT-based model. Next, we describe how to solve the problem by using a property that holds in the MINT system and how the ICA separation model should be modified. Then we derive the algorithm for estimating the parameters in the ICA framework. Finally, we explain other solutions to and configurations for our new ICA problem.

4.1.  Problem of Naive Combination of ICA to MINT System

4.1.1.  The Model

Here, we model the mixing process as the following equation by aggregating equation 3.1 and 3.3:
formula
4.1
With the circular transfer function matrices of the user and robot speech, and , the observation model can be written in matrix representation:
formula
4.2
where is an (Mw,r + 1) × (Mw,r + 1) unit matrix and the sizes of and are L(Nw + 1) × (Mw,u + 1) and L(Nw + 1) × (Mw,r + 1), respectively. We assume Nw and Mw,u are set properly such that is nonsingular.
If we assume that all elements of source vectors and are independent of one another, equation 4.2 can be solved using a standard ICA because the mixing process is nonsingular. In this case, after is estimated, we need to select the direct sound element, sw,u[t], from it. The separation model is represented by the following equation with separation matrices and :
formula
4.3

4.1.2.  The Problem

If we apply a standard ICA to the MINT model, equation 4.3 is solved, and an estimate of is obtained. We refer to this naive process as MINT-ICA process in the sequel. The number of independent components, that is, the dimensionality of , to estimate is obviously proportional to the reverberation time. This is caused by the estimation of all components but the direct element, such as sw,u[t − 1], sw,u[t − 2], and so on.

This extra processing increases the computational cost in solving permutation and scaling and estimating the separation matrices. The estimation cost in particular is proportional to the second order of the reverberation time, Nw. This increased computational cost should be reduced from the viewpoint of actual applications, that is, a real-time separation. The linear order of Nw would be preferable for practical applications. We must reduce the number of components to be estimated to reduce the computational cost of the calculation of the separation matrices.

4.2.  Independence Exchange Property in the MINT System

ICA based on MINT uses a temporal estimated source signal in learning separation matrix as shown in equation 2.7 to evaluate the time independence among and sw,u[ti], i = 1, …, Mu. This estimation increases the number of independent components to estimate because all elements of are estimated. The replacement of the independence condition with an equivalent one that has fewer independent components reduces the computational cost.

To derive the efficient model, we use the following proposition. Here, we define two vectors for readability: and .

Proposition 1: Independence exchange properties of higher-order-statis- tics ICAWe assume that the sound signals, sw,u, are time independent. In higher-order ICA, if the nonsingular condition in equation 4.2 holds, the following two conditions are equivalent:

    P1. sw,u and are mutually independent.

    P2. sw,u and are mutually independent.

Here, d > 0 is the initial delay interval, to be explained below.

Proof. We will prove only that the two independent conditions are equivalent. First, equation 3.5 results in the following relationship:
formula
4.4
(See Hyvarinen et al., 2001, for the proof.)
We also assume the relationship specified below:
formula
4.5
Therefore, we have
formula
4.6
Second, we show that if P1 holds, then P2 also holds:
formula
4.7
formula
4.8
formula
4.9
Finally, we show that if P2 holds, then P1 also holds:
formula
4.10
formula
4.11
formula
4.12
Figure 4:

Temporal dependence of speech spectrum.

Figure 4:

Temporal dependence of speech spectrum.

In the same way, we can easily prove that the property also holds when reference signal exists. Since the proof is straightforward, we skip it. By using this proposition, we gain two main advantages.

4.2.1.  Reduced Computational Cost

We can evaluate the time independence by using the delayed spectra instead of the estimate of the delayed source spectra . This exchange of the independence evaluation reduces the number of independent components to an estimate that leads to reduced computational costs of estimating the filter, permutation, and scaling.

4.2.2.  Time Independence with Initial Delay Interval Parameter

We can consider the time independence of speech signals by using the initial delay interval parameter, d. For example, Figure 4 shows the average time independence of speech signals of all frequency bins after being analyzed by STFT with a 64 ms Hanning window and an 8 ms shift. The metric for the independence is E[φ(x)xH]: a smaller value means a higher degree of independence. Here, we use tanh(100|x|)eθ(x) for φ(x). Here, θ(x) represents the phase of the complex number x. This function is the actual independence metric used in ICA (Sawada, Mukai, & Araki, 2003). The vertical axis represents the independence among sw[t] and sw[ti] with i ≥ 0, and the horizontal axis represents the frame interval length, i. From the graph, we can conclude that since the independence between direct sound with i = 0 and its adjacent frame is not high enough, we should evaluate the independence with a certain interval d. Note that the size of the high-dependence frame interval depends on the speed of speech and that the interframe time independence varies for types of sound sources. For example, periodic or sustained sound generally show less time independence. However, in natural speech interaction, since the speed of speech is almost constant and does not include sustained sound, we can safely assume that sufficient time independence is obtained by a fixed frame interval d. Additionally, though the independence does not seem to be high, we have empirically confirmed that it is high enough for ICA to be applied to actual speech data.

4.3.  New Model and Estimation of Filter Based on ICA

4.3.1.  The Model

In accordance with the independence exchange properties of higher-order ICA, we can extract user speech by setting the separation model so that both the estimated signals, , and the observed and reference signals, and, , are mutually independent. The separation model use here is
formula
4.13
where
formula
4.14
and are the estimated source and observed vectors, which will be explained in the next paragraph. Here, , , and correspond to separation matrices; their sizes will also be explained in the next paragraph. and are corresponding proper-sized unit matrices.

The ideal sizes of and are the number of sound sources. If we know in advance, for example, that there is only one sound source, we can set and in equation 4.13. In this case, we can extract the correct direct sound signal, , theoretically because equation 4.13 obviously becomes invertible. This enables us to reduce the computational cost in estimating the parameters. However, this is not the case in general, since it is usually difficult to estimate the number of sound sources in a reverberant environment. Consequently, we have to use all of the microphones; that is, , each dimensionality of and is L, and , , and become L × L, , and separation matrices, respectively. In such a case, the separated sounds, , include a direct sound signal sw,u[t] and some reflected sound signals. The way to extract the direct sound signal from will be discussed in the following section.

Note that the invert observation model corresponding to equation 4.13 forms an autoregression model (Nakatani et al., 2008), that is,
formula
4.15
where
formula
4.16
and , , and are mixing matrices in a proper size. The difference between ours and that of Nakatani et al. (2008) is that we solve the model with FD-ICA by re-forming it into a matrix and using an independence exchange property, while they use second-order statistics to solve it with a time-varying gaussian model and do not treat the reference signal, .

4.3.2.  Estimation Algorithm

We can derive the update procedures for equation 4.13 by using a standard learning algorithm of ICA. To obtain the optimum separation matrices, we minimize cost function J on the basis of KLD. Since the derivation of the algorithm is shown in the appendix, we show only the result here.

The learning algorithms for , , and are
formula
4.17
formula
4.18
formula
4.19
where
formula
4.20
Here, μ is a step-size parameter, and is a nonlinear function vector. is usually the unit matrix, and here we use a nonho- lonomic constraint matrix, (Choi, Cichocki, & Liu, 1999), because its convergence is better than that of the unit matrix. We use tanh(100|x|)eiθ(x) as a nonlinear function, φ(x) (Sawada et al., 2003). Equation 4.17 is used to estimate the blind source separation filter, . Equations 4.18 and 4.19, are used to estimate the blind dereverberation filter, , and the so-called echo cancellation filter, , respectively. Even when multiple sources are observed, we can use these rules because the independent exchange also holds for them. We can also use the learning algorithms for their separation without any modification to the equations.

The derived algorithms for estimating separation matrices form a natural extension to those of FD-ICA. The learning rule of is the same as that of standard FD-ICA. Our algorithms add both a dereverberation filter and an echo cancellation filter. The algorithm of is obviously different from that of algorithms (Hyvarinen et al., 2001; Douglas et al., 2005; Kokkinakis & Nandi, 2006; Hiroe, 2007) based on the convolutive mixing model because we use the independence exchange property for the instantaneous mixing model. This reduces computational costs and simplifies the algorithm.

4.4.  Spatial Sphering for Preprocessing

Sphering or whitening is a widely used preprocessing for standard ICA that accelerates the convergence of the separation matrix. This process is a linear transformation that de-correlates the input signals and normalizes their variances (Hyvarinen et al., 2001). However, this increases the computational cost because it needs an eigenvector or singular value decomposition of the correlation matrix of the input signal vector. Even if we use the Levinson-Durbin algorithm for linear-prediction-based prewhitening, the computational cost adds up to the second order of the reverberation time, Nw. Therefore we execute only spatial sphering and reference signal normalization. This spatial sphering may be executed in both batch and block-wise processing.

Spatial sphering decorrelates only microphone input, ignoring the time correlations. First, the correlation matrix of the microphone inputs is decomposed:
formula
4.21
formula
4.22
where is an L × L unitary matrix consisting of eigenvectors, is an L × Ldiagonal matrix with eigenvalues, and λw,r is a variance of known (reference) signal.
With these values, we transform the input signals:
formula
4.23
formula
4.24
where
formula
4.25
Then the input signals are decorrelated and normalized, and the known signals are normalized.
After the spatial sphering preprocessing, we use and instead of and sw,r[t] in equation 4.13 to 4.19. Here, the independence exchange still holds because this transformation is nonsingular. Actually, if we use , this transformation is described as a matrix (block diagonal matrix):
formula
4.26
Here, is an L(Nw + 2) × L(Nw + 2) block diagonal matrix and diag(λ−1/2w,r, …, λ−1/2w,r) is an (Mw,r + 1) × (Mw,r + 1) diagonal matrix. The signal flow with our method is outlined in Figure 5.
Figure 5:

Signal flow with our method for the MINT-based model.

Figure 5:

Signal flow with our method for the MINT-based model.

This eigenvalue decomposition of the microphone correlation matrix is also used in a sound-localization technique called MUSIC (multiple signal classification) (Schmidt, 1986). If we integrate the sound source separation with MUSIC, we can efficiently reuse the eigenvalue decomposition result.

4.5.  Solution to Scaling and Permutation Problems

4.5.1.  Scaling

We use the projection-back method (Murata & Ikeda, 2001) to solve the scaling problem. The elements of the inversed separation matrix are multiplied by the corresponding separated signals. In our case, a complete separation matrix is formed . The estimated signals are then scaled by using elements of the inverse matrix . The ith row and jth column element of is presented as . The scale, cw,j, is multiplied by the estimated jth element of :
formula
4.27
formula
4.28

4.5.2.  Permutation

Since we use all the microphones, we need to select direct sound spectra from the estimated spectrum at each frequency bin w. We solve the permutation problem by using the average power of the separated signals. If the separated signals include direct and reflected sounds, the power of the direct sound, , is strongest in the separated signals. Hence, the signal with the maximum power is selected:
formula
4.29
Note that this criterion will not work when other sound sources exist. Since such a situation is not in the scope of this letter, we do not discuss it.

4.6.  Other Configurations

4.6.1.  Initial Value of Separation Matrix

The initial value of the separation matrix, , is critical for the estimates because the learning rules of and are affected by for the estimates of . There are several techniques for finding an appropriate initial value of (Araki et al., 2005; Saruwatari, Kawamura, Nishikawa, Lee, & Shikano, 2006). Since they require a geometrical model from the microphone to the sound source, obtaining such a model may be difficult if the microphones are installed on the robot.

The initial value of at frequency bin w is adjusted to the estimated matrix, at frequency bin w + 1, and then all rows of the matrix are row-normalized. We use the unit matrix for the initial value of the first separation matrix. Since reverberation at high frequency decays faster than that at low frequency, we start by estimating the separation matrix at the highest-frequency bin and then move to lower-frequency bins. Thus, we can maintain the overall accuracy of the separation matrix estimation with regard to the initial value configuration. This initialization works efficiently and effectively in practice.

4.6.2.  Step-Size Scheduling

In general, ICA procedure with a fixed step-size parameter μ is affected by the temporal correlation that is not removed by only sphering. We use a step-size parameter adjustment using a combination of annealing and the exponentially weighted step-size (EW-SS) (Makino, Kaneda, & Koizumi, 1993) because they reduce the effect of the ignored temporal correlation when sphering. The step-size, μk, of the separation matrix at the jth iteration in the kth delayed frame component, corresponding to xw,*[tk],  sw,r[tk], is defined by
formula
4.30
where α, β, and λ are constant values.

4.7.  Comparison of Computational Costs

The theoretical computational costs with our method and with simple MINT-ICA are summarized in Table 1. Here, L is the number of microphones, Mw,r is the size (or frame length) of , and Nw is the size (or frame length) of .

Table 1:
Theoretical Computational Costs.
Naive MINT + ICA (MINT-ICA)Our Method
Sphering O((LNw + Mw,r)3O(L3
ICA iteration O(L2N2w + LNwMw,rO(L2Nw + LMw,r
Scaling O((LNw + Mw,r)3O(L3
Permutation O(LNwO(L
Naive MINT + ICA (MINT-ICA)Our Method
Sphering O((LNw + Mw,r)3O(L3
ICA iteration O(L2N2w + LNwMw,rO(L2Nw + LMw,r
Scaling O((LNw + Mw,r)3O(L3
Permutation O(LNwO(L

We focus on the order of Nw because we need Nw to be as large as possible to cope with long reverberations, and this enlargement process greatly affects the computational cost. The filter length of reference signal Mw,r also increases the cost linearly according to the reverberation time. Meanwhile, the number of microphones L is independent of such environmental conditions.

Simple MINT-ICA requires the third order of Nw for prewhitening, the second order for estimating separation matrix, and the third order for scaling because of the inverse operation of the matrix. The filters are iteratively estimated, and this critically affects the computational cost.

The cost of our method is reduced to a linear order of Nw, and processing is achieved at low computational cost in a long reverberant environment. Other algorithms (Hiroe, 2007; Nakatani et al., 2008) cannot achieve this at a low computational cost.

5.  Evaluation

We evaluate the performance of our method by comparing it with that of a conventional method in two different environments in terms of the word correctness of ASR as the metric. We first explain the experimental settings and then present the evaluation criteria and results.

5.1.  Experimental Settings

We conducted the experiments in the two rooms, a normal room and a hall-like room. The room layouts are shown in Figure 6. The normal room is 4.2 × 7.0 m, and the hall-like room is of 7.55 × 9.55 m. We used the microphone array embedded on a humanoid robot developed by HONDA.

Figure 6:

Layout of (left) normal and (right) hall-like rooms.

Figure 6:

Layout of (left) normal and (right) hall-like rooms.

5.1.1.  Recording Conditions and Test Set Data

The impulse responses for the user's speech data were recorded at 16 kHz in both rooms. The reverberation time, RT20, was 240 ms in the normal room and 670 ms in the hall-like room. A loudspeaker, in 1.2 m high, was located 1.5 m away from the two microphones installed on the humanoid's head. Utterances (impulsive responses) were recorded from five directions; 0, 45, 90, −45, and −90 degrees from the front direction of the robot. The impulse responses for robot's speech data were also recorded by using a loudspeaker embedded in the humanoid's head. All data (16 bits, PCM) were normalized to [−1.01.0]. These conditions are summarized in Table 2.

Table 2:
Conditions for Data Collection and Separation.
Impulse Response16-kHz Sampling
Reverberation time (RT20240 and 670 ms 
Distance and direction 1.5 m and 0°, 45°, 90°, −45°, −90° 
Number of microphones Two (mounted on the robot's head) 
STFT analysis Hamming: 64 ms; shift: 20 ms 
Input wave data [−1.0 1.0] normalized 
Impulse Response16-kHz Sampling
Reverberation time (RT20240 and 670 ms 
Distance and direction 1.5 m and 0°, 45°, 90°, −45°, −90° 
Number of microphones Two (mounted on the robot's head) 
STFT analysis Hamming: 64 ms; shift: 20 ms 
Input wave data [−1.0 1.0] normalized 

We used 200 Japanese sentences for the user's speech; they are convoluted with the corresponding recorded impulse responses. The robot's speech data were 200 sentences spoken by a male. We mixed user and robot speech signals of the same length. The duration of the target data ranged from 1 to 10 s.

5.1.2.  Parameters for ASR and Separation

The recognizer Julius (Lee, Kawahara, & Shikano, 2001) was used for hidden markov model (HMM)-based ASR with a statistical language model. Mel frequency cepstral coefficients (MFCC) (12 + Δ12 + ΔPow) were obtained after STFT with a window size of 512 points and a shift size of 160 points for the speech features. This was followed by cepstral mean normalization (CMN) (Acero & Stern, 1992). Note that we extracted the MFCC from the time-domain signal resynthesized from the separated spectrum. A triphone-based acoustic model (three states and four mixtures) was trained with 150 sentences of clean speech uttered by 200 male and female speakers (word closed). The statistical language model consisted of 21,000 words extracted from newspapers. The other experimental conditions are summarized in Table 3.

Table 3:
Conditions for Speech Recognition.
Test Set200 Sentences
Training set 30,000 sentences (200 people; 150 sentences each) 
Acoustic model PTM-triphone: 3-state, HMM 
Language model Statistical, vocabulary size of 21k 
Speech analysis Hamming: 32 ms; shift: 10 ms 
Features MFCC: 25 dim. (12 +Δ12 +Δ Pow) 
Test Set200 Sentences
Training set 30,000 sentences (200 people; 150 sentences each) 
Acoustic model PTM-triphone: 3-state, HMM 
Language model Statistical, vocabulary size of 21k 
Speech analysis Hamming: 32 ms; shift: 10 ms 
Features MFCC: 25 dim. (12 +Δ12 +Δ Pow) 

As the STFT parameters for the sound separation, the window size was 1, 024 points (64 ms), which is known as suboptimal (Araki et al., 2003), and the shift size was 320 points (20 ms). The tap lengths of the vector and the robot's speech spectrum, sw,r[t], were the same value, Nw = Mw,r = N, over all frequency bins, w. The step-size parameters were λ = 0.9, α = 6.0 × 10−1, and β = 5.0 × 10−3 for batch processing. We fixed the maximum number of iterations for estimating the matrices to 20 because the time for separation is usually restricted in practical use. With more iterations, the performance may improve slightly. We empirically set the initial delayed frame value, d, to 2. For the permutation resolution, we assumed only one speech signal.

5.2.  Experiments

We conducted three experiments:

  • Experiment 1: 

    Comparison in terms of processing time, word correctness (WC), and SNR (signal-to-noise ratio) of our method and MINT-ICA

  • Experiment 2: 

    Comparison in terms of WC and SNR of our method, semiblind ICA, and cascade processing

  • Experiment 3:

     WC of our method under several conditions

“Cascade processing” means a sequential process of the echo cancellation followed by the dereverberation, that is, we first estimated only and then estimated . In estimating , we assumed that and in equation 4.14 and estimated by using equation 4.19. Next, with the estimated fixed, and were estimated. The rule of the step-size scheduling in equation 4.30 is commonly used for calculating fixed, , and . In this experiment, we show the effectiveness of the simultaneous estimation of separation filters, , , and , in terms of WC and SNR. The environments for these experiments are the same.

As a criterion, we defined WC as
formula
5.1
WC increased as the number of correctly recognized words increased. “Word” included all word classes, not restricted to nouns. With this criterion, we can measure the quality of a separated speech signal in terms of ASR. We defined the SNR in the power spectrum domain by
formula
5.2
where sw[t] denotes the clean original speech spectrum and denotes the estimated speech spectrum. Scaling parameter η was defined to maximize the SNR. SNR represents the degree of noise contamination of the separated speech signal.

5.2.1.  Experiment 1: Comparison in Terms of Processing Time, WC, and SNR with MINT-ICA

In this experiment, we compared our method and MINT-ICA (experiment 1-1) in terms of the computational cost and examined the dereverberation performance of both methods in terms of WC and SNR (respectively, experiments 1-2a and 1-2b). We evaluated four patterns:

  1. 1. 

    MINT-ICA without sphering and with only dereverberation (Derev.)

  2. 2. 

    MINT-ICA with sphering and with only dereverberation (Derev.)

  3. 3. 

    Our method with only dereverberation (Derev.)

  4. 4. 

    Our method with dereverberation and echo cancellation (Derev. + E.C.)

The data for processing time evaluation were obtained from the front of the humanoid in the hall-like room. The total duration of the speech data was 1197 s for dereverberation and 1311 s for dereverberation and echo cancellation. The computer has an Intel Pentium D CPU with the clock speed of 3.20 GHz and the memory of 2 GB. The OS is Red Hat Enterprise Linux WS release 3. The program was implemented without using a numerical library such as blas or LAPACK. The data for dereverberation performance were obtained in both rooms, and they included only the target speaker's speech.

We also evaluated the real-time factor (RTF) for batch processing in all experiments. The RTF was calculated using P/I, where P is the processing time and I is the data amount in time (duration). Because the processing time did not include the time for buffering the data, there was a constant delay for this real-time processing. It can be eliminated by a method described elsewhere (Saruwatari et al., 2005).

5.2.2.  Experiment 2: Comparison in Terms of WC and SNR with Semiblind ICA and Cascade Processing

In this experiment, we compared the ability of our method to handle reverberation with those of conventional semiblind ICA and the cascade processing. Performance was evaluated with regard to the number of frames, N.

The data set was a mixture of user and robot utterances in both rooms. To show the upper-limit performance, separation was done by batch processing, not block-wise processing. Since semiblind ICA does not assume the use of multiple microphones, we evaluated it with only one microphone. The parameter settings for the cascade processing were the same as for our method.

5.2.3.  Experiment 3: Evaluation Under Several Conditions

In this experiment, we examined the WC and SNR of our methods for dereverberation and for dereverberation and echo cancellation. The number of frame N was changed for each conditions.

The data were obtained in both rooms. We used only the user utterance data to evaluate the dereverberation function (experiment 3-1) and the mixed data from user and robot utterance data to evaluate the dereverberation and echo cancellation function (experiment 3-2).

We changed the length of the observed signal and the size of the data set to estimate matrices , , and , that is, with 1, 2, and 3 s block-separated data and with all data (batch). When we were not batch processing, we set 0.8 as the value of exponential weight λ. When we estimated the separation matrices, we used the estimated matrices of the previous period as the initial value of the next period. We showed the relationship among the data length for separation, the number of frames N, and WC.

5.3.  Results

5.3.1.  Experiment 1: Comparison in Terms of Processing Time, WC and SNR with MINT-ICA

Figure 7 plots the results for experiment 1. In the panel for experiment 1-1, the horizontal axis represents the number of frames N of and sw,r[t], and the vertical axis represents the real-time factor. In the panels for experiments 1-2a and 1-2b, the vertical axis represents WC and SNR, respectively, over the five positions of the speaker. “No proc.” represents the results without any processing. The WC of clean speech, which is without noise, is about 90%.

Figure 7:

Results of experiment 1: (Upper plot) RTF (real time factor). (Lower plots) WC (word correctness) and SNR (signal-to-noise ratio) performance.

Figure 7:

Results of experiment 1: (Upper plot) RTF (real time factor). (Lower plots) WC (word correctness) and SNR (signal-to-noise ratio) performance.

With our method, the RTF increased proportionately to the number of frames for both dereverberation only and the combination of dereverberation and echo cancellation. For example, the method can cope with the reverberation if N = 20 under these experimental conditions. The RTF of MINT-ICA increases proportionally to the polynomial of N and RTF exceeds 1.0 at N = 6. Since the cost increases with the spatial sphering preprocessing, MINT-ICA is not suitable for real-time processing.

MINT-ICA does not perform well in terms of WC and SNR because it needs a large number of estimated independent components, resulting in a permutation error in equation 4.29. On the other hand, our method works well even if we use a long filter length.

5.3.2.  Experiment 2: Comparison in Terms of WC and SNR with Semiblind ICA and Cascade Processing

The plots on the left side of Figure 8 show the results for the RT20 = 240 ms room, and those on the right side show them for the RT20 = 670 ms room. The horizontal axis represents N, and the vertical axis represents the average WC (experiment 2a) and SNR (experiment 2b) over the five positions of the speaker. We can see that even when SNR is high, WC is not necessarily high.

Figure 8:

Results of experiment 2. (Upper plots) WC evaluation. (Lower plots) SNR evaluation of semiblind ICA, cascade processing, and our method.

Figure 8:

Results of experiment 2. (Upper plots) WC evaluation. (Lower plots) SNR evaluation of semiblind ICA, cascade processing, and our method.

Since conventional semiblind ICAs cannot remove the reverberations of user utterances, the WC and SNR do not improve, especially in the more strongly reverberant environments. In contrast, our method separates both reverberation and robot speech, and therefore WC is improved in both environments. For example, WC is improved by 10 points and SNR is about 1 dB under the 240 ms reverberant environment. WC was also improved by 31 points and by 3 dB under the 670 ms reverberant environment. The optimal frame length N differs according to the reverberation time. The cascade processing does not perform as well as our method because the step-size parameter and other settings are the same as for our method. This means that with the cascade processing, the parameters for echo cancellation and dereverberation must be adjusted independently. In addition, the cascade processing may not be able to attain global optimization due to falling into local minima. Our method thus has the advantage of easier parameter tuning because our algorithm is derived from one objective function.

5.3.3.  Experiment 3: Evaluation Under Several Conditions

In Figure 9, the results for the environments (RT20 = 240 and 670 ms) with only dereverberation are shown in the upper plots (experiments 3-1a and 3-1b) and those with both dereverberation and echo cancellation are shown in the lower plots (experiments 3-2a and 3-2b). The horizontal axis represents the number of frames, and the vertical axis represents the average WC and SNR over the five positions. The plots show the relationships for 1, 2, and 3 s data learning and batch processing. Tables 4 and 5 summarize the highest average WC and SNR for each condition.

Figure 9:

Results of experiment 3: WC and SNR of our method with 1, 2, and 3 s data learning and batch processing.

Figure 9:

Results of experiment 3: WC and SNR of our method with 1, 2, and 3 s data learning and batch processing.

Our method achieves blind dereverberation with the optimal number of frames in the results of dereverberation only. Although the number of frames varies with the length of data learning used for estimating the separation matrices, the WC is about 5 points higher than without any processing for the weakly reverberant environment and about 40 points higher for the strong one when 2 s data learning is used for estimation. The improvements in both suppressing the reverberation and robot speech echoes, for example, are 40 points in the weaker reverberant environment and 29 points in the stronger one. The improvements in the SNR are 3.9 and 4.3 dB, respectively.

Performance improves as the length of data learning increases and there is an optimum frame length, N. For example, WC with 1 s learning is worse than with 3 s learning. This is particularly noticeable in barge-in situations of mixed user and robot speech. Moreover, WC becomes worse as the frame length becomes longer with 1 s learning.

The separated signals and spectra of separated speech with our method are shown in Figure 10A, for the ground truth and Figures 10B and 10D for observed signals. Figures 10C and 10E are separated using our method. Obviously, the reverberation and robot speech have been removed in Figure 10E. The differences between Figures 10A and 10C are caused by reverberation in the frame of directly arriving signals and insufficient separation.

Table 4:
Experiment 3-1: Highest Average WC and SNR.
Only User SpeechDereverberation
Environment(No Processing)1 s2 s3 sBatch
WC (%) RT20: 240 ms 74.3 77.7 81.4 83.3 84.2 
 RT20: 670 ms 26.1 64.1 68.0 70.6 72.9 
SNR (dB) RT20: 240 ms 8.3 9.6 9.9 10.2 10.3 
 RT20: 670 ms 6.1 9.3 10.2 10.5 10.8 
Only User SpeechDereverberation
Environment(No Processing)1 s2 s3 sBatch
WC (%) RT20: 240 ms 74.3 77.7 81.4 83.3 84.2 
 RT20: 670 ms 26.1 64.1 68.0 70.6 72.9 
SNR (dB) RT20: 240 ms 8.3 9.6 9.9 10.2 10.3 
 RT20: 670 ms 6.1 9.3 10.2 10.5 10.8 
Table 5:
Experiment 3-2: Highest Average WC and SNR.
User and Robot SpeechDereverberation + Echo Cancel
Environment(No Processing)1 s2 s3 sBatch
WC (%) RT20: 240 ms 28.2 60.9 69.0 72.0 73.2 
 RT20: 670 ms 11.0 33.2 40.8 41.5 50.0 
SNR (dB) RT20: 240 ms 4.8 7.8 8.8 9.0 9.2 
 RT20: 670 ms 3.7 6.6 8.1 8.5 9.0 
User and Robot SpeechDereverberation + Echo Cancel
Environment(No Processing)1 s2 s3 sBatch
WC (%) RT20: 240 ms 28.2 60.9 69.0 72.0 73.2 
 RT20: 670 ms 11.0 33.2 40.8 41.5 50.0 
SNR (dB) RT20: 240 ms 4.8 7.8 8.8 9.0 9.2 
 RT20: 670 ms 3.7 6.6 8.1 8.5 9.0 
Figure 10:

Separated signals and spectra for experiments 2 and 3. (A) Clean speech. (B) Reverberated speech. (C) Dereverberated speech in panel B by using our method. (D) Mixed reverberated speech and robot speech (barge-in case). (E) Separated speech signal by using our method.

Figure 10:

Separated signals and spectra for experiments 2 and 3. (A) Clean speech. (B) Reverberated speech. (C) Dereverberated speech in panel B by using our method. (D) Mixed reverberated speech and robot speech (barge-in case). (E) Separated speech signal by using our method.

5.4.  Discussion and Future Work

Our experiments demonstrate that our method efficiently and effectively separates both reverberation and robot speech and that it reduces the computational cost compared to the semiblind ICA and simple MINT-ICA methods. However, we need to improve its separation and stability against the low number of samples used for the filter estimation when we compared the performance of block-wise processing with that of batch processing.

First, we need to compensate for the lack of information caused by using only few data in the filter estimation. The performance of our method is degraded by the use of only a few samples because ICA is based on higher-order statistics, and therefore our method needs sufficient samples for filter estimation. This lack of samples can be compensated for by using additional information acquired from another aspect of the environment or speakers. For example, the location of sound sources is useful and can be obtained using sound localization techniques such as MUSIC (Schmidt, 1986) or visual localization techniques (Asano et al., 2004). Since our method estimates the separation matrices, which are divided into the blind source separation filter, and the blind dereverberation filter, , we can use the sound location information as an initial value for the blind source separation filter. Since we still estimate the dereverberation filter blindly, such information does not devalue our method.

Second, we need to intelligently determine the number of iterations when estimating the filter to reduce the computation time further. In our experiments, we set the maximum number of iterations to 20; therefore, the RTF is less than 1.0. Better scheduling should shorten the processing time when we use more microphones because the computational cost is proportional to the second order of the number of microphones L. This can be achieved by using optimum step-size methods such as the method by Nakajima, Nakadai, Hasegawa, and Tsujino (2008) because the direction of the gradient is modified by a natural gradient like that in the Newton method, while we adjust only the scale of the gradient. There has been little research on the step size of ICA at a low computational cost for actual applications. Consequently, the step-size scheduling is one of the most serious problems in real-time processing. We also need to optimize the exponential parameter, λ.

Finally, to improve the ASR performance, we must manage reverberations that are not separated. The effect of the reverberation can be overcome by using adaptation or multicondition training of ASR acoustic models, missing feature techniques (Raj & Stern, 2005), and other techniques, such as the Weiner filters. Since techniques will effectively improve the ASR performance, the next issue will be how to integrate them with our method. We will reconsider useful conventional methods and refine them to construct an efficient system that works in actual environments.

6.  Conclusion

We tackled the problem of separating a robot's/system's own speech and the reduction of reverberation with the least amount of a priori information. This problem is general and important in human-machine interaction because microphones capture at least these two types of sound and the system cannot predict how the reverberation affects the target speech signal. From the viewpoint of computational auditory scene analysis (CASA), we developed a method for effectively overcoming the reverberation problem.

We deal with the reverberation of user speech by extending conventional semiblind independent component analysis (ICA). This is achieved by introducing a separation model based on the independence of delayed observed signals with the multiple input/output inverse-filtering theorem (MINT) and spatial sphering for preprocessing. Our method can overcome the dereverberation and echo cancellation problems simultaneously at a low computational cost, which is proportional to the linear order of the reverberation time. Experimental results with several environments demonstrate the performance and efficiency of our method in terms of the word correctness of ASR and SNR criteria.

Further improvements will come from optimizing the step-size parameters and using a priori information about sound sources. Because reverberation is not suppressed enough in current ASR systems, integration with existing ASR methods is necessary (e.g., spectral subtraction or missing data techniques). Since the reverberation filter in particular reflects the property of reverberation in the environment, the property can be used to adapt the acoustic model of ASR to the environment. We intend to analyze the property of reverberation filters and integrate it with other methods.

Appendix:  Derivation of Estimation Algorithm

Equation 4.13 conforms to a standard ICA formulation. We can therefore derive a learning rule for optimum filters by using the same deduction. To obtain optimum separation matrices, we minimize the cost function, J, on the basis of the Kullback-Leibler divergence:
formula
A.1
where q means the product of the marginal PDF of ,
formula
A.2
and
formula
A.3
formula
A.4
are defined as the differential entropy and the joint differential entropy, respectively. E[·] means the expectation operator over frame index t.
Since we assume the separation model is an invertible linear transformation, the joint entropy becomes
formula
A.5
Since , , and are fixed quantities during the estimation of , the cost function becomes simply
formula
A.6
The gradient of J with respect to is
formula
A.7
because the elements of in equation A.2 are fixed quantities, except for those in the first row. From equations 4.14 and A.6 the gradients in equation A.7 become
formula
A.8
formula
A.9
formula
A.10
where denotes a inverse and conjugate transportation of the matrix .
We use the following natural gradient because it has a better convergence property than the conventional gradient method (Amari, 1998). The natural gradient of equation A.6 is described as
formula
formula
A.11
Each natural gradient of is obtained by using equation 4.14 as follows:
formula
A.12
formula
A.13
formula
A.14
Therefore, we have the following learning algorithms for , , and from equations A.8 to A.11,
formula
A.15
formula
A.16
formula
A.17
where
formula
A.18
Here, μ is a step-size parameter, and is a nonlinear function vector. is usually the unit matrix , and here we use a nonholonomic constraint matrix, (Choi et al., 1999), because its convergence is better than that of the unit matrix. We use tanh(100|x|)eθ(x) as a nonlinear function, φ(x) (Sawada et al., 2003). Equation 4.17 is used to estimate the blind source separation filter, . Equations 4.18 and 4.19, in order, are used to estimate the blind dereverberation filter, , and the echo cancellation filter, . When multiple sound sources exist, we can use these rules because the independence exchange also holds for them and we can use the above algorithm for separation without any changes to it except for the permutation problem of ICA.

Acknowledgments

We thank Takuma Ohtsuka for his valuable comments. This research was partially supported by a grant-in-aid for scientific research (S), JSPS fellows, and a Global COE Program.

References

Acero
,
A.
, &
Stern
,
R. M.
(
1992
).
Cepstral normalization for robust speech recogni-tion
. In
Proc. of the Speech Processing in Adverse Conditions
(pp.
89
92
).
International Speech Communication Association
.
Amari
,
S.
(
1998
).
Natural gradient works efficiently in learning
.
Neural Computation
,
10
,
251
276
.
Araki
,
S.
,
Makino
,
S.
,
Aichner
,
R.
,
Nishikawa
,
T.
, &
Saruwatari
,
H.
(
2005
).
Subband-based blind separation for convolutive mixtures of Speech
.
IEICE Trans. on Fundamentals
,
E88-A
,
3593
3603
.
Araki
,
S.
,
Mukai
,
R.
,
Makino
,
S.
,
Nishikawa
,
T.
, &
Saruwatari
,
H.
(
2003
).
The fundamental limitation of frequency domain blind source separation for convolutive mixtures of speech
.
IEEE Trans. on Speech and Audio Processing
,
11
,
109
116
.
Asano
,
F.
,
Yamamoto
,
K.
,
Hara
,
I.
,
Ogata
,
J.
,
Yoshimura
,
T.
, &
Motomura
,
Y.
et al
, (
2004
).
Detection and separation of speech event using audio and video information fusion and its application to robust speech Interface
.
EURASIP Journal on Applied Signal Processing
,
11
,
1727
1738
.
Bishop
,
C. M.
(
2006
).
Pattern recognition and machine learning
.
New York
:
Springer
.
Boll
,
S. F.
(
1979
).
Suppression of acoustic noise in speech using spectral subtraction
.
IEEE Trans. on Acoustics, Speech and Signal Processing
,
27
,
113
120
.
Choi
,
S.
,
Cichocki
,
A.
, &
Liu
,
R.-W.
(
1999
).
Natural gradient learning with a nonholonomic constraint for blind deconvolution of multiple channels
. In
Proc. of International Workshop on ICA and BBS
(pp.
371
376
).
Piscataway, NJ
:
IEEE
.
Douglas
,
S. C.
,
Sawada
,
H.
, &
Makino
,
S.
(
2005
).
Natural gradient multichannel blind deconvolution and speech separation using causal FIR filters
.
IEEE Trans. on Speech and Audio Processing
,
13
,
92
104
.
Furuya
,
K.
, &
Kataoka
,
A.
(
2005
).
Robust speech dereverberation using multichannel blind deconvolution with spectral subtraction
.
IEEE Trans. on Audio, Speech and Language Processing
,
15
,
1579
1591
.
Gansler
,
T.
, &
Benesty
,
U.
(
2001
).
A frequency-domain double-talk detector based on a normalized cross-correlation vector
.
Signal Processing
,
81
,
1783
1787
.
Gansler
,
T.
,
Gray
,
S. L.
,
Sondhi
,
M. M.
, &
Benesty
,
J.
(
2000
).
Double-talk robust fast converging algorithm for network echo cancellation
.
IEEE Trans. on Speech and Audio Processing
,
8
,
656
663
.
Gauvain
,
J.-L.
, &
Lee
,
C.-H.
(
1994
).
Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains
.
IEEE Trans. on Speech and Audio Processing
,
2
,
291
298
.
Ghose
,
K.
, &
Reddy
,
U.
(
2000
).
A double-talk detector for acoustic echo cancellation applications
.
Signal Processing
,
80
,
1459
1467
.
Gomez
,
R.
,
Even
,
J.
,
Saruwatari
,
H.
, &
Shikano
,
K.
(
2008
).
Distant-talking robust speech recognition using late reflection components of room impulse response
. In
Proc. of IEEE Int'l Conf. on Acoustics, Speech, and Signal Processing
(pp.
4581
4584
).
Piscataway, NJ
:
IEEE
.
Haykin
,
S.
(
1991
).
Adaptive filter theory
(
4th ed.
).
Upper Saddle River, NJ
:
Prentice-Hall
.
Herbordt
,
W.
,
Buchner
,
H.
,
Nakamura
,
S.
, &
Kellermann
,
W.
(
2007
).
Multichannel bin-wise robust frequency-domain adaptive filtering and its application to adaptive beamforming
.
IEEE Trans. on Audio, Speech and Language Processing
,
15
,
1340
1351
.
Hiroe
,
A.
(
2007
).
Blind vector deconvolution: Convolutive mixture models in short-time Fourier transform domain
.
Proceedings of the Conference on Independent Component Analysis and Signal Separation
.
New York
:
Springer
.
Hoshuyama
,
O.
,
Sugiyama
,
A.
, &
Hirano
,
A.
(
1999
).
A robust adaptive beamformer for microphone arrays with blocking matrix using constrained adaptive filter
.
IEEE Trans. on Signal Processing
,
47
,
2677
2684
.
Hyvarinen
,
A.
,
Karhunen
,
J.
, &
Oja
,
E.
(
2001
).
Independent component analysis
.
Hoboken, NY
:
Wiley-Interscience
.
Ikeda
,
S.
, &
Murata
,
N.
(
1999
).
A method of ICA in time-frequency domain
. In
Proc. Int'l Workshop on Independent Component Analysis and Signal Separation
(pp.
365
370
).
New York
:
Springer
.
Joho
,
M.
,
Mathis
,
H.
, &
Moschytz
,
G. S.
(
2001
).
Combined blind/nonblind source separation based on the natural gradient
.
IEEE Signal Processing Letters
,
8
,
236
238
.
Kokkinakis
,
K.
, &
Nandi
,
A. K.
(
2006
).
Multichannel blind deconvolution for source separation in covolutive Mixture of speech
.
IEEE Trans. on Audio, Speech, and Language Processing
,
14
,
200
212
.
Komatani
,
K.
,
Kawahara
,
T.
, &
Okuno
,
H. G.
(
2008
).
Predicting ASR errors by exploiting barge-in rate of individual users for Spoken Dialogue systems
. In
Proc. of Interspeech
(pp.
183
186
).
International Speech Communication Association
.
Larue
,
A.
,
Mars
,
J. I.
, &
Jutten
,
C.
(
2006
).
Frequency-domain blind deconvolution based on mutual information rate
.
IEEE Trans. on Signal Processing
,
54
,
1771
1781
.
Lee
,
A.
,
Kawahara
,
T.
, &
Shikano
,
K.
(
2001
).
Julius: An open source real-time large vocabulary recognition engine
. In
Proc. of the Eighth European Conference on Speech Communication and Technology
(pp.
1691
1694
).
European Speech Communication Association
.
Leggeter
,
C. J.
, &
Woodland
,
P. C.
(
1995
).
Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models
.
Computer Speech and Language
,
9
,
171
185
.
Makino
,
S.
,
Kaneda
,
Y.
, &
Koizumi
,
N.
(
1993
).
Exponentially weighted stepsize NLMS adaptive filter based on the statistics of a room impulse response
.
IEEE Trans. on Speech and Audio Processing
,
1
,
101
108
.
Matsuyama
,
K.
,
Komatani
,
K.
,
Ogata
,
T.
, &
Okuno
,
H. G.
(
2009
).
Enabling a user to specify an item at any time during system enumeration
. In
Proc. of Interspeech
(pp.
252
255
).
International Speech Communication Association
.
Miyabe
,
S.
,
Hinamoto
,
Y.
,
Saruwatari
,
H.
,
Shikano
,
K.
, &
Tatekura
,
Y.
(
2007
).
Interface for barge-in free spoken dialogue system based on sound field reproduction and microphone array
.
EURASIP Journal on Advances in Signal Processing
,
2007
,
57470
.
Miyoshi
,
M.
, &
Kaneda
,
Y.
(
1988
).
Inverse filtering of room acoustics
.
IEEE Trans. on Acoustics, Speech and Signal Processing
,
36
,
145
152
.
Murata
,
N.
, &
Ikeda
,
S.
(
2001
).
An approach to blind source separation based on temporal structure of speech signals
.
Neurocomputing
,
41
,
1
24
.
Nakajima
,
H.
,
Nakadai
,
K.
,
Hasegawa
,
Y.
, &
Tsujino
,
H.
(
2008
).
Adaptive step-size parameter control for real-world blind source separation
. In
Proc. of IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing
(pp.
149
152
).
Piscataway, NJ
: IEEE
.
Nakatani
,
T.
, &
Miyoshi
,
M.
(
2003
).
Blind dereverberation of single channel Speech signal based on harmonic structure
.
Proc. of IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing
(vol. 1, pp.
92
95
).
Piscataway, NJ
:
IEEE
.
Nakatani
,
T.
,
Yoshioka
,
T.
,
Kinoshita
,
K.
,
Miyoshi
,
M.
, &
Juang
,
B.-H.
(
2008
).
Blind speech dereverberation with multichannel linear prediction based on short time fourier transform representation
. In
Proc. of IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing
(pp.
85
88
).
Piscataway, NJ
:
IEEE
.
Nishiura
,
T.
,
Hirano
,
Y.
,
Denda
,
Y.
, &
Nakayama
,
M.
(
2007
).
Investigations into early and late reflections on distant-talking speech recognition toward suitable reverberation criteria
. In
Proc. of Interspeech
(pp.
1082
1085
).
International Speech Communication Association
.
Nishikawa
,
T.
,
Saruwatari
,
H.
, &
Shikano
,
K.
(
2003
).
Blind source separation of acoustic signals based on multistage ICA combining frequency-domain ICA and time-domain ICA
.
IEICE Trans. Fundamentals
,
E86A
,
846
858
.
Raj
,
S.
, &
Stern
,
R. M.
(
2005
).
Missing-feature approaches in speech recognition
.
Signal Processing Magazine
,
22
,
101
116
.
Rosenthal
,
D.
, &
Okuno
,
H. G.
(
1998
).
Computational auditory scene analysis
.
Mahwah, NJ
:
Erlbaum
.
Saruwatari
,
H.
,
Kawamura
,
T.
,
Nishikawa
,
T.
,
Lee
,
A.
, &
Shikano
,
K.
(
2006
).
Blind source separation based on a fast-convergence algorithm combining ICA and beamforming
.
IEEE Trans. on Speech and Audio Processing
,
14
,
666
678
.
Saruwatari
,
H.
,
Mori
,
Y.
,
Takatani
,
T.
,
Ukai
,
S.
,
Shikano
,
K.
,
Hiekata
,
T., et al.
(
2005
).
Two-stage blind source separation based on ICA and binary masking for real-time robot audition system
. In
Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems
(pp.
209
214
).
Piscataway, NJ
:
IEEE
.
Sawada
,
H.
,
Mukai
,
R.
, &
Araki
,
S.
(
2003
).
Polar coordinate based nonlinear function for frequency-domain blind source separation
.
IEICE Trans. Fundamentals
,
86A
,
505
510
.
Schmidt
,
R. O.
(
1986
).
Multiple emitter location and signal parameter estimation
.
IEEE Trans. on Antennas and Propagation
,
32
,
276
280
.
Sohn
,
J.
,
Kim
,
N. S.
, &
Sung
,
W.
(
1999
).
A statistical model-based voice activity detection
.
IEEE Signal Processing Letters
,
6
,
1
3
.
Takeda
,
R.
,
Nakadai
,
K.
,
Komatani
,
K
,
Ogata
,
T.
, &
Okuno
,
H. G.
(
2007
).
Exploiting known sound sources to improve ICA-based robot audition in speech separation and recognition
. In
Proc. of IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems
(pp.
1757
1762
).
Piscataway, NJ
:
IEEE
.
Takeda
,
R.
,
Nakadai
,
K.
,
Takahashi
,
T.
,
Komatani
,
K.
,
Ogata
,
T.
, &
Okuno
,
H. G.
(
2009
).
ICA-based efficient blind dereverberation and echo cancellation method for barge-in-able robot audition
. In
Proc. of IEEE Int'l Conf. on Acoustics, Speech and Signal Processing
(pp. 3677–3680)
. Piscataway, NJ
:
IEEE
.
Vaidyanathan
,
P. P.
(
1993
).
Multirate systems and filter banks
.
Upper Saddle River, NJ
:
Prentice Hall
.
Yang
,
J.-M.
, &
Sakai
,
H.
(
2008
).
A robust ICA-based adaptive filter algorithm for system identification
.
IEEE Trans. Circuits and Systems II
,
55
,
1259
1263
.