## Abstract

This letter presents a new algorithm for blind dereverberation and echo cancellation based on independent component analysis (ICA) for actual acoustic signals. We focus on frequency domain ICA (FD-ICA) because its computational cost and speed of learning convergence are sufficiently reasonable for practical applications such as hands-free speech recognition. In applying conventional FD-ICA as a preprocessing of automatic speech recognition in noisy environments, one of the most critical problems is how to cope with reverberations. To extract a clean signal from the reverberant observation, we model the separation process in the short-time Fourier transform domain and apply the multiple input/output inverse-filtering theorem (MINT) to the FD-ICA separation model. A naive implementation of this method is computationally expensive, because its time complexity is the second order of reverberation time. Therefore, the main issue in dereverberation is to reduce the high computational cost of ICA. In this letter, we reduce the computational complexity to the linear order of the reverberation time by using two techniques: (1) a separation model based on the independence of delayed observed signals with MINT and (2) spatial sphering for preprocessing. Experiments show that the computational cost grows in proportion to the linear order of the reverberation time and that our method improves the word correctness of automatic speech recognition by 10 to 20 points in a RT_{20}= 670 ms reverberant environment.

## 1. Introduction

### 1.1. Background and Motivation

The ultimate goal of our research is to develop a human-symbiotic robot. Such robots, for example, could rescue people from dangerous situations in difficult-to-access places and perform domestic duties. Communication through speech is essential and natural between humans and robots because speech is used in a daily life to communicate. A robot equipped with microphones on its body or head hears its own speech, the speech of nearby people, reverberations (echoes), and unknown noise. For example, a person may interrupt and begin to speak in a TV-noise environment while the robot is still talking, that is, to barge in, as often occurs in human-to-human communication. In this case, the robot hears its own and the person's speech at the same time. Thus, robots are required to distinguish target sound from many disturbances as easily as humans do.

We present a comprehensive method for separating the robot's own speech and reducing reverberation with a reduced amount of a priori information. This method enables spoken dialogue systems to handle barge-in situations (Komatani, Kawahara, & Okuno, 2008; Matsuyama, Komatani, Ogata, & Okuno, 2009). Our target problems are reduced to a combination of blind dereverberation and echo cancellation using a microphone array in an acoustic and signal processing area. Here, *blind dereverberation* means removal of a target speaker's reverberation by using only observed signals, one of the most difficult problems in acoustic signal processing. Although dereverberation is an important problem, not many dereverberation methods work well in practical situations. *Echo cancellation* means separation of the system speech signal by using a reference signal and observed signals.

We outline our requirements and philosophy before describing related works. The method for speech separation (1) must have low computational cost, (2) requires a minimal amount of prior information, and (3) has connectivity with other useful methods. The reason for requiring low cost is that the resources of robots are usually limited, while real-time responses are required in human-robot interactions. Therefore, the computational cost is an important issue for practical application. The reason for requiring a small amount of priori information is that robots must be able to work in even unknown acoustic environments where they cannot make strong assumptions, such as the positions of sound sources. The reason for requiring connectivity with other methods is that since many noise reduction and recognition mechanisms (Sohn, Kim, & Sung, 1999; Raj & Stern, 2005) have been proposed, the integration of sound signal separation with them will drastically improve the overall system, including automatic speech recognition (ASR). Our philosophy is based on a macroviewpoint of constructing a whole system, such as an intelligent system or robot, from sound separation to speech recognition and speech dialogue, not a micro one that deals with only a specific technique.

Speech separation is categorized as a computational auditory scene analysis (CASA) problem (Rosenthal & Okuno, 1998) because it requires discrimination of a target speech signal from noisy input signals. Figure 1 shows the standard scheme for achieving CASA in our situation. Sound source separation is used to separate mixed sound signals. The separated speech is then usually recognized by ASR. The result of ASR is used in the spoken dialogue system to produce system utterances. The system speech can be used in the separation process because the system's speech is included in the signals captured by the microphones.

### 1.2. Related Work

A great deal of research has been conducted on blind dereverberation and echo cancellation. However, little reported research has been done on dealing with both of them or on integrating them.

#### 1.2.1. Echo Cancellation

Echo cancellation is used to separate a known (reference) source signal from the microphone input. Many echo cancellation techniques have been developed, beginning with the Weiner filter, least mean squares (LMS), the Kalman filter (Haykin, 1991), and the latest achievements such as statistical Bayes filters based on state-space models (Bishop, 2006). These echo cancellation methods are generally not robust against noise (i.e., barge-in situations), and they usually require double-talk detectors (Ghose & Reddy, 2000; Gansler, & Benesty, 2001). More robust echo cancellation methods have been reported, such as application of M-estimator or independent component analysis (ICA) (Gansler, Gray, Sondhi, & Benesty, 2000; Takeda, Nakadai, Komatani, Ogata, & Okuno, 2007; Miyabe, Hinamoto, Saruwatari, Shikano, & Tatekura, 2007; Yang & Sakai, 2008). However, these methods do not handle dereverberation problems. Herbordt, Buchner, Nakamura, and Kellermann (2007) integrated echo cancellation with generalized side-lobe cancellation (Hoshuyama, Sugiyama, & Hirano, 1999) in a multichannel microphone array, thereby achieving robust separation of known sources, and emphasized target speech. However, this requires a rough location of the targeted sound source, so it does not satisfy our requirements.

#### 1.2.2. Dereverberation and Blind Dereverberation

Blind dereverberation means removal of a target speaker's reverberation by using only observed signals. Nakatani and Miyoshi (2003) achieved blind dereverberation by using the harmonic structure of the speech signal. Gomez, Even, Saruwatari, and Shikano (2008) applied a fast spectral subtraction to late reverberations by using a prerecorded impulse response. These and similar approaches are not suitable for integrating echo cancellation because they use schemes different from those of echo cancellation (e.g., different optimization criteria).

In contrast, a number of multichannel blind deconvolution methods without using any impulse responses have been proposed for blind dereverberation. For example, Nakatani, Yoshioka, Kinoshita, Miyoshi, and Juang (2008) proposed blind dereverberation method based on maximum likelihood estimation using a time-varying gaussian model and a multichannel linear prediction model. However, they did not consider the echo cancellation even though most such methods employ a statistical approach compatible with echo cancellation. Other methods using only microphone arrays, including multichannel blind deconvolution, have not dealt with echo cancellation (Hyvarinen, Karhunen, & Oja, 2001; Furuya & Kataoka, 2005; Douglas, Sawada, & Makino, 2005; Larue, Mars, & Jutten, 2006).

### 1.3. Our Approach and Principal Contribution

We make use of ICA (Hyvarinen et al., 2001), especially frequency domain ICA (FD-ICA) (Ikeda & Murata, 1999), to deal simultaneously with blind dereverberation and echo cancellation. We use FD-ICA because (1) it provides a natural framework, such as blind source separation and adaptive filtering (Joho, Mathis, & Moschytz, 2001; Miyabe et al., 2007), (2) is robust against gaussian noise, such as fan noise, and (3) its convergence and computational cost are excellent compared with time domain ICA. However, Araki, Mukai, Makino, Nishikawa, and Saruwatari (2003) identified a fundamental limitation in the performance of FD-ICA for actual acoustic signals; reverberations are barely separated, which causes a substantial deterioration of ASR (Nishiura, Hirano, Denda, & Nakayama, 2007). Although using FD-ICA enables convolutive mixing problem to be converted into an instantaneous mixing problem that is easy to solve, another technique is required to achieve blind dereverberation for actual acoustic signals. because we usually observe reverberations longer than the proper window size used in the Fourier transform for FD-ICA (Araki et al., 2003).

We apply a multiple input/output inverse filtering theorem (MINT) to the separation model of FD-ICA (Miyoshi & Kaneda, 1988) and adopt STFT domain modeling (Takeda et al., 2009; Nakatani et al., 2008) to overcome the separation performance limitation caused by reverberation. Since FD-ICA was already extended to integrate blind source separation with echo cancellation (Takeda et al., 2009), combining MINT and FD-ICA–based echo cancellation achieves both the blind dereverberation and echo cancellation. However, this naive application substantially increases the computational cost because this approach essentially separates all reflected sounds as other sources. To overcome this problem, we developed a method comprising (1) a separation model using observed signal independence that holds under MINT conditions, (2) a spatial sphering technique for preprocessing, and (3) miscellaneous techniques for practical applications. The model provides a new learning algorithm for separation filters and is a natural extension of FD-ICA. With this method, we can achieve low computational cost and fast-convergence ICA for acoustic signals. (Note that this letter is an extended version of Takeda et al., 2009.)

Our method essentially differs from ICA and multichannel blind deconvolution methods in that it can separate multiple sound sources and their reverberations, except for the specific permutation problem of ICA. Our algorithm is essentially different from previous convolutive ICA algorithms (Hyvarinen et al., 2001; Nishikawa, Saruwatari, & Shikano, 2003; Douglas et al., 2005; Kokkinakis & Nandi, 2006; Hiroe, 2007), because it uses a trick derived by using MINT and reduces to a simple learning rule of separation filter that is similar to FD-ICA. In terms of a solution to FD-ICA limitations, there are some applicable methods, such as Nishikawa et al. (2003), Araki, Makino, Aichner, Nishikawa, and Saruwatari (2005), and Hiroe (2007). However, some of them were not designed for use in the STFT domain, and some of them increase the computational cost because they apply other transformations. Moreover, they cannot handle echo cancellation and blind dereverberation.

### 1.4. Organization of This Letter

Section 2 states our problem and explains our fundamental method. In section 3, we explain the MINT-based observation model and ICA and its problems. Section 4 explains our separation model and spatial sphering technique. Section 5 discusses our method based on experimental results on speech recognition.

## 2. Problem Statement

This section formulates the blind dereverberation and echo cancellation problem and explains the common principles underlying our method.

### 2.1. Our Strategy and Focus

The problem is to extract the target speaker's speech signal from observed signals by using a microphone array that includes the target speaker's speech, the robot's speech, and their reverberations. Moreover, we need to satisfy the three requirements of low computational cost, minimal amount of a priori information, and connectivity with other methods.

There are two approaches to model sound signals: modeling in the time domain and modeling in the subband domain, which includes frequency domain, wavelet domain, and other domains (Vaidyanathan, 1993). From the viewpoint of the whole robot audition/CASA system, it is suitable to use subband domain processing, especially the STFT domain (Hiroe, 2007; Takeda et al., 2007; Nakatani et al., 2008). This will be explained in section 2.2.

STFT domain processing is effective in terms of the following three main efficiency and technical aspects. First, since the speech features for ASR are often extracted framewise from the STFT domain signal, they can be extracted directly after the STFT domain speech separatioin and dereverberation, and we can eliminate unnecessary processing. Second, since misrecognition due to reverberation is caused by spectra remaining from several previous frames, we can naturally model a separation model in the STFT domain by regarding the reverberant spectra as delayed sound souces from previous frames. Moreover, reverberations in the same STFT frame or in a very delayed frame can be handled using ASR techniques, such as the cepstral mean normalization (Acero & Stern, 1992) or acoustic model adaptations (Gauvain & Lee, 1994; Leggetter & Woodland, 1995). Third, there are many techniques for reducing the stationary background noise in a high-SNR (signal-to-noise ratio) environment. A typical technique is spectral subtraction (Boll, 1979). These techniques are effective for improved SNR and compatible with our separation scheme in the STFT domain.

We define our speech separation problem as follows (see Figure 2):

- Purpose:
Removal of reverberation and robot's speech

- Input:
(1) Observed sound spectra in STFT domain (one person's speech signal, one reference speech signal, and their mixed reverberations) and (2) reference robot's speech spectrum in STFT domain

- Output:
Separated target speaker's spectrum in STFT domain

- Requirements:
(1) Low computational cost and (2) weak modeling assumptions

We use ICA to satisfy the second requirement. Since ICA involves few modeling assumptions, it is suitable for solving our problem. This method differs from other microphone array techniques in that it does not need the location of the sound source or equivalent information such as a transfer function. Thus, the remaining questions are how to treat reverberations in the STFT domain for ICA how to satisfy the requirement for the low computational cost. We describe the answers in section 3.

### 2.2. Spectrum in STFT Domain

*K*taps and the reflected sounds in the time domain.

_{h}*T*and shift

_{a}*T*. For example, sound signal is transformed into a spectrum using the discrete Fourier transform, where

_{s}*t*denotes the frame index,

*w*denotes the frequency bin index in the STFT domain, and

*f*is a window function for short-duration analysis. This process can be executed rapidly by using the fast Fourier transformation algorithm at

*O*(

*T*log

_{a}*T*) computational cost. Note that

_{a}*w*[

_{w}*t*] is a complex number because the STFT spectrum generally becomes a complex number even from a real-value signal.

### 2.3. Independent Component Analysis

*L*mutually independent complex random variables, . They are mixed using a time-invariant linear system that is represented by an

*L*×

*L*nonsingular matrix, . Let be an observed signal vector. The relationship between and is represented as ICA estimates the original source vector by using only the observed vector , where is an

*L*×

*L*separation matrix estimated using ICA.

*p*is the joint PDF of and

*q*corresponds to the product of the marginal PDF of . These parameters are usually estimated using an iterative gradient-based method because of the nonlinearity of

*J*.

^{j}represents the number of iterations, and ·

^{H}represents a conjugate transpose. The is a nonlinear function vector:

ICA is ambiguous about the permutation and scaling of each element of the estimated vector, . These two factors affect the quality of the resynthesized signals when using ICA in a decomposed domain, such as the frequency or wavelet domain (Hyvarinen et al., 2001). A solution to this problem is presented in section 4.5.

## 3. Basic Techniques

We use several basic techniques to achieve both blind dereverberation and echo cancellation. We explain the echo cancellation based on ICA in the STFT domain and dereverberation based on ICA and MINT. Then we discuss the problem of the naive combination of these two methods.

### 3.1. Model of Semiblind ICA for Echo Cancellation

Takeda et al. (2007) modeled echoes of the robot's speech (reference) spectrum in the STFT domain and developed a separation algorithm based on ICA. We denote the observed spectra as *x*_{w,j}[*t*] at frequency bin *w* and frame *t* of the *j*th microphone and the user's and robot's (reference) speech spectrum as *s*_{w,u}[*t*] and *s*_{w,r}[*t*], respectively.

*j*th microphone, assuming the discrete-linear convolution model. where

*h*

^{r}_{w,j}[

*n*] and

*a*

_{w,j}are the transfer coefficients of the robot's and the user's spectrum and

*K*

_{w,r}is the number of filter taps. We transform this equation into matrix representation with a reference spectrum vector, , and a transfer coefficients vector, , where is an (

*M*

_{w,r}+ 1) × (

*M*

_{w,r}+ 1) unit matrix and

*M*

_{w,r}=

*K*

_{w,r}. The reference spectrum is assumed to arrive without delay in the model. The independence between the user's and reference spectrum is evaluated in the ICA algorithm to suppress the reference spectrum . We call this algorithm “semiblind” because the reference is a known signal, whereas the rest signals are unknown.

Since equation 3.2 is a nonsingular mixing process, we can obtain the user's spectrum by applying ICA. We have two reasons for applying ICA. The first is that the speech signal has a nongaussian property (Hyvarinen et al., 2001) and this property theoretically matches the assumption of ICA. The second is that since ICA is robust against disturbances that have gaussian property, it can estimate the separation filter in noisy situations by using an iterative or online processing (Yang & Sakai, 2008). Of course, in batch processing, we can use the Wiener filter (Haykin, 1991) as an initial value for the separation filter in terms of the second-order statistics.

Semiblind ICA can separate the robot's speech, including its reverberations, because the model accounts for the reverberations over several frames. However, it cannot cope with the user's reverberations.

### 3.2. Model of ICA Based on Inverse Theorem in Acoustic Field

There is an inverse filter in a multiple-input system as long as the number of microphones is larger than that of the sound sources. This is almost the same as that defined in the time domain, that is, multiple input/output inverse filtering theorem (MINT) (Miyoshi et al., 1988). However, we distinguish our model in the STFT domain from the time-domain MINT.

If the condition that *L*(*N _{w}* + 1) = (

*M*

_{w,u}+ 1) and is full rank holds, the whole mixing matrix is a nonsingular matrix. This ensures the existence of an inverse system meaning that we can obtain the unique solution of the original sources. In practice, since there is no environment with rank-reduced (i.e., the transfer functions from the source to each microphone are all unique), this theorem is always true with proper

*N*and

_{w}*M*

_{w,u}. Note that this theorem also holds in a multiple-source case although the explanation is omitted here.

### 3.3. Simple Solution for Blind Dereverberation and Echo Cancellation

Semiblind ICA separates robot speech and its reverberations. However, the model ignores the reverberation of user speech, and therefore the separated user speech remains reverberant. On the other hand, the MINT, or multiple inputs, model has the inverse transfer function for user speech. If we can solve the inverse problem of the MINT formulation, the dereverberated user speech will be obtained. This means that both the blind dereverberation and echo cancellation can be achieved by combining the semiblind ICA and the MINT-based model.

## 4. Efficient ICA-Based Separation

We first point out the bottleneck problem found in the naive combination of the semiblind ICA and the MINT-based model. Next, we describe how to solve the problem by using a property that holds in the MINT system and how the ICA separation model should be modified. Then we derive the algorithm for estimating the parameters in the ICA framework. Finally, we explain other solutions to and configurations for our new ICA problem.

### 4.1. Problem of Naive Combination of ICA to MINT System

#### 4.1.1. The Model

*M*

_{w,r}+ 1) × (

*M*

_{w,r}+ 1) unit matrix and the sizes of and are

*L*(

*N*+ 1) × (

_{w}*M*

_{w,u}+ 1) and

*L*(

*N*+ 1) × (

_{w}*M*

_{w,r}+ 1), respectively. We assume

*N*and

_{w}*M*

_{w,u}are set properly such that is nonsingular.

*s*

_{w,u}[

*t*], from it. The separation model is represented by the following equation with separation matrices and :

#### 4.1.2. The Problem

If we apply a standard ICA to the MINT model, equation 4.3 is solved, and an estimate of is obtained. We refer to this naive process as MINT-ICA process in the sequel. The number of independent components, that is, the dimensionality of , to estimate is obviously proportional to the reverberation time. This is caused by the estimation of all components but the direct element, such as *s*_{w,u}[*t* − 1], *s*_{w,u}[*t* − 2], and so on.

This extra processing increases the computational cost in solving permutation and scaling and estimating the separation matrices. The estimation cost in particular is proportional to the second order of the reverberation time, *N _{w}*. This increased computational cost should be reduced from the viewpoint of actual applications, that is, a real-time separation. The linear order of

*N*would be preferable for practical applications. We must reduce the number of components to be estimated to reduce the computational cost of the calculation of the separation matrices.

_{w}### 4.2. Independence Exchange Property in the MINT System

ICA based on MINT uses a temporal estimated source signal in learning separation matrix as shown in equation 2.7 to evaluate the time independence among and *s*_{w,u}[*t* − *i*], *i* = 1, …, *M _{u}*. This estimation increases the number of independent components to estimate because all elements of are estimated. The replacement of the independence condition with an equivalent one that has fewer independent components reduces the computational cost.

To derive the efficient model, we use the following proposition. Here, we define two vectors for readability: and .

**Proposition 1: Independence exchange properties of higher-order-statis- tics ICA***We assume that the sound signals, s_{w,u}, are time independent. In higher-order ICA, if the nonsingular condition in equation 4.2 holds, the following two conditions are equivalent:*

* P1. s_{w,u} and are mutually independent.*

* P2. s_{w,u} and are mutually independent.*

*Here, d > 0 is the initial delay interval, to be explained below.*

In the same way, we can easily prove that the property also holds when reference signal exists. Since the proof is straightforward, we skip it. By using this proposition, we gain two main advantages.

#### 4.2.1. Reduced Computational Cost

We can evaluate the time independence by using the delayed spectra instead of the estimate of the delayed source spectra . This exchange of the independence evaluation reduces the number of independent components to an estimate that leads to reduced computational costs of estimating the filter, permutation, and scaling.

#### 4.2.2. Time Independence with Initial Delay Interval Parameter

We can consider the time independence of speech signals by using the initial delay interval parameter, *d*. For example, Figure 4 shows the average time independence of speech signals of all frequency bins after being analyzed by STFT with a 64 ms Hanning window and an 8 ms shift. The metric for the independence is E[φ(*x*)*x ^{H}*]: a smaller value means a higher degree of independence. Here, we use tanh(100|

*x*|)

*e*

^{θ(x)}for φ(

*x*). Here, θ(

*x*) represents the phase of the complex number

*x*. This function is the actual independence metric used in ICA (Sawada, Mukai, & Araki, 2003). The vertical axis represents the independence among

*s*[

_{w}*t*] and

*s*[

_{w}*t*−

*i*] with

*i*≥ 0, and the horizontal axis represents the frame interval length,

*i*. From the graph, we can conclude that since the independence between direct sound with

*i*= 0 and its adjacent frame is not high enough, we should evaluate the independence with a certain interval

*d*. Note that the size of the high-dependence frame interval depends on the speed of speech and that the interframe time independence varies for types of sound sources. For example, periodic or sustained sound generally show less time independence. However, in natural speech interaction, since the speed of speech is almost constant and does not include sustained sound, we can safely assume that sufficient time independence is obtained by a fixed frame interval

*d*. Additionally, though the independence does not seem to be high, we have empirically confirmed that it is high enough for ICA to be applied to actual speech data.

### 4.3. New Model and Estimation of Filter Based on ICA

#### 4.3.1. The Model

The ideal sizes of and are the number of sound sources. If we know in advance, for example, that there is only one sound source, we can set and in equation 4.13. In this case, we can extract the correct direct sound signal, , theoretically because equation 4.13 obviously becomes invertible. This enables us to reduce the computational cost in estimating the parameters. However, this is not the case in general, since it is usually difficult to estimate the number of sound sources in a reverberant environment. Consequently, we have to use all of the microphones; that is, , each dimensionality of and is *L*, and , , and become *L* × *L*, , and separation matrices, respectively. In such a case, the separated sounds, , include a direct sound signal *s*_{w,u}[*t*] and some reflected sound signals. The way to extract the direct sound signal from will be discussed in the following section.

#### 4.3.2. Estimation Algorithm

We can derive the update procedures for equation 4.13 by using a standard learning algorithm of ICA. To obtain the optimum separation matrices, we minimize cost function *J* on the basis of KLD. Since the derivation of the algorithm is shown in the appendix, we show only the result here.

*x*|)

*e*

^{iθ(x)}as a nonlinear function, φ(

*x*) (Sawada et al., 2003). Equation 4.17 is used to estimate the blind source separation filter, . Equations 4.18 and 4.19, are used to estimate the blind dereverberation filter, , and the so-called echo cancellation filter, , respectively. Even when multiple sources are observed, we can use these rules because the independent exchange also holds for them. We can also use the learning algorithms for their separation without any modification to the equations.

The derived algorithms for estimating separation matrices form a natural extension to those of FD-ICA. The learning rule of is the same as that of standard FD-ICA. Our algorithms add both a dereverberation filter and an echo cancellation filter. The algorithm of is obviously different from that of algorithms (Hyvarinen et al., 2001; Douglas et al., 2005; Kokkinakis & Nandi, 2006; Hiroe, 2007) based on the convolutive mixing model because we use the independence exchange property for the instantaneous mixing model. This reduces computational costs and simplifies the algorithm.

### 4.4. Spatial Sphering for Preprocessing

Sphering or whitening is a widely used preprocessing for standard ICA that accelerates the convergence of the separation matrix. This process is a linear transformation that de-correlates the input signals and normalizes their variances (Hyvarinen et al., 2001). However, this increases the computational cost because it needs an eigenvector or singular value decomposition of the correlation matrix of the input signal vector. Even if we use the Levinson-Durbin algorithm for linear-prediction-based prewhitening, the computational cost adds up to the second order of the reverberation time, *N _{w}*. Therefore we execute only spatial sphering and reference signal normalization. This spatial sphering may be executed in both batch and block-wise processing.

*L*×

*L*unitary matrix consisting of eigenvectors, is an

*L*×

*L*diagonal matrix with eigenvalues, and λ

_{w,r}is a variance of known (reference) signal.

*s*

_{w,r}[

*t*] in equation 4.13 to 4.19. Here, the independence exchange still holds because this transformation is nonsingular. Actually, if we use , this transformation is described as a matrix (block diagonal matrix): Here, is an

*L*(

*N*+ 2) ×

_{w}*L*(

*N*+ 2) block diagonal matrix and diag(λ

_{w}^{−1/2}

_{w,r}, …, λ

^{−1/2}

_{w,r}) is an (

*M*

_{w,r}+ 1) × (

*M*

_{w,r}+ 1) diagonal matrix. The signal flow with our method is outlined in Figure 5.

This eigenvalue decomposition of the microphone correlation matrix is also used in a sound-localization technique called MUSIC (multiple signal classification) (Schmidt, 1986). If we integrate the sound source separation with MUSIC, we can efficiently reuse the eigenvalue decomposition result.

### 4.5. Solution to Scaling and Permutation Problems

#### 4.5.1. Scaling

*i*th row and

*j*th column element of is presented as . The scale,

*c*

_{w,j}, is multiplied by the estimated

*j*th element of :

#### 4.5.2. Permutation

*w*. We solve the permutation problem by using the average power of the separated signals. If the separated signals include direct and reflected sounds, the power of the direct sound, , is strongest in the separated signals. Hence, the signal with the maximum power is selected: Note that this criterion will not work when other sound sources exist. Since such a situation is not in the scope of this letter, we do not discuss it.

### 4.6. Other Configurations

#### 4.6.1. Initial Value of Separation Matrix

The initial value of the separation matrix, , is critical for the estimates because the learning rules of and are affected by for the estimates of . There are several techniques for finding an appropriate initial value of (Araki et al., 2005; Saruwatari, Kawamura, Nishikawa, Lee, & Shikano, 2006). Since they require a geometrical model from the microphone to the sound source, obtaining such a model may be difficult if the microphones are installed on the robot.

The initial value of at frequency bin *w* is adjusted to the estimated matrix, at frequency bin *w* + 1, and then all rows of the matrix are row-normalized. We use the unit matrix for the initial value of the first separation matrix. Since reverberation at high frequency decays faster than that at low frequency, we start by estimating the separation matrix at the highest-frequency bin and then move to lower-frequency bins. Thus, we can maintain the overall accuracy of the separation matrix estimation with regard to the initial value configuration. This initialization works efficiently and effectively in practice.

#### 4.6.2. Step-Size Scheduling

_{k}, of the separation matrix at the

*j*th iteration in the

*k*th delayed frame component, corresponding to

*x*

_{w,*}[

*t*−

*k*],

*s*

_{w,r}[

*t*−

*k*], is defined by where α, β, and λ are constant values.

### 4.7. Comparison of Computational Costs

The theoretical computational costs with our method and with simple MINT-ICA are summarized in Table 1. Here, *L* is the number of microphones, *M*_{w,r} is the size (or frame length) of , and *N _{w}* is the size (or frame length) of .

. | Naive MINT + ICA (MINT-ICA) . | Our Method . |
---|---|---|

Sphering | O((LN + _{w}M_{w,r})^{3}) | O(L^{3}) |

ICA iteration | O(L^{2}N^{2}_{w} + LN_{w}M_{w,r}) | O(L^{2}N + _{w}LM_{w,r}) |

Scaling | O((LN + _{w}M_{w,r})^{3}) | O(L^{3}) |

Permutation | O(LN) _{w} | O(L) |

. | Naive MINT + ICA (MINT-ICA) . | Our Method . |
---|---|---|

Sphering | O((LN + _{w}M_{w,r})^{3}) | O(L^{3}) |

ICA iteration | O(L^{2}N^{2}_{w} + LN_{w}M_{w,r}) | O(L^{2}N + _{w}LM_{w,r}) |

Scaling | O((LN + _{w}M_{w,r})^{3}) | O(L^{3}) |

Permutation | O(LN) _{w} | O(L) |

We focus on the order of *N _{w}* because we need

*N*to be as large as possible to cope with long reverberations, and this enlargement process greatly affects the computational cost. The filter length of reference signal

_{w}*M*

_{w,r}also increases the cost linearly according to the reverberation time. Meanwhile, the number of microphones

*L*is independent of such environmental conditions.

Simple MINT-ICA requires the third order of *N _{w}* for prewhitening, the second order for estimating separation matrix, and the third order for scaling because of the inverse operation of the matrix. The filters are iteratively estimated, and this critically affects the computational cost.

## 5. Evaluation

We evaluate the performance of our method by comparing it with that of a conventional method in two different environments in terms of the word correctness of ASR as the metric. We first explain the experimental settings and then present the evaluation criteria and results.

### 5.1. Experimental Settings

We conducted the experiments in the two rooms, a normal room and a hall-like room. The room layouts are shown in Figure 6. The normal room is 4.2 × 7.0 m, and the hall-like room is of 7.55 × 9.55 m. We used the microphone array embedded on a humanoid robot developed by HONDA.

#### 5.1.1. Recording Conditions and Test Set Data

The impulse responses for the user's speech data were recorded at 16 kHz in both rooms. The reverberation time, RT_{20}, was 240 ms in the normal room and 670 ms in the hall-like room. A loudspeaker, in 1.2 m high, was located 1.5 m away from the two microphones installed on the humanoid's head. Utterances (impulsive responses) were recorded from five directions; 0, 45, 90, −45, and −90 degrees from the front direction of the robot. The impulse responses for robot's speech data were also recorded by using a loudspeaker embedded in the humanoid's head. All data (16 bits, PCM) were normalized to [−1.01.0]. These conditions are summarized in Table 2.

Impulse Response . | 16-kHz Sampling . |
---|---|

Reverberation time (RT_{20}) | 240 and 670 ms |

Distance and direction | 1.5 m and 0°, 45°, 90°, −45°, −90° |

Number of microphones | Two (mounted on the robot's head) |

STFT analysis | Hamming: 64 ms; shift: 20 ms |

Input wave data | [−1.0 1.0] normalized |

Impulse Response . | 16-kHz Sampling . |
---|---|

Reverberation time (RT_{20}) | 240 and 670 ms |

Distance and direction | 1.5 m and 0°, 45°, 90°, −45°, −90° |

Number of microphones | Two (mounted on the robot's head) |

STFT analysis | Hamming: 64 ms; shift: 20 ms |

Input wave data | [−1.0 1.0] normalized |

We used 200 Japanese sentences for the user's speech; they are convoluted with the corresponding recorded impulse responses. The robot's speech data were 200 sentences spoken by a male. We mixed user and robot speech signals of the same length. The duration of the target data ranged from 1 to 10 s.

#### 5.1.2. Parameters for ASR and Separation

The recognizer Julius (Lee, Kawahara, & Shikano, 2001) was used for hidden markov model (HMM)-based ASR with a statistical language model. Mel frequency cepstral coefficients (MFCC) (12 + Δ12 + ΔPow) were obtained after STFT with a window size of 512 points and a shift size of 160 points for the speech features. This was followed by cepstral mean normalization (CMN) (Acero & Stern, 1992). Note that we extracted the MFCC from the time-domain signal resynthesized from the separated spectrum. A triphone-based acoustic model (three states and four mixtures) was trained with 150 sentences of clean speech uttered by 200 male and female speakers (word closed). The statistical language model consisted of 21,000 words extracted from newspapers. The other experimental conditions are summarized in Table 3.

Test Set . | 200 Sentences . |
---|---|

Training set | 30,000 sentences (200 people; 150 sentences each) |

Acoustic model | PTM-triphone: 3-state, HMM |

Language model | Statistical, vocabulary size of 21k |

Speech analysis | Hamming: 32 ms; shift: 10 ms |

Features | MFCC: 25 dim. (12 +Δ12 +Δ Pow) |

Test Set . | 200 Sentences . |
---|---|

Training set | 30,000 sentences (200 people; 150 sentences each) |

Acoustic model | PTM-triphone: 3-state, HMM |

Language model | Statistical, vocabulary size of 21k |

Speech analysis | Hamming: 32 ms; shift: 10 ms |

Features | MFCC: 25 dim. (12 +Δ12 +Δ Pow) |

As the STFT parameters for the sound separation, the window size was 1, 024 points (64 ms), which is known as suboptimal (Araki et al., 2003), and the shift size was 320 points (20 ms). The tap lengths of the vector and the robot's speech spectrum, *s*_{w,r}[*t*], were the same value, *N _{w}* =

*M*

_{w,r}=

*N*, over all frequency bins,

*w*. The step-size parameters were λ = 0.9, α = 6.0 × 10

^{−1}, and β = 5.0 × 10

^{−3}for batch processing. We fixed the maximum number of iterations for estimating the matrices to 20 because the time for separation is usually restricted in practical use. With more iterations, the performance may improve slightly. We empirically set the initial delayed frame value,

*d*, to 2. For the permutation resolution, we assumed only one speech signal.

### 5.2. Experiments

We conducted three experiments:

- Experiment 1:
Comparison in terms of processing time, word correctness (WC), and SNR (signal-to-noise ratio) of our method and MINT-ICA

- Experiment 2:
Comparison in terms of WC and SNR of our method, semiblind ICA, and cascade processing

- Experiment 3:
WC of our method under several conditions

“Cascade processing” means a sequential process of the echo cancellation followed by the dereverberation, that is, we first estimated only and then estimated . In estimating , we assumed that and in equation 4.14 and estimated by using equation 4.19. Next, with the estimated fixed, and were estimated. The rule of the step-size scheduling in equation 4.30 is commonly used for calculating fixed, , and . In this experiment, we show the effectiveness of the simultaneous estimation of separation filters, , , and , in terms of WC and SNR. The environments for these experiments are the same.

*s*[

_{w}*t*] denotes the clean original speech spectrum and denotes the estimated speech spectrum. Scaling parameter η was defined to maximize the SNR. SNR represents the degree of noise contamination of the separated speech signal.

#### 5.2.1. Experiment 1: Comparison in Terms of Processing Time, WC, and SNR with MINT-ICA

In this experiment, we compared our method and MINT-ICA (experiment 1-1) in terms of the computational cost and examined the dereverberation performance of both methods in terms of WC and SNR (respectively, experiments 1-2a and 1-2b). We evaluated four patterns:

- 1.
MINT-ICA without sphering and with only dereverberation (Derev.)

- 2.
MINT-ICA with sphering and with only dereverberation (Derev.)

- 3.
Our method with only dereverberation (Derev.)

- 4.
Our method with dereverberation and echo cancellation (Derev. + E.C.)

The data for processing time evaluation were obtained from the front of the humanoid in the hall-like room. The total duration of the speech data was 1197 s for dereverberation and 1311 s for dereverberation and echo cancellation. The computer has an Intel Pentium D CPU with the clock speed of 3.20 GHz and the memory of 2 GB. The OS is Red Hat Enterprise Linux WS release 3. The program was implemented without using a numerical library such as blas or LAPACK. The data for dereverberation performance were obtained in both rooms, and they included only the target speaker's speech.

We also evaluated the real-time factor (RTF) for batch processing in all experiments. The RTF was calculated using *P*/*I*, where *P* is the processing time and *I* is the data amount in time (duration). Because the processing time did not include the time for buffering the data, there was a constant delay for this real-time processing. It can be eliminated by a method described elsewhere (Saruwatari et al., 2005).

#### 5.2.2. Experiment 2: Comparison in Terms of WC and SNR with Semiblind ICA and Cascade Processing

In this experiment, we compared the ability of our method to handle reverberation with those of conventional semiblind ICA and the cascade processing. Performance was evaluated with regard to the number of frames, *N*.

The data set was a mixture of user and robot utterances in both rooms. To show the upper-limit performance, separation was done by batch processing, not block-wise processing. Since semiblind ICA does not assume the use of multiple microphones, we evaluated it with only one microphone. The parameter settings for the cascade processing were the same as for our method.

#### 5.2.3. Experiment 3: Evaluation Under Several Conditions

In this experiment, we examined the WC and SNR of our methods for dereverberation and for dereverberation and echo cancellation. The number of frame *N* was changed for each conditions.

The data were obtained in both rooms. We used only the user utterance data to evaluate the dereverberation function (experiment 3-1) and the mixed data from user and robot utterance data to evaluate the dereverberation and echo cancellation function (experiment 3-2).

We changed the length of the observed signal and the size of the data set to estimate matrices , , and , that is, with 1, 2, and 3 s block-separated data and with all data (batch). When we were not batch processing, we set 0.8 as the value of exponential weight λ. When we estimated the separation matrices, we used the estimated matrices of the previous period as the initial value of the next period. We showed the relationship among the data length for separation, the number of frames *N*, and WC.

### 5.3. Results

#### 5.3.1. Experiment 1: Comparison in Terms of Processing Time, WC and SNR with MINT-ICA

Figure 7 plots the results for experiment 1. In the panel for experiment 1-1, the horizontal axis represents the number of frames *N* of and *s*_{w,r}[*t*], and the vertical axis represents the real-time factor. In the panels for experiments 1-2a and 1-2b, the vertical axis represents WC and SNR, respectively, over the five positions of the speaker. “No proc.” represents the results without any processing. The WC of clean speech, which is without noise, is about 90%.

With our method, the RTF increased proportionately to the number of frames for both dereverberation only and the combination of dereverberation and echo cancellation. For example, the method can cope with the reverberation if *N* = 20 under these experimental conditions. The RTF of MINT-ICA increases proportionally to the polynomial of *N* and RTF exceeds 1.0 at *N* = 6. Since the cost increases with the spatial sphering preprocessing, MINT-ICA is not suitable for real-time processing.

MINT-ICA does not perform well in terms of WC and SNR because it needs a large number of estimated independent components, resulting in a permutation error in equation 4.29. On the other hand, our method works well even if we use a long filter length.

#### 5.3.2. Experiment 2: Comparison in Terms of WC and SNR with Semiblind ICA and Cascade Processing

The plots on the left side of Figure 8 show the results for the RT_{20} = 240 ms room, and those on the right side show them for the RT_{20} = 670 ms room. The horizontal axis represents *N*, and the vertical axis represents the average WC (experiment 2a) and SNR (experiment 2b) over the five positions of the speaker. We can see that even when SNR is high, WC is not necessarily high.

Since conventional semiblind ICAs cannot remove the reverberations of user utterances, the WC and SNR do not improve, especially in the more strongly reverberant environments. In contrast, our method separates both reverberation and robot speech, and therefore WC is improved in both environments. For example, WC is improved by 10 points and SNR is about 1 dB under the 240 ms reverberant environment. WC was also improved by 31 points and by 3 dB under the 670 ms reverberant environment. The optimal frame length *N* differs according to the reverberation time. The cascade processing does not perform as well as our method because the step-size parameter and other settings are the same as for our method. This means that with the cascade processing, the parameters for echo cancellation and dereverberation must be adjusted independently. In addition, the cascade processing may not be able to attain global optimization due to falling into local minima. Our method thus has the advantage of easier parameter tuning because our algorithm is derived from one objective function.

#### 5.3.3. Experiment 3: Evaluation Under Several Conditions

In Figure 9, the results for the environments (RT_{20} = 240 and 670 ms) with only dereverberation are shown in the upper plots (experiments 3-1a and 3-1b) and those with both dereverberation and echo cancellation are shown in the lower plots (experiments 3-2a and 3-2b). The horizontal axis represents the number of frames, and the vertical axis represents the average WC and SNR over the five positions. The plots show the relationships for 1, 2, and 3 s data learning and batch processing. Tables 4 and 5 summarize the highest average WC and SNR for each condition.

Our method achieves blind dereverberation with the optimal number of frames in the results of dereverberation only. Although the number of frames varies with the length of data learning used for estimating the separation matrices, the WC is about 5 points higher than without any processing for the weakly reverberant environment and about 40 points higher for the strong one when 2 s data learning is used for estimation. The improvements in both suppressing the reverberation and robot speech echoes, for example, are 40 points in the weaker reverberant environment and 29 points in the stronger one. The improvements in the SNR are 3.9 and 4.3 dB, respectively.

Performance improves as the length of data learning increases and there is an optimum frame length, *N*. For example, WC with 1 s learning is worse than with 3 s learning. This is particularly noticeable in barge-in situations of mixed user and robot speech. Moreover, WC becomes worse as the frame length becomes longer with 1 s learning.

The separated signals and spectra of separated speech with our method are shown in Figure 10A, for the ground truth and Figures 10B and 10D for observed signals. Figures 10C and 10E are separated using our method. Obviously, the reverberation and robot speech have been removed in Figure 10E. The differences between Figures 10A and 10C are caused by reverberation in the frame of directly arriving signals and insufficient separation.

. | . | Only User Speech . | Dereverberation . | |||
---|---|---|---|---|---|---|

. | Environment . | (No Processing) . | 1 s . | 2 s . | 3 s . | Batch . |

WC (%) | RT_{20}: 240 ms | 74.3 | 77.7 | 81.4 | 83.3 | 84.2 |

RT_{20}: 670 ms | 26.1 | 64.1 | 68.0 | 70.6 | 72.9 | |

SNR (dB) | RT_{20}: 240 ms | 8.3 | 9.6 | 9.9 | 10.2 | 10.3 |

RT_{20}: 670 ms | 6.1 | 9.3 | 10.2 | 10.5 | 10.8 |

. | . | Only User Speech . | Dereverberation . | |||
---|---|---|---|---|---|---|

. | Environment . | (No Processing) . | 1 s . | 2 s . | 3 s . | Batch . |

WC (%) | RT_{20}: 240 ms | 74.3 | 77.7 | 81.4 | 83.3 | 84.2 |

RT_{20}: 670 ms | 26.1 | 64.1 | 68.0 | 70.6 | 72.9 | |

SNR (dB) | RT_{20}: 240 ms | 8.3 | 9.6 | 9.9 | 10.2 | 10.3 |

RT_{20}: 670 ms | 6.1 | 9.3 | 10.2 | 10.5 | 10.8 |

. | . | User and Robot Speech . | Dereverberation + Echo Cancel . | |||
---|---|---|---|---|---|---|

. | Environment . | (No Processing) . | 1 s . | 2 s . | 3 s . | Batch . |

WC (%) | RT_{20}: 240 ms | 28.2 | 60.9 | 69.0 | 72.0 | 73.2 |

RT_{20}: 670 ms | 11.0 | 33.2 | 40.8 | 41.5 | 50.0 | |

SNR (dB) | RT_{20}: 240 ms | 4.8 | 7.8 | 8.8 | 9.0 | 9.2 |

RT_{20}: 670 ms | 3.7 | 6.6 | 8.1 | 8.5 | 9.0 |

. | . | User and Robot Speech . | Dereverberation + Echo Cancel . | |||
---|---|---|---|---|---|---|

. | Environment . | (No Processing) . | 1 s . | 2 s . | 3 s . | Batch . |

WC (%) | RT_{20}: 240 ms | 28.2 | 60.9 | 69.0 | 72.0 | 73.2 |

RT_{20}: 670 ms | 11.0 | 33.2 | 40.8 | 41.5 | 50.0 | |

SNR (dB) | RT_{20}: 240 ms | 4.8 | 7.8 | 8.8 | 9.0 | 9.2 |

RT_{20}: 670 ms | 3.7 | 6.6 | 8.1 | 8.5 | 9.0 |

### 5.4. Discussion and Future Work

Our experiments demonstrate that our method efficiently and effectively separates both reverberation and robot speech and that it reduces the computational cost compared to the semiblind ICA and simple MINT-ICA methods. However, we need to improve its separation and stability against the low number of samples used for the filter estimation when we compared the performance of block-wise processing with that of batch processing.

First, we need to compensate for the lack of information caused by using only few data in the filter estimation. The performance of our method is degraded by the use of only a few samples because ICA is based on higher-order statistics, and therefore our method needs sufficient samples for filter estimation. This lack of samples can be compensated for by using additional information acquired from another aspect of the environment or speakers. For example, the location of sound sources is useful and can be obtained using sound localization techniques such as MUSIC (Schmidt, 1986) or visual localization techniques (Asano et al., 2004). Since our method estimates the separation matrices, which are divided into the blind source separation filter, and the blind dereverberation filter, , we can use the sound location information as an initial value for the blind source separation filter. Since we still estimate the dereverberation filter blindly, such information does not devalue our method.

Second, we need to intelligently determine the number of iterations when estimating the filter to reduce the computation time further. In our experiments, we set the maximum number of iterations to 20; therefore, the RTF is less than 1.0. Better scheduling should shorten the processing time when we use more microphones because the computational cost is proportional to the second order of the number of microphones *L*. This can be achieved by using optimum step-size methods such as the method by Nakajima, Nakadai, Hasegawa, and Tsujino (2008) because the direction of the gradient is modified by a natural gradient like that in the Newton method, while we adjust only the scale of the gradient. There has been little research on the step size of ICA at a low computational cost for actual applications. Consequently, the step-size scheduling is one of the most serious problems in real-time processing. We also need to optimize the exponential parameter, λ.

Finally, to improve the ASR performance, we must manage reverberations that are not separated. The effect of the reverberation can be overcome by using adaptation or multicondition training of ASR acoustic models, missing feature techniques (Raj & Stern, 2005), and other techniques, such as the Weiner filters. Since techniques will effectively improve the ASR performance, the next issue will be how to integrate them with our method. We will reconsider useful conventional methods and refine them to construct an efficient system that works in actual environments.

## 6. Conclusion

We tackled the problem of separating a robot's/system's own speech and the reduction of reverberation with the least amount of a priori information. This problem is general and important in human-machine interaction because microphones capture at least these two types of sound and the system cannot predict how the reverberation affects the target speech signal. From the viewpoint of computational auditory scene analysis (CASA), we developed a method for effectively overcoming the reverberation problem.

We deal with the reverberation of user speech by extending conventional semiblind independent component analysis (ICA). This is achieved by introducing a separation model based on the independence of delayed observed signals with the multiple input/output inverse-filtering theorem (MINT) and spatial sphering for preprocessing. Our method can overcome the dereverberation and echo cancellation problems simultaneously at a low computational cost, which is proportional to the linear order of the reverberation time. Experimental results with several environments demonstrate the performance and efficiency of our method in terms of the word correctness of ASR and SNR criteria.

Further improvements will come from optimizing the step-size parameters and using a priori information about sound sources. Because reverberation is not suppressed enough in current ASR systems, integration with existing ASR methods is necessary (e.g., spectral subtraction or missing data techniques). Since the reverberation filter in particular reflects the property of reverberation in the environment, the property can be used to adapt the acoustic model of ASR to the environment. We intend to analyze the property of reverberation filters and integrate it with other methods.

## Appendix: Derivation of Estimation Algorithm

*J*, on the basis of the Kullback-Leibler divergence: where

*q*means the product of the marginal PDF of , and are defined as the differential entropy and the joint differential entropy, respectively. E[·] means the expectation operator over frame index

*t*.

*x*|)

*e*

^{θ(x)}as a nonlinear function, φ(

*x*) (Sawada et al., 2003). Equation 4.17 is used to estimate the blind source separation filter, . Equations 4.18 and 4.19, in order, are used to estimate the blind dereverberation filter, , and the echo cancellation filter, . When multiple sound sources exist, we can use these rules because the independence exchange also holds for them and we can use the above algorithm for separation without any changes to it except for the permutation problem of ICA.

## Acknowledgments

We thank Takuma Ohtsuka for his valuable comments. This research was partially supported by a grant-in-aid for scientific research (S), JSPS fellows, and a Global COE Program.