## Abstract

This letter proposes a multichannel source separation technique, the multichannel variational autoencoder (MVAE) method, which uses a conditional VAE (CVAE) to model and estimate the power spectrograms of the sources in a mixture. By training the CVAE using the spectrograms of training examples with source-class labels, we can use the trained decoder distribution as a universal generative model capable of generating spectrograms conditioned on a specified class index. By treating the latent space variables and the class index as the unknown parameters of this generative model, we can develop a convergence-guaranteed algorithm for supervised determined source separation that consists of iteratively estimating the power spectrograms of the underlying sources, as well as the separation matrices. In experimental evaluations, our MVAE produced better separation performance than a baseline method.

## 1 Introduction

Blind source separation (BSS) is a technique for separating out individual source signals from microphone array inputs when the transfer characteristics between the sources and microphones are unknown. The frequency-domain BSS approach provides the flexibility of allowing us to utilize various models for the time-frequency representations of source signals and array responses. For example, independent vector analysis (IVA) (Kim, Eltoft, & Lee, 2006; Hiroe, 2006) allows us to efficiently solve frequency-wise source separation and permutation alignment in a joint manner by assuming that the magnitudes of the frequency components originating from the same source tend to vary coherently over time.

With a different approach, multichannel extensions of nonnegative matrix factorization (NMF) have attracted a lot of attention in recent years (Ozerov & Févotte, 2010; Kameoka, Yoshioka, Hamamura, Le Roux, & Kashino, 2010; Sawada, Kameoka, Araki, & Ueda, 2013; Kitamura, Ono, Sawada, Kameoka, & Saruwatari, 2016, 2017). NMF was originally applied to music transcription and monaural source separation tasks (Smaragdis, 2003; Févotte, Bertin, & Durrieu, 2009). The idea is to approximate the power (or magnitude) spectrogram of a mixture signal, interpreted as a nonnegative matrix, as a product of two nonnegative matrices. This amounts to assuming that the power spectrum of a mixture signal observed at each time frame can be approximated by a linear sum of a limited number of basis spectra scaled by time-varying amplitudes. Multichannel NMF (MNMF) is an extension of this approach to a multichannel case to allow the use of spatial information as an additional clue to separation. It can also be viewed as an extension of frequency-domain BSS that allows the use of spectral templates as a clue for jointly solving frequency-wise source separation and permutation alignment.

The original MNMF (Ozerov & Févotte, 2010) was formulated under a general problem setting where sources can outnumber microphones and a determined version of MNMF was subsequently proposed (Kameoka et al., 2010). While the determined version is applicable only to determined cases, it allows the implementation of a significantly faster algorithm than the general version. The determined MNMF framework was later called “independent low-rank matrix analysis (ILRMA)” (Kitamura et al., 2017). Kitamura et al. (2016) discussed the theoretical relation of MNMF to IVA, which has naturally allowed for the incorporation of the fast update rule of the separation matrix developed for IVA, called “iterative projection (IP)” (Ono, 2011), into the parameter optimization process in ILRMA. It has been shown that this has contributed not only to accelerating the entire optimization process but also to improving the separation performance. One important feature of ILRMA is that the log likelihood to be maximized is guaranteed to be nondecreasing at each iteration of the algorithm. However, one drawback is that it can fail to work for sources with spectrograms that do not comply with the NMF model.

As an alternative to the NMF model, some attempts have recently been made to use deep neural networks (DNNs) for modeling the spectrograms of sources for multichannel source separation (Nugraha, Liutkus, & Vincent, 2016; Mogami et al., 2018). The idea is to replace the process for estimating the power spectra of source signals in a source separation algorithm with the forward computations of pretrained DNNs. This can be viewed as a process of refining the estimates of the power spectra of the source signals at each iteration of the algorithm. While this approach is particularly appealing in that it can take advantage of the strong representation power of DNNs for estimating the power spectra of source signals, one weakness is that unlike ILRMA, the log likelihood is not guaranteed to be nondecreasing at each iteration of the algorithm.

To address the drawbacks of the methods mentioned above, we propose a multichannel source separation method using variational autoencoders (VAEs) (Kingma & Welling, 2014; Kingma, Rezendey, Mohamedy, & Welling, 2014) for source spectrogram modeling. It should be noted that a preprint paper on this work has already been made publicly available (Kameoka, Li, Inoue, & Makino, 2018). While there have recently been some attempts to apply VAEs to monaural and multichannel speech enhancement (Bando, Mimura, Itoyama, Yoshii, & Kawahara, 2018; Leglaive, Girin, & Horaud, 2018, 2019; Sekiguchi, Bando, Yoshii, & Kawahara, 2018), to the best of our knowledge, our work is the first to propose the application of VAEs to multichannel source separation.

## 2 Problem Formulation

## 3 Related Work

### 3.1 ILRMA

One important feature of ILRMA is that the log likelihood, equation 2.9, is nondecreasing at each iteration of the algorithm and is shown experimentally to converge quickly. However, one limitation is that since $vj(f,n)$ is restricted to equation 2.10, it can fail to work for sources with spectrograms that do not follow equation 2.10. Figure 1 shows an example of the NMF model optimally fitted to a speech spectrogram. As can be seen from this example, there is still plenty of room for improvement in the model design.

### 3.2 DNN Approach

While this approach is noteworthy in that it can exploit the benefits of the representation power of DNNs for source power spectrum modeling, one drawback is that updating $vj(f,n)$ in this way does not guarantee an increase in the log likelihood.

### 3.3 Source Separation Using Deep Generative Models

It is worth noting that there have been some attempts to apply deep generative models, including VAEs (Kingma & Welling, 2014; Kingma et al., 2014), and generative adversarial networks (GANs; Goodfellow et al., 2014) to monaural speech enhancement and source separation (Bando et al., 2018; Subakan & Smaragdis, 2018; Leglaive et al., 2018). As far as we know, their applications to multichannel source separation had yet to be proposed when our preprint paper on this work (Kameoka et al., 2018) was first made publicly available. Recently, it has been brought to our attention that several papers on applications of VAEs to multichannel speech enhancement have subsequently been published by different authors (Sekiguchi, Bando, Yoshii, & Kawahara, 2018; Leglaive et al., 2019). These methods are designed to enhance the speech of a particular speaker by using a VAE to model the spectrogram of that speaker. Hence, one limitation of these methods is that we must know which speaker is present in a test mixture.

## 4 Proposed Method

To address the limitations and drawbacks of the conventional methods, this letter proposes a multichannel source separation method using CVAEs for source spectrogram modeling. We briefly review the idea behind the VAEs and CVAEs in section 4.1 and present the proposed source separation algorithm in section 4.2, which we call the multichannel CVAE (MCVAE) or, more simply, the multichannel VAE (MVAE).

### 4.1 Variational Autoencoder

One notable feature of CVAEs is that they are able to learn a “disentangled” latent representation underlying the data of interest. For example, when a CVAE is trained using the MNIST data set of handwritten digits and $c$ as the digit class label, $z$ and $c$ are disentangled so that $z$ represents the factors of variation corresponding to handwriting styles. We can thus generate images of a desired digit with random handwriting styles from the trained decoder by specifying $c$ and randomly sampling $z$. Analogously, we would be able to obtain a generative model that can represent the spectrograms of a variety of sound sources if we could train a CVAE using class-labeled training examples.

### 4.2 Multichannel VAE

Let $S\u02dc={s(f,n)}f,n$ be the complex spectrogram of a particular sound source and $c$ be the class label of that source. Here, we assume that a class label comprises one or more categories, each consisting of multiple classes. We thus represent $c$ as a concatenation of one-hot vectors, each of which is filled with 1 at the index of a class in a certain category and with 0 everywhere else. For example, if we consider speaker identities as the only class category, $c$ will be represented as a single one-hot vector, where each element is associated with a different speaker.

The trained decoder distribution $p\theta (S\u02dc|z,c,g)$ can be used as a universal generative model that is able to generate spectrograms of all the sources involved in the training examples where the latent space variable $z$, the auxiliary input $c$, and the global scale $g$ can be interpreted as the model parameters. According to the properties of CVAEs, we consider that the CVAE training promotes disentanglement between $z$ and $c$ where $z$ characterizes the factors of intraclass variation, whereas $c$ characterizes the factors of categorical variation that represent source identities. Estimating $c$ from a test mixture corresponds to identifying which source is present in the mixture. There are, however, certain cases where we know which sources are present prior to separation. Thanks to the conditional modeling, we can also use our model in such cases by simply fixing $c$ at a specified index. We call $p\theta (S\u02dc|z,c,g)$ the CVAE source model.

The proposed MVAE is noteworthy in that it offers the advantages of the conventional methods concurrently: (1) it takes full advantage of the strong representation power of DNNs for source power spectrogram modeling, (2) the log likelihood is guaranteed to be nondecreasing at each iteration of the source separation algorithm, and (3) the criteria for CVAE training and source separation are consistent, thanks to the consistency between the expressions of the CVAE source model and the LGM. Figure 3 shows an example of the CVAE source model fitted to the speech spectrogram shown in Figure 1. We can confirm from this example that the CVAE source model is able to approximate the speech spectrogram somewhat better than the NMF model.

It is interesting to look at the differences between our method and the recently proposed VAE-based multichannel speech enhancement methods (Sekiguchi et al., 2018; Leglaive et al., 2019). The methods proposed in Sekiguchi et al. (2018) and Leglaive et al. (2019) model the spectrogram of a particular source to be enhanced using a regular VAE and express the spectrograms of the other sources using the NMF model. This allows these methods to handle semisupervised scenarios where interference sources are unseen in the training set. However, one limitation is that the target source to be enhanced must be specified prior to separation. With our method, one limitation is that it can handle only supervised scenarios where audio samples of all the sources in a test mixture are included in the training set. However, if there is a sufficiently wide variety of sources in the training set, our method can be applied even without being informed about which of the sources in the training set are present in a test mixture. Our method can also be flexibly adapted to a scenario where we know which sources are present by simply specifying (instead of having it estimate) $cj$, thanks to the conditional modeling. Another important feature of our model lies in its ability to capture the time-frequency interdependence in the STFT coefficients of each source thanks to the network design for the encoder and decoder, as presented in the section 4.3.

### 4.3 Network Architectures

We propose designing the encoder and decoder networks using fully convolutional architectures to allow the encoder to take a spectrogram as an input and allow the decoder to output a spectrogram of the same length instead of a single-frame spectrum. This allows the networks to capture time dependencies in spectral sequences. Although RNN-based architectures are a natural choice for modeling time series data, RNNs are unsuited to parallel implementations, and so both the training and conversion processes can be computationally demanding. Motivated by the recent success of sequential modeling using convolutional neural networks (CNNs) in the field of natural language processing (Dauphin, Fan, Auli, & Grangier, 2017) and the fact that CNNs are more suited to parallel implementations than RNNs, we use CNN-based architectures to design the encoder and decoder, as detailed below.

## 5 Experiments

### 5.1 Experimental Settings

To confirm the effect of the incorporation of the CVAE source model, we conducted experiments involving a supervised determined source separation task using speech mixtures. We excerpted speech utterances from the Voice Conversion Challenge (VCC) 2018 data set (Lorenzo-Trueba et al., 2018), which consists of recordings of six female and six male U.S. English speakers. Specifically, we used the utterances of two female speakers, SF1 and SF2, and two male speakers, SM1 and SM2, for CVAE training and source separation. We considered speaker identities as the only source class category. Thus, $c$ was a four-dimensional one-hot vector. The audio files for each speaker were manually segmented into 116 short sentences (each about 7 minutes long) where 81 and 35 sentences (about 5 and 2 minutes long, respectively) were provided as training and evaluation sets, respectively.

We used two-channel recordings of two sources as the test data, which we synthesized using the simulated room impulse responses (RIRs) generated using the image method (Allen & Berkley, 1979) and the real RIRs measured in an anechoic room (ANE) and an echo room (E2A). Figure 5 shows the two-dimensional configuration of the room for obtaining the simulated RIRs. $\u2218$ and $\xd7$ represent the positions of microphones and sources, respectively. The reverberation time ($RT60$) (Schroeder, 1965) of the simulated RIRs could be controlled according to the setting of the reflection coefficient of the walls. To simulate anechoic and echoic environments, we created test signals with the reflection coefficients set at 0.20 and 0.80, respectively. The corresponding $RT60$s were 78 ms and 351 ms, respectively. For the measured RIRs, we used the data included in the RWCP Sound Scene Database in Real Acoustic Environments (Nakamura, Hiyane, Asano, & Endo, 1999). $RT60$ of the anechoic room (ANE) and the echo room (E2A) were 173 ms and 225 ms, respectively.

We generated 10 speech mixtures for each speaker pair, SF1 + SF2, SF1 + SM1, SM1 + SM2, and SF2 + SM2. Hence, there were 40 test signals for each recording condition, each of which was about 4 to 7 s long. All the speech signals were resampled at 16,000 Hz. The STFT frame length was set at 256 ms, and a Hamming window was used with an overlap length of 128 ms.

### 5.2 Baseline and Proposed Methods

We chose ILRMA (Kameoka et al., 2010; Kitamura et al., 2016, 2017) and the recently proposed DNN approach, the independent deeply learned matrix analysis (IDLMA; Mogami et al., 2018) as baseline methods for comparison. With ILRMA, we set $Kj$ at 10 for all $j$. The IDLMA algorithm can be implemented by replacing the steps b) and c) in our algorithm with equation 3.5. Thus, ILRMA, IDLMA, and the proposed method differ only in the way $vj(f,n)$ is modeled and estimated, and so the comparisons with the baseline methods would demonstrate the effect of our model. For a fair comparison, we used the same training data as those described in section 5.1 to train the DNN in equation 3.5. According to the settings in Mogami et al. (2018), we designed the DNN using four fully connected layers, each of which had 2048 units and was followed by a rectified linear unit (ReLU). The source separation algorithms were run for 40 iterations for the proposed method and 100 iterations for the baseline methods. Although the original ILRMA is a fully blind (unsupervised) approach, we also tested its supervised version for a fair comparison where the basis spectra were pretrained using the same training data. Specifically, we applied the NMF algorithm, which consisted of performing equations 3.3 and 3.4, to the audio samples of each source to obtain the basis spectra. We then constructed $B$ by concatenating the obtained basis spectra of each source. Here we refer to the supervised version of ILRMA as sILRMA. For the proposed method, $W$ was initialized using ILRMA run for 30 iterations, and Adam optimization (Kingma & Ba, 2015) was used for CVAE training and the estimation of $\Psi $ in the source separation algorithm. The network configuration we used for the proposed method is shown in detail in Figure 4.

### 5.3 Results

To evaluate the source separation performance, we took the averages of the signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), and signal-to-artifact ratio (SAR; Vincent, Gribonval, & Févotte, 2006) of the separated signals obtained with the baseline and proposed methods using 10 test signals for each speaker pair. Figures 6 to 9 show the average SDRs, SIRs, and SARs obtained with the baseline and proposed methods under each recording condition. As the results show, the proposed method significantly outperformed the baseline methods for most of the test data in terms of SDR, revealing the advantage of the proposed approach. (Audio samples are provided at http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/Demos/mvae-ass/.)

As can be seen from comparisons between the results in Figures 6 and 7 and those in Figures 8 and 9, there were noticeable performance degradations with both the baseline and proposed methods when the reverberation became relatively long. We have recently successfully incorporated the idea of jointly solving dereverberation and source separation (Kameoka et al., 2010; Yoshioka, Nakatani, Miyoshi, & Okuno, 2011; Kagami, Kameoka, & Yukawa, 2018) into the method to overcome these degradations (Inoue, Kameoka, Li, Seki, & Makino, 2019).

## 6 Conclusion

This letter proposed a multichannel source separation technique, the multichannel variational autoencoder (MVAE) method. The method used VAEs to model and estimate the power spectrograms of the sources in mixture signals. The key features of the MVAE are that (1) it takes full advantage of the strong representation power of deep neural networks for source power spectrogram modeling, (2) the log likelihood is guaranteed to be nondecreasing at each iteration of the source separation algorithm, and (3) the criteria for the VAE training and source separation are consistent, which contributed to obtaining better separations than with conventional methods. While the MVAE method was formulated under determined mixing conditions, it can be generalized so that it can also deal with underdetermined cases (Seki, Kameoka, Li, Toda, & Takeda, 2018).

## Acknowledgments

This work was supported by JSPS KAKENHI 17H01763.