Abstract
We introduce dGSLM, the first “textless” model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter, and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn taking compared to a text-based cascaded model.1,2
1 Introduction
In natural conversations, speakers spontaneously coordinate who is currently speaking and when the other person will speak next. As a result, conversations end up being a fluent succession of turns without much overlapping speech or long stretches of silence. Of course, silences and overlaps also occur naturally and they carry significant information which is interpreted within the conversation setting. For instance, when overlapping speech occurs it often contains content-neutral verbal information (e.g., “hmm”, “yeah”) or non-verbal vocalization (e.g., laughter), used to convey a listening attitude (back-chanelling) (Yngve, 1970; Schegloff, 1982). Short silences between turns do occur and show both cross-cultural variations and universal dependence on dialogue related variables, for instance, straight and positive answers to questions are typically faster than non-responses or negative responses (Stivers et al., 2009).
All of this turn-taking coordination is natural to humans, and starts to be learned at an early age by infants (Nguyen et al., 2022). In contrast, it remains a challenging area of research in human/machine interactions (Skantze, 2021). One of the reason is that much of the research into natural dialogue modeling is taking place with text-based interfaces. Here, the coordination problem is primarily focused on semantic coherence and appropriateness of the artificial agent in interaction with a human (see Ni et al., 2021, for a review). The turn-taking problem itself is being taken care of by an artificially imposed walkie talkie arrangement; each agent is writing in turn and signalling the end of its turn by pressing carriage return.
Within speech-based systems, it is very similar, as current spoken assistants like Siri or Alexa are triggered by a predetermined wake word, and wait for the end of an utterance followed by sufficient silence to segment the turns of the human interlocutor. This may give rise to slow and unnatural conversations. In fact, in human-human conversation, pauses within speaker turns tend to be on average longer than gaps between speaker turns (Brady, 1968; Ten Bosch et al., 2005; Heldner and Edlund, 2010), indicating that silence may not be the main cue for humans to switch turns. Because most speech-based systems are based on Automatic Speech Recognition (ASR), and many significant aspects of speech like prosody and nonverbals are typically not annotated in naturalistic speech dialogues, current dialogue systems have been struggling with generating naturalistic dialogue.
Here we capitalize on recent progress in self-supervised learning and textless speech processing (Borgholt et al., 2022; Borsos et al., 2022; Lakhotia et al., 2021) to investigate the possibility to directly train a spoken dialogue model from raw audio, bypassing the need for text or ASR. Briefly, we build on self-supervised discrete speech representations models, which we train on spontaneous conversations with each speaker having his or her own audio channel. After training, the speech units come to represent not only verbal but also nonverbal materials. We can now encode a conversation between two interlocutors as two parallel streams of discrete tokens. We then introduce a novel dual-tower transformer architecture, where each channel is processed by one “tower” of the model that learns via an autoregressive loss, but the two towers also communicate via cross-attention in their hidden units. This cross-attention is critical for the correct synchronization of the two channels and result in a naturalistic distribution of turns, overlap and pauses. While this system is not trained on enough data to capture deep syntactic and semantic aspects of dialogue, and indeed scores below a text-based cascaded ASR+LM+TTS model on semantic content, it does capture better surface characteristics of chitchat in mimicking accurately turn-taking and backchanneling. This can be seen as a proof of principle that previously difficult to capture aspects of spontaneous conversations can be captured with minimally modified language modeling techniques. Finally, our model opens up new possibilities to create more natural naturalistic human-machine dialogue systems in the future.
2 Related Work
Unsupervised Spoken Language Modeling.
Recently, great advances have been achieved in the area of representation learning from raw audio. Models trained with either autoencoder objectives (Ondel et al., 2016; van den Oord et al., 2017) or masked objectives (CPC: van den Oord et al., 2018; APC: Chung and Glass, 2020; wav2vec 2.0: Baevski et al., 2020; HuBERT: Hsu et al., 2021a; MockingJay: Liu et al., 2020) from raw speech can learn audio representation that can be used for a variety of downstream tasks (Yang et al., 2021), see Borgholt et al. (2022) for a review.
Most of these models build a codebook of discrete units, either as latent representation or as targets. The discrete representation can in turn be fed to a standard autoregressive language model, which can then be sampled to generate new speech sequences (Lakhotia et al., 2021; Dieleman et al., 2021). An interesting aspect of this procedure is that it can capture aspects of speech that are typically not available in written transcriptions and can therefore model prosody and intonation (Kharitonov et al., 2021), or non verbal vocalizations typical of emotional speech (Kreuk et al., 2021). Up to now, however, no such model has been applied to multi-party conversational speech.
Dialogue Generation.
Since the early work on end-to-end neural dialogue generation (Vinyals and Le, 2015; Li et al., 2015; Serban et al., 2016), empowered by scalable methods for language representation (Radford et al., 2018; Lewis et al., 2019), there has been enormous progress in the area of dialogue generation (Roller et al., 2020; Zhang et al., 2019; Adiwardana et al., 2020). More recent research focused on utilizing retrieval augmented generation methods (Lewis et al., 2020) for long-context, multi-session conversations (Xu et al., 2021), and grounding responses on fresh information from the internet (Komeili et al., 2021; Shuster et al., 2022). However, all the progress in this research work centered around text dialogues leaving out non-lexical information (Schuller et al., 2013; Ang et al., 2002) in human-human dialogues, for example, emotion, pauses, laughter, hesitation, and interruption. Our work builds on end-to-end techniques while taking a speech-first approach to address this shortcoming, where prompts and generated sequences are represented as self-supervised discrete speech representations (Lakhotia et al., 2021). As a result, the capacity of our models is constrained by the amount of publicly available speech dialogues; for example, the LDC English Fisher dialogues corpus (Cieri et al., 2004) contains roughly 12M words compared to tens of billions of words in the case of text-based dialogue systems. There have been recent calls for large-scale end-to-end benchmarks and datasets with spoken input to fill this gap (Faruqui and Hakkani-Tür, 2021).
Turn-taking Modeling.
Decades-long research on conversation analysis (Duncan, 1972; Sacks et al., 1974; Schegloff, 2000; Gravano and Hirschberg, 2011; Levinson and Torreira, 2015; Ward, 2019) has shown that human turn-taking relies on a variety of complex signals, or cues, including prosodic cues, linguistic cues, and even non-verbal cues such as gaze or gestures, making turn-taking modeling a challenging problem. Simple turn-taking models using finite-state machines have been proposed to predict the distribution and durations of turn-taking events (Cassell et al., 2001; Thórisson, 2002; Raux and Eskenazi, 2009). More recently, more sophisticated machine learning-based models of turn-taking have been introduced (Meena et al., 2014; Skantze, 2017; Roddy et al., 2018; Masumura et al., 2018). These models used multi-modal features including simple linguistic features and prosodic features extracted from the speech to predict turn shifts. Most recently, Ekstedt and Skantze (2020) has shown the possibility of turn-taking prediction in spoken dialogue using only linguistics features (text input). We use these definitions of turn-taking events to analyse the output of our models.
3 Approach
Our approach is based on the availability of a dataset constructed along the Fisher Telephone conversation collection protocol (Cieri et al., 2004) where each conversation involves two speakers, and each speaker is recorded in a separate audio channel while having a very casual conversation. We follow the textless generative spoken language modeling pipeline of Lakhotia et al. (2021), which decomposes the problem of speech generation into three components: a Speech-to-Units encoder, a Units-to-Units language model, and a Units-to-Speech decoder. For the encoder we adopt HuBERT, (Hsu et al., 2021a) followed by k-means clustering; for the decoder network we use a modified Hifi-GAN neural vocoder (Kong et al., 2020), similarly to Polyak et al. (2021). These models are trained on single channel data from the Fisher dataset and applied to each channel separately, which do not model cross-channel interactions. For the language model, we introduce our new Dialogue Transformer Language Model, or DLM. Figure 1 presents an overview of our system. The following sections (Sections 3.1–3.3) will present at a high level each component of our model and review the turn-taking terminology in this study (Section 3.4).
3.1 Discrete Phonetic Representation
Conversational speech contains casual expressions (filler words like ‘hmm’) and a variety of non verbal sounds (e.g., laughter) that do not appear in formal or read speech. We therefore train a HuBERT model (Hsu et al., 2021a) directly on our conversation dataset in order to obtain domain-appropriate phonetic representation (see Appendix Section G for an analysis). Specifically, it is trained on the collection of voice segments extracted of all speakers in the dataset. The discrete units are then obtained by clustering the representation of the HuBERT model using the k-means algorithm. At inference time, the two-channel speech waveform is encoded channel-wise into two time-aligned streams of discrete units.
3.2 Waveform Generation
For the waveform generation, we used the discrete unit-based HiFi-GAN vocoder from Polyak et al. (2021) trained on a small subset of high-quality single-channel voice segments of our conversation dataset, using discrete units obtained from the HuBERT model and 1-hot speaker information from the dataset. During generation, we generate each channel of discrete units with one different speaker, and combine the audio generated from the two channels. Voices for the waveform generation are chosen from the speakers in the HifiGAN training set.
3.3 Dialogue Transformer Language Model
We introduce our Dialogue Transformer Language Model (DLM), which is a two-tower transformer network with Cross-Attention and shared weights trained with Edge Unit Prediction and Delayed Duration Prediction objectives. The model is illustrated in Figure 2 and its components will be detailed below, and we will perform ablations to test for the effects of each of these components.
We will also compare the two-tower model with a simpler single-tower model with dual inputs. This last model is inspired by previous work in multi-stream language model (Kharitonov et al., 2021). It consists of a single transformer, with two embedding heads in the input and two softmax heads in the output. This model combines very early the two speaker channels at the embedding layer and models them jointly, only to separate them again in the last layer. We call this model MS-TLM (Multi-Stream Transformer Language Model).
Cross-Attention Transformer Layer.
When modeling separate channels of dialogue, we would like the LM to not only get information from the history of each channel itself, but also have information from other channels as well. As a result, we add an additional Muti-Head Cross-Attention block after the Multi-Head Self-Attention block to share information between different channels (cf. Figure 2 right). We train a single Transformer model that we clone into the two towers with shared weights, which allows the model to be speaker-independent without having to do permutation invariant training.
Edge Unit Prediction.
Delayed Duration Prediction.
Training Objective.
Model Inference for Generation.
For generation, we autoregressively generate edge units and the corresponding durations in both channels. Even though the loss is applied only at the edge units, the model may generate spurious and inconsistent data at other non-edge time steps. We give precedence to the predicted duration associated with the first edge unit predicted in each channel and overwrite the network output with this edge units for the corresponding number of steps. It is this overwritten content which is used as input to the network till the next edge unit. For example, if we predict a unit at time t and the corresponding duration at time t + 1, we replace the next units of channel c by and only alter the unit at time . The duration prediction is rounded during generation.
3.4 Definitions of Turn-taking Metrics
Because our model generates two audio channels in parallel, it is possible to use simple Voice Activity Detection (VAD) tools on the output to derive turn-taking metrics. Following Figure 3, we define an Inter-Pausal Unit (IPU) as continuous stretch of speech in one speaker’s channel, delimited by a VAD silence of more than 200 ms on both side. We define silence as sections of the recording with no voice signals on either channel and overlap as sections where there are voice signals on both channels. Silences can be subdivided into gaps (when it occurs between two IPUs by distinct speakers) and pauses (when they occur for the same speaker). Successive IPUs by the same speaker separated by a pause are regrouped into a turn. Overlap could also theoretically be subdivided into backchannel (when it is rather short IPU contained within an IPU of the other speaker) and interruption (when it starts within an IPU of the other channel and continues after its end), but the exact definition is dependant on high-level linguistic features, which we will not attempt to extract here. In our analysis, we will therefore tally the distribution of duration of IPUs, gaps, pauses and overlaps in the training corpus and in generated dialogues of our various models.
3.5 Cascaded Dialogue Baseline System
We compare our textless-based dialogue models with a traditional cascaded dialogue system which consists of an ASR model, followed by a text-based language model and a Text-To-Speech (TTS) module. We first transcribe each channel of the dialogue with the ASR model, we then combine the transcribed text into a turn-based conversation3, we ignore any turns that are completely contained inside an other turn. We train a Transformer Language Model on these conversations and we finally employ a TTS module to synthesize the generated text into a turn-based conversation.
4 Experimental Setup
4.1 Training Set
We use in this work the Fisher Dataset (Cieri et al., 2004), a conversation corpus consisting of more than 16,000 English telephone conversations averaging 10 minutes in duration and focusing on various topics. The audio was recorded separately in two channels resulting in 2000 hours of transcribed speech.4
For the training of HuBERT and HifiGAN models, we follow the preprocessing steps of Kuchaiev et al. (2019)5 to obtain a collection of single-channel voice segments of the Fisher dataset. The segments vary mostly from 10–15 seconds, with a total duration of about 1800 hours. We divide the Fisher dataset into train/valid/test sets with a 98/1/1 split (different speakers in each split).
4.2 Model Training
We train a HuBERT Base model (Hsu et al., 2021a) from raw audio. The encoder contains seven 512-channel CNN layers with strides [5,2,2,2,2,2,2] and kernel widths [10,3,3,3,3,2,2], converting the signal rate from 16000 samples/sec down to 50 frames/sec. It is followed by 12 Transformer blocks. The model is trained with a masked objective for 3 iterations following the same recipe as in (Hsu et al., 2021a). The model alternates between feature extraction/quantization and masked-prediction training in each iteration. We used the k-means algorithm with codebook sizes of 100, 500, and 500 to quantize the MFCC features, the 6th transformer layer features, and the 9th transformer layer features for the three HuBERT training iterations. After training, we quantize the final transformer layer features into 500 units for the DLM training. We choose a large codebook size of 500 to model various kinds of vocalizations beyond broad phonetic classes. Following Hsu et al. (2021a), we use 250k training updates in the first iteration and 400k model updates in subsequent training iterations using 32 V100 32GB GPUs. As the transformer does not change the input frame rate, the encoded discrete units have a frame rate of 50 units per second (one every 20 ms). We show in Table A1 that our HuBERT model trained on the Fisher dataset learns better phonetic information suitable for conversations than the publicly available HuBERT model trained on audiobooks (Hsu et al., 2021a).
We train the HifiGAN model on a small subset of the Fisher dataset segments consisting of 120 speakers with 10 minutes each. These speakers were selected to be of high intelligibility using the average perplexity of a phone recognizer trained on the clean Librispeech 100h training subset (Rivière and Dupoux, 2021). The model is trained to generate the audio waveform given HuBERT units of a segment and a speaker embedding vector.
For the DLM models, we use a transformer model consisting of 6 layers, with 8 attention heads per layer, and an embedding size of 512. When cross-attention is used, it is added to the top 4 transformer layers. We show in Table A2 the effect of number of cross-attention layers on language modeling metrics. We train the DLM model on the parallel unit streams encoded from 2000 hours of stereo audio, each sample contains up to 6144 unit pairs, an equivalent of 123 seconds. The models are trained on a total of 32 V100 32GB GPUs, with a batch size of 370 seconds of audio per GPU for a total number of 250k steps. We used an Adam optimizer (Kingma and Ba, 2015) with a max learning rate of 5 × 10−4. The implementation of the DLM model is done using the fairseq (Ott et al., 2019) toolkit. It took us 66 hours on average to train 100k steps of DLM models without edge unit prediction, and 95 hours with additional edge unit prediction objective.
We also train a Multi-Stream Transformer Language Model (MS-TLM, Kharitonov et al., 2021), a single transformer model taking two streams of units as input and autoregressively predict the next units in both streams. It is a standard Transformer Language Model, with 6 layers, 8 attention heads per layer and an embedding size of 512, with the difference that the embedding layer concatenates the two embeddings of the two parallel units, and the output layer produces two softmax layers to predict the next units in both streams. We train the MS-TLM model similarly to the DLM models as previously mentioned. Training 100k steps of MS-TLM model took us 40 hours.
For the cascaded system, we use a pre-trained ASR model6 to decode the Fisher dataset. We then train a standard 6-layer Transformer Language Model on the turn-based conversations obtained from the ASR. We pre-process the text using a byte pair encoding (BPE, Sennrich et al., 2016) with 20k iterations and limit each sample to have 512 tokens. We trained the language model for 100k steps on 32 V100 32GB GPUs with a batch size of 2048 tokens per GPU. We use the same optimizer as for other models. Finally, we use the Google TTS API to synthesize generated conversations, with two different voices indicating two different speakers.
4.3 Evaluation Metrics
This section presents the evaluation metrics used to assess our dialogue models on two dimensions: Training and Generation.
4.3.1 Training Metrics
These metrics evaluate the dialogue modeling performance in each channel separately using metrics close to the training loss. They are computed by encoding files from the development set and extracting statistics on the predicted outputs at each time steps. They are used to compare the different versions of the DLMs and therefore not applied to the cascaded model.
Edge Unit Prediction.
We report the Negative Log-Likelihood (NLL), or Cross Entropy loss when predicting edge units. We also compute the Prediction Accuracy.
Edge Duration Prediction.
We use the Mean Absolute Error (MAE), or L1 Loss when evaluating edge duration prediction (a MAE of 1 corresponds to 20ms of error). The Duration Accuracy is also reported.
4.3.2 Dialogue Generation Metrics
We evaluate the generation properties of our models using descriptive statistics, automatic metrics and human-based judgements. Unless otherwise written, we perform conditional generation and generate 90-second long continuations using 117 30-second long prompts extracted from the development set and use the default generation temperature of 1.0. We generate the units by sampling among the top 20 possible units.
Turn-taking Event Statistics.
We compute the turn-taking events as defined in Section 3.4 using the samples generated by the models with VAD using the pyannote library7 (Bredin et al., 2020). We then analyze the statistics of these turn-taking events (number of events and their durations) across different models.
Turn-taking Event Consistency.
We evaluate the model’s capacity to generate consistent conversations in terms of turn-taking events. We measure the Pearson correlation between the total duration of events in each prompt and in the corresponding continuation.
Natural Dialogue Event Statistics.
We evaluate the naturalness of the generated speech by focusing on the Speaking Rate (WPM, words per minute), Laughter Frequency (LPM, laughs per minute), Filler Word Rate (FWR, filler words per 100 words), and Floor Transfer Offset (FTO, duration between two consecutive turns of the two speakers, a positive FTO represents a gap while a negative FTO represents an overlap). For this evaluation, we use the same ASR model used to decode the Fisher dataset6 to transcribe the generated speech. To detect laughs in the speech, we use an open-source model described in Gillick et al. (2021).8 To compute the FWR, we use the following filler words set: {‘uh’, ‘um’, ‘like’, ‘i mean’, ‘you know’}.
Semantic Evaluation.
We use two evaluation metrics proposed in Lakhotia et al. (2021), perplexity (PPL) and VERT, to assess the generation quality and diversity of the models. We first transcribe the generated speech using the ASR system. As these metrics are calculated on text sequences, we combine the text from two channels into a single turn-based text sequence,3 ignoring any turns that are completely contained inside an other turn. We employ the open-source DialoGPT model9 (Zhang et al., 2019) to compute the perplexity on the turn-based sequences. We simply replace the speaker tokens (¡A¿, ¡B¿) with the ¡—endoftext—¿ token, indicating a turn switch. For the VERT metrics, we also compute the self-BLEU and auto-BLEU on the turn-based text sequences. As the conversation texts contain a lot of repetitions, we report the VERT-4 score instead of VERT-2 score as in Lakhotia et al. (2021).
Since the PPL and VERT scores highly depend on the generation temperature, we perform generation on different temperatures ranging from 0.3–2.0. We then compute the PPL and VERT for each temperature and fit the points corresponding to different temperatures with an exponential line and report the PPL@GT (PPL with respect to the ground truth VERT) score (cf. Figure 7). For the conditional generation case, we compute instead the conditional perplexity (cond. PPL), which is the perplexity of the generated sequence given the concatenation of the prompt sequence and generated sequence as input to the DialoGPT model.
Human Opinion Score.
We perform a human evaluation on the generated examples. The opinions are based on two dimensions: N-MOS (naturalness Mean Opinion Score) representing naturalness and turn-taking conversationality, and M-MOS (meaningfulness Mean Opinion Score) for meaningfulness and content quality. For N-MOS, we asked the participants to concentrate on the fluidity and naturality of the interaction as well as the expressiveness of the speakers regardless of meaning. For M-MOS, they should focus on what is being said and if it is semantically coherent. For these two evaluations, we used a scale of 1-5 (1: worst, 5: best). The CrowdMOS package (Ribeiro et al., 2011) was used for all subjective evaluations using the recommended recipes for detecting and discarding inaccurate scores. Indeed, we remove all workers whose correlation with the mean scores is lower than 0.25, and then filter out outlier workers whose correlation with the mean scores is lower than 0.6. We enforced at least six raters for each of the generated samples. Participants were recruited using a crowd-sourcing platform.
5 Results
5.1 Content and Duration Modeling
Table 1 reports the modeling evaluation metrics on our development subset of the Fisher dataset. In rows Id 1–5, we compare different DLM models, while row Id 0 represents the MS-TLM model, which takes as input multiple unit streams from different channels, and predicts the next-step units only. We note that for models Id 1–3, the next-step unit prediction objective is also included in the training process, but when the duration prediction objective is employed (models Id 4–5), the next-step unit prediction objective is omitted.
We observe that by using the self cross-attention layers, the edge unit prediction metrics slightly improve (u NLL: 3.07 vs 2.95). On considering models Id 2 & 3, we observe a huge improvement in edge unit NLL & Accuracy when introducing the edge unit prediction objective (u NLL: 2.95 vs 2.49). By introducing the duration prediction objective and removing the next-step unit prediction objective, we see that the model performs even better on the edge unit prediction metrics (u NLL: 2.26), and finally the duration metrics greatly improves when we apply a delayed duration prediction (d MAE: 1.47 vs 1.23).
On comparing with the MS-TLM model, we see that our best DLM model perform much better on content modeling. The reason, we believe, is related to the entangled modeling of content and duration in the MS-TLM model.
5.2 Turn-taking Event Statistics
In this section, we analyze the distribution of the turn-taking events (as described in Section 3.4) in the dialogue continuations generated by our models. The statistics are computed over 3 hours of generated speech per model.
Figure 4 shows the distribution of each of the 4 turn-taking events: IPU, pause, gap, and overlap. In this figure, the Ground truth corresponds to the true continuation of the prompts in the original corpus. Despite having a reasonably good modeling score (cf. Table 1), DLM-1, which has no cross-attention layers between the two transformer towers, has poor performance on turn-takings events, except for the IPU event. The lack of communication between the two channels during generation creates huge gaps and overlaps in the generated samples. The MS-TLM and DLM-2 models have similar distributions of shorter overlaps and longer pauses and gaps. They were trained using the next-step prediction loss on duplicated unit sequences, which could lead to repeated unit generation, causing a slow pace and more extended silences in the generated audio. The opposite effect happens when we introduce the edge unit prediction (DLM-3-5). These models manage to generate more overlaps, with pauses and gaps of shorter duration. These observations are further reinforced in Table 2, which details the number of events and their total durations per minute. It is interesting to note that all models, except DLM-1, manage to capture the empirical fact that intra-turn pauses tend to be longer than between-turn gaps (Brady, 1968; Ten Bosch et al., 2005; Heldner and Edlund, 2010).
Id . | Model . | Number of occurrences / min . | Cumulated duration /min . | ||||||
---|---|---|---|---|---|---|---|---|---|
IPU . | Pause . | Gap . | Overlap . | IPU . | Pause . | Gap . | Overlap . | ||
0 | MS-TLM | 19.4 | 10.6 | 5.1 | 3.3 | 49.4s | 8.9s | 2.9s | 1.3s |
1 | DLM-1 | 17.7 | 7.9 | 3.9 | 5.5 | 41.4s | 13.8s | 10.7s | 6.1s |
2 | DLM-2 | 20.0 | 10.4 | 5.5 | 3.6 | 48.9s | 9.1s | 3.6s | 1.7s |
3 | DLM-3 | 19.0 | 1.8 | 4.9 | 11.7 | 65.0s | 1.1s | 1.8s | 8.1s |
4 | DLM-4 | 18.9 | 3.2 | 5.6 | 9.4 | 60.7s | 2.4s | 2.9s | 6.1s |
5 | DLM-5 | 24.2 | 5.4 | 7.2 | 10.9 | 59.1s | 3.6s | 2.9s | 5.8s |
6 | Cascaded | 17.5 | 0.0 | 14.9 | 0.0 | 54.8s | 0.0s | 5.3s | 0.0s |
Ground Truth | 21.6 | 7.0 | 7.5 | 6.5 | 53.5s | 5.5s | 4.4s | 3.6s | |
Training Set | 25.9 | 7.2 | 8.6 | 10.0 | 54.5s | 5.6s | 4.6s | 4.7s |
Id . | Model . | Number of occurrences / min . | Cumulated duration /min . | ||||||
---|---|---|---|---|---|---|---|---|---|
IPU . | Pause . | Gap . | Overlap . | IPU . | Pause . | Gap . | Overlap . | ||
0 | MS-TLM | 19.4 | 10.6 | 5.1 | 3.3 | 49.4s | 8.9s | 2.9s | 1.3s |
1 | DLM-1 | 17.7 | 7.9 | 3.9 | 5.5 | 41.4s | 13.8s | 10.7s | 6.1s |
2 | DLM-2 | 20.0 | 10.4 | 5.5 | 3.6 | 48.9s | 9.1s | 3.6s | 1.7s |
3 | DLM-3 | 19.0 | 1.8 | 4.9 | 11.7 | 65.0s | 1.1s | 1.8s | 8.1s |
4 | DLM-4 | 18.9 | 3.2 | 5.6 | 9.4 | 60.7s | 2.4s | 2.9s | 6.1s |
5 | DLM-5 | 24.2 | 5.4 | 7.2 | 10.9 | 59.1s | 3.6s | 2.9s | 5.8s |
6 | Cascaded | 17.5 | 0.0 | 14.9 | 0.0 | 54.8s | 0.0s | 5.3s | 0.0s |
Ground Truth | 21.6 | 7.0 | 7.5 | 6.5 | 53.5s | 5.5s | 4.4s | 3.6s | |
Training Set | 25.9 | 7.2 | 8.6 | 10.0 | 54.5s | 5.6s | 4.6s | 4.7s |
The cascaded model only produces alternating speech turns and therefore has almost no overlap and pause. This also results in low variance in the gap distribution, making the generated speech turn sound like machine conversation.
5.3 Turn-taking Event Consistency
Figure 5 shows the correlation between the total duration of turn-taking events in the prompts and in the generated continuations. For the ground truth, we compute the correlation of the events’ duration between the first 30 seconds and the folowing 90 seconds in each sample. We observe that in general all models except DLM-1 and cascaded have good correlations, showing their ability to maintain the dialogue consistency. Unsurprisingly, the cascaded model has no correlation with the prompt events, except for the gaps, which are proportional to the number of turn changes.
5.4 Natural Dialogue Event Statistics
Table 3 reports the naturalness statistics on the generated samples of our models. We first notice that, compared to ground truth, models that don’t have edge unit prediction (MS-TLM, DLM-1–2) tend to produce speech with less information and more hesitations (lower rate, less laughter, more filler words) than those with edge unit prediction (DLM-3–5). Adding duration prediction can effectively help to produce more natural speech, but it still produces more words than ground truth. The cascaded model is unable to produce laughter as the ASR and TTS modules are not able to capture these information, it also generate nearly “non-stop” speech at a faster rate than natural speech. Looking at Figure 6, we see indeed that the cascaded model has no negative FTO (overlap), and the positive FTOs (gaps) fall mostly in the range of one second. In general, other models seem to have good FTO distribution compared to the reference ground truth and training set.
Id . | Model . | WPM . | LPM . | FWR . |
---|---|---|---|---|
0 | MS-TLM | 139.17 | 1.88 | 9.36 |
1 | DLM-1 | 123.60 | 1.98 | 9.39 |
2 | DLM-2 | 141.09 | 2.06 | 10.36 |
3 | DLM-3 | 281.41 | 7.08 | 3.40 |
4 | DLM-4 | 244.13 | 6.05 | 3.38 |
5 | DLM-5 | 211.98 | 3.62 | 5.50 |
6 | Cascaded | 216.73 | 0.00 | 7.08 |
Ground Truth | 181.46 | 3.60 | 7.25 |
Id . | Model . | WPM . | LPM . | FWR . |
---|---|---|---|---|
0 | MS-TLM | 139.17 | 1.88 | 9.36 |
1 | DLM-1 | 123.60 | 1.98 | 9.39 |
2 | DLM-2 | 141.09 | 2.06 | 10.36 |
3 | DLM-3 | 281.41 | 7.08 | 3.40 |
4 | DLM-4 | 244.13 | 6.05 | 3.38 |
5 | DLM-5 | 211.98 | 3.62 | 5.50 |
6 | Cascaded | 216.73 | 0.00 | 7.08 |
Ground Truth | 181.46 | 3.60 | 7.25 |
5.5 Semantic Evaluation
For semantic metrics, we perform both conditional and unconditional generations. For conditional generation, we select 50 10-second long prompts in the validation set. For each model and temperature, we generate 50 samples and limit the transcribed turn-based text sequences to 50 words.
We found that for certain models is not possible to obtain PLL@GT as they tend to generate repeated units at low temperatures, creating complete noise in the synthesis. We therefore report the PPL scores for the default temperature 1.0 (@t1). As shown in Table 4, we see that the dialogue models fail to generate semantically coherent speech, resulting in high perplexity, especially in prompted generation. The cascaded model has a very good perplexity as the language model was trained on word and sub-word levels, it even has a higher PPL@GT than the ground truth in the unconditional case. When it comes to conditional generation, the cascaded model has a good PPL, but is still way below the ground truth.
. | unconditional . | conditional . | |||
---|---|---|---|---|---|
Id . | Model . | PPL↓ . | cond. PPL↓ . | ||
@t1 . | @GT . | @t1 . | @GT . | ||
0 | MS-TLM | 190.59 | 144.82 | 741.86 | – |
1 | DLM-1 | 145.85 | – | 195.89 | – |
2 | DLM-2 | 218.30 | – | 453.73 | – |
3 | DLM-3 | 155.17 | 161.58 | 463.27 | 329.74 |
4 | DLM-4 | 290.07 | 231.00 | 693.48 | 314.49 |
5 | DLM-5 | 179.65 | 187.16 | 605.84 | 365.08 |
6 | Cascaded | 32.23 | 80.80 | 45.93 | 117.06 |
Ground Truth | 100.85 | 100.85 | 65.00 | 65.00 |
. | unconditional . | conditional . | |||
---|---|---|---|---|---|
Id . | Model . | PPL↓ . | cond. PPL↓ . | ||
@t1 . | @GT . | @t1 . | @GT . | ||
0 | MS-TLM | 190.59 | 144.82 | 741.86 | – |
1 | DLM-1 | 145.85 | – | 195.89 | – |
2 | DLM-2 | 218.30 | – | 453.73 | – |
3 | DLM-3 | 155.17 | 161.58 | 463.27 | 329.74 |
4 | DLM-4 | 290.07 | 231.00 | 693.48 | 314.49 |
5 | DLM-5 | 179.65 | 187.16 | 605.84 | 365.08 |
6 | Cascaded | 32.23 | 80.80 | 45.93 | 117.06 |
Ground Truth | 100.85 | 100.85 | 65.00 | 65.00 |
5.6 Human Evaluations
For this evaluation, we filter the prompts to contain genuine alternations between the two interlocutors and balanced gender. We retained 50 10-second long prompts and generated 10 20-second long continuations for each prompt. Human evaluation results are reported in Table 5. The naturalness and meaningfulness MOS scores correlate well with results in previous sections. The DLM-5 model has the best performance among dialogue models, while the DLM-1 performs significantly worse on both scores. Interestingly, whereas there is a large gap between our best model and ground truth on meaningfulness (1.73 points on the 5-point scale) this gap is much reduced on turn-taking (.53 points). The cascaded model shows a lack of naturalness, while having better scores on meaningfulness than all dialogue models. However, it is still far below the ground truth despite having a very good semantic scores. Overall, our models can generate dialogues mimicking natural turn-taking, while fail maintaining cross-sentence meaningfulness. We believe the lack of semantic coherence in generated dialogues results from the fine-grained acoustic units used for modeling and the small training corpus size.
Id . | Model . | N-MOS↑ . | M-MOS↑ . |
---|---|---|---|
0 | MS-TLM | 3.31 ± 0.43 | 2.29 ± 0.49 |
1 | DLM-1 | 2.25 ± 0.60 | 1.70 ± 0.44 |
2 | DLM-2 | 2.95 ± 0.37 | 2.24 ± 0.47 |
3 | DLM-3 | 3.29 ± 0.43 | 2.20 ± 0.44 |
4 | DLM-4 | 3.36 ± 0.44 | 2.18 ± 0.46 |
5 | DLM-5 | 3.70 ± 0.46 | 2.48 ± 0.49 |
6 | Cascaded | 2.38 ± 0.63 | 2.70 ± 0.38 |
Ground Truth | 4.23 ± 0.26 | 4.21 ± 0.25 |
Id . | Model . | N-MOS↑ . | M-MOS↑ . |
---|---|---|---|
0 | MS-TLM | 3.31 ± 0.43 | 2.29 ± 0.49 |
1 | DLM-1 | 2.25 ± 0.60 | 1.70 ± 0.44 |
2 | DLM-2 | 2.95 ± 0.37 | 2.24 ± 0.47 |
3 | DLM-3 | 3.29 ± 0.43 | 2.20 ± 0.44 |
4 | DLM-4 | 3.36 ± 0.44 | 2.18 ± 0.46 |
5 | DLM-5 | 3.70 ± 0.46 | 2.48 ± 0.49 |
6 | Cascaded | 2.38 ± 0.63 | 2.70 ± 0.38 |
Ground Truth | 4.23 ± 0.26 | 4.21 ± 0.25 |
6 Conclusion and Future Work
We have presented dGSLM, the first model for spoken dialogue generation trained from raw audio. This model has been shown to reproduce naturalistic intelligible speech, while trained on only 2k hours of audio from telephone conversations. Informal inspection of the generated samples1 shows that it is able to reproduce non-verbal vocalizations (laughter, backchannels). Detailed analysis of the turn-taking events show that the model is able to reproduce accurate synchronization including distribution and duration of turn-taking events like IPU, gaps, pauses, and overlaps. In particular, it is able to reproduce the rather puzzling observation that inter-turn pauses tend to be on average longer than between-turn gaps, suggesting the pauses alone are not a sufficient signal to indicate a change of turn.
Although the model lacks the ability to produce semantically coherent speech, it paves the way for the construction of more naturalistic human-machine dialogue systems. The logic and timing of turn-taking, which has been up to now very difficult to model artificially emerges naturally from our system, while it is clearly not yet able to process speech at a deep semantic level. This indicates that a model that correctly predicts synchronization between turns can be learned from relatively a small amount of data. This is surprising given that one major paralinguistic information, intonation, was not explicitely encoded in the input (or the output) of the system. Further work incorporating pitch (Kharitonov et al., 2021) could potentially improve the current results. Results from the cascaded system also suggest that either using larger linguistic units (like BPE) from raw audio (Borsos et al., 2022) or combining our model with text-based models would create systems which could generate more natural and meaningful conversations.
Acknowledgments
In this work, E.D. in his academic role (EHESS, ENS-PSL, CNRS) was supported by the Agence Nationale pour la Recherche (ANR-17-EURE-0017 Frontcog, ANR-10-IDEX-0001-02 PSL*, ANR-19-P3IA-0001 PRAIRIE 3IA Institute), a grant from CIFAR (Learning in Machines and Brains). B.S. was also supported by the Agence Nationale pour la Recherche (ANR-19-P3IA-0001 PRAIRIE 3IA Institute).
Appendix
A Phonetic Quality of HuBERT Fisher
In Table A1, we compare the HuBERT Base model (Hsu et al., 2021a) trained on 2000 hours of Fisher dataset versus 1000 hours of Librispeech dataset on the machine-ABX phonetic test. We used Libri-light ABX (Kahn et al., 2020) for the Lirispeech test. For the Fisher, we generated a Fisher ABX dataset using the phonetic alignments obtained from Fisher development set. The results clearly show a domain effect, whereby the Fisher dataset is a better training set than the Librispeech dataset for ABX discriminations in Fisher.
. | Fisher . | LibriSpeech . | ||
---|---|---|---|---|
within↓ . | across↓ . | within↓ . | across↓ . | |
HuBERT Base | 7.77 | 12.57 | 3.95 | 4.69 |
HuBERT Fisher | 5.50 | 8.35 | 11.17 | 14.70 |
. | Fisher . | LibriSpeech . | ||
---|---|---|---|---|
within↓ . | across↓ . | within↓ . | across↓ . | |
HuBERT Base | 7.77 | 12.57 | 3.95 | 4.69 |
HuBERT Fisher | 5.50 | 8.35 | 11.17 | 14.70 |
B Effects of Cross-Attention Layer
In Table A2, we show the NLL and Accuracy scores of Transformer Language Models as a function of number of Cross-Attention layers. The models are Two-tower Transformer Systems but are trained with only Next-Step Prediction Objective. We find that more layers give better scores, but that 4 layers of cross-attention gives almost the same performance as 6 for less complexity.
Notes
Generation samples can be found at https://speechbot.github.io/dgslm.
Code and pre-trained models will be made available at https://github.com/facebookresearch/fairseq/tree/main/examples/textless_nlp/dgslm.
example: ¡A¿ hi ¡B¿ hi how you doing ¡A¿ great ¡B¿ good good my name is marine.
The transcription was done using the Quick Transcription specification (Cieri et al., 2004), resulting in some inaccuracies and untranscribed portions. Here, we only used the transcriptions to obtain speech segments containing vocal activity to train the HifiGan and HuBERT model. The DLM was trained on the unsegmented raw data.
References
Author notes
Action Editor: Shay Cohen
Work done while at Meta.