Abstract
We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to “reading”) and from semantic tokens to low-level acoustic tokens (“speaking”). Decoupling these two tasks enables training of the “speaking” module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the “reading” component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in naturalness and acoustic quality.
1 Introduction
Training a text-to-speech (TTS) system typically requires hundreds of hours of parallel data in the form of transcribed utterances. As a consequence, high-quality TTS is only available for a few dozen languages. Moreover, the audio generated by these systems is only as diverse as the parallel data that they are trained on, which should contain many speakers, with various accents, of diverse demographics, and in heterogeneous recording conditions. At the same time, extremely diverse audio-only data can be relatively abundant online, present in the forms of podcasts, radio, and TV shows. This makes it attractive to use such audio-only data for building realistically sounding TTS systems while minimizing the need for diverse parallel data.
In this paper, we investigate how audio-only data can be leveraged to reduce the need for supervision in training TTS systems. We introduce SPEAR-TTS,1 a multi-speaker TTS system that can be trained with as little as 15 minutes of parallel data from a single speaker. Moreover, SPEAR-TTS can synthesize a new voice using only 3 seconds of speech, without any speaker labels or explicit speaker representation. SPEAR-TTS leverages recent advances in the “textless” modeling of spoken language (Lakhotia et al., 2021; Polyak et al., 2021; Kreuk et al., 2021; Kharitonov et al., 2022; Borsos et al., 2023). These methods represent continuous audio waveforms as sequences of tokens from a finite vocabulary, casting speech generation as a language modeling task. In particular, AudioLM (Borsos et al., 2023) combines two types of discrete tokens: high-level semantic tokens and low-level acoustic tokens, which can be mapped to audio. Using these representations, we cast the TTS problem as a “translation” from text transcripts to acoustic tokens with semantic token representations serving as a pivot “language” (Utiyama and Isahara, 2007). This way, TTS is a composition of two sequence-to-sequence (seq2seq) tasks: translating text to semantic tokens, and translating semantic to acoustic tokens.
The key benefit of such an approach is that the supervision needed to learn how to map text into the intermediate semantic token representation (“reading”) and how to produce speech from it (“speaking”) become decoupled. While the “reading” stage relies on parallel text-audio data, the audio tokens used to train the “speaking” component are produced by self-supervised audio models and therefore can be extracted from a massive amount of unlabeled speech data. As a result, the quality and diversity of the generated speech become independent from the available parallel data.
Casting each stage of SPEAR-TTS as a seq2seq problem allows us to use standard Transformer models (Vaswani et al., 2017) and makes it easy to tap into the vast pool of ideas developed by the machine translation community to reduce the need for supervision. Specifically, we combine BART-style pretraining (Lewis et al., 2020) with backtranslation (Sennrich et al., 2016) to significantly reduce the amount of parallel supervision required to train SPEAR-TTS.
To control the voice used by SPEAR-TTS when producing an utterance, we leverage an example prompting mechanism that is closely related to prompting in textual language models (Brown et al., 2020). Here we condition the “speaking” model with an audio clip representing the target voice, steering it to use this voice when generating the utterance. This feature can simplify building controllable multi-speaker TTS systems for languages where only single-speaker parallel data is available.
Our experimental study on English speech shows that, by combining pretraining and backtranslation over a large dataset (551 hours) with just 15 minutes of parallel data from a single speaker, SPEAR-TTS:
Generates speech with high fidelity to the input transcript—CER 1.92% on LibriSpeech test-clean (Panayotov et al., 2015);
reliably reproduces the voice of an unseen speaker, when using a 3-second example from the target speaker;
achieves high acoustic quality, comparable to that of the ground-truth utterances (MOS 4.96 vs. 4.92).2
Overall, our approach to building TTS using massive self-supervised pretraining and backtranslation of discrete speech representations considerably differs from how existing TTS systems are implemented (Shen et al., 2018; Kong et al., 2020; Ren et al., 2020; Kim et al., 2021; Ao et al., 2022; Wang et al., 2023), significantly reducing the costs related to data collection and potentially providing high-quality multi-speaker TTS for languages that are not well covered today.
2 Related Work
Our work relates to several research directions that we overview in this Section.
Discretized Speech Processing
The work of Lakhotia et al. (2021) on generative spoken language modeling and generation pioneered the use of language models on discretized speech representations. Their work became a foundation for a range of extensions, including emotion transfer (Kreuk et al., 2021), prosody (Kharitonov et al., 2022), and dialog (Nguyen et al., 2023) modeling.
SPEAR-TTS is based on AudioLM (Borsos et al., 2023), a recent development in this line of work that achieves a superior quality in spoken language modeling and a high audio quality. SPEAR-TTS extends AudioLM along several dimensions. We adapt it to the text-to-speech scenario by conditioning it with a transcript input and show that by combining pretraining and backtranslation, we can dramatically reduce the amount of supervision required to train a high-fidelity TTS system. Finally, SPEAR-TTS explicitly incorporates a prompt-like mechanism for a voice control.
Low- and Semi-supervised TTS
Guided-TTS (Kim et al., 2021) is another TTS system that leverages audio-only data. It combines (a) a denoising diffusion probabilistic model (DDPM) that learns to model audio-only data, and (b) a phoneme classifier that guides the generative process towards producing an utterance with a desired transcript. Guided-TTS 2 (Kim et al., 2022) extends it by allowing speaker adaptability either via finetuning or in a zero-shot manner, using a 10-second speech sample processed by a dedicated speaker embedding module. Another adaptable DDPM-based TTS system was proposed by Levkovitch et al. (2022), which uses the classifier guidance mechanism to steer generation towards a particular voice in a zero-shot manner.
In contrast to SPEAR-TTS, the above works rely on stronger supervision: a phoneme classifier that is trained on LibriSpeech 960 and a pre-trained speaker verification system. Conversely, SPEAR-TTS uses a parameter-less prompting mechanism which does not require any speaker labels.
Liu et al. (2020) combine a sequential autoencoder with vector quantization and temporal segmentation mechanisms to learn a phoneme-like discrete speech representation, along with a seq2seq model that maps these representations to phonemes. Similarly to SPEAR-TTS, this system can be trained with almost no supervision, however the generated speech is single-speaker only and of much lower quality than ground-truth audio (2.33 vs 4.81 in their experiments).
Prompted Audio Generation
When a sentence is prepended by an emotional prompt, expressed in plain English, e.g., [I am really sad,] Tortoise TTS (Betker, 2022) synthesizes text in a sad voice.
Wang et al. (2023) propose VALL-E, a TTS system that allows prompt-based conditioning of the synthesized voice and emotion. In contrast to the two-stage architecture of SPEAR-TTS, VALL-E predicts an equivalent of acoustic tokens directly from a phoneme representation of a text. This is unlike SPEAR-TTS which only prompts the model with self-supervised audio tokens. Another difference is the amount of the parallel training data used: VALL-E is trained on the 60k hours of ASR-transcribed LibriLight (Kahn et al., 2020).
3 Discrete Speech Representations
Below is a brief overview of the two self-supervised audio representations which we adapt from AudioLM. These representations occupy the opposite ends of the reconstruction quality vs. bitrate trade-off: Acoustic tokens are typically high-bitrate, allowing high-fidelity audio generation, while semantic tokens are low-bitrate, thus making it easier to achieve a long-span coherence. A detailed discussion can be found in Borsos et al. (2023)[Section III-B].
Semantic Tokens
The role of semantic tokens is to provide a coarse, high-level conditioning to subsequently produce acoustic tokens. Thus, they should provide a representation of speech where linguistic content—from phonetics to semantics—is salient, while paralinguistic information such as speaker identity and acoustic details are removed. To obtain such a representation, we train a self-supervised speech representation model based on w2v-BERT (Chung et al., 2021). After its training, we run a k-means clustering on the mean-variance normalized outputs of a specific layer. We use the centroid indices as discrete tokens.
Acoustic Tokens
Acoustic tokens are discrete audio representations that provide high-fidelity reconstruction of the acoustic details. We train a SoundStream (Zeghidour et al., 2021) codec to reconstruct speech while compressing it into few discrete units using a residual quantizer. To represent the hierarchy of residual quantizers in a sequence, we flatten the tokens corresponding to the different levels by interleaving them (Borsos et al., 2023). We use the SoundStream codec to convert audio to acoustic tokens and synthesize audio back from acoustic tokens.
4 SPEAR-TTS Overview
SPEAR-TTS extends AudioLM (Borsos et al., 2023) by enabling text as a form of conditioning. Similarly to AudioLM, SPEAR-TTS is organized in two main stages, as illustrated in Figure 1. In the first stage (), text inputs are translated into a sequence of discrete semantic tokens. The second stage () maps semantic tokens into acoustic tokens, which are decoded to speech by the SoundStream decoder (Zeghidour et al., 2021). This way, learns to map text to the internal representation provided by semantic tokens (“reading”), while handles the production of speech from this intermediate internal representation (“speaking”).
By using semantic tokens as an intermediate representation, we achieve two goals. First, semantic tokens provide us with a high-level representation of speech. Thus, it is easier to learn a mapping from text transcripts to semantic tokens than directly between text and acoustic tokens. Second, as both semantic and acoustic tokens are derived from self-supervised models, the second stage can be trained using audio-only data. This turns out to be extremely beneficial for training , as the typical scale of available audio- only data is considerably larger than that of parallel data.
In turn, separating from allows us to pretrain the former with a denoising pretext task operating on succinct semantic tokens, further harnessing audio-only data.
5 : Improving Supervision Efficiency
The first stage maps tokenized text into semantic tokens. We use parallel text-semantic tokens data to learn this mapping. We start with a text-audio TTS dataset and extract semantic tokens from audio. As a result, is reduced to a seq2seq task that can be implemented by encoder-decoder or decoder-only Transformer architectures (Vaswani et al., 2017).
5.1 Pretraining
We take inspiration from BART and pretrain an encoder-decoder Transformer on a denoising task (Lewis et al., 2020). In this task, the model is provided with a corrupted version of an original semantic token sequence and the goal is to produce the corresponding uncorrupted token sequence. Such pretraining does not require parallel data and we carry it out using a large audio-only dataset.
Typical corruption methods include random substitution, deletion and masking of individual tokens or entire spans of tokens (Lewis et al., 2020; Raffel et al., 2020). In preliminary studies, we observed that deleting individual tokens independently at random works better than other alternatives.
After pretraining the model , we finetune it for the task. To achieve this, we freeze the upper layers of the encoder and all parameters of the decoder, excluding the parameters used in the decoder-encoder cross-attention layers, and update the lower layers of the encoder. The exact number of layers to tune is a hyperparameter.
5.2 Backtranslation
The same text sequence can be rendered as audio in multiple ways, with varying voice, accent, prosody, emotional content, and recording conditions. This one-to-many relationship makes the text-to-speech problem highly asymmetric—unlike text translation, where, for example, English-French translation is roughly equally hard in either direction. Thus, it is very attractive to use backtranslation (Sennrich et al., 2016; Edunov et al., 2018), i.e., to use the available parallel data to train a speech-to-text model and use it to generate synthetic parallel data from an audio-only corpus.
The two-stage architecture of SPEAR-TTS is particularly suitable for backtranslation as it can be implemented as translation between semantic tokens and text. The benefits are two-fold: (a) a reduction in the computational complexity due to never dealing with raw audio or long acoustic token sequences,3 and (b) the ability to leverage the same semantic token-level pretraining (§ 5.1) when training the “backward”-direction model.
In order to obtain a backtranslation model, we start from the same pretrained model as above. However, this time we freeze the encoder and only finetune the decoder. Afterwards, we transcribe audio-only data using this model. Next, we use the synthetically generated parallel data to train the first stage of the TTS system, which, in turn, is also obtained via finetuning another copy of (see § 5.1). After finetuning on the synthetic data, we continue finetuning on the original parallel data. Figure 2 illustrates this process.
6 : Controlling the Generation Process
The second stage model maps semantic tokens into acoustic tokens. To train it, we extract pairs of semantic and acoustic token sequences from each utterance in an audio-only dataset. Next, we train a Transformer model to perform seq2seq translation between the two token sequences. The second stage generates utterances with randomly varying voice, tempo, and recording conditions, reproducing the distribution of the characteristics observed in the training data. As and are trained independently, this diversity of speech is also preserved when is trained on a single-speaker dataset.
To control the characteristics of the speaker’s voice, we combine two findings from AudioLM (Borsos et al., 2023): (a) whenever the speech prefix is represented solely by semantic tokens, AudioLM generates continuations by sampling a different random voice each time; (b) however, when conditioning also includes acoustic tokens, AudioLM maintains the voice characteristics captured by the acoustic tokens when generating the continuation. In contrast to AudioLM, we explicitly incorporate this ability during training, as illustrated in Figure 3. During training, we randomly select two non-overlapping windows of speech from each training example, from which we compute sequences of semantic and acoustic tokens. We consider one of the windows as the prompt and the other as the target output. Next, we concatenate the sequences in the following order: (a) semantic tokens from the prompt, (b) semantic tokens from the target, (c) acoustic tokens from the prompt, and (d) acoustic tokens from the target. During training of , (a)–(c) are used as prefix and the model learns to generate the target acoustic tokens (d), preserving the speaker identity captured by the acoustic tokens from the prompt. At inference time, (a)–(c) are provided as input, and (d) is generated autoregressively. We also add a special separator token to inform the model about the expected discontinuity. This prevents boundary artifacts, which sometimes occur otherwise. Note that the text transcript of the prompt is not needed.
The speech samples generated by might contain some background noise, since this is typically present in the training data. We consider two methods to control the noise level in the synthesized speech at inference time. First, in the case of prompted generation, it is possible to select prompts containing cleaner speech. Second, we can use a stochastic sampling (e.g., temperature sampling), generate multiple sequences for the same input and then use a no-reference audio quality metric to select the samples containing the least amount of noise. To this end, we use a MOS estimator model similar to DNSMOS (Reddy et al., 2021).
7 Experimental Setup
7.1 Training and Validation Data
In this Section, we describe datasets that are used in this paper for training and evaluation. We provide an outline in Table 1.
Name . | Size, hours . | Transcripts used? . | Used for . |
---|---|---|---|
LibriLight | 60k | ✗ | Acoustic & semantic tokens, pretraining , and training |
LJSpeech | 0.25..24 | ✓ | Finetuning for backtranslation |
LibriTTS train | 551 | ✗ | Source of backtranslated data |
LibriSpeech test-clean (shorter than 10s) | 3 | ✓ | Intelligibility evaluation |
LibriSpeech train-clean + test-clean | 105 | ✗ | Training the voice classifier used in the evaluation |
Name . | Size, hours . | Transcripts used? . | Used for . |
---|---|---|---|
LibriLight | 60k | ✗ | Acoustic & semantic tokens, pretraining , and training |
LJSpeech | 0.25..24 | ✓ | Finetuning for backtranslation |
LibriTTS train | 551 | ✗ | Source of backtranslated data |
LibriSpeech test-clean (shorter than 10s) | 3 | ✓ | Intelligibility evaluation |
LibriSpeech train-clean + test-clean | 105 | ✗ | Training the voice classifier used in the evaluation |
Acoustic and Semantic Tokens:
We use LibriLight (Kahn et al., 2020) to train the self-supervised representation models (SoundStream and w2v-BERT) as well as the k-means used to discretize w2v-BERT embeddings into semantic tokens. We use the largest unlab-60k split of LibriLight that contains around 60,000 hours of English audiobooks read by more than 7,000 speakers.
First Stage :
To experiment in the low-supervision regime, we train on LJSpeech (Ito and Johnson, 2017), a single-speaker dataset containing 24 hours of parallel data. By using LJSpeech as the only source of parallel data, we also show that our method generalizes to multiple speakers, even if the parallel training data itself contains only a single speaker. Since LJSpeech does not specify a canonical train/dev/test split, we follow Liu et al. (2022, 2020) and randomly select 300 utterances as development and another 300 utterances as test set (30 minutes each), using the rest as training data. To simulate scenarios in which very limited data is available, we uniformly sample subsets of 3 hours, 1 hour, 30 minutes, and 15 minutes from the training set. As an indicative figure, the 15 minute subset contains around 21k semantic tokens and 2k words.
Pretraining:
Backtranslation:
In our experiments with backtranslation, we use LibriTTS (Zen et al., 2019) as a source of unlabelled speech (ignoring transcripts). We pool all training subsets of LibriTTS to obtain an audio-only dataset containing 551 hours of speech. Using LibriTTS as a source for audio-only data for backtranslation allows us to compare SPEAR-TTS with trained on original and backtranslated LibriTTS transcripts.
Second Stage :
7.2 Evaluation Data
We use LibriSpeech test-clean (Panayotov et al., 2015) to measure the character error rate (CER) (see § 7.4). As LJSpeech only contains sequences shorter than 10 seconds, we filter out sequences longer than that from LibriSpeech test-clean. As a result, we obtain 2,007 utterances, with a total duration of approximately 3 hours. Importantly, LibriSpeech test-clean has no intersection with any training or validation data we used.
7.3 Preprocessing
To prepare the data for training, we unroll standard abbreviations used in LJSpeech. Next, we apply the G2p_en phonemizer (Park and Kim, 2019). After removing the lexical stress information from its output, we obtain a string representation in a vocabulary of 47 tokens (39 phonemes from the CMU Dictionary, whitespace, and punctuation).
7.4 Metrics
Character Error Rate (CER)
We transcribe the synthesized utterances using an in-house ASR system and we evaluate the faithfulness to the input transcript by measuring the character error rate (CER). We use the LibriSpeech test-clean dataset (Panayotov et al., 2015) to calculate CER, since it requires minimal postprocessing to be compared to the output of the adopted ASR system. On the original audio, CER is equal to 0.98%.
Voice Preservation
When prompting the model with a short utterance, we evaluate the consistency of the speaker voice between the prompt and the generated speech. We use the same speaker classifier as Borsos et al. (2023), which is trained on a union of LibriSpeech train-clean-100 and test-clean (251 and 40 speakers, respectively). It computes predictions over a set of 291 speaker classes (see Appendix O for more details). We measure how often the speaker label predicted from the generated speech matches the one predicted from the prompt.
Quality
We rely on human judgments to evaluate the perceived quality of SPEAR-TTS by collecting Mean Opinion Scores (MOS). In this context, human raters listen to individual audio segments and rate their audio quality and speech naturalness on a scale from Poor (1) to Excellent (5).
7.5 Baselines
As our main baseline, we consider a system explicitly trained to target the low-supervision scenario. Namely, we use a modification of FastSpeech2 (Ren et al., 2020), which is a non-autoregressive model that uses auxiliary duration, pitch, and energy predictors. Specifically, in our experiments we consider the adaptation to the low-supervision setting by Pine et al. (2022). The model takes as input the phoneme representation of the text and predicts a spectrogram, which is then vocoded with HiFi-GAN (Kong et al., 2020). We denote this modification as FastSpeech2-LR. In a subjective evaluation reported by Pine et al. (2022), FastSpeech2-LR trained on 1 (3) hour(s) of parallel data performed on par with an open-source implementation of Tacotron2 (Shen et al., 2018) trained with 10 (24) hours of parallel data. We use checkpoints trained on 15 minutes, 30 minutes, 1 hour, 3 hours, and 24 hours that were shared by the authors.5
We also compare SPEAR-TTS to VALL-E (Wang et al., 2023), a recent TTS system that demonstrates state-of-the-art results in zero-shot voice adaptation. Similarly to SPEAR-TTS, it is capable of voice transfer using a 3-second voice prompt. VALL-E maps the input text to coarse acoustic tokens, and uses a non-autoregressive refinement stage to predict fine-grained acoustic tokens. VALL-E is trained on an ASR-transcribed version of LibriLight (Kahn et al., 2020), containing roughly 60,000 hours of parallel data. Since the model is not publicly available, the comparison is based on the samples provided on its demo page.
8 Hyperparameters & Training Details
8.1 Discrete Speech Representations
We follow the setup of AudioLM (Borsos et al., 2023) to compute both semantic and acoustic tokens, with a few differences. The semantic tokens are obtained by quantizing the embeddings returned by the 7th layer of w2v-BERT using a codebook of size 512. As a result, 1 second of audio is represented by 25 semantic tokens with a vocabulary size of 512, resulting in a bitrate of 225 bit/s. We remove sequentially repeated semantic tokens, as done in Lakhotia et al. (2021) and Borsos et al. (2023).
We extract acoustic tokens from a SoundStream neural codec (Zeghidour et al., 2021) with 3 quantization levels, each with a codebook of size 1024. We use a vocabulary with 3 × 1024 unique tokens and represent each frame as a flat sequence of tokens, interleaving the first, second, and third quantization layers, respectively. As a result, 1 second of audio is represented by 50 Hz × 3 = 150 acoustic tokens, a bitrate of 1500 bit/s.
8.2 First Stage ()
In all experiments, we use the Adafactor optimizer (Shazeer and Stern, 2018) with inverse square-root learning rate decay. As a regularization method, we use label smoothing with the smoothing parameter set to 0.1, except in the case of pretraining, when a large amount of data is available.
Pretraining
The pretraining task is configured so that the probability of deleting individual tokens is set to 0.6. This parameter was selected via grid search inspecting the validation accuracy of after finetuning. We apply dropout with probability equal to 0.5 and set the batch size to 256. We ran the pretraining for 1M updates and used the resulting checkpoint in all our experiments. As the architecture, we use T5-Large (Raffel et al., 2020), which is a 24 layer encoder-decoder model (see Table 2).
Finetuning
The same pretrained checkpoint is finetuned for different purposes (Figure 2). In all cases we perform a grid search on the dropout rate ({0.1, 0.3, 0.5}) and the number of layers to finetune, selecting the combination with the highest validation accuracy (with teacher-forcing). More specifically, when finetuning on ground-truth parallel data (as an ablation), we freeze both the upper layers of the encoder and the entire decoder, while updating the weights of the encoder embeddings and the lower layers. The number of the lower layers to tune is searched in {4, 6, 8}. When finetuning on synthetic parallel data, we search over the number of the encoder’s lower layers to be finetuned in {4, 6, 8, 10, 12, 24}. Next, we finetune the lower 4 layers of the encoder on the original parallel data (to avoid overfitting when very little data is available). Finally, when finetuning the decoder for backtranslation, we finetune N top and N bottom layers, with N ∈{2,3,4,12}. During finetuning, we select the checkpoint with the best validation accuracy.
Training from Scratch
As an ablation experiment, we train from scratch, experimenting with different variants of T5 architectures (Raffel et al., 2020), depending on the amount of data available. We adopt a decoder-only model without causal masking on the input sequence (Raffel et al., 2020), which led to better results in our preliminary experiments. We perform a grid-search on the following hyperparameters: dropout probability {0.1, 0.3, 0.5}; architecture size (T5-small or T5-base, see Table 2); the number of layers (T5-small: 2, 4, 6, 8; T5-base: 4, 6, 8, 12).
8.3 Second Stage ()
For , we use a 12-layer decoder-only Transformer model, with each layer having 12 heads with dimensionality 64, embedding dimensionality of 768, and FFN size of 2048. The optimizer and the learning rate schedule are the same as for .
8.4 Inference
We use beam search to sample from and temperature sampling to sample from . This combination ensures faithfulness to the transcript while enabling more diverse and natural sounding speech. We use a beam size equal to 10, as larger values do not lead to improvements in CER but are more computationally expensive. When generating backtranslation data we re-use the settings of , without running any additional hyperparameter search. For , we experiment with sampling temperatures T ∈{0.50,0.55,…,0.95,1.0} and select T = 0.75 which minimizes the CER on the LJSpeech validation dataset. In this case, the model is trained on synthetically generated parallel data obtained by backtranslation, with the backtranslation model trained on the 15 minute split of LJSpeech.
9 Experiments
We evaluate SPEAR-TTS along several dimensions. First, we measure the faithfulness of the generated speech to the input transcript, for different training scenarios and amounts of parallel data available (§ 9.1). Then, we show that SPEAR-TTS is able to successfully control the speaker voice, without any degradation in terms of fidelity to the transcript (§ 9.2).
9.1 Intelligibility and Supervision Efficiency
When evaluating SPEAR-TTS, we consider the following training settings for : (a) training from scratch using parallel data; (b) finetuning the pretrained checkpoint using parallel data; (c) finetuning the pretrained checkpoint to obtain the backtranslation model and then training the forward model from scratch on the synthetically generated data; (d) same as (c), but both the backward and the forward models are obtained by finetuning with an additional finetuning of the forward model on the original parallel data.
Table 3 reports the main results in terms of CER, as a proxy for the intelligibility of the generated speech. We observe that when decreasing the amount of parallel data, training from scratch (a) results in very high error rates. Conversely, thanks to pretraining (b), SPEAR-TTS maintains a relatively low CER (≤ 4%), when using as little as 3 hours of parallel data. This is similar to the CER achieved with 24 hours, but without pretraining. Backtranslation (c) has a general positive impact, especially when the amount of parallel data is reduced, achieving a CER of 2.88% with only 15 minutes. By combining backtranslation with pretraining (d), the CER is further decreased to 2.21% with the same amount of parallel data. This indicates that having a fixed decoder is useful to cope with the noisy nature of the synthetically generated training data obtained via backtranslation.
Parallel training data . | FastSpeech2-LR . | SPEAR-TTS . | |||
---|---|---|---|---|---|
Training . | Pretraining (b) . | Backtranslation . | |||
from scratch (a) . | from scratch (c) . | pretraining (d) . | |||
24 h | 1.99±0.20 | 3.67±0.21 | 2.38±0.13 | 2.26±0.14 | 2.06±0.12 |
3 h | 2.52±0.25 | 20.1±0.74 | 3.07±0.15 | 2.21±0.12 | 2.01±0.12 |
1 h | 2.74±0.27 | × | 5.51±0.21 | 2.23±0.13 | 2.16±0.13 |
30 min | 3.18±0.28 | × | 21.3±0.43 | 2.52±0.15 | 2.20±0.12 |
15 min | 4.90±0.34 | × | × | 2.88±0.19 | 2.21±0.12 |
Parallel training data . | FastSpeech2-LR . | SPEAR-TTS . | |||
---|---|---|---|---|---|
Training . | Pretraining (b) . | Backtranslation . | |||
from scratch (a) . | from scratch (c) . | pretraining (d) . | |||
24 h | 1.99±0.20 | 3.67±0.21 | 2.38±0.13 | 2.26±0.14 | 2.06±0.12 |
3 h | 2.52±0.25 | 20.1±0.74 | 3.07±0.15 | 2.21±0.12 | 2.01±0.12 |
1 h | 2.74±0.27 | × | 5.51±0.21 | 2.23±0.13 | 2.16±0.13 |
30 min | 3.18±0.28 | × | 21.3±0.43 | 2.52±0.15 | 2.20±0.12 |
15 min | 4.90±0.34 | × | × | 2.88±0.19 | 2.21±0.12 |
We also compare SPEAR-TTS to FastSpeech2- LR, observing that when using 24 hours of parallel data, both systems perform approximately on par (FastSpeech2-LR: 1.99% vs. SPEAR-TTS: 2.06%). However, as the amount of parallel data is reduced, CER of FastSpeech2-LR increases very rapidly. As a result, there is a significant gap when only 15 minutes are available, that is, FastSpeech2-LR: 4.90% vs. SPEAR-TTS: 2.21%.
In conclusion, the combination of pretraining and backtranslation allows SPEAR-TTS to synthesize speech that adheres to the input transcript, even with as little as 15 minutes of parallel data.
9.2 Prompted Generation
SPEAR-TTS is able to control the speaker voice via example prompting, as described in § 9.2. We evaluate SPEAR-TTS in a zero-shot scenario, in which the voice used for prompting was never seen by or at training and has to reproduce its characteristics from a single prompt example. Specifically, we fix , using the model trained on 15-minutes of LJSpeech and we consider all 40 speakers from LibriSpeech test-clean as target speakers. For each speaker, we randomly select 5 speech prompts with duration of 3 seconds each and transcripts from the same dataset. For each speech prompt and text transcript, we repeat the synthesis 5 times and average the metrics.
Table 4 reports the speaker accuracy, that is, how often the same voice is detected in both the prompt and the generated speech. We provide details on the classifier’s architecture and training in Appendix O. We observe top-1 accuracy equal to 92.4% showing that the prompting allows SPEAR-TTS to preserve the speaker voice. Also, the synthesized voice is stable when repeating inference, as captured by a low value of voice variability (0.41 bits). Moreover, we observe that with prompted generation SPEAR-TTS achieves a CER equal to 1.92%, which is lower than without prompting (2.21%). We believe that this improvement is due to using cleaner recordings for prompts, which steers the model to produce cleaner speech and consequently reduce ASR errors.
CER (%) . | Speaker accuracy (%) . | Voice diversity (bits) . | |
---|---|---|---|
top-1 . | top-3 . | ||
1.92 | 92.4 | 98.1 | 0.41 |
CER (%) . | Speaker accuracy (%) . | Voice diversity (bits) . | |
---|---|---|---|
top-1 . | top-3 . | ||
1.92 | 92.4 | 98.1 | 0.41 |
We also compare the voice preservation abilities of SPEAR-TTS with those of VALL-E (Wang et al., 2023). Following the methodology of Wang et al. (2023) we compute the cosine similarity between embeddings computed from the prompt (encoded and decoded with SoundStream) and from the generated speech, using a publicly available speaker verification system based on WavLM (Chen et al., 2022). This is the same model as used by Wang et al. (2023), which makes our measurements directly comparable with scores reported in their paper. From the results reported in Table 5, we observe that SPEAR-TTS significantly outperforms YourTTS (Casanova et al., 2022) and almost matches the speaker similarity of VALL-E, despite being trained with 240,000× less parallel data.
10 Subjective Evaluation
Ultimately, we resort to subjective tests with human raters to compare the quality of SPEAR-TTS with the baselines and with ground-truth natural speech. We focus on the scenario with minimal supervision and use the model that is trained with the 15 minute LJSpeech (Ito and Johnson, 2017) subset. As baselines, we use the FastSpeech2-LR models (Ren et al., 2020; Pine et al., 2022) trained on 15 minutes, 1 hour, and 24 hour subsets of LJSpeech.
To ensure that the evaluation sentences are not part of the training set of SPEAR-TTS or the FastSpeech2-LR models, we extract sentences from an audiobook chapter released in 2022, read by the same voice as in LJSpeech.6 This chapter was published later than any of the datasets we use. We extract 20 sentences from it, each with duration between 3 and 11 seconds, for a total of 133 seconds. We take transcripts for those sentences in the text of the corresponding book.
The baselines are TTS systems trained to generate a single voice. To ensure a fair comparison, we prompt with utterances extracted from the LJSpeech dataset, so that SPEAR-TTS generates speech with the same voice. To this end, we randomly select 3 second speech samples from LJSpeech and filter out samples that have more than 1 second of silence, using the remaining as prompts.
We recruited raters using an internal crowd-sourcing platform. The raters took an English proficiency test. Samples are presented to raters one-by-one, and raters are asked to judge the overall quality on a scale from Poor (1) to Excellent (5). Before starting, the raters were provided with example utterances for each grade. Those reference samples represent clean natural speech, speech resynthesized by SoundStream, and speech synthesized from partially corrupted SoundStream token sequences. We additionally included some examples representing speech degradations common in online communication. Each audio sample is evaluated by 20 raters. For each treatment, we average all scores to compute the Mean Opinion Score (MOS).
Table 6 reports the results of the subjective tests. We observe that SPEAR-TTS achieves considerably higher quality than the baselines, even when the latter use more parallel data during training. The MOS score achieved by SPEAR-TTS (4.96) is comparable to the one obtained for the ground-truth speech (4.92), confirming the high quality of the generated speech, despite the fact that the model was trained only on 15 minutes of parallel data.
Parallel training data . | FastSpeech2-LR . | SPEAR-TTS . | Ground-truth . | ||
---|---|---|---|---|---|
15 min . | 1 h . | 24 h . | 15 min . | – . | |
MOS | 1.72±0.04 | 2.08±0.04 | 2.11±0.04 | 4.96±0.02 | 4.92±0.04 |
Parallel training data . | FastSpeech2-LR . | SPEAR-TTS . | Ground-truth . | ||
---|---|---|---|---|---|
15 min . | 1 h . | 24 h . | 15 min . | – . | |
MOS | 1.72±0.04 | 2.08±0.04 | 2.11±0.04 | 4.96±0.02 | 4.92±0.04 |
We also compare SPEAR-TTS and VALL-E (Wang et al., 2023) in a small-scale subjective test using the examples provided on its demo page.7 These examples are generated by combining 8 transcripts with 3 prompts each, resulting in 24 speech utterances. Using the same instance of SPEAR-TTS described above (with trained with 15 minutes of single-speaker LJSpeech), we synthesize 24 utterances using the same transcripts and prompts and conduct a subjective test with the same protocol described above. Table 7 shows that, on these examples, SPEAR-TTS achieves considerably better naturalness and higher speech quality (MOS 4.75) than VALL-E (3.35), despite using considerably less supervision (15 min of parallel data & 1 speaker vs. approximately 60,000 hours of parallel data spoken by over 7,000 speakers).
11 Limitations and Broader Impact
While our motivation is to enable high-quality, diverse, and controllable TTS for under-served languages, we started our investigations with English, which allowed us to address the problem using a collection of well-studied datasets. This leaves some unknowns on whether our proposed approach would be applicable for languages sufficiently different from English, e.g., for tonal languages. For instance, would we be able to get a useful discretized representation of speech for such languages? Would they require a dramatically larger token vocabulary size and would it allow effective speech modeling? However, we are hopeful due to an increasing amount of progress in related research. For example, it was demonstrated that the underlying AudioLM framework successfully works across many languages and even for music (Borsos et al., 2023; Agostinelli et al., 2023; Rubenstein et al., 2023). However, this remains circumstantial evidence and a direct investigation on SPEAR-TTS is in order.
Another potential caveat is that while training the does not require parallel text-audio data, it still needs a considerable amount of data, in the order of a few thousands hours (see Appendix M). That can become a limitation for low-resource languages. A similar problem can be faced when training a w2v-BERT model. Here, a natural potential solution would be leveraging cross-lingual transfer, particularly from phonetically close languages, which was shown to work remarkably well for some self-supervised models (Riviere et al., 2020).
We also acknowledge that the ability to mimic a voice can have numerous malicious applications, including bypassing biometric identification and for the purpose of impersonation (Delgado et al., 2021; Casanova et al., 2022). Thus it is crucial to put in place safeguards against the misuse and, as an initial step, we verify that speech produced by SPEAR-TTS can be reliably detected by a classifier with an accuracy of 82.5% on a balanced dataset, using the same protocol and classifier as Borsos et al. (2023).
Due to the concerns about a potential malicious use, we decided not to release checkpoints and training code for our models publicly.
12 Conclusions & Future Work
We introduced SPEAR-TTS, a multi-speaker TTS system that has two features setting it apart. First, it only requires a minimal amount of parallel data to be trained, i.e., it can synthesize speech with high fidelity when trained on as little as 15 minutes of parallel data coming from a single speaker. Second, SPEAR-TTS is able to synthesize speech maintaining voice characteristics of a previously unseen speaker using a 3-second long voice example.
These capabilities originate from harnessing abundant audio-only data. The key component that unlocks the usage of such data is the hierarchical discrete representation of speech that combines high-level semantic tokens with low-level acoustic tokens. Using these representations, SPEAR-TTS casts the TTS problem as a composition of two sequence-to-sequence tasks, “reading” (from tokenized text to semantic tokens) and “speaking” (from semantic tokens to acoustic tokens).
SPEAR-TTS uses audio-only data in three ways: (a) to train the “speaking” model, such that the hard task of speech generation benefits from large-scale data, (b) as a domain for pretraining, and (c) to generate synthetic parallel data for backtranslation.
Our experimental study on English data (§ 9) shows that by combining audio-only data from LibriTTS (Zen et al., 2019) with 15 minutes of parallel data sampled from LJSpeech (Ito and Johnson, 2017), SPEAR-TTS achieves intelligibility comparable to that of an adapted FastSpeech2-LR (Pine et al., 2022) trained on 24 hours of LJSpeech.
Next, our experiments show that SPEAR-TTS can maintain voice characteristics of a previously unseen speaker, in a zero-shot manner, with high accuracy. Indeed, our measurements indicate that by taking a 3-second-long voice example for a speaker from LibriSpeech test-clean, SPEAR-TTS achieves 92.4% accuracy on maintaining the voice when synthesizing held-out text transcripts, according to our speaker classifier. Moreover, when measuring speaker similarity between prompts and generated speech, SPEAR-TTS obtains a cosine similarity close to the score reported for VALL-E (Wang et al., 2023) and significantly higher than the score of YourTTS (Casanova et al., 2022).
Subjective evaluations of speech naturalness show that SPEAR-TTS has significantly higher quality than a strong single-voice baseline even when trained with 96× less parallel data. Moreover, the MOS score of SPEAR-TTS is on par with the natural speech. When comparing quality of the speech synthesized in a zero-shot voice transfer task, SPEAR-TTS obtains a MOS that is considerably higher than VALL-E, with 240,000× less data.
We believe that applying our findings to building a TTS system for truly low-resource languages is the main direction for further work.
Acknowledgments
The authors are grateful to Ron Weiss and Matthieu Geist for their feedback on a draft of this paper. We also thank Aidan Pine for helping us to obtain and run checkpoints from Pine et al. (2022). We would also like to thank the TACL reviewers and editors for their thorough reviews and constructive feedback.
Notes
SPEAR stands for “speak, read and prompt”.
Samples can be found on the demo site: https://google-research.github.io/seanet/speartts/examples/.
An acoustic representation is ≥ 6 × longer than the semantic one.
In Appendix M, we study the impact of reducing the amount of data used for training on the overall performance.
https://www.microsoft.com/en-us/research/project/vall-e-x/vall-e/, “More Samples”.
References
A Influence of the Data Size on
In this experiment, we measure how sensitive is to the amount of data used to train it. To this end, we downsample LibriLight (Kahn et al., 2020) by factors of 1, 2, 5, and 10 before training models. All models share the same architecture and are trained for the same number of updates and we select the checkpoint with the highest validation accuracy. Next, we combine the selected checkpoints with trained on LibriTTS (Zen et al., 2019) (with pretraining) and measure intelligibility of SPEAR-TTS on LibriSpeech dev-clean. We report results in Table 8. We notice that reducing the data size 5x starts to affect the performance.
B Computational Cost
The cost of training SPEAR-TTS is dominated by the pretraining phase: Fitting the pretrained model took approximately 100 TPU-days with TPUv4 (Jouppi et al., 2023). For a comparison, the second most computationally demanding job, finetuning on the backtranslated data, required 20 TPU-hours. At the same time, the high computational cost of the pretraining can be offset by using the prepared model for multiple tasks.
C Speaker Classifier
We use the same speaker classifier as Borsos et al. (2023), which is a convolutional network that takes log-mel spectrograms as its input. The spectrograms are calculated with a window size of 25ms, a hop length of 10ms, and have 64 mel bins. The network contains 6 blocks, each cascading convolutions with kernels of 3×1 and 1×3. Each block is followed by a ReLU non-linearity and batch normalization (Ioffe and Szegedy, 2015). The per-block numbers of channels are [64, 128, 256, 256, 512, 512]. The classifier has an input span of 1 second and, to classify a longer utterance, we run a sliding window with a hop length of 250 ms and average predictions across the windows.
Author notes
Action Editor: Karen Livescu