Abstract
Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a ‘space’ delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., 2006, 2009) use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-the-art in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark. 1
1 Introduction
One of the first tasks that infants face is to learn the words of their native language(s). To do so, they must solve a word segmentation problem, since words are rarely uttered in isolation but come up in multi-word utterances (Brent and Cartwright, 1996). Segmenting utterances into word or subword units is also an important step in NLP applications, where an input string of orthographic symbols is tokenized into words or sentence piece before being fed to a language model. While many writing systems use white spaces that make (basic) tokenization a relatively simple task, others do not (e.g., Chinese) and turn tokenization into a challenging machine learning problem (Li and Yuan, 1998). A similar situation arises for ‘textless’ language models based on units derived from raw audio (Lakhotia et al., 2021), which, like infants, do not have access to isolated words or word-separating symbols.
The most successful approach to unsupervised word segmentation for text inputs is based on non-parametric Bayesian models (Goldwater et al., 2006, 2009; Kawakami et al., 2019; Kamper et al., 2017b; Berg-Kirkpatrick et al., 2010; Eskander et al., 2016; Johnson et al., 2007; Godard et al., 2018) that jointly segment utterances and build a lexicon of frequent word forms using a Dirichlet process. The intuition is that frequent word forms function as anchors in a sentence, enabling the segmentation of novel words (like ‘dax’ in ‘did you see the dax in the street?’). Such models are tested on phonemized texts obtained by converting text into a stream of phonemes after removing spaces and punctuation marks. Even though phonemized text may seem a reasonable approximation of continuous speech, such models have given disappointing results when applied directly to speech inputs. Since the direct comparison between speech-based and text-based models was introduced in 2014 by Ludusan et al. (2014), the gap of performance between these two types of inputs as documented in three iterations of the Zero Resource Speech Challenge focused on word segmentation has remained large (Versteegh et al., 2016; Dunbar et al., 2020). We attribute these difficulties to two major challenges posed by speech compared to text inputs: acoustic variability and temporal granularity, which we address in our contribution.
The first and most important challenge is acoustic variability. In text, all tokens of a word are represented the same. In speech, each token of a word is different, depending on background noise, intonation, speaker voice, speech rate, and so forth. Text-inspired speech segmentation algorithms (Kamper et al., 2017b) apply a clustering step to word tokens in order to get back to word types. We believe that this step is unnecessary and is responsible for the low performance of existing systems, as errors in the clustering step are not recoverable and negatively impact word segmentation. We propose an algorithm that segments speech utterances based on an instance lexicon of word tokens, instead of a lexicon of word types. Each speech segment instance of the training set is represented as a distinct memory trace, and these traces are used to estimate word form frequency using a k-NN algorithm without having to refer to a discrete word type but still applying the same Dirichlet process logic. As for the representation of these word tokens, we follow recent approaches that use fixed length Speech Sequence Embeddings (SSE) (Thual et al., 2018; Kamper et al., 2017b). We use state-of-the-art SSEs from Algayres et al. (2022) that have either been trained in a self-supervised fashion or with weak labels, in order to assess how the segmentation model behaves with inputs of increasing quality.
The second challenge is related to the fact that the speech waveform being continuous in time, the number of possible segmentation points grows very large, making the optimization of segmentation of large speech corpora using Bayesian techniques intractable. Some models reduce the number of segmentation points using phoneme boundaries (Lee and Glass, 2012; Bhati et al., 2021), syllable boundaries (Kamper et al., 2017b), or a constant discretization of the time axis into speech frames, subject to the following trade-off: Too coarse frames may miss some short word boundaries, too fine-grained ones may render segmentation intractable on large corpora because of the quadratic increase in number of segmentations with frame rate. Here we choose a compromise with 40ms frames, corresponding to half of the mean duration of a phoneme, yielding a 4x theoretical slowdown compared to phoneme-based transcripts. In addition, we introduce an accelerated version of a Dirichlet process segmentation algorithm that replaces Gibbs sampling of each boundary by sampling entire parse trees using Dynamic Programming Beam Search, and updating the lexicon in batches instead of continuously. This achieves a 10-times speed-up compared to Johnson et al. (2007)’s implementation, with similar or superior performances.
Because of our ‘no clustering’ approach, we cannot evaluate our model using the word type- based metrics of the Zero Resource speech segmentation challenge (Versteegh et al., 2016; Dunbar et al., 2017). Here, we use the two classes of metrics: boundary-related metrics (Token and Boundary F-score) that do not refer to types, and high-level word embedding metrics. The idea is use a downstream language model to learn high level word embeddings and evaluate them on semantic and Part-Of-Speech (POS) dimensions. We rely on a semantic similarity task from Dunbar et al. (2020) and introduce two new metrics.
The combination of our contributions yields an overall system, DP-Parse, that sets a new state- of-the-art on speech segmentation for five speech corpora by a large margin. By doing various ablation and anti-ablation studies (replacing unsupervised components by weakly supervised ones) we pinpoint the components of the system where there is still margin for improvement. In particular, we show that using better embeddings obtained with weak supervision, it is possible to double the segmentation F-score and substantially improve our new semantic and POS scores.
2 Related Work
2.1 Speech Sequence Embeddings
Speech Sequence Embedding (SSE) models take as input a piece of speech signal of any length and outputs a fixed-size vector. The main objective of these models is to represent the phonetic content of a given speech segment. A naive SSE model would be to extract frame-level features of a speech sequence, using for instance Wav2vec2.0 or HuBERT (Baevski et al., 2020; Hsu et al., 2021), and mean-pool the frames along the time axis. A more subtle approach is to train a self-supervised systems on positive pairs of speech sequences. The model learns speech sequence representation thanks to contrastive loss functions (e.g., the infoNCE loss as in Algayres et al. [2022]; Jacobs et al. [2021] or the Siamese loss as in Settle and Livescu [2016]; Riad et al. [2018]) or reconstructive losses (e.g., Correspondance Auto-Encoder from Kamper [2018] where the first element of the pair is compressed and decoded into the second element). Recently, a growing body of work leverage multilingual labeled dataset to build SSEs that can, to an extent, generalize to unseen languages (Hu et al., 2020a, b; Jacobs et al., 2021). In this work, we use a state-of-the-art self- supervised system SSE model from Algayres et al. (2022).
2.2 Speech Segmentation
Three main class of models have been proposed to solve speech segmentation: matching-first models (Park and Glass, 2007; Jansen and Van Durme, 2011; Hurtad, 2017) attempt to find high quality pairs of identical segments and cluster them based on similarity, thereby building a lexicon of word types that may not cover the entire corpus (Bhati et al., 2020; Räsänen et al., 2015). Segmentation-first models try to exhaustively parse a sentence (Lee et al., 2015; Kamper et al., 2017a, b; Kamper and van Niekerk, 2020), while jointly learning a lexicon (and therefore also involving matching and clustering). Segmentation-only models discover directly the likely word boundaries in running speech using prediction error (Bhati et al., 2021; Cuervo et al., 2021), without relying on any lexicon of word types. Our approach is a new hybrid of the last two lines of research: Like a segmentation-first model, we jointly model lexicon and segmentation, but our lexicon is not a lexicon of types, thereby escaping matching and clustering errors, like in segmentation-only models. This model shows good performances across different languages as measured by the Mean Average Precision on pre-segmented spoken words.
2.3 Evaluating Speech Segmentation
Across the different word segmentation studies, the evaluation has been done according to two general classes of metrics: matching and clustering metrics for word embeddings (MAP, NED, grouping F-score, Type F-score), and segmentation metrics for boundaries (Token and Boundary F-scores) (Carlin et al., 2011; Ludusan et al., 2014). All of these metrics presuppose that the optimal segmentation strategy is aligned with the text-based gold standard (based on spaces and punctuation). Yet, it is entirely possible that such segmentation is not optimal for downstream applications, as witnessed by the fact that most current language models use subword units like BPEs or sentence pieces (Devlin et al., 2018). Therefore, we propose a third class of segmentation metrics based on downstream language modeling tasks. Chung and Glass (2018) showed that gold word boundaries provide sufficient information to learn good quality semantic embeddings from raw audio using a skipgram or word2vec objective, as assessed by semantic similarity benchmarks adapted to speech inputs. These datasets were also used to compile the sSIMI benchmark in the Zero Speech Challenge 2021 (Nguyen et al., 2020), but turn out to give very low and noisy results (at least for small training sets). Here, we introduce two new metrics evaluating semantic and grammatical similarity that we show to be more stable.
3 Method: Instance-based Dirichlet Process Parsing
This section describes the building blocks and notations used by DP-Parse as depicted in Figure 1 and Table 1, respectively. At a high level, DP-Parse is a new adaptation of Goldwater’s Dirichlet process algorithm (Goldwater et al., 2009) for speech inputs. At the heart of Goldwater’s algorithm is a hierarchical Bayesian model that generates a joint distribution of a lexicon of words (a word is a sequence of phonemes) and a corpus (a sequence of words). The associated learning algorithm optimizes the likelihood of an observed corpus as a function of the (unobserved) lexicon and corpus segmentation. It does so by alternating between two EM steps: (Step 1) estimating the most probable lexicon given a segmentation, and (Step 2) sampling among the most probable segmentation given the lexicon. More precisely, at each Step 1, the algorithm estimates the probability of a segmentation from two sets of numbers: the probability that a given speech segment w is a word, and Lw the frequency of a given word w in the (segmented) input corpus. They are estimated by building tables of counts of phoneme-n-grams and words given a segmentation, respectively. At Step 2, these numbers are combined using Equation 3 (Section 3.3) and a new segmentation is sampled. Our adjustments to the original algorithm are the following:
Section 3.1: Phonemes are replaced by 40ms speech frames (from a pretrained self-supervised learning algorithm), words are replaced by fixed-length SSEs (using a pretrained self-supervised embedder).
Section 3.2: The tables of counts for computing and Lw are replaced by k-Nearest-Neighbors (k-NN) indexes of embeddings (L0, and L, respectively) which can be viewed as instance lexicons of all possible n-frames and of segmented ‘words’, respectively. The count estimates are obtained using Gaussian kernel density estimation over k-NN.
Section 3.3: We make a small adaptation to theDirichlet process formula.
Section 3.4: Instead of alternating between the two steps at each boundary as in the original Gibbs sampling algorithm, which is very time-consuming, we sample segmentations over entire utterances using a segment lattice, and update the lexicon over a batch of utterances.
Section 3.5: We initialize the system by selecting as an initial lexicon a list of short enough utterances, and precompute all the constant values (L0 and ) to optimize for speed.
u | a unit u is a block of 40 ms of speech signal |
w = (u0,…,uw) | a segment w is a sequence of units |
Ew ∈IRp | a fixed length embedding of a segment w computed by a SSE model |
C = {v0,…,vN} | a corpus C is a set of utterances (segments without silence found by VAD) |
a segmented corpus Cseg is a set of sequences of segments | |
D = {w0,...wP} | the list of all possible subsegments in C that are possible words (between 40ms and 800ms) |
a kNN index of embeddings from D | |
a kNN index of all embeddings from the segments in Cseg | |
the frequency of Ew in L0 | |
Lw | the frequency of Ew in L |
u | a unit u is a block of 40 ms of speech signal |
w = (u0,…,uw) | a segment w is a sequence of units |
Ew ∈IRp | a fixed length embedding of a segment w computed by a SSE model |
C = {v0,…,vN} | a corpus C is a set of utterances (segments without silence found by VAD) |
a segmented corpus Cseg is a set of sequences of segments | |
D = {w0,...wP} | the list of all possible subsegments in C that are possible words (between 40ms and 800ms) |
a kNN index of embeddings from D | |
a kNN index of all embeddings from the segments in Cseg | |
the frequency of Ew in L0 | |
Lw | the frequency of Ew in L |
So modified, the algorithm is no longer dependent on input type and can work either with discrete representations (text, represented as sequences of 1-hot vectors of phonemes) or continuous ones (speech, represented as embeddings).
3.1 Speech Sequence Embeddings
We represent speech as 20ms frames obtained by selecting the 8th layer of a pretrained Wav2vec2.0 Base system from Baevski et al. (2020). Each two successive frames are tied together so that a speech sentence is a series of 40ms speech blocks. To represent a speech segment, we use the SSE model from Algayres et al. (2022): a self-supervised system trained with contrastive learning where positive pairs are obtained by data-augmentation of the speech signal. This model shows good performance across different languages as measured by the Mean Average Precision on pre-segmented spoken words. The SSE model takes as input the pre-extracted Wav2vec2.0 frames and applies a single 1-D convolution layer with GLU (kernel size: 4, number of channels: 512, stride: 1) and a single transformer layer (attention heads: 4, size of attention matrices: 512 neurons, and feed forward layer: 2048 neurons). A final max-pooling layer along the time axis is applied to obtain a fixed-size vector. To save computation, we reduce the dimensionality of Algayres et al.’s (2022) SSE model from 512 to 64 with a PCA trained on random speech segments extracted from the corpus at work.
In addition to their unsupervised SSE model, Algayres et al. (2022) provide a weakly supervised version. They trained it with the same contrastive loss (infoNCE) using positive pairs obtained with a time-aligned transcription of the speech signal. The weakly supervised version will be used as a topline model, as it scores much higher on Mean Average Precision than the self-supervised one.
3.2 Density Estimation from Gaussian-smoothed k-Nearest-Neighbors
F is the Parzen-Rosenblatt window method for density estimation (Parzen, 1962) and returns a continuous value between 0 and k, estimating . It rests on two hyper-parameters: k, which is set to 1000, and β, the standard deviation of the Gaussian kernel. To set β, we follow the observation from Algayres et al. (2022): Around 50% of the segments in the development dataset of the Zerospeech Challenge 2017 appear to have a frequency of one. Therefore, we set β during runtime so that 50% of the calculated get a frequency of one (i.e., an F value below a small ϵ > 0). We did not change these hyper-parameters for any of the tested languages.
3.3 Dirichlet Process Formula
3.4 Segmentation Lattice
In Goldwater et al. (2009), the learning algorithm uses Gibbs sampling, where word boundaries are sampled one at a time, requiring all parameters of the Dirichlet Process model to be recomputed after each sample. On text data, it is not a problem, as updating the model’s parameters (Lw and ) is fast: No k-NN search is needed, parameters are computed exactly by matching and counting strings. It becomes a bottleneck for speech segmentation where the parameters are computed with density estimation and k-NN search. One way to alleviate this problem is to incorporate the dynamic programming version of the Gibbs sampler from Mochihashi et al. (2009). Another way, as in this paper, is to build a segmentation lattice over each utterance and sample among the n-best segmentations.
For each utterance in C, we create a segmentation lattice that provides a compact view for all possible segmentation paths. A segmentation path is a sequence of consecutive segmentation arcs that covers the full utterance. Each arc starts and finishes in-between the units and is bounded within a minimal and maximal length. An example of a segmentation lattice can be found in Figure 2. Each segmentation arc is associated with its S score from Equation 6. For each utterance, the n-best segmentations are computed with Dynamic Programming Beam Search, and we sample from a softmax of their total S scores. One advantage of this procedure is that it is possible to parallelize utterance segmentation across large batches and only update the lexicon L after each batch. In our experiments, we take the entire corpus C to be a single batch.
3.5 Initializing DP-Parse
We create a corpus C as a collection of utterances by applying the pyannote VAD (Bredin et al., 2019) with a threshold of 200 ms. As shown in Figure 1, initialization (Step 0) contains several sub-steps. The first (0.1) is to provide an initial segmentation of C, using a simple heuristic: All sentences shorter than 800ms are treated as word tokens, the other ones are discarded. The intuition is that short sentences could be words in isolation, which provides the seed for an initial lexicon.
The second sub-step (0.2) is to create the list D composed of the embeddings of all possible speech segments in C that are possible words, which we set to be anything between 40ms and 800ms (by 40ms increments). To embed a speech segment, we use the SSE model in Section 3.1.
The third sub-step (0.3) consists in the construction of L0: a k-NN index of all embedded segments in D. In practice, we found that randomly subsampling D to one million embedded segments worked well (see grid-search in Appendix A.2.2), and we precompute as explained in Section 3.2.
4 Evaluation Metrics and Settings
4.1 Boundary Level Segmentation Metrics
In the Zerospeech Challenge 2017 (Dunbar et al., 2017), two metrics measure how well discovered boundaries match with the gold word boundaries. These metrics, the Boundary and Token F-score, are obtained by comparing the discovered sets of tokens to the gold one obtained by force-aligning spoken sentences with their transcription. The evaluation is done in phoneme space and each discovered boundary is mapped to a real phoneme boundary: If a boundary overlaps a phoneme by more[-9.1pc]Figure 2 is not cited in body text. than 30ms or more than half of the phoneme length, the boundary is placed after that phoneme (otherwise it is placed before).
4.2 Semantic and POS Embedding Metrics
The idea here is to evaluate segmentation through its effect on a downstream language model. The evaluation is less direct than the segmentation metrics above, as it assumes that the segmentation is used to turn a spoken utterance into a series of fixed size SSEs, themselves used to train a continuous-input Speech Language Model (LM). The metrics will therefore reflect each of these components (segmentation, SSE, Speech LM), but by keeping the SSE and Speech LM components constants, we can hope to study systematically the effect of the segmentation component. Our evaluation focuses on word-level representations, which we call here High-Level Speech Embeddings (HLSE). They are the speech equivalent of Word2Vec representations (Mikolov et al., 2013), yet coming from an inexact segmentation. Here, we assume that a Speech LM has been trained on SSE inputs on a given dataset with a given segmentation and can be used at test time to provide embeddings associated to a set of test words. As our segmentation models work on whole utterances, we present the test words in continuous utterances and then mean pool the HLSEs from the Speech LM that overlap with that word to obtain a single vector. An overview of the building of generic HLSEs and their evaluation is shown in Figure 3.
We introduce two zero-shot tasks to evaluate HLSEs that do not require training a further classifier, and are purely based on distances between these embeddings. They are based on machine-ABX, a type of distance-based metric introduced in Schatz et al. (2013) which computes a discrimination score over sets of (A,B,X) triplets. For each triplet, A and X belong to the same category and B plays the role of a distractor. The task for the model is to find the distractor in each triplet based on the cosine distance between the HLSEs of A, B, and X. The first task (ABXsem), have A and X to be synonyms whereas B is semantically unrelated to neither A or X. The second task, ABXPOS, have A and X share the same POS-tags whereas B has a different one. The triplets are all extracted from the Librispeech corpus training set. See Appendix A.1 for details on the construction of the triplets.
Another task, sSIMI from Nguyen et al. (2020), also evaluates high-level representations of spoken words in a zero-shot and distance-based fashion. Yet, sSIMI differs from our ABX tasks in two important ways. First, sSIMI presents test words as pre-segmented chunk of speech without the context of the original sentence to help the Speech LM. Second, the task for the Speech LM is to predict a semantic similarity score given by human annotators to pairs of words. The Speech LM encodes the pre-segmented spoken words into HLSEs and the distance between the HLSEs is correlated with the human judgements. The correlation coefficient r is used as the final sSIMI score.
4.3 Datasets and Experimental Settings
The Zerospeech Challenge 2017 (Dunbar et al., 2017) provides five corpora to evaluate speech segmentation systems. These corpora are composed of speech recordings from different languages split into sentences using Voice Activity Detection. Three corpora (Mandarin, 2h30; English, 45h; French, 24h) are used for development, and two ‘surprise’ corpora for testing (German, 25h; Wolof, 10h). On each corpus, a separate SSE model from Algayres et al. (2022) is trained and a separate run of DP-Parse is performed to produce a full-coverage segmentation. DP-Parse’s hyper-parameters from Table 2, as well as q’s formula (Equation 5), were grid-searched to maximize token-F1 segmentation scores over the three development datasets. The two remaining test sets are used to show generalization of DP-Parse to new unseen languages. More details on hyper-parameter search is found in Appendix A.2.2.
Minimum segment length | 40ms |
Maximum segment length | 800ms |
No. of segments in L0 | 1M |
Concentration parameter α0 (eq. 3) | 100 |
No. of neighbors in k-NN (eq. 1) | 100 |
Lattice beam size | 10 |
Duration penalty q (eq. 5) | γ = 1.8 |
δ = 2 (text) | |
δ = 4 (speech) |
For the ABX and sSIMI tasks, we train a BERT model as a Speech LM on the training set of the Librispeech dataset (Panayotov et al., 2015), composed of 960 hours of English recordings. We proceed by first training a SSE model from Algayres et al. (2022) on the Librispeech. Then, this latter dataset is segmented using DP-Parse. Each segmented speech sequence is aggregated into a single vector using the pre-trained SSE model. Segmented sentences are used to train a BERT model with masked language modeling and the Noise Contrastive Estimation (NCE) loss (Gutmann and Hyvärinen, 2010) (see Figure 4). The BERT model is composed of 12 layers, 12 attention heads per layer with 768 neurons each and a FFN size of 3072 neurons. To compute the NCE, two heads p and h are composed respectively of one fully connected linear layer with 256 neurons and two fully connected layers with ReLU activation with also 256 neurons. Batches are composed of utterances from a single speaker, and the 200 negative samples for the NCE are chosen within the batch. Fifteen percent of SSEs are masked in each batch during training. Only the BERT is trained, gradients are not propagated into the SSEs. As BERT is a multi-layer transformer model, the scores for ABX tasks and sSIMI are obtained by grid-searching the layer that performs best on the dev set of each task and evaluating that same layer on the corresponding test sets. The development and test sets for the ABX tasks are composed of triplets of sentences extracted from the training set of the Librispeech corpus. For sSIMI, the development and test set both contain two subsets: one with spoken words extracted from the Librispeech training set and one with spoken words synthesized with WaveNet (van den Oord et al., 2016).
5 Results
5.1 Results on Word-level Segmentation
Regarding text segmentation, DP-Parse compares favorably to the original Dirichlet Process Unigram model from Goldwater et al. (2009). It produces a 21-point increase in token F1 compared with the original version (and 3 points increase in boundary F score) for a 10× runtime speed-up (Table 3) under the Adaptor Grammar implementation of Johnson et al. (2007).
. | Mandarin . | French . | English . |
---|---|---|---|
(2h30) . | (24h) . | (45h) . | |
DP Unigram | 61m39s | 240m44s | 62m11s |
DP-Parse | 0m55s | 25m12s | 62m11s |
Speed-up factor | 67.3 | 9.6 | 9.8 |
. | Mandarin . | French . | English . |
---|---|---|---|
(2h30) . | (24h) . | (45h) . | |
DP Unigram | 61m39s | 240m44s | 62m11s |
DP-Parse | 0m55s | 25m12s | 62m11s |
Speed-up factor | 67.3 | 9.6 | 9.8 |
Regarding speech segmentation, we compare DP-Parse with the three best speech segmentation models submitted at the Zerospeech Challenge 2017 (Bhati et al., 2020; Kamper et al., 2017b; Räsänen et al., 2015). Table 4 reports the token F1 and boundary F1 obtained by these models over the 5 datasets of the Zerospeech Challenge 2017. Across all corpora, DP-Parse outperforms its competitors by at least 5 points in both boundary and token F1. We also introduce a naive baseline that draws boundaries every 120 milliseconds, disregarding the content of the speech signal. It turns out to be surprisingly competitive with all speech segmentation systems except DP-Parse. The latter is the only existing speech segmentation system that beats the naive baseline on all languages.
Model . | Token F1 ↑ . | Boundary F1 ↑ . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Dev . | Test . | avg. . | Dev . | Test . | avg. . | |||||||
Mand. . | French . | Engl. . | German . | Wolof . | Mand. . | French . | Engl. . | German . | Wolof . | |||
Speech data | ||||||||||||
Every 120ms | 10.7 | 11.7 | 9.8 | 5.4 | 12.4 | 10.0 | 52.4 | 49.4 | 48.2 | 41.3 | 55.0 | 49.2 |
Syllables× | 4.4 | 5.1 | 4.1 | 3.0 | 4.2 | 4.2 | 49.1 | 46.3 | 43.7 | 38.1 | 40.6 | 43.6 |
SEA + | 12.1 | 8.3 | 8.6 | 7.5 | 14.8 | 10.3 | 52.2 | 48.4 | 46.1 | 40.9 | 54.2 | 48.4 |
ES-Kmeans∨ | 8.1 | 6.30 | 19.2 | 14.5 | 10.9 | 11.8 | 54.5 | 43.3 | 56.7 | 52.3 | 52.8 | 51.9 |
DP-Parse* | 16.0 | 15.3 | 21.9 | 13.4 | 17.5 | 16.8 | 59.3 | 55.9 | 60.0 | 51.5 | 59.0 | 57.1 |
DP-Parse* | 28.2 | 30.9 | 31.3 | 34.4 | 39.2 | 35 | 69.9 | 70.0 | 68.4 | 64.4 | 75.5 | 69.4 |
+ weak sup. | ||||||||||||
Text data | ||||||||||||
DP Unigram‡ | 34.9 | 57.0 | 64.7 | 33.6 | 38.8 | 45.7 | 79.5 | 86.1 | 88.9 | 71.6 | 77.3 | 80.7 |
DP-Parse* | 50.0 | 68.1 | 78.5 | 67.4 | 69.1 | 66.6 | 76.0 | 84.3 | 89.8 | 83.5 | 84.1 | 83.5 |
Model . | Token F1 ↑ . | Boundary F1 ↑ . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Dev . | Test . | avg. . | Dev . | Test . | avg. . | |||||||
Mand. . | French . | Engl. . | German . | Wolof . | Mand. . | French . | Engl. . | German . | Wolof . | |||
Speech data | ||||||||||||
Every 120ms | 10.7 | 11.7 | 9.8 | 5.4 | 12.4 | 10.0 | 52.4 | 49.4 | 48.2 | 41.3 | 55.0 | 49.2 |
Syllables× | 4.4 | 5.1 | 4.1 | 3.0 | 4.2 | 4.2 | 49.1 | 46.3 | 43.7 | 38.1 | 40.6 | 43.6 |
SEA + | 12.1 | 8.3 | 8.6 | 7.5 | 14.8 | 10.3 | 52.2 | 48.4 | 46.1 | 40.9 | 54.2 | 48.4 |
ES-Kmeans∨ | 8.1 | 6.30 | 19.2 | 14.5 | 10.9 | 11.8 | 54.5 | 43.3 | 56.7 | 52.3 | 52.8 | 51.9 |
DP-Parse* | 16.0 | 15.3 | 21.9 | 13.4 | 17.5 | 16.8 | 59.3 | 55.9 | 60.0 | 51.5 | 59.0 | 57.1 |
DP-Parse* | 28.2 | 30.9 | 31.3 | 34.4 | 39.2 | 35 | 69.9 | 70.0 | 68.4 | 64.4 | 75.5 | 69.4 |
+ weak sup. | ||||||||||||
Text data | ||||||||||||
DP Unigram‡ | 34.9 | 57.0 | 64.7 | 33.6 | 38.8 | 45.7 | 79.5 | 86.1 | 88.9 | 71.6 | 77.3 | 80.7 |
DP-Parse* | 50.0 | 68.1 | 78.5 | 67.4 | 69.1 | 66.6 | 76.0 | 84.3 | 89.8 | 83.5 | 84.1 | 83.5 |
Most speech segmentation systems from the Zerospeech Challenge 2017 rely on off-the-shelf self-supervised representations of speech. The hope is that, without modification, these systems would benefit from future improvements in speech representations and mechanically lead to higher segmentation scores. Yet, such hopes have never been guaranteed by explicit experiments. Here, we test for the ability of DP-Parse to improve with better inputs. For that, we use the weakly supervised version of the SSE model from Algayres et al. (2022) trained with 10 hours of labelled speech data. On this type of input, DP-Parse doubles its token F1 score.
5.2 Results on Semantic and POS Metrics
In Table 5, the segment-based sections show how a BERT model can benefit from speech segmentation and SSE modeling on the tasks ABXsem, ABXPOS and sSIMI. To do that, we trained BERT models along the pipeline depicted in Figure 4.
Input Representations . | Segments . | SSE . | BERT . | Dev Set . | Test Set . | Dev Set . | Test Set . | ||
---|---|---|---|---|---|---|---|---|---|
ABXsem↑ . | ABXPOS↑ . | ABXsem↑ . | ABXPOS↑ . | sSIMI ↑ . | sSIMI ↑ . | ||||
No supervision, frame based | |||||||||
CPC + | – | – | – | 51.22 [2] | 53.67 [2] | 53.86 [2] | 54.68 [2] | 6.14 [2] | 4.34 [2] |
W2V2 base× | – | – | – | 54.96 [8] | 55.78 [8] | 56.20 [8] | 56.86 [8] | 3.85 [9] | 5.21 [9] |
HuBERT base‡ | – | – | – | 54.65 [8] | 55.65 [8] | 55.12 [8] | 56.77 [8] | 3.53 [8] | 0.68 [8] |
W2V2 large× | – | – | – | 57.69 [13] | 59.86 [13] | 58.88 [13] | 60.93 [13] | 2.04 [11] | 6.82 [11] |
HuBERT large‡ | – | – | – | 58.04 [23] | 57.77 [23] | 58.95 [23] | 57.19 [23] | 4.14 [7] | 0.55 [7] |
No supervision, segment based | |||||||||
W2V2 base [8]× | 120ms | unsup† | 59.65 [7] | 70.66 [7] | 58.64 [7] | 71.68 [7] | 4.10 [9] | 4.19 [9] | |
W2V2 base [8]× | Syllables∨ | unsup† | 59.09 [6] | 68.58 [6] | 58.49 [6] | 68.81 [6] | 0.96 [9] | 3.48 [9] | |
W2V2 base [8]× | DP-Parse | unsup† | 61.09 [8] | 68.15 [7] | 62.73 [8] | 69.35 [7] | 4.54 [5] | 3.49 [5] | |
Weak supervision, segment based | |||||||||
W2V2 base [8]× | 120ms | weak-sup† | 59.64 [8] | 70.04 [8] | 59.43 [8] | 70.58 [8] | 5.83 [7] | 5.81 [7] | |
W2V2 base [8]× | Syllables∨ | weak-sup† | 61.44 [6] | 70.89 [7] | 60.28 [6] | 71.03 [7] | 0.51 [9] | 0.79 [9] | |
W2V2 base [8]× | DP-Parse | weak-sup† | 64.24 [8] | 72.65 [7] | 64.82 [8] | 72.67 [7] | 5.19 [9] | 4.51 [9] | |
Partial supervision, segment based | |||||||||
W2V2 base [8]× | gold words | unsup† | 67.89 [5] | 75.46 [7] | 67.17 [5] | 75.61 [7] | 6.82 [5] | 6.30 [5] | |
W2V2 base [8]× | gold words | weak-sup† | 71.38 [5] | 77.75 [6] | 71.04 [5] | 77.64 [6] | 2.72 [5] | 6.83 [5] | |
text | DP-Parse | 1-hot | 75.38 [4] | 78.57 [4] | 74.89 [4] | 78.49 [4] | 21.42 [2] | 19.54 [2] | |
Full supervision, segment based | |||||||||
text | gold words | 1-hot | 83.23 [3] | 81.07 [3] | 84.12 [3] | 80.09 [3] | 41.78 [1] | 36.83 [1] |
Input Representations . | Segments . | SSE . | BERT . | Dev Set . | Test Set . | Dev Set . | Test Set . | ||
---|---|---|---|---|---|---|---|---|---|
ABXsem↑ . | ABXPOS↑ . | ABXsem↑ . | ABXPOS↑ . | sSIMI ↑ . | sSIMI ↑ . | ||||
No supervision, frame based | |||||||||
CPC + | – | – | – | 51.22 [2] | 53.67 [2] | 53.86 [2] | 54.68 [2] | 6.14 [2] | 4.34 [2] |
W2V2 base× | – | – | – | 54.96 [8] | 55.78 [8] | 56.20 [8] | 56.86 [8] | 3.85 [9] | 5.21 [9] |
HuBERT base‡ | – | – | – | 54.65 [8] | 55.65 [8] | 55.12 [8] | 56.77 [8] | 3.53 [8] | 0.68 [8] |
W2V2 large× | – | – | – | 57.69 [13] | 59.86 [13] | 58.88 [13] | 60.93 [13] | 2.04 [11] | 6.82 [11] |
HuBERT large‡ | – | – | – | 58.04 [23] | 57.77 [23] | 58.95 [23] | 57.19 [23] | 4.14 [7] | 0.55 [7] |
No supervision, segment based | |||||||||
W2V2 base [8]× | 120ms | unsup† | 59.65 [7] | 70.66 [7] | 58.64 [7] | 71.68 [7] | 4.10 [9] | 4.19 [9] | |
W2V2 base [8]× | Syllables∨ | unsup† | 59.09 [6] | 68.58 [6] | 58.49 [6] | 68.81 [6] | 0.96 [9] | 3.48 [9] | |
W2V2 base [8]× | DP-Parse | unsup† | 61.09 [8] | 68.15 [7] | 62.73 [8] | 69.35 [7] | 4.54 [5] | 3.49 [5] | |
Weak supervision, segment based | |||||||||
W2V2 base [8]× | 120ms | weak-sup† | 59.64 [8] | 70.04 [8] | 59.43 [8] | 70.58 [8] | 5.83 [7] | 5.81 [7] | |
W2V2 base [8]× | Syllables∨ | weak-sup† | 61.44 [6] | 70.89 [7] | 60.28 [6] | 71.03 [7] | 0.51 [9] | 0.79 [9] | |
W2V2 base [8]× | DP-Parse | weak-sup† | 64.24 [8] | 72.65 [7] | 64.82 [8] | 72.67 [7] | 5.19 [9] | 4.51 [9] | |
Partial supervision, segment based | |||||||||
W2V2 base [8]× | gold words | unsup† | 67.89 [5] | 75.46 [7] | 67.17 [5] | 75.61 [7] | 6.82 [5] | 6.30 [5] | |
W2V2 base [8]× | gold words | weak-sup† | 71.38 [5] | 77.75 [6] | 71.04 [5] | 77.64 [6] | 2.72 [5] | 6.83 [5] | |
text | DP-Parse | 1-hot | 75.38 [4] | 78.57 [4] | 74.89 [4] | 78.49 [4] | 21.42 [2] | 19.54 [2] | |
Full supervision, segment based | |||||||||
text | gold words | 1-hot | 83.23 [3] | 81.07 [3] | 84.12 [3] | 80.09 [3] | 41.78 [1] | 36.83 [1] |
Let us first analyze the scores on the ABX tasks. The section unsupervised, segment-based shows that DP-Parse and two less-performant segmentation strategies lead to comparable ABX scores. Yet, the weak supervision section shows that by improving the quality of the SSEs with weak supervision, DP-Parse performs higher than other segmentation methods. The partial supervision section shows that with either perfect word boundaries or perfect segment representation (1-hot vectors), the ABX scores increase even more. Finally, the full supervision is regular text-based LM that serves as a topline model.
As baseline systems, we propose frame-based approaches that do not use speech segmentation nor SSEs. Speech-to-frames models like Wav2vec2.0, HuBERT, or CPC (Baevski et al., 2020; Hsu et al., 2021; van den Oord et al., 2019) are trained with masked language modeling in the spirit of text-based LM but on the raw speech signal. Therefore, these models should have already acquired some knowledge of semantics and POS-taggings. We evaluate speech-to-frames models directly on ABXsem, ABXPOS and sSIMI. To do that, we simply skip the Speech LM as well as the segmentation and SSE steps from the method in Figure 3. Speech-to-frames models are used to encode the whole speech sentences from the ABX triplets, and the frames within the word boundaries provided by the ABX tasks are mean-pooled to form the HLSEs. The results show that Wav2vec2.0 Large and HuBERT Large are strong baseline systems for our ABX metrics. Yet, they score in average below the segment-based approaches.
Regarding the scores obtained on sSIMI from Nguyen et al. (2020), the table of results (Table 5) shows important downsides on this task compared to our ABX tasks. First, the scores sometimes show large inconsistencies across development and test sets, which is not the case for the ABX tasks. Second, while our ABX tasks are sensitive to improvements in speech segmentation and SSE modeling, sSIMI is not and show comparable scores across most sections of Table 5. One reason for that could be that self-supervised speech systems (Wav2vec2.0, HuBERT, and CPC) as well as SSE models from Algayres et al. (2022) are not trained to encode short sequence of speech, especially extracted as chunks from a sentence.
6 Conclusion and Open Questions
We introduce a new speech segmentation pipeline that sets the state-of-the-art on the Zerospeech’s datasets at 16.8 token F1. The whole pipeline needs no specific tuning of hyper-parameters, making it ready to use on any new languages. We showed that the problem of speech segmentation can be reduced to the problem of learning discriminative speech representations. Indeed, using different level of supervision, our pipeline reaches up to 35 token F1 score. Therefore, as long as the field of unsupervised representation learning makes headway, this method should automatically produce higher token F1 scores.
A first avenue of improvement of the current work is in the SSE component. Here, we took the system described in Algayres et al. (2022) out of the box, and we showed good performance in speech segmentation compared to the state of the art, but there was still a large margin of improvement compared to text-based system. A recent unpublished paper (Kamper, 2022) came to our attention based on the non-lexical principle and showed similar or slightly better results than ours on a subset of the ZR17 language. Kamper (2022) also uses a segmentation lattice that resembles ours for inference. Yet, as we have shown, our system is monotonous with input embedding quality, and can therefore reach much better performance if the speech sequence embedding component were better trained. Further work is needed to improve the SSE component based on purely unsupervised methods.
Despite our computational speedup, our current system would face challenges to scale up the approach to larger datasets than Librispeech. While it uses FAISS, an optimized search library, it is unclear that storing each instance of a possible segment is necessary. One interesting avenue of research would therefore be to downsample both instance lexicons to alleviate space complexity and/or to trade storage with compute by recalculating the embeddings on-line.
Another interesting improvement would be to compute bigram probabilities instead of unigram probabilities, as previously explored in Goldwater et al. (2009). Yet, this possibility looks challenging, as explained in Appendix A.2.3.
Finally, our method showed promising results in terms of semantic and POS encoding when taken as a preprocessor for a language model. Such a phenomenon is displayed by our new in-context semantic tasks, but not by a previous without-context semantic task from Nguyen et al. (2020). Further work is needed to show whether this can translate into high quality generations when used as a component to generative systems (Lakhotia et al., 2021).
A Appendix
A.1 Construction of Triplets for ABXsem and ABXPOS
Let’s write the series of triplet from our ABX tasks: (Ai,Bi,Xi)0≤i <N. For all i, Ai is defined as a tuple where is a recording of a whole sentence and and are the temporal boundaries of a word with phonetic transcription . Bi and Xi are defined identically. The sentences are extracted from the Librispeech dataset, a 960-hour corpus of read English literature.
ABXsem is composed of 502 pairs (evenly split into a development and test set) of words created using a synonym dictionary. By sampling a list distractor for each pair, we reached 1557 different triplets. For each unique word found in the triplets, we sample 10 occurrences from the sentences of the Librispeech corpus. In total, ABXsem is computed over 1.5M triplets. ABXPOS is composed of 9997 pairs of words built from the WordNet database (Fellbaum, 1998). The words in each pair have the same POS tags that can be either noun, verb, or adjective. By sampling distractors and 10 occurrences of each word types, we reached 37.3M triplets.
A.2 Ablation Study
A.2.1 k-means Instead of k-NN
To estimate DP-Parse parameters ( and Lw), we followed a non-clustering approach with a k-NN density estimation. In this section, instead, we use k-means clustering to estimate DP-Parse parameters. We used the transcriptions of the development sets from the Zerospeech Challenge 2017 to give to the k-means the true number of clusters is it suppose to find when clustering SSEs from L0 and L. The value of and Lw are given by the size of the cluster in which w is found. From Table 6, the segmentation scores obtained using k-means are much lower than those obtained by k-NN. Indeed, as shown in a study by Algayres et al. (2020), k-means is subject to the uniform effect (Wu, 2012), which makes it not suited to estimate frequencies on highly skewed distributions, as the distribution of word types which follows the Zipf Law (Zipf, 1949).
A.2.2 DP-Parse Hyper-parameters
In Table 7, we provide DP-Parse speech segmentation performance, as measured by token-F1 scores, over three datasets (Mandarin, English, and French) for different hyper-parameters values. Surprisingly, increasing the number of neighbors (k), the number of samples in L0 (#L0), or the beam size does not improve token-F1 scores. DP-Parse time complexity scales linearly with each of these three parameters, therefore we keep their values as low as possible.
. | Token-F1 . | |||
---|---|---|---|---|
Mandarin . | French . | English . | average . | |
k = 10 | 16.8 | 14.7 | 21.5 | 17.7 |
k =100 | 16.7 | 15.3 | 21.9 | 18.0 |
k = 200 | 16.9 | 15.4 | 21.9 | 18.0 |
beam = 5 | 16.6 | 15.4 | 21.9 | 18.0 |
beam =10 | 16.7 | 15.3 | 21.9 | 18.0 |
beam = 100 | 16.8 | 15.2 | 21.9 | 18.0 |
#L0 = 300k | 16.9 | 14.7 | 21.5 | 17.7 |
#L0 =1M | 16.7 | 15.3 | 21.9 | 18.0 |
#L0 = 3M | 16.6 | 15.4 | 22.0 | 18.0 |
α0 = 10−4 | 16.4 | 15.0 | 21.4 | 17.6 |
α0 = 10 | 16.3 | 15.3 | 21.7 | 17.8 |
α0 =102 | 16.7 | 15.3 | 21.9 | 18.0 |
α0 = 103 | 17.0 | 15.1 | 21.7 | 17.9 |
α0 = 105 | 16.3 | 15.0 | 19.7 | 17.0 |
δ = 1 | 9.0 | 13.4 | 14.5 | 12.3 |
δ = 3 | 17.0 | 14.8 | 21.7 | 17.8 |
δ =4 | 16.7 | 15.3 | 21.9 | 18.0 |
δ = 5 | 15.8 | 15.0 | 21.4 | 17.4 |
δ = 9 | 15.2 | 14.8 | 19.9 | 16.6 |
γ = 0 | 14.7 | 14.7 | 19.9 | 16.4 |
γ = 1.6 | 16.0 | 15.0 | 21.7 | 17.6 |
γ =1.8 | 16.7 | 15.3 | 21.9 | 18.0 |
γ = 2 | 16.9 | 14.9 | 21.6 | 17.8 |
. | Token-F1 . | |||
---|---|---|---|---|
Mandarin . | French . | English . | average . | |
k = 10 | 16.8 | 14.7 | 21.5 | 17.7 |
k =100 | 16.7 | 15.3 | 21.9 | 18.0 |
k = 200 | 16.9 | 15.4 | 21.9 | 18.0 |
beam = 5 | 16.6 | 15.4 | 21.9 | 18.0 |
beam =10 | 16.7 | 15.3 | 21.9 | 18.0 |
beam = 100 | 16.8 | 15.2 | 21.9 | 18.0 |
#L0 = 300k | 16.9 | 14.7 | 21.5 | 17.7 |
#L0 =1M | 16.7 | 15.3 | 21.9 | 18.0 |
#L0 = 3M | 16.6 | 15.4 | 22.0 | 18.0 |
α0 = 10−4 | 16.4 | 15.0 | 21.4 | 17.6 |
α0 = 10 | 16.3 | 15.3 | 21.7 | 17.8 |
α0 =102 | 16.7 | 15.3 | 21.9 | 18.0 |
α0 = 103 | 17.0 | 15.1 | 21.7 | 17.9 |
α0 = 105 | 16.3 | 15.0 | 19.7 | 17.0 |
δ = 1 | 9.0 | 13.4 | 14.5 | 12.3 |
δ = 3 | 17.0 | 14.8 | 21.7 | 17.8 |
δ =4 | 16.7 | 15.3 | 21.9 | 18.0 |
δ = 5 | 15.8 | 15.0 | 21.4 | 17.4 |
δ = 9 | 15.2 | 14.8 | 19.9 | 16.6 |
γ = 0 | 14.7 | 14.7 | 19.9 | 16.4 |
γ = 1.6 | 16.0 | 15.0 | 21.7 | 17.6 |
γ =1.8 | 16.7 | 15.3 | 21.9 | 18.0 |
γ = 2 | 16.9 | 14.9 | 21.6 | 17.8 |
Another unexpected result is the low impact of the concentration parameter (α0). The value of this parameter should be a crucial as it controls the implicit vocabulary by controlling the amount of rare words in the segmentation. This observation has led us to create the penalty function q from Equation 5 that offers a control on word-lengths, which also impacts the implicit vocabulary. This time, we noticed that q’s shape strongly impact token-F1 scores. By a careful tuning of the penalty function, we increased the token-F1 from 16.4, with γ = 0 (i.e., no penalty function) to 18, with γ = 1.8 and δ = 4.
A.2.3 Hierarchical Dirichlet Process
Goldwater’s bigram Hierarchical Dirichlet Process (bigram-HDP), presented in Goldwater et al. (2009) for text segmentation computes the probability of two consecutive word-candidate, , to be segmented instead of only one, P(wi), as in the unigram Dirichlet process (unigram-DP). In Goldwater et al. (2009), the computation of the bigram-HDP probability requires the number of different types that can be found in all previously segmented bigrams with wi as second word. This is particularly difficult to adapt to speech segmentation with DP-Parse, because counting types require an explicit clustering step. We doubt that using k-means would work as we have shown that clustering SSEs with k-means works poorly, even if the true number of clusters is known (see Table 6). More advanced clustering technics could work better than k-means as shown in Kamper et al. (2014)—for example, Chinese Whispers Clustering, Hierarchical K-means, or probabilistic clustering using GMMs, yet it would require a large effort to incorporate it to DP-Parse.
Notes
Our open-source implementation of DP-Parse: https://gitlab.cognitive-ml.fr/ralgayres/dp-parse.
We remove from the list the embeddings corresponding to overlapping segments.
References
Author notes
Action Editor: Mauro Cettolo