DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Abstract Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a ‘space’ delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., 2006, 2009) use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-the-art in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark. 1


Introduction
One of the first tasks that infants face is to learn the words of their native language(s).To do so, they must solve a word segmentation problem, since words are rarely uttered in isolation but come up in multi-word utterances (Brent and Cartwright, 1996).Segmenting utterances into word or subword units is also an important step in NLP applications, where an input string of orthographic symbols is tokenized into words or sentence piece before being fed to a language model.While many writing systems use white spaces that make (basic) tokenization a relatively simple task, others do not (e.g.Chinese) and turn tokenization into a challenging machine learning problem (Li and Yuan, 1998).A similar situation arises for 'textless' language models based on units derived from raw audio (Lakhotia et al., 2021), which, like infants, do not have access to isolated words or wordseparating symbols.
The most successful approach to unsupervised word segmentation for text inputs is based on nonparametric Bayesian models (Goldwater et al., 2006(Goldwater et al., , 2009;;Kawakami et al., 2019;Kamper et al., 2017b;Berg-Kirkpatrick et al., 2010;Eskander et al., 2016;Johnson et al., 2007;Godard et al., 2018) that jointly segment utterances and build a lexicon of frequent word forms using a Dirichlet process.The intuition is that frequent word forms function as anchors in a sentence, enabling the segmentation of novel words (like 'dax' in 'did you see the dax in the street?').Such models are tested on phonemized texts obtained by converting text into a stream of phonemes after removing spaces and punctuation marks.Even though phonemized text may seem a reasonable approximation of continuous speech, such models have given disappointing results when applied directly to speech inputs.Since the direct comparison between speech-based and text-based models was introduced in 2014 by Ludusan et al. (2014), the gap of performance between these two types of inputs as documented in three iterations of the Zero Resource Speech Challenge focused on word segmentation has remained large (Versteegh et al., 2016;?;Dunbar et al., 2020).We attribute these difficulties to two major challenges posed by speech compared to text inputs: acoustic variability and temporal granularity, which we address in our contribution.
The first and most important challenge is acoustic variability.In text, all tokens of a word are represented the same.In speech, each token of a word is different, depending on background noise, intonation, speaker voice, speech rate, etc. Textinspired speech segmentation algorithms (Kamper et al., 2017b) apply a clustering step to word tokens in order to get back to word types.We believe that this step is unnecessary and is responsible for the low performance of existing systems, as errors in the clustering step are not recoverable and negatively impact word segmentation.We propose an algorithm that segments speech utterances based on an instance lexicon of word tokens, instead of a lexicon of word types.Each speech segment instance of the training set is represented as a distinct memory trace, and these traces are used to estimate word form frequency using a k-NN algorithm without having to refer to a discrete word type but still applying the same Dirichlet process logic.As for the representation of these word tokens, we follow recent approaches that use fixed length Speech Sequence Embeddings (SSE) (Thual et al., 2018;Kamper et al., 2017b).We use stateof-the-art SSEs from Algayres et al. (2022) that have either been trained in a self-supervised fashion or with weak labels, in order to assess how the segmentation model behaves with inputs of increasing quality.
The second challenge is related to the fact that the speech waveform being continuous in time, the number of possible segmentation points grows very large, making the optimization of segmentation of large speech corpora using Bayesian techniques intractable.Some models reduce the number of segmentation points using phoneme boundaries (Lee and Glass, 2012;Bhati et al., 2021), syllable boundaries (Kamper et al., 2017b) or a constant discretization of the time axis into speech frames, subject to the following trade-off: too coarse frames may miss some short word boundaries, too fine-grained ones may render segmentation intractable on large corpora because of the quadratic increase in number of segmentations with frame rate.Here we choose a compromise with 40ms frames, corresponding to half of the mean duration of a phoneme, yielding a 4x theoretical slowdown compared to phoneme-based transcripts.In addition, we introduce an accelerated version of a Dirichlet process segmentation algorithm that replaces Gibbs sampling of each boundary by sampling entire parse trees using Dynamic Programming Beam Search, and updating the lexicon in batches instead of continuously.This achieves a 10 times speed-up compared to Johnson et al. (2007)'s implementation, with similar or superior performances.
Because of our 'no clustering' approach, we cannot evaluate our model using the word typebased metrics of the Zero Resource speech segmentation challenge (Versteegh et al., 2016;Dunbar et al., 2017).Here, we use the two classes of metrics: boundary-related metrics (Token and Boundary F-score) that do not refer to types, and high-level word embedding metrics.The idea is use a downstream language model to learn high level word embeddings and evaluate them on semantic and Part-Of-Speech (POS) dimensions.We rely on a semantic similarity task from (Dunbar et al., 2020) and introduce two new metrics.
The combination of our contributions yields an overall system, DP-Parse, that sets a new state-ofthe-art on speech segmentation for five speech corpora by a large margin.By doing various ablation and anti-ablation studies (replacing unsupervised components by weakly supervised ones) we pinpoint the components of the system where there is still margin for improvement.In particular, we show that using better embeddings obtained with weak supervision, it is possible to double the segmentation F-score and substantially improve our new semantic and POS scores.

Speech Sequence Embeddings
Speech Sequence Embedding (SSE) models take as input a piece of speech signal of any length and outputs a fixed-size vector.The main objective of these models is to represent the phonetic content of a given speech segment.A naive SSE model would be to extract frame-level features of a speech sequence, using for instance Wav2vec2.0or HuBERT (Baevski et al., 2020;Hsu et al., 2021), and mean-pool the frames along the time axis.A more subtle approach is to train a selfsupervised systems on positive pairs of speech sequences.The model learns speech sequence representation thanks to contrastive loss functions (e.g. the infoNCE loss as in Algayres et al. (2022);Jacobs et al. (2021) or the Siamese loss as in Settle and Livescu (2016); Riad et al. (2018)) or reconstructive losses (e.g.Correspondance Auto-Encoder from Kamper (2018) where the first element of the pair is compressed and decoded into the second element).Recently, a growing body of work leverage multilingual labelled dataset to build SSEs that can, to an extent, generalize to unseen languages (Hu et al., 2020a,b;Jacobs et al., 2021).In this work, we use a state-of-the-art selfsupervised system SSE model from Algayres et al. (2022).

Speech segmentation
Three main class of models have been proposed to solve speech segmentation: matching-first models (Park and Glass, 2007;Jansen and Van Durme, 2011;Hurtad, 2017) attempt to find high quality pairs of identical segments and cluster them based on similarity, thereby building a lexicon of word types which may not cover the entire corpus (Bhati et al., 2020;Räsänen et al., 2015).Segmentationfirst models try to exhaustively parse a sentence (Lee et al., 2015;Kamper et al., 2017a,b;Kamper and van Niekerk, 2020), while jointly learning a lexicon (and therefore also involving matching and clustering).Segmentation-only models discover directly the likely word boundaries in running speech using prediction error (Bhati et al., 2021;Cuervo et al., 2021), without relying on any lexicon of word types.Our approach is a new hybrid of the last two lines of research: like segmentation-first model, we jointly model lexicon and segmentation, but our lexicon is not a lexicon of types, thereby escaping matching and clustering errors, like in segmentation-only models.This model shows good performances across different languages as measured by the Mean Average Precision on pre-segmented spoken words.

Evaluating speech segmentation
Across the different word segmentation studies, the evaluation has been done according to two general classes of metrics: matching and clustering metrics for word embeddings (MAP, NED, grouping F-score, Type F-score), and segmentation metrics for boundaries (Token and Boundary F-scores) (Carlin et al., 2011;Ludusan et al., 2014).All of these metrics presuppose that the optimal segmentation strategy is aligned with the text-based gold standard (based on spaces and punctuation).Yet, it is entirely possible that such segmentation is not optimal for downstream applications, as witnessed by the fact that most current language models use subword units like BPEs or sentence pieces (Devlin et al., 2018).Therefore, we propose a third class of segmentation metrics based on downstream language modelling tasks.Chung and Glass (2018) showed that gold word boundaries provide sufficient information to learn good quality semantic embeddings from raw au- dio using a skipgram or word2vec objective, as assessed by semantic similarity benchmarks adapted to speech inputs.These datasets were also used to compile the sSIMI benchmark in the Zero Speech Challenge 2021 (Nguyen et al., 2020), but turns out to give very low and noisy results (at least for small training sets).Here, we introduce two new metrics evaluating semantic and grammatical similarity that we show to be more stable.

Method: Instance-based Dirichlet Process Parsing
This section describes the building blocks and notations used by DP-Parse as depicted on Figure 1 and Table 1, respectively.At a high level, DP-Parse is a new adaptation of Goldwater's Dirichlet process algorithm (Goldwater et al., 2009) for speech inputs.At the heart of Goldwater's algorithm, is a hierarchical Bayesian model which generates a joint distribution of a lexicon of words (a word is a sequence of phonemes) and a corpus (a sequence of words).The associated learning algorithm optimizes the likelihood of an observed corpus as a function of the (unobserved) lexicon and corpus segmentation.It does so by alternating between two EM steps: (Step 1) estimating the most probable lexicon given a segmentation, and (Step 2) sampling among the most probable segmentation given the lexicon.More precisely, at each Step 1, the algorithm estimates the probability of a segmentation from two sets of numbers: the probability P 0 W (w) that a given speech segment w is a word, and L w the frequency of a given word w in the (segmented) input corpus.They are estimated by building tables of counts of phoneme-ngrams and words given a segmentation, respectively.At Step 2, these numbers are combined using Equation 3 (Section 3.3) and a new segmentation is sampled.Our adjustments to the original algorithm are the following: • Section 3.1: Phonemes are replaced by 40ms speech frames (from a pretrained selfsupervised learning algorithm), words are replaced by fixed-length speech sequence embeddings (using a pretrained self-supervised embedder).
• Section 3.2: The tables of counts for computing P 0 W (w) and L w are replaced by k-Nearest-Neighbors (k-NN) indexes of embeddings (L 0 , and L, respectively) which can be viewed as instance lexicons of all possible n-frames and of segmented 'words', respectively.The count estimates are obtained using Gaussian kernel density estimation over k-NN.
• Section 3.3: We make a small adaptation to the Dirichlet process formula.
• Section 3.4: Instead of alternating between the two steps at each boundary as in the original Gibbs sampling algorithm, which is very timeconsuming, we sample segmentations over entire utterances using a segment lattice, and update the lexicon over a batch of utterances.
• Section 3.5: We initialize the system by selecting as an initial lexicon a list of short enough utterances, and precompute all the constant values (L 0 and P 0 W (w)) to optimize for speed.
So modified, the algorithm is no longer dependent on input type and can work either with discrete representations (text, represented as sequences of 1-hot vectors of phonemes) or continuous ones (speech, represented as embeddings).

Speech Sequence Embeddings
We represent speech as 20ms frames obtained by selecting the 8th layer of a pretrained Wav2vec2.0 Base system from Baevski et al. (2020).Each two successive frames are tied together so that a speech sentence is a series of 40ms speech blocks.To represent a speech segment, we use the Speech Sequence Embedding (SSE) model from Algayres et al. (2022): a self-supervised system trained with contrastive learning where positive pairs are obtained by data-augmentation of the speech signal.This model shows good performances across different languages as measured by the Mean Average Precision on pre-segmented spoken words.The SSE model takes as input the pre-extracted Wav2vec2.0frames and apply a single 1-D convolution layer with GLU (kernel size: 4, number of channels: 512, stride: 1) and a single transformer layer (attention heads: 4, size of attention matrices: 512 neurons, and feed forward layer: 2048 neurons ).A final max-pooling layer along the time axis is applied to get a fixed-size vector.To save computation, we reduce the dimensionality of Algayres et al. ( 2022)'s SSE model from 512 to 64 with a PCA trained on random speech segments extracted from the corpus at work.
In addition to their unsupervised SSE model, Algayres et al. (2022) provides a weaklysupervised version.They trained it with the same contrastive loss (infoNCE) using positive pairs obtained with a time-aligned transcription of the speech signal.The weakly-supervised version will be used as a topline model, as it scores much higher on Mean Average Precision than the selfsupervised one.

Density estimation from
Gaussian-smoothed k-Nearest-Neighbors In text, the exact frequency of a string of letters can be computed by counting how often it occurs in the corpus.In speech, due to acoustic variability, most of the counts would be 1, and clustering methods cause too many errors to get reliable count estimates (more details in appendix A.2.1).
Here, we follow the method in F is the Parzen-Rosenblatt window method for density estimation (Parzen, 1962) and returns a continuous value between 0 and k, estimating L 0w .It rests on two hyper-parameters: k which is set to 1000 and β, the standard deviation of the Gaussian kernel.To set β, we follow the observation from Algayres et al. (2022): around 50% of the segments in the development dataset of the Zerospeech Challenge 2017 appear to have a frequency of one.Therefore, we set β during runtime so that 50% of the L 0w calculated get a frequency of one (i.e. an F value below a small > 0).We did not change these hyper-parameters for any of the tested languages.

Dirichlet Process formula
This section explains how Goldwater et al. (2009) formulates the Dirichlet Process and our modification.Let v be a non silent section from a speech corpus C found by a Voice Activity Detection (VAD).A segmentation of v is written v seg and is composed of a series of segment (w 1 , ..., w l ).The probability of v seg is, under the unigram hypothesis, the product of the probability of each segment to be a real spoken word: To model the probability of a segment w to be a real word, Goldwater et al. (2009) uses the following formulation of the Dirichlet Process: The first term of Equation 3 accounts for how often the token w has been segmented in C seg .
Its numerator L w is the count estimate of w in L. The second part of Equation 3 is the intrinsic probability of w to be a word regardless of its appearance in C seg .This intrinsic probability is called the base distribution P 0 W and is controlled by the concentration parameter α 0 .
To find a formulation for P 0 W , the intuition from Goldwater et al. ( 2009) is simple: frequently appearing tokens are more likely to be true words than rare ones.Yet, their formulation of P 0 W cannot be easily adapted to the segmentation of speech data.Our contribution here is to propose a new formula for P 0 W which follows the same intuition: where L 0w is the count estimate of w in L 0 (the lexicon of all possible segments in C). #L 0 being the total number of segments in L 0 , P 0 W is then the probability to find w in among the tokens in C.

Segmentation lattice
In Goldwater et al. (2009), the learning algorithm uses Gibbs sampling, where word boundaries are sampled one at a time, requiring all the parameters of the Dirichlet Process model to be recomputed after each sample.On text data, it is not a problem, as updating the model's parameters (L w and L 0w ) is fast: no k-NN search is needed, parameters are computed exactly by matching and counting strings.It becomes a bottleneck for speech segmentation where the parameters are computed with density estimation and k-NN search.One way to alleviate this problem, is to incorporate the dynamic programming version of the Gibbs sampler from (Mochihashi et al., 2009).Another way, as in this paper, is to build a segmentation lattice over each utterance and sample among the N-best segmentations.
Here, we assume that the corpus C has been pre-segmented in a set of utterances using a VAD algorithm.After each Step 1 from Figure 1 we can compute the log probability, log(P W (w)) from Equation 3, that each token in an utterance is a real word.Yet, instead of directly using this probability, we introduce a per-token penalty score that favors short tokens: This penalty is added to the dirichlet process log probability as follows: where is a very small number to handle cases where P W (w) = 0, len(w) is the number of 40ms speech frames in w.
For each utterance in C, we create a segmentation lattice that provides a compact view for all possible segmentation paths.A segmentation path is a sequence of consecutive segmentation arcs that covers the full utterance.Each arc starts and finishes in-between the units and is bounded within a minimal and maximal length.An example of a segmentation lattice can be found at Table 2.Each segmentation arc is associated with its S score from Equation 6.For each utterance, the N-best segmentations are computed with Dynamic Programming Beam Search, and we sample from a softmax of their total S scores.One advantage of this procedure is that it is possible to parallelize utterance segmentation across large batches and only update the lexicon L after each batch.In our experiments, we take the entire corpus C to be a single batch.

Initialising DP-Parse
We create a corpus C as a collection of utterances by applying the pyannote VAD (Bredin et al., 2019) with a threshold of 200 ms.As shown in Figure 1, initialization (Step 0) contains several sub-steps.The first (0.1) is to provide an initial segmentation of C, using a simple heuristic: all sentences shorter than 800ms are treated as word tokens, the other ones are discarded.The intuition is that short sentences could be words in isolation, which provides the seed for an initial lexicon.Figure 2: An example of a segmentation lattice for a small utterance of 6 units, with a constraint on word tokens to be bounded between 2 and 6 units.A segmentation path is a sequence of segmentation arcs that covers the whole utterance.Each arc starts and ends in-between units and is associated with its score S from Equation 6The second sub-step (0.2) is to create the list D composed of the embeddings of all possible speech segments in C that are possible words, which we set to be anything between 40ms and 800ms (by 40ms increments).To embed a speech segment, we use the SSE model in Section 3.1.
The third sub-step (0.3) consists in the construction of L 0 : a k-NN index of all embedded segments in D. In practice, we found that randomly subsampling D to one million embedded segments worked well (see grid-search in appendix A.2.2), and we precompute L 0w as explained in Section 3.2.

Boundary level segmentation metrics
In the Zerospeech Challenge 2017 (Dunbar et al., 2017), two metrics measure how well discovered boundaries matches with the gold word boundaries.These metrics, the Boundary and Token Fscore, are obtained by comparing the discovered sets of tokens to the gold one obtained by forcealigning spoken sentences with their transcription.The evaluation is done in phoneme space and each discovered boundary is mapped to a real phoneme boundary: if a boundary overlaps a phoneme by more than 30ms or more than half of the phoneme length, the boundary is placed after that phoneme (otherwise it is placed before).

Semantic and POS embedding metrics
The idea here is to evaluate segmentation through its effect on a downstream language model.The evaluation is less direct than the segmentation metrics above, as it assumes that the segmentation is used to turn a spoken utterance into a series of fixed size SSEs, themselves used to train a continuous-input Speech Language Model.The metrics will therefore reflect each of these components (segmentation, SSE, Speech LM), but by keeping the SSE and Speech LM components constants, we can hope to study systematically the effect of the segmentation component.Our evaluation focuses on word-level representations, which we call here High-Level Speech Embeddings (HLSE).They are the speech equivalent of Word2Vec representations (Mikolov et al., 2013), yet coming from an inexact segmentation.Here, we assume that a Speech LM has been trained on SSE inputs on a given dataset with a given segmentation and can be used at test time to provide embeddings associated to a set of test words.As our segmentation models work on whole utterances, we present the test words in continuous utterances and then mean pool the HLSEs from the Speech LM that overlap with that word to obtain a single vector.An overview of the building of generic HLSEs and their evaluation is shown in Figure 3.
We introduce two zero-shot tasks to evaluate HLSEs that do not require training a further clas-  2013) which computes a discrimination score over sets of (A, B, X) triplets.For each triplet, A and X belong to the same category and B plays the role of a distractor.The task for the model is to find the distractor in each triplet based on the cosine distance between the HLSEs of A, B and X.The first task (ABX sem ), have A and X to be synonyms whereas B is semantically unrelated to neither A or X.The second task, ABX P OS , have A and X share the same POStags whereas B has a different one.The triplets are all extracted from the Librispeech corpus training set.See appendix A.1 for details on the construction of the triplets.
Another task, sSIMI from Nguyen et al. ( 2020), also evaluates high-level representations of spoken words in a zero-shot and distance-based fashion.Yet, sSIMI differs from our ABX tasks in two important ways.First, sSIMI presents test words as pre-segmented chunk of speech without the context of the original sentence to help the Speech LM.Second, the task for the Speech LM is to predict a semantic similarity score given by human annotators to pairs of words.The Speech LM encodes the pre-segmented spoken words into HLSEs and the distance between the HLSEs is correlated with the human judgements.The correlation coefficient r is used as the final sSIMI score.

Datasets and Experimental settings
The Zerospeech Challenge 2017 (Dunbar et al., 2017) provides five corpora to evaluate speech segmentation systems.These corpora are composed of speech recordings from different languages split into sentences using Voice Activity Detection.Three corpora (Mandarin,2h30;English,45h;French,24h) are used for development, and two 'surprise' corpora for testing (German,25h;Wolof,10h).On each corpus, a separate SSE model from Algayres et al. ( 2022) is trained and a separate run of DP-Parse is performed to produce a full-coverage segmentation.DP-Parse's hyper-parameters from Table 2, as well as q's formula (Equation 5), were grid-searched to maximize token-F1 segmentation scores over the three development datasets.The two remaining test sets are used to show generalization of DP-Parse to new unseen languages.More details on hyperparameters search in Appendix A.2.2.
For the ABX and sSIMI tasks, we train a BERT model as a Speech Language Model on the training set of the Librispeech dataset (Panayotov et al., 2015), composed of 960 hours of English recordings.We proceed by first training a SSE model from Algayres et al. (2022) on the Librispeech.Then, this latter dataset is segmented using DP-Parse.Each segmented speech sequence is aggregated into a single vector using the pre-trained SSE model.Segmented sentences are used to train a BERT model with masked language modelling and the Noise Contrastive Estimation (NCE) loss (Gutmann and Hyvärinen, 2010) (see Figure 4).The BERT model is composed of 12 layers, 12 attention heads per layer with 768 neurons each and a FFN size of 3072 neurons.To compute the NCE, two heads p and h are composed respectively of one fully connected linear layer with 256 neurons and two fully connected layers with ReLU activation with also 256 neurons.Batches are composed of utterances from a single speaker, and the 200 negative samples for the NCE are chosen within the batch.15% of SSEs are masked in each batch during training.Only the BERT is trained, gradients are not propagated into the SSEs.As BERT is a multilayer transformer model, the scores for ABX tasks and sSIMI are obtained by grid-searching the layer that performs best on the dev set of each task and evaluate that same layer on the corresponding test sets.The development and test sets for Minimum segment length 40ms Maximum segment length 800ms Nb. of segments in L 0 1M Concentration parameter α 0 (eq.3) 100 Nb. of neighbors in k-NN (eq. 1) 100 Lattice beam size 10 Duration penalty q (eq.5)

Results on word-level segmentation
Regarding text segmentation, DP-Parse compares favorably to the original Dirichlet Process Unigram model from Goldwater et al. (2009).It produces a 21 points increase in token F1 compared to the original version (and 3 points increase in boundary F score) for a 10x runtime speed-up (Table 3) under the Adaptor Grammar implementation of Johnson et al. (2007).
Regarding speech segmentation, we compare DP-Parse with the three best speech segmentation models submitted at the Zerospeech Challenge 2017 (Bhati et al., 2020;Kamper et al., 2017b;Räsänen et al., 2015).Table 4 reports the token F1 and boundary F1 obtained by these models over the 5 datasets of the Zerospeech challenge 2017.Across all corpora, DP-Parse outperforms its competitors by at least 5 points in both boundary and token F1.We also introduce a naive baseline that draws boundaries every 120 milliseconds, disregarding the content of the speech signal.It turns out to be surprisingly competitive with all speech segmentation systems except DP-Parse.The latter is the only existing speech segmentation system that beats the naive baseline on all languages.Most speech segmentation systems from the Zerospeech Challenge 2017 rely on off-the-shelf self-supervised representations of speech.The hope is that, without modification, these systems would benefit from future improvements in speech representations and mechanically lead to higher segmentation scores.Yet, such hopes have never been guaranteed by explicit experiments.Here, we test for the ability of DP-Parse to improve with better inputs.For that, we use the weaklysupervised version of the SSE model from Algayres et al. ( 2022) trained with 10 hours of labelled speech data.On this type of input, DP-Parse doubles its token F1 score.

Results on semantic and POS metrics
In Table 5, the segment-based sections show how a BERT model can benefit from speech segmentation and SSE modelling on the tasks ABX sem , ABX P OS and sSIMI.To do that, we trained BERT models along the pipeline depicted in Fig- ure 4.
Let us first analyse the scores on the ABX tasks.The section unsupervised, segment-based shows that DP-Parse and two less-performant segmentation strategies lead to comparable ABX scores.
Yet, the weak supervision section shows that by improving the quality of the SSEs with weak supervision, DP-Parse performs higher than other segmentation methods.The partial supervision section shows that with either perfect word boundaries or perfect segment representation (1-hot vectors), the ABX scores increase even more.Finally, the full supervision is regular text-based LM that serves as a topline model.
As baseline systems, we propose frame-based approaches that do not use speech segmentation nor SSEs.Speech-to-frames models like Wav2vec2.0,HuBERT or CPC (Baevski et al., 2020;Hsu et al., 2021;van den Oord et al., 2019) are trained with masked language modelling in the spirit of text-based LM but on the raw speech signal.Therefore, these models should have already acquired some knowledge of semantics and POStaggings.We evaluate speech-to-frames models directly on ABX sem , ABX P OS and sSIMI.To do that, we simply skip the Speech LM as well as the segmentation and SSE steps from the method in Figure 3. Speech-to-frames models are used to encode the whole speech sentences from the ABX triplets, and the frames within the word boundaries provided by the ABX tasks are meanpooled to form the HLSEs.The results show that Wav2vec2.0Large and HuBERT Large are strong baseline systems for our ABX metrics.Yet, they score in average below the segment-based approaches.
Regarding the scores obtained on sSIMI from Nguyen et al. ( 2020), the table of results 5 shows important downsides on this task compared to our ABX tasks.First, the scores sometimes show large inconsistencies across development and test sets, which is not the case for the ABX tasks.Second, while our ABX tasks are sensitive to improvements in speech segmentation and SSE modelling, sSIMI is not and show comparable scores across most sections of Table 5.One reason for that could be that self-supervised speech systems (Wav2vec2.0,HuBERT and CPC) as well as SSE models from Algayres et al. (2022) are not trained to encode short sequence of speech, especially extracted as chunks from a sentence.

Conclusion and open questions
We introduce a new speech segmentation pipeline that sets the state-of-the-art on the Zerospeech's datasets at 16.8 token F1.The whole pipeline needs no specific tuning of hyper-parameters, making it ready to use on any new languages.We showed that the problem of speech segmentation can be reduced to the problem of learning discriminative speech representations.Indeed, using different level of supervision, our pipeline reaches up to 35 token F1 score.Therefore, as long as the field of unsupervised representation learning makes headway, this method should automatically produce higher token F1 scores.
A first avenue of improvement of the current work is in the SSE component.Here, we took the system described in Algayres et al. (2022) out of the box, and we showed good performance in speech segmentation compared to the state of the art, but there was still a large margin of improvement compared to text-based system.A recent unpublished paper (Kamper, 2022) came to our attention based on the non-lexical principle and showed similar or slightly better results than ours on a subset of the ZR17 language.Kamper (2022) also uses a segmentation lattice that resembles ours for inference.Yet, as we have shown, our system is monotonous with input embedding quality, and can therefore reach much better performance if the speech sequence embedding component were better trained.Further work is needed to improve the SSE component based on purely unsupervised methods.
Despite our computational speedup, our current system would face challenges to scale up the approach to larger datasets than Librispeech.While it uses FAISS, an optimized search library, it is unclear that storing each instance of a possible  2009).Yet, this possibility looks challenging as explained in appendix A.2.3.
Finally, our method showed promising results in terms of semantic and POS encoding when taken as a preprocessor for a language model.Such phenomenon is displayed by our new in-context semantic tasks, but not by a previous withoutcontext semantic task from Nguyen et al. (2020).Further work is needed to show whether this can translate high quality generations when used as a component to generative systems (Lakhotia et al., 2021).

A Appendix
A.1 Construction of triplets for ABX sem and ABX P OS Let's write the series of triplet from our ABX tasks: (A i , B i , X i ) 0≤i<N .For all i, A i is defined as a tuple (R a i , s a i , e a i , t a i ) where R a i is a recording of a whole sentence and s a i and e a i are the temporal boundaries of a word with phonetic transcription t a i .B i and X i are defined identically.The sentences are extracted from the Librispeech dataset, a 960 hours corpus of read English literature.The Speech LM is asked to encode the whole sentences (R a i , R b i , R x i ) and compute the three word embeddings for the words of interest using the provided timestamps.The ABX score is given by the following formula: where T is a collection of triplet word embeddings (f a i , f b i , f x i ) 0≤i<N and d is the cosine distance.ABX sem is composed of 502 pairs (evenly split into a development and test set) of words created using a synonym dictionary.By sampling a list distractor for each pair, we reached 1557 different triplets.For each unique word found in the triplets, we sample 10 occurrences from the sentences of the Librispeech corpus.In total, ABX sem is computed over 1.5M triplets.ABX P OS is composed of 9997 pairs of words built from the WordNet database (Fellbaum, 1998).The words in each pair have the same POS tags that can be either noun, verb or adjective.By sampling distractors and 10 occurrences of each word types, we reached 37.3M triplets.

A.2.1 k-means instead of k-NN
To estimate DP-Parse parameters (L 0w and L w ), we followed a non-clustering approach with a k-NN density estimation.In this section, instead, we use k-means clustering to estimate DP-Parse parameters.We used the transcriptions of the development sets from the Zerospeech Challenge 2017 to give to the k-means the true number of clusters is it suppose to find when clustering SSEs from L 0 and L. The value of L 0w and L w are given by the size of the cluster in which w is found.From Table 6, the segmentation scores obtained using k-means are much lower than those obtained by k-NN.Indeed, as shown in a study by (Algayres et al., 2020), k-means is subject to the uniform effect (Wu, 2012) which makes it not suited to estimate frequencies on highly skewed distributions, as the distribution of word types which follows the Zipf Law (Zipf, 1949).

A.2.2 DP-Parse hyper-parameters
We provide in Table 7, DP-Parse speech segmentation performances, as measured by token-F1 scores, over three datasets (Mandarin, English and French) for different hyper-parameters values.Surprisingly, increasing the number of neighbors (k), the number of samples in L 0 (#L 0 ) or the beam size does not improve token-F1 scores.DP-Parse time complexity scales linearly with each Table 7: Token F1-score on the number of neighbors for the k-NN search (k), the concentration parameter (alpha 0 ), the beam-search size (beam), the number of samples in L 0 (#L 0 ) and the pair (δ,γ) from the penalty function.The default parameter values are k = 100, α 0 = 100, beam = 10, #L 0 = 1M , γ = 1.8 and δ = 4 (also in bold in the table).
of these three parameters, therefore we keep their values as low as possible.
Another unexpected result, is the low impact of the concentration parameter (α 0 ).The value of this parameter should be a crucial as it controls the implicit vocabulary by controlling the amount of rare words in the segmentation.This observation has led us to create the penalty function q from Equation 5 that offers a control on word-lengths, which also impacts the implicit vocabulary.This time, we noticed that q's shape strongly impact token-F1 scores.By a careful tuning of the penalty function, we increased the token-F1 from 16, 4, with γ = 0 (i.e.no penalty function) to 18 ,with γ = 1.8 and δ = 4.

A.2.3 Hierarchical Dirichlet Process
Goldwater's bigram Hierarchical Dirichlet Process (bigram-HDP), presented in Goldwater et al. (2009) for text segmentation computes the probability of two consecutive word-candidate, P ( w i−1 , w i ), to be segmented instead of only one, P (w i ), as in the unigram dirichlet process (unigram-DP).In Goldwater et al. (2009), the computation of the bigram-HDP probability P ( w i−1 , w i ) requires the number of different types that can be found in all previously segmented bigrams with w i as second word.This is particularly difficult to adapt to speech segmentation with DP-Parse, because counting types requires an explicit clustering step.We doubt that using k-means would work as we have shown that clustering SSEs with k-means works poorly, even if the true number of clusters is known (see Table 6).More advanced clustering technics could work better than k-means as shown in (Kamper et al., 2014), e.g.Chinese Whispers Clustering, Hierarchical K-means or probabilistic clustering using GMMs, yet it would require a large effort to incorporate it to DP-Parse.

Figure 1 :
Figure 1: The core components of DP-Parse.The algorithm is a loop between two main steps (Step 1. and 2.) in the spirit of the Expectation Maximization algorithm.Given an initial coarse word segmentation, we estimate L w and L 0w (i.e. the Dirichlet Process parameters) with k-NN density estimation over SSEs.The estimated parameters are used to derive a new word segmentation, which in turn serves to re-estimate L w and L 0w .

Figure 3 :
Figure 3: Method for deriving HLSEs for semantic and POS evaluations.Speech is segmented and converted into SSEs before being used to train a Speech LM.At test time, the HLSEs that overlap with the ABX timestamps by more than 40ms are mean-pooled to form the ABX word embeddings (here the word 'asleep')

Figure 4 :
Figure 4: Our method to train a BERT model on SSEs.Speech is segmented into segments that are converted into vectors by a frozen SSE model.BERT is trained with the NCE loss.p and h are used to project vectors into a common space to compute the NCE objective.The negative samples are random SSEs from the same speaker.

Table 5 :
Semantic and Part-of-Speech discrimination scores (the higher, the better).Segment-based models compute SSEs for each segment (obtained by speech segmentation) on which a BERT model is trained with a NCE loss.Frame-based models are pre-trained baseline models without speech segmentation nor BERT training.sSIMI scores are average over the Librispeech and synthetic subsets.Layers used are given between brackets.+:van den Oord et al. (2019), ×:Baevski et al. (2020), ‡:Hsu et al. (2021), ∨: Räsänen et al. (2015), †: Algayres et al. (2022)

Table 1 :
Algayres et al. (2020)using Gaussian filtering over k-NNs.Let us assume that all speech segments w of a corpus are represented as embeddings E w and stored in a k-NN index L 0 .To compute L 0w the frequency of w in L 0 , we extract the k nearest-neighbors of E is a block of 40 ms of speech signal w = (u 0 , ..., u w ) a segment w is a sequence of units E w ∈ IR p a fixed length embedding of a segment w computed by a SSE model C = {v 0 , ..., v N } a corpus C is a set of utterances (segments without silence found by VAD) C seg = {(w 0 1 , ..., w 0 p0 ), ...} a segmented corpus C seg is a set of sequences of segments D = {w 0 , ...w P } the list of all possible subsegments in C that are possible words (between 40ms and 800ms) L 0 = {E w0 , ..., E w P } a kNN index of embeddings from D L = {E w0 , ..., E wn } a kNN index of all embeddings from the segments in C seg L 0w the frequency of E w in L 0 L w the frequency of E w in L Notations used in Section 3 tuitively, embeddings close to E w are more likely to be instances of the same segments than distant ones.For weighting, we use a Gaussian kernel centred on E w as in the following equation: w : (E 1 , ..., E k ) 2 and sum over them, weighted by a decreasing function of their distance to w. In-u a unit u