Direct Speech Translation for Automatic Subtitling

Abstract Automatic subtitling is the task of automatically translating the speech of audiovisual content into short pieces of timed text, i.e., subtitles and their corresponding timestamps. The generated subtitles need to conform to space and time requirements, while being synchronized with the speech and segmented in a way that facilitates comprehension. Given its considerable complexity, the task has so far been addressed through a pipeline of components that separately deal with transcribing, translating, and segmenting text into subtitles, as well as predicting timestamps. In this paper, we propose the first direct speech translation model for automatic subtitling that generates subtitles in the target language along with their timestamps with a single model. Our experiments on 7 language pairs show that our approach outperforms a cascade system in the same data condition, also being competitive with production tools on both in-domain and newly released out-domain benchmarks covering new scenarios.


Introduction
With the growth of websites and streaming platforms such as YouTube and Netflix,1 the amount of audiovisual content available online has dramatically increased.Suffice to say that the number of hours of Netflix original content has increased by 2,400% from 2014 to 2019. 2 This phenomenon has led to a huge demand for subtitles, which is becoming more and more difficult to satisfy only with human resources.Consequently, automatic subtitling tools are spreading to reduce subtitlers' workload by providing them with suggested subtitles to be post-edited (Álvarez et al., 2015;Vitikainen and Koponen, 2021).In general, subtitles can be either intralingual (hereinafter captions), if source audio and subtitle text are in the same language, or interlingual (hereinafter subtitles), if the text is in a different language.In this paper, we focus on automatizing interlingual subtitling, framing it as a speech translation (ST) for subtitling problem.
Differently from ST, in automatic subtitling the generated text has to comply with multiple requirements related to its length, format, and the time it should be displayed on the screen (Cintas and Remael, 2021).These requirements, which depend on the type of video content and target language, are dictated by the need to keep users' cognitive effort as low as possible while maximizing comprehension and engagement (Perego, 2008;Szarkowska and Gerber-Morón, 2018).This often leads to a condensation of the original spoken content, aimed at reducing the time required for reading subtitles while increasing that of watching the video (Burnham et al., 2008;Szarkowska et al., 2016).
Being such a complex task, automatic subtitling has so far been addressed by dividing the process into different steps (Piperidis et al., 2004;Melero et al., 2006;Matusov et al., 2019;Koponen et al., 2020;Bojar et al., 2021): automatic speech recognition (ASR), timestamp extraction from audio, segmentation into captions, and their machine translation (MT) into the final subtitles.More recently, drawing from the evidence that direct models achieve competitive quality with cascade architectures (Ansari et al., 2020), Karakanta et al. (2020a) proposed an ST system that jointly translates and segments into subtitles, arguing that direct models are able to better exploit speech cues and prosody in subtitle segmentation.However, their system does not generate timestamps, hence missing a critical aspect to reach the goal of fully automatic subtitling.Furthermore, the current lack of benchmarks hinders a thorough evaluation of the arXiv:2209.13192v2[cs.CL] 25 Jul 2023 technologies developed for automatic subtitling.In fact, the only corpus publicly available to date is MuST-Cinema (Karakanta et al., 2020b), which contains only single-speaker audios in the TEDtalks domain with verbatim translations.
To fill these gaps, this paper presents the first automatic subtitling system that performs the whole task with a single direct ST model, and introduces two new benchmarks.Our contributions can be summarized as follows: • We propose the first direct ST model for automatic subtitling able to produce both subtitles and timestamps.Code and pre-trained models are released under the Apache License 2.0 at: https://github.com/hlt-mt/FBK-fairseq/; • We introduce two (en→{de, es}) benchmarks for automatic subtitling, covering new domains, news/documentaries and interviews, with the presence of background noise and multiple speakers.We release them under the CC BY-NC 4.0 license at: https://mt.fbk.eu/ec-short-clips/ and https://mt.fbk.eu/europarl-interviews/; • We conduct the first extensive comparison between automatic subtitling systems based on cascade and direct ST models on all the 7 language pairs of MuST-Cinema (en→{de,es,fr,it,nl,pt,ro}), showing the superiority of our direct solution, while also demonstrating its competitiveness with production systems on both MuST-Cinema and out-of-domain benchmarks.

Direct Speech Translation
While the first cascaded approach to ST was proposed decades ago (Stentiford and Steer, 1988;Waibel et al., 1991), direct models3 have recently become increasingly popular (Bérard et al., 2016;Weiss et al., 2017) due to their ability to avoid error propagation (Sperber and Paulik, 2020), their superior exploitation of prosody and better audio comprehension (Bentivogli et al., 2021), and their lower computational cost (Weller et al., 2021).Motivated by these advantages, direct models are rapidly evolving and their initial performance gap with cascade architectures (Niehues et al., 2019) has been significantly reduced, leading to a substantial parity in the latest IWSLT campaigns (Ansari et al., 2020;Anastasopoulos et al., 2021Anastasopoulos et al., , 2022)).Such improvements can be partly attributed to the development of specialized architectures for speech processing (Chang et al., 2020;Papi et al., 2021;Burchi and Vielzeuf, 2021;Kim et al., 2022;Andrusenko et al., 2022), which are all variants of a Transformer model (Vaswani et al., 2017) preceded by convolutional layers that reduce the length of the input sequence (Bérard et al., 2018;Di Gangi et al., 2019).Among them, Conformer (Gulati et al., 2020) is currently the best-performing model in ST (Inaguma et al., 2021).For this reason, we build our systems with this architecture and test, for the first time, its effectiveness in the challenging task of fully automatic subtitling.

Subtitling Requirements
Subtitles are short pieces of timed text, generally displayed at the bottom of the screen, which describe, transcribe, or translate the dialogue or narrative.A subtitle is composed of two elements: the text, shown into "blocks", and the corresponding start and end display time -or timestamps. 4epending on the subtitle provider and the audiovisual content, different requirements have to be respected concerning both the text space and its timing.These constraints typically consist in: i) using at most two lines per block; ii) keeping linguistic units (e.g.noun and verb phrases) in the same line; iii) not exceeding a pre-defined number of characters per line (CPL), spaces included; iv) not exceeding a pre-defined reading speed, measured in number of characters per second (CPS).While a typical value used as maximum CPL threshold is 42 for most Latin languages,5 there is no agreement on the maximum CPS allowed.For instance, Netflix guidelines6 allow up to 17 CPS for adult and 15 for children programs, TED guidelines7 up to 21 CPS, and Amara guidelines8 up to 25 CPS.
To convey the meaning of the audiovisual product while adhering to time and space constraints, in some domains and scenarios subtitles require compression or condensation (Kruger, 2001;Gottlieb, 2004;Aziz et al., 2012;Liu et al., 2020a;Buet and Yvon, 2021).Due to the rehearsed nature of TED talks, the subtitles in MuST-Cinema have a limited degree of condensation, and the translation is mostly verbatim.In addition, the audio conditions (no background noise and a single speaker) are not representative of all the diverse contexts where subtitling is applied, such as news and movies.To fill this gap, we introduce two new benchmarks that feature different domains, scenarios (e.g., multiple speakers), and levels of subtitle condensation.

Automatic Subtitling
Attempts to (semi-)automatize the subtitling process have been done with cascade systems made of an ASR, a segmenter, and an MT model.Most works focused on adapting the MT module to subtitling with the goal of producing shorter and compressed texts.This has been performed either using statistical approaches trained on subtitling corpora (Volk et al., 2010;Etchegoyhen et al., 2014;Bywood et al., 2013) or by developing specifically tailored decoding solutions on statistical (Aziz et al., 2012) and neural models (Matusov et al., 2019).In particular, recent research efforts focused on controlling the MT output length so as to satisfy isometric requirements between source transcripts and target translations (Lakew et al., 2019;Matusov et al., 2020;Lakew et al., 2021Lakew et al., , 2022)).In addition, (Öktem et al., 2019;Federico et al., 2020;Virkar et al., 2021;Tam et al., 2022;Effendi et al., 2022) proved the usefulness of injecting prosody information about speech cues, such as pauses, in determining subtitle boundaries.Given the possibility for direct ST systems to access this information and their advantages mentioned in §2.1, Karakanta et al. (2020aKarakanta et al. ( , 2021) ) built the only (to the best of our knowledge) automatic subtitling system using a direct ST model, confirming with their results that the ability of direct ST systems to leverage prosody has particular importance for subtitle segmentation.However, their solution only covers the translation and segmentation into subtitles, neglecting the timestamp generation.Our study is hence the first to complete the entire subtitling process with a direct ST model and to evaluate its performance on all aspects of the subtitling task.

Direct Speech Translation for Subtitling
Motivated by all the advantages discussed in §2.1 and §2.3, we build the first automatic subtitling system solely based on a direct ST model (Figure 1).Our system works as follows: i) the audio is fed to a Subtitle Generator ( §3.1) that produces the (untimed) subtitle blocks; ii) the computed encoder representations are passed to the Source Timestamp Generator ( §3.2) to obtain the caption blocks and their corresponding timestamps; iii) the subtitle timestamps are estimated by the Source-to-Target Timestamp Projection ( §3.3) from the generated subtitles, captions, and source timestamps.These modules are described in the rest of this section.

Subtitle Generation
We train a direct ST Conformer-based model that jointly performs the ST task and the segmentation of the generated translation into (untimed) subtitle blocks and lines.To this end, we add two special tokens to the vocabulary of our system, <eob> and <eol>, which respectively represent the end of a subtitle block and the end of a line within a block.Both at training and inference time, <eob> and <eol> are treated as any other token, without giving them different weights, or adding specific loss.Additionally, we do not incorporate losses aimed at minimizing the number of generated characters or explicitly optimizing for CPL and CPS compliance.

Source Timestamp Generation
Estimating timestamps for the generated subtitle blocks from source audio is a challenging task.Current sequence-to-sequence models, in fact, generate target sequences that are decoupled from the input and, therefore, their tokens do not have a clear relationship with the frames they correspond to.To recover this relationship, we start from the observation that direct ST models are often trained with an auxiliary Connectionist Temporal Classification or CTC loss (Graves et al., 2006) in the encoder to improve model convergence (Kim et al., 2017;Bahar et al., 2019).The CTC maps the input frames to the transcripts -in our use case, captions -and we propose to leverage this CTC module at inference time to estimate the block timestamps.
In particular, the encoder representations computed during the forward pass are fed to the CTC module that provides the frame-level probability distribution over the source vocabulary tokens (in- cluding <eob>, <eol>, and the additional CTC blank token).This sequence of CTC probabilities over the source vocabulary serves two purposes.First, it is used to predict the caption with the CTC beam search algorithm (Graves and Jaitly, 2014).9Second, it is fed, together with the generated caption, to the CTC-based segmentation algorithm (Kürzinger et al., 2020), whose task is to find the most likely alignment between caption tokens and audio frames.The algorithm builds a trellis over the time steps for the generated tokens and, at each time step, only three paths are possible: i) staying at the same token (self-loop); ii) moving to the blank token; iii) moving to the next token.To avoid forcing the caption to start at the beginning of the audio, the transition cost for staying at the first token is set to 0. Otherwise, the transition cost is the CTC-predicted probability for a given token in that time step.The trellis is then backtracked from the time step with the highest probability in the last token of the generated caption, until the first token is reached.In our case, since we are interested in the timestamps of the subtitle blocks, we extract block-wise alignments that correspond to the start and the end time of each block.This means finding the time in which the first word of each subtitle is pronounced and the time in which the corresponding <eob> symbol is emitted by using the aforementioned algorithm.

Source-to-Target Timestamp Projection
After generating the untimed subtitles ( §3.1), and captions with their timestamps ( §3.2), the next step is to obtain the timestamps for subtitle blocks on the target side.In general, caption and subtitle seg- mentations may differ for many reasons (e.g.due to different syntactic patterns between languages) and imposing the caption segmentation on the subtitle side -as done in most cascade approaches (Georgakopoulou, 2019;Koponen et al., 2020) could be a sub-optimal solution.For this reason, we introduce a caption-subtitle alignment module that projects the source timestamps to the target blocks.To perform this task, we tested the three alternative methods described below.
Block-Wise Projection (BWP) This method operates at character level to project the predicted source-side (captions) timestamps on the target side (subtitles) without alterations.When the number of caption and subtitle blocks is equal, a condition that occurs in ∼80% of the cases, the timestamps of each caption block are directly assigned to the corresponding subtitle block.10This process is depicted in Figure 2.a, in which "C" and "B" re-spectively stand for characters and blocks in the caption and subtitle.When the number of caption and subtitle blocks is different (Figure 2.b), the target segmentation is discarded and replaced with the caption segmentation.In this case, line and block boundaries (<eol>/<eob>) are inserted in the target side by matching the number of characters each line/block has in the caption.If the insertion falls in the middle of a word, the <eol>/<eob> is appended to the word.This approach has two main weaknesses.First, it assumes that, when captions and subtitles have the same number of blocks, these blocks contain the same linguistic content, while this is not guaranteed.Second, it ignores the subtitle segmentation in ∼20% of the cases.

Levenshtein-based Projection (LEV)
To overcome the above limitations, our second method exploits the Levenshtein distance-based alignment (Levenshtein, 1966) between captions and subtitles.This method estimates the target-side timestamps from the source-side timestamps without ever altering the original target-side segmentation.First, all the non-block characters are masked with a single symbol ("C").For instance, "This is a block <eob>" is converted into "CCCCCCCCC-CCCCCCB", where "B" stands for <eob>.Then, the masked caption and subtitle are aligned with the weighted version of Levenshtein distance, in which the substitution operation is forbidden so as to avoid the replacement of a character with a block and vice versa.If the positions of a block in the aligned caption and subtitle match, its caption timestamp is directly assigned to the subtitle block.If they do not match, the timestamps of the subtitle blocks are estimated from the caption timestamps based on the alignment of "B"s and the number of characters.For instance, given the caption "CCCBCCCCBCCCCCB" and the subtitle "CCCCCCBCCBCCCB", the optimal sourcetarget alignment with the corresponding timestamp calculation is shown in Figure 3.In detail, the first subtitle block (CCC-CCC-B) is matched with the first two caption blocks (CCCBCCCCB) and the corresponding timestamp (00:01,5) is directly mapped.This also happens with the timestamp 00:02,5 of the last caption (BCC-CCCB) and subtitle block (CCCB).For the second subtitle block (CCB), the timestamp (00:01,9) is estimated proportionally from the caption (BCC-CCCB) using the character ratio between the orange block and the orange + green blocks.Semantic-based Projection (SEM) The third method projects the predicted source-side timestamps on target blocks by looking at the semantic content of the generated captions and subtitles.The method is based on SimAlign (Jalili Sabet et al., 2020), which combines semantic embeddings from fastText (Bojanowski et al., 2017), VecMap (Artetxe et al., 2018), mBERT,11 and XLM-RoBERTa (Conneau et al., 2020) to align source and target texts at the word level.Specifically, we first align captions and subtitles word by word (<eol>/<eob> included) with SimAlign.Then, when all <eob>s of a subtitle are aligned with <eob>s in the caption (66% of the cases), we assign the corresponding timestamp (Figure 4).Otherwise, i.e. when at least one <eob> in the subtitle is aligned with a caption word or <eol> or is not aligned at all, one of the two previous methods is applied as a fallback solution.

Data 4.1 Training Data
For the comparison between cascade and direct architectures ( §6.2), we train the models in a controlled and easily reproducible data setting by using MuST-Cinema v1.1, the only publicly available subtitling corpus also containing the source speech.It covers one general domain (TED talks), and 7 language pairs, namely en→{de, es, fr, it, nl, pt, ro}.The number of hours in the training set of each language pair is shown in the first row of Table 1.
For the comparison with production tools ( §6.3), we experiment in a more realistic unconstrained data scenario and we focus on en→de, and en→es. 12For training, we use MuST-Cinema, two ST datasets -Europarl-ST (Iranzo-Sánchez et al., 2020) and CoVoST2 (Wang et al., 2020b) and three ASR datasets -CommonVoice (Ardila et al., 2020), TEDlium (Hernandez et al., 2018) and VoxPopuli (Wang et al., 2021).We translate the ASR corpora with the Helsinki-NLP MT models (Tiedemann and Thottingal, 2020) and filter out data with a very high or low transcript/translation character ratio, as per (Gaido et al., 2022).The use of automatic translations as targets, also known as sequence-level knowledge distillation (Kim and Rush, 2016), is a popular data augmentation method used in the most recent IWSLT evaluation campaigns (Anastasopoulos et al., 2021(Anastasopoulos et al., , 2022) ) to enhance the performance of ST systems.Since none of the training sets, except for MuST-Cinema, includes the subtitle boundaries (<eob> and <eol>) in the target translation, we automatically insert them by employing the publicly-released multimodal and multilingual segmenter by Papi et al. (2022).The segmenter takes the source audio and the unsegmented text as input and outputs the segmented text i.e., containing <eob> and <eol>.By doing this, we can train our system to jointly translate from speech and segment into subtitles without the need for manuallycurated subtitle targets, which are hard to find and costly to create.The number of training hours is reported in Table 1.

Test Data
The models are tested in both in-domain and outof-domain conditions.For in-domain experiments, we use the MuST-Cinema test set, for which we adopt both the original audio segmentation (for reproducibility and for the sake of comparison with previous and future work) and more realistic automatic segmentation obtained with SHAS (Tsiamas et al., 2022).Notice that this audio segmentation is a completely different task from determining subtitle boundaries.Its only goal is splitting long audio files into smaller chunks (or utterances) that can be processed by ST systems, limiting performance degradation due to information loss caused by suboptimal splits (e.g., in the middle of a sentence).In general, each resulting utterance contains multiple subtitle blocks.For instance, in the MuST-Cinema training set there are ∼2.5 blocks per utterance, even though utterances are quite short (6.4s on average).When automatic segmentation methods like SHAS are applied, this ratio significantly increases, as audio segments are typically much longer, with many segments lasting between 14 and 20 seconds (Gaido et al., 2021b;Tsiamas et al., 2022).
For out-of-domain evaluations, we introduce the two new (en→{de,es}) test sets described below, which we also segment with SHAS.

EC Short Clips
The first test set is composed of short videos from the Audiovisual Service of the European Commission (EC) 13 recorded between 2016 and 2022.These informative clips have an average duration of 2 minutes and cover various topics discussed in EC debates such as economy, environment, and international rights.This benchmark presents several additional difficulties compared to TED talks since the videos often contain multiple speakers, and background music is sometimes present during the speech.We selected the videos with the highest subtitle conformity (at least 80% of the subtitles conforming to 42 CPL, and 75% conforming to 21 CPS), and removed subtitles describing on-screen text.This resulted in 27 videos having a total duration of 1 hour.The target srt files contain ∼5,000 words per language.
EuroParl Interviews The second test set is compiled from publicly available video interviews from the European Parliament TV 14 (2009-2015).We selected 12 videos of 1 hour total duration, amounting to ∼6,500 words per target language.The videos present multiple speakers and sometimes contain short interposed clips with news or narratives.Apart from the more challenging source au-dio properties compared to the clean single-speaker TED talks, here the target subtitles are not verbatim and demonstrate a high degree of compression and reduction.As a consequence, the CPL and CPS conformity is very high (∼100%) but this comes at the cost of being more difficult for automatic systems to perfectly match the non-verbatim translations.Nonetheless, to achieve real progress in automatic subtitling, it is particularly relevant to evaluate automatic systems on realistic and challenging benchmarks like the ones we provide.

Training Settings
Our systems are implemented on Fairseq-ST (Wang et al., 2020a), following the default settings unless stated otherwise.The input is represented by 80 audio features extracted every 10ms with sample window of 25 and pre-processed by two 1D convolutional layers with stride 2 to reduce the input length by a factor of 4. All segments longer than 30s in the training set are filtered out to speed up training.The models are based on encoder-decoder architectures and composed by a stack of 12 Conformer encoder layers and 8 Transformer decoder layers.We apply CTC loss to the 8 th encoder layer and use its predictions to compress the input sequences to reduce RAM consumption (Liu et al., 2020b;Gaido et al., 2021a).Both the Conformer and Transformer layers have a 512 embedding dimension and 2,048 hidden units in the linear layer.We set dropout to 0.1 in the linear, attention, and convolutional modules.In the convolutional modules, we also set a kernel size of 31 for the pointand depth-wise convolutions.
For the comparison between cascade and direct architectures, we train a one-to-many multilingual ST model that prepends a token representing the selected target language for decoding (Inaguma et al., 2019) on all the 7 languages of MuST-Cinema.Conversely, for the comparison with production tools, we develop a dedicated ST model for each target language (de, es).For inference, we set the beam size to 5 for both subtitles and captions.
We train with Adam optimizer (Kingma and Ba, 2015) (β 1 = 0.9, β 2 = 0.98) for 100,000 steps.The learning rate increases linearly up to 0.002 for the first 25,000 warm-up steps and then decays with an inverse square root policy, apart from fine-tunings, where it is the fixed value 0.  (Sennrich et al., 2016) with size 8,000 for the source language.For the multilingual model trained on MuST-Cinema, a shared vocabulary is built with a size of 16,000 while, for the two models developed to compare with production tools, we build German and Spanish vocabularies with a size of 16,000 subwords each.The ASR of our cascade model is trained using the same source language vocabulary of size 8,000 used in the translation setting.The MT model is trained using the standard hyper-parameters of the Fairseq multilingual MT task (Ott et al., 2019), with the same source and target vocabularies of the ST task.For all models, we stop the training when the validation loss does not improve for 10 epochs and the final models are obtained by averaging 7 checkpoints (the best, 3 preceding and 3 succeeding).Training is performed on 4 NVIDIA A100 (40GB RAM), with 40k max tokens per mini-batch and an update frequency of 2, except for the MT models for which 8 NVIDIA K80 (12GB RAM) are used with 4k max tokens and an update frequency of 1. Table 2 lists the total number of parameters of our direct models, showing that it is ∼1/3 of the cascade system used as a term of comparison.

Terms of Comparison
We compare our direct ST system both with a cascade pipeline trained under the same data conditions and with production tools.
Cascade We build an in-domain cascade composed of: an ASR, an audio forced aligner, a segmenter, and an MT system.The ASR has the same architecture of our ST system (Conformer encoder + Transformer decoder), and it is trained on MuST-Cinema transcripts without <eob> and <eol>.The audio forced aligner used to estimate the timestamps (Gretter et al., 2021) is based on the Kaldi15 acoustic model.The subtitle segmenter is the same multimodal segmenter we used to segment the training data for the direct system ( §4.1).The MT is a multilingual model trained on the MuST-Cinema (transcript, translation) pairs without <eob> and <eol>.The pipeline works as follows.The audio is first transcribed by the ASR and word-level timestamps are estimated with the forced aligner.Then, the transcript is segmented into captions with the segmenter and each block timestamp is obtained by averaging the end time of the word before an <eob> and the start time of the word after it.The segmented text is then split into sentences according to the <eob> and, finally, these sentences are translated by the MT.The <eob>s are automatically re-inserted at the end of each sentence while <eol>s are added to the subtitle translation using the same segmenter.
Production Tools As a term of comparison for the unconstrained data condition, we use production tools for automatic subtitling.These tools take audio or video content as input and return the subtitles in various formats, including srt.We test three online tools,16 namely: MateSub,17 Sonix, 18 and Zeemo. 19We also compare with the AppTek subtitling system,20 a cascade architecture whose ASR component is equipped with a neural model that predicts the subtitle boundaries before feeding the transcripts to the MT component (Matusov et al., 2019).For this system, two variants of the MT model are evaluated: a standard model and a model specifically trained to obtain shorter translations in order to better conform to length requirements (Matusov et al., 2020).Since we are not interested in comparing the tools with each other, all system scores are anonymized.

Evaluation
Translation quality, timing, and segmentation of subtitles are measured with multiple metrics.First, we compute SubER (Wilken et al., 2022),21 a tailored TER-based metric (the lower, the better) that scores the overall subtitle quality by considering translation, segmentation and timing altogether.
We adopt the cased and with punctuation version of the metric since these aspects are crucial for the quality and comprehension of the subtitles.Next, specifically for translation quality, we use SacreBLEU (Post, 2018), 22 on texts from which <eol> and <eob> have been removed.The quality of segmentation into subtitles is evaluated with Sigma from the EvalSub toolkit (Karakanta et al., 2022).Since BLEU and Sigma require the same audio segmentation between reference and predicted subtitles, we re-align the predictions in case of non-perfect alignment with the mWERSegmenter (Matusov et al., 2005).Lastly, to check the spatiotemporal compliance described in §2.2, we compute CPL conformity as the percentage of lines not exceeding 42 characters, and CPS conformity as the percentage of subtitle blocks having a maximum reading speed of 21 characters per second. 23onfidence intervals (CI) are computed with bootstrap resampling (Koehn, 2004).

Results
In this section, we first ( §6.1) choose the best timestamp projection method among those introduced in §3.3.Then ( §6.2), we compare the cascade and direct approaches trained in the same data conditions.Lastly ( §6.3), we show that our direct model, even though trained in laboratory settings, is competitive with production tools.In addition, in Appendix A, we analyze the performance of the CTCsegmentation algorithm for timestamp estimation compared to forced aligner tools.

Timestamp Projection
The quality of source-to-target timestamp projection ( §3) is crucial to correctly estimate the targetside timestamps and, in turn, to produce good subtitles.To select the best strategy, we compare the methods in §3.3 using the constrained model on the MuST-Cinema test sets for en→{de, es}.To test the robustness of the various methods when goldsegmented audio is not available, we also report the results using the automatic audio segmentation in addition to that obtained using the gold one.
Results are shown in method.We also report, as a baseline, a method that completely ignores the target segmentation and always maps the caption segmentation onto the subtitle as in BWP when the number of caption and subtitle blocks is different ( §3.3).For the SEM method, if the source-target alignment is not found by SimAlign, the LEV method is applied instead. 24 The results highlight the superiority of the LEV method, which outperforms the others on almost all metrics, with similar trends for both language pairs.The gap is more marked in the realistic scenario of automatically-segmented audio, likely due to the fact that the audio segments produced by SHAS are longer than the manually-annotated ones (8.6s vs 5.5s).As such, each audio segment contains more blocks to align, so the difference between the methods emerges more clearly.The low scores obtained by the baseline confirm that the caption segmentation is not optimal for the target language.Furthermore, SEM yields results that are either comparable to or slightly better than those obtained by BWP, especially in terms of Sigma and CPL, while being always worse than LEV.In addition, SEM exhibits lower CPS conformity even compared to the baseline.Consequently, its performance suggests that semantically-motivated approaches are not the best solution for timestamp projection.
Focusing on the LEV method, we observe that segmentation quality (higher Sigma) and overall subtitle quality (lower SubER) are slightly better when the gold segmentation is used, as expected.Conversely, CPS conformity is higher with the automatic audio segmentation.This counter-intuitive result can be explained as follows: audio segmentation not only splits but sometimes also cuts the 24 We also applied the baseline and the BWP method as a fallback method for SEM but it led to worse results.audio according to speakers' pauses, while the manual segmentation delimits speech boundaries more aggressively than the automatic one.In our case, manual segmentation results in audio segments that are about 2% shorter than those obtained with the automatic segmentation, thus "forcing" the generated subtitles to appear on screen for a shorter time, which in turn leads to a higher reading speed.

Cascade vs. Direct
After selecting LEV as our best timestamp projection method, we evaluate cascade and direct ST systems trained in the same data condition.Before this, to ensure the competitiveness of our cascade baseline, we compare it with the results obtained on the MuST-Cinema test set by the other cascade systems presented in literature, namely: en→{de, fr} by Karakanta et al. (2021), and en→fr by Xu et al. (2022).As these works report only BLEU with breaks, that is BLEU computed including also <eob> and <eol>, we compare our cascade baseline with them on that metric. 25Although these works leverage large additional training corpora for both ASR (e.g.LibriSpeech -Panayotov et al. 2015) and MT (e.g.OPUS -Tiedemann 2016and WMT-14 - Bojar et al. 2014), our cascade trained only on MuST-Cinema performs on par with them.It scores 20.2 on German and 26.2 on French, which are similar or even better than, respectively, 19.9 and 26.9 of (Karakanta et al., 2021), and 25.8 on French of (Xu et al., 2022).These results confirm the strength of our baselines, and the soundness of our experimental settings.54.4 Dir. 58.7 (58.7±2.3)46.7 (46.7±2.1)52.9 (52.9±1.7)50.4 (50.4±1.7)47.4 (47.4±1.9)44.6 (44.6±1.7)48.5 (48.5±2.1)49.9 BLEU (↑) Casc.18.9 (18.9±1.4)32.4 (32.4±1.8)25.1 (25.1±1.5)26.0 (26.0±1.6)25.8 (25.8±1.5)31.4 (31.4±1.7)28.4 (28.3±1.6)quality of the direct solution is significantly higher compared to that of the cascade on all language pairs, with a SubER decrease of 3.8-5.5 points, corresponding to an ∼8% improvement on average.Since SubER measures translation, segmentation and timestamp quality altogether, to disentangle the contribution of each of these aspects we leverage the other metrics.The higher Sigma of our system (+1.2average improvement) demonstrates that the joint generation of subtitle content and boundaries results in superior segmentation.This finding corroborates previous research on the value of prosody (see §2.3), and the ineffectiveness of projecting caption segmentation onto subtitles, as done by cascade approaches (Georgakopoulou, 2019;Koponen et al., 2020).The sub-optimal placement of block boundaries in the cascade system can also account for the superior translation quality of our method (+3.9 BLEU average improvement): as the MT component translates the caption block-byblock, inaccurate boundaries can impede access to information required for proper translation.
Looking at the conformity metrics, the direct system complies with the length requirement of 42 characters (CPL) in almost 90% of cases while the cascade system does so in only 78.1%.This difference is explained by the higher number of <eol> generated by the direct model (10-15% more than the cascade), although being still lower than that of the reference (8-10% less).According to the statistics computed on the outputs of the two systems, the cascade does not only have a higher average number of characters per line (32 vs. 29), but its variance is 1.5-2 times greater, with lines sometimes close to or even longer than 100 characters on all language pairs.In contrast, most of the CPL violations of the direct system are caused by lines shorter than 60 characters, and lines never exceed 70 characters.The trend for CPS is instead different, since the cascade generates subtitles with a higher conformity to the 21-CPS reading speed (72.0 vs 68.9).This can be partially explained by looking at the generated timestamps: upon a manual inspection of 100 subtitles, we noticed that the direct model tends to assign the start times of the subtitles slightly after those of the cascade (within 100ms of difference), and end times slightly before those of the cascade (mostly within 200ms).Overall, on the MuST-Cinema test sets, this leads to a total of ∼2,940s with subtitles on screen for the cascade, and ∼2,850s for the direct (∼3% lower).
To sum up, our direct system proves to be the best choice to address the automatic subtitling task in the constrained data condition, reaching better translation quality and more well-formed subtitles.Our results also indicate that improving the reading speed of the generated subtitles is one of the main aspects on which to focus future works.

MuST-Cinema
The results of the unconstrained models on the in-domain MuST-Cinema test set are shown in Table 5.Compared to production tools, our system shows better translation and segmentation quality as well as a significantly better overall quality on both languages.Gains in BLEU are more evident in Spanish, where we obtain a ∼6% improvement compared to the second-best model (System 4).Also, considerable Sigma improvements are observed with gains of 5.3-34.5% for German and 2.9-24.2%for Spanish, which are in line with SubER improvements of, respectively, 2.6-12.0%and 8.8-27.6%.A perfect CPL conformity is reached by System 1 and 2 for both languages, while our system is on par with System 3 on en-es and falls slightly behind System 3 and 4 on en-de, with a ∼90% average conformity for the two language pairs.System 5 is by far the worst, as it violates the 42 CPL constraint in more than 50% of the lines.As for CPS conformity, we observe that our system achieves better scores compared to 26 https://www.veed.io/. 27E.g., see https://www.apptek.com/post/asr-in-captions-accessibility-series-article-7 and https: //sonix.ai/articles/how-to-remove-background-audio-noise.
System 1 and 5 but it is worse than System 2, 3, and 4 on both language directions, highlighting again the need to improve this aspect in future work.
EC Short Clips This out-of-domain test set presents additional difficulties compared to TED talks, namely the presence of multiple speakers and background music during speech.It is worth mentioning that our direct ST models have not been trained to be robust to these phenomena, as they are not present in the training data, whereas production tools are designed to deal with any condition, and may have dedicated modules to handle them.
Nevertheless, the results in Table 6 show that, even in these challenging conditions, our direct ST models are competitive with production tools on BLEU, Sigma, and SubER.Indeed, there is no clear winner between the systems as the best score for each metric is obtained by a different model, which also varies across languages.Looking at the conformity constraints, Systems 1, 2, and 4 achieve a perfect CPL conformity (100%), while ours is comparable with System 3 and better than System 5.This difference is likely motivated by the number of <eol> inserted by our system, which is considerably lower than that of System 4 (368 vs. 635 for German and 451 vs. 594 for Spanish).Instead, the results for CPS conformity follow the same trend observed in the constrained data condition ( §6.2).
EuroParl Interviews EuroParl Interviews represents the most difficult of the three test sets: it contains multiple speakers, and the target translations are not verbatim since they are compressed to perfectly fit the subtitling constraints ( §2.2).This characteristic is very challenging for current automatic subtitling tools, especially for our direct model since it has not been trained on similar data.
The results are shown in Table 7.As on the EC test set, our system performs competitively with production tools, even achieving the best Sigma for German.For CPL, instead, most systems have high length conformity, even reaching 100%.As already noticed on the other test sets, the CPL conformity is strongly correlated with the number of <eol> inserted by a system: our model has an average conformity of 85.5% with only 451 <eol> inserted, nearly half of those inserted by System 1 (864),  System 2 (711), and System 4 (774) that always comply with the CPL constraint.CPS conformity shows the same trend as with the other test sets.
Compared to the results in Tables 5 and 6, we can see that all systems struggle in achieving a comparable overall subtitle quality (SubER), high-quality segmentations (Sigma), and, above all, high translation quality (BLEU).The translation quality of all systems degrades by at least 10 BLEU compared to the values observed on the MuST-Cinema and EC test sets.However, as previously mentioned, these results are expected since the EuroParl Interviews test set contains condensed translations of the source speech.
All in all, we can conclude that our direct ST model, even though not developed as a productionready system (it is not trained on huge amounts of data and different domains), is competitive with production tools.Indeed, considering the SubER metric computed over the three test sets (Table 8), our direct ST approach is the best on both German (67.0) and Spanish (57.2).As only the scores of System 2 fall within the confidence interval of our direct model in both cases, we can conclude that our model is on par with the best production system and outperforms the others in terms of SubER.

Conclusions
In this paper, we proposed the first approach based on direct speech-to-text translation models to fully automatize the subtitling process, including translation, segmentation into subtitles, and timestamp estimation.Experiments in constrained data conditions on 7 language pairs demonstrated the potential of our approach, which outperformed the current cascade architectures with a ∼7% improvement in terms of SubER.In addition, to test the generalisability of our findings across subtitling genres, we extended our evaluation setting by collecting two new test sets for en→{de, es} covering different domains, degrees of subtitle condensation, and audio conditions.Finally, we compared our models with production tools in unconstrained data conditions on both existing benchmarks and the newlycollected test sets.This comparison further high-lighted that our approach represents a promising direction: although trained on a relatively limited amount of data, our systems achieved comparable quality with production tools, with improvements in SubER ranging from 0.2 to 5.0 on en→de and from 0.6 to 13.0 on en→es over the three test sets.

A Timestamp Extraction Method
To validate the effectiveness of extracting sourceside timestamps with the CTC-based segmentation algorithm, we conduct an ablation study, where we replace it with the forced aligner tool of the Cascade architecture ( §5.2).Table 9 reports the scores.The forced aligner tool (FA) achieves similar results compared to the CTC-based segmentation algorithm (CTC), with a slightly worse SubER (+0.1) on average on the three test sets.Moreover, it is important to highlight that our method does not require an external model.These findings support our choice and align with previous research by Kürzinger et al. (2020), which highlighted the competitiveness of the CTC-based segmentation approach compared to widely used forced aligners (in their case, Gentle28 ).

B Effect of Background Noise
The presence of background noise in the test sets complicates both the audio segmentation (performed with SHAS) and the generation with the direct ST model.For this reason, for the sake of a fair comparison with production tools, we used Veed to remove the background noise from EC Short Clips and EuroParl Interviews, as mentioned in §6.3.Table 10 shows the impact of background noise on the resulting subtitling quality.By comparing 1. and 3., we notice that the presence of background noise causes an overall relative error increase of ∼5% on average over the two test sets and two language pairs.The degradation is caused both by the lower quality of the audio segmentation of SHAS and by worse outputs produced by the direct ST system, as the absence of noise during segmentation (2.) improves by an average of 1.7 SubER the results obtained without noise removal (3.).Creating models robust to background noise, though, is a task per se (Seltzer et al., 2013;Li et al., 2014;Mitra et al., 2017) and goes beyond the scope of this work.

Figure 1 :
Figure 1: Architecture of the direct ST system for automatic subtitling.

Figure 2 :
Figure 2: Example of BWP projection with (a) same number of blocks and (b) different number of blocks between caption and subtitle.

Table 1 :
Number of hours of the training sets.

Table 3 :
Comparison of timestamp projection methods on the MuST-Cinema en→{de, es} test set.

Table 5 :
Unconstrained results on MuST-Cinema with 95% CI in parentheses.

Table 6 :
Unconstrained results on EC Short Clips with 95% CI in parentheses.

Table 7 :
Unconstrained results on EuroParl Interviews with 95% CI in parentheses.

Table 8 :
SubER (↓) over the three test sets with 95% CI in parentheses.

Table 9 :
SubER scores (↓) on MuST-Cinema test set (MC), EC Short Clips (ECSC), and EuroParl Interviews (EPI) when the CTC-based audio segmentation (CTC) or the forced aligner (FA) method is used to extract the source-side timestamps.

Table 10 :
SubER scores (↓) on EC Short Clips (ECSC) and EuroParl Interviews (EPI) with background noise removal for: both the audio segmentation with SHAS and the prediction of the direct ST system (1.);only the audio segmentation, but the noisy audio is fed as input to the direct ST model (2.); no noise removal (3.).