Consistent Transcription and Translation of Speech

The conventional paradigm in speech translation starts with a speech recognition step to generate transcripts, followed by a translation step with the automatic transcripts as input. To address various shortcomings of this paradigm, recent work explores end-to-end trainable direct models that translate without transcribing. However, transcripts can be an indispensable output in practical applications, which often display transcripts alongside the translations to users. We make this common requirement explicit and explore the task of jointly transcribing and translating speech. While high accuracy of transcript and translation are crucial, even highly accurate systems can suffer from inconsistencies between both outputs that degrade the user experience. We introduce a methodology to evaluate consistency and compare several modeling approaches, including the traditional cascaded approach and end-to-end models. We find that direct models are poorly suited to the joint transcription/translation task, but that end-to-end models that feature a coupled inference procedure are able to achieve strong consistency. We further introduce simple techniques for directly optimizing for consistency, and analyze the resulting trade-offs between consistency, transcription accuracy, and translation accuracy.


Introduction
Speech translation (ST) is the task of translating acoustic speech signals into text in a foreign language. According to the prevalent framing of ST (e.g., Ney, 1999), given some input speech 1 We release human annotations of consistency under https://gi t h u b . com/apple/ml-transcript -translation-consistency-ratings.
x, ST seeks an optimal translationt ∈ T , while possibly marginalizing over transcripts s ∈ S: According to this formulation, ST models primarily focus on translation quality, while transcription receives less emphasis. In contrast, practical ST user interfaces often display transcripts to the user alongside the translations. A typical example is a two-way conversational ST application that displays the transcript to the speaker for verification, and the translation to the conversation partner (Hsiao et al., 2006). Therefore, there is a mismatch between this practical requirement and the prevalent framing as described above.
While traditional ST models often do commit to a single automatic speech recognition (ASR) transcript that is then passed on to a machine translation (MT) component (Stentiford and Steer, 1988;Waibel et al., 1991), researchers have undertaken much effort to mitigate resulting error propagation issues by developing models that avoid making decisions on transcripts. Recent examples include direct models (Weiss et al., 2017) that bypass transcript generation, and lattice-to-sequence models (Sperber et al., 2017) that translate the ASR search space as a whole. Despite their merits, such models may not be ideal for scenarios that display both a translation and a corresponding transcript to users.
In this paper, we replace Eq. 1 by a joint transcription/translation objective to reflect this requirement: This change in perspective has significant implications not only on model design but also 695 Figure 1: Example of lexical inconsistencies we encountered when generating transcript and translation independently. Although the transcript correctly contains replay, the German translation (mistakenly) chooses ersetzen (English: replace). The inconsistency is explained by the acoustic similarity between replay and replace, which is not obvious to a monolingual user.
Figure 2: Illustration of surface-level consistency between English transcript and German translation. Only translation 1 spells both named entities (Bill Gross and eSolar) consistently, and the German translation Solarthermaltechnologie (translation 1) is preferred over Solarwärme-Technologie (translation 2), by itself a correct choice but less similar on the surface level. on evaluation. First, besides translation accuracy, transcription accuracy becomes relevant and equally important. Second, the issue of consistency between transcript and translation becomes essential. For example, let us consider a naive approach of transcribing and translating with two completely independent, potentially erroneous models. These independent models would expectedly produce inconsistencies, including inconsistent lexical choice caused by acoustic or linguistic ambiguity (Figure 1), and inconsistent spelling of named entities (Figure 2). Even if output quality is high on average, such inconsistencies may considerably degrade the user experience.
Our contributions are threefold: First, we introduce the notion of consistency between transcripts and translations and propose methods to assess consistency quantitatively. Second, we survey and extend existing models, and develop novel training and inference schemes, under the hypothesis that both joint model training and a coupled inference procedure are desirable for our goal of accurate and consistent models. Third, we provide a comprehensive analysis, comparing accuracy and consistency for a wide variety of model types across several language pairs to determine the most suitable models for our task and analyze potential trade-offs.
2 Evaluation Beyond Accuracy-The Need for Consistency To better understand the desiderata of models that perform transcription and translation, it is helpful to discuss how one should evaluate such models. A first step is to evaluate transcription accuracy and translation accuracy in isolation. For this purpose, we can use well-established evaluation metrics such as word error rate (WER) for transcripts and BLEU (Papineni et al., 2002) for translations. When considering scenarios in which both transcript and translation are displayed, consistency is an essential additional requirement. 2 Let us first clarify what we mean by this term.
Definition: Consistency between transcript and translation is achieved if both are semantically equivalent, with a preference for a faithful translation approach (Newmark, 1988), meaning that stylistic, lexical, and grammatical characteristics should be transferred whenever fluency is not compromised. Importantly, consistency measures are defined over the space of both well-formed and erroneous sentence pairs. In the case of ungrammatical sentence pairs, consistency may be achieved by adhering to a literal or word-for-word translation strategy.
Consistency is only loosely related to accuracy, and can even be in opposition in some cases. For instance, when a translation error cannot be avoided, consistency is improved at the cost of transcription accuracy by placing the backtranslated error in the transcript. Because accuracy and error metrics assess transcript or translation quality in isolation, these metrics cannot capture phenomena that involve the interplay between transcript and translation.

Motivational Use Cases
Although ultimately user studies must assess to what extent consistency improves user satisfaction, our intention in this paper is to provide a universally useful notion of consistency that does not depend too much on specific use cases. Nevertheless, our definition may be most convincing when put in the context of specific example use cases.
Lecture Use Case. Here, a person follows a presentation or lecture-like event, presented in a foreign language, by reading transcript and translation on a screen (Fügen, 2008). This person may have partial knowledge of the source language, but knows only the target language sufficiently well. She, therefore, pays attention mainly to the translation outputs, but may occasionally consult the transcription output in cases where the translation seems wrong. In this case, quick orientation can be critical, and inconsistencies would cause distraction and undermine trust and perceived transparency of the transcription/ translation service.
Dialog Use Case. Next, consider the scenario of a dialog between two people who speak different languages. One person, the speaker, attempts to convey a message to the recipient, relying on an ST service that displays a transcript and a translation. Here, the transcript is shown to the speaker, who speaks only the source language, for purposes of verification and possibly correction. The translation is shown to the recipient, who only understands the target language, to convey the message (Hsiao et al., 2006). We can expect that if transcript and translation are error-free, then the message is conveyed smoothly. However, when the transcript or translation contains errors, miscommunication occurs. To efficiently recover from such miscommunication, both parties should agree on the nature and details of the mistaken content. In other words, occurring errors are preferred to be consistent between transcript and translation.

Estimating Consistency
Having argued for consistency as a desirable property, we now wish to empirically quantify the level of consistency between a particular model's transcripts and translations. To our knowledge, consistency has not been addressed in the context of ST before, perhaps because traditional cascaded models have not been observed to suffer from inconsistencies in the outputs. Therefore, we propose several metrics for estimating transcript/translation consistency in this section. In §7.3, we demonstrate strong agreement of these metrics with human ratings of consistency.

Lexical Consistency
Our first metric focuses on semantic equivalency in general, and consistent lexical choice in particular, as illustrated in Figure 1. To this end, we use a simple lexical coverage model based on word-level translation probabilities. This approach might also capture some aspects of grammatical consistency by rewarding the use of comparable function words. We sum negative translation log-probabilities for each utterance: t t→s = − t j ∈t max s i ∈s log p (t j | s i ). We then normalize across the test corpus C and average over both translation directions: 1 2 1 n (s,t) ∈ C t t→s + 1 m (s,t) ∈ C t s→t , where n and m denote the number of translated and transcribed words in the corpus, respectively. In practice, we use fast align (Dyer et al., 2013) to estimate probability tables from our training data. When a word has no translation probability assigned, including out-of-vocabulary cases, we use a simple smoothing method by assigning the lowest score found in the lexicon.
Although it may seem tempting to use a more elaborate translation model such as an encoder-decoder model, we deliberately choose this simple lexical approach. The main reason is that we need to estimate consistency for potentially erroneous transcript/translation pairs. In such cases, we found severe robustness issues when computing translation scores using a full-fledged encoder-decoder model.

Surface Form Consistency
Our consistency definition mentions a preference for a stylistic similarity between transcript and translation. One way of assessing stylistic aspects is to compare transcripts and translations at the surface level. This is most sensible when the source and target language are related, and could help capture phenomena such as consistent spelling of named entities, or translations using words with similar surface form as found in the transcript. Figure 2 provides an illustration.
We propose to assess surface form consistency through substring overlap. Our notion of substring overlap follows CharCut, which was proposed as a metric for reference-based MT evaluation (Lardilleux and Lepage, 2017). Following Eq. 2 of that paper, we determine substring insertions, deletions, and shifts in the translation, when compared with the transcript, and compute 1 − deletions+insertions+shifts |s|+|t| . Counts are aggregated and normalized at corpus level. To avoid spurious matches, we match only substrings of at least length n (here: 5), compare in casesensitive fashion, and deactivate CharCut's special treatment of longest common prefixes/suffixes. We note that surface form consistency is less suited to language pairs that use different alphabets, and leave it to future work to explore alternatives, such as the assessment of cross-lingual phonetic similarity in such cases.

Correlation of Transcription/Translation Error
This third metric bases consistency on wellestablished accuracy metrics or error metrics. We posit that a necessary (though not sufficient) condition for consistency is that the accuracy of the transcript should be correlated with the accuracy of the translation, where both are measured against some respective gold standard. We therefore propose to assess consistency through computing statistical correlation between utterance-level error metrics for transcript and translation.
Specifically, for a test corpus of size N , we compute Kendall's τ coefficient across utterancelevel error metrics. On the transcript side, we use utterance-level WER as the error metric. Because BLEU is a poor utterance-level metric, we make use of CharCut on the translation side, which has been shown to correlate well with human judgment at utterance level (Lardilleux and Lepage, 2017). Formally, we compute: Because CharCut is clipped above 1, we also apply clipping to utterance-level WER for stability.

Combined Metric for Dialog Task
The previous metrics estimate consistency in a fashion that is complementary to accuracy, such that it is possible to achieve good consistency despite poor accuracy. This allows trading off accuracy against consistency, depending on specific task requirements. Here, we explore a particular instance of such a task-specific trade-off that arises naturally through the formulation of a communication model. We consider a dialog situation ( §2.1), and assume that communication will be successful if and only if both transcript and translation do not contain significant deviations from some reference, as motivated in Figure 3. Conceptually, the main difference to §3.3 is that here we penalize, rather than reward, the bad/bad situation ( Figure 3). To estimate the probability of some generated transcript and translation allowing successful communication, given reference transcript and translation, we thus require that both the transcript and the translation are sufficiently accurate. For utterance with index k: We then use utterance-level accuracy metrics as a proxy, computing accuracy (s k ) = 1− WER clipped k , accuracy (t k ) = 1−CharCut k . For a test corpus of size N we compute corpus-level scores as 1 N 1≤k≤N P (succ k ).

Models for Transcription and Translation
We now turn to discuss model candidates for consistent transcription and translation of speech . We hypothesize that there are two desirable model characteristics in our scenario. First, motivated by Eq. 2, models may achieve better consistency by performing joint inference, in the sense that no independence assumption  between transcript and translation are made. We call this characteristic coupled inference. Second, shared representations through end-to-end (or joint) training may be of advantage in our scenario. We introduce several model variants, and also discuss whether they match these characteristics.

Model Basics
For a fair comparison, we keep the underlying architectural details as similar as possible across compared model types. All models are based on the attentional encoder-decoder framework (Bahdanau et al., 2015). For audio encoders, we roughly follow Chiu et al. (2018)'s multilayer bidirectional LSTM model, which encodes log-Mel speech features that are stacked and downsampled by a factor of 3 before being consumed by the encoder. When a model requires a text encoder ( §4.2), we utilize residual connections and feed-forward blocks similar to Vaswani et al. (2017), although for simplicity we use LSTMs (Hochreiter and Schmidhuber, 1997) rather than self-attention in all encoder (and decoder) components. Similarly, decoder components use residual blocks of (unidirectional) LSTMs and feedforward components (Domhan, 2018).
For ease of reference, we use enc(·) to refer to the encoder component that transforms speech inputs (or embedded text inputs) into a hidden encoder representations, dec(·) to refer to the attentional decoder component that produces hidden decoder states auto-regressively, and SoftmaxOut(·) to refer to the output softmax layer that models discrete output token probabilities. We will subscript components with the parameter sets π, φ to indicate cases in which model components are separately parametrized.

Cascaded Model (CASC)
The cascaded model (Figure 4a) represents ST's traditional approach of using separately trained ASR and MT models (Stentiford and Steer, 1988;Waibel et al., 1991). Here, we use modern sequence-to-sequence ASR and MT components. CASC runs a speech input x 1:l through an ASR model decodes the best hypothesis transcriptŝ, and then applies a separate MT model to generate a translation.
With respect to the two desirable characteristics of a consistent model, notice that CASC uses a coupled inference procedure, in the sense that no strong independence assumptions are made between transcript and translation. CASC may therefore be a good candidate for consistent speech transcription/translation. However, it is less straightforward to apply end-to-end training to cascaded models.

Direct Models
To improve over the cascaded approach, recent work has focused on end-to-end trainable models, with direct ST models being the most prototypical end-to-end model. In the following, we describe straightforward ways of extending direct models in order to apply them to our joint transcription/translation task. Note that these direct models (Figure 4b-d) generate transcripts and translations independently at inference time. In other words, these models do not support coupled inference, which may degrade consistency between transcript and translation.
It is worth discussing how our consistent transcription/translation scenario relates to the issue of error propagation, an important issue in ST in which translations are degraded due to poor transcription decisions. Prior research on direct ST models has often been motivated by the observation that direct ST models elegantly avoid the error propagation problem. However, note that by shifting perspective to the joint transcription/ translation goal, error propagation loses much of its relevance. First, error propagation is usually used to describe the negative effect of intermediate decisions, but here transcripts no longer function as intermediates. Second, strategies to mitigate error propagation often seek to make translations less influenced by transcription decisions. This is in conflict with our goal of achieving consistency between transcript and translation, which calls for precisely the opposite: Transcription and translation decisions should strongly depend on each other.

Independent Direct Model (DIRIND)
A simple way of using direct modeling strategies for our purposes is to use two independent direct models, one for transcription, one for translation ( Figure 4b). Specifically, we compute g 1:l = enc φ (x 1:l ) We are not aware of prior work using independent models for transcription and translation. We include this model as a contrastive baseline for the subsequent two models.

Multitask Direct Model (DIRMU)
A major weakness of DIRIND is that transcription and translation models are trained separately. A better solution is to follow Weiss et al. (2017)'s approach and sharing the speech encoder between transcription and translation models while making use of multitask training. Compared with Eq. 7, enc φ and enc π would be collapsed into a shared encoder (Figure 4c). Note that originally, Weiss et al. (2017) and follow-up works use the transcript decoder only to aid training and exploit additional data for ASR as a related task in multitask learning. However, it is straight-forward to utilize the transcript decoder during inference for our purposes.

Shared Direct Model (DIRSH)
We can also take the amount of sharing to the extreme by sharing all weights, not just encoder weights. Increasing the number of shared parameters may positively impact transcription/ translation consistency. We are not aware of prior work using this model variant for performing speech translation. Compared with Eq. 7, both enc φ /enc π and dec φ /dec π are collapsed into a shared encoder and a shared decoder (Figure 4d).

Joint Models
We previously discussed CASC as a model that features coupled inference but does not support end-to-end training. We also discussed several direct models, some of which support end-to-end training, but none of which follow a coupled inference procedure. This section introduces joint models that support both end-to-end training and coupled inference. 3

Two-Stage Model (2ST)
The two-stage model (Kano et al., 2017) is conceptually close to the cascaded approach but is end-to-end trainable because continuous transcript decoder states are passed on to the translation stage. Following Sperber et al. (2019)'s formulation, we re-use Eq. (5) to model a transcript s and hidden decoder states u m 1 , and then compute Beam search is applied to decode transcripts, as well as the corresponding hidden decoder states u 1:m that are then translated. Note that in contrast to our paper, Kano et al. (2017) and Sperber et al. (2019) treat transcripts only as intermediate computations and do not report transcription accuracies.

Triangle Model (TRI)
The triangle model (Anastasopoulos and Chiang, 2018) extends 2ST by adding a second attention mechanism to the translation decoder that directly attends to the encoded speech inputs. Eq. 5 is reused for transcription, and translations are computed as (9) TRI can be seen as combining DIRMU's advantage of featuring a direct connection between speech and translation, and 2ST's advantage of supporting joint inference. Anastasopoulos and Chiang (2018) evaluate both transcription and translation accuracy in a low-resource setting and report consistent improvements for the latter but less reliable gains for the former.

Concatenated Model (CONCAT)
Haghani et al. (2018) propose a sequence-tosequence model that produces the concatenation of two outputs sequences in the context of spoken language understanding. To our knowledge it has not been utilized in an ST context before, but is a very natural fit for our joint transcription/ translation scenario. CONCAT shares both the encoder and the decoder, leading to improved compactness: r 1:m+n := s 1 . . . s m t 1 . . . t n g 1:l = enc(x 1:l ) u i = dec(u <i , g 1:l , r i−1 ) P (r i | r <i , x 1:l ) = SoftmaxOut(u i ). (10)

Consistency as Training and Inference Objectives
Having surveyed models that are suitable for our task to various degrees, we next explore simple ways to further improve the consistency of the generated outputs through adjusting training or inference objectives.

Consistency as Training Objective
At training time, we wish to introduce a loss term that penalizes inconsistent outputs. Whereas the consistency measures discussed in §3 are all defined at either the utterance or the corpus level, we define our loss term at the token level for convenient integration with the standard cross entropy loss term. For convenience, we opt to follow the notion of surface-level consistency ( §3.2), according to which we may encourage models to assign probability mass to transcript (subword) tokens that appear in the translation, and to translated tokens that appear in the transcript. 4 Consider the standard cross entropy loss, which is computed against the ground-truth label distribution q(y i ) = δ y i ,y * i for predicted label y i at target position i, assigning all probability mass to the reference token y * i . We modify the ground truth label distribution for transcript and translation outputs, respectively: This can be seen as an instance of non-uniform label smoothing with strength ǫ (Szegedy et al., 2016). In practice, we give this loss term a relative weight of 0.1 during training, while at the same time disabling label smoothing. Because this loss requires access to the complete transcript and translation, we do not apply it at inference time.

Consistency as Inference Objective
We can also modify the inference objective to enforce more consistent outputs. A simple way for accomplishing this is via n-best rescoring. This is especially convenient when using consistency measures such as lexical consistency ( §3.1), which can be computed without referring to a gold standard. Our approach here follows two simple steps: First, we compute n-best lists using standard beam search. Second, we select the (s, t)-pair that produces the best lexical consistency score. Expectedly, this rescoring approach will yield improved consistency, while possibly degrading transcript or translation accuracy. Future work may explore ways for more explicitly balancing model and consistency scores.
6 Experimental Setup

Data
We conduct experiments on the MuST-C corpus (di Gangi et al., 2019), the largest publicly available ST corpus, containing TED 5 talks paired with English transcripts and translations into several languages. We present results for German, Spanish, Dutch, and Russian as the target language, where the data size is 408-504 hours of English speech, corresponding to 234K-270K utterances. In TED, translated subtitles are not displayed simultaneously with the transcribed subtitles, and consistency is therefore not inherently required in this data. In practice, however, the manual translation workflow in TED results in a sufficient level of consistency between transcripts and translations. Specifically, transcripts are generated first, and translators are required to use the transcript as a starting point while also referring to the audio. 6 We use MuST-C dev for validation and report results on tst-COMMON.

Model and Training Details
We make use of the 40-dimensional log Mel filterbank speech features provided with the corpus. The only text preprocessing applied to the training data is subword tokenization using SentencePiece (Kudo and Richardson, 2018) with the unigram setting. Following most recent work on end-to-end ST models, we choose a relatively small vocabulary size of 1024, with transcription/translation vocabularies shared. No additional preprocessing steps are applied for training, but for transcript evaluation we remove punctuation and non-speech event markers such 5 www.ted.com. 6 www.ted.com/participate/translate. as (laughter), and compute case-insensitive WER. For translations, we remove non-speech markers from the decoded outputs and use SacreBleu 7 (Post, 2019) to handle tokenization and scoring.
Model hyperparameters are manually tuned for the highest accuracy with DIRMU, our most relevant baseline. Unless otherwise noted, the same hyperparameters are used for all other model types. Weights for the speech encoder are initialized based on a pre-trained attentional ASR task that is identical to the ASR part of the direct multitask model. Other weights are initialized according to Glorot and Bengio (2010). The speech encoder is a 5-layer bidirectional LSTM with 700 dimensions per direction. Attentional decoders consist of 2 Transformer blocks (Vaswani et al., 2017) but use 1024-dimensional unidirectional LSTM instead of self-attention as a sequence model, except for the CONCAT and DIRSH for which we increase to 3 layers. For CASC's MT model, encoder/decoder both contain 6 layers with 1024-dimensional LSTMs. Subword embeddings are of size 1024.
We regularize using LSTM dropout with p = 0.3, decoder input word-type dropout (Gal and Ghahramani, 2016), and attention dropout, both p = 0.1. We apply label smoothing with strength ǫ = 0.1. We optimize using Adam (Kingma and Ba, 2014) with α = 0.0005, β 1 = 0.9, β 2 = 0.98, 4000 warm-up steps, and learning rate decay by using the inverse square root of the iteration. We set the batch size dynamically based on the sentence length, such that the average batch size is 128 utterances. The training is stopped when the validation score has not improved over 3 epochs, where the validation score is the product of corpuslevel translation BLEU score and corpus-level transcription word accuracy.
For decoding and generating n-best lists, we use beam size 10 and polynomial length normalization with exponent 1.5. Our implementation is based on PyTorch (Paszke et al., 2019) and xnmt (Neubig et al., 2018), and all trainings are done using single-GPU environments, utilizing Tesla V100 GPUs with 32 GB memory.

Human Ratings
To obtain a gold standard to compare our proposed automatic consistency metrics against, we collect transcript/translation consistency ratings from  Table 1: Overview of models and key properties. All models except CASC/DIRIND are end-to-end (E2E) trained. Models also differ in whether translations are conditioned on transcripts (t|s), and whether conditioning is implemented through attention or through sequential decoder states.
human annotators. The annotators are presented a single transcript/translation pair at a time, and are asked to judge the consistency on a 4-point Likert scale. We aimed for a balanced scale which assigned a score of 4 to cases with no or only minor mismatch, a score of 3 to indicate a purely stylistic mismatch, a score of 2 to indicate a partial semantic mismatch, and a score of 1 to a complete semantic mismatch. Instructions given to the annotators include an explanation of the definition given in §2 along with a table of several examples for each of the 4 categories. We displayed transcripts and translations in randomized order, so as to obfuscate the directionality of the translation, and do not provide the source speech utterances. Annotators are recruited from an in-house pool of trusted annotators and required to be proficient English and German speakers. For each of the 2641 speech utterances in the MuST-C English-German test set, we collect annotations for 8 transcript/translation pairs: 7 system outputs produced by the models in Table 1, and the reference transcript/translation pairs. Each transcript/translation item is rated individually and by at least three different annotators. In total, we used 58 raters to produce 63412 ratings. We fit a linear mixed-effects model on the result using the lme4 package (Bates et al., 2013), which allows estimating the consistency of the outputs for each system, while accounting for random effects of each annotator and of each input sentence. We refer to Norman (2010) and Gibson et al. (2011) for a discussion of using mixed-effects models in the context of Likert-scale ratings.

Results
We start by presenting empirical results across all four language pairs, and will then focus on English-German to discuss details. Table 1 contrasts the different model types that we examine.

Accuracy Comparison
To validate our implementation and to evaluate the overall model accuracy, Table 2 compares models across four language pairs. The table confirms that, except for DIRIND, our models obtain strong overall accuracies, as compared with prior work on the same data by Di Gangi et al. (2019). 8 Overall, CASC outperforms CONCAT and the 3 direct models in terms of WER and BLEU. 2ST/TRI achieve similar or stronger translation accuracy compared with CASC. Joint model training (used by all models except CASC and DIRIND) seems to hurt transcription accuracy somewhat, although the differences are often not statistically significant. This may be caused by an inherent trade-off between translation and transcription accuracy, as discussed by He et al. (2011). Finally, CONCAT achieves favorable transcription accuracies, and translation accuracies fall between direct models and non-direct models in most cases. Table 2 also shows results for lexical consistency. Without exception, 2ST/TRI achieve the best results, followed by CASC and CONCAT. The direct models perform poorly in all cases. Given that CASC is by design a natural choice for joint transcription/translation, we did not necessarily expect 2ST/TRI to achieve better consistency. This encouraging evidence for the versatility of end-toend trainable models is also supported by human ratings ( §7.3).

Lexical Consistency Comparison
To categorize models regarding inference procedure and end-to-end training (Table 1), we observe that coupled inference (all non-direct models) is most decisive for achieving good consistency, with conditioning on generated transcripts through sequential hidden states (CONCAT) being less effective than conditioning through   Table 3: Detailed consistency results, including surface form consistency (Sur; §3.2), correlation of error (Cor; §3.3), and the combined task-specific metric (Cmb; §3.4). Bold font indicates the best score among automatic outputs. Results that are not statistically significantly worse than the best score in the same column are in italics.
attention (other non-direct models). End-to-end training also appears beneficial for consistency (CASC vs. 2ST/TRI and DIRIND vs. DIRMU/DIRSH). Encouragingly, lexical and surface form consistencies are aligned, and follow the same trends as the gold standard. The correlation-based measure agrees on the inferior consistency of direct models and the superior consistency of TRI, while producing slightly different orderings among the remaining models. According to our combined dialog-specific measure, TRI/2ST are tied for the best overall model.

Analysis of Consistency Metrics
One noteworthy observation is that lexical consistency of references is far worse than for 2ST/TRI outputs. This contradicts the gold standard outputs and is possibly caused by both the system outputs and the lexical consistency score being overly literal and biased toward high-frequent outputs. For comparison against references, the surface form consistency therefore appears to be a better choice.    Table 3).

Consistency vs. Accuracy
Tables 2 and 3 tend to assign better consistency scores to models with higher accuracy scores. We wish to verify whether the trend is owed to the model characteristics or whether this indicates that our metrics fail to decouple accuracy and consistency. To this end, we again focus on English-German and introduce two new model variants: First, CINDP performs translation using CASC, but transcribes with an independently trained direct model. Expectedly, such a model shows high accuracy but low consistency, a hypothesis that is confirmed by results in Table 5, contrasted against DIRMU. Second, we train a weaker 2-stage model by using only half the training data. For such a model, we would expect lower accuracy but not lower consistency, which is again confirmed by Table 5, at least to some extent (lexical consistency is worse, but the correlation measure improves). These findings indicate that  Table 5: Consistency vs. accuracy. CINDP achieves better accuracy than DIRMU, but worse consistency scores. 2ST/2 is trained on less data than 2ST, which hurts its accuracy but not its consistency scores. accuracy and consistency are in fact reasonably well decoupled.

Qualitative Analysis
Manual inspection of the outputs of DIRMU and TRI for the English-German model confirms our intuition and the quantitative findings presented above, namely, that DIRMU suffers from considerable consistency issues due to transcripts and translations being generated separately. Examples in the decoded test data are in fact easy to spot, whereas for TRI we do find any consistency problems. Figures 6-8 show cherry-picked examples.

Related Work
To our knowledge there exists no prior work on consistency for joint transcription and translation of speech in particular, or other multitask conditional sequence generation models in general. The closest related prior work is perhaps Ribeiro et al.
(2019), who analyze the case of contradictory model outputs in a question answering task in which multiple different but highly related questions are shown to the model. Other prior work examines the trade-off between transcription  and translation quality in more traditional speech translation models theoretically (He and Deng, 2011) and empirically (He et al., 2011). Findings indicate that optimizing for WER does not necessarily lead to the best translations in a cascaded speech translation model, which is in line with the accuracy trade-offs observed in our experiment. Concurrent work explores synchronous decoding strategies for jointly transcribing and translating speech, but does not discuss the issue of consistency (Liu et al., 2020). With regard to our consistency evaluation metrics, a closely related line of research is work on quality estimation and cross-lingual similarity metrics (Fonseca et al., 2019). An important difference of transcription/translation consistency is that for purposes of assessing consistency there is no directionality, and both input sequences can be erroneous. It is therefore especially important for metrics to be robust against errors on both sides. Moreover, stylistic differences are often not accounted for in this line of prior work. We note the similarity of our proposed lexical consistency metric to work by Popović et al. (2011), and leave it for future work to explore whether metrics from other related work can and should be employed to measure consistency.
Finally, producing transcripts alongside translations may be framed as producing an explanation (the transcript) alongside the main output (the translation). Research on explainable machine learning systems (Smith-Renner et al., 2020, and references therein) may shed light on desirable properties of these explanation from a usability point of view, as well as questions related to appropriate user interface design.

Conclusion
This paper investigates the task of jointly transcribing and translating speech, which is relevant for use cases in which both transcripts and translations are displayed to users. The main theme has been the discussion of consistency between transcripts and translations. To this end, we proposed a notion of consistency and introduced techniques to estimate it. We conducted a thorough comparison across a wide range of models, both traditional and end-to-end trainable, with regards to both accuracy and consistency. As important model ingredients, we found that a coupled inference procedure, where translations are conditioned on transcripts through attention, is particularly helpful. We also found that end-toend training improves consistency and translations but at the cost of degraded transcripts. We further introduced training and inference techniques that are effective at further improving consistency, which we found to also come with some trade-offs.
Future work should examine how consistency correlates with user experience in practice and establish specific trade-offs for various use cases. Moreover, our techniques are applicable to other multitask use cases that could potentially benefit from consistent outputs.