Abstract
When generating text from probabilistic models, the chosen decoding strategy has a profound effect on the resulting text. Yet the properties elicited by various decoding strategies do not always transfer across natural language generation tasks. For example, while mode-seeking methods like beam search perform remarkably well for machine translation, they have been observed to lead to incoherent and repetitive text in story generation. Despite such observations, the effectiveness of decoding strategies is often assessed on only a single task. This work—in contrast—provides a comprehensive analysis of the interaction between language generation tasks and decoding strategies. Specifically, we measure changes in attributes of generated text as a function of both decoding strategy and task using human and automatic evaluation. Our results reveal both previously observed and novel findings. For example, the nature of the diversity–quality trade-off in language generation is very task-specific; the length bias often attributed to beam search is not constant across tasks. https://github.com/gianwiher/decoding-NLG
1 Introduction
Modern neural networks constitute an exciting new approach for the generation of natural language text. Much of the initial research into neural text generators went into designing different architectures (Sutskever et al., 2014; Rush et al., 2015; Serban et al., 2017). However, recent work has hinted that which decoding strategy (i.e., the method used to generate strings from the model) may be more important than the model architecture itself. For instance, a well replicated recent result is that, under a probabilistic neural text generator trained with the maximum-likelihood objective, the most probable string is often not human-like or high quality (Stahlberg and Byrne, 2019; Eikema and Aziz, 2020). In light of this finding, a plethora of decoding strategies have been introduced in the literature, each claiming to generate more desirable text than competing approaches.
Lamentably, empirical studies of decoding strategies are typically evaluated on a single natural language generation task—without investigation into how performance may change across tasks—despite the fact that these tasks differ qualitatively across a large number of axes. These qualitative differences manifest quantitatively as well: For example, we can see in Figure 1 that high probability strings are favorable in some tasks, like machine translation (MT), while heavily disfavored in others, like story generation (SG). Consequently, we should not a priori expect a strategy that works well for one task to demonstrate the same performance in another. Indeed, several cases already show evidence of this: Beam search works remarkably well for machine translation but, outside of this context, has been observed to return dull text or degenerate text (Holtzman et al., 2020; DeLucia et al., 2021). This raises a natural fear that decoding strategies have been optimized for performance on a specific task, and the task-agnostic claims about the effectiveness of one decoding strategy over another are potentially ill-founded. A broader analysis of decoding strategies—both within and across tasks—is needed in order to fully understand the extent of such a problem.
Quality-probability trade-off for different language generation tasks: story generation, unconditional language generation, abstractive summarization, dialogue, and machine translation. Notably, general trends in each curve differ drastically across tasks, despite training models with the same objective. See §4.3 for details on how quality scores are computed.
Quality-probability trade-off for different language generation tasks: story generation, unconditional language generation, abstractive summarization, dialogue, and machine translation. Notably, general trends in each curve differ drastically across tasks, despite training models with the same objective. See §4.3 for details on how quality scores are computed.
Our work fills this gap, providing the first comprehensive comparison of decoding strategies across natural language generation tasks. Empirically, we compare strategy performance on several axes, taxonomizing methods into groups such as deterministic and stochastic, to understand the importance of various strategy attributes for quantifiable properties of text. In summary, our main findings include the following:
Many previous empirical observations, among them the quality-diversity and quality-probability trade-offs (Ippolito et al., 2019; Zhang et al., 2021; Nadeem et al., 2020), manifest themselves in very task-specific ways. For example, our experiments reveal a distinct quality-diversity trade-off albeit only in a certain subset of tasks. This brings into question whether there is a single phenomenon under consideration or many distinct, but related, phenomena.
A group-level analysis shows the first empirical evidence of a distinct divide in preference for stochastic versus deterministic strategies across tasks: All directed generation tasks appear to favor the latter, yet there is a notable trend in the strength of this preference—even the inverse is true for story generation.
We see these results as both a reference point for language generation practitioners, so that they can more confidently choose a decoding strategy that fits their needs, and as an indicator of potential strengths and weaknesses of today’s neural probabilistic language generators. We have reason to believe that there is a task-specific optimization happening in the literature whereby many of the proposed and (even celebrated) decoding strategies only outperform their competitors on specific tasks. Thus, our paper serves as a cautionary note about proper comparisons.
2 Probabilistic Language Generators
In order to learn the weights θ, we minimize some loss function , defined in terms of a corpus . In theory, we want examples in to be assigned high probability. Accordingly, our loss is typically their negative log-likelihood under q.1
3 The Decoding Problem
3.1 Deterministic Algorithms
Greedy Search.
Beam Search.
Beyond the log-probability, other scoring functions have been proposed as modifications to the vanilla beam search algorithm. For example, Vijayakumar et al. (2018) propose diverse beam search (DBS) to address the issue of the lack of diversity within the set of returned strings. The algorithm further splits the beam into several sub-groups and adds an inner iteration at each time step to maximize inter-group diversity. We refer the reader to the original work for the full algorithm.
3.2 Stochastic Algorithms
Ancestral Sampling.
Instead of approximating y*, one can obtain generations by sampling y ∼ q(· ∣x). Due to the local normalization scheme of the models that we consider, this can be achieved simply by setting y0 = bos and then drawing each yt ∼ q(· ∣x,y <t) until eos is sampled.
Top-k Sampling.
Nucleus (Top-p) Sampling.
Bayes Minimum Risk (MBR).
4 Experimental Setup
The strategies presented in Section 3 are compared across a variety of NLG tasks covering open-ended as well as directed generations tasks. We define a task more formally as a triple where denotes the input space, the output space3 and q a model that defines a probability distribution over for every input . A high-level overview of these tasks (and the respective datasets used) can be found in Table 1. We use solely transformer-based models, all state-of-the-art for their respective tasks (Ng et al., 2019; Lewis et al., 2020; Zhang et al., 2020; Radford et al., 2019). We use open-sourced versions of models for reproducibility.
Overview of the tasks considered in this work. Examples given for and are the intended input and output, respectively. Models q are evaluated on the test set of the specified dataset. Note that we fine-tune the GPT-2 models on the specified dataset, while other models are loaded from checkpoints provided by the Hugging Face framework (Wolf et al., 2020).
Task . | . | . | q . | Dataset . |
---|---|---|---|---|
Machine Translation (MT) | sequence in source language | sequence in target language | FAIR’s wmt19 submission | newstest2019 |
Abstractive Summarization (AS) | news article | summary | BART | CNN/Dailymail |
Dialogue (Diag) | conversation history | response | DIALOGPT | DialoGPT |
Story Generation (SG) | short prompt | related story | GPT-2 (small and medium) | WritingPrompts |
Unconditional Generation (ULG) | empty sequence () | plausible natural language strings | GPT-2 (small and medium) | WikiText-103 |
Task . | . | . | q . | Dataset . |
---|---|---|---|---|
Machine Translation (MT) | sequence in source language | sequence in target language | FAIR’s wmt19 submission | newstest2019 |
Abstractive Summarization (AS) | news article | summary | BART | CNN/Dailymail |
Dialogue (Diag) | conversation history | response | DIALOGPT | DialoGPT |
Story Generation (SG) | short prompt | related story | GPT-2 (small and medium) | WritingPrompts |
Unconditional Generation (ULG) | empty sequence () | plausible natural language strings | GPT-2 (small and medium) | WikiText-103 |
Decoding Strategy Settings.
Most of the decoding algorithms specified in the previous section depend on certain parameters. For all our experiments we use the following settings:
We consider beam search with beam sizes K = 5 and K = 10, and DBS with Hamming distance as a dissimilarity function, λ = 0.7 and G = K = 5. The choice of dissimilarity function and hyperparameters is based on the recommendations from the original work. When we only want to return one string, we select the hypothesis with the highest score according to .
For top-k sampling, we set k = 30 and for top-p sampling, we set p = 0.85 based on experiments in DeLucia et al. (2021) that suggest a parameter range p ∈ [0.7,0.9].
For MBR,4 we obtain 30 to 32 ancestral samples5 to approximate the expected risk in Eq. (9) using MC. The candidate sequences, for which we all calculate the expected risk, consists of the ancestral samples used for the MC approximations together with sequences obtained from the other decoders. The metric BEER (Stanojević and Sima’an, 2014) is used as the utility function u.
4.1 Metrics
We use a number of different metrics to compare text across decoding strategies. An overview of all metrics can be found in Table 2. Note that we roughly divide the set of metrics into two categories: diversity metrics and quality metrics. Intuitively, we may expect that the two criteria are not always of equal importance. For example, in MT an accurate, high quality translation of the input is often more highly valued than generating engaging or stylized language or a wider range of diverse outputs. On the other hand, a conversational agent that is able to talk about a diverse range of topics is likely highly preferred to one that repeats the safest phrases over and over (Li and Jurafsky, 2016). In our subsequent experiments, we provide a quantitative analysis of this trade-off.
List of metrics considered in this work. For human evaluation metrics, prompt shown is provided to raters.
Quality Metrics . | |
---|---|
Automatic | |
bleu | Corpus-level metric originally developed to assess translation quality of MT systems (Papineni et al., 2002). Produces a score between 0 and 1 based on modified n-gram precision. We use the sacrebleu (Post, 2018) framework. |
meteor | Metric based on the harmonic mean of unigram precision and recall. Originally developed to evaluate MT. We use version 1.5 of the implementation from Denkowski and Lavie (2014). |
comet | Neural framework to train multilingual MT evaluation systems proposed by Rei et al. (2020). The nature of these metrics makes it only compatible with the MT task. We use a pretrained model checkpoint provided by the original work. |
rouge | Recall-oriented set of metrics originally developed to assess the quality of automatically generated summaries (Lin, 2004). We report the rouge-l measure, which is based on longest common subsequences between candidate and reference. |
bleurt | Trained evaluation metric based on bert (Devlin et al., 2019). Returns a score that indicates to what extent the candidate is grammatical and conveys the meaning of the reference (Sellam et al., 2020). We use a pretrained model checkpoint provided by the original work. |
Human | |
adequacy | How well does the response/continuation fit in a given conversation history? |
naturalness | To what degree does the text seem to be a natural English text? |
quality | How high is the overall quality of the text? |
accuracy | Given the context, is the text accurate? |
fluency | How fluent is the given text? |
Diversity Metrics | |
dist-n | Number of distinct n-grams divided by the total number of n-grams (Li et al., 2016). |
ent-n | The fact that infrequent n-gram contribute more to diversity than frequent ones is not taken into account by dist-n. This limitation is addressed by the ent-n metric first proposed by Zhang et al. (2018) which reflects how uniform the empirical n-gram distribution is for a given sentence. |
n-gram div. | Average over dist-n measures for different values of n. We calculate the average over n ∈{1,…,5}. |
self-bleu | Average bleu score across strings when using all other strings in set as references (Zhu et al., 2018). |
repetition | If a phrase (minimum length of 2) is repeated at least three times until the end of the generation, it is labeled as a repetition. This definition of a repetition is taken from Holtzman et al. (2020) |
Quality Metrics . | |
---|---|
Automatic | |
bleu | Corpus-level metric originally developed to assess translation quality of MT systems (Papineni et al., 2002). Produces a score between 0 and 1 based on modified n-gram precision. We use the sacrebleu (Post, 2018) framework. |
meteor | Metric based on the harmonic mean of unigram precision and recall. Originally developed to evaluate MT. We use version 1.5 of the implementation from Denkowski and Lavie (2014). |
comet | Neural framework to train multilingual MT evaluation systems proposed by Rei et al. (2020). The nature of these metrics makes it only compatible with the MT task. We use a pretrained model checkpoint provided by the original work. |
rouge | Recall-oriented set of metrics originally developed to assess the quality of automatically generated summaries (Lin, 2004). We report the rouge-l measure, which is based on longest common subsequences between candidate and reference. |
bleurt | Trained evaluation metric based on bert (Devlin et al., 2019). Returns a score that indicates to what extent the candidate is grammatical and conveys the meaning of the reference (Sellam et al., 2020). We use a pretrained model checkpoint provided by the original work. |
Human | |
adequacy | How well does the response/continuation fit in a given conversation history? |
naturalness | To what degree does the text seem to be a natural English text? |
quality | How high is the overall quality of the text? |
accuracy | Given the context, is the text accurate? |
fluency | How fluent is the given text? |
Diversity Metrics | |
dist-n | Number of distinct n-grams divided by the total number of n-grams (Li et al., 2016). |
ent-n | The fact that infrequent n-gram contribute more to diversity than frequent ones is not taken into account by dist-n. This limitation is addressed by the ent-n metric first proposed by Zhang et al. (2018) which reflects how uniform the empirical n-gram distribution is for a given sentence. |
n-gram div. | Average over dist-n measures for different values of n. We calculate the average over n ∈{1,…,5}. |
self-bleu | Average bleu score across strings when using all other strings in set as references (Zhu et al., 2018). |
repetition | If a phrase (minimum length of 2) is repeated at least three times until the end of the generation, it is labeled as a repetition. This definition of a repetition is taken from Holtzman et al. (2020) |
4.2 Evaluation of Quality
For tasks where one has access to a ground truth reference, for example, MT, AS, and to some extent Diag, there are a variety of automatic metrics to evaluate quality. Most of these metrics are based on statistics of n-gram overlap between output and reference. This class of metrics has its limitations; consequently we also consider human judgments of text quality using criteria in Table 2. We use the prolific framework to obtain ratings from 5 different annotators on 200 examples per decoding strategy; criteria used for each of the tasks is given in Table 3. For each of the criteria, an 8-point Likert scale is used. We select the criteria based on which have been most commonly used to assess performance of text generators on a given task, as outlined by van der Lee et al. (2021), and describe them to the annotators as in Table 2. If a rater assigns high scores to multiple examples that do not fulfill the specified criteria at all, the rating is rejected and we obtain a fresh set of scores from a new rater. For SG, AS, and Diag the raters are first presented with a prompt/ news article/dialogue history followed by the outputs of the different decoders and the reference in random order. For unconditional language generation we present the raters with generations and references in random order.6
4.3 Evaluation of Diversity
Automatic metrics to measure lexical diversity of generated text are mostly based on statistics of n-gram counts; while lexical diversity is a narrow definition of diversity, it is the commonly employed one in language generation as diverse word choice is arguably a large factor for this characteristic. Note that lexical diversity can be measured at the string level, that is, within a given string y, or across a set of strings {y(1),y(2),… }. While we provide some results for the former set of metrics, we focus largely on the latter set, as often practitioners are more concerned with having a diverse set of generations per input. Specifically, we take measurements with respect to sets decoded by each strategy, namely, the size K set decoded by beam search or K items generated according to a specific stochastic scheme.7 For the stochastic decoders, we set K = 10. For each input we thus obtain a set of outputs, denoted by , per decoder over which we calculate various metrics, such as self-bleu or n-gram diversity. Self-bleu is calculated on a per-string basis as the average of bleu scores when setting one of the generations as the hypothesis and all other strings in as references. To calculate dist-n, ent-n, and n-gram diversity metrics for a set of generations, we concatenate all outputs and perform calculations as described in Table 2. For ULG, where we only have one input x, we instead calculate scores over random (disjoint) subsets of size K = 10.
5 Results
5.1 Quality
Human evaluations are aggregated across raters, using the median value for each string. Results are displayed in Figure 2. According to human raters, sampling directly from the model yields text with the lowest quality metrics across all tasks: The clear exception is for SG, where we observe that mode-seeking strategies lead to degenerate text (further discussion in §5.4). In general, for the directed generation tasks (AS and Diag), beam search variants perform the best, even outperforming human generated references. Interestingly, despite its limited exploration of the search space, greedy decoding generates texts on par with beam search methods for Diag.
On the other hand, the results of stochastic methods are more nuanced: Although top-p and top-k decoding generate more highly rated texts than ancestral sampling, they often fail to reach quality levels of the beam search based methods. MBR decoding, which as a decoding strategy perhaps falls somewhere between the classes of deterministic and stochastic, likewise performs somewhere between these classes in terms of quality metrics. Overwhelmingly, trends in performance are much more distinct when analyzing strategies as stochastic vs. deterministic, rather than individually, suggesting that small algorithmic differences in decoding strategies may not be as critical as prior work has made seem.
We present automatic quality evaluation metrics for directed generation tasks in Table 4—the number in brackets in Table 4 shows how many of the decoders performed significantly worse than the best one in terms of the respective metric, as determined by an example-level permutation test. We use a significance level of 0.01; the resulting p-values were corrected for multiple testing using a Bonferroni correction. We observe similar trends as with our human evaluations: Beam search methods perform best, followed by top-p and top-k sampling, with ancestral sampling performing worst. Despite mixed results in Figure 2, MBR decoding yields competitive results in terms of automatic evaluation metrics, even matching the performance of beam search; this is perhaps not surprising given the poor correlation between human and automatic evaluation that is frequently observed in language generation. On Diag, we only observe a significant difference in performance between the best decoder and the worst 3, respectively, worst 4 decoders. Similarly, on MT, we observe that except for the bleu metric, only a significant difference between the best and the worst 3 decoders is present. On the other hand, we have that for the AS the best performing decoder significantly outperforms any other decoder except the other beam search methods. This contrasts the observation for Diag and MT where the mode seeking decoders seem all to perform equal.
Corpus-level quality metrics for Diag, AS, and MT. For Diag and AS the human score is calculated by taking the mean over the two criteria upon which the text is rated.
. | Dialogue . | Summarization . | MT (De-En) . | MT (En-De) . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
human . | bleu . | rouge-l . | human . | bleu . | rouge-l . | bleu . | meteor . | comet . | bleu . | meteor . | comet . | |
Greedy | 4.660 | 0.661 | 9.072(4) | 3.671 | 16.560 | 28.027 | 40.083 | 42.444 | 0.548 | 42.072 | 59.174 | 0.613 |
BS (k = 5) | 4.495 | 0.758(3) | 8.796 | 5.235(5) | 17.197 | 31.138 | 41.049 | 43.005 | 0.561(3) | 42.746(5) | 59.602(3) | 0.622 |
BS (k = 10) | 4.456 | 0.746 | 8.331 | 5.180 | 16.726 | 30.650 | 41.211(5) | 43.101(3) | 0.560 | 42.680 | 59.583 | 0.622(3) |
DBS | 4.689(3) | 0.436 | 8.708 | 5.122 | 18.141(5) | 31.487(6) | 39.770 | 42.254 | 0.538 | 41.702 | 58.793 | 0.611 |
MBR | 3.815 | 0.510 | 8.469 | 3.709 | 10.771 | 25.120 | 40.811 | 42.952 | 0.547 | 42.370 | 59.241 | 0.605 |
Ancestral | 3.329 | 0.196 | 5.408 | 1.825 | 5.390 | 17.985 | 17.402 | 27.425 | −0.520 | 15.595 | 35.722 | −0.832 |
Top-k | 4.234 | 0.308 | 6.961 | 4.276 | 11.644 | 25.961 | 27.574 | 35.651 | 0.376 | 27.091 | 47.839 | 0.458 |
top-p | 3.914 | 0.308 | 6.331 | 3.976 | 11.785 | 25.505 | 29.397 | 36.704 | 0.382 | 29.998 | 49.778 | 0.481 |
. | Dialogue . | Summarization . | MT (De-En) . | MT (En-De) . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
human . | bleu . | rouge-l . | human . | bleu . | rouge-l . | bleu . | meteor . | comet . | bleu . | meteor . | comet . | |
Greedy | 4.660 | 0.661 | 9.072(4) | 3.671 | 16.560 | 28.027 | 40.083 | 42.444 | 0.548 | 42.072 | 59.174 | 0.613 |
BS (k = 5) | 4.495 | 0.758(3) | 8.796 | 5.235(5) | 17.197 | 31.138 | 41.049 | 43.005 | 0.561(3) | 42.746(5) | 59.602(3) | 0.622 |
BS (k = 10) | 4.456 | 0.746 | 8.331 | 5.180 | 16.726 | 30.650 | 41.211(5) | 43.101(3) | 0.560 | 42.680 | 59.583 | 0.622(3) |
DBS | 4.689(3) | 0.436 | 8.708 | 5.122 | 18.141(5) | 31.487(6) | 39.770 | 42.254 | 0.538 | 41.702 | 58.793 | 0.611 |
MBR | 3.815 | 0.510 | 8.469 | 3.709 | 10.771 | 25.120 | 40.811 | 42.952 | 0.547 | 42.370 | 59.241 | 0.605 |
Ancestral | 3.329 | 0.196 | 5.408 | 1.825 | 5.390 | 17.985 | 17.402 | 27.425 | −0.520 | 15.595 | 35.722 | −0.832 |
Top-k | 4.234 | 0.308 | 6.961 | 4.276 | 11.644 | 25.961 | 27.574 | 35.651 | 0.376 | 27.091 | 47.839 | 0.458 |
top-p | 3.914 | 0.308 | 6.331 | 3.976 | 11.785 | 25.505 | 29.397 | 36.704 | 0.382 | 29.998 | 49.778 | 0.481 |
5.2 Diversity
We report diversity metrics for different strategies and tasks in Figure 3. Points are connected to better illustrate general trends across diversity metrics, not due to a quantitative relationship between the metrics themselves. We see that in general, the trends for a given task are quite consistent across diversity metrics, that is, lines of the same color follow the same trend across facets. On the other hand, trends across tasks are not as similar. For example, the gap in diversity between deterministic and stochastic methods is much more exacerbated in SG than MT.
Diversity metrics calculated at the set level. For ULG, the metrics are calculated for randomly chosen (disjoint) subsets of all generations. Note that low self-bleu indicates high diversity.
Diversity metrics calculated at the set level. For ULG, the metrics are calculated for randomly chosen (disjoint) subsets of all generations. Note that low self-bleu indicates high diversity.
Across tasks, ancestral sampling consistently produces the most diverse outputs. Limiting the search space, as in top-k and top-p sampling, leads to a drop in diversity compared to pure sampling; notably, this drop appears to be much more significant for directed generations tasks. Interestingly, introducing a diversity promoting term, as in DBS, increases diversity with respect to beam-based decoding algorithms, but still leads to substantially less diverse strings than stochastic methods.
At the task-level, responses for Diag seem to be more inherently diverse than for other tasks. Even methods known for producing repetitive sets (e.g., beam search) generate a relatively diverse set of solutions. This suggests that even though diverse options are often desired in Diag, we may not need to explicitly optimize for them via the chosen decoding strategy. On the other hand, diversity in SG is quite sensitive to the chosen decoding strategy, displaying drastic differences.
5.3 Quantitative Trade-offs in NLG Tasks
We provide an analysis of the importance of different metrics for each of the language generation tasks, looking specifically at their relationships with perceived quality.
The Probability–Quality Relationship.
Natural language generation is performed almost solely using probabilistic models. While ideally, we would like high quality text to be assigned high probability (and vice versa), we see that in practice this is not always the case (Cohen and Beck, 2019; Stahlberg and Byrne, 2019; Holtzman et al., 2020; Zhang et al., 2021; DeLucia et al., 2021). The trends observed in Figure 1 reveal that while high probability is often a determinant of quality in directed generation tasks, such as MT and AS,8 there is a negative correlation between quality and probability in SG and ULG at least up until a certain inflection point. Such relationships have been a main motivation behind research into new decoding strategies (e.g., Li et al., 2016; Shao et al., 2017; Holtzman et al., 2020). The quality scores in Figure 1 are obtained by taking the mean over human ratings. For MT, sentence-bleu is used.
This relationship also manifests in the divide in performance between deterministic strategies—all of which to some extent are mode-seeking—and stochastic strategies. Naturally, the deterministic decoding strategies we consider produce (on average) higher probability strings, as probability is part of their decoding objectives. Figure 6 shows that when compared to ancestral samples, most beam search generations are more strongly associated with higher (length) normalized log-likelihood than the output of the sampling based decoders. Thus, we might expect the results observed in Figure 1 to appear in a comparison of deterministic and stochastic strategies. We rank strategies within a task according to human ratings when available and calculate the highest rank obtained by each of the two groups. More specifically, for each input, we order generations according to their median human rating. Ranks are then assigned to each decoding strategy according to this ordering (lower is better). We then look at the highest rank achieved by the two subsets of decoding strategies. From Figure 4 we can see a distinct divide in preference for deterministic vs. stochastic strategies across tasks: All directed generation tasks appear to favor mode-seeking strategies. Yet there is a notable trend in the strength of this preference. As we might intuitively expect, we see an upward trend in the difference in rankings of mode-seeking vs. stochastic decoding methods as a task becomes more semantically constrained. At one end of the spectrum, in SG, we observe that in nearly all cases, the most highly ranked output from a deterministic strategy is still ranked below the worst of the stochastic strategies,9 indicating the ill-suitedness of mode-seeking strategies for such tasks. The opposite is true of MT at the other end of the spectrum.
Highest ranks achieved by stochastic vs. deterministic strategies on each input; a rank of 1 means a generation from the respective group of decoding strategies was ranked 1st among all generations. We omit ULG since only stochastic strategies are considered for this task. Note that the lowest possible rank for a deterministic strategy is 4 and for a stochastic strategy is 5.
Highest ranks achieved by stochastic vs. deterministic strategies on each input; a rank of 1 means a generation from the respective group of decoding strategies was ranked 1st among all generations. We omit ULG since only stochastic strategies are considered for this task. Note that the lowest possible rank for a deterministic strategy is 4 and for a stochastic strategy is 5.
The Diversity–Quality Relationship.
Here we investigate how diversity—as quantified by metrics in §4.1—relates to quality in a given task. Note that the probability–quality relationship has previously been attributed to a trade-off between diversity and quality (Zhang et al., 2021; Nadeem et al., 2020), albeit only in the investigation of a small subset of language generation tasks. However, we see in Figure 5 that the relationship between diversity and probability is not so easily defined: it changes quite drastically across tasks.
The relationship between diversity (n-gram div) and quality (median human rating) across language generation tasks. Results are qualitatively the same when using other diversity metrics, as we might expect given the results in Figure 3.
The relationship between diversity (n-gram div) and quality (median human rating) across language generation tasks. Results are qualitatively the same when using other diversity metrics, as we might expect given the results in Figure 3.
Specifically, Figure 5 shows there is indeed a trade-off for the two quantities in AS and MT, yet there appears to be an interdependence for open-ended generation tasks. Notably, Diag appears to fall outside of this paradigm, which perhaps challenges its definition as a directed generation task. In conjunction with other results (e.g., Figure 6), the trends shown in Figure 5 suggest that within directed tasks, Diag falls closer to open-ended generation tasks on the task spectrum. We further see that stochastic and deterministic methods are distinctly divided along the diversity–quality trend in each task; although this result is perhaps to some extent expected, the separating line is surprisingly sharp in all cases.
Correlations between quality metrics and other quantitative attributes of text with different decoding schemes separated by task.
Correlations between quality metrics and other quantitative attributes of text with different decoding schemes separated by task.
5.4 Eliciting Metrics
We now look at the ability of different decoding strategies to elicit the qualitative metrics described in §4.1, the quantitative properties studied in §5.3, as well as certain undesirable attributes of text. Through this analysis, we hope to ascertain how the effectiveness of different decoding strategies generalizes across tasks, and which—if any—more general claims can be made about these strategies.
Figure 6 shows how different decoders correlate with various metrics, using ancestral samples as a baseline.10 Our first take-away is that these correlation plots differ notably across tasks, which further demonstrates the sensitivity of the performance of decoding strategies to the task at hand. Among these differences though, we observe certain trends that provide insights into how decoders’ abilities to generate certain types of texts transfers across tasks. For example, the performance of decoders within the subsets of directed and open-ended generation tasks is reasonably consistent. We first discuss more specific trends with respect to quality metrics.
Quality Metrics.
We first note that there is no single decoding method that consistently correlates most strongly with high-quality text, which heeds further warnings against more general claims made about decoder performance. Perhaps the most distinct result when looking at decoders’ correlations with quality metrics is the difference in correlations for mode-seeking methods between open-ended and directed generation tasks. Here we see that on the directed tasks, the use of mode-seeking methods appears to correlate highly with quality, with no substantial differences among this class of methods even when, for example, also optimizing for intra-set diversity (as in DBS). Interestingly, the strengths of the correlations shown by stochastic methods are much more consistent across all tasks than the mode-seeking methods. While in general, decoder performance with respect to quality metrics is relatively consistent for directed generation tasks, there are exceptions to this consistency: MBR correlates well with quality metrics for MT, but underperforms in comparison to other decoders for both AS and Diag. On AS, greedy search tends to lead to poorer quality text than top-p and top-k sampling where, for the other directed tasks, all mode-seeking methods provide generations of higher quality.
Diversity Metrics.
In comparison to quality, we observe that behavior of different decoders changes less with respect to diversity. For diversity metrics calculated over a set of generations, ancestral sampling consistently generates the most diverse text (as demonstrated by the negative n-gram diversity/positive self-bleu correlations shown by all decoders). This is true even in comparison to DBS, which optimizes for intra-set diversity.11 Mode-seeking decoding strategies consistently have a stronger negative correlation with set-level diversity metrics (e.g., self-bleu) than their stochastic counterparts. This difference is more pronounced on certain tasks: For example, both Figure 6 and Figure 3 show a bigger jump in diversity scores between DBS and top-p sampling on SG compared to MT or AS. Interestingly, there is little consistency across tasks in terms of sequence-level string diversity.
Repetitions.
Probabilistic language generators are known to occasionally produce text with degenerate qualities (Dinan et al., 2020; Holtzman et al., 2020; Welleck et al., 2020b). One common form of degenerate behavior is repetitions, where generation falls into a loop of repeating the same phrase until the decoding algorithm terminates. Here we analyze the fraction of times this behavior occurs for different strategies; results can be found in Figure 7. On the SG task, we observe a substantial amount of text degeneration for mode-seeking strategies; this holds true for both small (s) and medium variants of GPT-2. Across both open-ended tasks, the only stochastic decoding scheme that appears to elicit this degenerate behavior is top-p sampling; Although only a small percentage of samples, it is responsible for all of the degenerate behavior observed for the ULG task. Notably, for all tasks besides SG, we see repetitive behavior in less than 1% of generations. The exact repetition counts together with the perplexity of the generated texts for SG are shown in Table 5.
Perplexities and repetition count for different strategies on the SG task. Mode-seeking strategies are able to produce text with very low perplexity but these generations almost always degenerate into repetitions.
. | Story Gen. (small) . | Story Gen. (medium) . | ||
---|---|---|---|---|
% repetition . | ppl . | % repetition . | ppl . | |
Greedy | 95.67 | 1.07 | 98.47 | 1.09 |
BS (k = 5) | 92.58 | 1.11 | 90.70 | 1.11 |
BS (k = 10) | 95.67 | 1.11 | 88.01 | 1.11 |
DBS | 99.25 | 1.05 | 97.75 | 1.05 |
MBR | 0.20 | 27.19 | 0 | 28.46 |
Ancestral | 0.23 | 30.43 | 0.13 | 32.98 |
Top-k | 1.97 | 7.10 | 0.53 | 7.38 |
Top-p | 15.87 | 5.52 | 5.65 | 6.33 |
Reference | 0 | 23.83 | 0 | 19.28 |
. | Story Gen. (small) . | Story Gen. (medium) . | ||
---|---|---|---|---|
% repetition . | ppl . | % repetition . | ppl . | |
Greedy | 95.67 | 1.07 | 98.47 | 1.09 |
BS (k = 5) | 92.58 | 1.11 | 90.70 | 1.11 |
BS (k = 10) | 95.67 | 1.11 | 88.01 | 1.11 |
DBS | 99.25 | 1.05 | 97.75 | 1.05 |
MBR | 0.20 | 27.19 | 0 | 28.46 |
Ancestral | 0.23 | 30.43 | 0.13 | 32.98 |
Top-k | 1.97 | 7.10 | 0.53 | 7.38 |
Top-p | 15.87 | 5.52 | 5.65 | 6.33 |
Reference | 0 | 23.83 | 0 | 19.28 |
Fraction of generations that degenerate into repetition (see Table 2 for definition). Note the different scales for the different tasks.
Fraction of generations that degenerate into repetition (see Table 2 for definition). Note the different scales for the different tasks.
Length.
We further investigate how different decoding strategies affect the length of generated text. Length biases have frequently been observed in language generation tasks (Murray and Chiang, 2018; Welleck et al., 2020a), both for shorter and longer strings. In this experiment, we hope to observe how much the decoding scheme can be held responsible for these biases. We report results in Figure 8 and Figure 9. For MT, all strategies manage to generate strings of lengths similar to the reference with the exception of ancestral sampling, which produces slightly longer strings. Interestingly, there are no consistent trends for beam search variants across the other directed generation tasks; rather, trends seem to be inverted for Diag and AS.
Differences between lengths of generated texts to reference strings. MAPE denotes the mean absolute percentage difference between reference lengths and the lengths of generated texts. MPE denotes the mean percentage error, where we do not take the absolute value of the difference in lengths, in order to get a sense of whether generated strings are (on average) longer or shorter than the reference.
Differences between lengths of generated texts to reference strings. MAPE denotes the mean absolute percentage difference between reference lengths and the lengths of generated texts. MPE denotes the mean percentage error, where we do not take the absolute value of the difference in lengths, in order to get a sense of whether generated strings are (on average) longer or shorter than the reference.
Mean lengths of generated text for open-ended tasks. Results are displayed from models based on the medium-sized version of GPT-2; we omit results for the small version, which were ostensibly the same.
Mean lengths of generated text for open-ended tasks. Results are displayed from models based on the medium-sized version of GPT-2; we omit results for the small version, which were ostensibly the same.
We see large variation in the length of generated strings for the SG task, especially among mode-seeking strategies; for example, standard beam search produces rather short strings while DBS and greedy decoding produce inordinately long strings. For the unconditional language generation task, we observe no big differences in generated sequence length among stochastic methods. Collectively, these results tell us that the previously observed length biases are task–decoder specific, rather than purely decoder specific.
6 Discussion
When constructing a text generation pipeline, the choice of decoding strategy has a large effect on various aspects of the resulting text. Yet when making this choice for a specific language generation task, practitioners are currently limited to either basing their decision on non-comprehensive analyses, using expensive human annotations or even resorting to guesswork. There are potential pitfalls in these practices: As evidenced by various results in this work, certain properties of decoding schemes—especially quality—do not transfer across tasks. This work aims to provide guidance for practitioners in the choice of decoding strategies, revealing their strengths and weaknesses with respect to individual tasks while also giving insights into whether one can expect these properties to transfer to tasks outside of this study. While all of the takeaways from this work cannot be summarized in a few lines, we highlight some key observations below.
The relationships and trade-offs between certain properties of text changes notably from one task to another. For example, as depicted in Figure 1, high-probability strings are typically also of high quality for MT while there is an almost inverse relationship between these attributes for SG. As shown in Figure 5, a quality–diversity trade-off exists for directed generation tasks whereas for open-ended generation tasks, the relationship is almost a co-dependence. These task-specific characteristics must be taken into account when both choosing and developing decoding strategies.
While decoder performance generally does not transfer faithfully across tasks, we can still identify some rules from our experiments that practitioners can use. For one, we see that on directed generation tasks, mode-seeking methods all perform competitively in terms of quality. Further, for stochastic decoders, we observe that restricting the sample space—as done in top-p and top-k decoding—greatly increases quality compared to ancestral sampling, albeit sacrificing some diversity. The ability of a decoder to elicit diversity in text—at least at the set-level—is perhaps the most consistent decoder quality across tasks. There are many other use-case specific insights that can be drawn from the results shown by figures and statistics in this work, which we hope serve as further guidance for practitioners.
It is worth noting that the behavior of decoders depends on their respective hyperparameters, for example, k or p in top-k and top-p sampling. This work does not perform a thorough search over hyperparameters, instead utilizing those most widely used in order to optimize for the usefulness of our results to practitioners, who are likely to use similar default settings. While based on the results of other works, these choices should provide representative variants of the text generated according to the respective decoding strategy, this is a limitation of our work worth taking into consideration.
7 Conclusion
This work provides an extensive analysis of the effects of different decoding strategies on generated text across various language generation tasks. We show how different attributes of model-generated text change depending not just on decoding strategy, but also on the task at hand, using both human and automatic evaluations. Our results both confirm several prior observations, for example, a trade-off between diversity and quality metrics for specific NLG tasks, while also revealing a number of previously unobserved trends in language generation, both with respect to decoding strategies and the tasks themselves. A main take-away of these results is that decoding strategies are perhaps optimized for specific language generation tasks and that practitioners should take great care in basing their choice of decoding strategy off of results reported for alternate tasks. We release the evaluation framework and generations in the hopes that this type of analysis will be extended, for example, by ablating components of model or training strategies, in order to isolate which artifacts can be attributed to the nature of a specific generation task vs. design choices. We ultimately see this line of research as important for helping practitioners more confidently choose a decoding strategy that fits their needs without the use of valuable resources, for the further development of decoding strategies and for better understanding the shortcomings of probabilistic language generators.
Ethical Concerns
We do not foresee any ethical concerns.
Acknowledgments
We would like to thank Bryan Eikema for his valuable insights and many fruitful discussions.
Notes
For certain tasks, this loss is typically augmented with label smoothing (Szegedy et al., 2016) to combat overfitting. In short, a certain probability mass is discounted from the ground-truth token and redistributed uniformly across all the other tokens.
We define directed generation tasks as involving a strong relationship between input and output (e.g., as in MT); for open-ended tasks, input contexts only pose a soft constraint on the output space, i.e., there is a considerable degree of freedom in what is a plausible output (e.g., in dialogue or story generation).
Note that formally the input and output spaces only differ by the model-specific vocabularies and maximum generation length l ∈ℤ +.
We use code provided at github.com/Roxot/mbr-nmt.
To speed up the generation process, samples are generated in batches. Depending on the memory requirements of the different models, the batch size differs across tasks, thus creating small differences in the number of samples acquired.
We omit human annotations for MT because it has been observed that there is no significant gain over the automatic metrics when using crowd workers due to large variations in evaluation (Freitag et al., 2021).
Because greedy and MBR decoding are methods that only return a single string, they are not considered in the latter set of metrics.
As computational constraints make it difficult (if not infeasible) to decode the highest probability string from neural models, we do not observe behavior at the extreme end of Figure 1, which other works have observed to produce poor quality text.
This must be the case since the average maximum ranking for mode-seeking methods is almost 4.
Ancestral samples give us an unbiased sample of the type of text that is assigned probability mass by our model, thus making it a good baseline for observing the effects of decoding strategies.
Although in general, DBS seems to be relatively effective at optimizing for intra-set diversity in comparison to other decoders, even achieving low self-bleu on SG despite also causing degeneration, as shown in Figure 7.
References
Author notes
Action Editor: Emily Pitler