On Decoding Strategies for Neural Text Generators

Abstract When generating text from probabilistic models, the chosen decoding strategy has a profound effect on the resulting text. Yet the properties elicited by various decoding strategies do not always transfer across natural language generation tasks. For example, while mode-seeking methods like beam search perform remarkably well for machine translation, they have been observed to lead to incoherent and repetitive text in story generation. Despite such observations, the effectiveness of decoding strategies is often assessed on only a single task. This work—in contrast—provides a comprehensive analysis of the interaction between language generation tasks and decoding strategies. Specifically, we measure changes in attributes of generated text as a function of both decoding strategy and task using human and automatic evaluation. Our results reveal both previously observed and novel findings. For example, the nature of the diversity–quality trade-off in language generation is very task-specific; the length bias often attributed to beam search is not constant across tasks. https://github.com/gianwiher/decoding-NLG


Introduction
Modern neural networks constitute an exciting new approach for the generation of natural language text.Much of the initial research into neural text generators went into designing different architectures (Sutskever et al., 2014;Rush et al., 2015;Serban et al., 2017).However, recent work has hinted that which decoding strategy, i.e. the method used to generate strings from the model, may be more important than the model architecture itself.For instance, a well replicated recent result is that, under a probabilistic neural text generator trained with the maximum-likelihood objective, the most probable string is often not humanlike or high quality (Stahlberg and Byrne, 2019; Figure 1: Quality-probability trade-off for different language generation tasks: story generation, unconditional language generation, abstractive summarization, dialogue, and machine translation.Notably, general trends in each curve differ drastically across tasks, despite training models with the same objective.See §4.1.2for details on how quality scores are computed.Eikema and Aziz, 2020).In light of this finding, a plethora of decoding strategies have been introduced into the literature, each claiming to generate more desirable text than competing approaches.
Lamentably, empirical studies of decoding strategies are typically evaluated with respect to a single natural language generation task-without investigation into how performance may change across tasks-despite the fact that these tasks differ qualitatively across a large number of axes.
These qualitative differences manifest quantitatively as well: for example, we can see in Fig. 1 that high probability strings are favorable in some tasks, like machine translation (MT), while heavily disfavored in others, like story generation (SG).Consequently, we should not a priori expect a strategy that works well for one task to demonstrate the same performance in another.Indeed, several cases already show evidence of this: Beam search works remarkably well for machine translation but outside of this context, has been observed to return dull text or degenerate text (Holtzman et al., 2020;DeLucia et al., 2021).This raises a natural fear that decoding strategies have been optimized for performance on a specific task, and the task-agnostic claims about the ef-fectiveness of one decoding strategy over another are potentially ill-founded.A broader analysis of decoding strategies-both within and across tasks-is needed in order to fully understand the extent of such a problem.
Our work fills this lacuna, providing the first comprehensive comparison of decoding strategies across natural language generation tasks.Empirically, we compare strategy performance on several axes, taxonomizing methods into groups such as deterministic and stochastic, to understand the importance of various strategy attributes for quantifiable properties of text.In summary, our main findings include the following: • Many previous empirical observations, among them the quality-diversity and qualityprobability trade-offs (Ippolito et al., 2019;Zhang et al., 2021;Nadeem et al., 2020), manifest themselves in very task-specific ways.For example, our experiments reveal a distinct quality-diversity trade-off albeit only in a certain subset of tasks.This brings into question whether there is a single phenomenon under consideration or many distinct, but related phenomena.
• A group-level analysis shows the first empirical evidence of a distinct divide in preference for stochastic versus deterministic strategies across tasks: All directed generation tasks appear to favor the latter, yet there is a notable trend in the strength of this preference-even the inverse is true for story generation.
We see these results as both a reference point for language generation practitioners, so that they can more confidently choose a decoding strategy that fits their needs, and as an indicator of potential strengths and weaknesses of today's neural probabilistic language generators.We have reason to believe that there is a task-specific optimization happening in the literature whereby many of the proposed and (even celebrated) decoding strategies only outperform their competitors on specific tasks.Thus, our paper, serves as a cautionary note about proper comparisons.

Probabilistic Language Generators
In this work, we consider models for language generation tasks that define a probability distribution over strings.More formally, these models are probability distributions p over an output space Y-(perhaps) conditioned on an input x-where Y is the set consisting of all possible strings that can be constructed from the vocabulary V: Here, BOS and EOS stand for special reserved beginning-of-sentence and end-of-sentence tokens, respectively, and V * denotes the Kleene closure of V. Today's language generators are typically parameterized by encoder-decoder architectures with attention mechanisms (Sutskever et al., 2014), notably the transformer (Vaswani et al., 2017), with trainable weights θ.These models follow a local-normalization scheme, meaning that for all t > 0, p( • | y <t ) defines a probability distribution over The probability of a sequence y = y 0 , y 1 , . . .can thus be computed as: where y <t def = y 0 , . . ., y t−1 and y <1 = y 0 def = BOS.In order to learn the weights θ, we minimize some loss function L(θ; C), defined in terms of a corpus C. In theory, we want examples in C to be assigned high probability.Accordingly, our loss is typically their negative log-likelihood under p.2

The Decoding Problem
We define the decoding problem as the search for some string y * according to a given model p and a set of decision rules.Given the probabilistic nature of most language generators, the natural choice for such a string would be the most probable sequence under the model: Solving the above optimization problem is commonly referred to as maximum a posteriori (MAP) decoding.There are two main reasons why in practice this direct optimization is not used when decoding: First, because of the exponentially large space Y and the non-Markovian structure of commonly used neural generators, direct optimization is often computationally infeasible.
Second, recent research has shown that the mode, i.e. the MAP solution y * , is often not human-like or high quality text (Eikema and Aziz, 2020).For example, in the domain of MT, the most likely string under the model is often the empty string (Stahlberg and Byrne, 2019).For open-ended generation,3 it has been observed that there's a positive correlation between likelihood and quality up to only a certain inflection point, after which the correlation becomes negative (Zhang et al., 2021).Thus in practice y * is almost exclusively approximated using heuristic methods.An overview of such (commonly-used) methods is presented below.

Deterministic Algorithms
Greedy Search.One approximation of y * is obtained by greedily choosing the most probable token at each decoding step t, i.e., the following recursion is performed until the EOS symbol is chosen or some maximum time step T is reached: Note that there is no formal guarantee that greedy decoding will return the global optimum of the decoding objective since decisions are only locally optimal.
Beam Search.Beam search is a simple extension of greedy search.Rather than considering only the highest probability continuation of our string at each step, we keep the k ∈ Z + highest probability paths, where the hyperparameter k is referred to as the beam: where B t is our beam, consisting of all possible extensions of y ∈ Y t−1 and L : Y → R is a scoring function that operates over sets Y ⊆ Y.Typically, we choose L(Y ) = y∈Y log p(y | x).As with greedy decoding, the recursion is performed until all strings end in the EOS symbol or some maximum time step T is reached.The highest scoring string y * is then chosen from this final set Y T .Other scoring functions have been proposed as modifications to the vanilla beam search algorithm.For example, (Vijayakumar et al., 2016) propose diverse beam search (DBS) to address the issue of the lack of diversity within the set of returned strings.The algorithm further splits the beam into several sub-groups and adds an inner iteration at each time step to maximize for intergroup diversity, i.e., they set is a measure of dissimilarity between y and strings within Y (g) t .We refer the reader to the original work for the full algorithm.

Stochastic Algorithms
Ancestral Sampling.Instead of approximating y * , one can obtain generations by sampling y ∼ p(• | x).Due to the local normalization scheme of the models that we consider, this can be achieved simply by setting y 0 = BOS and then drawing each y t ∼ p(• | x, y <t ) until EOS is sampled.
Top-k Sampling.Perhaps due to the "unreliable tail" of the distribution (Holtzman et al., 2020)i.e., the subset of V that are unrealistic extensions of a string but are necessarily assigned probability mass due to the non-sparse nature of the softmax transformation-sampling directly from p(• | x) can lead to text that is incoherent and sometimes even unrelated to the subject (Fan et al., 2018).One way to overcome this issue is to limit the sampling space to the top-k most likely tokens in each decoding step.Prior to sampling, the distribution over V is recomputed: Let Z(x, y <t ) def = y∈ V(k) p(y | x, y <t ) where V(k) ⊆ V is defined to be the set of the k most likely tokens.The truncated distribution is given by: short prompt related story GPT-2 (small and medium) WRITINGPROMPTS Unconditional Generation (ULG) empty sequence ( BOS ) plausible natural language strings GPT-2 (small and medium) WIKITEXT-103 Table 1: Overview of the tasks considered in this work.Examples given for X and Y out are the intended input and output, respectively.Models p are evaluated on the test set of the specified dataset.We finetune the GPT-2 models on the specified dataset; other models are loaded from checkpoints provided by the Hugging Face framework (Wolf et al., 2020).
based on the spread of the probability distribution at each generation step.Formally, nucleus sampling (Holtzman et al., 2020) considers the smallest subset of tokens whose cumulative probability mass exceeds a chosen threshold p.For generation step t, p ∈ (0, 1] and probability distribution The truncated distribution is then computed similarly to Eq. ( 5) with p = y∈ V(p) p(y | x, y <t ).
Bayes Minimum Risk (MBR).Under probabilistic language generators, probability mass is often spread over a large set of likely candidates without clear preference (Ott et al., 2018).However, this set of likely strings should not be arbitrary when p is good.Rather, these strings should capture the statistics of training data well, containing a number of potentially good solutions (Eikema and Aziz, 2020).This motivates a decision rule that exploits all available information in this set.Let u : Y × Y → R be a utility function that evaluates a string y against reference ŷ.According to statistical decision theory (Bickel and Doksum, 1977), the optimal decision y * is the one that minimizes expected risk (here we define risk as negative utility): Like MAP, it is generally computationally infeasible to solve the MBR objective exactly given the size of Y.In practice, one can obtain an unbiased estimate of the expected risk by Monte Carlo (MC) methods and limit the search space for the maximization problem to the sampled set.

Experimental Setup
The strategies presented in section 3 are compared across a variety of NLG tasks covering open-ended as well as directed generations tasks.We define a task more formally as a triple (X , Y out , p) where X denotes the input space, Y out ⊆ Y the output space4 and p a model that defines a probability distribution over Y out for every input x ∈ X .A high-level overview of these tasks (and the respective datasets used) can be found in Table 1.We use solely transformer-based models, all state-of-the-art for their respective tasks (Ng et al., 2019;Lewis et al., 2020;Zhang et al., 2020;Radford et al., 2019).We use open-sourced versions of models for reproducibility.
Decoding Strategy Settings.In this work, we consider beam search with beam sizes k = 5 and k = 10, and DBS with Hamming distance as a dissimilarity function, λ = 0.7 and G = k = 5.The choice of dissimilarity function and hyperparameters is based on the recommendations from the original work.When we only want to return one string, we select the hypothesis with the highest score according to log p.For top-k sampling, we set k = 30 and for top-p sampling, we set p = 0.85 based on experiments in (DeLucia et al., 2021) that suggest a parameter range p ∈ [0.7, 0.9] for top-p sampling.For MBR, we obtain 30 to 32 ancestral samples to approximate the expected risk in Eq. ( 7) using Monte Carlo.To speed up the generation process the samples are generated in batches.Depending on the memory requirements of the different models the batch size differ across tasks and we thus have small differences in the number of samples acquired.The candidate sequences, for which we all calculate the expected risk, consists of the ancestral samples used for the Monte Carlo approximations together with sequences obtained from the other decoders.The metric BEER (Stanojević and Sima'an, 2014) is used as utility function u.

Quality Metrics
Automatic BLEU Corpus-level metric originally developed to assess translation quality of MT systems (Papineni et al., 2002).Produces a score between 0 and 1 based on modified n-gram precision.We use the SACRE-BLEU (Post, 2018) framework.METEOR Metric based on the harmonic mean of unigram precision and recall.Originally developed to evaluate MT.We use version 1.5 of the implementation from (Denkowski and Lavie, 2014).

COMET
Neural framework to train multilingual MT evaluation systems proposed by (Rei et al., 2020).The nature of this metrics makes it only compatible with the MT task.We use a pretrained model checkpoint provided by the original work.

ROUGE
Recall-oriented set of metrics originally developed to assess the quality of automatically generated summaries (Lin, 2004).We report the ROUGE-L measure, which is based on longest common subsequences between candidate and reference.
BLEURT Trained evaluation metric based on BERT (Devlin et al., 2019).Returns a score that indicates to what extent the candidate is grammatical and conveys the meaning of the reference (Sellam et al., 2020).
We use a pretrained model checkpoint provided by the original work.Average over dist-n measures for different values of n.We calculate the average over k ∈ {1, ..., 5}.

SELF-BLEU
Average BLEU score across strings when using all other strings in set as references (Zhu et al., 2018).

REPETITION
If a phrase (minimum length of 2) is repeated at least three times until the end of the generation, it is labeled as a repetition.This definition of a repetition is taken from (Holtzman et al., 2020) Table 2: List of metrics considered in this work.For human evaluation metrics, prompt shown is provided to raters.

Metrics
We employ a number of different metrics to compare text across decoding strategies.An overview of all metrics can be found in Table 2.Note that we roughly divide the set of metrics into two categories: diversity metrics and quality metrics.Intuitively, we may expect that the two criteria are not always of equal importance.For example, in MT an accurate, high quality translation of the input is more highly valued than generating engaging or stylized language or a wider range of diverse outputs.On the other hand, a conversational agent that is able to talk about a diverse range of topics is likely highly preferred to one that repeats the safest phrases over and over (Li and Jurafsky, 2016).In our subsequent experiments, we provide a quantitative analysis of this trade-off.

Evaluation of Quality
For tasks where one has access to a ground truth reference, e.g., MT, AS and to some extent Diag, there are a variety of automatic metrics to evaluate quality.Most of these metrics are based on statistics of n-gram overlap between output and reference.This class of metrics has its limitations; consequently we also consider human judgements of text quality using criteria in Table 2.We use the prolific framework to obtain ratings from 5 different annotators on 200 examples per decoding strategy; criteria used for each of the tasks is given in Table 3. 5 For each of the criteria an 8-point Likert scale is used.We select the criteria based on  which have been most commonly used to assess performance of text generators on a given task, as outlined by (van der Lee et al., 2021), and describe them to the annotators as in Table 2.We check the obtained ratings manually if they have been filled out with care.If a rater assigns high scores to multiple examples that do not fulfill the specified criteria at all, the rating is rejected and we obtain a fresh set of scores from a new rater.For SG, AS and Diag the raters are first presented with a prompt/news article/dialogue history followed by the outputs of the different decoders and the reference in random order.For unconditional language generation we present the raters with generations and references in random order.We omit human annotations for MT because it has been observed that there is no significant gain over the automatic metrics when using crowd workers due to large variations in evaluation (Freitag et al., 2021).

Evaluation of Diversity
Automatic metrics to measure lexical diversity of generated text are mostly based on statistics of n-gram counts; while lexical diversity is a narrow definition of diversity, it is the commonly employed one in language generation as diverse word choice is arguably a large factor for this characteristic.Note that lexical diversity can be measured at the string level, i.e., within a given string y, or at the set level, i.e. across different strings {y (1) , y (2) , . . .}.While we provide some results for the former set of metrics, we focus largely on the latter set, as often practitioners are more concerned with having a diverse set of generations per input.Specifically, we take measurements with respect to sets decoded by each strategy, i.e., the size K set decoded by beam search or K items generated according to a specific stochastic scheme. 6or the beam search methods K is equal to the beam size while for the sampling decoders we set K = 10.For each input x ∈ X we thus obtain a set of strings per decoder over which we calculate various metrics such as self-BLEU or n-gram diversity.Self-BLEU is calculated as the average of BLEU scores when setting one of the generations x i ∈ X as candidate and all other strings in X as references.To calculate dist-n, ent-n, and n-gram diversity metrics for a set of generations, we concatenate all strings in X and perform calculations as described in Table 2.For ULG, where we only have one input x, we instead calculate scores over random (disjoint) subsets of size K = 10.

Quality
Human evaluations are aggregated across raters, using the median value for each string.Results are displayed in Fig. 2. According to human raters, sampling directly from the model yields text with the lowest quality metrics across all tasks: the clear exception is for SG, where we observe that mode-seeking strategies lead to degenerate text (further discussion in §5.4).In general, for the directed generation tasks (AS and Diag), beam search variants perform the best, even outperforming human generated references.Interestingly, despite its limited exploration of the search space, greedy decoding generates texts on par with beam search methods for Diag.
On the other hand, the results of stochastic methods are more nuanced: while top-p and topk decoding generate more highly-rated texts than ancestral sampling, they often fail to reach quality levels of the beam search based methods.MBR decoding, which perhaps falls somewhere between the classes of deterministic and stochastic, likewise performs somewhere in between these classes in terms of quality metrics.Overwhelmingly, trends in performance are much more distinct when analyzing strategies as stochastic vs. deterministic, rather than individually, suggesting that small algorithmic differences in decoding strategies may not be as critical as prior work has made seem (Holtzman et al., 2020).
We present automatic quality evaluation metrics for directed generation tasks in Table 4-the number in brackets in Table 4 shows how many of the decoders performed worse than the best one, as determined by a permutation test.We use a significance level of 0.01; the resulting p-values were corrected for multiple testing using a Bonferroni correction..We observe similar trends as with our human evaluations: beam search methods perform best, followed by top-p and top-k sampling with ancestral sampling performing worst.Despite mixed results in Fig. 2, MBR decoding yields competitive results in terms of automatic evaluation metrics, even matching the performance of beam search; this is perhaps not surprising given the poor correlation between human and automatic evaluation that is frequently observed in language generation.On Diag we see that for the metrics considered we only observe a significant difference in performance between the best decoder and the worst 3, resp worst 4 decoders.Similarly, on MT, we observe that except for the BLEU metric, only a significant difference between the best and the worst 3 decoders is present.On the other hand we have that for the AS the best performing decoder significantly outperforms any other decoder except the other beam search methods.This contrasts the observation for Diag and MT where the mode seeking decoders seem all to perform equal.deterministic strategies on each input; a rank of 1 means a generation from the respective group of decoding strategies was ranked 1st among all generations.We omit ULG since only stochastic strategies are considered for this task.Note that the lowest possible rank for a deterministic strategy is 4 and for a stochastic strategy is 5.

Diversity
We report diversity metrics for different strategies and tasks in Fig. 3. Points are connected to better illustrate general trends across diversity metrics, not due to a quantitative relationship between the metrics themselves.We see that in general, the trends for a given task are quite consistent across diversity metrics, i.e., lines of the same color follow the same trend.On the other hand, trends across tasks are not as similar.For example, the gap in diversity between deterministic and stochastic methods is much more exacerbated in SG than MT.
Across tasks, ancestral sampling consistently produces the most diverse outputs.Limiting the search space, as in top-k and top-p sampling, leads to a drop in diversity compared to pure sampling; notably, this drop appears to be much more significant for directed generations tasks.Interestingly, introducing a diversity promoting term, as in DBS, increases diversity with respect to beam-based decoding algorithms, but still leads to substantially less diverse strings than stochastic methods.
At the task-level, responses for Diag seem to be more inherently diverse than for other tasks.Even methods known for producing repetitive sets, e.g., beam search, generate a relatively diverse set of solutions.This suggests that even though diverse options are often desired in Diag, we may not need to explicitly optimize for them via the chosen decoding strategy.On the other hand, diversity in SG is quite sensitive to the chosen decoding strategy, displaying drastic differences.

Quantitative Trade-offs in NLG Tasks
We provide an analysis of the importance of different metrics for each of the language generation tasks, looking specifically at their relationships with perceived quality.
The Probability-Quality Relationship.Natural language generation is performed almost solely using probabilistic models.While ideally, we would like high quality text to be assigned high probability (and vice versa), we see that in practice this is not always the case (Cohen and Beck, 2019;Stahlberg and Byrne, 2019;Holtzman et al., 2020;Zhang et al., 2021;DeLucia et al., 2021).The trends observed in Fig. 1 reveal that while high probability is often a determinant of quality in directed generation tasks, such as MT and AS, 7 there is a negative correlation between quality and probability in SG and ULG at least up until a certain inflection point.Such relationships have been a main motivation behind research into new de- 7 As computational constraints make it difficult (if not infeasible) to decode the highest probability string from neural models, we do not observe behavior at the extreme end of Fig. 1, which other works have observed to produce poor quality text.
This relationship also manifests in the divide in performance between deterministic strategies-all of which to some extent are mode-seeking-and stochastic strategies.Naturally, deterministic decoding strategies produce (on average) higher probability strings, as probability is part of the algorithms' objectives.Fig. 6 shows that when compared to ancestral samples, most beam search generations are more strongly associated with higher (length) normalized log-likelihood than the output of the sampling based decoders.Thus, we might expect the results observed in Fig. 1 to appear in a comparison of deterministic and stochastic strategies.We rank strategies within a task according to human ratings when available and calculate the highest rank obtained by each of the two groups.More specifically, for each input, we order generations according to their median human rating.Ranks are then assigned to each decoding strategy according to this ordering (lower is better).We then look at the highest rank achieved by the two subsets of decoding strategies.
From Fig. 4 we can see a distinct divide in preference for deterministic vs. stochastic strategies across tasks: all directed generation tasks appear to favor mode-seeking strategies.Yet there is a notable trend in the strength of this preference.As we might intuitively expect, we see an upward trend in the difference in rankings of mode-seeking vs. stochastic decoding methods as a task becomes more semantically-constrained.At one end of the spectrum, in SG, we observe that in nearly all cases, the most highly ranked output from a deterministic strategy is still ranked below the worst of the stochastic strategies, 8 indicating the ill-suitedness of mode-seeking Figure 5: The relationship between diversity (n-gram div) and quality (median human rating) across language generation tasks.Results are qualitatively the same when using other diversity metrics, as we might expect given results in Fig. 3.
strategies for such tasks.The opposite is true of MT at the other end of the spectrum.
The Diversity-Quality Relationship.Here we investigate how diversity-as quantified by metrics in §4.1-relates to quality in a given task.Note that the probability-quality relationship has previously been attributed to a trade-off between diversity and quality (Zhang et al., 2021;Nadeem et al., 2020), albeit only in the investigation of a small subset of language generation tasks.However, we see in Fig. 5 that the relationship between diversity and probability is not so easily defined: it changes quite drastically across tasks.Specifically, Fig. 5 shows there is indeed a trade-off for the two quantities in AS and MT, yet there appears to be an interdependence for openended generation tasks.Notably, Diag appears to fall outside of this paradigm, which perhaps challenges its definition as a directed generation task.In conjunction with other results, e.g., Fig. 6, the trends shown in Fig. 5 suggest that within directed tasks, Diag falls closer to open-ended generation tasks on the task spectrum.We further see that stochastic and deterministic methods are distinctly divided along the diversity-quality trend in each task; although this result is perhaps expected, the separating line is surprisingly sharp in all cases.

Eliciting Quantitative and Qualitative Metrics
We now look at the ability of different decoding strategies to elicit the qualitative metrics described in §4.1, the quantitative properties studied in §5.3, as well as certain undesirable attributes of text.Through this analysis, we hope to ascertain how the effectiveness of different decoding strategies generalizes across tasks, and whichif any-more general claims can be made about these strategies.Fig. 6 shows how different decoders correlate with various metrics, using ancestral samples as a baseline. 9Our first take-away is that, these correlation plots differ notably across tasks, which further demonstrates the sensitivity of the performance of decoding strategies to the task at hand.Among these differences though, we observe certain trends that provide insights into how decoders' abilities to generate certain types of texts transfers across tasks.For example, the performance of decoders within the subsets of directed and open-ended generation tasks is reasonably consistent.We first discuss more specific trends with respect to quality metrics.
Quality Metrics.We first note that there is no single decoding method that consistently correlates most strongly with high-quality text, which heads further warnings against more general claims made about decoder performance.Perhaps the most distinct result when looking at decoders' correlations with quality metrics is the difference in correlations for mode-seeking methods between open-ended and directed generation tasks.Here we see that on the directed tasks, the use of modeseeking methods appears to correlate highly with quality, with no substantial differences among this class of methods even when e.g., also optimizing for intra-set diversity (as in DBS).Interestingly, the strengths of the correlations shown by stochastic methods are much more consistent across all tasks than the mode-seeking methods.
While in general, decoder performance w.r.t.quality metrics is relatively consistent for directed generation tasks, there are exceptions to this consistency: MBR correlates well with quality metrics for MT, but underperforms in comparison to other decoders for both AS and Diag.On AS, greedy search tends to lead to poorer quality text than top-p and top-k sampling where for the other directed tasks, all mode-seeking methods provide generations of higher quality.
Diversity Metrics.In comparison to quality, we observe that behavior of different decoders changes less with respect to diversity.For diversity metrics calculated over a set of generations, ancestral sampling consistently generates the most diverse text (as demonstrated by the negative n-gram diversity/positive self-BLEU correlations shown by all decoders).This is true even in comparison to DBS, which optimizes for intraset diversity.10Mode-seeking decoding strategies consistently have a stronger negative correlation with set-level diversity metrics, e.g., self-BLEU, than their stochastic counterparts.This difference is more pronounced on certain tasks: for example, both Fig. 6 and Fig. 3 show a bigger jump in diversity scores between DBS and top-p sampling on SG compared to MT or AS.Interestingly, there is little consistency across tasks in terms of sequence-level string diversity.
Repetitions.Probabilistic language generators are known to occasionally produce text with de-generate qualities (Dinan et al., 2020;Holtzman et al., 2020;Welleck et al., 2020b).One common form of degenerate behavior is repetitions, where generation falls into a loop of repeating the same phrase until the the decoding algorithm terminates.
Here we analyze the fraction of times this behavior occurs for different strategies; results can be found in Fig. 7. On the SG task, we observe a substantial amount of text degeneration for modeseeking strategies; this holds true for both small and medium variants of GPT-2.Across both openended tasks, the only stochastic decoding scheme that appears to elicit this degenerate behavior is top-p sampling; while only a small percentage of samples, it is responsible for all of the degenerate behavior observed for the ULG task.
Notably, for all tasks besides SG, we see repetitive behaviour in less than 1% of generations.The exact repetition counts together with the perplexity of the generated texts for SG are shown in Table 5.
Length.We further investigate how different decoding strategies affect the length of generated text.Length biases have frequently been observed   in language generation tasks (Murray and Chiang, 2018;Welleck et al., 2020a), both for shorter and longer strings.In this experiment, we hope to observe how much the decoding scheme can be held responsible for these biases.We report results in Fig. 8 and Table 6.For MT, all strategies manage to generate strings of lengths similar to the reference with the exception of ancestral sampling, which produces slightly longer strings.Interestingly, there are no consistent trends for beam search variants across the other directed generation tasks; rather, trends seem to be inverted for Diag and AS.We see large variation in the length of generated strings for the SG task, especially among modeseeking strategies; for example, standard beam search produces rather short strings while DBS and greedy decoding produce inordinately long strings.For the unconditional language generation, task we observe no big differences in generated sequence length among stochastic methods.Collectively, these results tell us that the previously observed length biases are task-decoder specific, rather than purely decoder specific.

Discussion
When constructing a text generation pipeline, the choice of decoding strategy has a large effect on various aspects of the resulting text.Yet when making this choice for a specific language generation task, practitioners are currently limited to either basing their decision on non-comprehensive analyses, using expensive human annotations or even resorting to guesswork.There are potential pitfalls in these practices: as evidenced by various results in this work, certain properties of decoding schemes-especially quality-do not transfer across tasks.This work aims to provide guidance for practitioners in the choice of decoding strategies, revealing their strengths and weak-nesses with respect to individual tasks while also giving insights into whether one can expect these properties to transfer to tasks outside of this study.While all of the takeaways from this work cannot be summarized in a few lines, we highlight some key observations below.The relationships and trade-offs between certain properties of text changes notably from one task to another.For example, as depicted in Fig. 1, high-probability strings are typically also of high quality for MT while there is an almost inverse relationship between these attributes for SG.As shown in Fig. 5, a quality-diversity trade-off exists for directed generation tasks whereas for openended generation tasks, the relationship is almost a co-dependence.These task-specific characteristics must be taken into account when both choosing and developing decoding strategies.
While decoder performance generally does not transfer faithfully across tasks, we can still identify some rules from our experiments that practitioners can use.For one, we see that on directed generation tasks, mode-seeking methods all perform competitively in terms of quality.Further, for stochastic decoders, we observe that restricting the sample space-as done in top-p and top-k decoding-greatly increases quality compared to ancestral sampling, albeit sacrificing some diversity.The ability of a decoder to elicit diversity in text-at least at the set-level-is perhaps the most consistent decoder quality across tasks.There are many other use-case specific insights that can be drawn from the results shown by figures and statistics in this work, which we hope serve as further guidance for practitioners.
It is worth noting that the behavior of decoders depends on their respective hyperparameters, e.g., k or p in top-k and top-p sampling.This work does not perform a thorough search over hyperparameters, instead employing those most widely-used in order to optimize for the usefulness of our results to practitioners, who are likely to use similar default settings.While based on the results of other works, these choices should provide representative variants of the text generated according to the respective decoding strategy, this is a limitation of our work worth considering.

Conclusion
This work provides an extensive analysis of the effects of different decoding strategies across vari-ous language generation tasks.We show how different attributes of model-generated text change depending not just on decoding strategy, but also on the task at hand, using both human and automatic evaluations.Our results both confirm several prior observations, e.g., a trade-off between diversity and quality metrics for specific NLG tasks, while also revealing a number of previously unobserved trends in language generation, both with respect to decoding strategies and the tasks themselves.A main take-away of these results is that decoding strategies are perhaps optimized for specific language generation tasks and that practitioners should take great care in basing their choice of decoding strategy off of results reported for alternate tasks.We release the evaluation framework and generations in the hopes that this type of analysis will be extended, e.g., by ablating components of model or training strategies, in order to isolate which artefacts can be attributed to the nature of a specific generation task vs. design choices.We ultimately see this line of research as important for both helping practitioners more confidently choose a decoding strategy that fits their needs, without the use of valuable resources, for the further development of decoding strategies and for better understanding the shortcomings of probabilistic language generators.

FF
HumanADEQUACY How well does the response/continuation fit in a given conversation history?NATURALNESS To what degree does the text seem to be a natural english text?QUALITY How high is the overall quality of the text?ACCURACY Given the context, is the text accurate?FLUENCY How fluent is the given text?Diversity Metrics DIST-n Number of distinct n-grams divided by the total number of n-grams (Li et al., 2016) ENT-n Entropy of empirical n-gram distribution: − w∈S (w ) where S denotes the set of all observed n-grams and F (w) is the count of w n-GRAM DIV.

Figure 3 :
Figure 3: Diversity metrics calculated at the set level.For ULG, the metrics are calculated for randomly chosen (disjoint) subsets of all generations.Note that low self-BLEU indicates high diversity.

Figure 4 :
Figure4: Highest ranks achieved by stochastic vs. deterministic strategies on each input; a rank of 1 means a generation from the respective group of decoding strategies was ranked 1st among all generations.We omit ULG since only stochastic strategies are considered for this task.Note that the lowest possible rank for a deterministic strategy is 4 and for a stochastic strategy is 5.

Figure 6 :
Figure 6: Correlations between quality metrics and other quantitative attributes of text with different decoding schemes separated by task.For each decoder we concatenate the generations with ancestral samples and calculate the Person correlation between the metrics and a one-hot encoding indicating whether or not the generation is an ancestral sample or not.

Figure 7 :
Figure 7: Fraction of generations that degenerate into repetition (see Table2for definition).Note the different scales for the different tasks.

Figure 8 :
Figure 8: Differences between lengths of generated texts to reference strings.MAPE denotes the mean absolute percentage difference between reference lengths and the lengths of generated texts.MPE denotes the mean percentage error, where we do not take the absolute value of the difference in lengths, in order to get a sense of whether generated strings are (on average) longer or shorter than the reference.SG ULG Greedy 921.96 -BS (k = 5) 287.11 -BS (k = 10) 247.89 -DBS 917.73 -MBR 681.94 450.90 Ancestral 578.57440.56 Top-k 610.73 469.99 Top-p 667.85 458.32

Table 3 :
Criteria used for each task in human evals.

Table 4 :
Corpus-level quality metrics for Diag, AS, and MT.For Diag and AS the human score is calculated by taking the mean over the two criteria upon which the text is rated.
Table 2 for definition).Note the different scales for the different tasks.

Table 5 :
Perplexities and repetition count for different strategies on the SG task.Mode-seeking strategies are able to produce text with very low perplexity but these generations almost always degenerate into repetitions.