Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining

There is an increasing focus on model-based dialog evaluation metrics such as ADEM, RUBER, and the more recent BERT-based metrics. These models aim to assign a high score to all relevant responses and a low score to all irrelevant responses. Ideally, such models should be trained using multiple relevant and irrelevant responses for any given context. However, no such data is publicly available, and hence existing models are usually trained using a single relevant response and multiple randomly selected responses from other contexts (random negatives). To allow for better training and robust evaluation of model-based metrics, we introduce the DailyDialog++ dataset, consisting of (i) five relevant responses for each context and (ii) five adversarially crafted irrelevant responses for each context. Using this dataset, we first show that even in the presence of multiple correct references, n-gram based metrics and embedding based metrics do not perform well at separating relevant responses from even random negatives. While model-based metrics perform better than n-gram and embedding based metrics on random negatives, their performance drops substantially when evaluated on adversarial examples. To check if large scale pretraining could help, we propose a new BERT-based evaluation metric called DEB, which is pretrained on 727M Reddit conversations and then finetuned on our dataset. DEB significantly outperforms existing models, showing better correlation with human judgements and better performance on random negatives (88.27% accuracy). However, its performance again drops substantially, when evaluated on adversarial responses, thereby highlighting that even large-scale pretrained evaluation models are not robust to the adversarial examples in our dataset. The dataset and code are publicly available.


Introduction
Open-domain conversational systems are increasingly in demand for several applications ranging from personal digital assistants to entertainers for recreation. While several automated dialogue agents such as Siri, Alexa, Cortana and Google Assistant have been built and deployed, there is no good automatic evaluation metric to measure the quality of their conversations. Researchers have usually adopted n-gram based metrics (Papineni et al., 2002;Banerjee and Lavie, 2005;Lin, 2004) or embedding based metrics (Forgues et al., 2014;Rus and Lintean, 2012;Zhang et al., 2020a) to compare the model's response with a single reference. These metrics assume that a valid response should be semantically or lexically similar to the reference without taking the context of the conversation into consideration. However, in open domain conversations, a given context can have a wide range of possible responses that may be lexically and semantically very different from each other. For example, the context, "I like dancing and swimming, what about you?" can be responded to with "I paint in my free time" or "I do not have time for hobbies right now", both of which are valid responses. As a result, n-gram and word embedding based metrics, which rely on lexical and/or semantic match, correlate very weakly with human judgements for dialogue evaluation (Liu et al., 2016).
Given the shortcomings of context-agnostic ngram and embedding based metrics, the focus has now shifted to building neural network based, trainable dialogue evaluation models Tao et al., 2018;Shimanaka et al., 2019;Ghazarian et al., 2019). Such models are trained to identify whether a given response can be considered as a valid continuation of the given context or not. In other words, the model should (i) assign a high score to all relevant responses no matter how diverse they are and (ii) assign a low score to all irrelevant responses, preferably with a clear margin of separation from relevant responses. Although there exist several open-domain dialogue datasets (Forsythand and Martell, 2007;Tiedemann, 2012;Ritter et al., 2010;Li et al., 2017b) that are used for training dialogue response generation systems, they are not suitable for training and testing such evaluation models. This is because these datasets have only a single relevant response and no irrelevant responses. Irrelevant responses can of course be generated by sampling random utterances from other contexts, but such examples typically do not have any overlap with the context and hence are easier for the model to distinguish from relevant responses (as we will show in our results later). We refer to the randomly sampled responses as random negatives.
Some efforts have been made to build dialog datasets with multiple relevant responses (i.e., multiple references), but these datasets are either very small (1000 contexts) (Moghe et al., 2018; or automatically constructed from Reddit conversations, hence, potentially noisy (Gao et al., 2019). Further, these datasets do not have any carefully crafted adversarial irrelevant responses. We define an adversarial irrelevant response as one which has a significant word overlap with the context but is still an irrelevant response (hence harder to identify than randomly selected irrelevant examples, which may not have any relation to the context). To overcome this limitation of existing datasets, we propose a large scale multi-reference dataset, Dai-lyDialog++, which is an extension of the Dai-lyDialog dataset. In particular, for each of the 19K contexts derived from DailyDialog, we collect additional 5 reference responses with the help of human annotators. Further, for ∼11K contexts in DailyDialog, we also ask human annotators to carefully craft irrelevant responses which have a significant word overlap with the context. This dataset will be made publicly available and help towards better training and more robust evaluation of dialogue evaluation metrics.
Using this dataset, we extensively evaluate a wide range of n-gram-based and embeddingbased metrics. In particular, we compute (i) the correlation of these metrics with binary human judgements and (ii) the accuracy obtained by using the scores assigned by the metrics to classify relevant/irrelevant responses. The performance of these metrics improves when presented with multiple references as opposed to a single reference, but they still leave a lot to be desired. On the other hand, most model-based evaluation metrics, when trained and evaluated using multiple relevant and random negative responses, perform significantly better than the n-gram-based and embedding-based methods. However, their performance drops substantially on the adversarial examples in our dataset.
Lastly, one could argue that dialog evaluation metrics could be improved by pretraining on large amounts of data. To check if this is indeed the case, we propose a new BERT-based evaluation metric called DEB (Dialog Evaluation using BERT), which is pretrained on 727M Reddit conversations. Indeed, this model performs significantly better on random negatives with an accuracy of 88.27% in distinguishing the positive and random negative responses. It also correlates well with human judgments on responses generated by five dialog generation systems Park et al., 2018;Zhang et al., 2020b). In particular, the Spearman rank correlation between human scores and DEB scores is 0.52 at the response level scores and 0.70 at the system level scores, calculated by aggregating the scores on all responses by each system. However, once again, when evaluated on adversarial examples from our dataset, its performance drops substantially, underscoring that even large-scale pretrained models are not robust to adversarial examples.

Proposed Dataset
Our goal was to build a dataset with manuallycreated multiple relevant and adversarial irrelevant responses. For this, we wanted to start with an existing base dataset, which already has one relevant response for every context, and then extend it to include multiple responses. For the base dataset, we considered several popular datasets such as Twitter (Ritter et al., 2010), Reddit (Henderson et al., 2019), Open Subtitles (Tiedemann, 2012), NPS Chat (Forsythand and Martell, 2007), Per-  (Zhang et al., 2018) and DailyDialog (Li et al., 2017b). Of these, Twitter and Reddit are generally considered noisy, so we chose not to use either of them as the base dataset. Similarly, Open Subtitles and NPS Chat did not have speaker aligned utterances and hence were not suitable for our purposes. We found that the Dai-lyDiaog dataset was clean, human-written, readily available, and covered a diverse set of generic topics such as ordinary life, school life, tourism, attitude & emotion, relationship, health, work, politics, culture & education and finance. It contains a total of 13K conversations with an average of 8 turns between exactly 2 speakers. Alternatively, we could have also chosen PersonaChat, which is of a similar size and also contains chit-chat style conversations, but we chose the antecedent Daily-Dialog dataset.
For shorter conversations in DailyDialog (having less than 8 turns) we collected multiple relevant responses only for the last utterance. For longer conversations (having 8 turns or more), we divided the conversation into two or more smaller chunks and collected multiple relevant responses for the last utterance in every chunk. In this way, from the 13K conversations 3 in DailyDialog, we were able to create 19K sub-conversations with multiple relevant responses for the last utterance in each sub-conversation or context. The responses were created by in-house annotators. Each context was shown to 2-3 annotators, and each of them was asked to generate 1-3 alternative responses for the last utterance, capping the total number of alternative responses to 5 (in addition to the one response already available in DailyDialog). The annotators were strictly instructed to avoid short generic responses ("Okay", "Thank you", "Sure", etc.), and write longer meaningful responses containing at least 8-10 words. These responses were then verified (and if needed, corrected and revalidated) by a different set of annotators.

Adversarial irrelevant responses
In addition to collecting multiple relevant responses for each context, we also wanted to collect irrelevant responses for each context. Most of the models which are trained for the task of dialogue evaluation (and dialogue generation) (Tao et al., 2018;Ghazarian et al., 2019;Li et al., 2017a) procure irrelevant responses by randomly sampling responses from other contexts. Such random negatives are often entirely out of context (unrelated) and hence are too easy for the model to distinguish. To allow for a more critical or adversarial examination of dialogue evaluation systems, we propose creating adversarially crafted irrelevant responses that have lexical or semantic overlap with the context but are still unacceptable as valid responses.
For obtaining such tricky negative responses, the annotators were asked to choose some words from the context and use them directly or indirectly while writing the responses. Indirect usage here refers to using words closely related to the context words. For example, using synonyms, antonyms, homonyms, subwords, or other words that are known to frequently co-occur with the words in the context (e.g., the words "flexibility" and "injuries" co-occur with "acrobatics"). Once again, each context was shown to 2-3 annotators, and each of them was asked to generate 1-3 adversarially crafted responses for the last utterance, capping the total number of alternative responses to 5. Each response was then validated by two different annotators. The validating annotators were instructed to either eliminate or modify the responses that were not negative or were borderline. A final check was made by one more evaluator to ensure that the responses were adversarially crafted, irrelevant, and grammatically correct. We collected 5 such responses for 11429 contexts. Table 1 shows examples of relevant and irrelevant responses in our dataset and Table 2 shows some statistics about our dataset.  We acknowledge that, in practice, a given context can have a large number of relevant responses (>> 5). However, exhaustively collecting all such responses is prohibitively expensive and time consuming. While it is desired to have even more than 5 responses for every context, we believe that having at least 5 is a good starting point given the dearth of such multi-reference conversation datasets. The proposed dataset thus serves as a pragmatic substitute for an ideal dataset which would have contained a large number of responses per context. Having said that, we would also like to point out that the value of the proposed dataset goes beyond having multiple relevant references as it is also the first dataset containing adversarial irrelevant responses for given contexts.

Existing metrics
In this section, we present a brief overview of the existing automatic metrics used for dialogue evaluation. The existing metrics can be broadly classified into two categories, viz. (i) Untrained metrics, and (ii) Trained metrics. Untrained evaluation metrics, usually adopted from the NLG literature, use a predefined formula to compare the candidate response with a reference without taking the context into account. On the other hand, trained metrics are usually trained specifically for the task of dialogue response evaluation to identify valid and invalid responses for a given context.

Untrained Metrics
Untrained metrics can be further sub-classified into (i) n-gram based, (ii) word embedding based, and (iii) contextualized embedding based metrics.
N-gram based: N-gram based metrics score a candidate response based on the amount of n-gram overlap it has with a given reference. BLEU (Papineni et al., 2002), ROUGE-L (Lin, 2004) and METEOR (Banerjee and Lavie, 2005) are among the most commonly adopted n-gram based metrics to evaluate dialogue systems. BLEU is calculated using n-gram precision scores between the candidate response and the reference. ROUGE-L (Lin, 2004) is based on the F-measure of the longest common subsequence between the candidate and reference responses. METEOR (Banerjee and Lavie, 2005) relaxes the exact match criteria by including word stems, synonyms, and paraphrases. More recently, Galley et al. (2015) proposed deltaBLEU which takes in multiple references and rewards n-gram matches with positive references and penalizes the matches with the negative references.
Word embedding based: These methods use word embeddings to compute the similarity between the candidate response and the reference response. The most commonly used word embedding based metrics are Embedding Average (Wieting et al., 2016), Vector Extrema (Forgues et al., 2014) and Greedy Matching (Rus and Lintean, 2012). Embedding Average defines a sentence embedding as the average word embedding of the constituent words. The final score is calculated using the cosine similarity of candidate and reference sentence embeddings. Vector Extrema (Forgues et al., 2014) instead computes the sentence embedding by taking the most extreme value for each dimension. In other words, the value of the i-th dimension of the sentence embedding is computed by taking a maximum over the i-th dimension of all words in the sentence. Greedy Matching (Rus and Lintean, 2012) first computes the maximum cosine similarity that every word in the candidate response has with any word in the reference response. Similarly, the highest cosine similarity for each of the reference words with any of the candidate response words is calculated. The similarity between the candidate response and reference response is then computed by taking an average of the maximum cosine similarities computed above.
BERTScore: Recently, Zhang et al. (2020a) proposed BERTScore, which uses contextualized word embeddings of the candidate and reference sentences to compute the score. BERTScore is similar to greedy matching but uses contextualized embeddings from BERT instead of static word embeddings.

Trained Metrics
ADEM: Automatic Dialogue Evaluation Model (ADEM)  uses pretrained vector representations of the the dialogue context c, reference response r, and proposed responser to compute the evaluation score as follows: where M, N ∈ R n×n are learned matrices, and α, β are scalar constants used to re-scale scores in the range [1,5]. The context, proposed response and reference response are encoded using a Hierarchical RNN (H-RNN) encoder consisting of utterance-level and context-level RNNs. The H-RNN encoder is pretrained on a Twitter dataset (Dhingra et al., 2016) in a generative setup using the latent variable hierarchical recurrent encoder decoder (VHRED) model . The weight matrices, M, N, are later finetuned for the task of dialogue response evaluation. RUBER: (Tao et al., 2018) introduced an unreferenced evaluation model consisting of GRU encoders (Chung et al., 2014) to measure the relatedness between the dialogue context and a given response. The authors train the model on Chinese dialogue data with the hinge loss objective. BERT regressor 4 : Shimanaka et al. (2019) propose a BERT based evaluation model to score a candidate sentence based on a reference. Unlike BERTScore, the BERT model is finetuned to predict human judgement scores from the concatenated reference and candidate sentence. BERT+DNN 5 : Ghazarian et al. (2019) use contextualized embeddings to compute a relatedness 4 Since we couldn't find an exact name for the evaluator model by Shimanaka et al. (2019) , we adopt the name, 'BERT regressor' from their paper's title. 5 Due to the lack of a specific name for the models in score between the dialogue context and response. The best performing model of Ghazarian et al. (2019) consists of a multi-layer perceptron that takes the concatenation of contextualized representations of the context and response as input.
The contextualized representations are obtained by max-pooling the respective BERT embeddings for each token. Note that the BERT embeddings are not finetuned.

Dialogue Evaluation using BERT
In the last two years, a lot of success in NLP has been driven by large pretrained transformer-based models (Radford et al., 2019;Devlin et al., 2019;. These models are typically trained with a language model objective and leverage large amounts of unlabeled data. However, none of the trained metrics discussed in the previous section leverage pretraining on large-scale dialogue corpora. With the hope that such pretraining should help dialog evaluation models also, we introduce DEB (Dialog Evaluation using BERT) which is trained using a masked language model objective (similar to BERT) and a modified next response prediction objective. We set up the the task of next response prediction as one of identifying whether the given response is a valid next response for the given context. Formally, given a context C = {w c 1 , . . . , w c n } and a response R = {w r 1 , . . . , w r m }, we first pass the concatenated sequence U = {[CLS], w c 1 , . . . , w c n , [SEP], w r 1 , . . . , w r m } through the BERT transformer and obtain H cls ∈ R H , the last-layer activations corresponding to the special [CLS] token. We then make our final next response predictions as follows:ŷ = softmax(WH cls ), where W ∈ R 2×H is a learnable matrix. We use cross entropy loss with binary targets for the nextresponse prediction. In addition, we use the standard masked language model objective by randomly masking 15% of the words in C and R.
Note that the proposed model is a straightforward extension of the standard BERT model used for language modeling. We do not claim any novelty on this front. The key contribution here is to assess if pretraining on large-scale dialogue corpora improves the performance of dialogue evaluation metrics. Existing BERT-based evaluation metrics (Shimanaka et al., 2019;Ghazarian et al., Ghazarian et al. (2019), we refer to the model adopted from their work as 'BERT+DNN' 2019) do not use such pretraining on any largescale, domain-related corpora. In other words, they do not leverage the more successful recipe of (i) pretraining with a masked language modeling objective and (ii) finetuning with a task-specific objective (dialog evaluation in this case). The idea behind DEB is to check if this successful recipe can be replicated for dialog evaluation, making use of the dialogues in the large-scale Reddit corpus.

Training details
For pretraining, we use a massive open-domain dialogue dataset of Reddit comments from 2005 to 2019 consisting of 256M threads with a total of 3.68B comments. From this dataset, we extracted a total of 727M {context, positive response} pairs with 654M for training and 73M for testing following the method described in Henderson et al. (2019). We used an equal number of negative responses by randomly sampling responses from other contexts. We use the BERT base model with 110M parameters consisting of 12 layers, 768 dimensional hidden space, and 12 attention heads per layer in all our experiments. We finetune the pretrained DEB model on our DailyDialog++ dataset for 1 epoch (we did not see any advantage of finetuning beyond 1 epoch). Note that during finetuning we only use the next response prediction objective.

Experimental Setup
Our goal is to check if the adversarial responses in our dataset, which are specifically crafted to target context-dependent model-based metrics (such as ADEM, RUBER, BERT+DNN, and DEB), indeed affect the performance of such models. To do so, we first need to benchmark the models' performance on random negatives and then check if the performance drops when evaluated on adversarial examples. Hence, in this section, we describe (i) the process of creating and validating such random negatives and (ii) the process used for training model-based metrics.
We randomly divide our dataset into train (80% contexts), validation (10% contexts) and test (10% contexts) splits. Note that, adversarial negatives are not used for training or finetuning the models unless explicitly specified.

Creating & validating random negatives
For every context in our dataset, which has 5 relevant responses, we also sample 5 random negatives. While sampling random negatives, we avoid short responses that may be generic and relevant for any context. To verify whether the sampled random negatives were indeed irrelevant, we asked human annotators to manually check 500 such sampled responses. More specifically, we showed them the original context and the sampled random negative response and asked them if it was a relevant or irrelevant response. In 95% of the cases, the annotators confirmed that the random negative response was irrelevant, thereby confirming that a random sampling strategy indeed results in irrelevant responses (although they may not be as hard as our adversarial negative examples as shown later).

Pretraining & finetuning trained metrics
We describe the pretraining and finetuning procedure for the various models used in our analysis below. ADEM: As previously mentioned in Section 3, ADEM was pretrained on Twitter corpus using the VHRED setup and then finetuned for dialogue response evaluation. We take this publicly available model and finetune it further using our DailyDi-alog++ dataset with a target of 5 for positive responses and 1 for random negatives. The reference response could be any of the other four relevant responses. Note that ADEM produces a score on a scale of 1 to 5 whereas the other models produce a score on a scale of 0 to 1. For easier comparison, we scale the output of ADEM so that it lies in the range of 0 to 1. BERT regressor: We finetune the publicly available pretrained BERT base model (110M parameters) on our DailyDialog++ dataset. We train the model with a label of 1 for positive responses and 0 for random negative responses using any one of the other four positive responses as the reference. We train the model using cross-entropy loss and follow the same set of hyper-parameters as used by Shimanaka et al. (2019) during finetuning. BERT+DNN: We use the best performing model from Ghazarian et al. (2019), which consists of a three layered feed-forward neural network and uses pretrained BERT embeddings as input. We train the model on our DailyDialog++ dataset with random negatives using cross entropy loss.
RUBER and RUBER-Large: We experiment with two variants of Tao et al. (2018)'s models with different sizes, viz, (i) RUBER (34M parameters), which consists of single-layer GRUs with a hidden size of 1024, and (ii) RUBER-Large (236M parameters), which consists of two layered GRUs with a hidden size of 2048. As shown in Vaswani et al. (2017), the training time for RNN based architectures is very high when compared to the transformer models that allow much greater parallelization. We observed an estimated time of over 200 days to train the RUBER-Large model on the 727M Reddit corpus on a 1080ti GPU, thereby making it practically infeasible to train such models on large-scale datasets. Taking the computational costs into consideration, we pretrained RUBER and RUBER-Large on a sample of 20M contexts with relevant and random irrelevant responses from Reddit. We then finetuned these models on our proposed dataset with random negatives. 6 DEB: We pretrained DEB on the entire 727M Reddit corpus using the masked language model and the modified next response prediction objective. Pretraining DEB took 4 days on a single Google Cloud TPUv2. We achieved a test accuracy of 90% on the next response prediction task and a perplexity of 15.47 (58% accuracy) on the masked language modelling task in the pretraining corpus. We then finetuned DEB on our dataset with random negatives.

Untrained metrics with multiple references
Untrained metrics like METEOR, Greedy Matching, etc usually work with a single reference response but can also be adapted to work with multiple reference responses. For example, for a given candidate response c and a set of reference responses r 1 , r 2 , r 3 , ..., r k , we can compute the multi-reference METEOR score as: Instead of the max function we can also use the average function. We use a similar formula for all the untrained metrics. A few metrics like BLEU, deltaBLEU, and ROUGE-L have their own standard formula to in-corporate multiple references. BLEU calculates the number of matches for each n-gram based on the maximum number of times the n-gram occurs in common with any one of the references. deltaBLEU further extends the same idea to incorporate a score for each reference. We follow the implementation from Galley et al. (2015) to compute the deltaBLEU scores. For ROUGE-L, we follow the strategy in Sharma et al. (2017) where the score is an F-measure of the maximum precision and maximum recall over all the references. In addition to the average and maximum aggregations, we also report these standard multi-reference scores for BLEU, deltaBLEU and ROUGE-L.

Results
In this section, we compare the performance of different dialog evaluation metrics in separating relevant references from (i) random negatives (ii) synthetically crafted adversarial irrelevant responses (explained below) and (iii) manually crafted adversarial irrelevant responses (as in our DailyDialog++ dataset).

Performance on random negatives
For every context in our test split, we obtain the scores assigned by a given metric to the 5 positive and 5 random negative responses. In particular, we treat each of the 5 relevant and 5 random irrelevant responses as a candidate response. For all untrained metrics other than deltaBLEU, we consider the remaining 4 relevant responses as reference responses. For deltaBLEU, we consider the remaining 4 relevant responses as references with a score of 1 and the remaining 4 irrelevant responses as references with a score of -1. We expect a good evaluation metric to provide high scores on relevant responses and low scores on the irrelevant responses. We then quantify the performance of all metrics using two measures. First, we compute the Point Biserial correlation (PBC) between the scores assigned by a metric and the binary target i.e., a score of 1 for positive responses and 0 for random negative responses. 7 Second, we compute the classification accuracy of the metric by using a threshold and marking all responses having a score above this threshold as positive and others as neg-7 Note that it can be shown that PBC is equivalent to the Pearson correlation when one of the variables is binary, as is the case above.  (Forgues et al., 2014) 0.24 (<1e-9) 0.35 (<1e-9) 0.33 (<1e-9) -59.22 63.70 63.90 -GreedyMatch (Rus and Lintean, 2012) 0.24 (<1e-9) 0.36 (<1e-9) 0.32 (<1e-9) -60.02 63.99 65.56 -BERTScore (Zhang et al., 2020a) 0.29 (<1e-9) 0.39 (<1e-9) 0.39 (<1e-9) -63.71 69.05 68.59 -ADEM  0.40 (<1e-9) 64.74 BERT regressor (Shimanaka et al., 2019) 0.52 (<1e-9) 73.40 BERT+DNN (Ghazarian et al., 2019) 0.57 (<1e-9) 74.67 RUBER (Tao et al., 2018) 0.64 (<1e-9) 78.18 RUBER-Large (Tao et al., 2018) 0.69 (<1e-9) 82.36 DEB (ours) 0.79* (<1e-9) 88.27* ative. We use a threshold of 0.5 for the trained metrics. For all the untrained metrics, we perform a search from 0 to 1 with step size of 0.01 and select the threshold that minimizes the error rate on the validation set. 8 Later in Section 6.1.1, we shall observe that if we use 0.5 as the threshold, the performance of most untrained metrics would be abysmally poor. Note that for the trained metrics we found that the scores were spread evenly in the range of 0 to 1 and there was no benefit of doing a grid search to find the threshold -a threshold of 0.5 was adequate.
In Table 3, we report PBC and accuracy of the different untrained metrics with both single and multiple references, and the trained metrics. When evaluating using single references, we use any one of the 5 relevant responses as a reference response (other than the one being used as a candidate). We observe that with a single reference, all the untrained metrics are poor at distinguishing between the positive and random negative responses as inferred from the low accuracy and correlation values. When we use multiple responses, we observe a relatively better performance. We notice that the performance is largely 8 With this approach of setting a threshold, we want to be lenient with the untrained metrics and investigate how best they can be adopted. One might also think of using the median of all the scores assigned by a metric as its threshold, however, such an approach is error-prone and has several boundary conditions that would fail the purpose. We hence estimate the threshold by minimizing the risk. similar across the aggregation techniques -average, maximum and standard (when applicable). Metrics such as BLEU-1, METEOR, ROUGE-L and BERTScore with multiple references are able to achieve modest correlations with the binary target. Interestingly, we observe that all the word embedding based methods even in the presence of multiple references perform badly in scoring the positive and random negative responses. In contrast, trained metrics such as BERT regressor, RUBER, BERT+DNN, and DEB perform substantially better than the untrained metrics. Our proposed DEB model achieves state-of-the-art performance with an accuracy of 88.27% and a strong correlation of 0.79.

Analysis using Box Plots
We now visualize the box plots of the scores given by the various metrics to the positive and random negative responses. Figure 1 shows these box plots for the multi-reference untrained metrics (max aggregation) and the trained metrics. We observe several shortcomings of the untrained metrics. Firstly, all the untrained metrics have a significant overlap in the interquartile range of the positive and random negative scores, implying that there is a high degree of intermixing of scores given to the positive and random negative responses. The overlap is even higher for word embedding based metrics, which obtain low point biserial correlations. Secondly, we note that the score distributions of the untrained metrics are highly skewed. For instance, the scores of BERTScore are almost always greater than 0.75 even though it scores responses in the range [0,1]. Therefore, it is difficult to tell at what value of the metric a response can be safely considered relevant. These observations suggest that untrained metrics even with multiple references cannot be reliably used to score dialogue responses.
For the ADEM evaluation model, we observe that it outputs scores close to the mean score of 0.5 with little spread in their values. Sai et al. (2019) also made similar observation about the clustering of the scores around the mean in ADEM, which they explain using linear system theory. In BERT regressor, there is a high overlap in the scores given to positives and random negatives. We further observe that the RUBER and BERT+DNN are able to better distinguish the positive and random negative responses. Although there is separation in the interquartile range for the two classes in RU-BER and BERT+DNN scores, there is a greater spread within each class and a lot of points of the two classes substantially overlap. RUBER-Large is able to reduce the overlap, while DEB further achieves better performance by pushing the scores for positive responses close to 1 and the scores for random negatives to 0 with high accuracy. We shall show in Section 7.3 that DEB achieves this by pushing the H cls embeddings for the positive and random negative responses farther apart in space.   Table 4. The modifications of reversing and jumbling the word order in a relevant response make it irrelevant (grammatically wrong) and hence we expect to see more of the original true positives get classified as negatives. BERT+DNN classifies a majority of these responses as positives. One possible reason for this is that their model only uses a max pooled aggregation on BERT embeddings and does not explicitly model the sequential order of words. On the other hand, DEB fares better than the other models as seen by the drop in fraction of responses identified as positives. However, RUBER variants and BERT+DNN do better than DEB when retaining only nouns in a response. On removing punctuation, we expect that most of the positive responses without punctuation would remain positive and hence the percentage of responses marked positive should remain about the same. In this case, both DEB and BERT+DNN perform better than the RUBER models. For the modifications of removing stopwords and replacing words with synonyms, it is hard to generalize the trend that is observed. Hence, we perform human evaluations by presenting in-house annotators with contexts and modified responses. We ask them to provide scores in the range 0 to 3, with higher scores meaning better responses. We obtain human scores on 400 samples for this task and compute the Pearson correlation of the model predictions with the human judgements. In this case, we find DEB is better correlated with human judgements on both the modifications.

Performance of model-based metrics on manually crafted adversarial responses
So far we have established that (i) untrained metrics perform poorly compared to trained metrics even for separating random negatives from positives (ii) trained models like RUBER,  BERT+DNN, RUBER-Large and DEB perform remarkably well in distinguishing relevant responses from random responses (iii) RUBER variants and DEB perform well on most synthetically mutated responses whereas BERT+DNN performs poorly against certain mutations. However, we still need to check if the trained models are robust to adversarial examples which are specifically crafted to fool such context-dependent, modelbased metrics. Note that none of the untrained metrics are context dependent as they directly compute the similarity between the reference and candidate response without considering the context.
We consider the 5 relevant and the 5 adversarial irrelevant responses in our dataset and just as before compute the scores assigned by the different metrics to each of these responses. We then compute the accuracy of a metric using the target label as 0 for irrelevant responses and 1 for relevant responses. As expected, the accuracy of all the models drops, as seen in Figure 2. In particular, we observe that the models wrongly classify most of the irrelevant responses as positive/relevant responses. This can be seen from the confusion matrices in Table 5, where it is clear that the number of false positives is very high.

Discussions
In this section, we do further analysis of DEB.

Ablation studies on DEB
There are different stages of training our DEB model. First, the underlying BERT model is already pretrained on English Wikipedia and the BooksCorpus. We then pretrain it further for our task using Reddit corpus and finally finetune it on the DailyDialog++ dataset. We now evaluate the contributions of each of these stages of training (see Table 6). First, we find that the original BERT model when adopted directly for the task of dialog evaluation gives an accuracy of 72.65% and 58.10% on random and adversarial negatives respectively. On further analysis, we find that it has a high false positive rate with more than 52% of the adversarial negatives getting classified as positives. After pretraining it with Reddit data, it achieves an accuracy of 84.16% on DailyDia-log++ even though it has not seen any training instances from this dataset. However, there is only a marginal improvement on adversarial negatives. Finally, finetuning BERT on DailyDialog++ using only random negatives further improves the accuracy to 88.29% and 66.75% respectively.

Training with adversarial examples
We examine whether the evaluation models can learn to distinguish the adversarial negatives when specifically finetuned for that task. By training on DailyDialog++ with adversarial negatives rather than random negatives, we find that all models give an improved performance in identifying adversarial negatives (see Table 7). However, with such training, every model's performance drops when evaluated on DailyDialog++ with random negatives, with BERT+DNN dropping substantially to 60.49%. The best overall performance is seen when the models are finetuned with both random and adversarial negatives, with DEB achieving the highest accuracies on both test sets. While such improvement is expected given the capacity of the models, obtaining such adversarial examples for training is not always feasible.
Effect of the number of adversarial negatives added to training: Due to the difficulty in manually creating adversarial examples, we study the effect of the number of the adversarial examples added to the training set. Our findings are presented in Figure 3, where we progressively increase the percentage of adversarial negative examples added as input to the DEB model during training with random negatives. As expected, the accuracy in identifying adversarial negatives improves as the model is exposed to more data points of the same type, where we specifically note the considerable improvement from 45.6% to 70.85% after adding just 1% of adversarial negatives from our dataset (i.e., 100 contexts with 5 adversarial examples each). With the addition of more adversarial negatives, we find a small drop in the accuracy of identifying random negatives. There is also a slight decrease in the performance on the positives responses when the number of adversarial examples are small. We note that the adversarial negatives are hard negatives close to the positive responses in the embedding space, as we elaborate in Section 7.3, thereby confusing the model.

Conicity analysis on DEB
We analyze the embeddings from the final embeddings projection space, that is, the one used by softmax layer for next response prediction. We check for the spread of the embeddings of the positive and negative responses. Specifically, let P, R and A be the set of embeddings of all positive responses, random negative responses and adversarial negative responses respectively for a given context. We want that if we consider the set P then the spread of this set should be low in the projected space (all positive responses embedded close to each other). At the same time, if we consider the union of the sets P, R and A then the spread of this set should be high (positive responses separated from negative responses). We measure this spread using conicity analysis (Chandrahas et al., 2018). Conicity on a set of vectors V is defined as the average of the cosine similarity of the vectors with their mean vector,v.The lower the conicity, the higher the spread. For each utterance in DailyDialog++, we first construct the sets P, R and A using the pretrained DEB model. We find that the average conicity of the set P is 0.89 (averaged over all utterances) indicating that the positive responses get mapped very close to each other. The average conicity of the set P ∪ R is 0.59, indicating that the positive responses are well separated from the random negatives. However, the average conicity of the set P ∪ A is 0.74, indicating that the positive responses are not well separated from the adversarial negative responses. We illustrate this in Figure 4a by representing the mean vector of each of the sets along a corresponding highlighted region where the vectors of the set lie on average. 9 We then finetune the DEB model on the DailyDi-alog++ dataset. Once again, for every utterance we construct the sets P, R and A using this finetuned model. We now observe that the average conicity of the sets P , P ∪ R and P ∪ A are 0.86, 0.37 and 0.35 respectively. Thus, after finetuning, the model is able to achieve a clear separation between positive responses and random or adversarial negative responses. Furthermore, the positive responses are still close to each other (illustrated in Figure 4b).  We reiterate that we do not train the models on these datasets but simply evaluate the models trained on DailyDialog++ on these datasets. Table 8 shows that DEB outperforms the other unreferenced models on all the 3 datasets. With Holl-E dataset being specific to conversations about movies rather than generic topics, we find the scores are relatively lower on it for all the models. The other evaluation models and metrics cannot be compared on PersonaChat and Twitter without additional reference responses, since the available single reference in these datasets is being evaluated. On the multi-reference test set of Holl-E, however, we find that their performance is lower than the three unreferenced models.  For the RNN-based models (HRED, VHRED, VHCR), we use a single-layer bidirectional encoder and single-layer decoder each with a hidden size of 1024. We pretrain the RNN-based models on the casual conversation subset of the Reddit dataset, consisting of 10M conversation exchanges. We finetune all the models on the Daily-Dialog++ dataset.
We conducted human evaluations to compare the extent to which the model-based metrics agree with human judgements. We randomly sampled 100 contexts from the test set of the DailyDi-alog++ dataset and obtained the responses generated by each of the above models. Annotators were shown a context-response pair and were asked to rate how human-like the response is with respect to the context, on a scale of 0-3. The annotators were asked to check for both fluency and coherence. A total of 15 in-house annotators participated in the human evaluation study. The annotators were Computer Science graduates competent in English. Each context-response pair was rated by 5 annotators and the final score was obtained by averaging the 5 scores. We also obtained scores at the system level by aggregating the scores for each model. In Table 9, we report the correlations of human judgments with the model scores at the response level and system level. We observe that the BERT+DNN model, which only has a feedforward neural network that is learnable, does not have any significant correlation with human judgments. On the other hand RUBER, consisting of pretrained GRUs obtains low to moderate correlations. RUBER-Large further obtains improved correlations, indicating that using large-scale pretrained models helps. This trend is also observed in the comparisons of DEB with its ablated versions (without Reddit pretraining and without finetuning on DailyDialog++), indicating the contribution of these steps in training the final model. Our proposed DEB model obtains significantly higher correlations at response level. We checked for significance using William's test to compare DEB with all other models and found p-values to be < 1e −6 . This establishes the effectiveness of DEB in scoring model generated responses. At the system level, we find that DEB correlates substan-  tially higher than other models, with the human rankings of the models. However, the p-values in this case are not significant due to the limited number of systems. In hindsight, we realise that reporting system level correlations is not very informative as the number of samples are very small (as many as the number of systems). Hence, these numbers are not very reliable. However, following , we still report the system-level correlations (along with the p-values) for the sake of completeness.

Related Work
We point the reader to Serban et al. (2018) for an excellent survey of existing datasets containing single reference responses. Recently, there has been some effort to create datasets containing multiple references but these datasets are either too small (around 1000 contexts) (Moghe et al., 2018; or noisy (Gao et al., 2019).
We have already reviewed all the existing dialog metrics in Section 3 and hence we do not discuss them again here. Instead, we quickly mention existing works which critically examine dialog evaluation metrics. For example, Liu et al. (2016) show that existing n-gram based metrics do not correlate well with human judgements for dialog evaluation. We report similar results but additionally show that the correlation improves in the presence of multiple references. Similarly, Sai et al. (2019) have critically examined ADEM and shown that in most cases it produces a score close to 2.5 (on a scale of 1 to 5) and hence does not clearly separate relevant and irrelevant responses.
Lastly, we also mention a very recent work, Zhang et al. (2020b), which has pretrained a large scale transformer on Reddit corpus for building conversation systems. However, their focus is on dialog generation and not on evaluation metrics.

Conclusions
We propose a multi-reference open-domain dialogue dataset with multiple relevant responses and adversarial irrelevant responses. We perform an extensive study of the existing dialogue evaluation metrics using this dataset and also propose a new transformer-based evaluator pretrained on largescale dialogue datasets. We identify the strengths and weaknesses of such a model through studies of its performance on untrained and synthetically modified data. We find DEB to be easily adaptable to other open-domain dialogue datasets. We also present the scope of the adversarial responses in our dataset towards bringing out better evaluation metrics, since all the current models do not perform well on those unless explicitly trained.