Less is More: Mitigate Spurious Correlations for Open-Domain Dialogue Response Generation Models by Causal Discovery

In this paper, we conduct the first study on spurious correlations for open-domain response generation models based on a corpus CGDialog curated by ourselves. The current models indeed suffer from spurious correlations and have a tendency to generate irrelevant and generic responses. Inspired by causal discovery algorithms, we propose a novel model-agnostic method for training and inference using a conditional independence classifier. The classifier is trained by a constrained self-training method, coined ConSTrain, to overcome data sparsity. The experimental results based on both human and automatic evaluation show that our method significantly outperforms the competitive baselines in terms of relevance, informativeness, and fluency.


Introduction
Open-domain response generation models have achieved impressive empirical success due to the recent advances in large-scale pre-trained transformers (Caldarini et al., 2022).However, although those models can generate fluent responses, it is still difficult for them to deeply understand conversation histories, and produce coherent and semantically diverse responses, especially when the conversation histories are long (Sankar et al., 2019;Qiu et al., 2019).We conjecture that one of the key reasons is spuriously correlated utterances in histories, which do not directly result in responses.Although the vulnerability to spurious correlations is a well-known problem in deep learning models (Wang et al., 2021), to the best of our knowledge, there is no study on this topic from a causal perspective for response generation models.
To investigate spurious correlations in dialogues, we are concerned with identifying non-spurious ones, which are the direct causes of responses.In this work, a direct cause of a response refers to a text or an utterance in a conversation history that directly results in the response.Table 1 shows an example dialogue between a help-seeker and a peer-supporter randomly picked from the Emotion Support Conversation corpus (ESCONV) (Liu et al., 2021).The utterance u 3 serves as the direct cause of the response u 6 , because it is the only utterance mentioning online learning.Otherwise, if we remove it from the history or significantly alter its meaning, the response u 6 becomes groundless.In contrast, if we remove an utterance non-causally related to a human response, such as u 1 or u 5 related to u 6 , the direct causes still provide sufficient and necessary information to the responses.
Causal discovery algorithms provide a theoretically grounded way to learn causal relations between random variables from observational data (Nogueira et al., 2021).Although they can be applied to identify which utterances in conversation histories are direct causes of responses in theory, the research on such methods for natural language processing problems is still in its infancy.
In this work, we conduct the first study on spurious correlations for response generation models from a causal perspective.We empirically show that non-cause utterances, including spurious correlated ones, have significantly more influence on response generation models than the direct cause utterances human would rely on.
Inspired by causal discovery algorithms, we propose a model-agnostic training and inference method for mitigating spurious correlations in long conversations.The method aims to automatically identify key utterances in histories, which serve as direct causes for response generation.Herein we convert the cause identification problem into a problem of conditional independence (CI) tests.The CI tests are realized by building a classifier to infer whether an utterance in the history statis-Supporter:

Hello
u 0 Help seeker: Hi, how are you?u 1 Supporter: Doing good.. How are you?u 2 Help seeker: I'm feeling really anxious these days.I'm finding the COVID online learning experience to be too much for me at this time.u 3 I want to stop school, but I don't think I can afford to.I need to get done with school.Supporter: I understand your frustration.All of us are challenged due to COVID.u 4

History
Help seeker: School was always hard.Now it's gotten harder.I think a lot of people are stressed.u 5 Human Supporter: How long are you doing the online school?u 6 BLENDERBOT Supporter: You are welcome.I wish you all the best in your future endeavors.Take care.u 7 Table 1: An emotion support dialogue annotated with direct causes of human response (in bold).
tically depends on the response conditioned on its preceding utterance.As there is no training data for such a classifier, we start with manually annotating causal relations on a small portion of public open-domain dialogues.To overcome the scarcity of the training data, we propose a Constrained Self-Training method, coined CONSTRAIN, which is able to identify causal relations with high precision and recall.This classifier is applied to filter out utterances in histories, which are not direct causes of responses, before training response generation models.Furthermore, the classifier serves as a scoring function to select the most relevant response from all generated candidates.To sum up, our contributions are as follows: • We conduct the first empirical study on spurious correlations for dialogue response generation models.To investigate this problem in depth, we curate a corpus CGDIALOG by annotating causal relations on dialogues.
• We reduce the direct cause identification problem to a problem of CI tests and propose a constrained self-training method, coined CON-STRAIN, to train the corresponding classifier.
• We propose to train response generation models by taking only direct causes as inputs and perform inference using the CI classifier.
• The extensive human evaluation results show that the response generation models, such as BLENDERBOT, using our method outperform the baselines in terms of relevance, informativeness, and fluency. 1

Causal Discovery Background
Given a set of random variables, causal discovery from observational data is concerned with discover-  Neal, 2020).
A change in v i results in a change in v j , but an intervention in v j does not necessarily lead to a change in v i .
Our work is motivated by constraint-based causal discovery approaches (Nogueira et al., 2021), which iteratively apply independence and CI tests to infer causal structures.Those approaches make the faithfulness assumption that independencies in a distribution imply the structure of the corresponding causal graph.The most commonly used algorithm in this family is the PC algorithm (Spirtes et al., 2000).It starts with adding undirected edges between two nodes if both of them are dependent by not passing independence tests.Then it remove an edge between two nodes if they are identified as conditionally independent after running CI tests.The algorithm would continue with larger conditioning sets until the skeleton of the graph is identified.Finally, it orients the edges when possible by using heuristics and identifying the specific structure, v i → v k ← v j , referred to as immorality, as illustrated in Fig. 2b (Neal, 2020).
In this work, we do not need to recover the complete causal structure between utterances in dialogues.Instead, we only focus on identifying direct causes of responses, namely the parents of the response nodes in a causal graph.A causal graph satisfies Causal Markov Condition, which states that each variable is independent of all its non-descendants, given its parents in the causal graph.Hence the value of a response variable is only determined by its parents (Pearl and Verma, 1991;Pearl, 2009).Under the faithfulness assumption, if a response variable v j is dependent on v i conditioning on arbitrary any other nodes, and we know the influence direction is from v i to v j , then we conclude that v i → v j .

Spurious Correlations in Dialogues
The slogan "Spurious correlation is no proof of causation" is well known in statistics (Simon, 1954).A correlation between a response and an utterance in a conversational history is spurious, if it does not directly result in the response.
Spurious correlations are an inherent problem of statistical machine learning (ML) models.Wang et al. (2021) point out that ML models relying on core features may well achieve similar training errors on the same training data as those relying on spurious features.However, the models relying on spurious correlations lead to high test errors because spurious correlations are inconsistent across datasets.Overparameterization further exacerbates spurious correlations by memorizing examples containing spurious features (Sagawa et al., 2020).Unfortunately, almost all the SotA open-domain dialogue models are based on large-scale transformers, which are overparameterized with respect to small dialogue training datasets in target domains.
To study the impact of spurious correlations for dialogue models, we leverage two public dialogue corpora (ESCONV and MSC) to construct a small evaluation corpus for Causal Graphs in dialogues, coined CGDIALOG, and evaluate two SotA dialogue models, BLENDERBOT (Roller et al., 2021) and DIALOGPT (Zhang et al., 2020), on that corpus in terms of spurious correlations.

Annotation of Causal Graphs
We randomly sampled 80 dialogues from ESCONV (Liu et al., 2021) and MSC (Xu et al., 2022) each, then employed four graduate computer science students and four well-trained crowd-workers to annotate direct causes of responses.All annotators were instructed to have a good understanding about what are direct causes of responses and used Amazon Mechanical Turk (AMT) for annotation.We trained them by letting them first annotate on a dryrun dataset, and provided feedback if there was a misunderstanding.After training, annotators were asked to read the provided responses and their conversation histories, then highlight which utterances or clauses serve as direct causes of the responses.We include clause level annotations because sometimes only one clause in a long utterance is the direct cause of a response.For quality check, a human expert having a good grasp of this task reviewed all annotations and corrected mistakes.CGDIALOG-ESCONV is splitted into a training set, a validation set and a test set, containing 272, 211, 211 context-response pairs respectively, while CGDIALOG-MSC contains 300, 250, 250 contextresponse pairs, respectively.
We measured the inter-annotator agreement between the expert and an annotator at both the utterance level and the clause level.At the utterance level, we computed Cohen's Kappa and obtained 0.8149.At the clause level, because marked text boundaries may vary between annotators, we compute the averaged F1 score for all possible pairs of annotators, as detailed in Rajpurkar et al. (2016); Poria et al. (2021).We obtained a F1 score of 0.8449, which indicates a high-level of interannotator agreement.
We show the corpus statistics in Table 2 and Figure 1.Most of the preceding utterances of responses are annotated as direct causes, which are over 80% and 95% on ESCONV and MSC respectively.The proximity of utterances to responses matters: the closer utterances are to the responses, the higher the chance to be direct causes.

Analysis of Spurious Correlations
We conduct experiments to investigate the impact of spurious correlations on two SotA response generation models: BLENDERBOT and DIALOGPT.Both models are fine-tuned on the training sets of ESCONV and MSC by taking full conversation histories as inputs.Inspired by (Sankar et al., 2019), we perturb conversation histories by removing either direct causes or non-causes from histories.We hope that the outputs of a robust model should have little changes if only spuriously correlated utterances are removed.The removal is conducted in two ways: i) replacing each removed token with the pad token <pad>.ii) directly dropping the removed tokens.We apply such perturbations to the test set of CGDIALOG and compare their results with the ones without any perturbations.If a response model captures the same genuine correlations between key utterances in histories and responses as humans, the perplexities of human responses estimated by the model should change only slightly if non-cause utterances are excluded from conversation histories.However, as shown in Table 3, the increase of perplexities caused by dropping or replacing non-cause utterances is significantly sharper than that resulted by the removal of cause utterances.
To further investigate the effects of perturbing conversation histories, we apply the same decoding method of both models to the histories after perturbations.We compare the responses generated before and after perturbations in terms of BLEU.Lower BLEU indicates larger changes of generated outputs.As we can see, dropping or replacing direct causes leads to notably smaller changes of outputs than applying the same operations to noncause utterances.
To eliminate the concern that the above observations are caused by the number of perturbed utterances, we remove or replace the same number of non-cause utterances as that of direct causes each time.More specifically, as the number of direct causes is always smaller than that of non-causes, we apply the perturbations to k utterances randomly chosen from non-cause utterances if the number of direct causes is k, and compute the corresponding perplexities and BLEU.To mitigate the influence of randomness, we repeat each experiment for five times and compute statistical significance based on two-sample t-test (Dror et al., 2020).As you can see from Table 3, both generative models are sensitive to the removal of utterances that are weakly associated with human responses.The perturbations on the equal number of non-cause utterances lead to larger changes of the model outputs than those on causes, as indicated by BLEU.For DI-ALOGPT, the increase of perplexities by perturbing non-causes is still significantly higher than that by perturbing causes.Therefore, both models do not really learn on the utterances that humans use as causes to articulate responses, but rely heavily on non-cause utterances.

Response Generation Based on Causal Discovery
As shown by our empirical study, spurious correlations are detrimental to the SotA dialogue models.
To remedy this, we propose to automatically identify the utterances in conversation histories, which serve as direct causes to responses, and only use them as history representations during both training and inference.Based on the theoretical analysis in Sec. 2, this identification problem is reduced to running CI tests between responses and utterances in their history.Herein, we propose a constrained self-training procedure to build a classifier for classifier-based CI tests (Lopez-Paz and Oquab, 2017;Sen et al., 2017Sen et al., , 2018;;Bellot and Schaar, 2019).Formally, given a conversation history C t = {u 0 , ..., u t−1 } at time t, a dialogue model aims to produce a word sequence r t as the response based on C t .Both u i and r t are regarded as collections of random variable, where each variable in the collection denotes if a single word is present or not.Because the same event can be expressed in various Figure 1: Top: The ratio between the number of the history-response pairs with a particular number of direct causes and all history-response pairs.Down: Proximity between direct causes and responses, measured by the percentage of such pairs in all history-response pairs.linguistic forms, we assume there is a projection function g(u), which maps an utterance to a latent random variable vector z ∈ Z denoting the meaning of the corresponding event.
A causal graph in the semantic space is a directed acyclic graph G = {V, E}, where a node represents a latent random variable vector z i and an edge is denoted by a causal relation between a pair of nodes.We do not define causal graphs in the word space because i) it is the meanings of utterances that are causally correlated and ii) the same words in different contexts may be involved in different causal relations.Identifying direct causes of responses can thus be regarded as recognizing causal relations between those latent random variables.To simplify notation, we denote the output of g(u i ) by z i , unless stated otherwise.

From Cause Identification to the Conditional Independence Tests
If a latent semantic vector z i of an utterance is a direct cause of the meaning of a response z j , then z i ⊥ ⊥ z j |Z t,−i , where Z t,−i denotes any subset of latent random variables derived from the history C t excluding z i .In other words, z i provides additional useful information for z j given any other utterances in a history.However, it is computationally expensive to consider all possible subsets of a conversation history for running CI tests for a single utterance.
To address the computational challenge, we observe that a response often only depends on the preceding utterance and at most two utterances in total.As evident in Fig. 1, 81% of the responses in CGDIALOG have one or two direct causes and 90% of the preceding utterances serve as direct causes of the following responses.Therefore, we can sharply reduce the computational overhead by making the following assumptions.
Assumption 1.For each response r t , g(u t−1 ) → g(r t ) always holds.
Assumption 2. There are at most two direct causes for the latent random variable vector of a response.Assumption 3. If there is an edge between g(u i ) and g(u j ) in a causal graph and i < j, then g(u i ) → g(u j ).
The last assumption articulates the fact that what people said in the past influences what people will say in the future.If the temporal order in a conversation is known, there is no need to apply statistical methods to infer the orientation.
Under the above assumptions, for a given response r t , there are only four possible neighborhood structures, as illustrated in Fig. 2. We have z t ⊥ ⊥ z j |z t−1 for Fig. 2a and Fig. 2b, but z t is conditionally independent of z j in the remaining cases.Herein, we make the faithfulness assumption that CIs imply graph structures.Under our assumptions, it is sufficient to determine if an utterance u j with j < t is a cause of t t by checking whether z t ⊥ ⊥ z j |z t−1 .Hence, we only need to run t − 2 CI tests for a response r t .Note that it is important to run CI tests instead of dependence tests to find a direct cause of a response.As illustrated in Fig. 2c, although z j is not a direct cause of z t , both of them are still dependent through z k and z t−1 according to dependence tests.If we run a CI test conditioned on z t−1 , the path through z k is blocked so that the test result reveals z t ⊥ ⊥ z j |z t−1 .More details of identifying independence structures in a graphical model can be found in (Neal, 2020;Pearl, 2009).

Conditional Independence Tests
To perform CI tests over a set of latent random variables z on observational data, we need to i) project utterances to the latent space, and ii) choose a scalable test method which can work with texts.However, the first step is already challenging because the latent random variables are unknown and we even do not know the number of them for an arbitrary dialogue corpus.
To tackle both challenges, we opt for the classifier-based CI test.As z t ⊥ ⊥ z j |z t−1 implies p(z t , z j |z t−1 ) = p(z t |z t−1 )p(z j |z t−1 ), this family of tests builds a classifier to determine if a sample of data is drawn from p(z t |z t−1 )p(z j |z t−1 ) or p(z t |z j , z t−1 )p(z j |z t−1 ).To train the classifier, we label a tuple (z t , z t−1 , z j ) with l = 1 if it is drawn from p(z t |z j , z t−1 )p(z j |z t−1 ), otherwise l = 0. Then the classifier aims to capture the conditional distribution p(l|z t , z t−1 , z j ).
The recent advances of deep learning show that hidden representations of deep neural networks can well capture meanings of input texts (Yang et al., 2020).Hence, it is straightforward to consider a deep encoder as a function g(u) from an utterance u to a hidden representation z.Specifically, we employ a pre-trained ROBERTA (Liu et al., 2019) as the encoder to map a tuple (r t , u t−1 , u j ) to a sequence of hidden representations (z t , z t−1 , z j ), where adjacent utterances are separated by the special token </s>.Taking the representations (z t , z t−1 , z j ) as input, the CI classifier consists of a mean-pooling layer, a linear layer and a sigmoid layer for characterizing p(l|z t , z t−1 , z j ).
Inspired by Sun et al. (2019), we first train the pre-trained ROBERTA with the masked language model objective on the publicly available Reddit dataset (Baumgartner et al., 2020)  i) The probability p(l = 1|u j , u t−1 , r t ) exceeds a predefined threshold 0.9; ii) u j is either u t−2 or u t−3 with respect to a response r t . For 12: 13: i ← i + 1 14: end while

Training and Inference for Generative
Response Models To overcome spurious correlations, we propose to feed only direct causes of responses to dialogue models during training and inference, where direct causes are selected by the CI classifier.This approach is model-agnostic because it only "cleans" the inputs of a response model regardless which neural architecture is used.
The training set of mainstream open-domain dialogue models consists of conversation history and response pairs {C t , r t } n t=1 .Before training, we preprocess the training set by keeping only direct causes in each conversation history.As u t−1 is always one of the direct causes according to Assumption 1, we find another cause by using the CI classifier.In particular, for each conversation history C t , we perform max inference on all tuples (u j , u t−1 , r t ) using the classifier, where j ∈ [0, t − 2].We select the u j that has the highest probability p(l = 1 | u j , u t−1 , r t ) as another direct cause.Dialogue models are subsequently trained on the preprocessed training set.
The input selection for inference is conducted in a similar manner.In particular, we feed each possible (u j , u t−1 ) with j ∈ [0, t−2] to the trained dialogue model to generate a response by beam search.Then we apply the CI classifier to identify the tuple (u j , u t−1 , r t ) with the highest p(l = 1 | u j , u t−1 , r t ).To allow selecting responses based on p(r t |u j , u t−1 ) or p(r t |u t−1 ), we choose the response conditioned on (u j , u t−1 ) if the highest p(l = 1 | u j , u t−1 , r t ) exceeds the threshold 0.5, tuned on a validation set, otherwise we take the response conditioned on u t−1 .

Datasets
We experiment on the following two open-domain dialogue corpora that have long conversation histories.The longer a conversation history is, the more likely utterances in the history are spuriously correlated with responses.In contrast, most opendomain dialogue corpora contain short conversations, in which there are dramatically less spuriously correlated utterances.For example, DailyDialog (Li et al., 2017), WizardOfWikipedia (Dinan et al., 2019) and EmpatheticDialogues (Rashkin et al., 2019) have 7.9 utterances, 9 utterances, and 4.31 utterances per conversation, respectively.

Emotion Support Conversation (ESCONV).
ESCONV (Liu et al., 2021) contains conversations between mental health help seekers and supporters, with 29.8 utterances per dialogue on average.In each dialogue, help seekers talk about their problems, such as unemployment, losing family member or infecting with COVID.Dialogue response models play the role of supporters to provide supportive responses to help seekers.Each utterance from supporters is annotated with a strategy such as providing suggestions, paraphrasing or question, which are not considered in our models.It is splitted into training, validation and test sets with the ratios of 80%, 10% and 10% respectively.
Multi-Session Chat (MSC).MSC (Xu et al., 2022) contains human-human chit-chats over five sessions, each of which contains up to 14 utterances.The average number of utterances per dialogue is 53.3.In each session, two interlocutors conduct a conversation based on given personas.Each persona describes personal information with multiple sentences.We experiment on its official splits of training, validation, and test sets.

Baseline Models
We compare our method CONSTRAIN and its variations, based on BLENDERBOT, with the following generative models: BLENDERBOT.This transformer-based encoder-decoder model achieves superior performance over the prior models in terms of engagingness and humanness (Roller et al., 2021).We fine-tune the pre-trained model with varying settings of conversational histories.As such, a conversational history contains either: 1) only the preceding utterance u t−1 , 2) the preceding two utterances (u t−2 , u t−1 ) when available, 3) the preceding three utterances (u t−3 , u t−2 , u t−1 ) when available, 4) the complete conversational history (u 0 , ..., u t−1 ), or 5) the preceding utterance u t−1 and a randomly selected utterance u j between 0 and t − 2. All hyperparameters remain the same in different settings.
DialoFlow.Li et al. (2021) proposes a dialogue system that models dynamic information flow across utterances.The model generates a response based on a distributed representation predicted based on past information flow.
Retrieval-guided Model.We implement the retrieval-guided response generation model proposed in (Zhong et al., 2022) without using user ids, because they are not available in both corpora.Herein, we first map the tokens in the preceding utterance u t−1 and the tokens in the previous history {u 0 , ..., u t−2 } into a set of BERT embeddings respectively.Then we compute a similarity matrix between the two sets of embeddings in terms of dot product.As there is a similarity vector for each token in the previous history, we score each of them by using the highest similarity score in the corresponding vector.We pick the top 30 scored ones as the final set of retrieved tokens.The input to their response generation model is the concatenation of u t−1 and the corresponding retrieved tokens.
ESCONV Baseline.Liu et al. (2021) provide two response models on ESCONV.The first one directly fine-tunes the BLENDERBOT model on ESCONV without using annotations of negotiation strategies.Another one fine-tunes BLENDERBOT by taking as input both negotiation strategies and conversation histories.Both models consider the preceding five utterances as conversation history.
TransferTransfo.As MSC can be viewed as an extension of PersonaChat dataset (Zhang et al., 2018), we consider TransferTransfo (Wolf et al., 2019), which reports the SotA performance on Per-sonaChat.We fine-tune this model on the training set of MSC for a fair comparison.
Retriever-generator. Xu et al. (2022) propose a model consisting of a retriever and a generator.The retriever selects relevant utterances from a history, while the generator produces responses conditioned on the utterances selected by the retriever.
Amongst the above models, BLENDERBOT, Di-aloFlow, and retrieval-guided model are evaluated on both corpora.TransferTransfo is evaluated only on MSC because the same model shows inferior performance than the one proposed in Liu et al. (2021) on ESCONV.Furthermore, the baseline (Liu et al., 2021) is only evaluated on ESCONV because it requires annotations of strategies.

Implementation Details
All the models are implemented with PyTorch (Paszke et al., 2019) and the Transformers library (Wolf et al., 2020).We use the same BLENDERBOT model2 in all relevant experiments.All models are trained with Adam (Kingma and Ba, 2015) optimizer with hyperparameters tuned on the validation sets.As a result, we run Adam with β 1 = 0.9 and β 2 = 0.999.The learning rate is 2 × 10 −5 for CI classifier and 5 × 10 −5 for the response model.We use a linear learning rate scheduler that dynamically decreases learning rate after a warm-up period.CI classifiers were trained for 10 epochs with the batch size 16 on one NVIDIA RTX 16G V100 GPU; the response models were trained with 5 epochs and a batch size of 8.The beam search width is set to 5 during decoding.

Metrics
Human Evaluation In practice, we had the same observations as in (Belz and Kow, 2010;Callison-Burch et al., 2007;Kiritchenko and Mohammad, 2017) that asking crowd-workers to directly score responses on a scale usually receives low-quality evaluation.Thus, following the evaluation design in (Novikova et al., 2018;Bojar et al., 2016;Zheng et al., 2021;Zhou et al., 2018;Liu et al., 2021), we opt for pairwise comparison between responses from different sources.In each comparison experiment, we compared our model with a baseline or human responses on a set of 100 conversations randomly sampled from our test set.Given a conversation history, we presented crowd-workers with a pair of responses, one of which is generated by our model and the other is either from humans or a baseline.Five well-trained crowdworkers from Amazon Mechanical Turk (AMT) are asked to choose the better one in terms of four metrics: Empathy (Which response shows better understanding of the partner's feelings?),Fluency (Which response has better fluency and readability?),Relevance (Which response is more relevant and coherent to the context?) and Informativeness (Which response provides more information when both are relevant?).For quality control, we selected only crowd-workers who have an approval rating greater than 90% and a minimum of 10, 000 approved tasks.Inter-rater agreement using Krippendorff's α was 0.41.In addition, we presented both good and bad example responses for each metric to educate crowd-workers.
The results of all comparison experiments are summarized by using ranking-based Best-Worst Scaling, a method shown to be more reliable than rating-based Likert scaling in prior studies (Kiritchenko and Mohammad, 2017;Puduppully and Lapata, 2021;Steen and Markert, 2021;Tang et al., 2022;Louviere et al., 2015).For each pair of models in comparison, the score of a model is calculated as the number of times rated best minus the number of times rated worst (Amplayo and Lapata, 2021; Puduppully and Lapata, 2021).Thus, for such a pair of models, their scores have the same absolute value but opposite signs.For example, in a comparison experiment between System A and System B, the score of System A is 13, then that of System B is -13.Thus, we only need to know the score of one system, then obtain the score of the other system automatically.To summarize those results, we put the scores of baselines and human responses in one table, which are compared with our model.As our model is always used as a reference, we set the scores of our model to be zero in that table.Therefore, a negative score in the table means the corresponding system performs worse than our model, while a positive score indicates a better performance of the corresponding system.
Automatic Evaluation Although automatic metrics are still not reliable for response evaluation (Liu et al., 2016), to facilitate comparisons with prior works, we consider the four automatic metrics for evaluating the quality of responses: BLEU (Papineni et al., 2002), BERTScore (Zhang* et al., 2020), MAUVE (Pillutla et al., 2021), ME-TEOR (Banerjee and Lavie, 2005).In addition, we evaluate the diversity of model outputs in terms of Distinct-1/2 (Li et al., 2016).

Experimental Results
Response Generation.We compare BLENDER-BOT using our method (CONSTRAIN) with multiple strongest baselines for response generation.Table 4 summarizes the human evaluation results based on the Best-Worst Scaling.Our response model outperforms all baselines in terms of all the metrics on both ESCONV and MSC, as indicated by their negative scores.Most of the results are statistically significant.The automatic evaluation results with MAUVE in Table 5, one of the best automatic metrics for NLG tasks, also demonstrates the strengths of our method over the baselines.This meets our expectation that responses generated based on direct causes perform better than responses generated on histories including spuriously correlated utterances.
Surprisingly, the BLENDERBOT using our method outperforms human responses on ESCONV in terms of fluency and informativeness.A close look at the results reveals that i) some of the responses generated by our model are longer than the corresponding human responses because they cover more specific details in contexts, and ii) a significant amount of responses in ESCONV contain grammatical errors while the model generated ones rarely make grammatical errors.Unfortunately, our model does not reach human-level performance on MSC in terms of informativeness and relevance, in which the majority of the multi-session conversations span more than 40 turns.
The two model variations in Liu et al. (2021) are the reported strongest baselines on ESCONV, while the retriever-generator model is the strongest one on MSC in literature.Both the retriever-generator and the retrieval-guided model apply retrieval techniques to identify the most relevant texts in context.The retrieval-guided model starts with employing the tokens in the preceding utterance u t−1 as queries to retrieve the most relevant tokens in the context {u 0 , ..., u t−2 }, followed by concatenating them with the ones in u t−1 as model inputs.In contrast, retriever-generator identifies relevant utterances in histories.Despite that, all of them still fall short of our method according to human and automatic evaluations.Table 4: Results of human evaluation using best-worst scaling (higher is better).The results in Bold are better than all the competitors.Systems significantly different from our method are marked with an asterisk * (using a one-way ANOVA with post hoc Tukey HSD tests; p 0.05).
key utterances from conversation histories.We compare different ways of selecting utterances from conversation histories as the inputs of the same neural architecture.Table 4 and Table 5 include the corresponding results of BLENDER-BOT on both corpora.Taking the full conversation histories as input, which is widely used in practice, turns out to be a poor choice on both corpora.The responses generated in this setting are often too general, such as "I'm sorry to hear that.",without touching specific details in contexts.As a comparison, using the preceding utterances is evident as a good heuristic on ESCONV, while the best heuristic on MSC is to use the preceding three utterances.The worse case is P (r t |u j , u t−1 ), which randomly selects an utterance between the first utterance and u t−2 to combine with u t−1 .The corresponding ratio of spurious correlations is one of the highest among all settings.Those results again demonstrate the harm of spuriously correlated utterances for generative models.
To demonstrate that our method is modelagnostic, we apply our method to DIALOGPT 3 3 https://huggingface.co/microsoft/DialoGPT-medium instead of BLENDERBOT, and evaluate the models on both ESCONV and MSC with varying input settings.As you can see from Table 6, our method outperforms the other DIALOGPT models with different input settings in terms of all metrics.As DIALOGPT uses only a transformer-based decoder, we show that our training and inference methods improve the performance of both decoder-only and encoder-decoder neural architectures.
Ablation Study of Response Generation.We conduct ablation studies to demonstrate conditional dependence is crucial for selecting direct causes during training and inference.The corresponding results are summarized in Table 7.
Training generative models with the utterances selected by our method improves model performance significantly.Without our method, empathy, informativeness and relevance drop for all BLENDERBOT variations on ESCONV.Only the fluency increases slightly when using the preceding two utterances as input during training.It is worth noting that training models with the utterances selected by our CI classifier improves the diversity of response candidates consistently.From Using BLENDERBOT trained with our method (CONSTRAIN), we compare our inference method, coined u M axCI,t−1 , with alternative methods: i) randomly selecting u j between 0 and t − 2 and combine it with u t−1 , coined u Random,t−1 ; ii) taking both u t−2 and u t−1 as input, coined u t−2,t−1 ; iii) applying the entropy-based method proposed in Csáky et al. (2019) to remove generic response candidates and select optimal response, coined u Entropy,t−1 ; iv) replacing the CI classifier with a dependence classifier for inference, coined u M axDep,t−1 .The dependence classifier is trained by setting (u t−1 , r t ) as positive samples, (u j , r t ) as negative samples, where u j far from responses is randomly sampled from dialogue histories.During inference, we generate response candidates in the same way as our method u M axCI,t−1 , but select the candidate that has the highest dependence probability P depend (l = 1|u j , r j t ) as the final output.The results in Table 7 show that our inference method outperforms alternative inference methods, when the models are trained with our method.Replacing the CI classifier with the dependence classifier (u M axDep,t−1 ) leads to substantial per-  formance drops in terms of all metrics.It is also noteworthy that generating responses using the preceding two utterances (u t−2,t−1 ) is a fairly effective heuristic, which only falls short of our method in terms of empathy.This can be explained by the statistics that 40% of direct causes on ESCONV are the preceding two utterances, while the corresponding percentage on MSC is 29%.Selecting key utterances randomly or using entropy to pair with u t−1 is worse than that simple heuristic.
In addition, we compare our method with regularized beam search (Roller et al., 2021) in three settings: i) replace the unregularlized beam search with the regularized one using our method, ii) using only preceding utterances as input, and iii) using full conversation histories as input.In all settings, the beam search employs a width of 10 with 3grams blocking and a minimum length of 20.Regularized beam search with full conversation histories (P (r t |u 0:t−1 )-Beam) or only preceding ut-terances (P (r t |u t−1 )-Beam) achieve dramatically lower performance than our inference method.If the beam search is used together with the CI classifier (CONSTRAIN-Beam), the model performance increases slightly but the differences are not statistically significant.
Qualitative studies.To further investigate the differences between the CI classifier and the dependence classifier, we apply the model to generate all candidate responses and score the candidates with the probabilities yielded by the dependence and the CI classifiers.Using the example conversation in Table 1, we show all generated candidate responses and the corresponding scores in Table 9.With u 3 , the direct cause used by humans, the corresponding response achieves the highest conditional dependence probability but not the highest dependence probability.Perplexity is also not reliable.Moreover, the distributions of the conditional dependence scores are more skewed towards the true direct causes than those of dependence scores.Hence, the conditional dependence, which measures the conditional mutual information obtained from a selected utterance beyond that from the preceding utterance, is more informative and robust than mutual information between responses and single utterances in contexts.
Furthermore, we apply our method to BLENDER-BOT on example dialogues and show qualitative differences to the baselines.Table 9 shows the responses generated by our method and the baselines using the running example in  suggestion to "talk to a school counselor" or refer to the most specific detail of "online learning", while the remaining ones talk about school or irrelevant contents.In addition, we provide the Best-Worse Scaling scores of five crowd-workers, who compare the baseline outputs with those of our method.Most crowd-workers consider our model output is better than that of the baselines in terms of informativeness and relevance.
For error analysis, we find out that our model cannot always generate natural and relevant responses by relying on the same direct causes as humans.As shown in Table 10, although there are overlapped direct causes between humans and our model, the response generated by our model is reasonable and relevant by capturing context specific entities "son" and "boyfriend", while the other models fail to do so.In those cases, even if our model uses different direct causes than humans for response generation, most of them are reasonable and fluent.To further investigate to what degree our model utilizes the same direct causes as humans, we apply our model to the test set of CGDIALOG and collect the direct causes used during inference.
The percentage of using exactly same causes, partially overlapped causes and totally different causes amount to 26.47%, 62.13% and 11.40%, respectively.Overall, comparing with the baselines, the model with our method produces more specific, relevant, and natural responses than the baselines regardless if it uses the same direct causes as humans or not.CI Classification Results.We evaluate our method CONSTRAIN to identify direct causes of responses in the test sets of CGDIALOG, and compare them with two simple but strong baselines: "Always u t−1 " and "Always u t−2 , u t−1 ".The former always considers u t−1 of responses as direct causes, while the latter considers the preceding two utterances as direct causes.In the test sets, we keep the manually annotated cause-response pairs as positive examples, while combining all non-cause utterances with u t−1 and r t as negative samples.As a result, the number of negative samples is much larger than the number of positive examples.Due to such an imbalance, we adopt precision, recall, and F1 as the evaluation metrics.
Table 11 reports the results of cause identifica-Dialogue Datasets Recently state-of-the-art open-domain dialogue agents have utilized DailyDialog (Li et al., 2017), PersonaChat (Zhang et al., 2018), EmpatheticDialogues (Rashkin et al., 2019) and Wizard of Wikipedia (Dinan et al., 2019).Dialogues in these datasets usually have 3-15 turns.Dialogue agents trained on these dataset don't have the ability to deal with dialogue with very long history.This weakness encourages researchers to crowdsource long conversations, such as Emotion Support Conversation (Liu et al., 2021) and Multi-Session Chat (Xu et al., 2022).
The number of utterances per dialogue in two datasets is 30 and 53, respectively.
Dialogue Models Recently, seq2seq dialogue models, such as DialoGPT, Blenderbot and PLATO (Zhang et al., 2020;Roller et al., 2021;Bao et al., 2020), showed significant improvement in generating fluent and relevant responses in various dialogue datasets.

Conclusion
We conduct the first study from a causal view to investigate and tackle spurious correlations in dialogues.Inspired by constraint-based causal discovery algorithms, we propose a novel constrained self-training method to build a CI classifier by using a small corpus CGDIALOG, which is manually annotated with causal graphs by us.The CI classifier is applied to filter out spuriously correlated utterances in conversation histories before training a response generation model.That classifier also serves as a scoring function during inference to select the best response from all generated candidates.By identifying conditionally dependencies between utterances and responses, our model agnostic approach significantly improves the overall generation quality of response models in terms of relevance, informativeness and fluency.

Figure 2 :
Figure 2: In Fig. a, the response variable has two direct causes that may be connected through z k (k > j) or directly connected, while the response variable in Fig. b has two disconnected cause variables.In Fig. c and d there is only one direct cause z t−1 linking to z t .
to adapt it to dialogues.After training 10 epochs with the learning rate 5 × 10 −5 , we fine-tune the model with our self-training procedure detailed below.Incremental Self-training with Constraints.It is straightforward to collect a small training dataset D L from the training set of CGDIALOG by considering (u j , u t−1 , r t ) annotated with g(u j ) → g(r t ) as positive examples and the remaining as negative examples.However, the size of D L is small by having only 922 examples in total.To address the scarcity of D L , we adapt the selftraining procedure introduced in Zou et al. (2019) to train the CI classifier.It starts with training an initial classifier f 0 on D L in a supervised manner.Then we apply this classifier to unlabeled utterance tuples.The tuples predicted with labels 1 are added to the training set as positive examples if they satisfy the threshold and context constraints :

Table 2 :
Statistics of the CGDIALOG.

Table 3
each response r t , negative examples are collected by randomly sampling u j from the utterances that are not selected as positive examples.We keep the number of positive examples the same as the number of negative examples in each batch.The extended training set is used to fine-tune the classifier.The process is repeated until the classifier achieves the highest performance on the validation set of CGDIALOG.More details can be found in Algorithm 1.Note that, the main difference to the original self-training algorithm is that we add a positive example to the training set only if u j is either u t−2 or u t−3 .The constraint is proven to be empirically useful in our experiments.
Those results indicate that retrieval techniques are still limited for identifying

Table 7 :
The comparisons between inference methods.All models are fine-tuned on BLENDERBOT.* indicates a significant difference with our model."Beam" indicates regularized beam search that employs a width of 10 with 3-grams blocking and a minimum length of 20.

Table 1 .
The responses generated by our method give a specific , u 5 ) I can understand how stressful this is for you.Have you thought about talking to a school counselor?0.7287 0.9287 (u 4 , u 5 )Yes, I can understand that.Do you have any friends or family that you can talk to about it?

Table 9 :
Response candidates generated by CONSTRAIN and one baseline P (r t |u 0:t−1 ) based on the conversation history in Table1.We use u M axCI,t−1 to select final responses, which are in bold.Behind responses generated by baselines, we append pair-wise comparison results annotated by five workers between baselines and our model, (Empathy, fluency, informativeness, relevance).In a pair-wise comparison, if baseline is better, it gets a +1 score; if baseline is worse, it gets a -1 score; if baseline is the same with our model, both get 0 score.The sum of the five workers' evaluations is the score shown in this Table.
Akama et al. (2020)2021)t al. (2020);Izacard and Grave (2021);Qu et al. (2021)propose retrieval-based dialog systems that select relevant utterances from history as input.However, such methods select utterances based on semantic relevance, which may still suffer from spurious correlation in input.Whang et al. (2021); Niu and Bansal (2018);Lee and Choi (2022);Akama et al. (2020)seek to first generate or retrieve response candidates, then select final responses using dialog-response binary classifier.Such binary classifiers are trained to identify relevance or irrelevance.However, relevance includes causation and spurious correlation, which cannot be identified by those classifiers.