ConvoSense: Overcoming Monotonous Commonsense Inferences for Conversational AI

Mastering commonsense understanding and reasoning is a pivotal skill essential for conducting engaging conversations. While there have been several attempts to create datasets that facilitate commonsense inferences in dialogue contexts, existing datasets tend to lack in-depth details, restate information already present in the conversation, and often fail to capture the multifaceted nature of commonsense reasoning. In response to these limitations, we compile a new synthetic dataset for commonsense reasoning in dialogue contexts using GPT, ℂonvoSense, that boasts greater contextual novelty, offers a higher volume of inferences per example, and substantially enriches the detail conveyed by the inferences. Our dataset contains over 500,000 inferences across 12,000 dialogues with 10 popular inference types, which empowers the training of generative commonsense models for dialogue that are superior in producing plausible inferences with high novelty when compared to models trained on the previous datasets. To the best of our knowledge, ℂonvoSense is the first of its kind to provide such a multitude of novel inferences at such a large scale.


Introduction
Effective dialogue is accomplished by a profound grasp of language and a thorough comprehension of the world.Such comprehension is crucial to the construction of responses that are pertinent, coherent, and captivating within an ongoing dialogue.A pivotal element of this worldview is commonsense: self-evident information that is universally acknowledged among humans (Clark and Brennan, 1991).
Over time, there has been a concerted endeavor to create datasets that facilitate commonsense reasoning.Early work, such as the widely recognized ConceptNet (Speer et al., 2017), focused predominantly on physical commonsense related to entities.
Lately, efforts have shifted toward building datasets encompassing social-and event-based commonsense, such as ATOMIC (Hwang et al., 2021).This new wave of datasets targets complex human concepts, including emotions, desires, and motivations.
As human conversations largely revolve around sharing personal experiences and life events (Fillwock and Traum, 2018;Mitsuda et al., 2019), it is critical for virtual agents to possess a robust understanding of human experiences to conduct effective dialogue.Datasets such as ATOMIC hold promise as they provide insights directly relevant to human experience; however, a drawback lies in their lack of contextual awareness as they hinge on isolated, concise phrases for commonsense inferences.This limitation poses challenges for dialogue-oriented tasks because utterances should not be viewed in isolation but must be interpreted within their context (Pan et al., 2019;Jin et al., 2022).
Several initiatives have recently aimed to curate commonsense inferences tailored for dialogue contexts (Gao et al., 2022;Ghosal et al., 2022;Zhou et al., 2022a).However, a trade-off currently exists between the breadth of inference types covered and the scope of dialogue contexts encompassed within these existing datasets.While some datasets cover a wide range of relations, they are limited to a small number of dialogues (Gao et al., 2022), whereas others capture a large number of dialogues but on a limited set of relations (Ghosal et al., 2022).
In addition, a few challenges can be encountered in these datasets.For example, the inferences in these datasets are often too succinct and derive only straightforward conclusions with minimal elaboration (Gao et al., 2022), which do not convey implicit commonsense.Some studies instruct annotators to recycle information from the ongoing conversation, undermining the speculative nature of inferences and detracting from the potential of offering fresh insights to enhance dialogue understanding (Ghosal et al., 2022).Moreover, al- though multiple plausible inferences can be drawn from a single dialogue context, only a few datasets support this multifaceted nature (Shen et al., 2022), impeding the development of models capable of generating diverse inferences, and thus, limiting their utility in real applications.
We present ConvoSense, a commonsense dataset generated by GPT encompassing 10 popular inference types with over 500,000 inferences across 12,000 dialogues ( §4).Our dataset shows greater contextual novelty and enhanced inference diversity and detail while maintaining exceptional reasonability compared to existing datasets ( §3).We also explore several strategies to build generative models producing inferences for dialogue contexts ( §5).Our experiments show that models trained on ConvoSense excel in generating plausible inferences with greater detail and novelty, compared to ones trained on existing datasets ( §6).To the best of our knowledge, this is the first dialoguebased commonsense dataset that not only covers an extensive array of inference types at large-scale but also provides a plethora of diverse, novel in-ferences tailored to each dialogue context.Our ConvoSense dataset and inference models can be accessed through our open-source project: https: //github.com/emorynlp/ConvoSense.

Related Work
Recent works have focused on integrating commonsense into various tasks, including story generation and explanation (Guan et al., 2020;Gabriel et al., 2021), dialogue summarization and explanation (Ghosal et al., 2021;Zhou et al., 2021;Kim et al., 2022), and response generation (Li et al., 2022;Sabour et al., 2022;Zhou et al., 2022b).Many of these works rely on existing datasets, such as ConceptNet (Li et al., 2022;Zhou et al., 2022b) and ATOMIC (Sabour et al., 2022), which only contain single-word or short-phrase premises and conclusions.Although there are commonsense datasets curated for long dialogue contexts, they tend to be of small size (Zhou et al., 2022a), express simple inferences (Gao et al., 2022), or copy context from the provided utterances (Ghosal et al., 2022).
On the other hand, GPT has recently been used to create a variety of datasets.Kim et al. (2023) and Zhan et al. (2023) constructed dyadic dialogue datasets at large-scale, while West et al. (2022) generated commonsense triples in the ATOMIC style (Hwang et al., 2021).However, the ATOMICstyle inferences are not necessarily suitable for dialogue, as they struggle to handle long contexts and often lack depth.Table 1 summarizes the inference types in existing dialogue-focused commonsense datasets and mappings of synonymous types among them.In particular, the following 3 datasets are used for comparisons with our work: ComFact Gao et al. ( 2022) mapped dialogue utterances to reasonable inferences from the existing ATOMIC2020 dataset (Hwang et al., 2021) by using exact string matching and embedding similarity.Subsequently, human annotators verified the relevance of the retrieved inferences.
Cicero Human participants were tasked with composing responses to five commonsense questions (e.g., What is the event that directly causes or could cause Target?) based on dialogue contexts and explicitly instructed to incorporate information from the preceding or forthcoming utterances.The first version produced a single inference for each example (Ghosal et al., 2022), whereas the second version produced multiple examples of both good and bad inferences (Shen et al., 2022).
Reflect Zhou et al. (2022a) supplied both humangenerated commonsense inferences and following utterance responses that could be derived from a specified commonsense inference.The inferences were collected by instructing human participants to answer a commonsense question, while the nextutterance responses were composed by new human participants who were provided with the dialogue context and one of the human-generated inferences.

Evaluating GPT-generated Inferences
In order to support the development of a large-scale and high coverage commonsense dataset for dialogue that improves upon existing works, we hypothesize that we can leverage large language models (LLMs) to accomplish this task in an efficient and low-cost manner.From initial pilot tests of both closed-source (GPT) and open-sourced LLMs (Vicuna and Llama), we find that GPT provides greater reliability in following specific instructions and produces commonsense inferences of overall better quality than the open-sourced LLMs.Consequently, we choose to rely on GPT in this work.

Prompt Engineering
Prior to crafting the full ConvoSense dataset, we empirically assess GPT's efficacy in generating reasonable and novel commonsense inferences for dialogue.To mitigate any unintended bias from in-context examples in the GPT prompt, we adopt a zero-shot generation framework.1 GPT prompts are refined iteratively to achieve the optimal outcomes.An example of the final prompt design, specifically tailored for the Desire inference type, is illustrated in Table 2.
During our development process, we observe that the inferences generated from GPT frequently contain detailed and rich information, thus addressing one of the major limitations of existing works.In addition, to encourage novel inferences from GPT, we include the instruction "Your answers should provide novel information that is not explicitly shared in the conversation."as seen in Table 2.We observe that this instruction helps in reducing the redundancy of the generated inferences to the information already explicitly shared in the dialogue context, thus addressing a second major limitation of existing works.Q Question: What does Speaker want to do next?
A Answer: As a result, Speaker wants ...
In a list titled "Answers", generate several likely answers to this question for the target expression, keeping the rest of the conversation in mind.
Your answers should provide novel information that is not explicitly shared in the conversation.For the prompt, each inference type is paired with a guiding question and an answer prefix, ensuring uniformity in the generated content for the specific type, which respectively fill the Inference Question  (Q) and Inference Answer Template (A) slots in the prompt.For every dialogue context, a sequence of utterances in the context is placed in the Dialogue Context (C) slot, and its final turn gets duplicated in the Target Utterance (T) slot.Finally, the GPT output, commencing with the header Answers and adopting a list-like format with newline separation, is parsed to extract the generated inferences.Table 3 details the questions and answer prefixes employed for the fifteen identified inference types derived from the previous studies in Table 1.

Evaluation
To evaluate the quality of GPT-generated commonsense inferences for dialogues, we compare their reasonability and novelty against inferences from human datasets.First, we sample a uniform distribution over inference types for each existing dataset.For every sample, we then prompt GPT to produce relevant inferences and randomly select one from the generated list.Finally, two human annotators are presented with the dialogue context, inference question, and both the GPT-and humangenerated inferences and asked to categorize them for reasonability and novelty.For this evaluation, we enlist native English speakers via the Surge AI crowdsourcing platform (www.surgehq.ai)by paying them at a rate of $0.15 per sample with an estimated completion time of 45 seconds.
Reasonability Most prior commonsense datasets assess their inferences based on human-judged reasonability (Hwang et al., 2021;Ghosal et al., 2022;Shen et al., 2022;Zhou et al., 2022a).An inference is deemed reasonable if it makes sense in, is relevant to, and is consistent with the provided dialogue context.We follow Hwang et al. (2021), in which annotators categorize inferences into levels of the truth likelihood: always/likely, sometimes/possible, never/farfetched, or invalid/nonsense.
Novelty A key trait of commonsense for dialogue is its role in enhancing dialogue comprehension by providing relevant contextual information.While Ghosal et al. (2022) gauge creativity in human responses, creativity is not strictly focused on inference novelty.In our study, annotators evaluate the extent to which an inference contributes fresh information to the conversation, categorized as: new & detailed, new & simple, and purely repetitive.
Since we aim to elicit the natural commonsense understanding learned by each annotator through their life experience in our annotation tasks, we do not provide any training or explicit examples towards what constitutes a "reasonable" or "novel" commonsense inference to avoid artificially polluting their commonsense understanding of the world.Instead, we provide a description of the task with definitions of the different categories.Our instructions are intended to mitigate bias towards trivial inference properties by providing clear definitions of the characteristics under study and emphasizing important aspects to keep in mind, such as ignoring grammar errors unless it made an inference nonsensical.Furthermore, decomposing inference quality into two characteristics allows for their independent evaluation.We verified through pilots that this approach resulted in reliable and reasonable annotations from our annotators for both tasks.

Speaker:
Listener: Speaker: 1. the speaker is old fashioned.2. the speaker is outdoorsy.
We're all went out for a nice picnic lunch earlier.
Where did you go?
To the park, the place by the lake.Considering the reported quality of the existing datasets and our preliminary assessments of GPTgenerated inferences, we expect much higher rates of positive classes than negative ones, resulting in a class imbalance.To overcome the vulnerability to prevalence skew exhibited by other agreement metrics like Cohen's kappa (Jeni et al., 2013;Wongpakaran et al., 2013;Quarfoot and Levine, 2016), Gwet's AC1 inter-annotator agreement metric is chosen (Gwet, 2002). 3Our annotators obtain AC1 values of 0.8 and 0.6 for reasonability and novelty, respectively, implying substantial agreement.

Attribute
Table 4 demonstrates that GPT can attain comparable reasonability in its generated inferences as those derived from humans, even exceeding the  reasonability of the inferences in ComFact with statistical significance.Notably, the results also indicate that GPT surpasses the novelty of the humangenerated inferences for the majority of the existing datasets.Furthermore, GPT outputs achieve higher detail than that observed from human-generated inferences.Figure 2 shows the percentage of new & detailed inferences out of all positive novelty inferences for each data source, clearly demonstrating the superiority of GPT inferences in terms of their expressed detail.Example inferences from GPT and humans are shown in Figure 1.

ConvoSense Dataset
Given our assessment of high-quality, novel, and detailed GPT-generated commonsense inferences across various dialogue contexts and inference types (Section 3), we construct a substantial conversational commonsense dataset using GPT, termed ConvoSense.

HumanGen: Human-generated Datasets
For fair comparisons to our work, we combine the three human-generated datasets (Section 2) into a solitary dataset, termed HumanGen. 4Specifically, their train/validation/test sets are integrated independently.For ComFact and Cicero, this integration follows the provided splits, while for Reflect, data is sampled following an 80/10/10 distribution.To standardize HumanGen into a cohesive format, we perform the following preprocessing steps.First, we leverage the mapping outlined in Table 1 along with the specifications from Table 3 to identify relevant commonsense inference questions for each instance.Then, we combine consecutive utterances from the same speaker to ensure every dialogue turn represents a distinct speaker.Lastly, we apply Speaker and Listener tags in a similar manner to ConvoSense (Figure 3).Since human-generated inferences often contain nominal references to specific target entities, we additionally incorporate the names of conversational participants into the tags, as exemplified by "Speaker (A)".
The naming conventions vary across the different human-generated datasets.To maintain uniformity, we adopt the naming conventions used in Cicero for both ComFact and Reflect, as Cicero constitutes nearly 90% of HumanGen.In Cicero, participants are denoted as A and B. For ComFact, originally lacking speaker designations, we randomly assign A/B tags to each conversation.On the other hand, Reflect includes original speaker names; thus, we replace them with A/B tags accordingly.Since the 4 Many commonsense types have a sparsity of training data when the human-generated datasets are viewed in isolation, which would impede the training of a neural model to adequately capture the commonsense type.
speaker name frequently appears in Reflect's inferences, we uniformly replace it with "the speaker", aligning with the prevalent format in Cicero.

ConvoSense: New GPT-generated Dataset
Constructing a practical dataset of commonsense inferences for dialogue benefits from covering a wide variety of dialogue situations.To this end, our construction process of ConvoSense first carefully selects the dialogues to include based on their topical diversity, trims the dialogue contexts to optimize utterance diversity, and finally generates the inferences for each context.

Dialogue Selection
We choose to sample the dialogues for ConvoSense as a subset of those dialogues in the high-quality and large-scale SODA dataset.SODA contains over a million dyadic dialogues generated by GPT covering situations based on ATOMIC commonsense tuples (Kim et al., 2023).For cost practicality, ConvoSense is constructed to contain 10,000 training dialogues, 1,000 validation dialogues, and 1,000 test dialogues each.
To encourage diversity in ConvoSense, we employ BERTopic (Grootendorst, 2022), which clusters the dialogues selected from SODA into groups using dimension reduction technique UMAP (McInnes et al., 2020) and HDBSCAN clustering algorithm (McInnes et al., 2017) on the BERT embeddings of the dialogues. 5We configure the hyperparameters6 to effectively group dialogues while maintaining a well-balanced distribution of group lengths based on manual verifications.As a result, we obtain 100K dialogue groups, where each group consists of 6.3 dialogues on average.These groupings represent 100K unique dialogue topics, thus enabling the construction of ConvoSense to span a variety of topics by sampling dialogues from a subset of these groupings.
Next, we randomly select one dialogue from the n groupings, where each dialogue contains at least 5 utterances and has a BERTopic score of at least 0.95 to its group.To maintain distinct dialogue scenarios in each split, each grouping can only be selected for one split.Through this procedure, we set n values as [10K, 1K, 1K] for assembling the training, validation, and test splits, respectively.
Utterance Selection For each selected dialogue, we determine which utterance to perform inference Listener: Speaker: Listener: Speaker: Listener: Speaker: 1. to ask the listener if she knows any shortcuts or tricks to find the perimeter quickly.2. to learn the different types of shapes and their respective perimeters to improve her math skills.3. to know the formula for calculating the perimeter so that she can apply it to the given shape.4. to explore practical applications of finding perimeters in daily life, such as measuring the perimeter of her backyard.5. to document the process of finding the perimeter step by step so that she can later revise it as a reference guide.generation on.We use the topic keywords identified for each group during the BERTopic grouping to pinpoint the most topically salient utterance in each dialogue and ensure that the diversity afforded by the grouping is maintained.This is achieved by selecting the utterance whose embedding yields the highest cosine similarity with the embedding of the four-word topic string assigned to the dialogue's respective group by BERTopic.Subsequently, we trim the dialogue's utterances such that the conversation ends at this selected utterance.This trimmed version becomes the final dialogue context used for commonsense inference generation, where the inferences are derived for the last utterance.
Because commonsense inferences often relate to a central figure in a conversation, either the speaker or the listener, we introduce nominal tags for the two participants.The terminal utterance is labeled as Speaker, and its preceding utterance is labeled as Listener.These nominal tags are then assigned in alternating order to the remainder.
Inference Types For each preprocessed dialogue, GPT generates inferences for all included commonsense types following the procedure in Section 3. Specifically, ten commonsense types are included: Subsequent, Cause, Prerequisite, Motivation, Attribute, Reaction, Reaction o , Desire, Desire o , and Constituents (highlighted in Table 3).These types are selected based on their usage frequency in existing datasets and their lack of semantic overlap.

Data Statistics Table 5 presents data statistics for
ConvoSense and HumanGen.ConvoSense significantly surpasses HumanGen for data volume, particularly regarding instances with polymorphic outputs, where multiple inferences can be derived per instance.Moreover, ConvoSense boasts greater vocabulary diversity and reduced redundancy among inferences.Illustrative examples from each dataset are shown in Figure 3.

Data Quality
The results in Section 3.3 demonstrate that GPT is generally capable of producing high-quality commonsense inferences regardless of the underlying dialogue source.Consequently, applying GPT to generate commonsense inferences for the SODA dialogues is expected to perform with similar high quality.To explicitly verify this, we conduct an evaluation of the ConvoSense dataset.An external conversational AI expert, unaffiliated with this study, evaluates the generated inferences for 100 ConvoSense examples (508 total inferences; average 5.08 inferences per example), with all ten inference types uniformly represented across examples.The human judge completes two evaluation tasks: grading reasonability and novelty of an inference (Sec.3.2) and performing inference clustering to measure per-example output diversity (Sec.6.2).  data, including the % of total inferences judged to be reasonable and novel, the % of positive novelty inferences judged to be detailed (vs.simple), and the average number of unique inference clusters per example, with the average % of unique inferences per example in parentheses.
Error Analysis We next perform an error analysis on the unreasonable inferences identified by the human judge.We observe that most unreasonable inferences are explained by being too niche to be likely given only the provided information in the dialogue context (26%; Desire examples #4-5 in Figure 3), or by their attribution to the wrong conversational participant (26%; Desire o examples #4-5 in Figure 3).Relatively speaking, only a small percentage of unreasonable inferences are explained by a violation of common knowledge of human experiences (10%), a lack of relevance to the dialogue context (10%), or a contradiction of the dialogue context (7%).This suggests that ConvoSense inferences are predominately accurate representations of commonsense understanding, although they can suffer from lack of precision regarding situational nuances and speaker roles.

Training and Decoding Strategies
With the rich and diverse multi-inference examples provided in ConvoSense, we are well-positioned for training commonsense generation models that produce versatile outputs.Yet, a key query remains: how can we induce this versatility into the model?
A common method of enhancing diversity in generative outputs is to modify the decoding strategy (Gimpel et al., 2013;Vijayakumar et al., 2018;Ippolito et al., 2019).Through preliminary testing, we observe that diverse beam search decoding with Hamming distance reward following Vijayakumar et al. ( 2018) improves the output diversity with less impact on accuracy compared to other methods.
On the other hand, Cao and Wan (2020) propose modifying the model architecture by introducing latent variables to guide output variety.However, these approaches only approximate learning varied responses by relying on conditioning on random latent variables.In contrast, ConvoSense provides direct access to numerous inferences per input, enabling direct training of generative models that produce multiple inferences per example, with the set of inferences treated as target outputs during training.Therefore, we explore the performance of three strategies for diverse generation of commonsense inferences.

Monomorphic Beam Search (M)
This model receives as input a dialogue context C consisting of the previous six utterances delimited by their corresponding speaker tags, the current response r for which to generate inferences, and a commonsense question q pertaining to one of the ten inference types (Table 3) in the following format: Polymorphic (P) Using the same input as the M model, this model is trained to output a series of inferences as a sequence.To do this, the groundtruth inferences for each training example are concatenated into a list-like sequence, delimited by semicolons and prefixed by an integer representing their position in the list as follow: (1) i 1 ; (2) i 2 ; (3) i 3 ; . . .The order of the answers in the list are shuffled between each training epoch.During inference, standard beam search decoding is used to generate the top-1 output.A single output from this model is intended to represent the set of multiple diverse inferences for the input, without the need for any post-hoc decoding strategies, which other studies have observed to negatively impact the accuracy of the output generations (Ippolito et al., 2019).

Model Configuration
We develop six generative models: ConvoSenseM, ConvoSenseM*, ConvoSenseP, HumanGenM, Hu-manGenM*, and HumanGenP.Each model name denotes the training dataset with the terminal letter indicating the model strategy.of them use T5-3b (Raffel et al., 2020) as the base model, which is then finetuned on the corresponding dataset following the indicated model strategy.The ConvoSense* and HumanGen* models are finetuned for 5 or 10 epochs, respectively.The best-performing models and hyperparameters 7 are selected through gridsearch based on their results on the validation sets.
For all models, decoding is performed with 10 beams.For ConvoSenseM* and HumanGenM*, the number of beam groups is 10, and the diversity penalty is 0.5 and 1.0, respectively.For P models, decoding also uses a repetition penalty of 5.0 to reduce output token repetition.
It is worth noting that only 16% of HumanGen examples feature multiple ground-truth inferences.Training a P model on the complete dataset yields a single-inference model, which defeats the purpose of the polymorphic model strategy.Instead, we develop the HumanGenP model exclusively on multi-inference instances to facilitate learning of polymorphic outputs. 7The Adafactor optimizer is used with a weight decay of 5e-3 and a learning rate of 5e-6, except for ConvoSenseP with 1e-6.The max source length is set to 768.The max target length is set to 400 for P models and 128 for other models.All models are trained using bf16 for memory efficiency.P models use a prefix of "provide several reasonable answers to the question based on the dialogue:\n" and other models use a prefix of "provide a reasonable answer to the question based on the dialogue:\n".

Generative Model Evaluation
We evaluate the six generative models (Section 5.2) on the ten commonsense inference types (Table 3) that exist in both the HumanGen (Section 4.1) and ConvoSense (Section 4.2) datasets.The model performance is evaluated using automatic reference metrics (Section 6.1), automatic diversity metrics (Section 6.2), and human evaluations of reasonability and novelty (Section 6.3).

Automatic Reference Metrics
Conventional evaluations of generative models against ground-truth references often overlook the diverse nature of the outputs.They typically assess individual model outputs against a single reference, focusing on best-case performance due to dataset constraints.However, such assessments are inadequate for our multi-inference dialogue generation objective.To address this, we structure our automated evaluation method to account for the concept of output diversity.This method, referred to as PolyAgg, serves as an aggregation function compatible with standard evaluation metrics.Its purpose is to gauge the model's capacity to encompass the complete set of ground-truth references in its generated outputs.Algorithm 1 demonstrates the PolyAgg aggregation function.It computes a score matrix for each example, where rows represent model outputs and columns represent ground-truth references, and finds the maximal assignment of rows to columns following the linear sum assignment problem (Burkard and Cela, 1999), which seeks to find the optimal bijective mapping between rows and columns in a cost matrix.By mandating a one-toone mapping from model outputs to references, we can accurately measure reference set coverage and prevent models that generate mere surface-level variations from scoring highly on datasets with (2) We use this evaluation scheme with three automatic metrics to measure the performance of the models.
Results We evaluate each model in terms of both its best-case performance (Top-1 output) and its multi-inference performance (Top-5 outputs).In the Top-1 setting, the maximum score achieved by the top-1 output against all of the ground-truth references for an example is taken and averaged across the test data.In the Top-5 setting, the top-5 outputs from the models are taken and scores are calculated using Equation 2, before being averaged across the test data.For M(*) models, the top one or five beams are taken as the outputs for each setting.
For P models, the first one or five inferences in the outputted sequence are taken as the outputs for each setting.The results are shown in Table 7 for each model on the HumanGen and ConvoSense test splits, respectively.Overall, it is evident that using diversitypromoting decoding (M*) outperforms the direct generation of multiple inferences (P).This approach achieves the highest BLEU, BertScore, and sentence similarity scores in the Top-5 assessment setting.This trend is particularly pronounced in the case of the ConvoSense-trained model, holding true for both the ConvoSense and HumanGen test splits.Enhancing training inference diversity as seen in ConvoSense appears to support the adoption of diversity-focused decoding strategies, yielding more contextually relevant outputs aligned with ground-truth references, even when applied to test examples from different datasets.
In the Top-1 setting, monomorphic models with standard beam search demonstrate superior per-formance for both HumanGen-and ConvoSensetrained models.However, the difference compared to diverse beam search is relatively minor, particularly when considering embedding-based metrics.Interestingly, the HumanGenP model displays the strongest ability to generalize to the ConvoSense test split among all HumanGen-trained models in the Top-1 scenario.Upon manual comparison of HumanGenP outputs against other HumanGentrained models, we observe that HumanGenP is more inclined to specify a focal person in the inference (e.g., "the speaker/listener").This often aligns better with ConvoSense references, although in a superficial manner with little impact on the underlying semantics.
It is also observed that the models produce low scores when evaluated against the test examples that are out-of-distribution with respect to their training data.This may not reflect the true underlying reasonability of the generated inferences, but rather a difference in inference content between the datasets, which is supported by evidence in Section 3.3 showing that human-written generations are more often repetitive with the dialogue context than GPT generations.To obtain a direct measure of the quality of the generated model inferences, we perform a human evaluation in Section 6.3.

Automatic Diversity Metrics
To assess the ability of each model in generating diverse inferences for a given dialogue context, we employ a clustering approach under the Top-5 evaluation scheme.This involves grouping the model generations for each example into clusters of inferences with similar meanings.The average number of inference clusters across examples serves as a measure of output diversity.
For each of the ten inference types, we draw 50 examples from the test splits of ConvoSense and HumanGen, except for the Constituents type in HumanGen due to its smaller test split (22 examples).We instruct GPT4 10 to create groups of semantically similar inferences given a dialogue context, question, and a list of inferences.GPT4 demonstrates its proficiency by achieving an average B-cubed F1-score (Bagga and Baldwin, 1998) of 0.872 against clusters identified by one of the authors for 20 examples, where B-cubed is a common clustering evaluation metric that measures the precision and recall of each element's neighbors 10 gpt-4-0613 with a temperature setting of 0 within the same cluster.This outperforms Amazon Mechanical Turk crowdworkers who only achieved a score of 0.581.11 Results Table 8 displays diversity outcomes per model.For both HumanGen and ConvoSensetrained models, the monomorphic model with diverse beam search generates the most unique outputs.12While ConvoSenseM* slightly outperforms HumanGenM* in terms of inference diversity, both models exhibit similar unique inference cluster counts.Compared to the ConvoSense inferences themselves (Table 6), it is clear that none of the trained models are able to replicate the high inference diversity.Nonetheless, there is a large discrepancy in inference detail, which is revealed through human assessments in the next section.

Human Evaluations
We also evaluate the models through human assessment, in both the Top-1 and Top-5 setting.inferences from each model.Underline denotes a statistically significant result against both HumanGen models (chi-square proportions test, α = 0.05).The average number of inference clusters is also shown, along with the average % of unique inferences per example in parentheses (Clusters).
Results Table 9 demonstrates ConvoSenseM*'s superior performance compared to the HumanGen models.ConvoSenseM* achieves a remarkable 93% reasonability and 98% novelty, averaging 3.4 unique inferences per example.Indeed, similar results hold even when considering the Top-1 output per model, showing that ConvoSenseM* exhibits strong performance regardless of whether a singlebest inference is desired or a diverse set of inferences are desired.Moreover, when considering the positive novelty inferences in the Top-5 setting, we observe that 75% are annotated as detailed for ConvoSenseM* whereas only 7% are indicated as such for HumanGenM*.This reveals a substantial improvement in the amount of detail present in the inferences produced by ConvoSense models as compared to HumanGen models, which results in richer information being provided by the model.

Limitations and Ethical Considerations
This work does not intend to present an exhaustive set of commonsense inferences for dialogue.While we adhere to established inference types relevant to dialogue from existing literature, there could be overlooked types or unique challenges within specific dialogue domains that remain to be explored.Furthermore, it is important to recognize that some social commonsense inference types may be associated with stereotypes and biases.When employing a model that produces commonsense inferences in a setting that impacts human users, caution must be exercised to prevent unjust or prejudiced decisions.Although exploration of the prevalence of harmful biases is out of the scope of the current work, we welcome future investigations into quantifying these aspects of our resources.
Finally, we adhered to OpenAI's terms of service and related policies when utilizing GPT, and we acknowledge that any subsequent utilization of our models and data should refer to these policies.

Future Work
Although ConvoSense is composed of diverse multi-inference dialogue data (Table 6), it is clear from our experiments (Tables 8 and 9) that our trained models do not quite achieve the same degree of inference diversity.Further work is needed on improving the ability of distilled models to better capture the diversity present in the data.
In addition, the integration of commonsense understanding into dialogue applications has shown promising results in improving performance on tasks such as response generation, summarization, and reading comprehension in previous works.In light of this, our work on improving commonsense resources and models presents an opportunity for further advancements in these dialogue applications.In particular, future work exploring how to capitalize on our commonsense model for dialogue response generation is highly compelling, since commonsense errors are one of the most common issues for modern dialogue agents (Finch et al., 2023).However, previous works have revealed that naive integration of commonsense inferences into neural models do not necessarily produce improvements (Zhou et al., 2022a).As a result, we leave the integration of our commonsense model to future work to allow for thorough investigation of its impact on response generation, covering aspects such as the impact of different commonsense inference types, the filtering of relevant inference types per dialogue context, and the effect of synthesizing multiple inferences into dialogue responses.

Conclusion
In this work, we present ConvoSense, an automatically constructed dataset of multioutput commonsense inferences for dialogue.ConvoSense surpasses existing datasets in size, advances inference detail and novelty, and attains comparable (if not superior) reasonability when compared to existing datasets.Our investigation into various techniques for generating multiple inferences reveals that diverse beam search on single-output generative models yields the best outcomes.By publicly releasing our trained models, we enable other works to benefit from the remarkable improvements in commonsense reasonability and novelty achieved by this work.
Speaker: I just finished cleaning up my kitchen and getting the trash out.Listener: I don't envy you.I hate cleaning.Speaker: I'm the other way.I love cleaning, and then C seeing my nice clean kitchen afterwards.Target: I'm the other way.I love cleaning, and then T seeing my nice clean kitchen afterwards.

Figure 2 :
Figure 2: Average % of new & detailed inferences out of all positive novelty inferences for each data source.
It is trained to output a single inference i.During training, instances with multiple correct inferences I generate several training examples, one for each target inference i ∈ I.During inference, standard beam search decoding is used to generate k outputs.Monomorphic Diverse Beam Search (M*) This model adheres to the same design as the M model, except during inference, it uses Hamming-distance diverse beam search decoding instead to generate k outputs, following Vijayakumar et al. (2018).

Table 3 :
Question and answer prefixes used for generating each inference type from GPT for dialogue contexts.The ten inference types used in our work are represented in gray shading.
Figure 1: Cause and Attribute inferences written by humans (top, green) and generated by GPT (bottom, blue).
invalid/nonsense] are considered negative reasonability.Similarly, [new & detailed, new & simple]are designated as positive, and [purely repetitive] is classified as negative novelty.This setup, with 300+ annotated samples per dataset, allows us to detect differences of at least 10% between GPT-and human-generated datasets using McNemar's binary matched-pairs test at 80% power and a significance level of 0.05, assuming discordance probabilities of 0.24 or lower (compatible with pilots).2Incases of annotator disagreement, one of the annotators' decisions is randomly selected.To mitigate the potential noise introduced by this random selection, we repeat the process 100 times and report the average result, only confirming statistical significance when every selection yields a significant result.

Table 5 :
DesireHi, Taraji.How are you doing today?I'm doing fine, thank you.Just working on my math homework.Do you need any help with that?Yeah, I could use some help.Thank you.Let's take a look.What are you working on?I'm working on this problem where I have to find the perimeter of this shape.Figure 3: Desire and Desire o inferences in the ConvoSense dataset.
Statistics of the ConvoSense and HumanGen datasets.Poly: polymorphic examples (multiple inferences).Examples: # of examples, Words: average # of words per inference, Inferences: average # of inferences per example with range shown in parentheses, U1/2(#): average # of unique unigrams/bigrams across all inferences, U1/2(%): average % of unique unigrams/bigrams between inferences within a single example, UL(%): average % of unique inferences across all examples.Averages are calculated at the macro level across all inference types.
Table 6 presents the results, confirming the high reasonability, novelty, detailedness, and diversity of the inferences in the ConvoSense dataset.

Table 6 :
Human evaluation results on 100 examples of ConvoSense

Table 7 :
Reference metric results on test splits.Columns BS denote Bertscore.Underline indicates best metric with statistical significance under Bonferonni multi-test correction, except where indicated by † (t-test, α = 0.05).PolyAgg is that it can only match up to the number of generated outputs.If a model generates fewer outputs than there are references, PolyAgg will not measure against all references.However, this is a reflection of the model's coverage capability, which is valuable information.To capture this, we introduce a coverage moderator for the PolyAgg score.Using cardinality notation | • |, where outs e denotes the model outputs and ref s e denotes the ground-truth references for a single example e ∈ E, the coverage moderator C is defined as: P olyAgg(outs e , ref s e ) * C * |ref s e | e∈E e∈E |ref s e |

Table 9 :
Percentage of reasonable (R) and novel (N)