Abstract
Mastering commonsense understanding and reasoning is a pivotal skill essential for conducting engaging conversations. While there have been several attempts to create datasets that facilitate commonsense inferences in dialogue contexts, existing datasets tend to lack in-depth details, restate information already present in the conversation, and often fail to capture the multifaceted nature of commonsense reasoning. In response to these limitations, we compile a new synthetic dataset for commonsense reasoning in dialogue contexts using GPT, ℂonvoense, that boasts greater contextual novelty, offers a higher volume of inferences per example, and substantially enriches the detail conveyed by the inferences. Our dataset contains over 500,000 inferences across 12,000 dialogues with 10 popular inference types, which empowers the training of generative commonsense models for dialogue that are superior in producing plausible inferences with high novelty when compared to models trained on the previous datasets. To the best of our knowledge, ℂonvoense is the first of its kind to provide such a multitude of novel inferences at such a large scale.
1 Introduction
Effective dialogue is accomplished by a profound grasp of language and a thorough comprehension of the world. Such comprehension is crucial to the construction of responses that are pertinent, coherent, and captivating within an ongoing dialogue. A pivotal element of this worldview is commonsense: self-evident information that is universally acknowledged among humans (Clark and Brennan, 1991).
Over time, there has been a concerted endeavor to create datasets that facilitate commonsense reasoning. Early work, such as the widely recognized ConceptNet (Speer et al., 2017), focused predominantly on physical commonsense related to entities. Lately, efforts have shifted toward building datasets encompassing social- and event-based commonsense, such as ATOMIC (Hwang et al., 2021). This new wave of datasets targets complex human concepts, including emotions, desires, and motivations.
As human conversations largely revolve around sharing personal experiences and life events (Fillwock and Traum, 2018; Mitsuda et al., 2019), it is critical for virtual agents to possess a robust understanding of human experiences to conduct effective dialogue. Datasets such as ATOMIC hold promise as they provide insights directly relevant to human experience; however, a drawback lies in their lack of contextual awareness as they hinge on isolated, concise phrases for commonsense inferences. This limitation poses challenges for dialogue-oriented tasks because utterances should not be viewed in isolation but must be interpreted within their context (Pan et al., 2019; Jin et al., 2022).
Several initiatives have recently aimed to curate commonsense inferences tailored for dialogue contexts (Gao et al., 2022; Ghosal et al., 2022; Zhou et al., 2022a). However, a trade-off currently exists between the breadth of inference types covered and the scope of dialogue contexts encompassed within these existing datasets. While some datasets cover a wide range of relations, they are limited to a small number of dialogues (Gao et al., 2022), whereas others capture a large number of dialogues but on a limited set of relations (Ghosal et al., 2022).
In addition, a few challenges can be encountered in these datasets. For example, the inferences in these datasets are often too succinct and derive only straightforward conclusions with minimal elaboration (Gao et al., 2022), which do not convey implicit commonsense. Some studies instruct annotators to recycle information from the ongoing conversation, undermining the speculative nature of inferences and detracting from the potential of offering fresh insights to enhance dialogue understanding (Ghosal et al., 2022). Moreover, although multiple plausible inferences can be drawn from a single dialogue context, only a few datasets support this multifaceted nature (Shen et al., 2022), impeding the development of models capable of generating diverse inferences, and thus limiting their utility in real applications.
We present ℂonvo ense, a commonsense dataset generated by GPT encompassing 10 popular inference types with over 500,000 inferences across 12,000 dialogues (§4). Our dataset shows greater contextual novelty and enhanced inference diversity and detail while maintaining exceptional reasonability compared to existing datasets (§3). We also explore several strategies to build generative models producing inferences for dialogue contexts (§5). Our experiments show that models trained on ℂonvo ense excel in generating plausible inferences with greater detail and novelty, compared to ones trained on existing datasets (§6). To the best of our knowledge, this is the first dialogue-based commonsense dataset that not only covers an extensive array of inference types at large-scale but also provides a plethora of diverse, novel inferences tailored to each dialogue context. Our ℂonvo ense dataset and inference models can be accessed through our open-source project: https://github.com/emorynlp/ConvoSense.
2 Related Work
Recent work has focused on integrating commonsense into various tasks, including story generation and explanation (Guan et al., 2020; Gabriel et al., 2021), dialogue summarization and explanation (Ghosal et al., 2021; Zhou et al., 2021; Kim et al., 2022), and response generation (Li et al., 2022; Sabour et al., 2022; Zhou et al., 2022b). Many of this work relies on existing datasets, such as ConceptNet (Li et al., 2022; Zhou et al., 2022b) and ATOMIC (Sabour et al., 2022), which only contain single-word or short-phrase premises and conclusions. Although there are commonsense datasets curated for long dialogue contexts, they tend to be of small size (Zhou et al., 2022a), express simple inferences (Gao et al., 2022), or copy context from the provided utterances (Ghosal et al., 2022).
On the other hand, GPT has recently been used to create a variety of datasets. Kim et al. (2023) and Zhan et al. (2023) constructed dyadic dialogue datasets at large-scale, while West et al. (2022) generated commonsense triples in the ATOMIC style (Hwang et al., 2021). However, the ATOMIC-style inferences are not necessarily suitable for dialogue, as they struggle to handle long contexts and often lack depth. Table 1 summarizes the inference types in existing dialogue-focused commonsense datasets and mappings of synonymous types among them. In particular, the following 3 datasets are used for comparisons with our work:
Type . | Label(s) . | Definition(s) . | COM . | CIC . | REF . |
---|---|---|---|---|---|
Subsequent | isBefore | What could happen after this? [2] | * | 22K | 600 |
Subsequent- | What subsequent event happens or could happen following the Target? [3] | ||||
Events | What might happen after? [4] | ||||
Antecedent | isAfter | What could have happened before this? [2] | * | 600 | |
What might have happened before? [4] | |||||
Cause | xReason | What could be the cause of this event? [2] | 80 | 21K | |
Cause | What is the event that directly causes or could cause Target? [3] | ||||
Prerequisite | xNeed | What does X need to do before the event can happen? [1] | 1K | 10K | |
Prerequisites | What is or could be the prerequisite of Target? [3] | ||||
Motivation | xIntent | Why does X cause the event? [1] | 800 | 12K | |
Motivation | What is an emotion or basic human drive that motivates or could motivate Target? [3] | ||||
Attribute | xAttr | How would X be described? [1] | 400 | 600 | |
How would you describe Speaker? [4] | |||||
Reaction | xReact | How does X feel after the event? [1] | 300 | 600 | |
What is Speaker feeling now? [4] | |||||
Reactiono | oReact | How do others feel after the event? [1] | 70 | 6K | 600 |
What is the possible emotional reaction of the listener in response to target? [3] | |||||
What is Responder feeling now? [4] | |||||
Desire | xWant | What would X likely want to do after the event? [1] | 1K | ||
Desireo | oWant | What would others likely want to do after the event? [1] | 100 | ||
Constituents | HasSubEvent | What is a substep that happens within this event? [2] | 800 | ||
Obstacle | HinderedBy | What could obstruct the occurrence of this event? [2] | 200 | ||
Effect | Causes | What does this event cause to happen? [2] | 30 | ||
Effects | xEffect | What effect does the event have on X? [1] | 400 | ||
Effecto | oEffect | What effects does the event have on others? [1] | 90 |
Type . | Label(s) . | Definition(s) . | COM . | CIC . | REF . |
---|---|---|---|---|---|
Subsequent | isBefore | What could happen after this? [2] | * | 22K | 600 |
Subsequent- | What subsequent event happens or could happen following the Target? [3] | ||||
Events | What might happen after? [4] | ||||
Antecedent | isAfter | What could have happened before this? [2] | * | 600 | |
What might have happened before? [4] | |||||
Cause | xReason | What could be the cause of this event? [2] | 80 | 21K | |
Cause | What is the event that directly causes or could cause Target? [3] | ||||
Prerequisite | xNeed | What does X need to do before the event can happen? [1] | 1K | 10K | |
Prerequisites | What is or could be the prerequisite of Target? [3] | ||||
Motivation | xIntent | Why does X cause the event? [1] | 800 | 12K | |
Motivation | What is an emotion or basic human drive that motivates or could motivate Target? [3] | ||||
Attribute | xAttr | How would X be described? [1] | 400 | 600 | |
How would you describe Speaker? [4] | |||||
Reaction | xReact | How does X feel after the event? [1] | 300 | 600 | |
What is Speaker feeling now? [4] | |||||
Reactiono | oReact | How do others feel after the event? [1] | 70 | 6K | 600 |
What is the possible emotional reaction of the listener in response to target? [3] | |||||
What is Responder feeling now? [4] | |||||
Desire | xWant | What would X likely want to do after the event? [1] | 1K | ||
Desireo | oWant | What would others likely want to do after the event? [1] | 100 | ||
Constituents | HasSubEvent | What is a substep that happens within this event? [2] | 800 | ||
Obstacle | HinderedBy | What could obstruct the occurrence of this event? [2] | 200 | ||
Effect | Causes | What does this event cause to happen? [2] | 30 | ||
Effects | xEffect | What effect does the event have on X? [1] | 400 | ||
Effecto | oEffect | What effects does the event have on others? [1] | 90 |
ComFact
Cicero
Human participants were tasked with composing responses to five commonsense questions (e.g., What is the event that directly causes or could cause Target?) based on dialogue contexts and explicitly instructed to incorporate information from the preceding or forthcoming utterances. The first version produced a single inference for each example (Ghosal et al., 2022), whereas the second version produced multiple examples of both good and bad inferences (Shen et al., 2022).
Reflect
Zhou et al. (2022a) supplied both human-generated commonsense inferences and following utterance responses that could be derived from a specified commonsense inference. The inferences were collected by instructing human participants to answer a commonsense question, while the next-utterance responses were composed by new human participants who were provided with the dialogue context and one of the human-generated inferences.
3 Evaluating GPT-generated Inferences
In order to support the development of a large-scale and high coverage commonsense dataset for dialogue that improves upon existing works, we hypothesize that we can leverage large language models (LLMs) to accomplish this task in an efficient and low-cost manner. From initial pilot tests of both closed-source (GPT) and open-sourced LLMs (Vicuna and Llama), we find that GPT provides greater reliability in following specific instructions and produces commonsense inferences of overall better quality than the open-sourced LLMs. Consequently, we choose to rely on GPT in this work.
3.1 Prompt Engineering
Prior to crafting the full ℂonvo ense dataset, we empirically assess GPT’s efficacy in generating reasonable and novel commonsense inferences for dialogue. To mitigate any unintended bias from in-context examples in the GPT prompt, we adopt a zero-shot generation framework.1 GPT prompts are refined iteratively to achieve the optimal outcomes. An example of the final prompt design, specifically tailored for the Desire inference type, is illustrated in Table 2.
During our development process, we observe that the inferences generated from GPT frequently contain detailed and rich information, thus addressing one of the major limitations of existing works. In addition, to encourage novel inferences from GPT, we include the instruction “Your answers should provide novel information that is not explicitly shared in the conversation.” as seen in Table 2. We observe that this instruction helps in reducing the redundancy of the generated inferences to the information already explicitly shared in the dialogue context, thus addressing a second major limitation of existing works.
For the prompt, each inference type is paired with a guiding question and an answer prefix, ensuring uniformity in the generated content for the specific type, which respectively fill the Inference Question (Q) and Inference Answer Template (A) slots in the prompt. For every dialogue context, a sequence of utterances in the context is placed in the Dialogue Context (C) slot, and its final turn gets duplicated in the Target Utterance (T) slot. inally, the GPT output, commencing with the header Answers and adopting a list-like format with newline separation, is parsed to extract the generated inferences. Table 3 details the questions and answer prefixes employed for the fifteen identified inference types derived from the previous studies in Table 1.
3.2 Evaluation
To evaluate the quality of GPT-generated commonsense inferences for dialogues, we compare their reasonability and novelty against inferences from human datasets. irst, we sample a uniform distribution over inference types for each existing dataset. For every sample, we then prompt GPT to produce relevant inferences and randomly select one from the generated list. Finally, two human annotators are presented with the dialogue context, inference question, and both the GPT- and human-generated inferences and asked to categorize them for reasonability and novelty. For this evaluation, we enlist native English speakers via the Surge AI crowdsourcing platform (https://surgehq.ai) by paying them at a rate of $0.15 per sample with an estimated completion time of 45 seconds.
Reasonability
Most prior commonsense datasets assess their inferences based on human-judged reasonability (Hwang et al., 2021; Ghosal et al., 2022; Shen et al., 2022; Zhou et al., 2022a). An inference is deemed reasonable if it makes sense in, is relevant to, and is consistent with the provided dialogue context. We follow Hwang et al. (2021), in which annotators categorize inferences into levels of the truth likelihood: always/likely, sometimes/possible, never/farfetched, or invalid/nonsense.
Novelty
A key trait of commonsense for dialogue is its role in enhancing dialogue comprehension by providing relevant contextual information. While Ghosal et al. (2022) gauge creativity in human responses, creativity is not strictly focused on inference novelty. In our study, annotators evaluate the extent to which an inference contributes fresh information to the conversation, categorized as: new & detailed, new & simple, and purely repetitive.
Since we aim to elicit the natural commonsense understanding learned by each annotator through their life experience in our annotation tasks, we do not provide any training or explicit examples towards what constitutes a “reasonable” or “novel” commonsense inference to avoid artificially polluting their commonsense understanding of the world. Instead, we provide a description of the task with definitions of the different categories. Our instructions are intended to mitigate bias towards trivial inference properties by providing clear definitions of the characteristics under study and emphasizing important aspects to keep in mind, such as ignoring grammar errors unless it made an inference nonsensical. Furthermore, decomposing inference quality into two characteristics allows for their independent evaluation. We verified through pilots that this approach resulted in reliable and reasonable annotations from our annotators for both tasks.
3.3 Results
Following Hwang et al. (2021), the two metrics in Section 3.2 are converted into binary representations. Thus, labels [always/likely, sometimes/possible] are categorized as positive and [never/farfetched, invalid/nonsense] are considered negative reasonability. Similarly, [new & detailed, new & simple] are designated as positive, and [purely repetitive] is classified as negative novelty. This setup, with 300+ annotated samples per dataset, allows us to detect differences of at least 10% between GPT- and human-generated datasets using McNemar’s binary matched-pairs test at 80% power and a significance level of 0.05, assuming discordance probabilities of 0.24 or lower (compatible with pilots).2 In cases of annotator disagreement, one of the annotators’ decisions is randomly selected. To mitigate the potential noise introduced by this random selection, we repeat the process 100 times and report the average result, only confirming statistical significance when every selection yields a significant result.
Considering the reported quality of the existing datasets and our preliminary assessments of GPT-generated inferences, we expect much higher rates of positive classes than negative ones, resulting in a class imbalance. To overcome the vulnerability to prevalence skew exhibited by other agreement metrics like Cohen’s kappa (Jeni et al., 2013; Wongpakaran et al., 2013; Quarfoot and Levine, 2016), Gwet’s AC1 inter-annotator agreement metric is chosen (Gwet, 2002).3 Our annotators obtain AC1 values of 0.8 and 0.6 for reasonability and novelty, respectively, implying substantial agreement.
Table 4 demonstrates that GPT can attain comparable reasonability in its generated inferences as those derived from humans, even exceeding the reasonability of the inferences in omFact with statistical significance. Notably, the results also indicate that GPT surpasses the novelty of the human-generated inferences for the majority of the existing datasets. Furthermore, GPT outputs achieve higher detail than that observed from human-generated inferences. Figure 2 shows the percentage of new & detailed inferences out of all positive novelty inferences for each data source, clearly demonstrating the superiority of GPT inferences in terms of their expressed detail. Example inferences from GPT and humans are shown in Figure 1.
Dataset . | R . | N . | # . |
---|---|---|---|
GPT | 93 (0.17)* | 91 (0.21)* | 390 |
omFact | 81 (0.05) | 73 (0.04) | |
GPT | 93 (0.10) | 80 (0.16)* | 300 |
icero | 88 (0.05) | 70 (0.06) | |
GPT | 89 (0.08) | 86 (0.08) | 300 |
eflect | 91 (0.09) | 82 (0.04) |
Dataset . | R . | N . | # . |
---|---|---|---|
GPT | 93 (0.17)* | 91 (0.21)* | 390 |
omFact | 81 (0.05) | 73 (0.04) | |
GPT | 93 (0.10) | 80 (0.16)* | 300 |
icero | 88 (0.05) | 70 (0.06) | |
GPT | 89 (0.08) | 86 (0.08) | 300 |
eflect | 91 (0.09) | 82 (0.04) |
4 ConvoSense Dataset
Given our assessment of high-quality, novel, and detailed GPT-generated commonsense inferences across various dialogue contexts and inference types (Section 3), we construct a substantial conversational commonsense dataset using GPT, termed ℂonvo ense.
4.1 HumanGen: Human-generated Datasets
For fair comparisons to our work, we combine the three human-generated datasets (Section 2) into a solitary dataset, termed ℍuman en.4 Specifically, their train/validation/test sets are integrated independently. For omFact and icero, this integration follows the provided splits, while for eflect, data is sampled following an 80/10/10 distribution. To standardize ℍuman en into a cohesive format, we perform the following preprocessing steps.
First, we leverage the mapping outlined in Table 1 along with the specifications from Table 3 to identify relevant commonsense inference questions for each instance. Then, we combine consecutive utterances from the same speaker to ensure every dialogue turn represents a distinct speaker. Lastly, we apply Speaker and Listener tags in a similar manner to ℂonvo ense (Figure 3). Since human-generated inferences often contain nominal references to specific target entities, we additionally incorporate the names of conversational participants into the tags, as exemplified by “Speaker (A)”.
The naming conventions vary across the different human-generated datasets. To maintain uniformity, we adopt the naming conventions used in icero for both omFact and eflect, as icero constitutes nearly 90% of ℍuman en. In icero, participants are denoted as A and B. For omFact, originally lacking speaker designations, we randomly assign A/B tags to each conversation. On the other hand, eflect includes original speaker names; thus, we replace them with A/B tags accordingly. Since the speaker name frequently appears in eflect’s inferences, we uniformly replace it with “the speaker”, aligning with the prevalent format in icero.
4.2 ConvoSense: New GPT-generated Dataset
Constructing a practical dataset of commonsense inferences for dialogue benefits from covering a wide variety of dialogue situations. To this end, our construction process of ℂonvo ense first carefully selects the dialogues to include based on their topical diversity, trims the dialogue contexts to optimize utterance diversity, and finally generates the inferences for each context.
Dialogue Selection
We choose to sample the dialogues for ℂonvo ense as a subset of those dialogues in the high-quality and large-scale SODA dataset. SODA contains over a million dyadic dialogues generated by GPT covering situations based on ATOMIC commonsense tuples (Kim et al., 2023). For cost practicality, ℂonvo ense is constructed to contain 10,000 training dialogues, 1,000 validation dialogues, and 1,000 test dialogues.
To encourage diversity in ℂonvo ense, we employ BERTopic (Grootendorst, 2022), which clusters the dialogues selected from SODA into groups using dimension reduction technique UMAP (McInnes et al., 2020) and HDBSCAN clustering algorithm (McInnes et al., 2017) on the BERT embeddings of the dialogues.5 We configure the hyperparameters6 to effectively group dialogues while maintaining a well-balanced distribution of group lengths based on manual verifications. As a result, we obtain 100K dialogue groups, where each group consists of 6.3 dialogues on average. These groupings represent 100K unique dialogue topics, thus enabling the construction of ℂonvo ense to span a variety of topics by sampling dialogues from a subset of these groupings.
Next, we randomly select one dialogue from the n groupings, where each dialogue contains at least 5 utterances and has a BERTopic score of at least 0.95 to its group. To maintain distinct dialogue scenarios in each split, each grouping can only be selected for one split. Through this procedure, we set n values as [10K,1K,1K] for assembling the training, validation, and test splits, respectively.
Utterance Selection
For each selected dialogue, we determine which utterance to perform inference generation on. We use the topic keywords identified for each group during the BERTopic grouping to pinpoint the most topically salient utterance in each dialogue and ensure that the diversity afforded by the grouping is maintained. This is achieved by selecting the utterance whose embedding yields the highest cosine similarity with the embedding of the four-word topic string assigned to the dialogue’s respective group by BERTopic. Subsequently, we trim the dialogue’s utterances such that the conversation ends at this selected utterance. This trimmed version becomes the final dialogue context used for commonsense inference generation, where the inferences are derived for the last utterance.
Because commonsense inferences often relate to a central figure in a conversation, either the speaker or the listener, we introduce nominal tags for the two participants. The terminal utterance is labeled as Speaker, and its preceding utterance is labeled as Listener. These nominal tags are then assigned in alternating order to the remainder.
Inference Types
For each preprocessed dialogue, GPT generates inferences for all included commonsense types following the procedure in Section 3. Specifically, ten commonsense types are included: Subsequent, Cause, Prerequisite, Motivation, Attribute, Reaction, Reactiono, Desire, Desireo, and Constituents (highlighted in Table 3). These types are selected based on their usage frequency in existing datasets and their lack of semantic overlap.
Data Statistics
Table 5 presents data statistics for ℂonvo ense and ℍuman en. ℂonvo ense significantly surpasses ℍuman en for data volume, particularly regarding instances with polymorphic outputs, where multiple inferences can be derived per instance. Moreover, ℂonvo ense boasts greater vocabulary diversity and reduced redundancy among inferences. Illustrative examples from ConvoSense are shown in Figure 3.
. | All . | Poly . | |||||||
---|---|---|---|---|---|---|---|---|---|
. | Examples . | Words . | Inferences . | U1(#) . | U2(#) . | Examples . | U1(%) . | U2(%) . | UL(%) . |
ConvoSense | 120,000 | 14.6 | 5.1 (2–13) | 16,666 | 199,087 | 120,000 | 92.8 | 98.9 | 98.8 |
omFact | 3,909 | 3.2 | 1.4 (2–12) | 295 | 315 | 1,401 | 86.7 | 97.3 | 60.3 |
icero | 52,644 | 11.6 | 1.3 (2–11) | 7,598 | 44,234 | 9,911 | 84.4 | 97.2 | 98.7 |
eflect | 3,000 | 6.4 | 1.1 (2–4) | 835 | 1,407 | 216 | 85.1 | 95.2 | 82.2 |
HumanGen | 59,553 | 6.6 | 1.3 (1–12) | 2,886 | 15,420 | 11,528 | 86.7 | 97.0 | 78.3 |
. | All . | Poly . | |||||||
---|---|---|---|---|---|---|---|---|---|
. | Examples . | Words . | Inferences . | U1(#) . | U2(#) . | Examples . | U1(%) . | U2(%) . | UL(%) . |
ConvoSense | 120,000 | 14.6 | 5.1 (2–13) | 16,666 | 199,087 | 120,000 | 92.8 | 98.9 | 98.8 |
omFact | 3,909 | 3.2 | 1.4 (2–12) | 295 | 315 | 1,401 | 86.7 | 97.3 | 60.3 |
icero | 52,644 | 11.6 | 1.3 (2–11) | 7,598 | 44,234 | 9,911 | 84.4 | 97.2 | 98.7 |
eflect | 3,000 | 6.4 | 1.1 (2–4) | 835 | 1,407 | 216 | 85.1 | 95.2 | 82.2 |
HumanGen | 59,553 | 6.6 | 1.3 (1–12) | 2,886 | 15,420 | 11,528 | 86.7 | 97.0 | 78.3 |
Data Quality
The results in Section 3.3 demonstrate that GPT is generally capable of producing high-quality commonsense inferences regardless of the underlying dialogue source. Consequently, applying GPT to generate commonsense inferences for the SODA dialogues is expected to perform with similar high quality. To explicitly verify this, we conduct an evaluation of the ℂonvo ense dataset. An external conversational AI expert, unaffiliated with this study, evaluates the generated inferences for 100 ℂonvo ense examples (508 total inferences; average 5.08 inferences per example), with all ten inference types uniformly represented across examples. The human judge completes two evaluation tasks: grading reasonability and novelty of an inference (Section 3.2) and performing inference clustering to measure per-example output diversity (Section 6.2). Table 6 presents the results, confirming the high reasonability, novelty, detailedness, and diversity of the inferences in the ℂonvo ense dataset.
Error Analysis
We next perform an error analysis on the unreasonable inferences identified by the human judge. We observe that most unreasonable inferences are explained by being too niche to be likely given only the provided information in the dialogue context (26%; Desire examples #4–5 in Figure 3), or by their attribution to the wrong conversational participant (26%; Desireo examples #4-5 in Figure 3). Relatively speaking, only a small percentage of unreasonable inferences are explained by a violation of common knowledge of human experiences (10%), a lack of relevance to the dialogue context (10%), or a contradiction of the dialogue context (7%). This suggests that ℂonvo ense inferences are predominately accurate representations of commonsense understanding, although they can suffer from lack of precision regarding situational nuances and speaker roles.
5 Generative Commonsense Models
5.1 Training and Decoding Strategies
With the rich and diverse multi-inference examples provided in ℂonvo ense, we are well-positioned for training commonsense generation models that produce versatile outputs. Yet, a key query remains: How can we induce this versatility into the model?
A common method of enhancing diversity in generative outputs is to modify the decoding strategy (Gimpel et al., 2013; Vijayakumar et al., 2018; Ippolito et al., 2019). Through preliminary testing, we observe that diverse beam search decoding with Hamming distance reward following Vijayakumar et al. (2018) improves the output diversity with less impact on accuracy compared to other methods.
On the other hand, Cao and Wan (2020) propose modifying the model architecture by introducing latent variables to guide output variety. However, these approaches only approximate learning varied responses by relying on conditioning on random latent variables. In contrast, ℂonvo ense provides direct access to numerous inferences per input, enabling direct training of generative models that produce multiple inferences per example, with the set of inferences treated as target outputs during training. Therefore, we explore the performance of three strategies for diverse generation of commonsense inferences.
Monomorphic Beam Search (M)
Monomorphic Diverse Beam Search (M*)
This model adheres to the same design as the M model, except during inference, it uses Hamming-distance diverse beam search decoding instead to generate k outputs, following Vijayakumar et al. (2018).
Polymorphic (P)
5.2 Model Configuration
We develop six generative models: ConvoSenseM, ConvoSenseM*, ConvoSenseP, HumanGenM, HumanGenM*, and HumanGenP. Each model name denotes the training dataset with the terminal letter indicating the model strategy. All of them use T5-3b (Raffel et al., 2020) as the base model, which is then finetuned on the corresponding dataset following the indicated model strategy. The ConvoSense* and HumanGen* models are finetuned for 5 or 10 epochs, respectively. The best-performing models and hyperparameters7 are selected through grid-search based on their results on the validation sets.
For all models, decoding is performed with 10 beams. For ConvoSenseM* and HumanGenM*, the number of beam groups is 10, and the diversity penalty is 0.5 and 1.0, respectively. For P models, decoding also uses a repetition penalty of 5.0 to reduce output token repetition.
It is worth noting that only 16% of HumanGen examples feature multiple ground-truth inferences. Training a P model on the complete dataset yields a single-inference model, which defeats the purpose of the polymorphic model strategy. Instead, we develop the HumanGenP model exclusively on multi-inference instances to facilitate learning of polymorphic outputs.
6 Generative Model Evaluation
We evaluate the six generative models (Sec- tion 5.2) on the ten commonsense inference types (Table 3) that exist in both the ℍuman en (Section 4.1) and ℂonvo ense (Section 4.2) datasets. The model performance is evaluated using automatic reference metrics (Section 6.1), automatic diversity metrics (Section 6.2), and human evaluations of reasonability and novelty (Section 6.3).
6.1 Automatic Reference Metrics
Conventional evaluations of generative models against ground-truth references often overlook the diverse nature of the outputs. They typically assess individual model outputs against a single reference, focusing on best-case performance due to dataset constraints. However, such assessments are inadequate for our multi-inference dialogue generation objective. To address this, we structure our automated evaluation method to account for the concept of output diversity. This method, referred to as PolyAgg, serves as an aggregation function compatible with standard evaluation metrics. Its purpose is to gauge the model’s capacity to encompass the complete set of ground-truth references in its generated outputs.
Algorithm 1 demonstrates the PolyAgg aggregation function. It computes a score matrix for each example, where rows represent model outputs and columns represent ground-truth references, and finds the maximal assignment of rows to columns following the linear sum assignment problem (Burkard and Cela, 1999), which seeks to find the optimal bijective mapping between rows and columns in a cost matrix. By mandating a one-to-one mapping from model outputs to references, we can accurately measure reference set coverage and prevent models that generate mere surface-level variations from scoring highly on datasets with diverse references. We use SciPy’s linear sum assignment solver, then calculate the mean of the assigned scores for the final metric value. Dou et al. (2021) utilize a similar aggregation for evaluating a diverse dialogue response generation model.
Results
We evaluate each model in terms of both its best-case performance (Top-1 output) and its multi-inference performance (Top-5 outputs). In the Top-1 setting, the maximum score achieved by the top-1 output against all of the ground-truth references for an example is taken and averaged across the test data. In the Top-5 setting, the top-5 outputs from the models are taken and scores are calculated using Equation 2, before being averaged across the test data. For M(*) models, the top one or five beams are taken as the outputs for each setting. For P models, the first one or five inferences in the outputted sequence are taken as the outputs for each setting. The results are shown in Table 7 for each model on the ℍuman en and ℂonvo ense test splits, respectively.
Overall, it is evident that using diversity-promoting decoding (M*) outperforms the direct generation of multiple inferences (P). This approach achieves the highest BLEU, BertScore, and sentence similarity scores in the Top-5 assessment setting. This trend is particularly pronounced in the case of the ConvoSense-trained model, holding true for both the ℂonvo ense and ℍuman en test splits. Enhancing training inference diversity as seen in ℂonvo ense appears to support the adoption of diversity-focused decoding strategies, yielding more contextually relevant outputs aligned with ground-truth references, even when applied to test examples from different datasets.
In the Top-1 setting, monomorphic models with standard beam search demonstrate superior performance for both HumanGen- and ConvoSense-trained models. However, the difference compared to diverse beam search is relatively minor, particularly when considering embedding-based metrics. Interestingly, the HumanGenP model displays the strongest ability to generalize to the ℂonvo ense test split among all HumanGen-trained models in the Top-1 scenario. Upon manual comparison of HumanGenP outputs against other HumanGen-trained models, we observe that HumanGenP is more inclined to specify a focal person in the inference (e.g., “the speaker/listener”). This often aligns better with ℂonvo ense references, although in a superficial manner with little impact on the underlying semantics.
It is also observed that the models produce low scores when evaluated against the test examples that are out-of-distribution with respect to their training data. This may not reflect the true underlying reasonability of the generated inferences, but rather a difference in inference content between the datasets, which is supported by evidence in Section 3.3 showing that human-written generations are more often repetitive with the dialogue context than GPT generations. To obtain a direct measure of the quality of the generated model inferences, we perform a human evaluation in Section 6.3.
6.2 Automatic Diversity Metrics
To assess the ability of each model in generating diverse inferences for a given dialogue context, we employ a clustering approach under the Top-5 evaluation scheme. This involves grouping the model generations for each example into clusters of inferences with similar meanings. The average number of inference clusters across examples serves as a measure of output diversity.
For each of the ten inference types, we draw 50 examples from the test splits of ℂonvo ense and ℍuman en, except for the Constituents type in ℍuman en due to its smaller test split (22 examples). We instruct GPT410 to create groups of semantically similar inferences given a dialogue context, question, and a list of inferences. GPT4 demonstrates its proficiency by achieving an average B-cubed F1-score (Bagga and Baldwin, 1998) of 0.872 against clusters identified by one of the authors for 20 examples, where B-cubed is a common clustering evaluation metric that measures the precision and recall of each element’s neighbors within the same cluster. This outperforms Amazon Mechanical Turk crowdworkers who only achieved a score of 0.581.11
Results
Table 8 displays diversity outcomes per model. For both HumanGen and ConvoSense-trained models, the monomorphic model with diverse beam search generates the most unique outputs.12 While ConvoSenseM* slightly outperforms HumanGenM* in terms of inference diversity, both models exhibit similar unique inference cluster counts. Compared to the ℂonvo ense inferences themselves (Table 6), it is clear that none of the trained models are able to replicate the high inference diversity. Nonetheless, there is a large discrepancy in inference detail, which is revealed through human assessments in the next section.
. | Clusters . | Words . |
---|---|---|
ConvoSenseM | 2.680 (54%) | 12.179 |
ConvoSenseM* | 3.509 (70%) | 12.928 |
ConvoSenseP | 3.262 (74%) | 13.292 |
HumanGenM | 3.031 (61%) | 6.492 |
HumanGenM* | 3.452 (69%) | 5.544 |
HumanGenP | 1.348 (69%) | 7.744 |
. | Clusters . | Words . |
---|---|---|
ConvoSenseM | 2.680 (54%) | 12.179 |
ConvoSenseM* | 3.509 (70%) | 12.928 |
ConvoSenseP | 3.262 (74%) | 13.292 |
HumanGenM | 3.031 (61%) | 6.492 |
HumanGenM* | 3.452 (69%) | 5.544 |
HumanGenP | 1.348 (69%) | 7.744 |
6.3 Human Evaluations
We also evaluate the models through human assessment, in both the Top-1 and Top-5 setting. Based on automated evaluation outcomes, we compare ConvoSenseM* to both HumanGenM and HumanGenM*. An external conversational AI expert, unaffiliated with this study, evaluates the top five inferences for 60 examples per model in a blinded design, with all ten inference types and both datasets being uniformly represented. The human judge completes two evaluation tasks: grading reasonability and novelty of an inference (Section 3.2) and performing inference clustering (Section 6.2).
Results
Table 9 demonstrates ConvoSenseM*’s superior performance compared to the HumanGen models. ConvoSenseM* achieves a remarkable 93% reasonability and 98% novelty, averaging 3.4 unique inferences per example. Indeed, similar results hold even when considering the Top-1 output per model, showing that ConvoSenseM* exhibits strong performance regardless of whether a single-best inference is desired or a diverse set of inferences are desired. Moreover, when considering the positive novelty inferences in the Top-5 setting, we observe that 75% are annotated as detailed for ConvoSenseM* whereas only 7% are indicated as such for HumanGenM*. This reveals a substantial improvement in the amount of detail present in the inferences produced by ConvoSense models as compared to HumanGen models, which results in richer information being provided by the model.
. | Top-1 . | Top-5 . | |||
---|---|---|---|---|---|
. | R . | N . | R . | N . | Clusters . |
ConvoSenseM* | 90 | 98 | 93 | 98 | 3.42 (68%) |
HumanGenM | 75 | 57 | 81 | 56 | 2.25 (45%) |
HumanGenM* | 75 | 70 | 81 | 70 | 3.17 (63%) |
. | Top-1 . | Top-5 . | |||
---|---|---|---|---|---|
. | R . | N . | R . | N . | Clusters . |
ConvoSenseM* | 90 | 98 | 93 | 98 | 3.42 (68%) |
HumanGenM | 75 | 57 | 81 | 56 | 2.25 (45%) |
HumanGenM* | 75 | 70 | 81 | 70 | 3.17 (63%) |
7 Limitations and Ethical Considerations
This work does not intend to present an exhaustive set of commonsense inferences for dialogue. While we adhere to established inference types relevant to dialogue from existing literature, there could be overlooked types or unique challenges within specific dialogue domains that remain to be explored.
Furthermore, it is important to recognize that some social commonsense inference types may be associated with stereotypes and biases. When employing a model that produces commonsense inferences in a setting that impacts human users, caution must be exercised to prevent unjust or prejudiced decisions. Although exploration of the prevalence of harmful biases is out of the scope of the current work, we welcome future investigations into quantifying these aspects of our resources.
Finally, we adhered to OpenAI’s terms of service and related policies when utilizing GPT, and we acknowledge that any subsequent utilization of our models and data should refer to these policies.
8 Future Work
Although ℂonvo ense is composed of diverse multi-inference dialogue data (Table 6), it is clear from our experiments (Tables 8 and 9) that our trained models do not quite achieve the same degree of inference diversity. Further work is needed on improving the ability of distilled models to better capture the diversity present in the data.
In addition, the integration of commonsense understanding into dialogue applications has shown promising results in improving performance on tasks such as response generation, summarization, and reading comprehension in previous works. In light of this, our work on improving commonsense resources and models presents an opportunity for further advancements in these dialogue applications. In particular, future work exploring how to capitalize on our commonsense model for dialogue response generation is highly compelling, since commonsense errors are one of the most common issues for modern dialogue agents (Finch et al., 2023). However, previous works have revealed that naive integration of commonsense inferences into neural models do not necessarily produce improvements (Zhou et al., 2022a). As a result, we leave the integration of our commonsense model to future work to allow for thorough investigation of its impact on response generation, covering aspects such as the impact of different commonsense inference types, the filtering of relevant inference types per dialogue context, and the effect of synthesizing multiple inferences into dialogue responses.
9 Conclusion
In this work, we present ℂonvo ense, an automatically constructed dataset of multi-output commonsense inferences for dialogue. ℂonvo ense surpasses existing datasets in size, advances inference detail and novelty, and attains comparable (if not superior) reasonability when compared to existing datasets. Our investigation into various techniques for generating multiple inferences reveals that diverse beam search on single-output generative models yields the best outcomes. By publicly releasing our trained models, we enable other works to benefit from the remarkable improvements in commonsense reasonability and novelty achieved by this work.
Acknowledgments
We gratefully acknowledge the support of Amazon for this work. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of Amazon. We would also like to thank the anonymous reviewers and the action editor for their valuable feedback.
Notes
gpt-turbo-3.5-301 with a temperature setting of 1.0.
We observe Cohen’s kappa of 0.19 and 0.15 for reasonability and novelty, respectively.
Many commonsense types have a sparsity of training data when the human-generated datasets are viewed in isolation, which would impede the training of a neural model to adequately capture the commonsense type.
The all-mpnet-base-v2 model is used for BERT.
neighbors: 5, components: 5, min_cluster_size: 2.
The Adafactor optimizer is used with a weight decay of 5e-3 and a learning rate of 5e-6, except for ConvoSenseP with 1e-6. The max source length is set to 768. The max target length is set to 400 for P models and 128 for other models. All models are trained using bf16 for memory efficiency. P models use a prefix of “provide several reasonable answers to the question based on the dialogue: ∖n” and other models use a prefix of “provide a reasonable answer to the question based on the dialogue: ∖n”.
BertScore: microsoft/deberta-xlarge-mnli.
SentenceBert: all-mpnet-base-v2.
gpt-4-0613 with a temperature setting of 0.
The self-serve SurgeAI crowdsourcing platform previously used in Section 3.2 was discontinued during this work.
High unique percentages for P models are due to low-count inference output (average of 4.4 and 2.0 outputted inferences for ConvoSenseP and HumanGenP, respectively).
References
Author notes
Action Editor: Yejin Choi