A Theoretically Grounded Question Answering Data Set for Evaluating Machine Common Sense

ABSTRACT Achieving machine common sense has been a longstanding problem within Artificial Intelligence. Thus far, benchmark data sets that are grounded in a theory of common sense and can be used to conduct rigorous, semantic evaluations of common sense reasoning (CSR) systems have been lacking. One expectation of the AI community is that neuro-symbolic reasoners can help bridge this gap towards more dependable systems with common sense. We propose a novel benchmark, called Theoretically Grounded common sense Reasoning (TG-CSR), modeled as a set of question answering instances, with each instance grounded in a semantic category of common sense, such as space, time, and emotions. The benchmark is few-shot i.e., only a few training and validation examples are provided in the public release to avoid the possibility of overfitting. Results from recent evaluations suggest that TG-CSR is challenging even for state-of-the-art statistical models. Due to its semantic rigor, this benchmark can be used to evaluate the common sense reasoning capabilities of neuro-symbolic systems.


INTRODUCTION
Developing machines with common sense reasoning (CSR) abilities is a longstanding challenge in the Artificial Intelligence community [1,2].Current CSR benchmarks largely use multiple-choice questionanswering (QA) instances to evaluate machine common sense [3].Unfortunately, many of these QA benchmarks have been constructed in an ad hoc fashion, with little evidence that they are grounded in a formal theory of common sense, such as the comprehensive theory proposed by Gordon and Hobbs [4].Because the benchmarks are not theoretically grounded, they cannot be used to rigorously conduct a semantic evaluation of machine learning models on CSR.This is particularly true for evaluating neurosymbolic systems that leverage deductive reasoning and formal logic [5,6].Additionally, there is some discussed subsequently in Section 3.
The full benchmark is split into eight datasets, each presenting prompts associated with one of four unique themes and contexts (vacationing abroad, bad weather, camping, and dental cleaning).Two datasets are associated with each unique theme and context.One of the datasets provides prompts in a multiple-choice format, and the other provides equivalent prompts in a True/False format.The context is a short phrase that provides a broad topical interpretation or context for the questions.For example, the context used in the first benchmark released, illustrated in Figure 1, is "vacationing abroad".The theme provides background details, including premises, constraints or motivations associated with the context.It can be understood as an instantiation of the context within a specific setting.For more information on the choice and creation of the datasets, see Section 4.

Figure 1.
Example questions from the "vacationing abroad" dataset, with annotated examples of the usage of taxonomy.An AI system would be expected to provide a binary answer (e.g., Yes/No) to indicate whether a candidate answer is a suitable response to the question.More details on benchmark construction and preliminary evaluation are provided in Sections 4.2 and 6, respectively.
Context: Planning a vacation abroad.Theme: After almost two years without a vacation, Chloe is taking a whole month off.She's planning a three-week trip with a few close friends.They always thought about visiting Europe's most famous destinations, including Paris and London.

Emotion concepts (direct) Time concepts (direct) Activity concept (instantiation)
Questions and candidate answers for each question are always related to the theme and grounded in one of nine fundamental common sense categories selected from the Gordon-Hobbs theory a .One of the objectives of TG-CSR is to serve as a tool for profiling and benchmarking the common sense reasoning capabilities of AI implementations.The representation of the Gordon-Hobbs categories provides a formal semantics for the benchmark, making TG-CSR one of the first and only benchmarks, to the best of our knowledge, that seeks to evaluate CSR systems in a theoretically grounded manner by referencing a theory that is based on models of human cognition and includes semantics as a first-class citizen.
Additionally, unlike existing CSR benchmarks, which are released in a single-phase and prone to overfitting by language models shortly after release, TG-CSR proactively preempts such over-fitting.While the full benchmark is now available for download, we have additionally released datasets for two of the four contexts b and themes on a competition leaderboard.This release is further split into training, development, and test partitions.Labels are provided for the training and development partitions but withheld for the test partition, which attempts to mitigate against observer bias.Hence, system developers looking to test implementations of their CSR approaches and theories can use the leaderboard, and compare their performance against other submitted systems.Researchers looking to investigate novel approaches can instead download the full dataset, and have more experimental control.For example, they can use the benchmark to test whether their CSR approach is sufficiently generalized (i.e., does equally well across all four contexts) or sensitive to format (i.e., performs reasonably well across both the multiple-choice and True/False formats).
The rest of this article is structured as follows.Section 2 provides a background and brief synthesis of related work.Section 3 goes deeper into data acquisition, especially the Gordon-Hobbs theory of common sense that underlies the manner of benchmark construction.Section 4 describes the construction methodology, including the methodology underlying the prompt (questions and answers) and ground truth construction.Section 5 provides details on the benchmark's release, recommended usage and some relevant descriptive statistics.Section 6 provides some details on the benchmark's evaluation, showing that even advanced large language models perform significantly below human performance on TG-CSR.We also discuss a preliminary evaluation of TG-CSR in a generative setting.Section 7 discusses applications and limitations of the data.Finally, Section 8 concludes the work and provides some guidance on promising future avenues of research in this space.

BACKGROUND AND RELATED WORK
Multiple-choice QA has emerged as a de facto standard for evaluating machine common sense [9].While there are likely several reasons for this occurrence, we identify three main ones: the first-mover effect of early organized efforts [10], ease of conducting automated evaluations using multiple-choice QA benchmarks, and funder-driven effects (e.g., the DARPA Machine Common Sense program [11] primarily a These include "Time", "Space", "Scheduling", "World States", "Physical Entities", "Activities", "Goals", "Values and Quantities", and "Emotions".b Specifically on planning for "vacationing abroad" and "bad weather."used multiple-choice QA to evaluate the progress of funded teams on adult common sense reasoning).The second of these is easiest to understand, since one of the advantages of multiple-choice tests, including standardized collegiate tests such as the Scholastic Aptitude Test or SAT, is that they can be automatically evaluated given a 'ground-truth' or answer key.
Concerning the first-mover effect, the Recognizing Textual Entailment challenges [10], going back all the way to 2010 and 2011, are perhaps the earliest organized efforts that proposed to evaluate aspects of machine common sense reasoning in a way that resembles current methodology.Another early example is Task 7 of SemEval-2012 [12], which required systems to reason about causality using a publicly available multiple-choice benchmark [13].These early competitions focused more on choice-based single-hop tasks and multiple-choice QA, with later work seeking to increase the difficulty of the competitions.Later benchmarks also often relied on inexpensive crowdsourcing techniques for their construction, which can raise questions about quality and completeness.Initial applications of large pre-trained language models, almost all based on transformer neural networks as the underlying technology, such as BERT [22], GPT [23], and derivatives [24,25], were able to quickly come within striking distance of human performance (although there is a non-trivial margin still) of many existing multiple-choice QA benchmarks.Recently, however, specific multiple-choice CSR benchmarks have come under criticism [26] and researchers have used adversarial attack techniques (among others) to show that gauging the true machine common sense ability of these language representation models is currently problematic [27].
Another reason for the original adoption of QA benchmarks is for measuring progress across state-ofthe-art large language models.This adoption has contributed to the standardization of multiple-choice QA as the evaluation modality for assessing progress, as multiple performers had to submit results on these benchmarks using a common leaderboard-style protocol and accuracy-based metrics.Table 1 provides a list of some well-known benchmarks that have become prominent in the machine common sense literature, at least within the context of natural language processing research.A major issue with many of the benchmarks listed in Table 1 is that they are often broadly constructed, with only loose or even ad hoc grounding in rigorous common sense theories, such as those inspired by cognitive science.For example, the Social IQa purports to test social interactions [19], but as we subsequently argue, this covers a broad range of human abilities and common sense sub-categories.In contrast, our benchmark takes a semantic, rather than ad hoc approach, by using the ontological properties of categories (such as time and space) to guide the construction and evaluation of machine common sense.
Although some benchmarks exist that are specifically designed to test a model's capabilities in number estimation, causal reasoning, social reasoning, and more, their semantics are not clearly defined, and their content can be overly diffuse.For example, the Social IQa benchmark purports to test social interactions [19], but arguably, covers a broad range of human abilities and common sense sub-categories that are not clearly defined.Either way, without a clear semantic framework providing a formal basis for the content of the benchmark, one can only guess from the description (or manually determine the semantics by reading all the thousands of questions, which is obviously infeasible and non-scalable).Similarly, ATOMIC [28] is a dataset for evaluating CSR capabilities on questions involving if-then reasoning, using nine proposed relation types.However, it is not clear if these relations are representative in terms of common sense reasoning coverage.Cosmos QA [29] is a benchmark for evaluating text comprehension that requires contextual common sense reasoning.Questions and answers were collected from four broad categories, including "causes", "effects", and "facts", which are claimed to cover the nine categories in [28].However, there is no evidence of the rigor of how these categories were used or the formal semantics of the categories.
An example of a semantically grounded common sense resource is CycIC a , which is a benchmark derived from the Cyc platform [2], containing sentences to be evaluated as true/false.While the dataset is annotated using some selected categories, including "norms", "theory of mind", and more, the semantics of these categories are under-defined and more importantly, the role of each category in the creation of the sentences is not specified.
In general, the lack of rigorous construction methods (see Paullada et al. [30] for a review of methods used to create machine learning datasets), and loose theoretical grounding of many existing benchmarks makes it difficult to make the case that a machine common sense system or a neuro-symbolic reasoner is either complete (at least, to the extent that is understood in studies of human cognition), or that it has a grasp of the foundational aspects (such as the semantics) of common sense tasks and questions [31].
Research efforts reported in recent literature, such as the development of foundational ontologies like DOLCE [32], indicate that considerable progress on the axiomatization of common sense reasoning has been made.Our work is based on the research of Gordon and Hobbs [4] who performed a comprehensive study of representational requirements for strategic planning, and created a formal theory that includes axioms that represent various aspects of common sense reasoning.

ACQUIREMENTS OF THE DATA
One challenge preventing the sound evaluation of machine common sense reasoning is the lack of a formal unified representation of the various types of reasoning associated with human common sense thinking.In contrast, TG-CSR is designed by grounding its questions and answers in the semantics of nine categories derived from 48 representational areas that are defined in the Gordon-Hobbs theory.The first four are from the set of foundational areas: "time", "space", "objects" (we call these "physical entities"), and "values and quantities".They were selected based on a recent experiment with human annotators that resulted in higher agreement when using these representational areas as labels [33].The next four areas: "activities", "goals", "scheduling" (all intuitively related to "events"), and "states" (we call these "world states") were also selected because of high agreement in the same experiment.The ninth and final representational area included in our research is "emotions", which was included as a way of evaluating at least one psycho-social aspect of CSR due to its importance.
As previously mentioned, some of these foundational areas have been extensively explored in knowledge representation efforts, including in the Semantic Web [34].Specifically, DOLCE is potentially the most prominent example of world modeling that aims to incorporate some common sense aspects in the Semantic Web.However, to the best of our knowledge, these efforts have not been leveraged collectively to develop a joint machine-interpretable representation that can serve as a formalization of human common sense reasoning.
This section provides an overview of an initial representation of common sense reasoning as an ontology.We start from the Gordon-Hobbs theory, focusing on nine representational areas and their associated terms.While we don't claim completeness, we review and consider the incorporation of certain existing taxonomies, vocabularies, and ontologies as an envisioned unified representation of common sense.For the remainder of this article, we refer to "representational areas" as "categories", and to "terms" as "concepts", when used in the context of our knowledge representation and data annotation efforts.

Contributing Theories
The Gordon-Hobbs theory defines a representational area of common sense as a cluster of related terms (concepts).Collectively, the 48 representational areas or categories presumably cover all of the identified aspects of human common sense thinking and address the three main modes of reasoning: deduction, induction, and abduction.Gordon and Hobbs defined numerous predicates and axioms for their categories.We note that their representation is similar to, and in some cases, based on prominent knowledge representation and Semantic Web ontologies and vocabularies.Figure 2 denotes the ontologies and vocabularies that we have identified as overlapping with each of the nine selected categories underlying TG-CSR.We also provide an example of how they are used in the questions and answer options contained in the benchmark.
For example, the "Time" category is extensively correlated with the owl-time ontology.[34] Both the Gordon-Hobbs theory of time and owl-time base their formalization on Allen's theory of action and time [35].Concepts like duration, end time, and start time are defined in owl-time, as well as concepts expressing relationships between time entities like before.
Although one could intuitively assume the "Space" category to be similar to "Time" in relation to its pervasiveness, the most noted representation of "Space" is associated with geography.GeoSPARQL [36] implements several topological relations, including concepts such as intersects, within, and distance.Some of these predicates map into concepts of the Gordon-Hobbs theory of "Space".
For the "Emotions" category, Gordon-Hobbs derived many of the concepts from the work of Ortony et al. [37] on basic emotions.An existing "Emotion" ontology (MFOEM) [38] associates many of these basic emotion types with a wide range of other emotions.
The "Physical Entities" category is concerned with physical objects in the world, including their shapes and composing parts.The Semanticscience Integrated Ontology (SIO) [39] is a well-known established model that was initially developed for supporting biomedical research.It provides an interesting formalization for material entities that includes concepts like has part, is part of, and surrounds.
The "Values" category denotes the use of quantities to measure things, whether quantitative or qualitative.It includes concepts like quantity, value, and range.The XML Schema Definition (XSD) [40] supports the usage of data types inside XML documents, including ontology definitions.More than that, XSD formalizes some relationships between quantities, including maximum, and minimum.The Units Ontology (UO) [41] provides several units of measurement that can be used to characterize values and quantities.

Figure 2. The Machine Common
Sense taxonomy used for TG-CSR QA construction.The nine categories are represented in boxes with the associated existing ontologies and vocabularies in ellipses.For each category, the identified elements in the question are linked to the respective formalization.The corresponding term from the Gordon-Hobbs theory is in bold.For the remaining categories, although terms are not explicitly in the text, they can be inferred and are also represented in bold.Because all candidate answers are derived from terms in the "Emotions" category, this specific question is categorized as "Emotions" in the benchmark, denoted by the gray box.The remaining categories of "Goals", "Scheduling", "States", and "Activities" are closely related to "Events".Concepts in these categories include activities, their agents, objectives, and their organization.The PROV Ontology [42] provides a formalization that can be used to model activities and associated agents.It also contains predicates that allow the explicit representation of precedence between activities.

Terms
Gordon and Hobbs acknowledge that "concepts in common sense knowledge cannot be defined precisely with necessary and sufficient conditions".In their theory, they identify a list of terms that express concepts associated with each representational area.For instance, in the "Time" area, one term associated with time is before, and it is defined as "A moment with a lesser position in an ordered set of moments."Theyformalize before as a predicate, e.g., (before t1 t2) where some time t1 is before some time t2.
In our work, the formalism defined by Gordon and Hobbs was used to support the construction of questions and answer candidates.Several questions created for the "Vacationing-Abroad" dataset are displayed in Figure 1.One question asks: "How did Chloe feel after removing destinations in France from her trip?"A plausible answer is "sad" which is a defined emotion concept with a defined predicate (sad p) that implies that some person is sad.This question is placed in the "Emotions" category along with similar questions that can be answered by the answer options associated with various emotion predicates.Another example in Figure 1 is a question from the "Time" category with instantiated time concepts as candidate answers.The question itself contains time concepts used directly, as well as an instantiated activity concept.

METHODOLOGY
Given the taxonomy introduced in the previous section, as well as a theme and context, we constructed questions and candidate answer-sets for the multiple-choice dataset using the nine categories in the taxonomy.Additionally, we generated a True/False version of each dataset, by systematically converting each of the question and (set-based) answer pairs in the original multiple-choice dataset to a binary True/ False format.
Before describing the construction methodology, we note some important desiderata that need to be fulfilled by the QA portion of the benchmark: • The QA instances must be designed for rigorously assessing performance on the major Gordon-Hobbs categories like time and space.They must also have a clear machine-readable structure and format.• The QA instances must be constructed using clear annotation guidelines to generate reliable ground truths.Furthermore, the ground truths must be stable and also exhibit high human agreement.• Ideally, the QA instances should be amenable to assessments by both generative and discriminative language representation models (LRMs) [43,44], although the former is expected to take more manual effort, unless metrics inspired by communities such as machine translation (e.g., BLEU [45]) are used, with an understanding of their limitations for evaluating potentially open-ended answers.• Given that there are already so many QA benchmarks with 'training' and 'development' sets openly available (see Table 1), and the general nature of common sense reasoning, the proposed benchmark should be few-shot.In other words, the benchmark must be designed such that good performance from LRMs or other statistical models cannot be explained through careful benchmark-specific finetuning alone.In effect, this forces the model to rely on more fundamental aspects of common sense, including common sense theory and semantics, to truly perform well on the QA instances.• Finally, given that many LRMs and other such systems often work better for binary QA modalities (such as Yes/No and True/False), especially if only a pre-trained version is used, the benchmark must also be available in such a modality.This is the primary reason that we release the benchmark using two modalities, as described in the next section.

Question-Answer (QA) Construction
Figure 3 summarizes the process that was used to create the datasets and associated ground truth for the benchmark.The guiding motivation behind our QA construction methodology was to emulate complex human interactions in real life, where multiple decisions involving common sense happen in a specific setting, often in a multi-hop manner.In support of this, we developed four contexts and themes.For each context/theme pair, several questions were created that require information contained in the respective context/theme to be answered correctly.This makes the problem more realistic, but also potentially challenging for purely statistical models that do not rely on context, but treat each question as a standalone input.Using the taxonomy introduced earlier, benchmark developers created questions and candidate answers, both "correct" and "incorrect".We evaluated the usefulness of the foundational categories identified by Gordon and Hobbs as core content descriptors in our previous research to measure human annotator agreement on common sense sentences [33].Category membership was determined by the answer, and the answer option was influenced by concepts that were incorporated into the taxonomy.For example, in the first question provided in Figure 1, the answer options are all examples of an emotion.In the second question, the answer options are all examples of a potential start time, but not all of these options can satisfy the duration constraints (e.g., three weeks) specified in the question.However, because all answer options are temporal, the question is categorized as "Time".
In contrast with existing benchmarks, the questions were created as multiple-set, rather than multiplechoice, since there could be more than one correct answer per question.This also makes the benchmark more realistic.The development of the dataset went through several iterations.Initially, each question had its own set of answer options.Although there could be multiple "correct" answers, for any given question, at least one of the answers in the answer option list had to be a "correct" answer.In order to create a set of answer options that could apply to all of the questions in a category, answers from all questions in a given category were combined into a global per-category set.Duplicate answers were removed and, in a few specific cases, the questions or the answers were rewritten.To support benchmark data analysis, a question ID was generated for each question and an answer ID was generated for each answer option.
All benchmark data (questions, per-category candidate answers-set, contexts, themes), and metadata (IDs, category annotations) were constructed in spreadsheets to support benchmark development and annotation.Then, to facilitate LRM experimentation and to make the dataset amenable to a competitionstyle leaderboard, all benchmark data was converted into several structured (using Javascript Object Notation or JSON) files that could be ingested and processed by a program relatively easily.

True/False (TF) Prompt Construction
The prompts in each True/False dataset are based on the question-answer pairs in the multiple-choice (MC) dataset for the same context.In a multiple-choice dataset, each category contains 2-5 questions, each with 8-13 answer options.Figure 4 displays the distribution of prompts per category for the datasets that were developed for each context.
In order to generate a T/F version of the dataset, a T/F prompt was created for each question-answer pair in the MC dataset by pairing an often slightly revised version of the MC question with one of it's answer options.If the MC question had 10 answer options, then 10 T/F prompts were created.Table 2 contains some MC to T/F conversion examples for questions in the emotion category across all contexts.In the table, the name of the context is provided in the first column.The MC version of the question is presented in the 2nd column.The 3rd column contains each of the answer options that are provided in the MC version.The last column contains the T/F prompt associated with the MC question and one MC answer option.The value of the answer option is displayed in bold type in the T/F prompt.Note that in many examples, the MC question was rewritten to yield a properly worded TF prompt.
When generating the T/F prompts, our goal was to find a single transformation of the MC question that would yield meaningful T/F prompts for all of the answer options associated with a single given MC question.For example, one of the questions provided in Fear If Linda encounters a bear at their campsite, she will feel fear.

Happy
If Linda encounters a bear at their campsite, she will feel happy.

Dental Cleaning
When Mary was booking the appointments, she was surprised to learn that her normal hygienist, Carol, was planning vacation when she wanted her appointment.How did Mary feel when she learned this information?
Happy Mary felt happy when she learned that the her normal hygienist, Carol, was planning vacation when she wanted her appointment.

Sad
Mary felt sad when she learned that the her normal hygienist, Carol, was planning vacation when she wanted her appointment.Bad Weather Kim and Bill were visiting relatives for a few days and came home to find a big tree laying on the roof of their house.How did they feel when they saw the tree laying on their roof?

Upset
Kim and Bill were visiting relatives for a few days and came home to find a big tree laying on the roof of their house.They were upset when they saw the tree laying on their roof.Happy Kim and Bill were visiting relatives for a few days and came home to find a big tree laying on the roof of their house.They were happy when they saw the tree laying on their roof.Vacationing Abroad How did Chloe feel after removing destinations in France from her trip?Frustrated After removing destinations in France from her trip, Chloe felt frustrated.

Sad
After removing destinations in France from her trip, Chloe felt sad.
destinations in France from her trip?The generated prompt is: After removing destinations in France from her trip, Chloe felt frustrated.The term frustrated is one of the answer options, and this term is replaced by each of the other answer options to generate all of the T/F prompts associated with the question.In this case, we also note that all of the answer options are single terms that describe an emotion.
In other categories, such as the time category, the transformation was more challenging because the answer options, while valid instances of time, are not equivalent instances of time.For example, in the time category, some of the answer options are examples of dates (June 1st), while other answer options are instances of time intervals (1/3 of the day).This complicates the transformation, resulting in T/F prompts that are not all equally meaningful.For example, consider the following question: Given that Chloe's vacation starts June 1st, and she only has three weeks of vacation, when should her flight depart?For this question, one of the answer options is: June 1st.To create a T/F prompt, the MC question was transformed into a statement that specifies when Chloe's flight can depart, with June 1st offered as an option: Since Chloe's vacation starts June 1st, and she only has three weeks of vacation, her flight can depart on June 1st.The transformation is then applied to each of the other answer options.When the answer option is 1/3 of the day, the T/F prompt becomes: Chloe's vacation starts June 1st, and she only has three weeks of vacation, her flight can depart in less than 1/3 of the day.Although this sentence does not make as much sense as the June 1st example, it can be still evaluated by an annotator as True or False.
Because the set of answer options for each question in a category tab was constructed using the answer options for all of the questions in that tab, there are cases when some of the answer options are obviously not relevant to the question.For example, in the "Physical Entities" tab, there is one question that asks: What kind of foods can Chloe have for lunch at a museum?while another question in that tab asks: Chloe and her three friends Jacky, Caitlin, and Joann plan to go shopping but Joann feels sick before going out, who should stay in the room and take a rest?The answer options for all of the questions in this tab include food items such as: sandwiches, salad, steak, wraps; as well as the names of people: Chloe, Caitlin, Jacky and Joann.In the question What kind of foods can Chloe have for lunch at a museum? the likely answers in the MC version are the food items.However, the equivalent T/F prompt must be written to address all of the answer options.The transformation thus created was: At a museum Chloe can have sandwiches for lunch.This transformation worked for all of the answer options except the answer option that is equivalent to Chloe.It worked for the options of Caitlin, Jacky and Joann because one can have a friend for lunch (as in eating lunch with a friend), but because there is only one Chloe in the theme, the T/F prompt: At a museum Chloe can have Chloe for lunch is considered awkward.
We manually evaluated the goodness of each of the T/F prompts.When there was a question about the meaning of a prompt, it was color coded, with yellow indicating that the T/F prompt does not make sense; light blue indicating that answers to the T/F prompt will likely differ from the ground truth results obtained in the MC annotation efforts, and tan indicating that although the T/F prompt is oddly worded it can likely be answered as T or F. Table 3 provides an example of a set of T/F prompts from one of the categories with color highlighting.All of the highlighted prompts were reviewed by several of the developers.In the dataset with the "vacationing abroad" context, prompts that were highlighted in yellow were removed because they were determined to be nonsensical.A total of five prompts were removed: four from the "Goals" category and one from the "Physical Entities" category.
Knowing that the multiple-choice question-answer pairs would be converted to a T/F format influenced how we developed the prompt data for the other contexts.As a result, we did not delete any T/F prompts in the other contexts.

Ground Truth Development
Following initial QA construction, as discussed earlier, the ground truth for the benchmark was derived by averaging the annotations of a group of human annotators.We were able to verify high agreement (correlation between annotators was greater than 0.9) and to obtain a more statistically rigorous annotation per QA instance.A more technical description of the averaging procedure is provided at the start of Section 6.
Additionally, to prevent possible bias that could influence the annotations, the category names of the questions were obfuscated in the distributed annotation spreadsheets.In the spreadsheets used by the annotators, a tab per category was provided with all questions and answers in their respective tab.However, each tab label was replaced by a label such as "QuestionTab1" instead of (for example) "Time".We created and provided a concise set of guidelines to the human annotators.In these directions, the Table 3. Example T/F prompts with color highlighting.Yellow indicates that the T/F prompt does not make sense; light blue indicates that answers to the T/F prompt may differ from the ground truth results, and tan indicates that the T/F prompt is oddly worded but can likely still be answered as True or False.Chloe and friends have a reservation at 6pm for dinner at a restaurant.They need to ride the subway to get there.Since they need to ride the subway to get there, a good time to depart for the restaurant is June 30th.Chloe and friends have a reservation at 6pm for dinner at a restaurant.They need to ride the subway to get there.Since they need to ride the subway to get there, a good time to depart for the restaurant is between 1 and 2 hours.

At a museum
human annotators were asked to read the context and theme before answering the questions in each spreadsheet tab.For the MC data, the annotators were instructed to evaluate each answer option per question on a Likert scale of 1-4 (4: very good fit; 3: good fit; 2: not sure, and 1: bad fit).For the T/F data, annotators were asked to evaluate each prompt as True (T) or False (F).Annotators could optionally provide comments about uncertainty in answering the question.We took these comments into account when refining benchmark construction.

BENCHMARK RELEASE, RECOMMENDED USAGE, AND STATISTICS
To make TG-CSR public-facing, we provide an overview on a website, along with data access instructions, and a link to the competition leaderboard.The "vacationing abroad" and "bad weather" have also been released to a leaderboard platform, called CodaLab, as shown in Table 4.
The license specification of the resource, and other corresponding links and details (including the website and competition link), are provided in Table 4.To access the TG-CSR dataset for the competitions, users are required to create a free CodaLab account and register for the competition.Once the registration is approved, users are able to log in to participate in the competition by downloading the Starting kit and Public Data.The Starting kit contains a detailed description of the dataset to help users understand the file structure and formats.It also contains starter code to validate the file format that can be accepted by the leaderboard as submission.CodaLab automatically evaluates and scores the submissions, allowing users to benchmark the performance of their submissions in real-time.
The Public Data contains the dataset for the "vacationing abroad" context, including training, development, and test sets.Since TG-CSR is initially designed to be a few-shot problem, we provide labels for the training and development sets, withholding labels for the test set.Users have to submit their predictions for this test set.However, to facilitate greater experimentation, we have separately released the full four-context benchmark (with corresponding ground truths for both the MC and T/F versions) in a Zenodo data repository, as listed in Table 4.Because TG-CSR is meant to be a few-shot benchmark, we recommend that, if developers split the questions into training and testing sets, that the latter be much larger than the former.In general, the optimal way to use the benchmark for measuring machine common sense reasoning is to only use one of the contexts (e.g., vacationing abroad) for gaining familiarity with the dataset and its format, and verifying performance against that obtained on the leaderboard.Questions in the other contexts should then be used in a zero-shot fashion (i.e., be considered as test instances).

Benchmark Statistics
For the "vacationing abroad" context, there are a total of 331 Q/A pairs in the dataset.In the leaderboard version, instead of returning a Likert scale score of 1, 2, 3 or 4, each submitted model should only return Yes/No predictions for each Q/A pair, with Yes (encoded using 1) indicating that the provided answer is a very good fit or a good fit for the corresponding question, and No (encoded using 0) indicating not sure or a bad fit.The training set, development set, and test set contain 81, 77, and 173 QA pairs, respectively.Only the test set does not currently include publicly available labels on the competition leaderboard, although these labels can be obtained from the full benchmark published in the Zenodo repository.The leaderboard version of the "bad weather" context is similar.As noted earlier, the other two contexts have not been released on the leaderboard, but can be accessed in the Zenodo repository, and should ideally be used to evaluate zero-shot CSR.The overall sizes and distributions of these contexts are similar.Earlier, in Figure 4, we compared the per-category distributions of these benchmarks.

BENCHMARK EVALUATION
We use the F1-score for reporting performance of submitted systems on the leaderboard.To evaluate human performance, we used a strategy inspired by leave-one-out cross-validation.Given k human annotators who had each independently graded a QA instance on a scale of 1-4, as described earlier, we first 'collapsed' 3 and 4 into the Yes label, and 1 and 2 into the No label.This is standard practice when using a finer-grained annotation scheme than the actual labeling scale needed (e.g., quaternary vs. binary, Table 5. TG-CSR benchmark statistics for both the MC and T/F modalities across all four contexts.The training, development and testing set partitions shown for the MC version of the "vacationing abroad" and "bad weather" contexts correspond to the version released on the CodaLab leaderboard.The Zenodo repository does not partition the prompts. in this exercise).Next, we use the labels of each annotator in turn as the ground truth, and take the mode of the remaining k−1 annotators as the human prediction.We compute an F1-score for this prediction.

MC
Repeating the exercise for each annotator ultimately yields k F1-score estimates, the average of which is used as the human performance.In the "vacationing abroad" context, the human performance (using the procedure above) was found to be 80.5% for the overall dataset (training, testing, and development), and 79.9%foronly the test set, illustrating high consistency.The numbers are similar for the "bad weather" context, although human performance is slightly higher (82.7%for the overall dataset, and 83.8%foronly the test set).More details are provided on the website.In contrast, a random baseline, which is computed based on the number of questions in the test set, was found to only achieve a 35%F1-Score.These human performance estimates are similar to those observed for the other CSR benchmarks being used in the community, and further evidence of the validity of the TG-CSR construction.
We also benchmarked the "vacationing abroad" dataset using the family of T*language representation models based on a recent paper [48].These models are prompt-based encoder-decoder models fine-tuned on a large collection of datasets (also described in [48]) in a multitask manner.The T0 models have demonstrated solid zero-shot performance on various natural language processing tasks, including popular CSR benchmarks, making them a good candidate for the evaluation of our benchmark.We experimented with two sizes of these models, T0-3B, and T0-11B, which contain 3 billion and 11 billion parameters.Moreover, to obtain the best possible performance on the benchmark, we used different versions of the 11B parameter models and report the best results in Table 6.
To briefly summarize these models, the T0-11B models were pretrained on a set of Multiple-Choice QA, Extractive QA, Closed-Book QA, Structure-To-Text, Sentiment Analysis, Summarization, Topic Classification, and Paraphrase Identification benchmarks.The T0p-11B model had the same training benchmarks as T0-11B Table 6.Baseline results of the T0* language representation models on the "vacationing abroad" context.As illustrated, the largest model with the most pretraining (T0++) data achieves the highest score, which is still far from human performance, showcasing the utility of TG-CSR.Since releasing the "vacationing abroad" context in the summer of 2022 on the CodaLab leaderboard, we have not observed any better-performing submissions from 13 participants at the time of writing.with additional datasets from GPT-3's [46] evaluation suite.Finally, the T0PP-11B model had the same training benchmark as T0P-11B with a few additional datasets from the SuperGLUE benchmark [47].

Model
To apply each model on our benchmark, we put the context question and proposed answer in the input prompt.A model then classifies the prompt as correct or incorrect.As shown in Table 6, the highest score of 60.3% F1-Score was achieved by the model T0++.While promising, this result also suggests that there is a considerable distance between a reasonable LRM's performance and human performance (79.9%).Consistent with previous studies [49,48,47], the largest model with the most pretraining data (T0++) achieved the highest performance on the benchmark.Further analysis of the models' predictions on TG-CSR suggests that they are biased toward classifying most of the question-answer pairs as "No", which explains the low F1 scores.It also validates the use of the F1-score, rather than simple accuracy (fraction of correct answers).This observation further emphasizes the importance of creating more diverse datasets and benchmarks as resources both for training and evaluation purposes for the models.It also provides evidence supporting our earlier claim that, by virtue of being more theoretically and semantically grounded, TG-CSR is currently more challenging for purely statistical models and may be a good fit for rigorously evaluating more neuro-symbolic approaches.
Given the ubiquity and cutting-edge performance of ChatGPT, we also evaluated its underlying model (GPT-3.5-turbo)on our benchmark competitions.While GPT-3.5-turbo is fundamentally a generative model, its utility extends to discriminative tasks as well.For example, in the MC modality, we incorporate system content in the message request dispatched to the ChatGPT API (which utilizes the GPT-3.5-turbomodel), specifying, You are a helpful assistant that answers multiple-choice questions.Select all options that apply.This prompts the model to answer in the context of the specific task and MC question presented.
In the case of the TF modality, we prompt the system in the following manner: You are a helpful assistant that answers TF (true/false) questions.This drives the model's responses to be in line with the context of the specific task and TF statement.We append a question statement at the end, Is this statement true or false?, to solicit a true/false answer from the model.Specific prompts for the MC and TF competitions, which can be used directly to make requests to the ChatGPT API, can be found in the Supplementary Materials to facilitate replication studies in the future.
As detailed in Table 7, we assessed the performance of GPT-3.5-turbo on the Bad Weather and Vacationing Abroad contexts in both MC and TF modalities.The model achieved F1-Scores of 0.651 and 0.717 in the MC modality for Vacationing Abroad and Bad Weather, respectively.In the TF modality, the performance was relatively lower, with F1-Scores of 0.479 and 0.683 for Vacationing Abroad and Bad Weather, respectively.Table 7.The performance of Model GPT-3.5-turbo, the underlying architecture for Chat-GPT, evaluated on the Bad Weather and Vacationing Abroad MC and TF contexts.

Vacationing Abroad
Bad Weather Multiple-choice format 0.651 0.717 True/false format 0.479 0.683 These scores, while superior to the T0++ model's performance, still do not reach human performance levels (0.799 and 0.838 for Vacationing Abroad and Bad Weather benchmark respectively).While the results do suggest that larger, more extensively pretrained models such as GPT-3.5-turbotend to yield better results, the difference in performance between the MC and TF formats also suggests a possible model bias towards the MC format.The differential performance across these formats also supports the argument that the TG-CSR, with its more complex, theoretically grounded challenges, maybe a better gauge for the evaluation of future neuro-symbolic-based language models.

Preliminary Evaluation of TG-CSR in a Generative Setting
We further evaluated our benchmark in a generative format which fits better with the language modeling aspect of these models.With the increased popularity of GPT-3 [43], and the recent release of models such as ChatGPT, generative models are rapidly gaining prominence in the AI community.Although TG-CSR is meant for evaluating machine common sense in a discriminative setting (since answer choices are always provided in both the MC and T/F settings), the evaluation protocol can be slightly modified to facilitate generative evaluations.In conducting such a preliminary modification, we found that the results suggest some interesting possibilities.
Although the performance of the language models in Table 6 is well below human performance, we observed in subsequent analysis that language models generally returned "No" for a QA instance, even when human annotators considered it to be a "Yes".This begs the question of whether, given the opportunity, a generative language model might return a good answer for a question when it is not prompted with pre-determined answer choices.
To evaluate the generative performance of a language model on TG-CSR, we again used the T0++model [48] but in a generative manner.The prompt template, and example of model input, is provided in Figure 5.As shown therein, we concatenated the theme and question and conditioned the model output on the prompt template; this is a common practice in the generative use of these models [48].Since many of the generated answers did not match any of the answer options provided, we performed a manual review of the results.To measure the accuracy of a generative answer, we first look for exact or almostexact matches between the generative answer and any answer options associated with a question.If there is a match, we then evaluate it by checking the related ground truth values.In the experiment, six generative answers closely matched a "Yes" (or good-fit) ground truth answer option in the following categories: Goals, Physical Entities, and World States; while one generative answer exactly matched one of the ground truth "No" answer options in the "Values and Quantities" category.Further review of the answers that were not close matches with any of the answer options (whether "Yes" or "No") showed that many of the generated answers were plausible fits, although there were also a few implausible fits.For example, the model generated "upset" as an answer for the first question in Figure 1.This value is not one of the provided answer choices, but it comports well with the other answer choices, such as "frustrated" and "angry".
There were however, a few generated answers that did not match any available answer option and were not considered plausible answers.One example is the generated answer of "London" to a Goals category question.While preliminary, the results of this experiment suggest that, in the future, generative language models could potentially be used to aid in QA construction by supplementing human answers.

APPLICATION AND LIMITATIONS OF THE DATA
During the development of question and associated answer options, we observed some differences in opinion by both the dataset developers and the annotators on what is (or is not) considered to be common sense.For example, while creating questions and answer options about how to pack a suitcase, we encountered different preferences for items to pack and for the packing order.We assume that the source of the difference is likely associated with differences in the age, sex, and cultural background of the dataset developers.Similarly, in another question about bringing food for a picnic, we observed differences between what types of food and drinks were considered good examples of common sense choices, e.g., bringing wine was considered questionable by some a .When we could not reach any agreement on a particular QA instance, we removed the QA instance from the dataset prior to release.However, our observations suggest, perhaps unsurprisingly, that knowledge of a personal or cultural type can intersect in non-trivial ways with common sense knowledge, and the two are not always easy to distinguish theoretically.In the picnic example, for instance, the common sense knowledge appears to exist at a higher level of abstraction.
More experiments are needed to understand these observations in depth.If common sense knowledge indeed exists at a higher level of abstraction, it might suggest that, rather than provide physical entity values as answer options for these questions in the dataset, it might be more appropriate to provide answer choices that are less granular or more abstract e.g., a drink versus a soda, a fruit versus strawberries, and so on.
In a previous experiment [33], annotators reported difficulty in classifying sentences about human emotions.However, in the development of the "vacationing abroad" and other contexts, we noticed that it was quite easy to create potential questions and answers for the "Emotions" category.We attribute this to the fact that we limited the number of candidate choices for the "Emotions" questions to ten.This in turn indicates the importance of constraints when designing and evaluating questions that might elicit a broad range of responses (at least in terms of vocabulary alone), and the difficulty in potentially evaluating a In the United States, and many other countries, consumption of alcohol in open public spaces like parks is usually illegal.answers in a generative (i.e., open-ended, where answer choices are not provided at all) setting using metrics such as BLEU [45].
In addition, our development efforts identified several challenges associated with using the Gordon-Hobbs theory and the larger taxonomy to construct a common sense dataset.For example, there was considerable internal discussion among the benchmark developers regarding what category a question should be placed into.This was particularly true for questions associated with time, activities, and scheduling since they all have temporal aspects associated with them.For example, an answer to a question about when to schedule some activity could be a time value, e.g., 10:00 AM, or it could be some named activity, e.g., after eating breakfast.To resolve these issues, we classify a sentence by the answer value type, e.g., when June 4 is the answer value, the category is time.This method is used to enforce consistency in QA construction, rather than make normative claims about the Gordon-Hobbs theory.
A final limitation that we emphasize about TG-CSR also applies to the majority of common sense benchmarks (including the ones in Table 1 currently being used to evaluate systems such as the large language models cited earlier.Namely, these benchmarks rely on the specific modality of question answering (whether framed as multiple-choice, or as true-false), with questions posed using natural language, to evaluate machine common sense.However, natural language may not be the best fit for evaluating more structured common sense problems (such as common sense planning, approximate numerical calculations, decision-making, and so on).Indeed, some recent research has shown that, in carefully constructed experiments, language models may have trouble with concepts such as rationality and negation [50,51].Others have constructed novel benchmarks and methodologies for assessing capabilities such as numerical reasoning [52], uncertainty (including the ability for a model to detect that no correct answer exists in response to a question [53]), and the generalization of fine-tuned language models [54].Nevertheless, the question remains as to whether current machine common sense evaluations are overly biased toward problems represented in natural language.

CONCLUSION AND FUTURE WORK
In this paper, we described TG-CSR, a theoretically-grounded benchmark that enables the evaluation of machine-based common sense reasoning.TG-CSR is, to our knowledge, the first CSR benchmark to be developed that is grounded in a formal theory of common sense, based on the influential work of Gordon and Hobbs.In addition to the actual resource, a part of which is now hosted on a public competition-style leaderboard (and the remainder of which is available for download) we described how ontologies and vocabularies formally represented in a unified common sense representation can support the creation of theoretically-grounded QA benchmarks.Specifically, we described how concepts from several ontologies were combined into a single taxonomy and used to guide the creation of the questions and answers that comprise TG-CSR.
With TG-CSR, we intend to open the pathway for the next generation of benchmarks that are grounded in ontological frameworks of common sense.More than this, we reiterate that, given the nature of common sense reasoning to be open world and unpredictable, benchmarks should embrace this uncertainty and eventually be zero-shot.
In future work, benchmarks about categories such as planning and organization may be developed.In addition, recent research describes how noise can affect human judgements, including benchmark annotations [55].In our current research, we are using TG-CSR as a testbed for analyzing the effects of noise during human labeling of natural language sentences.In this research direction, we will also investigate if the transition between task formats (such as multiple-choice to True/False) influences interpretation or performance of large language models and other CSR systems currently being developed.

Figure 3 .
Figure 3.A workflow illustrating the key steps of our QA benchmark and ground-truth construction methodology.

Figure 5 .
Figure 5. Example of prompt template for generative evaluation of the benchmark using language models.
almost two years without a vacation, Chloe is taking a whole month off.She's planning a three-week trip with a few close friends.They always thought about visiting Europe's most famous destinations, including Paris, and London.Question: How did Chloe feel after removing destinations in France from her trip?Example of model input:

Table 1 .
Selected multiple-choice machine common sense benchmarks, based on their historical/novelty value, and their adoption in the DARPA Machine Common Sense (MCS) evaluations.

Table 2 .
Table 2 asks: How did Chloe feel after removingFigure 4. Distribution of the Gordon-Hobbs categories within the benchmark contexts.Examples of "Emotion" questions in MC format converted to a T/F prompt for each context.
Chloe can have sandwiches for lunch.At a museum Chloe can have salad for lunch.At a museum Chloe can have wraps for lunch.At a museum Chloe can have steak for lunch.At a museum Chloe can have risotto for lunch.At a museum Chloe can have a baked potato for lunch.At a museum Chloe can have Chloe for lunch.At a museum Chloe can have Caitlin for lunch.At a museum Chloe can have Jacky for lunch.At a museum Chloe can have Joann for lunch.

Table 4 .
Benchmark access and license details.