## Abstract

To ensure readability, text is often written and presented with due formatting. These text formatting devices help the writer to effectively convey the narrative. At the same time, these help the readers pick up the structure of the discourse and comprehend the conveyed information. There have been a number of linguistic theories on discourse structure of text. However, these theories only consider unformatted text. Multimedia text contains rich formatting features that can be leveraged for various NLP tasks. In this article, we study some of these discourse features in multimedia text and what communicative function they fulfill in the context. As a case study, we use these features to harvest structured subject knowledge of geometry from textbooks. We conclude that the discourse and text layout features provide information that is complementary to lexical semantic information. Finally, we show that the harvested structured knowledge can be used to improve an existing solver for geometry problems, making it more accurate as well as more explainable.

## 1. Introduction

The study of discourse focuses on the properties of text as a whole and how meaning is conveyed by making connections between component sentences. Writers often use certain linguistic devices to make a discourse structure that enables them to effectively communicate their narrative. The readers, too, comprehend text by picking up these linguistic devices and recognizing the discourse structure. There are a number of linguistic theories on discourse relations (Van Dijk 1972; Longacre 1983; Grosz and Sidner 1986; Cohen 1987; Mann and Thompson 1988; Polanyi 1988; Moser and Moore 1996) that specify relations between discourse units and how to represent the discourse structure of a piece of text (i.e., discourse parsing; Duverle and Prendinger 2009; Subba and Di Eugenio 2009; Feng and Hirst 2012; Ghosh, Riccardi, and Johansson 2012; Feng and Hirst 2014; Ji and Eisenstein 2014; Li et al. 2014; Lin, Ng, and Kan 2014; Wang and Lan 2015). These discourse features have been shown to be useful in a number of NLP applications such as summarization (Dijk 1979; Marcu 2000; Boguraev and Neff 2000; Louis, Joshi, and Nenkova 2010; Gerani et al. 2014), information retrieval (Wang et al. 2006; Lioma, Larsen, and Lu 2012), information extraction (Kitani, Eriguchi, and Hara 1994; Conrath et al. 2014), and question answering (Chai and Jin 2004; Sun and Chai 2007; Narasimhan and Barzilay 2015; Sachan et al. 2015).

Most linguistic theories of discourse consider written text without much formatting. However, in this multimedia age, text is often richly formatted. Be it newsprint, textbooks, brochures, or even scientific articles, text is usually appropriately formatted and stylized. For example, the text may have a heading. It may be divided into a number of sections with section subtitles. Parts of the text may be italicized or boldfaced to place appropriate emphasis wherever required. The text may contain itemized lists, footnotes, indentations, or quotations. It may refer to associated tables and figures. The tables and figures, too, usually have associated captions. All these text layout features ensure that the text is easy to read and understand. Even articles accepted for *Computational Linguistics* follow a due formatting scheme.

These text layout features are in addition to other linguistic devices such as syntactic arrangement or rhetorical forms. Relations between textual units that are not necessarily contiguous can thus be expressed thanks to typographical or dispositional markers. Such relations, which are out of reach of standard NLP tools, have only been studied within some specific layout contexts (Hovy 1998; Pascual 1996; Bateman et al. 2001a, inter alia)^{1} and there are not many comprehensive studies on the various kinds of discourse features and how they can be leveraged to improve NLP tasks.

In this article, we study some of these discourse features in multimedia text and what communicative function they fulfill in the context. As a case study, we study the problem of harvesting structured subject knowledge of geometry from textbooks and show that the formatting devices can indeed be used to improve a strong information extraction system in that domain. We show that the discourse and text layout features provide information that is complementary to lexical semantic information commonly used for information extraction.

*Can this rich contextual and typographical information (whenever available) be used to harvest these axioms in the form of structured rules?*Our goal is to not only extract the axiom mentioned in Figure 1 but also map it to a rule corresponding to the Pythagorean theorem:

We present an automatic approach that can (a) harvest such subject knowledge from textbooks, and (b) parse the extracted knowledge to structured rules. We propose novel models that perform sequence labeling and alignment to extract redundant axiom mentions across various textbooks, and then parse the redundant axioms to structured rules. These redundant structured rules are then resolved to achieve the best correct structured rule for each axiom. We conduct a comprehensive feature analysis of the usefulness of various discourse features: shallow discourse features based on discourse markers, a deep one based on *Rhetorical Structure Theory* (Mann and Thompson 1988), and various text layout features in a multimedia document (Hovy 1998) for the various stages of information extraction. Our experiments show the usefulness of all the various typographical features over and above the various lexical semantic and discourse level features considered for the task.

We use our model to extract and parse axiomatic knowledge from a novel data set of 20 publicly available math textbooks. We use this structured axiomatic knowledge to build a new axiomatic solver that performs logical inference to solve geometry problems. Our axiomatic solver outperforms *GEOS* on all existing test sets introduced in Seo et al. (2015) as well as a new test set of geometry questions collected from these textbooks. We also performed user studies on a number of school students studying geometry who found that our axiomatic solver is more interpretable and useful compared with GEOS.

## 2. Background and Related Work

**Discourse Analysis:** Discourse analysis is the analysis of semantics conveyed by a coherent sequence of sentences, propositions, or speech. Discourse analysis is taken up in a variety of disciplines in the humanities and social sciences and a number of discourse theories have been proposed (Mann and Thompson 1988; Kamp and Reyle 1993; Lascarides and Asher 2008, inter alia). Their starting point lies in the idea that text is not just a collection of sentences, but also includes relations between all these sentences that ensure its coherence. It is often assumed that discourse analysis is a three-step process:

splitting the text into discourse units (DUs),

ensuring the attachment between DUs, and then

labeling links between DUs with discourse relations.

Discourse relations may be divided into two categories: nucleus-satellite (or subordinate) relations, which link an important argument to an argument supporting background information, and multinuclear (or coordinate) relations, which link arguments of equal importance. Most discourse theories (DRT, RST, SDRT, etc.) acknowledge that a discourse is hierarchically structured thanks to discourse relations. A number of discourse relations have been proposed under various theories for discourse analysis.

Discourse analysis has been shown to be useful for many NLP tasks, such as question answering (Chai and Jin 2004; Lioma, Larsen, and Lu 2012; Jansen, Surdeanu, and Clark 2014), summarization (Louis, Joshi, and Nenkova 2010), and information extraction (Kitani, Eriguchi, and Hara 1994). However, to the best of our knowledge, we do not have a theory or a working model of discourse in a multimedia setting.

**Formatting in Discourse:** Psychologists and educationists have frequently studied multimedia issues such as the impact of illustrations (pictures, tables, etc.) in text, design principles of multimedia presentations, and so forth (Dwyer 1978; Fleming, Levie, and Levie 1978; Hartley 1985; Twyman 1985). However, these discussions are usually too general and hard to build on from a computational perspective. Thus, most studies of multimedia text have only been theoretical in nature. Larkin and Simon (1987), Mayer (1989), and Petre and Green (1990) attempt to answer questions: whether a graphical notation is superior to text notation, what makes a diagram (sometimes) worth ten thousand words, how illustration effects thinking. Hovy (1998), Arens and Hovy (1990), Arens (1992), and Arens, Hovy, and Van Mulken (1993) provide a theory of the communicative function fulfilled by various formatting devices and use it in text planning. In a similar vein, Dale (1991b, a), White (1995), Pascual and Virbel (1996), Reed and Long (1997), and Bateman et al. (2001b) discuss the textual function of punctuation marks and use it in the text generation process. André et al. (1991) and André (2000) build a system *WIP* that generates multimedia presentations via layered architecture (composed of the control layer, content layer, design layer, realization layer, and the presentation layer) and with the help of various content, design, user, and application experts. Mackinlay (1986) discuss the automatic generation of tables and charts. Luc, Mojahid, and Virbel (1999) study enumerations. Feiner (1988), Arens et al. (1988), Neal et al. (1990), Feiner and McKeown (1991), Wahlster et al. (1992), Arens, Hovy, and Vossers (1992), and Maybury (1998) discuss various aspects of processing and knowledge required for automatically generating multimedia. Finally, Stock (1993) discusses using hypermedia features for the task of information exploration.

However, all the aforementioned studies were merely theoretical. All the models were hand-coded and not trained from multimedia corpora. In this paper, we provide a corpus analysis of multimedia text and use it to show that the formatting devices can indeed be used to improve a strong information extraction system in the geometry domain.

**Solving Geometry Problems:** Although the problem of using computers to solve geometry questions is old (Feigenbaum and Feldman 1963; Schattschneider and King 1997; Davis 2006), NLP and computer vision techniques were first used to solve geometry problems in Seo et al. (2015). Seo et al. (2014) only aligned geometric shapes with their textual mentions, but Seo et al. (2015) also extracted geometric relations and built *GEOS*, the first automated system to solve SAT style geometry questions. *GEOS* used a coordinate geometry based solution by translating each predicate into a set of manually written constraints. A Boolean satisfiability problem posed with these constraints was used to solve the multiple-choice question. *GEOS* had two key issues: (a) It needed access to answer choices that may not always be available for such problems, and (b) It lacked the deductive geometric reasoning used by students to solve these problems. In this article, we build an axiomatic solver that mitigates these issues by performing deductive reasoning using axiomatic knowledge extracted from textbooks. Furthermore, we use ideas from discourse to automatically extract these axiom rules from textbooks.

Automatic approaches that use logical inference for geometry theorem proving, such as the Wus method (Wen-Tsun 1986), Grobner basis method (Kapur 1986), and angle method (Chou, Gao, and Zhang 1994), have been used in tutoring systems such as *Geometry Expert* (Gao and Lin 2002) and *Geometry Explorer* (Wilson and Fleuriot 2005). There has also been research in synthesizing geometry constructions, given logical constraints (Gulwani, Korthikanti, and Tiwari 2011; Itzhaky et al. 2013) or generating geometric proof problems (Alvin et al. 2014) for applications in tutoring systems. Our approach can be used to provide the axiomatic information necessary for these works.

**Other Related Tasks:** Our work is also related to Textbook Question Answering (Kembhavi et al. 2017), which proposes the task of multimodal machine comprehension where the context needed to answer questions composes of both text and images. The TQA data set is built from middle school science textbooks and pairs a given question to a limited span of knowledge needed to answer it. Also related is the work on Diagram QA (Kembhavi et al. 2016), which proposes the task of understanding and answering questions based on diagrams from textbooks, and FigureSeer (Siegel et al. 2016), which parses figures in research papers.

**Information Extraction from Textbooks:** Our model for extracting structured rules of geometry from textbooks builds upon ideas from information extraction (IE), which is the task of automatically extracting structured information from unstructured and/or semi-structured documents. Although there has been a lot of work in IE on domains such as Web documents (Chang, Hsu, and Lui 2003; Etzioni et al. 2004; Cafarella et al. 2005; Chang et al. 2006; Banko et al. 2007; Etzioni et al. 2008; Mitchell et al. 2015) and scientific publication data (Shah et al. 2003; Peng and McCallum 2006; Saleem and Latif 2012), work on IE from educational material is much more sparse. Most of the research in IE from educational material deals with extracting simple educational concepts (Shah et al. 2003; Canisius and Sporleder 2007; Yang et al. 2015; Wang et al. 2015; Liang et al. 2015; Wu et al. 2015; Liu et al. 2016b; Wang et al. 2016) or binary relational tuples (Balasubramanian et al. 2002; Clark et al. 2012; Dalvi et al. 2016) using existing IE techniques. On the other hand, our approach extracts axioms and parses them to horn-clause rules. This is much more challenging. Raw application of rule mining or sequence labeling techniques used to extract information from Web documents and scientific publications to educational material usually leads to poor results as the amount of redundancy in educational material is lower and the amount of labeled data is sparse. Our approach tackles these issues by making judicious use of typographical information, the redundancy of information, and ordering constraints to improve the harvesting and parsing of axioms. This has not been attempted in previous work.

**Language to Programs:** After harvesting axioms from textbooks, we also parse the axiom mentions to horn-clause rules. This work is related to a large body of work on semantic parsing (Zelle and Mooney 1993,1996; Kate et al. 2005; Zettlemoyer and Collins 2012, inter alia). Semantic parsers typically map natural language to formal programs such as database queries (Liang, Jordan, and Klein 2011; Berant et al. 2013; Yaghmazadeh et al. 2017, inter alia), commands to robots (Shimizu and Haas 2009; Matuszek, Fox, and Koscher 2010; Chen and Mooney 2011, inter alia), or even general purpose programs (Lei et al. 2013; Ling et al. 2016; Yin and Neubig 2017; Ling et al. 2017). More specifically, Liu et al. (2016a) and Quirk, Mooney, and Galley (2015) learn “If-Then” and “If-This-Then-That” rules, respectively. In theory, these works can be adapted to parse axiom mentions to horn-clause rules. However, this would require a large amount of supervision, which would be expensive to obtain. We mitigated this issue by using redundant axiom mention extractions from multiple textbooks and then combining the parses obtained from various textbooks to achieve a better final parse for each axiom.

## 3. Data Format

Large-scale corpus studies of multimedia text have been rare because of the difficulty in obtaining rich multimedia documents in analyzable data structures. A large proportion of text today is typeset using some typesetting software such as LaTeX, Word, HTML, and so on. These features can also serve as useful cues in downstream applications and a model for text formatting is required.

Table 1 shows some excerpts of textbooks from our data set that describe complementary angles, exterior angles, and parallelogram diagonal bisection axioms. As described, each excerpt contains rich typographical features, such as the section headings, italicization, boldface, coloring, explicit axiom name, supporting figures, and equations that can be used to harvest the axioms. We wish to leverage such rich contextual and typographical information to accurately harvest axioms and then parse them to horn-clause rules. The textbooks are provided to us in rich JSON format, which retains the rich typesetting of these textbooks as shown in Tables 2 and 3. For demonstration, we have manually marked the various typographical features that can be used to harvest the axioms. We will show how we can use these features to harvest axioms of geometry from textbooks and then parse them to structured rules.

## 4. Text Formatting Elements in Discourse

In this section, we review various text formatting devices used in a typical multimedia system and identify what communicative function they serve. This will help us come up with a theory for text formatting in discourse and also motivate how these features can be used in a typical NLP application like information extraction. This theory is inspired from various style suggestions for English writing (Strunk 2007). The goal of a text formatting device in a multimedia text is to delimit the portion of text for which certain exceptional conditions of interpretation hold. We categorize text formatting devices into four broad categories: *depiction*, *position*, *composition*, and *substantiation*, and describe the various text formatting devices here:

- •
**Depiction:**Depiction features concern with how a string of text is presented in the multimedia. These include features such as capitalization, font size/color, boldface, italicization, underline, strikethrough, parenthesis, quotation marks, use of bounding boxes, and so forth. - •
**Position:**Position features concern with the positioning of a piece of text relative to the remaining material in the document. These features include in-lining, text offset, footnotes, headers and footers, text separation or isolation (a block of text separated from the rest to create a special effect). - •
**Composition:**Composition features are concerned with the internal structuring of a piece of text. Examples include graphical markers such as paragraph breaks, sections (having sections, chapters, etc., in the document), lists (itemization, enumeration), concept definition using a parenthesis or colon, and so on. - •
**Substantiation:**Substantiation features are used to further substantiate the discourse argument. Examples include associated figures or tables, references to tables, figures (e.g., Figure 1.2), or external links that are very important in understanding a complex multimedia document.

## 5. Text Formatting Features for Information Extraction?

A key question for research is: *Are these text formatting features useful for NLP tasks?* In particular, in this article, we will try to identify whether these text formatting features are useful for information extraction. In a typical multimedia document, authors use various text formatting devices to better communicate the content to their readers. This helps the readers digest the material quickly and much more easily. Thus, can these text formatting features be useful in an information extraction system too? We experimentally validate our hypothesis in the application of harvesting axioms of geometry from richly formatted textbooks.

Then, we show that these harvested axioms can improve an existing solver for answering SAT style geometry problems. SAT geometry tests the student’s knowledge of Euclidean geometry in its classical sense, including the study of points, lines, planes, angles, triangles, congruence, similarity, solid figures, circles, and analytical geometry. A typical geometry problem is provided in Figure 2. Geometry questions include a textual description accompanied by a diagram. Various levels of understanding are required to solve geometry problems. An important challenge is understanding both the diagram (which consists of identifying visual elements in the diagram, their locations, their geometric properties, etc.) and the text simultaneously, and then reasoning about the geometrical concepts using well-known axioms of Euclidean geometry.

We first recap *GEOS*, a completely automatic solver for geometry problems. We will then use the rich contextual and typographical information in textbooks to extract structured knowledge of geometry. This structured knowledge of geometry will then be used to improve *GEOS*.

## 6. Background: GEOS

Our work reuses *GEOS* (Seo et al. 2015) to parse the question text and diagram into its formal problem description as shown in Figure 3. *GEOS* uses a logical formula, a first-order logic expression that includes known numbers or geometrical entities (e.g., 4 cm) as constants, unknown numbers or geometrical entities (e.g., O) as variables, geometric or arithmetic relations (e.g., *isLine*, *isTriangle*) as predicates, and properties of geometrical entities (e.g., *measure*, *liesOn*) as functions.

This is done by learning a set of relations that potentially correspond to the question text (or the diagram) along with a confidence score. For diagram parsing, *GEOS* uses a publicly available diagram parser for geometry problems (Seo et al. 2014) to obtain the set of all visual elements, their coordinates, their relationships in the diagram, and their alignment with entity references in the question text. The diagram parser also provides confidence scores for each literal to be true in the diagram. For text parsing, *GEOS* takes a multistage approach, which maps words or phrases in the text to their corresponding concepts, and then identifies relations between identified concepts.

Given this formal problem description, *GEOS* uses a numerical method to check the satisfiablity of literals by defining a relaxed indicator function for each literal. These indicator functions are manually engineered for every predicate. Each predicate is mapped into a set of constraints over point coordinates.^{2} These constraints can be non-trivial to write, requiring significant manual engineering. As a result, *GEOS*’s constraint set is incomplete and it cannot solve a number of SAT style geometry questions. Furthermore, this solver is not interpretable. As our user studies show, it is not natural for a student to understand the solution of these geometry questions in terms of satisfiability of constraints over coordinates. A more natural way for students to understand and reason about these questions is through deductive reasoning using axioms of geometry.

## 7. Set-up for the Axiomatic Solver

To tackle the aforementioned issues with the numerical solver in *GEOS*, we replace the numerical solver with an axiomatic solver. We extract axiomatic knowledge from textbooks and parse them into horn-clause rules. Then we build an axiomatic solver that performs logical inference with these horn-clause rules and the formal problem description. A sample logical program (in prolog notation) that solves the problem in Figure 2 is given in Figure 4. The logical program has a set of declarations from the *GEOS* text and diagram parsers that describe the problem specification; and the parsed horn-clause rules describe the underlying theory. Normalized confidence scores from question text and diagram and axiom parsing models are used as probabilities in the program. Figure 5 shows a block diagram of the overall system that solves geometry problems. Also, Figure 6 pictorially shows the two step procedure for obtaining structured axiomatic knowledge from textbooks:

**Axiom Identification and Alignment:**In this stage, we identify axiom mentions in all textbooks and align the mentions of the same axiom across different textbooks.**Axiom Parsing:**In this stage, we parse each of these axiom mentions into implication rules and then resolve the implication rules for various axiom mentions referring to the same axiom mention.

Next, we describe how we harvest structured axiomatic knowledge from textbooks.

## 8. Harvesting Axiomatic Knowledge

We present a structured prediction model that identifies axioms in textbooks and then parses them. Because harvesting axioms from a single textbook is a very hard problem, we use multiple textbooks and leverage the redundancy of information to accurately extract and parse axioms. We first define a joint model that identifies axiom mentions in each textbook and aligns repeated mentions of the same axiom across textbooks. Then, given a set of axioms (with possibly multiple mentions of each axiom), we define a parsing model that maps each axiom to a horn-clause rule by utilizing the various mentions of the axiom.

Given a set of textbooks ℬ in machine readable form (JSON in our experiments), we extract chapters relevant for geometry in each of them to obtain a sequence of discourse elements (with associated typographical information) from each textbook. We assume that the textbook comprises an ordered set^{3} of **discourse elements** where a discourse element could be a natural language sentence, heading, title, figure, table, or caption. The discourse element (e.g., a sentence) could have additional typographical features. For example, the sentence could be written in boldface, underline, and so forth. These properties of discourse elements will be useful features that can be leveraged for the task of harvesting axioms. Let $Sb={s0(b),s1(b),\u2026s|Sb|(b)}$ denote the sequence of discourse elements in textbook *b*. |**S**_{b}| denotes the number of discourse elements in textbook *b*.

### 8.1 Axiom Identification and Alignment

We decompose the problem of extracting axioms from textbooks into two tractable sub-problems:

identification of axiom mentions in each textbook using sequence labeling

alignment of repeated mentions of the same axiom across textbooks

Then, we combine the learned models for these sub-problems into a joint optimization framework that simultaneously learns to identify and align axiom mentions.

#### 8.1.1 Axiom Identification.

**S**

_{b}|

*b*∈ℬ}, a sequence of discourse elements (with associated typographical information) from each textbook, the model labels each discourse element $si(b)$ as

**B**efore,

**I**nside, or

**O**utside an axiom. Hereon, a contiguous block of discourse elements labeled

**B**or

**I**will be considered as an axiom mention. Let $T={B,I,O}$ denote the tag set. Let $yi(b)$ be the tag assigned to $si(b)$ and

**Y**

_{b}be the tag sequence assigned to

**S**

_{b}. The conditional random field defines:

**θ**using maximum-likelihood estimation with L2 regularization:

**Features:** Features *f* look at a pair of adjacent tags $yk\u22121(b)$, $yk(b)$, the input sequence **S**_{b}, and where we are in the sequence. The features (listed in Table 4) include various content-based features encoding various notions of similarity between pairs of discourse elements (in terms of semantic overlap, more refined match of geometry entities, and certain keywords) as well as various typographical features such as whether the discourse elements are annotated as an axiom (or theorem or corollary) in the textbook; contain equations, diagrams, or text that is bold or italicized; are in the same node of the JSON hierarchy; are contained in a bounding box, and so forth. We also use features directly from an existing RST parser (Feng and Hirst 2014); discourse structure can be useful to understand if two consecutive discourse elements are together part of an axiom (or not).

Content | Sentence overlap | Semantic textual similarity between the current and next discourse element. We include features that compute the proportion of common unigrams and bigrams across the two discourse elements. This feature is conjoined with the tag assigned to the current and next sentence. |

Geometry entities | Number of geometry entities (constants, predicates, and functions)—normalized by the number of tokens in this discourse element. This feature is conjoined with the tag assigned to the current discourse element. | |

Keywords | Indicator that the current discourse element contains any one of the following words: hence, if, equal, twice, proportion, ratio, product. This feature is conjoined with the tag assigned to the current discourse element. | |

Discourse | RST edge | Indicator for the RST relation between the current and next discourse element. This feature is conjoined with the tag assigned to the current and next sentence. |

Axiom, Theorem, Corollary Mention | (a) The current (or previous) discourse element is mentioned as an Axiom, Theorem, or Corollary (e.g., Similar Triangle Theorem or Corollary 2.1). (b) The section or subsection in the textbook containing the current (or previous) discourse element mentions an Axiom, Theorem, or Corollary. This feature is conjoined with the tag assigned to the current (and previous) discourse element. | |

Equation | The current (or next) discourse element contains an equation (e.g., PA × PB = PT^{2}). This feature is conjoined with the tag assigned to the current (and next) sentence. | |

Associated diagram | The current discourse element contains a pointer to a figure (e.g., “Figure 2.1”). This feature is conjoined with the tag assigned to the current discourse element. | |

Bold/ Underline | The discourse element (or previous discourse element) contains text that is in bold font or underlined. Conjoined with the tag assigned to the current (and previous) discourse element. | |

Bounding box | Indicator that the current and previous discourse elements are bounded by a bounding box in the textbook. Conjoined with the tag assigned to the current (and previous) discourse element. | |

JSON structure | Indicator that the current and previous discourse element are in the same node of the JSON hierarchy. Conjoined with the tag assigned to the current (and previous) discourse element. |

Content | Sentence overlap | Semantic textual similarity between the current and next discourse element. We include features that compute the proportion of common unigrams and bigrams across the two discourse elements. This feature is conjoined with the tag assigned to the current and next sentence. |

Geometry entities | Number of geometry entities (constants, predicates, and functions)—normalized by the number of tokens in this discourse element. This feature is conjoined with the tag assigned to the current discourse element. | |

Keywords | Indicator that the current discourse element contains any one of the following words: hence, if, equal, twice, proportion, ratio, product. This feature is conjoined with the tag assigned to the current discourse element. | |

Discourse | RST edge | Indicator for the RST relation between the current and next discourse element. This feature is conjoined with the tag assigned to the current and next sentence. |

Axiom, Theorem, Corollary Mention | (a) The current (or previous) discourse element is mentioned as an Axiom, Theorem, or Corollary (e.g., Similar Triangle Theorem or Corollary 2.1). (b) The section or subsection in the textbook containing the current (or previous) discourse element mentions an Axiom, Theorem, or Corollary. This feature is conjoined with the tag assigned to the current (and previous) discourse element. | |

Equation | The current (or next) discourse element contains an equation (e.g., PA × PB = PT^{2}). This feature is conjoined with the tag assigned to the current (and next) sentence. | |

Associated diagram | The current discourse element contains a pointer to a figure (e.g., “Figure 2.1”). This feature is conjoined with the tag assigned to the current discourse element. | |

Bold/ Underline | The discourse element (or previous discourse element) contains text that is in bold font or underlined. Conjoined with the tag assigned to the current (and previous) discourse element. | |

Bounding box | Indicator that the current and previous discourse elements are bounded by a bounding box in the textbook. Conjoined with the tag assigned to the current (and previous) discourse element. | |

JSON structure | Indicator that the current and previous discourse element are in the same node of the JSON hierarchy. Conjoined with the tag assigned to the current (and previous) discourse element. |

Some extracted axiom mentions contain pointers to a diagram (e.g., “Figure 2.1”). In all these cases, we consider the diagram to be a part of the axiom mention. We will discuss the impact of the various content- and typography-based features later in Section 11.

#### 8.1.2 Axiom Alignment

Next, we leverage the redundancy of information and the relatively fixed ordering of axioms in various textbooks. Most textbooks typically present all axioms of geometry in approximately the same order, moving from easier concepts to more advanced concepts. For example, all textbooks will introduce the definition of a right-angled triangle before introducing the Pythagorean theorem. We leverage this structure by aligning various mentions of the same axiom across textbooks and introducing structural constraints on the alignment.

*b*. Let

**A**denote the collection of axiom mentions extracted from all textbooks. We assume a global ordering of axioms $A*=A1*,A2*,\u2026,AU*$ where

*U*is some predefined upper bound on the total number of axioms in geometry. Then, we emphasize that the axiom mentions extracted from each textbook (roughly) follow this ordering. Let $Zij(b)$ be a random variable that denotes if axiom $Ai(b)$ extracted from book

*b*refers to the global axiom $Aj*$. We introduce a log-linear model that factorizes over alignment pairs:

*Z*(

**A**;

**ϕ**) is the partition function of the log-linear model.

**g**denotes a feature function that measures the similarity of two axiom mentions (described in detail later). We introduce the following constraints on the alignment structure:

- C1:
An axiom appears in a book at most once.

- C2:
An axiom refers to exactly one theorem in the global ordering.

- C3:
Ordering Constraint: If

*i*^{th}axiom in a book refers to the*j*^{th}axiom in the global ordering then no axiom succeeding the*i*^{th}axiom can refer to a global axiom preceding*j*.

**Learning with Hard Constraints:**We find the optimal parameters

**ϕ**using maximum-likelihood estimation with L2 regularization:

Note that the constraints *C*1…3 define the feasible space of alignments. Our sampler always samples the next $Zik(b)$ in this feasible space. *μ* is tuned on the development set.

**Learning with Soft Constraints:**We might want to treat some constraints, in particular, the ordering constraints

*C*3 as soft constraints. We can write down the constraint

*C*3 using the alignment variables:

To model these constraints as soft constraints, we penalize the model for violating these constraints. Let the penalty for violating this constraints be the $exp\nu max0,1\u2212Zij(b)\u2212Zkl(b)$.

**ϕ***. We perform Gibbs sampling to compute feature expectations. The sampling equation for $Zik(b)$ is similar (Equation (1)), but:

**Features:**Now, we describe the features

*g*. These too include content-based features encoding various notions of similarity between pairs of axiom mentions (such as unigram, bigram, dependency and entity overlap, longest common subsequence [LCS], alignment, MT, and summarization scores) as well as various typographical features, such as matching of the current (and parent) node of axiom mentions in respective JSON hierarchies, equation template matching, and image caption matching. The features are listed in Table 5. We will further discuss the impact of the various content- and typography-based features later in Section 11.

Content | Unigram, Bigram, Dependency and Entity Overlap | Real valued features that compute the proportion of common unigrams, bigrams, dependencies, and geometry entities (constants, predicates, and functions) across the two axioms. When comparing geometric entities, we include geometric entities derived from the associated diagrams when available. |

Longest Common Subsequence | Real valued feature that computes the length of longest common sub-sequence of words between two axiom mentions normalized by the total number of words in the two mentions. | |

Number of discourse elements | Real valued feature that computes the absolute difference in the number of discourse elements in the two mentions. | |

Alignment Scores | We use an off-the-shelf monolingual word aligner—JACANA (Yao et al. 2013) pretrained on PPDB—and compute alignment score between axiom mentions as the feature. | |

MT Metrics | We use two common MT evaluation metrics METEOR (Denkowski and Lavie 2010) and MAXSIM (Chan and Ng 2008), and use the evaluation scores as features. While METEOR computes n-gram overlaps controlling on precision and recall, MAXSIM performs bipartite graph matching and maps each word in one axiom to at most one word in the other. | |

Summarization Metrics | We also use Rouge-S (Lin 2004), a text summarization metric, and use the evaluation score as a feature. Rouge-S is based on skip-grams. | |

Discourse (Typography) | JSON structure | Indicator matching the current (and parent) node of axiom mentions in respective JSON hierarchies; i.e., are both nodes mentioned as axioms, diagrams or bounding boxes? |

Equation Template | Indicator feature that matches templates of equations detected in the axiom mentions. The template matcher is designed such that it identifies various rewritings of the same axiom equation, e.g., PA × PB = PT^{2} and PA × PB = PC^{2} could refer to the same axiom with point T in one axiom mention being point C in another mention. | |

Image Caption | Proportion of common unigrams in the image captions of the diagrams associated with the axiom mentions. If both mentions do not have associated diagrams, this feature does not fire. |

Content | Unigram, Bigram, Dependency and Entity Overlap | Real valued features that compute the proportion of common unigrams, bigrams, dependencies, and geometry entities (constants, predicates, and functions) across the two axioms. When comparing geometric entities, we include geometric entities derived from the associated diagrams when available. |

Longest Common Subsequence | Real valued feature that computes the length of longest common sub-sequence of words between two axiom mentions normalized by the total number of words in the two mentions. | |

Number of discourse elements | Real valued feature that computes the absolute difference in the number of discourse elements in the two mentions. | |

Alignment Scores | We use an off-the-shelf monolingual word aligner—JACANA (Yao et al. 2013) pretrained on PPDB—and compute alignment score between axiom mentions as the feature. | |

MT Metrics | We use two common MT evaluation metrics METEOR (Denkowski and Lavie 2010) and MAXSIM (Chan and Ng 2008), and use the evaluation scores as features. While METEOR computes n-gram overlaps controlling on precision and recall, MAXSIM performs bipartite graph matching and maps each word in one axiom to at most one word in the other. | |

Summarization Metrics | We also use Rouge-S (Lin 2004), a text summarization metric, and use the evaluation score as a feature. Rouge-S is based on skip-grams. | |

Discourse (Typography) | JSON structure | Indicator matching the current (and parent) node of axiom mentions in respective JSON hierarchies; i.e., are both nodes mentioned as axioms, diagrams or bounding boxes? |

Equation Template | Indicator feature that matches templates of equations detected in the axiom mentions. The template matcher is designed such that it identifies various rewritings of the same axiom equation, e.g., PA × PB = PT^{2} and PA × PB = PC^{2} could refer to the same axiom with point T in one axiom mention being point C in another mention. | |

Image Caption | Proportion of common unigrams in the image captions of the diagrams associated with the axiom mentions. If both mentions do not have associated diagrams, this feature does not fire. |

#### 8.1.3 Joint Identification and Alignment.

*b*has tag

*j*. We reuse the definitions of the alignment variables $Zij(b)$ as before. We further define $Zi0(b)$ such that it denotes that the

*i*

^{th}axiom in textbook

*b*is not aligned with any global axiom. We again define a log-linear model with factors that score axiom identification and axiom alignments.

**g**and hence would worsen the quality of axiom alignments. This motivates our joint modeling of axiom identification and alignment.

We again have the following model constraints:

- C1
^{′}:Every discourse element has a unique label

- C2
^{′}Tag O cannot be followed by tag I

- C3
^{′}Consistency between

*Y*s and*Z*s; i.e., axiom boundaries defined by*Y*s and*Z*s must agree. - C4
^{′}= C3.

We use L-BFGS for learning. To compute feature expectations, we use a Metropolis Hastings sampler that samples **Y***s* and **Z***s* alternatively. Sampling for **Z***s* reduces to Gibbs sampling and the sampling equations are the same as before (Section 8.1.2). For better mixing, we sample **Y** in blocks. Consider blocks of **Y***s* which denote axiom boundaries at time stamp *t*; we define three operations to sample axiom blocks at the next time stamp. The operations (shown in Figure 7) are:

**Update axiom:** The axiom boundary can be shrunk, expanded, or moved. The new axiom, however, cannot overlap with other axioms.

**Delete axiom:** The axiom can be deleted by labeling all its discourse elements as *O*.

**Introduce axiom:** Given a contiguous sequence of discourse elements labeled *O*, a new axiom can be introduced.

*U*(

**Y**) =

*f*

_{AA}. We again have two variants, where we model the ordering constraints (C4

*′*) as soft or hard constraints.

### 8.2 Axiom Parsing

After harvesting axioms, we build a parser for these axioms that maps raw axioms to horn-clause rules. The axiom harvesting step provides us a multiset of axiom extractions. Let $A={A1,A2,\u2026,A|A|}$ represent the multiset where each axiom **A**_{i} is mentioned at least once. Each axiom mention, in turn, comprises a contiguous sequence of discourse elements and optionally an accompanying diagram.

Semantic parsers map natural language to formal programs such as database queries (Liang, Jordan, and Klein 2011, inter alia), commands to robots (Shimizu and Haas 2009, inter alia), or even general purpose programs (Yin and Neubig 2017). More specifically, Liu et al. (2016a) learn “If-Then” program statements and Quirk, Mooney, and Galley (2015) learn “If-This-Then-That” rules. In theory, these works can be used to parse axioms to horn-clause rules. However, semantic parsing is a hard task and would require a large amount of supervision. In our setting, we can only afford a modest amount of supervision. We mitigate this issue by using the redundant axiom mention extractions from multiple sources (textbooks) and combining the parses obtained from various textbooks to achieve a better final parse for each axiom.

First, we describe a base parser that parses axiom mentions to horn-clause rules. Then, we utilize the redundancy of axiom extractions from various sources (textbooks) to improve our parser.

#### 8.2.1 Base Axiomatic Parser.

Our base parser identifies the *premise* and *conclusion* portions of each axiom and then uses *GEOS*’s text parser to parse the two portions into a logical formula. Then, the two logical formulas are put together to form horn-clause rules.

Axiom mentions (for example, the Pythagorean theorem mention in Figure 1) are often accompanied by equations or diagrams. When the mention has an equation, we simply treat the equation as the *conclusion* and the rest of the mention as the *premise*. When the axiom has an associated diagram, we always include the diagram in the *premise*. We learn a model to predict the split of the axiom text into two parts, forming the *premise* and the *conclusion* spans. Then, the *GEOS* parser maps the *premise* and *conclusion* spans to *premise* and *conclusion* logical formulas, respectively.

Let *Z*_{s} represent the split that demarcates the *premise* and *conclusion* spans. We score the axiom split as a log-linear model: $p(Zs|a;w)\u221dexpwTh(a,Zs)$. Here, **h** are feature functions described later. We found that in most cases (>95%), the premise and conclusion are contiguous spans in the axiom mention where the left span corresponds to the *premise* and the right span corresponds to the *conclusion*. Hence, we search over the space of contiguous spans to infer *Z*_{s}. Joint search over the latent variables *Z*_{s}, *Z*_{p}, and *Z*_{c} is exponential. Hence, we use a greedy procedure, beam search, with a fixed beam size (10) for inference. That is, in each step, we only expand the ten most promising candidates so far given by the current score. We first infer *Z*_{s} to decide the split of the axiom and then infer *Z*_{p} and *Z*_{c} to obtain the parse of the premise and the conclusion, using the two-part approach described before. We use L-BGFGS for learning.

**Features:** We list the features **h** defined over candidate spans forming the text split in Table 6. The features are similar to those used in previous work on discourse analysis, in particular on the automatic detection of elementary discourse units (EDUs) in rhe- torical structure theory (Mann and Thompson 1988) and discourse parsing (Marcu 2000; Soricut and Marcu 2003). These include ideas such as the use of a list of discourse markers, punctuation, and natural text and JSON organization as an indicator of discourse boundaries. We also use an off-the-shelf discourse parser and an *EDU* segmenter from Soricut and Marcu (2003). Then we also used syntax-based cues, such as span lengths, head node attachment, distance to common ancestor/root, relative position of the two lexical heads and the text split; and *dominance*, which have been found to be useful in discourse parsing (Marcu 2000; Soricut and Marcu 2003). Finally, we also used some semantic features, such as the similarity of the two spans (in terms of common words, geometry relations and relation-arguments), and number of geometry relations in the respective span parses. We will discuss the impact of the various features later in Section 11. Given a beam of *premise* and *conclusion* splits, we use the *GEOS* parser to obtain *premise* and *conclusion* logical formulas for each split in the beam and obtain a beam of axiom parses for each axiom in each textbook.

Content | Span Similarity | Proportion of (a) words, (b) geometry relations, and (c) relation-arguments shared by the two spans. |

Number of Relations | Number of geometry relations represented in the two spans. We use the Lexicon Map from GEOS to compute the number of expressed geometry relations. | |

Span Lengths | The distribution of the two text spans is typically dependent on their lengths. We use the ratio of the length of the two spans as an additional feature. | |

Relative Position | Relative position of the two lexical heads and the text split in the discourse element sentence. We use the difference between the lexical head position and the text split position as the feature. | |

Discourse (Typography) | Discourse Markers | Discourse markers (connectives, cue-words, or cue-phrases, etc.) have been shown to give good indications on discourse structure (Marcu 2000). We build a list of discourse markers using the training set, considering the first and last tokens of each span, culled to top 100 by frequency. We use these 100 discourse markers as features. We repeat the same procedure by using part-of-speech (POS) instead of words and use them as features. |

Punctuation | Punctuation at the segment border is another excellent cue for the segmentation. We include indicator features to show whether there is punctuation at the segment border. | |

Text Organization | Indicator that the two text spans are part of the same (a) sentence, (b) paragraph. | |

RST Parse | We use an off-the-shelf RST parser (Feng and Hirst 2014) and include an indicator feature that shows that the segmentation matches the parse segmentation. We also include the RST label as a feature. | |

Soricut and Marcu Segmenter | Soricut and Marcu (2003) (section 3.1) presented a statistical model for deciding elementary discourse unit boundaries. We use the probability given by this model retrained on our training set as a feature. This feature uses both lexical and syntactic information. | |

Head/ Common Ancestor/ Attachment Node | Head node is defined as the word with the highest occurrence as a lexical head in the lexicalized tree among all the words in the text span. The at- tachment node is the parent of the head node. We use features for the head words of the left and right spans, the common ancestor (if any), the attachment node, and the conjunction of the two head node words. We repeat these features with part-of-speech (POS) instead of words. | |

Syntax | Distance to (a) root, and (b) common ancestor for the nodes spanning the respective spans. We use these distances and the difference in the distances as features. | |

Dominance | Dominance (Soricut and Marcu 2003) is a key idea in discourse that looks at syntax trees and studies sub-trees for each span to infer a logical nesting order between the two. We use the dominance relationship as a feature. See Soricut and Marcu (2003) for details. | |

JSON structure | Indicator that the two spans are in the same node in the JSON hierarchy. Conjoined with the indicator feature that shows that the two spans are part of the same paragraph. |

Content | Span Similarity | Proportion of (a) words, (b) geometry relations, and (c) relation-arguments shared by the two spans. |

Number of Relations | Number of geometry relations represented in the two spans. We use the Lexicon Map from GEOS to compute the number of expressed geometry relations. | |

Span Lengths | The distribution of the two text spans is typically dependent on their lengths. We use the ratio of the length of the two spans as an additional feature. | |

Relative Position | Relative position of the two lexical heads and the text split in the discourse element sentence. We use the difference between the lexical head position and the text split position as the feature. | |

Discourse (Typography) | Discourse Markers | Discourse markers (connectives, cue-words, or cue-phrases, etc.) have been shown to give good indications on discourse structure (Marcu 2000). We build a list of discourse markers using the training set, considering the first and last tokens of each span, culled to top 100 by frequency. We use these 100 discourse markers as features. We repeat the same procedure by using part-of-speech (POS) instead of words and use them as features. |

Punctuation | Punctuation at the segment border is another excellent cue for the segmentation. We include indicator features to show whether there is punctuation at the segment border. | |

Text Organization | Indicator that the two text spans are part of the same (a) sentence, (b) paragraph. | |

RST Parse | We use an off-the-shelf RST parser (Feng and Hirst 2014) and include an indicator feature that shows that the segmentation matches the parse segmentation. We also include the RST label as a feature. | |

Soricut and Marcu Segmenter | Soricut and Marcu (2003) (section 3.1) presented a statistical model for deciding elementary discourse unit boundaries. We use the probability given by this model retrained on our training set as a feature. This feature uses both lexical and syntactic information. | |

Head/ Common Ancestor/ Attachment Node | Head node is defined as the word with the highest occurrence as a lexical head in the lexicalized tree among all the words in the text span. The at- tachment node is the parent of the head node. We use features for the head words of the left and right spans, the common ancestor (if any), the attachment node, and the conjunction of the two head node words. We repeat these features with part-of-speech (POS) instead of words. | |

Syntax | Distance to (a) root, and (b) common ancestor for the nodes spanning the respective spans. We use these distances and the difference in the distances as features. | |

Dominance | Dominance (Soricut and Marcu 2003) is a key idea in discourse that looks at syntax trees and studies sub-trees for each span to infer a logical nesting order between the two. We use the dominance relationship as a feature. See Soricut and Marcu (2003) for details. | |

JSON structure | Indicator that the two spans are in the same node in the JSON hierarchy. Conjoined with the indicator feature that shows that the two spans are part of the same paragraph. |

#### 8.2.2 Multisource Axiomatic Parser.

Now, we describe a multisource parser that utilizes the redundancy of axiom extractions from various sources (textbooks). Given a beam of 10-best parses for each axiom from each source, we use a number of heuristics to determine the best parse for the axiom:

**Majority Voting:**For each axiom, pick the parse that occurs most frequently across beams.**Average Score:**Pick the parse that has the highest average parse score (only counting top 5 parses for each source) for each axiom.**Learn Source Confidence:**Learn a set of weights {*μ*_{1},*μ*_{2},…,*μ*_{S}}, one for each source, and then pick the parse that has the highest average weighted parse score for each axiom.**Predicate Score:**Instead of selecting from one of the top parses across various sources, treat each axiom parse as a bag of premise predicates and a bag of conclusion predicates. Then, pick a subset of premise and conclusion predicates for the final parse, using average scoring with thresholding.

## 9. Experiments

**Data Sets and Baselines:** We use a collection of grade 6–10 Indian high school math textbooks by four publishers/authors (*NCERT*, *R S Aggarwal*, *R D Sharma*, and *M L Aggarwal*)—a total of 5 × 4 = 20 textbooks to validate our model. Millions of students in India study geometry from these books every year and these books are readily available online. We manually marked chapters relevant for geometry in these books and then parsed them using Adobe Acrobat’s *pdf2xml* parser and AllenAI’s *Science Parse* project.^{4} Then, we annotated geometry axioms, alignments, and parses for grade 6, 7, and 8 textbooks by the four publishers/authors. We use grade 6, 7, and 8 textbook annotations for development, training, and testing, respectively. Grade 9 and 10 data are used as unlabeled data. Thus our method is semi-supervised. During training our axiom identification, alignment, and joint axiom identification and alignment models, the latent variables **Z** are fixed for the training set and are not sampled. For the remaining data, these variables are sampled using our Gibbs sampler. All the hyper-parameters in all the models are tuned on the development set using grid search. Then, these hyper-parameter values are fixed and the entire training + development set is used for training (along with the unlabeled data) and all the models are evaluated on the test set.

*GEOS* used 13 types of entities and 94 functions and predicates. We add some more entities, functions, and predicates to cover other more complex concepts in geometry not covered in *GEOS*. Thus, we obtain a final set of 19 entity types and 115 functions and predicates for our parsing model. We use Stanford CoreNLP (Manning et al. 2014) for feature generation. We use two data sets for evaluating our system: (a) practice and official SAT style geometry questions used in *GEOS*, and (b) an additional data set of geometry questions collected from the aforementioned textbooks. This data set consists of a total of 1,406 SAT style questions across grades 6–10, and is approximately 7.5 times the size of the data set used in *GEOS*. We split the data set into training (350 questions), development (150 questions), and test (906 questions), with equal proportion of grade 6–10 questions. We annotated the 500 training and development questions with ground-truth logical forms. We use the training set to train another version of *GEOS* with the expanded set of entity types, functions, and predicates. We call this system *GEOS++*, which will be used as a baseline for our method.

**Results:** We first evaluate the axiom identification, alignment, and parsing models individually.

For axiom identification, we compare the results of automatic identification with gold axiom identifications and compute the precision, recall, and F-measure on the test set. We use strict as well as relaxed comparison. In strict comparison mode the automatically identified mentions and gold mentions must match exactly to get credit, whereas in the relaxed comparison mode only a majority (>50%) of sentences in the automatically identified mentions and gold mentions must match to get credit. Table 7 shows the results of axiom identification, where we clearly see improvements in performance when we jointly model axiom identification and alignment. This is due to the fact that both components reinforce each other. We also observe that modeling the ordering constraints as soft constraints leads to better performance than modeling them as hard constraints. This is because the ordering of presentation of axioms is generally (yet not always) consistent across textbooks.

. | Strict Comp. . | Relaxed Comp. . | ||||
---|---|---|---|---|---|---|

. | P . | R . | F . | P . | R . | F . |

Identification | 64.3 | 69.3 | 66.7 | 84.3 | 87.9 | 86.1 |

Joint-Hard | 68.0 | 68.1 | 68.0 | 85.4 | 87.1 | 86.2 |

Joint-Soft | 69.7 | 71.1 | 70.4 | 86.9 | 88.4 | 87.6 |

. | Strict Comp. . | Relaxed Comp. . | ||||
---|---|---|---|---|---|---|

. | P . | R . | F . | P . | R . | F . |

Identification | 64.3 | 69.3 | 66.7 | 84.3 | 87.9 | 86.1 |

Joint-Hard | 68.0 | 68.1 | 68.0 | 85.4 | 87.1 | 86.2 |

Joint-Soft | 69.7 | 71.1 | 70.4 | 86.9 | 88.4 | 87.6 |

To evaluate axiom alignment, we first view it as a series of decisions, one for each pair of axiom mentions, and compute precision, recall, and F-score by comparing automatic decisions with gold decisions. Then, we also use a standard clustering metric, Normalized Mutual Information (NMI) (Strehl and Ghosh 2002) to measure the quality of axiom mention clustering. Table 8 shows the results on the test set when gold axiom identifications are used. We observe improvements in axiom alignment performance too when we jointly model axiom identification and alignment jointly both in terms of F-score as well as NMI. Modeling ordering constraints as soft constraints again leads to better performance than modeling them as hard constraints in terms of both metrics.

. | P . | R . | F . | NMI . |
---|---|---|---|---|

Alignment | 71.8 | 74.8 | 73.3 | 0.60 |

Joint-Hard | 75.0 | 76.4 | 75.7 | 0.65 |

Joint-Soft | 79.3 | 81.4 | 80.3 | 0.69 |

. | P . | R . | F . | NMI . |
---|---|---|---|---|

Alignment | 71.8 | 74.8 | 73.3 | 0.60 |

Joint-Hard | 75.0 | 76.4 | 75.7 | 0.65 |

Joint-Soft | 79.3 | 81.4 | 80.3 | 0.69 |

To evaluate axiom parsing, we compute precision, recall, and F-score in (a) deriving literals in axiom parses, as well as for (b) the final axiom parses on our test set. Table 9 shows the results of axiom parsing for *GEOS* (trained on the training set) as well as various versions of our best performing system (*GEOS++* with our axiomatic solver) with various heuristics for multisource parsing. The results show that our system (single source) performs better than *GEOS*, as it is trained with the expanded set of entity types, functions, and predicates. The results also show that the choice of heuristic is important for the multisource parser—though all the heuristics lead to improvements over the single source parser. The average score heuristic that chooses the parse with the highest average score across sources performs better than majority voting, which chooses the best parse based on a voting heuristic. Learning the confidence of every source and using a weighted average is an even better heuristic. Finally, predicate scoring, which chooses the parse by scoring predicates on the premise and conclusion sides, performs the best leading to 87.5 F1 score (when computed over parse literals) and 73.2 F1 score (when computed on the full parse). The high F1 score for axiom parsing on the test set shows that our approach works well and we can accurately harvest axiomatic knowledge from textbooks.

. | . | Literals . | Full Parse . | ||||
---|---|---|---|---|---|---|---|

. | . | P . | R . | F . | P . | R . | F . |

GEOS | 86.7 | 70.9 | 78.0 | 64.2 | 56.6 | 60.2 | |

GEOS++ | Single Src. | 91.6 | 75.3 | 82.6 | 68.8 | 60.4 | 64.3 |

Maj. Voting | 90.2 | 78.5 | 83.9 | 70.0 | 63.3 | 66.5 | |

Avg. Score | 90.8 | 79.6 | 84.9 | 71.7 | 66.4 | 69.0 | |

Src. Confid. | 91.0 | 79.9 | 85.1 | 73.3 | 68.1 | 70.6 | |

Pred. Score | 92.8 | 82.8 | 87.5 | 76.6 | 70.1 | 73.2 |

. | . | Literals . | Full Parse . | ||||
---|---|---|---|---|---|---|---|

. | . | P . | R . | F . | P . | R . | F . |

GEOS | 86.7 | 70.9 | 78.0 | 64.2 | 56.6 | 60.2 | |

GEOS++ | Single Src. | 91.6 | 75.3 | 82.6 | 68.8 | 60.4 | 64.3 |

Maj. Voting | 90.2 | 78.5 | 83.9 | 70.0 | 63.3 | 66.5 | |

Avg. Score | 90.8 | 79.6 | 84.9 | 71.7 | 66.4 | 69.0 | |

Src. Confid. | 91.0 | 79.9 | 85.1 | 73.3 | 68.1 | 70.6 | |

Pred. Score | 92.8 | 82.8 | 87.5 | 76.6 | 70.1 | 73.2 |

Finally, we use the extracted horn-clause rules in our axiomatic solver for solving geometry problems. For this, we over-generate a set of horn-clause rules by generating three horn-clause parses for each axiom and use them as the underlying theory in prolog programs such as the one shown in Figure 4. We use weighted logical expressions for the question description and the diagram derived from *GEOS++* as declarations, and the (normalized) score of the parsing model multiplied by the score of the joint axiom identification and alignment model as weights for the rules. Table 10 shows the results for our best end-to-end system and compares it to *GEOS* on the practice and official SAT data set from Seo et al. (2015) as well as questions from the 20 textbooks. On all the three data sets, our system outperforms *GEOS*. Especially on the data set from the 20 textbooks (which is indeed a harder data set and includes more problems that require complex reasoning based on geometry), *GEOS* does not perform very well, whereas our system still achieves a good score. *Oracle* shows the performance of our system when gold axioms (written down by an expert) are used along with automatic text and diagram interpretations in *GEOS++*. This shows that there is scope for further improvement in our approach.

## 10. Explainability

Students around the world solve geometry problems through rigorous deduction, whereas the numerical solver in *GEOS* does not provide such explainability. One of the key benefits of our axiomatic solver is that it provides an easy-to-understand student-friendly deductive solution to geometry problems.

To test the explainability of our axiomatic solver, we asked 50 grade 6–10 students (10 students in each grade) to use *GEOS* and our system (*GEOS++* with our axiomatic solver) as a Web-based assistive tool while learning geometry. The tool uses the probabilistic prolog solver (Fierens et al. 2015) to derive the most probable explanation (MPE) for a solution. Then, it lists, one by one, the various axioms used and the conclusion drawn from the axiom application, as shown in Figure 8. The students were each asked to rate how ‘explainable’ and ‘useful’ the two systems were on a scale of 1–5. Table 11 shows the mean rating by students in each grade on the two facets. We can observe that students of each grade found our system to be more interpretable as well as more useful to them than *GEOS*. This study lends support to our claims about the need for an interpretable deductive solver for geometry problems.

. | Explainability . | Usefulness . | ||
---|---|---|---|---|

. | GEOS
. | O.S.
. | GEOS
. | O.S.
. |

Grade 6 | 2.7 | 2.9 | 2.9 | 3.2 |

Grade 7 | 3.0 | 3.7 | 3.3 | 3.6 |

Grade 8 | 2.7 | 3.5 | 3.1 | 3.5 |

Grade 9 | 2.4 | 3.3 | 3.0 | 3.7 |

Grade 10 | 2.8 | 3.1 | 3.2 | 3.8 |

Overall | 2.7 | 3.3 | 3.1 | 3.6 |

. | Explainability . | Usefulness . | ||
---|---|---|---|---|

. | GEOS
. | O.S.
. | GEOS
. | O.S.
. |

Grade 6 | 2.7 | 2.9 | 2.9 | 3.2 |

Grade 7 | 3.0 | 3.7 | 3.3 | 3.6 |

Grade 8 | 2.7 | 3.5 | 3.1 | 3.5 |

Grade 9 | 2.4 | 3.3 | 3.0 | 3.7 |

Grade 10 | 2.8 | 3.1 | 3.2 | 3.8 |

Overall | 2.7 | 3.3 | 3.1 | 3.6 |

## 11. Feature Ablation

In this section, we will measure the value of the various features in our axiom harvesting and parsing pipeline. Note that we have described three sets of features **f**, **g**, and **h**—corresponding to the various steps in our pipeline: axiom identification, axiom alignment, and axiom parsing in Tables 4, 5, and 6. We will ablate each of the three features one by one via **backward selection** (i.e., we will remove features and observe how that affects performance).

### 11.1 Ablating Axiom Identification Features

Table 12 shows the fall in performance in terms of the axiom identification performance, as well as the overall performance as we ablate various axiom identification features listed in Table 4. We can observe that removal of any of the features results in a loss of performance. Thus, all the content as well as typographical features are important for performance. We observe that the content features such as sentence overlap, geometry entity sharing, and keyword usage are clearly important. At the same time, the various discourse features such as the RST relation, axiom, theorem, corollary annotation, use of equations and diagrams, bold/underline, bounding box, and XML structure are all important. Most of these features depend on typographical information that is vital in performance of the axiom identification component as well as the overall model. In particular, we can observe that the axiom, theorem, corollary annotation, and bounding box features contribute most to the performance of the model as they are direct indicators of the presence of an axiom mention.

Axiom Identification F1 | SAT Scores | |||||

Strict Comp. | Relaxed Comp. | Practice | Official | Textbook | ||

Content | Sentence Overlap | 56.2 | 73.8 | 56 | 43 | 42 |

Geometry entities | 64.0 | 80.4 | 61 | 49 | 46 | |

Keywords | 67.5 | 81.0 | 62 | 54 | 48 | |

Discourse (Typography) | RST edge | 66.6 | 78.9 | 58 | 46 | 44 |

Axm, Thm, Corr. | 62.6 | 77.8 | 57 | 47 | 43 | |

Equation | 66.2 | 78.6 | 57 | 46 | 42 | |

Associated Diagram | 68.5 | 84.4 | 61 | 52 | 49 | |

Bold / Underline | 68.2 | 82.0 | 62 | 52 | 48 | |

Bounding box | 59.7 | 75.5 | 55 | 47 | 40 | |

XML structure | 67.4 | 80.6 | 60 | 51 | 46 | |

Unablated | 70.4 | 87.6 | 64 | 55 | 51 |

Axiom Identification F1 | SAT Scores | |||||

Strict Comp. | Relaxed Comp. | Practice | Official | Textbook | ||

Content | Sentence Overlap | 56.2 | 73.8 | 56 | 43 | 42 |

Geometry entities | 64.0 | 80.4 | 61 | 49 | 46 | |

Keywords | 67.5 | 81.0 | 62 | 54 | 48 | |

Discourse (Typography) | RST edge | 66.6 | 78.9 | 58 | 46 | 44 |

Axm, Thm, Corr. | 62.6 | 77.8 | 57 | 47 | 43 | |

Equation | 66.2 | 78.6 | 57 | 46 | 42 | |

Associated Diagram | 68.5 | 84.4 | 61 | 52 | 49 | |

Bold / Underline | 68.2 | 82.0 | 62 | 52 | 48 | |

Bounding box | 59.7 | 75.5 | 55 | 47 | 40 | |

XML structure | 67.4 | 80.6 | 60 | 51 | 46 | |

Unablated | 70.4 | 87.6 | 64 | 55 | 51 |

### 11.2 Ablating Axiom Alignment Features

Table 13 shows the fall in performance in terms of the axiom alignment performance as well as the overall performance as we ablate various axiom alignment features listed in Table 5. We again observe that removal of any of the features results in a loss of performance. Thus, the various content as well as typographical features are important for performance. We observe that the content features such as unigram, bigram and entity overlap, length of the longest common subsequence, number of sentences and various aligner, MT, and summarization scores are clearly important. At the same time, the various discourse features such as the XML structure, equation template, and image caption match are all important. Note that these features depend on typographical information that is again vital in performance. In particular, we can observe that the overlap and the XML structure features contribute most to the performance of the model.

SAT Scores | ||||||

F1 | NMI | Practice | Official | Textbook | ||

Content | Overlap | 70.7 | 0.54 | 57 | 45 | 45 |

LCS | 78.7 | 0.64 | 61 | 53 | 49 | |

Number of Sentences | 78.5 | 0.65 | 62 | 54 | 48 | |

Alignment Scores | 72.6 | 0.57 | 59 | 49 | 48 | |

MT Metrics | 74.8 | 0.60 | 62 | 52 | 49 | |

Summarization Metrics | 75.9 | 0.63 | 62 | 54 | 50 | |

Typography | XML Structure | 71.5 | 0.57 | 58 | 47 | 46 |

Equation Template | 76.6 | 0.61 | 57 | 47 | 43 | |

Image Caption | 77.9 | 0.65 | 62 | 53 | 47 | |

Unablated | 80.3 | 0.69 | 64 | 55 | 51 |

SAT Scores | ||||||

F1 | NMI | Practice | Official | Textbook | ||

Content | Overlap | 70.7 | 0.54 | 57 | 45 | 45 |

LCS | 78.7 | 0.64 | 61 | 53 | 49 | |

Number of Sentences | 78.5 | 0.65 | 62 | 54 | 48 | |

Alignment Scores | 72.6 | 0.57 | 59 | 49 | 48 | |

MT Metrics | 74.8 | 0.60 | 62 | 52 | 49 | |

Summarization Metrics | 75.9 | 0.63 | 62 | 54 | 50 | |

Typography | XML Structure | 71.5 | 0.57 | 58 | 47 | 46 |

Equation Template | 76.6 | 0.61 | 57 | 47 | 43 | |

Image Caption | 77.9 | 0.65 | 62 | 53 | 47 | |

Unablated | 80.3 | 0.69 | 64 | 55 | 51 |

### 11.3 Ablating Axiom Parsing Features

Table 14 shows the fall in performance in terms of the axiom parsing performance as well as the overall performance as we ablate various axiom parsing features listed in Table 6. We again observe that removal of any of the features results in a loss of performance. The axiom parsing component uses a few content-based features, such as span similarity and number of relations, span lengths, and relative position; and various discourse features, such as discourse markers, punctuations, text organization, RST parse, an existing discourse segmentor from Soricut and Marcu (Soricut and Marcu 2003), node attachment, syntax, dominance, and XML structure; and all are clearly important. In particular, we can observe that span similarity and punctuation features contribute most to the performance of the model.

F1 | SAT Scores | ||||

Literals | Full Parse | Practice | Official | Textbook | |

Span Similarity | 71.8 | 64.6 | 51 | 40 | 42 |

No. of Relations | 82.3 | 70.5 | 60 | 51 | 49 |

Span Lengths | 86.0 | 72.0 | 63 | 54 | 50 |

Relative Position | 83.9 | 69.2 | 60 | 52 | 47 |

Discourse Markers | 77.4 | 68.4 | 55 | 48 | 47 |

Punctuations | 73.5 | 65.0 | 52 | 45 | 45 |

Text Organization | 74.4 | 66.2 | 52 | 47 | 46 |

RST Parse | 84.6 | 70.8 | 62 | 52 | 49 |

Soricut & Marcu | 83.2 | 69.8 | 61 | 52 | 50 |

Head Node, etc. | 85.3 | 71.6 | 62 | 54 | 49 |

Syntax | 75.5 | 66.6 | 54 | 47 | 46 |

Dominance | 73.9 | 66.1 | 53 | 47 | 44 |

XML Structure | 77.6 | 68.0 | 59 | 51 | 46 |

Unablated | 87.5 | 73.2 | 64 | 55 | 51 |

F1 | SAT Scores | ||||

Literals | Full Parse | Practice | Official | Textbook | |

Span Similarity | 71.8 | 64.6 | 51 | 40 | 42 |

No. of Relations | 82.3 | 70.5 | 60 | 51 | 49 |

Span Lengths | 86.0 | 72.0 | 63 | 54 | 50 |

Relative Position | 83.9 | 69.2 | 60 | 52 | 47 |

Discourse Markers | 77.4 | 68.4 | 55 | 48 | 47 |

Punctuations | 73.5 | 65.0 | 52 | 45 | 45 |

Text Organization | 74.4 | 66.2 | 52 | 47 | 46 |

RST Parse | 84.6 | 70.8 | 62 | 52 | 49 |

Soricut & Marcu | 83.2 | 69.8 | 61 | 52 | 50 |

Head Node, etc. | 85.3 | 71.6 | 62 | 54 | 49 |

Syntax | 75.5 | 66.6 | 54 | 47 | 46 |

Dominance | 73.9 | 66.1 | 53 | 47 | 44 |

XML Structure | 77.6 | 68.0 | 59 | 51 | 46 |

Unablated | 87.5 | 73.2 | 64 | 55 | 51 |

## 12. Axioms Harvested

We qualitatively analyze the structured axioms harvested by our method. We show the few most probable horn-clause rules for some popular named theorems in geometry in Figure 9, along with the confidence of our method on the rules being correct. Note that some horn-clause parsed rules can be incorrect. For example, the second most probable horn-clause rule for the Pythagorean theorem is partially incorrect (does not state which angle is 90°). Similarly, the second and third most probable horn-clause for the circle secant tangent theorem are also incorrect. Our problog solver can use these redundant but weighted horn-clause rules for solving geometry problems.

## 13. Example Solutions and Error Analysis

Next, we qualitatively describe some example solutions of geometry problems as well as perform a qualitative error analysis. We first show some sample questions that our solver can answer correctly in Table 15. We also show the explanations generated by our deductive solver for these problems (constructed in the same way as described earlier). Note that these problems are diverse in terms of question types, as well as the reasoning required to answer them, and our solver can handle them.

We also show some failure cases of our approach in Table 16. There are a number of reasons that could lead to a failure of our approach to correctly answer a question. These include an error in parsing the diagram, the text, or an incorrect or incomplete knowledge in the form of geometry rules. As can be observed in the failure examples, and also evaluated by us in a small error analysis of 100 textbook questions, our approach answered 52 questions correctly. Among the 48 incorrectly answered questions, our diagram parse was incorrect for 12 questions, and the text parse was incorrect for 15 questions. Our formal language was insufficiently defined to handle 6 questions (i.e., the semantics of the question could not be adequately captured by the formal language). Twenty-one questions were incorrectly answered due to missing knowledge of geometry in the form of rules. Note that several questions were incorrectly answered due to a failure of multiple system components (for example, failure of both the text and the diagram parser).

## 14. Conclusion

We presented an approach to harvest structured axiomatic knowledge from math textbooks. Our approach uses rich features based on context and typography, the redundancy of axiomatic knowledge, and shared ordering constraints across multiple textbooks to accurately extract and parse axiomatic knowledge to horn-clause rules. We used the parsed axiomatic knowledge to improve the best previously published automatic approach to solve geometry problems. A user-study conducted on a number of school students studying geometry found our approach to be more interpretable and useful than its predecessor. While this article focused on harvesting geometry axioms from textbooks as a case study, we would like to extend it to obtain valuable structured knowledge from textbooks in areas such as science, engineering, and finance.

## Notes

Please see related work (Section 2) for a complete list of references.

For example, the predicate *isPerpendicular*(AB, CD) is mapped to the constraint $yB\u2212yAxB\u2212xA\xd7yD\u2212yCxD\u2212xC=\u22121$.

Given a textbook in JSON format, we can construct this ordered set by preorder traversal of the JSON tree.