Abstract
Answering questions that involve multi-step reasoning requires decomposing them and using the answers of intermediate steps to reach the final answer. However, state-of-the-art models in grounded question answering often do not explicitly perform decomposition, leading to difficulties in generalization to out-of-distribution examples. In this work, we propose a model that computes a representation and denotation for all question spans in a bottom-up, compositional manner using a CKY-style parser. Our model induces latent trees, driven by end-to-end (the answer) supervision only. We show that this inductive bias towards tree structures dramatically improves systematic generalization to out-of- distribution examples, compared to strong baselines on an arithmetic expressions benchmark as well as on C losure, a dataset that focuses on systematic generalization for grounded question answering. On this challenging dataset, our model reaches an accuracy of 96.1%, significantly higher than prior models that almost perfectly solve the task on a random, in-distribution split.
1 Introduction
Humans can effortlessly interpret new language utterances, if they are composed of previously observed primitives and structure (Fodor and Pylyshyn, 1988). Neural networks, conversely, do not exhibit this systematicity: Although they generalize well to examples sampled from the distribution of the training set, they have been shown to struggle in generalizing to out-of-distribution (OOD) examples that contain new compositions in grounded question answering (Bahdanau et al., 2019a,b) and semantic parsing (Finegan-Dollak et al., 2018; Keysers et al., 2020). Consider the question in Figure 1. This question requires querying object sizes, comparing colors, identifying spatial relations, and computing intersections between sets of objects. Neural models succeed when these concepts are combined in ways that were seen at training time. However, they commonly fail when concepts are combined in new ways at test time.
A possible reason for this phenomenon is the expressivity of architectures such as LSTMs (Hochreiter and Schmidhuber, 1997) and Transformers (Vaswani et al., 2017), where “global” representations that depend on the entire input are computed. Contextualizing token representations using the entire utterance potentially lets the model avoid step-by-step reasoning, “collapse” reasoning steps, and rely on shortcuts (Jiang and Bansal, 2019; Subramanian et al., 2019, 2020). Such failures are revealed when evaluating models for systematic generalization on OOD examples. This stands in contrast to pre-neural log-linear models, where hierarchical representations were explicitly constructed over the input (Zettlemoyer and Collins, 2005; Liang et al., 2013).
In this work, we propose a model for visual question answering (QA) that, analogous to these classical pre-neural models, computes for every span in the input question a representation and a denotation, that is, the set of objects in the image that the span refers to (see Figure 1). Denotations for long spans are recursively computed from shorter spans using a bottom–up CKY-style parser without access to the entire input, leading to an inductive bias that encourages compositional computation. Because training is done from the final answer only, the model must effectively learn to induce latent trees that describe the compositional structure of the problem. We hypothesize that this explicit grounding of the meaning of sub-spans through hierarchical computation should result in better generalization to new compositions.
We evaluate our approach in two setups: (a) a synthetic arithmetic expressions dataset, and (b) C losure (Bahdanau et al., 2019b), a visual QA dataset focusing on systematic generalization. On a random train/test split ( i.i.d split), both our model and prior baselines obtain near perfect performance. However, on splits that require generalization to new compositions ( compositional split) our model dramatically improves performance: On the arithmetic dataset, a Transformer-based model fails to generalize and obtains 2.9% accuracy, whereas our model, Grounded Latent Trees (GLT), gets 98.4%. On C losure, our model’s accuracy is 96.1%, 24 absolute points higher than strong baselines and even 19 points higher than models that use gold structures at training time or depend on domain-knowledge.
To conclude, we propose a model with an inherent inductive bias for compositional computation, which leads to gains in systematic generalization, and induces latent structures that are useful for understanding its inner workings. This suggests that despite the success of general-purpose architectures built on top of contextualized representations, restricting information flow inside the network can benefit compositional generalization. Our code and data can be found at https://github.com/benbogin/glt-grounded-latent-trees-qa.
2 Compositional Generalization
Language is mostly compositional; humans can understand and produce a potentially infinite number of novel combinations from a finite set of components (Chomsky, 1957; Montague, 1970). For example, a person would know what a “winged giraffe” is even if she’s never seen one, assuming she knows the meaning of “winged” and “giraffe”. This ability, termed compositional generalization, is fundamental for building robust models that learn from limited data (Lake et al., 2018).
Neural networks have been shown to generalize well in many language understanding tasks in i.i.d splits (Devlin et al., 2019; Linzen, 2020). However, when evaluated on splits that require compositional generalization, a significant drop in performance is observed. For example, in SCAN (Lake and Baroni, 2018) and gSCAN (Ruiset al., 2020), synthetically generated commands are mapped into a sequence of actions. When tested on unseen command combinations, models perform poorly. A similar trend was shown in text-to-SQL parsing Finegan-Dollak et al. ( 2018), where splitting examples by the template of the target SQL query resulted in a dramatic performance drop. SQOOP (Bahdanau et al., 2019a) shows the same phenomena on a synthetic visual task, which tests for generalization over unseen combinations of object properties and relations. This also led to methods that construct compositional splits automatically (Keysers et al., 2020).
In this work, we focus on answering complex grounded questions over images. The C levr benchmark (Johnson et al., 2017a) contains synthetic images and questions that require multi-step reasoning, for example, “Are there any cyan spheres made of the same material as the green sphere?”. While performance on this task is high, with an accuracy of 97%–99% (Perez et al., 2018; Hudson and Manning, 2018), recent work (Bahdanau et al., 2019b) introduced C losure: a new set of questions with identical vocabulary but different structure than C levr, asked on the same set of images. They evaluated generalization of different models and showed that all fail on a large fraction.
The most common approach for grounded QA is based on end-to-end differentiable models such as FiLM (Perez et al., 2018), MAC (Hudson and Manning, 2018), LXMERT (Tan and Bansal, 2019), and UNITER (Chen et al., 2020). These models do not explicitly decompose the problem into smaller sub-tasks, and are thus prone to fail on compositional generalization. A different approach (Yi et al., 2018; Mao et al., 2019) is to parse the image into a symbolic or distributed knowledge graph with objects, attributes, and relations, and then parse the question into an executable logical form, which is deterministically executed. Last, Neural Module Networks (NMNs; Andreas et al. 2016) parse the question into an executable program, where execution is learned: Each program module is a neural network designed to perform an atomic task, and modules are composed to perform complex reasoning. The latter two model families construct compositional programs and have been shown to generalize better on compositional splits (Bahdanau et al., 2019a,b) compared with end-to-end models. However, programs are not explicitly tied to question spans, and search over the space of programs is not differentiable, leading to difficulties in training.
In this work, we learn a latent structure for the question and tie each question span to an executable module in a differentiable manner. Our model balances distributed and symbolic approaches: We learn from downstream supervision only and output an inferred tree, describing how the answer was computed. We base our work on latent tree parsers (Le and Zuidema, 2015; Liu et al., 2018; Maillard et al., 2019; Drozdov et al., 2019) that produce representations for all spans, and softly weight all possible trees. We extend these parsers to answer grounded questions, grounding sub-trees in image objects.
Closest to our work is Gupta and Lewis ( 2018) (GL), who have proposed to compute denotations for each span by constructing a CKY chart. Our model differs in several important aspects: First, while both models compute the denotations of all spans, we additionally compute dense representations for each span using a learned composition function, increasing the expressiveness of our model. Second, we compute the denotations using simpler and more generic composition operators, while in GL the operators are fine-grained and type-driven. Last, we work with images rather than a knowledge graph, and propose several modifications to the chart construction mechanism to overcome scalability challenges.
3 Model
We first give a high-level overview of our proposed GLT model (§ 3.1), and then explain our model in detail.
3.1 High-level Overview
Problem Setup
Our task is visual QA, where given a question q = ( q0,…, qn−1) and an image I, we aim to output an answer from a fixed set of natural language phrases. We train a model from a training set . We assume we can extract from the image up to nobj features vectors of objects, and represent them as a matrix (details on object detection and representation are in § 3.4).
Our goal is to compute for every question span qij = ( qi, …, qj−1) a representation and a denotation, interpreted as the probability that the question span refers to each object. We compute hij and dij in a bottom–up fashion, using CKY (Cocke, 1969; Kasami, 1965; Younger, 1967). Algorithm 1 provides a high-level description of the procedure. We compute representations and denotations for length-1 spans (we use hi = hi( i+1), di = di( i+1) for brevity) by setting the representation to be the corresponding word representation in an embedding matrix E, and grounding each word in the image objects: (lines 3–4; fground is described in § 3.4). Then, we recursively compute representations and denotations of larger spans (lines 5–6). Last, we pass the question representation, h0 n, and the weighted sum of the visual representations ( d0 nV) through a softmax layer to produce a final answer distribution (line 1), using a learned classification matrix .
Computing hij, dij for all spans requires overcoming some challenges. Each span representation hij should be a function of two sub-spanshik, hkj. We use the term sub-spans to refer to all adjacent pairs of spans that cover qij: . However, we have no supervision for the “correct” split point k. Our model considers all possible split points and learns to induce a latent tree structure from the final answer only (§ 3.2). We show that this leads to an interpretable compositional structure and denotations that can be inspected at test time.
In § 3.3 we describe the form of the composition functions, which compute both span representations and denotations from two sub-spans. These functions must be expressive enough to accommodate a wide range of interactions between sub-spans, while avoiding reasoning shortcuts that might hinder compositional generalization.
3.2 Grounded Chart Parsing
Next, we compute the denotation dij of each span. Conceptually, computing dij can be analogous to hij; that is, a function fd will compute for every split point k, and . However, the function fd (see § 3.3) interacts with the visual representations of all objects and is thus computationally costly. Therefore, we propose a less expressive but more efficient approach, where fd(⋅) is applied only once for each span qij.
3.3 Composition Functions
We now describe the exact form of the composition functions fh and fd.
Composing Representations
Composing Denotations
Next, we describe the function , used to compute the span denotation dij (Equation 5). This function has access only to words in the span qij and not to the entire input utterance. We want fd(⋅) to support both simple compositions that depend only on the denotations of sub-spans, as well as more complex functions that consider the visual representations of different objects (spatial relations, colors, etc.).
S kip
I ntersection and union
V isual
This module is responsible for visual computations, such as computing spatial relations ( “left of the red sphere”) and comparing attributes of objects ( “has the same size as the red sphere”). Unlike other modules, it also uses the visual representations of the objects, . For example, for the sub-spans “left of” and “the red object”, we expect the function to ignore (since the denotation of “left to” is irrelevant), and return a denotation with high probability for objects that are left to objects with high probability in .
3.4 Grounding
In lines 3–4 of Algorithm 1, we handle length-1 spans. The span representation hi is initialized as the corresponding word embedding , and the denotation is computed with a grounding function. A simple implementation for fground would use the dot product between the word representation and the visual representations of all objects: . 3 But, in the case of co-reference ( “it”), we should ground the co-referring pronoun in the denotation of a previous span. We now describe this case.
Co-reference
Sentences such as “there is a red sphere; what is its material?” are challenging given a CKY parser, because the denotation of “its” depends on a distant span. We propose a simple heuristic that addresses the case where the referenced object is the denotation of a previous sentence. This could be expanded in future work, to a wider array of coreference phenomena.
In every example that comprises two sentences: (a) We compute the denotation dfirst for the entire first sentence as previously described; (b) We ground each word in the second sentence as proposed above: ; (c) For each word in the second sentence, we predict whether it co-refers to dfirst using a learned gate , where FF coref ∈ℝ hdim×2. (d) We define .
Visual Representations
Next, we describe how we compute the visual embedding matrix V. Two common approaches to obtain visual features are (1) computing a feature map for the entire image and letting the model learn to attend to the correct feature position (Hudson and Manning, 2018; Perez et al., 2018); and (2) predicting the locations of objects in the image, and extracting features just for these objects (Anderson et al., 2018; Tan and Bansal, 2019; Chen et al., 2020). We use the latter approach, because it simplifies learning over discrete sets, and has better memory efficiency – the model only attends to a small set of objects rather then the entire image feature map.
We train an object detector, Faster R-CNN (Ren et al., 2015), to predict the location of all objects, in the format of bounding boxes (horizontal and vertical positions).
We use gold scene data of 5,000 images from C levr for training (and 1,000 images for validation). In order to extract features for each predicted bounding box, we use the bottom–up top-down attention method of Anderson et al. ( 2018), 4 which produces the features matrix , where D = 2048. Bounding boxes and features are extracted and fixed as a pre-processing step.
Finally, to compute V, similar to LXMERT and UNITER we augment object representations in Vpred with their position embeddings, and pass them through a single Transformer self-attention layer to add context about other objects: V = TransformerLayer( VpredWfeat +bb predWpos), where and .
Complexity
Similar to CKY, we go over all O( n2) spans in a sentence, and for each span compute for each of the possible O( n) splits (there is no grammar constant since the grammar has effectively one rule). To compute denotations dij, for all O( n2) spans, we perform a linear computation over all nobj objects. Thus, the algorithm runs in time O( n3 + n2nobj), with similar memory consumption. This is higher than end-to-end models that do not compute explicit span representations.
3.5 Training
The model is fully differentiable, and we simply maximize the log probability of the correct answer a* (see Algorithm 1).
4 Experiments
In this section, we evaluate our model on both in-distribution and OOD splits.
4.1 Arithmetic Expressions
It has been shown that neural networks can be trained to perform numerical reasoning (Zaremba and Sutskever, 2014; Kaiser and Sutskever, 2016; Trask et al., 2018; Geva et al., 2020). However, models are often evaluated on expressions that are similar to the ones they were trained on, where only the numbers change. To test generalization, we create a simple dataset and evaluate on two splits that require learning the correct operator precedence. In the first split, sequences of operators that appear at test time do not appear at training time. In the second split, the test set contains longer sequences compared to the training set.
We define an arithmetic expression as a sequence containing n numbers with n − 1 arithmetic operators between each pair. The answer a is the result of evaluating the expression.
Evaluation Setups
The sampled operators are addition and multiplication, and we use only expressions where a ∈{0,1,…,100} to train as a multi-class problem. At training, we randomly pick the length n to be up to ntrain, and at test time we choose a fixed length ntest. We evaluate on three setups, (a) Easy split: we choose ntrain = ntest = 8, and randomly sample operators from a uniform distribution for both training and test examples. In this setup, we only check that the exact same expression is not shared between the training and test set. (b) Operation split: We randomly pick 3 positions, and for each one randomly assign exactly one operator that will appear at training time. On the test set, the operators in all three positions are flipped, so that they contain the unseen operator ( Figure 6). The same lengths are used as in the easy split. (c) Length split: we train with ntrain = 8 and test with ntest = 10. Examples for all setups are generated on-the-fly for 3 million steps (batch size is 100).
Models
We compare GLT to a standard Transformer, where the input is the expression, and the output is predicted using a classification layer over the [CLS] token. All models are trained with cross-entropy loss given the correct answer.
For both models, we use an in-distribution validation set for hyper-parameter tuning. Since here we do not have an image, we only compute hij for all spans, and define p( a∣ q) = softmax( Wh0 n).
GLT layers are almost entirely recurrent, that is, the same parameters are used to compute representations for spans of all lengths. The only exception are layer-normalization parameters, which are not shared across layers. Thus, at test time when processing an expression longer than observed at training time, we use the layer-normalization parameters (total of 2 ⋅ hdim parameters per layer) from the longest span seen at training time. 5
Results
Results are reported in Table 1. We see that both models almost completely solve the in-distribution setup, but on OOD splits the Transformer performs poorly, while GLT shows only a small drop in accuracy.
4.2 C levr and C losure
We evaluate performance on grounded QA using C levr (Johnson et al., 2017a), consisting of 100,000 synthetic images with multiple objects of different shapes, colors, materials and sizes. 864,968 questions were generated using 80 different templates, including simple questions ( “what is the size of red cube?”) and questions requiring multi-step reasoning ( Figure 1). The split in this dataset is i.i.d: The same templates are used for the training, validation, and test sets.
To test compositional generalization when training on C levr, we use the C losure dataset (Bahdanau et al., 2019b), which includes seven new question templates, with a total of 25,200 questions, asked on the C levr validation set images. The templates are created by taking referring expressions of various types from C levr and combining them in novel ways.
A problem found in C losure is that sentences from the template embed_mat_spa are ambiguous. For example, in the question “Is there a sphere on the left side of the cyan object that is the same size as purple cube?”, the phrase “that is the same size as purple cube” can modify either “the sphere” or “the cyan object”, but the answer in C losure is always the latter. Therefore, we deterministically compute both of the two possible answers and keep two sets of question-answer pairs of this template for the entire dataset. We evaluate models 6 on this template by taking the maximum score over these two sets (such that models must be consistent and choose a single interpretation for the template to get a perfect score).
Baselines
We evaluate against the baselines from Bahdanau et al. ( 2019b). The most comparable baselines are MAC (Hudson and Manning, 2018) and FiLM (Perez et al., 2018), which are differentiable and do not use program annotations. We also compare to NMNs that require a few hundred program examples for training. We show results for PG+EE (Johnson et al., 2017b) and an improved version, PG-Vector-NMN (Bahdanau et al., 2019b). Last, we compare to NS-VQA, which in addition to parsing the question, also parses the scene into a knowledge graph. NS-VQA requires additional gold data to parse the image into a knowledge graph based on data from C levr (colors, shapes, locations, etc.).
Setup
Baseline results are taken from previous papers (Bahdanau et al., 2019b; Hudson and Manning, 2018; Yi et al., 2018; Johnson et al., 2017b), except for MAC and FiLM on C losure, which we re-executed due to the aforementioned evaluation change. For GLT, we use C levr’s validation set for hyper-parameter tuning and early-stopping, for a maximum of 40 epochs. We run 4 experiments to compute mean and standard deviation on C losure test set.
Because of our model’s run-time and memory demands (see § 3.4), we found that running on C levr and C losure, where question length goes up to 43 tokens, can be slow. Thus, we delete function words that typically have empty denotations and can be safely skipped, 7 reducing the maximum length to 25. We run a single experiment where stop words were not removed and report results in Table 3 (with stop-words), showing that performance only mildly change.
C levr and C losure
In this experiment we compare results on i.i.d and compositional splits. Results are in Table 2. We see that GLT performs well on C levr and gets the highest score on C losure, improving by almost 24 points over comparable models, and by 19 points over NS-VQA, outperforming even the oracle GT-Vector-NMN which uses gold programs at test time.
. | C levr . | C losure . |
---|---|---|
MAC | 98.5 | 72.4 |
FiLM | 97.0 | 60.1 |
GLT (our model) | 99.1 | 96.1 ± 2.5 |
NS-VQA †∓ | 100 | 77.2 |
PG+EE (18K prog.) † | 95.4 | – |
PG-Vector-NMN † | 98.0 | 71.3 |
GT-Vector-NMN †‡ | 98.0 | 94.4 |
. | C levr . | C losure . |
---|---|---|
MAC | 98.5 | 72.4 |
FiLM | 97.0 | 60.1 |
GLT (our model) | 99.1 | 96.1 ± 2.5 |
NS-VQA †∓ | 100 | 77.2 |
PG+EE (18K prog.) † | 95.4 | – |
PG-Vector-NMN † | 98.0 | 71.3 |
GT-Vector-NMN †‡ | 98.0 | 94.4 |
Removing Modules
We defined (§ 3.3) two modules specifically for C levr, I ntersection and union. We evaluate performance without them, keeping only V isual and S kip, and show results in Table 3, observing only a moderate loss in accuracy and generalization. Removing these modules leads to more cases where the V isual function is used, effectively performing intersection and union as well. While the drop in accuracy is small, this model is harder to interpret since V isual now performs multiple functions.
Next, we evaluate performance when training only with the V isual module. Results show that even a single high-capacity module is enough for in-distribution performance, but generalization to C losure now substantially drops to 83.8.
. | C levr . | C losure . | C.Humans . |
---|---|---|---|
GLT | 99.1 ± 0.0 | 96.1 ± 2.7 | 75.8 ± 0.7 |
{V isual, S kip} | 98.9 ± 0.1 | 91.8 ± 10.1 | 76.3 ± 0.6 |
{V isual} | 99.0 ± 0.1 | 83.8 ± 1.8 | 72.0 ± 1.6 |
with stop-words | 99.3 | 93.2 | 75.4 |
contextualized input | 98.9 ± 0.1 | 87.5 ± 12.1 | 76.1 ± 0.9 |
non-compositional | 97.7 ± 2.0 | 83.5 ± 0.5 | 75.8 ± 2.3 |
. | C levr . | C losure . | C.Humans . |
---|---|---|---|
GLT | 99.1 ± 0.0 | 96.1 ± 2.7 | 75.8 ± 0.7 |
{V isual, S kip} | 98.9 ± 0.1 | 91.8 ± 10.1 | 76.3 ± 0.6 |
{V isual} | 99.0 ± 0.1 | 83.8 ± 1.8 | 72.0 ± 1.6 |
with stop-words | 99.3 | 93.2 | 75.4 |
contextualized input | 98.9 ± 0.1 | 87.5 ± 12.1 | 76.1 ± 0.9 |
non-compositional | 97.7 ± 2.0 | 83.5 ± 0.5 | 75.8 ± 2.3 |
Contextualizing Question Tokens
We hypothesized that the fact that question spans do not observe the entire input might aid composition generalization. To check this, we pass the question tokens q through a Transformer self-attention layer, to get contextualized tokens embeddings . We show in Table 3 that performance on C levr remains the same, but on C losure it drops to 87.5.
Non-compositional Representations
To assess the importance of compositional representations, we replace hij (Equation 2) with a sequential encoder. Concretely, we set the representation of the span qij to be hij = BiLSTM( qij), the output of a single layer bi-directional LSTM which is given the tokens of the span as input. We see a drop in both C levr and C losure (experiments were more prone to overfitting), showing the importance of compositional representations.
Few-shot
In the few-shot (FS) setup, we train with additional 252 OOD examples: 36 questions for each C losure template. Similar to Bahdanauet al. ( 2019b), we take a model that was trained on C levr and fine-tune adding oversampled C losure examples (300x) to the original training set. To make results comparable to Bahdanau et al. ( 2019b), we perform model selection based on C losure validation set, and evaluate on the test set. As we see in Table 4, GLT gets the best accuracy. If we perform model selection based on C levr (the preferred way to evaluate in the OOD setup, Teney et al. 2020), accuracy remains similar at 97.0 ± 2.0 (and is still highest).
. | C losure FS . | C.Humans . |
---|---|---|
MAC | 90.2 | 81.5 |
FiLM | – | 75.9 |
GLT (our model) | 97.4 ± 0.3 | 76.2 |
NS-VQA | 92.9 | 67.0 |
PG-Vector-NMN | 88.0 | – |
PG+EE (18K prog.) | – | 66.6 |
. | C losure FS . | C.Humans . |
---|---|---|
MAC | 90.2 | 81.5 |
FiLM | – | 75.9 |
GLT (our model) | 97.4 ± 0.3 | 76.2 |
NS-VQA | 92.9 | 67.0 |
PG-Vector-NMN | 88.0 | – |
PG+EE (18K prog.) | – | 66.6 |
CLEVR-Humans
To test performance on real language, we use C levr-Humans (Johnson et al., 2017b), which includes 32,164 questions over images from C levr. These questions, asked by humans, contain words and reasoning steps that were unseen in C levr. We train on C levr and fine-tune on CLEVR-Human, similar to prior work. We use GloVe embeddings (Pennington et al., 2014) for words unseen in C levr. Table 4 shows that GLT gets better results than models that use programs, showing its flexibility to learn new concepts and phrasings. It is also comparable to FiLM, but gets lower results than MAC (see error analysis below).
4.3 Error analysis
We sampled 25 errors from each dataset (C levr, C losure, and C levr-Humans) for analysis. On C levr, most errors (84%) are due to problems in visual processing such as grounding the word “rubber” to a metal object, problems in bounding box prediction or questions that require subtle spatial reasoning, such as identifying if an object is left to another object of different size, when they are at an almost identical x-position. The remaining errors (16%) are due to failed comparisons of numbers or attributes ( “does the red object have the same material as the cube”).
On C losure, 60% of the errors were similar to C levr, for example, problematic visual processing or failed comparisons. In 4% of cases, the execution of the V isual module was wrong, for example, it collapsed two reasoning steps (both intersection and finding objects of same shape), but did not output the correct denotation. Other errors (36%) are in the predicted latent tree, where the model was uncertain about the split point and softly predicted more than one tree, resulting in a wrong answer. In some cases (16%) this was due to question ambiguity (see § 4.2), and in others cases the cause was unclear (e.g., for the phrase “same color as the cube” the model gave a similar probability for the split after “same” and after “color”, leading to a wrong denotation of that span).
On CLEVR-Humans, the model successfully learned certain new “concepts” such as colors ( “gold”), superlatives ( “smallest”, “furthest”), relations ( “next to”, “between”), negation (see Figure 7), and the “all“ quantifier. It also answered correctly questions that had different style from C levr ( “Are there more blocks or balls?”). However, the model is not always robust to these new concepts, and fails in other cases that require counting of abstract concepts (such as colors or shapes) or that use C levr’s operators differently than the original uses. See Table 5 for our manually performed analysis and examples over 150 C levr-Humans validation questions.
Category . | Examples . | Correct . |
---|---|---|
Negation | – How many objects are not shiny? ✓ – What color is the cylinder that is not the same colr as the sphere? ✗ | 1/2 (50%) |
Spelling mistakes | – Are there two green cumes? ✓ – How many rubber spehres are there here? ✗ | 1/5 (20%) |
Superlatives | – What color is the object furthest to the right? ✓ – What shape is the smallest matte object? ✓ | 8/10 (80%) |
Visual Concepts: obscuring, between | – Is the sphere the same color as the object that is obscuring it? ✓ – What color is the object in between the two large cubes? ✓ | 5/7 (71%) |
Visual Concepts: reflection, shadow | – Which shape shows the largest reflection ✗ – Are all of the objects casting a shadow? ✗ | 0/2 (0%) |
Visual Concepts: relations | – What color is the small ball near the brown cube? ✓ – What is behind and right of the cyan cylinder? ✗ | 3/8 (38%) |
All quantifier | – Are all the spheres the same size? ✓ – Are all the cylinders brown? ✗ | 10/12 (83%) |
Counting abstract concepts | – How many different shapes are there? ✓ – How many differently colored cubes are there? ✗ | 1/4 (25%) |
Complex logic | – if these objects were lined up biggest to smallest, what would be in the middle? ✓ – if most of the items shown are shiny and most of the items shown are blue, would it be fair to say most of the items are shiny and blue? ✗ | 3/4 (75%) |
Different question structure | – Are more objects metallic or matte? ✓ – Each shape is present 3 times except for the ✗ | 1/2 (50%) |
Uniqueness | – What color object is a different material from the rest? ✓ – What color is the object that does not match the others? ✗ | 1/2 (50%) |
Long-tail concepts | – Can you roll all the purple objects? ✓ – How many of these things could be stacked on top of each other? ✗ | 2/5 (40%) |
Operators used differently than C levr | – Are the large cylinders the same color? ✓ – Are there more rubber objects than matte cylinders and green cubes? ✗ | 2/5 (40%) |
Same operators as C levr, | – What color is the cube directly in front of the blue cylinder? ✓ | 1/80 (89%) |
possibly different phrasing | – What color is the little cylinder? ✗ |
Category . | Examples . | Correct . |
---|---|---|
Negation | – How many objects are not shiny? ✓ – What color is the cylinder that is not the same colr as the sphere? ✗ | 1/2 (50%) |
Spelling mistakes | – Are there two green cumes? ✓ – How many rubber spehres are there here? ✗ | 1/5 (20%) |
Superlatives | – What color is the object furthest to the right? ✓ – What shape is the smallest matte object? ✓ | 8/10 (80%) |
Visual Concepts: obscuring, between | – Is the sphere the same color as the object that is obscuring it? ✓ – What color is the object in between the two large cubes? ✓ | 5/7 (71%) |
Visual Concepts: reflection, shadow | – Which shape shows the largest reflection ✗ – Are all of the objects casting a shadow? ✗ | 0/2 (0%) |
Visual Concepts: relations | – What color is the small ball near the brown cube? ✓ – What is behind and right of the cyan cylinder? ✗ | 3/8 (38%) |
All quantifier | – Are all the spheres the same size? ✓ – Are all the cylinders brown? ✗ | 10/12 (83%) |
Counting abstract concepts | – How many different shapes are there? ✓ – How many differently colored cubes are there? ✗ | 1/4 (25%) |
Complex logic | – if these objects were lined up biggest to smallest, what would be in the middle? ✓ – if most of the items shown are shiny and most of the items shown are blue, would it be fair to say most of the items are shiny and blue? ✗ | 3/4 (75%) |
Different question structure | – Are more objects metallic or matte? ✓ – Each shape is present 3 times except for the ✗ | 1/2 (50%) |
Uniqueness | – What color object is a different material from the rest? ✓ – What color is the object that does not match the others? ✗ | 1/2 (50%) |
Long-tail concepts | – Can you roll all the purple objects? ✓ – How many of these things could be stacked on top of each other? ✗ | 2/5 (40%) |
Operators used differently than C levr | – Are the large cylinders the same color? ✓ – Are there more rubber objects than matte cylinders and green cubes? ✗ | 2/5 (40%) |
Same operators as C levr, | – What color is the cube directly in front of the blue cylinder? ✓ | 1/80 (89%) |
possibly different phrasing | – What color is the little cylinder? ✗ |
4.4 Limitations
While some of the errors shown in Table 5 can be improved with careful design of modules architectures, improvement of the image features, or by introducing a pre-training mechanism, some question structures are inherently more difficult to correctly answer using our model. We describe two main limitations.
One key issue is discrepancy between the text of the question and the actual reasoning steps that are required to correctly answer it. For example, in the question “What is the most common shape?”, the phrase “most common” entails grouping and counting shapes, and then taking an argmax as an answer. This complex reasoning cannot be constructed compositionally from question spans.
The second issue is the supported types of phrase denotations. GLT only grounds objects in the image to a phrase, however, in some cases denotations should be a number or an attribute. For example, in comparison questions ( “is the number of cubes higher than the number of spheres?”) a model would ideally have a numerical denotation for each group of objects.
Importantly, these limitations do not prevent the model from answering questions correctly, since even when denotations do not accurately represent reasoning, the model can still “fall back” to the flexible answer function, which can take the question and image representations and, in theory, perform anyf needed computation. In such cases, the model will not be interpretable, similar to other high-capacity models such as Transformers. In future work, we will explore combining the compositional generalization abilities of GLT with the advantages of high-capacity architectures.
4.5 Interpretability
A key advantage of latent trees is interpretability—one can analyze the structure of a given question. Next, we quantitatively inspect the quality of the intermediate outputs produced by the model by (a) comparing these outputs to ground truth and (b) assessing how helpful they are for humans.
Evaluating Intermediate Outputs
We assess how accurate our intermediate outputs are compared to ground truth trees. Evaluating against ground truth trees is not trivial, since for a given question, there could be many “correct” trees—that is, trees that represent a valid way to answer the question. For example, the two possible ways to split the phrase “large shiny block” are both potentially valid. Thus, even if we had one ground-truth tree for each question, comparing predicted trees to ground-truth trees might not be reliable.
Instead, we exploit the fact that for the synthetic questions in C levr, it is possible to define a set of certain constituents that must appear in any valid tree. For example, phrases starting with “the”, followed by adjectives and then a noun ( “the shiny tiny sphere”), should always be considered a constituent. We use such manually defined rules to extract the set cq of obligatory constituents for any question q in C levr and C losure validation sets.
We compute a recall score for each question q, by producing predicted trees (we greedily take the maximum probability split in each node), extracting a set of all constituents, and then computing the proportion of constituents in cq that are also in . We show results in Table 6 for both C levr and C losure. We see that 81.6%–83.1% of the expected constituents are correctly output byGLT, for C levr and C losure respectively. Next, we sample and analyze 25 cases from each dataset where the expected constituents were not output. We observe that almost all missed constituents actually were output as a constituent by the model, but with an additional prefix that caused a wrong split. For example, instead of the expected constituent “tiny green objects“ we found the constituent “any tiny green objects” (split after “tiny”). These mistakes did not cause any error in module execution, except for one case in C losure.
. | C levr . | C losure . |
---|---|---|
Constituents (%) | 83.1 | 81.6 |
Denotations ( F1) | 95.9 | 94.7 |
. | C levr . | C losure . |
---|---|---|
Constituents (%) | 83.1 | 81.6 |
Denotations ( F1) | 95.9 | 94.7 |
Additionally, we obtain the gold set of objects for each constituent in cq by deterministically mapping each constituent to a program that we execute on the scene. For each constituent in , we compare the predicted set of objects (objects with probability above 50%) with the gold set of objects, and report the average F1. As can be seen in Table 6, reported score for C levr is 95.9% and for C losure it is 94.7%.
Human Interpretability
Next, we want to assess how useful are the intermediate outputs produced by the model for humans. We perform the “forward prediction” test (Hu et al., 2018) that evaluates interpretability by showing humans questions accompanied with intermediate outputs, and asking them to predict if the model will be correct. This test is based on the assumption that humans can more easily predict the accuracy of an interpretable model versus a non-interpretable one. Thus, we also ask a separate group of participants to predict the accuracy without showing them the intermediate outputs for comparison.
We show each of the 20 participants 12 questions evenly split into C levr, C losure, and C levr-Humans, where for each dataset we show two correct and two incorrect examples per person. We assign 11 participants to the baseline group (group A) that sees only the question, image, and gold answer, and 9 participants to the tested group (group B) which are also given the predicted denotation tree of our model, similar to Figure 1. Both groups are given the same 8 “training” examples of questions, with the predicted answer of the model along with the gold answer, to improve their understanding of the task. In these training examples, group B participants also see example denotation trees, along with a basic explanation of how to read those.
We show results in Table 7. We see that for C losure and C levr-Humans, accuracy for group B is significantly higher ( p < 0.05), but for C levr results are similar. We observe that in C levr, where accuracy is 99.1%, wrong predictions are mostly due to errors in the less interpretable V isual module, while for C losure and C levr-Humans errors are more often due to wrong selection of constituents and modules, which can be spotted by humans.
5 Conclusion
Developing models that generalize to compositions that are unobserved at training time has recently sparked substantial research interest. In this work we propose a model with a strong inductive bias towards compositional computation, which leads to large gains in systematic generalization and provides an interpretable structure that can be inspected by humans. Moreover, our model also obtains high performance on real language questions (C levr-humans). In future work, we will investigate the structures revealed by our model in other grounded QA setups, and will allow the model freedom to incorporate non-compositional signals, which go hand in hand with compositional computation in natural language.
Acknowledgments
This research was partially supported by The Yandex Initiative for Machine Learning, and the European Research Council (ERC) under the European Union Horizons 2020 research and innovation programme (grant ERC DELPHI 802800). We thank Jonathan Herzig and the anonymous reviewers for their useful comments. This work was completed in partial fulfillment for the Ph.D. degree of Ben Bogin.
Notes
We also use Dropout and Layer-Norm (Ba et al., 2016) throughout the paper, omitted for simplicity.
Using the left sub-span instead substantially reduces performance on C levr and C losure.
To improve runtime efficiency in the case of a large knowledge-graph, one could introduce a fast filtering function to reduce the number of proposed objects.
Specifically, we use this version: https://github.com/airsplay/py-bottom-up-attention.
Removing layer normalization leads to improved accuracy of 99% on the arithmetic expressions length split, but training convergence on C levr becomes too slow.
We update the scores on C losure for MAC, FiLM, and GLT due to this change in evaluation. The scores for the rest of the models were not affected.
The removed tokens are punctuations, ‘the’, ‘there’, ‘is’, ‘a’, ‘as’, ‘of’, ‘are’, ‘other’, ‘on’, ‘that’.