Neural Event Semantics for Grounded Language Understanding

Abstract We present a new conjunctivist framework, neural event semantics (NES), for compositional grounded language understanding. Our approach treats all words as classifiers that compose to form a sentence meaning by multiplying output scores. These classifiers apply to spatial regions (events) and NES derives its semantic structure from language by routing events to different classifier argument inputs via soft attention. NES is trainable end-to-end by gradient descent with minimal supervision. We evaluate our method on compositional grounded language tasks in controlled synthetic and real-world settings. NES offers stronger generalization capability than standard function-based compositional frameworks, while improving accuracy over state-of-the-art neural methods on real-world language tasks.


Introduction
Capturing the compositional semantics of grounded language is a long-standing goal in natural language processing. Composition yields systematicity, and is thus essential to developing systems that can generalize broadly in real-world settings. Recent progress with neural module networks (Andreas et al., 2016b;Hu et al., 2017) and related models (Johnson et al., 2017b;Yi et al., 2018;Bahdanau et al., 2019a) have moved neural network methods closer to this goal.
These works are largely based on the idea, functionism (Montague, 1970), that semantic composition is function composition. In Figure 1(a), function predicates compose by nesting: Predicates like ''red'' and ''circle'' operate on sets of elements, progressively filtering them at each Project Website: https://neural-event-semantics.github.io/. step (circle(red(x))). The final relational predicate above is thus several steps removed from the original inputs x,y. Similarly, in module networks, atomic module blocks compose by sequentially passing outputs of intermediate blocks to later modules. The diverse composition ruleset needed to coordinate function inputs and outputs leads to complexity in this paradigm, which has practical implications for its fundamental learnability. Indeed, neural module network instantiations of this framework often depend on low-level ground truth module layout programs (Johnson et al., 2017b) or large amounts of training data to sustain end-to-end reinforcement learning methods (Yi et al., 2018;Mao et al., 2019).
While functionism is the dominant paradigm in linguistic semantics, there is an intriguing alternative: event semantics (Davidson, 1967). Conjunctivism (Pietroski, 2005) is a particularly powerful version of event semantics, wherein the only composition operator is conjunction-structure arises by routing event variables to the function predicates. We illustrate this key difference between paradigms in Figure 1: in a conjunctivist setting (Figure 1(b)), even the relational above has events e 1 , e 2 directly routed as input, rather than taking inputs that are output results of a sequence of filter operations. Overall meaning is still preserved, since e 1 concurrently routes to (red, circle) and e 2 to (green, square). All module outputs directly contribute to the final truth value without intermediate steps. Altogether, this shift from deriving compositional structure by functional module layout to conjunctive events routing offers a path to improved learnability; we explore the implications of this line of thinking in the context of compositional neural models.
We propose neural event semantics (NES), a new conjunctivist framework for compositional grounded language understanding. Our work addresses the drawbacks of modern neural module approaches by re-examining the underlying 875 Figure 1: (a) Prior neural network methods for compositional semantics, such as neural module networks, derive compositional structure through nested application of function modules. This paradigm, rooted in functionism, is powerful but retains drawbacks to learnability due to its complexity. (b) We propose neural event semantics (NES), a new framework based on conjunctivism, where all words are classifiers and output scores compose by simple multiplication. We call the input spatial regions to these classifiers events: NES derives semantic structure from language by learning how to route event inputs of classifiers for different words in a context-sensitive manner. By relaxing this routing operation with soft attention, NES enables end-to-end differentiable learning without lowlevel supervision for compositional grounded language understanding. semantics framework, shifting from functionism to conjunctivism. The focus of NES revolves around event variables, abstractions of entities in the world (e.g., in images, we can think of events as spatial regions). We treat all words as event classifiers: For each word, a single score indicates the presence of a concept on a specific input (e.g., red, above in Figure 1(b)). We compose output scores from classifiers by multiplication, generalizing logical conjunction. The structural heart of NES is the intermediate soft (attentional) event routing stage, which ensures that these otherwise independent wordlevel modules receive contextually consistent event inputs. In this way, the simple product of all classifier scores accurately represents the intended compositional structure of the full sentence. Our NES framework is end-to-end differentiable, able to learn from high-level supervision by gradient descent while providing interpretability at the level of individual words.
We evaluate our NES framework on a series of grounded language tasks aimed at assessing its generalizability. We verify the merits of our conjunctivist design in a controlled comparison with functionist methods on the synthetic ShapeWorld benchmark (Kuhnle and Copestake, 2017). We show NES exhibits stronger systematic generalization over prior techniques, without requiring any low-level supervision. Further, we verify the flexibility of the framework in realworld language settings, offering significant gains (+4 to 6 points) in the state-of-the-art accuracy for language and zero-shot generalization tasks on the CiC reference game benchmark (Achlioptas et al., 2019).

Background and Related Work
Compositional Neural Networks. The advent of neural module networks (NMN) (Andreas et al., 2016a,b;Hu et al., 2017) and related techniques (Johnson et al., 2017b;Yi et al., 2018;Bahdanau et al., 2019a) has proven to be a driving force in compositional language understanding. These techniques share a key principle: Small, reusable neural network modules stack together as functional building blocks in an overall executable neural program. A parsing system determines the programmatic layout, wiring the outputs of intermediate blocks to the inputs of other blocks.
The reliance of these techniques on prespecified module libraries, ground truth supervision on functional module layouts, and/or sample-inefficient reinforcement learning methods (Williams, 1992) has motivated subsequent work to eschew explicit semantics for recurrent attentional computation techniques (Hudson and Manning, 2018;Perez et al., 2018;Hu et al., 2018;Hudson and Manning, 2019). This class of more implicit semantics methods offers the benefits of end-to-end differentiability of traditional non-compositional neural networks (Lake et al., 2017), making them better suited for real-world settings. As a trade-off, however, these methods exhibit less systematic generalization than their more explicit counterparts (Marois et al., 2018;Jayram et al., 2019;Bahdanau et al., 2019b).
Recent work has also suggested that the modular network approach leads to limitations of systematic generalization: Functional module layout can lead to entangled concept understanding (Bahdanau et al., 2019a;. While these works go on to propose mitigating measures, such as module-level pretraining, we consider an orthogonal approach: re-visiting the underlying semantics foundation. This enables us to address the challenges jointly: Our NES framework retains the end-to-end learnability of implicit methods, while improving upon the systematic generalizability of explicit ones. Grounded Compositional Semantics. Our work is also closely related to the broader, pre-neural network body of prior work which developed models for compositional semantics in grounded language settings (Matuszek et al., 2012;Krishnamurthy and Kollar 2013;. These methods all share the two-stage approach of semantic parsing and evaluation, and combine functionist and conjunctivist elements. The parsing stage typically leverages a (functionist) combinatory categorial grammar (CCG) parser (Zettlemoyer and Collins, 2005) to map input language input to a discrete (conjunctive) logical form bound by an existential closure. The evaluation stage passes visual segments as input to these predicates to obtain a final score representing its truth condition. In our work, we aim to generalize these frameworks to a modular neural network setting, embracing conjunctivist design across all stages to improve end-to-end learnability. Our proposed soft event routing mechanism relaxes prior discrete constraints and offers an alternative to probablistic program (Krishnamurthy et al., 2016) formulations. Together, NES is able to learn how to predict the (soft) conjunctive neural logical forms while jointly learning the underlying semantics of each concept (without prespecification) end-to-end from denotation alone.
Grounded Language Understanding. The space of grounded language understanding methods and tasks is large, encompassing tasks in imagecaption agreement (Kuhnle and Copestake, 2017;Suhr et al., 2019), reference grounding (Monroe et al., 2017;Achlioptas et al., 2019), instruction following (Ruis et al., 2020;Vogel and Jurafsky, 2010;Chaplot et al., 2018), captioning (Chen et al., 2015), and question answering (Antol et al., 2015;Johnson et al., 2017a;Hudson and Manning, 2019), among others. Often, the ability to operate with only high-level labels is critical (Karpathy and Fei-Fei, 2015). Consistent with recent work (Bahdanau et al., 2019a), we center our analysis on foundational tasks of caption agreement and reference grounding, on both synthetic and realworld language data, with the understanding that core insights can translate to related tasks.

Prelude: Classical Conjunctivism to NES
To explain our proposed differentiable neural approach, we first revisit classical logic in our current context. In conjunctivist event semantics (Pietroski, 2005), we work with the space of existentially quantified conjunctions of predicates. For illustration, consider the partial logical form: where e i are event variables and V is the domain of candidate event values. To evaluate this expression, we need an interpretation of the variables: an assignment of event values in V to each event variable e i . We then route these events to the arguments of predicates based on the logical form. The logical form gives the abstract template for which events route to which inputs and, most crucially, which arguments are shared across predicates (e 1 routes to ''circle'' and the first argument of ''on''). We make this routing explicit by a routing tensor A wri ∈ {0, 1}: For each argument slot (r) of each predicate (w, for word), A wr * ∈ {0, 1} n is a onehot vector indicating which of the n event variables e i ∈ e belongs in this argument slot. We can thus rewrite the matrix expression in Equation (1) as: [[circle(A 11 * e, A 12 * e) ∧ on(A 21 * e, A 22 * e)]] (2) Without loss of generality, 1 we upgrade each predicate to take a fixed m arguments; here m = 2. Equation (2) makes it clear that the routing tensor A is the key syntactical element specifying the structure of the logical form in Equation (1). Having routed events e i to predicate arguments via A, we can evaluate the predicates (''circle'', ''on''). These predicates are Boolean functions, assigned by a lookup table (lexicon). We compose the outputs of these Boolean functions by conjunction to get the truth-value of the entire matrix. This describes how we evaluate the matrix expression in Equation (1) for a specific assignment of e i in V . We arrive at the final interpretation of Figure 2: We propose neural event semantics (NES), an end-to-end differentiable framework based on conjunctivist event semantics (Sec 3.1). NES parses input text to a neural logical form F , which can score a given set of input events. In NES, all words are event classifiers (Section 3.3) whose scores compose by multiplication (Section 3.4). The structural heart of NES is a differentiable event argument routing operation (Section 3.2), ensuring arguments to each event classifier are contextually correct. NES semantically grounds F to an input world W by existential event variable intepretation (Section 3.5), finding a satisfying assignment (if one exists) of events e from values V . Equation (1) by existential quantification: searching over the possible assignments to see if there exists one that makes the matrix true.
We emphasize that the logical form is fully determined by the routing tensor A and the lexicon mapping each word/predicate to a Boolean function. Evaluation is specified by conjunctive composition and finding a satisfying variable interpretation. Our strategy to develop a learnable framework is to soften each of the key components: argument routing (Section 3.2), predicate evaluation (Section 3.3), conjunctive composition (Section 3.4), and existential event interpretation (Section 3.5).
Overview. We propose a neural event semantics (NES) framework, illustrated in Figure 2, which relaxes this classical logic into a differentiable computation that can be learned end-to-end. NES takes a text statement T and constructs a neural logical form F . This form is specified by a now real-valued routing tensor A wri ∈ [0, 1] and a lexicon associating event classifiers M w to each word w. NES specifies composition via the product of classifier prediction scores, as a relaxation of conjunction. Finally, evaluation is completed by existentially interpreting event variables e i into a domain of event values V (grounded representations extracted from a visual world W ) by a max operator.

Differentiable Event Argument Routing
Our first key operation in NES is to predict the argument routing tensor A from the input language. Critically, we relax A from its original discrete formulation in Section 3.1 to a continuous-valued one A wri ∈ [0, 1], where A wr * ∈ [0, 1] n is normalized by softmax over the index for the n events e i . This softened routing can be seen as a form of attention, determining which argument slot r for a word w will attend to which event variables e i (see Figure 3). We predict these attentions directly from the input tokenized text sequence T = [t 1 , . . . , t l ], of length l. For each token word t w , we pass a word embedding q w as input to a bidirectional LSTM (Graves and Schmidhuber, 2005) that serves as the sequence encoder and outputs forward/backward hidden capturing the bidirectional context surrounding t w . Passing the concatenated states through a linear layer, we obtain a final hidden state: Figure 3: Words as Classifiers of Routed Events. All words w correspond to modules M w of a single type signature. Predicted argument routing attention A routes input events e from the overall logical form F to the specific arguments in the event classifier M w (per Equation (5)), ensuring contextual consistency between event classifiers for different words. q w , a decontextualized word embedding, indicates to M w its lexical concept. M w shown here with maximum arity m = 2 slots and n = 3 events (including the ungrounded background event e ∅ ); since ''circle'' only binds to one argument e 1 , the second slot is bound to e ∅ . See Section 3.2 and 3.3.
From h w , a multilayer perceptron (MLP ROUTE ) network outputs for each argument slot r: Over the full input sequence of length l, we obtain the full argument routing tensor A ∈ [0, 1] l×m×n , with m argument slots per word and n events. Note that the prediction of A from input text T plays the role of capturing syntax for NES, using the language to derive coordination of argument routings across different words. 2 A key design aspect of the routing operation: Because A can route an ungrounded background event e ∅ to (extra) argument slots, NES can implicitly learn the arity of each word. Further, the attention formulation enables partial routing of such background events; we observe later in Section 4.1.4 that this is critical to enabling the more complex coordination necessary to handle negation.

Words as Event Classifiers
In NES, all words are event classifiers: Words are associated with modules M w that output a real-valued score s w of how true a lexical concept is for a given set of (routed) event inputs 2 We emphasize that this is a language-only operation: Coordination here is not conditional on the later grounding step to specific event values V in the visual world (Section 3.5).
(Section 3.2, Figure 3). Denoting events e i ∈ R d e , e ∅ as a null background event, and e = [e 1 · · · e n−1 | e ∅ ] ∈ R n×d e , we can formalize the routed inputs as A wr * e ∈ R d e . The concatenation of these routed inputs over all m argument slots is input to M w .
While in principle the modules can be completely separate for each word in the lexicon, we choose to share the weights of the different classifiers M w : This improves memory efficiency for large vocabularies and is helpful in real-world language generalization settings. Thus, we can realize modules M w by an MLP network that receives the word embedding q w as further input (see Figure 3), computing its output s w as: where σ denotes the sigmoid function that normalizes the output score s w ∈ [0, 1]. 3

Conjunctive Composition in NES
Per Section 3.1, the matrix of a classical conjunctive logical form (for a given interpretation of variables) is evaluated by composing Boolean predicate outputs by conjunction. For the neural logical form F in NES, we consider the real-valued generalization of conjunction: We compose the l word-level scores s w from the classifiers M w (Equation (5)) by multiplication ( w s w ). For numerical stability, we calculate the combined log score in log space: where the length normalization is optional but helps with training on variable length sequences.

Existential Event Variable Interpretation
In previous Section 3.2-3.4, we've described how NES translates input language to a neural logical form F , and how such a logical form can operate for a specific intepretation (binding) of events to candidate values V . Now, we describe the final existential variable interpretation step, which relaxes the existential quantification of classical logic (Equation (1)) into a max operation over possible event interpretations of a specific input domain V .
Candidate Event Values V . We decompose our input world W into a set of candidate event proposals, with corresponding representation values V . In our experiments, we process input visual scenes W with a pre-trained convolutional visual encoder φ (Simonyan and Zisserman, 2015;He et al., 2016) to provide a set of up to k candidate event value These candidate values capture the information corresponding to the localized image segment surrounding that specific event; we base our approach on recent findings of object-centric representations for compositional modular network approaches (Yi et al., 2018). To capture spatial information, we augment each representation with the spatial coordinates of the center of its bounding box; this enables NES and our relevant baseline methods (e.g., NMN) to assess the semantics of spatial relationships (e.g., ''below'') while operating directly on event values.
Assignment and Final Scoring. Given the domain V of candidate event values, an interpretation is thus an assignment of each of the n − 1 grounded event variables (we don't include e ∅ ) to a unique value in V : We denote this assignment operation as e ← V . We translate the existential closure (∃e 1 , e 2 in Figure 2) as an operation that determines the best scoring assignment of event candidate values to event variables. Expanded, the final grounded score s * F = max e←V s F is: Figure 4 visualizes output score tables (including s w , s F , s * F ) with k = 2 candidate event values and n = 3 events including background e ∅ . We highlight that Figure 4 shows how each individual module provides consistent outputs depending on the specific event interpretation e ← V (e.g., ''below'' is only true if (e 1 , e 2 ) bind to (v 2 , v 1 ), not (v 1 , v 2 )). The final score s * F reflects the s F of that correct assignment, since it is the max score.

Training: Learning from Denotation
We train our overall system end-to-end with gradient descent with a dataset of (statement T , world scene W , true/false denotation label Y ) triplets. Example end-to-end NES results on ShapeWorld. We show an input world with two event candidates (for clarity) with representations v 1 , v 2 for the red and green circles, respectively. We visualize the possible event assignments (e 1 , e 2 ) ∈ {(v 1 , v 2 ), (v 2 , v 1 )} and the classifier scores s w ∈ [0, 1] for each assignment, including stop words. We find NES provides correct and consistent predictions across assignments and concepts, without any explicit logical form-level supervision. See Section 4.1.4.
We apply a straightforward binary cross entropy loss at the level of text statements and their truth labels to the final output score s * F , without needing any low-level ground truth supervision of the neural logical form. Overall, our full NES framework offers advantages from both traditional neural module network methods and end-to-end differentiable implicit semantics techniques.
The max operation in Equation (7) is a technical challenge for the end-to-end training. To improve gradient flow, we propose to use a tunable approximation f max , which approaches the max as β → ∞ and is always upper-bounded by it: In context, s is a vector of all the scores s F (Equation (6)) corresponding to the assignments e ← V , and the output of Equation (8) is a bounded approximation of s * F in Equation (7). See Appendix A for correctness and details. 4 During test-time inference, we still use the original max operation shown in Equation (7).

Experiments: Synthetic Language
We design the first series of experiments to highlight key compositional and generalization properties of NES in a controlled, synthetic setting.

Dataset and Tasks
ShapeWorld. Our synthetic tasks and datasets are based on the ShapeWorld benchmark suite (Kuhnle and Copestake, 2017), which was designed specifically for evaluation of compositional models for grounded semantics. Here, events are based on simple objects: shapes with different color attributes and spatial relationships. Images are generated by sampling events from task-specific distributions with visual noise (e.g., hue, size variance), and are placed without hard grid constraints. For each image, multiple true/false language statements are generated with a templated grammar (Copestake et al., 2016). Negative statements are generated close to the distribution of positive statements to ensure difficulty: Models must understand all aspects of the statement correctly to output a truth condition label. We visualize an example in the qualitative results ( Figure 4).
Task A: Standard Generalization. This generalization task evaluates compositional models on the standard setting where train and evaluation splits are based on the same underlying input event distribution. This task is similar to the original SHAPES dataset (Andreas et al., 2016b), without shape positions locked to a 3 × 3 spatial grid.
Task B: Compositional Generalization. The compositional generalization task examines the systematic generalization of models to an unseen event distribution. During test time, every instance has at least one event sampled from a held-out distribution. For example, while red triangles and blue squares may be present at train time, blue triangles and red squares are only present during test time. Critically, any language associated with these unseen events is always false during training since these events are never actually present. Thus, models that overfit on complete phrases during training will not generalize well at test time.
Task Variant: Negation. For both tasks, we include a variation with negation to ensure NES can model non-intersective modifiers, which are prevalent in real-world grounded language. In these variants, true and false statements that include attribute-level negation (e.g., phrases like ''not red'') are also generated for each image.

Baseline and Model Details
Baselines. Across our synthetic experiments, we compare NES against baselines in 3 categories: • Black-box neural networks. These baseline neural network models combine CNN, LSTM, and attention components (Johnson et al., 2017a) and represent standard endto-end black-box techniques for language + vision tasks.
• Functionist approaches. For our functionist baselines, we consider the prevailing parameterizations of the neural module networks (NMN) framework (Andreas et al., 2016b).
For the modules, we leverage the base generic module design introduced in the E2ENMN framework (Hu et al., 2017;Bahdanau et al., 2019a). Because our experiments are event-centric, the inputs and implementation of the framework are consistent with prior work (Yi et al., 2018;Mao et al., 2019;. Thus, each module takes as input a set of localized event values (originally from the image), an attention over these values (from a preceding module step), and a decontextualized word embedding. The module then applies the attention and processes the input, before outputting an updated attention to be used in dependent downstream module steps. For end-to-end (E2E) experiments, ground truth programs are used to pre-train the parsing module layout generator, which is the structural heart of NMN. This parser is implemented using a sequence-to-sequence Bi-LSTM (Hu et al., 2017;Johnson et al., 2017b). We emphasize that, in our experiments, we ensure consistent hidden state sizes for both the modules and the sequence encoder for NMN and NES, as well as consistent event-centric visual + decontextualized word embedding input.
• Implicit semantics methods. This class of models leverages recursive computation units with attention over visual and textual input to provide better compositionality than traditional end-to-end black-box neural network methods. We examine the MAC model (Hudson and Manning, 2018, 2019) as a representative baseline, following recent prior work (Bahdanau et al., 2019a). Similar to our NMN baseline, we report results with an event-centric version of the MAC model, following Mao et al. (2019), such that MAC is able to attend over a discrete set of localized event values. Thus, we can enable fair and consistent comparison of MAC, NMN, and other baselines with NES.
Implementation Details. Models and baselines are implemented in PyTorch (Paszke et al., 2019). Localized event candidate values V are extracted by a pre-processing step. Our encoder φ is a ResNet-101 network (He et al., 2016), and localized event feature representations are based on conv4 features per prior work (Johnson et al., 2017a;Hudson and Manning, 2018) with pixel grid coordinates (per Section 3.5) to capture the necessary spatial and visual information for the downstream semantics. Following standard work in object detection (He et al., 2017), we use pooling to ensure all localized event values have the same dimension. Word embeddings are 300-dim GloVe.6B embeddings (Pennington et al., 2014). All text and visual inputs are consistent across all models for fair comparison.
As noted previously, model sizes are also kept consistent across models where applicable. Please refer to the supplement for implementation and additional details. 5

Validating Conjunctivism
Overview. Our first experiments are centered around validating a fundamental design principle underlying our NES framework: that concept meaning can be effectively represented by conjunction of event classifiers. Both NMN and NES leverage syntax to guide their compositional structure: functional module layout (NMN) and event routing (NES), respectively. 6 Here, we isolate the impact of the design philosophy on the quality of the learned semantics by providing 5 Available at https://neural-event-semantics .github.io/. 6 We note that while we focus on the functionist realizations of NMNs prevalent in prior work, we recognize that the broader family of modular network approaches can include conjunctivist elements as well. A key intention of these experiments is to illustrate the value of our conjunctivist design as a compelling direction for future modular network design. Figure 5: Validating Conjunctivism. Here, we provide ground truth (GT) logical forms for both functionist (NMN) and conjunctivist (NES) approaches. Controlling other factors, we observe that our conjunctivist NES framework provides better systematic generalization (Task B) than a functionist one. See Section 4.1.3. ground truth (GT) ''syntax'' (layout or routing) to each framework, assessing performance on Tasks A and B.
Systematic Generalization. Figure 5 shows the results for both NMN-GT and NES-GT. Both frameworks perform equally well on the standard generalization task (Task A), showing that the NES conjunctivist design preserves the efficacy of the functionist paradigm. In Task B however, while both frameworks perform reasonably well, NES exhibits stronger systematic generalization capability than the NMN model when evaluated on an unseen event distribution. These quantitative results suggest that NES enables a stronger decoupling of individual concepts, yielding higher accuracy when they are composed for unseen events.
To explore concept disentanglement further, we analyze the color sensitivity of color words in Figure 6. For this analysis, we take the trained models from Task B and examine the normalized response score of different modules (e.g., red) to a continuous spectrum of color input. We sample the input shapes for each color classifier from the unseen event distribution. Our analysis suggests that NES offers stronger disentanglement of attribute concepts: color words respond to separated and appropriate spectral regions, in contrast to NMN. 7

End-to-End Experiments
Overview. Having validated that conjunctivist composition can support strong performance with known event routings, our second set of synthetic experiments are designed to assess the full endto-end learning capability of the NES framework, including the critical event routing stage. In this setting, we offer no ground truth logical form input or supervision to the NES model, and evaluate performance on all tasks. We do necessary program layout pre-training for the E2E-Func (NMN) baseline prior to end-to-end REINFORCE training.
Generalization. In Figure 7, we show that our initial findings in Section 4.1.3 hold in the more general end-to-end setting, across the broader set of model classes. While compositional methods consistently outperform the noncompositional baselines, there is a clear differentiation between MAC and NES/E2E-Func on Task B (systematic novel-event generalization). This suggests that MAC relies too strongly on correlative associations of text phrases for unseen events, overfitting at training.
In Figure 4, we visualize a table of NES score predictions on a specific input V , using a two-event setting for visual clarity. An input statement is considered true if there is an assignment (grounding to V ) of the events with a high overall score. Across different event assignments e ← V , NES provides consistent and correct score outputs. Because NES considers each word as its own event classifier (with appropriate routing), it provides interpretable indicators for which attributes are specifically not present for each assignment.
In Figure 8, we visualize the event routing predictions from an example NES model trained end-to-end. Consistent with our observation in Figure 4, we see that the model can learn approximate routings and implicit arity of the different event classifiers. Though event routings are modeled as soft attention and classifier output scores are continuous, both have approached nearly discrete outputs by the end of learning, capturing the underlying logical structure of the domain. 8 Negation. Finally, we demonstrate that NES is capable of handling non-intersective modifiers by examining its ability to model property negation. In contrast with functionist models, conjunctivist event semantics must handle negation through modification of the input event to the given predicate (Pietroski, 2005). In Figure 9, we show the results from these experiments. First, we observe that NES can maintain the same level of generalization accuracy in variants of Task A and B that contain negation. Visualizing an example model, we see that NES learns to coordinate negation through its event routing stage: the presence of ''not'' in the textual input can lead NES to predict a soft routing A w1 * that attends to a combination of both e 1 and the ungrounded background e ∅ for the first argument of ''red'' (denoted as e 1 in the example). Now, when this specific ''red'' attribute classifier processes its updated event arguments, its classification behavior is reversed: a high score when the attribute is not present in the original e 1 .
We compare with an ablation variant of NES that removes this routing flexibility: for attribute classifiers M w , we restrict their routing attention A w * * to only consider the n−1 grounded events in the first argument slot (removing e ∅ from consideration) and fix the second slot a 2 to the background e ∅ . Because individual event classifier modules only take decontextualized word embeddings, the event routing mechanism is the only way for context information to influence the classification. Thus, this ablation directly reflects the impact of the flexible event routing mechanism and its usage of the ungrounded background event to Figure 7: End-to-End Methods. Generalization performance of end-to-end-methods on ShapeWorld tasks. We observe that our conjunctivist NES framework offers stronger generalization performance on both standard (Task A) and systematic (Task B) compositional task settings. See Section 4.1.4 for additional details and analysis. handle more complex language settings. We find that while the ablation maintains performance on the standard tasks, its accuracy significantly decreases in this setting where some input statements have negation. Overall, we observe that the rich, augmented event space and flexible event routing stage enable our conjunctivist framework to learn how to model non-intersective modifiers, a crucial step for real-world language (Section 4.2).

Experiments: Real-World Language
Having validated the efficacy of NES in a controlled synthetic setting, we now explore NES in a grounded reference game task to demonstrate its broader applicability. Because the overall end-to-end NES framework requires no low-level supervision during training, it mirrors the broader Figure 9: Negation with NES. (a) We visualize one way in which NES can handle coordination for non-intersective modifiers (e.g., attribute negation) by leveraging the background event e ∅ . NES soft routing leads to a modified event argument input e 1 attending over e 1 and e ∅ , enabling the red classifier to output the opposite prediction (now, output score = 1.0 if original e 1 is not red). (b) NES performance on Task A and B negation variants remains consistent. Ablation (Section 4.1.4) highlights the impact of the event routing mechanism. applicability of implicit semantics methods (MAC) to less structured, human-generated language.
Chairs-in-Context (CiC). The Chairs-in-Context (CiC) dataset (Achlioptas et al., 2019) contains chairs and other objects from the ShapeNet dataset, paired with human-generated language collected in the context of a reference game. Each CiC input consists of a set of 3 chairs representing a contrastive communication context, with a human utterance (up to 33 tokens) intended to identify one of the chairs. In total, there are over 75k triplets with an 80-10-10 split for train-valtest. CiC also contains a zero-shot evaluation set with triplets of unseen object classes (e.g., tables). CiC is challenging due to its relatively long-tail language diversity and varied visual inputs.
Task A: Language Generalization. Our first CiC benchmark task is language generalization, where a model must ground the specific chair from the input set given a referring utterance. The dataset split ensures no overlap in speaker-listener pairs between training and evaluation, so models must generalize to new communication contexts.
Task B: Zero-Shot Generalization. Our second CiC benchmark task is zero-shot generalization, which examines the ability for the model to generalize from understanding attribute concepts learned in a chairs context to contexts with unseen object classes like tables and lamps. The overall task setting is the same as before, but during evaluation the triplets are composed of objects from a particular unseen class. For consistency with prior work, all models here are evaluated on an image-only setting (i.e., no 3D point-cloud representation). We provide a breakdown of the results on the full zero-shot transfer set by class.
Models and Implementation. Our main baseline is the recent ShapeGlot (SG) architecture (Achlioptas et al., 2019). The SG baseline leverages recurrent, convolutional, and attention components in an end-to-end architecture to achieve state-of-the-art performance on the language and zero-shot generalization datasets. We also consider a conjunctive baseline with event classifiers without the soft event routing stage, reminiscent of a product-of-experts (PoE) classification setting. This baseline serves to illustrate the impact of the flexible routing stage on compositionality, and in particular handling of non-intersective modifiers. We additionally report two compositional baselines from Section 4.1.2, MAC and NMN, following the protocols outlined by our previous endto-end synthetic experiments 4.1.4. Because CiC contains unstructured human-generated text and it is difficult to train NMN end-to-end from denotation alone, we initialize the sequence-to-sequence program generator in the NMN baseline by pretraining on auxiliary parse information for 1,000 examples (Suhr et al., 2019;Yi et al., 2018); all other baselines do not have any additional supervision data. Finally, we also consider a denser input event space for NES corresponding to sub-regions in the image input. Here, sub-events are additionally sampled from the (unannotated) final conv4 feature grid of the encoder network; we denote this as NES + in our experiments. We adopt consistent experimental settings from Achlioptas et al. (2019), treating each chair as an event candidate space, with predictions normalized by 3-way softmax over possible target images. All model sizes are kept comparable in number of parameters for fair comparison. We leverage the same pre-trained VGG16 features (Simonyan and Zisserman, 2015;Chang et al., 2015) and GloVe (Wiki.6B) embeddings (Pennington et al., 2014). For completeness, we report results with VGG16 and ResNet-101 without ShapeNet pre-training for both tasks.
Analysis. We report our results in Table 1 and  Table 2 against the prior state-of-the-art SG architecture (Achlioptas et al., 2019). The MAC baseline provides comparable performance to the prior state-of-the-art. The NMN baseline has reasonable accuracy, albeit lower than the MAC and SG baselines. This is likely due to the ambiguity in longer token sequences (up to 33 tokens), which can contain filler words and occasional disfluencies that hurt the efficacy of the sequenceto-sequence program generator. Nonetheless, NMN outperforms the PoE baseline, which serves   Table 2: CiC-Zero Shot Generalization. Zeroshot generalization to unseen objects on the Chairs-in-Context (CiC) dataset. Results suggest NES can learn words as event classifiers in a general, object-agnostic manner. *SG model from (Achlioptas et al., 2019). as a simplistic conjunctive modular baseline without the NES event routing framework.
We observe that our model improves over the prior state-of-the-art work on this dataset by a large margin on the original neural listener task. Further, NES significantly improves zeroshot generalization performance, indicating that it has learned event classifiers for attributes (e.g., ''messy'', ''tall'') that can generalize to entirely unseen input event distributions. We visualize qualitative results in Figure 10: NES can provide interpretable event classifier outputs at the word level without any additional low-level supervision, in both the main (chairs) and unseen zero-shot settings. We also show how learned event classifiers are lexically consistent by performing standalone retrieval of antonym pairs. We observe that highranked retrievals for a word classifier correlate with low-ranked retrievals of its antonym.

Overall Discussion
We provide additional discussion of the overall NES framework, considering its broader implications, limitations, and avenues for further work.
Broader Generality. In the above sections, we have described our key results of NES on the ShapeWorld and CiC benchmarks. However, modular neural network approaches like NMN are intuitively suited to settings where the visual and language environments are particularly regular, context-free, and unambiguous. In its current formulation, NES is similarly suited to such structured settings: effective generalization to highly irregular and context-sensitive vision and language settings in images and videos , remains outside the current scope of the presented paper. Nonetheless, we believe that careful consideration of some of the key elements in the NES framework, such as the proposed soft event routing system with ungrounded events used for coordinating richer meaning, can offer a promising route towards improving the state-of-the-art.
Computational Complexity. Through its existential quantification operating over events, the complexity of event assignment (Equation (7)) during inference scales by O(k n−1 ), where k is the number of visual event candidates V and n − 1 the number of events e in the logical form F (excluding e ∅ ). This was not an issue in the domains examined here, but may become one in complex vision-language domains. Exploring potential relationships with concurrent techniques  that increase computational complexity but also improve systematicity may prove insightful here as well.

Conclusion
In this work, we introduced neural event semantics (NES) for compositional grounded language understanding. Our framework's conjunctivist design offers a compelling alternative to designs rooted primarily in function-based semantics: By deriving structure from events and their (soft) routings, NES operates with a simpler composition ruleset (conjunction) and effectively learns semantic concepts without any low-level ground truth supervision. Controlled synthetic experiments (ShapeWorld) show the generalization benefits of our framework, and we demonstrate broader applicability of NES on real-world language data (CiC) by significantly improving language and zero-shot generalization over prior state-of-the-art. Ultimately, our work shows that deep consideration of the mechanisms for compositional neural methods may yield techniques better suited for differentiable neural modeling, maintaining core expressivity for grounded language understanding tasks.