Maintaining Common Ground in Dynamic Environments

Common grounding is the process of creating and maintaining mutual understandings, which is a critical aspect of sophisticated human communication. While various task settings have been proposed in existing literature, they mostly focus on creating common ground under static context and ignore the aspect of maintaining them overtime under dynamic context. In this work, we propose a novel task setting to study the ability of both creating and maintaining common ground in dynamic environments. Based on our minimal task formulation, we collected a large-scale dataset of 5,617 dialogues to enable fine-grained evaluation and analysis of various dialogue systems. Through our dataset analyses, we highlight novel challenges introduced in our setting, such as the usage of complex spatio-temporal expressions to create and maintain common ground. Finally, we conduct extensive experiments to assess the capabilities of our baseline dialogue system and discuss future prospects of our research.


Introduction
Common grounding is the process of creating, repairing and updating mutual understandings (i.e. common ground), which is a critical aspect of sophisticated human communication . Humans can create substantial common ground by expressing various information in natural language, which can be clarified or repaired to resolve misunderstandings at essential levels of detail. Furthermore, as the situation changes and relevant information gets outdated, humans can update their common ground accordingly by discarding old information and acquiring new ones. Such ability plays a vital role in sustaining collaborative relationships and adapting to emerging problems in nonstationary, real-world environments.
However, despite the wide variety of tasks proposed in existing literature (Fang et al., 2015;Zarrieß et al., 2016;De Vries et al., 2017;Udagawa and Aizawa, 2019;Haber et al., 2019), they mostly focus on creating common ground under static (time-invariant) context and ignore their dynamic aspects. While some recent dialogue tasks deal with dynamic information, they often lack suitable evaluation metrics (Pasunuru and Bansal, 2018), context updates in the course of the dialogue (Alamri et al., 2019) or diverse dynamics of the environment itself (De Vries et al., 2018;Suhr et al., 2019;Narayan-Chen et al., 2019;Thomason et al., 2019;Moon et al., 2020). Therefore, it remains unclear how well existing dialogue systems can adapt to the diversely changing situations through advanced common grounding ( §2).
To address this problem, we propose a novel dialogue task based on three design choices ( §3): First, we formulate a novel sequential collaborative reference task as a temporal generalization of the collaborative reference task proposed in He et al. (2017) and Udagawa and Aizawa (2019). In our formulation, the goal of the agents is generalized to track and select the common entity at multiple timesteps, while the agents' observations change dynamically between each timestep. This setting requires both creation and maintenance of common ground, whilst enabling clear evaluation based on the length of successful timesteps.
Secondly, we focus on synthesizing the entity movements, as popularized in the recent video understanding benchmarks (Girdhar and Ramanan, 2020;Yi et al., 2020;Bakhtin et al., 2019). By leveraging such synthetic dynamics, we can minimize undesirable biases, maximize diversity and enable fully controlled evaluation and analysis.
Finally, we build upon OneCommon Corpus (Udagawa and Aizawa, 2019) to introduce natural difficulty of common grounding with minimal task complexity. To be specific, we represent en- Figure 1: Example dialogue of our sequential collaborative reference task ( §3). Each agent has a partial view of a 2-D plane with synthetic entities (grayscale dots of various sizes). During each turn, the entities move randomly on the 2-D plane. At the end of each turn, the agents communicate with each other to find and select one of the same, common entities. After each turn (if the selections match), both agents' views shift randomly and the next turn begins. * Note that the colored polygons (indicating the referents of the underlined expressions) are shown for illustration purposes only and not visible to the agents nor provided in the current dataset.
tity attributes and their temporal dynamics based on continuous real values to introduce high ambiguity and uncertainty. In addition, we consider a partially-observable setting where each agent only has a partial view of the environment, which introduces various misunderstandings and partial understandings that need to be resolved. Based on this task design, we collected a largescale dataset of 5,617 dialogues (including over 65K utterances) through careful crowdsourcing on Amazon Mechanical Turk ( §4).
We show an exemplary dialogue of our task in Figure 1. Since the environment is dynamic, humans rely on various spatio-temporal expressions to express entity states at different timesteps ("started off on the left", "ends to the right") or how they changed dynamically ("moves very quickly", "come towards the left") to create common ground. Furthermore, in later turns, humans often leverage their previous common ground ("still see the same one?", "crosses underneath our old one") to update their common ground more re-liably and efficiently. We conduct detailed analyses of the dataset to study such strategies in §5.
In our experiments ( §6), we train a neural-based dialogue system based on . Through our extensive evaluation and analysis, we assess the current model's strengths as well as important limitations and demonstrate huge room left for further improvement.
Overall, our main contributions are: • Proposal of a novel dialogue task to study common grounding in dynamic environments. • Large-scale dataset of 5,617 dialogues to develop and test various data-driven models. 1 • Detailed dataset analyses which highlight novel challenges introduced in our setting. • Extensive evaluation and analysis of a simple yet strong baseline dialogue system. Twitch-FIFA (Pasunuru and Bansal, 2018) Synthetic N/A AVSD (Alamri et al., 2019) Real Indirect SIMMC (Moon et al., 2020) Synthetic+Real Indirect MutualFriends (He et al., 2017) Synthetic Create GuessWhat?! (De Vries et al., 2017) Real Create Photobook Dataset (Haber et al., 2019) Real Create OneCommon (Udagawa and Aizawa, 2019) Synthetic Create Dynamic-OneCommon (Ours) Synthetic Create+Maintain Table 1: Comparison with the major datasets. Environments are considered dynamic if they involve rich, spontaneous dynamics and contexts to be updated if new information is provided in the course of the dialogue.

Related Work
The notion of common ground was originally introduced in Lewis (1969) and Stalnaker (1978) and theoretically elaborated in fields such as psycholinguistics (Clark and Brennan, 1991;Brennan et al., 2010). While formal approaches (rule/logicbased) exist to computationally model the process of common grounding (Traum, 1994;Van Ditmarsch et al., 2007;Poesio and Rieser, 2010), capturing their full complexities in realistic, situated conversations remains a formidable problem. From an empirical perspective, various dialogue tasks have been proposed to develop and evaluate data-driven models of common grounding. Most of the existing literature focuses on closed domain, goal-oriented settings to measure the ability both quantitatively and objectively (Fang et al., 2015;Zarrieß et al., 2016;De Vries et al., 2017). Recent works, summarized as the grounded agreement games in Schlangen (2019), introduce symmetric speaker roles to encourage more bilateral interaction. Udagawa and Aizawa (2019) also raise continuous and partially-observable context to be essential for requiring advanced common grounding ( §3.1). Finally, Haber et al. (2019) propose a multi-round image identification task, where different combinations of images are provided to each agent at every round. While this setting is useful for studying subsequent references affected by the existing common ground (Brennan and Clark, 1996;Takmaz et al., 2020), the observations in each round are static, temporarily independent images. Hence, all of these tasks focus on creating common ground under static context and lack evaluation metrics for maintaining common ground in dynamic environments.
We also note that some recent dialogue tasks require dealing with dynamic information, although common grounding usually takes place implicitly and may be difficult to measure directly. For instance, Alamri et al. (2019) proposed Q&A based dialogues grounded in video contexts. However, the information given to each agent remains fixed throughout the dialogue, requiring creation but minimal update of common ground. Many recent works also focus on dialogues grounded in external environments (De Vries et al., 2018;Suhr et al., 2019;Narayan-Chen et al., 2019;Thomason et al., 2019;Moon et al., 2020). These settings often involve dynamic change of the perspectives, but they usually assume the environments themselves to be stationary and do not change spontaneously (without direct intervention). In contrast to these works, we introduce both context updates in the course of the dialogue and diverse dynamics of the external environment to require advanced common grounding. 2 We summarize our comparison with the major existing datasets in Table 1.
Finally, our work is relevant to the emerging literature on spatio-temporal grounding in computer vision and NLP. This includes video QA (Lei et al., 2018;Castro et al., 2020), video object grounding (Zhou et al., 2018;Sadhu et al., 2020) and video captioning (Krishna et al., 2017a), all of which are essential subtasks in our dialogue. However, existing resources often contain exploitable biases and lack visual/linguistic diversity as well as reliable evaluation metrics (esp. in language generation) (Aafaq et al., 2019). It is also challenging to probe model behaviors without the controllability of the video contexts (Girdhar and Ramanan, 2020). We have addressed such concerns based on our task design ( §3.2) and expect our resource to be useful for promoting this line of research as well.

Task Formulation
In this section, we review the collaborative reference task from OneCommon Corpus (OCC in short) and formulate our sequential counterpart as its temporal generalization.

Collaborative Reference Task
Based on Udagawa and Aizawa (2019), a collaborative reference task is a multi-agent cooperative game with entities E = {e 1 , e 2 , ..., e m } and agents A = {a 1 , a 2 , ..., a n }. Each agent a j ∈ A has an observation of entities obs j (E) and can exchange information with other agents in natural language. At the end of the game, each agent selects one of the observable entities, and the game is successful if and only if all the agents selected the same entity. 3 This can be considered as a general framework for evaluating accurate mutual recognition of a common entity, which is often a critical step in general common grounding.
One main feature of OCC is that they represented all entity attributes (color, size and location on a 2-D plane) based on continuous real values. Unlike discrete/categorical attributes, this introduces high ambiguity and uncertainty to be expressed in symbolic natural language. In addition, they introduced partial-observability where each agent only has a partial view of the 2-D plane, which requires collaborative resolution of various misunderstandings. We show an example of a successful dialogue from OCC in Figure 2. However, this current task formulation assumes each observation to be static and can only evaluate the ability of creating common ground.

Sequential Collaborative Reference Task
To address this limitation, we generalize each observation to be dynamic and collaborative reference to be sequential. Specifically, each agent a j ∈ A now receives observation obs j (E, t) at each timestep t ∈ [t 0 , ∞), and the agents' goal is to communicate in natural language to select the same entity at multiple timesteps t 1 , t 2 , ... ∈ (t 0 , ∞). 4 At each selection timestep t k (k ∈ N), a j must select one entity observable at t k but has all previous observations up to t k , {obs j (E, t)|t ∈ [t 0 , t k ]}. The game ends when the selections no 3 In contrast to the typical reference tasks (De Vries et al., 2017), agent roles are symmetric and they can agree upon any of the common entities (as long as it's the same). 4 We assume t k−1 < t k for all k ∈ N.
A: I see three in a line going up and to the right. The middle one is the largest and darkest B: I don't see that. I have one large, medium gray dot that's under a small, darker gray dot A: Is the larger dot slightly to the left B: yes, slightly, let's choose the larger one A selects: A's View B's View B selects: Figure 2: Example dialogue from OneCommon Corpus (OCC). We can see that the human players are able to detect misunderstandings and make flexible clarifications to reduce ambiguity and uncertainty.
longer match at timestep t k (k ∈ N): therefore, the success at t 1 measures the ability of creating common ground, and the length of successful timesteps (LST) k − 1 measures the ability of maintaining them. This is a general framework for evaluating both creation and maintenance of mutual entity recognition in dynamic environments. Based on this task formulation, we propose a minimal task setting extending OCC and incorporate dynamic change of the entity locations.
We refer to each time range [t k−1 , t k ] as turn k. During each turn, we change the location of each entity e i ∈ E based on a simple parameterized movement, where the trajectory is determined by a quadratic Bézier curve (Bézier, 1974). 5 See Figure 3 for an illustration, where r 1 , r 2 are parameters of distance and θ k−1 , ∆θ represent angles. We sample r 1 , r 2 , ∆θ from fixed uniform distributions each turn and update θ k as θ k ← θ k−1 + ∆θ (θ 0 is initialized randomly). This way, we can generate diverse, unbiased, coherent and fully controllable dynamics of the environment.
To enable fair comparison with OCC, we limit the number of agents to 2 and set the circular agent views to have the same diameter as OCC. At each selection timestep t k , we ensure that each agent has 7 observable entities with only 4, 5 or 6 of them in common, which is also identical to OCC. Finally, we sample all entity attributes (color, size and initial location) from the same uniform distributions as OCC with minimal modifications. 6 Therefore, we expect the (distribution of) observations at t k to be similar and enable mostly fair comparison with OCC (in §5 and §6).
To ensure task difficulty, we also shift the perspective of each agent after each successful turn (see Figure 1) so that the overlapping regions differ every turn. The same dot is prohibited from staying in common for over 3 consecutive selection timesteps, requiring frequent updates of common ground. Finally, we limit the maximum number of turns to 5 for practical purposes (hence the maximum LST is 5 in each game).

Dataset Collection
To collect large-scale, high-quality dialogues, we conducted careful crowdsourcing on Amazon Mechanical Turk. The web application is based on the CoCoA framework (He et al., 2017), and we used Scalable Vector Graphics (SVG) to animate entity movements and parallel shifts of the agent perspectives. Before working on our task, crowd workers were required to take a brief tutorial on the task setting, dialogue interface and instructions. Sample screenshots of our dialogue interface and tutorial are shown in Figure 4: note that animations up to the current turn could be replayed anytime for the ease of playing the game. 7 To ensure worker quality, we required crowd workers to have more than 500 completed HITs and acceptance rates higher than 99%. To encourage success, we rewarded $0.25 for every successful turn plus additional bonuses for longer LST achieved (up to $0.25 if LST = 5). Finally, we manually reviewed all submitted works and excluded dialogues which clearly violated the instructions (e.g. relying on premature guessing or other ineffective strategies 8 ). We did not exclude dialogues based on task failures (even if LST = 0), as long as they were based on valid strategies. To solicit linguistic/strategic variety, we generally used a unique environment for each game. However, if the task was unsuccessful (i.e. LST = 0), we allowed the environment to be reused in another game. This way, we can expect to eventually collect successful (LST > 0) dialogues for the relatively difficult environments as well.
Overall, we collected 5,804 dialogues, and after the reviewing process, we were left with 5,617 qualified dialogues. We refer to this dataset as Dynamic-OneCommon Corpus (D-OCC). Note that our dataset is currently in English, but the dataset collection procedure is language-agnostic and can be applied in any other languages.

Dataset Analysis
Next, we conduct detailed analyses of the dataset to study human common grounding strategies under dynamic context. Whenever possible, we give comparative analyses with OCC to highlight the effect of dynamic factors introduced in D-OCC.  First, we summarize the overall statistics of OCC and D-OCC in Table 2.

Overall Statistics
In total, OCC and D-OCC have a comparable number of dialogues. However, dialogues can be much longer in D-OCC, since collaborative reference is repeated multiple times. On average, utterance lengths are slightly shorter in D-OCC: this can be mostly attributed to the increased (relative) frequency of short utterances like acknowledgments and shortened subsequent responses (e.g. "same again?" = "select the same black dot again?"). 9 Note that long, complex utterances are also common in our dataset, as seen in Figure  1. Overall, we found 462 unique workers participated in D-OCC, which indicates reasonable diversity at the player level as well.
In terms of LST, the overall average was 3.31 with over half (53.5%) of the dialogues succeeding all 5 turns. This suggests that humans can solve the task reliably through sophisticated common grounding. After filtering dialogues with poor/careless workers (whose avg. LST < 2), we observed a slight improvement up to 3.57. If we only focus on the top 10 workers (with at least 10 tasks completed), avg. LST was significantly higher reaching 4.24. These results indicate that (at least potentially) much higher human ceiling performance can be achieved. Note that if we include the last unsuccessful turn in 46.5% of the dialogues, the average of all completed turns was slightly longer (3.77) in our dataset.
Finally, we found that both datasets have a relatively small vocabulary size as well as the occupancy of rare tokens (used less than 10 times in the dataset). 10 This indicates minimal complexity at the lexical level, as observed in Udagawa and Aizawa (2019). We also found that the two datasets have a large vocabulary overlap, which is expected as D-OCC extends the setting of OCC.

Spatio-Temporal Expressions
At the utterance level, we observed an extensive usage of spatio-temporal expressions which are characteristic in dynamic environments. To study the frequency of such expressions, we manually annotated 100 dialogues in D-OCC with LST ≥ 2 (focusing on the more successful strategies).

Current State
It's to the right of where the grey one ended up for me after moving up and left.
23.8% 0.91 Now I have another triangle / Does it land next to two smaller gray dots? Does it have a lighter one below and to the left when they stop? Two similar shades close to each other (implicit) State Change a small dark one traveling southwest / 2 other dots following it 32.7% 0.97 Do you have two dark med-size dots move slowly apart as they drift right? I have a large pale grey that moves down but starts out curving to the right and then takes a sharp turn to the south east Previous State I still see the larger gray one that was next to it in the previous turn.
5.5% 0.79 I have the smaller dot that started out below it to the left. Before it moves, is there a lighter gray dot down and to the right of it?  Specifically, we detect whether each utterance contains 3 types of spatio-temporal expressions: 11 • Reference to current state describes location of entities at the end of the current turn (i.e. timestep t k if the utterance is in turn k). • Reference to state change describes temporal change of entity locations (i.e. movements). • Reference to previous state describes entity locations at previous timestep t (where t < t k ).
We show examples and estimated frequencies of spatio-temporal expressions in Table 3. We also computed the agreement of our annotation based on 50 dialogues with 3 annotators, which we found to be reliable based on Cohen's κ (Cohen, 1968).
Based on this result, we found that reference to state change is the most widely used strategy, which could be simple as "moves northwest" or more complex as in Table 3. Reference to previous state is much less frequent compared to other types but still observed in many dialogues. Note that humans distinguish previous and current states in various ways, including temporal expressions ("was", "now"), motion verbs ("started out", "landed") and implicit/default reasoning.
We also found that expressions are often nuanced and pragmatic, which are characteristic un-11 Note that a single utterance may contain none or multiple types of such expressions, and expressions of color, size or possession are not considered as spatio-temporal expressions. der continuous and partially-observable context (Udagawa and Aizawa, 2019). Nuances are typically expressed by the degree modifiers to convey subtle differences in location, movements, confidence, etc. Following Paradis (2008), we categorize them into 2 main types (and 5 subtypes): scalar modifiers used for concepts in a range of scale (diminishers, moderators, boosters) and totality modifiers used for concepts with definite boundaries (approximators, maximizers). See Table 4 for examples and the estimated occurrences of such modifiers in OCC and D-OCC. 12 Based on these results, we can verify that there are comparable numbers of various degree modifiers in D-OCC as well, which are used effectively to cope with complex ambiguity and uncertainty.
In Figure 5, we show examples of pragmatic expressions which require pragmatic (non-literal) interpretations (Monroe et al., 2017). For instance, trajectories of the expression "straight down" may not indicate vertical lines in the literal sense (e.g. could be curving or leaning to the left). Similarly, the expression of "(moving) right and (then) up" may be used for diverse movements ending up in various locations (e.g. even below the ini-straight down right (and) then up tial location!). While such expressions more or less deviate from literal semantics, they are pragmatically sufficient to convey the speaker's intention (i.e. identify the target among the distractors) (Grice, 1975): alternatively, the speaker may need to choose different expressions for the same movement depending on the context (distractors). We also show exemplary expressions of multiple entity interactions in Figure 6, which demonstrate interesting pragmaticality as well. For instance, "toward each other" may be used for trajectories moving in orthogonal (rather than opposite) directions for the most of the time.
Overall, our analyses of spatio-temporal expressions reveal advanced language understanding and generation required in D-OCC, regardless of the task/lexical simplicity.

Turn-Level Strategies
Finally, we study and compare human strategies at different timesteps (in different turns). Table 5 shows detailed statistics of the dataset in the initial turn and later turns, where creation and main-tenance of common ground are required, respectively. Note that we also distinguish later turns based on whether the previous selection (i.e. previous target) stays in common () or leaves at least one agent's view (): former cases can retain the same common ground but the latter cases require an update of common ground.
First, if we focus on the 1 st turn, we can verify that success rates are consistently higher in D-OCC than OCC, especially in difficult cases when the number of shared entities is smaller. This indicates that humans can create common ground more accurately by leveraging dynamic information (e.g. entity movements) unavailable in OCC.
In later turns, we found that human performance is near perfect with shorter dialogues in cases (when the previous target stays in common). This is natural because they can simply retain common ground and repeat the same selection. Notably, human performance is consistently higher than the 1 st turn even in cases (when the previous target is no longer in common), which verifies that humans can leverage previous common ground to update common ground more reliably as well.
We show example utterances of and cases in Table 6. Note that the previous target may temporarily leave the view and come back in cases, which occasionally makes even retainment of the same common ground non-trivial. In cases, humans either inform about the lost entities explicitly or implicitly, e.g. by ignoring old entities and starting to focus on the new ones.

Experiments
Finally, we conduct extensive experiments to assess our baseline model's capability of common grounding in dynamic environments.

Evaluation
To study the model's capability from various aspects, we design 3 (sub)tasks based on D-OCC.
First, we evaluate the model's ability of recognizing common ground based on the target selection task, originally proposed for OCC. This is an important subtask of (sequential) collaborative reference, where the model is given one player's observation and the (ground-truth) dialogue history to predict which target was selected by the player. Since there can be multiple selections in D-OCC, the model makes predictions at the end of each turn k (at timestep t k ). The number of en-   tities observable at t k is fixed at 7 for both OCC and D-OCC ( §3.2), so this is a simple classification task evaluated based on accuracy. Secondly, we estimate the model's ability of creating and maintaining common ground based on the selfplay dialogue task, where each model plays the full sequential collaborative reference task against an identical copy of itself. While this evaluation has the advantage of being scalable and automatic, succeeding on this setting is only necessary for human-level common grounding and not sufficient, since the model may only be able to coordinate with itself (and not with real humans).
Thirdly, we conduct human evaluation to test the model's ability of playing sequential collaborative reference against real human workers on AMT. Due to the high cost of this evaluation, we only focus on the top 3 variants of our baseline ranked by avg. LST in the selfplay dialogue task.

Model Architecture
For a fair comparison with prior work, we implement our baseline model following the OCC models in . The overall model architecture is shown in Figure 7.
To encode the dialogue tokens throughout the turns, we use a unidirectional GRU (Cho et al., 2014). To encode the observation during turn k, we first split the animation of entity movements into 10 frames and the agent view shift into 5 frames. Then, we process each observation frame based on the spatial encoder, followed by the temporal encoder to integrate these outputs.
The spatial encoder is used to extract spatial features and meta features from each observation frame. Spatial features represent the spatial attributes of each entity (color, size and location in the frame), which are encoded using an MLP and a relation network (Santoro et al., 2017). The relation network is used to represent the spatial attributes relative to a subset of entitiesẼ ⊂ E, which could be all entities observable in turn k (E all ) or selectable entities visible at t k (E sel ). Hence, the spatial features of e i are computed as: where e i is the vector representation of entity e i and is the vector concatenation. 13 Meta features are binary information of each entity representing whether (or not) the entity (i) is visible in the frame, (ii) is visible at timestep t k , (iii) was visible at timestep t k−1 , and (iv) was selected in the previous turn (i.e. is the previous target). Meta features are also encoded using an MLP, and we take the sum of spatial/meta features as the (entity-level) output of the spatial encoder.
Finally, we use the temporal encoder based on a GRU to encode the outputs of the spatial encoder. The final state of the temporal encoder is considered as the final representation of each entity.
Based on the outputs of these encoders, we use two attention modules (based on MLPs) to compute attention scores for each entity. The first at-  tention module is used to weight the final representations of all entities E all conditioned on the current dialogue state: then, the weighted sum of E all is concatenated with the dialogue state to predict the next dialogue token (Xu et al., 2015). The second module is used to predict the target entity, where we simply take the (soft)max of attention scores for the selectable entities E sel in turn k.
Note that there are only two main differences between our baseline and the best OCC model (TSEL-REF-DIAL) from : first, in TSEL-REF-DIAL, the final representation of each entity is its spatial features, i.e. the meta features and temporal encoder are not used (which are only meaningful in D-OCC). Second, TSEL-REF-DIAL is also trained on the reference resolution task (using an additional attention module), which is only available in OCC. Due to this architectural similarity, we can virtually pretrain our model on OCC by initializing the shared model parameters based on TSEL-REF-DIAL and then fine-tune the whole model on D-OCC. 14

Experiment Setup
All modules of our baseline (MLPs and GRUs) are single layered with 256 hidden units, except for the attention modules which are 2-layered. Dropout rate of 0.5 is applied at each layer during training, and we use the Adam optimizer (Kingma and Ba, 2015) with the initial learning rate set to 0.001. After manual tuning on the validation set, we weight the losses from next token prediction and target selection with the ratio of 2:1.
In terms of data splits, we use 500 dialogues with LST ≥ 2 for testing target selection, another 500 for validation and the rest for training. 15 Note that we use all unsuccessful turns (where the players failed to agree upon the same entity) as well, assuming they are still based on valid strategies. For selfplay dialogue and human evaluation, we collect 2,000 and 200 dialogues in unseen environments, respectively. Each experiment is repeated 5 times with different random seeds (including data splits), except for human evaluation.
Finally, we conduct extensive ablations to study the effect of various model architectures, including pretraining, spatial attributes (color, size and location) and the meta feature (previous target).
In addition, we also ablate the dynamic information of the observation by only using the last frame in each turn as the input for the temporal encoder.

Results
We show the results for target selection in Table 7. The human performance is estimated by 3 annotators based on 50 dialogues with LST ≥ 2.
Based on these results, we can verify that all ablations hurt the performance of our baseline in some way. Pretraining on OCC is generally effective, and all spatial attributes contribute to the overall performance (especially color and size).
When the meta feature of the correct previous target is available, all models perform remarkably well in cases (previous target stays in common), which is natural since humans often repeated the same selection. Finally, dynamic information also contributes to the baseline performance, despite the effect being rather marginal. However, there is huge room left for improvement in the 1 st turn and even more so in cases (previous target no longer in common). These results indicate that recognizing the creation of common ground is still difficult, and recognizing how they are updated (rather than retained) remains even more challenging for the current baseline.
Next, we show the results for selfplay dialogue and human evaluation in Table 8. We also include the results of TSEL-REF-DIAL (trained on OCC without fine-tuning on D-OCC) as a reference. 16 In selfplay dialogue, we can verify that the baseline model performs reasonably well, outperforming TSEL-REF-DIAL in the 1 st turn of D-OCC (as well as OCC). However, it is worth noting that TSEL-REF-DIAL may be suffering from a minor covariate shift in D-OCC (c.f. §3.2), and without pretraining, our baseline still underperforms this best OCC model. We also found that all ablations of spatial attributes hurt performance, while the locational attributes became more critical in the full dialogue task. The meta feature of the previous target (selected by the model) is also critical, as the models seem to be relying heavily on this feature to both retain and update the target. However, we found that ablation of dynamic information does not degrade (actually improves) performance in selfplay dialogue. This indicates that the last frame of each turn (current state) is sufficient for the baseline to coordinate with itself, and it is unlikely to be leveraging sophisticated temporal information (state change or previous state) like the human strategies seen in §5.2. Also, while the models perform near perfectly in cases, the success rates drop or do not improve significantly in cases (compared to the 1 st turn). This shows that current models can retain the same common ground easily but struggle in updating them using the previous common ground, unlike the human strategies seen in §5.3.
Finally, in human evaluation, we could verify that our baseline performs the best of the top 3 models in the selfplay dialogue task, but the success rates were much lower than observed in selfplay. This indicates that current models may not be using natural language in the same way humans use it (i.e. are not properly grounded, Bender and Koller, 2020), although they do become closer to it when all the features are available. 17 To summarize, our results in sequential collaborative reference show that the current baseline can leverage all spatial features and retain the same common ground, especially when provided explicitly as the meta feature. However, it may not be using temporal information effectively, and the creation and update of common ground still remain challenging in the dynamic environments, especially when conversing with real humans.

Discussion and Conclusion
In this work, we proposed a novel dialogue task to study the ability of creating, retaining and updating common ground in dynamic environments. The introduced dynamics are fully controllable in our setting to maximize diversity, minimize biases and enable reliable evaluation and analysis. Based on our dataset analyses and experiments, we demonstrated the advanced strategies of common grounding required and the open room for improvement in our newly developed Dynamic-OneCommon Corpus (D-OCC).
In future work, we plan to utilize and enrich this dataset in several ways. For instance, we can conduct various causal analysis, e.g. by changing  Table 8: Results for the sequential collaborative reference task (selfplay dialogue and human evaluation).
Human performance is estimated based on the overall average of the crowd workers (c.f. Table 2 and 5).
certain feature of entities (such as movement) and studying the differences in model behavior, which is essential yet difficult to conduct in many existing datasets (c.f. §2). Another promising direction is to add fine-grained annotation of reference resolution , as (partially) illustrated in Figure 1. We can also annotate spatio-temporal expressions, e.g. by following the procedure in . Such annotations would allow us to gain deeper understandings of the intermediate process of common grounding: for instance, we can study whether the developed models recognize and use the spatiotemporal expressions appropriately and consistently in a human-like way (i.e. not only imitate at the superficial level, as observed in §6.4).
In order to improve the model performance, we're considering several approaches. One approach is to make the model learn from task success (and failure) through reinforcement learning. Due to the symmetric agent roles in our task, this is straightforward to conduct through selfplay (Lewis et al., 2017;Yarats and Lewis, 2018), and we can expect the models to avoid ineffective strategies like underspecification and premature guessing. We also expect the incorporation of pragmatic reasoning to be a fruitful area of future research. One representative approach is the Rational Speech Act (RSA) framework (Goodman and Frank, 2016), which has been applied in both continuous (Monroe et al., 2017) and partially-observable domains (Hawkins et al., 2021). However, application in dynamic domains would involve additional complexities that need to be taken into account, such as the dependencies on previous common ground. Finally, we're planning to study wider variety of model architectures and pretraining datasets, including video-processing methods (Carreira and Zisserman, 2017;Wang et al., 2018), vision-language grounding models (Lu et al., 2019;Le et al., 2020) and large-scale, open domain datasets (Krishna et al., 2017b;Sharma et al., 2018). Note that the entity-level representation of the observation (required in our baseline) can be obtained from raw video features, e.g. by utilizing the object trackers (Bergmann et al., 2019;. Finally, we'd like to discuss the main limitation of our current work, namely the ecological validity (De Vries et al., 2020) of D-OCC. Since we focused on the simplest task setting under continuous, partially-observable and dynamic context, direct application of our work in realistic settings may not be straightforward. However, the generic strategies required in our setting are fundamental in many real-world applications. For an illustration, imagine a navigation task in a dynamic environment, such as finding a lost child in an urban city. Since the target entity (the child) may not stay in one place, routing directions can no longer be fixed and need to be updated accordingly (as in "now head more to the west" or "go back to the previous block"). Furthermore, the landmark entities may not be stationary either and could be ephemeral (as in "following the group of travelers" or "in the middle of the crowd"). Lastly, if the child is not conspicuous with confusable distractors (e.g. with many pedestrians around), the descriptions need to be precise and distinguishing (as in "wearing a little darker shirt" or "walking right towards the station"). In order to study such (nuanced and pragmatic) spatio-temporal expressions and references to previous common ground, we expect D-OCC to be an essential proving ground. In addition, our sequential collaborative reference task is defined generally (c.f. §3.2), so we can easily scale up the task complexity to study the desired dynamics under consideration: the exploration of different, potentially more complex dynamics is an important research area left as future work.
Overall, we expect our task design, resource and analyses to be fundamental for developing dialogue systems that can both create and maintain common ground in dynamic environments.