Abstract
We study continual learning for natural language instruction generation, by observing human users’ instruction execution. We focus on a collaborative scenario, where the system both acts and delegates tasks to human users using natural language. We compare user execution of generated instructions to the original system intent as an indication to the system’s success communicating its intent. We show how to use this signal to improve the system’s ability to generate instructions via contextual bandit learning. In interaction with real users, our system demonstrates dramatic improvements in its ability to generate language over time.
1 Introduction
Natural language provides an expressive and accessible avenue to instruct non-expert users. The ability to generate instructions is critical for systems that collaborate with users, for example, to delegate tasks. In such scenarios, the system generates language to communicate to the user a latent intent. When users are cooperative and proficient in the language, whether they accomplish the system’s intent provides an informative, albeit noisy signal of the quality of instruction generation.
This implicit signal is fundamentally different from supervised data, including via active learning, in that it does not label the system’s intent with a written instruction, but only provides evidence to the quality of a given instruction in relaying this intent. As a natural byproduct of interaction with users, it also differs from explicit user feedback in not requiring user action beyond what they already do as part of the interaction. Despite its potential and prevalence, this signal is understudied for learning to generate natural language
In this paper, we study this learning signal. We formalize continually improving instruction generation by observing human users executing generated instructions. We learn by comparing instruction execution to the system intent, and demonstrate how this results in a system that continually improves its natural language generation ability through interaction with users. Figure 1 illustrates our learning process.
We design a task-oriented collaborative scenario using the CerealBar game environment (Suhr et al., 2019). In CerealBar, two agents, a leader and a follower, work together to complete tasks. The leader plans the tasks to complete, and communicates goals to the follower using natural language. CerealBar was originally introduced for studying follower instruction execution. We modify it to focus on generation of leader instructions, which are then executed by human followers. The collaborative, embodied setup effectively engages users, and aligns their incentives with executing the system’s instructions to the best of their abilities.
A major challenge is inferring a learning signal from observed user behavior. Given the user execution, we create positive and negative examples, depending on how the user execution aligns with the system’s plan and the user’s perceived correctness of their own execution. For example, consider an execution that does not align well with the system’s plan, but that the user considers correct given the instruction. Because of the misalignment, we cannot consider the instruction as a successful example given the system’s plan. However, given the user’s perceived correctness, we can generate a positive example treating the user’s execution as a plan paired with the instruction. In contrast to supervised learning with gold-standard per-token labels (Sutskever et al., 2014), such utterance-level binary labels form a challenging signal for learning, because they do not distinguish between correct and incorrect tokens.
We do not make the typical distinction between training and deployment; as human users follow generated instructions, we continually collect new data, periodically train using this data, and evaluate the system through the interaction itself. We formalize learning as an off-policy contextual bandit learning problem. We show that positive examples can be treated in a manner that reduces to supervised learning, allowing for simple effective use of the data. However, using negative examples is more challenging, because simply minimizing their likelihood gives an unbounded negative loss. We weigh negative examples using an inverse propensity score (IPS; Horvitz and Thompson, 1952; Wang et al., 2017) to address this issue.
We experiment with our approach through interaction with human users, tracking both task performance and how the generated language changes. We observe dramatic improvements in the quality of instructions generated as reflected in users’ execution: Task completion in accordance to the system intent increases from 44.7% to 79.3%. This is accompanied by significant language change: The occurrence of erroneous phrases decreases as desired, but the effective system vocabulary gradually shrinks.
Although using user feedback for improving language generation has been studied, as we discuss in Section 8, to the best of our knowledge, this study is the first to show effective instruction generation learning by observing user execution. Our experiments demonstrate the effectiveness of our process, but also illustrate limitations and important directions for future work. Code and data are available at https://lil.nlp.cornell.edu/cerealbar/.
2 Technical Overview and Notation
Our goal is to continually improve a natural language instruction generation model, by observing human executions of generated instructions.
Interaction Scenario
We focus on a collaborative scenario, where two agents, a leader and a follower, complete tasks in an environment. The system is the leader, and the human user is the follower. The leader plans tasks to accomplish, acts in the world, and instructs the follower using natural language. We use a deterministic procedure for planning and executing leader actions, and focus on learning the leader instruction generation model. The human follower acts in the world following the system instructions. We instantiate this scenario using CerealBar (Section 3), a collaborative game, where two agents collect sets of cards together by moving in a 3D environment.
Task
A world state s describes the current environment; in CerealBar, this includes the location of landmarks, cards, and both agents. A plan is a sequence of poses the system intends for the human user to take starting from a start state s1. In CerealBar, a plan includes moving in the environment with the intent of collecting cards; each pose pj is a tuple (hj,wj,αj), where hj and wj are height and width coordinates, and αj is a discrete orientation angle. An instruction is a sequence of tokens . An instruction execution is the sequence of poses a user takes executing , starting in a start state s1. The generation distribution is parameterized by θ. The goal of instruction generation is that given a generated instruction , the user execution from s1 will follow the plan . The user does not have access to , but only to its description .
Learning
We use an encoder-decoder neural network model (Section 4), which we continually improve by observing user behavior. This process proceeds in rounds. At each round r, we first collect data and then train our model by estimating the model parameters θr. During data collection in round r, we sample from our model to generate instructions, and observe a human user’s execution of each instruction. An execution of an instruction generated for the plan with start state s1 creates a tuple , where is the user execution and f is structured user feedback solicited using binary questions (e.g., about the grammaticality of ). The learner does not observe the user’s actions executing , but only their poses along the execution. Given these tuples, we create a dataset , where y(i) ∈{−1, +1} is a binary label. Depending on the user execution and feedback, the plan is either the original plan used for generating or the user execution of . We formulate estimating θr +1 as a contextual bandit learning problem with y as the reward. Section 5 describes the complete learning process.
Evaluation
Throughout the system’s lifetime, we measure how well human users complete tasks, and also use earth mover’s distance (EMD; Rubner et al., 1998) to quantify the similarity of the user execution to the plan . We characterize language change over time by tracking vocabulary size, instruction length, and other statistics.
3 Interaction Scenario
Suhr et al. (2019) describe CerealBar in detail. CerealBar is a two-player, turn-based game where a leader and follower collaborate to collect sets of matching cards. The game objective is to collect as many valid sets as possible in a 3D environment. The environment includes landmarks (houses, mountains, ponds, etc.) that the players must move around, and may obscure a player’s view. A valid set consists of three cards with three distinct colors, shapes, and counts. Players move onto cards to select or deselect them. When the selected cards comprise a valid set, the players earn a point, all cards disappear,1 and new cards appear. The two players must collaborate effectively using natural language. The leader observes the entire environment, plans who should select which cards for the next set, executes their own part of this plan, and issues instructions to the follower. The follower executes leader instructions, only seeing a partial first-person view of the environment. Leader instructions must make use of the observed spatial environment, including landmarks, for the follower to be able to execute them given their partial view. Each interaction includes multiple instructions. Figure 2 shows the game and example generated instructions.
CerealBar was originally used for learning a follower instruction execution model from human demonstrations (Suhr et al., 2019). In contrast, we learn an instruction generation model for the leader, with the human user as the follower. The generated instructions must often specify multiple tasks to complete (i.e., when the follower is to select multiple cards), and how to navigate to the target cards, because the follower has only partial observability of the environment. This includes references to landmarks, spatial relations, and descriptions of paths. We focus on language generation, and use a deterministic planner to generate the plan, including which cards to select and how each player should move in their next turn, and execute the planned leader actions. The system uses the model we learn to map the follower’s part of the plan to a natural language instruction.
We learn through interactions with non-expert human followers, which CerealBar is particularly suited for. The utility-maximizing game objective to earn a high score by collecting as many valid sets as possible incentivizes followers to execute the generated instructions as accurately as possible. In addition, CerealBar players need no expert knowledge to participate in the game, beyond familiarity with the simple game rules.
4 Model
We design a relatively simple encoder-decoder architecture to model the generation distribution , leaving more complex model development for future work. The inputs are a start state s1 and a plan . The model parameters are θ. Our design considers the environment and plan to generate relevant, grounded instructions. Figure 3 illustrates the model.
Inputs
Similar to Suhr et al. (2019), we represent the world state s1 ∈{0,1}P×H×W as a binary 3D tensor, where P is the number of position properties, and H and W are the environment’s height and width. Each of the W × H positions is represented as a binary properties vector of length P (encoding the type of object in the position, its color, etc.). The system plan is a sequence of follower poses along the intended execution. Each pose pj is a tuple (hj,wj,αj) of height hj and width wj coordinates, and a discrete orientation angle αj.
Encoder
The encoder computes a set of hidden states, which the decoder attends to during generation. We use a learned embedding function ϕs to map each position vector to a dense embedding of size Ns by summing the embeddings of each of the position’s properties. We combine the embeddings into a tensor , and compute: S′ =CNN1(S), where CNN1 is a learned convolution and . Because the CerealBar environment is a grid of hexagons, we use HexaConv (Hoogeboom et al., 2018). We encode the plan positions into a sequence of vectors by cropping a Ns′ × Np × Np-sized tensors from S′ centered around each (hj,wj) and rotated by αj. These tensors represent the pose of the follower and its surroundings during execution. Each is encoded to , while retaining the dimensionality of .
Decoder
The decoder computes a probability distribution over token types conditioned on the prefix generated so far and the set , which represents the environment state and plan. The decoder uses the first four layers of the GPT-2 Transformer architecture (Radford et al., 2019), which enables initializing with GPT-2 weights. We extend it with pseudo self attention (Ziegler et al., 2019) to condition the generation on the encoder outputs . This adds a linear layer that projects the encoder outputs into the decoder self-attention space.
Inference
We decode instructions from using temperature sampling with a temperature of τ (Kreutzer et al., 2018b). This sharpens the sampling distribution, to focus on higher probability outputs. We do not use beam search.
5 Learning
We continually improve our model by observing users following generated instructions and re-estimating the model parameters. We initialize the model parameters θ1 using an existing language model and training on a static dataset of instructions (Section 5.1). We then perform a series of rounds, each round r includes deploying the model with human users and training on the collected interactions (Section 5.2). In round r, we collect interactions between our model parameterized by θr and human followers, to create a dataset of start states , plans , instructions , and binary labels y(i). We estimate θr +1 using all data collected so far . Figure 1 illustrates our learning process.
5.1 Initialization
User interaction requires some level of minimal performance. Pilot experiments showed that a poorly initialized system is likely to frustrate users, who in turn provide little learning signal. Our initialization provides a sufficient level of grammaticality and plausibility to support user interaction, and thereby further learning.
5.2 Learning from User Behavior
Learning from interacting with human users alternates between generating instructions in interaction with users and training the model.
Interaction with Users
In each round r, we first deploy the model with parameters θr to interact with human users, with our system as the leader and the user as the follower. We do not update the model during this interaction phase.
The game environment is randomly generated for each interaction. Each game continues until it concludes, either when the user leaves or the turns are exhausted. A game often includes collecting multiple sets of cards, and generating multiple instructions. Each instruction is generated for the current state as the start state s1;2 as both agents move and change the status of cards, the environment state changes throughout the game. At state s1, we generate the plan using a deterministic planner that determines (a) which cards should be selected or de-selected to make the next valid set, and (b) the shortest paths the leader and follower should take to visit all target cards. The actions the planner assigns to the follower form the plan . The actions assigned to the leader are executed by the leader agent deterministically during its turn. The model is used to sample an instruction , which is displayed to the user. The human user has no access to , the set of target cards, or the game state s1. They only observe the instruction and what is ahead (Figure 2).
During their turn, the user executes to the best of their ability, and indicates when done. If the user determines that the instruction cannot be followed, they can terminate the execution, which is treated just like marking the instruction as complete. The user execution is the entire sequence of poses they take while following the instruction.
When the user concludes or terminates an instruction , we show them a top-down view of the entire environment with their execution path highlighted. They do not see the original system plan. We ask the user two binary feedback questions about the perceived correctness of their execution and grammaticality (Figure 4).
We create a tuple for each execution , where s1 is the start state of the environment, is the plan generated in that state, is the sampled instruction, and f is the set of responses to the feedback questions. Once the user submits the answers to the feedback questions, the next instruction is generated.
Dataset Construction
We use all interactions in round r to construct dataset , which is made of tuples , where is a plan and y is a binary label. Given a tuple , we use three heuristics to add examples to :
If any feedback answer in f is negative, the instruction does not reflect the user’s execution or not well written (i.e., ungrammatical). We add a negative example to with the system plan : .
- 2.
If both feedback answers are positive, the user considers their execution accurate and the instruction well formed. This does not necessarily indicate the execution follows the system plan, but that we can treat the execution as a plan. We add a positive example with the execution as the plan: .
- 3.
If both answers are positive and the execution follows the plan ,3 the instruction communicates the plan well. We add a positive example with the system plan: .
Overall, we add examples to using both the original system plan and the user execution. The heuristics utilize the observational learning signal as much as possible while avoiding examples not beneficial for learning. For example, we do not add negative examples using the user execution, because these are less likely to be useful for learning. Although such executions can form negative examples if the user answered negatively to the correctness question, they tend to be relatively arbitrary, and it is unlikely the model conditioned on them will assign significant probability to the generated instruction, which is the behavior negative examples come to suppress.
Parameter Estimation
We estimate the model parameters for the next round θr +1 using all available data . We re-train our model, starting with GPT-2 parameters (Section 5.1).4
We formulate learning as an offline contextual bandit problem, treating the sentence labels y as rewards. Learning from the positive examples in forms a straightforward supervised learning problem, albeit one where the data is generated from system interaction. A key challenge is using the negative examples. Treating them like supervised examples requires optimizing the probability of their instructions to zero. Because , this leads to an unbounded negative loss that quickly dominates the objective. This in contrast to positive examples, for which the loss is bounded by zero. This issue is not present in existing work using offline contextual bandits to improve machine translation (Lawrence et al., 2017; Kreutzer et al., 2018b), where rewards are always non-negative.
6 Experimental Setup
Initialization Data
We create the supervised initialization dataset by sampling 360 interactions from the original CerealBar data (Suhr et al., 2019), which was collected in a wizard-of-oz (WOZ; Kelley, 1984) setup via human-human games. We select this number through pilot studies and qualitative analysis to minimize the amount of initialization data, while still maintaining sufficient model performance for early interactions to facilitate learning. Our goal is to use as little data as possible to study the target scenario where investment in supervised data is minimal, and most learning is left to interaction with users. This data includes 7,147 examples. We use the human demonstrations in the original data as plans.
Evaluation
Similar to Zhao et al. (2021), we observe that automated metrics, such as Bleu (Papineni et al., 2002) or BERTScore (Zhang et al., 2020), computed over a static held-out validation set are unreliable for evaluating instruction generation. Instead, we focus on task-completion measures via human execution. We measure task completion by considering the user execution as completing the intended task if the user visits all card locations included in the system plan; or, if the plan includes no target cards, the user stays in the starting position. We quantify the similarity of the user execution to the path in the system plan by computing earth mover’s distance (EMD; Rubner et al., 1998)6 between the two (Blukis et al., 2019). We also track the user answers to the feedback questions (Figure 4). We average each measure over the number of instructions in each round.
Language Analysis
We quantitatively analyze how generated instructions change throughout training. For each round, we report mean instruction length, vocabulary size, and three measures of syntactic complexity using dependency trees (Xu and Reitter, 2016): (a) maximum depth: the longest path from root to a leaf; (b) maximum width: the maximum out-degree of any word in the tree; and (c) average branching factor: the average out-degree of non-leaf words. We normalize the three measures by instruction length. We qualitatively analyze errors in generated instructions, by comparatively analyzing 100 randomly sampled examples where the user failed to complete the intended task from the first and final rounds.
Interaction Setup
Except initialization, learning and evaluation are done through live interaction with users on Amazon MTurk. All workers passed a tutorial and a qualification quiz. We pay $0.15 per interaction, with a bonus of $0.10 per instruction to workers who follow our guidelines.
Implementation Details
Similar to performance evaluation, automated measures are unreliable for model selection. Instead, for both initialization and in each round, we train for N = 400 epochs, and take the final model. We find N via qualitative analysis of the initial model. We use an ensemble of four models. We uniformly sample one of the four models to sample each instruction, and take its probability to use in IPS for negative examples. We use a sampling temperature τ = 0.5, and AdamW (Loshchilov and Hutter, 2018) for learning.
7 Results and Analysis
We conduct a long-term experiment with 14 rounds using our approach, and separate seven-round experiments to compare system variants. In both experiments, we collect roughly 100 interactions for each system per round. In the seven-round experiments, we deploy methods simultaneously to ensure that our observations are not sensitive to changes in user behavior, for example, because of adaptation and increased expertise. We do not inform workers about the model they are interacting with. We train each system only on data collected by the same method in previous rounds.
7.1 Long-term Study
We experiment with our approach for 14 rounds. We collect a total of 27,031 instructions from 1,445 interactions, with 103.2 interactions per round on average. The total cost is $2,895. Figure 5 shows both performance measures and language trends. For task measures and user feedback, we also break down performance according to the number of target cards in the system plan to evaluate performance changes for plans which may be more difficult to describe (e.g., because they require specifying more cards).7
Our learning method significantly improves the system performance across all measures. Task completion rate improves from 44.7% at round one to 79.3% at round 14, while EMD decreases from 1.73 to 0.88, showing increasing similarity between the execution and the plan. The user perception of the system also improves: The positive response rate for the perceived correctness question improves from 47.9% to 78.6%, and for grammaticality from 88.9% to 99.2%. The overall collaborative system performance improves as well; the game score increases from 4.5 to 10.4. The number of positive examples per round gradually increases, as the system improves and the interactions become longer. In contrast, the number of negative examples decreases over time.
We observe that the initial model struggles to describe plans containing more target cards, with a particularly low task completion rate of 1.6% for 3-card plans in the first round. This is potentially because only 0.7% of human follower executions in demonstrate picking up three cards, while the planner generates 3-card plans 7.9% of the time. While initial improvement is slow for 3-card instructions, it picks up around round eight, and reaches 32.9% task completion rate.
Language Analysis
We observe a consistent trend of decreasing sentence length and vocabulary size. Overall, these trends accompany reduction in over generation of erroneous phrases that are not grounded well in the environment. We also qualitatively observe that the systems gradually generates slightly more underspecified instructions, for example by dropping mentions of landmarks crucial for navigating to a target card. This may explain the slight decrease in 1-card task completion rate in later rounds (Figure 5), because the planner usually has the follower travel further for 1-card instructions, which requires relying more on landmarks. A potential explanation to the decrease in vocabulary size is the ever increasing presence of system-generated sentences in training, which reinforces the system’s word choices. Alternatively, our learning signal may not account for the need for more descriptive language. For example, humans may compensate with exploration for omitted descriptions, which is not distinguished by how we convert the observed behavior to a learning signal. These trends outline important directions for future work.
We observe a small increase in syntactic complexity over the system’s lifetime with regard to the branching factor, which shows significant increase (p < 0.00001).8 We also see a slight decrease in maximum tree depth (p < 0.0001), and no significant change in max width.
Error Analysis
We analyze errors in the generated instructions at the first and final rounds. For each round, we randomly sample 100 instructions that the user did not execute according to the plan or answered negatively to a feedback question. Table 1 shows error types and example instructions. Overall, the frequency of erroneous instructions decreases from 68.5% of instructions in the first round, to 26.8% in the final round. From the first to final round, we observe noticeable decrease in errors related to grounding of cards and landmarks. The overall frequency of errors related to incorrect directions and incorrect actions or conditions also decreases, and implausible instructions diminish close to zero percent. However, there is an overall increase in underspecified instructions. This aligns with the decrease in the vocabulary size and landmark use we discuss above.
Confounding Factors
We identify two mechanisms unrelated to our approach that could explain the observed performance changes. We deploy two additional systems alongside our system during the final round. For each interaction, one of the three systems is randomly chosen. We do not inform the workers of the identity of the model for each interaction. First, we deploy the system following initialization during the final round to study if performance might be explained by user improvement over time. Second, because we train with a fixed number of epochs, later rounds have many more gradient updates, which may allow for better parameter estimation, even with the same amount of data. We train a system on the initialization dataset for the same number of gradient updates as when training the final full system.
Table 2 shows that these confounding factors do not explain the observed gains. We find minimal differences between evaluating the initial model (θ1) at the beginning and end of deployment, showing no significant effect from user improvement. Training the initial system longer (θ1′) shows a slight overall improvement, but negligent compared to final system (θ14).
Model . | r . | Overall . | 0-card7 . | 1-card . | 2-card . | 3-card . |
---|---|---|---|---|---|---|
θ1 | 1 | 44.8 | 84.1 | 64.9 | 9.6 | 1.7 |
θ1 | 14 | 45.1 | 84.5 | 62.1 | 9.3 | 0.8 |
14 | 49.6 | 76.6 | 63.8 | 24.8 | 7.4 | |
θ14 | 14 | 79.4 | 99.6 | 81.9 | 72.1 | 33.0 |
Model . | r . | Overall . | 0-card7 . | 1-card . | 2-card . | 3-card . |
---|---|---|---|---|---|---|
θ1 | 1 | 44.8 | 84.1 | 64.9 | 9.6 | 1.7 |
θ1 | 14 | 45.1 | 84.5 | 62.1 | 9.3 | 0.8 |
14 | 49.6 | 76.6 | 63.8 | 24.8 | 7.4 | |
θ14 | 14 | 79.4 | 99.6 | 81.9 | 72.1 | 33.0 |
7.2 System Variants Study
We vary different design decisions, and experiment for seven interaction rounds.9 We experiment with four system variants: (a) Full: our full approach described in Section 5; (b) Pos-Only: use only examples with positive labels y = +1; (c) TC-Only: ignore the feedback questions, instead if the user completes the task according to our task success measure we add positive examples with both the system plan and user execution, otherwise we add a negative example using the system plan; (d) No-Ensemble: train and deploy a single model each round, starting from an initial model randomly sampled from these we use for Full; and (e) Fine-Tuning: train model parameters θr +1 on for N epochs, starting from θr, avoiding overfitting with rehearsal (Rebuffi et al., 2017; Hawkins et al., 2020a). In rehearsal, in each batch, half the examples are sampled randomly from the previous datasets ,…. Except the variations specified, the systems are identical. We do not deploy a system ablating IPS, because we observe that training with negative examples without IPS results in a largely unusable system.
We collect a total of 63,189 instructions across all systems, with 3173 interactions. Each round includes 453.2 interactions on average. The total cost is $7,165. All systems are used concurrently in each round, including re-deploying Full again starting from initialization. Figure 6 shows the results. Despite some differences between the system variants, our method is largely robust to variations in learning design decisions.
All systems achieve comparable improvements in task completion rate, except for os-Only, which slightly underperforms. We observe faster decrease in the vocabulary size and instruction length for os-Only, which does not use negative examples. This is possibly because the loss from negative examples encourages a more uniform generation distribution, potentially slowing down the overall trends of making the generation distribution more peaky. TC-Only, which ignores the answers to user feedback questions when constructing the dataset, shows fewer positive responses to the perceived correctness feedback, although task completion remains comparable.
We observe that using a single (No-Ensemble) model rather than an ensemble leads to limited difference in overall performance. However, because of the challenge of identifying a good automated metric to stop training, the performance of models following training varies significantly. This can lead to deploying a bad model, which provides users with a poor experience. Using an ensemble of models incurs higher computational cost, but makes such a worst-case scenario less likely. For example, in our long-term experiment, the maximum task completion performance gap we observe between the best and worst models in each round is 13%.
Finally, we observe that fine-tuning (Fine- Tuning) works as well as our re-training approach (Full), potentially with a more stable vocabulary size. This is in contrast to our initial experiments, which showed it is harder to get consistent improvements through fine-tuning. While the fine- tuning process is harder to design because it requires to choose the fine-tuning procedure (e.g., rehearsal (Robins, 1995) or KL regularization (Yu et al., 2013)) and carefully optimize additional hyperparameters, it can work just as well as re- training. Because fine-tuning is faster to train between rounds, it may be preferable in future work.
7.3 Comparison to Supervised Learning
We also separately study the learning trends of our method compared to training on equivalent amount of supervised WOZ data. Supervised data is fundamentally different from our bandit data, for two main reasons: (a) it is significantly costlier because it requires a dedicated instruction-writing effort, whereas our data arises naturally from the system interaction with users during deployment; and (b) it provides per-token labels, whereas our data includes only utterance-level binary labels. For the supervised system, after each round, we expand the dataset by randomly drawing an equivalent amount of additional data from the complete dataset of Suhr et al. (2019), which includes 19,112 examples from 960 interactions.10 This dataset allows for seven rounds. We concurrently deploy a no-ensemble variant of our continual learning system. We collect a total of 22,216 instructions across both systems, with 1,166 interactions. This experiment’s total cost is $2,230.
Figure 7 shows our continual learning system consistently outperforms this supervised alternative in overall task completion rate. There are two potential explanations to this gap. First, the data our approach uses is made of examples the system is likely to generate, potentially providing a more effective learning signal. Second, there is a difference between the plans of human leaders and our planner. Our training is better suited to adapt to how the complete system is designed, whereas training on human-annotated data is bound to suffer from a distribution shift. However, the continual learning system did not consistently outperform the supervised alternative on 2-card and 3-card instructions, especially at early rounds. This is likely because the continual learning system generates few positive examples for more complex system plans (i.e., 2-card or 3-card) at earlier rounds. At later rounds, as the system improves, we observe more positive examples for such plans, creating an accelerating effect of improvement, which is best observed in our long-term experiment (Figure 5).
8 Related Work
Learning for instruction generation has been studied using supervised methods, with examples of task specifications (i.e., contexts) paired with human-written instructions (e.g., Daniele et al., 2016; Narayan-Chen et al., 2019), including to improve instruction following (Fried et al., 2018; Tan et al., 2019). We focus on continually learning by observing users executing generated instructions. This reduces annotation needs, and delegates much of the learning to interaction with users during system deployment. Language generation in context was also studied in scenarios that are not explicitly instructional, but aim to elicit specific behavior, such as negotiation games (e.g., Lewis et al., 2017) and referring expression generation (e.g., Dale and Reiter, 1995).
Gatt and Krahmer (2017) survey existing work on language generation, including using rule- based methods. Similar to our approach, some rule-based methods were evaluated with human followers in situated environments using task success (e.g., Koller et al., 2010; Janarthanam and Lemon, 2011). Such methods are accurate and reliable, but are limited to pre-specified rules and remain static following development. Our focus is on studying the potential for learning by observing human behavior. The two approaches can be combined, for example by using rule-based methods to generate initialization data for our approach.
Bandit learning has been studied with simulated user ratings for machine translation (Nguyen et al., 2017; Lawrence et al., 2017; Kreutzer et al., 2017) and semantic parsing (Lawrence and Riezler, 2018). We learn from real users, similar to recent studies in machine translation (Kreutzer et al., 2018a, b). In general, such learning assumes users can judge the system output, for example via proficiency in the language they wish to translate to. Our learning signal does not require such expertise, and is available naturally from the interaction.
Explicit human feedback has also been incorporated into reinforcement learning methods (Knox and Stone, 2009; Pilarski et al., 2011; Daniel et al., 2015; Mathewson and Pilarski, 2016; Warnell et al., 2018; MacGlashan et al., 2017; Arumugam et al., 2019), including in the context of dialogue system learning (Liu et al., 2018). Jaques et al. (2020) study forming a reward from implicit feedback for non-task-oriented dialogue language generation, by training multiple models to detect linguistic signals, such as sentiment and lexical overlap, that correlate with explicit user feedback. Learning from users has also been studied by asking users to rank system outputs (e.g., Wilson et al., 2012; Christiano et al., 2017), including for instruction following (Wang et al., 2016) and summarization (Stiennon et al., 2020). Unlike our approach, such ranking requires knowing the true system intent, and is not part of the system’s normal operation (i.e., instructing users in our case).
Incorporating human users into learning is related to active learning (Settles, 2009), where a policy selects examples for an oracle to label during learning. Unlike common active learning scenarios we do not select examples from a static underlying distribution (i.e., a training set) for annotation, but generate examples with the learned model. This is similar to query synthesis active learning (Angluin, 1988), where examples are generated for annotation, rather than being selected from a set of unannotated examples. A more significant difference is that active learning methods solicit model output annotations by presenting an oracle with model inputs. In contrast, our approach exposes users to model outputs (i.e., generated instructions). It does not solicit written instructions, as would be expected if requesting labels. We also do not show model inputs (i.e., plans) to users. Finally, our model interacts with users during system operation, while completing its task. It does not require oracle annotators.
Language learning from behavioral signals has been studied in the cognitive science and psychology literature.11Krauss and Weinheimer (1966) study two types of feedback in human studies: concurrent linguistic feedback and behavioral intent confirmation, and show how both influence linguistic adaptation in an interaction over time. Studies of reference games reproduced the effect of confirmation feedback, showing that successful intent communication reinforces convention formation in the form of shorter references (Clark and Wilkes-Gibbs, 1986; Hawkins et al., 2020b). Our learning signal is a type of confirmation feedback. However, our interaction procures and makes use of more complex learning signals than a simple binary intent communication success, by using the path the listener takes in response to the generated instruction as an alternative intent when constructing data for learning (Section 5.2).12
9 Discussion
We propose a methodology to continually improve an instruction generation model by observing human users executing natural language instructions, and demonstrate its efficacy within a collaborative instruction following scenario. Our study shows that observation of user behavior is an informative signal for generating language to relay instructional intent. To the best of our knowledge, this type of learning signal has not been studied before. This learning setting facilitates continual learning through interaction with users, and is particularly compelling for interactions with collaborative agents, including robots and software agents. Such agents are likely to operate in constantly changing environments (e.g., robots in homes), where continual learning is necessary to adjust to changes. Our continual learning approach also provides systems the flexibility to co-adapt to human users, who are likely to change preferences and behaviors in response to system behavior.
Our experiments demonstrate the learning process is robust to various learning and process design choices. However, they also show it is accompanied by a reduction of language complexity, including reducing the effective vocabulary and sentence length. While much of the decrease in the effective vocabulary size throughout the system lifetime relates to generating fewer erroneous phrases, it also reduces the language diversity and descriptiveness. Our experiments show that this trend can be slowed down by using negative examples, and appears to be less pronounced when using fine-tuning. The combination of this decrease with the preference for shorter instructions makes it difficult for the system to describe longer, complex trajectories. Qualitatively, we observe this open problem is responsible for a significant portion of the remaining errors. An important direction for future work is experimenting with directly encouraging more diverse language. This can be combined with approaches that allow for introducing new word types, which is unlikely in our approach, even though it uses sub-word tokenization. A potential direction in this vein is combining active learning to solicit human-written oracle instructions for plans the system fails to communicate.
Our work highlights several other directions for future work. There is a strong need for a reliable automated metric to evaluate instruction generation. In absence of such a metric, we use a simple, but likely sub-optimal stopping criteria for learning. Beyond the learning signal we explored in our experiments, there are additional potential cues available during interaction. For example, using continuous-valued similarity between system intent and user execution, modeling follower quality to discount the learning signal from interactions with bad followers, or weighing the feedback questions differently for more nuanced reward.
Finally, the decrease in utterance length and vocabulary size mirrors similar trends observed in studies of human communication (Clark and Wilkes-Gibbs, 1986; Hawkins et al., 2020b). This illustrates the potential of continual learning systems to reflect the dynamics of language change human participants expect in natural language interactions. Observations of human learning also indicate the potential of integrating our approach with conversational self-repair (Clark, 2020) and partner reformulation (Clark, 2018), both important components of child language acquisition that likely provide better credit assignment for learning compared to our binary bandit signal.
Acknowledgments
This research was supported by ARO W911NF- 21-1-0106, a Google Focused Award, the Masason Foundation, a Facebook Fellowship, and NSF under grants no. 1750499 and DGE-1650441. We thank Jonathan Chang, Sasha Rush, the Cornell NLP Group, Robert Hawkins, Dipendra Misra, and John Langford for discussion and comments; Suyi Diao for Unity development; Anna Effenberger for code to compute syntax complexity; Ge Gao, Koji Shiono, and Takayuki Kojima for feedback on our interaction platform; and the crowdsourcing workers for participating in our data collection. Finally, we thank the action editor and the anonymous reviewers for detailed comments.
Notes
In Suhr et al. (2019), only the selected cards disappear. We introduced this modification to minimize inter-turn effects for the follower (i.e., memorize card locations).
For simplicity, we do not index the game time step.
For instructions that target cards, we require getting the card selection right, and ignore the follower position. For instructions that require waiting (e.g., hold still), we require the position to remain the same, but allow orientation deviation.
Pilot studies showed re-training to be more stable than fine-tuning given new data, and we conduct the majority of our experiments with this method. However, we also observe that our process is overall robust to the initially observed instabilities of fine-tuning (Section 7).
An alternative, and important direction for future study is to add IPS to all examples, but clip it at a certain maximal value, similar to clipping in PPO (Schulman et al., 2017).
We use POT (Flamary et al., 2021) to compute EMD.
0-card plans target no cards (e.g., hold still).
We use t-test (α = 0.01) comparing rounds 1 and 14.
This study is similar to ablation analysis, but aims to study different learning design decisions. Full-fledged repetitive ablations to identify the ideal system design are particularly challenging in this work, both because of experiment costs and the complex dynamics of interacting with users.
Interactions with the supervised system are not used for learning, but only for evaluation.
This review is not comprehensive, and only aims to highlight the relation to problems studied in related disciplines.