Draw Me a Flower: Processing and Grounding Abstraction in Natural Language

Abstract Abstraction is a core tenet of human cognition and communication. When composing natural language instructions, humans naturally evoke abstraction to convey complex procedures in an efficient and concise way. Yet, interpreting and grounding abstraction expressed in NL has not yet been systematically studied in NLP, with no accepted benchmarks specifically eliciting abstraction in NL. In this work, we set the foundation for a systematic study of processing and grounding abstraction in NLP. First, we deliver a novel abstraction elicitation method and present Hexagons, a 2D instruction-following game. Using Hexagons we collected over 4k naturally occurring visually-grounded instructions rich with diverse types of abstractions. From these data, we derive an instruction-to-execution task and assess different types of neural models. Our results show that contemporary models and modeling practices are substantially inferior to human performance, and that model performance is inversely correlated with the level of abstraction, showing less satisfying performance on higher levels of abstraction. These findings are consistent across models and setups, confirming that abstraction is a challenging phenomenon deserving further attention and study in NLP/AI research.


Introduction
As human-computer interaction in natural language (NL) becomes more and more pervasive, e.g., via smart devices and chatbots, a cognitive phenomenon known as abstraction, which is prevalent in human communication and cognition, is taking a central role in the way users communicate their intentions and needs to artificial agents.
When communicating in NL, with a human or an artificial agent, a human may issue a request for a single action such as "send an email" or "set my  On the left are instructions for drawing the target image (bottom-right), paired with their grounding on the HEXAGONS board on the right.Italics mark expressions of abstraction as referring to, e.g., objects (a flower), iterations, and conditions.alarm".However, when engaging with more complex tasks that require multiple actions, humans often evoke abstraction in order to communicate their intentions in an economic-yet-precise way.Examples for evoking abstraction when issuing a complex request that consists of multiple actions may be: "Schedule a group meeting every other Wednesday until the end of the year, unless there are holidays.".For an autonomous car, an envisioned request might be "Circle the block looking for a shady parking spot, park at the first spot you see.Try this 4 times, and one more time allowing for non-shady parking.In no luck, try the adjacent block.".In fact, even the individual request "send an email" is in itself an abstraction over a sequence of multiple individual actions such as :"open your inbox, click new mail, select a recipient," etc.
Abstraction is defined by Wing (2011) as " [letting] one object stand for many.It is used to capture essential properties common to a set of objects while hiding irrelevant distinctions among them".In the calendar example, a single utterance references multiple meetings in multiple days.Likewise, in the autonomous car example, the speaker evokes some sort of control structure in order to iterate a process several times.Abstraction so construed is both critical and pervasive in NL.Referring to multiple instances at once may be done by means of the shape they form, via a process that iterates them, or via an action/condition applied to select or manipulate them.To illustrate this, Figure 1 showcases an example of a natural language procedure for drawing a target image (the bottom-right image) on an empty board.The instructions start with the construction of an object, a red flower, covering 6 tiles.Then, the Instructor prescribes multiple flower patterns via repeat actions that realize a nested 'loop'.Finally, the instructor states the color of the flower centers via a 'condition' (green for red, blue for yellow).This example goes to show both the essence and power of abstraction in NL; here, merely four NL utterances suffice to prescribe a complex image over a 180-tiles board.
Despite its importance and widespread use, detecting and grounding abstraction in NL has not yet been systematically studied in NLP.Previous studies on grounding instructions target linguistic phenomena as anaphora and ellipsis (Long et al., 2016), spatial relations (Jayannavar et al., 2020;Bisk et al., 2016a) and referring expressions (Haber et al., 2019) but do not specifically elicit abstraction.In studies on navigation (Anderson et al., 2018;Chevalier-Boisvert et al., 2018;Misra et al., 2018) eliciting abstract statements is also sparse.Instructions often refer to specifics of the environment rather than abstract phenomena.
In this work we aim to add a new facet to the study of natural language understanding, that of interpreting abstraction.We set out to provide a foundation for systematically studying the phenomenon of processing and grounding diverse levels of abstraction found in naturally-occurring NL utterances.Achieving this goal is far from trivial.As standard in NLP, we would first need to establish an appropriate dataset for studying this phenomenon.Specifically, we'd like to collect naturally-occurring data that manifest abstraction.But how can we purposefully request for the presence of abstraction in naturally-occurring data?
To overcome this challenge, we develop an abstraction elicitation protocol by adopting practices from STEM education, specifically from Computational Thinking (CT) research (Wing, 2011;Grover and Pea, 2013).The idea, in a nutshell, is to develop visual stimuli that evoke, and thus cultivate (and elicit) higher-order thinking, which is then narrated in NL.We implement the proposed protocol in a novel HEXAGONS game, a situated collaborative game where an Instructor provides instructions that should be grounded and executed in a virtual world (Long et al., 2016;Bisk et al., 2016a;Kim et al., 2019;Jayannavar et al., 2020).In contrast to previous studies, we use practices from CT research to design visual triggers of abstraction.Hence, on the one hand, we implicitly call for the presence of abstraction in the instructions, but on the other hand, we provide naturallyelicited abstract instructions from workers not possessing formal knowledge of what abstraction is.
Using the HEXAGONS game and the task stimuli we collected over 4k human instructions manifesting a variety of formal abstractions (objects, control structures and functions) expressed naturally and intuitively in NL and grounded on the HEXAGONS board.To showcase how this data may be used for studying abstraction processing in NL, we derive an instruction-to-execution task, where the model needs to ground and execute NL instructions on the HEXAGONS board.We propose a naïve rule-based baseline as well as two neural modeling alternatives -one based on classification, one on generation -and assess their performance on the elicited abstraction data.
Our experiments show that, while our models perform better than the naïve rule-based baseline, they are substantially inferior to human performance.Moreover, we show that models' performance is inversely correlated with the level of abstraction, that is, the models execute con-crete instructions quite well, but perform poorly on higher-level abstractions.This holds across different models, setups, task conditions, amount and type of training data, and board contexts.We further observe that the instruction's history is another important factor in models' performance; the longer the history, the better the performance.
The contribution of this paper is thus manifold.First, we bring to the fore of NLP research a critical aspect of human-computer communication, namely, the ability to detect, process and ground abstraction in natural language.Next, we devise a novel abstraction elicitation methodology and deliver the HEXAGONS dataset as a novel benchmark to explore the automatic processing of different levels of abstraction.This dataset may also serve broader communities such as AI researchers, linguists, cognitive psychologists and STEM educators in the study of human processing of abstraction.Finally, for the instruction-to-execution task we derive from the HEXAGONS data, we show experimental evidence that unequivocally confirms that abstract instructions in NL are indeed more challenging for current systems to process, and we expose abstraction as an important and challenging dimension for further study in NLP.1 2 The Challenge: Eliciting and Processing Abstraction in NL Abstraction is a cognitive phenomenon related to diverse human activities such as learning, decision making, and behavior regulation (cf.Burgoon et al. 2013).In the context of human-computer interaction, and in general in human problemsolving, abstraction is said to be one of a set of cognitive skills known as Computational Thinking (CT) skills, defined by Cuny, Snyder, and Wing (2010) as "the thought processes involved in formulating problems and their solutions so that the solutions are represented in a form that can be effectively carried out by an information-processing agent."Abstraction in this context refers to a process of information reduction (Burgoon et al., 2013), where multiple instances are conceived as arising from a single object, "consisting of their shared properties while discarding irrelevant distinctions" (Wing, 2011).
Abstraction is considered by many as the most important CT skill, allowing the human to think in terms of objects and concentrating on their essential features, while ignoring irrelevant details (Dijkstra, 1972;Denning et al., 1989;Koppelman and Van Dijk, 2010;Wing, 2011Wing, , 2017)).Thus, abstraction leads to speaker's capacity of being more precise and less error-prone (Dijkstra, Accessed 1 May 2021; Haberman, 2004) and to designing more concise, elegant and efficient solutions (Ginat and Blau, 2017).
To illustrate how humans may exhibit different levels of abstraction, consider the simple example in Figure 2, where a human is requested to describe a pattern on a 2D HEXAGONS board.The first (top) NL procedure expresses low-level abstraction; it refers to each occurrence of a halfcolumn as a unique event.This is a repetitive and lengthy procedure.In contrast, the second (bottom) NL procedure refers to all the occurrences of this half-column at once (via 'repeat but alternate'), discarding distinctions related to, e.g., tiles' positions and colors.The result of this abstraction is thus concise, clear and far more efficient.
In a broader sense, in order to express abstraction speakers employ so-called abstraction mechanisms -such as objects, functions and control flow (Koppelman and Van Dijk, 2010) -and expertise in using them is considered an important part of humans' CT skills (Grover and Pea, 2013).Such mechanisms are invoked in Figure 1.And this kind of communication is not limited to the simple HEXAGONS board used in Figures 1-2.It is relevant in countless many other domains as in the aforementioned calendar and car examples.
Due to the prevalence of abstraction in human communication, models would need to process varied levels of abstraction in order to correctly interpret NL instructions.But in order to successfully develop such models, we have to collect data that systematically reflect such phenomena, and to the best of our knowledge, this has not yet been done in NLP.The challenge of eliciting abstraction is genuine, as we aspire to elicit natural language that reflects authentic human communication of any speaker.But we cannot simply request crowdworkers to employ abstraction mechanisms as they are not familiar with these formal concepts.On the other hand, explicitly teaching them to employ abstraction undermines the naturalness of expression.So, how do we break out of this loop? 1.In the first column paint the first 5 tiles downward orange 2. In the first column paint the last 5 tiles downward blue 3.In the 3rd column paint the first 5 tiles blue ... 17.In the 17th column paint the first 5 tiles orange 18.In the 17th column paint the last 5 tiles blue 1.In first column on left of grid, start with 5 orange hexagons and complete it with 5 blue hexagons.Leave column 2 from left white.2. Repeat these colors and this 5:5 scheme every other column but alternate the two colors so that each colored column is the opposite order of the one on either side of it.

The Proposed Methodology
In this work we are interested in creating situations where humans express abstraction in NL spontaneously and naturally, towards learning models that can interpret such abstractions.To achieve this goal, we turn to the vast research in human learning and STEM education, on cultivating (and thus eliciting) higher-order CT skills in humans (Cuny et al., 2010;Shute et al., 2017).Eliciting such higher-order thinking requires a careful task design, drawing on literature on the development of instruments that probe and assess humans' CT skills (Ructtinger and Stevens, 2017;Relkin and Bers, 2019;Basu et al., 2021).
Our designed task stimuli are carefully crafted to evoke abstraction without explicitly requesting workers to do so.Our elicitation methodology extends a recent trend in grounded semantic parsing, where players engage in a referential game (situated collaborative scenarios in terms of Jayannavar et al. ( 2020)) where an Instructor provides instructions that should be grounded and executed in a (simulated) world (Long et al., 2016;Bisk et al., 2016a;Kim et al., 2019;Jayannavar et al., 2020).The remainder of this section elaborates on the virtual environment we devise, and the task stimuli we design for elicitation.

The HEXAGONS App and Game
In order to collect NL descriptions which express diverse abstraction levels, we design an online drawing app that enables users to construct increasingly complex images on a HEXAGONS board, a two-dimensional board paved with hexagonal tiles, of the kind illustrated in Figures 1-2.
The HEXAGONS board contains 18 columns and 10 rows, and the HEXAGONS App UI provides a drawing interface in which a user may paint tiles using a palette of eight colors.
In order to elicit NL instructions, we extended the app with an instruction-following game where a human agent is asked to describe the construction process of a given image (e.g., Figure 3) to a different user of the app, who has access to a similar but blank HEXAGONS board.The game has two different modes.The first mode is called Description, where a user is given an image from a pre-defined pool and has to provide instructions in NL on how to construct the image.Every line break in the textual description initiates a new instruction.The second mode is called Execution, where a user accepts a sequence of instructions one by one, and needs to execute them sequentially to reconstruct the target image on the board.
We refer to each pair of an instruction and a corresponding execution as a drawing step.We call the sequence of drawing steps composing the full image a drawing procedure.

The Task Stimuli
The HEXAGONS app assumes a single primitive action that corresponds to the two-place predicate paint(position, color), which specifies a color for a specific tile in the 180 hexagon tiles.The key idea is to ask Instructors to construct a complex image which manifests some regularity.
The regularity is intended to encourage Instructors to seek more efficient alternatives to the primitivelevel operations, which then evokes CT skills such as decomposition and abstraction in order to deliver an economic and efficient construction.
In what follows we briefly elaborate, for each abstraction mechanism that we target, how we design the form on the HEXAGONS board that potentially evokes this mechanism.
• Objects.Users may refer to a set of instances at once by means of the form they make (line, circle, triangle, etc.), discarding other details such as the position and color of individual tiles.Objects may be defined in one place, and referred to elsewhere (as in Figure 1).
• Bounded Iterations ('For' loops).To elicit bounded loops we design images that manifest periodic replication of an object.For example, Figure 3(a) shows a replication of a flower pattern 12 times.
• Conditional Iterations ('While' loops).To elicit conditional loops we design images that manifest a periodic replication of an object controlled by a certain condition.For example, in Figure 3(b), to replicate the lines with different length one may use the condition 'extend the lines out up to the boundaries of the board.' • Conditional Statements ('if-then').We design images that manifest random replication of steady variants of an object, where employing a condition enables users to capture all variants at once.For example, to capture the two variants of five-tile-long lines in Figure 3(c), one cannot simply use repetition, as the replication is not periodic.However, noticing that the red and blue 'tops' go with the green and yellow 'tails' respectively, enables a user to achieve an economic description using a condition on the 'top' tiles.
• Functions.We design images that manifest replication of objects in different colors or positions, to encourage defining a 'block' and then applying it with different parameters.Moreover, we use a particular set of visual functions, of symmetrical operations, and particularly reflection and rotation (e.g., Figures 3(d,e)).
• Recursion.This is a unique type of functions which is challenging to evoke.We approach this challenge by designing three types of stimuli: growing patterns, spirals, and selfsimilarity patterns, e.g., fractals (Figure 3(f)).
We note that the association between images and targeted abstraction mechanisms has been defined a priori.However, in effect, users may generate instructions with no abstraction or use a different abstraction mechanism to achieve the same result.All in all, since users (and in particular, crowdworkers) aim to be efficient, they tend (even if not explicitly told) to employ abstractions.

Data Collection and Curation
For the data collection we employed Englishspeaking workers from Amazon Mechanical Turk (MTurk) and adopted the methodology of controlled crowdsourcing (Roit et al., 2020;Pyatkin et al., 2020) to ensure a high quality corpus.Specifically, the process includes four stages: pilot, recruitment, annotation and consolidation.
We collect drawing procedures for the task stimuli in a process that comprises of two steps: (1) In the Collection phase an Instructor writes instructions for drawing a given image, step by step via the Description mode of the game.Following this, the Instructor aligns each instruction she has written to its respective execution on the board via the Execution mode of the game.The result of this process is a drawing procedure where instructions are coupled with their actions grounded on the HEXAGONS board.
(2) In the Verification phase each drawing procedure from the Collection phase is given to two Verifiers who do not have access to the original image.The Verifiers are shown the instructions one by one in Execution mode.
Their task is to execute the instructions step by step until reconstructing the full image.
This two-phase process is intended to reveal faulty instructions and to ensure the quality and executability of the collected procedures, by making the Instructor's intentions explicit (step 1), and by exposing disagreements with Verifiers (step 2).

Pilot and Recruitment
We checked the flow, clarity and feasibility of the data collection in a pilot study, followed by two separate rounds of recruiting Instructors and Verifiers.
To recruit Instructors, we screened workers by examining understanding and engagement in the Instructor task.Appropriate candidates had to complete three Instructor tasks, that is, repeat the Collection phase with three randomly selected images from our pool.In no stage did we formally teach workers what abstraction is.Instead we engage the workers in several tasks and encourage them to write their instructions efficiently and to avoid tiresome repetitions of the primitive paint.Out of the 34 candidates, we assembled a group of 28 Instructors exhibiting diverse levels of abstraction, out of which 24 took an active role during the annotation procedure.
We follow a separate process in recruiting Verifiers using the Verification phase, instructing candidates to be as accurate as possible while executing drawing procedures.Out of 27 candidates, we recruited 16 workers exhibiting the most precise work.The groups of Instructors and Verifiers are disjoint, so drawing procedures are verified by workers other than those who generate them.
Annotation Procedure The annotation procedure is based on a Generation-Validation cycle, which is similar to previous protocols for constructing large-scale corpora by untrained crowdworkers (FitzGerald et al., 2018).
Specifically, based on images from our crafted stimuli, drawing procedures are first generated and verified by the Instructors in the Collection phase, and then each procedure is given to two additional Verifiers, that work through the Verification phase to check the understandability and executability of the procedures.
The annotation process itself consisted of two rounds.In the first round we gave each image from our pool to three Instructors in order to generate three different drawing procedures for each image.Each of the generated procedures was verified as usual.In the second round, we presented Instructors with the opportunity to draw new images on a blank HEXAGONS board.The goal is to scale-up the extension of the image-pool with interesting compositions using crowdsourcing.Indeed, the collected images in this round reflect similar rationale to our own set of images (e.g., Figure 4(ab)) yet demonstrate more complex interactions between structures and patterns (Figure 4(c-d)), with both abstract and figurative images (Figure 4(e-f)).This new pool of images then passed through the construction and verification phases as usual.
Consolidation Having collected the raw dataset, we manually inspected all drawing procedures that had at least one disagreement between an Instructor and each of the Verifiers.Then, we developed a protocol to (i) detect Instructors' errors, (ii) classify the types of errors, and (iii) fix the Instructor execution.The protocol was applied to the data by the two first authors.The reported agreement on error classification was 0.95 and 0.98 Cohen's Kappa for the first two tasks and 95% agreement for the last one.Following this protocol, we detected cases where the Instructor's execution is not properly aligned with the instruction, and manually corrected the execution to match the instruction.Types of errors include: Over-/Underexecution where Instructors executes more/less than the instructions require; miscounting of positions on the board; error propagation from previous steps, and others such as using wrong colors.All in all we inspected 1461 drawing steps out of which 20.8% were identified as having Instructors' error and subsequently were manually corrected.This process results in data with fullyaligned instruction-execution pairs for each of the instructions in each of the drawing procedures.

Annotation Costs
We used Amazon Mechanical Turk to recruit English-speaking workers for this study.Participants in the data collection rounds were paid higher rates than in Pilot and Recruitment.The payments for Collection and Verification were $1.5 and $0.5 respectively.

The HEXAGONS Dataset
Our finalized dataset is the collection of all drawing procedures, composed of instructions and their aligned executions collected in both annotation rounds and some from the recruitment stage, after having passed our quality assurance and consolidation process.In total, we collected 620 drawing procedures yielding 4177 drawing steps, that is, 4177 instructions with aligned executions, circa 100K tokens.Table 1 shows the data statistics.
Quantitative Analysis In order to quantitatively evaluate the resulting dataset, we define Board-Based and Action-Based metrics that compare two different executions of an instruction.
Let us define a function f (x) that accepts a board state x, and translates it into a set of elements position, color that indicate the colored tiles.Now let b and b be two states of the board, where b and b are considered gold and hypothesis respectively.
Based on the f (x) function, we can define Precision and Recall as in Equations ( 1) and (2), respectively, where precision is the percentage of tiles correctly colored from all those colored in the hypothesis, and recall is the percentage of tiles correctly colored from all those colored in gold.We then define F1 as usual as the harmonic mean of the two (See illustration in Figure 5).
The metrics we report come in two flavours.In Board-Based Metrics, f (b) picks up all colored tiles in the resulting image after each step.In Action-Based Metrics, f (b) focuses only on the tiles that changed color in the current step, that is, on the instruction's denotation (rather than the entire board state).
For assessing the quality of the dataset, we compare for each drawing step, the board states (or In the current state (second row) the intersection between Gold and Prediction is the three red tiles.Therefore, recall and precision are 3 5 , 3 7 , respectively and Board-Based F1 is 3 6 =0.5.There are three actions taken in Gold and five in Prediction (third row) with one red tile in the intersection of both sets.Thus, recall and precision are  actions) of the Instructor considered as gold, to the board states (or actions) of a Verifier, considered the hypothesis.We report the Mean F1 and Exact Match (EM) averaged over the entire dataset.In addition we report the Max(/Min) Mean F1 and EM which takes into account only the higher(/lower)-scoring Verifier for each step.
Table 2 shows the dataset evaluation.The EM metric is more strict than Mean F1.The Max Mean-F1 score of circa 96 indicates that the images can most of the times be reproduced by at least one human following the instructions, despite the complexity of the instructions.
Qualitative Analysis: Overall Phenomena In order to understand the distribution of the elicited NL phenomena in our dataset we sampled 24 drawing procedures with a total of 194 drawing steps (instruction), preserving the internal distribution of stimuli types and annotation rounds.We then manually categorized utterances according to abstract, linguistic and spatial phenomena, as summarized in Table 3.

Qualitative Analysis: Levels of Abstraction
To probe further into the levels of abstraction that are manifested in the dataset, the first two authors annotated all the instructions in the dev set to one of four levels of abstraction we identified.
• No abstraction (0): In this level Instructors generate concrete instructions which show no abstraction.The instructions specify the coordinates and colors to be painted, in an absolute (e.g., "Paint the third hexagon in the sixth column green") or a relative (e.g., "Paint the tile below this tile red") fashion.
• Low-Level Abstraction (1): In this level Instructors refer to a collection of tiles as a single object by means of the topographic shape they form in cases they form vertical or diagonal lines, which are endemic to the HEXAGONS board (e.g., "paint the first column from the left green", "connect a diagonal line between the two tiles you just painted").an abstraction mechanisms on multiple tiles (e.g., first step in Figure 1).
• High-Level Abstraction (3): In this level Instructors use diverse abstraction mechanisms (Table 4) applied to multiple objects which themselves can be complex or abstract (as illustrated in the last three steps in Figure 1).
The annotation to levels was conducted in two stages.In the first stage the dev set was annotated into three levels where the last two levels are combined into a single category (Mid-to-High).In the second stage Mid-to-High cases were split into two levels.The inter-annotator agreement for the first stage is 0.923 Krippendorff's Alpha (Krippendorff, 2004) and for the second stage it is 0.94 Krippendorff's Alpha.For both coefficient's calculations we use the weighted scheme for ordinal variables (Gwet, 2015).
Table 5 shows the distribution of abstraction levels we defined within the drawing steps in the dev set.This analysis shows that most of the drawing steps (> 60%) contain abstract instructions, where 50% contain Mid-to-High abstraction level.

Experiments
The Task Given the HEXAGONS dataset, we aim to devise models that interpret NL instructions and mimic an Executor's role, in order to assess how standard Pre-trained Language Models (PLM) interpret these utterances.
To this end, we define a computational task as follows.Let D = d 1 . . .d n be a sequence of NL instructions of a drawing procedure and let b = t 1 . . .t 180 be a board state naming all tiles' colors on the board at a given state.We aim to induce a function f (d 1 . . .d n ) = b 1 . . .b n that maps a given sequence of instructions to the sequence of board states that indicate the instructions' denotation on the board.Since such an f is overly complex, we model each drawing procedure as a sequence of instruction-to-execution steps, where given an instruction d i in a procedure, we seek a model for g(d i ) = ({aij} l i j=1 ) to predict the l i denoted actions a i1 ...a il i , where a ij = row ij , column ij , color ij .Intuitively, this means that in the execution of each drawing step, we classify each tile to a color label or no_action.
Input Configurations For each of the models, we experiment with several input settings that differ in the type and extent of context provided: (i) No-History.The input is only the instruction to be executed (current drawing step).(ii) 1-Previous.
The input contains the instruction to be executed (current step) and one previous instruction.(iii) Full-History.The input contains the instruction to be executed and all previous instructions in that drawing procedure.(iv) Oracle Board-state.The input contains the instruction to be executed and the gold board-state that is obtained prior to the current drawing step.No previous instruction history is included.(v) Predicted Board-state.The input contains the instruction to be executed and the predicted board-state, predicted for all steps so far.No previous instruction history is included.
(vi) Full-History + Oracle/Predicted Board-state, a combination of (iv) and (v) with full history (iii).
Data Splits We split the dataset into train/dev/test with an 80/10/10 ratio of the drawing procedures (Table 6).Following up on recent practices (Finegan-Dollak et al., 2018;Herzig and Berant, 2020;Goldman et al., 2022) we randomly split the data in a way that avoids shared stimuli between the train and the dev/test sets.Specifically, (i) we make sure that there is no image overlap between the three sets (recall that each image is delivered to at least three Instructors; see Section 4), and (ii) we keep the same distribution of images in terms of the abstraction mechanisms they are designed to elicit (Section 3) as well as the same proportion between the images collected in different annotation rounds.

Models
We design two neural models based on two types of PLMs, a BERT-style encoder-only model and a generative encoder-decoder model.For each of these architectures, we modelled the task in a way that is most compatible with it, a classification task for the encoder model (DeBERTa) and a generation task with the encoder-decoder model (T5).
We describe here our two architectures in turn. 3lassification-Based: For the classification model we fine-tune DeBERTa4 (He et al., 2020) with a classification head to predict an action/no_action for each of the tiles, resulting in 180 prediction steps for each instruction.The output of each prediction is one of 9 classes (8 colors and 1 no_action).We define the inputs for the task as follows: The current instruction is prepended with a given tile's coordinates, to indicate for which of the tiles the model is making a prediction; e.g.<row number> <column num-ber> <current instruction>.In the Full-History setting, the previous instructions are concatenated with a delimiter.When adding the board-state, we represent it as a sequence of 180 colors and we additionally mark the given tile with delimiters (e.g...blue, white, TARGET_S, red, TARGET_E, red..).
Generation-Based: T5 (Raffel et al., 2020) is a generative transformer architecture which uses a text-to-text framework for a variety of NLP tasks. 5 We formulate our text-to-actions task as a textto-text task using a straightforward input/output scheme: The task's input is put in a template that consists of a prefix and a suffix (simplify instructions: <current instruction>.simplified instructions:) and fed as input to the model. 6The gold output actions are formatted into text by first transforming them into triplets (<row number> <column number> <color>) which we then combine into a longer comma separated string (e.g.0 4 red, 0 5 blue, 1 0 green) ordering the actions by row and column.During inference, we generate the most likely continuation for the input at hand.We then take the generated sequence and parse it into actions, discarding malformed token sequences.Due to the generative nature of the process, the model's current prediction is conditioned on all previously predicted actions for a given step.Baseline and Skyline Our naïve baseline model is a deterministic rule-based model based on pattern-matching, inline with previous work (Pišl and Mareček, 2017).The model we design detects patterns that reflect the basic predicate paint(position, color), where the position assumes coordinates on a (top-down, leftright) grid.For example, given the sentence "In the first column, color the 2nd tile blue", this model extracts the action Paint((2, 1), blue).The naïve model refers only to the current instruction.
As a skyline we use humans' performance on the task, presented in terms of the Action-Based Mean F1/EM for the dev and test sets.
Evaluation Metrics To evaluate models' performance, we report Action-Based Mean F1/EM of the predicted actions compared to gold actions.That is, the Action-Based F1/EM (Section 5) are averaged over all instructions in the test set.

Results and Analysis
Table 7 shows results of the naïve baseline and the human skyline performance and Table 8 shows the performance of the neural models across the different input configurations, on the test set.
The results show that all models perform substantially better than the naïve rule-based baseline, where the lowest results (obtained by the No-History condition) are still 23.23 F1 and 15.23  tive model (T5) often performing better at EM and the classifier (DeBERTa) having higher F1.
Our ablated experiments on input configurations are designed to empirically assess the contribution of two kinds of contexts, textual and boardstate contexts.We observe that textual context (previous instructions) is an important factor in model performance; the longer the context is, the better the model performs.Performance is lowest when predicting executions on the HEXAGONS board with only the current instruction as input (No-History).Adding more context proves to be beneficial, with the Full-History condition having the best realistic (non-oracle) performance.This result corroborates previous findings in studies which show that models benefit from textual history (Haber et al., 2019;Xu et al., 2022).
A different way of providing context for the execution of an instruction is via the state of previous executions on the board.Here, we experiment with either providing an oracle board-state at each step, or iteratively feeding the predicted board-states from the previous step to the current step.While providing the oracle board-state improves performance upon the No-History condition, our results show that it is not as informative as including the full instruction history.
A possible reason may be that textual instructions often refer back to previously introduced (or decomposed) objects, while board states do not explicitly name these decomposed concepts.Adding both the oracle board-state and all previous instructions as input results in the best performance, however this is not a realistic setup.The more realistic context setting is that of a predicted board and all previous instructions, but it performs worse than only providing full history, due to error propagation in the predicted states.Abstraction Levels Figures 6-7 present the models' performance by abstraction levels (Section 5) across increasing train set sizes.The results on the full train set show that models' performance is inversely correlated with the abstractionlevel of the instructions; models' performance on executions of concrete primitive-like instructions exceeds those of Mid-to-High level of abstraction.This result is significant across models, metrics and input configurations. 7 Comparing these results with baseline and skyline by the levels of abstraction (Table 9), we observe that models' performances reside between these two boundaries while substantially inferior to skyline across all four levels of abstraction.The results on the gradually increasing train set size show that although in general all levels of abstraction benefit from larger train sets, still model performance on non-abstract instructions is consistently better than instructions exhibiting Midto-High-level of abstraction, keeping a mean gap of circa 38 Mean F1.Noticeably, the increase for the highest abstraction level is very mild.especially for EM.This hints that no substantial learning is happening at the highest abstraction level, and a different architecture or training regime, geared towards abstraction, is needed.
We manually inspected executions of our models with respect to the levels of abstraction of the instruction.Looking at the successful executions of high-level instructions, we observe that those instructions are mainly instances where no actions should be performed, e.g., Goal/Result declarations, or instances where very common objects (e.g., flowers) are defined and drawn.More complex functions, such as repetitions with conditions, are harder for the models to interpret.
An example of how executions differ between different abstraction levels is displayed in Figure 8.The model correctly executes the first instruction which contains no abstraction.The next instruction is of high abstraction including replication of the triangle object.The model does not manage to identify the spots where to attach the new triangles, or to generate appropriate triangles.
These findings are all consistent with the claim that abstract instructions pose a challenge for current NLP technology, orthogonally to data size and various other factors.

Related Work
Studying abstraction in collaborative communication is related to previous studies on collaborative games that focus on how interlocutors gener- ate referring expressions accepted and understood by both the speaker and hearer (Clark and Wilkes-Gibbs, 1986;Khani et al., 2018;Haber et al., 2019;Udagawa and Aizawa, 2019).In such situations a speaker attempts to generate the shortest refererring expression that will sufficiently communicate their intention.This phenomenon of minimizing speakers' effort is inline with Grice's (1975) maxim of quantity, stating that speakers will give as much information as needed and not more.The settings of collaborative games are very common in creating datasets for grounded semantic parsing as navigation tasks (Anderson et al., 1991;MacMahon et al., 2006;Anderson et al., 2018;Chevalier-Boisvert et al., 2018;Misra et al., 2018;Chen et al., 2019;Paz-Argaman and Tsarfaty, 2019;Suhr et al., 2019), the 2-D/3-D blocks world (Bisk et al., 2016a(Bisk et al., ,b, 2018;;Jayannavar et al., 2020) and other instruction-following scenarios (Long et al., 2016;Kim et al., 2019).Some of these studies observe abstraction as a phenomenon that indeed occurs in NL instructions (e.g., Anderson et al., 2018), implying that abstraction is a cross-domain and hence critical phenomenon for natural language understanding.However, eliciting naturally-occurring NL instructions that reflect a variety of abstraction levels in a systematic way is novel to the HEXAGONS data.
To confirm this, we inspected the 2-D Blocks dataset (Bisk et al., 2016a,b;Pišl and Mareček, 2017) which most resembles our setting.Sampling 594 instructions (5% of the train set), we found out that almost all the instructions (96.5%) map to actions of shifting a single block to some location, with some spatial expressions (e.g., "place box 17 three spaces above box 20").Such instructions are labeled "no abstraction" in our protocol (see Section 5).Notably, a fraction of the sample does express some low-level (2.7%) and mid-level (0.8%) abstraction.
Two other studies that are particularly related to our work are by Wang et al. (2017) and Wang et al. (2016).In the VoxeLurn study (Wang et al., 2017), a community of users is interacting with a computerized agent to deliver constructions in a 3-D blocks world.The community gradually and collaboratively builds increasingly complex and more abstract language from a core programming language via a process called "naturalization".SHRDLURN (Wang et al., 2016) exhibits similar constructions but on an individual rather than a community effort.Both studies indeed address abstraction, but from an opposite direction to ours; while these works assume a strict narrow and synthetic language and build abstractions bottomup, our work aims to tackle the opposite direction, uncovering abstractions that are expressed in unrestricted informal NL and grounding them in an executable 'backend'.Thus, these studies and ours exhibit orthogonal ways to address abstraction.

Conclusion
We bring to the fore of NLP a novel and critical aspect of human-computer communication, namely, the ability to automatically detect, interpret and ground abstraction in NL.We devise an abstraction elicitation methodology and deliver a novel benchmark, HEXAGONS, manifesting the denotation of instructions rich and diverse in their levels of abstraction.Our results on the instruction-toexecution task derived from these data show that the models' performance is significantly inversely correlated with the level of abstraction, and this holds across models, contexts, and data sizes.This work opens a manifold of directions for future research such as generating human-like abstractions or detecting the level of abstraction, as well as studying abstraction in adjacent fields as linguistics, cognitive science and NL programming.
1. Make a red flower, by coloring in red all tiles adjacent to the 2nd tile from the top in the 2nd column from the left.2. Repeat this flower pattern across the board to the right, alternating yellow and red, leaving a blank column between every 2 flowers.3. Repeat this row of flowers 2 more times, but reverse the colors in each new row.You should get 6 red flowers and 6 yellow flowers in total.

Figure 1 :
Figure 1: Abstraction in the HEXAGONS Game.On the left are instructions for drawing the target image (bottom-right), paired with their grounding on the HEXAGONS board on the right.Italics mark expressions of abstraction as referring to, e.g., objects (a flower), iterations, and conditions.

Figure 2 :
Figure 2: Levels of Abstraction in NL.Two drawing procedures for drawing the illustrated image, manifesting low-level and high-level abstraction.

Figure 3 :
Figure 3: The HEXAGONS Image Gallery Sample.
Figure 4: A Sample of Crowdsourced HEXAGONS Images by MTurk Workers in the Second Round.

Figure 5 :
Figure 5: Board-Based and Action-Based Metrics.In the current state (second row) the intersection between Gold and Prediction is the three red tiles.Therefore, recall and precision are 3 5 , 3 7 , respectively and Board-Based F1 is 3 6 =0.5.There are three actions taken in Gold and five in Prediction (third row) with one red tile in the intersection of both sets.Thus, recall and precision are 1 3 , 1 5 , respectively and Action-Based F1 score is 1 4 =0.25.EM for both metrics is 0.

Figure 6 :
Figure 6: Qualitative Learning Curve of DeBERTa and T5 on Full History Models, Mean EM.

Figure 7 :
Figure 7: Qualitative Learning Curve of DeBERTa and T5 on Full-History Models, Mean F1.

Figure 8 :
Figure 8: Processing Different Abstraction Levels.DeBERTa's executions (left) and board-states (right) for different levels of abstraction.Instruction (a)-(b): "Use green to fill in the 2nd and 3rd spots on the 3rd column, 1st and 2nd spots on the 4th column, and 2nd spot on the 5th column.",F1=1; (c)-(d): "Create the same shape with green on all the purple and orange spots.",F1=0.08.

Table 2 :
Dataset Evaluation.Min/Max Mean F1 and EM scores are in brackets.

Table 3 :
Abstract, Linguistic and Spatial Phenomena in the HEXAGONS Dataset.

Table 5 :
Levels of Abstraction in the Dev Set.

Table 7 :
Baseline and Skyline on Test.

Table 8 :
EM points over this baseline on the test-set.At the same time, all models are substantially inferior to human performance, where the best model performance (Full-History) is 35.91 F1 and 45.81 EM points below human performance on the test set.DeBERTa and T5 both show the same trends for the different input configurations, with the genera-Results of DeBERTa and T5 on Test.

Table 9 :
Baseline and Skyline on Dev Sets by Levels of Abstractions.