Modeling Non-Cooperative Dialogue: Theoretical and Empirical Insights

Abstract Investigating cooperativity of interlocutors is central in studying pragmatics of dialogue. Models of conversation that only assume cooperative agents fail to explain the dynamics of strategic conversations. Thus, we investigate the ability of agents to identify non-cooperative interlocutors while completing a concurrent visual-dialogue task. Within this novel setting, we study the optimality of communication strategies for achieving this multi-task objective. We use the tools of learning theory to develop a theoretical model for identifying non-cooperative interlocutors and apply this theory to analyze different communication strategies. We also introduce a corpus of non-cooperative conversations about images in the GuessWhat?! dataset proposed by De Vries et al. (2017). We use reinforcement learning to implement multiple communication strategies in this context and find that empirical results validate our theory.


Introduction
A robust dialogue agent cannot always assume a cooperative conversational counterpart when deployed in the wild.Even in goal-oriented settings, where the intent of an interlocutor may seem to be granted, bad actors and disinterested parties are free to interact with our dialogue systems.These non-cooperative interlocutors add harmful noise to data, which can elicit unexpected behaviors from our dialogue systems.Thus, the need to study non-cooperation increases daily as we build and deploy conversational systems which interact with people from different demographics, political views, and intents, continuously learning from the collected data.Examples include Amazon Alexa, task-oriented systems that help patients recovering from injuries or can teach a person a new language, and systems that help predict deceptive behaviors in courtrooms.To effectively communicate in presence of unwanted behaviors like bullying (Cercas Curry and Rieser, 2018), systems need to understand users' strategic behaviors (Asher and Lascarides, 2013) and be able to identify non-cooperative actions.Designing agents that learn to identify non-cooperative interlocutors is challenging since it requires processing the context of the dialogue in addition to modeling the choices that interlocutors make under uncertainty -choices which typically affect their ability to complete tasks unrelated to identifying noncooperation as well.In light of this, we ask: What communication strategies are effective for identifying non-cooperative interlocutors, while also achieving the goals of a distinct dialogue task?
To answer this question, we appeal to a simple non-cooperative version of the visual dialogue game Guess What?! (De Vries et al., 2017).See Figure 1 for an example.The game consists of a multi-round dialogue between two players: a question-player and an answer-player.Both have  table).The answer-player, who may be cooperative or noncooperative, gives binary responses to the questionplayer's queries.In this example, the answer-player is non-cooperative and leads the question-player to an incorrect object (the orange).This is a real example produced by autonomous agents (described in Section 5).
access to the same image whereas only the answerplayer has access to an image-secret; i.e., a particular goal-object for the question-player to recognize.The question-player's goal is to ask the answer-player questions which will reveal the secret.A cooperative answer-player then provides good answers to assist in this goal.In the original game, the answer-player is always cooperative.Our modified game instead allows the answerplayer to be non-cooperative with some non-zero probability.Unlike a cooperative answer-player, a non-cooperative answer-player will not necessarily act in assistance to the question-player, and instead, may attempt to reveal an incorrect secret or otherwise hinder information exchange.In experiments, the specific strategies we study are learned from human non-cooperative conversation.The question-player, importantly, does not know if answer-player is non-cooperative.At the end of the dialogue, the question-player's final objective is not only to identify the goal-object, but also to determine if the conversation takes place with a cooperative or non-cooperative answer-player.
We propose a formal theoretical model for analyzing communication strategies in the described scenario.We frame the question-player's objective in terms of two distinct classification tasks and use tools from the theory of learning algorithms to analyze relationships between these tasks.Our main theoretical result identifies circumstances where the question-player's performance in identifying non-cooperation correlates with performance in identifying the goal-object.Building on this, we provide a mathematical definition of the efficacy of a non-cooperative player which is based on the conceptual idea that cooperation is necessary to make progress in dialogue.Our analysis concludes that when the answer-player is effective in this sense, the question-player can gather useful information for both the object identification task and the non-cooperation identification task by selecting a communication strategy based only on the former objective.
To test the assumptions of our theoretical model as well as the value of the aforementioned communication strategy in practice, we implement this strategy using reinforcement learning (RL).Our experiments validate our theory.As compared to heuristically justified baselines, the communication strategy motivated by our theory yields consistently better results.To conduct this experiment, we have collected a novel corpus of non-cooperative Guess What?! game instances which is publicly available.2Throughout experimentation, we provide a qualitative and quantitative analysis of the non-cooperative strategies present in our corpus.These results, in particular, demonstrate that non-cooperative autonomous agents that utilize dialogue history can better deceive question-players.This contrasts the observation of Strub et al. ( 2017) that cooperative answerplayers do not use this information.
In total, our work is positioned at the intersection of two foci: detection of non-cooperative dialogue and modeling of non-cooperative dialogue.Unlike many detection works, we consider detection in context of interaction.Additionally, while many modeling works consider the intent of conversational agents and construct strategies for noncooperative dialogue based on this, our strategies are motivated purely from a learning theoretic argument.As we are aware, a theoretical description similar to ours has not been given before.

Related Works
The view that conversation is not necessarily cooperative is not novel, but the argument can be made that it has lacked sufficient investigation in the dialogue literature (Lee, 2000).Game theoretic investigations of non-cooperation are plentiful, perhaps beginning with work of Nash (1951).
Concepts from this space, such as the stochastic games introduced by Shapley (1953), have been used to model dialogue (Barlier et al., 2015) when non-cooperation between parties is allowed.Pinker et al. (2008) also consider a game-theoretic model of speech.In fact, even the dialogue game we consider in this text can be modelled through game-theoretic constructs; e.g., a Bayesian Game (Kajii and Morris, 1997).Whereas game theory focuses primarily on analysis of strategies, studying non-cooperation in dialogue requires both the learning of strategies and the learning of utterance meaning.Aptly, our use of the theory of learning algorithms (rather than game theory) is suited to handle both of these.While we are first to use learning theory, efforts to characterize noncooperation in dialogue, learn non-cooperative strategies in autonomous agents, and detect noncooperation in dialogue are not absent from the literature (Plüss, 2010;Georgila and Traum, 2011a;Shim and Arkin, 2013;Vourliotakis et al., 2014).We discuss these topics in detail in the following.
Modeling Non-Cooperative Dialogue.One of the earliest works on non-cooperation -specific to dialogue -is that of Jameson et al. (1994) which considers strategic conversation for advantage in commerce.Similarly, Traum et al. (2008) focuses on negotiation and Georgila and Traum (2011b) focus on learning negotiation strategies (i.e., argumentation) through reinforcement learning (RL).More recently, Efstathiou and Lemon (2014) consider using RL to teach agents to compete in a resource-trading game and Keizer et al. (2017) use deep RL to model negotiation in a similiar game.In most of these, the intent of interlocutors is assumed and utilized in model design.In the last, strategies are learned from data similarly to our work, but objectives for learning are not motivated by learning-theoretic analysis as in ours.
Detecting Non-Cooperative Dialogue.The work of Zhou et al. (2004) presents an early example of automated deception detection which focuses on indicators arising from the used language.Plüss (2014) also focuses on how (more general) non-cooperative dialogue can be identified at a linguistic level.Besides linguistic cues, several works employ additional features in identification of deception.These include physiological responses (Abouelenien et al., 2014), human micro-expressions (Wu et al., 2018), andacoustics (Levitan, 2019).There are also many novel scenarios for detection of deception including talk-show games (Soldner et al., 2019), interrogation games (Chou and Lee, 2020), and news (Conroy et al., 2015;Shu et al., 2017).
Other Visual Dialogue Games.As Galati and Brennan (2021) observe, conversation involving multiple media for information transfer (instead of a single medium) typically leads to increased understanding between interlocutors.Thus, visualdialogue is a particularly interesting setting for investigating both cooperation and non-cooperation.Appropriately, cooperative visual-dialogue games (Das et al., 2017;Schlangen, 2019;Haber et al., 2019) et al. (2020) to name a few.In contrast, we focus on detection.Additionally, our theoretical results are more broad and do not explicitly model adversarial intent.Identifying non-cooperation in dialogue is also related to detecting distribution shift in high-dimensional, distribution-independent settings (Gretton et al., 2012;Lipton et al., 2018;Rabanser et al., 2019;Atwell et al., 2022) as well as learning to generalize in presence of such distribution shift (Ben-David et al., 2010;Ganin and Lempitsky, 2015;Zhao et al., 2018Zhao et al., , 2019;;Schoenauer-Sebag et al., 2019;Johansson et al., 2019;Germain et al., 2020;Sicilia et al., 2022).This connection is a strong motivation for our theoretical work, but we emphasize our results are not a trivial application of existing theory.

Dataset
In this section, we first describe our modified version of the GuessWhat?! game.Then, we describe the data acquisition process as well as the noncooperative dataset used in this study.The dataset will be made publicaly available upon publication.

Proposed Dialogue Game
As noted, our proposed dialogue game is a modification of the cooperative two-player visualdialogue game GuessWhat?! (De Vries et al., 2017).Distinctly, our version incorporates noncooperation.
Initialization.An image is randomly selected and an object within this image is randomly chosen to be the goal-object.With some probability, the game instance is designated as a cooperative game.Otherwise, the game is non-cooperative.
Players.Unlike the original GuessWhat?! game, there are three (not two) player roles: the question-player, the cooperative answer-player, and the non-cooperative answer-player.For cooperative game instances (decided at initialization), the cooperative answer-player is put in play.Otherwise, the non-cooperative answer-player is put in play.The question-player always plays and does not know whether the answer-player is cooperative or non-cooperative.To start, all active players are granted access to the image.The question-player asks yes/no questions about the image and objects within the image.At the end of dialogue, the question-player will use the gathered information to guess both the unknown goal-object and the (cooperation) type of the active answer-player. 3Unlike the questionplayer, the active answer-player has knowledge of the game's goal-object and responds to the question-player's queries with yes, no, or n/a (not applicable).
Objectives.The question-player's goals are always to identify both the goal-object and the presence of non-cooperation if it exists (i.e., if the non-cooperative answer-player is in play).The cooperative answer-player's goal is to reveal the goal-object to the question-player by answering the yes/no questions appropriately.The noncooperative answer-player's goal is instead to lead the question-player away from this goal object; i.e., to ensure the question-player does not correctly guess this object.There is no specific way in which this misleading must be done (e.g., there is not always an alternate object).Instead, during data collection, participants are simply instructed to deceive the question-player.
Gameplay.The question-player and active answer-player converse until the question-player is ready to make a guess or a pre-specified maxium number of dialogue rounds have transpired. 4he question-player is then presented with a list of possible objects and must guess which of these was the secret goal-object.In addition, the question-player must guess whether the answerplayer was cooperative or non-cooperative.

Data Collection
Collection.We developed a web application to collect dialogue from human participants taking the role of a non-cooperative answer-player.Participants were native English speakers recruited via an online crowd-sourcing platform and paid $15 per hour according to our institution's human subject review board.Participants were asked to deceive an autonomous question-player pretrained to identify the goal-object only.For pretraining, we used the original Guess What?! game corpus and supervised learning setup (De Vries et al., 2017;Strub et al., 2017).Participants received an image and a crop that indicated the goal-object.Both of these are randomly sampled from the original Guess What?! game corpus.They were tasked with leading the question-player away from this goal object by answering questions with yes, no, or n/a.Dialogue persisted until the question-player made a guess.
Dataset.We collected 3746 non-cooperative dialogues.Dataset statistics are shown in Table 1, while visualization of the object and dialoguelength distributions are shown in Figure 2. Compared to the original Guess What?! corpus, both dialogue-length and object distributions are similar.For objects, this is expected as these are uniformly sampled from the original corpus.We see 16 of our 20 most likely objects are shared with the 20 most likely of the original GuessWhat?! object distribution, and further, the first 4 objects have identical ordering (see Figure 3).Differences here are simply attributed to randomness and the increasing uniformity as likelihood of an object decreases.For dialogue length, one might expect non-cooperative dialogue to be longer.Instead, the distributions are both right-skew with an average near 5 (i.e., 4.99 in our dataset and 5.11 in the original GuessWhat?! corpus).The primary difference is that the original corpus has more outliers, which is most probably a result of the increased sample size.We likely observe consistency between our non-cooperative corpus and the original corpus because the question-player -who controls dialogue length -is autonomous and trained on a cooperative corpus.Hence, this and other aspects of our non-cooperative corpus may be influenced by pre-conditioning the question-player for cooperation.This issue is mitigated in our experiments (Section 5) where the question-player is also trained on simulated non-cooperative dialogue.Also note, while the size of our collected dataset is smaller than the original cooperative corpus, we only use our data to train an autonomous, non-cooperative answer-player.When a larger sample is required (e.g., when training the question-player via RL), we use simulated non-cooperative data generated by the pre-trained, non-cooperative answer-player, which is a standard technique in the literature (Strub et al., 2017).
Besides the statistics shown in Table 1 and Figure 2, we also point out the question-player succeeded at identifying the goal-object in only 19% of the collected games.Comparatively, on an autonomously generated and fully cooperative test set, comparably trained question-players achieve 52.3% success (Strub et al., 2017).This indicates the deceptive strategies employed by the humans were effective at fooling the question-player to select the wrong goal-object.More detailed analysis of the strategies used by the participants is given in Section 5; these strategies are self-described by the participants and also automatically detected for a simple case.Finally, we also computed the answer distribution on the collected corpus: answers were 46% yes, 52% no, and 2% n/a.

A Theoretical Model
This section formally models the objectives of the question-player as two distinct learning tasks.We use results from the theory of learning algorithms to give a relationship between these tasks in Thm 4.1.We then use Thm 4.1 to analyze communication strategies in Section 4.3.

Setup
As described in Section 3, the question-player has two primary objectives: identification of the goalobject and identification of non-cooperation.To do so, the question-player is granted access to the image and may also converse with an answerplayer.In the end, the question-player guesses based on this evidence (i.e., the image features and dialogue history).Mathematically, we encapsulate the question player's guess as a learned hypothesis (i.e., function) from the game features to the set of object labels or the set of cooperation labels.
Key Terms.We write Y to describe the finite set of object labels and Z = {CP, NC} for the set of cooperation labels; CP denotes cooperation and NC denotes non-cooperation.In relation to the example in Figure 1, Y might contain labels for the orange, apple, cups, and dining-table.In the same example, the cooperation label would be NC to indicate a non-cooperative answer-player.We use X to denote the feature space which contains all possible game configurations.For example, each X ∈ X might capture the dialogue history, the image, and particular features of the image preextracted for the question-player (i.e., which objects are contained in the image at which locations).With this notation, the question-player's learned hypotheses may be described as an object identification hypothesis o : X → Y and a cooperation identification hypothesis c : X → Z.The question-player learns these functions by example.In particular, we assume the question-player is given access to a random sequence of m examples S = (X i , Y i , Z i ) m i=1 independently and identically distributed according to an unknown distribution P θ over X × Y × Z.To abbreviate, we write S iid ∼ P θ and assume all samples are of size m for simplicity.The distribution P θ is dependent on the question-player's communication policy π θ which we assume is uniquely determined by the real-vector θ.Later, this allows us to select communication strategies using common re-  inforcement learning algorithms.
We emphasize the dependence of P θ on π θ distinguishes our setup from typical scenarios in learning theory.Besides learning the hypotheses o and c, the question-player can also select the communication policy π θ .This policy implicitly dictates the distribution over which the questionplayer learns, and thus, can either improve or hurt the player's chance at success.As in reality, neither we nor the learner have knowledge of the mechanism through which changes to the communication policy π θ modify the distribution P θ .Our only assumption is that changing π θ does not modify the probability of cooperation.That is, there is a constant p NC ∈ (0, 1) such that for all π θ Pr(Z = NC) = pNC; (X, Y, Z) ∼ P θ . (1) This agrees with the description in Section 3 where the game instance is designated cooperative or non-cooperative prior to dialogue.With a random sample S, an unbiased estimate for p NC is where 1 is the indicator function.
Error.To measure the quality of the questionplayer's guesses, we report the observed error-rate on the sample S = (X i , Y i , Z i ) m i=1 .In particular, the empirical object-identification error for any hypothesis o : X → Y is defined (3) Similarly, the cooperation identification error for any hypothesis c : In some cases, we instead restrict the sample over which we compute the empirical objectidentification error.Specifically, we restrict to cooperative game instances and write where S = ((X i , Y i ) | Z i = CP) is the sample S with each triple where Z i = CP removed.The case oer S (o | NC) is defined similarly.Based on these, we further define the cooperation gap This gap describes observed change in (weighted) object-identification error induced by change in cooperation.We often expect ∆ to be positive. 5inally, recall S iid ∼ P θ and P θ is unknown, so in practice, we can only report the observed error discussed above.Still, we are typically more interested in the true or expected error for future samples from P θ .This quantity tells us how the question-player's hypotheses generalize beyond the random samples we observe.Precisely, the expected cooperation-identification error of a hypothesis c : X → Z is defined where (X, Y, Z) ∼ P θ .The true (or expected) object-identification error is similarly defined.

Applicability to Distinct Contexts
While we have specified our discussion above to promote understanding, one of the benefits of our theoretical framework is that it is fairly general.
In fact, the reader may be concerned that our discussion above lacks precise definitions of seemingly important terms; i.e., the feature space X and the communication policy π θ .These components are intentionally left abstract because our theoretical results make no assumptions on the mechanism through which π θ influences P θ -i.e., except Eq. ( 1).Further, our results make no assumptions on how the game configurations are represented in the feature space X .This space could correspond to any set of dialogues with/without some associated data (e.g., images).Lastly, the only assumptions on the label spaces are that Y is finite and Z is binary.In this sense, our theoretical discussion is applicable to very general scenarios beyond the simple visual-dialogue game considered.We emphasize some examples later in Section 6.

Bounding Cooperation Identification Error
To motivate our main result, we informally observe that identifying non-cooperation is essentially a problem of identifying distribution-shift.Specifically, we are interested in differences between the two dialogue distributions induced by cooperative and non-cooperative answer-players, respectively.Luckily, there is a rich literature on the topic of distribution-shift.We take insight, in particular, from work of Ben-David et al. (2007,2010) which measures shift using the symmetric difference hypothesis class.For a set of hypotheses O ⊆ {o | o : X → Y}, this class contains hypotheses characteristic to disagreements in O: where NC[•] acts like an indicator function, returning NC for true arguments and CP otherwise.Using this class, we identify a relationship between the true error when identifying non-cooperation cer θ and the observed object-identification errors oer S (•| CP) and oer S (•| NC) against the cooperative and non-cooperative answer-player, re-spectively.While a more traditional learningtheoretic bound would relate cer θ to the empirical observation cer S for the same task, our novel bound reveals a connection to the seemingly distinct task of object-identification.Later, this relationship is useful for analyzing how the questionplayer's communication policy controls the datadistribution so that both objectives are improved.Proofs of all result are provided in Section 4.4.
Theorem 4.1.Define O as above and take C to be sufficiently complex so that O∆O ⊆ C. Let d be the VC-Dimension of C. Then for any δ ∈ (0, 1), with probability at least 1 − δ, for all o, o ∈ O, where iid ∼ P θ , and ĉ ∈ arg min c∈C cer S (c).
Remarks.Notice, one sensible choice of o and o is to pick o which minimizes the observed object-identification error and o which maximizes ∆ S ; this produces the tightest bound on the expected cooperation-identification error.We leave these hypotheses unspecified because later we must make limiting assumptions on the properties of o and o (e.g., Prop.4.1).Greater generality here makes our results more broadly applicable.Besides this, we also observe that C goes to 0 as m grows.Ultimately, we ignore C in interpretation, but point out that bounds based on the VC-Dimension (as above) are notoriously loose for most P θ .As we are primarily interested in these bounds for purpose of interpretation and algorithm design, this is a non-issue.On the other hand, if practically computable bounds are desired, other (more data-dependent) techniques may be fruitful; e.g., see Dziugaite and Roy (2017).
Interpretation.As noted, the question-player has some control over the distribution P θ through the communication policy π θ .So, Thm.4.1 can be interpreted to motivate indirect mechanisms for controlling the cooperation-identification error cer θ (ĝ).Specifically, with respect to oer S , we can infer that improving performance on the object identification task should implicitly improve performance on the separate task of identifying non-cooperation.The term ∆ S also offers insight.It suggests certain non-cooperative answer-players -whose actions induce a large reduction in performance as compared to the cooperative answerplayer -are easy to identify.Stated more plainly, non-cooperative agents reveal themselves by their non-cooperation; this is true, in particular, when their behavior causes large performance drops.In Section 4.3, we formalize these concepts further.

Analyzing Communication Strategies
In this section, we analyze methods for the question-player to select the communication policy π θ .In recent dialogue literature, reinforcement learning (RL) has proven successful in teaching agents effective communication strategies.For example, Strub et al. (2017) show this to be the case in the fully cooperative version of Guess What?!.Selecting an appropriate reward structure is fundamental to any RL training regime.To this end, we use Thm 4.1 to study different reward structures.We consider, in particular, an episodic RL scenario where the discount factor (often called γ) is set to 1 and the only non-zero reward comes at the end of the episode.So, the question-player holds a full dialogue with the answer-player, guesses the goalobject and answer-player's cooperation based on this dialogue, and then receives a reward dependent on whether the guesses are correct.Under these assumptions, the question-player selects the communication policy π θ to maximize: where ρ : X × Y × Z → R is the reward structure to be decided.In particular, selection of θ can often be achieved through policy gradient methods.Williams (1992) and Sutton et al. (1999) are attributed with showing we can estimate ∇ θ J(θ) in an un-biased manner through Monte-Carlo estimation.In our implementation in Section 5, our particular policy gradient technique is identical to previous work on communication strategies for the Guess What?! dataset (Strub et al., 2017).Thus, we focus discussion on the reward structure ρ and understanding its role through a theoretical lens.
To select ρ, we first consider some obvious choices without appealing to complex analysis.Specifically, for c fixed, define ρ(X, Y Thus, maximizing J(θ) is equivalent to minimizing the cooperation-identification error.This reward focuses only on identifying non-cooperation.
On the other hand for some fixed o, then So, in this case, maximizing J(θ) minimizes the expected object-identification error.
It is easy to see the trade-off between the two choices discussed above.Each focuses distinctly on a single objective of the question-player and it is not clear how these two objectives can relate to each other.To properly answer this, we appeal to analysis.We first give some definitions.
Simply, Def.4.1 formally describes when a communication policy π θ * improves the questionplayer's ability to identify the goal-object.Next, we define efficacy of an answer-player as a property of the errors induced by this player's dialogue.Definition 4.2.We say a non-cooperative answerplayer is effective with fixed parameter if for all δ > 0 there is n such that for all θ, θ ∈ Θ, o ∈ O, and m ≥ n, we have where Def. 4.2 requires that the error of all questionplayers converge in probability to the same O( )sized region when playing against an effective answer-player.If a non-cooperative answer-player is effective, then regardless of the communication strategy employed by the question-player, we should not expect to observe large changes in object-identification performance against the noncooperative opponent.Conceptually, this captures the following idea: Without cooperation, we cannot expect interlocutors to make significant headway.This assumption is inherently related to an answer-player's failure to abide by Gricean maxims of conversation: uninformative and deceitful responses violate the maxim of relation and quality, respectively.Instead of explicitly modelling these violations, Def.4.2 focuses on the effect of violations -namely, failure to progress.While violation of other Gricean maxims (i.e., quantity and manner) are less applicable to the simple game we consider, the definition of non-cooperation we give (as an observable effect) still applies.
As alluded, when the non-cooperative answerplayer is effective, this non-cooperation is enough to reveal the answer-player to the question-player.The question-player may focus on communicating to identify the goal-object and this will reduce all terms in the upper-bound of Thm.4.1; subsequently, we expect this communication strategy to be effective not only for identifying the goalobject, but also for identifying non-cooperation.
Proposition 4.1.Let o, o ∈ O and θ * , θ ∈ Θ. Suppose the non-cooperative answer-player is effective and further suppose both o and o are αimproved by θ * relative to θ with α > .Then, for any δ > 0, there is n such that for all m ≥ n, with probability at least 1 − δ − γ we have where S iid Remarks.Notice, the result assumes the hypotheses o, o and policies π θ , π θ * are fixed a priori to drawing S, T .Hence, the bound is only valid for test sets independent from training.Regardless, it is still useful for interpretation and this style of bound produces tighter guarantees than conventional learning-theoretic bounds; i.e., from both analytic and empirical perspectives, respectively (Shalev-Shwartz and Ben-David, 2014;Sicilia et al., 2021).Like Thm.4.1, we also use two hypotheses o, o ∈ O, but the result is easily specified to the one hypothesis case by taking o = o (albeit, this may loosen the bound).In any case, the assumption is not unreasonable.A policy π θ * -optimized with respect to just one hypothesis omay also offer relative improvement for other hypotheses distinct from o.For greater certainty, the term δ in the probability can be made arbitrarily small provided a large enough sample.Sensibly, the term γ indicates the probability is also proportional to how much better the communication mechanism π θ * is where "better" is given precise meaning by comparing population statistics for the objective J(•) via α.At minimum, we require α > , but should be small for suitably effective answer-players anyway.Finally, we again, safely ignore O(C) terms, which go to 0 as m grows.
Interpretation.The takeaway from Prop.4.1 is an unexpectedly sensible strategy for game success: the question-player focuses communication efforts only on identifying the goal-object.When the non-cooperative agent is effective, this communication strategy essentially reduces an upperbound on the true cooperation-identification error.All the while, this strategy very obviously assists the object-recognition task as well.We again note the implication that non-cooperative agents can reveal themselves by their non-cooperation.The question-player need not expend additional effort to uncover them by dialogue actions.
Comparison to Thm. 4.1.While Thm. 4.1 alludes the interpretation given above -since the object-identification error is shown to control cooperation identification error in part -, Prop.4.1 distinguishes itself because it considers all terms in the upperbound (not just oer).This subtlety is important.In particular, a priori, one cannot be certain that improving the object-identification error from S to T also improves the cooperation gap ∆.Instead, it could be the case that ∆ decreases and the overall bound on cer is worsened.Aptly, Prop.4.1 isolates the circumstances (i.e., related to Def. 4.2), which ensure this adverse effect does not occur.It shows us, under reasonable assumptions, the communication strategy discussed in our interpretation controls the whole bound in Thm.4.1 and not just some part.As noted, drawing inference from only a portion of the bound can have unexpected consequences.In fact, this is the topic of much recent work in analysis of learning algorithms (Johansson et al., 2019;Wu et al., 2019;Zhao et al., 2019;Sicilia et al., 2022).
Comparison to Cooperative Setting.It is also worthwhile to note that setting the reward as ρ(X, Y, Z) = 1[o(X) = Y ] is also an appropriate strategy in the distinct fully cooperative Guess What?! game.The authors of the original Guess-What?! corpus propose this reward exactly in their follow-up work (Strub et al., 2017) which uses RL to learn communication strategies in the fully cooperative setting.Thus, the theoretical results of this section are exceedingly practical.They suggest, for effective non-cooperative agents, we may sensibly employ the same techniques in both the fully cooperative setting and the partially noncooperative setting.This is beneficial, because the nature of our problem anticipates we will not know the setting in which we operate.
Motivating a Mixed Objective.As a final note, we remark on how this result may be applied to properly motivate a reward which, a priori, can only be heuristically justified.Specifically, a very reasonable suggestion would be to combine the rewards in Eq. ( 11) and Eq. ( 12) via convex sum.Prior to our theoretical analyses, it is unclear that the two strategies would be complementary.Instead, the objectives could be competing, and so, this mixed strategy could lead to sub-par performance on both tasks.In light of this, our theoretical results help to understand this heuristic more formally.They suggest the two strategies are, in fact, complementary and outline the assumptions necessary for this to be the case.In contrast, empirical analyses can be much more specific to the data used, among other factors.This, in general, is a key differentiation between the analysis we have provided here and the oft-used appeal to heuristics.

Proofs
Here, we provide proof of all theoretical results.We first remind the reader of some key definitions for easy reference: Please, see Section 4.1 for additional definitions and context.
Claim.Define O as above and take C to be sufficiently complex so that O∆O ⊆ C. Let d be the VC-Dimension of C. Then for any δ ∈ (0, 1), with probability at least 1 − δ, for all o, o ∈ O, where iid ∼ P θ , and ĉ ∈ arg min c∈C cer S (c).
Proof.For any c ∈ C and δ ∈ (0, 1), we have This is a standard VC-bound; e.g., Thm.6.11 in Shalev-Shwartz and Ben-David (2014).Thus, it suffices to show that for any sample S of size m and any choice of hypotheses o, o ∈ H, we have Notice first, by choice of ĉ, for any c ∈ C we have By definition of O∆O and its relation to C, for any choice of o, o ∈ O, there is some c ∈ C such that ] for all X.Recall, NC[•] acts like an indicator function, returning NC for true arguments and CP otherwise.Thus, The equality follows by applying the definition of c , appropriately grouping terms, and then using the fact: . Now, the triangle inequality for classification error (Crammer et al., 2007;Ben-David et al., 2007) tells us for any (X, Y ) ∈ X × Y and any o, o ∈ O we have Applying these bounds to the result of Eqn. ( 20) and re-arranging terms completes the proof.
Claim.Let o, o ∈ O and θ * , θ ∈ Θ. Suppose the non-cooperative answer-player is effective and further suppose both o and o are α-improved by θ * relative to θ with α > .Then, for any δ > 0, there is n such that for all m ≥ n, with probability at least 1 − δ − γ we have where S iid We first give a Lemma.
Now, we proceed with the proof of Prop.4.1.
Proof.We begin by bounding the probability of a few events of interest.First, as well as by two applications of Lemma 4.1.Second, by Hoeffding's Inequality, for any δ ∈ (0, 1) we know with C = (2m and Third, by assumption on the non-cooperative agent, we know we may pick large enough samples S and T so Applying Boole's inequality bounds the probability that any one of these events holds by δ + γ.
Considering the complement event yields a lower bound on the probability that every one of these events fails to hold.Specifically, the lower bound is 1 − δ − γ.Thus, it is sufficient to show under assumption of the complement event.To this end, assume the complement.Then, we have directly that So, in the remainder, we concern ourselves with showing −∆ T (o ) ≤ −∆ S (o ) + + O(C).First note that for T it is always true that A similar equation holds for S.Then, oer T (o ) ≤ oer S (o ) − by assumption, so expanding, We also assume | p S − p NC | ≤ C and |p NC − p T | ≤ C, so applying to both sides of Eq. ( 34) yields Finally, the fact | oer S (o |NC) − oer T (o |NC)| ≤ may be applied to both sides of Eq. ( 35) to attain (36)

Experimentation
In this section, we empirically study the communication strategies just discussed in a theoretical context.We also give insights on the noncooperative strategies found in the collected data.

Implementation
Our implementation makes use of the existing framework of De Vries et al. (2017).The primary difference in the game we consider is the included possibility that the answer-player is noncooperative.As such, many of our model components are based on those proposed by the dataset authors (De Vries et al., 2017;Strub et al., 2017).
Question-Player.The question-player consists of: a hypothesis o which predicts the goal-object given the object categories, object locations, and the dialogue-history; a hypothesis c which predicts cooperation given the same information; and the communication policy π θ which generates dialogue given the image6 and the current dialoguehistory.Each is modelled by a neural-network.Architectures of o and the policy π θ are identical to the guesser model and questioner model described by Strub et al. (2017).We give an overview of the architectures in Figure 4 as well.
Answer-Player.The cooperative answer-player is modeled by a neural-network with binary output dependent only on the goal-object and the most immediate question.Strub et al. (2017) demonstrate -in the cooperative case -that additional features do not improve performance.On the other hand, non-cooperative behaviors may require more complex modeling.We explore different features for the network modeling the noncooperative answer-player.During experimentation, we condition on various combinations of Training.As noted, o is assumed fixed before considering the task of c.In practice, we achieve this through supervised learning (SL) by training o on human games in the Guess What?! (GW) corpus.Similarly, the cooperative answer-player is trained via SL on the GW corpus.The noncooperative answer-player uses our novel corpus of non-cooperative games (see Section 3).Following Strub et al. (2017), we pre-train the communication policy π θ using SL on the GW corpus.In some cases, π θ is then taught a specific communication strategy by fine-tuning with RL on simulated dialogue.Dialogue is simulated by randomly sampling Z ∼ Bernoulli(p NC ), drawing an imageobject pair uniformly at random from the GW corpus, and allowing the current policy π θ and the already trained answer-player indicated by Z to converse 5 rounds.The hypothesis c is trained simultaneously on simulated dialogue during the RL phase of π θ via SL.We do so because c is assumed to minimize sample error in Thm 4.1.While simultaneous gradient methods only approximate this goal, it is more in line with assumptions than fixing c a priori.In general, hyper-parameters are fixed for all experiments and are detailed in the code, which is publicly available.When possible, we follow the parameter choices of Strub et al. (2017).As an exception, we shorten the number of epochs in the RL phase to 10. Recall, the new network c is trained in this phase as well.For c, the learning rate is 1e-4.The new non-cooperative answer-players are trained similarly to the cooperative answer-players -i.e., as in Strub et al. ( 2017) -but we remove early-stopping to avoid the need for a validation set.Our non-cooperative corpus is thus used in its entirety for training since all trained agents are evaluated on novel generated dialogue (see Section 5.2).When training with the GW corpus, we use the original train/val split.
Comparison.Despite some slight deviations from the original Guess What?! training setup, we point out that our fully cooperative results are fairly similar.In Figure 5, we show errorrate on simulated, cooperative, test dialogues for our question-player trained solely on objectidentification; the precise error-rate is 48.8%.For the most similar training and testing setup used by Strub et al. (2017), the question-player achieves an error-rate of 46.7%.

Results
We report error for cooperation-identification and object-identification.We use a sample S which has simulated dialogue (see Training) between our trained question-and answer-players using about 23K image-object pairs sampled from the GW test set.The objects/images are fixed for all experiments, but dialogue will of course change depending on the question-player.Each data-point in the figures corresponds to a single run using a specified percentage of cooperative examples; i.e., the answer-player's type is selected by sampling Bernoulli(p NC ) and setting p NC as the desired %.
Human Non-Cooperative Strategies.Between qualitative analysis of this data and conversations with the workers, we determined three primary hu-Figure 5: The first three communication strategies (top to bottom in the legend) correspond to using RL with the objective described by Eq. ( 11), Eq. ( 12), or an average of both.Respectively, the last two strategies correspond to using no RL to learn a strategy (i.e., supervised learning only) or to making predictions at random.For objectidentification error, parentheses indicate the subset of examples on which the error rate is computed.For noncooperation detection, the error rate is computed on all samples.Overall, results validate our theoretical argument.
man strategies for deception: spamming, absolute contradiction, and alternate goal objects.When spamming, participants would answer every question with the same answer; e.g., always answering no.Absolute contradiction was when participants determined the correct answer to the questionplayer's query and then provided the negation of this.Finally, alternate goal objects describes the strategy of selecting an incorrect object in the image and providing answers as if this object was the correct goal.Of these, spamming is fairly easy to automatically detect; i.e., by searching for games where all answers are identical.We find 19% of the collected non-cooperative dialogues contain entirely spam answers.This, of course, does not account for mixed strategies within a game, but it does indicate the dataset is not dominated by the least complex strategy.Lastly, we remind the reader, some non-cooperative strategies directly describe violations of Gricean maxims.In particular, absolute contradiction and alternate goal objects violate the maxim of quality, while spamming violates the maxim of relevance.Due to the answer-player's simple vocabulary and the greater control given to the question-player (i.e., in directing conversation topic and length), the maxims of manner and quantity are difficult for the answer-player to violate.So, it is expected observed strategies do not violate these maxims.
Modeling Human Non-Cooperation.We further studied strategies in the autonomous noncooperative answer-players.Notice, besides spamming, the human strategies may require knowledge of the full dialogue history as well as other objects in the image.We tested whether the autonomous answer-player utilized this informa-tion by training multiple answer-players with different information access: the first produced answers conditioned only on the goal-object and the most immediate question (1), the next two were also conditioned on the full dialogue-history (2) or the full image (3), and the last was conditioned on all of these features (4).We paired these non-cooperative answer-players with a questionplayer whose communication strategy focused on the object-identification task; i.e., using Eqn.( 12).Answer-players 2, 3, and 4 induced an objectidentification error outside a 95% confidence interval7 of answer-player 1.In contrast, Strub et al. (2017) found that cooperative answer-players only needed access to the goal-object and the most immediate question to perform well.This result indicates the complexities inherent to deception and suggests that distinct strategies were learned when non-cooperative answer-players had access to more information.In the remainder, we focus on non-cooperative answer-player 2 with access to the full dialogue history.Our interpretation for answer-players 1, 3, and 4 is largely similar.
Empirical Validity of Def.4.2.Our next observation concerns the formal definition of effective given in Section 4 Def.4.2.While the limiting property required by the definition is not easy to measure empirically, we observe in Figure 5 that the object-identification error on non-cooperative examples is relatively stable across questionplayer communication strategies.This fact -that the non-cooperative answer-player exhibits behav-ior consistent with an effective answer-playerpoints to the validity of our theory.Recall, an effective answer-player is assumed in Prop.4.1.
Empirical Validity of Proposition 4.1.Finally, the primary conclusion of our theoretical analysis was that communication strategies which focus only on the object-identification task should be effective for both object-identification and cooperation-identification. Figure 5 confirms this.Selecting a communication strategy based on improving object-identification improves objectidentification as expected.Further, on the potentially opposing objective of identifying noncooperation, this strategy is also effective.It far improves over a random baseline and also improves over the baseline which uses no RL-based strategy.On the other hand, the communication strategy which focuses only on the identification of non-cooperation fails at the opposing task of object-identification.This strategy performs almost as badly as a random baseline when the percent of non-cooperative examples is large and is also consistently worse than the baseline which uses no RL.The mixture of both strategies seems to achieve good middle ground.Recall, while this strategy may be heuristically intuited, our theoretical results formally justified this strategy as well.

Conclusion
Combining tools from learning theory, reinforcement learning, and supervised learning, we model partially non-cooperative communicative strategies in dialogue.Understanding such strategies is essential when building robust agents capable of conversing with parties of varying intent.Our theoretical and empirical findings suggest noncooperative agents may sufficiently reveal themselves through their non-cooperative communicative behavior.
Although the dialogue game studied is simple, the results have ramifications for more complex dialogue systems.Our theoretical results, in particular, are not limited in this sense and may apply to designing communication strategies in distinct contexts.As noted in Section 4.1.1,the limited assumptions we make facilitate this.For example, classifying intents and asking the right clarification questions is crucial to decision making in dialogue (Purver et al., 2003;DeVault and Stone, 2007;Khalid et al., 2020).Our theory is directly applicable to this setting and could be applied to inform learning objectives for any dialogue agent that asks clarification questions to make a classification.A real-world example of this is the onlinebanking setting studied by Dhole (2020), in which the dialogue agent asks clarification questions to decide the type of account a user would like to open.If we suppose some users may be noncooperative in this context, our theoretical setup is satisfied: there is some feature space (the dialogues), the label space of user-intents is finite, users are labeled with a binary indicator of cooperation, and the dialogue agent can control the distribution over which it learns by asking clarification questions.Our theoretical results should apply to many similar dialogue systems that can ask clarification questions or other types of questions.The only stipulations are that the theoretical setup is satisfied (e.g., in the manner just shown) and that our proposed assumptions on the nature of non-cooperative dialogue still hold (i.e., see Section 4.3,Def. 4.2).
To promote continued research, the collected corpus as well as our code are publicly available. 8

Ethical Considerations
We have described a research prototype.The proposed dataset does not include sensitive or personal data.Our human subject board approved our protocol.Human subjects participated voluntarily and were compensated fairly for their time.The publicly available dataset is fully anonymized.
The proposed architecture relies on pretrained models such as word or image embeddings so any harm or bias associated with these models may be present in our model.We believe general methods that propose to mitigate harms can resolve these issues.

Figure 1 :
Figure 1: (Example) The question-player's objective is to identify a secret goal-object (the diningtable).The answer-player, who may be cooperative or noncooperative, gives binary responses to the questionplayer's queries.In this example, the answer-player is non-cooperative and leads the question-player to an incorrect object (the orange).This is a real example produced by autonomous agents (described in Section 5).

Figure 2 :
Figure 2: Our new non-cooperative dataset.Left shows distribution of objects in the collected games.All 80 objects in the original GuessWhat?! corpus occur.Right shows distribution of question-counts per dialogue.

Figure 3 :
Figure 3: Original GuessWhat?! dataset.Left shows distribution of objects in original games.Right shows distribution of question-counts per dialogue with 114 outliers larger than 27 removed for improved visualization.

Figure 4 :
Figure4: Architecture used in our implementation.Object categories and words are represented using one-hot encoding so an embedding is learned for each object/word.Locations are represented by assigning a common coordinate-system to all images and reporting the object center's image-relative coordinates.
are a growing area of study.We extend, in particular, the cooperative game Guess What?!

Table 1 :
Count of unique images, objects, words, and questions within the non-cooperative games collected.(+3) gives count of words with at least 3 occurences.First row is our proposed dataset.Second (GW) reports computed stats on the original GuessWhat?! corpus.