Explainable Abuse Detection as Intent Classification and Slot Filling

Abstract To proactively offer social media users a safe online experience, there is a need for systems that can detect harmful posts and promptly alert platform moderators. In order to guarantee the enforcement of a consistent policy, moderators are provided with detailed guidelines. In contrast, most state-of-the-art models learn what abuse is from labeled examples and as a result base their predictions on spurious cues, such as the presence of group identifiers, which can be unreliable. In this work we introduce the concept of policy-aware abuse detection, abandoning the unrealistic expectation that systems can reliably learn which phenomena constitute abuse from inspecting the data alone. We propose a machine-friendly representation of the policy that moderators wish to enforce, by breaking it down into a collection of intents and slots. We collect and annotate a dataset of 3,535 English posts with such slots, and show how architectures for intent classification and slot filling can be used for abuse detection, while providing a rationale for model decisions.1


Introduction
The central goal of online content moderation is to offer users a safer experience by taking actions against abusive behaviours, such as hate speech.Researchers have been developing supervised classifiers to detect hateful content, starting from a collection of posts known to be abusive and non-abusive.To successfully accomplish this task, models are expected to learn complex concepts from previously flagged examples.For example, hate speech has been defined as "abusive speech targeting specific group characteristics, such as ethnic origin, religion, gender or sexual orientation" (Warner and Hirschberg, 2012), 1 Accepted at TACL.Our code and data are available at https://github.com/Ago3/PLEAD.but there is no clear definition of what constitutes abusive speech.
Recent research (Dixon et al., 2018) has shown that supervised models fail to grasp these complexities; instead, they exploit spurious correlations in the data, they become overly reliant on low-level lexical features and flag posts because of, for instance, the presence of group identifiers alone (e.g., women or gay).Efforts to mitigate these problems focus on regularization, e.g., preventing the model from paying attention to group identifiers during training (Kennedy et al., 2020;Zhang et al., 2020), however, they do not seem effective at producing better classifiers (Calabrese et al., 2021).Social media companies, on the other hand, give moderators detailed guidelines to help them decide whether a post should be deleted, and these guidelines also help ensure consistency in their decisions (see Table 1).Models are not given access to these guidelines, and arguably this is the reason for many of their documented weaknesses.
Let us illustrate this with the following example.Assume we are shown two posts, the abusive "Immigrants are parasites", and the non-abusive "I love artists", and are asked to judge whether a new post "Artists are parasites" is abusive.While the post is insulting, it does not contain hate speech, as professions are not usually protected, but we cannot know that without access to moderation guidelines.Based on these two posts alone, we might struggle to decide which label to assign.We are then given more examples, specifically the nonabusive "I hate artists" and the abusive "I hate immigrants".In the absence of any other information, we would probably label the post "Artists are parasites" as non-abusive.The example highlights that 1) the current problem formulation (i.e., given post p and a collection of labelled examples C, decide whether p is abusive) is not adequate, since even humans would struggle to agree on the correct classification, and 2) relying on group identi-Post: Artists are parasites Policy: Posts containing dehumanising comparisons targeted to a group based on their protected characteristics violate the policy.Protected characteristics include race, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity, serious disease and immigration status.
Old Formulation: Is the post abusive?Our Formulation: Does the post violate the policy?Table 1: While it is hard to judge whether a post is abusive based solely on its content, taking the policy into account facilitates decision making.The example is based on the Facebook Community Standards.
fiers is a natural consequence of the problem definition, and often not incorrect.Note that the difficulty does not arise due to the lack of data annotated with real moderator decisions who would be presumably making labeling decisions according the policy.Rather, models are not able to distinguish between necessary and sufficient conditions for making a decision based on examples alone (Balkir et al., 2022).
In this work we depart from the common approach that aims to mitigate undesired model behaviour by adding artificial constraints (e.g., ignoring group identifiers when judging hate speech) and instead re-define the task through the the concept of policy-awareness: given post p and policy P , decide whether p violates P .This entails models are given policy-related information in order to classify posts like "Artists are parasites" ; e.g., they know that posts containing dehumanising comparisons targeted to a group based on their protected characteristics violate the policy, and that profession is not listed among the protected characteristics (see Table 1).To enable models to exploit the policy, we formalize the task as an instance of intent classification and slot filling and create a machine-friendly representation of a policy for hate speech by decomposing it into a collection of intents and corresponding slots.For instance, the policy in Table 1 expresses the intent "Dehumanisation" and has three slots: "target", "protected characteristic", and "dehumanising comparison".All slots must be present for a post to violate a policy.Given this definition, the post in Table 1 contains a target ("Artists" ) and a dehumanising comparison ("are parasites" ) but does not violate the policy since it does not have a value for protected characteristic.
We create and make publicly available the Policy-aware Explainable Abuse Detection (PLEAD) dataset which contains (intent and slot) annotations for 3, 535 abusive and non-abusive posts.To decide whether a post violates the policy and explain the decision, we design a sequence-to-sequence model that generates a structured representation of the input by first detecting and then filling slots.Intent is assigned deterministically based on the filled slots, leading to the final abusive/non-abusive classification.Experiments show our model is more reliable than classification-only approaches, as it delivers transparent predictions.

Related Work
We use abuse as an umbrella term covering any kind of harmful content on the Web, as this is accepted practice in the field (Vidgen et al., 2019;Waseem et al., 2017).Abuse is hard to recognise, due to ambiguity in its definition and differences in annotator sensitivity (Ross et al., 2016).Recent research suggests embracing disagreements by developing multi-annotator architectures that capture differences in annotator perspective (Davani et al., 2022;Basile et al., 2021;Uma et al., 2021).While this approach better models how abuse is perceived, it is not suitable for content moderation where one has to decide whether to remove a post and a prescriptive paradigm is preferable (Rottger et al., 2022).Zufall et al. (2020) adopt a more objective approach, as they aim to detect content that is illegal according to EU legislation.However, as they explain, illegal content constitutes only a tiny portion of abusive content, and no explicit knowledge about the legal framework is provided to their model.The problem is framed as the combination of two binary tasks: whether a post contains a protected characteristic, and whether it incites violence.The authors also create a dataset which, however, is not publicly available.
Most existing work ignores these annotation difficulties and models abuse detection with transformer-based models (Vidgen et al., 2021b;Kennedy et al., 2020;Mozafari et al., 2019).Despite impressive F1-scores, these models are black-box and not very informative for moderators.Efforts to shed light on their behaviour, re-  Table 2: Definition of policy guidelines, intents, and slots associated with them.Example posts and their annotations.Wording in the guidelines which is mapped onto slots is ::::::::: underlined.
veal that they are good at exploiting spurious correlations in the data but unreliable in more realistic scenarios (Calabrese et al., 2021;Röttger et al., 2021).Although explainability is considered a critical capability (Mishra et al., 2019) in the context of abuse detection, to our knowledge, Sarwar et al. ( 2022) represent the only explainable approach.Their model justifies its predictions by returning the k nearest neighbours that determined the classification outcome.However, such "explanations" may not be easily understandable to humans, who are less skilled at detecting patterns than transformers (Vaswani et al., 2017).
In our work, we formalize the problem of policy-aware abuse detection as an instance of intent classification and slot filling (ICSF), where slots are properties like "target" and "protected characteristic" and intents are policy rules or guidelines (e.g., "dehumanisation").While Ahmad et al. (2021) use ICSF to parse and explain the content of a privacy policy, we are not aware of any work that infers policy violations in text with ICSF.State-of-the-art models developed for ICSF are sequence-to-sequence transformers built on top of pretrained architectures like BART (Aghajanyan et al., 2020), and also represent the starting point for our modeling approach.

Problem Formulation
Given a policy for the moderation of abusive content, and a post p, our task is to decide whether p is abusive.We further note that policies are often expressed as a set of guidelines R = {r 1 , r 2 , . . .r N } as shown in Table 2 and a post p is abusive when its content violates any r i ∈ R. Aside from deciding whether a guideline has been violated, we also expect our model to return a human-readable ex-planation which should be specific to p (i.e., an extract from the policy describing the guideline being violated is not an explanation), since customised explanations can help moderators make more informed decisions, and developers better understand model behaviour.

Intent Classification and Slot Filling
The generation of post specific explanations requires detection systems to be able to reason over the content of the policy.To facilitate this process, we draw inspiration from previous work (Gupta et al., 2018) on intent classification and slot filling (ICSF), a task where systems have to classify the intent of a query (e.g., IN:CREATE_CALL for the query "Call John") and fill the slot associated with it (e.g., "Call" is the filler for the slot SL:METHOD and "John" for SL:CONTACT).For our task, we decompose policies into a collection of intents corresponding to the guidelines mentioned above, and each intent is characterized by a set of properties, i.e., slots (see Table 2).
The canonical output of ICSF systems is a tree structure.Multiple representations have been defined, each with a different trade-off between expressivity and ease of parsing.For our use case, we adopt the decoupled representation proposed in Aghajanyan et al. (2020): non-terminal nodes are either slots or intents, the root node is an intent, and terminal nodes are words attested in the post (see Figure 1).In this representation, it is not necessary for all input words to appear in the tree (i.e., in-order traversal of the tree cannot reconstruct the original utterance).Although this ultimately renders the parsing task harder, it is crucial for our domain where words can be associated with multiple slots or no slots, and reasoning over long-term dependencies is necessary to recognise, e.g., a derogatory opinion (see Figure 1).
Importantly, we first identify the slots occurring in a post and then deterministically infer the author's intent, as this renders the output tree an explanation of the final classification outcome rather than a post-hoc justification (Biran and Cotton, 2017).Likewise, since we view the predicted slots as an explanation for intent, we cannot jointly perform intent classification and slot filling, to avoid producing inconsistent explanations (Camburu et al., 2020;Ye and Durrett, 2022).
Hate Speech Taxonomy As a case-study, we model the codebook 2 for hate speech annotations designed by the Alan Turing Institute (Vidgen et al., 2021b).This policy is very similar to the guidelines that social media platforms provide to moderators and users. 3 We obtained an intent from each section of the policy, and associated it with a set of slots (see Table 2).We followed the policy guidelines closely and slots were mostly extracted verbatim from them (see underlined policy terms in Table 2 which give rise to slots).We refrained from renaming or grouping slots to create more abstract labels (e.g., using SL:AbusiveSpeech to replace SL:Dehuma-nisingComparison, SL:Threatening-Speech, SL:DerogatoryOpinion, and SL:NegativeOpinion).Note that commonsense knowledge is required to decide whether a span is the right filler for a slot.For instance, [SL:ThreateningSpeech dog] would be odd, while [SL:ThreateningSpeech should be shot] wouldn't.
In addition to slots corresponding to different types of hate speech, most intents have a Target who is being abused because of a 2 https://github.com/bvidgen/Dynamically-Generated-Hate-Speech-Dataset 3 e.g., https://transparency.fb.com/en-gb/ policies/community-standards/hate-speech ProtectedCharacteristic.In contrast to previous work (Sap et al., 2020;Ousidhoum et al., 2019), we distinguish targets from protected groups, as this allows annotators to better infer the target's characteristics from context.A post is deemed abusive (i.e., violates the policy) if and only if all slots for at least one of the (hateful) intents are filled.We also introduce a new intent (i.e., IN:NotHateful) to accommodate all posts that do not violate the policy.
Besides being more machine-friendly, our formulation is advantageous in reducing the amount of abusive instances required for training, since a model can learn to predict slots even from non-abusive instances (e.g., slots SL:Target and SL:DehumanisingComparison are also present in the non-abusive "Artists are parasites" ).This is particularly important in this domain, since in absolute terms, abusive posts are (luckily) relatively infrequent compared to non-abusive ones (Founta et al., 2018), and most harmful content is detected by moderators and subsequently deleted.
Counter Speech In a few cases, posts might quote hate speech, but the authors clearly distance themselves from the harmful message.To enable models to correctly recognise counter speech -speech that directly counters hate, for example by presenting facts or reacting with humour (Mathew et al., 2019) -we introduce a new slot encoding the author's stance (i.e., SL:NegativeStance).For instance, the post "It's nonsense to say that Polish people are nasty" expresses a derogatory opinion which is based on a protected characteristic of a target (i.e., "Polish people").Even though all slots for the Derogation intent are filled, the post is not abusive as the author is reacting to the hateful message.A post is hateful if and only if there are fillers for all associated slots but not for SL:NegativeStance.

The PLEAD Dataset
Post selection To validate our problem formulation and for model training we created a dataset consisting of posts with slot annotations (e.g., Target, ThreateningSpeech).We built our annotation effort on an existing dataset associated with the policy guidelines introduced in Section 3 and extended it with additional spanlevel labels.This dataset (Vidgen et al., 2021b) was created by providing annotators with a clas-  sification model trained on 11 other datasets, and asking them to write hateful and non-hateful sentences such that they fooled the model in predicting the opposite class (hateful for non-hateful and vice versa).The process was iterative, we used sentences from the second round onwards, which were annotated with policy violations.The dataset is not balanced, i.e., some policies are violated more frequently than others.To mitigate this and reduce annotation costs, we selected all posts from the less popular policies and a random sample of posts from the most popular ones.We further merged posts annotated with derogation and animosity classes as they are similar, the main difference being the extent to which the negative opinion is implied.The number of selected posts per intent is shown in Table 3.We note that this is a collection of hard examples, as they were written so as to fool a state-of-the-art model.Most non-abusive posts in the dataset have annotations for all slots save one, or they contain counter speech and are easily confusable with hate speech.
Annotation Task We performed two annotation tasks, one for hateful posts and one for non-hateful ones.For hateful posts, annotators were presented with the post, information about the target(s), its characteristics, and the slots.They were then asked to specify the spans of text corresponding to each slot.The dataset already contains annotations about which policy is being violated.For instance, for posts labelled as Pro Hate Crime, annotators look for spans corresponding to HateEntity and Support.Information about the target and its characteristics is also present in metadata distributed with the dataset, and we used it to steer annotators towards a correct reading of the posts.In general the original posts, metadata, and labels are of high-quality; Vidgen et al. (2021b) report extremely high agree-ment for instances created during round 2, moderate for the following rounds and disagreements were resolved by an expert annotator.
Each post can contain multiple targets, and each target can be associated with multiple protected characteristics (e.g., black woman indicates both the race and gender of a target).Our annotation scheme assumes that only one opinion is annotated for each post.For instance, the post "I love black people but hate women" contains both a non-hateful and hateful opinion, but we only elicit annotations for the hateful one.Likewise, when a post contains more than one hateful opinion 4 , annotators select the one that better fits the associated policy and target description.Equally, for non-hateful posts, we asked annotators to focus on a single opinion, with a preference for opinions that resemble hateful messages (e.g., the second opinion in "I love cats, but I wish all wasps dead ").Annotators could specify as many spans (and associated slots) as they thought appropriate, including none.If enough elements were selected for a post to violate a rule (e.g., both HateEntity and Support were specified), annotators were asked whether the post contained counter speech (and if so, to specify a span of text for NegativeStance) or derogatory terms used as reclaimed identity terms (e.g., the n-word used by a member of the Black community).
Annotator Selection We recruited annotators resident in English-speaking countries through the Amazon Mechanical Turk crowdsourcing platform.To ensure high-quality annotations we designed a quiz for each policy rule and assessed the fairness of the quiz through a two-phase pilot study: in the first phase annotators were shown the instructions and asked to annotate eight sentences.These annotations were then used as possible correct answers for the quiz or to clarify the instructions.During the second phase, new annotators were shown the updated instructions and asked to pass a quiz consisting of three questions.
The pilot showed that most crowdworkers who understood the task were able to pass the quiz, but no one was able to pass the quiz without understanding the task.Only successful annotators were granted a guideline-specific qualification that allowed them to annotate real instances.To enforce consistency, annotators were prompted to pass a quiz after every 30-post batch, and each batch contained posts associated with one rule.To ensure the data was annotated correctly, we included two control questions in each batch.These were not simple attention checks, but regular posts for which the correct answers were known (from the pilot study).For the annotation of non-hateful posts, we only admitted annotators who had submitted at least 300 annotations for hateful posts, and used the first batch of annotations as a further qualification test.Overall, 75% of annotations were produced by women, 91% by people who identify as straight, and 75% by people with ethnicity other than white.We will release a full breakdown of demographic information with our dataset.
Annotation Quality We collected three annotations per post.We measured inter-annotator agreement, as the length of the longest common subsequence (LCS) normalized by the length of the longest span.LCS better reflects our annotation task compared to more standard measures like Cohen's Kappa (McHugh, 2012) which could be calculated on the token level.However, our annotators do not consider words indvidually, they primarily make decisions for sequences of tokens.Aside from presuming a different annotation task, token-based Kappa leads to extremely unbalanced data with the un-annotated tokens vastly outnumbering the labels of interest, artificially inflating agreement.
As shown in Table 3, average LCS scores are high for relations like Target and ProtectedCharacteristic, but drop to ∼45% for relations like DerogatoryOpinion.After a quick inspection, we discovered that poor scores were the result of non overlapping but nevertheless valid annotations.For instance, in "Homosexuality means the end of the family.The next step is legalizing pedophilia.",both "'means the end of the family" and "legalizing pedophilia" were annotated as derogatory opinions, but since there is no overlap between the spans, agreement is 0. To better assess the quality of the annotations, an expert annotator manually reviewed 50% in each category.As shown in Table 3, the percentage of valid annotations was remarkably high, which leads as to conclude that they are reliable.
A Note on Ethics To protect annotators from exposure to hateful content, we tagged our project with the "offensive content" label on Amazon Mechanical Turk, included a warning in the task title, and asked for consent twice (first at the end of the information sheet, and then with a one-sentence checkbox).Annotators were presented with small batches of 30 sentences, and invited to take a break at the end of each session.They were also offered the option to quit anytime during the session, or to abandon the study at any point.A reminder to seek for help in case they experienced distress was provided at the beginning of each session.The study was approved by the relevant ethics committee (details removed for anonymous peer review).

Abuse Detection Model
ICSF is traditionally modelled as a sequenceto-sequence problem where the input utterance represents the source sequence, and the target sequence is a linearised version of the corresponding tree.
For instance, the linearised version of the tree in Figure 1  where posts can contain multiple sentences all of which might have to be considered to discover policy violations (e.g., because of coreference), we adopt the coversational approach to ICSF introduced in Aghajanyan et al. (2020).In this setting, all sentences are parsed in a single session (rather than utterance-by-utterance) which is pertinent to our task, as we infer intent after filling the slots, and would otherwise have no information on which slots to carry over (e.g., detecting a target in the first utterance does not constrain the set of slots that could occur in the following ones).
Our sequence-to-sequence model is built on top of BART (Lewis et al., 2020).However, in canoni- cal ICSF, BART generates the intent first, and then uses it to look for the slots associated with it.In our case, intent is inferred post-hoc, based on the identified slots, not vice versa.Our model adopts a two-step approach where BART first generates a coarse representation of the input, namely a meaning sketch with coarse-grained slots, and then refines it (Dong and Lapata, 2018).The meaning sketch is a tree where non-terminal nodes are slots, and leaves are <mask> tokens.The sketch for the example in Figure 1 and its refined version are shown in Figure 2. Specifically, we first encode source tokens w i (Figure 2a): where |x| is the number of tokens in post x, and then use the hidden states to generate the meaning sketch by computing probability distribution p c over the vocabulary for each time step t as: where s c t−i is the incremental state of the decoder.We then decode the meaning sketch z 1 , . . ., z T : And refine it (see Figure 2b) by first re-encoding the source tokens jointly with the meaning sketch (which is gold at training time, predicted otherwise): A second decoder generates then a new probability distribution over the vocabulary: At inference time, we use beam search to generate the final representation starting from p f .
The training objective is to jointly learn to generate the correct sketch z for post x, and the correct tree t from x and z.We define our loss function for tuple (x, z, t) as: where V is the vocabulary and i is an index over the sequence length.
Although this loss penalises the model for hallucinating or missing slots, it does not discriminate between errors that cause the prediction of a wrong intent, and those that are less relevant (e.g., hallucinating a threat when no target has been detected).In fact, intent is not part of our sequence-to-sequence task since it is only predicted post-hoc.To help the model learn how combinations of slots relate to intents, we include intent classification as an additional training task.
We essentially predict intent starting from the probability of each slot to appear in the sketch (Figure 2a).In other words, we restrict p c t to slot tokens (e.g., SL:Target) and normalise it, to obtain a new probability distribution q t over the set of slots.We aggregate these probabilities by taking the maximum value over the sequence length, thus obtaining a single score for each slot.Since each intent can be modelled as a disjunction of slot combinations (e.g., the NotHateful intent could result from a containing only a target, or only a target AND a protected characteristic), we pass the slot scores through two linear layers with activation functions: thus obtaining a probability distribution s intent over intents I. W s2s ∈ R |S|×|S| models slot-to-slot interactions, while W s2i ∈ R |S|×|I| models interactions between combinations of slots and intents.We then modify our loss to include the new classification loss for an input post with intent c: The new loss aims to assign higher penalty to meaning sketches that lead to intent misclassification.The two linear layers are trained on gold intents and sketches.The layers are then added to the BART-base architecture while kept frozen, so that the model cannot modify its weights to "cover up" wrong sketches by still mapping them to the right intents.Note that this additional classification task is only meant to improve the quality of the generated sketches: intent is added post-hoc in the output tree depending on the slots that have been detected (Figure 2a).

Experimental Results
We performed experiments on the PLEAD dataset (Section 4).Rather than learning complex structures with nested slots, we post-process an instance with T targets into T equivalent instances, one per target.Furthermore, we discarded instances with reclaimed identity terms as these are not taken into account by our current modelling of the policy, and are too infrequent (< 0.01%).We split the dataset into training, validation and test set (80%/10%/10%), keeping the same intent distribution over the splits.

Why Explainability?
Our first experiment provides empirical support for our hypothesis that classifiers trained on collections of abusive and non-abusive posts do not necessarily learn representations directly related to abusive speech.We would further argue that if a model performs well on the test set, it has not necessarily learned to detect abuse.For this experiment, we trained RoBERTa (Vidgen et al., 2021b) with five different random seeds, and obtained an F1-score of ∼80% in the binary classification setting with a low standard deviation (see Table 4).We further examined the output of these five RoBERTa models using AAA (Calabrese et al., 2021) and HateCheck (Röttger et al., 2021).AAA stands for Adversarial Attacks against Abuse and is a metric that better captures a model's performance on hard-to-classify posts, by penalising systems which are biased on lowlevel lexical features.It does so by adversarially modifying the test data (based on patterns found in the training data) to generate plausible test samples.HateChek is a suite of functional tests for hate speech detection models.Firstly, we observe high standard deviations across AAA-scores.Models obtained with seeds 4 and 5 have identical F1-scores, but a gap of 12 points on AAA, suggesting that they may be modelling different phenomena.HateCheck tests on group identifiers confirm this hypothesis, as the model trained with random seed 5 misclassifies most neutral (GI N ) or positive (GI P ) sentences containing group identifiers as hateful, while the model trained with seed 4 can distinguish between different contexts and recognises most positive sentences as not hateful.Likewise, the models obtained with seeds 1 and 2 have identical F1-scores, and also similar AAA-scores, but a 20 point gap on the test containing attacks on individuals (IND).This suggests that classifiers tend to model different phenomena (like the presence of group identifiers or violent speech) rather than policy violations and that similarities in terms of F1-score disguise important differences amongst models.

Model Evaluation
Since the output of our model is a parse tree, we represent it as set of productions and evaluate using F1 (Quirk et al., 2015) on: (a) the entire tree (PF1), (b) the top level (i.e., productions rooted in intent, PF1 I ), and (c) the lower level (i.e., productions rooted in correctly detected slots, PF1 L ).We also report exact match accuracy for the full tree (EMA T ).
We compare our model (BART+MS+I) to ablated versions of itself, including a BART model without meaning sketches or an intent-aware loss, and a variant with meaning sketches but no intentaware loss (BART+MS).We also compare against two baselines which encode the input post with an LSTM or BERT, respectively, and then use a feedforward neural network to predict which slot labels should be attached to each token (Weld et al., 2021).The LSTM was initialised with Glove embeddings (Pennington et al., 2014).For BERT, we concatenate the hidden representation of each token to the embedding of the CLS token, and compute the slots associated to a word as the union of the slots predicted for the corresponding subwords.We enhance these baselines by modeling slot prediction as a multi-label classification task (i.e., one-vs-one) in line with Pawara et al. (2020).For each pair of slots < s 1 , s 2 >, we introduce an output node and use gold label 1 (−1) if s 1 (s 2 ) is the right tag for the token, and 0 otherwise.
As an upper bound, we report F1 score by comparing the annotations of one crowdworker against the others.Recall that annotation of hateful posts was simplified by asking participants to look for specific slots; as a result, some scores are only available for non-hateful instances where annotators could select from all the slots.
Our results are summarized in Table 5 (scores are means over five runs; hyperparameter values can be found in our code documentation).Our model achieves a production F1 of 52.96%, outperforming all comparison models.When looking at the top level of the tree (PF1 I ), model performance on hateful instances (H) is considerably inferior to non-hateful ones (NH).This is not surprising, since hateful instances can be represented with ∼4 sketches while non-hateful ones are noiser and can present a larger number of slot combinations.Model performance at filling correctly detected slots for hateful and non-hateful instances is comparable (61.93% and 62.66%), approaching the human ceiling.EMA T scores are slightly higher for the non-hateful class, but this is not unexpected since hateful trees all have at least three slots, while many non-hateful ones have only one (i.e., a target).
Our model achieves an F1 of 57.17% on intent classification.In the binary setting, F1 jumps to 74.84%, suggesting that some mistakes on intent classification are due to the model confusing different hateful intents.As with all other models in the literature, the AAA-score is just below random guessing (Calabrese et al., 2021).Overall, improvement with respect to baselines is significant for all metrics.We also observe that both sketches and our intent-aware loss have a large impact on the quality of the generated trees, and the intent predictions based on them.PF1 L scores for BART + MS are higher but these are computed on correctly detected slots; the proportion of correct slots detected by this model is worse than the full model (see PF1 I for BART+MS vs. BART+MS+I).

Error Analysis
We sampled 50 instances from the test set, and manually reviewed the trees generated by the five variants of our model (one per random seed).Overall, we observe that error patterns are consistent among all variants.In posts containing multiple targets, a recurrent mistake is to link the hateful expression to the wrong target, especially if the mention of the correct target is implicit (see example 1 in Table 6).
We also see cases where the parsing is coherent to the selected target, but this prevents the model from detecting hateful messages towards a different target (e.g., "l3zv0z" in example 2).Some mistakes stem from difficulty in distinguishing DerogatoryOpinion from other slots, as in example 3 where the opinion is misclassified as a dehumanising comparison.This is a reasonable mistake, as comparisons to criminals are considered dehumanising according to the policy (and therefore annotation instructions) and are often annotated as DehumanisingComparison in the dataset.We also observe that for posts correctly identified as non-hateful, the model tends to miss out on protected characteristics even when they occur (example 4).The model also hallucinates values for slots due to stereotypes prominent in the dataset.In example 5, "women" is mistakenly generated as the target of a sentence about sexual promiscuity (of couples), and in example 6 the model hallucinates "apes" as the animal in the comparison.In future work, hallucinations could be addressed by explicitly constraining the decoder to the input post.
Finally, we analysed the behaviour of the model in AAA scenarios, and observed that it struggles with counter speech, as the negative stance is often expressed with a negative opinion about the proponent of the hateful opinion, and therefore tagged as DerogatoryOpinion.Adding words that correlate with the hateful class to non-hateful posts succeeds in misleading our model; non-hateful in-stances often differ from hateful ones by a slot, rendering distractors more effective.However, for the same reason, the addition of such words can also flip the label (e.g., adding "#kill" to a post containing a target and a protected characteristic), and the model is incorrectly penalised by AAA (which assumes the label remains the same).

Discussion
The overwhelming majority of approaches to detecting abusive language online are based on training supervised classifiers with labelled examples.Classifiers are expected to learn what abuse is based on these examples alone.We depart from this approach, reformulate the problem as policyaware abuse detection and model the policy explicitly as an Intent Classification and Slot Filling task.Our experiments show that conventional black-box classifiers learn to model one of the phenomena represented in the dataset, but small changes such as different random initialisation can lead the very same model to learn different ones.Our ICSF-based approach guides the model towards learning policy-relevant phenomena, and this can be demonstrated by the explainable predictions it produces.
We acknowledge that policies for hate speech, as most human developed guidelines, leave some room for subjective interpretation.For instance, moderators might disagree on whether a certain expression represents a dehumanising comparison.However, the more detailed the policy is (e.g., by listing all possible types of comparisons), the less freedom moderators will have to make subjective judgments.The purpose of policies is to make decisions as objective as possible, and our new problem formulation shares the same goal.
While our model still makes errors, the proposed formulation allows us to precisely pinpoint where these errors occur and design appropriate mitigation strategies.This is in stark contrast with existing approaches, where instability is the consequence of spurious correlations in the data, it is hard to isolate errors and, consequently, mitigation strategies are often not grounded in human knowledge about abuse.For example, our analysis showed that our model can sometimes fail to generate the correct tree by mixing the targets and sentiments of multiple opinions.This suggests that it would be useful to have nested slots, e.g., a derogatory opinion as the child of its corresponding target.This could also help the model learn the difference between derogatory opinions (nested within a target node) and negative stance (nested within an opinion node), facilitating the detection of counter speech examples.Introducing a slot for the proponent of an opinion could also help, as the model would then recognise when a hateful opinion is expresed by someone other than the author.
Finally, we would like to emphasize that our modeling approach is not policy-specific and could be adapted to other policies used in industry or academia.Our formulation of abuse detection and the resulting annotation are compatible with more than one dataset (e.g., Vidgen et al. (2021a)) and could be easily modified, e.g., by removing or adding intents and slots.Extending our approach to other policies would require additional annotation effort, however, this would also be the case in the vanilla classification setting if one were to use a different inventory of labels.

Conclusions
In this work we introduced the concept of policyaware abuse detection which we argue allows to develop more interpretable models and yields high-quality annotations to learn from.Humans that agree on the interpretation of a post also agree on its classification label.Our new task requires models to produce human-readable explanations that are specific to the input post.To enable models to reason over the policy, we formalise the problem of abuse detection as an instance of ICSF where each policy guideline corresponds to an intent, and is associated with a specific set of slots.We collect and release an English dataset where posts are annotated with such slots, and design a new neural model by adapting and enhancing ICSF architectures to our domain.The result is a model which is more reliable than existing approaches, and more "rational" in its predictions and mistakes.In the future, we would like to investigate whether and how the explanations our model produces influence moderator decisions.

"
Hitler was right all along.We are witnessing it at home EVERY day." [HateEntity Hitler], [Support was right all along]

Figure 1 :
Figure 1: Decoupled representation for a post.

Figure 2 :
Figure 2: We first generate the meaning sketch based on the input post (a), and then refine it by filling the slots (b).The intent (in red) is inferred deterministically based on predicted slots y slots .The model is trained with an intent-aware loss (a).

Table 3 :
Number of occurrences per slot for each intent; inter-annotator agreement measured by Longest Common Subsequence score (LCS), and percentage of annotations approved by expert (A).

Table 4 :
Performance of RoBERTa on PLEAD (measured by F1 and AAA) and HateCheck functionality tests for neutral (GI N ) and positive (GI P ) group identifiers and attacks on individuals (IND).

Table 6 :
Posts that are incorrectly parsed (but not necessarily incorrectly classified) by our model.