A Neighbourhood Framework for Resource-Lean Content Flagging

We propose a novel framework for cross-lingual content flagging with limited target-language data, which significantly outperforms prior work in terms of predictive performance. The framework is based on a nearest-neighbour architecture. It is a modern instantiation of the vanilla k-nearest neighbour model, as we use Transformer representations in all its components. Our framework can adapt to new source-language instances, without the need to be retrained from scratch. Unlike prior work on neighbourhood-based approaches, we encode the neighbourhood information based on query--neighbour interactions. We propose two encoding schemes and we show their effectiveness using both qualitative and quantitative analysis. Our evaluation results on eight languages from two different datasets for abusive language detection show sizable improvements of up to 9.5 F1 points absolute (for Italian) over strong baselines. On average, we achieve 3.6 absolute F1 points of improvement for the three languages in the Jigsaw Multilingual dataset and 2.14 points for the WUL dataset.

We propose a novel framework for crosslingual content flagging with limited targetlanguage data, which significantly outperforms prior work in terms of predictive performance. The framework is based on a nearest-neighbour architecture. It is a modern instantiation of the vanilla knearest neighbour model, as we use Transformer representations in all its components. Our framework can adapt to new source-language instances, without the need to be retrained from scratch. Unlike prior work on neighbourhood-based approaches, we encode the neighbourhood information based on query-neighbour interactions. We propose two encoding schemes and we show their effectiveness using both qualitative and quantitative analysis. Our evaluation results on eight languages from two different datasets for abusive language detection show sizable improvements of up to 9.5 F1 points absolute (for Italian) over strong baselines. On average, we achieve 3.6 absolute F1 points of improvement for the three languages in the Jigsaw Multilingual dataset and 2.14 points for the WUL dataset.

Introduction
Online content moderation is an increasingly important problem -small-scale websites and largescale corporations alike strive to remove harmful content from their platforms (Vidgen et al., 2019;Pavlopoulos et al., 2017;Wulczyn et al., 2017). This is partly in anticipation of proposed legislation, such as the Digital Service Act (Commission, 2020) in the EU and the Online Harms Bill (Government, 2020) in the UK. Moreover, the lack of content moderation can have significant impact on businesses (e.g., Parler was denied server space), on governments (e.g., the U.S. Capitol Riots), and on individuals, e.g., because hate speech is linked to self-harm (Jürgens et al., 2019).
A key challenge when developing content moderation systems is the lack of resources for many languages (other than English). With this in mind, here we aim to create a content flagging model for a target language with limited annotated data by transferring knowledge from another dataset in a different language, for which a large amount of training data is available.
Various approaches have been proposed in the literature to address the lack of enough training data in the target language. A popular approach is to fine-tune large-scale pre-trained multilingual language models such as XLM (Conneau and Lample, 2019), XLM-R (Conneau et al., 2020), or mBERT (Devlin et al., 2019) on the target dataset (Glavaš et al., 2020;Stappen et al., 2020). In order to incorporate knowledge from the source dataset, a sequential adaptation technique can be used that first fine-tunes a multilingual language model (LM) on the source dataset, and then on the target dataset (Garg et al., 2020). There are also existing approaches for mixing the source and the target datasets (Shnarch et al., 2018) in different proportions, followed by fine-tuning the multilingual language model on the resulting dataset. While sequential adaptation introduces the risk of forgetting the knowledge from the source dataset, such mixing methods are driven by heuristics that are effective, but not systematic. Crucially, as we argue in this paper, this is because they do not model the relationship between the source and the target datasets. Another problem arises if we consider that examples with novel labels can be added to the source dataset. This is a specifically pertinent issue for content moderation, as efforts to create new resources often lead to the introduction of new label inventories or taxonomies (Banko et al., 2020). In that case, model re-training becomes a requirement in order to be able to map the new label space to the output layer that is used for finetuning. We propose a Transformer-based k-Nearest Neighbour (kNN + ) framework, 1 a one-stop solution and a significant improvement over the vanilla k-NN model. Our framework addresses the above-mentioned challenges, which are not easy to solve via simple fine-tuning of pre-trained language models. Moreover, to the best of our knowledge, our framework is the first attempt to use k-NN for transfer learning for the task of abusive content detection.
Given a query, which is a training or an evaluation data point from the target dataset, kNN + retrieves its nearest neighbours using a languageagnostic sentence embedding model. Then, it constructs Transformer representations for the query and for its neighbours. After that, it computes interaction features, which are based on the interactions of the representations of the query with each of its neighbours. 2 At training time, the interaction features are optimised using supervised training signals computed from the label of the query and the neighbour, so that the features indicate their level of agreement. 1 We use a '+' superscript to indicate that our kNN + framework is an improvement over the vanilla k-NN model. 2 We borrow the terminology from information retrieval, as the interactions between a query and a document in deep matching models are computed in a similar way (Guo et al., 2016).
For example, if the query and its neighbour are both abusive, they agree on the labels. Thus, the interactions help the model learn a semantic similarity space in terms of labels. The framework further uses a self-attention mechanism to aggregate the interaction features from all the neighbours, and it uses the aggregated representation to classify the input query. This representation is computed from the interaction features and indicates the agreement of the query with the neighbourhood. As the predictions are made based on aggregated interaction features only, kNN + can easily incorporate new examples with unseen labels without requiring re-training. The conceptual framework is shown in Figure 1; it is robust to neighbours with incorrect labels, as it can learn to disagree with them as part of its training process.
We instantiate two variants of our framework: Cross-Encoder (CE) kNN + and Bi-Encoder (BE) kNN + . The CE kNN + concatenates the query and a neighbour, and passes that sequence through a Transformer to obtain interaction features. BE kNN + computes representations of the query and of a neighbour by passing them individually through a Transformer, and computes interaction features from these representations. BE kNN + is more efficient than CE kNN + , but it does not yield the same performance gains. Both models outperform six strong baselines both in cross-lingual and in multilingual settings. Our contributions can be summarised as follows: • We address cross-lingual transfer learning for content flagging with limited labelled data from the target language. • We demonstrate that neighbourhood methods, such as kNN are viable candidates for approaching content flagging. • We propose a novel framework, kNN + , which, unlike a vanilla kNN, models the relationship between a data point and each of its neighbours to represent the neighbourhood, using language-agnostic Transformers. • Our evaluation results on eight languages from two different datasets for abusive language detection show sizable improvements of up to 9.5 F1 points absolute (for Italian) over strong baselines. On average, we achieve improvements of 3.6 F1 points for the three languages in the Jigsaw Multilingual dataset, and of 2.14 F1 points on the WUL dataset.
Below, we review recent work on abusive language detection and neighbourhood approaches.

Abusive Content Detection
Most approaches for abusive language detection use text classification models, which have been shown to be effective for related tasks such as sentiment analysis. This includes SVMs (MacAvaney et al., 2019), CNNs (Georgakopoulos et al., 2018;Badjatiya et al., 2019;Agrawal and Awekar, 2018), LSTMs (Arango et al., 2019;Agrawal and Awekar, 2018), BiLSTMs, with attention (Agrawal and Awekar, 2018), Capsule networks (Srivastava et al., 2018), and fine-tuned Transformers (Glavaš et al., 2020). All these approaches focus on single data points, while we also model their neighbourhoods. See  for a recent survey of abusive language detection. Several papers studied the bias in hate speech detection datasets and criticised the use of withindataset evaluations (Arango et al., 2019;Davidson et al., 2019;Badjatiya et al., 2019), as this is not a realistic setting, and findings about generalisability based on such experimental settings are questionable. A more realistic and robust evaluation setting was investigated by Glavaš et al. (2020), who showed the performance of online abuse detectors in a zero-shot cross-lingual setting. They fine-tuned several multilingual language models (Devlin et al., 2019;Conneau and Lample, 2019;Conneau et al., 2020;Sanh et al., 2019; such as XLM-RoBERTa and mBERT on English datasets and observed how these models transfer to datasets in five other languages. Other cross-lingual abuse detection efforts include using Twitter user features for detecting hate speech in English, German, and Portuguese (Fehn Unsvåg and Gambäck, 2018), cross-lingual embeddings (Ranasinghe and Zampieri, 2020), and using multingual lexicon with deep learning (Pamungkas and Patti, 2019). A lot of relevant research was also done as part of the OffensEval shared task at SemEval (Zampieri et al., 2019a(Zampieri et al., ,b, 2020Rosenthal et al., 2021).
While understanding the performance of zeroshot cross-lingual models is interesting from a natural language understanding point of view, in reality, a platform willing to deploy an abusive language detection system can almost always provide some examples of malicious content for training.
Thus, a few-shot or a low-shot scenario is more realistic, and we approach cross-lingual transfer learning from that perspective. We hypothesise that a nearest-neighbour model is a reasonable choice in such a scenario, and we propose several improvements over such a model.

Neighbourhood Models
kNN models have been used for a number of NLP tasks such as part of speech tagging (Daelemans et al., 1996) and morphological analysis (Bosch et al., 2007), among many others. Their effectiveness is rooted in the underlying similarity function, and thus non-linear models such as neural networks can bring additional boost to their performance. More recently, Kaiser et al. (2017) used a similarly differentiable memory that is learned and updated during training and is then applied to one-shot learning tasks.  introduced k-NN retrieval for improving language modelling, which Kassner and Schütze (2020) extended to question answering (QA). Guu et al. (2020) proposed a framework for retrievalaugmented language modelling (REALM), showing its effectiveness on three Open QA datasets.  explored a retrieval-augmented generation for a variety of tasks, including factchecking and QA, among others. Fan et al. (2021) introduced a k-NN framework for dialogue generation using pre-trained embeddings enhanced by learning an alignment function for retrieval from a set of external multi-modal evidence sources. Finally, Wallace et al. (2018) proposed a deep k-NN approach for interpreting the predictions from a neural network for the task of natural language inference.
All the above approaches use neighbours as additional information sources, but do not consider the interactions between the neighbours as we do. Moreover, there is no existing work on using deep kNN models for cross-lingual abusive content detection.

kNN + Framework
We present our kNN + framework below.

Problem Setting
Our goal is to learn a content flagging model from source and target datasets in different languages with different label spaces -see Figure 1 for an illustration of our framework.  Figure 2: Two variants based on two encoding schemes used in our proposed kNN + , where M f eature is the interaction feature computation model, q is the query, and c i is a candidate neighbour. In the Bi-Encoder setup (Figure 2a), the query and each candidate are encoded separately using the same M f eature model. Afterwards, in order to obtain a joint vector representation for each query-candidate tuple, the query's representation (rep q ) is concatenated with each candidate's representation (rep i ) along with the absolute element-wise difference between the two. In the Cross-Encoder setting (Figure 2b), the query and each candidate are passed through the M f eature model, which produces the joint vector representation (rep i ) for the query-candidate tuple. Finally, we pass each joint representation through (i) a linear layer to predict the label agreement between the query and the candidate, and (ii) a self-attention layer followed by a linear projection layer to predict the label of the example.
Formally, we assume access to a source dataset for content flagging, , where x s i is a textual content and y s i ∈ Y. Further, a target dataset is given, , n s n t , and label-rich, i.e., |Y| > 2. The label space, Y = {hate, insult, . . . , neutral}, of D s contains fine-grained labels for different levels of abusiveness along with the neutral label. We convert the label space of D s to align it with the label space of D t as follows: Y = {f lagged | x ∈ Y, x = neutral}. Note that this conversion is needed at training time to compute label agreement in our proposed neighbourhood framework. However, at inference time, a conversion of the label space of D s is not needed, as the label of an item from D t is predicted using the latent representations of the neighbours, rather than their labels. This process is described in more detail in Section 3.3.

Why a Neighbourhood Framework?
A vanilla kNN predicts a content label by aggregating the labels of k similar training instances. To this end, it uses the content as a query to retrieve neighbours from the training instances. We hypothesise that this retrieval step can be performed in a cross-lingual transfer learning scenario. In our setting, the queries are target dataset instances, and we index the source dataset for retrieval.
Note that the target instances could also be considered as neighbours for retrieval, but we exclude them, as the target dataset is small.
For a vanilla kNN model, the queries and the documents are represented using lexical features, and thus the model suffers from the curse of dimensionality (Radovanović et al., 2009). Moreover, the prediction pipeline becomes inefficient if the source dataset is considerably larger than the target dataset, as is our case here (Lu et al., 2012). Finally, for a vanilla kNN, there is no straightforward way to map between different languages for cross-lingual transfer.
We address these problems by using a Transformer-based multilingual representation space (Feng et al., 2020) that computes the similarity between two sentences expressed in different languages. We assume that efficiency issues are less critical here for two main reasons: (i) retrieval using dense vector sentence embeddings has become significantly faster with recent advances (Johnson et al., 2021), and (ii) the number of labelled source data examples is not expected to go beyond millions, because obtaining annotations for multilingual abusive content detection is costly and the annotation process can be very harmful for the human annotators as well (Schmidt and Wiegand, 2017;Waseem, 2016;Malmasi and Zampieri, 2018;Mathur et al., 2018).
Even though multilingual language models can make the vanilla kNN model a viable solution for our problem, it is hard to make predictions with that model. Once a neighbourhood is retrieved, a vanilla kNN uses a majority voting scheme for prediction, as the example in Figure 1 shows. Given a flagged Turkish query, our framework retrieves two neutral and one flagged English neighbours. Here, the majority voting prediction based on the neighbourhood is incorrect. The problem is this: A non-parametric vanilla kNN cannot make a correct prediction with an incorrectly retrieved neighbourhood. Thus, we propose a learned voting strategy to alleviate this problem.

The Architecture of kNN +
We describe our kNN + framework (shown in Figure 2), including the training and the inference procedures. The framework includes neighbourhood retrieval, interaction feature computation and aggregation, and a multi-task learning objective function for optimisation, which we describe in detail below.

Neighbourhood Retrieval
We construct a retrieval index R from the given source dataset, . For each given example x s i ∈ D s , we compute its dense vector representation, x s i = M retriever (x s i ). Here, M retriever is a multilingual sentence embedding model that we use for retrieval. There are several multilingual sentence embedding models that we could use as M retriever (Artetxe and Schwenk, 2019;Reimers and Gurevych, 2020;Chidambaram et al., 2019;Feng et al., 2020). In this work, we use LaBSE (Feng et al., 2020), a strong multilingual sentence matching model, which has been trained with parallel sentence pairs from 109 languages. The model is trained on 17 billion monolingual sentences and 6 billion bilingual sentence pairs and it has achieved state-of-the-art performance for a parallel text retrieval task proposed by Zweigenbaum et al. (2017). We use x s i as a key, and we assign (x s i , y i s ) as its corresponding value. Our retrieval index R stores all the key-value pairs computed from the source dataset.
Assume we have a training data point, (x t j , y t j ) ∈ D t , from the target dataset. We consider the content x t j as our query q, i.e., q = x t j . We compute a vector representation of the query, q = M retriever (q). We use q to score each key, x s i of R using cosine similarity, i.e., cos(q, x s i ).
We sort the items in R in descending order of the scores of the keys, and we take the values of the top-k items to construct the neighbourhood of q, N q = {(c 1 , l 1 ), (c 2 , l 2 ), . . . , (c k , l k )}. Thus, each neighbour is a tuple of a content and its label from the source dataset. We convert fine-grained neighbour labels to binary labels (flagged, neutral) as described in Section 3.1, to align the label space with the target dataset. Nevertheless, the original fine-grained labels of the neighbours can be used to get an explanation at inference time as this is one of the core features of kNN-based models. However, our focus is on combining these models with Transformer-based ones. We leave the investigation of the explainability characteristics of kNN + for future work.
Interaction Feature Modelling As discussed in Section 3.2, the neighbourhood retrieval process might lead to prediction errors. Thus, we propose a learned voting strategy to mitigate this. Our proposed strategy depends on how q relates to its neighbourhood N q . To model this relationship, we compute the interaction features between q and the content of its j-th neighbour, c j ∈ N q . We obtain a set of k interaction features from k neighbours, and we optimise them using query and neighbour labels.
Similarly to Reimers and Gurevych (2019), we apply two encoding schemes to compute the interaction features: a Cross-Encoder (CE) and Bi-Encoder (BE). Under our kNN + framework, we refer to the schemes as CE kNN + for CE, and BE kNN + for BE. The BE kNN + is computationally inexpensive, while the CE kNN + is more effective. We provide a justification for this as we describe the schemes in the following paragraphs.
For the CE kNN + implementation (see Figure 2b), we first form a set of query-neighbour pairs S ce = {(q, c 1 ), (q, c 2 ), . . . , (q, c k )} by concatenating q with the content of each of its neighbours. Then, we obtain the output representation, rep j = M f eature (q, c j ) of each (q, c j ) ∈ S ce , from a pre-trained multilingual language model M f eature . In this way, we create a set of interaction features, I ce = {rep 1 , rep 2 , . . . , rep j } from q and its neighbourhood. Throughout this paper, the [CLS] token representation of M f eature is taken as its final output. We use varieties of implementations of M f eature in the experimentation. Note that the feature interaction model M f eature is different from the neighbourhood retrieval one M retriever . We optimise interaction features from M f eature , and we leave retrieval model optimisation for future work.
For the BE kNN + scheme (see Figure 2a), we obtain the output representations of q and each of the neighbours individually from M f eature . Given the representation of the query, rep q = M f eature (q), and the representation of its j th neighbour, rep j = M f eature (c j ), we model their interaction features by concatenating them along with their vector difference. The interaction features obtained for the j-th neighbour are (rep q , rep j , |rep q − rep j |), and we construct a set of interaction features I be from all the neighbours of q. We use the vector difference |rep q − rep j | along with the content vectors rep q and rep j following the work of Reimers and Gurevych (2019). They trained a sentence embedding model using a Siamese neural network architecture with Natural Language Inference (NLI) data. They tried the following approaches to obtain features between the representations u and v of two sentences: Their empirical analysis showed that (u, v, |u − v|) works the best for NLI data, and thus we apply this in our framework. We plan to explore other options in future work.
Both the cross-encoder and the bi-encoder architectures were shown to be effective in a wide variety of tasks including Semantic Textual Similarity and Natural Language Inference. Reimers and Gurevych (2019) showed that a bi-encoder is much more efficient than a cross-encoder, and that bi-encoder representations can be stored as sentence vectors. Thus, once M f eature is trained, the vector representations M f eature (x s i ) of each x s i ∈ D s can be saved along with the textual contents and label. Then, at inference time, only the representation of the query needs to be computed, which reduces the computation time from k × M f eature to a constant time. Moreover, the model can easily adapt to new neighbours without the need for retraining. However, from an effectiveness perspective, the cross-encoder is usually a better option as it encodes the query and its neighbour jointly, thus enabling multi-head attentionbased interactions among the tokens of the query and of the neighbour.
Choice of M f eature We explore two M f eature models for both the CE and the BE schemes: a pre-trained XLM-R model, which we will refer to as M XLM-R f eature , as well as an XLM-R model augmented with paraphrase knowledge, which we will refer to as M P-XLM-R f eature (Reimers and Gurevych, 2020). Sentence representations from XLM-R are not aligned across languages (Ethayarajh, 2019) and M P-XLM-R f eature overcomes this problem. In particular, M P-XLM-R f eature is trained to learn sentence semantics with parallel data from 50 languages. Moreover, the training process includes knowledge distillation from a Sentence BERT model (Reimers and Gurevych, 2019) trained on 50 million English paraphrases. As such, we expect M P-XLM-R f eature to outperform M XLM-R f eature , as it more accurately captures the semantics of the query and its neighbour sentences. Note that there is work on producing better alignments of multilingual vector spaces (Zhao et al., 2021), which would allow us to consider a variety of pre-trained sentence representation models, but exploring this is outside the scope of this paper.
Interaction Features Optimisation Given a query q and its j-th neighbour, we obtain features rep j ∈ I ce and (rep q , rep j , |rep q − rep j |) ∈ I be from M f eature for the CE kNN + and BE kNN + schemes, respectively. For both schemes, we optimise the interaction features to indicate whether a query and its neighbour have the same or different labels. We do this to later aggregate interaction features from all the neighbours of a query to model the overall agreement of the query with the retrieved neighbourhood. Our hypothesis is that understanding individual neighbour-level agreement and aggregating it will allow us also to understand the neighbourhood.
We apply a fully connected layer with two outputs over the interaction features to optimise them. The outputs indicate the label agreement between q and its j-th neighbour, (c j , l j ) ∈ N q . There is a label agreement if both q and the j-th neighbour are flagged or are both neutral, i.e., y t j = l j . We learn the label agreement using a binary crossentropy loss L lal , which is computed using the output of a softmax layer for each example in a batch of training data. We refer to L lal as labelagreement loss. In our implementation, a batch of data comprises a query and its k neighbours. We provide more details about the training procedure in Section 4.4.
Note that as our model predicts label agreement, it also indirectly predicts the label of the query and of the neighbour. In this way, it learns representations that separate flagged from the non-flagged examples.

Interaction Features Aggregation
The main reasons to use interaction features for label agreement is to predict whether q should be flagged or not. In a vanilla kNN setup, there is no mechanism to back-propagate classification errors, as the only parameter to tune there is the hyper-parameter k. In our model, we propose to optimise the interaction features -using a self-attention module -to minimise the classification error with a fixed neighbourhood size k. To this end, we propose to aggregate the k interaction features: I ce for CE kNN + and I be for BE kNN + . The aggregated representation captures global information, i.e., the agreement between the query and its neighbourhood, whereas the interaction features capture them locally.
We use structured self-attention (Lin et al., 2017) to capture the neighbourhood information. At first, we construct an interaction features matrix, H ∈ R k×h from the set of k neighbours (I ce or I be ), where h is the dimensionality of the interaction feature space. Then, we compute structured self-attention as follows: Here, W 1 ∈ R hr×h is a matrix that encodes interactions between the representations and projects the interaction features into a lowerdimensional space, h r < h, thus making the representation matrix h r × k dimensional. We multiply another matrix W 2 ∈ R 1×hr by the resulting representation, and we apply softmax to obtain a probability distribution over the k neighbours. Then, we use this probability distribution to produce an attention vector that linearly combines the interaction features to generate the neighbourhood representation rep Nq , which we eventually use for classification.

Classification Loss Optimisation
The aggregated interaction features, rep Nq , are used as an input to a softmax layer with two outputs (flagged or neutral), which we optimise using a binary cross-entropy loss, L cll . We refer to L cll as classification loss.
Optimising this loss means that the classification decision for a query is made by computing its agreement or disagreement with the neighbourhood as a whole. Our approach is a multi-task learning one, and the final loss is computed as follows: As both the classification and the labelagreement tasks aid each other, we adopt a multitask learning approach. We balance the two losses using the hyper-parameter λ. The classification loss forces the model to predict a label for the query. As the model learns to predict a label for a query, it becomes easier for it to reduce the label agreement loss L lal . Moreover, as the model learns to predict label agreement, it learns to compute interaction features, which represent agreement or disagreement. This, in turn, helps to optimise L cll .
Note that, at inference time, our framework requires neither the labels of the neighbours for classification, nor a heuristic-based label-aggregation scheme. The classification layer makes a prediction based on the pooled representation from the interaction features, thus removing the need for any heuristic-based voting strategy based on the labels of the neighbours. Each individual interaction feature from the query and a neighbour captures the agreement between them as we optimise the features via the L lal loss. The opinion of the neighbourhood is captured using an aggregation of individual interaction features -which is different from a vanilla kNN -where neighbourhood opinion is captured using an individual neighbour label. As our aggregation is performed using a selfattention mechanism, we obtain a probability distribution over the interaction features that we can use to find the neighbour that influenced the neighbourhood opinion the most. We also know both the original and the converted label of the neighbour (see Section 3.1 for further details about the label space conversion). The original label of the neighbour could help us understand the prediction behind the query better. For example, if the query is flagged and the original label of the most influential neighbour is hate, we could infer that the query is hate speech. However, we do not explore this direction in this paper, and we leave it as a future work.

Datasets
We conducted experiments on two different multilingual datasets covering eight languages from six language families: Slavic, Turkic, Romance, Germanic, Albanian, and Finno-Ugric. We used these datasets as our target datasets, and an English dataset as the source dataset, which contains a large number of training examples with finegrained categorisation. Both the source and target datasets are from the same domain (Wikipedia), as we do not study domain adaptation techniques in the present work. We describe these three datasets in the following paragraphs. The number of examples per dataset and the corresponding label distributions are shown in Table 1.
Jigsaw English (Jigsaw, 2018) is an English dataset with over 159K manually reviewed comments, annotated with multiple labels. We map the labels (toxic, severe toxic, obscene, threat, insult, and identity hate) into a flagged label; if at least one of these six labels is present for some example, we consider that example as flagged, and as neutral otherwise. As Jigsaw English is a resourcerich dataset, covering different aspects of abusive language, we use it as the source dataset. We use all its examples for training, as we validate our models on the target datasets' dev sets.

Jigsaw Multilingual
(Jigsaw Multilingual, 2020) aims to improve toxicity detection by addressing the shortcomings of the monolingual setup. The dataset contains examples in Italian, Turkish, and Spanish. It has binary labels (toxic or non-toxic), and thus it aligns well with our experimental setup. The label distribution is fairly similar to that for Jigsaw English, as shown in Table 1. This dataset is used for experimenting in a resource-rich environment. As it does not have standard training, testing, and development sets, we split the examples in each language as follows: 1,500, 500, and 500 for Italian and Spanish, and 1,800, 600, and 600 for Turkish.
WUL (Glavaš et al., 2020) aims to create a fair evaluation setup for abusive language detection in multiple languages. Although originally in English, multilinguality is achieved by translating the comments as accurately as possible into five different languages: German (DE), Hungarian (HR), Albanian (SQ), Turkish (TR), and Russian (RU). We use this dataset partially, by using the test set originally generated by Wulczyn et al. (2017), who focused on identifying personal attacks. In contrast to Jigsaw Multilingual, this dataset is used for experimenting in a low-resource environment. For each language, we have 600 examples, which are split as follows: 400, for training, 100 for development, and 100 for testing. As abusive content can be very culture-specific, there will be cases, even within the same language, where some utterances will be offensive in one culture, but not in another one. Thus, a translation-based dataset such as WUL might not be an ideal choice, and we acknowledge this limitation.
The results from experimenting with the above datasets cannot be compared to those in the literature as we use the test set from these datasets to create our train/dev/test splits. The datasets used in previous work (Jigsaw Multilingual and WUL) provide English-only training data and observe the performance of different models in zeroshot transfer learning settings. Our setup is different as we assume that there is a limited number of training examples in the target language. Thus, we produce results only on a subset of the original testset for both datasets. Therefore, our results are not directly comparable to the results from the literature, as both the training and the testing datasets differ.

Baselines
We compare our proposed approach against three families of strong baselines. The first one considers training models only on the target dataset, the second one is source adaptation, where we use Jigsaw English as our source dataset, and the third one consists of traditional kNN classification method, but with dense vector retrieval using LaBSE (Feng et al., 2020). We use cosine similarity under a LaBSE representation space to retrieve neighbours for the baselines and for our proposed approaches.

Target Dataset Training This family of baselines uses only the target dataset for training:
Lexicon approach: After standard text tokenization and normalization of the text, we count the number of terms it contains that are also listed in the abusive language lexicon HurtLex 3 . Based on the development set, we learn a threshold for the minimum number of matches required to flag the text. Then, we apply the lexicon and the threshold to the test set.
fastText is a baseline that uses the mean of the token vectors obtained from fastText (Joulin et al., 2017) word embeddings to represent a textual example. These representations are then used in a binary logistic regression classifier.
XLM-R Target is a pre-trained XLM-R model, which we fine-tune on the target dataset.

Source Adaptation This family of baselines includes variations of XLM-R:
XLM-R Mix-Adapt is a baseline model, which we train by mixing source and target data. This is possible because the label inventories of our source and target datasets are the same: Y = {f lagged, neutral}. The mixing is done by oversampling the target data to match the number of instances of the source dataset. As the number of instances in the target dataset is limited, this is preferable to undersampling.
XLM-R Seq-Adapt (Garg et al., 2020) is a Transformer pre-trained on the source and finetuned on the target data. Here, we fine-tune XLM-R on the Jigsaw English dataset, and then we do a second round of fine-tuning on the target dataset.
Nearest Neighbour We apply two nearest neighbour baselines, using majority voting for label aggregation. We varied the number of neighbours from 3 to 20, and we found that using 10 neighbours works best (on the dev set).
LaBSE-kNN Here the source dataset is indexed using representations obtained from LaBSE sentence embeddings (Feng et al., 2020), and the neighbours are retrieved using cosine similarity.
Weighted LaBSE-kNN is a baseline that uses the same retrieval step as LaBSE-kNN, but with a weighted voting strategy: each label is scored by summing the cosine similarities for the retrieved flagged and neutral neighbours, respectively; then, the label with the highest score is returned.

Evaluation Measures
Following prior work on abusive language detection, we use F1 measure for evaluation. The F1 measure combines precision and recall (using a harmonic mean), which are both important to consider for automatic abusive language detection systems. In particular, online platforms strive to remove all content that violates their policies, and thus, if the system were to achieve 100% recall, the contents could be further filtered by human moderators to weed out the benign content. However, if the system's precision were very low, it would mean that the moderators would have to read every piece of content on the platform.

Fine-Tuning and Hyper-Parameters
We train all the models for 10 epochs with XLM-R as a base transformer representation with a maximum sequence length of 256 tokens. However, we make an exception for SRC (see Section 5.1): we train it for a single epoch, as training a neighbourhood-based model on a large dataset is resource-intensive. For all the approaches, we use Adam with β 1 0.9, β 2 0.999, 1e-08 as the optimiser setting. For the baseline models, we use a batch size of 64, and a learning rate of 4e-05. For kNN + -based models, we create a training batch from a query and its 10-nearest neighbours. For stable updates, we accumulate gradients from 50 batches before back-propagation. We selected the values of all of the aforementioned hyper-parameters based on the validation set. For kNN + -based models, the best learning rate is selected from {5e-05, 7e-05}. Table 2 shows the performance of our model variants compared to the seven strong baselines we described above (rows 1-7). The first two rows represent non-contextual baselines and they perform worse compared to the baseline pre-trained XLM-R models fine-tuned with labelled data (rows 3-5). Specifically, the lexicon baseline performs the worst among all, which indicates the limited coverage of hate speech lexicon and the loss in precision due to token mismatches and context obliviousness. For example, the word monkey is generally included in a hate speech lexicon, but the appearance of the token in a textual content does not necessarily mean that the content is abusive.  The highlighted rows in Table 2 show different variants of our framework, based on CE kNN + and BE kNN + , i.e., using cross-encoders vs. bi-encoders. For each of the encoding schemes, we instantiate three different models by using three different pre-trained representations fine-tuned in our neighbourhood framework, namely: M XLM-R f eature , which is a pre-trained XLM-RoBERTa model (XLM-R); M P-XLM-R f eature , which is an XLM-R model fine-tuned under a knowledge distillation setting with 50 million paraphrases and parallel data in 50 languages (Reimers and Gurevych, 2020); and M P-XLM-R f eature → SRC, which is an M P-XLM-R f eature model fine-tuned with source data (here, 159,571 instances from Jigsaw English) in our neighbourhood framework.

Evaluation in a Cross-lingual Setting
In order to train with SRC, we use all the training data in Jigsaw English, and we retrieve neighbours from Jigsaw English using LaBSE sentence embeddings. 4 Then, we use this training data to fine-tune M P-XLM-R f eature with our kNN + -based crossencoder (CE kNN + + M P-XLM-R f eature → SRC) and biencoder (BE kNN + + M P-XLM-R f eature → SRC) experimental setups. This is analogous to applying sequential adaptation (Garg et al., 2020), but here we do it in our neighbourhood framework.
The SRC approach addresses one of the weaknesses of our kNN framework. The training data is created from instances in the target dataset and their neighbours from the source dataset. Thus, the neighbourhood model cannot use all source training data, as it pre-selects a subset of the source data based on similarity. This is a disadvantage compared to the sequential adaptation model, which uses all source training instances for pretraining. In order to overcome this, we use the neighbourhood approach to pre-train our models with source data. Table 2 shows the F1 scores for eight languagespecific training and evaluation sets stemming from two different data sets: Jigsaw Multilingual and WUL. Jigsaw Multilingual is an imbalanced dataset with 15% abusive content and WUL is balanced (see Table 1). Thus, it is hard to achieve high F1 score in Jigsaw Multilingual, whereas for WUL the F1 scores are relatively higher. Our CE kNN + variants achieve superior performance to all the baselines and our BE kNN + variants as well in the majority of the cases.
The performance of the best and of the secondbest models for each language are highlighted by bold-facing and underlining, respectively. We attribute the higher scores achieved by CE kNN + variants compared to the BE kNN + on the latestage interaction of the query and its neighbours.
The CE kNN + variants show a large performance gain compared to baseline models on the Italian and the Turkish test sets from Jigsaw Multilingual. Even though the additional SRC pretraining is not always helpful for the CE kNN + model, it is always helpful for the BE kNN + model. However, both models struggle to outperform the baseline for the Spanish test set. We analysed the training data distribution for Spanish, but we could not find any noticeable patterns.
Yet, it can be observed that the XLM-R target baseline for Spanish (2nd row, 1st column) achieves a higher F1 score compared to the Seq-Adapt baseline, which yields better performance for Italian and Turkish. We believe that the indomain training examples are good enough to achieve a reasonable performance for Spanish.
On the WUL dataset, BE kNN + + M P-XLM-R

Impact of the Learned Voting Strategy
To demonstrate the effectiveness of our learned voting strategy, we use our baselines (shown in Table 2, rows 3-7) to retrieve neighbours, and then we perform majority voting to predict the label of a test instance. The results for all the approaches are shown in Table 3. For comparison, we also add the best bi-encoder and cross-encoder versions of kNN + (see Table 2, rows 10 and 13).  In particular, these baseline models are pretrained XLM-R models fine-tuned on different combinations of source and target language datasets (see Fine-Tuned kNN Baselines, Table 3). For each data case in the source dataset, we compute its representation as the [CLS] token from the classification model and we construct a list of vectors. Given a test data case from the target dataset, we also compute its representation based on the [CLS] token representation from the classification model. We then compute its cosine similarity with each of the [CLS] vectors from the source dataset. After that, we compute a ranked list of the top-10 neighbours based on similarity scores.
Next, we vary the number of neighbours from three to ten -considering them in the order they are ranked based on their similarity to the queryto obtain a majority vote and to classify the test example. We can see in Table 3 that the performance is similar to that for the LaBSE-kNN and for the Weighted LaBSE-kNN approaches in which the neighbours are retrieved using a representation space constructed from sentence similarity data (see Sentence Similarity kNN Baselines, Table 3). The results in Table 3 show that when fine-tuned models are directly used in a nearest neighbours framework without additional modifications, their performance is lower by between 25 and 60 F1 points absolute, compared to our proposed kNN + model.  These results suggest that the interactions between the query and the retrieved neighbours captured by our model are an important prerequisite for achieving high performance.

Evaluation in a Multilingual Setting
In this subsection, we go beyond our cross-lingual setting and we analyse the effectiveness of our proposed model in a multilingual setting. A multilingual setting has been explored in recent work on abusive language detection (Pamungkas and Patti, 2019; Ousidhoum et al., 2019;Basile et al., 2019;Ranasinghe and Zampieri, 2020;Corazza et al., 2020;Glavaš et al., 2020;Leite et al., 2020) and it is desirable because online platforms are not limited to specific languages. An effective multilingual model unifies the two-stage process of language detection and prediction with a language-specific classifier. Moreover, abusive language is generally code-mixed (Saumya et al., 2021), which makes language-agnostic representation spaces more desirable.
We investigate a multilingual scenario, where all target languages in our cross-lingual setting are observed both at training and at testing time. To this end, we create new training, development, and testing splits in a 5:1:2 ratio from the 8,000 available data cases in the Jigsaw Multilingual dataset. Each split contains randomly sampled data in Italian, Spanish, and Turkish.
We train and evaluate our BE kNN + and CE kNN + using the aforementioned splits; the results are shown in Table 4. Here, we must note that our neighbourhood retrieval model is languageagnostic, and thus we can retrieve neighbours for queries in any language.  We find that in a multilingual scenario, our BE kNN + model with SRC pre-training performs better than the CE kNN + model. Both the BE and the CE approaches supersede the best baseline model Seq-Adapt. Compared to the cross-lingual setting, there is more data in a mix of languages available. We hypothesise that the success of the bi-encoder model over the cross-encoder one stems from the increase in data size.

Analysis of the BE Representation
In order to understand the impact of the representations by BE kNN + + M P-XLM-R f eature → SRC, a model variant instantiated from our proposed kNN framework, we computed the similarity between the query and its neighbours in the representation space. An example is shown in Table 5 (it is the example from the introduction). Given the Turkish flagged query, we use LaBSE (Feng et al., 2020) and our BE representation space to retrieve ranked lists of its ten nearest neighbours. The table shows the scores computed by both approaches, and we can see that our representation can help discriminate between flagged and neutral contents better. When we compute the cosine similarity between the query and the nearest neighbours, the BE representation space assigns negative scores to the neutral content. The LaBSE sentence embeddings are optimised for semantic similarity, and thus using them does not allow us to discriminate between flagged and neutral content.
We further study the impact of our representation by comparing a voting-based kNN on the top-10 neighbours retrieved by LaBSE vs. a reranking using our BE representation. For both the LaBSE-based ranking and for our re-ranking, at each ranking point, we apply the majority voting kNN approach on the neighbourhood within that ranking point. Figure 3 shows the results for the test part of the Jigsaw Multilingual dataset (including the multilingual setup; see Section 5.3). We can see that the re-ranking step improves over LaBSE for all the different numbers of neighbours.

Multi-Task Learning Parameter Sensitivity
Our approach uses multi-task learning, where we balance the weights of L cll and L lal using a hyperparameter λ. Figure 4 shows the impact of different values for this hyper-parameter. On the horizontal axis, we increase the importance of the L lal loss, and we show the performance of all model variants on the development part of the Jigsaw Multilingual dataset. We can see that the models perform well if the weight for the label-agreement loss is set to 0.7, and degrades if it is increased.

Conclusion and Future Work
We proposed kNN + , a novel framework for crosslingual content flagging, which significantly outperforms strong baselines with limited training data in the target language. We further demonstrated the effectiveness of our framework in a multilingual scenario, where a test data point can be in Turkish, Italian, or Spanish.
Moreover, we provided a qualitative analysis of the representations learned by our proposed BE kNN + framework, and we demonstrated that, in the learned representation space, flagged content stays close to flagged content, while non-flagged stays close to non-flagged content.
Our framework computes a neighbourhood representation for a query using an attention mechanism, thus indicating the influence of each individual neighbour. This and the kNN-based architecture offer an opportunity to obtain an explanation for the individual model predictions, and such explanations can be based not only on the textual content of the influential neighbours, but also on their original fine-grained labels.
In future work, we plan to understand the viability of such explanations in a user study. We also plan to evaluate our framework on other content flagging tasks, e.g., for detecting harmful memes (Dimitrov et al., 2021;Pramanick et al., 2021a,b), as the framework is not limited to abusive content detection.