Quantifying Social Biases in NLP: A Generalization and Empirical Comparison of Extrinsic Fairness Metrics

Measuring bias is key for better understanding and addressing unfairness in NLP/ML models. This is often done via fairness metrics which quantify the differences in a model's behaviour across a range of demographic groups. In this work, we shed more light on the differences and similarities between the fairness metrics used in NLP. First, we unify a broad range of existing metrics under three generalized fairness metrics, revealing the connections between them. Next, we carry out an extensive empirical comparison of existing metrics and demonstrate that the observed differences in bias measurement can be systematically explained via differences in parameter choices for our generalized metrics.


Introduction
The prevalence of unintended social biases in NLP models has been recently identified as a major concern for the field. A number of papers have published evidence of uneven treatment of different demographics (Dixon et al., 2018;Zhao et al., 2018;Rudinger et al., 2018;Garg et al., 2019;Borkan et al., 2019;Stanovsky et al., 2019;Gonen and Webster, 2020;Huang et al., 2020a;Nangia et al., 2020), which can reportedly cause a variety of serious harms, like unfair allocation of opportunities or unfavorable representation of particular social groups .
Measuring bias in NLP models is key for better understanding and addressing unfairness. This is often done via fairness metrics which quantify the differences in a model's behaviour across a range of social groups. The community has proposed a multitude of such metrics (Dixon et al., 2018;Garg et al., 2019;Huang et al., 2020a;Borkan et al., 2019;Gaut et al., 2020). In this ♠ Work done during an internship at Amazon AI. paper, we aim to shed more light on how those varied means of quantifying bias differ and what facets of bias they capture. Developing such understanding is crucial for drawing reliable conclusions and actionable recommendations regarding bias. We focus on bias measurement for downstream tasks as Goldfarb-Tarrant et al. (2021) have recently shown that there is no reliable correlation between bias measured intrinsically on, for example, word embeddings, and bias measured extrinsically on a downstream task. We narrow down the scope of this paper to tasks which do not involve prediction of a sensitive attribute.
We survey 146 papers on social bias in NLP and unify the multitude of disparate metrics we find under three generalized fairness metrics. Through this unification we reveal the key connections between a wide range of existing metrics-we show that they are simply different parametrizations of our generalized metrics. Next, we empirically investigate the role of different metrics in detecting the systemic differences in performance for different demographic groups, i.e., differences in quality of service (Jacobs et al., 2020). We experiment on three transformer-based models-two models for sentiment analysis and one for named entity recognition (NER)-which we evaluate for fairness with respect to seven different sensitive attributes, qualified for protection under the United States federal anti-discrimination law: 1 Gender, Sexual Orientation, Religion, Nationality, Race, Age and Disability. Our results highlight the differences in bias measurements across the metrics and we discuss how these variations can be systematically explained via different parameter choices of our generalized metrics. Our proposed unification and observations can guide decisions about which metrics (and parameters) to use, allowing researchers to focus on the pressing matter of bias mitigation, rather than reinventing parametric variants of the same metrics. While we focus our experiments on English, the metrics we study are language-agnostic and our methodology can be trivially applied to other languages.
We release our code with implementations of all metrics discussed in this paper. 2 Our implementation mirrors our generalized formulation (Section 3), which simplifies the creation of new metrics. We build our code on top of CHECKLIST 3 (Ribeiro et al., 2020), making it compatible with the CHECKLIST testing functionalities; i.e., one can evaluate the model using the fairness metrics, as well as the CHECKLIST-style tests, like invariance, under a single bias evaluation framework.

Terminology
We use the term sensitive attribute to refer to a category by which people are qualified for protection, e.g., Religion or Gender. For each sensitive attribute we define a set of protected groups T , e.g., for Gender, T could be set to {female, male, non-binary}. Next, each protected group can be expressed through one of its identity terms, I; e.g., for the protected group female those terms could be {woman, female, girl} or a set of typically female names.

Definitions of Fairness in NLP
The metrics proposed to quantify bias in NLP models across a range of social groups can be categorized based on whether they operationalize notions of group or counterfactual fairness. In this section we give a brief overview of both and encourage the reader to consult  for a broader scope of literature on fairness, dating back to the 1960s.
Group fairness requires parity of some statistical measure across a small set of protected groups (Chouldechova and Roth, 2020). Some prominent examples are demographic parity (Dwork et al., 2012), which requires equal positive classification rate across different groups, or equalized odds (Hardt et al., 2016) which for binary classification requires equal true positive and false negative rates. In NLP, group fairness metrics are based on performance comparisons for different sets of  Counterfactual fairness requires parity for two or more versions of an individual, one from the actual world and others from counterfactual worlds in which the individual belongs to a different protected group; i.e., it requires invariance to the change of the protected group (Kusner et al., 2017). Counterfactual fairness is often viewed as a type of individual fairness, which asks for similar individuals to be treated similarly (Dwork et al., 2012). In NLP, counterfactual fairness metrics are based on comparisons of performance for variations of the same sentence, which differ in mentioned identity terms. Such data can be created through perturbing real-world sentences or creating synthetic sentences from templates. In this work, we require that for each protected group there exists at least one sentence variation for every source example (pre-perturbation sentence or a template). In practice, the number of variations for each protected group will depend on the cardinality of I (Table 1). In contrast to most NLP works (Dixon et al., 2018;Garg et al., 2019;Sheng et al., 2020), we allow for a protected group to be realized as more than one identity term. To allow for this, we separate the variations for each source example into |T | sets, each of which can be viewed as a separate counterfactual world.

Generalized Fairness Metrics
We introduce three generalized fairness metrics which are based on different comparisons between protected groups and are model and task agnostic. They are defined in terms of two parameters: (i) A scoring function, φ, which calculates the score on a subset of examples. The score is a base measurement used to calculate the met-ric and can be either a scalar or a set (see Table 2 for examples).
(ii) A comparison function, d, which takes a range of different scores-computed for different subsets of examples-and outputs a single scalar value.
Each of the three metrics is conceptually different and is most suitable in different scenarios; the choice of the most appropriate one depends on the scientific question being asked. Through different choices for φ and d, we can systematically formulate a broad range of different fairness metrics, targeting different types of questions. We demonstrate this in Section 4 and Table 2, where we show that many metrics from the NLP literature can be viewed as parametrizations of the metrics we propose here. To account for the differences between group and counterfactual fairness (Section 2.2) we define two different versions of each metric.
Notation Let T = {t 1 , t 2 , ..., t |T | } be a set of all protected groups for a given sensitive attribute, e.g., Gender, and φ(A) be the score for some set of examples A. This score can be either a set or a scalar, depending on the parametrization of φ. For group fairness, let S be the set of all evaluation examples. We denote a subset of examples associated with a protected group t i as S t i . For counterfactual fairness, let X = {x 1 , x 2 , ..., x |X| } be a set of source examples, e.g., sentences preperturbation, and S = {S 1 , S 2 , ..., S |S| } be a set of sets of evaluation examples, where S j is a set of all variations of a source example x j , i.e., there is a one-to-one correspondence between S and X. We use S t i j to denote a subset of S j associated with a protected group t i . For example, if T = {female, male} and the templates were defined as in Table 1

Pairwise Comparison Metric
Pairwise Comparison Metric (PCM) quantifies how distant, on average, the scores for two different, randomly selected groups are. It is suitable for examining whether and to what extent the chosen protected groups differ from one another. For example, for the sensitive attribute Disability, are there any performance differences for cognitive vs mobility vs no disability? We define Group (1) and Counterfactual (2) PCM as follows: where N is a normalizing factor, e.g., |T | 2 .

Background Comparison Metric
Background Comparison Metric (BCM) relies on a comparison between the score for a protected group and the score of its background. The definition of the background depends on the task at hand and the investigated question. For example, if the aim is to answer whether the performance of a model for the group differs from the model's general performance, the background can be a set of all evaluation examples. Alternatively, if the question of interest is whether the groups considered disadvantaged are treated differently than some privileged group, the background can be a set of examples associated with that privileged group. In such a case, T should be narrowed down to the disadvantaged groups only. For counterfactual fairness the background could be the unperturbed example, allowing us to answer whether a model's behaviour differs for any of the counterfactual versions of the world. Formally, we define Group (3) and Counterfactual (4) BCM as follows: where N is a normalizing factor and β t i ,S is the background for group t i for the set of examples S.
Vector-valued BCM In its basic form BCM aggregates the results obtained for different protected groups in order to return a single scalar value. Such aggregation provides a concise signal about the presence and magnitude of bias, but it does so at the cost of losing information. Often, it is important to understand how different protected groups contribute to the resulting outcome. This requires the individual group results not to be accumulated; i.e., dropping the 1 N t i ∈T term from equations 3 and 4. We call this version of BCM, the vector-valued BCM (VBCM).

Multi-group Comparison Metric
Multi-group Comparison Metric (MCM) differs from the other two in that the comparison function d takes as arguments the scores for all protected groups. This metric can quantify the global effect that a sensitive attribute has on a model's performance; e.g., whether the change of Gender has any effect on model's scores. It can provide a useful initial insight, but further inspection is required to develop better understanding of the underlying bias, if it is detected. Group (5) and Counterfactual (6) MCM are defined as:

Classifying Existing Fairness Metrics
Within the Generalized Metrics  (Huang et al., 2020a) and dependency parsing (Blodgett et al., 2018). We arrive at this list by reviewing 146 papers that study bias from the survey of  and selecting metrics that meet three criteria: (i) the metric is extrinsic; i.e., it is applied to at least one downstream NLP task, 4 (ii) it quantifies the difference in performance across two or more groups, and (iii) it is not based on the prediction of a sensitive attribute-metrics based on a model's predictions of sensitive attributes, e.g., in image captioning or text generation, constitute a specialized sub-type of fairness metrics. Out of the 26 metrics we find, only four do not fit within our framework: BPSN and BNSP (Borkan et al. Importantly, many of the metrics we find are PCMs defined for only two protected groups, typically for male and female genders or white and non-white races. Only those that use commutative d can be straightforwardly adjusted to more groups. Those which cannot be adjusted are marked with gray circles in Table 2. Prediction v/s Probability Based Metrics Beyond the categorization into PCM, BCM and MCM, as well as group and counterfactual fairness, the metrics can be further categorized into prediction or probability based. The former calculate the score based on a model's predictions, while the latter use the probabilities assigned to a particular class or label (we found no metrics that make use of both probabilities and predictions). 13 out of 16 group fairness metrics are prediction based, while all counterfactual metrics are probability based. Since the majority of metrics in Table 2 (Jiang et al., 2020) relax the definition and allow for a separate φ function for the background-they require returning different confidence scores for the protected group and the background. The metrics of Prabhakaran et al. (2019) 18 19 21 originally have not been defined in terms of protected groups. In their paper, T is a set of different names, both male and female.

Metric
Gen. Table 2: Existing fairness metrics and how they fit in our generalized metrics. f (x, c), y(x) andŷ(x) are the probability associated with a class c, the gold class and the predicted class for example x, respectively. M W U is the Mann-Whitney U test statistic and W 1 is the Wasserstein-1 distance between the distributions of X and Y . Metrics marked with * have been defined in the context of only two protected groups and do not define the normalizing factor. The metrics associated with gray circles cannot be applied to more than two groups (see Section 4).

Experimental Details
Having introduced our generalized framework and classified the existing metrics, we now empirically investigate their role in detecting the systemic performance difference across the demographic groups. We first discuss the relevant experimental details before presenting our results and analyses (Section 6).
Models We experiment on three RoBERTa (Liu et al., 2019) based models: 6 (i) a binary classifier 6 Our preliminary experiments also used models based on Electra (Clark et al., 2020) as well as those trained on SST-2 and SST-3 datasets (Socher et al., 2013). For all models, we observed similar trends in differences between the metrics. Due to space constraints we omit those results and leave a detailed cross-model bias comparison for future work.   (Lafferty et al., 2001) to predict the tags. In NER experiments we use the BILOU labeling scheme (Ratinov and Roth, 2009) and, for the probability-based metrics, we use the probabilities from the encoder's output. Table 5 reports the performance on the official dev splits for the datasets the models were trained on.
Evaluation Data For classification, we experiment on seven sensitive attributes, and for each attribute we devise a number of protected groups (Table 3). 8 We analyze bias within each attribute independently and focus on explicit mentions of each identity. This is reflected in our choice of identity terms, which we have gathered from Wikipedia, Wiktionary as well as (Dixon et al., 2018) and (Hutchinson et al., 2020) (see Table 4 for an example). Additionally, for the Gender attribute we also investigate implicit mentionsfemale and male groups represented with names typically associated with these genders. We experiment on synthetic data created using handcrafted templates, as is common in the literature (Dixon et al., 2018;Kurita et al., 2019;Huang et al., 2020a).
For each sensitive attribute we use 60 templates 7 We process the SemEval data as is commonly done for SST (Socher et al., 2013). For binary classification, we filter out the neutral class and compress the multiple fine-grained positive/negative classes into a single positive/negative class. For 3-class classification we do not filter out the neutral class. 8 For Disability and Race we used the groups from Hutchinson et al. (2020)   with balanced classes; 20 negative, 20 neutral and 20 positive templates. For each attribute we use 30 generic templates-with adjective and noun phrase slots to be filled with identity terms-and 30 attribute-specific templates. 9 In Table 6 we present examples of both generic templates and attribute-specific templates for Nationality. Note that the slots of generic templates are designed to be filled with terms that explicitly reference an identity (Table 4), and are unsuitable for experiments on female/male names. For this reason, for names we design additional 30 name-specific templates (60 in total). We present examples of those templates in Table 6.
For NER, we only experiment on Nationality and generate the evaluation data from 22 templates with a missing {country} slot for which we manu- 9 We will release all templates upon acceptance. ally assign a BILOU tag to each token. The {coun-try} slot is initially labeled as U-LOC and is later automatically adjusted to a sequence of labels if a country name filling the slot spans more than one token, e.g., B-LOC L-LOC for New Zeland.
Metrics We experiment on metrics which support more than two protected groups (i.e., the white-circled metrics in Table 2). As described in Section 2.2, for each source example we allow for a number of variations for each group. Hence, for counterfactual metrics which require only one example per group (all counterfactual metrics but Average Individual Fairness 21 ) we evaluate on the |T |-ary Cartesian products over the sets of variations for all groups. For groups with large |I| we sample 100 elements from the Cartesian product, without replacement. We convert Counterfactual Token Fairness Gap 17 and Perturbation Score Sensitivity 18 into PCMs since for templated-data there is no single real-world example.
Average  (Hardt et al., 2016) and can pinpoint the existence of bias more precisely, as we show later (Section 6).

Moving Beyond Binary Classification
14 out of 15 white-circled metrics from Table 2 are inherently classification metrics, 11 of which are defined exclusively for binary classification. We adapt binary classification metrics to (i) multiclass classification and (ii) sequence labeling to support a broader range of NLP tasks.

Multi-class
Classification Probability-based metrics that use the probability of the target class ( 18 19 20 ) do not require any adaptations for multi-class classification. For other metrics, we measure bias independently for each class c, using a one-vs-rest strategy for prediction-based metrics and the probability of class c for the scores of probability-based metrics ( 3 5 6 17 21 ).
Sequence Labeling We view sequence labeling as a case of multi-class classification, with each ag e d is ab ili ty ge n d er n at io n al it y ra ce re lig io n se xu al it y  Metrics marked with (all) are inherently multi-class and are calculated for all-classes. Superscripts P and * mark the probability-based and correctly normalized metrics, respectively. We row-normalize the heatmap coloring, across the whole figure, using maximum absolute value scaling.
token being a separate classification decision. As for multi-class classification, we compute the bias measurements for each class independently. For prediction-based metrics, we use one-vs-rest strategy and base the F1 and FNR scores on exact span matching. 10 For probability-based metrics, for each token we accumulate the probability scores for different labels of the same class. E.g., with the BILOU labeling scheme, the probabilities for B-PER, I-PER, L-PER and U-PER are summed to obtain the probability for the class PER. Further, for counterfactual metrics, to account for different identity terms yielding different number of tokens, we average the probability scores for all tokens of multi-token identity terms. 10 We do not compute FPR based metrics, since false positives are unlikely to occur for our synthetic data and are less meaningful if they occur. Fig. 1 shows the results for sentiment analysis for all attributes on BCM, PCM and MCM metrics. In each table we report the original bias measurements and row-normalize the heatmap coloring using maximum absolute value scaling to allow for some cross-metric comparison. 11 Fig. 1 gives evidence of unintended bias for most of the attributes we consider, with Disability and Nationality being the most and least affected attributes, respectively. We highlight that since we evaluate on simple synthetic data in which the expressed sentiment is evident, even small performance differences can be concerning. Fig. 1 also gives an 11 Even after normalization, bias measurements across metrics are not fully comparable-different metrics employ different base measurements (e.g., TPR, TNR etc.) and hence measure different aspects of bias.   initial insight into how the bias measurements vary across the metrics. In Fig. 2 we present the per-group results for VBCM and BCM metrics for the example Gender attribute. 12 Similarly, in Fig. 3 we show results for NER for the relevant LOC class. The first set of results indicates that the most problematic Gender group is cis. For NER we observe a big gap in the model's performance between the most affluent countries and countries with lower GDP. In the context of those empirical results we now discuss how different parameter choices affect the observed bias measurement.

Empirical Metric Comparison
Key Role of the Base Measurement Perhaps the most important difference between the metrics lies in the parametrization of the scoring function φ. The choice of φ determines what type and aspect of bias is being measured, making the metrics conceptually different. Consider, for example φ of 12 We omit the per-group results for the remaining attributes due to the lack of space. For BCM, we do not include accumulated values in the normalization. Average Group Fairness 3 -{f (x, 1) | x ∈ A}and Positive Average Equality Gap 5 -{f (x, 1) | x ∈ A, y(x) = 1}. They are both based on the probabilities associated with class 1, but the former is computed on all examples in A, while the latter is computed on only those examples that belong to the positive class (i.e. have gold label 1). This difference causes them to measure different types of bias-the first targets demographic parity, the second equality of opportunity.
Further, consider FPED 1 and FNED 2 which employ FPR and FNR for their score, respectively. This difference alone can lead to entirely different results. E.g., in Fig. 2a FNED reveals prominent bias for the cis group while FPED shows none. Taken together, these results signal that the model's behaviour for this group is notably different from the other groups but this difference manifests itself only on the positive examples.
(In)Correct Normalization Next, we highlight the importance of correct normalization. We argue that fairness metrics should be invariant to the number of considered protected groups, otherwise the bias measurements are incomparable and can be misleadingly elevated. The latter is the case for three metrics-FPED 1 , FNED 2 and Disparity Score 7 . The first two lack any kind of normalization, while Disparity Score is incorrectly normalized-N is set to the number of groups, rather than group pairs. In Fig. 1 we present the results on the original versions of those metrics and for their correctly normalized versions, marked with *. The latter result in much lower bias measurements. This is all the more important for FPED and FNED, as they have been very influential, with many works relying exclusively on these metrics (Rios, 2020;Huang et al., 2020b;Gencoglu, 2021;Rios and Lwowski, 2020).
Relative vs Absolute Comparison Next, we argue that the results of metrics based on the relative comparison, e.g., FPR Ratio 4 , can be misleading and hard to interpret if the original scores are not reported. In particular, the relative comparison can amplify bias in cases when both scores are low; in such scenario even a very small absolute difference can be relatively large. Such amplification is evident in the FNR Ratio metric (FNR equivalent of FPR Ratio) on female vs male names for RoBERTa fine-funed on SemEval-2 (Fig. 2b). Similarly, when both scores are very high, the bias can be underestimated-a significant difference between the scores can seem relatively small if both scores are large. Indeed, such effects have also been widely discussed in the context of reporting health risks (Forrow et al., 1992;Stegenga, 2015;Noordzij et al., 2017). In contrast, the results of metrics based on absolute comparison can be meaningfully interpreted, even without the original scores, if the range of the scoring function is known and interpretable (which is the case for all metrics we review).

Importance of Per-Group Results
Most group metrics accumulate the results obtained for different groups. Such accumulation leads to diluted bias measurements in situations where the performance differs only for a small proportion of all groups. This is evident in, for example, the pergroup NER results for correctly-normalized metrics (Fig. 3). We emphasize the importance of reporting per-group results whenever possible.
Prediction vs Probability Based In contrast to prediction-based metrics, probability-based met-  Figure 3: Results for the NER model on Nationality attribute for six groups defined by categorizing countries based on their GDP (six quantiles) for the (most relevant) LOC class. We present group metrics at the top and the counterfactual metrics at the bottom. The probability-based metrics not marked with (TC) use probability scores for LOC for all tokens, including O; hence they are less meaningful than their TC alternatives.
rics capture also more subtle performance differences which do not lead to different predictions. This difference can be seen, for example, for aab Gender group results for SemEval-2 (Fig. 2a) and the results for female/male names for SemEval-3 (Fig. 2d). We contend it is beneficial to employ both types of metrics to understand the effect of behaviour differences on predictions and to allow for detection of more subtle differences.
Signed vs Unsigned Out of the 15 white-circled metrics only two are signed; Positive and Negative Average Equality Gap (AvgEG) 5 6 . Employing at least one signed metric allows for quick identification of the bias direction. For example, results for Average Equality Gap reveal that examples mentioning the cis Gender group are considered less positive than examples mentioning other groups and that, for NER, the probability of LOC is lower for the richest countries (first and second quantiles have negative signs).

True Class Evaluation
We observe that the TC versions of probability-metrics allow for better understanding of bias location, compared to their non-TC alternatives. Consider Average Group Fairness 3 and its TC versions evaluated on the positive class (PosAvgGF) and negative class (Ne-gAvgGF) for binary classification (Fig. 2a). The latter two reveal that the differences in behaviour apply solely to the positive examples.

Fairness Metrics vs Significance Tests
Just like fairness metrics, statistical significance tests can also detect the presence of systematic differences in the behavior of a model, and hence are often employed as alternative means to quantify bias Davidson et al., 2019;Zhiltsova et al., 2019). However, in contrast to fairness metrics, significance tests do not capture the magnitude of the differences. Rather, they quantify the likelihood of observing given differences under the null hypothesis. This is an important distinction with clear empirical consequences, as even very subtle differences between the scores can be statistically significant.
To demonstrate this, we present p-values for significance tests for which we use the probability of the positive class as a dependent variable (Table 7). Following , we obtain a single probability score for each template by averaging the results across all identity terms per group. Since we evaluate on synthetic data which is balanced across all groups, we use the scores for all templates regardless of their gold class. We use the Friedman test for all attributes with more than two protected groups. For Gender with male/female names as identity terms we use the Wilcoxon signed-rank test. We observe that, despite the low absolute values of the metrics obtained for the Nationality attribute ( Fig. 1), the behaviour of the models across the groups is unlikely to be equal. The same applies to the results for female vs male names for SemEval-3 (Fig. 2d). Employing a test for statistical significance can capture such nuanced presence of bias.
Notably, Average Equality Gap metrics 5 6 occupy an atypical middle ground between being a fairness metric and a significance test. In contrast to other metrics from Table 2, they do not quantify the magnitude of the differences, but the likelihood of a group being considered less positive than the background.

Which Metrics to Choose?
In the previous section we highlighted important differences between the metrics which stem from different parameter choices. In particular, we emphasized the difference between prediction and  probability-based metrics, in regards to their sensitivity to bias, as well as the conceptual distinction between the fairness metrics and significance tests. We also stressed the importance of correct normalization of metrics and reporting per-group results whenever possible. However, one important question still remains unanswered: out of the many different metrics that can be used, which ones are the most appropriate? Unfortunately, there is no easy answer. The choice of the metrics depends on many factors, including the task, the particulars of how and where the system is deployed, as well as the goals of the researcher.
In line with the recommendations of Olteanu et al. (2017) and , we assert that fairness metrics need to be grounded in the application domain and carefully matched to the type of studied bias to offer meaningful insights. While we cannot be too prescriptive about the exact metrics to choose, we advice against reporting results for all the metrics presented in this paper. Instead, we suggest a three-step process which helps to narrow down the full range of metrics to those that are the most applicable.
Step 1. Identifying the type of question to ask and choosing the appropriate generalized metric to answer it. As discussed in Section 3, each generalized metric is most suitable in different scenarios; e.g., MCM metrics can be used to investigate whether the attribute has any overall effect on the model's performance and (V)BCM allows to investigate how the performance for particular groups differs with respect to model's general performance.
Step 2. Identifying scoring functions which target the studied type and aspect of bias. At this stage it is important to consider practical con-sequences behind potential base measurements. E.g., for sentiment classification, misclassyfing positive sentences mentioning a specific demographic as negative can be more harmful than misclassyfing negative sentences as positive, as it can perpetuate negative stereotypes. Consequently, the most appropriate φ would be based on FNR or the probability of the negative class. In contrast, in the context of convicting low-level crimes, a false positive has more serious practical consequences than a false negative, since it may have a long-term detrimental effect on a person's life. Further, the parametrization of φ should be carefully matched to the motivation of the study and the assumed type/conceptualization of bias.
Step 3. Making the remaining parameter choices. In particular, deciding on the comparison function most suitable for the selected φ and the targeted bias; e.g., absolute difference if φ is scalar-valued φ or Wasserstein-1 distance for setvalued φ.
The above three steps can identify the most relevant metrics, which can be further filtered down to the minimal set sufficient to identify studied bias. To get a complete understanding of a model's (un)fairness, our general suggestion is to consider at least one prediction-based metric and one probability-based metric. Those can be further complemented with a test for statistical significance. Finally, it is essential that the results of each metric are interpreted in the context of the score employed by that metric (see Section 6). It is also universally good practice to report the results from all selected metrics, regardless of whether they do or do not give evidence of bias.

Related Work
To our knowledge, we are the first to review and empirically compare fairness metrics used within NLP. Close to our endeavour are surveys which discuss types, sources and mitigation of bias in NLP or AI in general. Surveys of Mehrabi et al. (2019),  and Chouldechova and Roth (2020) cover a broad scope of literature on algorithmic fairness. Shah et al. (2020) offer both a survey of bias in NLP as well as a conceptual framework for studying bias. Sun et al. (2019) provide a comprehensive overview of addressing gender bias in NLP. There are also many task specific surveys, e.g., for lan-guage generation (Sheng et al., 2021) or machine translation (Savoldi et al., 2021). Finally,  outline a number of methodological issues, such as providing vague motivations, which are common for papers on bias in NLP.

Conclusion
We conduct a thorough review of existing fairness metrics and demonstrate that they are simply parametric variants of the three generalized fairness metrics we propose, each suited to a different type of a scientific question. Further, we empirically demonstrate that the differences in parameter choices for our generalized metrics have direct impact on the bias measurement. In light of our results, we provide a range of concrete suggestions to guide NLP practitioners in their metric choices.
We hope that our work will facilitate further research in the bias domain and allow the researchers to direct their efforts towards bias mitigation. Since our framework is language and model agnostic, in the future we plan to experiment on more languages and use our framework as principled means of comparing different models with respect to bias.