Abstract
Measuring bias is key for better understanding and addressing unfairness in NLP/ML models. This is often done via fairness metrics, which quantify the differences in a model’s behaviour across a range of demographic groups. In this work, we shed more light on the differences and similarities between the fairness metrics used in NLP. First, we unify a broad range of existing metrics under three generalized fairness metrics, revealing the connections between them. Next, we carry out an extensive empirical comparison of existing metrics and demonstrate that the observed differences in bias measurement can be systematically explained via differences in parameter choices for our generalized metrics.
1 Introduction
The prevalence of unintended social biases in NLP models has been recently identified as a major concern for the field. A number of papers have published evidence of uneven treatment of different demographics (Dixon et al., 2018; Zhao et al., 2018; Rudinger et al., 2018; Garg et al., 2019; Borkan et al., 2019; Stanovsky et al., 2019; Gonen and Webster, 2020; Huang et al., 2020a; Nangia et al., 2020), which can reportedly cause a variety of serious harms, like unfair allocation of opportunities or unfavorable representation of particular social groups (Blodgett et al., 2020).
Measuring bias in NLP models is key for better understanding and addressing unfairness. This is often done via fairness metrics, which quantify the differences in a model’s behavior across a range of social groups. The community has proposed a multitude of such metrics (Dixon et al., 2018; Garg et al., 2019; Huang et al., 2020a; Borkan et al., 2019; Gaut et al., 2020). In this paper, we aim to shed more light on how those varied means of quantifying bias differ and what facets of bias they capture. Developing such understanding is crucial for drawing reliable conclusions and actionable recommendations regarding bias. We focus on bias measurement for downstream tasks, as Goldfarb-Tarrant et al. (2021) have recently shown that there is no reliable correlation between bias measured intrinsically on, for example, word embeddings, and bias measured extrinsically on a downstream task. We narrow down the scope of this paper to tasks that do not involve prediction of a sensitive attribute.
We survey 146 papers on social bias in NLP and unify the multitude of disparate metrics we find under three generalized fairness metrics. Through this unification we reveal the key connections between a wide range of existing metrics— we show that they are simply different parametrizations of our generalized metrics. Next, we empirically investigate the role of different metrics in detecting the systemic differences in performance for different demographic groups, namely, differences in quality of service (Jacobs et al., 2020). We experiment on three transformer-based models—two models for sentiment analysis and one for named entity recognition (NER)—which we evaluate for fairness with respect to seven different sensitive attributes, qualified for protection under the United States federal anti-discrimination law:1Gender, Sexual Orientation, Religion, Nationality, Race, Age, and Disability. Our results highlight the differences in bias measurements across the metrics and we discuss how these variations can be systematically explained via different parameter choices of our generalized metrics. Our proposed unification and observations can guide decisions about which metrics (and parameters) to use, allowing researchers to focus on the pressing matter of bias mitigation, rather than reinventing parametric variants of the same metrics. While we focus our experiments on English, the metrics we study are language-agnostic and our methodology can be trivially applied to other languages.
We release our code with implementations of all metrics discussed in this paper.2 Our implementation mirrors our generalized formulation (Section 3), which simplifies the creation of new metrics. We build our code on top of CheckList3 (Ribeiro et al., 2020), making it compatible with the CheckList testing functionalities; that is, one can evaluate the model using the fairness metrics, as well as the CheckList-style tests, like invariance, under a single bias evaluation framework.
2 Background
2.1 Terminology
We use the term sensitive attribute to refer to a category by which people are qualified for protection (e.g., Religion or Gender). For each sensitive attribute we define a set of protected groupsT (e.g., for Gender, T could be set to {female, male, non-binary}). Next, each protected group can be expressed through one of its identity terms, I (e.g., for the protected group female those terms could be {woman, female, girl} or a set of typically female names).
2.2 Definitions of Fairness in NLP
The metrics proposed to quantify bias in NLP models across a range of social groups can be categorized based on whether they operationalize notions of group or counterfactual fairness. In this section we give a brief overview of both and encourage the reader to consult Hutchinson and Mitchell (2019) for a broader scope of literature on fairness, dating back to the 1960s.
Group fairness
requires parity of some statistical measure across a small set of protected groups (Chouldechova and Roth, 2020). Some prominent examples are demographic parity (Dwork et al., 2012), which requires equal positive classification rate across different groups, or equalized odds (Hardt et al., 2016) which for binary classification requires equal true positive and false negative rates. In NLP, group fairness metrics are based on performance comparisons for different sets of examples, for example, the comparison of two F1 scores: one for examples mentioning female names and one for examples with male names.
Counterfactual fairness
requires parity for two or more versions of an individual, one from the actual world and others from counterfactual worlds in which the individual belongs to a different protected group; that is, it requires invariance to the change of the protected group (Kusner et al., 2017). Counterfactual fairness is often viewed as a type of individual fairness, which asks for similar individuals to be treated similarly (Dwork et al., 2012). In NLP, counterfactual fairness metrics are based on comparisons of performance for variations of the same sentence, which differ in mentioned identity terms. Such data can be created through perturbing real-world sentences or creating synthetic sentences from templates.
In this work, we require that for each protected group there exists at least one sentence variation for every source example (pre-perturbation sentence or a template). In practice, the number of variations for each protected group will depend on the cardinality of I (Table 1). In contrast to most NLP works (Dixon et al., 2018; Garg et al., 2019; Sheng et al., 2020), we allow for a protected group to be realized as more than one identity term. To allow for this, we separate the variations for each source example into |T| sets, each of which can be viewed as a separate counterfactual world.
Source Example . | Female . | Male . |
---|---|---|
I like {person}. | I like Anna. | I like Adam. |
I like Mary. | I like Mark. | |
I like Liz. | I like Chris. | |
{Person} has friends. | Anna has friends. | Adam has friends. |
Mary has friends. | Mark has friends. | |
Liz has friends. | Chris has friends. |
Source Example . | Female . | Male . |
---|---|---|
I like {person}. | I like Anna. | I like Adam. |
I like Mary. | I like Mark. | |
I like Liz. | I like Chris. | |
{Person} has friends. | Anna has friends. | Adam has friends. |
Mary has friends. | Mark has friends. | |
Liz has friends. | Chris has friends. |
3 Generalized Fairness Metrics
We introduce three generalized fairness metrics that are based on different comparisons between protected groups and are model and task agnostic. They are defined in terms of two parameters:
- (i)
A scoring function, ϕ, which calculates the score on a subset of examples. The score is a base measurement used to calculate the metric and can be either a scalar or a set (see Table 2 for examples).
- (ii)
A comparison function, d, which takes a range of different scores—computed for different subsets of examples—and outputs a single scalar value.
Each of the three metrics is conceptually different and is most suitable in different scenarios; the choice of the most appropriate one depends on the scientific question being asked. Through different choices for ϕ and d, we can systematically formulate a broad range of different fairness metrics, targeting different types of questions. We demonstrate this in Section 4 and Table 2, where we show that many metrics from the NLP literature can be viewed as parametrizations of the metrics we propose here. To account for the differences between group and counterfactual fairness (Section 2.2) we define two different versions of each metric.
. | Metric . | Gen. Metric . | ϕ(A) . | d . | N . | . |
---|---|---|---|---|---|---|
Group metrics | ||||||
① | False Positive Equality Difference (FPED) | BCM | False Positive Rate | |x − y| | 1 | S |
② | False Negative Equality Difference (FNED) | False Negative Rate | |x − y| | 1 | S | |
③ | Average Group Fairness (AvgGF) | {f(x,1)∣x ∈ A} | W1(X,Y ) | |T| | S | |
④ | FPR Ratio | VBCM | False Positive Rate | – | ||
⑤ | Positive Average Equality Gap (PosAvgEG) | {f(x,1)∣x ∈ A,y(x) = 1} | – | |||
⑥ | Negative Average Equality Gap (NegAvgEG) | {f(x,1)∣x ∈ A,y(x) = 0} | – | |||
⑦ | Disparity Score | PCM | F1 | |x − y| | |T| | – |
⑧ | *TPR Gap | True Positive Rate | |x − y| | – | ||
⑨ | *TNR Gap | True Negative Rate | |x − y| | – | ||
⑩ | *Parity Gap | |x − y| | – | |||
⑪ | *Accuracy Difference | Accuracy | x − y | 1 | – | |
⑫ | *TPR Difference | True Positive Rate | x − y | 1 | – | |
⑬ | *F1 Difference | F1 | x − y | 1 | – | |
⑭ | *LAS Difference | LAS | x − y | 1 | – | |
⑮ | *Recall Difference | Recall | x − y | 1 | – | |
⑯ | *F1 Ratio | Recall | 1 | – | ||
Counterfactual metrics | ||||||
⑰ | Counterfactual Token Fairness Gap (CFGap) | BCM | f(x,1), A = {x} | |x − y| | |T| | {xj} |
⑱ | Perturbation Score Sensitivity (PertSS) | VBCM | f(x, y(x)), A = {x} | |x − y| | |T| | {xj} |
⑲ | Perturbation Score Deviation (PertSD) | MCM | f(x, y(x)), A = {x} | std(X) | – | – |
⑳ | Perturbation Score Range (PertSR) | f(x, y(x)), A = {x} | max(X) −min(X) | – | – | |
㉑ | Average Individual Fairness (AvgIF) | PCM | {f(x,1)∣x ∈ A} | W1(X,Y ) | – | |
㉒ | *Average Score Difference | mean({f(x,1)∣x ∈ A}) | x − y | – |
. | Metric . | Gen. Metric . | ϕ(A) . | d . | N . | . |
---|---|---|---|---|---|---|
Group metrics | ||||||
① | False Positive Equality Difference (FPED) | BCM | False Positive Rate | |x − y| | 1 | S |
② | False Negative Equality Difference (FNED) | False Negative Rate | |x − y| | 1 | S | |
③ | Average Group Fairness (AvgGF) | {f(x,1)∣x ∈ A} | W1(X,Y ) | |T| | S | |
④ | FPR Ratio | VBCM | False Positive Rate | – | ||
⑤ | Positive Average Equality Gap (PosAvgEG) | {f(x,1)∣x ∈ A,y(x) = 1} | – | |||
⑥ | Negative Average Equality Gap (NegAvgEG) | {f(x,1)∣x ∈ A,y(x) = 0} | – | |||
⑦ | Disparity Score | PCM | F1 | |x − y| | |T| | – |
⑧ | *TPR Gap | True Positive Rate | |x − y| | – | ||
⑨ | *TNR Gap | True Negative Rate | |x − y| | – | ||
⑩ | *Parity Gap | |x − y| | – | |||
⑪ | *Accuracy Difference | Accuracy | x − y | 1 | – | |
⑫ | *TPR Difference | True Positive Rate | x − y | 1 | – | |
⑬ | *F1 Difference | F1 | x − y | 1 | – | |
⑭ | *LAS Difference | LAS | x − y | 1 | – | |
⑮ | *Recall Difference | Recall | x − y | 1 | – | |
⑯ | *F1 Ratio | Recall | 1 | – | ||
Counterfactual metrics | ||||||
⑰ | Counterfactual Token Fairness Gap (CFGap) | BCM | f(x,1), A = {x} | |x − y| | |T| | {xj} |
⑱ | Perturbation Score Sensitivity (PertSS) | VBCM | f(x, y(x)), A = {x} | |x − y| | |T| | {xj} |
⑲ | Perturbation Score Deviation (PertSD) | MCM | f(x, y(x)), A = {x} | std(X) | – | – |
⑳ | Perturbation Score Range (PertSR) | f(x, y(x)), A = {x} | max(X) −min(X) | – | – | |
㉑ | Average Individual Fairness (AvgIF) | PCM | {f(x,1)∣x ∈ A} | W1(X,Y ) | – | |
㉒ | *Average Score Difference | mean({f(x,1)∣x ∈ A}) | x − y | – |
Notation
Let T = {t1,t2,…,t|T|} be a set of all protected groups for a given sensitive attribute, for example, Gender, and ϕ(A) be the score for some set of examples A. This score can be either a set or a scalar, depending on the parametrization of ϕ. For group fairness, let S be the set of all evaluation examples. We denote a subset of examples associated with a protected group ti as . For counterfactual fairness, let X = {x1,x2,…,x|X|} be a set of source examples, e.g., sentences pre-perturbation, and S′ = {S′1,S′2,…,S′|S|} be a set of sets of evaluation examples, where S′j is a set of all variations of a source example xj, i.e., there is a one-to-one correspondence between S′ and X. We use to denote a subset of S′j associated with a protected group ti. For example, if T = {female, male} and the templates were defined as in Table 1, then ‘I like Anna.’, ‘I like Mary.’, ‘I like Liz.’ }.
3.1 Pairwise Comparison Metric
3.2 Background Comparison Metric
Vector-valued BCM
In its basic form BCM aggregates the results obtained for different protected groups in order to return a single scalar value. Such aggregation provides a concise signal about the presence and magnitude of bias, but it does so at the cost of losing information. Often, it is important to understand how different protected groups contribute to the resulting outcome. This requires the individual group results not to be accumulated; that is, dropping the term from equations (3) and (4). We call this version of BCM, the vector-valued BCM (VBCM).
3.3 Multi-group Comparison Metric
Multi-group Comparison Metric (MCM) differs from the other two in that the comparison function d takes as arguments the scores for all protected groups. This metric can quantify the global effect that a sensitive attribute has on a model’s performance; for example, whether the change of Gender has any effect on model’s scores. It can provide a useful initial insight, but further inspection is required to develop better understanding of the underlying bias, if it is detected.
4 Classifying Existing Fairness Metrics Within the Generalized Metrics
Table 2 expresses 22 metrics from the literature as instances of our generalized metrics from Section 3. The presented metrics span a number of NLP tasks, including text classification (Dixon et al., 2018; Kiritchenko and Mohammad, 2018; Garg et al., 2019; Borkan et al., 2019; Prabhakaran et al., 2019), relation extraction (Gaut et al., 2020), text generation (Huang et al., 2020a) and dependency parsing (Blodgett et al., 2018).
We arrive at this list by reviewing 146 papers that study bias from the survey of Blodgett et al. (2020) and selecting metrics that meet three criteria: (i) the metric is extrinsic; that is, it is applied to at least one downstream NLP task,4 (ii) it quantifies the difference in performance across two or more groups, and (iii) it is not based on the prediction of a sensitive attribute—metrics based on a model’s predictions of sensitive attributes, for example, in image captioning or text generation, constitute a specialized sub-type of fairness metrics. Out of the 26 metrics we find, only four do not fit within our framework: BPSN and BNSP (Borkan et al., 2019), the metric (De-Arteaga et al., 2019), and Perturbation Label Distance (Prabhakaran et al., 2019).5
Importantly, many of the metrics we find are PCMs defined for only two protected groups, typically for male and female genders or white and non-white races. Only those that use commutative d can be straightforwardly adjusted to more groups. Those that cannot be adjusted are marked with gray circles in Table 2.
Prediction vs. Probability Based Metrics
Beyond the categorization into PCM, BCM, and MCM, as well as group and counterfactual fairness, the metrics can be further categorized into prediction or probability based. The former calculate the score based on a model’s predictions, while the latter use the probabilities assigned to a particular class or label (we found no metrics that make use of both probabilities and predictions). Thirteen out of 16 group fairness metrics are prediction based, while all counterfactual metrics are probability based. Since the majority of metrics in Table 2 are defined for binary classification, the prevalent scores for prediction based metrics include false positive/negative rates (FPR/FNR) and true positive/negative rates (TPR/TNR). Most of the probability-based metrics are based on the probability associated with the positive/toxic class (class 1 in binary classification). The exception are the metrics of Prabhakaran et al. (2019), which utilize the probability of the target class ⑱ ⑲ ⑳.
Choice of ϕ and d
For scalar-valued ϕ the most common bivariate comparison function is the (absolute) difference between two scores. As outliers, Beutel et al. (2019) ④ use the ratio of the group score to the background score and Webster et al. (2018) ⑯ use the ratio between the first and the second group. Prabhakaran et al.’s (2019) MCM metrics use multivariate d. Their Perturbation Score Deviation metric ⑲ uses the standard deviation of the scores, while their Perturbation Score Range metric ⑳ uses the range of the scores (difference between the maximum and minimum score). For set-valued ϕ, Huang et al. (2020a) choose Wasserstein-1 distance (Jiang et al., 2020) ③ ㉑, while Borkan et al. (2019) define their comparison function using the Mann-Whitney U test statistic (Mann and Whitney, 1947).
5 Experimental Details
Having introduced our generalized framework and classified the existing metrics, we now empirically investigate their role in detecting the systemic performance difference across the demographic groups. We first discuss the relevant experimental details before presenting our results and analyses (Section 6).
Models
We experiment on three RoBERTa (Liu et al., 2019) based models:6 (i) a binary classifier trained on SemEval-2018 valence classification shared task data (Mohammad et al., 2018) processed for binary classification (SemEval-2)7 (ii) a 3-class classifier trained on SemEval-3, and (iii) a NER model trained on the CoNLL 2003 Shared Task data (Tjong Kim Sang and De Meulder, 2003) which uses RoBERTa to encode a text sequence and a Conditional Random Field (Lafferty et al., 2001) to predict the tags. In NER experiments we use the bilou labeling scheme (Ratinov and Roth, 2009) and, for the probability-based metrics, we use the probabilities from the encoder’s output. Table 5 reports the performance on the official dev splits for the datasets the models were trained on.
Evaluation Data
For classification, we experiment on seven sensitive attributes, and for each attribute we devise a number of protected groups (Table 3).8 We analyze bias within each attribute independently and focus on explicit mentions of each identity. This is reflected in our choice of identity terms, which we have gathered from Wikipedia, Wiktionary, as well as Dixon et al. (2018) and Hutchinson et al. (2020) (see Table 4 for an example). Additionally, for the Gender attribute we also investigate implicit mentions— female and male groups represented with names typically associated with these genders. We experiment on synthetic data created using hand- crafted templates, as is common in the literature (Dixon et al., 2018; Kiritchenko and Mohammad, 2018; Kurita et al., 2019; Huang et al., 2020a). For each sensitive attribute we use 60 templates with balanced classes: 20 negative, 20 neutral and 20 positive templates. For each attribute we use 30 generic templates—with adjective and noun phrase slots to be filled with identity terms—and 30 attribute-specific templates.9 In Table 6 we present examples of both generic templates and attribute-specific templates for Nationality. Note that the slots of generic templates are designed to be filled with terms that explicitly reference an identity (Table 4), and are unsuitable for experiments on female/male names. For this reason, for names we design additional 30 name-specific templates (60 in total). We present examples of those templates in Table 6.
Sensitive attribute . | Protected groups (T) . |
---|---|
Gender | aab, female, male, cis, many-genders, no-gender, non-binary, trans |
Sexual Orientation | asexual, homosexual, heterosexual, bisexual, other |
Religion | atheism, buddhism, baha’i-faith, christianity, hinduism, islam, judaism, mormonism, sikhism, taoism |
Race | african american, american indian, asian, hispanic, pacific islander, white |
Age | young, adult, old |
Disability | cerebral palsy, chronic illness, cognitive, down syndrome, epilepsy, hearing, mental health, mobility, physical, short stature, sight, unspecified, without |
Nationality | We define 6 groups by categorizing countries based on their GDP. |
Sensitive attribute . | Protected groups (T) . |
---|---|
Gender | aab, female, male, cis, many-genders, no-gender, non-binary, trans |
Sexual Orientation | asexual, homosexual, heterosexual, bisexual, other |
Religion | atheism, buddhism, baha’i-faith, christianity, hinduism, islam, judaism, mormonism, sikhism, taoism |
Race | african american, american indian, asian, hispanic, pacific islander, white |
Age | young, adult, old |
Disability | cerebral palsy, chronic illness, cognitive, down syndrome, epilepsy, hearing, mental health, mobility, physical, short stature, sight, unspecified, without |
Nationality | We define 6 groups by categorizing countries based on their GDP. |
Protected group . | Identity terms (I) . |
---|---|
aab | AMAB, AFAB, DFAB, DMAB, female-assigned, male-assigned |
female | female (adj), female (n), woman |
male | male (adj), male (n), man |
many genders | ambigender, ambigendered, androgynous, bigender, bigendered, intersex, intersexual, pangender, pangendered, polygender, androgyne, hermaphrodite |
no-gender | agender, agendered, genderless |
Protected group . | Identity terms (I) . |
---|---|
aab | AMAB, AFAB, DFAB, DMAB, female-assigned, male-assigned |
female | female (adj), female (n), woman |
male | male (adj), male (n), man |
many genders | ambigender, ambigendered, androgynous, bigender, bigendered, intersex, intersexual, pangender, pangendered, polygender, androgyne, hermaphrodite |
no-gender | agender, agendered, genderless |
SemEval-2 . | SemEval-3 . | CoNLL 2003 . |
---|---|---|
Accuracy | F1 | |
0.90 | 0.73 | 0.94 |
SemEval-2 . | SemEval-3 . | CoNLL 2003 . |
---|---|---|
Accuracy | F1 | |
0.90 | 0.73 | 0.94 |
For NER, we only experiment on Nationality and generate the evaluation data from 22 templates with a missing {country} slot for which we manually assign a bilou tag to each token. The {country} slot is initially labeled as u-loc and is later automatically adjusted to a sequence of labels if a country name filling the slot spans more than one token, for example, b-locl-loc for New Zeland.
Metrics
We experiment on metrics that support more than two protected groups (i.e., the white-circled metrics in Table 2). As described in Section 2.2, for each source example we allow for a number of variations for each group. Hence, for counterfactual metrics that require only one example per group (all counterfactual metrics but Average Individual Fairness ㉑) we evaluate on the |T|-ary Cartesian products over the sets of variations for all groups. For groups with large |I| we sample 100 elements from the Cartesian product, without replacement. We convert Counterfactual Token Fairness Gap ⑰ and Perturbation Score Sensitivity ⑱ into PCMs because for templated- data there is no single real-world example.
Average Group Fairness ③, Counterfacutal Token Fairnes Gap ⑰, and Average Individual Fairness ㉑ calculate bias based on the probability of positive/toxic class on all examples. We introduce alternative versions of these metrics which calculate bias only on examples with gold label c, which we mark with a (TC) (for true class) suffix. The original versions target demographic parity (Dwork et al., 2012), while the TC versions target equality of opportunity (Hardt et al., 2016) and can pinpoint the existence of bias more precisely, as we show later (Section 6).
5.1 Moving Beyond Binary Classification
Fourteen out of 15 white-circled metrics from Table 2 are inherently classification metrics, 11 of which are defined exclusively for binary classification. We adapt binary classification metrics to (i) multiclass classification and (ii) sequence labeling to support a broader range of NLP tasks.
Multiclass Classification
Probability-based metrics that use the probability of the target class (⑱ ⑲ ⑳) do not require any adaptations for multiclass classification. For other metrics, we measure bias independently for each class c, using a one-vs-rest strategy for prediction-based metrics and the probability of class c for the scores of probability-based metrics (③ ⑤ ⑥ ⑰ ㉑).
Sequence Labeling
We view sequence labeling as a case of multiclass classification, with each token being a separate classification decision. As for multiclass classification, we compute the bias measurements for each class independently. For prediction-based metrics, we use one-vs-rest strategy and base the F1 and FNR scores on exact span matching.10 For probability-based metrics, for each token we accumulate the probability scores for different labels of the same class. For example, with the bilou labeling scheme, the probabilities for b-per, i-per, l-per, and u-per are summed to obtain the probability for the class per. Further, for counterfactual metrics, to account for different identity terms yielding different number of tokens, we average the probability scores for all tokens of multi-token identity terms.
6 Empirical Metric Comparison
Figure 1 shows the results for sentiment analysis for all attributes on BCM, PCM, and MCM metrics. In each table we report the original bias measurements and row-normalize the heatmap coloring using maximum absolute value scaling to allow for some cross-metric comparison.11Figure 1 gives evidence of unintended bias for most of the attributes we consider, with Disability and Nationality being the most and least affected attributes, respectively. We highlight that because we evaluate on simple synthetic data in which the expressed sentiment is evident, even small performance differences can be concerning. Figure 1 also gives an initial insight into how the bias measurements vary across the metrics.
In Figure 2 we present the per-group results for VBCM and BCM metrics for the example Gender attribute.12 Similarly, in Figure 3 we show results for NER for the relevant loc class. The first set of results indicates that the most problematic Gender group is cis. For NER we observe a big gap in the model’s performance between the most affluent countries and countries with lower GDP. In the context of those empirical results we now discuss how different parameter choices affect the observed bias measurement.
Key Role of the Base Measurement
Perhaps the most important difference between the metrics lies in the parametrization of the scoring function ϕ. The choice of ϕ determines what type and aspect of bias is being measured, making the metrics conceptually different. Consider, for example ϕ of Average Group Fairness ③ —{f(x,1)∣x ∈ A}—and Positive Average Equality Gap ⑤—{f(x,1)∣x ∈ A,y(x) = 1}. They are both based on the probabilities associated with class 1, but the former is computed on all examples in A, while the latter is computed on only those examples that belong to the positive class (i.e., have gold label 1). This difference causes them to measure different types of bias—the first targets demographic parity, the second equality of opportunity.
Further, consider FPED ① and FNED ②, which use FPR and FNR for their score, respectively. This difference alone can lead to entirely different results. For example, in Figure 2a FNED reveals prominent bias for the cis group while FPED shows none. Taken together, these results signal that the model’s behavior for this group is notably different from the other groups but this difference manifests itself only on the positive examples.
(In)Correct Normalization
Next, we highlight the importance of correct normalization. We argue that fairness metrics should be invariant to the number of considered protected groups, otherwise the bias measurements are incomparable and can be misleadingly elevated. The latter is the case for three metrics—FPED ①, FNED ③, and Disparity Score ⑦. The first two lack any kind of normalization, while Disparity Score is incorrectly normalized—N is set to the number of groups, rather than group pairs. In Figure 1 we present the results on the original versions of those metrics and for their correctly normalized versions, marked with *. The latter result in much lower bias measurements. This is all the more important for FPED and FNED, as they have been very influential, with many works relying exclusively on these metrics (Rios, 2020; Huang et al., 2020b; Gencoglu, 2021; Rios and Lwowski, 2020).
Relative vs Absolute Comparison
Next, we argue that the results of metrics based on the relative comparison, for example, FPR Ratio ④, can be misleading and hard to interpret if the original scores are not reported. In particular, the relative comparison can amplify bias in cases when both scores are low; in such scenarios even a very small absolute difference can be relatively large. Such amplification is evident in the FNR Ratio metric (FNR equivalent of FPR Ratio) on female vs male names for RoBERTa fine-funed on SemEval-2 (Figure 2b). Similarly, when both scores are very high, the bias can be underestimated—a significant difference between the scores can seem relatively small if both scores are large. Indeed, such effects have also been widely discussed in the context of reporting health risks (Forrow et al., 1992; Stegenga, 2015; Noordzij et al., 2017). In contrast, the results of metrics based on absolute comparison can be meaningfully interpreted, even without the original scores, if the range of the scoring function is known and interpretable (which is the case for all metrics we review).
Importance of Per-Group Results
Most group metrics accumulate the results obtained for different groups. Such accumulation leads to diluted bias measurements in situations where the performance differs only for a small proportion of all groups. This is evident in, for example, the per-group NER results for correctly normalized metrics (Figure 3). We emphasize the importance of reporting per-group results whenever possible.
Prediction vs Probability Based
In contrast to prediction-based metrics, probability-based metrics also capture more subtle performance differences that do not lead to different predictions. This difference can be seen, for example, for aabGender group results for SemEval-2 (Figure 2a) and the results for female/male names for SemEval-3 (Figure 2d). We contend that it is beneficial to use both types of metrics to understand the effect of behavior differences on predictions and to allow for detection of more subtle differences.
Signed vs Unsigned
Out of the 15 white-circled metrics only two are signed; Positive and Negative Average Equality Gap (AvgEG) ⑤ ⑥. Using at least one signed metric allows for quick identification of the bias direction. For example, results for Average Equality Gap reveal that examples mentioning the cisGender group are considered less positive than examples mentioning other groups and that, for NER, the probability of loc is lower for the richest countries (first and second quantiles have negative signs).
True Class Evaluation
We observe that the TC versions of probability-metrics allow for better understanding of bias location, compared with their non-TC alternatives. Consider Average Group Fairness ③ and its TC versions evaluated on the positive class (PosAvgGF) and negative class (NegAvgGF) for binary classification (Figure 2a). The latter two reveal that the differences in behavior apply solely to the positive examples.
6.1 Fairness Metrics vs Significance Tests
Just like fairness metrics, statistical significance tests can also detect the presence of systematic differences in the behavior of a model, and hence are often used as alternative means to quantify bias (Mohammad et al., 2018; Davidson et al., 2019; Zhiltsova et al., 2019). However, in contrast to fairness metrics, significance tests do not capture the magnitude of the differences. Rather, they quantify the likelihood of observing given differences under the null hypothesis. This is an important distinction with clear empirical consequences, as even very subtle differences between the scores can be statistically significant.
To demonstrate this, we present p-values for significance tests for which we use the probability of the positive class as a dependent variable (Table 7). Following Kiritchenko and Mohammad (2018), we obtain a single probability score for each template by averaging the results across all identity terms per group. Because we evaluate on synthetic data, which is balanced across all groups, we use the scores for all templates regardless of their gold class. We use the Friedman test for all attributes with more than two protected groups. For Gender with male/female names as identity terms we use the Wilcoxon signed-rank test. We observe that, despite the low absolute values of the metrics obtained for the Nationality attribute (Figure 1), the behavior of the models across the groups is unlikely to be equal. The same applies to the results for female vs male names for SemEval-3 (Figure 2d). Utilizing a test for statistical significance can capture such nuanced presence of bias.
Attribute . | SemEval-2 . | SemEval-3 . |
---|---|---|
Gender (names) | 8.72 × 10−1 | 3.05 × 10−6 |
Gender | 1.41 × 10−8 | 3.80 × 10−24 |
Sexual Orientation | 2.76 × 10−9 | 9.49 × 10−24 |
Religion | 1.14 × 10−23 | 8.24 × 10−36 |
Nationality | 1.61 × 10−2 | 1.45 × 10−14 |
Race | 2.18 × 10−5 | 8.44 × 10−5 |
Age | 4.86 × 10−2 | 4.81 × 10−8 |
Disability | 9.67 × 10−31 | 2.89 × 10−44 |
Attribute . | SemEval-2 . | SemEval-3 . |
---|---|---|
Gender (names) | 8.72 × 10−1 | 3.05 × 10−6 |
Gender | 1.41 × 10−8 | 3.80 × 10−24 |
Sexual Orientation | 2.76 × 10−9 | 9.49 × 10−24 |
Religion | 1.14 × 10−23 | 8.24 × 10−36 |
Nationality | 1.61 × 10−2 | 1.45 × 10−14 |
Race | 2.18 × 10−5 | 8.44 × 10−5 |
Age | 4.86 × 10−2 | 4.81 × 10−8 |
Disability | 9.67 × 10−31 | 2.89 × 10−44 |
Notably, Average Equality Gap metrics ⑤ ⑥ occupy an atypical middle ground between being a fairness metric and a significance test. In contrast to other metrics from Table 2, they do not quantify the magnitude of the differences, but the likelihood of a group being considered less positive than the background.
7 Which Metrics to Choose?
In the previous section we highlighted important differences between the metrics which stem from different parameter choices. In particular, we emphasized the difference between prediction and probability-based metrics, in regards to their sensitivity to bias, as well as the conceptual distinction between the fairness metrics and significance tests. We also stressed the importance of correct normalization of metrics and reporting per-group results whenever possible. However, one important question still remains unanswered: Out of the many different metrics that can be used, which ones are the most appropriate? Unfortunately, there is no easy answer. The choice of the metrics depends on many factors, including the task, the particulars of how and where the system is deployed, as well as the goals of the researcher.
In line with the recommendations of Olteanu et al. (2017) and Blodgett et al. (2020), we assert that fairness metrics need to be grounded in the application domain and carefully matched to the type of studied bias to offer meaningful insights. While we cannot be too prescriptive about the exact metrics to choose, we advise against reporting results for all the metrics presented in this paper. Instead, we suggest a three-step process that helps to narrow down the full range of metrics to those that are the most applicable.
Step 1. Identifying the type of question to ask and choosing the appropriate generalized metric to answer it.
As discussed in Section 3, each generalized metric is most suitable in different scenarios; for example, MCM metrics can be used to investigate whether the attribute has any overall effect on the model’s performance and (V)BCM allows us to investigate how the performance for particular groups differs with respect to model’s general performance.
Step 2. Identifying scoring functions that target the studied type and aspect of bias.
At this stage it is important to consider practical consequences behind potential base measurements. For example, for sentiment classification, misclassyfing positive sentences mentioning a specific demographic as negative can be more harmful than misclassyfing negative sentences as positive, as it can perpetuate negative stereotypes. Consequently, the most appropriate ϕ would be based on FNR or the probability of the negative class. In contrast, in the context of convicting low- level crimes, a false positive has more serious practical consequences than a false negative, since it may have a long-term detrimental effect on a person’s life. Further, the parametrization of ϕ should be carefully matched to the motivation of the study and the assumed type/conceptualization of bias.
Step 3.
Making the remaining parameter choices. In particular, deciding on the comparison function most suitable for the selected ϕ and the targeted bias; for example, absolute difference if ϕ is scalar-valued ϕ or Wasserstein-1 distance for set-valued ϕ.
The above three steps can identify the most relevant metrics, which can be further filtered down to the minimal set sufficient to identify studied bias. To get a complete understanding of a model’s (un)fairness, our general suggestion is to consider at least one prediction-based metric and one probability-based metric. Those can be further complemented with a test for statistical significance. Finally, it is essential that the results of each metric are interpreted in the context of the score employed by that metric (see Section 6). It is also universally good practice to report the results from all selected metrics, regardless of whether they do or do not give evidence of bias.
8 Related Work
To our knowledge, we are the first to review and empirically compare fairness metrics used within NLP. Close to our endeavor are surveys that discuss types, sources, and mitigation of bias in NLP or AI in general. Surveys of Mehrabi et al. (2019), Hutchinson and Mitchell (2019), and Chouldechova and Roth (2020) cover a broad scope of literature on algorithmic fairness. Shah et al. (2020) offer both a survey of bias in NLP as well as a conceptual framework for studying bias. Sun et al. (2019) provide a comprehensive overview of addressing gender bias in NLP. There are also many task specific surveys, for example, for language generation (Sheng et al., 2021) or machine translation (Savoldi et al., 2021). Finally, Blodgett et al. (2020) outline a number of methodological issues, such as providing vague motivations, which are common for papers on bias in NLP.
We focus on measuring bias exhibited on classification and sequence labeling downstream tasks. A related line of research measures bias present in sentence or word representations (Bolukbasi et al., 2016; Caliskan et al., 2017; Kurita et al., 2019; Sedoc and Ungar, 2019; Chaloner and Maldonado, 2019; Dev and Phillips, 2019; Gonen and Goldberg, 2019); Hall Maudslay et al., 2019; Liang et al., 2020; Shin et al., 2020; Liang et al., 2020; Papakyriakopoulos et al., 2020. However, such intrinsic metrics have been recently shown not to correlate with application bias (Goldfarb-Tarrant et al., 2021). In yet another line of research, Badjatiya et al. (2019) detect bias through identifying bias sensitive words.
Beyond the fairness metrics and significance tests, some works quantify bias through calculating a standard evaluation metric, for example, F1 or accuracy, or a more elaborate measure independently for each protected group or for each split of a challenge dataset (Hovy and Søgaard, 2015; Rudinger et al., 2018; Zhao et al., 2018; Garimella et al., 2019; Sap et al., 2019; Bagdasaryan et al., 2019; Stafanovičs et al., 2020; Tan et al., 2020; Mehrabi et al., 2020; Nadeem et al., 2020; Cao and Daumé III, 2020).
9 Conclusion
We conduct a thorough review of existing fairness metrics and demonstrate that they are simply parametric variants of the three generalized fairness metrics we propose, each suited to a different type of a scientific question. Further, we empirically demonstrate that the differences in parameter choices for our generalized metrics have direct impact on the bias measurement. In light of our results, we provide a range of concrete suggestions to guide NLP practitioners in their metric choices.
We hope that our work will facilitate further research in the bias domain and allow the researchers to direct their efforts towards bias mitigation. Because our framework is language and model agnostic, in the future we plan to experiment on more languages and use our framework as principled means of comparing different models with respect to bias.
Acknowledgments
We would like to thank the anonymous reviewers for their thoughtful comments and suggestions. We also thank the members of Amazon AI for many useful discussions and feedback.
Notes
We do not consider language modeling to be a downstream task.
BPSN and BNSP can be defined as Group VBCM if we relax the definition and allow for a separate ϕ function for the background—they require returning different confidence scores for the protected group and the background. The metrics of Prabhakaran et al. (2019) ⑱, ⑲, ㉑ originally have not been defined in terms of protected groups. In their paper, T is a set of different names, both male and female.
Our preliminary experiments also used models based on Electra (Clark et al., 2020) as well as those trained on SST-2 and SST-3 datasets (Socher et al., 2013). For all models, we observed similar trends in differences between the metrics. Due to space constraints we omit those results and leave a detailed cross-model bias comparison for future work.
We process the SemEval data as is commonly done for SST (Socher et al., 2013). For binary classification, we filter out the neutral class and compress the multiple fine-grained positive/negative classes into a single positive/negative class. For 3-class classification we do not filter out the neutral class.
For Disability and Race we used the groups from Hutchinson et al. (2020) and from the Racial and Ethnic Categories and Definitions for NIH Diversity Programs (https://grants.nih.gov/grants/guide/notice-files/not-od-15-089.html), respectively. For the remaining attributes, we rely on Wikipedia and Wiktionary, among other sources.
The templates can be found with the code.
We do not compute FPR based metrics, because false positives are unlikely to occur for our synthetic data and are less meaningful if they occur.
Even after normalization, bias measurements across metrics are not fully comparable—different metrics use different base measurements (TPR, TNR, etc.) and hence measure different aspects of bias.
We omit the per-group results for the remaining attributes due to the lack of space. For BCM, we do not include accumulated values in the normalization.
References
Author notes
Work done during an internship at Amazon AI.