Abstract
Text representations learned by machine learning models often encode undesirable demographic information of the user. Predictive models based on these representations can rely on such information, resulting in biased decisions. We present a novel debiasing technique, Fairness-aware Rate Maximization (FaRM), that removes protected information by making representations of instances belonging to the same protected attribute class uncorrelated, using the rate-distortion function. FaRM is able to debias representations with or without a target task at hand. FaRM can also be adapted to remove information about multiple protected attributes simultaneously. Empirical evaluations show that FaRM achieves state-of-the-art performance on several datasets, and learned representations leak significantly less protected attribute information against an attack by a non-linear probing network.
1 Introduction
Democratization of machine learning has led to deployment of predictive models for critical applications like credit approval (Ghailan et al., 2016) and college application reviewing (Basu et al., 2019). Therefore, it is important to ensure that decisions made by these models are fair towards different demographic groups (Mehrabi et al., 2021). Fairness can be achieved by ensuring that the demographic information does not get encoded in the representations used by these models (Blodgett et al., 2016; Elazar and Goldberg, 2018; Elazar et al., 2021).
However, controlling demographic information encoded in a model’s representations is a challenging task for textual data. This is because natural language text is highly indicative of an author’s demographic attributes even when it is not explicitly mentioned (Koppel et al., 2002; Burger et al., 2011; Nguyen et al., 2013; Verhoeven and Daelemans, 2014; Weren et al., 2014; Rangel et al., 2016; Verhoeven et al., 2016; Blodgett et al., 2016).
In this work, we debias information about a protected attribute (e.g., gender, race) from textual data representations. Previous debiasing methods (Bolukbasi et al., 2016; Ravfogel et al., 2020) project representations in a subspace that does not reveal protected attribute information. These methods are only able to guard protected attributes against an attack by a linear function (Ravfogel et al., 2020). Other methods (Xie et al., 2017; Basu Roy Chowdhury et al., 2021) adversarially remove protected information while retaining information about a target attribute. However, they are difficult to train (Elazar and Goldberg, 2018) and require a target task at hand.
We present a novel debiasing technique, Fairness-aware Rate Maximization (FaRM), that removes demographic information by controlling the rate-distortion function of the learned representations. Intuitively, in order to remove information about a protected attribute from a set of representations, we want the representations from the same protected attribute class to be uncorrelated to each other. We achieve this by maximizing the number of bits (rate-distortion) required to encode representations with the same protected attribute. Figure 1 illustrates the process. The representations are shown as points in a two-dimensional feature space, color-coded according to their protected attribute class. FaRM learns a function ϕ(x) such that representations of the same protected class become uncorrelated and similar to other representations, thereby making it difficult to extract the information about the protected attribute from the learned representations.
We perform rate-distortion maximization based debiasing in the following setups: (a) unconstrained debiasing—we remove information about a protected attribute g while retaining remaining information as much as possible (e.g., debiasing gender information from word embeddings), and (b) constrained debiasing—we retain information about a target attribute y while removing information pertaining to g (e.g., removing racial information from representations during text classification). In the unconstrained setup, debiased representations can be used for different downstream tasks, whereas for constrained debiasing the user is interested only in the target task. For unconstrained debiasing, we evaluate FaRM for removing gender information from word embeddings and demographic information from text representations that can then be used for a downstream NLP task (we show their utility for biography and sentiment classification in our experiments). Our empirical evaluations show that representations learned using FaRM in an unconstrained setup leak significantly less protected attribute information compared to prior approaches against an attack by a non-linear probing network.
For constrained debiasing, FaRM achieves state-of-the-art debiasing performance on 3 datasets, and representations are able to guard protected attribute information significantly better than previous approaches. We also perform experiments to show that FaRM is able to remove multiple protected attributes simultaneously while guarding against intersectional group biases (Subramanian et al., 2021). To summarize, our main contributions are:
We present Fairness-aware Rate Maximization (FaRM) for debiasing of textual data representations in unconstrained and constrained setups, by controlling their rate- distortion functions.
We empirically show FaRM leaks significantly less protected information against a non-linear probing attack, outperforming prior approaches.
We present two variations of FaRM for debiasing multiple protected attributes simultaneously, which is also effective against an attack for intersectional group biases.
2 Related Work
Removing sensitive attributes from data representations for fair classification was initially introduced as an optimization task (Zemel et al., 2013). Subsequent works have used adversarial frameworks (Goodfellow et al., 2014) for this task (Zhang et al., 2018; Li et al., 2018; Xie et al., 2017; Elazar and Goldberg, 2018; Basu Roy Chowdhury et al., 2021). However, adversarial networks are difficult to train (Elazar and Goldberg, 2018) and cannot function without a target task at hand.
Unconstrained debiasing frameworks focus on removing a protected attribute from representations, without relying on a target task. Bolukbasi et al. (2016) demonstrated that GloVe embeddings encode gender information, and proposed an unconstrained debiasing framework for identifying gender direction and neutralizing vectors along that direction. Building on this approach, Ravfogel et al. (2020) proposed INLP, a robust framework to debias representations by iteratively identifying protected attribute subspaces and projecting representations onto the corresponding nullspaces. However, these approaches fail to guard protected information against an attack by a non-linear probing network. Dev et al. (2021) showcased that nullspace projection approaches can be extended for debiasing in a constrained setup as well.
In contrast to prior works, we present a novel debiasing framework based on the principle of rate-distortion maximization. Coding rate maximization was introduced as an objective function by Ma et al. (2007) for image segmentation. It has also been used in explaining feature selection by deep networks (Macdonald et al., 2019). Recently, Yu et al. (2020) proposed maximal coding rate (MCR2) based on rate-distortion theory, a representation-level objective function that can serve as an alternative to empirical risk minimization methods. Our work is similar to MCR2 as we learn representations using a rate-distortion framework, but instead of tuning representations for classification we remove protected attribute information from them.
3 Preliminaries
Our framework performs debiasing by making representations of the same protected attribute class uncorrelated. To achieve this, we leverage a principled objective function called rate-distortion, to measure the compactness of a set of representations. In this section, we introduce the fundamentals of rate-distortion theory.1
Rate-Distortion.
In lossy data compression (Cover, 1999), the compactness of a random distribution is measured by the minimal number of binary bits required to encode it. A lossy coding scheme encodes a finite set of vectors Z = {z1,…,zn}∈ℝn×d from a distribution P(Z), such that the decoded vectors can be recovered up to a precision ϵ2. The rate-distortion function R(Z,ϵ) measures the minimal number of bits per vector required to encode the sequence Z.
In rate-distortion theory, we need nR(Z,ϵ) bits to encode n vectors of Z. The optimal codebook also depends on data dimension (d) and requires dR(Z,ϵ) bits to encode. Therefore, a total of (n + d)R(Z,ϵ) is bits required to encode the sequence Z. Ma et al. (2007) showed that this provides a tight bound even in cases where the underlying distribution P(Z) is degenerate. This enables the use of this loss function for real-world data, where the underlying distribution may not be well defined.
In general, a set of compact vectors (low information content) would require a small number of bits to encode, which would correspond to a small value of R(Z,ϵ) and vice versa.
Rate Distortion for a Mixed Distribution.
For multi-class data, where a vector zi can only be a member of a single class, we restrict πij = {0,1}, and therefore the covariance matrix for the j-th subset is ZjZjT. In general, if the representations within each subset Zj are similar to each other, they will have low intra-class variance, and it would correspond to a small Rc(Z,ϵ|Π) and vice versa.
4 Fairness-Aware Rate Maximization
In this section, we describe FaRM to debias representations in unconstrained and constrained setups.
4.1 Unconstrained Debiasing using FaRM
In this setup, we aim to remove information about a protected attribute g from data representations X while retaining the remaining information. To achieve this, the debiased representations Z should have the following properties:
- (a)
Intra-class Incoherence: Representations belonging to the same protected attribute class should be highly uncorrelated. This would make it difficult for a classifier to extract any information about g from the representations.
- (b)
Maximal Informativeness: Representations should be maximally informative about the remaining information.
Assuming that there are k protected attribute classes, we can write Z = Z1 ∪… ∪ Zk. To achieve (a), we need to ensure that the representations in a subset Zj belonging to the same protected class are dissimilar and have large intra-class variance. An increased intra-class variance would correspond to an increase in the number of bits to encode samples within each class and the rate-distortion function Rc(Z,ϵ|Πg) would be large. For (b), we want the representations Z to retain maximal possible information from the input X. Increasing information content in Z, would require a larger number of bits to encode it. This means that the rate-distortion R(Z,ϵ) should also be large.
The unconstrained debiasing routine is described in Algorithm 1. We use a deep neural network ϕ as our feature map to obtain debiased representations z = ϕ(x). The objective function Ju is sensitive to the scale of the representations. Therefore, we normalize the Frobenius norm of the representations to ensure individual input samples have an equal impact on the loss. We use layer normalization (Ba et al., 2016) to ensure that all representations have the same magnitude and lie on a sphere of radius r. The feature encoder ϕ is updated using gradients from the objective function Ju. The debiased representations are retrieved by feeding input data X through the trained network ϕ. An illustration of the debiasing process in the unconstrained setup is shown in Figure 1.
4.2 Constrained Debiasing using FaRM
In this setup, we aim to remove information about a protected attribute g from data representations X while retaining information about a specific target attribute y. The learned representations should have the following properties:
Target-Class Informativeness: Representations should be maximally informative about the target task attribute y.
- (b)
Inter-class Coherence: Representations from different protected attribute classes should be similar to each other. This would make it difficult to extract information about g from Z.
5 Experimental Setup
In this section, we discuss the datasets, experimental setup, and metrics used for evaluating FaRM. The implementation of FaRM is publicly available at https://github.com/brcsomnath/FaRM.
5.1 Datasets
We evaluate FaRM using several datasets. Among these, the Dial and Biographies datasets are used for evaluating both constrained and unconstrained debiasing. Pan16 and GloVe embeddings are used only for constrained and unconstrained debiasing, respectively. We use the same train-test split as prior works for all datasets.
(a) Dial (Blodgett et al., 2016) is a Twitter-based sentiment classification dataset. Each tweet is associated with sentiment and mention labels (treated as the target attribute in constrained evaluation) and “race” information (protected attribute) of the author. The sentiment labels are “happy” or “sad” and the race categories are “African-American English” (AAE) or “Standard American English” (SAE).
(b) Biography classification dataset (De-Arteaga et al., 2019) contains biographies that are associated with a profession (target attribute) and gender label (protected attribute). There are 28 distinct profession categories and 2 gender classes.
(c) Pan16 (Rangel et al., 2016) is also a Tweet- classification dataset where each Tweet is annotated with the author’s age and gender information, both of which are binary protected attributes. The target task is mention detection.
(d) GloVe embeddings: We follow the setup of Ravfogel et al. (2020) to debias the most common 150,000 GloVe word embeddings (Zhao et al., 2018). For training, we use the 7500 most male-biased, female-biased, and neutral words (determined by the magnitude of the word vector’s projection onto the gender direction, which is the largest principal component of the space of vectors formed using the difference gendered word vector pairs).
5.2 Implementation Details
We use a multi-layer neural network with ReLU non-linearity as our feature map ϕ in the unconstrained setup. This setup is optimized using stochastic gradient descent with a learning rate of 0.001 and momentum of 0.9. For constrained debiasing, we used BERTtextrmbase as ϕ, and a 2-layer neural network as f. Constrained setup is optimized using the AdamW (Loshchilov and Hutter, 2019) optimizer with a learning rate of 2× 10−5. We set λ = 0.01 for all experiments. Hyperparameters were tuned on the development set of the respective datasets. Our models were trained on a single Nvidia Quadro RTX 5000 GPU.
5.3 Probing Metrics
Following previous work (Elazar and Goldberg, 2018; Ravfogel et al., 2020; Basu Roy Chowdhury et al., 2021), we evaluate the quality of our debiasing by probing the learned representations for the protected attribute g and target attribute y. In our experiments, we probe all representations using a non-linear classifier. We use an MLP Classifier from the scikit-learn library (Pedregosa et al., 2011). We report the Accuracy and Minimum Description Length (MDL) (Voita and Titov, 2020) for predicting g and y. A large MDL signifies that more effort is needed by a probing network to achieve a certain performance. Hence, we expect debiased representations to have a large MDL for protected attribute g and a small MDL for predicting target attribute y. Also, we expect a high accuracy for y and low accuracy for g.
5.4 Group Fairness Metrics
TPR-GAP.
Based on the notion of equalized odds, De-Arteaga et al. (2019) introduced TPR-GAP, which measures the true positive rate (TPR) difference of a classifier between two protected groups.
Demographic Parity (DP).
Bickel et al. (1975) illustrated that notions of demographic parity and equalized odds can strongly differ in a real-world scenario. For representation learning, Zhao and Gordon (2019) demonstrated an inherent tradeoff between the utility and fairness of representations. TPR-GAP described above is not a good indicator of fairness if y and g are correlated, as debiasing would lead to a drop in target task performance as well. For our experiments, we compare models using both metrics for completeness. However, like prior work, in some cases we observe conflicting results due to the tradeoff.
6 Results: Unconstrained Debiasing
We evaluate FaRM for unconstrained debiasing in three different setups: word embedding debiasing, and debiasing text representations for biographies and sentiment classification. For the classification tasks, we retrieve text representations from a pre-trained encoder, debias them using FaRM (without taking the target task into account) and evaluate the debiased representations by probing for y and g. In all settings, we train the feature encoder ϕ, and evaluate the retrieved representations Zdebiased = ϕ(X). All tables mention the expected trend of a metric using ↑ (higher) or ↓ (lower).
6.1 Word Embedding Debiasing
We revisit the problem of debiasing gender information from word embeddings introduced by Bolukbasi et al. (2016).
Setup.
We debias gender information from GloVe embeddings using a 4-layer neural network with ReLU non-linearity as the feature map ϕ(x). We discuss the choice of the feature map ϕ in Section 8.
Results.
Table 1 presents the result of debiasing word embeddings for baseline INLP (Ravfogel et al., 2020) and FaRM. We observe that when compared with INLP, FaRM reduces the accuracy of the network by an absolute margin of 32.4% and achieves a steep increase in MDL. FaRM is able to guard the protected attribute against an attack by a non-linear probing network (near-random accuracy). We also report the rank of the resulting word embedding matrix. The information content of a matrix is captured by its rank (maximal number of linearly independent columns). An increase in rank of the resultant embedding matrix indicates that FaRM is able to retain more information in the representations, in general, compared to INLP.
Visualization.
We visualize the t-SNE (Van der Maaten and Hinton, 2008) projections of GloVe embeddings before and after debiasing in Figures 4a and 4b, respectively. Female and male-biased word vectors are represented by red and blue dots, respectively. The figures clearly demonstrate that the gendered vectors are not separable after debiasing. In order to quantify the improvement, we perform k-means clustering with K = 2 (one for each gender label). We compute the V-measure (Rosenberg and Hirschberg, 2007)—a measure to quantify the overlap between clusters. V-measure in the original space drops from 99.9% to 0.006% using FaRM (compared to 0.31% using INLP). This indicates that debiased representations from FaRM are more difficult to disentangle. We further analyze the quality of the debiased word embeddings in Section 8.
6.2 Biography Classification
Next, we evaluate FaRM by debiasing text representations in an unconstrained setup and using the representations for fair biography classification.
Setup.
We obtain the text representations using two methods: FastText (Joulin et al., 2017) and BERT (Devlin et al., 2019). For FastText, we sum the individual token representations in each biography. For BERT, by retrieving the final layer hidden representation above the [CLS] token from a pre-trained BERTbase model. We choose the feature map ϕ(x) as a 4-layer neural network with ReLU non-linearity.
Results.
Table 2 presents the unconstrained debiasing results of FaRM on this dataset. ‘Original’ in the table refers to the pre-trained embeddings from BERTbase or FastText. We observe that FaRM significantly outperforms INLP in fairness metrics—DP (improvement of 92% for FastText and 91% for BERT) and GapgRMS (improvement of 93% for FastText and 18% for BERT). We observe that FaRM achieves near-random gender classification performance (majority baseline: 53.9%) against a non-linear probing attack. FaRM improves upon INLP’s gender leakage by an absolute margin of 9.8% and 39.4% for FastText and BERT respectively. However, we observe a substantial drop in the accuracy for identifying professions (target attribute) using the debiased embeddings.3 This is possibly because in this dataset, gender is highly correlated with the profession and removing gender information results in loss of profession information. Zhao and Gordon (2019) identified this phenomenon by noting the tradeoff between learning fair representations and performing well on target task, when protected and target attributes are correlated. The results in this setup (Table 2) demonstrate this phenomenon. In unconstrained debiasing, we remove information about protected attributes from the representations without taking into account the target task. As a result target task performance suffers from more debiasing.4 This calls for constrained debiasing for such datasets. In Section 7, we show that FaRM is able to retain target performance while debiasing for this dataset in the constrained setup.
Metric . | Method . | FastText . | BERT . |
---|---|---|---|
Profession Acc. (↑) | Original | 79.9 | 80.9 |
INLP | 76.3 | 77.8 | |
FaRM | 54.8 | 55.8 | |
Gender Acc. (↓) | Original | 98.9 | 99.6 |
INLP | 67.4 | 94.9 | |
FaRM | 57.6 | 55.6 | |
DP (↓) | Original | 1.65 | 1.68 |
INLP | 1.51 | 1.50 | |
FaRM | 0.12 | 0.14 | |
GapgRMS (↓) | Original | 0.185 | 0.171 |
INLP | 0.089 | 0.096 | |
FaRM | 0.006 | 0.079 |
Metric . | Method . | FastText . | BERT . |
---|---|---|---|
Profession Acc. (↑) | Original | 79.9 | 80.9 |
INLP | 76.3 | 77.8 | |
FaRM | 54.8 | 55.8 | |
Gender Acc. (↓) | Original | 98.9 | 99.6 |
INLP | 67.4 | 94.9 | |
FaRM | 57.6 | 55.6 | |
DP (↓) | Original | 1.65 | 1.68 |
INLP | 1.51 | 1.50 | |
FaRM | 0.12 | 0.14 | |
GapgRMS (↓) | Original | 0.185 | 0.171 |
INLP | 0.089 | 0.096 | |
FaRM | 0.006 | 0.079 |
6.3 Controlled Sentiment Classification
Lastly, for the Dial dataset, we perform unconstrained debiasing in a controlled setting.
Setup.
Following the setup of Barrett et al. (2019) and Ravfogel et al. (2020), we control the proportion of protected attributes within a target task class. For example, if target class split = 80% that means “happy” sentiment (target) class contains 80% AAE / 20% SAE, while the “sad” class contains 20% AAE / 80% SAE (AAE and SAE are protected class labels mentioned in Section 5.1). We train DeepMoji (Felbo et al., 2017) followed by a 1-layer MLP for sentiment classification. We retrieve representations from the DeepMoji encoder and debias them using FaRM. For debiasing, we choose the feature map ϕ(x) to be a 7-layer neural network with ReLU non-linearity. After debiasing, we train a non-linear MLP to investigate the quality of debiasing. We evaluate the debiasing performance of FaRM in various stages of label imbalance.
Results.
The results of this experiment are reported in Table 3. We see that FaRM is able to achieve the best fairness scores—an improvement in GapgRMS (≥12.5%) and DP (≥21%) across all setups. Considering the accuracy of identifying the protected attribute (race) we can see that FaRM significantly reduces leakage of race information by an absolute margin of 11%–17% across different target class splits. FaRM also achieves similar performance to INLP in sentiment (target attribute) classification. We observe that the fairness score for FaRM deteriorates with an increasing correlation between the protected attribute and the target attribute. In cases where the target and the protected attributes are highly correlated (split = 70% and 80%), we observe a low sentiment classification accuracy (for both INLP and FaRM) compared to the original classifier. This is similar to the observation made for the Biographies dataset and shows that it is difficult to debias information about protected attribute while retaining overall information about the target task when the protected attribute is highly correlated with the target attribute. In the constrained setup, we observe FaRM is able to retain target performance (Section 7).
Metric . | Method . | Split . | |||
---|---|---|---|---|---|
50% . | 60% . | 70% . | 80% . | ||
Sentiment Acc. (↑) | Original | 75.5 | 75.5 | 74.4 | 71.9 |
INLP | 75.1 | 73.1 | 69.2 | 64.5 | |
FaRM | 74.8 | 73.2 | 67.3 | 63.5 | |
Race Acc. (↓) | Original | 87.7 | 87.8 | 87.3 | 87.4 |
INLP | 69.5 | 82.2 | 80.3 | 69.9 | |
FaRM | 54.2 | 69.9 | 69.0 | 52.1 | |
DP (↓) | Original | 0.26 | 0.44 | 0.63 | 0.81 |
INLP | 0.16 | 0.33 | 0.30 | 0.28 | |
FaRM | 0.09 | 0.10 | 0.17 | 0.22 | |
GapgRMS (↓) | Original | 0.15 | 0.24 | 0.33 | 0.41 |
INLP | 0.12 | 0.18 | 0.16 | 0.16 | |
FaRM | 0.09 | 0.10 | 0.12 | 0.14 |
Metric . | Method . | Split . | |||
---|---|---|---|---|---|
50% . | 60% . | 70% . | 80% . | ||
Sentiment Acc. (↑) | Original | 75.5 | 75.5 | 74.4 | 71.9 |
INLP | 75.1 | 73.1 | 69.2 | 64.5 | |
FaRM | 74.8 | 73.2 | 67.3 | 63.5 | |
Race Acc. (↓) | Original | 87.7 | 87.8 | 87.3 | 87.4 |
INLP | 69.5 | 82.2 | 80.3 | 69.9 | |
FaRM | 54.2 | 69.9 | 69.0 | 52.1 | |
DP (↓) | Original | 0.26 | 0.44 | 0.63 | 0.81 |
INLP | 0.16 | 0.33 | 0.30 | 0.28 | |
FaRM | 0.09 | 0.10 | 0.17 | 0.22 | |
GapgRMS (↓) | Original | 0.15 | 0.24 | 0.33 | 0.41 |
INLP | 0.12 | 0.18 | 0.16 | 0.16 | |
FaRM | 0.09 | 0.10 | 0.12 | 0.14 |
7 Results: Constrained Debiasing
In this section, we present the results of constrained debiasing using FaRM. For all experiments, we use a BERTbase model as ϕ and a 2-layer neural network with ReLU non-linearity as f (Figure 3).
7.1 Single Attribute Debiasing
In this setup, we focus on debiasing a single protected attribute g while retaining information about the target attribute y.
Setup.
We conduct experiments on 3 datasets: Dial (Blodgett et al., 2016), Pan16 (Rangel et al., 2016), and Biographies (De-Arteaga et al., 2019). We experiment with different target and protected attribute configurations in Dial (y: Sentiment/Mention, g: Race) and Pan16 (y: Mention, g: Gender/Age). For Biographies, we use the same setup as described in Section 6.2. For the protected attribute g, we report ΔF1—the difference between F1-score and the majority baseline. We also report fairness metrics: GapgRMS and Demographic Parity (DP) of the learned classifier. We compare FaRM with the state-of-the-art AdS (Basu Roy Chowdhury et al., 2021), BERTbase sequence classifier, and pre-trained BERTbase representations.
Results.
Table 4 presents the results of this experiment. We observe that in general, FaRM achieves good fairness performance while maintaining target performance. In particular, it achieves the best DP scores across all setups. In Pan16, FaRM achieves perfect fairness in terms of protected attribute probing accuracy ΔF1 = 0 with comparable performance to AdS in terms of MDL of g. In the Biographies dataset, the task accuracy of FaRM is the same as AdS but FaRM outperforms AdS in fairness metrics. We also observe that for this dataset, some baselines performed very well on one (but not both) of the two fairness metrics, which can be attributed to the inherent tradeoff between them (see Section 5.4). However, FaRM achieves a good balance between the two metrics. Overall, this shows that FaRM is able to robustly remove sensitive information about the protected attribute while achieving good target task performance.
Method . | Dial . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sentiment (y) . | Race (g) . | Fairness . | Mention (y) . | Race (g) . | Fairness . | |||||||
F1↑ . | MDL↓ . | ΔF1↓ . | MDL↑ . | DP↓ . | GapgRMS↓ . | F1↑ . | MDL↓ . | ΔF1↓ . | MDL↑ . | DP↓ . | GapgRMS↓ . | |
BERTbase (pre-trained) | 63.9 | 300.7 | 10.9 | 242.6 | 0.41 | 0.20 | 66.1 | 290.1 | 24.6 | 258.8 | 0.20 | 0.10 |
BERTbase (fine-tuned) | 76.9 | 99.0 | 18.4 | 176.2 | 0.30 | 0.14 | 81.7 | 49.1 | 28.7 | 199.2 | 0.06 | 0.03 |
AdS | 72.9 | 56.9 | 5.2 | 290.6 | 0.43 | 0.21 | 81.1 | 7.6 | 21.7 | 270.3 | 0.06 | 0.03 |
FaRM | 73.2 | 17.9 | 0.2 | 296.5 | 0.26 | 0.14 | 78.8 | 3.1 | 0.3 | 324.8 | 0.06 | 0.03 |
Method . | Dial . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sentiment (y) . | Race (g) . | Fairness . | Mention (y) . | Race (g) . | Fairness . | |||||||
F1↑ . | MDL↓ . | ΔF1↓ . | MDL↑ . | DP↓ . | GapgRMS↓ . | F1↑ . | MDL↓ . | ΔF1↓ . | MDL↑ . | DP↓ . | GapgRMS↓ . | |
BERTbase (pre-trained) | 63.9 | 300.7 | 10.9 | 242.6 | 0.41 | 0.20 | 66.1 | 290.1 | 24.6 | 258.8 | 0.20 | 0.10 |
BERTbase (fine-tuned) | 76.9 | 99.0 | 18.4 | 176.2 | 0.30 | 0.14 | 81.7 | 49.1 | 28.7 | 199.2 | 0.06 | 0.03 |
AdS | 72.9 | 56.9 | 5.2 | 290.6 | 0.43 | 0.21 | 81.1 | 7.6 | 21.7 | 270.3 | 0.06 | 0.03 |
FaRM | 73.2 | 17.9 | 0.2 | 296.5 | 0.26 | 0.14 | 78.8 | 3.1 | 0.3 | 324.8 | 0.06 | 0.03 |
Method . | Pan16 . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mention (y) . | Gender (g) . | Fairness . | Mention (y) . | Age (g) . | Fairness . | |||||||
F1↑ . | MDL↓ . | ΔF1↓ . | MDL↑ . | DP↓ . | GapgRMS ↓ . | F1↑ . | MDL↓ . | ΔF1↓ . | MDL↑ . | DP↓ . | GapgRMS ↓ . | |
BERTbase (pre-trained) | 72.3 | 259.7 | 7.4 | 300.5 | 0.11 | 0.056 | 72.8 | 262.6 | 6.1 | 302.0 | 0.14 | 0.078 |
BERTbase (fine-tuned) | 89.7 | 4.0 | 15.1 | 267.6 | 0.04 | 0.007 | 89.3 | 4.8 | 7.4 | 295.4 | 0.04 | 0.006 |
AdS | 89.7 | 7.6 | 4.9 | 313.9 | 0.04 | 0.007 | 89.2 | 6.0 | 1.1 | 315.1 | 0.04 | 0.004 |
FaRM | 88.7 | 1.7 | 0.0 | 312.4 | 0.04 | 0.007 | 88.6 | 0.8 | 0.0 | 312.6 | 0.03 | 0.008 |
Method . | Pan16 . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mention (y) . | Gender (g) . | Fairness . | Mention (y) . | Age (g) . | Fairness . | |||||||
F1↑ . | MDL↓ . | ΔF1↓ . | MDL↑ . | DP↓ . | GapgRMS ↓ . | F1↑ . | MDL↓ . | ΔF1↓ . | MDL↑ . | DP↓ . | GapgRMS ↓ . | |
BERTbase (pre-trained) | 72.3 | 259.7 | 7.4 | 300.5 | 0.11 | 0.056 | 72.8 | 262.6 | 6.1 | 302.0 | 0.14 | 0.078 |
BERTbase (fine-tuned) | 89.7 | 4.0 | 15.1 | 267.6 | 0.04 | 0.007 | 89.3 | 4.8 | 7.4 | 295.4 | 0.04 | 0.006 |
AdS | 89.7 | 7.6 | 4.9 | 313.9 | 0.04 | 0.007 | 89.2 | 6.0 | 1.1 | 315.1 | 0.04 | 0.004 |
FaRM | 88.7 | 1.7 | 0.0 | 312.4 | 0.04 | 0.007 | 88.6 | 0.8 | 0.0 | 312.6 | 0.03 | 0.008 |
Method . | Biographies . | |||||
---|---|---|---|---|---|---|
Profession (y) . | Gender (g) . | Fairness . | ||||
F1↑ . | MDL↓ . | ΔF1↓ . | MDL↑ . | DP↓ . | GapgRMS↓ . | |
BERTbase (pre-trained) | 74.3 | 499.9 | 45.2 | 27.6 | 0.43 | 0.169 |
BERTbase (fine-tuned) | 99.9 | 2.2 | 8.3 | 448.9 | 0.46 | 0.001 |
AdS | 99.9 | 3.3 | 3.1 | 449.5 | 0.45 | 0.003 |
FaRM | 99.9 | 7.6 | 7.4 | 460.3 | 0.42 | 0.002 |
Method . | Biographies . | |||||
---|---|---|---|---|---|---|
Profession (y) . | Gender (g) . | Fairness . | ||||
F1↑ . | MDL↓ . | ΔF1↓ . | MDL↑ . | DP↓ . | GapgRMS↓ . | |
BERTbase (pre-trained) | 74.3 | 499.9 | 45.2 | 27.6 | 0.43 | 0.169 |
BERTbase (fine-tuned) | 99.9 | 2.2 | 8.3 | 448.9 | 0.46 | 0.001 |
AdS | 99.9 | 3.3 | 3.1 | 449.5 | 0.45 | 0.003 |
FaRM | 99.9 | 7.6 | 7.4 | 460.3 | 0.42 | 0.002 |
7.2 Multiple Attribute Debiasing
In this setup, we focus on debiasing multiple protected attributes gi simultaneously, while retaining information about target attribute y. We evaluate FaRM on the Pan16 dataset with y as Mention, g1 as Gender, and g2 as Age. Subramanian et al. (2021) showed that debiasing a categorical attribute can still reveal information about intersectional groups (e.g., if age (young/old) and gender (male/female) are two categorical protected attributes, then (age = old, gender = male) is an intersectional group). We report the ΔF1/ MDL scores for probing intersectional groups.
Approach.
We present two variations of FaRM to remove multiple attributes simultaneously in a constrained setup. Assuming there are N protected attributes, the variations are discussed below:
Results.
We present the results of debiasing multiple attributes in Table 5. We observe that FaRM improves upon AdS’ ΔF1-score of age and gender, with N-partition and 1-partition setups performing equally well. The performance on the target task is comparable with AdS, although there is a slight rise in MDL. It is important to note that even though AdS performs decently well in preventing leakage about g1 and g2, it still leaks a significant amount of information about the intersectional groups. In both of its configurations, FaRM is able to prevent leakage of intersectional biases while considering the protected attributes independently. This shows that robustly removing information about multiple attributes helps in preventing leakage about intersectional groups as well.
Setup . | Pan16 . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mention (y) . | Age (g1) . | Fairness (g1) . | Gender (g2) . | Fairness (g2) . | Inter. Groups (g1,g2) . | |||||||
F1↑ . | MDL↓ . | ΔF1↓ . | MDL↑ . | DP↓ . | GapgRMS ↓ . | ΔF1↓ . | MDL↑ . | DP↓ . | GapgRMS ↓ . | ΔF1↓ . | MDL↑ . | |
BERTbase (fine-tuned) | 88.6 | 6.8 | 14.9 | 196.4 | 0.06 | 0.009 | 16.5 | 192.0 | 0.04 | 0.014 | 20.7 | 117.2 |
AdS | 88.6 | 5.5 | 2.2 | 231.5 | 0.05 | 0.006 | 1.6 | 230.9 | 0.04 | 0.017 | 9.1 | 118.5 |
FaRM (N-partition) | 87.0 | 13.4 | 0.0 | 234.3 | 0.03 | 0.003 | 0.0 | 234.2 | 0.06 | 0.025 | 0.7 | 468.0 |
FaRM (1-partition) | 86.4 | 15.6 | 0.0 | 234.6 | 0.05 | 0.006 | 0.0 | 234.2 | 0.02 | 0.009 | 0.0 | 467.7 |
Setup . | Pan16 . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mention (y) . | Age (g1) . | Fairness (g1) . | Gender (g2) . | Fairness (g2) . | Inter. Groups (g1,g2) . | |||||||
F1↑ . | MDL↓ . | ΔF1↓ . | MDL↑ . | DP↓ . | GapgRMS ↓ . | ΔF1↓ . | MDL↑ . | DP↓ . | GapgRMS ↓ . | ΔF1↓ . | MDL↑ . | |
BERTbase (fine-tuned) | 88.6 | 6.8 | 14.9 | 196.4 | 0.06 | 0.009 | 16.5 | 192.0 | 0.04 | 0.014 | 20.7 | 117.2 |
AdS | 88.6 | 5.5 | 2.2 | 231.5 | 0.05 | 0.006 | 1.6 | 230.9 | 0.04 | 0.017 | 9.1 | 118.5 |
FaRM (N-partition) | 87.0 | 13.4 | 0.0 | 234.3 | 0.03 | 0.003 | 0.0 | 234.2 | 0.06 | 0.025 | 0.7 | 468.0 |
FaRM (1-partition) | 86.4 | 15.6 | 0.0 | 234.6 | 0.05 | 0.006 | 0.0 | 234.2 | 0.02 | 0.009 | 0.0 | 467.7 |
8 Model Analysis
In this section, we present several analysis experiments to evaluate the functioning of FaRM.
Robustness to Label Corruption.
We evaluate the robustness of FaRM by randomly sub-sampling instances from the dataset and modifying the protected attribute label. In Figure 5a, we report the protected attribute leakage (ΔF1 score) from the debiased word embeddings with varying fractions of training set label corruption. We observe that FaRM’s performance degrades with an increase in label corruption. This is expected as, at high corruption ratios, most of the protected attribute labels are wrong, resulting in poor performance.
In the constrained setup (Figure 5b), we observe that FaRM is able to debias protected attribute information (y-axis scale in Figure 5b and 5a are different) even at high corruption ratios. We believe this enhanced performance (compared to unconstrained setup) is due to the additional supervision in the form of target loss, which enables FaRM to learn robust representations even with corrupted protected attribute labels.
Sensitivity to λ.
We measure the sensitivity of FaRM’s performance w.r.t. λ (Equation 4) in the constrained setup. In Figure 6, we show the MDL of the target attribute y (in blue) and protected attribute g (in red) for Dial and Pan16 for different λ. We observe that when 10−4 ≤ λ ≤ 1, the performance of FaRM does not change much. For λ = 10, MDL for y is quite large, showcasing that the model does not converge on the target task. This is expected as the regularization term (Equation 4) is much larger than term, and boosting it further with λ = 10 makes it difficult for the target task loss to converge. Similarly, when λ ≤ 10−5, the regularization term is much smaller compared to , and there is a substantial drop in MDL for g. However, we show that FaRM achieves good performance over a broad spectrum of λ. Therefore, reproducing the desired results does not require extensive hyperparameter tuning.
Probing Word Embeddings.
A limitation of using FaRM for debiasing word embeddings is that distances in the original embedding space are not preserved. The Mazur–Ulam theorem (Fleming and Jamison, 2003) states that isometry for a mapping is preserved only if the function ϕ is affine. FaRM uses a non-linear feature map ϕ(x). Therefore, distances cannot be preserved. A linear map ϕ(x) is also not ideal because it does not guard protected attributes against an attack by a non-linear probing network. We investigate the utility of debiased embeddings by performing the following experiments:
(a) Word Similarity Evaluation: In this experiment, we evaluate the debiased embeddings on the following datasets: SimLex-999 (Hill et al., 2015), WordSim-353 (Agirre et al., 2009), and MTurk-771 (Halawi et al., 2012). In Table 6, we report the Spearman correlation between the gold similarity scores of word pairs and the cosine similarity scores obtained before (top row) and after (bottom row) debiasing GloVe embeddings. We observe a significant drop in correlation with gold scores, which is expected since debiasing is removing some information from the embeddings. In spite of the drop, there is a reasonable correlation with the gold scores indicating that FaRM is able to retain a significant degree of semantic information.
Method . | SimLex-999 . | WordSim-353 . | MTurk-771 . |
---|---|---|---|
GloVe | 0.374 | 0.695 | 0.684 |
FaRM | 0.242 | 0.503 | 0.456 |
Method . | SimLex-999 . | WordSim-353 . | MTurk-771 . |
---|---|---|---|
GloVe | 0.374 | 0.695 | 0.684 |
FaRM | 0.242 | 0.503 | 0.456 |
(b) Part-of-speech Tagging: We evaluate debiased embeddings for detecting POS tags in a sentence using the Universal tagset (Petrov et al., 2012). GloVe embeddings achieve an F1-score of 95.2% and FaRM achieves an F1-score of 93.0% on this task. This shows FaRM’s debiased embeddings still possess a significant amount of morphological information about the language.
(c) Sentiment Classification: We perform sentiment classification using word embeddings on the IMDb movies dataset (Maas et al., 2011). GloVe embeddings achieve an accuracy of 80.9%, while debiased embeddings achieve an accuracy of 74.6%. The drop in this task is slightly more compared to POS tagging, but FaRM is still able to achieve reasonable performance on this task.
These experiments showcase that even though exact distances aren’t preserved using FaRM, the debiased embeddings still retain relevant information useful in downstream tasks.
Evolution of Loss Components.
We evaluate how FaRM’s loss components evolve during training. In the unconstrained setup for GloVe debiasing, we evaluate how the evolution of components—R(Z,ϵ) (in red) and Rc(Z,ϵ|Πg) (in black). In Figure 7a, we observe that both loss terms start increasing simultaneously, with their difference remaining constant in the final iterations. Next in the constrained setup, the evolution of target loss and bias loss R(Z,ϵ) − Rc(Z,ϵ|Πg) for Dial dataset are shown in Figure 7b. We observe that the bias term converges first followed by the target loss. This is expected as the magnitude of rate-distortion loss is larger than target loss, which forces the model to minimize it first.
Limitations.
A limitation of FaRM is that we lack a principled feature map ϕ(x) selection approach. In the unconstrained setup, we relied on empirical observations and found that a 4-layer ReLU network sufficed for GloVe and Biographies, while a 7-layer network was required for Dial. For the constrained setup, BERTbase proved to be expressive enough to perform debiasing in all setups. Future works can explore white- box network architectures (Chan et al., 2022) for debiasing.
9 Conclusion
We proposed Fairness-aware Rate Maximization (FaRM), a novel debiasing technique based on the principle of rate-distortion maximization. FaRM is effective in removing protected information from representations in both unconstrained and constrained debiasing setups. Empirical evaluations show that FaRM outperforms prior works in debiasing representations by a large margin on several datasets. Extensive analysis showcase that FaRM is sample efficient, and robust to label corruptions and minor hyperparameter changes. Future works can focus on leveraging FaRM for achieving fairness in complex tasks like language generation.
10 Ethical Considerations
In this work, we present FaRM—a robust representation learning framework to selectively remove protected information. FaRM is developed with an intent to enable development of fair learning systems. However, FaRM can be misused to remove salient features from representations and perform classification by leveraging demographic information. Debiasing using FaRM is only evaluated on datasets with binary protected attribute variables. This may not be ideal while removing protected information about gender, which can extend beyond binary categories. Currently, we lack datasets with fine-grained gender annotation. It is important to collect data and develop techniques, that would benefit everyone in our community.
Notes
We borrow some notations from Yu et al. (2020) to explain concepts of rate-distortion theory.
Note, we cannot use the same regularization term (Equation 4) for unconstrained debiasing, as minimizing R(Z,ϵ) without the supervision of target loss results in all representations converging to a compact space, thereby losing most of the information.
Majority baseline for profession classification ≈29%.
In our experiments, we found profession accuracy to be high with a shallow feature map or training for earlier epochs, but the gender leakage was significant in these scenarios.
References
Author notes
Action Editor: Christopher Potts