We present the Assignment-Maximization Spectral Attribute removaL (AMSAL) algorithm, which erases information from neural representations when the information to be erased is implicit rather than directly being aligned to each input example. Our algorithm works by alternating between two steps. In one, it finds an assignment of the input representations to the information to be erased, and in the other, it creates projections of both the input representations and the information to be erased into a joint latent space. We test our algorithm on an extensive array of datasets, including a Twitter dataset with multiple guarded attributes, the BiasBios dataset, and the BiasBench benchmark. The latter benchmark includes four datasets with various types of protected attributes. Our results demonstrate that bias can often be removed in our setup. We also discuss the limitations of our approach when there is a strong entanglement between the main task and the information to be erased.1

Developing a methodology for adjusting neural representations to preserve user privacy and avoid encoding bias in them has been an active area of research in recent years. Previous work shows it is possible to erase undesired information from representations so that downstream classifiers cannot use that information in their decision-making process. This previous work assumes that this sensitive information (or guarded attributes, such as gender or race) is available for each input instance. These guarded attributes, however, are sensitive, and obtaining them on a large scale is often challenging and, in some cases, not feasible (Han et al., 2021b). For example, Blodgett et al. (2016) studied the characteristics of African-American English on Twitter, and could not couple the ethnicity attribute directly with the tweets they collected due to the attribute’s sensitivity.

This paper introduces a novel debiasing setting in which the guarded attributes are not paired up with each input instance and an algorithm to remove information from representations in that setting. In our setting, we assume that each neural input representation is coupled with a guarded attribute value, but this assignment is unavailable. In cases where the domain of the guarded attribute is small (for example, with binary attributes), this means that the guarded attribute information consists of priors with respect to the whole population and not instance-level information.

The intuition behind our algorithm is that if we were to find a strong correlation between the input variable and a set of guarded grounded attributes either in the form of an unordered list of records or as priors, then it is unlikely to be coincidental if the sample size is sufficiently large (§3.5). We implement this intuition by jointly finding projections of the input samples and the guarded attributes into a joint embedding space and an alignment between the two sets in that joint space.

Our resulting algorithm (§3), the Alignment-Maximization Spectral Attribute removaL algorithm (AMSAL), is a coordinate-ascent algorithm reminiscent of the hard expectation-maximization algorithm (hard EM; MacKay, 2003). It first loops between two Alignment and Maximization steps, during which it finds an alignment (A) based on existing projections and then projects the representations and guarded attributes into a joint space based on an existing alignment (M). After these two steps are iteratively repeated and an alignment is identified, the algorithm takes another step to erase information from the input representations based on the projections identified. This step closely follows the work of Shao et al. (2023), who use Singular Value Decomposition to remove principal directions of the covariance matrix between the input examples and the guarded attributes. Figure 1 depicts a sketch of our setting and the corresponding algorithm, with xi being the input representations and zj being the guarded attributes. Our algorithm is modular: While our use of the algorithm of Shao et al. (2023) for the removal step is natural due to the nature of the AM steps, a user can use any such algorithm to erase the information from the input representations (§3.4).

Figure 1: 

A depiction of the problem setting and solution. The inputs are aligned to each guarded sample, based on strength using two projections U and V. We solve a bipartite matching problem to find the blue edges, and then recalculate U and V.

Figure 1: 

A depiction of the problem setting and solution. The inputs are aligned to each guarded sample, based on strength using two projections U and V. We solve a bipartite matching problem to find the blue edges, and then recalculate U and V.

Close modal

Our contributions are as follows: (1) We propose a new setup for removing guarded information from neural representations where there are few or no labeled guarded attributes; (2) We present a novel two-stage coordinate-ascent algorithm that iteratively improves (a) an alignment between guarded attributes and neural representations; and (b) information removal projections.

Using an array of datasets, we perform extensive experiments to assess how challenging our setup is and whether our algorithm is able to remove information without having aligned guarded attributes (§4). We find in several cases that little information is needed to align between neural representations and their corresponding guarded attributes. The consequence is that it is possible to erase the information such guarded attributes provide from the neural representations while preserving the information needed for the main task decision-making. We also study the limitations of our algorithm by experimenting with a setup where it is hard to distinguish between the guarded attributes and the downstream task labels when aligning the neural representations with the guarded attributes (§4.5).

For an integer n we denote by [n] the set {1,…,n}. For a vector v, we denote by ||v||2 its 2 norm. For two vectors v and u, by default in column form, v,u=vu (dot product). Matrices and vectors are in boldface font (with uppercase or lowercase letters, respectively). Random variable vectors are also denoted by boldface uppercase letters. For a matrix A, we denote by aij the value of cell (i,j). The Frobenius norm of a matrix A is AF=i,jaij2. The spectral norm of a matrix is A2=maxx2=1Ax2. The expectation of a random variable T is denoted by E[T].

In our problem formulation, we assume three random variables: X∈ ℝd, Y∈ ℝ, and Z∈ ℝd, such that d′ ≤ d and the expectation of all three variables is 0 (see Shao et al., 2023). Samples of X are the inputs for a classifier to predict corresponding samples of Y. The random vector Z represents the guarded attributes. We want to maintain the ability to predict Y from X, while minimizing the ability to predict Z from X.

We assume n samples of (X,Y) and m samples of Z, denoted by (x(i),y(i)) for i∈ [n], and z(i) for i∈ [m] (mn). While originally, these samples were generated jointly from the underlying distribution p(X,Y,Z), we assume a shuffling of the Z samples in such a way that we are only left with m samples that are unique (no repetitions) and an underlying unknown many-to-one mapping π:[n][m] that maps each x(i) to its original z(j).

The problem formulation is such that we need to remove the information from the xs in such a way that we consider the samples of zs as a set. In our case, we do so by iterating between trying to infer π, and then using standard techniques to remove the information from xs based on their alignment to the corresponding zs.

Singular Value Decomposition

Let A=E[XZ], the matrix of cross-covariance between X and Z. This means that Aij = Cov(Xi,Zj) for i∈ [d] and j∈ [d′].

For any two vectors, a∈ ℝd, b∈ ℝd, the following holds due to the linearity of expectation:
(1)
Singular value decomposition on A, in this case, finds the “principal directions”: directions in which the projection of X and Z maximize their covariance. The projections are represented as two matrices U∈ ℝd×d and V∈ ℝd′×d. Each column in these matrices plays the role of the vectors a and b in Eq. 1. SVD finds U and V such that for any i∈ [d′] it holds that:
where Oi is the set of pairs of vectors (a,b) such that ∥a2 =∥b2 = 1, a is orthogonal to U1,…,Ui−1 and similarly, b is orthogonal to V1,…,Vi−1.

Shao et al. (2023) showed that SVD in this form can be used to debias representations. We calculate SVD between X and Z and then prune out the principal directions that denote the highest covariance. We will use their method, SAL (Spectral Attribute removaL), in the rest of the paper. See also §3.4.

We view the problem of information removal with unaligned samples as a joint optimization problem of: (a) finding the alignment; (b) finding the projection that maximizes the covariance between the alignments, and using its complement to project the inputs. Such an optimization, in principle, is intractable, so we break it down into two coordinate-ascent style steps: A-step (in which the alignment is identified as a bipartite graph matching problem) and M-step (in which based on the previously identified alignment, a maximal-covariance projection is calculated). Formally, the maximization problem we solve is:
(2)
where we constrain U and V to be matrices with orthonormal columns in ℝn×k.

Note that the sum in the above equation has a term per pair of (x(i),zπ(i)), which enables us to frame the A-step as an integer linear programming (ILP) problem (§3.1). The full algorithm is given in Figure 2, and we proceed in the next two steps to further explain the A-step and the M-step.

Figure 2: 

The main Assignment-Maximization Spectral Attribute removaL (AMSAL) algorithm for removal of information without alignment between samples of X and Z.

Figure 2: 

The main Assignment-Maximization Spectral Attribute removaL (AMSAL) algorithm for removal of information without alignment between samples of X and Z.

Close modal

3.1 A-step (Guarded Sample Assignment)

In the Assignment Step, we are required to find a many-to-one alignment π:[n][m] between {x(1),…,x(n)} and {z(1),…,z(m)}. Given U and V from the previous M-step, we can find such an assignment by solving the following optimization problem:
This maximization problem can be formulated as an integer linear program of the following form:
(3)

This is a solution to an assignment problem (Kuhn, 1955; Ramshaw and Tarjan, 2012), where pij denotes whether x(i) is associated with the (type of) guarded attribute z(j). The values (b0j,b1j) determine lower and upper bounds on the number of xs a given z(j) can be assigned to. While a standard assignment problem can be solved efficiently using the Hungarian method of Kuhn (1955), we choose to use the ILP formulation, as it enables us to have more freedom in adding constraints to the problem, such as the lower and upper bounds.

3.2 M-step (Covariance Maximization)

The result of an A-step is an assignment π such that π(i) = j implies x(i) was deemed as aligned to zj. With that π in mind, we define the following empirical covariance matrix Ωπ∈ ℝd×d:
(4)

We then apply SVD on Ωπ to get new U and V that are used in the next iteration of the algorithm with the A-step, if the algorithm continues to run. When the maximal number of iterations is reached, we follow the work of Shao et al. (2023) in using a truncated part of U to remove the information from the xs. We do that by projecting x(i) using the singular vectors of U with the smallest singular values. These projected vectors co-vary the least with the guarded attributes, assuming the assignment in the last A-step was precise. This method has been shown by Shao et al. (2023) to be highly effective and efficient in debiasing neural representations.

3.3 A Matrix Formulation of the AM Steps

Let e1,…,em be the standard basis vectors. This means ei is a vector of length m with 0 in all coordinates except for the ith coordinate, where it is 1.

Let ε be the set of all matrices E where each Eε is such that E∈ ℝn×m and each row is one of ei, i∈ [m]. In that case, EZ is an n × d′ matrix, such that the jth row is a copy of the ith column of Z∈ ℝd′×n. Therefore, the AM steps can be viewed as solving the following maximization problem using coordinate ascent:
where U, V are orthonormal matrices, and Σ is a diagonal matrix with non-negative elements. This corresponds to the SVD of the matrix XEZ.

In that case, the matrix E can be directly mapped to an assignment in the form of π, where π(i) would be the j such that the jth coordinate in the ith row of E is non-zero.

3.4 Removal Algorithm

The AM steps are best suited for the removal of information through SVD with an algorithm such as SAL. This is because AM steps are optimizing an objective of the same type of SAL—relying on the projections U and V to project the inputs and guarded representations into a joint space. However, a by-product of the algorithm in Figure 2 is an assignment function π that aligns between the inputs and the guarded representations.

With that assignment, other removal algorithms can be used, for example, the algorithm of Ravfogel et al. (2020). We experiment with this idea in §4.

3.5 Justification of the AM Steps

We next provide a justification of our algorithm (which may be skipped on a first reading). Our justification is based on the observation that if indeed X and Z are linked together (this connection is formalized as a latent variable in their joint distribution), then for a given sample that is permuted, the singular values of Ω will be larger the closer the permutation is to the identity permutation. This justifies finding such a permutation that maximizes the singular values in an SVD of Ω.

More Details
Let ι:[n][n] be the identity permutation, ι(i) = i. We will assume the case in which n = m (but the justification can be generalized to the case m < n), and that the underlying joint distribution p(X,Z) is mediated by a latent variable H, such that
(5)

This implies there is a latent variable that connects X and Z, and that the joint distribution p(X,Z) is a mixture through H.

Proposition 1 (informal).

Let {(x(i),z(i))} be a sample of sizenfrom the distribution in Eq. 5. Letπbe a permutation over [n] uniformly sampled from the set of permutations. Then with high likelihood, the sum of the singular values ofΩπis smaller than the sum of singular values underΩι.

For full details of this claim, see Appendix A.

In our experiments, we test several combinations of algorithms. We use the k-means (KMeans) as a substitute for the AM steps as a baseline for the assignment step of xs to zs. In addition, for the removal step (once an assignment has been identified), we test two algorithms: SAL (Shao et al., 2023; resulting in AMSAL) and INLP (Ravfogel et al., 2020). We also compare these two algorithms in oracle mode (in which the assignment of guarded attributes to inputs is known), to see the loss in performance that happens due to noisy assignments from the AM or k-means algorithm (OracleSAL and OracleINLP).

When running the AM algorithm or k-means, we execute it with three random seeds (see also §4.6) for a maximum of a hundred iterations and choose the projection matrix with the largest objective value over all seeds and iterations. For the slack variables (b0j and b1j variables in Eq. 3), we use 20%–30% above and below the baseline of the guarded attribute priors according to the training set. With the SAL methods, we remove the number of directions according to the rank of the Ω matrix (between 2 to 6 in all experiments overall).

In addition, we experiment with a partially supervised assignment process, in which a small seed dataset of aligned xs and zs is provided to the AM steps. We use it for model selection: Rather than choosing the assignment with the highest SVD objective value, we choose the assignment with the highest accuracy on this seed dataset. We refer to this setting as Partial (for “partially supervised assignment”).

Finally, in the case of a gender-protected attribute, we compare our results against a baseline in which the input x is compared against a list of words stereotypically associated with the genders of male or female.2 Based on the overlap with these two lists, we heuristically assign the gender label to x and then run SAL or INLP (rather than using the AM algorithm). While this wordlist heuristic is plausible in the case of gender, it is not as easy to derive in the case of other protected attributes, such as age or race. We give the results for this baseline using the marker WL in the corresponding tables.

Main Findings

Our overall main finding shows that our novel setting in which guarded infor mation is erased from individually unaligned representations is viable. We discovered that AM methods perform particularly well when dealing with more complex bias removal scenarios, such as when multiple guarded attributes are present. We also found that having similar priors for the guarded attributes and downstream task labels may lead to poor performance on the task at hand. In these cases, using a small amount of supervision often effectively helps reduce bias while maintaining the utility of the representations for the main classification of the regression problem. Finally, our analysis of alignment stability shows that our AM algorithm often converges to suitable solutions that align X with Z.

Due to the unsupervised nature of our problem setting, we advise validating the utility of our method in the following way. Once we run the AM algorithm, we check whether there is a high-accuracy alignment between X and Y (rather than Z, which is unavailable). If this alignment is accurate, then we run the risk of significantly damaging task performance. An example is given in §4.5.

4.1 Word Embedding Debiasing

As a preliminary assessment of our setup and algorithms, we apply our methods to GloVe word embeddings to remove gender bias, and follow the previous experiment settings of this problem (Bolukbasi et al., 2016; Ravfogel et al., 2020; Shao et al., 2023). We considered only the 150,000 most common words to ensure the embedding quality and omitted the rest. We sort the remaining embeddings by their projection on the he-she direction. Then we consider the top 7,500 word embeddings as male-associated words (z = 1) and the bottom 7,500 as female-associated words (z = −1).

Our findings are that both the k-means and the AM algorithms perfectly identify the alignment between the word embeddings and their associated gender label (100%). Indeed, the dataset construction itself follows a natural perfect clustering that these algorithms easily discover. Since the alignments are perfectly identified, the results of predicting the gender from the word embeddings after removal are identical to the oracle case. These results are quite close to the results of a random guess, and we refer the reader to Shao et al. (2023) for details on experiments with SAL and INLP for this dataset. Considering Figure 3, it is evident that our algorithm essentially follows a natural clustering of the word embeddings into two clusters, female and male, as the embeddings are highly separable in this case. This is why the alignment score of X (embedding) to Z (gender) is perfect in this case. This finding indicates that this standard word embedding dataset used for debiasing is trivial to debias—debiasing can be done even without knowing the identity of the stereotypical gender associated with each word.

Figure 3: 

A t-SNE visualization of the word embeddings before and after gender information removal. In (a) we see the embeddings naturally cluster into the corresponding gender.

Figure 3: 

A t-SNE visualization of the word embeddings before and after gender information removal. In (a) we see the embeddings naturally cluster into the corresponding gender.

Close modal

4.2 BiasBios Results

De-Arteaga et al. (2019) presented the BiasBios dataset, which consists of self-provided biographies paired with the profession and gender of their authors. A list of pronouns and names is used to obtain the authors’ gender automatically. They aim to expose the caveats of automated hiring systems by showing that even the simple task of predicting a candidate’s profession can be affected by the candidate’s gender, which is encoded in the biography representation. For example, we want to avoid one being identified as “he” or “she” in their biography, affecting the likelihood of them being classified as engineers or teachers.

We follow the setup of De-Arteaga et al. (2019), predicting a candidate’s professions (y), based on a self-provided short biography (x), aiming to remove any information about the candidate’s gender (z). Due to computational constraints, we use only random 30K examples to learn the projections with both SAL and INLP (whether in the unaligned or aligned setting). For the classification problem, we use the full dataset. To obtain vector representations for the biographies, we use two different encoders, FastText word embeddings (Joulin et al., 2016), and BERT (Devlin et al., 2019). We stack a multi-class classifier on top of these representations, as there are 28 different professions. We use 20% of the training examples for the Partial setting. For BERT, we followed De-Arteaga et al. (2019) in using the last CLS token state as the representation of the whole biography. We used the BERT model bert-base-uncased.

Evaluation Measures

We use an extension of the True Positive Rate (TPR) gap, the root mean square (RMS) TPR gap of all classes, for evaluating bias in a multiclass setting. This metric was suggested by De-Arteaga et al. (2019), who demonstrated it is significantly correlated with gender imbalances, which often lead to unfair classification. The higher the metric value is, the bigger the gap between the two categories (for example, between male and female) for the specific main task prediction. For the profession classification, we report accuracy.

Results

Table 1 provides the results for the biography dataset. We see that INLP significantly reduces the TPR-GAP in all settings, but this comes at a cost: The representations are significantly less useful for the main task of predicting the profession. When inspecting the alignments, we observe that their accuracy is quite high with BERT: 100% with k-means, 85% with the AM algorithm, and 99% with Partial AM. For FastText, the results are lower, hovering around 55% for all three methods. The high BERT assignment performance indicates that the BiasBios BERT representations are naturally separated by gender. We also observe that the results of WL+SAL and WL+INLP are correspondingly identical to Oracle+SAL and Oracle+INLP. This comes as no surprise, as the gender label is derived from a similar word list, which enables the WL approach to get a nearly perfect alignment (over 96% agreement with the gender label).

Table 1: 

BiasBios dataset results. The top part uses BERT embeddings to encode the biographies, while the bottom part uses FastText embeddings.

BiasBios dataset results. The top part uses BERT embeddings to encode the biographies, while the bottom part uses FastText embeddings.
BiasBios dataset results. The top part uses BERT embeddings to encode the biographies, while the bottom part uses FastText embeddings.

4.3 BiasBench Results

Meade et al. (2022) followed an empirical study of an array of datasets in the context of debiasing. They analyzed different methods and tasks, and we follow their benchmark evaluation to assess our AMSAL algorithm and other methods in the context of our new setting. We include a short description of the datasets we use in this section. We include full results in Appendix B, with a description of other datasets. We also encourage the reader to refer to Meade et al. (2022) for details on this benchmark. We use 20% of the training examples for the Partial setting.

StereoSet (Nadeem et al., 2021)

This dataset presents a word completion test for a language model, where the completion can be stereotypical or non-stereotypical. The bias is then measured by calculating how often a model prefers the stereotypical completion over the non-stereotypical one. Nadeem et al. (2021) introduced the language model score to measure the language model usability, which is the percentage of examples for which a model prefers the stereotypical or non- stereotypical word over some unrelated word.

CrowS-Pairs (Nangia et al., 2020)

This dataset includes pairs of sentences that are minimally different at the token level, but these differences lead to the sentence being either stereotypical or anti-stereotypical. The assessment measures how many times a language model prefers the stereotypical element in a pair over the anti-stereotypical element.

Results

We start with an assessment of the BERT model for the CrowS-Pairs gender, race, and religion bias evaluation (Table 2). We observe that all approaches for gender, except AM+INLP, reduce the stereotype score. Race and religion are more difficult to debias in the case of BERT. INLP with k-means works best when no seed alignment data is provided at all, but when we consider PartialSAL, in which we use the alignment algorithm with some seed aligned data, we see that the results are the strongest. When we consider the RoBERTa model, the results are similar, with PartialSAL significantly reducing the bias. Our findings from Table 2 overall indicate that the ability to debias a representation highly depends on the model that generates the representation. In Table 10 we observe that the representations, on average, are not damaged for most GLUE tasks. Additional analysis, included with a full version appendix, shows that the representations, on average, are not damaged for most GLUE tasks.

Table 2: 

(a) CrowS-Pairs Gender stereotype scores (Stt. score) in language models debiased by different debiasing techniques and assignment; (b) CrowS-Pairs Race stereotype scores; (c) CrowS-Pairs Religion stereotype scores. All models are deemed least biased if the stereotype score is 50%. The colored numbers are calculated as | |b − 50 |− |s − 50 | | where b is the top row score and s is the corresponding system score.

(a) CrowS-Pairs Gender stereotype scores (Stt. score) in language models debiased by different debiasing techniques and assignment; (b) CrowS-Pairs Race stereotype scores; (c) CrowS-Pairs Religion stereotype scores. All models are deemed least biased if the stereotype score is 50%. The colored numbers are calculated as | |b − 50 |− |s − 50 | | where b is the top row score and s is the corresponding system score.
(a) CrowS-Pairs Gender stereotype scores (Stt. score) in language models debiased by different debiasing techniques and assignment; (b) CrowS-Pairs Race stereotype scores; (c) CrowS-Pairs Religion stereotype scores. All models are deemed least biased if the stereotype score is 50%. The colored numbers are calculated as | |b − 50 |− |s − 50 | | where b is the top row score and s is the corresponding system score.

As Meade et al. (2022) have noted, when changing the representations of a language model to remove bias, we might cause such adjustments that damage the usability of the language model. To test which methods possibly cause such an issue, we also assess the language model score on the StereoSet dataset in Table 3. We overall see that often SAL-based methods give a lower stereotype score, while INLP methods more significantly damage the language model score. This implies that the SAL-based methods remove bias effectively while less significantly harming the usability of the language model representations.

Table 3: 

StereoSet stereotype scores (Stt. Score) and language modeling scores (LM Score) for the gender category. Stereotype scores indicate the least bias at 50% and the LM scores indicate high usability at 100%.

StereoSet stereotype scores (Stt. Score) and language modeling scores (LM Score) for the gender category. Stereotype scores indicate the least bias at 50% and the LM scores indicate high usability at 100%.
StereoSet stereotype scores (Stt. Score) and language modeling scores (LM Score) for the gender category. Stereotype scores indicate the least bias at 50% and the LM scores indicate high usability at 100%.

We also conducted comprehensive results for other datasets (SEAT and GLUE) and categories of bias (based on race and religion). The results, especially for GLUE, demonstrate the effectiveness of our method of unaligned information removal. For GLUE, we consistently retain the baseline task performance almost in full. See Appendix B.

4.4 Multiple-Guarded Attribute Sentiment

We hypothesize that AM-based methods are better suited for setups where multiple guarded attributes should be removed, as they allow us to target several guarded attributes with different priors. To examine our hypothesis, we experiment with a dataset curated from Twitter (tweets encoded using BERT, bert-base-uncased), in which users are surveyed for their age and gender (Cachola et al., 2018). We bucket the age into three groups (0–25, 26–50, and above 50). Tweets in this dataset are annotated with their sentiment, ranging from 1 (very negative) to 5 (very positive). The dataset consists of more than 6,400 tweets written by more than 1,700 users. We removed users who no longer have public Twitter accounts and users with locations that do not exist based on a filter,3 resulting in a dataset with over 3,000 tweets, written by 817 unique users. As tweets are short by nature and their number is relatively small, the debiasing signal in this dataset (the amount of information it contains about the guarded attributes) might not be sufficient for the attribute removal. To amplify this signal, we concatenated each tweet in the dataset to at most ten other tweets from the same user.

We study the relationship between the main task of sentiment detection and the two protected attributes of age and gender. As a protected attribute z, we use the combination of both age and gender as a binary one-hot vector. This dataset presents a use-case for our algorithm of a composed protected attribute. Rather than using a classifier for predicting the sentiment, we use linear regression. Following Cachola et al. (2018), we use Mean Absolute Error (MAE) to report the error of the sentiment predictions. Given that the sentiment is predicted as a continuous value, we cannot use the TPR gap as in previous sections. Rather, we use the following formula:
(6)
where MADz=j=1i|ηijμj| where i ranges over the set of size of examples with protected attribute value j, μj is the average of absolute Y prediction error for that set and ηij is the absolute difference between μj and the absolute error of example i.4 The function std in this case indicates the standard deviation of the m values of MADz =j, j∈ [m].
Results

Table 4 presents our results. Overall, AMSAL reduces the gender and age gap in the predictions while not increasing by much MAE. In addition, we can see both AM-based methods outperform their k-means counterparts which increase unfairness (Kmeans + INLP) or significantly harm the downstream-task performance (Kmeans + SAL). We also consider Figure 4, which shows the quality of the assignments of the AM algorithm change as a function of the labeled data used. As expected, the more labeled data we have, the more accurate the assignments are, but the differences are not very large.

Table 4: 

MAE and debiasing gap values on the Twitter dataset, when using BERT to encode the tweets. For age and gender, we give the MAE gap as in Eq. 5.

MAE and debiasing gap values on the Twitter dataset, when using BERT to encode the tweets. For age and gender, we give the MAE gap as in Eq. 5.
MAE and debiasing gap values on the Twitter dataset, when using BERT to encode the tweets. For age and gender, we give the MAE gap as in Eq. 5.
Figure 4: 

Accuracy of the AM steps with respect to age and gender separately (on unseen data), as a function of the fraction of the labeled dataset used by the AM algorithm.

Figure 4: 

Accuracy of the AM steps with respect to age and gender separately (on unseen data), as a function of the fraction of the labeled dataset used by the AM algorithm.

Close modal

4.5 An Example of Our Method Limitations

We now present the main limitation in our approach and setting. This limitation arises when the random variables Y and Z are not easily distinguishable through information about X.

We experiment with a binary sentiment analysis (y) task, predicted on users’ tweets (x), aiming to remove information regarding the authors’ ethnic affiliations. To do so, we use a dataset collected by Blodgett et al. (2016), which examined the differences between African-American English speakers and Standard American English speakers. As information about one’s ethnicity is hard to obtain, the user’s geolocation information was used to create a distantly supervised mapping between authors and their ethnic affiliations. We follow previous work (Shao et al., 2023; Ravfogel et al., 2020) and use the DeepMoji encoder (Felbo et al., 2017) to obtain representations for the tweets. The train and test sets are balanced regarding sentiment and authors’ ethnicity. We use 20% of the examples for the Partial setting. Table 5 gives the results for this dataset. We observe that the removal with the assignment (k-means, AM, or Partial) significantly harms the performance on the main task and reduces it to a random guess.

Table 5: 

The performance of removing race information from the DeepMoji dataset is shown for two cases: with balanced ratios of race and sentiment (left) and with ratios of 0.8 for sentiment and 0.5 for race (right). In both cases, the total size of the dataset used is 30,000 examples. To evaluate the performance of the unbalanced sentiment dataset, we use the F1 macro measure, because in an unbalanced dataset such as this one, a simple classifier that always returns one label will achieve an accuracy of 80%. Such a classifier would have a F1 macro score of 0.444̇.

The performance of removing race information from the DeepMoji dataset is shown for two cases: with balanced ratios of race and sentiment (left) and with ratios of 0.8 for sentiment and 0.5 for race (right). In both cases, the total size of the dataset used is 30,000 examples. To evaluate the performance of the unbalanced sentiment dataset, we use the F1 macro measure, because in an unbalanced dataset such as this one, a simple classifier that always returns one label will achieve an accuracy of 80%. Such a classifier would have a F1 macro score of 0.444̇.
The performance of removing race information from the DeepMoji dataset is shown for two cases: with balanced ratios of race and sentiment (left) and with ratios of 0.8 for sentiment and 0.5 for race (right). In both cases, the total size of the dataset used is 30,000 examples. To evaluate the performance of the unbalanced sentiment dataset, we use the F1 macro measure, because in an unbalanced dataset such as this one, a simple classifier that always returns one label will achieve an accuracy of 80%. Such a classifier would have a F1 macro score of 0.444̇.

This presents a limitation of our algorithm. A priori, there is no distinction between Y and Z, as our method is unsupervised. In addition, the positive labels of Y and Z have the same prior probability. Indeed, when we check the assignment accuracy in the sentiment dataset, we observe that the k-means, AM, and Partial AM assignment accuracy for identifying Z are between 0.55 and 0.59. If we check the assignment against Y, we get an accuracy between 0.74 and 0.76. This means that all assignment algorithms actually identify Y rather than Z (both Y and Z are binary variables in this case). The conclusion from this is that our algorithm works best when sufficient information on Z is presented such that it can provide a basis for aligning samples of Z with samples of X. Suppose such information is unavailable or unidentifiable with information regarding Y. In that case, we may simply identify the natural clustering of X according to their main task classes, leading to low main-task performance.

In Table 5, we observe that this behavior is significantly mitigated when the priors over the sentiment and the race are different (0.8 for sentiment and 0.5 for race). In that case, the AM algorithm is able to distinguish between the race-protected attribute (z) and the sentiment class (y) quite consistently with INLP and SAL, and the gap is reduced.

We also observe that INLP changed neither the accuracy nor the TPR-GAP for the balanced scenario (Table 5) when using a k-means assignment or an AM assignment. Upon inspection, we found out that INLP returns an identity projection in these cases, unable to amplify the relatively weak signal in the assignment to change the representations.

4.6 Stability Analysis of the Alignment

In Figure 5, we plot the accuracy of the alignment algorithm (knowing the true value of the guarded attribute per input) throughout the execution of the AM steps for the first ten iterations. The shaded area indicates one standard deviation. We observe that the first few iterations are the ones in which the accuracy improves the most. For most of the datasets, the accuracy does not decrease between iterations, though in the case of DeepMoji we do observe a “bump.” This is indeed why the Partial setting of our algorithm, where a small amount of guarded information is available to determine at which iteration to stop the AM algorithm, is important. In the word embeddings case, the variance is larger because, in certain executions, the algorithm converged quickly, while in others, it took more iterations to converge to high accuracy.

Figure 5: 

Accuracy of the AM steps (in identifying the correct assignment of inputs to guarded information) as a function of the iteration number. Shaded gray gives upper and lower bound on the standard deviation over five runs with different seeds for the initial π. FastText refers to the BiasBios dataset, the BERT models are for the CrowS-Pairs dataset and Emb. refers to the word embeddings dataset from §4.1.

Figure 5: 

Accuracy of the AM steps (in identifying the correct assignment of inputs to guarded information) as a function of the iteration number. Shaded gray gives upper and lower bound on the standard deviation over five runs with different seeds for the initial π. FastText refers to the BiasBios dataset, the BERT models are for the CrowS-Pairs dataset and Emb. refers to the word embeddings dataset from §4.1.

Close modal

Figure 6 plots the relative change of the objective value of the ILP from §3.1 against iteration number. The relative change is defined as the ratio between the objective value before the algorithm begins and the same value at a given iteration. We see that there is a relative stability of the algorithm and that the AM steps converge quite quickly. We also observe the DeepMoji dataset has a large increase in the objective value in the first iteration (around × 5 compared to the value the algorithm starts with), after which it remains stable.

Figure 6: 

Ratio of the objective value in iteration t and iteration 0 of the ILP for the AM steps as a function of the iteration number t. Shaded gray gives upper and lower bound on the standard deviation over five runs with different seeds for the initial π. See legend explanation in Table 5.

Figure 6: 

Ratio of the objective value in iteration t and iteration 0 of the ILP for the AM steps as a function of the iteration number t. Shaded gray gives upper and lower bound on the standard deviation over five runs with different seeds for the initial π. See legend explanation in Table 5.

Close modal

There has been an increasing amount of work on detecting and erasing undesired or protected information from neural representations, with standard software packages for this process having been developed (Han et al., 2022). For example, in their seminal work, Bolukbasi et al. (2016) showed that word embeddings exhibit gender stereotypes. To mitigate this issue, they projected the word embeddings to a neutral space with respect to a “he-she” direction. Influenced by this work, Zhao et al. (2018) proposed a customized training scheme to reduce the gender bias in word embeddings. Gonen and Goldberg (2019) examined the effectiveness of the methods mentioned above and concluded they remove bias in a shallow way. For example, they demonstrated that classifiers can accurately predict the gender associated with a word when fed with the embeddings of both debiasing methods.

Another related strand of work uses adversarial learning (Ganin et al., 2016), where an additional objective function is added for balancing undesired-information removal and the main task (Edwards and Storkey, 2016; Li et al., 2018; Coavoux et al., 2018; Wang et al., 2021). Elazar and Goldberg (2018) have also demonstrated that an ad-hoc classifier can easily recover the removed information from adversarially trained representations. Since then, methods for information erasure such as INLP and its generalization (Ravfogel et al., 2020,2022), SAL (Shao et al., 2023) and methods based on similarity measures between neural representations (Colombo et al., 2022) have been developed. With a similar motivation to ours, Han et al. (2021b) aimed to ease the burden of obtaining guarded attributes at a large scale by decoupling the adversarial information removal process from the main task training. They, however, did not experiment with debiasing representations where no guarded attribute alignments are available. Shao et al. (2023) experimented with the removal of features in a scenario in which a low number of protected attributes is available.

Additional previous work showed that methods based on causal inference (Feder et al., 2021), train-set balancing (Han et al., 2021a), and contrastive learning (Shen et al., 2021; Chi et al., 2022) effectively reduce bias and increase fairness. In addition, there is a large body of work for detecting bias, its evaluation (Dev et al., 2021) and its implications in specific NLP applications. Savoldi et al. (2022) detected a gender bias in speech translation systems for gendered languages. Gender bias is also discussed in the context of knowledge base embeddings by Fisher et al. (2019); Du et al. (2022), and multilingual text classification (Huang, 2022).

We presented a new and challenging setup for removing information, with minimal or no available sensitive information alignment. This setup is crucial for the wide applicability of debiasing methods, as for most applications, obtaining such sensitive labels on a large scale is challenging. To ease this problem, we present a method to erase information from neural representations, where the guarded attribute information does not accompany each input instance. Our main algorithm, AMSAL, alternates between two steps (Assignment and Maximization) to identify an assignment between the input instances and the guarded information records. It then completes its execution by removing the information by minimizing covariance between the input instances and the aligned guarded attributes. Our approach is modular, and other erasure algorithms, such as INLP, can be used with it. Experiments show that we can reduce the unwanted bias in many cases while keeping the representations highly useful. Future work might include extending our technique to the kernelized case, analogously to the method of Shao et al. (2023).

The AM algorithm could potentially be misused by, rather than using the AM steps to erase information, using them to link records of two different types, undermining the privacy of the record holders. Such a situation may merit additional concern because the links returned between the guarded attributes and the input instances will likely contain mistakes. The links are unreliable for decision-making at the individual level. Instead, they should be used on an aggregate as a statistical construct to erase information from the input representations. Finally,5 we note that the automation of the debiasing process, without properly statistically confirming its accuracy using a correct sample, may promote a false sense of security that a given system is making fair decisions. We do not recommend using our method for debiasing without proper statistical control and empirical verification of correctness.

We thank the reviewers, the action editors and Marcio Fonseca for their thorough feedback. We also thank Daniel Preoţiuc-Pietro for his help with the Twitter data. We thank Kousha Etessami for being a sounding board for certain parts of the paper. The experiments in this paper were supported by compute grants from the Edinburgh Parallel Computing Center and from the Baskerville Tier 2 HPC service (University of Birmingham).

We provide here the full details for the claim in §3.5. Our first observation is that for a uniformly sampled permutation π:[n][n], the probability that it has exactly kn elements such that π(i) = i for all i in this set of elements is bounded from above by:6
We also assume that E[XH]=0 and E[ZH]=0, and that the product of every pair of coordinates of X and Z is bounded in absolute value by a constant B > 0. Let {(x(i),z(i),h(i))} be a random sample of size n from the joint distribution p(X,Z,H). Given a permutation π:[n][n], define I(π) = {iπ(i) = i}. For a given set M ⊆ [n], define

For a matrix A∈ ℝd×d, let σj(A) be its jth largest singular value, and let σ+(A)=jσj(A). Let σ+=σ+(E[Ωι]).

We first note that for any permutation π, it holds that E[ΩπK]=0 where we define K = [n] ∖ I(π).

Lemma 1.
For anyt > 0, it holds that:
(7)
is smaller than2ddexpt2|I(π)|B2.

Proof.
By Hoeffding’s inequality, for anyi∈ [d], j∈ [d′], it holds that the probability that for |I(π)| i.i.d. random variablesXk, Zkthe following is true:
is smaller than2expt2|I(π)|B2. Therefore, by a union bound on each element of the matrix Ωπ, we get the upper bound on Eq. 7.

Lemma 2.
For anyt > 0, it holds that:
is smaller than 2|K|ddB.

Proof.

Since Xiand Zjare bounded as a product in absolute value byB, and the dimensions ofΩπKisd × d′, each cell being a sum of |K| values, the bound naturally follows.

Let n such that + > 2kddB where k = |K|. Then from Lemma 2, ΩπKE[ΩπK]2<nσ+. Consider the event σ +(Ωι) < σ +(Ωπ). Its probability is bounded from above by the probability of the event σ +(Ωι) ≤ + OR σ +(Ωπ) ≥ + (for any n as the above). Due to the inequality of Weyl (Theorem 1 in Stewart 1990; see below), the fact that Ωπ =ΩπK +ΩπI(π), Lemma 1, and the fact that nkn, the probability of this OR event is bounded from above by 4ddexp(nk)(σ+)2(ddB)2.

The conclusion from this is that if we were to sample uniformly a permutation π from the set of permutations over [n], then with quite high likelihood (because the fraction of elements that are preserved under π becomes smaller as n becomes larger), the sum of the singular values of Ωπ under this permutation will be smaller than the sum of the singular values of Ωι—meaning, when the xs and the zs are correctly aligned. This justifies our objective of aligning the xs and the zs with an objective that maximizes the singular values, following Proposition 1.

Inequality of Weyl (1912)

As mentioned by Stewart (1990), the following holds:

Lemma 3.

LetAandEbe two matrices, and letÃ=A+E. Letσibe theith singular value ofAandσ~ibe theith singular value ofÃ. Then|σiσ~i|||E||2.

We include more results for the SEAT dataset from BiasBench and for the CrowS-Pairs dataset and StereoSet datasets for bias categories other than gender. A description of the SEAT and GLUE datasets (with metrics used) follows.

SEAT (May et al., 2019)

SEAT is a sentence-level extension of WEAT (Caliskan et al., 2017), which is an association test between two categories of words: attribute word sets and target word sets. For example, attribute words for gender bias could be { he, man }, while a target words could be { career, office }. For example, an attribute word set (in case of gender bias) could be a set of words such as { he, him, man }, while a target word set might be words related to office work. If we see a high association between an attribute word set and a target word set, we may claim that a particular gender bias is encoded. The final evaluation is calculated by measuring the similarity between the different attributes and target word sets. To extend WEAT to a sentence-level test, (Caliskan et al., 2017) incorporated the WEAT attribute and target words into synthetic sentence templates.

We use an effect size metric to report our results for SEAT. This measure is a normalized difference between cosine similarity of representations of the attribute words and the target words. Both attribute words and target words are split into two categories (for example, in relation to gender), so the difference is based on four terms, between each pair of each category set of words (target and attribute). An effect size closer to zero indicates less bias is encoded in the representations.

GLUE (Wang et al., 2019)

We follow Meade et al. (2022) and use the GLUE dataset to test the debiased model on an array of downstream tasks to validate their usability. GLUE is a highly popular benchmark for testing NLP models, containing a variety of tasks, such as classification tasks (e.g., sentiment analysis), similarity tasks (e.g., paraphrase identification), and inference tasks (e.g., question-answering).

The following tables of results are included:

  • Table 6 presents the StereoSet results for removing the race (a) and religion (b) guarded attributes.

  • Tables 7, 8, and 9 describe the SEAT effect sizes for the gender, race, and religion cases, respectively.

  • Table 10 presents the scores the debiased representations achieve for the GLUE benchmark.

Table 6: 

(a) StereoSet stereotype scores and language modeling scores (LM Score) for race debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Stereotype scores are least biased at 50% and the LM Scores are best at 100%; (b) StereoSet stereotype scores and language modeling scores (LM Score) for religion debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Stereotype scores are least biased at 50% and the LM Scores are best at 100%.

(a) StereoSet stereotype scores and language modeling scores (LM Score) for race debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Stereotype scores are least biased at 50% and the LM Scores are best at 100%; (b) StereoSet stereotype scores and language modeling scores (LM Score) for religion debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Stereotype scores are least biased at 50% and the LM Scores are best at 100%.
(a) StereoSet stereotype scores and language modeling scores (LM Score) for race debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Stereotype scores are least biased at 50% and the LM Scores are best at 100%; (b) StereoSet stereotype scores and language modeling scores (LM Score) for religion debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Stereotype scores are least biased at 50% and the LM Scores are best at 100%.
Table 7: 

SEAT effect sizes for gender-debiased representations of BERT, ALBERT, RoBERTa, and GPT-2 models. Effect sizes closer to 0 are indicative of less biased model representations. Statistically significant effect sizes at p < 0.01 are denoted by *. The final column reports the average absolute effect size across all six gender SEAT tests for each debiased model.

SEAT effect sizes for gender-debiased representations of BERT, ALBERT, RoBERTa, and GPT-2 models. Effect sizes closer to 0 are indicative of less biased model representations. Statistically significant effect sizes at p < 0.01 are denoted by *. The final column reports the average absolute effect size across all six gender SEAT tests for each debiased model.
SEAT effect sizes for gender-debiased representations of BERT, ALBERT, RoBERTa, and GPT-2 models. Effect sizes closer to 0 are indicative of less biased model representations. Statistically significant effect sizes at p < 0.01 are denoted by *. The final column reports the average absolute effect size across all six gender SEAT tests for each debiased model.
Table 8: 

SEAT effect sizes for race debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Effect sizes closer to 0 are indicative of less biased model representations. Statistically significant effect sizes at p < 0.01 are denoted by *. The final column reports the average absolute effect size across all six gender SEAT tests for each debiased model.

SEAT effect sizes for race debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Effect sizes closer to 0 are indicative of less biased model representations. Statistically significant effect sizes at p < 0.01 are denoted by *. The final column reports the average absolute effect size across all six gender SEAT tests for each debiased model.
SEAT effect sizes for race debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Effect sizes closer to 0 are indicative of less biased model representations. Statistically significant effect sizes at p < 0.01 are denoted by *. The final column reports the average absolute effect size across all six gender SEAT tests for each debiased model.
Table 9: 

SEAT effect sizes for religion debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Effect sizes closer to 0 are indicative of less biased model representations. Statistically significant effect sizes at p < 0.01 are denoted by *. The final column reports the average absolute effect size across all six gender SEAT tests for each debiased model.

SEAT effect sizes for religion debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Effect sizes closer to 0 are indicative of less biased model representations. Statistically significant effect sizes at p < 0.01 are denoted by *. The final column reports the average absolute effect size across all six gender SEAT tests for each debiased model.
SEAT effect sizes for religion debiased BERT, ALBERT, RoBERTa, and GPT-2 models. Effect sizes closer to 0 are indicative of less biased model representations. Statistically significant effect sizes at p < 0.01 are denoted by *. The final column reports the average absolute effect size across all six gender SEAT tests for each debiased model.
Table 10: 

GLUE tests for gender-debiased BERT, ALBERT, RoBERTa, and GPT-2 models.

GLUE tests for gender-debiased BERT, ALBERT, RoBERTa, and GPT-2 models.
GLUE tests for gender-debiased BERT, ALBERT, RoBERTa, and GPT-2 models.
1 

Our code is available at https://github.com/jasonshaoshun/AMSAL.

3 

We used a list of cities, counties, and states in the United States, taken from https://tinyurl.com/4kmc6pyn. All users were in the United States when the data was collected by the original curators.

4 

The absolute error of prediction a with true value b is |ab|.

5 

We thank the anonymous reviewer for raising this issue.

6 

Choose k elements that are fixed, and let the rest vary arbitrarily.

Su
Lin Blodgett
,
Lisa
Green
, and
Brendan
O’Connor
.
2016
.
Demographic dialectal variation in social media: A case study of African-American English
. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
, pages
1119
1130
,
Austin, Texas
.
Association for Computational Linguistics
.
Tolga
Bolukbasi
,
Kai-Wei
Chang
,
James Y.
Zou
,
Venkatesh
Saligrama
, and
Adam Tauman
Kalai
.
2016
.
Man is to computer programmer as woman is to homemaker? Debiasing word embeddings
. In
Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5–10, 2016, Barcelona, Spain
, pages
4349
4357
.
Isabel
Cachola
,
Eric
Holgate
,
Daniel Preoţiuc-
Pietro
, and
Junyi Jessy
Li
.
2018
.
Expressively vulgar: The socio-dynamics of vulgarity and its effects on sentiment analysis in social media
. In
Proceedings of the 27th International Conference on Computational Linguistics
, pages
2927
2938
,
Santa Fe, New Mexico, USA
.
Association for Computational Linguistics
.
Aylin
Caliskan
,
Joanna J.
Bryson
, and
Arvind
Narayanan
.
2017
.
Semantics derived automatically from language corpora contain human-like biases
.
Science
,
356
(
6334
):
183
186
. ,
[PubMed]
Jianfeng
Chi
,
William
Shand
,
Yaodong
Yu
,
Kai-Wei
Chang
,
Han
Zhao
, and
Yuan
Tian
.
2022
.
Conditional supervised contrastive learning for fair text classification
.
ArXiv preprint
,
abs/2205.11485
.
Maximin
Coavoux
,
Shashi
Narayan
, and
Shay B.
Cohen
.
2018
.
Privacy-preserving neural representations of text
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
1
10
,
Brussels, Belgium
.
Association for Computational Linguistics
.
Pierre
Colombo
,
Guillaume
Staerman
,
Nathan
Noiry
, and
Pablo
Piantanida
.
2022
.
Learning disentangled textual representations via statistical measures of similarity
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
2614
2630
,
Dublin, Ireland
.
Association for Computational Linguistics
.
Maria
De-Arteaga
,
Alexey
Romanov
,
Hanna
Wallach
,
Jennifer
Chayes
,
Christian
Borgs
,
Alexandra
Chouldechova
,
Sahin
Geyik
,
Krishnaram
Kenthapadi
, and
Adam Tauman
Kalai
.
2019
.
Bias in bios: A case study of semantic representation bias in a high-stakes setting
. In
Proceedings of the Conference on Fairness, Accountability, and Transparency
, pages
120
128
.
Sunipa
Dev
,
Tao
Li
,
Jeff M.
Phillips
, and
Vivek
Srikumar
.
2021
.
OSCaR: Orthogonal subspace correction and rectification of biases in word embeddings
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
5034
5050
,
Online and Punta Cana, Dominican Republic
.
Association for Computational Linguistics
.
Jacob
Devlin
,
Ming-Wei
Chang
,
Kenton
Lee
, and
Kristina
Toutanova
.
2019
.
BERT: Pre-training of deep bidirectional transformers for language understanding
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
4171
4186
,
Minneapolis, Minnesota
.
Association for Computational Linguistics
.
Yupei
Du
,
Qi
Zheng
,
Yuanbin
Wu
,
Man
Lan
,
Yan
Yang
, and
Meirong
Ma
.
2022
.
Understanding gender bias in knowledge base embeddings
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1381
1395
,
Dublin, Ireland
.
Association for Computational Linguistics
.
Harrison
Edwards
and
Amos J.
Storkey
.
2016
.
Censoring representations with an adversary
. In
4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2–4, 2016, Conference Track Proceedings
.
Yanai
Elazar
and
Yoav
Goldberg
.
2018
.
Adversarial removal of demographic attributes from text data
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
11
21
,
Brussels, Belgium
.
Association for Computational Linguistics
.
Amir
Feder
,
Nadav
Oved
,
Uri
Shalit
, and
Roi
Reichart
.
2021
.
CausaLM: Causal model explanation through counterfactual language models
.
Computational Linguistics
,
47
(
2
):
333
386
.
Bjarke
Felbo
,
Alan
Mislove
,
Anders
Søgaard
,
Iyad
Rahwan
, and
Sune
Lehmann
.
2017
.
Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm
. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages
1615
1625
,
Copenhagen, Denmark
.
Association for Computational Linguistics
.
Joseph
Fisher
,
Dave
Palfrey
,
Christos
Christodoulopoulos
, and
Arpit
Mittal
.
2019
.
Measuring social bias in knowledge graph embeddings
.
ArXiv preprint
,
abs/1912.02761
.
Yaroslav
Ganin
,
Evgeniya
Ustinova
,
Hana
Ajakan
,
Pascal
Germain
,
Hugo
Larochelle
,
François
Laviolette
,
Mario
Marchand
, and
Victor
Lempitsky
.
2016
.
Domain-adversarial training of neural networks
.
The Journal of Machine Learning Research
,
17
(
1
):
2096
2030
.
Hila
Gonen
and
Yoav
Goldberg
.
2019
.
Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
609
614
,
Minneapolis, Minnesota
.
Association for Computational Linguistics
.
Xudong
Han
,
Timothy
Baldwin
, and
Trevor
Cohn
.
2021a
.
Balancing out bias: Achieving fairness through training reweighting
.
ArXiv preprint
,
abs/2109.08253
.
Xudong
Han
,
Timothy
Baldwin
, and
Trevor
Cohn
.
2021b
.
Decoupling adversarial training for fair NLP
. In
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
, pages
471
477
,
Online
.
Association for Computational Linguistics
.
Xudong
Han
,
Aili
Shen
,
Yitong
Li
,
Lea
Frermann
,
Timothy
Baldwin
, and
Trevor
Cohn
.
2022
.
fairlib: A unified framework for assessing and improving classification fairness
.
ArXiv preprint
,
abs/2205.01876
.
Xiaolei
Huang
.
2022
.
Easy adaptation to mitigate gender bias in multilingual text classification
. In
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
717
723
,
Seattle, United States
.
Association for Computational Linguistics
.
Armand
Joulin
,
Edouard
Grave
,
Piotr
Bojanowski
,
Matthijs
Douze
,
Hérve
Jégou
, and
Tomas
Mikolov
.
2016
.
Fasttext.zip: Compressing text classification models
.
ArXiv preprint
,
abs/1612.03651
.
Harold W.
Kuhn
.
1955
.
The Hungarian method for the assignment problem
.
Naval Research Logistics Quarterly
,
2
:
83
97
.
Yitong
Li
,
Timothy
Baldwin
, and
Trevor
Cohn
.
2018
.
Towards robust and privacy-preserving text representations
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
, pages
25
30
,
Melbourne, Australia
.
Association for Computational Linguistics
.
David J. C.
MacKay
.
2003
.
Information theory, Inference and Learning Algorithms
.
Cambridge University Press
.
Chandler
May
,
Alex
Wang
,
Shikha
Bordia
,
Samuel R.
Bowman
, and
Rachel
Rudinger
.
2019
.
On measuring social biases in sentence encoders
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
622
628
,
Minneapolis, Minnesota
.
Association for Computational Linguistics
.
Nicholas
Meade
,
Elinor
Poole-Dayan
, and
Siva
Reddy
.
2022
.
An empirical survey of the effectiveness of debiasing techniques for pre-trained language models
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1878
1898
,
Dublin, Ireland
.
Association for Computational Linguistics
.
Moin
Nadeem
,
Anna
Bethke
, and
Siva
Reddy
.
2021
.
StereoSet: Measuring stereotypical bias in pretrained language models
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
5356
5371
,
Online
.
Association for Computational Linguistics
.
Nikita
Nangia
,
Clara
Vania
,
Rasika
Bhalerao
, and
Samuel R.
Bowman
.
2020
.
CrowS-pairs: A challenge dataset for measuring social biases in masked language models
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
1953
1967
,
Online
.
Association for Computational Linguistics
.
Lyle
Ramshaw
and
Robert E.
Tarjan
.
2012
.
On minimum-cost assignments in unbalanced bipartite graphs
.
HP Labs, Palo Alto, CA, USA, Technical Report HPL-2012-40R1
.
Shauli
Ravfogel
,
Yanai
Elazar
,
Hila
Gonen
,
Michael
Twiton
, and
Yoav
Goldberg
.
2020
.
Null it out: Guarding protected attributes by iterative nullspace projection
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
7237
7256
,
Online
.
Association for Computational Linguistics
.
Shauli
Ravfogel
,
Michael
Twiton
,
Yoav
Goldberg
, and
Ryan D.
Cotterell
.
2022
.
Linear adversarial concept erasure
. In
International Conference on Machine Learning
, pages
18400
18421
.
PMLR
.
Beatrice
Savoldi
,
Marco
Gaido
,
Luisa
Bentivogli
,
Matteo
Negri
, and
Marco
Turchi
.
2022
.
Under the morphosyntactic lens: A multifaceted evaluation of gender bias in speech translation
. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1807
1824
,
Dublin, Ireland
.
Association for Computational Linguistics
.
Shun
Shao
,
Yftah
Ziser
, and
Shay B.
Cohen
.
2023
.
Gold doesn’t always glitter: Spectral removal of linear and nonlinear guarded attribute information
. In
Proceedings of the 17th Annual Meeting of the European chapter of the Association for Computational Linguistics (EACL)
,
volume abs/2203.07893
.
Aili
Shen
,
Xudong
Han
,
Trevor
Cohn
,
Timothy
Baldwin
, and
Lea
Frermann
.
2021
.
Contrastive learning for fair representations
.
ArXiv preprint
,
abs/2109.10645
.
Gilbert W.
Stewart
.
1990
.
Perturbation theory for the singular value decomposition
,
Technical Report UMIACS-90-120 / CS-TR 2539
,
University of Maryland, College Park
.
Alex
Wang
,
Amanpreet
Singh
,
Julian
Michael
,
Felix
Hill
,
Omer
Levy
, and
Samuel R.
Bowman
.
2019
.
GLUE: A multi-task benchmark and analysis platform for natural language understanding
. In
7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019
.
OpenReview.net
.
Liwen
Wang
,
Yuanmeng
Yan
,
Keqing
He
,
Yanan
Wu
, and
Weiran
Xu
.
2021
.
Dynamically disentangling social bias from task-oriented representations with adversarial attack
. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
3740
3750
,
Online
.
Association for Computational Linguistics
.
Jieyu
Zhao
,
Yichao
Zhou
,
Zeyu
Li
,
Wei
Wang
, and
Kai-Wei
Chang
.
2018
.
Learning gender-neutral word embeddings
. In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages
4847
4853
,
Brussels, Belgium
.
Association for Computational Linguistics
.

Author notes

*

Equal contribution.

Action Editor: Jonathan Berant

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.