## Abstract

We present the Assignment-Maximization Spectral Attribute removaL (AMSAL) algorithm, which erases information from neural representations when the information to be erased is implicit rather than directly being aligned to each input example. Our algorithm works by alternating between two steps. In one, it finds an assignment of the input representations to the information to be erased, and in the other, it creates projections of both the input representations and the information to be erased into a joint latent space. We test our algorithm on an extensive array of datasets, including a Twitter dataset with multiple guarded attributes, the BiasBios dataset, and the BiasBench benchmark. The latter benchmark includes four datasets with various types of protected attributes. Our results demonstrate that bias can often be removed in our setup. We also discuss the limitations of our approach when there is a strong entanglement between the main task and the information to be erased.^{1}

## 1 Introduction

Developing a methodology for adjusting neural representations to preserve user privacy and avoid encoding bias in them has been an active area of research in recent years. Previous work shows it is possible to erase undesired information from representations so that downstream classifiers cannot use that information in their decision-making process. This previous work assumes that this sensitive information (or **guarded** attributes, such as gender or race) is available for each input instance. These guarded attributes, however, are sensitive, and obtaining them on a large scale is often challenging and, in some cases, not feasible (Han et al., 2021b). For example, Blodgett et al. (2016) studied the characteristics of African-American English on Twitter, and could not couple the ethnicity attribute directly with the tweets they collected due to the attribute’s sensitivity.

This paper introduces a novel debiasing setting in which the guarded attributes are not paired up with each input instance and an algorithm to remove information from representations in that setting. In our setting, we assume that each neural input representation is coupled with a guarded attribute value, but this assignment is unavailable. In cases where the domain of the guarded attribute is small (for example, with binary attributes), this means that the guarded attribute information consists of **priors** with respect to the whole population and not instance-level information.

The intuition behind our algorithm is that if we were to find a strong correlation between the input variable and a set of guarded grounded attributes either in the form of an unordered list of **records** or as priors, then it is unlikely to be coincidental if the sample size is sufficiently large (§3.5). We implement this intuition by jointly finding projections of the input samples and the guarded attributes into a joint embedding space *and* an alignment between the two sets in that joint space.

Our resulting algorithm (§3), the Alignment-Maximization Spectral Attribute removaL algorithm (AMSAL), is a coordinate-ascent algorithm reminiscent of the hard expectation-maximization algorithm (hard EM; MacKay, 2003). It first loops between two Alignment and Maximization steps, during which it finds an alignment (A) based on existing projections and then projects the representations and guarded attributes into a joint space based on an existing alignment (M). After these two steps are iteratively repeated and an alignment is identified, the algorithm takes another step to erase information from the input representations based on the projections identified. This step closely follows the work of Shao et al. (2023), who use Singular Value Decomposition to remove principal directions of the covariance matrix between the input examples and the guarded attributes. Figure 1 depicts a sketch of our setting and the corresponding algorithm, with **x**_{i} being the input representations and **z**_{j} being the guarded attributes. Our algorithm is modular: While our use of the algorithm of Shao et al. (2023) for the removal step is natural due to the nature of the AM steps, a user can use any such algorithm to erase the information from the input representations (§3.4).

Our contributions are as follows: (1) We propose a new setup for removing guarded information from neural representations where there are few or no labeled guarded attributes; (2) We present a novel two-stage coordinate-ascent algorithm that iteratively improves (a) an alignment between guarded attributes and neural representations; and (b) information removal projections.

Using an array of datasets, we perform extensive experiments to assess how challenging our setup is and whether our algorithm is able to remove information without having aligned guarded attributes (§4). We find in several cases that little information is needed to align between neural representations and their corresponding guarded attributes. The consequence is that it is possible to erase the information such guarded attributes provide from the neural representations while preserving the information needed for the main task decision-making. We also study the limitations of our algorithm by experimenting with a setup where it is hard to distinguish between the guarded attributes and the downstream task labels when aligning the neural representations with the guarded attributes (§4.5).

## 2 Problem Formulation and Notation

For an integer *n* we denote by [*n*] the set {1,…,*n*}. For a vector **v**, we denote by ||**v**||_{2} its *ℓ*_{2} norm. For two vectors **v** and **u**, by default in column form, $\u2329v,u\u232a=v\u22a4u$ (dot product). Matrices and vectors are in boldface font (with uppercase or lowercase letters, respectively). Random variable vectors are also denoted by boldface uppercase letters. For a matrix ** A**, we denote by

*a*

_{ij}the value of cell (

*i*,

*j*). The Frobenius norm of a matrix

**is $\u2225A\u2225F=\u2211i,jaij2$. The spectral norm of a matrix is $\u2225A\u22252=max\u2225x\u22252=1\u2225Ax\u22252$. The expectation of a random variable**

*A***T**is denoted by $E[T]$.

In our problem formulation, we assume three random variables: **X**∈ ℝ^{d}, **Y**∈ ℝ, and **Z**∈ ℝ^{d′}, such that *d*′ ≤ *d* and the expectation of all three variables is 0 (see Shao et al., 2023). Samples of **X** are the inputs for a classifier to predict corresponding samples of **Y**. The random vector **Z** represents the guarded attributes. We want to maintain the ability to predict **Y** from **X**, while minimizing the ability to predict **Z** from **X**.

We assume *n* samples of (**X**,**Y**) and *m* samples of **Z**, denoted by (**x**^{(i)},**y**^{(i)}) for *i*∈ [*n*], and **z**^{(i)} for *i*∈ [*m*] (*m* ≤ *n*). While originally, these samples were generated jointly from the underlying distribution *p*(**X**,**Y**,**Z**), we assume a shuffling of the **Z** samples in such a way that we are only left with *m* samples that are unique (no repetitions) and an underlying unknown many-to-one mapping $\pi :[n]\u2192[m]$ that maps each **x**^{(i)} to its original **z**^{(j)}.

The problem formulation is such that we need to remove the information from the *x*s in such a way that we consider the samples of *z*s as a set. In our case, we do so by iterating between trying to infer *π*, and then using standard techniques to remove the information from *x*s based on their alignment to the corresponding *z*s.

##### Singular Value Decomposition

Let $A=E[XZ\u22a4]$, the matrix of cross-covariance between **X** and **Z**. This means that *A*_{ij} = Cov(X_{i},Z_{j}) for *i*∈ [*d*] and *j*∈ [*d*′].

**a**∈ ℝ

^{d},

**b**∈ ℝ

^{d′}, the following holds due to the linearity of expectation:

**, in this case, finds the “principal directions”: directions in which the projection of**

*A***X**and

**Z**maximize their covariance. The projections are represented as two matrices

**∈ ℝ**

*U*^{d×d}and

**∈ ℝ**

*V*^{d′×d′}. Each column in these matrices plays the role of the vectors

**a**and

**b**in Eq. 1. SVD finds

**and**

*U***such that for any**

*V**i*∈ [

*d*′] it holds that:

**a**,

**b**) such that ∥

**a**∥

_{2}=∥

**b**∥

_{2}= 1,

**a**is orthogonal to

*U*_{1},…,

*U*_{i−1}and similarly,

**b**is orthogonal to

*V*_{1},…,

*V*_{i−1}.

Shao et al. (2023) showed that SVD in this form can be used to debias representations. We calculate SVD between **X** and **Z** and then prune out the principal directions that denote the highest covariance. We will use their method, SAL (Spectral Attribute removaL), in the rest of the paper. See also §3.4.

## 3 Methodology

**and**

*U***to be matrices with orthonormal columns in ℝ**

*V*^{n×k}.

### 3.1 A-step (Guarded Sample Assignment)

**x**

^{(1)},…,

**x**

^{(n)}} and {

**z**

^{(1)},…,

**z**

^{(m)}}. Given

**and**

*U***from the previous M-step, we can find such an assignment by solving the following optimization problem:**

*V*This is a solution to an assignment problem (Kuhn, 1955; Ramshaw and Tarjan, 2012), where *p*_{ij} denotes whether **x**^{(i)} is associated with the (type of) guarded attribute **z**^{(j)}. The values (*b*_{0j},*b*_{1j}) determine lower and upper bounds on the number of *x*s a given **z**^{(j)} can be assigned to. While a standard assignment problem can be solved efficiently using the Hungarian method of Kuhn (1955), we choose to use the ILP formulation, as it enables us to have more freedom in adding constraints to the problem, such as the lower and upper bounds.

### 3.2 M-step (Covariance Maximization)

*π*such that

*π*(

*i*) =

*j*implies

**x**

^{(i)}was deemed as aligned to

**z**

_{j}. With that

*π*in mind, we define the following empirical covariance matrix

**Ω**

_{π}∈ ℝ

^{d×d′}:

We then apply SVD on **Ω**_{π} to get new ** U** and

**that are used in the next iteration of the algorithm with the A-step, if the algorithm continues to run. When the maximal number of iterations is reached, we follow the work of Shao et al. (2023) in using a truncated part of**

*V***to remove the information from the**

*U**x*s. We do that by projecting

**x**

^{(i)}using the singular vectors of

**with the smallest singular values. These projected vectors co-vary the least with the guarded attributes, assuming the assignment in the last A-step was precise. This method has been shown by Shao et al. (2023) to be highly effective and efficient in debiasing neural representations.**

*U*### 3.3 A Matrix Formulation of the AM Steps

Let **e**_{1},…,**e**_{m} be the standard basis vectors. This means **e**_{i} is a vector of length *m* with 0 in all coordinates except for the *i*th coordinate, where it is 1.

**where each $E\u2208\epsilon $ is such that**

*E***∈ ℝ**

*E*^{n×m}and each row is one of

**e**

_{i},

*i*∈ [

*m*]. In that case,

*E*

*Z*^{⊤}is an

*n*×

*d*′ matrix, such that the

*j*th row is a copy of the

*i*th column of

**∈ ℝ**

*Z*^{d′×n}. Therefore, the AM steps can be viewed as solving the following maximization problem using coordinate ascent:

**,**

*U***are orthonormal matrices, and**

*V***Σ**is a diagonal matrix with non-negative elements. This corresponds to the SVD of the matrix

*X*

*E*

*Z*^{⊤}.

In that case, the matrix ** E** can be directly mapped to an assignment in the form of

*π*, where

*π*(

*i*) would be the

*j*such that the

*j*th coordinate in the

*i*th row of

**is non-zero.**

*E*### 3.4 Removal Algorithm

The AM steps are best suited for the removal of information through SVD with an algorithm such as SAL. This is because AM steps are optimizing an objective of the same type of SAL—relying on the projections ** U** and

**to project the inputs and guarded representations into a joint space. However, a by-product of the algorithm in Figure 2 is an assignment function**

*V**π*that aligns between the inputs and the guarded representations.

### 3.5 Justification of the AM Steps

We next provide a justification of our algorithm (which may be skipped on a first reading). Our justification is based on the observation that if indeed **X** and **Z** are linked together (this connection is formalized as a latent variable in their joint distribution), then for a given sample that is permuted, the singular values of **Ω** will be larger the closer the permutation is to the identity permutation. This justifies finding such a permutation that maximizes the singular values in an SVD of **Ω**.

##### More Details

*ι*(

*i*) =

*i*. We will assume the case in which

*n*=

*m*(but the justification can be generalized to the case

*m*<

*n*), and that the underlying joint distribution

*p*(

**X**,

**Z**) is mediated by a latent variable

**H**, such that

This implies there is a latent variable that connects **X** and **Z**, and that the joint distribution *p*(**X**,**Z**) is a mixture through **H**.

*Let* {(**x**^{(i)},**z**^{(i)})} *be a sample of size**n**from the distribution in Eq. 5*. *Let**π**be a permutation over* [*n*] *uniformly sampled from the set of permutations*. *Then with high likelihood, the sum of the singular values of***Ω**_{π}*is smaller than the sum of singular values under***Ω**_{ι}.

For full details of this claim, see Appendix A.

## 4 Experiments

In our experiments, we test several combinations of algorithms. We use the *k*-means (KMeans) as a substitute for the AM steps as a baseline for the assignment step of *x*s to *z*s. In addition, for the removal step (once an assignment has been identified), we test two algorithms: SAL (Shao et al., 2023; resulting in AMSAL) and INLP (Ravfogel et al., 2020). We also compare these two algorithms in *oracle* mode (in which the assignment of guarded attributes to inputs is known), to see the loss in performance that happens due to noisy assignments from the AM or *k*-means algorithm (OracleSAL and OracleINLP).

When running the AM algorithm or *k*-means, we execute it with three random seeds (see also §4.6) for a maximum of a hundred iterations and choose the projection matrix with the largest objective value over all seeds and iterations. For the slack variables (*b*_{0j} and *b*_{1j} variables in Eq. 3), we use 20%–30% above and below the baseline of the guarded attribute priors according to the training set. With the SAL methods, we remove the number of directions according to the rank of the **Ω** matrix (between 2 to 6 in all experiments overall).

In addition, we experiment with a partially supervised assignment process, in which a small seed dataset of aligned *x*s and *z*s is provided to the AM steps. We use it for model selection: Rather than choosing the assignment with the highest SVD objective value, we choose the assignment with the highest accuracy on this seed dataset. We refer to this setting as Partial (for “partially supervised assignment”).

Finally, in the case of a gender-protected attribute, we compare our results against a baseline in which the input **x** is compared against a list of words stereotypically associated with the genders of male or female.^{2} Based on the overlap with these two lists, we heuristically assign the gender label to **x** and then run SAL or INLP (rather than using the AM algorithm). While this wordlist heuristic is plausible in the case of gender, it is not as easy to derive in the case of other protected attributes, such as age or race. We give the results for this baseline using the marker WL in the corresponding tables.

##### Main Findings

Our overall main finding shows that our novel setting in which guarded infor mation is erased from individually unaligned representations is viable. We discovered that AM methods perform particularly well when dealing with more complex bias removal scenarios, such as when multiple guarded attributes are present. We also found that having similar priors for the guarded attributes and downstream task labels may lead to poor performance on the task at hand. In these cases, using a small amount of supervision often effectively helps reduce bias while maintaining the utility of the representations for the main classification of the regression problem. Finally, our analysis of alignment stability shows that our AM algorithm often converges to suitable solutions that align **X** with **Z**.

Due to the unsupervised nature of our problem setting, we advise validating the utility of our method in the following way. Once we run the AM algorithm, we check whether there is a high-accuracy alignment between **X** and **Y** (rather than **Z**, which is unavailable). If this alignment is accurate, then we run the risk of significantly damaging task performance. An example is given in §4.5.

### 4.1 Word Embedding Debiasing

As a preliminary assessment of our setup and algorithms, we apply our methods to GloVe word embeddings to remove gender bias, and follow the previous experiment settings of this problem (Bolukbasi et al., 2016; Ravfogel et al., 2020; Shao et al., 2023). We considered only the 150,000 most common words to ensure the embedding quality and omitted the rest. We sort the remaining embeddings by their projection on the $he\u2192$-$she\u2192$ direction. Then we consider the top 7,500 word embeddings as male-associated words (*z* = 1) and the bottom 7,500 as female-associated words (*z* = −1).

Our findings are that both the *k*-means and the AM algorithms perfectly identify the alignment between the word embeddings and their associated gender label (100%). Indeed, the dataset construction itself follows a natural perfect clustering that these algorithms easily discover. Since the alignments are perfectly identified, the results of predicting the gender from the word embeddings after removal are identical to the oracle case. These results are quite close to the results of a random guess, and we refer the reader to Shao et al. (2023) for details on experiments with SAL and INLP for this dataset. Considering Figure 3, it is evident that our algorithm essentially follows a natural clustering of the word embeddings into two clusters, female and male, as the embeddings are highly separable in this case. This is why the alignment score of **X** (embedding) to **Z** (gender) is perfect in this case. This finding indicates that this standard word embedding dataset used for debiasing is *trivial to debias*—debiasing can be done even without knowing the identity of the stereotypical gender associated with each word.

### 4.2 BiasBios Results

De-Arteaga et al. (2019) presented the BiasBios dataset, which consists of self-provided biographies paired with the profession and gender of their authors. A list of pronouns and names is used to obtain the authors’ gender automatically. They aim to expose the caveats of automated hiring systems by showing that even the simple task of predicting a candidate’s profession can be affected by the candidate’s gender, which is encoded in the biography representation. For example, we want to avoid one being identified as “he” or “she” in their biography, affecting the likelihood of them being classified as engineers or teachers.

We follow the setup of De-Arteaga et al. (2019), predicting a candidate’s professions (**y**), based on a self-provided short biography (**x**), aiming to remove any information about the candidate’s gender (**z**). Due to computational constraints, we use only random 30K examples to learn the projections with both SAL and INLP (whether in the unaligned or aligned setting). For the classification problem, we use the full dataset. To obtain vector representations for the biographies, we use two different encoders, FastText word embeddings (Joulin et al., 2016), and BERT (Devlin et al., 2019). We stack a multi-class classifier on top of these representations, as there are 28 different professions. We use 20% of the training examples for the Partial setting. For BERT, we followed De-Arteaga et al. (2019) in using the last CLS token state as the representation of the whole biography. We used the BERT model bert-base-uncased.

##### Evaluation Measures

We use an extension of the True Positive Rate (TPR) gap, the root mean square (RMS) TPR gap of all classes, for evaluating bias in a multiclass setting. This metric was suggested by De-Arteaga et al. (2019), who demonstrated it is significantly correlated with gender imbalances, which often lead to unfair classification. The higher the metric value is, the bigger the gap between the two categories (for example, between male and female) for the specific main task prediction. For the profession classification, we report accuracy.

##### Results

Table 1 provides the results for the biography dataset. We see that INLP significantly reduces the TPR-GAP in all settings, but this comes at a cost: The representations are significantly less useful for the main task of predicting the profession. When inspecting the alignments, we observe that their accuracy is quite high with BERT: 100% with *k*-means, 85% with the AM algorithm, and 99% with Partial AM. For FastText, the results are lower, hovering around 55% for all three methods. The high BERT assignment performance indicates that the BiasBios BERT representations are naturally separated by gender. We also observe that the results of WL+SAL and WL+INLP are correspondingly identical to Oracle+SAL and Oracle+INLP. This comes as no surprise, as the gender label is derived from a similar word list, which enables the WL approach to get a nearly perfect alignment (over 96% agreement with the gender label).

### 4.3 BiasBench Results

Meade et al. (2022) followed an empirical study of an array of datasets in the context of debiasing. They analyzed different methods and tasks, and we follow their benchmark evaluation to assess our AMSAL algorithm and other methods in the context of our new setting. We include a short description of the datasets we use in this section. We include full results in Appendix B, with a description of other datasets. We also encourage the reader to refer to Meade et al. (2022) for details on this benchmark. We use 20% of the training examples for the Partial setting.

##### StereoSet (Nadeem et al., 2021)

This dataset presents a word completion test for a language model, where the completion can be stereotypical or non-stereotypical. The bias is then measured by calculating how often a model prefers the stereotypical completion over the non-stereotypical one. Nadeem et al. (2021) introduced the language model score to measure the language model usability, which is the percentage of examples for which a model prefers the stereotypical or non- stereotypical word over some unrelated word.

##### CrowS-Pairs (Nangia et al., 2020)

This dataset includes pairs of sentences that are minimally different at the token level, but these differences lead to the sentence being either stereotypical or anti-stereotypical. The assessment measures how many times a language model prefers the stereotypical element in a pair over the anti-stereotypical element.

##### Results

We start with an assessment of the BERT model for the CrowS-Pairs gender, race, and religion bias evaluation (Table 2). We observe that all approaches for gender, except AM+INLP, reduce the stereotype score. Race and religion are more difficult to debias in the case of BERT. INLP with *k*-means works best when no seed alignment data is provided at all, but when we consider PartialSAL, in which we use the alignment algorithm with some seed aligned data, we see that the results are the strongest. When we consider the RoBERTa model, the results are similar, with PartialSAL significantly reducing the bias. Our findings from Table 2 overall indicate that the ability to debias a representation *highly depends on the model that generates the representation*. In Table 10 we observe that the representations, on average, are not damaged for most GLUE tasks. Additional analysis, included with a full version appendix, shows that the representations, on average, *are not damaged for most GLUE tasks*.

As Meade et al. (2022) have noted, when changing the representations of a language model to remove bias, we might cause such adjustments that damage the usability of the language model. To test which methods possibly cause such an issue, we also assess the language model score on the StereoSet dataset in Table 3. We overall see that often SAL-based methods give a lower stereotype score, while INLP methods more significantly damage the language model score. This implies that the *SAL-based methods remove bias effectively while less significantly harming the usability of the language model representations*.

We also conducted comprehensive results for other datasets (SEAT and GLUE) and categories of bias (based on race and religion). The results, especially for GLUE, demonstrate the effectiveness of our method of unaligned information removal. For GLUE, we consistently retain the baseline task performance almost in full. See Appendix B.

### 4.4 Multiple-Guarded Attribute Sentiment

We hypothesize that AM-based methods are better suited for setups where multiple guarded attributes should be removed, as they allow us to target several guarded attributes with different priors. To examine our hypothesis, we experiment with a dataset curated from Twitter (tweets encoded using BERT, bert-base-uncased), in which users are surveyed for their age and gender (Cachola et al., 2018). We bucket the age into three groups (0–25, 26–50, and above 50). Tweets in this dataset are annotated with their sentiment, ranging from 1 (very negative) to 5 (very positive). The dataset consists of more than 6,400 tweets written by more than 1,700 users. We removed users who no longer have public Twitter accounts and users with locations that do not exist based on a filter,^{3} resulting in a dataset with over 3,000 tweets, written by 817 unique users. As tweets are short by nature and their number is relatively small, the debiasing signal in this dataset (the amount of information it contains about the guarded attributes) might not be sufficient for the attribute removal. To amplify this signal, we concatenated each tweet in the dataset to at most ten other tweets from the same user.

**z**, we use the combination of both age and gender as a binary one-hot vector. This dataset presents a use-case for our algorithm of a composed protected attribute. Rather than using a classifier for predicting the sentiment, we use linear regression. Following Cachola et al. (2018), we use Mean Absolute Error (MAE) to report the error of the sentiment predictions. Given that the sentiment is predicted as a continuous value, we cannot use the TPR gap as in previous sections. Rather, we use the following formula:

*i*ranges over the set of size

*ℓ*of examples with protected attribute value

*j*,

*μ*

_{j}is the average of absolute

**Y**prediction error for that set and

*η*

_{ij}is the absolute difference between

*μ*

_{j}and the absolute error of example

*i*.

^{4}The function std in this case indicates the standard deviation of the

*m*values of MAD

_{z =j},

*j*∈ [

*m*].

##### Results

Table 4 presents our results. Overall, AMSAL reduces the gender and age gap in the predictions while not increasing by much MAE. In addition, we can see both AM-based methods outperform their *k*-means counterparts which increase unfairness (Kmeans + INLP) or significantly harm the downstream-task performance (Kmeans + SAL). We also consider Figure 4, which shows the quality of the assignments of the AM algorithm change as a function of the labeled data used. As expected, the more labeled data we have, the more accurate the assignments are, but the differences are not very large.

### 4.5 An Example of Our Method Limitations

We now present the main limitation in our approach and setting. This limitation arises when the random variables **Y** and **Z** are not easily distinguishable through information about **X**.

We experiment with a binary sentiment analysis (**y**) task, predicted on users’ tweets (**x**), aiming to remove information regarding the authors’ ethnic affiliations. To do so, we use a dataset collected by Blodgett et al. (2016), which examined the differences between African-American English speakers and Standard American English speakers. As information about one’s ethnicity is hard to obtain, the user’s geolocation information was used to create a distantly supervised mapping between authors and their ethnic affiliations. We follow previous work (Shao et al., 2023; Ravfogel et al., 2020) and use the DeepMoji encoder (Felbo et al., 2017) to obtain representations for the tweets. The train and test sets are balanced regarding sentiment and authors’ ethnicity. We use 20% of the examples for the Partial setting. Table 5 gives the results for this dataset. We observe that the removal with the assignment (*k*-means, AM, or Partial) significantly harms the performance on the main task and reduces it to a random guess.

This presents a limitation of our algorithm. A priori, there is no distinction between **Y** and **Z**, as our method is unsupervised. In addition, the positive labels of **Y** and **Z** have the same prior probability. Indeed, when we check the assignment accuracy in the sentiment dataset, we observe that the *k*-means, AM, and Partial AM assignment accuracy for identifying **Z** are between 0.55 and 0.59. If we check the assignment against **Y**, we get an accuracy between 0.74 and 0.76. This means that all assignment algorithms actually identify **Y** rather than **Z** (both **Y** and **Z** are binary variables in this case). The conclusion from this is that our algorithm works best when sufficient information on **Z** is presented such that it can provide a basis for aligning samples of **Z** with samples of **X**. Suppose such information is unavailable or unidentifiable with information regarding **Y**. In that case, we may simply identify the natural clustering of **X** according to their main task classes, leading to low main-task performance.

In Table 5, we observe that this behavior is significantly mitigated when the priors over the sentiment and the race are different (0.8 for sentiment and 0.5 for race). In that case, the AM algorithm is able to distinguish between the race-protected attribute (**z**) and the sentiment class (**y**) quite consistently with INLP and SAL, and the gap is reduced.

We also observe that INLP changed neither the accuracy nor the TPR-GAP for the balanced scenario (Table 5) when using a *k*-means assignment or an AM assignment. Upon inspection, we found out that INLP returns an identity projection in these cases, unable to amplify the relatively weak signal in the assignment to change the representations.

### 4.6 Stability Analysis of the Alignment

In Figure 5, we plot the accuracy of the alignment algorithm (knowing the true value of the guarded attribute per input) throughout the execution of the AM steps for the first ten iterations. The shaded area indicates one standard deviation. We observe that the first few iterations are the ones in which the accuracy improves the most. For most of the datasets, the accuracy does not decrease between iterations, though in the case of DeepMoji we do observe a “bump.” This is indeed why the Partial setting of our algorithm, where a small amount of guarded information is available to determine at which iteration to stop the AM algorithm, is important. In the word embeddings case, the variance is larger because, in certain executions, the algorithm converged quickly, while in others, it took more iterations to converge to high accuracy.

Figure 6 plots the relative change of the objective value of the ILP from §3.1 against iteration number. The relative change is defined as the ratio between the objective value before the algorithm begins and the same value at a given iteration. We see that there is a relative stability of the algorithm and that the AM steps converge quite quickly. We also observe the DeepMoji dataset has a large increase in the objective value in the first iteration (around × 5 compared to the value the algorithm starts with), after which it remains stable.

## 5 Related Work

There has been an increasing amount of work on detecting and erasing undesired or protected information from neural representations, with standard software packages for this process having been developed (Han et al., 2022). For example, in their seminal work, Bolukbasi et al. (2016) showed that word embeddings exhibit gender stereotypes. To mitigate this issue, they projected the word embeddings to a neutral space with respect to a “he-she” direction. Influenced by this work, Zhao et al. (2018) proposed a customized training scheme to reduce the gender bias in word embeddings. Gonen and Goldberg (2019) examined the effectiveness of the methods mentioned above and concluded they remove bias in a shallow way. For example, they demonstrated that classifiers can accurately predict the gender associated with a word when fed with the embeddings of both debiasing methods.

Another related strand of work uses adversarial learning (Ganin et al., 2016), where an additional objective function is added for balancing undesired-information removal and the main task (Edwards and Storkey, 2016; Li et al., 2018; Coavoux et al., 2018; Wang et al., 2021). Elazar and Goldberg (2018) have also demonstrated that an ad-hoc classifier can easily recover the removed information from adversarially trained representations. Since then, methods for information erasure such as INLP and its generalization (Ravfogel et al., 2020,2022), SAL (Shao et al., 2023) and methods based on similarity measures between neural representations (Colombo et al., 2022) have been developed. With a similar motivation to ours, Han et al. (2021b) aimed to ease the burden of obtaining guarded attributes at a large scale by decoupling the adversarial information removal process from the main task training. They, however, did not experiment with debiasing representations where no guarded attribute alignments are available. Shao et al. (2023) experimented with the removal of features in a scenario in which a low number of protected attributes is available.

Additional previous work showed that methods based on causal inference (Feder et al., 2021), train-set balancing (Han et al., 2021a), and contrastive learning (Shen et al., 2021; Chi et al., 2022) effectively reduce bias and increase fairness. In addition, there is a large body of work for detecting bias, its evaluation (Dev et al., 2021) and its implications in specific NLP applications. Savoldi et al. (2022) detected a gender bias in speech translation systems for gendered languages. Gender bias is also discussed in the context of knowledge base embeddings by Fisher et al. (2019); Du et al. (2022), and multilingual text classification (Huang, 2022).

## 6 Conclusions and Future Work

We presented a new and challenging setup for removing information, with minimal or no available sensitive information alignment. This setup is crucial for the wide applicability of debiasing methods, as for most applications, obtaining such sensitive labels on a large scale is challenging. To ease this problem, we present a method to erase information from neural representations, where the guarded attribute information does not accompany each input instance. Our main algorithm, AMSAL, alternates between two steps (Assignment and Maximization) to identify an assignment between the input instances and the guarded information records. It then completes its execution by removing the information by minimizing covariance between the input instances and the aligned guarded attributes. Our approach is modular, and other erasure algorithms, such as INLP, can be used with it. Experiments show that we can reduce the unwanted bias in many cases while keeping the representations highly useful. Future work might include extending our technique to the kernelized case, analogously to the method of Shao et al. (2023).

## Ethical Considerations

The AM algorithm could potentially be misused by, rather than using the AM steps to erase information, using them to link records of two different types, undermining the privacy of the record holders. Such a situation may merit additional concern because the links returned between the guarded attributes and the input instances will likely contain mistakes. The links are unreliable for decision-making at the *individual level*. Instead, they should be used on an aggregate as a statistical construct to erase information from the input representations. Finally,^{5} we note that the automation of the debiasing process, without properly statistically confirming its accuracy using a correct sample, may promote a false sense of security that a given system is making fair decisions. We do not recommend using our method for debiasing without proper statistical control and empirical verification of correctness.

## Acknowledgments

We thank the reviewers, the action editors and Marcio Fonseca for their thorough feedback. We also thank Daniel Preoţiuc-Pietro for his help with the Twitter data. We thank Kousha Etessami for being a sounding board for certain parts of the paper. The experiments in this paper were supported by compute grants from the Edinburgh Parallel Computing Center and from the Baskerville Tier 2 HPC service (University of Birmingham).

## A Justification of the AM Algorithm: Further Details

*k*≤

*n*elements such that

*π*(

*i*) =

*i*for all

*i*in this set of elements is bounded from above by:

^{6}

**X**and

**Z**is bounded in absolute value by a constant

*B*> 0. Let {(

**x**

^{(i)},

**z**

^{(i)},

**h**

^{(i)})} be a random sample of size

*n*from the joint distribution

*p*(

**X**,

**Z**,

**H**). Given a permutation $\pi :[n]\u2192[n]$, define

*I*(

*π*) = {

*i*∣

*π*(

*i*) =

*i*}. For a given set

*M*⊆ [

*n*], define

For a matrix ** A**∈ ℝ

^{d×d′}, let

*σ*

_{j}(

**) be its**

*A**j*th largest singular value, and let $\sigma +(A)=\u2211j\sigma j(A)$. Let $\sigma +=\sigma +(E[\Omega \iota ])$.

We first note that for any permutation *π*, it holds that $E[\Omega \pi \u2223K]=0$ where we define *K* = [*n*] ∖ *I*(*π*).

*For any*

*t*>

*0, it holds that:*

*is smaller than*$2dd\u2032exp\u2212t2|I(\pi )|B2$.

*Proof*.

*By Hoeffding’s inequality, for any*

*i*∈ [

*d*],

*j*∈ [

*d*′],

*it holds that the probability that for*|

*I*(

*π*)|

*i.i.d. random variables*

**X**

^{k},

**Z**

^{k}

*the following is true:*

*is smaller than*$2exp\u2212t2|I(\pi )|B2$.

*Therefore, by a union bound on each element of the matrix*Ω

_{π},

*we get the upper bound on Eq. 7*.

*For any*

*t*>

*0, it holds that:*

*is smaller than 2*|

*K*|

*dd*′

*B*.

*Proof*.

*Since X*_{i}*and Z*_{j}*are bounded as a product in absolute value by**B*, *and the dimensions of***Ω**_{π∣K}*is**d* × *d*′, *each cell being a sum of* |*K*| *values, the bound naturally follows*.

Let *n* such that *nσ*^{ +} > 2*kdd*′*B* where *k* = |*K*|. Then from Lemma 2, $\u2225\Omega \pi \u2223K\u2212E[\Omega \pi \u2223K]\u22252<n\sigma +$. Consider the event *σ*^{ +}(**Ω**_{ι}) < *σ*^{ +}(**Ω**_{π}). Its probability is bounded from above by the probability of the event *σ*^{ +}(**Ω**_{ι}) ≤ *nσ*^{ +} OR *σ*^{ +}(**Ω**_{π}) ≥ *nσ*^{ +} (for any *n* as the above). Due to the inequality of Weyl (Theorem 1 in Stewart 1990; see below), the fact that **Ω**_{π} =**Ω**_{π∣K} +**Ω**_{π∣I(π)}, Lemma 1, and the fact that *n* − *k* ≤ *n*, the probability of this OR event is bounded from above by $4dd\u2032exp\u2212(n\u2212k)(\sigma +)2(dd\u2032B)2$.

The conclusion from this is that if we were to sample uniformly a permutation *π* from the set of permutations over [*n*], then with quite high likelihood (because the fraction of elements that are preserved under *π* becomes smaller as *n* becomes larger), the sum of the singular values of **Ω**_{π} under this permutation will be smaller than the sum of the singular values of **Ω**_{ι}—meaning, when the *x*s and the *z*s are correctly aligned. This justifies our objective of aligning the *x*s and the *z*s with an objective that maximizes the singular values, following Proposition 1.

##### Inequality of Weyl (1912)

As mentioned by Stewart (1990), the following holds:

*Let**A**and**E**be two matrices, and let*$\xc3=A+E$. *Let**σ*_{i}*be the**i**th singular value of**A**and*$\sigma ~i$*be the**i**th singular value of*$\xc3$. *Then*$|\sigma i\u2212\sigma ~i|\u2264||E||2$.

## B Comprehensive Results on the BiasBench Datasets

We include more results for the SEAT dataset from BiasBench and for the CrowS-Pairs dataset and StereoSet datasets for bias categories other than gender. A description of the SEAT and GLUE datasets (with metrics used) follows.

##### SEAT (May et al., 2019)

SEAT is a sentence-level extension of WEAT (Caliskan et al., 2017), which is an association test between two categories of words: attribute word sets and target word sets. For example, attribute words for gender bias could be { *he*, *man* }, while a target words could be { *career*, *office* }. For example, an attribute word set (in case of gender bias) could be a set of words such as { *he*, *him*, *man* }, while a target word set might be words related to office work. If we see a high association between an attribute word set and a target word set, we may claim that a particular gender bias is encoded. The final evaluation is calculated by measuring the similarity between the different attributes and target word sets. To extend WEAT to a sentence-level test, (Caliskan et al., 2017) incorporated the WEAT attribute and target words into synthetic sentence templates.

We use an *effect size* metric to report our results for SEAT. This measure is a normalized difference between cosine similarity of representations of the attribute words and the target words. Both attribute words and target words are split into two categories (for example, in relation to gender), so the difference is based on four terms, between each pair of each category set of words (target and attribute). An effect size closer to zero indicates less bias is encoded in the representations.

##### GLUE (Wang et al., 2019)

We follow Meade et al. (2022) and use the GLUE dataset to test the debiased model on an array of downstream tasks to validate their usability. GLUE is a highly popular benchmark for testing NLP models, containing a variety of tasks, such as classification tasks (e.g., sentiment analysis), similarity tasks (e.g., paraphrase identification), and inference tasks (e.g., question-answering).

The following tables of results are included:

## Notes

Our code is available at https://github.com/jasonshaoshun/AMSAL.

We used a list of cities, counties, and states in the United States, taken from https://tinyurl.com/4kmc6pyn. All users were in the United States when the data was collected by the original curators.

The absolute error of prediction *a* with true value *b* is |*a* −*b*|.

We thank the anonymous reviewer for raising this issue.

Choose *k* elements that are fixed, and let the rest vary arbitrarily.

## References

## Author notes

Equal contribution.

Action Editor: Jonathan Berant