Erasure of Unaligned Attributes from Neural Representations

We present the Assignment-Maximization Spectral Attribute removaL (AMSAL) algorithm, which erases information from neural representations when the information to be erased is implicit rather than directly being aligned to each input example. Our algorithm works by alternating between two steps. In one, it finds an assignment of the input representations to the information to be erased, and in the other, it creates projections of both the input representations and the information to be erased into a joint latent space. We test our algorithm on an extensive array of datasets, including a Twitter dataset with multiple guarded attributes, the BiasBios dataset, and the BiasBench benchmark. The latter benchmark includes four datasets with various types of protected attributes. Our results demonstrate that bias can often be removed in our setup. We also discuss the limitations of our approach when there is a strong entanglement between the main task and the information to be erased.1


Introduction
Developing a methodology for adjusting neural representations to preserve user privacy and avoid encoding bias in them has been an active area of research in recent years.Previous work shows it is possible to erase undesired information from representations so that downstream classifiers cannot use that information in their decision-making process.This previous work assumes that this sensitive information (or guarded attributes, such as gender or race) is available for each input instance.These guarded attributes, however, are sensitive, and obtaining them on a large scale is often challenging and, in some cases, not feasible (Han et al., 2021b).jasonshaoshun/AMSAL.
x (1)   x (2)   x (3)   x (4)   x (5)   z (1) Figure 1: A depiction of the problem setting and solution.The inputs are aligned to each guarded sample, based on strength using two projections U and V .We solve a bipartite matching problem to find the blue edges, and then recalculate U and V .
For example, Blodgett et al. (2016) studied the characteristics of African-American English (AAE) on Twitter, and could not couple the ethnicity attribute directly with the tweets they collected due to the attribute's sensitivity.
This paper introduces a novel debiasing setting in which the guarded attributes are not paired up with each input instance and an algorithm to remove information from representations in that setting.In our setting, we assume that each neural input representation is coupled with a guarded attribute value, but this assignment is unavailable.In cases where the domain of the guarded attribute is small (for example, with binary attributes), this means that the guarded attribute information consists of priors with respect to the whole population and not instance-level information.
The intuition behind our algorithm is that if we were to find a strong correlation between the input variable and a set of guarded grounded attributes either in the form of an unordered list of records or as priors, then it is unlikely to be coincidental if the sample size is sufficiently large ( §3.5).We implement this intuition by jointly finding projections of the input samples and the guarded attributes into a joint embedding space and an alignment between the two sets in that joint space.
Our resulting algorithm ( §3), the Alignment-Maximization Spectral Attribute removaL algorithm (AMSAL), is a coordinate-ascent algorithm reminiscent of the hard expectation-maximization algorithm (hard EM;MacKay 2003).It first loops between two Alignment and Maximization steps, during which it finds an alignment (A) based on existing projections and then projects the representations and guarded attributes into a joint space based on an existing alignment (M).After these two steps are iteratively repeated and an alignment is identified, the algorithm takes another step to erase information from the input representations based on the projections identified.This step closely follows the work of Shao et al. (2023), who use Singular Value Decomposition (SVD) to remove principal directions of the covariance matrix between the input examples and the guarded attributes.Figure 1 depicts a sketch of our setting and the corresponding algorithm, with x i being the input representations and z j being the guarded attributes.Our algorithm is modular: while our use of the algorithm of Shao et al. (2023) for the removal step is natural due to the nature of the AM steps, a user can use any such algorithm to erase the information from the input representations ( §3.4).
Our contributions are as follows: (1) We propose a new setup for removing guarded information from neural representations where there are few or no labeled guarded attributes; (2) We present a novel two-stage coordinate-ascent algorithm that iteratively improves (a) an alignment between guarded attributes and neural representations; and (b) information removal projections.
Using an array of datasets, we perform extensive experiments to assess how challenging our setup is and whether our algorithm is able to remove information without having aligned guarded attributes ( §4).We find in several cases that little information is needed to align between neural representations and their corresponding guarded attributes.The consequence is that it is possible to erase the information such guarded attributes provide from the neural representations while preserving the infor-mation needed for the main task decision-making.We also study the limitations of our algorithm by experimenting with a setup where it is hard to distinguish between the guarded attributes and the downstream task labels when aligning the neural representations with the guarded attributes ( §4.5).

Problem Formulation and Notation
For an integer n we denote by [n] the set {1, . . ., n}.For a vector v, we denote by ||v|| 2 its 2 norm.For two vectors v and u, by default in column form, v, u = v u (dot product).Matrices and vectors are in boldface font (with uppercase or lowercase letters, respectively).Random variable vectors are also denoted by boldface uppercase letters.For a matrix A, we denote by a ij the value of cell (i, j).The Frobenius norm of a matrix A In our problem formulation, we assume three random variables: and the expectation of all three variables is 0 (see Shao et al. 2023).Samples of X are the inputs for a classifier to predict corresponding samples of Y.The random vector Z represents the guarded attributes.We want to maintain the ability to predict Y from X, while minimizing the ability to predict Z from X.
We assume n samples of (X, Y) and m samples of Z, denoted by (x (i) , y (i) ) for i ∈ [n], and z (i)  for i ∈ [m] (m ≤ n).While originally, these samples were generated jointly from the underlying distribution p(X, Y, Z), we assume a shuffling of the Z samples in such a way that we are only left with m samples that are unique (no repetitions) and an underlying unknown many-to-one mapping π : [n] → [m] that maps each x (i) to its original z (j) .
The problem formulation is such that we need to remove the information from the xs in such a way that we consider the samples of zs as a set.In our case, we do so by iterating between trying to infer π, and then using standard techniques, remove the information from xs based on their alignment to the corresponding zs.
• (A-step) With U and V as above, with top k singular vectors, find π by solving the problem as in Eq. 2.
Return: The singular vectors from U that have lowest singular values.
Figure 2: The main Assignment-Maximization Spectral Attribute removaL (AMSAL) algorithm for removal of information without alignment between samples of X and Z.
For any two vectors, a ∈ R d , b ∈ R d , the following holds due to the linearity of expectation: aAb = Cov(a X, b Z). (1) Singular value decomposition on A, in this case, finds the "principal directions": directions in which the projection of X and Z maximize their covariance.The projections are represented as two matrices U ∈ R d×d and V ∈ R d ×d .Each column in these matrices plays the role of the vectors a and b in Eq. 1. SVD finds U and V such that for any i ∈ [d ] it holds that: Shao et al. (2023) showed that SVD in this form can be used to debias representations.We calculate SVD between X and Z and then prune out the principal directions that denote the highest covariance.We will use their method, SAL (Spectral Attribute removaL), in the rest of the paper.See also §3.4.

Methodology
We view the problem of information removal with unaligned samples as a joint optimization problem of: (a) finding the alignment; (b) finding the projection that maximizes the covariance between the alignments, and using its complement to project the inputs.Such an optimization, in principle, is intractable, so we break it down into two coordinateascent style steps: A-step (in which the alignment is identified as a bipartite graph matching problem) and M-step (in which based on the previously identified alignment, a maximal-covariance projection is calculated).Formally, the maximization problem we solve is: where we constrain U and V to be matrices with orthonormal columns in R n×k .
Note that the sum in the above equation has a term per pair of (x (i) , z π(i) ), which enables us to frame the A-step as an integer linear programming (ILP) problem ( §3.1).The full algorithm is given in Figure 2, and we proceed in the next two steps to further explain the A-step and the M-step.

A-step (Guarded Sample Assignment)
In the Assignment Step, we are required to find a many-to-one alignment π : [n] → [m] between {x (1) , . . ., x (n) } and {z (1) , . . ., z (m) }.Given U and V from the previous M-step, we can find such an assignment by solving the following optimization problem: This maximization problem can be formulated as an integer linear program of the following form: This is a solution to an assignment problem (Kuhn, 1955;Ramshaw and Tarjan, 2012), where p ij denotes whether x (i) is associated with the (type of) guarded attribute z (j) .The values (b 0j , b 1j ) determine lower and upper bounds on the number of xs a given z (j) can be assigned to.While a standard assignment problem can be solved efficiently using the Hungarian method of Kuhn (1955), we choose to use the ILP formulation, as it enables us to have more freedom in adding constraints to the problem, such as the lower and upper bounds.

M-step (Covariance Maximization)
The result of an A-step is an assignment π such that π(i) = j implies x (i) was deemed as aligned to z j .With that π in mind, we define the following empirical covariance matrix Ω π ∈ R d×d : (3) We then apply SVD on Ω π to get new U and V that are used in the next iteration of the algorithm with the A-step, if the algorithm continues to run.When the maximal number of iterations is reached, we follow the work of Shao et al. (2023) in using a truncated part of U to remove the information from the xs.We do that by projecting x (i) using the singular vectors of U with the smallest singular values.These projected vectors co-vary the least with the guarded attributes, assuming the assignment in the last A-step was precise.This method has been shown by Shao et al. (2023) to be highly effective and efficient in debiasing neural representations.

A Matrix Formulation of the AM Steps
Let e 1 , . . ., e m be the standard basis vectors.This means e i is a vector of length m with 0 in all coordinates except for the ith coordinate, where it is 1.
Let E be the set of all matrices E where each E ∈ E is such that E ∈ R n×m and each row is one of e i , i ∈ [m].In that case, EZ is an n × d matrix, such that the jth row is a copy of the ith column of Z ∈ R d ×n .Therefore, the AM steps can be viewed as solving the following maximization problem using coordinate ascent: where U , V are orthonormal matrices, and Σ is a diagonal matrix with non-negative elements.This corresponds to the SVD of the matrix XEZ .
In that case, the matrix E can be directly mapped to an assignment in the form of π, where π(i) would be the j such that the jth coordinate in the ith row of E is non-zero.

Removal Algorithm
The AM steps are best suited for the removal of information through SVD with an algorithm such as SAL.This is because the AM steps are optimizing an objective of the same type of SAL -relying on the projections U and V to project the inputs and guarded representations into a joint space.However, a by-product of the algorithm in Figure 2 is an assignment function π that aligns between the inputs and the guarded representations.
With that assignment, other removal algorithms can be used, for example, the algorithm of Ravfogel et al. (2020).We experiment with this idea in §4.

Justification of the AM Steps
Next, we justify our algorithm (which may be skipped on the first reading).Our justification is based on the observation that if indeed X and Z are linked together (this connection is formalized as a latent variable in their joint distribution), then for a given sample that is permuted, the singular values of Ω will be larger the closer the permutation is to the identity permutation.This justifies finding such a permutation that maximizes the singular values in an SVD of Ω.
More Details Let ι : [n] → [n] be the identity permutation, ι(i) = i.We will assume the case in which n = m (but the justification can be generalized to the case m < n), and that the underlying joint distribution p(X, Z) is mediated by a latent variable H, such that This implies there is a latent variable that connects X and Z, and that the joint distribution p(X, Z) is a mixture through H.
Proposition 1 (informal).Let {(x (i) , z (i) )} be a sample of size n from the distribution in Eq. 4. Let π be a permutation over [n] uniformly sampled from the set of permutations.Then with high likelihood, the sum of the singular values of Ω π is smaller than the sum of singular values under Ω ι .
For full details of this claim, see Appendix A.
In our experiments, we test several combinations of algorithms.We use the k-means (KMEANS) as a substitute for the AM steps as a baseline for the assignment step of xs to zs.In addition, for the removal step (once an assignment has been identified), we test two algorithms: SAL (Shao et al. 2023; resulting in AMSAL) and INLP (Ravfogel et al., 2020).We also compare these two algorithms in oracle mode (in which the assignment of guarded attributes to inputs is known), to see the loss in performance that happens due to noisy assignments from the AM or k-means algorithm (ORACLESAL and ORACLEINLP).
When running the AM algorithm or k-means, we execute it with three random seeds (see also §4.6) for a maximum of a hundred iterations and choose the projection matrix with the largest objective value over all seeds and iterations.For the slack variables (b 0j and b 1j variables in Eq. 2), we use 20%-30% above and below the baseline of the guarded attribute priors according to the training set.With the SAL methods, we remove the number of directions according to the rank of the Ω matrix (between 2 to 6 in all experiments overall).
In addition, we experiment with a partially supervised assignment process, in which a small seed dataset of aligned xs and zs is provided to the AM steps.We use it for model selection: rather than choosing the assignment with the highest SVD objective value, we choose the assignment with the highest accuracy on this seed dataset.We refer to this setting as PARTIAL (for "partially supervised assignment").
Finally, in the case of a gender-protected attribute, we compare our results against a baseline in which the input x is compared against a list of words stereotypically associated with the genders of male or female.2Based on the overlap with these two lists, we heuristically assign the gender label to x and then run SAL or INLP (rather than using the AM algorithm).While this wordlist heuristic is plausible in the case of gender, it is not as easy to derive in the case of other protected attributes, such as age or race.We give the results for this baseline using the marker WL in the corresponding tables.
Main Findings Our overall main finding shows that our novel setting in which guarded information is erased from individually-unaligned representa-tions is viable.We discovered that AM methods perform particularly well when dealing with more complex bias removal scenarios, such as when multiple guarded attributes are present.We also found that having similar priors for the guarded attributes and downstream task labels may lead to poor performance on the task at hand.In these cases, using a small amount of supervision often effectively helps reduce bias while maintaining the utility of the representations for the main classification of the regression problem.Finally, our analysis of alignment stability shows that our AM algorithm often converges to suitable solutions that align X with Z.
Due to the unsupervised nature of our problem setting, we advise validating the utility of our method in the following way.Once we run the AM algorithm, we check whether there is a highaccuracy alignment between X and Y (rather than Z, which is unavailable).If this alignment is accurate, then we run the risk of significantly damaging task performance.An example is given in §4.5.

Word Embedding Debiasing
As a preliminary assessment of our setup and algorithms, we apply our methods to GloVe word embeddings to remove gender bias, and following the previous experiment settings of this problem (Bolukbasi et al., 2016;Ravfogel et al., 2020;Shao et al., 2023).We considered only the 150,000 most common words to ensure the embedding quality and omitted the rest.We sort the remaining embeddings by their projection on the − → he-− → she direction.Then we consider the top 7,500 word embeddings as male-associated words (z = 1) and the bottom 7,500 as female-associated words (z = −1).
Our findings are that both the k-means and the AM algorithms perfectly identify the alignment between the word embeddings and their associated gender label (100%).Indeed, the dataset construction itself follows a natural perfect clustering that these algorithms easily discover.Since the alignments are perfectly identified, the results of predicting the gender from the word embeddings after removal are identical to the oracle case.These results are quite close to the results of a random guess, and we refer the reader to Shao et al. (2023) for details on experiments with SAL and INLP for this dataset.Considering Figure 3, it is evident that our algorithm essentially follows a natural clustering of the word embeddings into two clusters, female and male, as the embeddings are highly separable in this case.This is why the alignment score of X (embedding) to Z (gender) is perfect in this case.This finding indicates that this standard word embedding dataset used for debiasing is trivial to debias -debiasing can be done even without knowing the identity of the stereotypical gender associated with each word.representation.For example, we want to avoid one being identified as "he" or "she" in their biography, affecting the likelihood of them being classified as engineers or teachers.

De
We follow the setup of De-Arteaga et al. ( 2019), predicting a candidate's professions (y), based on a self-provided short biography (x), aiming to remove any information about the candidate's gender (z).Due to computational constraints, we use only random 30K examples to learn the projections with both SAL and INLP (whether in the unaligned or aligned setting).For the classification problem, we use the full dataset.To get vector representations for the biographies, we use two different encoders, FastText word embeddings (Joulin et al., 2016), and BERT (Devlin et al., 2019).We stack a multiclass classifier on top of these representations, as there are 28 different professions.We use 20% of the training examples for the PARTIAL setting.For BERT, we followed De-Arteaga et al. ( 2019) in using the last CLS token state as the representation of the whole biography.We used the BERT model bert-base-uncased.

Evaluation Measures
We use an extension of the True Positive Rate (TPR) gap, the root mean square (RMS) TPR gap of all classes, for evaluating bias in a multiclass setting.This metric was suggested by De-Arteaga et al. (2019), who demonstrated it is significantly correlated with gender imbalances, which often lead to unfair classification.The higher the metric value is, the bigger the gap between the two categories (for example, between male and female) for the specific main task prediction.For the profession classification, we report accuracy.
Results Table 1 provides the results for the biography dataset.We see that INLP significantly reduces the TPR-GAP in all settings, but this comes at a cost: the representations are significantly less useful for the main task of predicting the profession.When inspecting the alignments, we observe that their accuracy is quite high with BERT: 100% with k-means, 85% with the AM algorithm and 99% with PARTIAL AM.FastText results are lower, hovering around 55% for all three methods.The high BERT assignment performance indicates that the BiasBios BERT representations are naturally separated by gender.We also observe that the results of WL+SAL and WL+INLP are correspondingly identical to Oracle+SAL and Oracle+INLP.This comes as no surprise, as the gender label is derived from a similar word list, which enables the WL approach to get a nearly perfect alignment (over 96% agreement with the gender label).

BiasBench Results
Meade et al. ( 2022) followed an empirical study of an array of datasets in the context of debiasing.They analyzed different methods and tasks, and we follow their benchmark evaluation to assess our AMSAL algorithm and other methods in the context of our new setting.We include a short description of the datasets we use in this section.We include full results in Appendix B, with a description of other datasets.We also encourage the reader to refer to Meade et al. (2022) for details on this benchmark.We use 20% of the training examples for the PARTIAL setting.
StereoSet (Nadeem et al., 2021) This dataset presents a word completion test for a language model, where the completion can be stereotypical or non-stereotypical.The bias is then measured by calculating how often a model prefers the stereotypical completion over the non-stereotypical one.Nadeem et al. (2021) introduced the language model score to measure the language model usability, which is the percentage of examples for which a model prefers the stereotypical or non-stereotypical word over some unrelated word.
CrowS-Pairs (Nangia et al., 2020) This dataset includes pairs of sentences that are minimally different at the token level, but these differences lead to the sentence being either stereotypical or antistereotypical.The assessment measures how many times a language model prefers the stereotypical element in a pair over the anti-stereotypical element.

Results
We start with an assessment of the BERT model for the CrowS-Pairs gender, race and religion bias evaluation (Table 2).We observe that all approaches for gender, except AM+INLP reduce the stereotype score.Race and religion are more difficult to debias in the case of BERT.INLP with k-means works best when no seed alignment data is provided at all, but when we consider PARTIAL-SAL, in which we use the alignment algorithm with some seed aligned data, we see that the results are the strongest.When we consider the RoBERTa model, the results are similar, with PARTIALSAL significantly reducing the bias.Our findings from Table 2 overall indicate that the ability to debias a representation highly depends on the model that generates the representation.In Table 10 we observe that the representations, on average, are not damaged for most GLUE tasks.
As Meade et al. (2022) have noted, when changing the representations of a language model to remove bias, we might cause such adjustments that damage the usability of the language model.To test which methods possibly cause such an issue, we also assess the language model score on the Stere-oSet dataset in Table 3.We overall see that often SAL-based methods give lower stereotype score, while INLP methods more significantly damage the language model score.This implies that the SALbased methods remove bias effectively while less significantly harming the usability of the language model representations.
We also conducted comprehensive results for other datasets (SEAT and GLUE) and categories of bias (based on race and religion).The results, especially for GLUE, demonstrate the effectiveness of our method of unaligned information removal.For GLUE, we consistently retain the baseline task performance almost in full.See Appendix B.

Multiple-Guarded Attribute Sentiment
We hypothesize that AM-based methods are better suited for setups where multiple guarded attributes should be removed, as they allow us to target several guarded attributes with different priors.
To examine our hypothesis, we experiment with a dataset curated from Twitter (tweets encoded using BERT, bert-base-uncased), in which users are surveyed for their age and gender (Cachola et al., 2018).We bucket the age into three groups (0-25, 26-50 and above 50).Tweets in this dataset are annotated with their sentiment, ranging from one (very negative) to five (very positive).The dataset consists of more than 6,400 tweets written by more than 1,700 users.We removed users that no longer have public Twitter accounts and users with locations that do not exist based on a filter,3 re- sulting in a dataset with over 3,000 tweets, written by 817 unique users.As tweets are short by nature and their number is relatively small, the debiasing signal in this dataset, the amount of information it contains about the guarded attributes, might not be sufficient for the attribute removal.To amplify this signal, we concatenated each tweet in the dataset All users were in the United States when the data was collected by the original curators.to at most ten other tweets from the same user.
We study the relationship between the main task of sentiment detection and the two protected attributes of age and gender.As a protected attribute z, we use the combination of both age and gender as a binary one-hot vector.This dataset presents a use-case for our algorithm of a composed protected attribute.Rather than using a classifier for predicting the sentiment, we use linear regression.Following Cachola et al. (2018), we use Mean Absolute Error (MAE) to report the error of the sentiment predictions.Given that the sentiment is predicted as a continuous value, we cannot use the TPR gap as in previous sections.Rather, we use the following formula: where MAD z=j = 1 i |η ij − µ j | where i ranges over the set of size of examples with protected attribute value j, µ j is the average of absolute Y prediction error for that set and η ij is the absolute difference between µ j and the absolute error of example i. 4 The function std in this case indicates the standard deviation of the m values of MAD z=j , j ∈ [m].
Results Table 4  In addition, we can see both AM-based methods outperform their k-means counterparts which increase unfairness (KMEANS + INLP) or significantly harm the downstream-task performance (KMEANS + SAL).We also consider Figure 4, which shows the quality of the assignments of the AM algorithm change as a function of the labeled data used.As expected, the more labeled data we have, the more accurate the assignments are, but the differences are not very large.

An Example of Our Method Limitations
We now present the main limitation in our approach and setting.This limitation arises when the random variables Y and Z are not easily distinguishable through information about X.
We experiment with a binary sentiment analysis (y) task, predicted on users' tweets (x), aiming to remove information regarding the authors' ethnic affiliations.To do so, we use a dataset collected by Blodgett et al. (2016), which examined the differences between African-American English (AAE) speakers and Standard American English (SAE) speakers.As information about one's ethnicity is hard to obtain, the user's geolocation information was used to create a distantly supervised mapping between authors and their ethnic affiliations.We follow previous work (Shao et al., 2023;Ravfogel et al., 2020) and use the DeepMoji encoder (Felbo et al., 2017) to obtain representations for the tweets.The train and test sets are balanced regarding sentiment and authors' ethnicity.We use 20% of the examples for the PARTIAL setting.Table 5 gives the results for this dataset.We observe that the removal with the assignment (k-means, AM or PARTIAL) significantly harms the performance on the main task and reduces it to a random guess.This presents a limitation of our algorithm.A priori, there is no distinction between Y and Z, as our method is unsupervised.In addition, the positive labels of Y and Z have the same prior probability.Indeed, when we check the assignment accuracy in the sentiment dataset, we observe that the k-means, AM and PARTIAL AM assignment accuracy for identifying Z are between 0.55 and 0.59.If we check the assignment against Y, we get an accuracy between 0.74 and 0.76.This means that all assignment algorithms actually identify Y rather than Z (both Y and Z are binary variables in this case).The conclusion from this is that our algorithm works best when sufficient information on Z is presented such that it can provide a basis for aligning samples of Z with samples of X. Suppose such information is unavailable or unidentifiable with information regarding Y.In that case, we may simply identify the natural clustering of X according to their main task classes, leading to low main-task performance.
In Table 5, we observe that this behavior is significantly mitigated when the priors over the sentiment and the race are different (0.8 for sentiment and 0.5 for race).In that case, the AM algorithm is able to distinguish between the race-protected attribute (z) and the sentiment class (y) quite consis- Figure 6: Ratio of the objective value in iteration t and iteration 0 of the ILP for the AM steps as a function of the iteration number t. Shaded gray gives upper and lower bound on the standard deviation over five runs with different seeds for the initial π.See legend explanation in Table 5.
tently with INLP and SAL, and the gap is reduced.
We also observe that INLP changed neither the accuracy nor the TPR-GAP for the balanced scenario (Table 5) when using a k-means assignment or an AM assignment.Upon inspection, we found out that INLP returns an identity projection in these cases, unable to amplify the relatively weak signal in the assignment to change the representations.

Stability Analysis of the Alignment
In Figure 5, we plot the accuracy of the alignment algorithm (knowing the true value of the guarded attribute per input) throughout the execution of the AM steps for the first ten iterations.The shaded area indicates one standard deviation.We observe that the first few iterations are the ones in which the accuracy improves the most.For most of the datasets, the accuracy does not decrease between iterations, though in the case of DeepMoji we do observe a "bump."This is indeed why the PARTIAL setting of our algorithm, where a small amount of guarded information is available to determine at which iteration to stop the AM algorithm, is important.In the word embeddings case, the variance is larger because, in certain executions, the algorithm converged quickly, while in others, it took more iterations to converge to high accuracy.
Figure 6 plots the relative change of the objective value of the ILP from §3.1 against iteration number.The relative change is defined as the ratio between the objective value before the algorithm begins and the same value at a given iteration.We see that there is a relative stability of the algorithm and that the AM steps converge quite quickly.We also observe the DeepMoji dataset has a large increase in the objective value in the first iteration (around ×5 compared to the value the algorithm starts with), after which it remains stable.
There has been an increasing amount of work about detecting and erasing undesired or protected information from neural representations, with standard software packages for this process having been developed (Han et al., 2022).For example, in their seminal work, Bolukbasi et al. (2016) showed that word embeddings exhibit gender stereotypes.To mitigate this issue, they projected the word embeddings to a neutral space with respect to a "heshe" direction.Influenced by this work, Zhao et al. (2018) proposed a customized training scheme to reduce the gender bias in word embeddings.Gonen and Goldberg (2019) examined the effectiveness of the methods mentioned above and concluded they remove bias in a shallow way.For example, they demonstrated that classifiers can accurately predict the gender associated with a word when fed with the embeddings of both debiasing methods.
Another related strand of work uses adversarial learning (Ganin et al., 2016), where an additional objective function is added for balancing undesiredinformation removal and the main task (Edwards and Storkey, 2016;Li et al., 2018;Coavoux et al., 2018;Wang et al., 2021).Elazar and Goldberg (2018) have also demonstrated that an ad-hoc classifier can easily recover the removed information from adversarially trained representations.Since then, methods for information erasure such as INLP and its generalization (Ravfogel et al., 2020(Ravfogel et al., , 2022)), SAL (Shao et al., 2023) and methods based on similarity measures between neural representations (Colombo et al., 2022) have been developed.With a similar motivation to ours, Han et al. (2021b) aimed to ease the burden of obtaining guarded attributes at a large scale by decoupling the adversarial information removal process from the main task training.They, however, did not experiment with debiasing representations where no guarded attribute alignments are available.Shao et al. (2023) experimented with the removal of features in a scenario in which a low number of protected attributes is available.
Additional previous work showed that methods based on causal inference (Feder et al., 2021), trainset balancing (Han et al., 2021a), and contrastive learning (Shen et al., 2021;Chi et al., 2022) effectively reduce bias and increase fairness.In addition, there is a large body of work for detecting bias, its evaluation (?) and its implications in specific NLP applications.Savoldi et al. (2022) detected a gen-der bias in speech translation systems for gendered languages.Gender bias is also discussed in the context of knowledge base embeddings by Fisher et al. (2019); Du et al. (2022), and multilingual text classification (Huang, 2022).

Conclusions and Future Work
We presented a new and challenging setup for removing information, with minimal or no available sensitive information alignment.This setup is crucial for the wide applicability of debiasing methods, as for most applications, obtaining such sensitive labels on a large scale is challenging.To ease this problem, we present a method to erase information from neural representations, where the guarded attribute information does not accompany each input instance.Our main algorithm, AMSAL, alternates between two steps (Assignment and Maximization) to identify an assignment between the input instances and the guarded information records.It then completes its execution by removing the information by minimizing covariance between the input instances and the aligned guarded attributes.Our approach is modular, and other erasure algorithms, such as INLP, can be used with it.Experiments show that we can reduce the unwanted bias in many cases while keeping the representations highly useful.Future work might include extending our technique to the kernelized case, analogously to the method of Shao et al. (2023).

Ethical Considerations
The AM algorithm could potentially be misused by rather than using the AM steps to erase information, using them to link records of two different types, undermining the privacy of the record holders.Such a situation may merit additional concern because the links returned between the guarded attributes and the input instances will likely contain mistakes.The links are unreliable for decisionmaking at the individual level.Instead, they should be used on an aggregate as a statistical construct to erase information from the input representations.Finally,5 we note that the automation of the debiasing process, without properly statistically confirming its accuracy using a correct sample may promote a false sense of security that a given system is making fair decisions.We do not recommend using our method for debiasing without proper statistical control and empirical verification of correctness.
A Justification of the AM Algorithm: Further Details We provide here the full details for the claim in §3.5.Our first observation is that for a uniformly sampled permutation π : [n] → [n], the probability that it has exactly k ≤ n elements such that π(i) = i for all i in this set of elements is bounded from above by:6 We also assume that E[X | H] = 0 and E[Z | H] = 0, and that the product of every pair of coordinates of X and Z is bounded in absolute value by a constant B > 0. Let {(x (i) , z (i) , h (i) )} be a random sample of size n from the joint distribution p(X, Z, H).Given a permutation π : For a matrix A ∈ R d×d , let σ j (A) be its jth largest singular value, and let σ + (A) = j σ j (A).
We first note that for any permutation π, it holds that E[Ω π|K ] = 0 where we define K = [n] \ I(π).
Lemma 1.For any t > 0, it holds that: Proof.By Hoeffding's inequality, for any i ∈ [d], j ∈ [d ], it holds that the probability that for |I(π)| i.i.d.r.v.X k , Z k the following is true: is smaller than 2 exp − t 2 |I(π)|B 2 .Therefore, by a union bound on each element of the matrix Ω π , we get the upper bound on Eq. 6.
Lemma 2. For any t > 0, it holds that: Proof.Since X i and Z j are bounded as a product in absolute value by B, and the dimensions of Ω π|K is d × d , each cell being a sum of |K| values, the bound naturally follows.
Let n such that nσ + > 2kdd B where k = |K|.Then from Lemma 2, ||Ω π|K −E[Ω π|K ]|| 2 < nσ + .Consider the event σ + (Ω ι ) < σ + (Ω π ).Its probability is bounded from above by the probability of the event σ + (Ω ι ) ≤ nσ + OR σ + (Ω π ) ≥ nσ + (for any n as the above).Due to the inequality of Weyl (Theorem 1 in Stewart 1990; see below), the fact that Ω π = Ω π|K + Ω π|I(π) , Lemma 1, and the fact that n − k ≤ n, the probability of this OR event is bounded from above by The conclusion from this is that if we were to sample uniformly a permutation π from the set of permutations over [n], then with quite high likelihood (because the fraction of elements that are preserved under π becomes smaller as n becomes larger), the sum of the singular values of Ω π under this permutation will be smaller than the sum of the singular values of Ω ι -meaning, when the xs and the zs are correctly aligned.This justifies our objective of aligning the xs and the zs with an objective that maximizes the singular values, following Proposition 1. Weyl (1912) As mentioned by Stewart (1990), the following holds: Lemma 3. Let A and E be two matrices, and let Ã = A + E. Let σ i be the ith singular value of A and σi be the ith singular value of Ã.Then

B Comprehensive Results on the BiasBench Datasets
We include more results for the SEAT dataset from BiasBench and for the CrowS-Pairs dataset and StereoSet datasets for bias categories other than gender.A description of the SEAT and GLUE datasets (with metrics used) follows.
SEAT (May et al., 2019) SEAT is a sentencelevel extension of WEAT (Caliskan et al., 2017), which is an association test between two categories of words: attribute word sets and target word sets.For example, attribute words for gender bias could be { he, man }, while a target words could be { career, office }.For example, an attribute word set (in case of gender bias) could be a set of words such as { he, him, man }, while a target word set might be words related to office work.If we see a high association between an attribute word set and a target word set, we may claim that a particular gender bias is encoded.The final evaluation is calculated by measuring the similarity between the different attributes and target word sets.To extend WEAT to a sentence-level test, (Caliskan et al., 2017) incorporated the WEAT attribute and target words into synthetic sentence templates.
We use an effect size metric to report our results for SEAT.This measure is a normalized difference between cosine similarity of representations of the attribute words and the target words.Both attribute words and target words are split into two categories (for example, in relation to gender), so the difference is based on four terms, between each pair of each category set of words (target and attribute).An effect size closer to zero indicates less bias is encoded in the representations.
GLUE (Wang et al., 2019) We follow Meade et al. (2022) and use the GLUE dataset to test the debiased model on an array of downstream tasks to validate their usability.GLUE is a highly popular benchmark for testing NLP models, containing a variety of tasks, such as classification tasks (e.g., sentiment analysis), similarity tasks (e.g., paraphrase identification), and inference tasks (e.g., question-answering).
The following tables of results are included: • Table 6 presents the StereoSet results for removing the race (a) and religion (b) guarded attributes.
• Tables 7, 8, and 9 describe the SEAT effect sizes for the gender, race, and religion cases, respectively. •

Figure 3 :
Figure 3: A t-SNE visualization of the word embeddings before and after gender information removal.In (a) we see the embeddings naturally cluster into the corresponding gender.
(a) CrowS-Pairs Gender stereotype scores (Stt.score) in language models debiased by different debiasing techniques and assignment; (b) CrowS-Pairs Race stereotype scores; (c) CrowS-Pairs Religion stereotype scores.All models are deemed least biased if the stereotype score is 50%.The colored numbers are calculated as | |b − 50| − |s − 50| | where b is the top row score and s is the corresponding system score.

Figure 5 :
Figure5: Accuracy of the AM steps (in identifying the correct assignment of inputs to guarded information) as a function of the iteration number.Shaded gray gives upper and lower bound on the standard deviation over five runs with different seeds for the initial π.FastText refers to the BiasBios dataset, the BERT models are for the CrowS-Pairs dataset and Emb.refers to the word embeddings dataset from §4.1.
presents our results.Overall, AMSAL reduces the gender and age gap in the predictions while not increasing by much MAE.

Table 5 :
The performance of removing race information from the DeepMoji dataset is shown for two cases: with balanced ratios of race and sentiment (left) and with ratios of 0.8 for sentiment and 0.5 for race (right).In both cases, the total size of the dataset used is 30,000 examples.To evaluate the performance of the unbalanced sentiment dataset, we use the F 1 macro measure, because in an unbalanced dataset such as this one, a simple classifier that always returns one label will achieve an accuracy of 80%.Such a classifier would have a F 1 macro score of 0.44 4.

Table 7 :
SEAT effect sizes for gender-debiased representations of BERT, ALBERT, RoBERTa, and GPT-2 models.Effect sizes closer to 0 are indicative of less biased model representations.Statistically significant effect sizes at p < 0.01 are denoted by *.The final column reports the average absolute effect size across all six gender SEAT tests for each debiased model.

Table 8 :
SEAT effect sizes for race debiased BERT, ALBERT, RoBERTa, and GPT-2 models.Effect sizes closer to 0 are indicative of less biased model representations.Statistically significant effect sizes at p < 0.01 are denoted by *.The final column reports the average absolute effect size across all six gender SEAT tests for each debiased model.

Table 9 :
SEAT effect sizes for religion debiased BERT, ALBERT, RoBERTa, and GPT-2 models.Effect sizes closer to 0 are indicative of less biased model representations.Statistically significant effect sizes at p < 0.01 are denoted by *.The final column reports the average absolute effect size across all six gender SEAT tests for each debiased model.