Gender Bias in Machine Translation

Machine translation (MT) technology has facilitated our daily tasks by providing accessible shortcuts for gathering, elaborating and communicating information. However, it can suffer from biases that harm users and society at large. As a relatively new field of inquiry, gender bias in MT still lacks internal cohesion, which advocates for a unified framework to ease future research. To this end, we: i) critically review current conceptualizations of bias in light of theoretical insights from related disciplines, ii) summarize previous analyses aimed at assessing gender bias in MT, iii) discuss the mitigating strategies proposed so far, and iv) point toward potential directions for future work.


Introduction
Interest in understanding, assessing, and mitigating gender bias is steadily growing within the natural language processing (NLP) community, with recent studies showing how gender disparities affect language technologies. Sometimes, for example, coreference resolution systems fail to recognize women doctors (Zhao et al., 2017;Rudinger et al., 2018), image captioning models do not detect women sitting next to a computer (Hendricks et al., 2018), and automatic speech recognition works better with male voices (Tatman, 2017). Despite a prior disregard for such phenomena within research agendas (Cislak et al., 2018), it is now widely recognized that NLP tools encode and reflect controversial social asymmetries for many seemingly neutral tasks, machine translation (MT) included. Admittedly, the problem is not new (Frank et al., 2004). A few years ago, Schiebinger (2014) criticized the phenomenon of "masculine default" in MT after running one of her interviews through a commercial translation system. In spite of several feminine mentions in the text, she was repeatedly referred to by masculine pronouns. Gender-related concerns have also been voiced by online MT users, who noticed how commercial systems entrench social gender expectations, e.g., translating engineers as masculine and nurses as feminine (Olson, 2018).
With language technologies entering widespread use and being deployed at a massive scale, their societal impact has raised concern both within (Hovy and Spruit, 2016;Bender et al., 2021) and outside (Dastin, 2018) the scientific community. To take stock of the situation, Sun et al. (2019) reviewed NLP studies on the topic. However, their survey is based on monolingual applications, whose underlying assumptions and solutions may not be directly applicable to languages other than English (Zhou et al., 2019;Zhao et al., 2020;Takeshita et al., 2020) and cross-lingual settings. Moreover, MT is a multifaceted task, which requires resolving multiple gender-related subtasks at the same time (e.g., coreference resolution, named entity recognition). Hence, depending on the languages involved and the factors accounted for, gender bias has been conceptualized differently across studies. To date, gender bias in MT has been tackled by means of a narrow, problem-solving oriented approach. While technical countermeasures are needed, failing to adopt a wider perspective and engage with related literature outside of NLP can be detrimental to the advancement of the field (Blodgett et al., 2020).
In this paper, we intend to put such literature to use for the study of gender bias in MT. We go beyond surveys restricted to monolingual NLP (Sun et al., 2019) or more limited in scope (Costa-jussà, 2019;Monti, 2020), and present the first comprehensive review of gender bias in MT. In particular, we 1) offer a unified framework that introduces the concepts, sources, and effects of bias in MT, clarified in light of relevant notions on the relation between gender and different languages; 2) critically discuss the state of the research by identifying blind spots and key challenges.

arXiv:2104.06001v3 [cs.CL] 7 May 2021
Bias is a fraught term with partially overlapping, or even competing, definitions (Campolo et al., 2017). In cognitive science, bias refers to the possible outcome of heuristics, i.e., mental shortcuts that can be critical to support prompt reactions Kahneman, 1973, 1974). AI research borrowed from such a tradition (Rich and Gureckis, 2019;Rahwan et al., 2019) and conceived bias as the divergence from an ideal or expected value (Glymour and Herington, 2019;Shah et al., 2020), which can occur if models rely on spurious cues and unintended shortcut strategies to predict outputs (Schuster et al., 2019;McCoy et al., 2019;Geirhos et al., 2020). Since this can lead to systematic errors and/or adverse social effects, bias investigation is not only a scientific and technical endeavour but also an ethical one, given the growing societal role of NLP applications (Bender and Friedman, 2018). As Blodgett et al. (2020) recently called out, and has been endorsed in other venues (Hardmeier et al., 2021), analysing bias is an inherently normative process which requires identifying what is deemed as harmful behavior, how, and to whom. Hereby, we stress a humancentered, sociolinguistically-motivated framing of bias. By drawing on the definition by Friedman and Nissenbaum (1996), we consider as biased an MT model that systematically and unfairly discriminates against certain individuals or groups in favor of others. We identify bias per specific model's behaviors, which are assessed by envisaging their potential risks when the model is deployed (Bender et al., 2021) and the harms that could ensue (Crawford, 2017), with people in focus (Bender, 2019). Since MT systems are daily employed by millions of individuals, they could impact a wide array of people in different ways.
As a guide, we rely on Crawford (2017), who defines two main categories of harms produced by a biased system: i) Representational harms (R) -i.e., detraction from the representation of social groups and their identity, which, in turn, affects attitudes and beliefs; ii) Allocational harms (A) -i.e., a system allocates or withholds opportunities or resources to certain groups. Considering the so far reported real-world instances of gender bias (Schiebinger, 2014;Olson, 2018) and those addressed in the MT literature reviewed in this paper, (R) can be further distinguished into underrepresentation and stereotyping.
Under-representation refers to the reduction of the visibility of certain social groups through language by i) producing a disproportionately low representation of women (e.g., most feminine entities in a text are misrepresented as male in translation); or ii) not recognizing the existence of non-binary individuals (e.g., when a system does not account for gender neutral forms). For such cases, the misrepresentation occurs in the language employed to talk "about" such groups. 1 Also, this harm can imply the reduced visibility of the language used "by" speakers of such groups by iii) failing to reflect their identity and communicative repertoires. In these cases, an MT flattens their communication and produces an output that indexes unwanted gender identities and social meanings (e.g. women and non-binary speakers are not referred to by their preferred linguistic expressions of gender).
Stereotyping regards the propagation of negative generalizations of a social group, e.g., belittling feminine representation to less prestigious occupations (teacher (Feminine) vs. lecturer (Masculine)), or in association with attractiveness judgments (pretty lecturer (Feminine)).
Such behaviors are harmful as they can directly affect the self-esteem of members of the target group (Bourguignon et al., 2015). Additionally, they can propagate to indirect stakeholders. For instance, if a system fosters the visibility of the way of speaking of the dominant group, MT users can presume that such a language represents the most appropriate or prestigious variant 2 -at the expense of other groups and communicative repertoires. These harms can aggregate, and the ubiquitous embedding of MT in web applications provides us with paradigmatic examples of how the two types of (R) can interplay. For example, if women or non-binary 3 scientists are the subjects of a query, automatically translated pages run the risk of referring to them via masculine-inflected job qualifications. Such misrepresentations can lead to experience feelings of identity invalidation (Zimman et al., 2017). Also, users may not be aware of being exposed to MT mistakes due to the deceptively fluent output of a system (Martindale and Carpuat, 2018). In the long run, stereotypi-cal assumptions and prejudices (e.g., only men are qualified for high-level positions) will be reinforced (Levesque, 2011;Régner et al., 2019).
Regarding (A), MT services are consumed by the general public and can thus be regarded as resources in their own right. Hence, (R) can directly imply (A) as a performance disparity across users in the quality of service, i.e., the overall efficiency of the service. Accordingly, a woman attempting to translate her biography by relying on an MT system requires additional energy and time to revise wrong masculine references. If such disparities are not accounted for, the MT field runs the risk of producing systems that prevent certain groups from fully benefiting from such technological resources.
In the following, we operationalize such categories to map studies on gender bias to their motivations and societal implications (Table 1 and 2).

Understanding Bias
To confront bias in MT, it is vital to reach out to other disciplines that foregrounded how the sociocultural notions of gender interact with language(s), translation, and implicit biases. Only then can we discuss the multiple factors that concur to encode and amplify gender inequalities in language technology. Note that, except for , current studies on gender bias in MT have assumed an (often implicit) binary vision of gender. As such, our discussion is largely forced into this classification. Although we reiterate on bimodal feminine/masculine linguistic forms and social categories, we emphasize that gender encompasses multiple biosocial elements not to be conflated with sex (Risman, 2018;Fausto-Sterling, 2019), and that some individuals do not experience gender, at all, or in binary terms (Glen and Hurrell, 2012).

Gender and Language
The relation between language and gender is not straightforward. First, the linguistic structures used to refer to the extra-linguistic reality of gender vary across languages ( §3.1.1). Moreover, how gender is assigned and perceived in our verbal practices depends on contextual factors as well as assumptions about social roles, traits, and attributes ( §3.1.2). At last, language is conceived as a tool for articulating and constructing personal identities ( §3.1.3).
Notional gender languages 4 (e.g., Danish, English). On top of lexical gender (mom/dad), such languages display a system of pronominal gender (she/he, her/him). English also hosts some marked derivative nouns (actor/actress) and compounds (chairman/chairwoman).
Grammatical gender languages (e.g., Arabic, Spanish). In these languages, each noun pertains to a class such as masculine, feminine, and neuter (if present). Although for most inanimate objects gender assignment is only formal, 5 for human referents masculine/feminine markings are assigned on a semantic basis. Grammatical gender is defined by a system of morphosyntactic agreement, where several parts of speech beside the noun (e.g., verbs, determiners, adjectives) carry gender inflections.
In light of the above, the English sentence "He/She is a good friend" has no overt expression of gender in a genderless language like Turkish ("O iyi bir arkadaş"), whereas Spanish spreads several masculine or feminine markings ("El/la es un/a buen/a amigo/a"). Although general, such macrocategories allow us to highlight typological differences across languages. These are crucial to frame gender issues in both human and machine translation. Also, they exhibit to what extent speakers of each group are led to think and communicate via binary distinctions, 6 as well as underline the relative complexity in carving out a space for lexical innovations which encode non-binary gender (Hord, 2016;Conrod, 2020). In this sense, while English is bringing the singular they in common use and developing neo-pronouns (Bradley et al., 2019), for grammatical gender languages like Spanish neu-4 Also referred to as natural gender languages. Following McConnell-Ginet (2013), we prefer notional to avoid terminological overlapping with "natural", i.e., biological/anatomical sexual categories. For a wider discussion on the topic, see Nevalainen and Raumolin-Brunberg (1993); Curzan (2003). 5 E.g., "moon" is masculine in German, feminine in French. 6 Outside of the Western paradigm, there are cultures whose languages traditionally encode gender outside of the binary (Epple, 1998;Murray, 2003;Hall and O'Donovan, 2014). trality requires the development of neo-morphemes ("Elle es une buene amigue").

Social Gender Connotations
To understand gender bias, we have to grasp not only the structure of different languages, but also how linguistic expressions are connoted, deployed, and perceived (Hellinger and Motschenbacher, 2015). In grammatical gender languages, feminine forms are often subject to a so-called semantic derogation (Schulz, 1975), e.g., in French, couturier (fashion designer) vs. couturière (seamstress). English is no exception (e.g., governor/governess).
Moreover, bias can lurk underneath seemingly neutral forms. Such is the case of epicene (i.e., gender neutral) nouns where gender is not grammatically marked. Here, gender assignment is linked to (typically binary) social gender, i.e., "the socially imposed dichotomy of masculine and feminine role and character traits" (Kramarae and Treichler, 1985). As an illustration, Danish speakers tend to pronominalize dommer (judge) with han (he) when referring to the whole occupational category (Gomard, 1995;Nissen, 2002). Social gender assignment varies across time and space (Lyons, 1977;Romaine, 1999;Cameron, 2003) and regards stereotypical assumptions about what is typical or appropriate for men and women. Such assumptions impact our perceptions (Hamilton, 1988;Gygax et al., 2008;Kreiner et al., 2008) and influence our behavior -e.g., leading individuals to identify with and fulfill stereotypical expectations (Wolter and Hannover, 2016;Sczesny et al., 2018) -and verbal communication, e.g., women are often misquoted in the academic community (Krawczyk, 2017).
Translation studies highlight how social gender assignment influences translation choices (Jakobson, 1959;Chamberlain, 1988;Comrie, 1999;Di Sabato and Perri, 2020). Primarily, the problem arises from typological differences across languages and their gender systems. Nonetheless, socio-cultural factors also influence how translators deal with such differences. Consider the character of the cook in Daphne du Maurier's "Rebecca", whose gender is never explicitly stated in the whole book. In the lack of any available information, translators of five grammatical gender languages represented the character as either a man or a woman (Wandruszka, 1969;Nissen, 2002). Although extreme, this case can illustrate the situation of uncertainty faced by MT: the mapping of one-to-many forms in gender prediction. But, as discussed in §4.1, mistranslations occur when contextual gender information is available as well.

Gender and Language Use
Language use varies between demographic groups and reflects their backgrounds, personalities, and social identities (Labov, 1972;Trudgill, 2000;Pennebaker and Stone, 2003). In this light, the study of gender and language variation has received much attention in socio-and corpus linguistics (Holmes and Meyerhoff, 2003;Eckert and McConnell-Ginet, 2013). Research conducted in speech and text analysis highlighted several gender differences, which are exhibited at the phonological and lexicalsyntactic level. For example, women rely more on hedging strategies ("it seems that"), purpose clauses ("in order to"), first-person pronouns, and prosodic exclamations (Mulac et al., 2001;Mondorf, 2002;Brownlow et al., 2003). Although some correspondences between gender and linguistic features hold across cultures and languages (Smith, 2003;, it should be kept in mind that they are far from universal 7 and should not be intended in a stereotyped and oversimplified manner (Bergvall et al., 1996;Nguyen et al., 2016;Koolen and van Cranenburgh, 2017).
Drawing on gender-related features proved useful to build demographically informed NLP tools (Garimella et al., 2019) and personalized MT models (Mirkin et al., 2015;Bawden et al., 2016;Rabinovich et al., 2017). However, using personal gender as a variable requires a prior understanding of which categories may be salient, and a critical reflection on how gender is intended and ascribed (Larson, 2017). Otherwise, if we assume that the only relevant (sexual) categories are "male" and "female", our models will inevitably fulfill such a reductionist expectation (Bamman et al., 2014).

Gender Bias in MT
To date, an overview of how several factors may contribute to gender bias in MT does not exist. We identify and clarify concurring problematic causes, accounting for the context in which systems are developed and used ( §2). To this aim, we rely on the three overarching categories of bias described by Friedman and Nissenbaum (1996), which fore-ground different sources that can lead to machine bias. These are: pre-existing bias -rooted in our institutions, practices and attitudes ( §3.2.1), technical bias -due to technical constraints and decisions ( §3.2.2), and emergent bias -arising from the interaction between systems and users ( §3.2.3). We consider such categories as placed along a continuum, rather than being discrete.

Pre-existing Bias
MT models are known to reflect gender disparities present in the data. However, reflections on such generally invoked disparities are often overlooked. Treating data as an abstract, monolithic entity (Gitelman, 2013) -or relying on "overly broad/overloaded terms like training data bias" 8 (Suresh and Guttag, 2019) -do not encourage reasoning on the many factors of which data are the product. First and foremost, the historical, sociocultural context in which they are generated.
A starting point to tackle these issues is the Europarl corpus (Koehn, 2005), where only 30% of sentences are uttered by women (Vanmassenhove et al., 2018). Such an imbalance is a direct window into the glass ceiling that has hampered women's access to parliamentary positions. This case exemplifies how data might be "tainted with historical bias", mirroring an "unequal ground truth" (Hacker, 2018). However, other gender variables are harder to spot and quantify.
Empirical linguistics research pointed out that subtle gender asymmetries are rooted in languages' use and structure. For instance, an important aspect regards how women are referred to. Femaleness is often explicitly invoked when there is no textual need to do so, even in languages that do not require overt gender marking. A case in point regards Turkish, which differentiates cocuk (child) and kiz cocugu (female child) (Braun, 2000). Similarly, in a corpus search, Romaine (2001) found 155 explicit female markings for doctor (female, woman or lady doctor), compared to only 14 male doctor. Feminist language critique provided extensive analysis of such a phenomenon by highlighting how referents in discourse are considered men by default unless explicitly stated (Silveira, 1980;Hamilton, 1991). Finally, prescriptive top-down guidelines limit the linguistic visibility of gender diversity, e.g., the Real Academia de la Lengua Española recently discarded the official use of non-binary innovations and claimed the functionality of masculine generics (Mundo, 2018;López et al., 2020).
By stressing such issues, we are not condoning the reproduction of pre-existing bias in MT. Rather, the above-mentioned concerns are the starting point to account for when dealing with gender bias.

Technical Bias
Technical bias comprises aspects related to data creation, models design, training and testing procedures. If present in training and testing samples, asymmetries in the semantics of language use and gender distribution are respectively learnt by MT systems and rewarded in their evaluation. However, as just discussed, biased representations are not merely quantitative, but also qualitative. Accordingly, straightforward procedures -e.g., balancing the number of speakers in existing datasets -do not ensure a fairer representation of gender in MT outputs. Since datasets are a crucial source of bias, it is also crucial to advocate for a careful data curation (Mehrabi et al., 2019;Paullada et al., 2020;Hanna et al., 2021;Bender et al., 2021), guided by pragmatically-and socially-informed analyses (Hitti et al., 2019;Sap et al., 2020;Devinney et al., 2020) and annotation practices (Gaido et al., 2020).
Overall, while data can mirror gender inequalities and offer adverse shortcut learning opportunities, it is "quite clear that data alone rarely constrain a model sufficiently" (Geirhos et al., 2020) nor explain the fact that models overamplify (Shah et al., 2020) such inequalities in their outputs. Focusing on models' components, Costa-jussà et al. (2020b) demonstrate that architectural choices in multilingual MT impact the systems' behavior: shared encoder-decoders retain less gender information in the source embeddings and less diversion in the attention than language-specific encoder-decoders (Escolano et al., 2021), thus disfavoring the generation of feminine forms. While discussing the loss and decay of certain words in translation, Vanmassenhove et al. (2019Vanmassenhove et al. ( , 2021 attest to the existence of an algorithmic bias that leads underrepresented forms in the training data -as it may be the case for feminine references -to further decrease in the MT output. Specifically, Roberts et al. (2020) prove that beam search -unlike sampling -is skewed toward the generation of more frequent (masculine) pronouns, as it leads models to an extreme operating point that exhibits zero variability.
Thus, efforts towards understating and mitigat-ing gender bias should also account for the model front. To date, this remains largely unexplored.

Emergent Bias
Emergent bias may arise when a system is used in a different context than the one it was designed for, e.g., when it is applied to another demographic group. From car crash dummies to clinical trials, we have evidence of how not accounting for gender differences brings to the creation of male-grounded products with dire consequences (Liu and Dipietro Mager, 2016;Criado-Perez, 2019), such as higher death and injury risks in vehicle crash and less effective medical treatments for women. Similarly, unbeknownst to their creators, MT systems that are not intentionally envisioned for a diverse range of users will not generalize for the feminine segment of the population. Hence, in the interaction with an MT system, a woman will likely be misgendered or not have her linguistic style preserved . Other conditions of users/system mismatch may be the result of changing societal knowledge and values. A case in point regards Google Translate's historical decision to adjust its system for instances of gender ambiguity. Since its launch twenty years ago, Google had provided only one translation for single-word gender-ambiguous queries (e.g., professor translated in Italian with the masculine professore). In a community increasingly conscious of the power of language to hardwire stereotypical beliefs and women's invisibility (Lindqvist et al., 2019;Beukeboom and Burgers, 2019), the bias exhibited by the system was confronted with a new sensitivity. The service's decision (Kuczmarski, 2018) to provide a double feminine/masculine output (profes-sor→professoressa|professore) stems from current demands for gender-inclusive resolutions. For the recognition of non-binary groups (Richards et al., 2016), we invite studies on how such modeling could be integrated with neutral strategies ( §6).

Assessing Bias
First accounts on gender bias in MT date back to Frank et al. (2004). Their manual analysis pointed out how English-German MT suffers from a dearth of linguistic competence, as it shows severe difficulties in recovering syntactic and semantic information to correctly produce gender agreement. Similar inquiries were conducted on other target grammatical gender languages for several commercial MT systems (Abu-Ayyash, 2017; Monti, 2017;Rescigno et al., 2020). While these studies focused on contrastive phenomena, Schiebinger (2014) 9 went beyond linguistic insights, calling for a deeper understanding of gender bias. Her article on Google Translate's "masculine default" behavior emphasized how such a phenomenon is related to the larger issue of gender inequalities, also perpetuated by socio-technical artifacts (Selbst et al., 2019). All in all, these qualitative analyses demonstrated that gender problems encompass all three MT paradigms (neural, statistical, and rule-based), preparing the ground for quantitative work.
To attest the existence and scale of gender bias across several languages, dedicated benchmarks, evaluations, and experiments have been designed. We first discuss large scale analyses aimed at assessing gender bias in MT, grouped according to two main conceptualizations: i) works focusing on the weight of prejudices and stereotypes in MT ( §4.1); ii) studies assessing whether gender is properly preserved in translation ( §4.2). In accordance with the human-centered approach embraced in this survey, in Table 1 we map each work to the harms (see §2) ensuing from the biased behaviors they assess. Finally, we review existing benchmarks for comparing MT performance across genders ( §4.3).

MT and Gender Stereotypes
In MT, we record prior studies concerned with pronoun translation and coreference resolution across typologically different languages accounting for both animate and inanimate referents (Hardmeier and Federico, 2010;Le Nagard and Koehn, 2010;Guillou, 2012). For the specific analysis on gender bias, instead, such tasks are exclusively studied in relation to human entities.
Prates et al. (2018)  http:// genderedinnovations.stanford.edu/casestudies/nlp.html random, yet they show a strong masculine skew. 10 To further analyze the under-representation of she pronouns, Prates et al. (2018) focus on 22 macro-categories of occupation areas and compare the proportion of pronoun predictions against the real-world proportion of men and women employed in such sectors. In this way, they find that MT not only yields a masculine default, but it also underestimates feminine frequency at a greater rate than occupation data alone suggest. Such an analysis starts by acknowledging pre-existing bias (see §3.2.1) -e.g., low rates of women in STEM -to attest the existence of machine bias, and defines it as the exacerbation of actual gender disparities.
Going beyond word lists and simple synthetic constructions, Gonen and Webster (2020) inspect the translation into Russian, Spanish, German, and French of natural yet ambiguous English sentences. Their analysis on the ratio and type of generated masculine/feminine job titles consistently exhibits social asymmetries for target grammatical gender languages (e.g., lecturer masculine vs. teacher feminine). Finally, Stanovsky et al. (2019) assess that MT is skewed to the point of actually ignoring explicit feminine gender information in source English sentences. For instance, MT systems yield a wrong masculine translation of the job title baker, although it is referred to by the pronoun she. Beside the overlook of overt gender mentions, the model's reliance on unintended (and irrelevant) cues for gender assignment is further confirmed by the fact that adding a socially connoted -but formally epiceneadjective (the pretty baker) pushes models toward feminine inflections in translation.
We observe that the propagation of stereotypes is a widely researched form of gender asymmetries in MT, one that so far has been largely narrowed down to occupational stereotyping. After all, occupational stereotyping has been studied by different disciplines (Greenwald et al., 1998) attested across cultures (Lewis and Lupyan, 2020), and it can be easily detected in MT across multiple language directions with consistent results. Current research should not neglect other stereotyping dynamics, as in the case of Stanovsky et al. (2019) and Cho et al. 10 Cho et al. (2019) highlight that a higher frequency of feminine references in the MT output does not necessarily imply a bias reduction. Rather, it may reflect gender stereotypes, as for hairdresser that is skewed toward feminine. This observation points to the tension between frequency count, suitable for testing under-representation, and qualitative-oriented analysis on bias conceptualized in terms of stereotyping.
(2019), who include associations to physical characteristics or psychological traits. Also, the intrinsically contextual nature of societal expectations advocates for the study of culture-specific dimensions of bias. Finally, we signal that the BERT-based perturbation method by Webster et al. (2019) identifies other bias-susceptible nouns that tend to be assigned to a specific gender (e.g., fighter as masculine). As Blodgett (2021) underscores, however, "the existence of these undesirable correlations is not sufficient to identify them as normatively undesirable". It should thus be investigated whether such statistical preferences can cause harms, e.g., by checking if they map to existing harmful associations or quality of service disparities.

MT and Gender Preservation
Vanmassenhove et al. (2018) and  investigate whether speakers' gender 11 is properly reflected in MT. This line of research is preceded by findings on gender personalization of statistical MT (Mirkin et al., 2015;Bawden et al., 2016;Rabinovich et al., 2017), which claim that gender "signals" are weakened in translation.  conjecture the existence of age and gender stylistic bias due to models' underexposure to the writings of women and younger segments of the population. To test this hypothesis, they automatically translate a corpus of online reviews with available metadata about users . Then, they compare such demographic information with the prediction of age and gender classifiers run on the MT output. Results indicate that different commercial MT models systematically make authors "sound" older and male. Their study thus concerns the under-representation of the language used "by" certain speakers and how it is perceived (Blodgett, 2021). However, the authors do not inspect which linguistic choices MT overproduces, nor which stylistic features may characterize different socio-demographic groups.
Still starting from the assumption that demographic factors influence language use, Vanmassenhove et al. (2018) probe MT's ability to preserve speaker's gender translating from English into ten languages. To this aim, they develop genderinformed MT models (see § 5.1), whose outputs are compared with those obtained by their baseline counterparts. Tested on a set for spoken lan-guage translation (Koehn, 2005), their enhanced models show consistent gains in terms of overall quality when translating into grammatical gender languages, where speaker's references are often marked. For instance, the French translation of "I'm happy" is either "Je suis heureuse" or "Je suis hereux" for a female/male speaker respectively. Through a focused cross-gender analysis -carried out by splitting their English-French test set into 1st person male vs. female data -they assess that the largest margin of improvement for their genderinformed approach concerns sentences uttered by women, since the results of their baseline disclose a quality of service disparity in favor of male speakers. Besides morphological agreement, they also attribute such improvement to the fact that their enhanced model produces gendered preferences in other word choices. For instance, it opts for think rather than believe, which is in concordance with corpus studies claiming a tendency for women to use less assertive speech (Newman et al., 2008). Note that the authors rely on manual analysis to ascribe performance differences to gender-related features. In fact, global evaluations on generic test sets alone are inadequate to pointedly measure gender bias.

Existing Benchmarks
MT outputs are typically evaluated against reference translations employing standard metrics such as BLEU (Papineni et al., 2002) or TER (Snover et al., 2006). This procedure poses two challenges. First, these metrics provide coarse-grained scores for translation quality, as they treat all errors equally and are rather insensitive to specific linguistic phenomena (Sennrich, 2017). Second, generic test sets containing the same gender imbalance present in the training data can reward biased predictions. Hereby, we describe the publicly available MT Gender Bias Evaluation Testsets (GBETs) (Sun et al., 2019), i.e., benchmarks designed to probe gender bias by isolating the impact of gender from other factors that may affect systems' performance. Note that different benchmarks and metrics respond to different conceptualizations of bias (Barocas et al., 2019). Common to them all in MT, however, is that biased behaviors are formalized by using some variants of averaged performance 12 disparities across gender groups, comparing the accuracy of gender predictions on an equal number of masculine, feminine, and neutral references.

Escudé Font and Costa-jussà (2019) developed the bilingual English-Spanish
Occupations test set. It consists of 1,000 sentences equally distributed across genders. The phrasal structure envisioned for their sentences is "I've known {her|him|<proper noun>} for a long time, my friend works as {a|an} <occupation>". The evaluation focuses on the translation of the noun friend into Spanish (amigo/a). Since gender information is present in the source context and sentences are the same for both masculine/feminine participants, an MT system exhibits gender bias if it disregards relevant context and cannot provide the correct translation of friend at the same rate across genders.

Stanovsky et al. (2019) created
WinoMT by concatenating two existing English GBETs for coreference resolution (Rudinger et al., 2018;Zhao et al., 2018a). The corpus consists of 3,888 Winogradesque sentences presenting two human entities defined by their role and a subsequent pronoun that needs to be correctly resolved to one of the entities (e.g., "The lawyer yelled at the hairdresser because he did a bad job"). For each sentence, there are two variants with either he or she pronouns, so as to cast the referred annotated entity (hairdresser) into a proto-or anti-stereotypical gender role. By translating WinoMT into grammatical gender languages, one can thus measure systems' ability to resolve the anaphoric relation and pick the correct feminine/masculine inflection for the occupational noun. On top of quantifying under-representation as the difference between the total amount of translated feminine and masculine references, the subdivision of the corpus into proto-and anti-stereotypical sets also allows verifying if MT predictions correlate with occupational stereotyping.
Finally,  enriched the original version of WinoMT in two different ways. First, they included a third gender-neutral case based on the singular they pronoun, thus paving the way to account for non-binary referents. Second, they labeled the entity in the sentence which is not coreferent with the pronoun (lawyer). The latter annotation is used to verify the shortcomings of some mitigating approaches as discussed in §5.
The above-mentioned corpora are known as challenge sets, consisting of sentences created ad hoc for diagnostic purposes. In this way, they can  be used to quantify bias related to stereotyping and under-representation in a sound environment. However, since they consist of a limited variety of synthetic gender-related phenomena, they hardly address the variety of challenges posed by realworld language and are relatively easy to overfit. As recognized by Rudinger et al. (2018) "they may demonstrate the presence of gender bias in a system, but not prove its absence". The Arabic Parallel Gender Corpus (Habash et al., 2019) includes an English-Arabic test set 13 retrieved from OpenSubtitles natural language data (Lison and Tiedemann, 2016). Each of the 2,448 sentences in the set exhibits a first person singular reference to the speaker (e.g., "I'm rich"). Among them, ∼200 English sentences require gender agreement to be assigned in translation. These were translated into Arabic in both gender forms, obtaining a quantitatively and qualitatively equal amount of sentence pairs with annotated masculine/feminine references. This natural corpus thus allows for cross-gender evaluations on MT production of correct speaker's gender agreement.
MuST-SHE (Bentivogli et al., 2020) is a natural benchmark for three language pairs (English-French/Italian/Spanish). Built on TED talks data (Cattoni et al., 2021), for each language pair it comprises ∼1,000 (audio, transcript, translation) triplets, thus allowing evaluation for both MT and speech translation (ST). Its samples are balanced between masculine and feminine phenomena, and incorporate two types of constructions: i) sentences referring to the speaker (e.g., "I was born in Mumbai"), and ii) sentences that present contextual information to disambiguate gender (e.g., "My mum was born in Mumbai"). Since every gender-marked word in the target language is annotated in the corpus, MuST-SHE grants the advantage of complementing BLEU-and accuracy-based evaluations on gender translation for a great variety of phenomena.
Unlike challenge sets, natural corpora quantify whether MT yields reduced feminine representation in authentic conditions and whether the quality of service varies across speakers of different genders. However, as they treat all gender-marked words equally, it is not possible to identify if the model is propagating stereotypical representations.
All in all, we stress that each test set and metric is only a proxy for framing a phenomenon or an ability (e.g., anaphora resolution), and an approximation of what we truly intend to gauge. Thus, as we discuss in §6, advances in MT should account for the observation of gender bias in real-world conditions to avoid that achieving high scores on a mathematically formalized esteem could lead to a false sense of security. Still, benchmarks remain valuable tools to monitor models' behavior. As such, we remark that evaluation procedures ought to cover both models' general performance and gender-related issues. This is crucial to establish the capabilities and limits of mitigating strategies.

Mitigating Bias
To attenuate gender bias in MT, different strategies dealing with input data, learning algorithms, and model outputs have been proposed. As attested by Birhane et al. (2020), since advancements are oftentimes exclusively reported in terms of values internal to the machine learning field (e.g efficiency, performance), it is not clear how such strategies are meeting societal needs by reducing MT-related harms. In order to conciliate technical perspectives with the intended social purpose, in Table 2 we map each mitigating approach to the harms (see §2) they are meant to alleviate, as well as to the benchmark their effectiveness is evaluated against. Complementarily, we hereby describe each approach by means of two categories: model debiasing ( §5.1) and debiasing through external components ( §5.2).  in binary terms (b), or including non-binary (nb) identities. Finally, we indicate which (R)epresentational -under-representation and stereotyping -or (A)llocational Harm -as reduced quality of service -the approach attempts to mitigate.

Model Debiasing
This line of work focuses on mitigating gender bias through architectural changes of general-purpose MT models or via dedicated training procedures. Gender tagging. To improve the generation of speaker's referential markings, Vanmassenhove et al. (2018) prepend a gender tag (M or F) to each source sentence, both at training and inference time. As their model is able to leverage this additional information, the approach proves useful to handle morphological agreement when translating from English into French. However, this solution requires additional metadata regarding the speakers' gender that might not always be feasible to acquire. Automatic annotation of speakers' gender (e.g., based on first names) is not advisable, as it runs the risk of introducing additional bias by making unlicensed assumptions about one's identity. Elaraby et al. (2018) bypass this risk by defining a comprehensive set of cross-lingual gender agreement rules based on POS tagging. In this way, they identify speakers' and listeners' gender references in an English-Arabic parallel corpus, which is consequently labeled and used for training. The idea, originally developed for spoken language translation in a two-way conversational setting, can be adapted for other languages and scenarios by creating new dedicated rules. However, in realistic deployment conditions where reference translations are not available, gender information still has to be externally supplied as metadata at inference time. Stafanovičs et al. (2020) and  explore the use of word-level gender tags. While Stafanovičs et al. (2020) just report a gender translation improvement,  rely on the expanded version of WinoMT to identify a problem concerning gender tagging: it intro-duces noise if applied to sentences with references to multiple participants, as it pushes their translation toward the same gender.  also include a first non-binary exploration of neutral translation by exploiting an artificial dataset, where neutral tags are added and gendered inflections are replaced by placeholders. The results are however inconclusive, most likely due to the small size and synthetic nature of their dataset.
Adding context. Without further information needed for training or inference, Basta et al. (2020) adopt a generic approach and concatenate each sentence with its preceding one. By providing more context, they attest a slight improvement in gender translations requiring anaphorical coreference to be solved in English-Spanish. This finding motivates exploration at the document level, but it should be validated with manual (Castilho et al., 2020) and interpretability analyses since the added context can be beneficial for gender-unrelated reasons, such as acting as a regularization factor (Kim et al., 2019).
Debiased word embeddings. The two abovementioned mitigations share the same intent: supply the model with additional gender knowledge. Instead, Escudé Font and Costa-jussà (2019) leverage pre-trained word embeddings, which are debiased by using the hard-debiasing method proposed by Bolukbasi et al. (2016) or the GN-GloVe algorithm (Zhao et al., 2018b). These methods respectively remove gender associations or isolate them from the representations of English gender-neutral words. Escudé Font and Costa-jussà (2019) employ such embeddings on the decoder side, the encoder side, and both sides of an English-Spanish model. The best results are obtained by leveraging GN-GloVe embeddings on both encoder and decoder sides, increasing BLEU scores and gender accuracy. The authors generically apply debiasing methods developed for English also to their target language. However, being Spanish a grammatical gender language, other language-specific approaches should be considered to preserve the quality of the original embeddings (Zhou et al., 2019;Zhao et al., 2020). We also stress that it is debated whether depriving systems of some knowledge and "blind" their perceptions is the right path toward fairer language models (Dwork et al., 2012;Caliskan et al., 2017;Gonen and Goldberg, 2019;Nissim and van der Goot, 2020). Also, Goldfarb-Tarrant et al. (2020) find that there is no reliable correlation between intrinsic evaluations of bias in word-embeddings and cascaded effects on MT models' biased behavior.
Balanced fine-tuning. Costa-jussà and de Jorge (2020) rely on Gebiotoolkit (Costa-jussà et al., 2020c) to build gender-balanced datasets (i.e., featuring an equal amount of masculine/feminine references) based on Wikipedia biographies. By finetuning their models on such natural and more even data, the generation of feminine forms is overall improved. However, the approach is not as effective for gender translation on the anti-stereotypical WinoMT set. As discussed in §3.2.2, they employ a straightforward method that aims to increase the amount of feminine Wikipedia pages in their training data. However, such coverage increase does not mitigate stereotyping harms, as it does not account for the qualitative different ways in which men and women are portrayed (Wagner et al., 2015).

Debiasing through External Components
Instead of directly debiasing the MT model, these mitigating strategies intervene in the inference phase with external dedicated components. Such approaches do not imply retraining, but introduce the additional cost of maintaining separate modules and handling their integration with the MT model.
Black-box injection. Moryossef et al. (2019) attempt to control the production of feminine references to the speaker and numeral inflections (plural or singular) for the listener(s) in an English-Hebrew spoken language setting. To this aim, they rely on a short construction, such as "she said to them", which is prepended to the source sentence and then removed from the MT output. Their approach is simple, it can handle two types of information (gender and number) for multiple entities (speaker and listener), and improves systems' ability to generate feminine target forms. However, as in the case of Vanmassenhove et al. (2018) and Elaraby et al. (2018), it requires metadata about speakers and listeners.
Lattice re-scoring.  propose to post-process the MT output with a lattice re-scoring module. This module exploits a transducer to create a lattice by mapping gender marked words in the MT output to all their possible inflectional variants. Developed for German, Spanish, and Hebrew, all the sentences corresponding to the paths in the lattice are re-scored with another model, which has been gender-debiased but at the cost of lower generic translation quality. Then, the sentence with the highest probability is picked as the final output. When tested on WinoMT, such an approach leads to an increase in the accuracy of gender forms selection. Note that the gender-debiased system is created by fine-tuning the model on an ad hoc built tiny set containing a balanced amount of masculine/feminine forms. Such an approach, also known as counterfactual data augmentation (Lu et al., 2020), requires to create identical pairs of sentences differing only in terms of gender references. In fact, Saunders and Byrne (2020) compile English sentences following this schema: "The <profession> finished <his|her> work". Then, the sentences are automatically translated and manually checked. In this way, they obtain gender-balanced parallel corpus. Thus, to implement their method for other language pairs, the generation of new data is necessary. For the fine-tuning set, the effort required is limited as the goal is to alleviate stereotypes by focusing on a pre-defined occupational lexicon. However, data augmentation is very demanding for complex sentences that represent a rich variety of gender agreement phenomena 14 such as those occurring in natural language scenarios.
Gender re-inflection. i) a two-step system that first identifies the gender of 1st person references in an MT output, and then re-inflects them in the opposite form; ii) a single-step system that always produces both forms from an MT output. Their method does not necessarily require speakers' gender information: if metadata are supplied, the MT output is re-inflected accordingly; differently, both feminine/masculine inflections are offered (leaving to the user the choice of the appropriate one). The implementation of the re-inflection component was made possible by the Arabic Parallel Gender Corpus (see §4.3), which demanded an expensive work of manual data creation. However, such corpus grants research on English-Arabic the benefits of a wealth of gender-informed natural language data that have been curated to avoid hetero-centrist interpretations and preconceptions (e.g., proper names and speakers of sentences like "that's my wife" are flagged as gender-ambiguous). Along the same line, Google Translate also delivers two outputs for short gender-ambiguous queries (Johnson, 2020b). Among languages with grammatical gender, the service is currently available only for English-Spanish.
In light of the above, we remark that there is no conclusive state-of-the-art method for mitigating bias. The discussed interventions in MT tend to respond to specific aspects of the problem with modular solutions, but if and how they can be integrated within the same MT system remains unexplored. As we have discussed through the survey, the umbrella term "gender bias" refers to a wide array of undesirable phenomena. Thus, it is unlikely that a one-size-fits-all solution will be able tackle problems that differ from one another, as they depend on e.g., how bias is conceptualized, the language combinations, the kinds of corpora used. As a result, we believe that generalization and scalability should not be the only criteria against which mitigating strategies are valued. Conversely, we should make room for openly context-aware interventions. Finally, gender bias in MT is a socio-technical problem. We thus highlight that engineering interventions alone are not a panacea  and should be integrated with long-term multidisciplinary commitment and practices (D'Ignazio and Klein, 2020;Gebru, 2020) necessary to address bias in our community, hence in its artifacts, too.

Conclusion and Key Challenges
As studies confronting gender bias in MT are rapidly emerging, in this paper we presented them within a unified framework to critically overview current conceptualizations and approaches to the problem. Since gender bias is a multifaceted and interdisciplinary issue, in our discussion we integrated knowledge from related disciplines, which can be instrumental to guide future research and make it thrive. We conclude by suggesting several directions that can help this field going forward.
Model de-biasing. Neural networks rely on easy-to-learn shortcuts or "cheap tricks" (Levesque, 2014), as picking up on spurious correlations offered by training data can be easier for machines than learning to actually solve a specific task. What is "easy to learn" for a model depends on the inductive bias (Sinz et al., 2019;Geirhos et al., 2020) resulting from architectural choices, training data and learning rules. We think that explainability techniques (Belinkov et al., 2020) represent a useful tool to identify spurious cues (features) exploited by the model during inference. Discerning them can provide the research community with guidance on how to improve models' generalization by working on data, architectures, loss functions and optimizations. For instance, data responsible for spurious features (e.g., stereotypical correlations) might be recognized and their weight at training time might be lowered (Karimi Mahabadi et al., 2020). Besides, state-of-the-art architectural choices and algorithms in MT have mostly been studied in terms of overall translation quality without specific analyses regarding gender translation. For instance, current systems segment text into subword units with statistical methods that can break the morphological structure of words, thus losing relevant semantic and syntactic information in morphologically-rich languages (Niehues et al., 2016;Ataman et al., 2017). Several languages show complex feminine forms, typically derivative and created by adding a suffix to the masculine form, such as Lehrer/Lehrerin (de), studente/studentessa (it). It would be relevant to investigate whether, compared to other segmentation techniques, statistical approaches disadvantage (rarer and more complex) feminine forms. The MT community should not overlook focused hypotheses of such kind, as they can deepen our comprehension of the gender bias conundrum.
Non-textual modalities. Gender bias for nontextual automatic translations (e.g., audiovisual) has been largely neglected. In this sense, ST represents a small niche (Costa-jussà et al., 2020a). For the translation of speaker-related gender phenom-ena, Bentivogli et al. (2020) prove that direct ST systems exploit speaker's vocal characteristics as a gender cue to improve feminine translation. However, as addressed by Gaido et al. (2020), relying on physical gender cues (e.g., pitch) for such task implies reductionist gender classifications (Zimman, 2020) making systems potentially harmful for a diverse range of users. Similarly, although image-guided translation has been claimed useful for gender translation since it relies on visual inputs for disambiguation (Frank et al., 2018;Ive et al., 2019), it could bend toward stereotypical assumptions about appearance. Further research should explore such directions to identify potential challenges and risks, by drawing on bias in image captioning (van Miltenburg, 2019) and consolidated studies from the fields of automatic gender recognition and human-computer interaction (HCI) (Hamidi et al., 2018;Keyes, 2018;May, 2019).
Beyond Dichotomies. Besides a few notable exceptions for English NLP tasks (Manzini et al., 2019;Cao and Daumé III, 2020;Sun et al., 2021) and one in MT , the discussion around gender bias has been reduced to the binary masculine/feminine dichotomy. Although research in this direction is currently hampered by the absence of data, we invite considering inclusive solutions and exploring nuanced dimensions of gender. Starting from language practices, Indirect Non-binary Language (INL) overcomes gender specifications (e.g., using service, humankind rather than waiter/waitress or mankind). 15 Whilst more challenging, INL can be achieved also for grammatical gender languages (Motschenbacher, 2014;Lindqvist et al., 2019), and it is endorsed for official EU documents (Papadimoulis, 2018). Accordingly, MT models could be brought to avoid binary forms and move toward gender-unspecified solutions, e.g., adversarial networks including a discriminator that classifies speaker's linguistic expression of gender (masculine or feminine) could be employed to "neutralize" speaker-related forms (Li et al., 2018;Delobelle et al., 2020). Conversely, Direct Non-binary Language (DNL) aims at increasing the visibility of non-binary individuals via neologisms and neomorphemes (Bradley et al., 2019;Papadopoulos, 2019;Knisely, 2020). With DNL starting to circulate (Shroy, 2016;Santiago, 2018;López, 2019), the community is presented 15 INL suggestions have also been recently implemented within Microsoft text editors (Langston, 2020). with the opportunity to promote the creation of inclusive data.
Finally, as already highlighted in legal and social science theory, discrimination can arise from the intersection of multiple identity categories (e.g., race and gender) (Crenshaw, 1989) which are not additive and cannot always be detected in isolation (Schlesinger et al., 2017). Following the MT work by , as well as other intersectional analyses from NLP (Herbelot et al., 2012;Jiang and Fellbaum, 2020) and AI-related fields (Buolamwini and Gebru, 2018), future studies may account for the interaction of gender attributes with other sociodemographic classes.
Human-in-the-loop. Research on gender bias in MT is still restricted to lab tests. As such, unlike other studies that rely on participatory design (Turner et al., 2015;Cercas Curry et al., 2020;Liebling et al., 2020), the advancement of the field is not measured with people's experience in focus or in relation to specific deployment contexts. However, these are fundamental considerations to guide the field forward and, as HCI studies show (Vorvoreanu et al., 2019), to propel the creation of gender-inclusive technology. In particular, representational harms are intrinsically difficult to estimate and available benchmarks only provide a rough idea of their extent. This advocates for focused studies 16 on their individual or aggregate effects in everyday life. Also, we invite the whole development process to be paired with bias-aware research methodology (Havens et al., 2020) and HCI approaches (Stumpf et al., 2020), which can help to operationalize sensitive attributes like gender (Keyes et al., 2021). Finally, MT is not only built for people, but also by people. Thus, it is vital to reflect on the implicit biases and backgrounds of the people involved in MT pipelines at all stages and how they could be reflected in the model. This means starting from bottom-level countermeasures, engaging with translators (De Marco and Toto, 2019;Lessinger, 2020), annotators (Waseem, 2016;Geva et al., 2019), considering everyone's subjective positionality and, crucially, also the lack of diversity within technology teams (Schluter, 2018;Waseem et al., 2020). 16 To the best of our knowledge, the Gender-Inclusive Language Models Survey is the first project of this kind that includes MT.