Warning: This paper contains examples of stereotypes and biases.

The Bias Benchmark for Question Answering (BBQ) is designed to evaluate social biases of language models (LMs), but it is not simple to adapt this benchmark to cultural contexts other than the US because social biases depend heavily on the cultural context. In this paper, we present KoBBQ, a Korean bias benchmark dataset, and we propose a general framework that addresses considerations for cultural adaptation of a dataset. Our framework includes partitioning the BBQ dataset into three classes—Simply-Transferred (can be used directly after cultural translation), Target-Modified (requires localization in target groups), and Sample-Removed (does not fit Korean culture)—and adding four new categories of bias specific to Korean culture. We conduct a large-scale survey to collect and validate the social biases and the targets of the biases that reflect the stereotypes in Korean culture. The resulting KoBBQ dataset comprises 268 templates and 76,048 samples across 12 categories of social bias. We use KoBBQ to measure the accuracy and bias scores of several state-of-the-art multilingual LMs. The results clearly show differences in the bias of LMs as measured by KoBBQ and a machine-translated version of BBQ, demonstrating the need for and utility of a well-constructed, culturally aware social bias benchmark.

The evaluation of social bias and stereotypes in generative language models through question answering (QA) has quickly gained importance as it can help estimate bias in downstream tasks. For English, the Bias Benchmark for Question Answering (BBQ) (Parrish et al., 2022) has been widely used in evaluating inherent social bias within large language models (LLMs) through the QA task (Liang et al., 2023; Srivastava et al., 2023). Similarly, there has been an attempt to develop a Chinese benchmark (CBBQ) (Huang and Xiong, 2023). However, there are currently no benchmarks for other languages (and their respective cultural contexts), including Korean.

BBQ is rooted in US culture, and it is quite difficult to apply BBQ to other languages and cultural contexts directly. Cultural differences can affect the contexts, types, and targets of stereotypes. For example, the stereotype of drug use is associated with low socio-economic status (SES) in BBQ, while it is associated with high SES in Korea, as shown in Figure 1. Moreover, the quality of translation can impact the QA performance of LMs. Several studies (Lin et al., 2021; Ponti et al., 2020) have highlighted the serious shortcomings of relying solely on machine-translated datasets. Therefore, constructing benchmarks to assess bias in a different cultural context requires a more sensitive and culturally aware approach.

Figure 1: 

BBQ and KoBBQ assess LMs’ bias by asking the model discriminatory questions with ambiguous or disambiguated context. Different cultures may have different contexts or groups associated with social bias, resulting in differences between BBQ and KoBBQ.

Figure 1: 

BBQ and KoBBQ assess LMs’ bias by asking the model discriminatory questions with ambiguous or disambiguated context. Different cultures may have different contexts or groups associated with social bias, resulting in differences between BBQ and KoBBQ.

Close modal

In this paper, we propose a process for developing culturally adaptive datasets and present KoBBQ (Korean Bias Benchmark for Question Answering) that reflects the situations and social biases in South Korea. Our methodology builds upon the English BBQ dataset while taking into account the specific cultural nuances and social biases that exist in Korean society. We leverage cultural transfer techniques, adding Korea-specific stereotypes and validating the dataset through a large-scale survey. We categorize BBQ samples into three groups for cultural transformation: Sample-Removed, Target-Modified, and Simply-Transferred. We exclude Sample-Removed samples from the dataset since they include situations and biases not present in Korean culture. For the Target-Modified samples, we conduct a survey in South Korea and use the results to modify the samples. Additionally, we enrich the dataset by adding samples with four new categories (Domestic Area of Origin, Family Structure, Political Orientation, and Educational Background), referring to these samples as Newly-Created. For each stereotype, we ask 100 South Koreans to choose the target group if the stereotype exists in South Korea, and we exclude the samples if more than half of the people report having no related stereotypes or the skew towards one target group is less than a threshold. The final KoBBQ contains 76,048 samples with 268 templates across 12 categories.1

Our research proposes diverse approaches for analyzing social bias within LMs. Using KoBBQ, we evaluate and compare various existing multilingual LLMs and Korean-specialized LLMs. We simultaneously assess QA performance and bias by utilizing a bias score correlating with the accuracy. In addition, we analyze the response patterns of the LLMs to certain social categories. Our research also indicates that most LLMs have high bias scores on Newly-Created samples, implying that KoBBQ addresses culture-specific situations that existing LMs have overlooked. By comparing KoBBQ with machine-translated BBQ, we find distinctive characteristics in model performance and bias score, highlighting the importance of a hand-built dataset in bias detection.

Our main contributions include:

  • We propose a pipeline for cultural adaptation of existing social benchmark datasets into another culture. This process enables dataset construction more aligned with different cultural contexts, leading to more accurate and comprehensive bias measurement.

  • We present KoBBQ, a hand-built dataset for measuring intrinsic social biases of LMs considering social contexts in Korea. It will serve as a valuable resource to assess and understand bias in the Korean language context.

  • We evaluate and provide comprehensive analyses on existing state-of-the-art Korean and multilingual LMs in diverse ways by measuring performances and bias scores.

2.1 Social Bias in LLMs

Social bias refers to disparate treatment or outcomes between social groups that arise from historical and structural power asymmetries (Gallegos et al., 2023). These biases manifest in various forms, from toxic expressions towards certain social groups to stereotypical linguistic associations.

Recent studies have revealed inherent bias in LLMs across diverse categories, including gender, political ideologies, occupation, age, disability status, class, culture, gender identity, sexual orientation, race, ethnicity, nationality, and religion (Kotek et al., 2023; Motoki et al., 2023; Xue et al., 2023; Esiobu et al., 2023). Tao et al. (2023) observe LLMs’ cultural bias resembling English-speaking and Protestant European countries, and Nguyen et al. (2023) underscore the need for equitable and culturally aware AI and evaluation.

Bias in LLMs can be quantified through 1) embedding or probabilities of tokens or sentences and 2) distribution, classifier prediction, and lexicon of generated texts. Evaluation datasets for measuring bias leverage counterfactual inputs (a fill-in-the-blank task with masked token and predicting most likely unmasked sentences) or prompts (sentence completion and question answering) (Rudinger et al., 2018; Nangia et al., 2020; Gehman et al., 2020; Parrish et al., 2022), inter alia.2

2.2 Bias and Stereotype Datasets

BBQ-format Datasets.

The BBQ (Parrish et al., 2022) dataset is designed to evaluate models for bias and stereotypes using a multiple-choice QA format. It includes real-life scenarios and associated questions to address social biases inherent in LMs. As the QA format is highly adaptable for evaluating BERT-like models and generative LMs, it is used for assessing state-of-the-art LMs (Liang et al., 2023; Srivastava et al., 2023). However, BBQ mainly contains US-centric stereotypes, which poses challenges for direct implementation in Korean culture.

Huang and Xiong (2023) released CBBQ, a Chinese BBQ dataset tailored for Chinese social and cultural contexts. They re-define bias categories and types for Chinese culture based on the Employment Promotion Law, news articles, social media, and knowledge resource corpora in China. However, both BBQ and CBBQ have never verified their samples with a large-scale survey of whether their samples convey social and cultural contexts appropriately. A more in-depth exploration of the comparisons of KoBBQ with other BBQ datasets is provided in §5.2.

English Datasets.

Winogender (Rudinger et al., 2018) and WinoBias (Zhao et al., 2018) shed light on gender bias with the use of gender pronouns (i.e., he, she, they), but the approach is difficult to apply in Korean where gender pronouns are rarely used. StereoSet (Nadeem et al., 2021) and CrowS-Pairs (Nangia et al., 2020) measure stereotypical bias in masked language models. UnQover (Li et al., 2020) quantifies biases in a QA format with underspecified questions, which share similar ideas with the questions with ambiguous contexts in BBQ. BOLD (Dhamala et al., 2021) is proposed to measure social bias in open-ended text generation with complex metrics that depend on another language model or pre-defined lexicons, including gender pronouns. These datasets deal with limited categories of social bias.

Korean Datasets.

There exist several Korean datasets that deal with bias. K-StereoSet3 is a machine-translated and post-edited version of StereoSet development set, whose data are noisy and small. KoSBi (Lee et al., 2023a) is an extrinsic evaluation dataset to assess whether the outputs of generative LMs are safe. The dataset is created through a machine-in-the-loop framework, considering target groups revealing Korean cultures. They classified types of unsafe outputs into three: stereotype, prejudice, and discrimination. Still, it is still difficult to identify the different types of stereotypes that exist within Korean culture from these datasets.

2.3 Cross-cultural NLP

Several approaches for cultural considerations in LMs have been proposed in tasks such as word vector space construction or hate speech classification (Lin et al., 2018; Lee et al., 2023b), and culturally sensitive dataset constructions (Liu et al., 2021; Yin et al., 2021; Jeong et al., 2022). Recent studies have also presented methods for translating existing data in a culturally sensitive manner by automatically removing examples with social keywords, which refer to those related to social behaviors (e.g., weddings) (Lin et al., 2021), or performing cross-cultural translation with human translators by substituting or paraphrasing original concepts into similar meaning (Ponti et al., 2020). Our approach builds upon these methods by adapting cross-cultural translation, manually eliminating samples that do not fit Korean culture, and incorporating culturally fit target groups and handcrafted samples into a Korean-specific bias benchmark dataset.

3.1 BBQ-format Dataset

The task is to answer a discriminatory question given a context, where the context and question address a stereotype related to specific target social groups. The dataset builds upon templates with attributes for the target group, non-target group (groups far from the stereotype), and lexical variants. Each template with unique attributes involves a total of eight context-question pairs, with four different context types (either ambiguous or disambiguated, and either biased or counter-biased) and two different question types (biased or counter-biased).

Context Types.

The context describes a scenario where two individuals from different social groups engage in behavior related to the given stereotype. Let ‘target’ denote the one from the target group and ‘non-target’ the other. A biased context depicts a situation where the behavior of the ‘target’ aligns with the stereotype. In contrast, the roles of the two people are swapped in a counter-biased context.

The first half of each context only mentions the ‘target’ and ‘non-target’ without sufficient information to answer the questions accurately, referred to as an ambiguous context. The second half adds the necessary details to answer the question, making the whole context a disambiguated context.

Question Types.

A biased question asks which group conforms to a given stereotype, while a counter-biased question asks which group goes against it.

Answer Types.

The correct answer in ambiguous contexts is always ‘unknown.’ When given a disambiguated context, the correct answer under a biased context is always the biased answer, referring to answers conforming to social biases. Under a counter-biased context, the correct answer is always the counter-biased answer that goes against the social bias.

3.2 Dataset Construction

The dataset curation process of KoBBQ consists of 5 steps: (1) categorization of BBQ templates, (2) cultural-sensitive translation, (3) demographic category construction, (4) creation of new templates, and (5) a large-scale survey on social bias. Each of the steps will be further explained below.

3.2.1 Categorization of BBQ Templates

Four of the authors, who are native Koreans, categorize the templates from the original BBQ dataset into three classes: Sample-Removed, Target-Modified, and Simply-Transferred. We go through a discussion to establish a consensus on all labels. Figure 2 shows examples for each class.

Figure 2: 

Examples of 4 types in KoBBQ. The yellow box indicates the answer to the biased question, asking which group conforms to the relevant social value. [N1] or [N2] represent the templated slots with one potential filler from target or non-target groups. A dotted box refers to the target groups that align with the relevant social bias. Any modified parts from BBQ are marked with strike lines, while cultural-sensitive translation parts are underlined.

Figure 2: 

Examples of 4 types in KoBBQ. The yellow box indicates the answer to the biased question, asking which group conforms to the relevant social value. [N1] or [N2] represent the templated slots with one potential filler from target or non-target groups. A dotted box refers to the target groups that align with the relevant social bias. Any modified parts from BBQ are marked with strike lines, while cultural-sensitive translation parts are underlined.

Close modal
Sample-Removed

refers to samples that are not representative of the Korean cultural context. We exclude Sample-Removed samples from KoBBQ to accurately reflect Korean culture.

Target-Modified

denotes samples whose inherent biases exist in Korean cultures but are stereotyped towards different target groups. Therefore, in addition to cultural-sensitive translation, we modify and collect target groups appropriate for Korean culture through a large-scale public survey of Korean citizens.

Simply-Transferred

indicates samples revealing stereotypical biases that match Korean cultural background. These samples only go through cultural-sensitive translation when transformed into samples of KoBBQ.

3.2.2 Cultural-sensitive Translation

We initially use DeepL Translator4 to translate Simply-Transferred and Target-Modified samples. However, Peskov et al. (2021) pointed out that translated sentences may lack cultural context, highlighting the need for the adaptation of entities to the target culture, known as adaptation in the translation field (Vinay and Darbelnet, 1995) as part of cross-cultural translation (Sperber et al., 1994). To ensure a high-quality translation with Korean cultural contexts, we request a professional translator to perform culturally sensitive human-moderated translations. We specifically ask the translator to use Korean culture-familiar words, such as E-Mart5 instead of Walmart, bleached hair instead of dark hair,6 and basketball instead of rugby,7 to avoid awkwardness stemming from the cultural difference between US and Korean cultures.

3.2.3 Demographic Category Reconstruction

We reconstruct the stereotyped group categories of the original BBQ based on the categories and demographic groups of KoSBi (Lee et al., 2023a), which refers to UDHR8 and NHRCK.9 We (1) merge race/ethnicity and nationality into a single category and (2) add four categories that reflect unique social contexts of Korean cultures: domestic area of origin, educational background, family structure, and political orientation. The reason behind merging the two categories is that the distinction between race/ethnicity and nationality is vague in Korea, considering that Korea is an ethnically homogeneous nation compared to the US (Han, 2007). For the newly merged race/ ethnicity/nationality category, we include groups potentially familiar to Korean people. These include races that receive social prejudice from Koreans (Lee, 2007), ethnicities related to North Korea, China, and Japan, and the top two countries with the highest number of immigrants from each world region determined by MOFA10 between 2000 and 2022.11 Moreover, by adding new categories, the dataset covers a wide range of social biases and corresponding target groups embedded within Korean society. The final KoBBQ comprises 12 categories in Table 1.

Table 1: 

Statistics of KoBBQ. ST, TM, SR, NC denote Simply-Transferred, Target-Modified, Sample-Removed, and Newly-Created, respectively. Numbers within parenthesis indicate the number of templates before being filtered by the survey results. The number of samples means the number of unique pairs of the context and question.

Category# of Templates# of Templates# of Samples
SRTMSTNC
Age 20 (28 →) 21 3,608 
Disability Status 20 (25 →) 20 2,160 
Gender Identity 25 (29 →) 25 768 
Physical Appearance 17 (25 →) 20 4,040 
Race/Ethnicity/Nationality 17 33 10 (46 →) 43 51,856 
Religion 10 (25 →) 20 688 
Socio-Economy Status 16 10 (28 →) 27 6,928 
Sexual Orientation 10 (25 →) 12 552 
 
Domestic Area of Origin 22 (25 →) 22 800 
Family Structure 23 (25 →) 23 1,096 
Political Orientation 11 (28 →) 11 312 
Educational Background 24 (25 →) 24 3,240 
 
Total 48 42 107 119 268 76,048 
Category# of Templates# of Templates# of Samples
SRTMSTNC
Age 20 (28 →) 21 3,608 
Disability Status 20 (25 →) 20 2,160 
Gender Identity 25 (29 →) 25 768 
Physical Appearance 17 (25 →) 20 4,040 
Race/Ethnicity/Nationality 17 33 10 (46 →) 43 51,856 
Religion 10 (25 →) 20 688 
Socio-Economy Status 16 10 (28 →) 27 6,928 
Sexual Orientation 10 (25 →) 12 552 
 
Domestic Area of Origin 22 (25 →) 22 800 
Family Structure 23 (25 →) 23 1,096 
Political Orientation 11 (28 →) 11 312 
Educational Background 24 (25 →) 24 3,240 
 
Total 48 42 107 119 268 76,048 

3.2.4 Creation of New Templates

To create a fair and representative sample of Korean culture and balance the number of samples across categories, the authors manually devise templates and label them as Newly-Created. Our templates rely on sources backed by solid evidence, such as research articles featuring in-depth interviews with representatives of the target groups, statistical reports derived from large-scale surveys conducted on the Korean public, and news articles that provide expert analysis of statistical findings.

3.2.5 Large-scale Survey on Social Bias

In contrast to BBQ, we employ statistical evidence to validate social bias and target groups within KoBBQ by implementing a large-scale survey of the Korean public.12

Survey Setting.

We conduct a large-scale survey to verify whether the stereotypical biases revealed through KoBBQ match the general cognition of the Korean public. Moreover, we perform a separate reading comprehension survey, where we validate the contexts and associated questions. To ensure a balanced demographic representation of the Korean public, we require the participation of 100 individuals for each survey question while balancing gender and age groups.

For the social bias verification survey, we split the whole dataset into two types: 1) target or non-target groups must be modified or newly designated, and 2) only the stereotype needs to be validated with a fixed target group. All of the Target-Modified templates conform to the first type. Among Simply-Transferred and Newly-Created templates, those in religion, domestic area of origin, and race/ethnicity/nationality categories are also included in the first type unless the reference explicitly mentions the non-target groups. This is because, for those categories, it is hard to specify the non-target groups based only on the target groups. The others conform to the second type. As some samples within KoBBQ share the same stereotype, we extract unique stereotypes for survey question construction.

Target Modification.

In addition to target group selection, non-target groups in KoBBQ differ from that of BBQ as it only comprises groups far from the social stereotype, promoting a better comparison between target and non-target groups. In the survey, for the first type, we ask workers to select all possible target groups for a given social bias using a select-all-that-apply question format, with the prompt “Please choose all social groups that are appropriate as the ones corresponding to the stereotype ‘<social_bias>’ in the common perception of Korean society.” We provide a comprehensive list of demographic groups for each category, including an option for ‘no stereotype exists’ for those with no bias regarding the social bias.

We select target groups that received at least twice the votes, and non-target groups with half or fewer votes compared to equal distribution of votes across all options, ensuring that we only keep options with significant bias.13 If there are no groups for either of the two groups, we eliminate the corresponding samples from the dataset. As a result, 8.3% of the stereotypes within this survey type are eliminated, resulting in a 3.0% decrease in the total number of templates.

Stereotype Validation.

References are not enough for demonstrating the existence of social biases in Korean society. To confirm such biases, we conduct a large-scale survey where workers were asked to identify which group corresponds to the given social bias while providing the target and non-target groups for the second type. We use the prompt “When comparing <group1> and <group2> in the context of Korean society, please choose the social group that corresponds to the stereotype ‘<social_bias>’ as a fixed perception.”. We also provide a ‘no stereotype exists’ choice for people with no related bias. The order of the target and non-target groups is randomly shuffled and templated into <group1> and <group2>.

After the survey, we select the templates where more than two-thirds of the people who did not select ‘no stereotype exists’ chose to eliminate those that do not demonstrate significant bias within the target group. This approach guarantees a representative label that reflects the majority opinion. After doing so, the number of stereotypes is reduced by 13.6% in this survey type, and the overall count of the templates is decreased by 10.9%.

Data Filtering.

We finalize our dataset using two filtering methods: 1) ‘no stereotype exists’ count and 2) reading comprehension task. We apply this for both types of the survey.

Of the 290 unique stereotypes, 18.8% of people chose the option “no stereotype exists” on average. To select stereotypes that align with common social stereotypes in Korean society, we excluded any options that received over 50% of “no stereotype exists” responses from our workers. Using this method, we additionally eliminate 3.1% of the overall stereotypes, resulting in a 2.8% decrease in the total count of templates.

We construct a reading comprehension task for each template, using counter-biased contexts and counter-biased questions as they require more attention for comprehension, necessitating a higher focus of the workers. We eliminate those where the ratio of correct answers to the corresponding context and question pair was below 50%. After this step, 3.9% of the templates remaining are discarded. The discarded samples include those whose disambiguated contexts were too ambiguous for human annotators to correctly answer the questions.

3.3 Data Statistics

Table 1 shows the number of templates per class mentioned in §3.2.1 and the number of samples per category. Each template consists of multiple samples, as each target group and the non-target group is substituted with several specific examples of them. We also provide the number of templates before and after eliminating data following the survey result.

The categories from the original BBQ that comprise a significant portion of the social bias that exists within Korean society are mainly composed of Simply-Transferred types, such as age, disability Status, and gender Identity. With the demographic groups newly updated, for race/ethnicity/nationality, all the original templates except those that include social bias or context not applicable to Korean culture are classified as Target-Modified. In order to add social bias in Korean culture and to balance the dataset among categories, we created new samples for categories from the original BBQ, as shown in Newly-Created counts. However, based on the survey results, templates from sexual orientation and political orientation are significantly removed, indicating that the Korean public does not have a diverse range of social bias regarding those categories, as evidenced by the change in template count before and after the survey.

In this section, we evaluate state-of-the-art generative LLMs on KoBBQ. Our evaluation encompasses accuracy and bias scores, ensuring a comprehensive assessment of the models’ inherent bias.

4.1 Experimental Settings

The task is multiple-choice QA, in which the models are asked to choose the most appropriate answer when given a context, a question, and three choices (‘target,’ ‘non-target,’ and ‘unknown’).

Evaluation Prompts.

We use five different prompts with different instructions and different ‘unknown’ expressions. The gray text box below shows one of the prompts we use in the experiment. Following Izacard et al. (2023), we apply the cyclic permutation of the three choices (A, B, and C) to each prompt.

graphic

graphic

Evaluation Set.

Each template in KoBBQ comprises multiple target and non-target groups, along with alternative expressions. Due to the vast size and uneven distribution from all combinations in the dataset, we utilize a test set encompassing a randomly sampled example from each template. In total, our evaluation set comprises 32,160 samples (quadruples of the prompt, context, question, and choice permutation).14

Models.

We only include the models that are capable of QA tasks in the zero-shot setting since fine-tuning or few-shot can affect the bias of the models (Li et al., 2020; Yang et al., 2022). The following models are used in the experiments: Claude-v1 (claude-instant-1.2), Claude-v2 (claude-2.0),15 (Bai et al., 2022), GPT-3.5 (gpt-3.5-turbo-0613), GPT-4 (gpt-4-0613),16 CLOVA-X,17 and KoAlpaca (KoAlpaca-Polyglot -12.8B).18 For GPT-3, GPT-3.5, and GPT-4, we use the OpenAI API and set the temperature as 0 to use greedy decoding. The model inferences were run from August to September 2023.

Post-processing of Generated Answers.

The criteria for accepting responses generated by generative models are established to ensure that only valid answers are accepted. Specifically, responses must meet one of the following criteria: (i) include only one alphabet indicating one of the given options, (ii) exactly match the term provided in the options, optionally with an alphabet for the option, or (iii) include a specific expression that is intended to provide an answer, such as ‘answer is -’. Responses that fail to meet these criteria are considered as out-of-choice answers and are excluded from scoring.

4.2 Evaluation Metrics

Considering the nature of the BBQ-formatted dataset, it is essential to measure both the accuracy and bias score of models. In this section, we define the accuracy and diff-bias score using the notations shown in Table 2.

Table 2: 

Notations for counts for each case. nt denotes the number of templates corresponding to each combination. Amb, Dis, B, cB, and Unk are abbreviations of ambiguous, disambiguated, biased, counter-biased, and unknown, respectively. Each underlined cell indicates the correct answer type for a given context. Each context type contains cases for both biased and counter-biased questions, for a total of 2nt cases.

ContextAnswerBcBUnkTotal
Amb B / cB nab nac nau na(=4nt
Dis nbb nbc nbu nb(=2nt
cB ncb ncc ncu nc(=2nt
ContextAnswerBcBUnkTotal
Amb B / cB nab nac nau na(=4nt
Dis nbb nbc nbu nb(=2nt
cB ncb ncc ncu nc(=2nt

Accuracy.

In ambiguous contexts, the correct answer is always ‘unknown’ regardless of question types. On the other hand, in disambiguated contexts, the correct answers correspond to the question types (i.e., the target group is correct one for a biased question). We denote the accuracy in ambiguous and disambiguated contexts as Acca and Accd, which are calculated as Equation 1 and Equation 2, respectively.
Acca=nauna
(1)
Accd=nbb+nccnb+nc
(2)

Diff-bias Score.

In the BBQ-format datasets, the extent to which a language model reveals its inherent social bias depends on its QA performance. For instance, if the model answers the question perfectly based only on the context provided, it means that the model is not affected by any bias. In this section, we define diff-bias scores based on Parrish et al. (2022) to measure how frequently the model answers questions based on its bias. Furthermore, we provide their maximum values, which are determined by the model’s accuracy. This highlights the importance of evaluating both the bias score and accuracy in tandem.

In ambiguous contexts, we define the diff-bias score Diff-biasa as the difference between the prediction ratios of biased answers and counter-biased answers, as described in Equation 3. A higher value indicates that the model tends to produce more answers that align with social biases. Note that the absolute value of Diff-biasa is bounded by the accuracy, as shown in Equation 4.
Diff-biasa=nabnacna
(3)
Diff-biasa1Acca(0Acca1)
(4)
We define the diff-bias score of disambiguated context, Diff-biasd, as the difference between the accuracies under biased context and under counter-biased context, as Equation 5. Thereby, a higher diff-bias score indicates the model has relatively more accurate performance for biased contexts (Accdb) than counter-biased contexts (Accdc). This biased performance difference could be originated from the model’s inherent social bias. Diff-biasd refers to the subtraction of the accuracies mentioned above, while the mean of the two values is the same as Accd in Equation 2 considering that nb = nc = 2nt. It produces the range of Diff-biasd as Equation 6.
Diff-biasd=AccdbAccdc=nbbnbnccnc
(5)
Diff-biasd12Accd1(0Accd1)=2Accd(0Accd0.5)2(1Accd)(0.5<Accd1)
(6)

In summary, the accuracy represents the frequency of the model generating correct predictions, while the diff-bias indicates the direction and the extent to which incorrect predictions are biased. An optimal model would exhibit an accuracy of 1 and a diff-bias score of 0. A uniformly random model would have an accuracy of 1/3 and a diff-bias score of 0. A model that consistently provides only biased answers would have a diff-bias score of 1, with an accuracy of 0 in ambiguous contexts and 0.5 in disambiguated contexts.

4.3 Experimental Results

In this section, we present the evaluation results of the six LLMs on KoBBQ.

Accuracy and Diff-bias Score.

Table 3 shows the accuracy and diff-bias scores of the models on KoBBQ.19 Overall, the models show higher accuracy in disambiguated contexts compared to ambiguous contexts. Remarkably, all the models present positive diff-bias scores, with pronounced severity in ambiguous contexts. This suggests that the models tend to favor outputs that are aligned with prevailing societal biases.

Table 3: 

The diff-bias score and accuracy of models upon five different prompts. ‘max|bias|’ indicates the maximum absolute value of the diff-bias score depending on the accuracy. The rows are sorted by the accuracy.

(a) Ambiguous Context
Modelaccuracy (↑)diff-bias (↓)max|bias|
KoAlpaca 0.1732±0.0435 0.0172±0.0049 0.8268 
Claude-v1 0.2702±0.1691 0.2579±0.0645 0.7298 
Claude-v2 0.5503±0.2266 0.1556±0.0480 0.4497 
GPT-3.5 0.6194±0.0480 0.1653±0.0231 0.3806 
CLOVA-X 0.8603±0.0934 0.0576±0.0333 0.1397 
GPT-4 0.9650±0.0245 0.0256±0.0152 0.0350 
 
(b) Disambiguated Context 
Model accuracy (↑) diff-bias (↓) max|bias| 
KoAlpaca 0.4247±0.0199 0.0252±0.0085 0.8495 
CLOVA-X 0.7754±0.0825 0.0362±0.0103 0.4491 
GPT-3.5 0.8577±0.0142 0.0869±0.0094 0.2847 
Claude-v2 0.8762±0.0650 0.0321±0.0050 0.2475 
Claude-v1 0.9103±0.0224 0.0322±0.0041 0.1793 
GPT-4 0.9594±0.0059 0.0049±0.0070 0.0811 
(a) Ambiguous Context
Modelaccuracy (↑)diff-bias (↓)max|bias|
KoAlpaca 0.1732±0.0435 0.0172±0.0049 0.8268 
Claude-v1 0.2702±0.1691 0.2579±0.0645 0.7298 
Claude-v2 0.5503±0.2266 0.1556±0.0480 0.4497 
GPT-3.5 0.6194±0.0480 0.1653±0.0231 0.3806 
CLOVA-X 0.8603±0.0934 0.0576±0.0333 0.1397 
GPT-4 0.9650±0.0245 0.0256±0.0152 0.0350 
 
(b) Disambiguated Context 
Model accuracy (↑) diff-bias (↓) max|bias| 
KoAlpaca 0.4247±0.0199 0.0252±0.0085 0.8495 
CLOVA-X 0.7754±0.0825 0.0362±0.0103 0.4491 
GPT-3.5 0.8577±0.0142 0.0869±0.0094 0.2847 
Claude-v2 0.8762±0.0650 0.0321±0.0050 0.2475 
Claude-v1 0.9103±0.0224 0.0322±0.0041 0.1793 
GPT-4 0.9594±0.0059 0.0049±0.0070 0.0811 

Specifically, GPT-4 achieves outstandingly the highest accuracy of over 0.95 in both contexts while also having low diff-bias scores. However, considering the ratio of its diff-bias score to the maximum value, GPT-4 still cannot be said to be free from bias. Regarding diff-bias scores, Claude-v1 and GPT-3.5 achieve the highest bias scores in ambiguous and disambiguated contexts, respectively. Meanwhile, KoAlpaca exhibits low accuracy and bias scores, which is attributed to its tendency to randomly choose answers between the two options except ‘unknown’ in most cases.

Bias Score by Category.

Figure 3 depicts the diff-bias score for each stereotyped group category on six different models. We observed significant differences in diff-bias scores among bias categories in both ambiguous and disambiguated contexts, with a p-value < 0.01 tested by one-way ANOVA. In particular, stereotypes associated with socio-economic status demonstrate a significantly lower diff-bias score in disambiguated contexts compared to all other bias categories. Additionally, stereotypes associated with gender identity and race/ethnicity/nationality exhibit marginally lower diff-bias scores in ambiguous contexts. In contrast, those associated with age and political orientation showed marginally high scores. They are significantly lower or higher compared to the overall diff-bias score.

Figure 3: 

Tukey-HSD test on the normalized diff-bias scores for each stereotype group category with 99% confidence interval.

Figure 3: 

Tukey-HSD test on the normalized diff-bias scores for each stereotype group category with 99% confidence interval.

Close modal

Scores by Label Type.

Figure 4 illustrates the accuracy and diff-bias scores for each label type on the models. In ambiguous context, the Newly-Created samples have the lowest accuracy and the highest diff-bias score. This suggests that the samples the authors added identify the presence of unexamined inherent bias in LMs. The Target-Modified and Simply-Transferred show similar accuracy but exhibit a noticeable difference in the diff-bias score in ambiguous contexts. This shows that bias scores can differ even when accuracy is similar. In disambiguated contexts, a higher accuracy tends to be associated with a lower bias score. The models achieve the highest QA performance with the lowest diff-bias score in the Newly-Created samples.

Figure 4: 

Tukey-HSD test on both the normalized accuracy and diff-bias scores for each sample type with 99% confidence interval.

Figure 4: 

Tukey-HSD test on both the normalized accuracy and diff-bias scores for each sample type with 99% confidence interval.

Close modal

5.1 KoBBQ vs. Machine-translated BBQ

To highlight the need for a hand-crafted bias benchmark considering cultural differences, we show the differences in performance and bias of LMs between KoBBQ and machine-translated BBQ (mtBBQ). Table 4 shows the accuracy and bias scores of models for the Simply-Transferred (ST) and Target-Modified (TM) samples, which are included in both KoBBQ and mtBBQ. We perform a Wilcoxon rank-sum test to examine the statistically significant differences between the two datasets for each model and label.

Table 4: 

Comparison of accuracy, bias scores, and Wilcoxon rank-sum test for KoBBQ and machine-translated BBQ (mtBBQ) in the ST (Simply-Transferred) and TM (Target-Modified) labels. P-values are calculated on KoBBQ and mtBBQ for each label and model. The colored cells indicate the statistically significant differences (, , and ).

Comparison of accuracy, bias scores, and Wilcoxon rank-sum test for KoBBQ and machine-translated BBQ (mtBBQ) in the ST (Simply-Transferred) and TM (Target-Modified) labels. P-values are calculated on KoBBQ and mtBBQ for each label and model. The colored cells indicate the statistically significant differences (, , and ).
Comparison of accuracy, bias scores, and Wilcoxon rank-sum test for KoBBQ and machine-translated BBQ (mtBBQ) in the ST (Simply-Transferred) and TM (Target-Modified) labels. P-values are calculated on KoBBQ and mtBBQ for each label and model. The colored cells indicate the statistically significant differences (, , and ).

Regarding accuracy, the models show higher scores on KoBBQ than mtBBQ in disambiguated contexts, exhibiting a significant difference, except for KoAlpaca, which shows low QA performance. Since the task in disambiguated contexts resembles the machine reading comprehension task, this underscores how manual translation enhances contextual comprehension. There is no significant difference in ambiguous contexts between KoBBQ and mtBBQ.

For the diff-bias score, the difference between KoBBQ and mtBBQ exists in both contexts. In general, model biases are higher when using KoBBQ compared to mtBBQ with ambiguous contexts. This may be due to the incomplete comprehension of the models of the machine-translated texts, resulting in less successful measurement of inherent model bias when compared to manually translated KoBBQ. Under the disambiguated context, some significantly different cases exist, although there is no clear trend regarding the order between KoBBQ and mtBBQ.

Overall, KoBBQ and mtBBQ show differences in both models’ performance and bias score even when considering common labels (Simply-Transferred and Target-Modified) excluding the different labels (Newly-Created and Sample-Removed). These findings highlight the importance of manual translation and cultural adaptation, as machine translation alone is insufficient for measuring the model’s bias.

5.2 KoBBQ vs. BBQ/CBBQ

In this work, we present a general framework that can be used to extend the BBQ dataset (Parrish et al., 2022) to various different cultures. Through the template categorization in terms of applicability, we label whether a sample is applicable only with minor revisions (Simply-Transferred) or with different target groups (Target-Modified) or even cannot be applicable at all (Sample-Removed). Our labeling results can aid in research on Korean culture, and our framework can be utilized in building culturally adapted datasets for other cultures as well. The datasets constructed in this manner enable direct comparisons of cultural differences with the existing dataset. For example, Simply-Transferred samples can reveal a multilingual LM’s variations across different languages with shared contexts, and Target-Modified samples demonstrate cultural distinctions through the comparison of different target groups associated with the same stereotypes.

KoBBQ is created directly by humans without the assistance of LLMs (except for initial translation). We explored the possibility of using LLMs within our framework, but we encountered certain limitations. First, we asked GPT-4 to choose all target groups associated with the given stereotypes, in the same way as the human survey for target modification. Comparing GPT-4 with human survey results for Target-Modified samples reveals a low agreement, with an accuracy (exact match) of 23.8% and an F1 score (average F1 of all target group classes) of 39.73%. Furthermore, similar to the approach in CBBQ (Huang and Xiong, 2023), we experimented with letting GPT-4 generate disambiguated contexts, questions, and answers, given stereotypes and ambiguous contexts written by humans. We find several limitations of LLMs in context generation as follows. 1) It makes more general expressions rather than including specific or even cultural situations or keywords, lacking Korea’s unique culture within the context. 2) For counter-biased contexts, it still tends to create contexts in a biased manner reflecting its inherent bias. 3) It struggles to construct a clarified context that contains both biased and counter-biased answers. The results include instances that fail to follow the template format and contain grammatical errors specific to Korean as well. Detailed examples are described in Table 5. These results demonstrate that human effort remains essential for the construction of a culturally sensitive bias benchmark.

Table 5: 

Examples of disambiguated contexts generated by human and GPT-4. Compared to human-written contexts, GPT-4 tends to 1) generate general contexts rather than specific or cultural contexts, 2) make grammatical errors, create a biased context where it is prompted to create a counter-biased context, and 3) fail to create a fully disambiguated context that should include the answers for the biased/counter-biased questions. The grammatical errors are underlined.

Examples of disambiguated contexts generated by human and GPT-4. Compared to human-written contexts, GPT-4 tends to 1) generate general contexts rather than specific or cultural contexts, 2) make grammatical errors, create a biased context where it is prompted to create a counter-biased context, and 3) fail to create a fully disambiguated context that should include the answers for the biased/counter-biased questions. The grammatical errors are underlined.
Examples of disambiguated contexts generated by human and GPT-4. Compared to human-written contexts, GPT-4 tends to 1) generate general contexts rather than specific or cultural contexts, 2) make grammatical errors, create a biased context where it is prompted to create a counter-biased context, and 3) fail to create a fully disambiguated context that should include the answers for the biased/counter-biased questions. The grammatical errors are underlined.

Although BBQ, CBBQ, and KoBBQ are all written based on the relevant references, only KoBBQ incorporates a comprehensive large-scale survey targeting the domestic public. It not only validates the reliability of the benchmark but also reflects the intensity of certain stereotypes in South Korea. As this result could provide valuable insights into the stereotypes present in Korean society, we will release the raw survey results along with our dataset for future research.

We presented a Korean bias benchmark (KoBBQ) that contains question-answering data with situations related to biases existing in Korea. From BBQ dataset, the existing US-centric bias benchmark, we divided its samples into three classes (Simply-Transferred, Target-Modified, and Sample-Removed) to make it culturally adaptive. Additionally, we added four new categories that depict biases prevalent in Korean culture. KoBBQ consists of 76,048 samples across 12 categories of social bias. To ensure the quality and reliability of our data, we recruited a sufficient number of crowdworkers in the validation process. Using our KoBBQ, we analyzed six large language models in terms of the accuracy and diff-bias score. By showing the differences between our KoBBQ and machine-translated BBQ, we emphasized the need for culturally sensitive and meticulously curated bias benchmark construction.

Our method can be applied to other cultures, which can promote the development of culture-specific bias benchmarks. We leave the extension of the dataset to other languages and the framework for universal adaptation to more than two cultures as future work. Furthermore, our KoBBQ is expected to contribute to the improvement of the safe usage of LLMs’ applications by assessing the inherent social biases present in the models.

While the perception of social bias can be subjective, we made an extensive effort to gather insights into prevalent social biases in Korean society through our large-scale survey. Nevertheless, caution should be taken before drawing definitive conclusions based solely on our findings. Furthermore, we acknowledge the potential existence of other social bias categories in Korean society that our study has not addressed.

It is crucial to understand that performance in QA tasks can influence bias measurements. Our metric does not entirely disentangle bias scores from QA performance. Hence, a holistic view that considers both aspects is essential to avoid potentially incomplete or skewed interpretations.

This research project was performed under approval from KAIST IRB (KH2023-069). We ensured that the wages of our translator and crowdworkers exceed the minimum wage in the Republic of Korea in 2023, which is KRW 9,260 (approximately USD 7.25).20 Specifically, we paid around KRW 150 per word for the translator, with a duration of two weeks, resulting in a payment of KRW 2,500,000. For the large-scale survey for verifying stereotypes in Korea, we paid Macromill Embrain KRW 4,200,000 with a contract period of 11 days. There was no discrimination when recruiting workers regarding any demographics, including gender and age. They were informed that the content might be stereotypical or biased.

We acknowledge the potential risk associated with releasing a dataset that contains stereotypes and biases. This dataset must not be used as training data to automatically generate and publish biased languages targeting specific groups. We will explicitly state the terms of use in that we do not condone any malicious use. We strongly encourage researchers and practitioners to utilize this dataset in beneficial ways, such as mitigating bias in language models.

This project was funded by the KAIST-NAVER hypercreative AI center. Alice Oh is funded by Institute of Information communications Technology Planning Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00184, Development and Study of AI Technologies to Inexpensively Conform to Evolving Policy on Ethics). The authors would like to thank Jaehong Kim from KAIST Graduate School of Culture Technology for his assistance in the survey design.

1 

Our KoBBQ dataset, evaluation codes including prompts, and survey results are available at https://jinjh0123.github.io/KoBBQ.

2 

Existing evaluation datasets for bias in LLMs are available at https://github.com/i-gallegos/Fair-LLM-Benchmark.

5 

One of the largest discount stores in Korea (https://company.emart.com/en/company/business.do).

6 

Typically, the natural hair color of Korean individuals is dark (Im et al., 2017).

8 

Universal Declaration of Human Rights.

9 

National Human Rights Commission of Korea.

10 

Ministry of Foreign Affairs.

12 

Done with Macromill Embrain, a Korean company specialized in online research (https://embrain.com/).

13 

As there are 38 options for race/ethnicity/nationality, we exclude the specific countries while only including each region name for option counts to prevent thresholds being too low (e.g., excluding US and Canada while including North America).

14 

We check that the average differences of both the accuracy and diff-bias scores on the evaluation set and the entire KoBBQ set are less than 0.005, and they result in no significant differences by Wilcoxon rank-sum test for Claude-v1, GPT-3.5, and CLOVA-X with 3 prompts. When calculating the scores for the entire set, we average the scores of samples from the same template, to mitigate the impact of the imbalance of samples for each template.

19 

The average ratios of out-of-choice answers from each model are below 0.005, except for Claude-v2 (0.015), CLOVA-X (0.068), and KoAlpaca (0.098).

Yuntao
Bai
,
Saurav
Kadavath
,
Sandipan
Kundu
,
Amanda
Askell
,
Jackson
Kernion
,
Andy
Jones
,
Anna
Chen
,
Anna
Goldie
,
Azalia
Mirhoseini
,
Cameron
McKinnon
,
Carol
Chen
,
Catherine
Olsson
,
Christopher
Olah
,
Danny
Hernandez
,
Dawn
Drain
,
Deep
Ganguli
,
Dustin
Li
,
Eli
Tran-Johnson
,
Ethan
Perez
,
Jamie
Kerr
,
Jared
Mueller
,
Jeffrey
Ladish
,
Joshua
Landau
,
Kamal
Ndousse
,
Kamile
Lukosiute
,
Liane
Lovitt
,
Michael
Sellitto
,
Nelson
Elhage
,
Nicholas
Schiefer
,
Noemí
Mercado
,
Nova
DasSarma
,
Robert
Lasenby
,
Robin
Larson
,
Sam
Ringer
,
Scott
Johnston
,
Shauna
Kravec
,
Sheer El
Showk
,
Stanislav
Fort
,
Tamera
Lanham
,
Timothy
Telleen-Lawton
,
Tom
Conerly
,
Tom
Henighan
,
Tristan
Hume
,
Samuel R.
Bowman
,
Zac
Hatfield-Dodds
,
Ben
Mann
,
Dario
Amodei
,
Nicholas
Joseph
,
Sam
McCandlish
,
Tom
Brown
, and
Jared
Kaplan
.
2022
.
Constitutional AI: Harmlessness from AI feedback
.
CoRR
,
abs/2212.08073v1
.
Jwala
Dhamala
,
Tony
Sun
,
Varun
Kumar
,
Satyapriya
Krishna
,
Yada
Pruksachatkun
,
Kai-Wei
Chang
, and
Rahul
Gupta
.
2021
.
BOLD: Dataset and metrics for measuring biases in open-ended language generation
. In
Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency
,
FAccT ’21
, pages
862
872
,
New York, NY, USA
.
Association for Computing Machinery
.
David
Esiobu
,
Xiaoqing
Tan
,
Saghar
Hosseini
,
Megan
Ung
,
Yuchen
Zhang
,
Jude
Fernandes
,
Jane
Dwivedi-Yu
,
Eleonora
Presani
,
Adina
Williams
, and
Eric
Smith
.
2023
.
ROBBIE: Robust bias evaluation of large generative language models
. In
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
, pages
3764
3814
,
Singapore
.
Association for Computational Linguistics
.
Isabel O.
Gallegos
,
Ryan A.
Rossi
,
Joe
Barrow
,
Md Mehrab
Tanjim
,
Sungchul
Kim
,
Franck
Dernoncourt
,
Tong
Yu
,
Ruiyi
Zhang
, and
Nesreen K.
Ahmed
.
2023
.
Bias and fairness in large language models: A survey
.
Samuel
Gehman
,
Suchin
Gururangan
,
Maarten
Sap
,
Yejin
Choi
, and
Noah A.
Smith
.
2020
.
RealToxicityPrompts: Evaluating neural toxic degeneration in language models
. In
Findings of the Association for Computational Linguistics: EMNLP 2020
, pages
3356
3369
,
Online
.
Association for Computational Linguistics
.
Kyung-Koo
Han
.
2007
.
The archaeology of the ethnically homogeneous nation-state and multiculturalism in Korea
.
Korea Journal
,
47
(
4
):
8
32
.
Yufei
Huang
and
Deyi
Xiong
.
2023
.
CBBQ: A chinese bias benchmark dataset curated with human-ai collaboration for large language models
.
CoRR
,
abs/2306.16244v1
.
Kyung Min
Im
,
Tae-Wan
Kim
, and
Jong-Rok
Jeon
.
2017
.
Metal-chelation-assisted deposition of polydopamine on human hair: A ready-to-use eumelanin-based hair dyeing methodology
.
ACS Biomaterials Science & Engineering
,
3
(
4
):
628
636
. ,
[PubMed]
Gautier
Izacard
,
Patrick
Lewis
,
Maria
Lomeli
,
Lucas
Hosseini
,
Fabio
Petroni
,
Timo
Schick
,
Jane
Dwivedi-Yu
,
Armand
Joulin
,
Sebastian
Riedel
, and
Edouard
Grave
.
2023
.
Atlas: Few-shot learning with retrieval augmented language models
.
Journal of Machine Learning Research
,
24
(
251
):
1
43
.
Younghoon
Jeong
,
Juhyun
Oh
,
Jongwon
Lee
,
Jaimeen
Ahn
,
Jihyung
Moon
,
Sungjoon
Park
, and
Alice
Oh
.
2022
.
KOLD: Korean offensive language dataset
. In
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
, pages
10818
10833
,
Abu Dhabi, United Arab Emirates
.
Association for Computational Linguistics
.
Hadas
Kotek
,
Rikker
Dockum
, and
David
Sun
.
2023
.
Gender bias and stereotypes in large language models
. In
Proceedings of The ACM Collective Intelligence Conference
,
CI ’23
, pages
12
24
,
New York, NY, USA
.
Association for Computing Machinery
.
Ha-Ryoung
Lee
.
2007
.
Study on Social Prejudice towards Race: Centering on the Relationship of Social Distance to Stereotypes and emotions
.
Master’s thesis
,
Hanyang University
,
Seoul, KR
.
Hwaran
Lee
,
Seokhee
Hong
,
Joonsuk
Park
,
Takyoung
Kim
,
Gunhee
Kim
, and
Jung-woo
Ha
.
2023a
.
KoSBI: A dataset for mitigating social bias risks towards safer large language model applications
. In
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)
, pages
208
224
,
Toronto, Canada
.
Association for Computational Linguistics
.
Nayeon
Lee
,
Chani
Jung
, and
Alice
Oh
.
2023b
.
Hate speech classifiers are culturally insensitive
. In
Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP)
, pages
35
46
,
Dubrovnik, Croatia
.
Association for Computational Linguistics
.
Tao
Li
,
Daniel
Khashabi
,
Tushar
Khot
,
Ashish
Sabharwal
, and
Vivek
Srikumar
.
2020
.
UNQOVERing stereotyping biases via underspecified questions
. In
Findings of the Association for Computational Linguistics: EMNLP 2020
, pages
3475
3489
,
Online
.
Association for Computational Linguistics
.
Percy
Liang
,
Rishi
Bommasani
,
Tony
Lee
,
Dimitris
Tsipras
,
Dilara
Soylu
,
Michihiro
Yasunaga
,
Yian
Zhang
,
Deepak
Narayanan
,
Yuhuai
Wu
,
Ananya
Kumar
,
Benjamin
Newman
,
Binhang
Yuan
,
Bobby
Yan
,
Ce
Zhang
,
Christian Alexander
Cosgrove
,
Christopher D.
Manning
,
Christopher
Re
,
Diana
Acosta-Navas
,
Drew Arad
Hudson
,
Eric
Zelikman
,
Esin
Durmus
,
Faisal
Ladhak
,
Frieda
Rong
,
Hongyu
Ren
,
Huaxiu
Yao
,
Jue
Wang
,
Keshav
Santhanam
,
Laurel
Orr
,
Lucia
Zheng
,
Mert
Yuksekgonul
,
Mirac
Suzgun
,
Nathan
Kim
,
Neel
Guha
,
Niladri S.
Chatterji
,
Omar
Khattab
,
Peter
Henderson
,
Qian
Huang
,
Ryan Andrew
Chi
,
Sang Michael
Xie
,
Shibani
Santurkar
,
Surya
Ganguli
,
Tatsunori
Hashimoto
,
Thomas
Icard
,
Tianyi
Zhang
,
Vishrav
Chaudhary
,
William
Wang
,
Xuechen
Li
,
Yifan
Mai
,
Yuhui
Zhang
, and
Yuta
Koreeda
.
2023
.
Holistic evaluation of language models
.
Transactions on Machine Learning Research
.
Featured Certification, Expert Certification
.
Bill Yuchen
Lin
,
Seyeon
Lee
,
Xiaoyang
Qiao
, and
Xiang
Ren
.
2021
.
Common sense beyond English: Evaluating and improving multilingual language models for commonsense reasoning
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
1274
1287
,
Online
.
Association for Computational Linguistics
.
Bill Yuchen
Lin
,
Frank
F. Xu
,
Kenny
Zhu
, and
Seung-won
Hwang
.
2018
.
Mining cross-cultural differences and similarities in social media
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
709
719
,
Melbourne, Australia
.
Association for Computational Linguistics
.
Fangyu
Liu
,
Emanuele
Bugliarello
,
Edoardo Maria
Ponti
,
Siva
Reddy
,
Nigel
Collier
, and
Desmond
Elliott
.
2021
.
Visually grounded reasoning across languages and cultures
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
10467
10485
,
Online and Punta Cana, Dominican Republic
.
Association for Computational Linguistics
.
Fabio
Motoki
,
Valdemar Pinho
Neto
, and
Victor
Rodrigues
.
2023
.
More human than human: measuring chatgpt political bias
.
Public Choice
.
Moin
Nadeem
,
Anna
Bethke
, and
Siva
Reddy
.
2021
.
StereoSet: Measuring stereotypical bias in pretrained language models
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
5356
5371
,
Online
.
Association for Computational Linguistics
.
Nikita
Nangia
,
Clara
Vania
,
Rasika
Bhalerao
, and
Samuel R.
Bowman
.
2020
.
CrowS-pairs: A challenge dataset for measuring social biases in masked language models
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
1953
1967
,
Online
.
Association for Computational Linguistics
.
Xuan-Phi
Nguyen
,
Wenxuan
Zhang
,
Xin
Li
,
Mahani
Aljunied
,
Qingyu
Tan
,
Liying
Cheng
,
Guanzheng
Chen
,
Yue
Deng
,
Sen
Yang
,
Chaoqun
Liu
,
Hang
Zhang
, and
Lidong
Bing
.
2023
.
Seallms – large language models for southeast asia
.
Alicia
Parrish
,
Angelica
Chen
,
Nikita
Nangia
,
Vishakh
Padmakumar
,
Jason
Phang
,
Jana
Thompson
,
Phu Mon
Htut
, and
Samuel
Bowman
.
2022
.
BBQ: A hand-built bias benchmark for question answering
. In
Findings of the Association for Computational Linguistics: ACL 2022
, pages
2086
2105
,
Dublin, Ireland
.
Association for Computational Linguistics
.
Denis
Peskov
,
Viktor
Hangya
,
Jordan
Boyd-Graber
, and
Alexander
Fraser
.
2021
.
Adapting entities across languages and cultures
. In
Findings of the Association for Computational Linguistics: EMNLP 2021
, pages
3725
3750
,
Punta Cana, Dominican Republic
.
Association for Computational Linguistics
.
Edoardo Maria
Ponti
,
Goran
Glavaš
,
Olga
Majewska
,
Qianchu
Liu
,
Ivan
Vulić
, and
Anna
Korhonen
.
2020
.
XCOPA: A multilingual dataset for causal commonsense reasoning
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
2362
2376
,
Online
.
Association for Computational Linguistics
.
Rachel
Rudinger
,
Jason
Naradowsky
,
Brian
Leonard
, and
Benjamin
Van Durme
.
2018
.
Gender bias in coreference resolution
. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)
, pages
8
14
,
New Orleans, Louisiana
.
Association for Computational Linguistics
.
Ami D.
Sperber
,
Robert F.
Devellis
, and
Brian
Boehlecke
.
1994
.
Cross-cultural translation: Methodology and validation
.
Journal of Cross-Cultural Psychology
,
25
(
4
):
501
524
.
Aarohi
Srivastava
,
Abhinav
Rastogi
,
Abhishek
Rao
,
Abu Awal Md
Shoeb
,
Abubakar
Abid
,
Adam
Fisch
,
Adam R.
Brown
,
Adam
Santoro
,
Aditya
Gupta
,
Adrià
Garriga-Alonso
,
Agnieszka
Kluska
,
Aitor
Lewkowycz
,
Akshat
Agarwal
,
Alethea
Power
,
Alex
Ray
,
Alex
Warstadt
,
Alexander W.
Kocurek
,
Ali
Safaya
,
Ali
Tazarv
,
Alice
Xiang
,
Alicia
Parrish
,
Allen
Nie
,
Aman
Hussain
,
Amanda
Askell
,
Amanda
Dsouza
,
Ambrose
Slone
,
Ameet
Rahane
,
Anantharaman S.
Iyer
,
Anders Johan
Andreassen
,
Andrea
Madotto
,
Andrea
Santilli
,
Andreas
Stuhlmüller
,
Andrew M.
Dai
,
Andrew
La
,
Andrew
Lampinen
,
Andy
Zou
,
Angela
Jiang
,
Angelica
Chen
,
Anh
Vuong
,
Animesh
Gupta
,
Anna
Gottardi
,
Antonio
Norelli
,
Anu
Venkatesh
,
Arash
Gholamidavoodi
,
Arfa
Tabassum
,
Arul
Menezes
,
Arun
Kirubarajan
,
Asher
Mullokandov
,
Ashish
Sabharwal
,
Austin
Herrick
,
Avia
Efrat
,
Aykut
Erdem
,
Ayla
Karakaş
,
B.
Ryan Roberts
,
Bao Sheng
Loe
,
Barret
Zoph
,
Bartłomiej
Bojanowski
,
Batuhan
Özyurt
,
Behnam
Hedayatnia
,
Behnam
Neyshabur
,
Benjamin
Inden
,
Benno
Stein
,
Berk
Ekmekci
,
Bill Yuchen
Lin
,
Blake
Howald
,
Bryan
Orinion
,
Cameron
Diao
,
Cameron
Dour
,
Catherine
Stinson
,
Cedrick
Argueta
,
Cesar
Ferri
,
Chandan
Singh
,
Charles
Rathkopf
,
Chenlin
Meng
,
Chitta
Baral
,
Chiyu
Wu
,
Chris
Callison-Burch
,
Christopher
Waites
,
Christian
Voigt
,
Christopher D.
Manning
,
Christopher
Potts
,
Cindy
Ramirez
,
Clara E.
Rivera
,
Clemencia
Siro
,
Colin
Raffel
,
Courtney
Ashcraft
,
Cristina
Garbacea
,
Damien
Sileo
,
Dan
Garrette
,
Dan
Hendrycks
,
Dan
Kilman
,
Dan
Roth
,
C.
Daniel Freeman
,
Daniel
Khashabi
,
Daniel
Levy
,
Daniel Moseguí
González
,
Danielle
Perszyk
,
Danny
Hernandez
,
Danqi
Chen
,
Daphne
Ippolito
,
Dar
Gilboa
,
David
Dohan
,
David
Drakard
,
David
Jurgens
,
Debajyoti
Datta
,
Deep
Ganguli
,
Denis
Emelin
,
Denis
Kleyko
,
Deniz
Yuret
,
Derek
Chen
,
Derek
Tam
,
Dieuwke
Hupkes
,
Diganta
Misra
,
Dilyar
Buzan
,
Dimitri Coelho
Mollo
,
Diyi
Yang
,
Dong-Ho
Lee
,
Dylan
Schrader
,
Ekaterina
Shutova
,
Ekin Dogus
Cubuk
,
Elad
Segal
,
Eleanor
Hagerman
,
Elizabeth
Barnes
,
Elizabeth
Donoway
,
Ellie
Pavlick
,
Emanuele
Rodolà
,
Emma
Lam
,
Eric
Chu
,
Eric
Tang
,
Erkut
Erdem
,
Ernie
Chang
,
Ethan A.
Chi
,
Ethan
Dyer
,
Ethan
Jerzak
,
Ethan
Kim
,
Eunice Engefu
Manyasi
,
Evgenii
Zheltonozhskii
,
Fanyue
Xia
,
Fatemeh
Siar
,
Fernando
Martínez-Plumed
,
Francesca
Happé
,
Francois
Chollet
,
Frieda
Rong
,
Gaurav
Mishra
,
Genta Indra
Winata
,
Gerard
de Melo
,
Germán
Kruszewski
,
Giambattista
Parascandolo
,
Giorgio
Mariani
,
Gloria Xinyue
Wang
,
Gonzalo
Jaimovitch-Lopez
,
Gregor
Betz
,
Guy
Gur-Ari
,
Hana
Galijasevic
,
Hana
Galijasevic
,
Hannah
Kim
,
Hannah
Rashkin
,
Hannaneh
Hajishirzi
,
Harsh
Mehta
,
Hayden
Bogar
,
Henry Francis
Anthony Shevlin
,
Hinrich
Schuetze
,
Hiromu
Yakura
,
Hongming
Zhang
,
Hugh Mee
Wong
,
Ian
Ng
,
Isaac
Noble
,
Jaap
Jumelet
,
Jack
Geissinger
,
Jackson
Kernion
,
Jacob
Hilton
,
Jaehoon
Lee
,
Jaime Fernández
Fisac
,
James B.
Simon
,
James
Koppel
,
James
Zheng
,
James
Zou
,
Jan
Kocon
,
Jana
Thompson
,
Janelle
Wingfield
,
Jared
Kaplan
,
Jarema
Radom
,
Jascha
Sohl-Dickstein
,
Jason
Phang
,
Jason
Wei
,
Jason
Yosinski
,
Jekaterina
Novikova
,
Jelle
Bosscher
,
Jennifer
Marsh
,
Jeremy
Kim
,
Jeroen
Taal
,
Jesse
Engel
,
Jesujoba
Alabi
,
Jiacheng
Xu
,
Jiaming
Song
,
Jillian
Tang
,
Joan
Waweru
,
John
Burden
,
John
Miller
,
John U.
Balis
,
Jonathan
Batchelder
,
Jonathan
Berant
,
Jörg
Frohberg
,
Jos
Rozen
,
Jose
Hernandez-Orallo
,
Joseph
Boudeman
,
Joseph
Guerr
,
Joseph
Jones
,
Joshua B.
Tenenbaum
,
Joshua S.
Rule
,
Joyce
Chua
,
Kamil
Kanclerz
,
Karen
Livescu
,
Karl
Krauth
,
Karthik
Gopalakrishnan
,
Katerina
Ignatyeva
,
Katja
Markert
,
Kaustubh
Dhole
,
Kevin
Gimpel
,
Kevin
Omondi
,
Kory Wallace
Mathewson
,
Kristen
Chiafullo
,
Ksenia
Shkaruta
,
Kumar
Shridhar
,
Kyle
McDonell
,
Kyle
Richardson
,
Laria
Reynolds
,
Leo
Gao
,
Li
Zhang
,
Liam
Dugan
,
Lianhui
Qin
,
Lidia
Contreras-Ochando
,
Louis-Philippe
Morency
,
Luca
Moschella
,
Lucas
Lam
,
Lucy
Noble
,
Ludwig
Schmidt
,
Luheng
He
,
Luis
Oliveros-Colón
,
Luke
Metz
,
Lütfi Kerem
Senel
,
Maarten
Bosma
,
Maarten
Sap
,
Maartje Ter
Hoeve
,
Maheen
Farooqi
,
Manaal
Faruqui
,
Mantas
Mazeika
,
Marco
Baturan
,
Marco
Marelli
,
Marco
Maru
,
Maria Jose
Ramirez-Quintana
,
Marie
Tolkiehn
,
Mario
Giulianelli
,
Martha
Lewis
,
Martin
Potthast
,
Matthew L.
Leavitt
,
Matthias
Hagen
,
Mátyás
Schubert
,
Medina Orduna
Baitemirova
,
Melody
Arnaud
,
Melvin
McElrath
,
Michael Andrew
Yee
,
Michael
Cohen
,
Michael
Gu
,
Michael
Ivanitskiy
,
Michael
Starritt
,
Michael
Strube
,
Michał
Swądrowski
,
Michele
Bevilacqua
,
Michihiro
Yasunaga
,
Mihir
Kale
,
Mike
Cain
,
Mimee
Xu
,
Mirac
Suzgun
,
Mitch
Walker
,
Mo
Tiwari
,
Mohit
Bansal
,
Moin
Aminnaseri
,
Mor
Geva
,
Mozhdeh
Gheini
,
Mukund Varma
T.
,
Nanyun
Peng
,
Nathan Andrew
Chi
,
Nayeon
Lee
,
Neta Gur-Ari
Krakover
,
Nicholas
Cameron
,
Nicholas
Roberts
,
Nick
Doiron
,
Nicole
Martinez
,
Nikita
Nangia
,
Niklas
Deckers
,
Niklas
Muennighoff
,
Nitish Shirish
Keskar
,
Niveditha S.
Iyer
,
Noah
Constant
,
Noah
Fiedel
,
Nuan
Wen
,
Oliver
Zhang
,
Omar
Agha
,
Omar
Elbaghdadi
,
Omer
Levy
,
Owain
Evans
,
Pablo Antonio
Moreno Casares
,
Parth
Doshi
,
Pascale
Fung
,
Paul Pu
Liang
,
Paul
Vicol
,
Pegah
Alipoormolabashi
,
Peiyuan
Liao
,
Percy
Liang
,
Peter W.
Chang
,
Peter
Eckersley
,
Phu Mon
Htut
,
Pinyu
Hwang
,
Piotr
Miłkowski
,
Piyush
Patil
,
Pouya
Pezeshkpour
,
Priti
Oli
,
Qiaozhu
Mei
,
Qing
Lyu
,
Qinlang
Chen
,
Rabin
Banjade
,
Rachel Etta
Rudolph
,
Raefer
Gabriel
,
Rahel
Habacker
,
Ramon
Risco
,
Raphaël
Millière
,
Rhythm
Garg
,
Richard
Barnes
,
Rif A.
Saurous
,
Riku
Arakawa
,
Robbe
Raymaekers
,
Robert
Frank
,
Rohan
Sikand
,
Roman
Novak
,
Roman
Sitelew
,
Ronan Le
Bras
,
Rosanne
Liu
,
Rowan
Jacobs
,
Rui
Zhang
,
Russ
Salakhutdinov
,
Ryan Andrew
Chi
,
Seungjae Ryan
Lee
,
Ryan
Stovall
,
Ryan
Teehan
,
Rylan
Yang
,
Sahib
Singh
,
Saif M.
Mohammad
,
Sajant
Anand
,
Sam
Dillavou
,
Sam
Shleifer
,
Sam
Wiseman
,
Samuel
Gruetter
,
Samuel R.
Bowman
,
Samuel Stern
Schoenholz
,
Sanghyun
Han
,
Sanjeev
Kwatra
,
Sarah A.
Rous
,
Sarik
Ghazarian
,
Sayan
Ghosh
,
Sean
Casey
,
Sebastian
Bischoff
,
Sebastian
Gehrmann
,
Sebastian
Schuster
,
Sepideh
Sadeghi
,
Shadi
Hamdan
,
Sharon
Zhou
,
Shashank
Srivastava
,
Sherry
Shi
,
Shikhar
Singh
,
Shima
Asaadi
,
Shixiang Shane
Gu
,
Shubh
Pachchigar
,
Shubham
Toshniwal
,
Shyam
Upadhyay
,
Shyamolima Shammie
Debnath
,
Siamak
Shakeri
,
Simon
Thormeyer
,
Simone
Melzi
,
Siva
Reddy
,
Sneha Priscilla
Makini
,
Soo-Hwan
Lee
,
Spencer
Torene
,
Sriharsha
Hatwar
,
Stanislas
Dehaene
,
Stefan
Divic
,
Stefano
Ermon
,
Stella
Biderman
,
Stephanie
Lin
,
Stephen
Prasad
,
Steven
Piantadosi
,
Stuart
Shieber
,
Summer
Misherghi
,
Svetlana
Kiritchenko
,
Swaroop
Mishra
,
Tal
Linzen
,
Tal
Schuster
,
Tao
Li
,
Tao
Yu
,
Tariq
Ali
,
Tatsunori
Hashimoto
,
Te-Lin
Wu
,
Théo
Desbordes
,
Theodore
Rothschild
,
Thomas
Phan
,
Tianle
Wang
,
Tiberius
Nkinyili
,
Timo
Schick
,
Timofei
Kornev
,
Titus
Tunduny
,
Tobias
Gerstenberg
,
Trenton
Chang
,
Trishala
Neeraj
,
Tushar
Khot
,
Tyler
Shultz
,
Uri
Shaham
,
Vedant
Misra
,
Vera
Demberg
,
Victoria
Nyamai
,
Vikas
Raunak
,
Vinay Venkatesh
Ramasesh
,
vinay uday
prabhu
,
Vishakh
Padmakumar
,
Vivek
Srikumar
,
William
Fedus
,
William
Saunders
,
William
Zhang
,
Wout
Vossen
,
Xiang
Ren
,
Xiaoyu
Tong
,
Xinran
Zhao
,
Xinyi
Wu
,
Xudong
Shen
,
Yadollah
Yaghoobzadeh
,
Yair
Lakretz
,
Yangqiu
Song
,
Yasaman
Bahri
,
Yejin
Choi
,
Yichi
Yang
,
Yiding
Hao
,
Yifu
Chen
,
Yonatan
Belinkov
,
Yu
Hou
,
Yufang
Hou
,
Yuntao
Bai
,
Zachary
Seid
,
Zhuoye
Zhao
,
Zijian
Wang
,
Zijie J.
Wang
,
Zirui
Wang
, and
Ziyi
Wu
.
2023
.
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
.
Transactions on Machine Learning Research
.
Yan
Tao
,
Olga
Viberg
,
Ryan S.
Baker
, and
Rene F.
Kizilcec
.
2023
.
Auditing and mitigating cultural bias in llms
.
Jean-Paul
Vinay
and
Jean
Darbelnet
.
1995
.
Comparative Stylistics of French and English: A methodology for translation
.
John Benjamins
.
Mingfeng
Xue
,
Dayiheng
Liu
,
Kexin
Yang
,
Guanting
Dong
,
Wenqiang
Lei
,
Zheng
Yuan
,
Chang
Zhou
, and
Jingren
Zhou
.
2023
.
Occuquest: Mitigating occupational bias for inclusive large language models
.
Jingfeng
Yang
,
Haoming
Jiang
,
Qingyu
Yin
,
Danqing
Zhang
,
Bing
Yin
, and
Diyi
Yang
.
2022
.
SEQZERO: Few-shot compositional semantic parsing with sequential prompts and zero-shot models
. In
Findings of the Association for Computational Linguistics: NAACL 2022
, pages
49
60
,
Seattle, United States
.
Association for Computational Linguistics
.
Da
Yin
,
Liunian Harold
Li
,
Ziniu
Hu
,
Nanyun
Peng
, and
Kai-Wei
Chang
.
2021
.
Broaden the vision: Geo-diverse visual commonsense reasoning
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
2115
2129
,
Online and Punta Cana, Dominican Republic
.
Association for Computational Linguistics
.
Jieyu
Zhao
,
Tianlu
Wang
,
Mark
Yatskar
,
Vicente
Ordonez
, and
Kai-Wei
Chang
.
2018
.
Gender bias in coreference resolution: Evaluation and debiasing methods
. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)
, pages
15
20
,
New Orleans, Louisiana
.
Association for Computational Linguistics
.

Author notes

*

Equal Contribution. This work was done during the internships at NAVER AI Lab.

Action Editor: Zeljko Agic

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.