Abstract
Warning: This paper contains examples of stereotypes and biases.
The Bias Benchmark for Question Answering (BBQ) is designed to evaluate social biases of language models (LMs), but it is not simple to adapt this benchmark to cultural contexts other than the US because social biases depend heavily on the cultural context. In this paper, we present KoBBQ, a Korean bias benchmark dataset, and we propose a general framework that addresses considerations for cultural adaptation of a dataset. Our framework includes partitioning the BBQ dataset into three classes—Simply-Transferred (can be used directly after cultural translation), Target-Modified (requires localization in target groups), and Sample-Removed (does not fit Korean culture)—and adding four new categories of bias specific to Korean culture. We conduct a large-scale survey to collect and validate the social biases and the targets of the biases that reflect the stereotypes in Korean culture. The resulting KoBBQ dataset comprises 268 templates and 76,048 samples across 12 categories of social bias. We use KoBBQ to measure the accuracy and bias scores of several state-of-the-art multilingual LMs. The results clearly show differences in the bias of LMs as measured by KoBBQ and a machine-translated version of BBQ, demonstrating the need for and utility of a well-constructed, culturally aware social bias benchmark.
1 Introduction
The evaluation of social bias and stereotypes in generative language models through question answering (QA) has quickly gained importance as it can help estimate bias in downstream tasks. For English, the Bias Benchmark for Question Answering (BBQ) (Parrish et al., 2022) has been widely used in evaluating inherent social bias within large language models (LLMs) through the QA task (Liang et al., 2023; Srivastava et al., 2023). Similarly, there has been an attempt to develop a Chinese benchmark (CBBQ) (Huang and Xiong, 2023). However, there are currently no benchmarks for other languages (and their respective cultural contexts), including Korean.
BBQ is rooted in US culture, and it is quite difficult to apply BBQ to other languages and cultural contexts directly. Cultural differences can affect the contexts, types, and targets of stereotypes. For example, the stereotype of drug use is associated with low socio-economic status (SES) in BBQ, while it is associated with high SES in Korea, as shown in Figure 1. Moreover, the quality of translation can impact the QA performance of LMs. Several studies (Lin et al., 2021; Ponti et al., 2020) have highlighted the serious shortcomings of relying solely on machine-translated datasets. Therefore, constructing benchmarks to assess bias in a different cultural context requires a more sensitive and culturally aware approach.
BBQ and KoBBQ assess LMs’ bias by asking the model discriminatory questions with ambiguous or disambiguated context. Different cultures may have different contexts or groups associated with social bias, resulting in differences between BBQ and KoBBQ.
BBQ and KoBBQ assess LMs’ bias by asking the model discriminatory questions with ambiguous or disambiguated context. Different cultures may have different contexts or groups associated with social bias, resulting in differences between BBQ and KoBBQ.
In this paper, we propose a process for developing culturally adaptive datasets and present KoBBQ (Korean Bias Benchmark for Question Answering) that reflects the situations and social biases in South Korea. Our methodology builds upon the English BBQ dataset while taking into account the specific cultural nuances and social biases that exist in Korean society. We leverage cultural transfer techniques, adding Korea-specific stereotypes and validating the dataset through a large-scale survey. We categorize BBQ samples into three groups for cultural transformation: Sample-Removed, Target-Modified, and Simply-Transferred. We exclude Sample-Removed samples from the dataset since they include situations and biases not present in Korean culture. For the Target-Modified samples, we conduct a survey in South Korea and use the results to modify the samples. Additionally, we enrich the dataset by adding samples with four new categories (Domestic Area of Origin, Family Structure, Political Orientation, and Educational Background), referring to these samples as Newly-Created. For each stereotype, we ask 100 South Koreans to choose the target group if the stereotype exists in South Korea, and we exclude the samples if more than half of the people report having no related stereotypes or the skew towards one target group is less than a threshold. The final KoBBQ contains 76,048 samples with 268 templates across 12 categories.1
Our research proposes diverse approaches for analyzing social bias within LMs. Using KoBBQ, we evaluate and compare various existing multilingual LLMs and Korean-specialized LLMs. We simultaneously assess QA performance and bias by utilizing a bias score correlating with the accuracy. In addition, we analyze the response patterns of the LLMs to certain social categories. Our research also indicates that most LLMs have high bias scores on Newly-Created samples, implying that KoBBQ addresses culture-specific situations that existing LMs have overlooked. By comparing KoBBQ with machine-translated BBQ, we find distinctive characteristics in model performance and bias score, highlighting the importance of a hand-built dataset in bias detection.
Our main contributions include:
We propose a pipeline for cultural adaptation of existing social benchmark datasets into another culture. This process enables dataset construction more aligned with different cultural contexts, leading to more accurate and comprehensive bias measurement.
We present KoBBQ, a hand-built dataset for measuring intrinsic social biases of LMs considering social contexts in Korea. It will serve as a valuable resource to assess and understand bias in the Korean language context.
We evaluate and provide comprehensive analyses on existing state-of-the-art Korean and multilingual LMs in diverse ways by measuring performances and bias scores.
2 Related Work
2.1 Social Bias in LLMs
Social bias refers to disparate treatment or outcomes between social groups that arise from historical and structural power asymmetries (Gallegos et al., 2023). These biases manifest in various forms, from toxic expressions towards certain social groups to stereotypical linguistic associations.
Recent studies have revealed inherent bias in LLMs across diverse categories, including gender, political ideologies, occupation, age, disability status, class, culture, gender identity, sexual orientation, race, ethnicity, nationality, and religion (Kotek et al., 2023; Motoki et al., 2023; Xue et al., 2023; Esiobu et al., 2023). Tao et al. (2023) observe LLMs’ cultural bias resembling English-speaking and Protestant European countries, and Nguyen et al. (2023) underscore the need for equitable and culturally aware AI and evaluation.
Bias in LLMs can be quantified through 1) embedding or probabilities of tokens or sentences and 2) distribution, classifier prediction, and lexicon of generated texts. Evaluation datasets for measuring bias leverage counterfactual inputs (a fill-in-the-blank task with masked token and predicting most likely unmasked sentences) or prompts (sentence completion and question answering) (Rudinger et al., 2018; Nangia et al., 2020; Gehman et al., 2020; Parrish et al., 2022), inter alia.2
2.2 Bias and Stereotype Datasets
BBQ-format Datasets.
The BBQ (Parrish et al., 2022) dataset is designed to evaluate models for bias and stereotypes using a multiple-choice QA format. It includes real-life scenarios and associated questions to address social biases inherent in LMs. As the QA format is highly adaptable for evaluating BERT-like models and generative LMs, it is used for assessing state-of-the-art LMs (Liang et al., 2023; Srivastava et al., 2023). However, BBQ mainly contains US-centric stereotypes, which poses challenges for direct implementation in Korean culture.
Huang and Xiong (2023) released CBBQ, a Chinese BBQ dataset tailored for Chinese social and cultural contexts. They re-define bias categories and types for Chinese culture based on the Employment Promotion Law, news articles, social media, and knowledge resource corpora in China. However, both BBQ and CBBQ have never verified their samples with a large-scale survey of whether their samples convey social and cultural contexts appropriately. A more in-depth exploration of the comparisons of KoBBQ with other BBQ datasets is provided in §5.2.
English Datasets.
Winogender (Rudinger et al., 2018) and WinoBias (Zhao et al., 2018) shed light on gender bias with the use of gender pronouns (i.e., he, she, they), but the approach is difficult to apply in Korean where gender pronouns are rarely used. StereoSet (Nadeem et al., 2021) and CrowS-Pairs (Nangia et al., 2020) measure stereotypical bias in masked language models. UnQover (Li et al., 2020) quantifies biases in a QA format with underspecified questions, which share similar ideas with the questions with ambiguous contexts in BBQ. BOLD (Dhamala et al., 2021) is proposed to measure social bias in open-ended text generation with complex metrics that depend on another language model or pre-defined lexicons, including gender pronouns. These datasets deal with limited categories of social bias.
Korean Datasets.
There exist several Korean datasets that deal with bias. K-StereoSet3 is a machine-translated and post-edited version of StereoSet development set, whose data are noisy and small. KoSBi (Lee et al., 2023a) is an extrinsic evaluation dataset to assess whether the outputs of generative LMs are safe. The dataset is created through a machine-in-the-loop framework, considering target groups revealing Korean cultures. They classified types of unsafe outputs into three: stereotype, prejudice, and discrimination. Still, it is still difficult to identify the different types of stereotypes that exist within Korean culture from these datasets.
2.3 Cross-cultural NLP
Several approaches for cultural considerations in LMs have been proposed in tasks such as word vector space construction or hate speech classification (Lin et al., 2018; Lee et al., 2023b), and culturally sensitive dataset constructions (Liu et al., 2021; Yin et al., 2021; Jeong et al., 2022). Recent studies have also presented methods for translating existing data in a culturally sensitive manner by automatically removing examples with social keywords, which refer to those related to social behaviors (e.g., weddings) (Lin et al., 2021), or performing cross-cultural translation with human translators by substituting or paraphrasing original concepts into similar meaning (Ponti et al., 2020). Our approach builds upon these methods by adapting cross-cultural translation, manually eliminating samples that do not fit Korean culture, and incorporating culturally fit target groups and handcrafted samples into a Korean-specific bias benchmark dataset.
3 KoBBQ Dataset
3.1 BBQ-format Dataset
The task is to answer a discriminatory question given a context, where the context and question address a stereotype related to specific target social groups. The dataset builds upon templates with attributes for the target group, non-target group (groups far from the stereotype), and lexical variants. Each template with unique attributes involves a total of eight context-question pairs, with four different context types (either ambiguous or disambiguated, and either biased or counter-biased) and two different question types (biased or counter-biased).
Context Types.
The context describes a scenario where two individuals from different social groups engage in behavior related to the given stereotype. Let ‘target’ denote the one from the target group and ‘non-target’ the other. A biased context depicts a situation where the behavior of the ‘target’ aligns with the stereotype. In contrast, the roles of the two people are swapped in a counter-biased context.
The first half of each context only mentions the ‘target’ and ‘non-target’ without sufficient information to answer the questions accurately, referred to as an ambiguous context. The second half adds the necessary details to answer the question, making the whole context a disambiguated context.
Question Types.
A biased question asks which group conforms to a given stereotype, while a counter-biased question asks which group goes against it.
Answer Types.
The correct answer in ambiguous contexts is always ‘unknown.’ When given a disambiguated context, the correct answer under a biased context is always the biased answer, referring to answers conforming to social biases. Under a counter-biased context, the correct answer is always the counter-biased answer that goes against the social bias.
3.2 Dataset Construction
The dataset curation process of KoBBQ consists of 5 steps: (1) categorization of BBQ templates, (2) cultural-sensitive translation, (3) demographic category construction, (4) creation of new templates, and (5) a large-scale survey on social bias. Each of the steps will be further explained below.
3.2.1 Categorization of BBQ Templates
Four of the authors, who are native Koreans, categorize the templates from the original BBQ dataset into three classes: Sample-Removed, Target-Modified, and Simply-Transferred. We go through a discussion to establish a consensus on all labels. Figure 2 shows examples for each class.
Examples of 4 types in KoBBQ. The yellow box indicates the answer to the biased question, asking which group conforms to the relevant social value. [N1] or [N2] represent the templated slots with one potential filler from target or non-target groups. A dotted box refers to the target groups that align with the relevant social bias. Any modified parts from BBQ are marked with strike lines, while cultural-sensitive translation parts are underlined.
Examples of 4 types in KoBBQ. The yellow box indicates the answer to the biased question, asking which group conforms to the relevant social value. [N1] or [N2] represent the templated slots with one potential filler from target or non-target groups. A dotted box refers to the target groups that align with the relevant social bias. Any modified parts from BBQ are marked with strike lines, while cultural-sensitive translation parts are underlined.
Sample-Removed
refers to samples that are not representative of the Korean cultural context. We exclude Sample-Removed samples from KoBBQ to accurately reflect Korean culture.
Target-Modified
denotes samples whose inherent biases exist in Korean cultures but are stereotyped towards different target groups. Therefore, in addition to cultural-sensitive translation, we modify and collect target groups appropriate for Korean culture through a large-scale public survey of Korean citizens.
Simply-Transferred
indicates samples revealing stereotypical biases that match Korean cultural background. These samples only go through cultural-sensitive translation when transformed into samples of KoBBQ.
3.2.2 Cultural-sensitive Translation
We initially use DeepL Translator4 to translate Simply-Transferred and Target-Modified samples. However, Peskov et al. (2021) pointed out that translated sentences may lack cultural context, highlighting the need for the adaptation of entities to the target culture, known as adaptation in the translation field (Vinay and Darbelnet, 1995) as part of cross-cultural translation (Sperber et al., 1994). To ensure a high-quality translation with Korean cultural contexts, we request a professional translator to perform culturally sensitive human-moderated translations. We specifically ask the translator to use Korean culture-familiar words, such as E-Mart5 instead of Walmart, bleached hair instead of dark hair,6 and basketball instead of rugby,7 to avoid awkwardness stemming from the cultural difference between US and Korean cultures.
3.2.3 Demographic Category Reconstruction
We reconstruct the stereotyped group categories of the original BBQ based on the categories and demographic groups of KoSBi (Lee et al., 2023a), which refers to UDHR8 and NHRCK.9 We (1) merge race/ethnicity and nationality into a single category and (2) add four categories that reflect unique social contexts of Korean cultures: domestic area of origin, educational background, family structure, and political orientation. The reason behind merging the two categories is that the distinction between race/ethnicity and nationality is vague in Korea, considering that Korea is an ethnically homogeneous nation compared to the US (Han, 2007). For the newly merged race/ ethnicity/nationality category, we include groups potentially familiar to Korean people. These include races that receive social prejudice from Koreans (Lee, 2007), ethnicities related to North Korea, China, and Japan, and the top two countries with the highest number of immigrants from each world region determined by MOFA10 between 2000 and 2022.11 Moreover, by adding new categories, the dataset covers a wide range of social biases and corresponding target groups embedded within Korean society. The final KoBBQ comprises 12 categories in Table 1.
Statistics of KoBBQ. ST, TM, SR, NC denote Simply-Transferred, Target-Modified, Sample-Removed, and Newly-Created, respectively. Numbers within parenthesis indicate the number of templates before being filtered by the survey results. The number of samples means the number of unique pairs of the context and question.
Category . | # of Templates . | # of Templates . | # of Samples . | |||
---|---|---|---|---|---|---|
SR . | TM . | ST . | NC . | |||
Age | 1 | 0 | 20 | 1 | (28 →) 21 | 3,608 |
Disability Status | 0 | 0 | 20 | 0 | (25 →) 20 | 2,160 |
Gender Identity | 0 | 0 | 25 | 0 | (29 →) 25 | 768 |
Physical Appearance | 3 | 0 | 17 | 3 | (25 →) 20 | 4,040 |
Race/Ethnicity/Nationality | 17 | 33 | 0 | 10 | (46 →) 43 | 51,856 |
Religion | 10 | 7 | 4 | 9 | (25 →) 20 | 688 |
Socio-Economy Status | 7 | 1 | 16 | 10 | (28 →) 27 | 6,928 |
Sexual Orientation | 10 | 1 | 5 | 6 | (25 →) 12 | 552 |
Domestic Area of Origin | 0 | 0 | 0 | 22 | (25 →) 22 | 800 |
Family Structure | 0 | 0 | 0 | 23 | (25 →) 23 | 1,096 |
Political Orientation | 0 | 0 | 0 | 11 | (28 →) 11 | 312 |
Educational Background | 0 | 0 | 0 | 24 | (25 →) 24 | 3,240 |
Total | 48 | 42 | 107 | 119 | 268 | 76,048 |
Category . | # of Templates . | # of Templates . | # of Samples . | |||
---|---|---|---|---|---|---|
SR . | TM . | ST . | NC . | |||
Age | 1 | 0 | 20 | 1 | (28 →) 21 | 3,608 |
Disability Status | 0 | 0 | 20 | 0 | (25 →) 20 | 2,160 |
Gender Identity | 0 | 0 | 25 | 0 | (29 →) 25 | 768 |
Physical Appearance | 3 | 0 | 17 | 3 | (25 →) 20 | 4,040 |
Race/Ethnicity/Nationality | 17 | 33 | 0 | 10 | (46 →) 43 | 51,856 |
Religion | 10 | 7 | 4 | 9 | (25 →) 20 | 688 |
Socio-Economy Status | 7 | 1 | 16 | 10 | (28 →) 27 | 6,928 |
Sexual Orientation | 10 | 1 | 5 | 6 | (25 →) 12 | 552 |
Domestic Area of Origin | 0 | 0 | 0 | 22 | (25 →) 22 | 800 |
Family Structure | 0 | 0 | 0 | 23 | (25 →) 23 | 1,096 |
Political Orientation | 0 | 0 | 0 | 11 | (28 →) 11 | 312 |
Educational Background | 0 | 0 | 0 | 24 | (25 →) 24 | 3,240 |
Total | 48 | 42 | 107 | 119 | 268 | 76,048 |
3.2.4 Creation of New Templates
To create a fair and representative sample of Korean culture and balance the number of samples across categories, the authors manually devise templates and label them as Newly-Created. Our templates rely on sources backed by solid evidence, such as research articles featuring in-depth interviews with representatives of the target groups, statistical reports derived from large-scale surveys conducted on the Korean public, and news articles that provide expert analysis of statistical findings.
3.2.5 Large-scale Survey on Social Bias
In contrast to BBQ, we employ statistical evidence to validate social bias and target groups within KoBBQ by implementing a large-scale survey of the Korean public.12
Survey Setting.
We conduct a large-scale survey to verify whether the stereotypical biases revealed through KoBBQ match the general cognition of the Korean public. Moreover, we perform a separate reading comprehension survey, where we validate the contexts and associated questions. To ensure a balanced demographic representation of the Korean public, we require the participation of 100 individuals for each survey question while balancing gender and age groups.
For the social bias verification survey, we split the whole dataset into two types: 1) target or non-target groups must be modified or newly designated, and 2) only the stereotype needs to be validated with a fixed target group. All of the Target-Modified templates conform to the first type. Among Simply-Transferred and Newly-Created templates, those in religion, domestic area of origin, and race/ethnicity/nationality categories are also included in the first type unless the reference explicitly mentions the non-target groups. This is because, for those categories, it is hard to specify the non-target groups based only on the target groups. The others conform to the second type. As some samples within KoBBQ share the same stereotype, we extract unique stereotypes for survey question construction.
Target Modification.
In addition to target group selection, non-target groups in KoBBQ differ from that of BBQ as it only comprises groups far from the social stereotype, promoting a better comparison between target and non-target groups. In the survey, for the first type, we ask workers to select all possible target groups for a given social bias using a select-all-that-apply question format, with the prompt “Please choose all social groups that are appropriate as the ones corresponding to the stereotype ‘<social_bias>’ in the common perception of Korean society.” We provide a comprehensive list of demographic groups for each category, including an option for ‘no stereotype exists’ for those with no bias regarding the social bias.
We select target groups that received at least twice the votes, and non-target groups with half or fewer votes compared to equal distribution of votes across all options, ensuring that we only keep options with significant bias.13 If there are no groups for either of the two groups, we eliminate the corresponding samples from the dataset. As a result, 8.3% of the stereotypes within this survey type are eliminated, resulting in a 3.0% decrease in the total number of templates.
Stereotype Validation.
References are not enough for demonstrating the existence of social biases in Korean society. To confirm such biases, we conduct a large-scale survey where workers were asked to identify which group corresponds to the given social bias while providing the target and non-target groups for the second type. We use the prompt “When comparing <group1> and <group2> in the context of Korean society, please choose the social group that corresponds to the stereotype ‘<social_bias>’ as a fixed perception.”. We also provide a ‘no stereotype exists’ choice for people with no related bias. The order of the target and non-target groups is randomly shuffled and templated into <group1> and <group2>.
After the survey, we select the templates where more than two-thirds of the people who did not select ‘no stereotype exists’ chose to eliminate those that do not demonstrate significant bias within the target group. This approach guarantees a representative label that reflects the majority opinion. After doing so, the number of stereotypes is reduced by 13.6% in this survey type, and the overall count of the templates is decreased by 10.9%.
Data Filtering.
We finalize our dataset using two filtering methods: 1) ‘no stereotype exists’ count and 2) reading comprehension task. We apply this for both types of the survey.
Of the 290 unique stereotypes, 18.8% of people chose the option “no stereotype exists” on average. To select stereotypes that align with common social stereotypes in Korean society, we excluded any options that received over 50% of “no stereotype exists” responses from our workers. Using this method, we additionally eliminate 3.1% of the overall stereotypes, resulting in a 2.8% decrease in the total count of templates.
We construct a reading comprehension task for each template, using counter-biased contexts and counter-biased questions as they require more attention for comprehension, necessitating a higher focus of the workers. We eliminate those where the ratio of correct answers to the corresponding context and question pair was below 50%. After this step, 3.9% of the templates remaining are discarded. The discarded samples include those whose disambiguated contexts were too ambiguous for human annotators to correctly answer the questions.
3.3 Data Statistics
Table 1 shows the number of templates per class mentioned in §3.2.1 and the number of samples per category. Each template consists of multiple samples, as each target group and the non-target group is substituted with several specific examples of them. We also provide the number of templates before and after eliminating data following the survey result.
The categories from the original BBQ that comprise a significant portion of the social bias that exists within Korean society are mainly composed of Simply-Transferred types, such as age, disability Status, and gender Identity. With the demographic groups newly updated, for race/ethnicity/nationality, all the original templates except those that include social bias or context not applicable to Korean culture are classified as Target-Modified. In order to add social bias in Korean culture and to balance the dataset among categories, we created new samples for categories from the original BBQ, as shown in Newly-Created counts. However, based on the survey results, templates from sexual orientation and political orientation are significantly removed, indicating that the Korean public does not have a diverse range of social bias regarding those categories, as evidenced by the change in template count before and after the survey.
4 Experiments
In this section, we evaluate state-of-the-art generative LLMs on KoBBQ. Our evaluation encompasses accuracy and bias scores, ensuring a comprehensive assessment of the models’ inherent bias.
4.1 Experimental Settings
The task is multiple-choice QA, in which the models are asked to choose the most appropriate answer when given a context, a question, and three choices (‘target,’ ‘non-target,’ and ‘unknown’).
Evaluation Prompts.
We use five different prompts with different instructions and different ‘unknown’ expressions. The gray text box below shows one of the prompts we use in the experiment. Following Izacard et al. (2023), we apply the cyclic permutation of the three choices (A, B, and C) to each prompt.
Evaluation Set.
Each template in KoBBQ comprises multiple target and non-target groups, along with alternative expressions. Due to the vast size and uneven distribution from all combinations in the dataset, we utilize a test set encompassing a randomly sampled example from each template. In total, our evaluation set comprises 32,160 samples (quadruples of the prompt, context, question, and choice permutation).14
Models.
We only include the models that are capable of QA tasks in the zero-shot setting since fine-tuning or few-shot can affect the bias of the models (Li et al., 2020; Yang et al., 2022). The following models are used in the experiments: Claude-v1 (claude-instant-1.2), Claude-v2 (claude-2.0),15 (Bai et al., 2022), GPT-3.5 (gpt-3.5-turbo-0613), GPT-4 (gpt-4-0613),16 CLOVA-X,17 and KoAlpaca (KoAlpaca-Polyglot -12.8B).18 For GPT-3, GPT-3.5, and GPT-4, we use the OpenAI API and set the temperature as 0 to use greedy decoding. The model inferences were run from August to September 2023.
Post-processing of Generated Answers.
The criteria for accepting responses generated by generative models are established to ensure that only valid answers are accepted. Specifically, responses must meet one of the following criteria: (i) include only one alphabet indicating one of the given options, (ii) exactly match the term provided in the options, optionally with an alphabet for the option, or (iii) include a specific expression that is intended to provide an answer, such as ‘answer is -’. Responses that fail to meet these criteria are considered as out-of-choice answers and are excluded from scoring.
4.2 Evaluation Metrics
Considering the nature of the BBQ-formatted dataset, it is essential to measure both the accuracy and bias score of models. In this section, we define the accuracy and diff-bias score using the notations shown in Table 2.
Notations for counts for each case. nt denotes the number of templates corresponding to each combination. Amb, Dis, B, cB, and Unk are abbreviations of ambiguous, disambiguated, biased, counter-biased, and unknown, respectively. Each underlined cell indicates the correct answer type for a given context. Each context type contains cases for both biased and counter-biased questions, for a total of 2nt cases.
Context . | Answer . | B . | cB . | Unk . | Total . |
---|---|---|---|---|---|
Amb | B / cB | nab | nac | nau | na(=4nt) |
Dis | B | nbb | nbc | nbu | nb(=2nt) |
cB | ncb | ncc | ncu | nc(=2nt) |
Context . | Answer . | B . | cB . | Unk . | Total . |
---|---|---|---|---|---|
Amb | B / cB | nab | nac | nau | na(=4nt) |
Dis | B | nbb | nbc | nbu | nb(=2nt) |
cB | ncb | ncc | ncu | nc(=2nt) |
Accuracy.
Diff-bias Score.
In the BBQ-format datasets, the extent to which a language model reveals its inherent social bias depends on its QA performance. For instance, if the model answers the question perfectly based only on the context provided, it means that the model is not affected by any bias. In this section, we define diff-bias scores based on Parrish et al. (2022) to measure how frequently the model answers questions based on its bias. Furthermore, we provide their maximum values, which are determined by the model’s accuracy. This highlights the importance of evaluating both the bias score and accuracy in tandem.
In summary, the accuracy represents the frequency of the model generating correct predictions, while the diff-bias indicates the direction and the extent to which incorrect predictions are biased. An optimal model would exhibit an accuracy of 1 and a diff-bias score of 0. A uniformly random model would have an accuracy of 1/3 and a diff-bias score of 0. A model that consistently provides only biased answers would have a diff-bias score of 1, with an accuracy of 0 in ambiguous contexts and 0.5 in disambiguated contexts.
4.3 Experimental Results
In this section, we present the evaluation results of the six LLMs on KoBBQ.
Accuracy and Diff-bias Score.
Table 3 shows the accuracy and diff-bias scores of the models on KoBBQ.19 Overall, the models show higher accuracy in disambiguated contexts compared to ambiguous contexts. Remarkably, all the models present positive diff-bias scores, with pronounced severity in ambiguous contexts. This suggests that the models tend to favor outputs that are aligned with prevailing societal biases.
The diff-bias score and accuracy of models upon five different prompts. ‘max|bias|’ indicates the maximum absolute value of the diff-bias score depending on the accuracy. The rows are sorted by the accuracy.
(a) Ambiguous Context . | |||
---|---|---|---|
Model . | accuracy (↑) . | diff-bias (↓) . | max|bias| . |
KoAlpaca | 0.1732±0.0435 | 0.0172±0.0049 | 0.8268 |
Claude-v1 | 0.2702±0.1691 | 0.2579±0.0645 | 0.7298 |
Claude-v2 | 0.5503±0.2266 | 0.1556±0.0480 | 0.4497 |
GPT-3.5 | 0.6194±0.0480 | 0.1653±0.0231 | 0.3806 |
CLOVA-X | 0.8603±0.0934 | 0.0576±0.0333 | 0.1397 |
GPT-4 | 0.9650±0.0245 | 0.0256±0.0152 | 0.0350 |
(b) Disambiguated Context | |||
Model | accuracy (↑) | diff-bias (↓) | max|bias| |
KoAlpaca | 0.4247±0.0199 | 0.0252±0.0085 | 0.8495 |
CLOVA-X | 0.7754±0.0825 | 0.0362±0.0103 | 0.4491 |
GPT-3.5 | 0.8577±0.0142 | 0.0869±0.0094 | 0.2847 |
Claude-v2 | 0.8762±0.0650 | 0.0321±0.0050 | 0.2475 |
Claude-v1 | 0.9103±0.0224 | 0.0322±0.0041 | 0.1793 |
GPT-4 | 0.9594±0.0059 | 0.0049±0.0070 | 0.0811 |
(a) Ambiguous Context . | |||
---|---|---|---|
Model . | accuracy (↑) . | diff-bias (↓) . | max|bias| . |
KoAlpaca | 0.1732±0.0435 | 0.0172±0.0049 | 0.8268 |
Claude-v1 | 0.2702±0.1691 | 0.2579±0.0645 | 0.7298 |
Claude-v2 | 0.5503±0.2266 | 0.1556±0.0480 | 0.4497 |
GPT-3.5 | 0.6194±0.0480 | 0.1653±0.0231 | 0.3806 |
CLOVA-X | 0.8603±0.0934 | 0.0576±0.0333 | 0.1397 |
GPT-4 | 0.9650±0.0245 | 0.0256±0.0152 | 0.0350 |
(b) Disambiguated Context | |||
Model | accuracy (↑) | diff-bias (↓) | max|bias| |
KoAlpaca | 0.4247±0.0199 | 0.0252±0.0085 | 0.8495 |
CLOVA-X | 0.7754±0.0825 | 0.0362±0.0103 | 0.4491 |
GPT-3.5 | 0.8577±0.0142 | 0.0869±0.0094 | 0.2847 |
Claude-v2 | 0.8762±0.0650 | 0.0321±0.0050 | 0.2475 |
Claude-v1 | 0.9103±0.0224 | 0.0322±0.0041 | 0.1793 |
GPT-4 | 0.9594±0.0059 | 0.0049±0.0070 | 0.0811 |
Specifically, GPT-4 achieves outstandingly the highest accuracy of over 0.95 in both contexts while also having low diff-bias scores. However, considering the ratio of its diff-bias score to the maximum value, GPT-4 still cannot be said to be free from bias. Regarding diff-bias scores, Claude-v1 and GPT-3.5 achieve the highest bias scores in ambiguous and disambiguated contexts, respectively. Meanwhile, KoAlpaca exhibits low accuracy and bias scores, which is attributed to its tendency to randomly choose answers between the two options except ‘unknown’ in most cases.
Bias Score by Category.
Figure 3 depicts the diff-bias score for each stereotyped group category on six different models. We observed significant differences in diff-bias scores among bias categories in both ambiguous and disambiguated contexts, with a p-value < 0.01 tested by one-way ANOVA. In particular, stereotypes associated with socio-economic status demonstrate a significantly lower diff-bias score in disambiguated contexts compared to all other bias categories. Additionally, stereotypes associated with gender identity and race/ethnicity/nationality exhibit marginally lower diff-bias scores in ambiguous contexts. In contrast, those associated with age and political orientation showed marginally high scores. They are significantly lower or higher compared to the overall diff-bias score.
Tukey-HSD test on the normalized diff-bias scores for each stereotype group category with 99% confidence interval.
Tukey-HSD test on the normalized diff-bias scores for each stereotype group category with 99% confidence interval.
Scores by Label Type.
Figure 4 illustrates the accuracy and diff-bias scores for each label type on the models. In ambiguous context, the Newly-Created samples have the lowest accuracy and the highest diff-bias score. This suggests that the samples the authors added identify the presence of unexamined inherent bias in LMs. The Target-Modified and Simply-Transferred show similar accuracy but exhibit a noticeable difference in the diff-bias score in ambiguous contexts. This shows that bias scores can differ even when accuracy is similar. In disambiguated contexts, a higher accuracy tends to be associated with a lower bias score. The models achieve the highest QA performance with the lowest diff-bias score in the Newly-Created samples.
Tukey-HSD test on both the normalized accuracy and diff-bias scores for each sample type with 99% confidence interval.
Tukey-HSD test on both the normalized accuracy and diff-bias scores for each sample type with 99% confidence interval.
5 Discussion
5.1 KoBBQ vs. Machine-translated BBQ
To highlight the need for a hand-crafted bias benchmark considering cultural differences, we show the differences in performance and bias of LMs between KoBBQ and machine-translated BBQ (mtBBQ). Table 4 shows the accuracy and bias scores of models for the Simply-Transferred (ST) and Target-Modified (TM) samples, which are included in both KoBBQ and mtBBQ. We perform a Wilcoxon rank-sum test to examine the statistically significant differences between the two datasets for each model and label.
Comparison of accuracy, bias scores, and Wilcoxon rank-sum test for KoBBQ and machine-translated BBQ (mtBBQ) in the ST (Simply-Transferred) and TM (Target-Modified) labels. P-values are calculated on KoBBQ and mtBBQ for each label and model. The colored cells indicate the statistically significant differences (,
, and
).

Regarding accuracy, the models show higher scores on KoBBQ than mtBBQ in disambiguated contexts, exhibiting a significant difference, except for KoAlpaca, which shows low QA performance. Since the task in disambiguated contexts resembles the machine reading comprehension task, this underscores how manual translation enhances contextual comprehension. There is no significant difference in ambiguous contexts between KoBBQ and mtBBQ.
For the diff-bias score, the difference between KoBBQ and mtBBQ exists in both contexts. In general, model biases are higher when using KoBBQ compared to mtBBQ with ambiguous contexts. This may be due to the incomplete comprehension of the models of the machine-translated texts, resulting in less successful measurement of inherent model bias when compared to manually translated KoBBQ. Under the disambiguated context, some significantly different cases exist, although there is no clear trend regarding the order between KoBBQ and mtBBQ.
Overall, KoBBQ and mtBBQ show differences in both models’ performance and bias score even when considering common labels (Simply-Transferred and Target-Modified) excluding the different labels (Newly-Created and Sample-Removed). These findings highlight the importance of manual translation and cultural adaptation, as machine translation alone is insufficient for measuring the model’s bias.
5.2 KoBBQ vs. BBQ/CBBQ
In this work, we present a general framework that can be used to extend the BBQ dataset (Parrish et al., 2022) to various different cultures. Through the template categorization in terms of applicability, we label whether a sample is applicable only with minor revisions (Simply-Transferred) or with different target groups (Target-Modified) or even cannot be applicable at all (Sample-Removed). Our labeling results can aid in research on Korean culture, and our framework can be utilized in building culturally adapted datasets for other cultures as well. The datasets constructed in this manner enable direct comparisons of cultural differences with the existing dataset. For example, Simply-Transferred samples can reveal a multilingual LM’s variations across different languages with shared contexts, and Target-Modified samples demonstrate cultural distinctions through the comparison of different target groups associated with the same stereotypes.
KoBBQ is created directly by humans without the assistance of LLMs (except for initial translation). We explored the possibility of using LLMs within our framework, but we encountered certain limitations. First, we asked GPT-4 to choose all target groups associated with the given stereotypes, in the same way as the human survey for target modification. Comparing GPT-4 with human survey results for Target-Modified samples reveals a low agreement, with an accuracy (exact match) of 23.8% and an F1 score (average F1 of all target group classes) of 39.73%. Furthermore, similar to the approach in CBBQ (Huang and Xiong, 2023), we experimented with letting GPT-4 generate disambiguated contexts, questions, and answers, given stereotypes and ambiguous contexts written by humans. We find several limitations of LLMs in context generation as follows. 1) It makes more general expressions rather than including specific or even cultural situations or keywords, lacking Korea’s unique culture within the context. 2) For counter-biased contexts, it still tends to create contexts in a biased manner reflecting its inherent bias. 3) It struggles to construct a clarified context that contains both biased and counter-biased answers. The results include instances that fail to follow the template format and contain grammatical errors specific to Korean as well. Detailed examples are described in Table 5. These results demonstrate that human effort remains essential for the construction of a culturally sensitive bias benchmark.
Examples of disambiguated contexts generated by human and GPT-4. Compared to human-written contexts, GPT-4 tends to 1) generate general contexts rather than specific or cultural contexts, 2) make grammatical errors, create a biased context where it is prompted to create a counter-biased context, and 3) fail to create a fully disambiguated context that should include the answers for the biased/counter-biased questions. The grammatical errors are underlined.

Although BBQ, CBBQ, and KoBBQ are all written based on the relevant references, only KoBBQ incorporates a comprehensive large-scale survey targeting the domestic public. It not only validates the reliability of the benchmark but also reflects the intensity of certain stereotypes in South Korea. As this result could provide valuable insights into the stereotypes present in Korean society, we will release the raw survey results along with our dataset for future research.
6 Conclusion
We presented a Korean bias benchmark (KoBBQ) that contains question-answering data with situations related to biases existing in Korea. From BBQ dataset, the existing US-centric bias benchmark, we divided its samples into three classes (Simply-Transferred, Target-Modified, and Sample-Removed) to make it culturally adaptive. Additionally, we added four new categories that depict biases prevalent in Korean culture. KoBBQ consists of 76,048 samples across 12 categories of social bias. To ensure the quality and reliability of our data, we recruited a sufficient number of crowdworkers in the validation process. Using our KoBBQ, we analyzed six large language models in terms of the accuracy and diff-bias score. By showing the differences between our KoBBQ and machine-translated BBQ, we emphasized the need for culturally sensitive and meticulously curated bias benchmark construction.
Our method can be applied to other cultures, which can promote the development of culture-specific bias benchmarks. We leave the extension of the dataset to other languages and the framework for universal adaptation to more than two cultures as future work. Furthermore, our KoBBQ is expected to contribute to the improvement of the safe usage of LLMs’ applications by assessing the inherent social biases present in the models.
Limitations
While the perception of social bias can be subjective, we made an extensive effort to gather insights into prevalent social biases in Korean society through our large-scale survey. Nevertheless, caution should be taken before drawing definitive conclusions based solely on our findings. Furthermore, we acknowledge the potential existence of other social bias categories in Korean society that our study has not addressed.
It is crucial to understand that performance in QA tasks can influence bias measurements. Our metric does not entirely disentangle bias scores from QA performance. Hence, a holistic view that considers both aspects is essential to avoid potentially incomplete or skewed interpretations.
Ethics Statement
This research project was performed under approval from KAIST IRB (KH2023-069). We ensured that the wages of our translator and crowdworkers exceed the minimum wage in the Republic of Korea in 2023, which is KRW 9,260 (approximately USD 7.25).20 Specifically, we paid around KRW 150 per word for the translator, with a duration of two weeks, resulting in a payment of KRW 2,500,000. For the large-scale survey for verifying stereotypes in Korea, we paid Macromill Embrain KRW 4,200,000 with a contract period of 11 days. There was no discrimination when recruiting workers regarding any demographics, including gender and age. They were informed that the content might be stereotypical or biased.
We acknowledge the potential risk associated with releasing a dataset that contains stereotypes and biases. This dataset must not be used as training data to automatically generate and publish biased languages targeting specific groups. We will explicitly state the terms of use in that we do not condone any malicious use. We strongly encourage researchers and practitioners to utilize this dataset in beneficial ways, such as mitigating bias in language models.
Acknowledgments
This project was funded by the KAIST-NAVER hypercreative AI center. Alice Oh is funded by Institute of Information communications Technology Planning Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00184, Development and Study of AI Technologies to Inexpensively Conform to Evolving Policy on Ethics). The authors would like to thank Jaehong Kim from KAIST Graduate School of Culture Technology for his assistance in the survey design.
Notes
Our KoBBQ dataset, evaluation codes including prompts, and survey results are available at https://jinjh0123.github.io/KoBBQ.
Existing evaluation datasets for bias in LLMs are available at https://github.com/i-gallegos/Fair-LLM-Benchmark.
One of the largest discount stores in Korea (https://company.emart.com/en/company/business.do).
Typically, the natural hair color of Korean individuals is dark (Im et al., 2017).
Most popular sports activities in South Korea as of March 2023 (https://www.statista.com/forecastshttps://www.statista.com/forecasts/1389015/most-popular-sports-activities-in-south-korea).
Universal Declaration of Human Rights.
National Human Rights Commission of Korea.
Ministry of Foreign Affairs.
Done with Macromill Embrain, a Korean company specialized in online research (https://embrain.com/).
As there are 38 options for race/ethnicity/nationality, we exclude the specific countries while only including each region name for option counts to prevent thresholds being too low (e.g., excluding US and Canada while including North America).
We check that the average differences of both the accuracy and diff-bias scores on the evaluation set and the entire KoBBQ set are less than 0.005, and they result in no significant differences by Wilcoxon rank-sum test for Claude-v1, GPT-3.5, and CLOVA-X with 3 prompts. When calculating the scores for the entire set, we average the scores of samples from the same template, to mitigate the impact of the imbalance of samples for each template.
The average ratios of out-of-choice answers from each model are below 0.005, except for Claude-v2 (0.015), CLOVA-X (0.068), and KoAlpaca (0.098).
References
Author notes
Equal Contribution. This work was done during the internships at NAVER AI Lab.
Action Editor: Zeljko Agic