Abstract
Although commonsense reasoning is greatly shaped by cultural and geographical factors, previous studies have predominantly centered on cultures grounded in the English language, potentially resulting in an Anglocentric bias. In this paper, we introduce IndoCulture, aimed at understanding the influence of geographical factors on language model reasoning ability, with a specific emphasis on the diverse cultures found within eleven Indonesian provinces. In contrast to prior work that has relied on templates (Yin et al., 2022) and online scrapping (Fung et al., 2024), we create IndoCulture by asking local people to manually develop a cultural context and plausible options, across a set of predefined topics. Evaluation of 27 language models reveals several insights: (1) the open-weight Llama–3 is competitive with GPT–4, while other open-weight models struggle, with accuracies below 50%; (2) there is a general pattern of models generally performing better for some provinces, such as Bali and West Java, and less well for others; and (3) the inclusion of location context enhances performance, especially for larger models like GPT–4, emphasizing the significance of geographical context in commonsense reasoning.1
1 Introduction
The reasoning abilities of multilingual language models are frequently evaluated using English texts, potentially amplifying an Anglocentric bias toward culture grounded in the English language, and leading to less inclusive models (Thomas, 1983; Ponti et al., 2020). Cultures, however, vary significantly from one location to another and profoundly shape the way speakers of a language reason (Hershcovich et al., 2022). Recent evaluations of models’ commonsense reasoning ability (OpenAI, 2023; Sengupta et al., 2023; Liu et al., 2023) have been conducted on English datasets such as Social IQA (Sap et al., 2019) and PIQA (Bisk et al., 2020), and thus often overlook geographical aspects, thereby risking cultural bias.
Culture is a multifaceted concept encompassing the way of life (Giddens and Sutton, 2021), including our thoughts and actions (Macionis, 2012). It includes tangible elements like food, art, and clothing, as well as intangible aspects such as ideas, values, attitudes, and norms. Culture is shaped by geographical location and ethnicity, influencing the commonsense reasoning of people within a region. For example, in Indonesia, it is culturally acceptable to eat rice with your hands but it is considered unusual to use chopsticks. Similarly, at traditional Indonesian weddings, it is common to sit on the floor while eating, whereas this practice is less common in Australia.
This work focuses on understanding the influence of geographical contexts in cultural commonsense reasoning, with the main focus on Indonesian culture. Indonesia is a highly multicultural country (Putra et al., 2019), home to over 1,300 recognized ethnic groups and more than 700 languages (Zarbaliyev, 2017; Aji et al., 2022). As the targest archipelagic country in the world, Indonesia has a population exceeding 270 million spread across 38 provinces, stretching from Aceh province in the west to Papua province in the east. Few prior studies on commonsense reasoning in Indonesian contexts (Mahendra et al., 2021; Wibowo et al., 2024; Putri et al., 2024) have explicitly addressed the geographical nuances and rich diversity of Indonesian cultures.
This paper introduces IndoCulture, a novel dataset to evaluate cultural reasoning in eleven Indonesian provinces, manually developed by local people in each province based on predefined topics. In prior work, cultural reasoning has primarily relied on datasets constructed through templates (Yin et al., 2022), and online scraping (Nguyen et al., 2023; Fung et al., 2024). While these studies offer valuable insights, they may be susceptible to training data contamination when used to assess large language models (LLMs). For instance, Fung et al. (2024) reported a zero-shot accuracy of 92% when using ChatGPT (Ouyang et al., 2022) to evaluate low-resource data.
IndoCulture contains cultural commonsense knowledge data from eleven provinces in Indonesia (blue colored in Figure 1), namely, Aceh, North Sumatra, West Sumatra, West Java, Central Java, East Java, Bali, South Borneo, East Nusa Tenggara (NTT), South Sulawesi, and Papua. These provinces span breadth of Indonesia, each representing a major island in the country, with the addition of Bali and NTT. Figure 1 also shows three examples in IndoCulture for three provinces: Aceh, North Sumatra, and Papua.2 The first example focuses on cultural artifact, specifically, the traditional wedding dress from Aceh. The second example examines family relationships while the third example focuses on cultural beliefs and norms regarding pregnancy in Papua.
IndoCulture covers eleven provinces spanning from eastern to western Indonesia. The highlighted regions in the map represent the provinces examined in IndoCulture. We present examples from Aceh, North Sumatra, and Papua, with three plausible options and correct answers indicated in bold. English translations are provided for illustrative purposes.
IndoCulture covers eleven provinces spanning from eastern to western Indonesia. The highlighted regions in the map represent the provinces examined in IndoCulture. We present examples from Aceh, North Sumatra, and Papua, with three plausible options and correct answers indicated in bold. English translations are provided for illustrative purposes.
Can large language models effectively reason based on the diverse cultures of Indonesia? To capture the rich diversity of Indonesian cultures, we predefined 12 fine-grained topics as guidelines for data construction. Figure 2 displays the topic distribution in IndoCulture, with the majority focusing on food, weddings, art, pregnancy and children, and family relationships. Additionally, we also pose the question: Is there any influence of geographical location on the commonsense reasoning of language models? We address these questions through comprehensive experiments across different language models, incorporating several levels of location granularity as additional context in the prompt.
Our contributions can be summarized as follows:
We present IndoCulture, a high-quality cultural reasoning dataset in the Indonesian language, covering eleven provinces of Indonesia and twelve fine-grained cultural topics. Our dataset has 2,429 instances, and was developed by local people with rigorous quality controls in place.
We assess 19 open-weight multilingual models, 6 open-weight Indonesian-centric models, and 2 closed-weight models. Although local individuals can answer all questions correctly (i.e., 100% accuracy), most open-weight models struggle to comprehend Indonesian cultures. Interestingly, we observed that Llama–3 (Dubey et al., 2024) is competitive with GPT–4 (OpenAI, 2023).
We conduct a thorough analysis over various dimensions: (1) model performance for each province and topic; (2) the influence of different granularities of location context (i.e., none, province, country); (3) model performance over English translations; and (4) analysis of model explanations for a given answer.
2 Related Work
Commonsense Reasoning in English
Many studies have focused on commonsense reasoning in English, often overlooking considerations of culture and geographical location. Early work included the Winograd Schema Challenge (Levesque et al., 2012) and WinoGrande (Sakaguchi et al., 2021) for pronoun coreference resolution. Other research areas include reasoning based on cause-effect relationships (Roemmele et al., 2011), physical activities (Bisk et al., 2020), social interactions (Sap et al., 2019), cloze story completion (Mostafazadeh et al., 2016), sentence completion (Zellers et al., 2019), numerical reasoning (Lin et al., 2020), and temporal reasoning (Qin et al., 2021). Additionally, pretrained language models have been employed in other work to extract structured commonsense knowledge by providing seed words (Davison et al., 2019), and using code language models (Madaan et al., 2022).
Cultural Commonsense Reasoning with Geographical Contexts
Previous studies have explored commonsense reasoning with geographical context. Shwartz (2022) investigated time perception (e.g., morning and night) across different locations, while Yin et al. (2022) examined cultural knowledge of language models across five countries using datasets built from templates and translations. Other work has focused on automatically extracting cultural knowledge from various sources, including Wikipedia (Fung et al., 2024), conversations (Fung et al., 2023), and Common Crawl (Nguyen et al., 2023), incorporating location context with the assistance of LLMs. Relatedly, Ziems et al. (2023) created a knowledge bank for situational norms, using English-speaking Mechanical Turk annotators and incorporating a country taxonomy. Unlike this work, IndoCulture specifically concentrates on cultural reasoning across Indonesian provinces, developed and validated manually by local people (experts). Compared to the automatic method and English-speaking crowd workers for data construction, IndoCulture arguably contains less noise, and is free from the training data contamination of LLMs.
Commonsense Reasoning with Indonesian Contexts
Table 1 shows a comparison of IndoCulture with other Indonesian datasets for cultural knowledge and reasoning evaluation. Commonsense reasoning in Indonesian language models has been studied using translated English–Indonesian datasets, such as XCOPA (Ponti et al., 2020) and XStoryCloze (Lin et al., 2022). However, these datasets potentially introduce a cultural bias toward culture grounded in the English language. IndoCloze (Koto et al., 2022) was the first commonsense reasoning dataset in Indonesian, developed by native Indonesian workers following the cloze story completion framework (Mostafazadeh et al., 2016). However, IndoCloze lacks local cultural nuances and fine-grained geographical context. Wibowo et al. (2024) followed the COPA framework (Roemmele et al., 2011) to build a dataset with contexts limited to Jakarta. In other work, Putri et al. (2024) studied the capability of LLMs in generating questions with cultural norms, for both general Indonesian and specific Sundanese contexts, while Liu et al. (2024) used proverbs and LLMs to generate conversational data. In contemporary work, Myung et al. (2024) released BLEnD, a large-scale cultural knowledge dataset, built using templates, translation, and human validations, covering the West Java province in Indonesia. BLEnD specifically focuses on short-answer questions, limiting its capacity for reasoning evaluation. Unlike most other datasets that do not consider geographical factors, IndoCulture has broad coverage across eleven provinces, thereby providing greater inclusivity for local communities in Indonesia.
Comparison of IndoCulture with other cultural knowledge and reasoning datasets containing instances in Indonesian. The metadata includes Size (number of Indonesian instances), Cultural? (whether the data considers cultural nuances), Location? (whether the data includes fine-grained location information, such as provinces, as context), #province (number of Indonesian provinces covered), and #topic (number of fine-grained topics covered). * indicates the dataset involves question generation with less emphasis on reasoning.
Dataset . | Size . | Data Construction Method . | Cultural? . | Location? . | #province . | #topic . |
---|---|---|---|---|---|---|
IndoCulture (ours) | 2,429 | Manually built and validated by native | 11 | 66 | ||
COPAL-ID (Wibowo et al., 2024) | 559 | Manually built and validated by native | – | – | – | |
MAPS (Liu et al., 2024) | 371 | LLM generation & human generation | – | – | 1 | |
ID-CSQA (Putri et al., 2024)* | 4,416 | LLM generation & human generation | – | – | 5 | |
BLEnD (Myung et al., 2024) | 1,000 | Template, translation, human validation | 1 | 6 | ||
IndoCloze (Koto et al., 2022) | 2,335 | Manually built and validated by native | – | – | – | – |
XCOPA (Ponti et al., 2020) | 600 | Translated from English data | – | – | – | – |
XStoryCloze (Lin et al., 2022) | 1,872 | Translated from English data | – | – | – | – |
Dataset . | Size . | Data Construction Method . | Cultural? . | Location? . | #province . | #topic . |
---|---|---|---|---|---|---|
IndoCulture (ours) | 2,429 | Manually built and validated by native | 11 | 66 | ||
COPAL-ID (Wibowo et al., 2024) | 559 | Manually built and validated by native | – | – | – | |
MAPS (Liu et al., 2024) | 371 | LLM generation & human generation | – | – | 1 | |
ID-CSQA (Putri et al., 2024)* | 4,416 | LLM generation & human generation | – | – | 5 | |
BLEnD (Myung et al., 2024) | 1,000 | Template, translation, human validation | 1 | 6 | ||
IndoCloze (Koto et al., 2022) | 2,335 | Manually built and validated by native | – | – | – | – |
XCOPA (Ponti et al., 2020) | 600 | Translated from English data | – | – | – | – |
XStoryCloze (Lin et al., 2022) | 1,872 | Translated from English data | – | – | – | – |
3 IndoCulture
As illustrated in Figure 1, IndoCulture is a sentence completion task in the Indonesian language featuring a one-sentence premise, three plausible options, and one correct option to evaluate reasoning ability and cultural knowledge across eleven Indonesian provinces. While sentence completion tasks are straightforward for humans, answering IndoCulture requires machines to engage in cultural reasoning to logically conclude which of the three options is logically consistent with the first sentence (Huang and Chang, 2023). The dataset includes a total of 2,429 instances.
3.1 Data Construction
IndoCulture was constructed manually by humans, and verified through a two-step process.
Worker Recruitment
Culture generally arises from the shared experiences, traditions, and beliefs of a specific group over time, often closely intertwined with native populations. With this in mind, we engaged individuals from various provinces across Indonesia to assist in preparing data for the IndoCulture benchmark.
During recruitment, we presented a few examples of the intended IndoCulture data and requested each candidate to generate similar instances tailored to the context of their respective provinces. From a pool of 58 applicants, we carefully selected 22 expert workers representing 11 provinces (with 2 workers selected per province). These recruited expert workers are local residents and have resided in their respective provinces for a minimum of 10 years, thereby possessing a profound understanding of local customs and culture. The age range of our workforce spans from 21 to 35 years old, with educational backgrounds distributed as follows: 3 high school graduates, 14 bachelor’s degree holders, 4 master’s degree holders, and 1 PhD holder.
During data construction, each expert worker fulfilled the dual roles of instance writer and quality controller. Each worker was compensated above the monthly minimum wage in Indonesia.
Province Selection
The provinces covered in this study represent the diversity of Indonesian cultures. The 11 provinces (in Figure 1) are spread across 6 islands of the Indonesian archipelago, which are inhabited by different ethnic groups who speak different regional languages and adhere to different religions.
Topic Taxonomy
IndoCulture consists of 12 topics and 66 fine-grained subtopics, carefully constructed based on discussions and brainstorming with Indonesian natives. The selection of these topics and subtopics was guided based on several criteria and motivations: (1) relevance to Indonesian culture; (2) diversity and coverage; (3) regional representation (e.g., religious holidays); (4) practicality; and (5) expert consultation (i.e., native speaker feedback). Compared to the other Indonesian datasets in Table 1, IndoCulture includes a richer array of fine-grained topics. Below is a list of the topics along with their detailed subtopics. The numbers following each topic indicate the total number of instances required to be written by one worker (with a total of 150 per worker).
Food (22): breakfast (2); lunch (3); dinner (2); snacks (2); food souvenirs (3); traditional foods and beverages (5); eating habits (1); cutlery (1); cooking ware (1), fruit (2).
Wedding (20): traditions before marriage (3); traditions when getting married (3); traditions after marriage (3); men’s wedding clothes (2); women’s wedding clothes (2); invited guests (2); wedding location (1); foods at a wedding (2); gifts brought to weddings (2).
Family relationship (13): relationships within the main family (3); relationships in the extended family (3); relations with society/neighbors (5); clan/descendant system (2).
Pregnancy and kids (16): traditions during pregnancy (4); traditions after birth (2); how to care for a newborn baby (2); how to care for toddlers (2); how to care for children (2); how to care for teenagers (2); parents and children interactions as adults (2).
Death (10): when death occurs (2); the process of dealing with a corpse (2); traditions after the body is buried (2); the clothes of the mourners (2); inheritance matters (2).
Religious holiday (12): traditions before religious holidays (2); traditions leading up to religious holidays (4); traditions during religious holidays (5); traditions after religious holidays (1).
Agriculture (6): what to plant (2); traditions when planting (2); harvest (2).
Fisheries and trade (7): traditions of taking care of livestock/fish (5); buying and selling traditions (2)
Art (16): musical instruments (3); folk songs (3); traditional dances (3); use of art at certain events (5); poetry or similar literature (2)
Traditional games (5): game types; (3), location played (2).
Daily activities (10): morning activities (1); afternoon activities (1); evening activities (1); leisure activities (3); house, household, and transportation (4).
Socio-religious aspects of life (13): regular religious activities (2); mystical things (2); traditional ceremonies (1); lifestyle (3); self care (1); traditional medicine (3); traditional sayings (1).
Instance Writing
For each instance, workers were asked to craft two culturally relevant sentences that align with the predefined subtopic. The first sentence serves as the premise context, and the last sentence acts as the correct answer. Subsequently, the annotator generates two additional plausible sentences as distractors by modifying cultural objects or activities from the correct sentence. These distractors are designed to reflect local cultural contexts, ensuring they are challenging yet unambiguous, and could potentially serve as correct answers in other regional contexts. Workers were given a period of two months to complete the task.
Two Stages of Quality Control
In stage 1, we implemented quality control by pairing two annotators from the same province. Each annotator was tasked with answering a set of questions prepared by the other annotator, and vice versa. During this phase, the annotator were presented with a premise sentence and three shuffled options. They were allowed to search for the answer from any source if they were unsure. Instances that were incorrectly answered by the second annotator were discarded, as we hypothesize that these instances may contain incorrect answers or possess a level of ambiguity. Additionally, annotators were required to identify whether the instance is province-specific (binary annotation: True/False), indicating that it is uniquely relevant in their province and not in others.
In stage 2 of quality control, the first two authors of this paper performed post-editing of data that passed the first stage of quality control. We first focused on correcting the linguistic aspects of the text, including checking for spelling errors. Although the text is written in Indonesian, some annotators may use dialects or be influenced by the structure or style of regional languages. In these cases, we corrected the text to adhere to Indonesian grammar.
To maintain the quality of IndoCulture, we rigorously filtered instances that contained: (1) poor writing, in the case that it was difficult to post-edit to enhance their quality; (2) obvious answer options, which allow for easy guessing of the correct choice without understanding the cultural context; and (3) ambiguous contexts, where all options are equally valid as the correct answer. For example, in a topic about breakfast, the three options might include one traditional food alongside two other very commonly consumed foods in Indonesia, and be considered too obvious.
Furthermore, we manually verified the province-specific annotations for each instance using the Google search engine. We annotated whether the instance pertains to national-level culture or not. If the example is specific to a province, it will be annotated as uncommon in national culture, and vice versa.
3.2 Data Statistics
After the instance writing process, we initially collected 3,162 instances out of a target of 3,300 instances (22 workers × 150 subtopics). Although we requested each annotator to produce 150 instances, not all were able to complete their allotted tasks within the given timeframe. Unfortunately, we were unable to find additional candidates from the same local province to address the data deficiencies (Winata et al., 2023).
In stage 1 of quality control, the initial pool of 3,162 instances was reduced to 2,801 instances, and stage 2 of quality control further reduced the sample to 2,429 high-quality samples. The data distribution of IndoCulture per province is presented in Table 2. Approximately three-quarters of IndoCulture instances contain province-specific content, with an average length of around 35 words. IndoCulture covers multiple topics, as illustrated in Figure 2.
Overall statistics of IndoCulture by province.
Province . | # . | province . | μ(word) . | μ(char) . |
---|---|---|---|---|
specific (%) . | ||||
Aceh | 246 | 70.7 | 28.0 | 175.9 |
North Sumatra | 234 | 83.8 | 36.8 | 246.0 |
West Sumatra | 299 | 74.6 | 39.6 | 261.4 |
West Java | 231 | 58.0 | 37.5 | 244.8 |
Central Java | 171 | 66.7 | 39.3 | 260.5 |
East Java | 233 | 69.5 | 46.0 | 310.4 |
Bali | 241 | 76.3 | 33.3 | 216.1 |
NTT | 103 | 72.8 | 31.8 | 203.6 |
South Borneo | 233 | 83.7 | 33.3 | 226.0 |
South Sulawesi | 185 | 90.3 | 33.6 | 227.8 |
Papua | 253 | 88.1 | 37.3 | 245.0 |
All | 2429 | 76.0 | NA | NA |
Province . | # . | province . | μ(word) . | μ(char) . |
---|---|---|---|---|
specific (%) . | ||||
Aceh | 246 | 70.7 | 28.0 | 175.9 |
North Sumatra | 234 | 83.8 | 36.8 | 246.0 |
West Sumatra | 299 | 74.6 | 39.6 | 261.4 |
West Java | 231 | 58.0 | 37.5 | 244.8 |
Central Java | 171 | 66.7 | 39.3 | 260.5 |
East Java | 233 | 69.5 | 46.0 | 310.4 |
Bali | 241 | 76.3 | 33.3 | 216.1 |
NTT | 103 | 72.8 | 31.8 | 203.6 |
South Borneo | 233 | 83.7 | 33.3 | 226.0 |
South Sulawesi | 185 | 90.3 | 33.6 | 227.8 |
Papua | 253 | 88.1 | 37.3 | 245.0 |
All | 2429 | 76.0 | NA | NA |
4 Experiments
4.1 Set-Up
We evaluate 27 language models in zero-shot settings: (1) nineteen open-weight multilingual language models of varying sizes, namely, BLOOMZ (Muennighoff et al., 2023), mT0 (Muennighoff et al., 2023), Bactrian-X (Li et al., 2023), Llama–2 (Touvron et al., 2023), and Llama–3 (Dubey et al., 2024); (2) two South East Asian language models, namely, SeaLLM (Nguyen et al., 2024), and SeaLion (Singapore, 2023); (3) four Indonesian-centric language models, namely, IndoBART (Cahyawijaya et al., 2021), IndoGPT (Cahyawijaya et al., 2021), Merak (Ichsan, 2023), and Komodo (Owen et al., 2024); (3) two closed-weight models, namely, ChatGPT: gpt-3.5-turbo (Ouyang et al., 2022) and GPT–4: gpt-4-0613 (OpenAI, 2023). Please refer to Appendix A for further details.
Templates for sentence completion and multiple-choice questions prompts.
For GPT–3.5 and GPT–4, we exclude experiments with sentence completion because the closed-weight models do not provide an overall probability score. For multiple-choice questions, we select the first generated token that corresponds to the letters A, B, or C using a regular expression.
4.2 Results
Overall Observation
The results presented in Table 3 display the performance across various models and settings. The overall observation is that most open-weight models struggle to understand Indonesian culture, contrasting sharply with the 100% accuracy achieved by humans (i.e., natives of the given province). Among open-weight models, Llama–3 achieves the highest accuracy of 73.3%. Other open-weight models such as Merak and mT0xxl achieve accuracy of 52–53%, while closed-weight models, such as GPT–3.5 and GPT–4, achieve performances of 62.7% and 75.9%, respectively. These findings underscore the challenging nature of the IndoCulture dataset.
Zero-shot accuracy across various models and settings. “MCQ” refers to the multiple-choice question method, and l denotes the location as additional context (“Ind” and “Prov” denote the country of Indonesia, and the corresponding province). The bold numbers highlight the highest score within each model group.
Model (#parameter) . | Completion . | MCQ . | ||||
---|---|---|---|---|---|---|
l = None . | l = Ind . | l = Prov . | l = None . | l = Ind . | l = Prov . | |
Human | – | – | 100.0 | – | – | 100.0 |
Random | 33.3 | 33.3 | 33.3 | 33.3 | 33.3 | 33.3 |
BLOOMZ (560M) | 37.2 | 35.3 | 35.3 | 32.5 | 32.4 | 32.5 |
BLOOMZ (1.1B) | 36.3 | 36.9 | 37.2 | 32.4 | 32.4 | 32.4 |
BLOOMZ (3B) | 38.6 | 40.7 | 41.5 | 47.0 | 48.6 | 49.2 |
BLOOMZ (7B) | 41.3 | 44.1 | 44.6 | 49.5 | 50.6 | 50.5 |
mT0small (300M) | 28.3 | 28.1 | 28.3 | 34.1 | 33.1 | 32.6 |
mT0base (580M) | 28.4 | 28.1 | 28.5 | 35.4 | 35.1 | 35.6 |
mT0large (1.2B) | 29.6 | 29.5 | 30.1 | 35.6 | 35.7 | 35.8 |
mT0xl (3.7B) | 31.9 | 31.0 | 31.2 | 49.8 | 50.5 | 50.7 |
mT0xxl (13B) | 33.2 | 33.5 | 34.3 | 52.7 | 51.4 | 52.1 |
Bactrian-XLLaMA (7B) | 33.8 | 34.2 | 34.2 | 38.0 | 38.6 | 38.9 |
Bactrian-XLLaMA (13B) | 33.3 | 35.2 | 35.1 | 38.6 | 38.2 | 38.6 |
Llama–2 (7B) | 37.2 | 37.5 | 37.7 | 40.5 | 39.9 | 38.8 |
Llama–2 chat (7B) | 37.3 | 37.4 | 37.9 | 40.6 | 41.3 | 40.7 |
Llama–2 (13B) | 39.6 | 40.2 | 40.2 | 47.6 | 47.6 | 47.3 |
Llama–2 chat (13B) | 38.6 | 38.9 | 39.3 | 47.8 | 49.6 | 49.6 |
Llama–3 (8B) | 41.0 | 42.2 | 43.4 | 54.4 | 54.4 | 55.1 |
Llama–3 Instruct (8B) | 41.9 | 41.5 | 42.3 | 56.7 | 57.6 | 59.0 |
Llama–3 (70B) | 51.2 | 51.7 | 54.3 | 68.6 | 69.9 | 72.7 |
Llama–3 Instruct (70B) | 49.2 | 49.6 | 52.2 | 68.5 | 69.3 | 73.3 |
IndoBART (132M) | 42.4 | 41.3 | 42.1 | 32.6 | 32.4 | 32.7 |
IndoGPT (117M) | 42.6 | 41.9 | 42.4 | 33.7 | 33.8 | 34.7 |
Merak (7B) | 41.0 | 41.5 | 43.5 | 51.9 | 53.1 | 53.2 |
SeaLLM (7B) | 39.1 | 39.3 | 41.1 | 52.2 | 53.1 | 53.0 |
SEA-LION (7B) | 38.8 | 38.9 | 39.7 | 33.8 | 33.0 | 33.3 |
Komodo (7B) | 45.1 | 45.4 | 46.1 | 37.6 | 35.1 | 36.1 |
GPT–3.5 (NA) | – | – | – | 59.8 | 60.9 | 62.7 |
GPT–4 (NA) | – | – | – | 69.1 | 71.8 | 75.9 |
Model (#parameter) . | Completion . | MCQ . | ||||
---|---|---|---|---|---|---|
l = None . | l = Ind . | l = Prov . | l = None . | l = Ind . | l = Prov . | |
Human | – | – | 100.0 | – | – | 100.0 |
Random | 33.3 | 33.3 | 33.3 | 33.3 | 33.3 | 33.3 |
BLOOMZ (560M) | 37.2 | 35.3 | 35.3 | 32.5 | 32.4 | 32.5 |
BLOOMZ (1.1B) | 36.3 | 36.9 | 37.2 | 32.4 | 32.4 | 32.4 |
BLOOMZ (3B) | 38.6 | 40.7 | 41.5 | 47.0 | 48.6 | 49.2 |
BLOOMZ (7B) | 41.3 | 44.1 | 44.6 | 49.5 | 50.6 | 50.5 |
mT0small (300M) | 28.3 | 28.1 | 28.3 | 34.1 | 33.1 | 32.6 |
mT0base (580M) | 28.4 | 28.1 | 28.5 | 35.4 | 35.1 | 35.6 |
mT0large (1.2B) | 29.6 | 29.5 | 30.1 | 35.6 | 35.7 | 35.8 |
mT0xl (3.7B) | 31.9 | 31.0 | 31.2 | 49.8 | 50.5 | 50.7 |
mT0xxl (13B) | 33.2 | 33.5 | 34.3 | 52.7 | 51.4 | 52.1 |
Bactrian-XLLaMA (7B) | 33.8 | 34.2 | 34.2 | 38.0 | 38.6 | 38.9 |
Bactrian-XLLaMA (13B) | 33.3 | 35.2 | 35.1 | 38.6 | 38.2 | 38.6 |
Llama–2 (7B) | 37.2 | 37.5 | 37.7 | 40.5 | 39.9 | 38.8 |
Llama–2 chat (7B) | 37.3 | 37.4 | 37.9 | 40.6 | 41.3 | 40.7 |
Llama–2 (13B) | 39.6 | 40.2 | 40.2 | 47.6 | 47.6 | 47.3 |
Llama–2 chat (13B) | 38.6 | 38.9 | 39.3 | 47.8 | 49.6 | 49.6 |
Llama–3 (8B) | 41.0 | 42.2 | 43.4 | 54.4 | 54.4 | 55.1 |
Llama–3 Instruct (8B) | 41.9 | 41.5 | 42.3 | 56.7 | 57.6 | 59.0 |
Llama–3 (70B) | 51.2 | 51.7 | 54.3 | 68.6 | 69.9 | 72.7 |
Llama–3 Instruct (70B) | 49.2 | 49.6 | 52.2 | 68.5 | 69.3 | 73.3 |
IndoBART (132M) | 42.4 | 41.3 | 42.1 | 32.6 | 32.4 | 32.7 |
IndoGPT (117M) | 42.6 | 41.9 | 42.4 | 33.7 | 33.8 | 34.7 |
Merak (7B) | 41.0 | 41.5 | 43.5 | 51.9 | 53.1 | 53.2 |
SeaLLM (7B) | 39.1 | 39.3 | 41.1 | 52.2 | 53.1 | 53.0 |
SEA-LION (7B) | 38.8 | 38.9 | 39.7 | 33.8 | 33.0 | 33.3 |
Komodo (7B) | 45.1 | 45.4 | 46.1 | 37.6 | 35.1 | 36.1 |
GPT–3.5 (NA) | – | – | – | 59.8 | 60.9 | 62.7 |
GPT–4 (NA) | – | – | – | 69.1 | 71.8 | 75.9 |
The Multiple-choice Question Method is Generally Better.
Our findings suggest that the multiple-choice question method tends to outperform the sentence completion method, with exceptions noted for BLOOMZ (560M, 1.1B), IndoBART, IndoGPT, and Komodo. Interestingly, in the sentence completion task, the Indonesian-focused language model Komodo (7B) outperforms nearly all large multilingual models, with the exception of Llama–3 (70B). However, Komodo experiences a significant decline for the multiple-choice question method, with a notable margin of 10–12 points. This discrepancy could potentially be attributed to differences in the nature of language model training and instruction-tuning.
Impact of Location Context on Model Performance
Our investigation reveals that incorporating various levels of location granularity has a noticeable effect on zero-shot performance, especially for models with larger parameter sizes. Detailed location context notably enhances the accuracy of BLOOMZ (7B), Llama–2 (13B), Llama–3 (70B), Merak (7B), SeaLLM (7B), Komodo (7B), GPT–3.5, and GPT–4. For instance, in GPT–4, the accuracy gap between l = none and l = Indonesia is 2.7, and this gap further increases by 7 points when l = Province is assigned.
4.3 Analysis
Given the exceptional performance of models with large parameter sizes using the multiple-choice question method and location l = Province, we employ these configurations for our analysis. In this section, our main focus is on the top three performing models: Llama–3 Instruct (70B), Merak (7B), and GPT–4. These models represent a multilingual open-weight model, Indonesian-centric open-weight model, and closed-weight model, respectively.
Results by Province
Table 4 highlights that the top 3 performing LLMs exhibit a nuanced understanding of culture within Indonesian provinces, particularly excelling in the cultures of West Java and Bali compared to other provinces. Llama–3 and GPT–4, for instance, achieve the best accuracy of more than 90%, while Merak achieves the best performance in Bali, Papua, and West Java, with accuracies ranging between 55% and 79%. In other provinces like West Sumatra and South Borneo, the models typically exhibit poorer performance. Specifically, for Llama–3, the performance gap compared to Bali ranges from 10% to 30%. This highlights the presence of cultural biases and a lack of inclusivity in model reasoning abilities, likely stemming from the distribution of training data. The proximity of West Java to Jakarta (Indonesia’s capital) and Bali’s global status as a tourism destination may contribute to the abundance of textual data on these two cultures.
Top-3 model accuracy by province. “PS” indicates instances containing province-specific context, while “¬PS” indicates otherwise. The green and red cells indicate the top three and bottom three scores, respectively.

We also note a consistent disparity between non-province and province-specific contexts across all models, with models generally finding non-province contexts easier to comprehend. On average, this gap ranges from 12 to 13 points for the three models, highlighting the challenge posed by province-specific content and emphasizing the significant influence of location context on the reasoning ability of LLMs.
Results by Topic
Table 5 shows the accuracy of the top 3 performing models across different topics. Similar to Table 4, the models perform worse in province-specific contexts for all topics, with the notable exception of food. For province-specific contexts, GPT–4 excels for the themes of food, religious holidays, and arts while for non-specific contexts, Llama–3 achieves accuracy more than 90% for agriculture, daily activities, and religious holidays.
Results by Fine-grained Cultural Elements
We tasked two expert workers with annotating 200 random samples from IndoCulture based on six cultural elements, as derived from Axtell and Fornwald (1998); Williams (2014).3,4 While these elements may not encompass every cultural aspect, we contend that they cover the most prominent or pivotal elements, including: (1) symbols (material or non-material objects representing meaning); (2) artifacts (material or non-material objects produced by society); (3) values and beliefs (principles, ideas, and concepts assumed to be ideal and correct in society); (4) norms (rules guiding values and beliefs); (5) language; (6) rituals (established procedures and ceremonies); and (7) other, for examples that do not fit into any of the defined elements. This annotation is a multi-label task, and the average Kappa score across the cultural elements is 0.56, with each ranging from 0.4 to 0.75. These scores indicate moderate to substantial agreement.
Table 6 displays the distribution of each cultural element in our dataset, along with the performance breakdown across Merak, Llama–3, and GPT–4. Among the 200 random samples, we observe that 42.5% of our data contains artifacts, 37.5% norms, and 30% rituals. Only 4% of the data pertains to symbols, while 7.5% belongs to the other category. Merak shows lower accuracy in norms, with a 24% decrease compared to values and beliefs. Conversely, Llama–3 performs best in values and beliefs with 85% accuracy, but accuracy drops by 23% for norms. GPT–4 maintains relatively stable accuracies across cultural elements, with differences averaging between 3% and 5%. Furthermore, language presents a challenge for Merak, achieving only 38% accuracy, whereas Llama–3 and GPT–4 achieve 66% and 72% accuracy, respectively.
Accuracy comparison of Merak, Llama–3–Instruct (70B), and GPT–4 across 200 random samples, categorized by cultural elements. The numerical value following each cultural element indicates its proportion within the samples.
Cultural element (%) . | Merak . | Llama–3 . | GPT–4 . |
---|---|---|---|
Symbols (4) | 50.0 | 50.0 | 70.8 |
Artifacts (42.5) | 55.3 | 74.1 | 67.8 |
Values and Beliefs (10.5) | 61.9 | 85.7 | 69.8 |
Norms (37.5) | 38.7 | 62.7 | 73.6 |
Language (19.5) | 38.5 | 66.7 | 72.0 |
Ritual (30) | 53.3 | 65.0 | 70.7 |
Other (7.5) | 66.7 | 66.7 | 69.2 |
Cultural element (%) . | Merak . | Llama–3 . | GPT–4 . |
---|---|---|---|
Symbols (4) | 50.0 | 50.0 | 70.8 |
Artifacts (42.5) | 55.3 | 74.1 | 67.8 |
Values and Beliefs (10.5) | 61.9 | 85.7 | 69.8 |
Norms (37.5) | 38.7 | 62.7 | 73.6 |
Language (19.5) | 38.5 | 66.7 | 72.0 |
Ritual (30) | 53.3 | 65.0 | 70.7 |
Other (7.5) | 66.7 | 66.7 | 69.2 |
Can the Model Provide a Reasonable Explanation to Support the Answer?
We conduct a manual investigation of the text generation output for Merak (7B), Llama–3, and GPT–4 across 200 random samples. This involves manually examining the generated answer along with its explanation. To obtain the explanation, we modify the Indonesian prompt in Figure 3 by adding the string Jelaskan jawabanmu! “Explain your answer!”. Our annotation process is binary, categorizing explanations as either True or False. We label an explanation as False if it is absent, contains hallucinations, or provides inaccurate information.5
As anticipated, there is a substantial drop in accuracy for Merak (7B) from 53.2% (score reported in Table 3) to 29.5% (see Figure 4). This discrepancy underscores the limitations of relying solely on token probabilities to assess the true capability of a language model. Interestingly, only 4.5% of the samples are answered correctly with the appropriate explanation by Merak, despite it being the top performer among the Indonesian-centric language models. Larger models like Llama–3 and GPT–4 achieve more robust accuracies of 70.5% and 69.5% (shown in Figure 5), which are 3–5% lower than those indicated in Table 3. However, both models encounter challenges in generating appropriate explanations for correctly-answered samples, with Llama–3 and GPT–4 producing explanation errors in 29% and 34% of cases, respectively.
Performance comparison between Merak (7B), Llama–3 Instruct (70B), and GPT–4 based on text generation output. “Answer (T)” indicates that the generated answer is true, while “Exp(F)” denotes that the answer explanation is false.
Performance comparison between Merak (7B), Llama–3 Instruct (70B), and GPT–4 based on text generation output. “Answer (T)” indicates that the generated answer is true, while “Exp(F)” denotes that the answer explanation is false.
The accuracy of Indonesian and English translations across BLOOMZ (7B), mT0xxl (13B), Llama–2 chat (13B), Llama–3 Instruct (70B), Merak (7B), GPT–3.5, and GPT–4.
The accuracy of Indonesian and English translations across BLOOMZ (7B), mT0xxl (13B), Llama–2 chat (13B), Llama–3 Instruct (70B), Merak (7B), GPT–3.5, and GPT–4.
Does Language Affect Model Performance?
We automatically translated IndoCulture to English using the Google Translate API6 and used the English prompt in Figure 3 to evaluate the models. Specifically for this part, we include more models for comparison. All results over English text dropped, except for Llama–2 and Merak. This could be attributed to two reasons. First, Llama–2 is an English-centric model, and Merak is fine-tuned from Llama–2. Second, the performance drop for other models could be caused by translation errors. We further investigated this with 100 random samples and found that 81 samples had acceptable translations. We observed translation errors such as pronoun mismatches, inaccurate proverb translations, and inaccurate translations of local terms, such as pupuik translated as “fertilizer”.7 To better understand the cultural gap in language models, we followed the approach of Liu et al. (2024) to manually correct the translations and reevaluate the models. We found that GPT–4’s performance over the 100 random samples was 77.0 for the original Indonesian text, 68.0 for the English machine-translated text, and 72.0 for the English translation fixed by humans.
5 Discussion
A recent study (Wang et al., 2024) has demonstrated that evaluating language models using token probabilities in multiple-choice question types does not align well with the generated text. This discrepancy is reported to be more pronounced in models fine-tuned on conversational or safety data. In response to this issue, we conducted a manual evaluation of 200 random samples (in Section 4.3) and found that the performance in the generated text, especially for Merak—the best open-weight Indonesian-centric language model—deteriorates. However, this issue is less apparent in larger models such as Llama–3 and GPT–4. Conducting manual evaluation on all data and models is expensive, and we plan to address this issue in future works. This work primarily focuses on introducing a novel dataset constructed for evaluating cultural commonsense reasoning within the Indonesian context, including preliminary evaluation results based on standard methods used in previous studies (OpenAI, 2023; Touvron et al., 2023; Koto et al., 2023; Li et al., 2024; Koto et al., 2024).
6 Conclusion
IndoCulture is a cultural commonsense reasoning dataset encompassing the diversity of Indonesian cultures, spanning from Aceh province in the west to Papua province in the east. Through collaboration with local individuals across eleven provinces and rigorous quality control measures, we introduce IndoCulture for the purpose of evaluating language models. Our findings reveal that large language models, whether Indonesian-centric or multilingual, demonstrate a limited understanding of Indonesian cultures. Notably, incorporating location as additional context significantly enhances model performance, particularly for GPT–3.5 and GPT–4.
Limitations
IndoCulture is specifically designed to explore the influence of geographical location on cultural commonsense reasoning, with a focus on the present time. It does not consider temporal aspects. Our dataset was created in the year 2023, and we recognize that cultures may evolve over time, as discussed by Mesoudi (2016).
Furthermore, as demonstrated in Section 4.3, a significant portion of IndoCulture comprises cultural elements such as artifacts, norms, and rituals. Symbols, values and beliefs, and language represent smaller proportions, ranging from 4% to 20%. We encourage future research to further explore these cultural elements and to expand the geographical coverage beyond the eleven provinces studied in this paper.
We also acknowledge that our dataset size is relatively small, to cover all the provinces. However, compared to existing Indonesian datasets that have been manually curated by natives (see COPAL–ID in Table 1), ours is significantly larger in terms of both size and regional coverage. Future work may extend the size and the coverage of IndoCulture to get a more holistic picture of Indonesian cultures.
Ethical Considerations
IndoCulture is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.8 Our data is intended for academic research and non-commercial purposes. Workers were compensated above the minimum monthly salary in Indonesia and are fully aware that the data will be released to the public. It is important to note that no private or sensitive information of the workers is included in IndoCulture.
Acknowledgments
We acknowledge the thorough feedback and impactful suggestions from the reviewers and the action editor. We also extend our gratitude to the 22 annotators from eleven provinces in Indonesia for their valuable contributions in constructing IndoCulture. This project was supported by MBZUAI and UI through PUTI-Q2 grant (NKB-1192/UN2.RST/HKP.05.00/2022).
Notes
IndoCulture is available from https://huggingface.co/datasets/indolem/IndoCulture.
Although Papua consists of six provinces, for the purpose of this study, we treat it as a single entity (referred to as Papua) due to the relatively recent establishment of most of these provinces.
We use the Google search engine to verify the correctness of the explanation.
Accessed in March 2024.
Pupuik is a traditional musical instrument in West Sumatra. It is worth noting that the word pupuik closely resembles the word pupuk in Indonesian, which means fertilizer.
References
A Model Details
Table 7 lists all model artifacts evaluated in our experiments.
With the exception of GPT–3.5 and GPT–4, all the models used in this study were sourced from HuggingFace (Wolf et al., 2020).
Models (#parameters) . | Source . |
---|---|
BLOOMZ (560M) | bigscience/bloomz-560m |
BLOOMZ (1.1B) | bigscience/bloomz-1b1 |
BLOOMZ (1.7B) | bigscience/bloomz-1b7 |
BLOOMZ (3B) | bigscience/bloomz-3b |
BLOOMZ (7.1B) | bigscience/bloomz-7b1 |
mT0small (300M) | bigscience/mt0-small |
mT0base (580M) | bigscience/mt0-base |
mT0large (1.2B) | bigscience/mt0-large |
mT0xl (3.7B) | bigscience/mt0-xl |
mT0xxl (13B) | bigscience/mt0-xxl |
Llama–2 (7B) | meta-llama/Llama--2-7b |
Llama–2 chat (7B) | meta-llama/Llama--2-7b-chat |
Llama–2 (13B) | meta-llama/Llama--2-13b |
Llama–2 chat (13B) | meta-llama/Llama--2-13b-chat |
Llama–3 (8B) | meta-llama/Meta-Llama--3-8B |
Llama–3 Instruct (8B) | meta-llama/Meta-Llama--3-8B-Instruct |
Llama–3 (70B) | meta-llama/Meta-Llama--3-70B |
Llama–3-chat (70B) | meta-llama/Meta-Llama--3-70B-Instruct |
Bactrian-XLLaMa (7B) | MBZUAI/bactrian-x-llama-7b-merged |
Bactrian-XLLaMa (13B) | MBZUAI/bactrian-x-llama-13b-merged |
IndoBART (132M) | indobenchmark/indobart-v2 |
IndoGPT (117M) | indobenchmark/indogpt |
Merak (7B) | Ichsan2895/Merak-7B-v5-PROTOTYPE1 |
SeaLLM (7B) | SeaLLMs/SeaLLM-7B-v2 |
SEA-LION (7B) | aisingapore/sea-lion-7b |
Komodo (7B) | Yellow-AI-NLP/komodo-7b-base |
Models (#parameters) . | Source . |
---|---|
BLOOMZ (560M) | bigscience/bloomz-560m |
BLOOMZ (1.1B) | bigscience/bloomz-1b1 |
BLOOMZ (1.7B) | bigscience/bloomz-1b7 |
BLOOMZ (3B) | bigscience/bloomz-3b |
BLOOMZ (7.1B) | bigscience/bloomz-7b1 |
mT0small (300M) | bigscience/mt0-small |
mT0base (580M) | bigscience/mt0-base |
mT0large (1.2B) | bigscience/mt0-large |
mT0xl (3.7B) | bigscience/mt0-xl |
mT0xxl (13B) | bigscience/mt0-xxl |
Llama–2 (7B) | meta-llama/Llama--2-7b |
Llama–2 chat (7B) | meta-llama/Llama--2-7b-chat |
Llama–2 (13B) | meta-llama/Llama--2-13b |
Llama–2 chat (13B) | meta-llama/Llama--2-13b-chat |
Llama–3 (8B) | meta-llama/Meta-Llama--3-8B |
Llama–3 Instruct (8B) | meta-llama/Meta-Llama--3-8B-Instruct |
Llama–3 (70B) | meta-llama/Meta-Llama--3-70B |
Llama–3-chat (70B) | meta-llama/Meta-Llama--3-70B-Instruct |
Bactrian-XLLaMa (7B) | MBZUAI/bactrian-x-llama-7b-merged |
Bactrian-XLLaMa (13B) | MBZUAI/bactrian-x-llama-13b-merged |
IndoBART (132M) | indobenchmark/indobart-v2 |
IndoGPT (117M) | indobenchmark/indogpt |
Merak (7B) | Ichsan2895/Merak-7B-v5-PROTOTYPE1 |
SeaLLM (7B) | SeaLLMs/SeaLLM-7B-v2 |
SEA-LION (7B) | aisingapore/sea-lion-7b |
Komodo (7B) | Yellow-AI-NLP/komodo-7b-base |
Author notes
Action Editor: Miguel Ballesteros