Abstract
One widely cited barrier to the adoption of LLMs as proxies for humans in subjective tasks is their sensitivity to prompt wording—but interestingly, humans also display sensitivities to instruction changes in the form of response biases. We investigate the extent to which LLMs reflect human response biases, if at all. We look to survey design, where human response biases caused by changes in the wordings of “prompts” have been extensively explored in social psychology literature. Drawing from these works, we design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior, particularly in models that have undergone RLHF. Furthermore, even if a model shows a significant change in the same direction as humans, we find that they are sensitive to perturbations that do not elicit significant changes in humans. These results highlight the pitfalls of using LLMs as human proxies, and underscore the need for finer-grained characterizations of model behavior.1
1 Introduction
In what ways do large language models (LLMs) display human-like behavior, and in what ways do they differ? The answer to this question is not only of intellectual interest (Dasgupta et al., 2022; Michaelov and Bergen, 2022), but also has a wide variety of practical implications. Works such as Törnberg (2023), Aher et al. (2023), and Santurkar et al. (2023) have demonstrated that LLMs can largely replicate results from humans on a variety of tasks that involve subjective labels drawn from human experiences, such as annotating human preferences, social science and psychological studies, and opinion polling. The seeming success of these models suggests that LLMs may be able to serve as viable participants in studies—such as surveys—in the same way as humans (Dillion et al., 2023), allowing researchers to rapidly prototype and explore many design decisions (Horton, 2023; Chen et al., 2022). Despite these potential benefits, the application of LLMs in these settings, and many others, requires a more nuanced understanding of where and when LLMs and humans behave in similar ways.
Separately, another widely noted concern is the sensitivity of LLMs to minor changes in prompts (Jiang et al., 2020; Gao et al., 2021; Sclar et al., 2023). In the context of simulating human behavior though, sensitivity to small changes in a prompt may not be a wholly negative thing; in fact, humans are also subconsciously sensitive to certain instruction changes (Kalton and Schuman, 1982). These sensitivities—which come in the form of response biases—have been well studied in the literature on survey design (Weisberg et al., 1996) and can manifest as a result of changes to the specific wording (Brace, 2018), format (Cox III, 1980), and placement (Schuman and Presser, 1996) of survey questions. Such changes often cause respondents to deviate from their original or “true” responses in regular, predictable ways. In this work, we investigate the parallels between LLMs’ and humans’ responses to these instruction changes.
Our Contributions.
Using biases identified from prior work in survey design as a case study, we generate question pairs (i.e., questions that do or do not reflect the bias), gather a distribution of responses across different LLMs, and evaluate model behavior in comparison to trends from prior social science studies, as outlined in Figure 1. As surveys are a primary method of choice for obtaining the subjective opinions of large-scale populations (Weisberg et al., 1996) and are used across a diverse set of organizations and applications (Hauser and Shugan, 1980; Morwitz and Pluzinski, 1996; Al-Abri and Al-Balushi, 2014), we believe that our results would be of broad interest to multiple research communities.
We evaluate LLM behavior across 5 different response biases, as well as 3 non-bias perturbations (e.g., typos) that are known to not affect human responses. To understand whether aspects of model architecture (e.g., size) and training schemes (e.g., instruction fine-tuning and RLHF) affect LLM responses to these question modifications, we selected 9 models—including both open models from the Llama2 series and commercial models from OpenAI—to span these considerations. In summary, we find:
(1) LLMs do not generally reflect human-like behaviors as a result of question modifications:
All models showed behavior notably unlike humans such as a significant change in the opposite direction of known human biases and a significant change to non-bias perturbations. Furthermore, unlike humans, models are unlikely to show significant changes due to bias modifications if they are more uncertain with their original responses.
(2) Behavioral trends of RLHF-ed models tend to differ from those of vanilla LLMs:
Among the Llama2 base and chat models, we find that RLHF-ed chat models demonstrated less significant changes to question modifications as a result of response biases but are more affected by non-bias perturbations than their non-RLHF-ed counterparts, highlighting the potential undesirable effects of additional training schemes.
(3) There is little correspondence between exhibiting response biases and other desirable metrics for survey design:
We find that a model’s ability to replicate human opinion distributions is not indicative of how well an LLM reflects human behavior.
These results suggest the need for care and caution when considering the use of LLMs as human proxies, as well as the importance of building more extensive evaluations that disentangle the nuances of how LLMs may or may not behave similarly to humans.
2 Methodology
In this section, we overview our evaluation framework, which consists of three parts (Figure 1): (1) dataset generation, (2) collection of LLM responses, and (3) analysis of LLM responses.
2.1 Dataset Generation
When evaluating whether humans exhibit hypothesized response biases, prior social science studies typically design a set of control questions and a set of treatment questions, which are intended to elicit the hypothesized bias (McFarland, 1981; Gordon, 1987; Hippler and Schwarz, 1987; Schwarz et al., 1992, inter alia). In line with this methodology, we similarly create sets of questions (q, q′) ∈ Q that contain both original (q) and modified (q′) forms of multiple-choice questions to study whether an LLM exhibits a response bias behavior given a change in the prompt.
The first set of question pairs is one where q′ corresponds to questions that are modified in a way that is known to induce that particular in humans. However, we may also want to know whether a shift in LLM responses elicited by the change between (q, q′) in is largely unique to that change. One way to test this is by evaluating models on , which are changes in prompts that humans are known to be robust against, such as typos or certain randomized letter changes (Rawlinson, 2007; Sakaguchi et al., 2017; Belinkov and Bisk, 2017; Pruthi et al., 2019). Thus, we also generate where q is an original question that is also contained in , and q′ is a transformed version of q using these perturbations. Examples of questions from and are in Table 1.
We created and by modifying a set of existing “unbiased” survey questions that have been curated and administered by experts. The original forms q of these question pairs come from survey questions in Pew Research’s American Trends Panel (ATP), detailed in Appendix J.1. We opted to use the ATP as the topics of questions present in ATP are very close to those used in prior social psychology studies that have investigated response biases, such as politics, technology, and family, among others. Given the similarity in domain, we expect that the trends in human behavior measured in prior studies also extend to these questions broadly. Concretely, we selected questions from the pool of ATP questions curated by Santurkar et al. (2023), who studied whether LLMs reflect human opinions; in contrast, we study whether changes in LLM opinions as a result of question modification match known human behavioral patterns, and then investigate how well these different evaluation metrics align.
We looked to prior social psychology studies to identify well-studied response biases for which implementation in existing survey questions is relatively straightforward, and the impact of such biases on human decision outcomes has been explicitly demonstrated in prior studies with humans. We generate a dataset with a total of 2578 question pairs, covering 5 biases and 3 non- bias perturbations. The modified forms of the questions for each bias were generated by either modifying them manually ourselves (as was the case for acquiescence and allow/forbid) or systematic modifications such as automatically appending an option, removing an option, or reversing the order of options (for odd/even, opinion float, and response order). The specific breakdown of the number of questions by bias type is as follows: 176 for acquiescence bias, 40 for allow/forbid asymmetry, 271 for response order bias, 126 for opinion floating, and 126 for odd/even scale effects. For each perturbation, we generate a modified version based on each original question from . Specific implementation details are provided in Appendix J.2.
2.2 Collecting LLM Responses
To mimic data that would be collected from humans in real-world user studies, we assume that all LLM output should take the form of samples with a pre-determined sample size for each treatment condition. The collection process entailed sampling a sufficiently large number of LLM outputs for each question in every question pair in and . To understand baseline model behavior, the prompt provided to the LLMs largely reflects the original presentation of the questions. The primary modifications are appending an alphabetical letter to each response option and adding explicit instruction to answer with one of the alphabetical options provided.2 We provide the prompt template in Appendix K.2. We then query each LLM with a temperature of 1 until we get a valid response3 (e.g., one of the letter options) to elicit answers from the original probability distribution of the LLM. For each pair of questions, we sample 50 responses per form to create and .
We selected LLMs to evaluate based on multiple axes of consideration: open-weight versus closed-weight models, whether the model has been instruction fine-tuned, whether the model has undergone reinforcement learning with human feedback (RLHF), and the number of model parameters. We evaluate a total of nine models, which include variants of Llama2 (Touvron et al., 2023) (7b, 13b, 70b), Solar4 (an instruction fine-tuned version of Llama2 70b) and variants of the Llama2 chat family (7b, 13b, 70b), which has had both instruction fine-tuning as well as RLHF (Touvron et al., 2023), along with models from the GPT series (Brown et al., 2020) (GPT 3.5 turbo, GPT 3.5 turbo instruct).5
2.3 Analysis of LLM Responses
Paralleling prior social psychology work, we measure whether there is a deviation in the response distributions between and from , and, like these studies, if such deviations form an overall trend in behavior. Based on the implementation of each bias, we compute changes on a particular subset of relevant response options, following Table 2. We refer to the degree of change as Δb. Here, there is no notion of a ground-truth label (e.g., whether the LLM is getting the “correct answer” before and after some modification), which differs from most prior work in this space (Dasgupta et al., 2022; Michaelov and Bergen, 2022; Sinclair et al., 2022; Zheng et al., 2023; Pezeshkpour and Hruschka, 2023).
Bias Type . | Δb . |
---|---|
Acquiescence | count(q’[a]) - count(q[a]) |
Allow/forbid | count(q[b]) - count(q’[a]) |
Response order | count(q’[d]) - count(q[a]) |
Opinion floating | count(q[c]) - count(q’[c]) |
Odd/even scale | count(q’[b]) + count(q’[d]) |
- count(q[b]) - count(q[d]) |
Bias Type . | Δb . |
---|---|
Acquiescence | count(q’[a]) - count(q[a]) |
Allow/forbid | count(q[b]) - count(q’[a]) |
Response order | count(q’[d]) - count(q[a]) |
Opinion floating | count(q[c]) - count(q’[c]) |
Odd/even scale | count(q’[b]) + count(q’[d]) |
- count(q[b]) - count(q[d]) |
To determine whether there is a consistent deviation across all questions, we compute the average change Δ-b across all questions and conduct a Student’s t-test where the null hypothesis is that Δ-b for a given model and bias type is 0. Together, the p-value and direction of Δ-b inform us whether we observe a significant change across questions that aligns with known human behavior.6 We then evaluate LLMs on following the same process (i.e., selecting the subset of relevant response options for the bias) to compute Δp, with the expectation that across questions Δ-p should be not statistically different from 0.
3 Results
3.1 General Trends in LLM Behavior
As shown in Figure 2, we evaluate a set of 9 models on 5 different response biases, summarized in the first column of each grid, and compare the behavior of each model on 3 non-bias perturbations, as presented in the second, third, and fourth column of each grid. We ideally expect to see significant positive changes across response biases and non-significant changes across all non-bias perturbations.
Overall, we find that LLMs generally do not exhibit human-like behavior across the board. Specifically, (1) no model aligns with known human patterns across all biases, and (2) unlike humans, all models display statistically significant changes to non-bias perturbations, regardless of whether it responded to the bias modification itself. The model that demonstrated the most “human-like” response was Llama2 70b, but it nevertheless still exhibits a significant change as a result of non-bias perturbations on three of the five bias types.
Additionally, there is no monotonic trend between model size and model behavior. When comparing results across both the base Llama2 models and Llama2 chat models, which vary in size (7b, 13b, and 70b), we do not see a consistent monotonic trend between the number of parameters and size of Δ-b, which aligns with multiple prior works (McKenzie et al., 2023; Tjuatja et al., 2023). There are only a handful of biases where we find that increasing model parameters leads to an increase or decrease in Δ-b (e.g., allow/forbid and opinion float for the base Llama2 7b to 70b).
3.2 Comparing Base Models with Their Modified Counterparts
Instruction fine-tuning and RLHF-ed models can improve a model’s abilities to better generalize to unseen tasks (Wei et al., 2022a; Sanh et al., 2022) and be steered towards a user’s intent (Ouyang et al., 2022); how do these training schemes affect other abilities, such as exhibiting human-like response biases? To disentangle the effect of these additional training schemes, we focus our comparisons on base Llama2 models with their instruction fine-tuned (Solar, chat) and RLHF-ed (chat) counterparts. As we do not observe a clear effect from instruction fine-tuning,7 we center our analysis on the use of RLHF by comparing the base models with their chat counterparts:
RLHF-ed models are more insensitive to bias-inducing changes than their vanilla counterparts.
We find that the base models are more likely to exhibit a change for the bias modifications, especially for those with changes in the wording of the question like acquiescence and allow/forbid. An interesting exception is odd/even, where all but one of the RLHF-ed models (3.5 turbo instruct) have a larger positive effect size than the Llama2 base models. Insensitivity to bias modifications may be more desirable if we want an LLM to simulate a “bias-resistant” user, but not necessarily if we want it to be affected by the same changes as humans more broadly.
RLHF-ed models tend to show more significant changes resulting from perturbations.
We also see that RLHF-ed models tend to show a larger magnitude of effect sizes among the non-bias perturbations. For every perturbation setting that has a significant effect in both model pairs, the RLHF-ed chat models have a greater magnitude of effect size in 21 out of 27 of these settings and have on average 68% larger effect size than the base model, a noticeably less human-like—and arguably generally less desirable—behavior.
4 Examining the Effect of Uncertainty
Out of the nine models tested, we did not observe consistent correspondence between the uncertainty measure and the magnitude of Δ-b.
Across all nine models, we do not observe a correspondence between the uncertainty measure and the magnitude of Δ-b given a modified form of the question, which provides further evidence of dissimilarities between human and LLM behavior. However, the RLHF-ed models tended to have more biases where there was a weak positive correlation (0.2 ≤ r ≤ 0.5) between the uncertainty measure and the magnitude of Δ-b than their non-RLHF-ed counterparts. Specific values for all models are provided in Table 4.
5 Comparison to Other Desiderata for LLMs as Human Proxies
Beyond aspects of behavior like response biases, use cases where LLMs may be used as proxies for humans involve many other factors of model performance. In the case of completing surveys, we may also be interested in whether LLMs can replicate the opinions of a certain population. Thus, we explore the relationship between how well a model reflects human opinions and the extent to which it exhibits human-like response biases.
To see how well LLMs can replicate population-level opinions, we compare the distribution of answers generated by the models in the original question to that of human responses (Santurkar et al., 2023; Durmus et al., 2023; Argyle et al., 2022). We first aggregate the LLM’s responses on each unmodified question q to construct Dmodel for the subset of questions used in our study. Then from the ATP dataset, which provides human responses, we construct Dhuman for each q. Finally, we compute a measure of similarity between Dmodel and Dhuman for each question, which Santurkar et al. (2023) refer to as representativeness. We use the repository provided by Santurkar et al. (2023) to calculate the representativeness of all nine models and find that they are in line with the range of values reported in their work.
The ability to replicate human opinion distributions is not indicative of how well an LLM reflects human behavior.
Figure 3 shows the representativeness score between human and model response distributions. While Llama2 70b’s performance, when compared to the ideal setting in Figure 3 (left), shows the most “human-like” behavior and also has the highest representativeness score, the relative orderings of model performance are not consistent across both evaluations. For example, Llama2 7b-chat and 13b-chat exhibit very similar changes from question modifications as well as close representativeness scores, whereas with GPT 3.5 turbo and turbo instruct we observe very different behaviors but extremely close representativeness scores.
6 Related Work
LLM Sensitivity to Prompts.
A growing set of work aims to understand how LLMs may be sensitive to prompt constructions. These works have studied a variety of permutations of prompts which include—but are not limited to—adversarial prompts (Wallace et al., 2019; Perez and Ribeiro, 2022; Maus et al., 2023; Zou et al., 2023), changes in the order of in-context examples (Lu et al., 2022), changes in multiple-choice questions (Zheng et al., 2023; Pezeshkpour and Hruschka, 2023), and changes in formatting of few-shot examples (Sclar et al., 2023). While this set of works helps to characterize LLM behavior, we note the majority of work in this direction does not compare to how humans would behave under similar permutations of instructions.
A smaller set of works has explored whether changes in performance also reflect known patterns of human behavior, focusing on tasks relating to linguistic priming and cognitive biases (Dasgupta et al., 2022; Michaelov and Bergen, 2022; Sinclair et al., 2022) in settings that are often removed from actual downstream use cases. Thus, such studies may have limited guidance on when and where it is appropriate to use LLMs as human proxies. In contrast, Jones and Steinhardt (2022) uses cognitive biases as motivation to generate hypotheses for failure cases of language models with code generation as a case study. Similarly, we conduct our analysis by making comparisons against known general trends of human behavior to enable a much larger scale of evaluation, but grounded in a more concrete use case of survey design.
When making claims about whether LLMs exhibit human-like behavior, we also highlight the importance of selecting stimuli that have been verified in prior human studies. Webson and Pavlick (2022) initially showed that LLMs can perform unexpectedly well to irrelevant and intentionally misleading examples, under the assumption that humans would not be able to do so. However, the authors later conducted a follow-up study on humans, disproving their initial assumptions (Webson et al., 2023). Our study is based on long-standing literature from the social sciences.
Comparing LLMs and Humans.
Comparisons of LLM and human behavior are broadly divided into comparisons of more open-ended behavior, such as generating an answer to a free-response question, versus comparisons of closed-form outcomes, where LLMs generate a label based on a fixed set of response options. Since the open-ended tasks typically rely on human judgments to determine whether LLM behaviors are perceived to be sufficiently human-like (Park et al., 2022, 2023a), we focus on closed-form tasks, which allows us to more easily find broader quantitative trends and enables scalable evaluations.
Prior works have conducted evaluations of LLM and human outcomes on a number of real-world tasks including social science studies (Park et al., 2023b; Aher et al., 2023; Horton, 2023; Hämäläinen et al., 2023), crowdsourcing annotation tasks (Törnberg, 2023; Gilardi et al., 2023), and replicating public opinion surveys (Santurkar et al., 2023; Durmus et al., 2023; Chu et al., 2023; Kim and Lee, 2023; Argyle et al., 2022). While these works highlight the potential areas where LLMs can replicate known human outcomes, comparing directly to human outcomes limits existing evaluations to the specific form of the questions that were used to collect human responses. Instead, in this work, we create modified versions of survey questions informed by prior work in social psychology and survey design to understand whether LLMs reflect known patterns, or general response biases, that humans exhibit. Relatedly, Scherrer et al. (2023) analyzes LLM beliefs in ambiguous moral scenarios using a procedure that also varies the formatting of the prompt, though their work does not focus on the specific effects of these formatting changes.
7 Conclusion
We conduct a comprehensive evaluation of LLMs on a set of desired behaviors that would potentially make them more suitable human proxies, using survey design as a case study. However, of the 9 models that we evaluated, we found LLMs are generally not reflective of human-like behavior. We also observe distinct differences in behavior between the Llama2 base models and their chat counterparts, which uncover the effects of additional training schemes, namely RLHF. Thus, while the use of RLHF is useful for enhancing the “helpfulness” and “harmlessness” of LLMs (Fernandes et al., 2023), it may lead to other potentially undesirable behaviors (e.g., greater sensitivity to specific types of perturbations). Furthermore, we show that the ability of a language model to replicate human opinion distributions generally does not correspond to its ability to show human-like response biases. Taken together, we believe our results highlight the limitations of using LLMs as human proxies in survey design and the need for more critical evaluations to further understand the set of similarities or dissimilarities with humans.
8 Limitations
In this work, the focus of our experiments was on English-based, and U.S.-centric survey questions. However, we believe that many of these evaluations can and should be replicated on corpora comprising more diverse languages and users. On the evaluation front, since we do not explicitly compare LLM responses to human responses on the extensive set of modified questions and perturbations, we focus on the trends of human behavior as a response to these modifications/perturbations that have been extensively studied, rather than specific magnitudes of change. Additionally, the response biases studied in this work are neither representative nor comprehensive of all biases. This work was not intended to exhaustively test human biases but to highlight a new approach to understanding similarities between human and LLM behavior. Finally, while we observed the potential effects of additional training schemes, namely RLHF, our experiments were limited to the 3 pairs of Llama2 models.
Notes
Our code, dataset, and collected samples are available: https://github.com/lindiatjuatja/BiasMonkey.
We also explored prompt templates where models were allowed to generate more tokens to explain the “reasoning” behind their answer, with chain of thought (Wei et al., 2022b), but found minimal changes in model behavior.
We report the average number of queries per model in Appendix K.3.
We also attempted to evaluate GPT 4 (0613) in our experimental setup, but found it extremely difficult to get valid responses, likely due to OpenAI’s generation guardrails. We provide specific numbers in Appendix K.4.
While we also report the magnitude of Δ-b to better illustrate LLM behavior across biases, we note that prior user studies generally do not focus on magnitudes.
We note that SOLAR and the Llama2 chat models use different fine-tuning datasets, which may mask potential common effects of instruction fine-tuning more broadly.
References
A Stimuli Implementation
A.1 American Trends Panel Details
The link to the full ATP dataset. We use a subset of the dataset that has been formatted into CSVs from (Santurkar et al., 2023). Since our study is focused on subjective questions, we further filtered for opinion-based questions, so questions asking about people’s daily habits (e.g., how often they smoke) or other “factual” information (e.g., if they are married) are out-of-scope. Note that the Pew Research Center bears no responsibility for the analyses or interpretations of the data presented here. The opinions expressed herein, including any implications for policy, are those of the author and not of Pew Research Center.
A.2 and Details
We briefly describe how we implement each response bias and non-bias perturbation. We will release the entire dataset of and question pairs.
Acquiescence
(McClendon, 1991; Choi and Pak, 2005). Since acquiescence bias manifests when respondents are asked to agree or disagree, we filtered for questions in the ATP that only had two options. For consistency, all q′ are reworded to suggest the first of the original options, allowing us to compare the number of ‘a’ responses.
Allow/Forbid Asymmetry
(Hippler and Schwarz, 1987). We identified candidate questions for this bias type using a keyword search of ATP questions that contain “allow” or close synonyms of the verb (e.g., asking if a behavior is “acceptable”).
Response Order
(Ayidiya and McClendon, 1990; O’Halloran et al., 2014). Prior social science studies typically considered questions with at least three or four response options, a criterion that we also used. We constructed q′ by flipping the order of the responses. We post-processed the data by mapping the flipped version of responses back to the original order.
Odd/Even Scale Effects
(O’Muircheartaigh et al., 2001). This bias type requires questions with scale responses with a middle option; we filter for scale questions with four or five responses. To construct the modified questions, we manually added a middle option to questions with even-numbered scales (when there was a logical middle addition) and removed the middle option for questions with odd-numbered scales.
Opinion Floating
(Schuman and Presser, 1996). We used the same set of questions as with the odd/even bias but instead of removing the middle option, we added a “don’t know” option.
Middle Random
(Rawlinson, 2007). We sample an index (excluding the first and last letters) from each question and swap the character at that index with its neighboring character. This was only applied to words of length ≥ 4.
Key Typo
(Rawlinson, 2007). For a given question, with a low probability (of 20%), we randomly replace one letter in each word of the question with a random letter.
Letter Swap
(Rawlinson, 2007). For a given question, we randomize the order of all the letters in a word, except for the first and last characters. Again, this perturbation is only applied to words of length ≥ 4.
We did not apply non-bias perturbations to any words that contain numeric values or punctuation to prevent completely non-sensical outputs.
A.3 Full Results
The full set of results for all stimuli is in Table 4.
B LLM Details
B.1 Model Access
We provide links to model weights (where applicable):
Base Llama2 (7b, 13b, 70b) and Llama2 chat (7b, 13b, 70b).
Accessed from https://huggingface.co/meta-llama.
Solar (Instruction fine-tuned Llama2 70b).
Accessed from https://huggingface.co/upstage/SOLAR-0-70b-16bit.
GPT 3.5 turbo.
Specific model version is gpt-3.5-turbo-0613. Accessed through the OpenAI API.
GPT 3.5 turbo instruct.
Specific model version is gpt-3.5-turbo-0914. Accessed through the OpenAI API.
B.2 Prompt template
This prompt is used for all models. We have the models generate only one token with a temperature of 1.
Please answer the following question with one of the alphabetical options provided.
Question: [question]
A. [option]
B. [option]
...
E. [option]
Answer:
B.3 Number of Queries Required per Model
As mentioned in Section 2.2, we repeatedly queried the models until we generated a total of 50 valid responses. To better contextualize their performance in this survey setting, we gathered additional statistics on the number of queries required. In each query, we generate 100 single-token responses. To estimate the average number of queries needed, we randomly sampled 10 random pairs of questions (q, q′) per bias and generated 50 valid responses for each form of the question, for a total of 100 questions. Table 3 shows the average number of queries per model; we note that while Llama2-7b and 13b do require a relatively high number of queries, they were free to query and thus did not present a prohibitive cost for experimentation.
Model . | Average # of queries . |
---|---|
Llama2-7b | 69.63 |
Llama2-13b | 56.93 |
Llama2-70b | 22.36 |
Llama2-7b-chat | 32.77 |
Llama2-13b-chat | 12.99 |
Llama2-70b-chat | 2.05 |
SOLAR | 1.00 |
GPT-3.5-turbo | 1.00 |
GPT-3.5-turbo-instruct | 1.20 |
Model . | Average # of queries . |
---|---|
Llama2-7b | 69.63 |
Llama2-13b | 56.93 |
Llama2-70b | 22.36 |
Llama2-7b-chat | 32.77 |
Llama2-13b-chat | 12.99 |
Llama2-70b-chat | 2.05 |
SOLAR | 1.00 |
GPT-3.5-turbo | 1.00 |
GPT-3.5-turbo-instruct | 1.20 |
B.4 Initial Explorations with GPT-4
In addition to the models above, we also attempted to use GPT-4-0613 in our experimental setup, but found it difficult to generate valid responses for many questions, most likely due to OpenAI’s generation guardrails. As an initial experiment, we tried generating 50 responses per question for all (q, q′) in (747 questions × 2 conditions) and counting the number of valid responses that GPT-4 generated out of the 50. On average, GPT-4 generated ∼21 valid responses per question, with nearly a quarter of the questions having 0 valid responses. For these questions, GPT-4 tended to generate “As” or “This” (and when set to generate more tokens, GPT-4 generated “As a language model” or “This is subjective” as the start of its response).
This is in stark contrast to GPT-3.5, which had an average of ∼48 valid responses per question with none of the questions having 0 valid responses. Histograms for the ratio of valid responses are shown in Figure 4. Based on these observations, the number of repeated queries that would be required for evaluating GPT-4 would be prohibitively expensive and potentially infeasible for certain questions in our dataset.
model . | bias type . | Δ-b . | p value . | Δ-p key typo . | p value . | Δ-p middle random . | p value . | Δ-p letter swap . | p value . | pearson r . | p value . |
---|---|---|---|---|---|---|---|---|---|---|---|
Llama2 7b | Acquiescence | 1.921 | 0.021 | −3.920 | 0.007 | −4.480 | 0.000 | −4.840 | 0.004 | 0.182 | 0.015 |
Response Order | 24.915 | 0.000 | 1.680 | 0.382 | −0.320 | 0.871 | 2.320 | 0.151 | −0.503 | 0.000 | |
Odd/even | 1.095 | 0.206 | 0.720 | 0.625 | 1.360 | 0.355 | 1.680 | 0.221 | −0.102 | 0.255 | |
Opinion Float | 4.270 | 0.000 | 0.720 | 0.625 | 1.360 | 0.355 | 1.680 | 0.221 | −0.252 | 0.004 | |
Allow/forbid | −60.350 | 0.000 | −5.400 | 0.007 | −10.250 | 0.000 | −7.700 | 0.000 | −0.739 | 0.000 | |
Llama2 13b | Acquiescence | −11.852 | 0.000 | −6.800 | 0.001 | −5.760 | 0.000 | −9.320 | 0.000 | −0.412 | 0.000 |
Response Order | 45.757 | 0.000 | 11.600 | 0.000 | 11.640 | 0.000 | 11.720 | 0.000 | −0.664 | 0.000 | |
Odd/even | −3.492 | 0.000 | 5.840 | 0.000 | 3.600 | 0.031 | 4.000 | 0.007 | 0.192 | 0.031 | |
Opinion Float | 4.127 | 0.000 | 5.840 | 0.000 | 3.600 | 0.031 | 4.000 | 0.007 | −0.023 | 0.799 | |
Allow/forbid | −55.100 | 0.000 | −9.100 | 0.000 | −5.700 | 0.000 | −7.600 | 0.000 | −0.739 | 0.000 | |
Llama2 70b | Acquiescence | 7.296 | 0.000 | −2.440 | 0.218 | −3.080 | 0.173 | −3.320 | 0.146 | −0.018 | 0.809 |
Response Order | 5.122 | 0.000 | −1.080 | 0.597 | 3.240 | 0.113 | 2.000 | 0.306 | −0.140 | 0.021 | |
Odd/even | 12.191 | 0.000 | 0.920 | 0.540 | 0.600 | 0.687 | −0.800 | 0.618 | 0.12 | 0.179 | |
Opinion Float | 2.444 | 0.000 | 0.920 | 0.540 | 0.600 | 0.687 | −0.800 | 0.618 | −0.033 | 0.714 | |
Allow/forbid | −42.200 | 0.000 | −6.200 | 0.004 | 2.250 | 0.332 | 0.350 | 0.877 | −0.628 | 0.000 | |
Llama2 7b-chat | Acquiescence | 1.136 | 0.647 | −7.807 | 0.000 | −12.034 | 0.000 | −5.546 | 0.000 | −0.099 | 0.189 |
Response Order | −9.801 | 0.000 | 7.173 | 0.000 | 12.679 | 0.000 | 1.594 | 0.253 | −0.315 | 0.000 | |
Odd/even | 20.079 | 0.000 | 8.460 | 0.000 | 15.810 | 0.000 | 9.175 | 0.000 | −0.315 | 0.000 | |
Opinion Float | −1.254 | 0.283 | 8.460 | 0.000 | 15.801 | 0.000 | 9.175 | 0.000 | −0.086 | 0.339 | |
Allow/forbid | −7.050 | 0.367 | −18.700 | 0.000 | −24.600 | 0.000 | −16.200 | 0.002 | −0.161 | 0.321 | |
Llama2 13b-chat | Acquiescence | 1.909 | 0.434 | −9.239 | 0.000 | −11.534 | 0.000 | −5.284 | 0.000 | −0.095 | 0.209 |
Response Order | −9.292 | 0.000 | 7.653 | 0.000 | 10.753 | 0.000 | 0.472 | 0.719 | −0.324 | 0.000 | |
Odd/even | 21.254 | 0.000 | 10.159 | 0.000 | 14.460 | 0.000 | 9.492 | 0.000 | −0.163 | 0.069 | |
Opinion Float | −0.191 | 0.870 | 10.159 | 0.000 | 14.460 | 0.000 | 9.492 | 0.000 | −0.106 | 0.238 | |
Allow/forbid | −7.300 | 0.333 | −15.950 | 0.000 | −23.450 | 0.000 | −16.200 | 0.000 | −0.131 | 0.422 | |
Llama2 70b-chat | Acquiescence | 11.114 | 0.000 | 2.320 | 0.523 | −5.280 | 0.312 | 4.040 | 0.166 | 0.452 | 0.000 |
Response Order | −0.495 | 0.745 | 0.200 | 0.904 | 15.040 | 0.002 | 1.200 | 0.459 | 0.465 | 0.000 | |
Odd/even | 26.476 | 0.000 | 3.280 | 0.210 | −2.040 | 0.656 | −7.240 | 0.018 | −0.231 | 0.009 | |
Opinion Float | 1.556 | 0.039 | 3.280 | 0.210 | −2.040 | 0.656 | −7.240 | 0.018 | 0.440 | 0.000 | |
Allow/forbid | 4.000 | 0.546 | −4.750 | 0.258 | −16.000 | 0.021 | −0.950 | 0.811 | 0.280 | 0.080 | |
Solar | Acquiescence | 18.511 | 0.000 | 0.120 | 0.970 | 2.560 | 0.596 | 0.600 | 0.833 | 0.187 | 0.013 |
Response Order | 9.683 | 0.000 | 2.280 | 0.336 | 8.680 | 0.012 | 4.360 | 0.017 | 0.248 | 0.000 | |
Odd/even | 17.508 | 0.000 | 0.480 | 0.815 | 2.960 | 0.223 | −1.000 | 0.661 | −0.385 | 0.000 | |
Opinion Float | 1.921 | 0.017 | 0.480 | 0.815 | −2.960 | 0.223 | −1.000 | 0.661 | 0.291 | 0.001 | |
Allow/forbid | 6.800 | 0.207 | −2.950 | 0.343 | −8.500 | 0.131 | −8.050 | 0.001 | 0.145 | 0.373 | |
GPT 3.5 Turbo | Acquiescence | 5.523 | 0.040 | −11.720 | 0.008 | −28.680 | 0.000 | −19.120 | 0.000 | 0.334 | 0.000 |
Response Order | −2.709 | 0.147 | 4.960 | 0.121 | 15.960 | 0.002 | 8.000 | 0.011 | 0.198 | 0.001 | |
Odd/even | 25.048 | 0.000 | −5.480 | 0.082 | −14.800 | 0.001 | −5.800 | 0.062 | −0.273 | 0.002 | |
Opinion Float | −11.905 | 0.000 | −5.480 | 0.082 | −14.800 | 0.001 | −5.800 | 0.062 | 0.467 | 0.000 | |
Allow/forbid | 25.300 | 0.000 | −12.000 | 0.008 | −23.200 | 0.001 | −6.950 | 0.058 | 0.206 | 0.202 | |
GPT 3.5 Turbo Instruct | Acquiescence | 6.455 | 0.024 | 2.600 | 0.445 | −11.800 | 0.008 | −2.800 | 0.326 | 0.334 | 0.000 |
Response Order | −11.114 | 0.000 | 3.880 | 0.169 | 11.920 | 0.001 | 3.800 | 0.147 | 0.275 | 0.000 | |
Odd/even | 2.032 | 0.390 | 1.560 | 0.433 | −7.120 | 0.061 | −0.840 | 0.711 | −0.073 | 0.416 | |
Opinion Float | 0.143 | 0.891 | 1.560 | 0.433 | −7.120 | 0.061 | −0.840 | 0.711 | 0.360 | 0.000 | |
Allow/forbid | 8.550 | 0.111 | −4.500 | 0.216 | −10.050 | 0.139 | 4.100 | 0.261 | 0.437 | 0.005 |
model . | bias type . | Δ-b . | p value . | Δ-p key typo . | p value . | Δ-p middle random . | p value . | Δ-p letter swap . | p value . | pearson r . | p value . |
---|---|---|---|---|---|---|---|---|---|---|---|
Llama2 7b | Acquiescence | 1.921 | 0.021 | −3.920 | 0.007 | −4.480 | 0.000 | −4.840 | 0.004 | 0.182 | 0.015 |
Response Order | 24.915 | 0.000 | 1.680 | 0.382 | −0.320 | 0.871 | 2.320 | 0.151 | −0.503 | 0.000 | |
Odd/even | 1.095 | 0.206 | 0.720 | 0.625 | 1.360 | 0.355 | 1.680 | 0.221 | −0.102 | 0.255 | |
Opinion Float | 4.270 | 0.000 | 0.720 | 0.625 | 1.360 | 0.355 | 1.680 | 0.221 | −0.252 | 0.004 | |
Allow/forbid | −60.350 | 0.000 | −5.400 | 0.007 | −10.250 | 0.000 | −7.700 | 0.000 | −0.739 | 0.000 | |
Llama2 13b | Acquiescence | −11.852 | 0.000 | −6.800 | 0.001 | −5.760 | 0.000 | −9.320 | 0.000 | −0.412 | 0.000 |
Response Order | 45.757 | 0.000 | 11.600 | 0.000 | 11.640 | 0.000 | 11.720 | 0.000 | −0.664 | 0.000 | |
Odd/even | −3.492 | 0.000 | 5.840 | 0.000 | 3.600 | 0.031 | 4.000 | 0.007 | 0.192 | 0.031 | |
Opinion Float | 4.127 | 0.000 | 5.840 | 0.000 | 3.600 | 0.031 | 4.000 | 0.007 | −0.023 | 0.799 | |
Allow/forbid | −55.100 | 0.000 | −9.100 | 0.000 | −5.700 | 0.000 | −7.600 | 0.000 | −0.739 | 0.000 | |
Llama2 70b | Acquiescence | 7.296 | 0.000 | −2.440 | 0.218 | −3.080 | 0.173 | −3.320 | 0.146 | −0.018 | 0.809 |
Response Order | 5.122 | 0.000 | −1.080 | 0.597 | 3.240 | 0.113 | 2.000 | 0.306 | −0.140 | 0.021 | |
Odd/even | 12.191 | 0.000 | 0.920 | 0.540 | 0.600 | 0.687 | −0.800 | 0.618 | 0.12 | 0.179 | |
Opinion Float | 2.444 | 0.000 | 0.920 | 0.540 | 0.600 | 0.687 | −0.800 | 0.618 | −0.033 | 0.714 | |
Allow/forbid | −42.200 | 0.000 | −6.200 | 0.004 | 2.250 | 0.332 | 0.350 | 0.877 | −0.628 | 0.000 | |
Llama2 7b-chat | Acquiescence | 1.136 | 0.647 | −7.807 | 0.000 | −12.034 | 0.000 | −5.546 | 0.000 | −0.099 | 0.189 |
Response Order | −9.801 | 0.000 | 7.173 | 0.000 | 12.679 | 0.000 | 1.594 | 0.253 | −0.315 | 0.000 | |
Odd/even | 20.079 | 0.000 | 8.460 | 0.000 | 15.810 | 0.000 | 9.175 | 0.000 | −0.315 | 0.000 | |
Opinion Float | −1.254 | 0.283 | 8.460 | 0.000 | 15.801 | 0.000 | 9.175 | 0.000 | −0.086 | 0.339 | |
Allow/forbid | −7.050 | 0.367 | −18.700 | 0.000 | −24.600 | 0.000 | −16.200 | 0.002 | −0.161 | 0.321 | |
Llama2 13b-chat | Acquiescence | 1.909 | 0.434 | −9.239 | 0.000 | −11.534 | 0.000 | −5.284 | 0.000 | −0.095 | 0.209 |
Response Order | −9.292 | 0.000 | 7.653 | 0.000 | 10.753 | 0.000 | 0.472 | 0.719 | −0.324 | 0.000 | |
Odd/even | 21.254 | 0.000 | 10.159 | 0.000 | 14.460 | 0.000 | 9.492 | 0.000 | −0.163 | 0.069 | |
Opinion Float | −0.191 | 0.870 | 10.159 | 0.000 | 14.460 | 0.000 | 9.492 | 0.000 | −0.106 | 0.238 | |
Allow/forbid | −7.300 | 0.333 | −15.950 | 0.000 | −23.450 | 0.000 | −16.200 | 0.000 | −0.131 | 0.422 | |
Llama2 70b-chat | Acquiescence | 11.114 | 0.000 | 2.320 | 0.523 | −5.280 | 0.312 | 4.040 | 0.166 | 0.452 | 0.000 |
Response Order | −0.495 | 0.745 | 0.200 | 0.904 | 15.040 | 0.002 | 1.200 | 0.459 | 0.465 | 0.000 | |
Odd/even | 26.476 | 0.000 | 3.280 | 0.210 | −2.040 | 0.656 | −7.240 | 0.018 | −0.231 | 0.009 | |
Opinion Float | 1.556 | 0.039 | 3.280 | 0.210 | −2.040 | 0.656 | −7.240 | 0.018 | 0.440 | 0.000 | |
Allow/forbid | 4.000 | 0.546 | −4.750 | 0.258 | −16.000 | 0.021 | −0.950 | 0.811 | 0.280 | 0.080 | |
Solar | Acquiescence | 18.511 | 0.000 | 0.120 | 0.970 | 2.560 | 0.596 | 0.600 | 0.833 | 0.187 | 0.013 |
Response Order | 9.683 | 0.000 | 2.280 | 0.336 | 8.680 | 0.012 | 4.360 | 0.017 | 0.248 | 0.000 | |
Odd/even | 17.508 | 0.000 | 0.480 | 0.815 | 2.960 | 0.223 | −1.000 | 0.661 | −0.385 | 0.000 | |
Opinion Float | 1.921 | 0.017 | 0.480 | 0.815 | −2.960 | 0.223 | −1.000 | 0.661 | 0.291 | 0.001 | |
Allow/forbid | 6.800 | 0.207 | −2.950 | 0.343 | −8.500 | 0.131 | −8.050 | 0.001 | 0.145 | 0.373 | |
GPT 3.5 Turbo | Acquiescence | 5.523 | 0.040 | −11.720 | 0.008 | −28.680 | 0.000 | −19.120 | 0.000 | 0.334 | 0.000 |
Response Order | −2.709 | 0.147 | 4.960 | 0.121 | 15.960 | 0.002 | 8.000 | 0.011 | 0.198 | 0.001 | |
Odd/even | 25.048 | 0.000 | −5.480 | 0.082 | −14.800 | 0.001 | −5.800 | 0.062 | −0.273 | 0.002 | |
Opinion Float | −11.905 | 0.000 | −5.480 | 0.082 | −14.800 | 0.001 | −5.800 | 0.062 | 0.467 | 0.000 | |
Allow/forbid | 25.300 | 0.000 | −12.000 | 0.008 | −23.200 | 0.001 | −6.950 | 0.058 | 0.206 | 0.202 | |
GPT 3.5 Turbo Instruct | Acquiescence | 6.455 | 0.024 | 2.600 | 0.445 | −11.800 | 0.008 | −2.800 | 0.326 | 0.334 | 0.000 |
Response Order | −11.114 | 0.000 | 3.880 | 0.169 | 11.920 | 0.001 | 3.800 | 0.147 | 0.275 | 0.000 | |
Odd/even | 2.032 | 0.390 | 1.560 | 0.433 | −7.120 | 0.061 | −0.840 | 0.711 | −0.073 | 0.416 | |
Opinion Float | 0.143 | 0.891 | 1.560 | 0.433 | −7.120 | 0.061 | −0.840 | 0.711 | 0.360 | 0.000 | |
Allow/forbid | 8.550 | 0.111 | −4.500 | 0.216 | −10.050 | 0.139 | 4.100 | 0.261 | 0.437 | 0.005 |
Author notes
Equal contribution.
Action Editor: Kristina Toutenova