Abstract
Recently, large language models (LLMs), especially those that are pretrained on code, have demonstrated strong capabilities in generating programs from natural language inputs. Despite promising results, there is a notable lack of a comprehensive evaluation of these models’ language-to-code generation capabilities. Existing studies often focus on specific tasks, model architectures, or learning paradigms, leading to a fragmented understanding of the overall landscape. In this work, we present L2CEval, a systematic evaluation of the language-to-code generation capabilities of LLMs on 7 tasks across the domain spectrum of semantic parsing, math reasoning, and Python programming, analyzing the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition, we assess confidence calibration, and conduct human evaluations to identify typical failures across different tasks and models. L2CEval offers a comprehensive understanding of the capabilities and limitations of LLMs in language-to-code generation. We release the evaluation framework1 and all model outputs, hoping to lay the groundwork for further future research.
All future evaluations (e.g., LLaMA-3, StarCoder2, etc) will be updated on the project website: https://l2c-eval.github.io/.
1 Introduction
Language-to-code (L2C2) is a type of task that aims to automatically map natural language descriptions to programs, which are later executed to satisfy the user’s demand (Yin and Neubig, 2017; Austin et al., 2021). As illustrated in Figure 1, language-to-code is the foundation of many applications in AI, such as task-oriented dialogue systems (Andreas et al., 2020), coding assistant (Agashe et al., 2019; Lai et al., 2023), language interfaces to databases (Pasupat and Liang, 2015; Yu et al., 2018), and robotic control (Zhou et al., 2022; Shridhar et al., 2020). It has also served as a great testbed for evaluating various language understanding capabilities of NLP systems, such as logical and math reasoning (Gao et al., 2023; Han et al., 2022), grounded language understanding (Xie et al., 2022; Huang et al., 2023), and tool use (Schick et al., 2024; Paranjape et al., 2023).
Recent progress on large language models (LLMs) (OpenAI, 2023; Chowdhery et al., 2023; Touvron et al., 2023a), especially those specifically trained for coding (Fried et al., 2023; Nijkamp et al., 2022; Chen et al., 2021; Li et al., 2023), has shown that LLMs trained on a mixture of text and code are able to perform language-to-code generation under few-shot or even zero-shot learning settings (Rajkumar et al., 2022; Ni et al., 2023b). However, the modeling factors that affect the performance of LLMs for such L2C tasks–including model size, training data mixture, prompting methods, and instruction tuning–are poorly understood. In addition, there lacks a consistent evaluation of different LLMs on the same spectrum of language-to-code tasks, making it difficult for the users to decide which models to use for certain tasks or if they should resort to finetuning their own models. Beyond model performance, model properties such as robustness to prompt and confidence calibration are also crucial for understanding the reliability of LLMs, but such properties have not been systematically studied for L2C tasks.
In this work, we present L2CEval, providing a systematic evaluation of the language-to-code generation capabilities of LLMs. L2CEval includes a wide range of state-of-the-art models, specifically 56 models from 13 different organizations, all evaluated on three core domains of language-to-code generation tasks: semantic parsing, math reasoning, and Python programming. Our L2CEval framework includes extensive evaluations of models as small as 1 billion parameters, to significantly larger ones such as Falcon-180B, as well as davinci and GPT-4 models from OpenAI. We also benchmark models that are trained on different mixtures of data of varying sizes (35B ∼ 3.5T tokens), as well as models that are instruction-tuned, from both open-source and open-access proprietary categories. Our work is the first to conduct extensive and thorough comparisons of LLMs for language-to-code generation across multiple dimensions of variation. To summarize, we release L2CEval and its main contributions are as follows:
We standardize the evaluation (e.g., prompts, metrics) of 7L2C tasks across domains of semantic parsing, math reasoning, and Python programming to allow controlled comparisons among 56 models from 13 organizations;
We study the scaling effect of model size, pretraining compute/data mixture, as well as several modeling contributions (e.g., instruction-tuning, zero/few-shot prompting) for L2C tasks;
We analyze the robustness and calibration measurements of different models, and identify their common error cases;
We release the code for our evaluation framework, and model outputs (i.e., texts and logits) for reproducibility and future studies.
Through our work, we hope to provide insight into applying LLMs to L2C applications, as well as building future LLMs.
2 L2CEval
The main motivation behind L2CEval is to provide a comprehensive evaluation of language-to-code generation capabilities and understand what affects such L2C capabilities. In the following sections, we first discuss the key desiderata in design of L2CEval in § 2.1, then the formulation of L2C in § 2.2, then in § 2.3 we introduce the domains and specific tasks we consider for L2CEval, as well as the reasons for choosing such tasks. Finally, in § 2.4, we describe the models included in L2CEval and the model selection process.
2.1 Desiderata
To ensure that the L2CEval framework serves as a comprehensive resource for evaluating language-to-code (L2C) capabilities, we outline a set of key desiderata that guided its construction.
Task Inclusion.Diverse Task Representation: The benchmark aims to capture a wide scope of L2C tasks, specifically incorporating semantic parsing, Python programming, and math reasoning. Task Complexity: Under each of these domains, we include 2 to 3 sub-tasks to represent a combination of different levels of language understanding, reasoning, and programming abilities.
Model Evaluation.Open Source and Commercial Models: L2CEval is designed to accommodate both open-source and commercial models to provide a holistic view of available L2C capabilities. Our focus is more on the open-source models, however, as proprietary models do not disclose certain basic information which makes drawing scientific conclusions difficult. Model Size Variability: The benchmark includes models of different sizes to explore any correlation between model size and performance. Specialized vs General Models: We examine the performance trade-offs between models exclusively trained on code and general language models to understand the advantages or disadvantages of specialization.
Evaluation Setup.Standardized Prompts: All tasks and models are evaluated using standardized prompts, overcoming the inconsistencies prevalent in prior work. Reproducibility: Our evaluation setup is clearly described to facilitate reproducible experiments. Fair Comparison: Universal evaluation metrics are employed across diverse tasks and models to enable equitable comparisons.
Transparency and Reusability.Documentation: The framework is thoroughly documented to promote community engagement and constructive feedback. Interoperability: L2CEval is built to be easily updated or extended, allowing for the incorporation of new tasks or models as the field evolves. We also share the code and model outputs to support future research in this domain.
By adhering to these desiderata, the L2CEval framework aims to be a comprehensive, fair, and practical evaluation resource for the community.
2.2 Language-to-Code Generation (L2C)
Problem Formulation.
Execution-based Evaluation.
2.3 Tasks
We evaluate the language-to-code capabilities of LLMs in three representative application scenarios shown in Figure 1: semantic parsing, math reasoning, and Python programming. Particularly, these tasks collectively assess the capabilities of models in language-to-code generation to understand natural language in different contexts, reason about the steps for solving the problem, and convert it into executable code (see Figure 1). Semantic parsing focuses on the transformation of natural language queries into structured, domain-specific languages; math reasoning challenges the models’ numerical and logical reasoning abilities by requiring them to solve problems that involve multiple steps of calculation and reasoning; and Python programming tests the models’ proficiency in generating functional code that aligns with a user’s intent, reflecting a real-world application of LLMs in software development. A summary of L2CEval benchmarks is shown as Table 1 and we discuss each of these tasks in detail as below.
Domain . | Dataset . | Split . | Size . | Input . | Output . |
---|---|---|---|---|---|
Semantic Parsing | Spider (Yu et al., 2018) | Dev | 1,032 | DB schema + NL | SQL Query |
WikiTQ (Pasupat and Liang, 2015) | Dev | 2,831 | Table headers* + NL | SQL Query | |
Math Reasoning | GSM8k (Cobbe et al., 2021) | Dev3 | 1,495 | Math problem in NL | Python solution |
SVAMP (Patel et al., 2021) | All | 1,992 | Math problem in NL | Python solution | |
Python Programming | MBPP (Austin et al., 2021) | Test | 500 | NL spec. + 1 test | Python function |
HumanEval (Chen et al., 2021) | All | 164 | NL spec. + 1–3 test | Python function | |
DS-1000 (Lai et al., 2023) | All | 1,000 | NL spec. | Python lines |
Domain . | Dataset . | Split . | Size . | Input . | Output . |
---|---|---|---|---|---|
Semantic Parsing | Spider (Yu et al., 2018) | Dev | 1,032 | DB schema + NL | SQL Query |
WikiTQ (Pasupat and Liang, 2015) | Dev | 2,831 | Table headers* + NL | SQL Query | |
Math Reasoning | GSM8k (Cobbe et al., 2021) | Dev3 | 1,495 | Math problem in NL | Python solution |
SVAMP (Patel et al., 2021) | All | 1,992 | Math problem in NL | Python solution | |
Python Programming | MBPP (Austin et al., 2021) | Test | 500 | NL spec. + 1 test | Python function |
HumanEval (Chen et al., 2021) | All | 164 | NL spec. + 1–3 test | Python function | |
DS-1000 (Lai et al., 2023) | All | 1,000 | NL spec. | Python lines |
Semantic Parsing.
Semantic parsing considers the task of translating a user’s natural language utterance (e.g.,who averaged the most pots in the last season? in Figure 1) into machine-executable programs (e.g., an SQL database query), and has been a long-standing problem in NLP (Zettlemoyer and Collins, 2005; Berant et al., 2013). A prompt to an LLM consists of an NL utterance and descriptions of relevant structured context, such as the schema information of a database (e.g., columns in each table). The target output is a program defined in some domain-specific languages, such as SQL. Intuitively, semantic parsing challenges LLMs on grounded language understanding (Xie et al., 2022; Cheng et al., 16), where a model needs to associate NL concepts in utterances (e.g.,“last season”) with relevant structured knowledge (e.g., superlative operation on column season) in order to synthesize the program (Pasupat and Liang, 2015; Yu et al., 2018; Yin et al., 2020). In this work, we choose to use text-to-SQL as a representative task as it closely ties with applications such as natural language interface to databases (Androutsopoulos et al., 1995; Affolter et al., 2019). Recent work (Rajkumar et al., 2022; Ni et al., 2023b) shows that LLMs are effective in performing text-to-SQL parsing. In this work, we use two widely used text-to-SQL datasets, Spider (Yu et al., 2018) and WikiTQ (Pasupat and Liang, 2015), as our datasets for benchmarking semantic parsing capabilities of LLMs. Following Xie et al. (2022), we concatenate the natural language utterance with the database schema or table headers as LLM input.6
Math Reasoning.
To solve a math word problem, a model needs to abstract the mathematical relations from the natural language description, and reason about the potential steps. Compared to semantic parsing where the target programs are table-lookup queries, programs for math reasoning tasks usually require multiple steps of calculation and numerical and logical reasoning. Because of this, math word problems are widely adopted as testbeds for evaluating the reasoning abilities of LLMs (Cobbe et al., 2021; Wei et al., 2022b; Ni et al., 2023a; Welleck et al., 2022). In this paper, we choose the GSM8k (Cobbe et al., 2021) and SVAMP (Patel et al., 2021) datasets, which are grade-school level math problems described in natural language. We chose these two benchmarks due to their moderate difficulty and popularity. Following Welleck et al. (2022) and Gao et al. (2023), we prompt the models to answer math word problems by generating Python programs as solutions, which are later executed by a Python interpreter to output the answer.
Python Programming.
One of the most important applications for LLMs trained on code is to assist programmers in developing software. Typically, a model is given a developer’s natural language intent (e.g.,write a merge sort function) with optional additional specifications (Austin et al., 2021) such as input/output examples or unit tests (e.g.,assert merge_sort([5,7,3])==[3,5,7])), to generate the code (e.g., a Python function) that implements the user’s intent. To evaluate the basic programming skills of the LLMs, we use the MBPP (Austin et al., 2021) and HumanEval (Chen et al., 2021) datasets, for which the model needs to implement some basic Python functions to pass the test cases. Moreover, we also include DS-1000 (Lai et al., 2023), which focuses on data-science-related questions and libraries.7
2.4 Models
We evaluate 56 models that vary in size, training data mixture, context length, and training methods. Table 2 summarizes the open-source models we evaluated and several key properties.
Organization . | Model Series . | Variants . | Sizes . | # All . | # Code . | Context . | Code . |
---|---|---|---|---|---|---|---|
Tokens . | Tokens . | Length . | Specific . | ||||
Salesforce | CodeGen (Nijkamp et al., 2022) | multi/mono | 6.1/16.1B | 505∼577B | 119∼191B | 2K | ✓ |
CodeGen-2.5 (Nijkamp et al., 2023) | multi/mono/instruct† | 7B | 1.4T | 1.4T | 2K | ✓ | |
Eleuther AI | GPT-J (Wang and Komatsuzaki, 2021) | 6.1B | 402B | 46B | 2K | ✗ | |
GPT-NeoX (Black et al., 2022) | 20.6B | 472B | 54B | 2K | ✗ | ||
Pythia (Biderman et al., 2023) | 1.4/6.9/12B | 300B | 35B | 2K | ✗ | ||
Databricks | Dolly-v2 (Conover et al., 2023) | 6.9/12B | – | – | 2K | ✗ | |
BigCode | SantaCoder (Allal et al., 2023) | 1.1B | 236B | 236B | 2K | ✓ | |
StarCoder (Li et al., 2023) | base/plus | 15.5B | 1∼1.6T | 1T | 8K | ✓ | |
Meta | InCoder (Fried et al., 2023) | base/instruct† | 1.3/6.7B | 52B | 52B | 2K | ✓ |
LLaMA (Touvron et al., 2023a) | 7/13/30B | 1∼1.4T | 45∼63B | 2K | ✗ | ||
LLaMA-2 (Touvron et al., 2023b) | 7/13/70B | 2T | − | 4K | ✗ | ||
CodeLLaMA (Rozière et al., 2023) | 7/13/34B | 2.5T | 435B | 16K | ✓ | ||
Stanford | Alpaca† (Taori et al., 2023) | 7/13/30B | – | – | 2K | ✗ | |
Replit | Replit-v1-3b (rep, 1) | 3B | 525B | 525B | 2K | ✓ | |
WizardLM | WizardCoder-v1 (Luo et al., 2023) | 15B | – | – | 2K | ✓ | |
MosaicML | MPT (Team, 2023a, b) | base/instruct† | 7/30B | 1T | 135B | 2K/8K | ✗ |
MistralAI | Mistral-v0.1 (Jiang et al., 2023) | base/instruct† | 7B | – | – | 32K | ✗ |
XLANG | Lemur-v1 (Xu et al., 2023) | 70B | – | – | 4K | ✓ | |
TII | Falcon (Almazrouei et al., 2023) | base/instruct† | 7/40/180B | 1∼3.5T | – | 2K | ✗ |
OpenAI | Codex (Chen et al., 2021) | code-cushman-001 | 12B/ – | 400B | 100B | 2K/8K | ✓ |
code-davinci-002 | |||||||
InstructGPT† (Ouyang et al., 2022) | text-davinci-002/3 | – | – | – | 4K | ✗ | |
ChatGPT† (OpenAI, 2022) | turbo-0301/0613 | – | – | – | |||
GPT-4† (OpenAI, 2023) | 0314/0613 | – | – | – | 8K | ✗ |
Organization . | Model Series . | Variants . | Sizes . | # All . | # Code . | Context . | Code . |
---|---|---|---|---|---|---|---|
Tokens . | Tokens . | Length . | Specific . | ||||
Salesforce | CodeGen (Nijkamp et al., 2022) | multi/mono | 6.1/16.1B | 505∼577B | 119∼191B | 2K | ✓ |
CodeGen-2.5 (Nijkamp et al., 2023) | multi/mono/instruct† | 7B | 1.4T | 1.4T | 2K | ✓ | |
Eleuther AI | GPT-J (Wang and Komatsuzaki, 2021) | 6.1B | 402B | 46B | 2K | ✗ | |
GPT-NeoX (Black et al., 2022) | 20.6B | 472B | 54B | 2K | ✗ | ||
Pythia (Biderman et al., 2023) | 1.4/6.9/12B | 300B | 35B | 2K | ✗ | ||
Databricks | Dolly-v2 (Conover et al., 2023) | 6.9/12B | – | – | 2K | ✗ | |
BigCode | SantaCoder (Allal et al., 2023) | 1.1B | 236B | 236B | 2K | ✓ | |
StarCoder (Li et al., 2023) | base/plus | 15.5B | 1∼1.6T | 1T | 8K | ✓ | |
Meta | InCoder (Fried et al., 2023) | base/instruct† | 1.3/6.7B | 52B | 52B | 2K | ✓ |
LLaMA (Touvron et al., 2023a) | 7/13/30B | 1∼1.4T | 45∼63B | 2K | ✗ | ||
LLaMA-2 (Touvron et al., 2023b) | 7/13/70B | 2T | − | 4K | ✗ | ||
CodeLLaMA (Rozière et al., 2023) | 7/13/34B | 2.5T | 435B | 16K | ✓ | ||
Stanford | Alpaca† (Taori et al., 2023) | 7/13/30B | – | – | 2K | ✗ | |
Replit | Replit-v1-3b (rep, 1) | 3B | 525B | 525B | 2K | ✓ | |
WizardLM | WizardCoder-v1 (Luo et al., 2023) | 15B | – | – | 2K | ✓ | |
MosaicML | MPT (Team, 2023a, b) | base/instruct† | 7/30B | 1T | 135B | 2K/8K | ✗ |
MistralAI | Mistral-v0.1 (Jiang et al., 2023) | base/instruct† | 7B | – | – | 32K | ✗ |
XLANG | Lemur-v1 (Xu et al., 2023) | 70B | – | – | 4K | ✓ | |
TII | Falcon (Almazrouei et al., 2023) | base/instruct† | 7/40/180B | 1∼3.5T | – | 2K | ✗ |
OpenAI | Codex (Chen et al., 2021) | code-cushman-001 | 12B/ – | 400B | 100B | 2K/8K | ✓ |
code-davinci-002 | |||||||
InstructGPT† (Ouyang et al., 2022) | text-davinci-002/3 | – | – | – | 4K | ✗ | |
ChatGPT† (OpenAI, 2022) | turbo-0301/0613 | – | – | – | |||
GPT-4† (OpenAI, 2023) | 0314/0613 | – | – | – | 8K | ✗ |
Selection Criteria.
While it is not possible to evaluate every single LLM on these tasks, we strive to provide a comprehensive evaluation of the current LLMs in L2C generation, by covering a diversified selection of LLMs of varying sizes and are trained on different mixtures of data. For example, the size of the models we consider ranges from 1B (e.g., SantaCoder (Allal et al., 2023)) to 170B+ (e.g., Falcon-180B (Almazrouei et al., 2023) and GPT-4 model from OpenAI). Though we prioritize the evaluation of code-specific models, which means that the majority of the training tokens are from code (e.g., CodeLLaMA (Rozière et al., 2023), StarCoder (Li et al., 2023)), we also include the most competitive general LLMs such as LLaMA2-70B (Touvron et al., 2023a) and Falcon-180B for comparison. To evaluate the effect of instruction-tuning and its data mixtures on L2C tasks, we also include several instruct-tuned versions of the LLMs, such as Alpaca (Taori et al., 2023), Dolly (Conover et al., 2023), etc. We also prioritize the evaluation of open-source models and mainly present our findings with these models as we are unclear about the technical details of proprietary models (e.g., model size, training data) and hesitate to speculate about them.
Model Access.
3 Results and Analysis
We organize the experiment results and analysis as follows. We first discuss different scaling effects in § 3.1, then in § 3.2 and § 3.3, we analyze how pretraining data mixture and instruction-tuning affects the models for L2C tasks. To study model robustness, in § 3.4, we measure the sensitivity of the models on the few-shot demonstrations, and show model confidence calibration in § 3.5. Finally, we present an error analysis in § 3.6.
3.1 Scaling
Model Size.
Pretraining Compute.
To study the scaling of compute resources during pretraining, we plot the average model performance across all 7 tasks against the estimated FLOPS of compute12 needed during training, as shown in Figure 2. From the figure, we can see that code LMs are much more compute efficient than general LMs in terms of L2C performances. This is particularly noticeable when the compute budget is constrained; for example, SantaCoder-1B outperforms Pythia-6.9B with an order of magnitude less compute, and InCoder-1B outperforms Pythia-1.4B using only 1/5 of the pretraining compute. While code LMs generally require fewer FLOPS to achieve comparable performance on L2C tasks,13 it is expected as well, given that general LMs are also optimized for many other natural language tasks unrelated to coding. Moreover, such a trend also seems to be diminishing when scaling up (e.g., comparing CodeLLaMA-34B and LLaMA-2-70B), which suggests that when model and pretraining data size gets larger, it is possible that general LMs will be as compute efficient as code LMs for L2C tasks.
3.2 Data Mixture
While all of the models we evaluate in L2CEval have seen code during pretraining, the distributions of their training data mixture vary significantly, as illustrated in Table 2. In Figure 3, we show 12 different models that are around 7B, ordered by their average model performance on all 7 tasks, and plot the amount of the code and non-code tokens in their pretraining data. As we can see from the figure, the number of code tokens in the pretraining data affects the model performance much more than the amount of non-code tokens, and the model performance is almost monotonically increasing with the amount of code tokens in the training data. Given that CodeLLaMA models are further pretrained on code tokens on top of LLaMA-2, by comparing the performance of their 7B and 13B versions in Figure 3 and Table 3, we can see that training on more code tokens not only drastically improves text-to-sql parsing and Python programming, but also math reasoning tasks. As mentioned in § 3.1, since the generated Python solutions for GSM8K and SVAMP are both simple in syntax (i.e., straight-line programs with numeric operations), we hypothesize that training on more code tokens improves the reasoning abilities of LMs in general. This is also consistent with previous findings (Fu et al., 2022; Mishra et al., 2022; Wei et al., 2022b).
3.3 Instruction Tuning
Instruction tuning (Ouyang et al., 2022) is a type of method that enhances the ability of LLMs to follow instructions written in natural language. Here we compare the instruction-tuned models and their base models and show their few/zero-shot results in Table 4.
Zero-shot Results.
Firstly, we observe that the zero-shot results are generally improved after instruction tuning for all models on Spider and MBPP. This is perhaps a little surprising for Dolly and MPT-instruct models as their instruction-tuning data does not explicitly include coding tasks. Besides the fact that instruction-tuning trains the models to focus more on the instruction, we also hypothesize that instruction-tuning generally improves language understanding abilities, which is essential for L2C tasks. We also note that the zeros-shot performances for GSM8k are all zeros for the selected models. By inspecting the model outputs, we find that the models fail to follow the instructions and provide the answer by ending the Python solution with answer = x.
Few-shot Results.
As for few-shot performance numbers, those that are instruction-tuned with coding tasks, such as Alpaca and CodeLLaMA- instruct models, yield much more consistent improvements than Dolly and MPT-instruct across all tasks. Notably, CodeLLaMA-34B-instruct improves 7.0, 9.2, and 7.2 points over the base model on Spider, GSM8K, and MBPP datasets, respectively. Though we observe that some few-shot results deteriorate for Dolly and MPT-instruct, it should also be noted that such performance degradations are quite minimal, as half of them are within 2%. It is suggested in Ouyang et al. (2022) that instruction tuning generally decreases few-shot performance, as it shifts the attention of the model from the few-shot exemplars to the instructions, but from these results, we believe that whether instruction-tuning improves few-shot performance largely depends on how similar the instruction-tuning tasks are to tasks for evaluation.
3.4 Sensitivity to Prompt
Here we study how sensitive are the models to the number of few-shot demonstrations or different examples in the prompt.
Number of Few-shot Demonstrations.
Figure 4 illustrates the correlation between model performance and the number of exemplars in the prompt.15 While increasing the number of few-shot exemplars in the prompt generally improves execution accuracy, such improvement is not consistent with different models and tasks. For example, on the MBPP dataset, increasing from 3 to 8 exemplars in the prompt actually decreases the performance for most of the selected models, e.g., by 4.0% for codex-cushman. We hypothesize that this is because the programs in the prompt will bias the model towards generating similar programs and ignore the specification. Supporting evidence of this is in Table 5, as codex-cushman is shown to be more sensitive to the exemplars. This effect has also been observed in Li et al. (2022b).
Models . | Spider . | GSM8k . | MBPP . |
---|---|---|---|
(2-shot) . | (2-shot) . | (3-shot) . | |
code-davinci | 73.7±0.3 | 66.4±1.0 | 59.0±1.9 |
code-cushman | 50.4±0.7 | 24.2±1.1 | 39.3±3.3 |
CodeGen-6B-mono | 32.4±0.6 | 13.8±0.2 | 35.5±0.5 |
StarCoder-15.5B | 54.9±2.7 | 32.3±0.8 | 44.1±2.2 |
Alpaca-7B | 20.1±3.5 | 7.3±1.2 | 13.6±0.6 |
Models . | Spider . | GSM8k . | MBPP . |
---|---|---|---|
(2-shot) . | (2-shot) . | (3-shot) . | |
code-davinci | 73.7±0.3 | 66.4±1.0 | 59.0±1.9 |
code-cushman | 50.4±0.7 | 24.2±1.1 | 39.3±3.3 |
CodeGen-6B-mono | 32.4±0.6 | 13.8±0.2 | 35.5±0.5 |
StarCoder-15.5B | 54.9±2.7 | 32.3±0.8 | 44.1±2.2 |
Alpaca-7B | 20.1±3.5 | 7.3±1.2 | 13.6±0.6 |
Different Examples as Demonstrations.
Moreover, we also show the sensitivity of the models to different exemplars and present the results in Table 5 by showing the variance of model performance across different runs using different exemplars in the prompt. While the variances differ for different models and tasks, none of them are significant enough to alter the ranking of the models, nor impact the conclusions presented in this work.
3.5 Model Calibration
3.6 Error Modes
In Figure 6, we present an error analysis on the four best models, by manually16 examining a fixed set of 100 examples from the GSM8k and MBPP datasets across 4 selected models. We categorize the errors into 5 cases:
- 1)
execution error, where deformed programs are generated;
- 2/3)
missing/extra steps, where some key steps are missing or extraneous lines are generated in predicted code;
- 4)
wrong steps, where the model only makes subtle mistakes in certain steps in the code;
- 5)
when the NL specification itself is ambiguous and unclear.
From the results shown in Figure 6, we can see that for GSM8k, compared with stronger models (e.g., code-davinci and GPT-4), while a similar number of errors are made for missing and generating extra steps for solving the math problem, StarCoder and code-cushman make more mistakes in predicting intermediate steps, or generating deformed programs. On MBPP, however, weaker models are prone to miss crucial steps in the implementation, which shows a lack of understanding of the problem as well as planning abilities. Hallucination (Ji et al., 2023) is a common issue in natural language generation, while we find it to be rare for models to generate lines of code that are extraneous, hallucination can also exhibit as using wrong operators or introducing variable values that do not exist in the natural language description, which would be categorized as “wrong steps” in Figure 6.
4 Limitations
While we strive to provide a comprehensive and fair evaluation of LLMs in L2C tasks, here we also discuss the limitations of L2CEval.
Generation Using Greedy Decoding.
In this work, we use greedy decoding to generate a single program for each example as the models’ output. While this is the most efficient way of generation and ensures fair comparison for different models as it is not affected by factors like sampling temperature, it is also relatively noisy (Nijkamp et al., 2022; Chen et al., 2021). For tasks such as MBPP or Python programming in general, sampling k solutions then measure pass@k (any of the k programs being correct) or n@k (i.e., # the k programs being correct) are better as they give the model k tries to generate the correct program to lower the variance. For Python programming tasks, such methods are closer to practical use cases as we typically have test cases that can filter out some incorrect programs in the samples. For other tasks, having a better pass@k also provides opportunities for post-generation reranking methods such as (Shi et al., 2022; Zhang et al., 2023; Ni et al., 2023b). However, the cost for evaluating pass@k or n@k is k times of the compute compared with greedy decoding, thus we choose to only evaluate greedy decoding in this work and leave sampling-based evaluation to future work.
Execution-based Evaluation.
Moreover, we mainly rely on execution-based evaluation (i.e., execution accuracy) for this work. However, such evaluation may produce spurious programs, i.e., false-positive programs that achieve the correct execution result by chance (Zhong et al., 2020; Ni et al., 2020). In this work, we adopt human evaluation to measure the problem of spuriousness and found non-trivial portion of “correct” programs being spurious for Spider but not for other datasets. In addition, execution may not always be straightforward in practice, especially when complex dependencies and potentially harmful programs are considered (Chen et al., 2021).
Confounding Factors During Comparison.
When comparing models, especially across different model series, there are typically multiple performance-impacting factors that are in effect at the same time, some of which are not studied in this work, such as model architecture and pretraining objective. Such confounding factors may limit the validity of the conclusions that we draw from model comparisons. In this work, we try to mitigate this by fixing as many variables about the models as possible during a comparison, such as making observations within the same model series. While the general trend can still be observed across different model series, we should also note that when interpreting the results, readers should be mindful of such confounding factors when comparing different models.
Lack of Information for Proprietary Models.
For the open-access proprietary LLMs (e.g., OpenAI models), due to the lack of basic information and mismatches between the models described in the papers and the actual API engines, very few scientific conclusions can be drawn from these results. We evaluate such proprietary models mainly to provide baselines and in the hope of helping practitioners in choosing models for their use cases. We also present human evaluations of some of the strong models to discuss differences in common error modes. However, when making our findings, we generally rely on open-source models instead, to avoid being misled by speculative model details of such closed-source models.
5 Related Work
Code Generation Evaluation.
Several code generation benchmarks are collected from raw data from GitHub and StackOverflow, and involve professional annotators to enhance the quality of the data (Iyer et al., 2018; Agashe et al., 2019; Yin et al., 2018). While such benchmarks focus more on lexical-based evaluation, ODEX (Xuezhi Wang et al., 2023) introduces execution-based evaluation, which has also been widely applied in recent code generation evaluation benchmarks, such as DS-1000 (Lai et al., 2023), HumanEval (Chen et al., 2021), and MBPP (Austin et al., 2021). More recently, there has been an increasing focus on assessing the generalization capabilities of code generation models across multiple programming languages (Athiwaratkun et al., 2023), and benchmarks such as CodeGeeX (Zheng et al., 2023) and MultiPL-E (Cassano et al., 2023). In our work, we focus on studying whether LLMs can map natural language instructions to code using the most popular programming languages for each domain (i.e., SQL for semantic parsing and Python for math reasoning and programming). While the study of different programming languages are orthogonal to our work, we refer the readers to these existings works on multi-lingual evaluation benchmarks.
Other Code-related Tasks.
Large language models have also shown significant success in other code-related directions. One popular direction is code understanding. For example, CodeXGLUE (Lu et al., 2023) comprises three widely used code understanding tasks including defect detection, clone detection, and code search. However, CONCODE (Iyer et al., 2018) is the only language-to-code task included in CodeXGLUE and it uses surface-form based evaluation metrics such as BLEU. BigCloneBench (Krinke and Ragkhitwetsagul, 2022) tasks to measure the similarity between code pairs to predict whether they have the same functionality. CodeSearchNet (Husain et al., 2019) is a benchmark of semantic code search given natural language queries. Besides code understanding, there have been other tasks such as code translation (Roziere et al., 2020) and program repair (Gupta et al., 2017). We leave systematic evaluation of LLMs on those tasks as important future work.
6 Conclusions
In this paper, we present L2CEval, a comprehensive evaluation framework for natural language to code generation, and we evaluate 56 models from 13 organizations, on 7 tasks from 3 core domains. L2CEval investigates models’ performance on a variety of axes such as model scale, training data mixture, sensitivity to few-shot exemplars as well as the impact of instruction tuning, inter alia. We also present an analysis on the model calibration and conduct a human evaluation of common error modes across different models. We hope our study will provide useful insights for the community into applying LLMs for downstream code applications and future model development efforts.
Acknowledgments
We would like to thank Rui Zhang and Tao Yu for the initial discussions about this work, and Hailey Schoelkopf and Zhangir Azerbayev for their helpful discussions and suggestions. The authors would also like to thank the TACL reviewers and action editor David Chiang for the careful reviews and valuable feedbacks. This work is supported in part by a gift from Salesforce Research.
Notes
All future evaluations (e.g., LLaMA-3, StarCoder2, etc) will be updated on the project website: https://l2c-eval.github.io/.
We refer to “natural language” whenever we use the term “language” in this work.
Here we use the split from Ni et al. (2023a).
We discuss the limitation of greedy decoding in § 4.
See § 4 for the limitations of execution-based evaluation.
While more challenging datasets exists for textto-SQL (e.g., BIRD-SQL (Li et al., 2024)), we believe the observations should be generalizable due to the similar task format.
Evaluation of more recent smaller models like LLaMA-3 (8B) shows that it achieves avg. performance of 45.1, outperforming CodeLLaMA-base (13B).
Here we base our estimation on Kaplan et al. (2020): FLOPS ≈ 6 * model size (B) * training tokens (B).
The only exceptions are CodeGen-multi/mono models, which are trained with far less amount of code tokens compared with other code LLMs.
To be consistent with previous work, we use the evaluation harness from the BigCode project (https://github.com/bigcode-project/bigcode-evaluation-harness) for evaluating HumanEval and DS-1000. However, it does not output logits which are essential for calculating calibration scores.
While the range of number of shots are different for each task due to different task prompt lengths (e.g., database schema encoding for Spider), we keep it consistent across different models on the same task for a fair comparison.
Two of the authors performed this annotation.
References
Author notes
Action Editor: David Chiang