An Empirical Study of The Non-Determinism of Chatgpt in Code Generation
An Empirical Study of The Non-Determinism of Chatgpt in Code Generation
Code Generation
SHUYIN OUYANG and JIE M. ZHANG, King’s College London, London, UK
MARK HARMAN, University College London, London, UK
MENG WANG, University of Bristol, Bristol, UK
There has been a recent explosion of research on Large Language Models (LLMs) for software engineering
tasks, in particular code generation. However, results from LLMs can be highly unstable; non-deterministically
returning very different code for the same prompt. Such non-determinism affects the correctness and con-
sistency of the generated code, undermines developers’ trust in LLMs, and yields low reproducibility in
LLM-based papers. Nevertheless, there is no work investigating how serious this non-determinism threat is.
To fill this gap, this article conducts an empirical study on the non-determinism of ChatGPT in code
generation. We chose to study ChatGPT because it is already highly prevalent in the code generation research
literature. We report results from a study of 829 code generation problems across three code generation
benchmarks (i.e., CodeContests, APPS and HumanEval) with three aspects of code similarities: semantic
similarity, syntactic similarity, and structural similarity. Our results reveal that ChatGPT exhibits a high degree
of non-determinism under the default setting: the ratio of coding tasks with zero equal test output across
different requests is 75.76%, 51.00% and 47.56% for three different code generation datasets (i.e., CodeContests,
APPS and HumanEval), respectively. In addition, we find that setting the temperature to 0 does not guarantee
determinism in code generation, although it indeed brings less non-determinism than the default configuration
(temperature = 1). In order to put LLM-based research on firmer scientific foundations, researchers need to
take into account non-determinism in drawing their conclusions.
Additional Key Words and Phrases: code generation, non-determinism, large language model
1 Introduction
Large language models (LLMs) are non-deterministic by nature [34]. This is because LLMs
predict the probability of a word or token given the context, represented by a sample of words. The
This work was supported by the UKRI Centre for Doctoral Training in Safe and Trusted Artificial Intelligence (EP/S023356/1).
Authors’ Contact Information: Shuyin Ouyang (corresponding author), King’s College London, London, UK; e-mail:
shuyin.ouyang@kcl.ac.uk; Jie M. Zhang, King’s College London, London, UK; e-mail: jie.zhang@kcl.ac.uk; Mark
Harman, University College London, London, UK; e-mail: mark.harman@ucl.ac.uk; Meng Wang, University of Bristol,
Bristol, UK; e-mail: meng.wang@bristol.ac.uk.
This work is licensed under a Creative Commons Attribution International 4.0 License.
© 2025 Copyright held by the owner/author(s).
ACM 1557-7392/2025/1-ART42
https://doi.org/10.1145/3697010
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
42:2 S. Ouyang et al.
randomness in LLMs typically comes from the sampling methods used during text generation, such
as top-k sampling or nucleus sampling [31, 50]. As a result, identical instructions or prompts can
yield completely different responses to separate requests.
This non-determinism (i.e., the inconsistency in the code candidates generated in different
requests with identical prompts)1 is an essential consideration when using LLM in practice [59].
Unreliable and inconsistent code snippets can have significant negative effects on the process of
software development, particularly in safety-critical applications where consistency and reliability
are paramount [11, 30]. It may also undermine developers’ trust in LLMs when completely different
suggestions are given at different times [64].
Moreover, non-determinism affects the reliability and reproducibility of empirical software
engineering [54]. Indeed, compared to other tasks of ChatGPT, such as question answering and text
summarisation, the non-determinism threat in code-related tasks is much more serious, because the
inconsistency (especially semantic inconsistency) often indicates errors in the generated code [28].
It is therefore of vital importance to understand how serious the non-determinism is for LLM-based
software engineering tasks and call for actionable solutions to alleviate this issue.
This article presents the first systematic empirical study on the threat of non-determinism of
ChatGPT in code generation tasks. We chose the code generation tasks because code generation
with LLMs, such as ChatGPT, has recently attracted significant attention due to its impressive and
cutting-edge performance [10, 15, 37]. Indeed, many publications have emerged from both the
software engineering community and the machine learning community on evaluating the capability
of ChatGPT in code generation [6, 10, 16, 41, 69].
This article focuses on ChatGPT (including GPT-3.5 and GPT-4), rather than other LLMs, for
the following two reasons: (1) ChatGPT is the most widely adopted LLM in code generation in
the literature [15, 16, 23, 42, 44, 65, 72]; (2) ChatGPT has the best performance in code generation
and represents the state-of-the-art so far [4, 15]. Thus, as the first work on the non-determinism of
LLMs in software engineering tasks, we focus on ChatGPT in this article but encourage other work
to continue to investigate the non-determinism issue in other LLMs.
We conduct a series of experiments using the ChatGPT models on three widely-studied code
generation benchmarks (i.e. CodeContests, APPS, and HumanEval) with 829 coding problems. For
each code generation task, we let ChatGPT make five predictions. We then compare the similarity
of the five code candidates from three aspects, namely semantic similarity, syntactic similarity and
structural similarity. We also explore the influence of temperature (i.e., a parameter that controls the
randomness of the response generated by ChatGPT) on non-determinism, as well as the correlation
between non-determinism and coding task features such as the length of coding instruction and
the difficulty of the task. We show the non-determinism with different models of ChatGPT, namely,
GPT-3.5 and GPT-4. Finally, we compare the non-determinism of code generation with different
prompt engineering strategies.
Our results reveal that the threat of non-determinism in ChatGPT for code generation is serious,
especially under default setting: In particular, (1) the ratio of problems with not a single equal test
output among the top-five code candidates is above 50% for all the benchmarks we study; (2) the
maximum difference of the test pass rate reaches 1.00 for all three datasets, and accounts for 39.63% of
the problems in HumanEval, the most widely used code generation benchmark. In addition, contrary
to the widely held belief (and practice followed to minimise non-determinism) [7, 13, 39], setting
the temperature to zero does not guarantee determinism in code generation. Also interestingly,
our result analysis suggests that the length of coding instructions has a negative correlation with
1 There are other terms in the literature that also refer to non-determinism, such as inconsistency, variance, randomness and
instability.
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
An Empirical Study of the Non-Determinism of ChatGPT in Code Generation 42:3
almost all our similarity measurements, meaning that longer description length tends to yield code
candidates with less similarity and more buggy code. Different prompt engineering strategies also
yield different degrees of non-determinism in code generation.
To understand how the literature handles the non-determinism threat, we collect 76 LLM-based
code generation papers that appeared in the last 2 years. Our manual analysis results highlight that
only 21.1% of these papers consider the non-determinism threat in their experiments. These results
highlight that there is currently a significant threat to the validity of scientific conclusions. We call
for researchers to take into account the non-determinism threat in drawing their conclusions.
To summarise, this article makes the following contributions:
— We present the first study of the non-determinism threat in code generation tasks on ChatGPT,
with three widely-studied datasets (CodeContest, APPS, HumanEval) and three types of
similarity measurements. Our results reveal that the non-determinism threat is serious and
deserves attention from both academia and industry.
— We study the influence of temperature on the non-determinism of ChatGPT and find that
setting temperature to zero does not guarantee determinism in code generation, which is
contrary to many people’s beliefs.
— We study the correlation between coding task features and the degree of non-determinism. The
results reveal that the length of coding instruction has a negative correlation with syntactic
and structural similarity, as well as the average correctness of the generated code.
— We study the influence of different prompt engineering techniques on code generation non-
determinism. We find that prompts with a Chain-of-Thought strategy leads to more non-
determinism when temperature = 0, while code candidates generated from prompts requesting
simple and concise code are more stable.
We release our data, code, and results at our homepage [3]. The rest of the article is organised as
follows. Section 2 introduces the main procedure of our study. Section 3 describes the design of the
experiments, including research questions, benchmarks, selected models and measurement tools.
Section 4 presents the results and discusses some interesting findings based on the experimental
results we obtained. Section 5 discusses the threats to validity in two aspects, as well as the
limitations of this study. Section 6 introduces the related work of our study. Section 7 discusses the
implications for software developers and researchers and future work. Section 8 concludes.
2 Method
Figure 1 shows an overview of our experimental procedure. For each code generation task, our
study first produces a prompt with a coding instruction, then feeds this prompt to ChatGPT API
[2] to generate code (zero-shot). We call the API five times to let ChatGPT make five predictions
with the same prompt. We then extract code from each of the five responses, to get five code
candidates. Our non-determinism analysis compares the five code candidates in terms of their
semantic similarity, syntactic similarity and structural similarity.
Prompt Synthesis: The first step in our study is prompt preparation. There are many ways to
conduct prompt engineering for code generation. In this article, we follow the common practice
in LLM-based code generation assessment [5, 15]. In particular, (1) we ask ChatGPT to generate
Python code for each code generation task with zero-shot prompting; (2) we use the basic prompt
design directly followed by programming task descriptions. To guarantee that ChatGPT produces
code rather than pure natural languages in its response, we augment the original coding problem
description with an instruction to request for Python code.
One challenge in extracting the code from the API response is that there is no clear signal to
distinguish code with plain text in the response, which is different from ChatGPT’s web chat
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
42:4 S. Ouyang et al.
Code Extraction: After receiving the response from ChatGPT, we apply code extraction to retrieve the
code from the generated text. We compile the code directly without making any modifications. Our
experiments are mainly run on Google Deep Learning VM instances, with the Linux environment
pre-installed from open images.2 All of the necessary libraries are pre-installed. In this way, it can
ensure to the greatest extent that the generated code will not cause import errors caused by the
library not being installed during running.
Test Case Execution: To evaluate the semantics of ChatGPT’s generated code, we use the test
suite that is suited to each benchmark. We not only record whether each test passes or not but also
record every specific test output, which enables us to compare the similarity of test outputs even if
they both fail. For CodeContests and HumanEval datasets, every problem has a certain timeout
value of 3 seconds. The APPS dataset does not provide a default timeout value, and we set the value
to be 3 seconds as well. We use single-threaded scripts to run the tests to ensure that the test cases
are executed sequentially to avoid race conditions that may arise from concurrent executions.
Similarity Checking: To measure the similarity between code candidates, we introduce similarity
measurement tools that evaluated the semantic, syntactic, and structural similarity between the
generated code solutions. The semantic similarity is measured by comparing test execution outputs.
The syntactic similarity is measured by comparing the text similarity between codes. The structural
similarity is evaluated by comparing the code candidates’ abstract syntax trees (ASTs). More
details about our similarity measurement methods are mentioned in Section 3.4.
3 Experimental Design
3.1 Research Questions
This study answers the following questions:
RQ1: To what extent is ChatGPT susceptible to non-determinism in code generation under the default
setting? This RQ investigates the non-determinism of ChatGPT in terms of the semantic, syntactic,
and structural similarity among the code candidates generated with identical instructions under
the default setting. There are three sub-RQs:
— Sub-RQ1.1: To what extent is ChatGPT susceptible to non-determinism in terms of semantic
similarity?
2 https://cloud.google.com/compute/docs/images
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
An Empirical Study of the Non-Determinism of ChatGPT in Code Generation 42:5
3 https://judge.u-aizu.ac.jp
4 https://atcoder.jp
5 https://www.codechef.com
6 https://codeforces.com
7 https://www.hackerearth.com
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
42:6 S. Ouyang et al.
60.20% interview problems, 19.60% introductory problems, and 20.20% competition problems. APPS
evaluates models not only on their ability to code syntactically correct programs but also on their
ability to understand task descriptions and devise algorithms to solve these tasks [27].
HumanEval: The HumanEval dataset is an evaluation set first proposed in [12], which contains
164 hand-written coding problems. Each problem includes a function signature, docstring, body,
and several unit tests, with an average of 9.24 test cases per problem. We use the whole dataset to
benchmark our experiments.
As mentioned in Section 2, we especially focus on the code generated with Python3 language,
since it is one of the most widely studied programming languages in code generation [5, 12, 17, 37,
61, 63, 66].
8 https://platform.openai.com/docs/api-reference/chat/create
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
An Empirical Study of the Non-Determinism of ChatGPT in Code Generation 42:7
9 Althoughthe benchmarks are very widely studied, their test suites can be inadequate. This article is less affected by the
inadequate test suite issue as we focus on the similarity of test pass rate rather than the absolute value of test pass rate.
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
42:8 S. Ouyang et al.
4.1 RQ1: Non-Determinism of ChatGPT with Three Types of Similarities under Default
Setting
4.1.1 RQ1.1: Semantic Similarity. Semantic similarity is measured by the following metrics:
test pass rate and OER, and OER excluding exceptions. As mentioned in Section 3.4, each coding
problem has five test pass rates, we use the variance and maximum difference of these five values
to indicate ChatGPT’s non-determinism in generating code for the task. We also report the mean
value, which represents the average correctness of the generated code. For OER or OER (no ex.), we
compare the equivalence across all the five code candidates as well as between every two candidates.
For each dataset, we report the distribution of different measurements in Figures 2 and 3. The mean
measurement values for all the coding problems (the mean value inside each bar in each bar chart)
in a dataset are shown in Table 2. The max diff refers to the maximum value of the max diff among
10 https://github.com/fyrestone/pycode_similar
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
An Empirical Study of the Non-Determinism of ChatGPT in Code Generation 42:9
Fig. 3. RQ1.1: Distribution of semantic similarity in terms of test output equivalence rate (OER and OER
(no ex.)).
OER and OER (no ex.) are the output equivalence rate and the equivalence rate excluding
exceptions.
all the coding problems. In addition, Table 2 also shows the ‘Ratio of worst cases’, which is the ratio
of problems with maximum diff of test pass rate being 1 or OER being 0.
From Figure 2, Figure 3 and Table 2, we observe that ChatGPT is very unstable in generating
semantically consistent code candidates. In particular, the ratios of tasks with zero equal test output
(i.e., OER = 0) among the five code candidates are 75.76%, 51.00% and 47.56% for the three datasets,
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
42:10 S. Ouyang et al.
respectively. This indicates that for the majority of the cases, ChatGPT generates code candidates
with completely different semantics from identical instructions.
The mean variance of the test pass rate is relatively small from Table 2, ranging between 0.03
and 0.09, this is because the test pass rate of different code candidates is often equally worse, as
can be observed from Figure 2(a). However, the max diff of the test pass rate reaches 1.00 for all
three datasets and accounts for 39.63% of the problems in HumanEval, the most widely used code
generation benchmark. This indicates the correctness of code candidates generated from the same
instruction can vary significantly. The large difference in different datasets also sheds light on the
importance of using multiple datasets when assessing the code generation performance for LLMs.
Our statistical analysis with Kruskal–Wallis test shows that, in 92.1% of CodeContests, 39.4% of
APPS and 40% of HumanEval, the outputs of the code are indeed significantly different, where the
p-value under the Kruskal–Wallis test is less than 0.05.
Answer to RQ1.1: The semantic difference among the code generated by ChatGPT in different
requests is significant. In particular, the ratio of coding tasks with not a single equal test
output among the five different requests is 75.76%, 51.00% and 47.56% for CodeContests,
APPS and HumanEval, respectively. In addition, the maximum difference of the test pass rate
reaches 1.00 for all three datasets and accounts for 39.63% of the problems in HumanEval,
the most widely used code generation benchmark.
4.1.2 RQ1.2: Syntactic Similarity. Syntactic similarity measures the text similarity among code
candidates. In our experiment, the syntactic similarity is evaluated by the following metrics: LCS
and LED (more details in Section 3.4). For the five code candidates for each coding problem, we use
the first code candidate as a reference and calculate the LCS and LED between the reference and
the remaining four candidates. In addition, we calculate LCS and LED with code candidates in pairs,
for each pair combination. Thus, each problem has four LCS values and LED values, and 20 LCS
and LED values in pairs, each value indicating a syntactic similarity. We use the mean of these four
values as well as the worst of them (i.e., the smallest value for LCS and the largest value for LED),
and the mean of these 20 values calculated in pairs to represent each problem’s syntactic similarity.
Figure 4 shows the distribution of LCS and LED for all the problems in each dataset. Table 3 shows
the mean, mean worst and pair mean LCS and LED values for all the coding problems (the mean
value inside each bar in the figures) in a dataset.
We observe that the code candidates generated from the same instruction also differ largely in
the syntactic measure. Specifically, the mean LCS is 0.22, 0.23 and 0.42 for CodeContests, APPS and
HumanEval, respectively, indicating the mean ratio of the LCS among the code candidates.
For the three datasets, we could see from Table 3 that the lowest LCS and largest LED values
both happen for the CodeContests dataset. By contrast, the largest LCS and smallest LED values
both happen for HumanEval. This indicates that ChatGPT is most unstable syntactically for the
code generation tasks in CodeContests, and most stable for HumanEval. We further explore the
correlation between different similarities and code task features in Section 4.4.
Answer to RQ1.2: Code candidates generated by ChatGPT in different requests also differ
significantly in syntax. The mean syntax similarity (LCS) is only 0.22, 0.23 and 0.42 for
CodeContests, APPS and HumanEval, respectively.
4.1.3 RQ1.3: Structural Similarity. Structural similarity measures the codes’ similarity based
on their AST. In our experiment, the structural similarity is mainly measured by the tool
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
An Empirical Study of the Non-Determinism of ChatGPT in Code Generation 42:11
Fig. 4. RQ1.2: Distribution of syntactic similarity (LCS and LED). Lower LCS and higher LED indicate less
syntactic similarity.
pycode_similar with two different settings, namely United_Diff and Tree_Diff (more details
in Section 3.4). For the five code candidates for each coding problem, we use the first code candidate
as a reference and calculate the structural similarity between the first candidate with the remaining
four candidates under United_Diff and Tree_Diff settings. We also calculate the structural similarity
with code candidates in pairs, with a total of 20 pair mean values. Thus, each problem has four
mean values and 20 pair mean values for United_Diff and Tree_Diff, respectively, with each value
indicating a structural similarity measure. We use the mean of these four values, the worst of them,
and their pair mean values (i.e., the smallest value for United_Diff and Tree_Diff) to represent each
problem’s structural similarity. Figure 5 shows the distribution of United_Diff and Tree_Diff for all
the problems in each dataset. Table 4 shows the mean, mean worst values, and pair mean values
under United_Diff and Tree_Diff settings for all the coding problems (the mean value inside each
bar in the figures) in a dataset.
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
42:12 S. Ouyang et al.
We observe that the code candidates generated from the same instruction show great similarity
in structure. Specifically, the mean values are 0.33, 0.43 and 0.60 under the United_Diff setting and
0.41, 0.54 and 0.62 under Tree_Diff setting for CodeContests, APPS and HumanEval, respectively.
For the three datasets, we could see from Table 4 that the lowest values under United_Diff and
Tree_Diff happen for the CodeContests dataset. By contrast, the largest values under the two
settings both happen for HumanEval. This indicates that ChatGPT is most unstable in structure for
the code generation tasks in CodeContests, and most stable for HumanEval. We further explore the
correlation between different similarities and task features in RQ4.
Answer to RQ1.3: Code candidates show high structural similarity under UnitedDiff and
TreeDiff settings. We observe that the code candidates generated from the same instruction
have high similarity in structure. Specifically, the mean values are 0.33, 0.43 and 0.60 under
the United_Diff setting, and 0.41, 0.54 and 0.62 under Tree_Diff setting for CodeContests,
APPS and HumanEval, respectively.
11 https://platform.openai.com/docs/api-reference/chat/create#chat-create-temperature
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
An Empirical Study of the Non-Determinism of ChatGPT in Code Generation 42:13
as an example, there are still 43.64% (CodeContests), 27.40% (APPS) and 18.29% (HumanEval)
of problems with no equal test output among the five code candidates. This is contrary to many
people’s belief that setting the temperature to 0 can make ChatGPT deterministic [7, 13, 39], because
when setting the temperature to 0, the model applies greedy sampling which should indicate full
determinism, with the logit value for the next token being a pure function of the input sequence
and the model weights. The reason for such non-determinism with the temperature being zero is
still controversial [1], with different hypotheses such as floating point, unreliable GPU calculations,
and its sparse MoE architecture failing to enforce per-sequence determinism [33, 55]. The details
for all the non-deterministic coding tasks and their test outputs with temperature = 0 are on our
homepage [3].
When temperature = 0.5, we observe that ChatGPT tends to generate code candidates that are
more deterministic than temperature = 1 but less deterministic than temperature = 0. This is as
expected because the higher temperature brings more creativity to ChatGPT and affects its ability
to generate similar code (as can be observed from the other measurements, such as LCS and LED).
Nevertheless, we observe that the value of test pass rates among the three different temperatures
are similar, which indicates that low temperature might be a better choice given the comparable
test pass rate and the low degree of non-determinism.
Answer to RQ2: Contrary to the widely held belief (and common practices), setting the
temperature to 0 does not guarantee determinism in code generation, although it indeed
brings more determinism than the default configuration (temperature = 1) for all three types
of similarities. We also observe that the values of test pass rate among the three different
temperatures are similar, indicating that low temperature might be a better choice for code
generation tasks.
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
42:14 S. Ouyang et al.
4.3 RQ3: Non-Determinism Comparison with Top Candidates in the Same Prediction
RQ1 and RQ2 compare the similarity of 5 code candidates generated in multiple requests. Each
candidate is the top candidate in each request. However, ChatGPT can also generate 5 code candi-
dates within the same request (the top 5 candidates ranked by their predictive probabilities). This
RQ compares the non-determinism degree of code candidates for the two request configurations
mentioned above (with temperature = 1 and temperature = 0). Table 6 shows the results for Code-
Contests, the results for the two other datasets are on our homepage [3]. For ease of presentation,
we use R1 to refer to one-time requests and R2 to refer to multiple requests.
Our results reveal that when setting temperature = 1, it is difficult to tell which way of requesting
is more deterministic. For semantic similarity, R1 and R2’s performance are similar among three
datasets. Code candidates requested in R1 are slightly more random than those requested in R2 in
terms of syntactic similarity since those requested in R1 have lower LCS values and higher LED
values. However, code candidates requested in R1 are slightly more stable than those requested
in R2 when it comes to similarity because those requested in R1 have higher structural similarity
values in both United_Diff and Tree_Diff settings.
When temperature is 0, the difference between the two request ways is obvious. Code Candidates
requested by R1 show higher determinism than those requested by R2. When requesting by R1, the
ratio of worst cases, where max diff is close to 0 (1.20%), and the OER and OER (no ex.) are higher
than R2 and close to 1. The LCS values are higher than the values under other temperatures and LED
values are lower than the values under other temperatures, which indicates higher determinism.
Among the three datasets, the structural similarity values are also higher than the values in other
temperatures, which means the code candidates are more close to each other in terms of their AST
structure.
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
An Empirical Study of the Non-Determinism of ChatGPT in Code Generation 42:15
Answer to RQ3: Under default temperature, the top-5 code candidates from one single
request have similar non-determinism with the 5 top-1 candidates from different requests
for ChatGPT when the temperature is 1 (default temperature of ChatGPT), but higher
determinism when the temperature is 0.
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
42:16 S. Ouyang et al.
Fig. 6. RQ4: Correlations between coding tasks and non-determinism (CodeContests, temperature = 1). Only
significant correlations will be displayed on the heatmap, while the insignificant correlations (i.e., p-value >
0.05) are masked by ‘−’.
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
An Empirical Study of the Non-Determinism of ChatGPT in Code Generation 42:17
This problem exhibits high non-determinism, as indicated by its measurement results across
multiple tests (i.e., the test case rate variance is 0.13, the OER value is zero, the LCS mean value
is 0.15, the mean LED value is 111.5, and both the United_Diff and Tree_Diff values are zero),
suggesting a rather high fluctuation. The detailed description potentially covers a wide array of
scenarios, which may distract the attention from LLMs, which results in inconsistent test results
and higher non-determinism.
The second example ‘1575_M. Managing Telephone Poles’, with a description length of 1,511,
shows a pattern that a shorter description leads to more stability in code generation. Below is the
description of the second code problem, where we present only the core part of the description due
to the extensive length of the overall content.
The test pass rates are consistently 1.0 across all tests, with a variance of 0.0, showing no deviation
in the generated code candidates. The LCS mean value is 0.74, and the LED mean value is 3.5,
which indicates a high syntactical stability. Structural similarity is 0.21 and 0.38 under United_Diff
and Tree_Diff settings, which shows the code candidates still vary in their AST. Here, the shorter
description does not introduce ambiguity but rather lets ChatGPT focus on critical details, leading
to a uniform understanding of the code problem and better generation performance.
Answer to RQ4: A coding task with a longer description and higher difficulty tends to suffer
from more non-determinism in the generated code in terms of code syntax and structure.
The generated code also tends to be more buggy.
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
42:18 S. Ouyang et al.
Answer to RQ5: The non-determinism issue of GPT-4 is lightly less severe than GPT-3.5
under temperature = 1, while the non-determinism issue of GPT-4 is similar to GPT-3.5
under temperature = 0.
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
An Empirical Study of the Non-Determinism of ChatGPT in Code Generation 42:19
t refers to temperature.
and Concise, which can also be told from the high ratio of worst cases in both OER and OER (no
ex.) with 46.06% and 54.55%. Opposite from CoT, code candidates generated from Concise prompt
are more semantically deterministic. Code candidates generated by the CoT prompt have a low
mean LCS value (0.38) and high LED value (39.31), while those generated from the Concise prompt
have a high mean LCS value (0.07) and low LED value (11.77). The other measurements in LCS
and LED also support the above phenomenon. When it comes to structural similarity, under two
different measurement settings, code candidates generated from the CoT prompt have significantly
higher randomness than the code generated from Concise prompt. Our experiment results show a
similar situation in both APPS and HumanEval, where code generated from the Concise prompt
ends up way more deterministic than code generated from the CoT prompt.
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
42:20 S. Ouyang et al.
5 Threats to Validity
The threats to internal validity mainly lie in the implementation of our experiment and result
analysis. To reduce the first threat, we checked our code twice, once during the experiment stage and
once during the record analysis stage. To reduce the second threat, the two authors independently
analyzed the experiment results and drew experimental conclusions separately. Once their analysis
results were different, the third author discussed with them to determine the final result.
The threats to external validity mainly lie in the datasets, GPT versions, and prompt design in
our study. To reduce the threat in datasets, we use three diverse datasets that are widely used in
code generation tasks. Additionally, the problems in our dataset are from different contests with
different difficulties. For example, CodeContests is the most challenging dataset, while HumanEval
is the easiest, in terms of the average difficulty of coding problems. To reduce the threat in GPT
versions, we consider the two newest versions of GPT: GPT-3.5 and GPT-4, and compare their
non-determinism from multiple aspects. To reduce the threat of prompt design, we use the most
typical prompts that are the most widely used in LLM-based code generation and design an RQ to
study their influence on non-determinism.
Another primary concern highlighted in our analysis revolves around the operationalisation of
semantic, syntactic, and structural similarities into measurable metrics for assessing code similarity.
The approach of measuring semantic similarity through the comparison of test execution outputs,
while practical, presents a notable limitation. It potentially oversimplifies the multifaceted nature of
semantic similarity, which should ideally encapsulate the code’s meaning and functionality rather
than merely its output. This method risks ignoring the intricate logic and diverse correct solutions
that different pieces of code may offer. To reduce the threat in measurement tools, we consider
three types of similarities and choose at least two measurements for each type of similarity, and we
also apply statistical analysis techniques to enhance our experiment results. For the HumanEval
dataset, we evaluate our measurement on an external testset, EvalPlus [42]. The result shows that
our measurements show similar evaluation results, which supports the robustness of our chosen
measurements.
However, it is important to acknowledge certain limitations within our study that may affect
the breadth of its applicability and the generalisability of its findings. Firstly, our analysis does not
extend to the impact that different programming languages might have on the non-determinism
of code generation. Programming languages vary widely in syntax, semantics, and complexity,
which can influence how LLMs like ChatGPT interpret and generate code, potentially affecting
the degree of non-determinism in the output. Secondly, our work only adopts a few methods
for measuring code similarity. There is no unified standard for measuring code similarity. It is
challenging to cover all the code similarity measurements. Other methods include embedding-
based similarity measure methods, using pre-trained code language models, such as CodeBERT
[17] and GraphCodeBERT [22]. Thirdly, the influence of the prompt on non-determinism is not
fully considered. The specificity, clarity, and technical depth of prompts provided to ChatGPT
can significantly influence the model’s output, suggesting that prompts could be a crucial factor
in understanding non-determinism. Fourthly, our study focuses exclusively on ChatGPT. While
ChatGPT is a prominent LLM used for code generation, it is not the only one. The landscape of
LLMs is diverse, with models trained on different datasets, architectures, and objectives. Therefore,
our findings may not apply to other LLMs used for similar purposes.
6 Related Work
6.1 Code Generation
Code generation generates programs that need to satisfy all the constraints defined by the underlying
task. Usually, the constraints are represented in various forms, e.g. input/output pairs, examples,
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
An Empirical Study of the Non-Determinism of ChatGPT in Code Generation 42:21
problem descriptions, partial programs, and assertions. Relatively early work includes deductive
synthesis approaches [19, 46] and inductive synthesis approaches [8, 57, 58, 60]. The deductive
synthesis approach operated under the assumption that a comprehensive and precise formal
specification of the user’s desired intention would be provided. However, in many instances, this
turned out to be just as intricate and challenging as creating the actual program. While the inductive
synthesis approach was based on inductive specifications such as input/output pairs and examples
etc., such as works on Lisp programs [8, 57, 60], Pygmalion [58] and more recently FlashFill [20].
More information could be found in a survey [21], which covers notable work on the development
of program synthesis approaches.
In recent years, more and more researchers have applied neural networks in code generation. Yin
and Neubig [70] combine the grammar rules with the decoder and propose a syntax-driven neural
architecture to improve code generation performance. Instead of RNN, Sun et al. [61] propose a
grammar-based structural CNN to capture the long dependency in code. Bolin et al. [66] propose a
dual learning framework that jointly trains the code generation model and code summarisation
model together to achieve better performance in both tasks. Xu et al. [68] present a user study
in-IDE code generation, demonstrating challenges such as time efficiency, correctness, and code
quality, as well as the willingness to use code generation tools from developers.
12 https://openai.com/blog/openai-codex
13 https://github.com/features/copilot
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
42:22 S. Ouyang et al.
version for trial by academia and industry. Due to neither Codex nor GitHub Copilot being open-
sourced, there are several attempts to reproduce their performance, like PYCODEGPT-CERT [73],
CodeParrot,14 and GPT-CC.15 Encoder-decoder pre-trained models are composed of an encoder
and a decoder. AlphaCode [37], which is pre-trained through GitHub repositories with 715.1 GB of
code, uses an encoder-decoder transformer architecture. It achieves on average a ranking in the
top 54% in competitions with more than 5,000 participants in simulated evaluations.
ChatGPT, a language model developed by the team of OpenAI, has the potential to play a role in
code generation. As it is widely known, ChatGPT offers a chat window to enable interaction in a
conversational way. In addition to its powerful capabilities for natural language processing tasks,
ChatGPT inherits the code generation capabilities from Codex and can perform even better, so the
OpenAI team has announced the deprecation of Codex series models in its official documents. There
are several research works that mentioned its ability in code-related areas, including mathematical
capability [18], bug-solving capability [62] and software testing [29]. ChatGPT’s ‘Regenerate
response’ function demonstrates the diversity of its output, but at the same time, it also raises
concerns about the consistency of its output given the same input. Currently, people are amazed
by its superficial performance in terms of code generation; however, there is still no research
work focused on the threat of non-determinism. Therefore, we think it is necessary to make a
comprehensive evaluation of ChatGPT’s ability in code generation. More detailed information
could be found on its official website’s blog [2].
14 https://huggingface.co/codeparrot/codeparrot
15 https://github.com/CodedotAl/gpt-code-clippy
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
An Empirical Study of the Non-Determinism of ChatGPT in Code Generation 42:23
visions only, which yields a set of 76 papers. After an in-depth reading of the experimental design
and discussion in each paper, we find that only 35.5% (27/76) out of the 76 papers mention non-
determinism or related terms (e.g., stability, randomness and variance) in their papers. Among
them, 21.1% (16/76) papers consider non-determinism in their experimental evaluation, including
fixed random seeds, multiple runs of experiments with different fixed random seeds, and report
results with error bars or standard deviation. In addition, 14.5% (11/76) of the papers do not consider
non-determinism in their experiments but discuss the threat of non-determinism in their paper.
7 Discussion
In this section, we discuss the implications, tradeoffs of non-determinism, and future research
directions for code generation with LLMs.
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
42:24 S. Ouyang et al.
Looking deeper into the consistency of the error, we can find that generated code candidates are
more likely (at least 65.85%, 73.83% and 90.00% in CodeContests, APPS and HumanEval) to share the
same error type if all of them fail to pass the test cases. The most common error types they share
are IndexError (46.03% in CodeContests), IndexError (34.78% in APPS) and NameError (33.33% in
HumanEval), respectively, under temperature = 0.
8 Conclusion
This work studies the non-determinism threat of code generation with ChatGPT. We perform
experiments on three widely studied code generation benchmarks and find that the correctness, test
outputs, as well as syntax and structure of code candidates generated from the same instruction,
vary significantly in different requests. We hope that this article could raise awareness of the threat
of non-determinism in future code generation tasks when using LLMs.
References
[1] GitHub. Retrieved from https://152334h.github.io/blog/non-determinism-in-gpt-4/
[2] OpenAI. Retrieved from https://chat.openai.com/chat
[3] GitHub. Retrieved from https://github.com/ShuyinOuyang/LLM-is-a-box-of-chocolate
[4] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida,
Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul
Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro,
Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim
Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson,
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
An Empirical Study of the Non-Determinism of ChatGPT in Code Generation 42:25
Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark
Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai,
Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila
Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman,
Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha
Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei
Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey,
Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain,
and Shawn Jain. 2023. Gpt-4 technical report. arXiv:2303.08774. Retrieved from https://arxiv.org/pdf/2303.08774
[5] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie
Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program synthesis with large language models. arXiv:2108.07732.
Retrieved from https://arxiv.org/pdf/2108.07732
[6] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V. Do, Y. Xu, and P. Fung.
2023. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity.
arXiv:2302.04023. Retrieved from https://arxiv.org/pdf/2302.04023
[7] Bhavya Bhavya, Jinjun Xiong, and Chengxiang Zhai. 2022. Analogy generation by prompting large language models:
A case study of instructgpt. arXiv:2210.04186. Retrieved from https://arxiv.org/pdf/2210.04186
[8] Alan W. Biermann. 1978. The inference of regular LISP programs from examples. IEEE Transactions on Systems, Man,
and Cybernetics 8, 8 (1978), 585–600.
[9] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,
Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Proceedings of the 34th International
Conference on Neural Information Processing Systems, 1877–1901.
[10] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee,
Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023.
Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv:2303.12712. Retrieved from https:
//arxiv.org/pdf/2303.12712
[11] Subhashis Chatterjee, Deepjyoti Saha, Akhilesh Sharma, and Yogesh Verma. 2022. Reliability and optimal release time
analysis for multi up-gradation software with imperfect debugging and varied testing coverage under the effect of
random field environments. Annals of Operations Research 312, 1 (May 2022), 65–85. DOI: https://doi.org/10.1007/
s10479-021-04258-y
[12] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards,
Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf,
Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser,
Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert,
Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak,
Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan
Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie
Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021.
Evaluating large language models trained on code. arXiv:2107.03374. Retrieved from https://arxiv.org/pdf/2107.03374
[13] Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang. 2023.
Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt. arXiv:2304.02014. Retrieved
from https://arxiv.org/pdf/2304.02014
[14] Li Dong and Mirella Lapata. 2016. Language to logical form with neural attention. arXiv:1601.01280. Retrieved from
https://arxiv.org/pdf/1601.01280
[15] Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large
language models for software engineering: Survey and open problems. In 2023 IEEE/ACM International Conference
onSoftware Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 31–53.
[16] Yunhe Feng, Sreecharan Vanam, Manasa Cherukupally, Weijian Zheng, Meikang Qiu, and Haihua Chen. 2023.
Investigating code generation performance of ChatGPT with crowdsourcing social data. In Proceedings of the IEEE
47th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 876–885.
[17] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting
Liu, Daxin Jiang, and Ming Zhou. 2020. Codebert: A pre-trained model for programming and natural languages.
arXiv:2002.08155. Retrieved from https://arxiv.org/pdf/2002.08155
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
42:26 S. Ouyang et al.
[18] Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and
Julius Berner. 2024. Mathematical capabilities of chatgpt. Advances in neural information processing systems 36 (2024).
[19] Cordell Green. 1981. Application of theorem proving to problem solving. In Readings in Artificial Intelligence. Elsevier,
202–222.
[20] Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. ACM Sigplan
Notices 46, 1 (2011), 317–330.
[21] Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. 2017. Program synthesis. Foundations and Trends® in Pro-
gramming Languages 4, 1–2 (2017), 1–119.
[22] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy,
Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang,
and Ming Zhou. 2020. Graphcodebert: Pre-training code representations with data flow. arXiv:2009.08366. Retrieved
from https://arxiv.org/pdf/2009.08366
[23] Qi Guo, Junming Cao, Xiaofei Xie, Shangqing Liu, Xiaohong Li, Bihuan Chen, and Xin Peng. 2024. Exploring
the potential of chatgpt in automated code refinement: An empirical study. In Proceedings of the 46th IEEE/ACM
International Conference on Software Engineering, 1–13.
[24] Tatsunori B. Hashimoto, Kelvin Guu, Yonatan Oren, and Percy S. Liang. 2018. A retrieve-and-edit framework for
predicting structured outputs. In Proceedings of the 32nd International Conference on Neural Information Processing
Systems.
[25] Hossein Hassani and Emmanuel Sirmal Silva. 2023. The role of ChatGPT in data science: How AI-assisted conversational
interfaces are revolutionizing the field. Big Data and Cognitive Computing 7, 2 (2023), 62.
[26] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir
Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring coding challenge competence with apps.
arXiv:2105.09938. Retrieved from https://arxiv.org/pdf/2105.09938
[27] Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2020. Aligning
ai with shared human values. arXiv:2008.02275. Retrieved from https://arxiv.org/pdf/2008.02275
[28] Jeevana Priya Inala, Chenglong Wang, Mei Yang, Andres Codas, Mark Encarnación, Shuvendu Lahiri, Madanlal
Musuvathi, and Jianfeng Gao. 2022. Fault-aware neural code rankers. In Proceedings of the 36th International Conference
on Neural Information Processing Systems, 13419–13432.
[29] Sajed Jalil, Suzzana Rafi, Thomas D. LaToza, Kevin Moran, and Wing Lam. 2023. Chatgpt and software testing education:
Promises & perils. In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation
Workshops (ICSTW). IEEE, 4130–4137.
[30] Andrej Kiviriga. 2023. Efficient Model Checking: The Power of Randomness. Aalborg Universitetsforlag.
[31] Kalpesh Krishna, Yapei Chang, John Wieting, and Mohit Iyyer. 2022. Rankgen: Improving text generation with large
ranking models. arXiv:2205.09726. Retrieved from https://arxiv.org/pdf/2205.09726
[32] Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy S. Liang. 2019. SPoC:
Search-based pseudocode to code. In Proceedings of the 33rd International Conference on Neural Information Processing
Systems.
[33] Emanuele La Malfa, Aleksandar Petrov, Simon Frieder, Christoph Weinhuber, Ryan Burnell, Anthony G. Cohn, Nigel
Shadbolt, and Michael Wooldridge. 2023. The ARRT of language-models-as-a-service: Overview of a new paradigm
and its challenges. arXiv:2309.16573. Retrieved from https://arxiv.org/pdf/2309.16573
[34] Mina Lee, Percy Liang, and Qian Yang. 2022. Coauthor: Designing a human-ai collaborative writing dataset for
exploring language model capabilities. In Proceedings of the 2022 CHI Conference on Human Factors in Computing
Systems, 1–19.
[35] Jia Li, Ge Li, Zhuo Li, Zhi Jin, Xing Hu, Kechi Zhang, and Zhiyi Fu. 2023. Codeeditor: Learning to edit source code
with pre-trained models. ACM Transactions on Software Engineering and Methodology 32, 6 (2023), 1–22.
[36] Jia Li, Yongmin Li, Ge Li, Zhi Jin, Yiyang Hao, and Xing Hu. 2023a. Skcoder: A sketch-based approach for automatic
code generation. In Proceedings of the IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE,
2124–2135.
[37] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling,
Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun
Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme
Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. 2022. Competition-level
code generation with alphacode. Science 378, 6624 (2022), 1092–1097.
[38] Zongjie Li, Chaozheng Wang, Zhibo Liu, Haoxuan Wang, Dong Chen, Shuai Wang, and Cuiyun Gao. 2023. Cctest:
Testing and repairing code completion systems. In Proceedings of the IEEE/ACM 45th International Conference on
Software Engineering (ICSE). IEEE, 1238–1250.
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
An Empirical Study of the Non-Determinism of ChatGPT in Code Generation 42:27
[39] Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. 2023.
Code as policies: Language model programs for embodied control. In Proceedings of the IEEE International Conference
on Robotics and Automation (ICRA). IEEE, 9493–9500.
[40] Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočisky, Andrew Senior, Fumin Wang, and Phil
Blunsom. 2016. Latent predictor networks for code generation. arXiv:1603.06744. Retrieved from https://arxiv.org/
pdf/1603.06744
[41] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really
correct? Rigorous evaluation of large language models for code generation. arXiv:2305.01210.
[42] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2024. Is your code generated by chatgpt really
correct? Rigorous evaluation of large language models for code generation. In Proceedings of the 37th International
Conference on Neural Information Processing Systems.
[43] Yiheng Liu, Tianle Han, Siyuan Ma, Jiayue Zhang, Yuanyuan Yang, Jiaming Tian, Hao He, Antong Li, Mengshen
He, Zhengliang Liu, Zihao Wu, Lin Zhao, Dajiang Zhu, Xiang Li, Ning Qiang, Dingang Shen, Tianming Liu, and
Bao Ge. 2023. Summary of chatgpt/GPT-4 research and perspective towards the future of large language models.
arXiv:2304.01852. Retrieved from https://arxiv.org/pdf/2304.01852
[44] Yue. Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-BachD Le, and David Lo.
2024. Refining ChatGPT-generated code: Characterizing and mitigating code quality issues. ACM Transactions on
Software Engineering and Methodology 33, 5 (Jun 2024), 1–26. DOI: https://doi.org/10.1145/3643674
[45] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain,
Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan
Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. Codexglue: A machine learning benchmark
dataset for code understanding and generation. arXiv:2102.04664. Retrieved from https://arxiv.org/pdf/2102.04664
[46] Zohar Manna and Richard J. Waldinger. 1971. Toward automatic program synthesis. Communications of the ACM 14,
3 (1971), 151–165.
[47] Antonio Mastropaolo, Luca Pascarella, Emanuela Guglielmi, Matteo Ciniselli, Simone Scalabrino, Rocco Oliveto, and
Gabriele Bavota. 2023. On the robustness of code generation techniques: An empirical study on github copilot. In
Proceedings of the IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2149–2160.
[48] Patrick E. McKight and Julius Najab. 2010. Kruskal-Wallis test. In The Corsini Encyclopedia of Psychology. John Wiley
& Sons, Inc., 1–1.
[49] Patrick E. McKnight and Julius Najab. 2010. Mann-Whitney U test. In The Corsini Encyclopedia of Psychology. John
Wiley & Sons, Inc., 1–1.
[50] Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. 2023. Detectgpt: Zero-
shotmachine-generated text detection using probability curvature. In International Conference on Machine Learning.
PMLR, 24950–24962.
[51] Prabhat Nagarajan, Garrett Warnell, and Peter Stone. 2018. Deterministic implementations for reproducibility in deep
reinforcement learning. arXiv:1809.05676. Retrieved from https://arxiv.org/pdf/1809.05676
[52] OpenAI. 2023. GPT-4 technical report. arXiv:2303.08774.
[53] Hung Viet Pham, Shangshu Qian, Jiannan Wang, Thibaud Lutellier, Jonathan Rosenthal, Lin Tan, Yaoliang Yu, and
Nachiappan Nagappan. 2020. Problems and opportunities in training deep learning software systems: An analysis of
variance. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 771–783.
[54] Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani.
2022. Synchromesh: Reliable code generation from pre-trained language models. arXiv:2201.11227. Retrieved from
https://arxiv.org/pdf/2201.11227
[55] Joan Puigcerver, Carlos Riquelme, Basil Mustafa, and Neil Houlsby. 2023. From sparse to soft mixtures of experts.
arXiv:2308.00951. Retrieved from https://arxiv.org/pdf/2308.00951
[56] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by
Generative Pre-Training. Technical Report, OpenAI.
[57] David E Shaw, William R Swartout, and C. Cordell Green. 1975. Inferring LISP programs from examples. In Proceedings
of the 4th International Joint Conference on Artificial Intelligence ( IJCAI), Vol. 75, 260–267.
[58] David Canfield Smith. 1975. Pygmalion: A Creative Programming Environment. Stanford University.
[59] Ioana Baldini Soares, Dennis Wei, Karthikeyan Natesan Ramamurthy, Moninder Singh, and Mikhail Yurochkin. 2022.
Your fairness may vary: Pretrained language model fairness in toxic text classification. In Proceedings of the Annual
Meeting of the Association for Computational Linguistics.
[60] Phillip D. Summers. 1977. A methodology for LISP program construction from examples. Journal of the ACM 24, 1
(1977), 161–175.
[61] Zeyu Sun, Qihao Zhu, Lili Mou, Yingfei Xiong, Ge Li, and Lu Zhang. 2019. A grammar-based structural cnn decoder
for code generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 7055–7062.
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.
42:28 S. Ouyang et al.
[62] Nigar M Shafiq Surameery and Mohammed Y Shakor. 2023. Use chat gpt to solve programming bugs. International
Journal of Information Technology & Computer Engineering 3, 1 (2023), 17–22.
[63] Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. Intellicode compose: Code generation
using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and
Symposium on the Foundations of Software Engineering, 1433–1443.
[64] Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. 2022. Expectation vs. experience: Evaluating the usability
of code generation tools powered by large language models. In Proceedings of the CHI Conference on Human Factors in
Computing Systems Extended Abstracts, 1–7.
[65] Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software Testing With
Large Language Models: Survey, Landscape, and Vision. IEEE Transactions on Software Engineering 50, 4, 911–936.
DOI: https://doi.org/10.1109/TSE.2024.3368208
[66] Bolin Wei, Ge Li, Xin Xia, Zhiyi Fu, and Zhi Jin. 2019. Code generation as a dual task of code summarization. In
Proceedings of the 33rd International Conference on Neural Information Processing Systems.
[67] Xiongfei Wu, Liangyu Qin, Bing Yu, Xiaofei Xie, Lei Ma, Yinxing Xue, Yang Liu, and Jianjun Zhao. 2020. How are
deep learning models similar? An empirical study on clone analysis of deep learning software. In Proceedings of the
28th International Conference on Program Comprehension, 172–183.
[68] Frank F. Xu, Bogdan Vasilescu, and Graham Neubig. 2022. In-ide code generation from natural language: Promise and
challenges. ACM Transactions on Software Engineering and Methodology 31, 2 (2022), 1–47.
[69] Burak Yetiştiren, Işı̇k Özsoy, Miray Ayerdem, and Eray Tüzün. 2023. Evaluating the code quality of AI-assisted code
generation tools: An empirical study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT. arXiv:2304.10778.
Retrieved from https://arxiv.org/pdf/2304.10778
[70] Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation.
arXiv:1704.01696. Retrieved from https://arxiv.org/pdf/1704.01696
[71] Pengcheng Yin and Graham Neubig. 2018. Tranx: A transition-based neural abstract syntax parser for semantic
parsing and code generation. arXiv:1810.02720. Retrieved from https://arxiv.org/pdf/1810.02720
[72] Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao
Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. In Proceedings
of the 46th IEEE/ACM International Conference on Software Engineering, 1–12.
[73] Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-
Guang Lou. 2022. CERT: Continual pre-training on Sketches for Library-oriented Code Generation. arXiv:2206.06888.
Retrieved from https://arxiv.org/pdf/2206.06888
ACM Transactions on Software Engineering and Methodology, Vol. 34, No. 2, Article 42. Publication date: January 2025.