0% found this document useful (0 votes)

8 views18 pages

Chi 24

This study evaluates the performance of early-stage scholars using large language models (LLMs) for academic tasks under varying time pressures and training conditions. Through empirical research and semi-structured interviews, the authors analyze how task complexity and user perceptions impact the effectiveness of LLMs in academic workflows. The findings aim to inform the design and integration of LLM-based tools to better assist scholars in their academic endeavors.

Uploaded by

vincentdftbg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views18 pages

Chi 24

Uploaded by

vincentdftbg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Evaluating Large Language Models on Academic Literature

Understanding and Review: An Empirical Study among

Early-stage Scholars
Jiyao Wang∗ Haolong Hu∗ Zuyuan Wang
The Hong Kong University of Science The Hong Kong University of Science The Hong Kong University of Science
and Technology (Guangzhou) and Technology (Guangzhou) and Technology (Guangzhou)
Guangzhou, China Guangzhou, China Guangzhou, China
jwanggo@connect.ust.hk hhu574@connect.hkust-gz.edu.cn zwang534@connect.hkust-gz.edu.cn

Yan Song Youyu Sheng Dengbo He†

The Hong Kong University of Science The Hong Kong University of Science The Hong Kong University of Science
and Technology (Guangzhou) and Technology (Guangzhou) and Technology (Guangzhou)
Guangzhou, China Guangzhou, China Guangzhou, China
syan931@connect.hkust-gz.edu.cn ysheng330@connect.hkust-gz.edu.cn dengbohe@ust.hk

ABSTRACT KEYWORDS
The rapid advancement of large language models (LLMs) such as large language model, academic tasks, user perception, human-AI
ChatGPT makes LLM-based academic tools possible. However, little collaboration
research has empirically evaluated how scholars perform different
ACM Reference Format:
types of academic tasks with LLMs. Through an empirical study
Jiyao Wang, Haolong Hu, Zuyuan Wang, Yan Song, Youyu Sheng, and Dengbo
followed by a semi-structured interview, we assessed 48 early-stage He. 2024. Evaluating Large Language Models on Academic Literature Under-
scholars’ performance in conducting core academic activities (i.e., standing and Review: An Empirical Study among Early-stage Scholars. In
paper reading and literature reviews) under different levels of time Proceedings of the CHI Conference on Human Factors in Computing Systems
pressure. Before conducting the tasks, participants received differ- (CHI ’24), May 11–16, 2024, Honolulu, HI, USA. ACM, New York, NY, USA,
ent training programs regarding the limitations and capabilities 18 pages. https://doi.org/10.1145/3613904.3641917
of the LLMs. After completing the tasks, participants completed
an interview. Quantitative data regarding the influence of time
pressure, task type, and training program on participants’ perfor-
1 INTRODUCTION
mance in academic tasks was analyzed. Semi-structured interviews The rapid advancement of artificial intelligence (AI) and natural lan-
provided additional information on the influential factors of task guage processing (NLP) has led to the development of sophisticated
performance, participants’ perceptions of LLMs, and concerns about large language models (LLMs), such as ChatGPT 1 , GPT4 2 , and
integrating LLMs into academic workflows. The findings can guide Claude 3 . These models have demonstrated impressive capabilities
more appropriate usage and design of LLM-based tools in assisting in generating human-like text, understanding context, and solving
academic work. complex language tasks. The application scope of LLM technology
is extensive, and relevant scholars have been actively analyzing the
CCS CONCEPTS impact of such technology on fields such as healthcare [1, 39], edu-
cation [35, 36], and creative writing [26] (e.g., helping journalists
• Human-centered computing → Empirical studies in HCI; • extract ideas from document [56], enabling scholars to communi-
Computing methodologies → Natural language processing. cate findings mutually [25], and assisting writers in exploring more
ways of writing story [12]).
∗ Both
In recent years, LLMs have been employed for various academic
authors contributed equally to this research.
† Corresponding tasks [33], such as literature reviews, paper reading, writing polish-
author.
ing, etc. Being different from other areas where LLMs are applicable,
the academic work requires extensive training in acquiring, judging,
and synthesizing relevant information [46]. Moreover, the academic
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed community also demands high standards for logical coherence, in-
for profit or commercial advantage and that copies bear this notice and the full citation formation accuracy, and idea novelty [6] and thus requires more
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
responsible AI tools compared to other domains. However, the
republish, to post on servers or to redistribute to lists, requires prior specific permission application of LLMs in academic contexts was under-investigated.
and/or a fee. Request permissions from permissions@acm.org.
CHI ’24, May 11–16, 2024, Honolulu, HI, USA
1 https://openai.com/chatgpt
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
2 https://openai.com/gpt-4
ACM ISBN 979-8-4007-0330-0/24/05
https://doi.org/10.1145/3613904.3641917 3 https://www.claude.co.id/
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Wang et al.

Further, as emerging conversational systems, the promotion of specifically, graduate students who have just started their academic
LLM results in more diverse user behaviors as well as new social careers as researchers. This decision was based on the fact that
norms and user expectations [7]. Hence, it is imperative to evaluate LLMs are new to most scholars, and based on the research in other
the capacity boundary of LLMs in academic settings through user domains, new users may have highly uncertain and potentially
studies, so that the LLMs can be better designed to be integrated inappropriate strategies when they first start to use the LLMs [30].
into academic workflows, and subsequently, contribute to academic Thus, understanding the strategies new users adopt can help sup-
research. A few studies have evaluated the effectiveness of LLMs in port young scholars to better use the LLM tools or at least shorten
assisting selected academic tasks. For example, Gordijn and Have their familiarization process, by providing new users with appro-
[27] argued that the capacity of ChatGPT to develop a whole sci- priate training materials. In the study, an onsite experiment was
entific paper is restricted. LLMs have also been found to be able conducted, followed by a semi-structured interview regarding the
to alleviate some of the time pressures by automating certain pro- usage of LLM in the experiment or their daily life. Together, the
cesses during their academic tasks [14]. However, academic tasks empirical and interview data offered a nuanced perspective on the
are diverse, and different tasks may require completely different opportunities and challenges of using LLMs for academic tasks.
cognitive resources. For instance, compared to extracting key in-
formation from a paper (in which the source of the information 2 RELATED WORKS
is known, but the information is unknown), literature reviews re-
2.1 Natural Language Technologies
quire locating and summarizing information from a wider range of
studies (in which the targeted information is partially known, but In recent years, a remarkable evolution has been happening in the
the source of the information is unknown). Actually, some research field of Computational Linguistics, also known as Natural Language
[2, 16, 52] has also pointed out that LLMs may introduce inaccu- Processing (NLP), primarily driven by the development of neuron-
racies and biases in academic tasks, especially in understanding based network models trained on vast datasets [45, 68]. Compared
and summarizing the content of the literature as a priority concern to traditional rule-based systems, recent data-driven models have
for humanity [16]. Furthermore, task complexity can also be mod- shown remarkable results across various NLP tasks [19, 53]. Deep
erated by time pressure [44], which is prevalent in academia [11]. learning techniques have become mainstream in developing these
Thus, a more comprehensive investigation is needed to better un- NLP models [40]. Current popular architectures include Long Short-
derstand the role of LLMs in different academic tasks with different Term Memory (LSTM) [67] and transformer models [63].
task complexity, which can be moderated by task type and time A significant paradigm shift in natural language technologies oc-
pressure. curred over the past half-decade, primarily attributed to the advent
On the other hand, given that LLMs can be regarded as a special of large language models (LLMs) [17]. These techniques involve an
type of automation that can help gather, analyze, and summarize initial training phase on a comprehensive dataset, followed by fine-
information, users’ perceptions of the LLM may also influence tuning for specific tasks. Pre-trained models like BERT [15], BART
task performance. Although a few empirical studies discussed the [42], XLNet [66], and LLaMa [62] have demonstrated substantial
implication and limitations of LLMs when they are used for specific performance improvements across a variety of NLP tasks.
academic tasks (e.g., literature review [3], idea generation [22, 48]), However, the challenges of smaller models persist in LLMs. For
no research has discussed different strategies young scholars may instance, the LLMs still lack an explicit factual model, which makes
take when using the LLMs for different tasks, nor compared how them prone to producing inaccurate information [34]. Even innocu-
task difficulties (e.g., as moderated by time pressure) and training ous prompts can lead to the generation of toxic content from these
may influence users’ performance, although these factors have been models [24, 55]. Their performance varies, excelling in some ar-
widely acknowledged as influential factors of users’ reliance on eas while faltering in others [25]. Guiding these models to deliver
automation [32]. specific outputs remains a challenge, leading to the emergence of
Thus, using a mixed-methods approach combining an experi- prompt engineering as a sub-field [4, 45]. Ethical concerns sur-
mental study with semi-structured interviews, this study aims to rounding these models are wide-ranging, from environmental to
investigate: socio-political considerations [5].
Our research acknowledges the limitations of current LLMs and
• When using LLMs, whether and why there are discrepancies assumes that they cannot fully replace human in creative writing
in young academic users’ performance in conducting differ- tasks. However, they can significantly aid academic writing across
ent academic tasks, as defined by time pressure and required various contexts to a certain extent. This perspective motivates our
cognitive resources. exploration of users’ concerns when using LLMs as a peer-level
• How can young academic users’ perceptions of LLMs lim- writing tool.
itations affect their performance or strategies when using
LLMs for academic tasks? 2.2 Large Language Model in Academic Tasks
• What young academic users’ expectations of the LLMs and
Large Language Models (LLMs), exemplified by ChatGPT, harness
LLM training are when the LLMs are being used for different
broad internet-based datasets to mimic human language patterns
academic tasks?
and create realistic text [57]. This capability has attracted interest
Given that the younger generation has a higher acceptance of across academia. For instance, the broader implications of AI in aca-
emerging technologies [9], and may lack experience in conducting demic research have been scrutinized by Grimaldi and Ehrler [28]
academic tasks, this study was planned to target young scholars, and Hutson [33]. These tasks include the compilation of essential
Evaluating Large Language Models on Academic Literature Understanding and Review CHI ’24, May 11–16, 2024, Honolulu, HI, USA

components of the manuscript such as the abstract, introduction, Recruited through online posters in the social network and on-
literature review, methodology, results, discussions, and conclu- campus posters, all participants were native Mandarin speakers.
sions. Each participant was given a unique experiment ID number from
Scholars, researchers, and students in the academic community P1 to P48. Table 1 provides a comprehensive overview of the aca-
have utilized LLMs like ChatGPT for a variety of academic and non- demic profiles of the participants. All participants were actively
academic tasks. Dowling and Lucey [18] explored the application involved in academia, including 22 Ph.D. students, 17 MPhil stu-
of ChatGPT and found it to be particularly effective for initial idea dents, and 9 research assistants (with a minimum of a bachelor’s
generation, literature synthesis, and creating testing frameworks. degree). An examination of their academic publication history re-
Yet, according to Gordijn et al., [27], ChatGPT still fails to produce veals that 35 participants had 1 to 3 publications (including journal
a complete scientific article on par with a skilled researcher. How- articles, conference proceedings, and edited books); 3 participants
ever, it is expected that the capabilities and uses of these tools will had 4-6 publications; while 10 participants were still striving for
continue to grow, and will be capable of conducting more academic their first publication. Moreover, according to their self-reported
tasks, including experiment design, manuscript writing, peer re- current research topics, we classified them into three types, i.e.,
views, and editorial decision support [16]. Additionally, the ability AI-related, Other STEM (Science, technology, engineering, and
of ChatGPT to generate and understand texts in multiple languages mathematics), and Social Science & Business. Basically, we tried
is believed to improve the efficiency of publishing and accessing to balance the participant distribution of background within each
literature for non-native English speakers [43]. In general, scien- experiment condition.
tists in many fields are positive about the potential of using LLM Significantly, the study focused on participants with limited
in academic tasks [52]. exposure to LLM in their academic career, specifically those who
However, the performance of the LLM in academic tasks is still ”sometimes used LLMs for academic purposes” or less. This criterion
less than ideal. For example, Aydın and Karaarslan [3] pointed out was adopted given that frequent users of LLM tools may have
that when using ChatGPT for a literature review in healthcare, the developed their own strategies for using the LLM, which can hardly
content generated by LLM still lacks synthesis, and may suffer the be controlled in the experiment. More importantly, as illustrated
problem of plagiarism. In another study, Gao et al. [23] reported in previous human-automation interaction domains [31, 32], new
that the abstract generated by the LLMs can still be identified as users may encounter performance and trust degradation when
AI-generated using an AI output detector. Particularly, through using unfamiliar LLMs. Thus, focusing on this group of users can
multiple experimental trials, Dis et al. [16] reminded researchers help optimize the design of LLMs to better support the users.
to pay extra attention and remain vigilant when applying LLM to
literature comprehension and summarization tasks. However, to the Table 1: Background Statistics of Each Group of Participants.
best of our knowledge, no empirical research has been conducted
to understand how scholars use the LLMs and how the LLMs can Background Type Group 1 Group 2
influence scholars’ performance in academic tasks. Given that the Academic Position Ph.D. student 11 11
LLM can still only work as a collaborator, it is necessary to consider Mphil student 8 9
the characteristics of the user-LLM combined system instead of Research assistant 5 4
the LLM alone. Furthermore, most of the existing research focused Publication Number 0 3 7
on the attitudes and opinions of senior researchers on the use of 1-3 19 16
LLM [2, 52]. Few research focused on younger scholars, who may 4-6 2 1
have a higher propensity to accept new technologies and lack the Gender Male 15 15
necessary expert knowledge to supervise the application of LLM in Female 9 9
academic tasks [16]. Usage Experience Never used 3 3
Rarely used 9 9
Sometimes used 12 12
3 METHODOLOGY Research Interest AI-related 3 3
We adopted a mixed approach consisting of an empirical experiment Other STEM 13 12
and a semi-structured interview. Quantitative performance data Social Science & Business 8 9
was gathered to evaluate the performance and the strategies partici- Notes: In this table and the following tables, Group 1 refers to the group of
pants took for different academic tasks. Post-experiment interviews participants who received additional training on LLM limitations; while the
focused on researchers’ evaluations of current LLM limitations in participants in Group 2 only receive basic training on how to use the LLM.
In our recruitment questionnaire, options of LLM tool usage experience
academia, their subjective understanding of the factors influencing
include: Never used; Rarely used; Occasionally used, but not frequent;
their performance across tasks, and their concerns about integrating Sometimes used - about half the time.
LLM into their workflows.

3.1 Participants 3.2 Tasks

In the scope of this study, a diverse (in terms of academic back- Two types of academic tasks (i.e., literature review (LR), and pa-
ground) pool of 48 young participants (age <= 30, 30 males and 18 per understanding (PU)) were used in the study, given that they
females) was selected. All from research institutions or universities require different levels of skills, and are the type of tasks that LLM
in China where English is the principal language of instruction. users need to pay the most attention to [16]. The literature review
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Wang et al.

task requires the users to search and identify relevant information

when the sources of the information are unknown but the targeted
information is partially known; while in the paper understanding
task, the source of the information is known, but the information is
unknown. In addition, given that the task complexity can moderate
the relationships between users’ trust and reliance on the system
[54] and that time pressure is common in academia [11], we set two
levels of time constraints for the tasks (i.e., 10 minutes, and 20 min-
utes) to construct comparable pair under the same task type. These
two-time limits were set based on users’ feedback in pilot tests. Figure 1: Coding framework and themes.
Note that, given that we aim to understand how young scholars
use LLM for academic tasks, it would be unfair comparisons if the
selected topics are within the research domain the participants are Chrome 5 and Microsoft Office Packages. Mimicking real-world
familiar with. Thus, we chose to provide experimental materials (i.e., situations, participants were permitted to use Chrome and ChatGPT
scientific papers and review topics) from a field that no participants during tasks voluntarily. All experiments were conducted in the
were from nor familiar with so that all participants were at a similar same meeting room with minimal external interference.
level of familiarity with the materials. We ended up choosing the
topics from the human factors in transportation, because this field 3.3 Experiment Design and Procedures
is minor in our target universities, and the experimenters are all In addition to the four within-subject experimental conditions,
familiar with this field. This decision was also based on the belief we also provided participants with or without training materials
that the human factors domain has long been regarded as ’common regarding the limitations and potential errors that may be made
sense’ [61]. While this assumption may not be entirely accurate, it by the LLMs on top of the basic training for using LLM tools (e.g.,
suggests that the research in this area should be relatively compre- how to turn on the interface and what LLM can do in general) as
hensible for non-experts. The task type and time constraints led to the between-subjects factor. Thus, participants were divided into
four task conditions as follows: two between-subjects study groups, one of which (Group 1) was
• Paper Understanding-More Time (PU-MT) Given a pub- informed about potential errors and limitations of the selected
lished scientific paper, answer five questions related to the ChatGPT in a pre-experiment video (provided in the supplement
paper we provided within 20 minutes. materials), while the other group (Group 2) was not.
• Paper Understanding-Limited Time (PU-LT) Given an- Upon arrival, participants’ informed consent was obtained. Sub-
other published scientific paper, answer the other five ques- sequently, participants received basic training (around 10 minutes)
tions related to the paper we provided within 10 minutes. regarding how to use the LLM. Half of the participants received an
• Literature Review-More Time (LR-MT) Given a topic, additional pre-experiment training video regarding the LLM limita-
complete a literature review of approximately 500 words on tions. Participants were then asked to complete four academic tasks
the topic within 20 minutes. (i.e., PU-MT, PU-LT, LR-MT, LR-LT) on the same laptop. To elimi-
• Literature Review-Limited Time (LR-LT) Given another nate the learning and fatigue effects from task execution order, we
topic, complete a literature review of approximately 500 counterbalanced the four experimental conditions. Throughout the
words on the topic within 10 minutes. experiment, the experimenter strictly adhered to a non-intrusive
approach, refraining from interrupting the participants unless they
To control the level of difficulties within each experimental con-
sought assistance unrelated to the ongoing tasks. Participants were
dition, for the PU task, we used five similar questions regarding the
allowed a maximum of 5 minutes break between two tasks. Follow-
two target papers, and the two papers were of the same length (5
ing the experiment, we conducted semi-structured interviews. The
pages) and were from the same academic conference in the same
questions used in the semi-structured interview can be found in
year; for the paper understanding task, we chose the topics from
Appendix C.
the same fields and a preliminary search in Google Scholar showed
The entire experiment lasted approximately two hours and par-
that the two tasks yield a similar number of publications in recent
ticipants received 120 Chinese Yuan as compensation. The Hong
years.
Kong University of Science and Technology’s Human and Artefacts
ChatGPT (GPT3.5 version 4 ), a popular LLM tool that leverages
Research Ethics Committee approved this study (protocol number:
advanced language technology, was selected for the experiment.
HREP-2023-0159).
To maintain fairness, we restricted the use of other LLM tools,
allowing only the official ChatGPT interface. While it is challenging
to determine the popularity of such tools, ChatGPT appeared to be
3.4 Analysis and Coding Methods
the most widely recognized at the time of the study. Participants 3.4.1 Quantitative Analysis. Two experts (senior Ph.D. students
were free to use ChatGPT for the tasks when they felt necessary. We who had authored at least one peer-reviewed publication in the field)
established a virtual machine (VM) on Microsoft Azure for users to from the human factors field were invited to evaluate the answers
get access to ChatGPT. The VM also featured pre-installed Google participants generated. The two raters followed the same scoring
standard that was decided before they started the evaluation and
4 https://openai.com/chatgpt 5 https://www.google.cn/chrome/index.html
Evaluating Large Language Models on Academic Literature Understanding and Review CHI ’24, May 11–16, 2024, Honolulu, HI, USA

Table 2: Individual and Group Performance Statistics of Paper Understanding.

PU-MT PU-LT
Group 1 Group 2 Group 1 Group 2
LLM Usage Grade:Time LLM Usage Grade:Time LLM Usage Grade:Time LLM Usage Grade:Time
y 90:100 n 70:100 y 70:100 y 15:100
y 75:80 y 65:100 y 60:100 y 5:100
n 70:100 y 55:100 n 75:100 y 95:100
y 55:100 y 75:80 y 65:100 y 65:100
y 85:85 y 75:100 y 65:100 y 85:90
y 65:90 n 60:100 y 80:100 y 60:100
y 80:100 y 75:100 y 90:100 y 45:100
y 80:100 y 85:95 y 90:100 y 80:50
y 80:85 n 75:100 y 70:100 n 50:100
y 90:95 y 90:80 y 70:100 y 85:100
y 80:80 y 65:100 y 65:90 y 45:100
y 75:100 y 55:90 y 55:100 y 60:100
n 35:100 y 20:100 y 55:100 y 90:100
y 75:75 y 85:90 y 80:100 y 85:100
y 50:100 y 70:70 y 40:100 y 50:100
y 55:100 y 65:100 n 60:100 y 50:100
y 30:100 y 45:100 y 15:100 n 65:100
y 75:100 n 50:80 y 70:100 y 25:100
y 85:80 y 65:90 y 30:100 n 75:80
n 35:100 y 85:100 n 35:100 y 80:100
y 50:100 n 75:100 y 40:100 n 55:100
y 90:100 n 55:100 y 70:100 n 75:100
y 65:100 y 20:100 y 65:100 y 25:100
y 50:100 n 45:100 n 25:100 y 20:100
n/y=3/21 average=67.5:94.6 n/y=7/17 average=63.5:94.8 n/y=4/20 average=60.0:99.6 n/y=5/19 average=57.7:94.8
Notes: The column ’LLM Usage’ indicates whether participants used the LLM tool to assist in completing the task. The symbol ’y’ means ’used’ and ’n’ is ’not
used’ in each task. The unit of time is minute.

conducted the evaluation independently. An Intraclass Correlation Variance Inflation Factor (VIF) was used to mitigate the issue of
Coefficient (ICC) analysis was conducted and the two raters reached multicollinearity. To examine the significance of variables within
an ICC of 0.94 (95% CI = [0.91, 0.97], p < .0001), which indicates each sub-structure, Tukey-Kramer post-hoc tests [38] were con-
high consistency and inter-rater reliability of the grades (i.e., from ducted. Variables demonstrating a significance level of p < .05 were
0 to 100). The guiding principles employed for scoring, as well considered statistically significant in the analyses.
as detailed experimental materials, can be found in Appendix B.
Other metrics of the task performance in the empirical experiment 3.4.2 Qualitative Analysis. As for the answers in the semi-structured
part of the study include task completion Time (%) (i.e., the actual interview, Figure 1 illustrates the coding framework and its corre-
completion time / the time allowed for the current task.) and LLM sponding themes. We transcribed interviews from 48 participants
tool adoption rate (i.e., the number of participants in each group using automated transcription software6 , followed by content cal-
who used LLM during that task / the total number of participants ibration to ensure the alignment between the original audio and
in each group). The criterion of LLM tool usage is whether they had transcribed text. Our approach blends the strengths of qualitative
fully accepted responses by LLM in their answers while fulfilling and quantitative analysis to investigate textual content. This dual
each task. approach not only facilitates more robust inferences but also opens
For the quantitative analysis method, in order to quantify the avenues for additional reflection, hypothesis refinement, and fur-
combined effects of participants’ background and controlled ex- ther investigation [47].
periment conditions, regression analyses were performed by ”SAS To gain a deeper understanding of the interview content, two re-
OnDemand for Academics”. Mixed linear regression models (using searchers (co-authors of this paper) identified several topics of inter-
Proc MIXED) were built for two continuous dependent variables est based on the research questions and interview outline, including
(Time and Grade), which included all demographic factors, three training, academic task types, pressure, concerns, and individual
experimental conditions, and their two-way interactions as inde- differences. They independently read all the interview texts, and
pendent variables. Repeated measures were accounted for through extracted segments related to these topics. At the same time, they
a generalized estimating equation, which can be used to model performed open coding (i.e., taking apart the information collected,
multiple responses from a single subject. Backward stepwise selec- assigning concepts, and then reassembling it in new ways) and
tion procedures were employed based on model fitting criteria and 6 https://www.feishu.cn/product/minutes
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Wang et al.

Table 3: Individual and Group Performance Statistics of Literature Review.

LR-MT LR-LT
Group 1 Group 2 Group 1 Group 2
LLM Usage Grade:Time LLM Usage Grade:Time LLM Usage Grade:Time LLM Usage Grade:Time
y 50:85 y 46:90 y 65:70 y 26:90
y 58:85 y 46:100 y 33:100 y 40:100
y 41:100 y 49:60 y 26:100 y 60:100
y 46:100 y 59:55 y 17:100 y 34:100
y 58:75 y 50:75 y 50:100 y 18:100
y 66:100 y 55:90 y 46:100 y 41:80
y 52:95 y 42:70 y 31:100 y 27:100
y 41:100 y 38:100 y 36:100 y 50:60
y 31:100 y 32:55 y 33:100 y 47:40
y 26:100 y 50:100 y 11:100 y 32:100
y 41:65 y 48:100 y 41:70 y 34:100
y 53:85 y 42:75 y 49:50 y 45:100
y 12:100 y 42:75 y 23:100 y 16:100
y 43:100 y 50:100 y 29:100 y 47:100
y 15:85 y 55:80 y 12:100 y 38:100
y 30:100 y 26:100 y 33:100 y 30:100
y 12:100 y 39:100 y 3:100 y 16:100
y 83:100 y 28:100 y 80:100 y 44:100
y 41:75 y 17:100 y 51:100 y 16:100
y 17:100 y 40:100 y 19:100 y 3:100
y 64:100 y 37:100 y 63:100 y 45:100
y 39:100 y 37:100 y 49:100 y 45:100
y 69:60 y 10:100 y 67:100 y 23:100
y 13:100 y 33:80 y 14:100 y 33:100
n/y=0/24 average=41.7:92.1 n/y=0/24 average=40.5:87.7 n/y=0/24 average=36.7:95.4 n/y=0/24 average=33.8:94.6

assigned descriptive labels to key paragraphs or viewpoints in the 4 RESULTS

text. Subsequently, the two researchers jointly integrated the high- In this section, we present both quantitative and qualitative results
frequency repetitive labels and established overall themes based of the study.
on their discussions. To achieve a comprehensive understanding
and delve deeper into the participants’ perspectives, attitudes, and
emotions, the two researchers jointly developed broader categories 4.1 Quantitative results from the empirical
and labels. By comparing and integrating different themes, they experiment
established the final coding framework, capturing the core content As mentioned previously, the quantitative metrics are extracted
of the discussion through keywords, phrases, or topic sentences. from the empirical experiment and are summarized in Table 2 and
After completing all coding work, a third researcher (another 3. The results of the regression analysis are shown in Table 4, 5.
co-author of this paper) joined the discussion to check and fur- To verify the difficulty of experimental material before formal
ther analyze the constructed coding framework. In addition, we quantitative analysis, through paired samples t-tests, we first com-
conducted a statistical analysis of the qualitative data to explore pared the scores of two cohorts who conducted the same type of
the frequency, distribution, and correlations of the coding. Finally, academic task with different materials but under the same experi-
the three researchers discussed the results of the qualitative and mental conditions (same time pressure, and same training level). We
quantitative analyses, cross-referenced, and synthesized each the- did not find a significant discrepancy neither between the scores of
matic category to analyze the participants’ strategies, attitudes, and the two cohorts who read different papers nor between those who
perceived changes when using LLM tools. It is important to note performed literature review for different topics (p>.05). Therefore,
that the interview outline took open-ended and semi-open-ended we argue that the difference in the difficulty levels of the experi-
questions, and during the course of the interviews, we were flexible mental materials we prepared for the same type of academic tasks
in adjusting the questions based on the responses of the intervie- was minor.
wees. Thus, not all 48 interviewees were asked and responded to Next, as shown in the last rows of Table 2 and Table 3, under
the same questions, even though the outline of the interviews was low time pressure, 10 out of 48 participants chose to finish the
fixed. Therefore, only the interviewees who responded to a particu- paper understanding task without the LLM tool; while under high
lar question were coded and discussed rather than the entire group time pressure, 9 participants chose to finish the tasks without LLM.
of 48 interviewees. Among these 9 participants, un-use of LLM occurred only in paper
understanding tasks, and all participants chose to use LLM tools
Evaluating Large Language Models on Academic Literature Understanding and Review CHI ’24, May 11–16, 2024, Honolulu, HI, USA

Table 4: Summary of Statistical Results.

Dependent Variable (DV) Independent Variable (IV) F-value p

Usage Experience F(2, 42) = 2.55 .09∗
Training F(1, 42) = 7.50 .009∗∗
Time Training * Usage Experience F(2, 42) = 4.84 .01∗∗
Task Type F(1, 47) = 4.64 .04∗∗
Time Pressure F(1, 47) = 5.51 .02∗∗
Usage Experience F(2, 45) = 2.21 .1
Task Type F(1, 45) = 105.19 <.0001∗∗
Grade
Usage Experience * Task Type F(2, 45) = 8.98 .0005∗∗
Time Pressure F(1, 47) = 9.59 .003∗∗
Notes: In this table and the following tables, ∗ marks marginal significant results (p<.1), ∗∗ marks significant results (p<.05).

Table 5: Significant Post-hoc Results for Discrete Independent Variables.

DV IV IV Level IV Level compared to Estimation (95% CI) t value p

Task Type PU LR 3.49 [0.23, 6.79] t(47)=2.15 .04∗∗
Time Pressure MT LT -3.80 [-7.06, -0.54] t(47)=-2.35 .02∗∗
Time Training Without training With training -5.72 [-9.93, -1.50] t(42)=-2.74 .009∗∗
Training*Usage Experience Without training*Never used Without training*Sometimes used -13.65 [-25.40, -1.90] t(42)=-3.47 .015∗∗
Without training*Never used With training*Never used -15.00 [-29.86, -0.14] t(42)=-3.01 .047∗∗
Task Type PU LR 24.60 [19.77, 29.43] t(45)=10.26 <.0001∗∗
Time Pressure MT LT 6.26 [2.19, 10.33] t(47)=3.10 .003∗∗
Usage Experience*Task Type Never used*PU Never used*LR 30.00 [12.99, 47.02] t(45)=5.25 <.0001∗∗
Grade
Rarely used*PU Rarely used*LR 12.97 [3.15, 22.80] t(45)=3.93 .004∗∗
Rarely used*PU Sometimes used*PU -16.25 [-30.11, -2.39] t(45)=-3.49 .01∗∗
Sometimes used*PU Sometimes used*LR 30.83 [22.34, 39.34] t(45)=10.78 <.0001∗∗
Notes: estimate is the difference between IV level and IV level compared to.

when they were conducting the literature review task. This indicates 4.2 Qualitative results from semi-structure
that scholars presented varied preferences for the use of LLM tools interview
on different tasks. Refer to [13], such attitude may be determined
We extracted four categories of topics from the interview: training
by the perceived ease of use and usability of the tool, which we will
for initial users, variations in two academic tasks, strategies under
further discuss in qualitative analysis.
time pressure, and concerns about LLM tools. Each category was
Second, to better model the influence of users’ background and
further divided into three subtopics, which encompass the common
three experimental conditions, as well as their interaction effects,
themes emerging from participants’ responses. Figure 2 illustrates
we built two models for Time (%) and Grade of participants. Refer
the detailed statistics. Through coding and discussing diverse top-
to Table 4, we found that the type of training, task type and time
ics, we aim to delve into participants’ attitudes, strategies, and
pressure were significant predictors of time spent on task; task type
reflections on various aspects of LLM tools.
and time pressure were influential factors of grades. Specifically,
as shown in Table 5, one would spend more time and gain higher
scores when conducting a paper understanding task compared 4.2.1 Training for initial users. The majority (47/48) of participants
to when conducting a literature review task. At the same time, would like to obtain some kind of guidance or training before using
people under higher time pressure spent a higher percentage of the LLM tools, but one subject explicitly stated that she did not
time on tasks but obtained lower scores. We also found that the need to know any information or knowledge to use the LLM tool
training made a difference - participants who received limitation- for the first time and she could use it in a straightforward away. A
related training used more time on task compared to those who total of 16 types of information that participants wished to know
received only basic training, while no significant effects of training before using the LLM were identified, and these were categorized
were observed on grades. Finally, two significant interaction effects into 6 themes through thematic analysis. These categories, listed in
related to LLM usage experience were identified. We found that, descending order of frequency in Figure 2, are pre-use techniques,
within the group without limitation, participants who had more features and limitations of LLM, basic methods and operations to
experience in utilizing LLMs in academic tasks spent more time use LLM, ethics and compliance, historical or current tool develop-
in tasks compared to those who used LLMs less frequently. At ment, and others. In addition to the most frequently mentioned and
the same time, when conducting paper understanding tasks, more emphasized ”questioning techniques”, many participants empha-
experienced LLM users were always more likely to obtain higher sized the importance of crafting effective ”prompts.” For instance,
grades compared to less experienced users of LLM. P43 said, ”If you don’t use prompt engineering and instead express
yourself naturally, there’s a good chance you won’t get what you’re
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Wang et al.

Figure 2: Interview quantitative statistical data. The categorization of coding and theme group corresponds to Figure 1.

looking for; if you don’t get the results you intended, LLM is not very indicated that they would read or watch the official learning ma-
useful in academic assignments.” It is noteworthy that although 16 terials. Most individuals tended to rely on third-party educational
participants felt that understanding the limitations or flaws of the resources when acquiring new skills. P23 mentioned that he would
LLM tool was necessary, only 2 of them considered it the most learn how to use LLM tools through some user-generated content
crucial skill when using LLM for academic tasks. platforms; P38 said that he would check out posts shared on the
When discussing the learning resources for the LLM tool, we Internet to learn; P30 emphasized the role of watching reviews
find that the official guides or documentation provided by the LLM of LLM tools through short video platforms; and P40 said that he
tools were not the primary learning resources. Only 4 participants would check online forums or use a search engine to find relevant
information.
Evaluating Large Language Models on Academic Literature Understanding and Review CHI ’24, May 11–16, 2024, Honolulu, HI, USA

To compare the effect of different training, we provided gen- In contrast, 16 participants held an opposite view, and the rest 6
eral training for all participants and conducted limitation-related participants expressed uncertainty about the comparison of the
training for half of the participants, in which we emphasized the role of the LLM in two tasks used in the experiment. For example,
shortcomings, limitations, and academic integrity issues related to P21 mentioned that ”The LLM is more useful in supporting paper
the LLM tool. By comparing these two training methods, we found understanding. The LLM tool can give me a general outline. It can
that the individuals who did not receive LLM-limitation-related explain terms I don’t understand, and it also can summarise the paper
training expressed greater satisfaction with the actual effective- a little bit.” ”When it comes to the literature review, I think it’s better
ness of the tools and a higher percentage of them (83.3% versus to refer to relevant published literature reviews that are more capable
62.5%) believed that the LLM tool provided important assistance in or conduct this myself, instead of referring to a bunch of literature
completing the tasks. Further, participants who did NOT receive summarized by the current LLM tool.”
limitation-related training mentioned more content outside of our In addition, some thought-provoking ideas were identified. For
’limitation-related training’, e.g., they mentioned limitations of con- example, P41 emphasized that: ”Reading a paper is a process of
tent generation more frequently in the semi-structured interview comprehension, and the use of the LLM tool removes this purpose.
compared to those who received limitation-related training. For Implicitly, there is a concern that LLM tools may negatively impact
example, p8 said ”The current training data of LLM is also based on a one’s capability in reading and comprehension of paper. While,
more general data site, so my current experience is that there is still a P13 indicated that LLM can help the comprehension process as
lack of specialized knowledge. The generated answers are still limited he mentioned that using LLM for paper reading is like ”going on
and not professional enough.” a treasure hunt with a treasure map”, highlighting the function of
LLM tools as an aid. Nevertheless, although the LLM tool can guide
4.2.2 Variations in the role of LLMs in different academic tasks. and speed up the paper reading task, a deeper comprehension of
Only 27% of individuals stated that the LLM tool was merely useful the paper still requires the involvement of one’s personal reflection.
in assisting the two academic tasks in the experiment, while the
rest of the respondents indicated that the LLM tool was useful to 4.2.3 Strategies under time pressure. Under different levels of time
some extent. For example, P37 said, ”I find the ChatGPT very helpful, pressure, there are significant divergences in the impact of LLM
especially for summarizing existing literature and quickly locating tools on paper understanding and literature review tasks. 21 out
answers. I think it’s incredibly useful.” of 48 participants felt that the LLM tool was more useful under
Regarding the types of assistance gained from the LLM tools, less time pressure compared to that under high time pressure. For
we have categorized them into six primary themes using thematic instance, P40 said, ”During the literature review, the tool is more
analysis, ordered by frequency from high to low: 1. literature sum- useful when 20 minutes were allowed for the task.”. At the same time,
marization, which aims to help users understand and summarise 10 out of 48 participants felt that the LLM tool worked better under
the content of the literature; 2. information retrieval, which helps higher time pressure compared to that under low time pressure. The
find relevant information for a specific problem or to give advice rest 17 participants thought that the time pressure did not make a
on how to solve the problem; 3. linguistic optimization, which in- difference.
volves polishing the texts, and correcting grammar, spelling, and At the same time, referring to Figure 2, 23 out of 48 participants
expression; 4. data analysis, which helps users process and analyze specifically compared the role of the LLM tool in completing the
data; 5. writing aids, which support users with writing inspiration, paper understanding task under different time pressures. Among
content continuation, and so on; 6. framework establishment, which these 23 participants, 6 participants believed that time pressure
helps users create a framework or structure to present their ideas or would not affect the completion of the paper understanding task. In
research results. Figure 2 presents detailed data on these six themes. comparison, 2 participants stated that they did not use the LLM tool
Information search is a noteworthy feature of LLM tools, which at all in the paper understanding task regardless of time pressure.
is believed to have the potential to replace search engines and Further, 24 out of 48 participants specifically compared the role of
encyclopedias. As P15 said, ”I study chemistry, and when I come the LLM tool in completing the literature review task under different
across some unfamiliar compounds, I will ask the LLM tool directly, time pressures. Nearly half (11/24) of the participants indicated that
which is more accurate and direct than the results obtained from a the LLM tool would be more effective under low time pressure,
search engine.” P8 also mentioned that ”asking questions to the LLM while 5 participants held the opposite opinion.
tool is like asking a Wikipedia.” It is worth mentioning that, a few These discrepancies and divergences also led to variations in
participants (2/48) mentioned assistance of LLMs in personalized participants’ attitudes toward tasks under different levels of time
tasks (e.g., Language translation, coding). For example, P37 men- pressure. We employed creative coding to differentiate these atti-
tioned that ”the LLM tool can judge my solutions, then identify some tudes and found that under lower time pressure, participants tended
shortcomings, and help me to correct them” to exhibit more positive attitudes toward LLM. When the time pres-
It is also interesting to find that participants exhibit significant sure was low, only 3 respondents regarded the help from LLM
divergence regarding the role of the LLM tool in different tasks. As tools as ignorable, while the rest held positive attitudes towards
illustrated in Figure 2, 21 respondents believed that the LLM was LLM in accomplishing academic tasks. It is likely that as the time
more helpful in assisting literature review tasks compared to in pressure reduced, participants could engage more in introspective
assisting paper understanding tasks. P19 said, ”I think it (LLM) was thinking (e.g., contemplating ways to ask questions, strategies of
more useful in the Literature review, it not only helps us target some using LLM, and double-checking the accuracy of the responses
key information but at the same time relieves us of writing burdens.” generated by LLM), which has been mentioned in the interview.
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Wang et al.

the task once, I gained experience or a sense of how to finish this task
more quickly.” Participants were likely to become more familiar
with the experimental process, leading to better comprehension of
response and optimized strategies. This familiarity can also play
a role in real-world academic tasks, manifesting as increasing effi-
ciency when conducting similar tasks or using LLM tools repeatedly.
Another intriguing discovery was that some participants had to
lower their interactions with the LLM tool due to time pressure.
They stopped scrutinizing the generated content before adopting it,
which paradoxically led to early task completion. For example, P44
said, ”Because it might just be time pressure, I didn’t expect as much
from the LLM tool. So I didn’t bother to make any further adjustments
to the answer, and finished the task ahead of schedule.”. This raises
concerns about over-reliance on AI tools in high-pressure situations
[8, 54].

4.2.4 Concerns about LLM tools. Most participants expressed con-

cerns about the impacts of the LLM tool. The top five most fre-
quently mentioned negative effects include 1) concerns about the
accuracy of LLM-generated responses, where the answers provided
by the tool may be erroneous or imprecise; 2) impact on human
cognitive abilities, where overuse of the tool may weaken the user’s
Figure 3: The 2x2 LLM tool usage strategies matrix. The x-axis ability to think independently and dependency on LLM tools may
indicates the time dimension, where more time means less develop; 3) copyright and originality concerns, wherein the tool may
time pressure. The two rows in the y-axis represent the Liter- infringe upon others’ intellectual property rights while generating
ature Review and Paper Understanding tasks, respectively. content and users may be questioned about the originality of their
work when utilizing AI-generated materials; 4) time-consuming,
where users might spend excessive amounts of time seeking ac-
Conversely, when time pressure got higher, almost all participants curate answers or rectifying incorrect content; 5) hindering basic
leaned towards negative attitudes toward the LLM, mainly due to learning, where users may stop developing basic skills due to over-
concerns about the lack of time to check the replies generated by reliance on the tool. The statistics of these five negative impacts can
LLM. Interestingly, a few participants chose to prioritize comple- be found in Figure 2. For instance, P38 said, ”Relying too much on
tion of the task over concerns about the LLM, e.g., P8 said ”I will the LLM for assistance in academic tasks might lead to academic mis-
use the LLM to generate an approximate answer to satisfy the basic conduct or errors within the academic process.” He also mentioned,
requirement of completing the task first, and then check if the answer ”If you overly depend on this tool and turn to it for solutions when-
is what I want when I have the time later”. These distinct attitudes ever you encounter problems, it might hinder critical thinking and
also affected participants’ strategies in using LLMs. Figure 3 sum- innovation by impeding our natural thought processes.”
marises the most widely adopted strategies when using the LLM More specifically, among the 48 participants, concerns regarding
tool in different situations. As time pressure decreased, participants the current LLM tool are primarily about the accuracy of responses
showed a stronger willingness to examine the content generated rather than other issues such as privacy and copyright. Indeed, al-
by the LLM tool. though many participants did not explicitly mention concerns over
The lack of familiarity with LLM was one of the primary obstacles the accuracy of their responses, they always mentioned this concern
preventing the timely completion of tasks in the study. Through- implicitly in their words. For example, though P14 did not men-
out the interviews, the reasons for not being able to finish tasks tion the accuracy issue directly, he still expressed concerns about
within the designated time frame were mentioned 15 times. Some the correctness of the LLM-generated content when describing
highly-mentioned reasons include unfamiliarity with the LLM tool his strategy in using LLM: ”I may double-check the LLM responses,
(mentioned 3 times), slow responses to LLM (mentioned 7 times), and beyond the logic, I will also pay attention to some of the parts
and inaccurate time management (mentioned 5 times). Notably, that may not match my perception and may do further validation ”.
unfamiliarity with using LLM tools stood out as one of the promi- Surprisingly, 5 participants indicated that they did not have any
nent barriers, as mentioned by P25, ”I might not be proficient with concerns about the LLM tool. For example, P42 mentions that ”LLM
that software, so when I use it, I feel a bit flustered.” Hence, appro- has not been used in a particularly bad way, so I do not have any
priate training for tool usage should be necessary. Throughout the obvious concerns,” while P19 also indicates that ”I have no concern
interviews, there were a total of 39 instances of tasks being com- about LLM. I think that as long as the responses are scrutinized and
pleted ahead of schedule. Most cases happened in literature review full, while reasonable inputs are provided, a high degree of accuracy
tasks with low time pressure, where 18 participants completed tasks can be achieved.
ahead of schedule. The primary facilitating factor for early comple- When the academic background of participants is considered,
tion was experience in usage, as indicated by P45: ”After completing we were also surprised to find that none of the participants without
Evaluating Large Language Models on Academic Literature Understanding and Review CHI ’24, May 11–16, 2024, Honolulu, HI, USA

publication mentioned copyright concerns proactively. Even after We also found that participants can adaptively change their
reminding, only 2 (out of 10) of them said they would consider strategies when conducting different types of academic tasks. In
the copyright as an issue. Further, only one of them mentioned order to further reveal the participants’ strategies, a flowchart was
academic integrity issues when using LLM. While for participants obtained by summarising the interview data and the experimenter’s
who had at least one publication, a larger portion of them (23 out report of observation notes[21]. We combined various factors such
of 38) regarded the copyright or academic integrity as a potential as interview transcripts, task materials, task completion time, scores,
issue of using LLM. For example, P46 said ”There may be academic and interaction styles to create a basic flowchart showing the pro-
misconduct …… I’m also afraid that my intellectual property will cess of completing the task for most of the participants (Figure 4).
be compromised. I prefer not to send the paper I’m working on In the figure, we aggregated and abstracted the steps the majority
directly to LLM. Instead, I’ll probably send small segments and have of participants took.
the GPT do some writing polishing.” It seems that researchers who In general, at the beginning of a task, participants would judge
received more extensive academic training were more aware of whether they need help from LLM tools, taking the task type, the
violating academic rules when an AI-based tool was used. difficulty of the task (as moderated by the time pressure), and their
Regarding the issues in the design of the LLM, the majority of capabilities into consideration. Then, during their interactions with
participants believed that the current design of LLM tools does not LLM tools in tasks, participants may repeatedly modify their strate-
provide users with sufficient information. Among the 48 partici- gies (e.g., adjusting the context in their prompts) to optimize the
pants, 33 of them expressed that the current design of LLM tools LLM-generated results. Different strategies were adopted for differ-
does not provide official guidance on how to use prompts efficiently. ent types of tasks. Specifically, participants were highly uniform in
For example, P44 said, ”The design only offers an interface for input their strategies when using LLM tools for the PU tasks. Most of them
and output, but it doesn’t provide specific guidance on how to better would divide the articles into small segments and ask questions
utilize and master the tool. Most of the learning comes from seeking based on the segments. When conducting LR tasks, participants
information through other channels.” Similarly, P31 said, ”The inter- chose a more diverse strategy. For example, some participants asked
face is very simple, and the content is quite brief. It doesn’t provide me the LLM to generate a complete review; others only let the LLM
with proper guidance.” However, some interviewees held different generate the outline. Some participants even chose to provide a
opinions. They believed that the simplicity of the interface makes framework for the LLM to refer to. Figure 4 depicts two strategies
the tool easy to operate, as P39 mentioned, ”The LLM tool itself is that the participants used the most in LR tasks. Finally, another
quite simple. After having several conversations with it, you naturally difference in strategies for the LR task and PU task was that par-
become familiar with the pattern. It doesn’t require excessive design.” ticipants usually used LLM throughout the whole task procedures
for the PU task; whereas participants preferred to conduct self-
5 DISCUSSION modification and refinement for the responses generated by LLM
tools in LR tasks. It is likely that the participants had different levels
5.1 Strategies in using LLMs for different of trust in the LLM tool when completing different tasks, which led
academic tasks to different levels of reliance on the LLM tool in the tasks.
Combining results from quantitative and qualitative analysis, our In addition, when conducting PU tasks, participants were more
research indicates that young scholars performed better when us- inclined to complete the task on their own compared to when con-
ing LLM tools for paper understanding (PU) tasks compared to ducting the LR (see Tables 2 and 3). Further statistical tests show
literature review (LR) tasks. However, young scholars spent more that those who did not use the LLM tool obtained lower scores
time and had lower intentions to use LLMs for PU tasks versus and took a significantly longer time to complete the task (paired
LR tasks. In the field of human-computer interaction, it is widely t-test p<.0001). Based on the interviews, we found two potential
recognized that user reliance on automation can be moderated by reasons explaining the low usage rate of LLM in PU tasks. First,
task complexity [54]. Similarly, in our experiments, compared to participants were confident in their ability to comprehend the scien-
the PU task, where the information source was known, participants tific literature; second, they were trying to avoid the deterioration
needed to search a wider range of unknown sources in LR tasks. of their learning ability as a result of over-reliance on LLM. We
Further, most participants perceived LLM tools as being good at speculate that the abandonment of LLM in tasks may also be re-
handling complex tasks such as developing process frameworks. lated to one’s personality traits and the early-stage scholars’ wish
Both may explain why participants relied on LLM more in LR tasks, to develop their skills for future academic success. However, in
especially when the time pressure was high, as many participants this experiment, we cannot validate these assumptions given that
felt that copying and typing text from PDF files into LLM in PU only early-stage scholars participated in the experiment, and future
tasks was more complicated and time-consuming compared to the experiments are needed.
procedures in LR tasks. However, it should be noted that, given the
limitations of the LLM we provided in the experiment (i.e., source
of bias [2]), current LLM tools cannot provide the most up-to-date
results in LR tasks. Thus, the LLMs can provide limited assistance 5.2 Strategic choices under time pressure
in LR tasks, which may explain why participants obtained lower Time pressure can influence researchers’ strategies when using
scores in LR tasks (which were judged based on scoring standards the LLMs and their attitudes toward LLM tools. Although previous
like the number of literature, source accuracy, and citation quality. results have shown that time pressure can affect the strategies that
Details please refer to supplement materials) compared to PU tasks. users adopt to learn new knowledge or skills [50], it is still unknown
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Wang et al.

Figure 4: Flowchart of conducting two tasks with LLMs. Notes: This flowchart only summarizes the basic processes for the two
tasks in the experiment. The actual behaviors of participants were more diverse when under different levels of time pressure,
which are presented in Figure 3.

how time pressure may influence one’s strategies when an AI- same time, under high time pressure, some participants chose to
based assistant, the LLM tool, was available for academic tasks. The skip verifying the responses generated by LLM tools. This indi-
qualitative analyses in our study indicate that with relatively low cates that when faced with more urgent deadlines, researchers may
time pressure, researchers exhibited a more positive attitude toward prioritize efficiency over skepticism and potentially sacrifice the
the LLM and were more confident in fulfilling tasks using LLM tools. quality of their work. This result is in line with previous findings in
In contrast, under high time pressure, most researchers showed a the human-automation interaction domain, which suggested that
more hesitant and negative attitude toward using LLM for academic external stressors of the tasks may influence their adoption of new
tasks. It is possible that researchers still have concerns over the technologies [41]. Specifically, when the users are under a high
capability of LLM. Thus, under high time pressure, participants workload or in a stressful situation, they tend to rely more on the
tended to adopt more conservative methods rather than use new technologies, even if they do not fully trust them. Thus, LLM tool
tools [64], so that they do not need to double-check the content designers should carefully balance efficiency and effectiveness to
generated by LLM. better support users. For instance, the trade-offs between response
However, users’ attitudes toward the LLM tools may not directly speed and the accuracy of the responses can be customizable to
reflect their choice of strategies. The observational data in our study better cater to users’ needs when they are under different levels of
shows that, in PU tasks, with low time pressure, participants were time pressure.
more likely to abandon LLM tools; whereas under high time pres-
sure, participants exhibited higher LLM tool usage rates. At the
Evaluating Large Language Models on Academic Literature Understanding and Review CHI ’24, May 11–16, 2024, Honolulu, HI, USA

5.3 Users’ attitudes and concerns of LLM not give users enough hints and feedback. It is recommended that
In general, researchers hold a positive and forward-looking attitude the functionality and interface of the LLM should be improved so
toward LLM tools. They mentioned more about the functionality of that it is more suitable for different academic tasks. For example,
LLM tools and how to effectively utilize them, rather than the limita- the option of ”Prompt” for different scenarios can be proactively
tions of the tools. On the one hand, this is an encouraging discovery, provided, to reduce the learning cost of the researcher.
as it suggests that young scholars focused more on harnessing the What is more alerting is that, some participants overstated the
benefits of LLM tools rather than dwelling on the shortcomings abilities of LLM tools (i.e., overlooked potential limitations or risks
and thus they may be more willing to use them. On the other hand, in the LLM usage to academic tasks), which coincides with several
this may lead to misuse of the LLM tools. For example, during the voices supporting the use of LLM tools in academic tasks [51].
interviews, although most participants mentioned their concerns However, overestimating the capabilities of the LLMs may lead
about the limitations of the LLM tool (similar to the findings from to over-reliance on LLM tools. From an academic performance
[2, 29, 52]), very few participants could comprehensively and sys- perspective of view, this may result in erroneous or inaccurate
tematically acknowledge the constraints and boundaries of LLM conclusions in academic tasks. From an educational perspective of
tools and even fewer participants were awareness of the potential view, this may negatively impact young scholars’ critical thinking
privacy and copyright issues of the LLMs. Especially, those with and academic skills, potentially affecting their overall academic
little academic experience (i.e., had no academic publications) were development. This finding points to another important topic in
inclined to overlook the potential personal privacy, academic copy- LLM usage, the training of the users.
right, and ethics issues caused by LLM tools. This finding provides
a different perspective on the opinions of adopting the LLMs for
academic tasks compared to the previous study, which focused
5.4 The role and future improvement of training
more on senior scholars [52]. Additionally, young scholars may
intentionally choose to ignore the limitations of the LLMs, similar Training is pivotal for the appropriate use of the LLM tool. Previous
to how human beings rely on heuristics to make decisions in urgent research has pointed out that it is important to train users to refine
situations [58]. In our study, under time pressure, some participants their mental models, and subsequently facilitate user-LLM collabo-
indicated that they intentionally ignored the deficiencies of LLM ration performance [65]. Our study reveals that individuals who re-
tools, even when they were aware of these issues. For instance, ceived limitation-related training expressed lower satisfaction with
when striving to complete PU tasks under high time pressure, some the effectiveness of the LLM tool and discussed more of the accuracy
participants indicated that they lowered their expectations of the of the LLM-generated responses in the post-experiment interview.
performance of the LLM tool and may cease to verify the content This implies that the trained individuals were more skeptical of the
generated by these tools. content generated by the LLM, which may explain why, among the
The associations among academic experience, attitudes toward ones who never used LLM, those who received limitation-based
LLMs, and strategies when using LLM indicate that, in the academic training spent more time on academic tasks compared to those who
community, users’ willingness to use the LLM tools is a dynamic did not receive limitation-related training.
process and there is a chance that young scholars prefer to use the It is also interesting to notice that, in addition to the experience
LLM tools, especially under high time pressure. Thus, LLM tool passed in the limitation-related training, users can also gain ex-
designers should try to make the users aware of the limitations perience during interactions with LLMs, before the experiment.
and boundaries of LLM tools so that the users can use the LLMs For example, we found that compared to those who had little to
more effectively and responsibly. For instance, appropriate system no prior experience with the LLMs, the participants who had rel-
transparency [37, 49] can be an effective way to address the con- atively more experience (i.e., those who self-reported sometimes
cerns regarding the accuracy of tool outputs. Specifically, designers using LLMs, of which 12 of them received limitation-based train-
can incorporate features such as confidence scores or explanatory ing and 12 did not) with the LLM tended to be more aware of the
annotations [60] in the responses generated by the LLMs, which strengths and weaknesses of the LLM tool and tried to find the
would help users better understand the reliability of the generated best strategies when using LLM tools. Specifically, they adjusted
content so that the users can make more informed decisions when their interactions with LLM tools more constantly compared to
using the tool, even under high time pressure. On the other hand, those who had less experience and they also provided more insight-
before adopting the LLMs, young scholars should also receive train- ful comments on LLMs. For example, 7 out of the 12 LLM users
ing regarding the limitations of LLM tools so that they can make who did not receive limitation-based training mentioned that they
more informed decisions on when and how to use the LLM tools. double-checked and double-examined the answers generated by
At the same time, from the perspective of Human Machine Inter- LLM, which cost additional time in the task. In contrast, users who
face (HMI) design for LLM, understanding the differences in users’ lacked LLM experience and did not receive limitation-related train-
performance and level of reliance can aid in designing LLM tools to ing tended to show low confidence in the LLM tools. In particular,
fit the needs of different academic tasks and improve users’ satisfac- in the study, among the 3 participants who had no LLM experience
tion with the LLM tools. Sakirin et al. showed that users preferred and did not receive limitation-related training at the same time, 2
to use the dialogue interface supported by the LLM tool [59], but of them abandoned LLM tools during the PU task, and they did not
our study found that the interface of the LLM tool did not provide adopt the strategies most experienced users would take (as shown
users with enough information, and the oversimplified design may in Figure 4) in LR tasks. The above findings indicate that limitation-
related training can not only shorten the period that users may take
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Wang et al.

to develop appropriate strategies to use new technologies, but also Further, as an empirical study, though we tried to replicate realistic
help promote the adoption of LLMs among new users. scenarios in daily life, the scenarios the users encountered were still
Unfortunately, the training or materials provided by the official artificial to some level and users may have biased behaviors in the
providers of the LLM tools may not be enough. Many participants experiment. Future research may consider observational studies to
reported receiving their training from third-party platforms on the better reveal the strategies users may adopt when LLMs are used
Internet rather than from official sources. This could be attributed for academic tasks. Lastly, considering the rapid development of
to uncertainties regarding the comprehensibility of official docu- LLMs, more advanced models or interfaces are being introduced
mentation or the usability issues of the official documents. Such a (e.g., GPT-4V 7 , Semantic Reader 8 ). In this work, we were not able
phenomenon has also been observed in other domains. For example, to adopt these up-to-date tools, as they were not publicly accessible
researchers found that a very low portion of users read the manual when our experiment began. Thus, the readers should be aware
of their vehicles regarding driving automation [20]. Therefore, it is that some findings in our study may not apply to some emerging
recommended that LLM developers or maintainers explore better LLM tools and future assessments of how users’ behaviors change
ways to present necessary information to new users, or actively adaptively with the evolvement of LLM tools are needed.
engage with relevant online forums and social media groups to
assist users in addressing their usage-related queries. However, it 7 FINDINGS AND CONCLUSIONS
should be noted that the training methods adopted in our study are In this study, we conducted an empirical study involving 48 early-
still preliminary, and future research should continue to optimize stage scholars, to understand how LLM tools can be utilized for aca-
training methods and content, and better incorporate the training demic tasks and affect early-stage scholars’ workflow. Specifically,
methods in the LLM tool design to improve users’ performance in we discussed the influences of user perspectives on LLMs, evaluated
academic tasks with the LLM tools. users’ performance when using LLM in two typical but different
Especially, for the PU tasks, the results showed that most early- academic tasks, and analyzed the influence of time pressure in these
stage scholars would prefer to read and understand the literature tasks. Besides, the qualitative analysis based on a post-experiment
on their own, as they did not want to ”rely too much on the LLM to interview revealed the strategies users adopted when using LLM
constrain their learning ability”. Hence, future LLMs can provide tools. In general, several key findings are summarized as:
more translation or search functions for key information in PU
scenarios. For LR tasks, designers should try to reduce the chances • We found that young scholars can adaptively change their
of noisy responses appearing or provide confidence scores [60] for strategies when using the LLM for different tasks. Specifi-
the LLM-generated responses. Personalized training and support cally, we observed more diverse questioning styles and less
services may also be necessary to help participants make better reliance on the LLM tools when using LLM for LR task; while
use of LLM tools. By tailoring support based on researchers’ expe- a more monotonic strategy was observed when the LLM was
rience, proficiency with the tools, and the type of tasks they are used for PU tasks. Future LLM design may consider cus-
undertaking, individualized assistance can be provided. This could tomizing the tool to better satisfy users’ needs in different
include training on specific usage techniques and strategies for a scenarios.
particular type of task, thus enabling scholars to perform better in • Time pressure can influence users’ attitudes toward the LLM
their academic endeavors. tools and the strategies they take to cooperate with the LLM.
However, the strategies they took may not necessarily match
the attitudes they hold. High time pressure led to declined
6 LIMITATIONS attitudes toward the LLM, but increased the adoption rate
We recognize that although our study has followed standards in of the LLM. It is likely that the users were not satisfied with
the field of human-computer interaction to some extent [10], there the performance of LLM tools, but they had to use them
are still some limitations. First, as we intentionally limited our to reduce the time pressure. Future LLM tools may need to
targeted user group to young scholars, the findings may not be allow users to customize the LLM tools to reach a balance
well-generalizable to the senior academic community. Users with between accuracy and efficiency.
different levels of familiarity with academic tasks may hold different • Young scholars had an overall positive attitude towards the
attitudes toward AI-based tools and may adopt different strategies LLMs in academic tasks, but due to their lack of academic ex-
for using them. In future research, senior scholars with different perience, they were also inclined to ignore the academic eth-
backgrounds should be recruited. Secondly, limited by the sample ical and privacy risks introduced by LLM tools, and tended
size, we had to focus on two types of common but typical academic to voluntarily give up their concentration on the risks from
tasks in a single academic domain in this experiment, which may LLMs when the complexity of the task increases. Thus, train-
not cover all scenarios when LLM tools are used in academic tasks. ing might be necessary to support better use of LLM tools
We also only considered the task difficulty controlled by the time among young scholars.
pressure. In daily academic tasks, the task difficulties may be mod- • We investigated a specific training, the limitation-based train-
erated by many factors. Future research may consider introducing ing, on users’ attitudes towards strategies and performance
more types of academic tasks (e.g., academic writing, data analysis, in academic tasks when using the LLM tools. The results
and experimental design) from more academic domains, and mod-
eling the influence of other task-difficulty-related factors to assess 7 https://openai.com/research/gpt-4v-system-card

the impact of LLM tools on academic tasks more comprehensively. 8 https://openreader.semanticscholar.org/

Evaluating Large Language Models on Academic Literature Understanding and Review CHI ’24, May 11–16, 2024, Honolulu, HI, USA

show that training can increase users’ awareness of the limi- [17] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng
tations of the LLMs and lead to more appropriate strategies Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-
training for natural language understanding and generation. Advances in Neural
when using LLM tools, similar to what experience with LLM Information Processing Systems 32 (2019).
can do. Given that users criticized the current LLM devel- [18] Michael Dowling and Brian Lucey. 2023. ChatGPT for (finance) research: The
Bananarama conjecture. Finance Research Letters 53 (2023), 103662.
opers for not providing adequate training materials, more [19] Armin Esmaeilzadeh and Kazem Taghva. 2022. Text classification using neural
authoritative and comprehensive online training materials network language model (nnlm) and bert: An empirical comparison. In Intelligent
for LLM are expected in the future. Systems and Applications: Proceedings of the 2021 Intelligent Systems Conference
(IntelliSys) Volume 3. Springer, 175–189.
[20] Yannick Forster, Sebastian Hergeth, Frederik Naujoks, Josef Krems, and Andreas
Keinath. 2019. User education in automated driving: Owner’s manual and interac-
ACKNOWLEDGMENTS tive tutorial support mental model formation and human-automation interaction.
Information 10, 4 (2019), 143.
This work was supported by the Guangzhou Municipal Science and [21] Patricia H Fowler, Janet Craig, Lawrence D Fredendall, and Uzay Damali. 2008.
Technology Project (No. 2023A03J0011) and the Guangzhou Science Perioperative workflow: barriers to efficiency, risks, and satisfaction. AORN
Journal 87, 1 (2008), 187–208.
and Technology Program City-University Joint Funding Project
[22] Catherine A Gao, Frederick M Howard, Nikolay S Markov, Emma C Dyer, Siddhi
(No. 2023A03J0001). We first appreciate the assistance provided by Ramesh, Yuan Luo, and Alexander T Pearson. 2022. Comparing scientific abstracts
Wenbo Zhang throughout the entire experiment. We also would generated by ChatGPT to original abstracts using an artificial intelligence output
detector, plagiarism detector, and blinded human reviewers. BioRxiv (2022),
like to express our gratitude to all participants for their valuable 2022–12.
time and to the reviewers for their helpful feedback on our paper. [23] Catherine A. Gao, Frederick M. Howard, Nikolay S. Markov, Emma C. Dyer, Siddhi
Ramesh, Yuan Luo, and Alexander T. Pearson. 2022. Comparing scientific ab-
stracts generated by ChatGPT to original abstracts using an artificial intelligence
output detector, plagiarism detector, and blinded human reviewers. BioRxiv (2022).
REFERENCES https://doi.org/10.1101/2022.12.23.521610 arXiv:https://www.biorxiv.org/con-
[1] Ian L Alberts, Lorenzo Mercolli, Thomas Pyka, George Prenosil, Kuangyu Shi, tent/early/2022/12/27/2022.12.23.521610.full.pdf
Axel Rominger, and Ali Afshar-Oromieh. 2023. Large language models (LLM) [24] Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith.
and ChatGPT: what will the impact on nuclear medicine be? European Journal of 2020. Realtoxicityprompts: Evaluating neural toxic degeneration in language
Nuclear Medicine and Molecular Imaging 50, 6 (2023), 1549–1552. models. arXiv preprint arXiv:2009.11462 (2020).
[2] Muath Alser and Ethan Waisberg. 2023. Concerns with the usage of ChatGPT [25] Katy Ilonka Gero, Vivian Liu, and Lydia Chilton. 2022. Sparks: Inspiration
in Academia and Medicine: A viewpoint. American Journal of Medicine Open for science writing using language models. In Designing Interactive Systems
100036 (2023). Conference. 1002–1019.
[3] Ömer Aydın and Enis Karaarslan. 2022. OpenAI ChatGPT generated literature [26] Katy Ilonka Gero, Tao Long, and Lydia B Chilton. 2023. Social dynamics of AI
review: Digital twin in healthcare. Available at SSRN 4308687 (2022). support in creative writing. In Proceedings of the 2023 CHI Conference on Human
[4] Stephen H Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Ni- Factors in Computing Systems. 1–15.
hal V Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, et al. [27] Bert Gordijn and Henk ten Have. 2023. ChatGPT: evolution or revolution?
2022. Promptsource: An integrated development environment and repository for Medicine, Health Care and Philosophy 26, 1 (2023), 1–2.
natural language prompts. arXiv preprint arXiv:2202.01279 (2022). [28] Gianluca Grimaldi and Bruno Ehrler. 2023. AI et al.: machines are about to change
[5] Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret scientific publishing forever. ACS Energy Letters 8, 1 (2023), 878–880.
Shmitchell. 2021. On the dangers of stochastic parrots: Can language models [29] Mohanad Halaweh. 2023. ChatGPT in education: Strategies for responsible
be too big?. In Proceedings of the 2021 ACM conference on fairness, accountability, implementation. (2023).
and transparency. 610–623. [30] Dengbo He, Dina Kanaan, and Birsen Donmez. 2022. Distracted when using
[6] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, driving automation: a quantile regression analysis of driver glances consider-
Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma ing the effects of road alignment and driving experience. Frontiers in Future
Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv Transportation 3 (2022), 772910.
preprint arXiv:2108.07258 (2021). [31] Sebastian Hergeth, Lutz Lorenz, and Josef F Krems. 2017. Prior familiarization
[7] Petter Bae Brandtzaeg and Asbjørn Følstad. 2018. Chatbots: changing user needs with takeover requests affects drivers’ takeover performance and automation
and motivations. Interactions 25, 5 (2018), 38–43. trust. Human factors 59, 3 (2017), 457–470.
[8] Dawn Branley-Bell, Rebecca Whitworth, and Lynne Coventry. 2020. User trust [32] Kevin Anthony Hoff and Masooda Bashir. 2015. Trust in automation: Integrating
and understanding of explainable ai: Exploring algorithm visualisations and user empirical evidence on factors that influence trust. Human Factors 57, 3 (2015),
biases. In International Conference on Human-Computer Interaction. Springer, 407–434.
382–399. [33] Matthew Hutson. 2022. Could AI help you to write your next paper? Nature 611,
[9] Tim Broady, Amy Chan, and Peter Caputi. 2010. Comparison of older and younger 7934 (2022), 192–193.
adults’ attitudes towards and abilities with computers: Implications for training [34] Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. 2020. How can we
and learning. British Journal of Educational Technology 41, 3 (2010), 473–485. know what language models know? Transactions of the Association for Computa-
[10] Kelly Caine. 2016. Local standards for sample size at CHI. In Proceedings of the tional Linguistics 8 (2020), 423–438.
2016 CHI conference on human factors in computing systems. 981–992. [35] Andreas Jungherr. 2023. Using ChatGPT and Other Large Language Model (LLM)
[11] Lydia Carson, Christoph Bartneck, and Kevin Voges. 2013. Over-competitiveness Applications for Academic Paper Assignments. (2023).
in academia: A literature review. Disruptive Science and Technology 1, 4 (2013), [36] Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna
183–190. Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke
[12] John Joon Young Chung, Wooseok Kim, Kang Min Yoo, Hwaran Lee, Eytan Adar, Hüllermeier, et al. 2023. ChatGPT for good? On opportunities and challenges
and Minsuk Chang. 2022. TaleBrush: visual sketching of story generation with of large language models for education. Learning and Individual Differences 103
pretrained language models. In CHI Conference on Human Factors in Computing (2023), 102274.
Systems Extended Abstracts. 1–4. [37] René F Kizilcec. 2016. How much information? Effects of transparency on trust
[13] Fred D Davis. 1989. Perceived usefulness, perceived ease of use, and user accep- in an algorithmic interface. In Proceedings of the 2016 CHI conference on human
tance of information technology. MIS Quarterly (1989), 319–340. factors in computing systems. 2390–2395.
[14] Ismail Dergaa, Karim Chamari, Piotr Zmijewski, and Helmi Ben Saad. 2023. From [38] CY KRAMERß. 1956. Extension of multiple range tests to group means with
human writing to artificial intelligence generated text: examining the prospects unequal numbers of replication. Biometrics 12 (1956), 307–310.
and potential threats of ChatGPT in academic writing. Biology of Sport 40, 2 [39] Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie
(2023), 615–622. De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido,
[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: James Maningo, et al. 2023. Performance of ChatGPT on USMLE: Potential for
Pre-training of deep bidirectional transformers for language understanding. arXiv AI-assisted medical education using large language models. PLoS Digital Health
preprint arXiv:1810.04805 (2018). 2, 2 (2023), e0000198.
[16] Eva Dis, Johan Bollen, Willem Zuidema, Robert Rooij, and Claudi Bockting. [40] Ivano Lauriola, Alberto Lavelli, and Fabio Aiolli. 2022. An introduction to deep
2023. ChatGPT: five priorities for research. Nature 614 (02 2023), 224–226. learning in natural language processing: Models, techniques, and tools. Neuro-
https://doi.org/10.1038/d41586-023-00288-7 computing 470 (2022), 443–456.
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Wang et al.

[41] John D Lee and Katrina A See. 2004. Trust in automation: Designing for appro- CA: Los Angeles, CA, 1506–1510.
priate reliance. Human Factors 46, 1 (2004), 50–80. [66] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov,
[42] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language
Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising understanding. Advances in Neural Information Processing Systems 32 (2019).
sequence-to-sequence pre-training for natural language generation, translation, [67] Lirong Yao and Yazhuo Guan. 2018. An improved LSTM structure for natural
and comprehension. arXiv preprint arXiv:1910.13461 (2019). language processing. In 2018 IEEE International Conference of Safety Produce
[43] Michael Liebrenz, Roman Schleifer, Anna Buadze, Dinesh Bhugra, and Alexander Informatization (IICSPI). IEEE, 565–569.
Smith. 2023. Generating scholarly content with ChatGPT: ethical challenges for [68] Nan Zhong, Zhenxing Qian, and Xinpeng Zhang. 2021. Deep neural network
medical publishing. The Lancet Digital Health 5, 3 (2023), e105–e106. retrieval. In Proceedings of the 29th ACM International Conference on Multimedia.
[44] Peng Liu and Zhizhong Li. 2012. Task complexity: A review and conceptualization 3455–3463.
framework. International Journal of Industrial Ergonomics 42, 6 (2012), 553–568.
[45] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and
Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of A INTERVIEW QUANTITATIVE STATISTIC
prompting methods in natural language processing. Comput. Surveys 55, 9 (2023),
1–35.
DATA
[46] Alexandra Luccioni and Joseph Viviano. 2021. What’s in the box? an analysis During the coding and analysis phases, a large amount of data was
of undesirable content in the Common Crawl corpus. In Proceedings of the 59th
Annual Meeting of the Association for Computational Linguistics and the 11th
collected. Thus only the data involved in Figure 2 is shown in Table
International Joint Conference on Natural Language Processing (Volume 2: Short 6 as an additional illustration for ease of reference.
Papers). 182–189.
[47] Thorleif Lund. 2012. Combining qualitative and quantitative approaches: Some
arguments for mixed methods research. Scandinavian Journal of Educational B EXPERIMENT MATERIALS AND QUESTIONS
Research 56, 2 (2012), 155–165.
[48] Muneer M Alshater. 2022. Exploring the role of artificial intelligence in enhancing
B.1 Paper Understanding 1 (P1)
academic performance: A case study of ChatGPT. Available at SSRN (2022). Paper link:
[49] Jan Maarten Schraagen, Sabin Kerwien Lopez, Carolin Schneider, Vivien Schnei-
der, Stephanie Tönjes, and Emma Wiechmann. 2021. The role of transparency https://journals.sagepub.com/doi/pdf/10.1177/1071181322661442
and explainability in automated systems. In Proceedings of the Human Factors (1) What is ADAS? Please list some typical functions it con-
and Ergonomics Society Annual Meeting, Vol. 65. SAGE Publications Sage CA: Los
Angeles, CA, 27–31. cludes.
[50] Giuliana Mazzoni and Cesare Cornoldi. 1993. Strategies in study time allocation: (2) What is the purpose of this research, and how this study can
Why is study time sometimes not effective? Journal of Experimental Psychology: benefit future studies?
General 122, 1 (1993), 47.
[51] Jesse G Meyer, Ryan J Urbanowicz, Patrick CN Martin, Karen O’Connor, Ruowang (3) Please briefly describe the procedures of how the survey data
Li, Pei-Chen Peng, Tiffani J Bright, Nicholas Tatonetti, Kyoung Jae Won, Gra- was collected, the participants’ criteria, and how valid data
ciela Gonzalez-Hernandez, et al. 2023. ChatGPT and large language models in
academia: opportunities and challenges. BioData Mining 16, 1 (2023), 20.
was selected after the data collection.
[52] Meredith Ringel Morris. 2023. Scientists’ Perspectives on the Potential for Gen- (4) Please briefly conclude the findings in this paper, and how
erative AI in their Fields. arXiv preprint arXiv:2304.01420 (2023). these findings can benefit future studies.
[53] Michael Muller, Lydia B Chilton, Anna Kantosalo, Charles Patrick Martin, and
Greg Walsh. 2022. GenAICHI: generative AI and HCI. In CHI conference on human (5) Please indicate the limitations of this paper.
factors in computing systems extended abstracts. 1–7.
[54] Raja Parasuraman and Victor Riley. 1997. Humans and automation: Use, misuse, B.2 Paper Understanding 2 (P2)
disuse, abuse. Human Factors 39, 2 (1997), 230–253.
[55] Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Paper link:
Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red teaming https://journals.sagepub.com/doi/pdf/10.1177/1071181322661400
language models with language models. arXiv preprint arXiv:2202.03286 (2022).
[56] Savvas Petridis, Nicholas Diakopoulos, Kevin Crowston, Mark Hansen, Keren (1) Please give a definition for AV and TAM respectively and
Henderson, Stan Jastrzebski, Jeffrey V Nickerson, and Lydia B Chilton. 2023. indicate how TAM is relevant to AV.
Anglekindling: Supporting journalistic angle ideation with large language models.
In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. (2) What is the purpose of this research, and how this study can
1–16. benefit future studies?
[57] Md Mizanur Rahman, Harold Jan Terano, Md Nafizur Rahman, Aidin Salamzadeh,
and Md Saidur Rahaman. 2023. ChatGPT and academic research: a review and
(3) Please summarize the types of information collected in the
recommendations based on practical examples. Journal of Education, Management survey and how the valid data was selected after the data
and Development Studies 3, 1 (2023), 1–12. collection.
[58] Torsten Reimer and Jörg Rieskamp. 2007. Fast and frugal heuristics. Encyclopedia
of Social Psychology (2007), 346–348. (4) Please briefly conclude the findings in this paper, and how
[59] Tam Sakirin and Rachid Ben Said. 2023. User preferences for ChatGPT-powered these findings can benefit future studies.
conversational interfaces versus traditional methods. Mesopotamian Journal of (5) Please indicate the limitations of this paper.
Computer Science 2023 (2023), 24–31.
[60] Anuschka Schmitt, Thiemo Wambsganss, and Andreas Janson. 2022. Designing
for conversational system trustworthiness: the impact of model transparency on B.3 Literature Review Topic 1 (T1)
trust and task performance. (2022).
[61] Horrock Stevens. 2019. What Human Factors Isn’t: 1. Common Sense. https:// Topic No.1: Novice driver training
humanisticsystems.com/2019/07/10/what-human-factors-isnt-1-common-sense/
[62] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
B.4 Literature Review Topic 2 (T2)
Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv Topic No.2: Hazard perception in driving
preprint arXiv:2302.13971 (2023).
[63] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all C OUTLINE OF INTERVIEW
you need. Advances in Neural Information Processing Systems 30 (2017).
[64] Christopher D Wickens, William S Helton, Justin G Hollands, and Simon Banbury.
Table 7 shows the outline of the interview, including types, ques-
2021. Engineering psychology and human performance. Routledge. tions, and time. It should be noted that in addition to these questions,
[65] Bart D Wilkison, Arthur D Fisk, and Wendy A Rogers. 2007. Effects of mental we rely on the participant’s answers for more in-depth discussion.
model quality on collaborative system performance. In Proceedings of the Human
Factors and Ergonomics Society Annual Meeting, Vol. 51. SAGE Publications Sage
Evaluating Large Language Models on Academic Literature Understanding and Review CHI ’24, May 11–16, 2024, Honolulu, HI, USA

Table 6: Interview quantitative statistical data.

Type Theme Label Number of Mentions

Pre-use Techniques 39
Features and Limitations of LLM 25
Basic Methods and Operations to Use LLM 22
Training Content
Ethics and Compliance 10
Historical or Current Tool Development 6
Others 4
Questioning Techniques 15
Others 7
Background Knowledge 5
Most Important Training Operating Principles 5
Training
Instrumental Role 3
Access 3
Academic Ethics 3
Use the Internet 25
Others 7
Self-exploration 7
Learning Path Read the Official Documentation 4
School Programs 4
Ask Other Users 3
Do not Know 1
Helpful 25
Help with LLM tools Little Helpful 13
Very Helpful 10
Literature Summarization 32
Variations Information Retrieval 17
Linguistic Optimization 14
Types of LLM Assistance
Data Analysis 10
Writing Aids 7
Framework Establishment 5
LLM is More Useful to LR 21
LLM is More Useful to PU 16
LLM Impact on LR and PU
Incomparable 6
No Answer 5
Highly Effective over Long Periods 8
Strategies Highly Effective When Time is Short 7
LLM Effects on PU
Time Has No Impact 6
No Answer 2
Highly Effective over Long Periods 11
LLM Effects on LR Time Has No Impact 8
Highly Effective When Time is Short 5
Accuracy of Responses 35
Privacy 24
Copyright 23
Participants’ Worries about LLM No Worries 6
Content Limitations 6
Others 5
Academic Integrity 5
Do Not Provide Enough Information 33
Concerns Already Provide Enough Information 13
LLM Tools Design
No Answer 1
Depends on Use 1
Accuracy of LLM-generated Responses 22
Impact on Human Cognitive Abilities 15
Others 7
Adverse Effects of LLM
Copyright and Originality Concerns 5
Time-consuming 3
Hindering Basic Learning 3
CHI ’24, May 11–16, 2024, Honolulu, HI, USA Wang et al.

Table 7: Interview Outline

Type No. Question Time/min

Training 1 What information/knowledge do you think a user should have before using LLM 10
tools for academic tasks?
* Classify and rank them by importance.
* Why do you think XXX is necessary?
* Do you think the designers have provided adequate information to the users?
If no, how can the drivers acquire the necessary information about LLM tools?
Academic 2 Would you find it helpful to have LLM tools to finish academic tasks? And why? 5
task * If not, under what circumstances do you think the use of LLM tools would have
a negative impact?
In our experiment, did your evaluation of the LLM tool change when you were doing
different academic tasks?
Pressure 3 When you were doing our experiments with sufficient time, do you think LLM tool 5
helped you to finish the task well? Why?
When you were asked to finish the task in a shorter time, do you think the LLM tool
helped you to finish the task well? Why?
Individual 4 On <task x> you did not finish it in the required time: 5
specific * Do you think what are the main obstacles?
* Did the LLM tool help you? Why?
5 On <task x> you finished it before the required time: 5
* Which factors do you think contributed to that?
* Did the LLM tool help you? Why?
Ending 6 Do you have any concerns about LLM tools used for academic tasks? 5
From the perspective of
* Privacy
* Copyright
* Reliability of responses
7 Have we missed anything? 2

New Ideas in Psychology: Chenguang Zhao, Meirewuti Habule, Wei Zhang
No ratings yet
New Ideas in Psychology: Chenguang Zhao, Meirewuti Habule, Wei Zhang
7 pages
z-s2.0-S2666920X23000516-main (Bernabei, 2023)
No ratings yet
z-s2.0-S2666920X23000516-main (Bernabei, 2023)
18 pages
Metadiscursive Nouns in Academic Argument - ChatGPT Vs Student Practices
No ratings yet
Metadiscursive Nouns in Academic Argument - ChatGPT Vs Student Practices
11 pages
Cain 2024 Prompting Change Exploring Prompt e
No ratings yet
Cain 2024 Prompting Change Exploring Prompt e
11 pages
AI Literacy's Role in Prompt Engineering
No ratings yet
AI Literacy's Role in Prompt Engineering
14 pages
An Analysis of Large Language Models: Their Impact and Potential Applications
No ratings yet
An Analysis of Large Language Models: Their Impact and Potential Applications
24 pages
LLMs Transforming Education
No ratings yet
LLMs Transforming Education
14 pages
2025 Hyland Does Chatgpt Write Like A Student Engagement Markers in Argumentative Essays
No ratings yet
2025 Hyland Does Chatgpt Write Like A Student Engagement Markers in Argumentative Essays
30 pages
"With Great Power Comes Great Responsibility!
No ratings yet
"With Great Power Comes Great Responsibility!
18 pages
Leveraging The Potential of Large Language Models in Education Through Playful and Game Based Learning
No ratings yet
Leveraging The Potential of Large Language Models in Education Through Playful and Game Based Learning
20 pages
Ingram 2023 Prompting Large Language Models Power Educational Chatbots
No ratings yet
Ingram 2023 Prompting Large Language Models Power Educational Chatbots
20 pages
Impact Robotic
No ratings yet
Impact Robotic
21 pages
INTRODUCTION
No ratings yet
INTRODUCTION
3 pages
Educhat:: A Large-Scale Language Model-Based Chatbot System For Intelligent Education
No ratings yet
Educhat:: A Large-Scale Language Model-Based Chatbot System For Intelligent Education
9 pages
Advancing Educational Accessibility: The Langchain LLM Chatbot'S Impact On Multimedia Syllabus-Based Learning
No ratings yet
Advancing Educational Accessibility: The Langchain LLM Chatbot'S Impact On Multimedia Syllabus-Based Learning
16 pages
Using Generative Artificial Intelligence/Chatgpt For Academic Communication: Students' Perspectives
No ratings yet
Using Generative Artificial Intelligence/Chatgpt For Academic Communication: Students' Perspectives
25 pages
Discover Education: Exploring The Opportunities and Challenges of Chatgpt in Academia
No ratings yet
Discover Education: Exploring The Opportunities and Challenges of Chatgpt in Academia
14 pages
FutureOfLearning LLMs Book Chapter
No ratings yet
FutureOfLearning LLMs Book Chapter
12 pages
Beyond Traditional Teaching
No ratings yet
Beyond Traditional Teaching
43 pages
Adapting Large Language Models For Education: Foundational Capabilities, Potentials, and Challenges
No ratings yet
Adapting Large Language Models For Education: Foundational Capabilities, Potentials, and Challenges
31 pages
Talking About Large Language Models
No ratings yet
Talking About Large Language Models
12 pages
Gelvanovsky Saduov
No ratings yet
Gelvanovsky Saduov
11 pages
The Impact of A Large
No ratings yet
The Impact of A Large
82 pages
Transforming Education With Large Language Models
No ratings yet
Transforming Education With Large Language Models
10 pages
Large-Language-Model-Based-Artificial-Intelligence-In-The-Language-Classroom-Practical-Ideas-For-Teaching - Content File PDF
No ratings yet
Large-Language-Model-Based-Artificial-Intelligence-In-The-Language-Classroom-Practical-Ideas-For-Teaching - Content File PDF
20 pages
Large Language Models For Education: A Survey: Xu, Gan, Qi, Wu and Yu
No ratings yet
Large Language Models For Education: A Survey: Xu, Gan, Qi, Wu and Yu
19 pages
Understanding The Human-LLM Dynamic A Literature S
No ratings yet
Understanding The Human-LLM Dynamic A Literature S
16 pages
Enhancing Education Through Thoughtful Integration of Large Language Models in Assigned Work
No ratings yet
Enhancing Education Through Thoughtful Integration of Large Language Models in Assigned Work
9 pages
Chatgpt and Large Language Models in Academia: Opportunities and Challenges
No ratings yet
Chatgpt and Large Language Models in Academia: Opportunities and Challenges
11 pages
Support Mateial Mandeep
No ratings yet
Support Mateial Mandeep
66 pages
Wang Et Al 2023 Emotional Intelligence of Large Language Models
No ratings yet
Wang Et Al 2023 Emotional Intelligence of Large Language Models
12 pages
Performance Secondary
No ratings yet
Performance Secondary
7 pages
Yan Et Al, 2023 - Practical and Ethical Challenges of Large Language Models in Education A Systematic
No ratings yet
Yan Et Al, 2023 - Practical and Ethical Challenges of Large Language Models in Education A Systematic
23 pages
My Library - CSV
No ratings yet
My Library - CSV
10 pages
1 s2.0 S2666920X23000772 Main
No ratings yet
1 s2.0 S2666920X23000772 Main
13 pages
On The Application of Large Language Models For Language Teaching and Assessment Technology
No ratings yet
On The Application of Large Language Models For Language Teaching and Assessment Technology
25 pages
Why Johny Cant Prompt
No ratings yet
Why Johny Cant Prompt
21 pages
11 - An Evaluation of ChatGPT's Proficiency in English Language Testing of The Vietnamese National High School Graduation Examination
No ratings yet
11 - An Evaluation of ChatGPT's Proficiency in English Language Testing of The Vietnamese National High School Graduation Examination
12 pages
Can Large Language Models Be An Alternative To Human Evaluation
No ratings yet
Can Large Language Models Be An Alternative To Human Evaluation
25 pages
Survey On Generative AI in Education
No ratings yet
Survey On Generative AI in Education
7 pages
ChatGPT For Language Learning Assessing Teacher Candidates Skills and Perceptions Using The Technology Acceptance Model TAM
No ratings yet
ChatGPT For Language Learning Assessing Teacher Candidates Skills and Perceptions Using The Technology Acceptance Model TAM
17 pages
Rita 3381842 PP
No ratings yet
Rita 3381842 PP
10 pages
2408 11539v1
No ratings yet
2408 11539v1
8 pages
AI in Language Teaching: Practical Uses
No ratings yet
AI in Language Teaching: Practical Uses
19 pages
Through The Lens of Core Competency: Survey On Evaluation of Large Language Models
No ratings yet
Through The Lens of Core Competency: Survey On Evaluation of Large Language Models
22 pages
Chatgpt Enters The Classroom
No ratings yet
Chatgpt Enters The Classroom
4 pages
文獻報告
No ratings yet
文獻報告
5 pages
A Comprehensive Review of Large Language Models: Issues and Solutions in Learning Environments
No ratings yet
A Comprehensive Review of Large Language Models: Issues and Solutions in Learning Environments
34 pages
Exploring The Frontiers of LLMs in Psychological Applications
No ratings yet
Exploring The Frontiers of LLMs in Psychological Applications
34 pages
"The Teachers Are Confused As Well": A Multiple-Stakeholder Ethics Discussion On Large Language Models in Computing Education
No ratings yet
"The Teachers Are Confused As Well": A Multiple-Stakeholder Ethics Discussion On Large Language Models in Computing Education
20 pages
Introduction
No ratings yet
Introduction
3 pages
ChatGPT in Education - A Global and Vietnamese Research Overview - 7.13
No ratings yet
ChatGPT in Education - A Global and Vietnamese Research Overview - 7.13
11 pages
Harnessing The Power of Llms in Practice: A Survey On Chatgpt and Beyond
No ratings yet
Harnessing The Power of Llms in Practice: A Survey On Chatgpt and Beyond
32 pages
Large Language Models: A Need and The Challenges: Human
No ratings yet
Large Language Models: A Need and The Challenges: Human
16 pages
Spoken Language Intelligence LLM For Language Learning
No ratings yet
Spoken Language Intelligence LLM For Language Learning
27 pages
"Chatgpt Is Here To Help, Not To Replace Anybody" - An Evaluation of Students' Opinions On Integrating Chatgpt in Cs Courses
No ratings yet
"Chatgpt Is Here To Help, Not To Replace Anybody" - An Evaluation of Students' Opinions On Integrating Chatgpt in Cs Courses
15 pages
Evaluating Large Language Models On A Highly-Specialized Topic, Radiation Oncology Physics
No ratings yet
Evaluating Large Language Models On A Highly-Specialized Topic, Radiation Oncology Physics
36 pages
M2 Cache
No ratings yet
M2 Cache
24 pages
"I Know I'm Being Observed": Video Interventions To Educate Users About Targeted Advertising On Facebook
No ratings yet
"I Know I'm Being Observed": Video Interventions To Educate Users About Targeted Advertising On Facebook
27 pages
2024 Findings-Emnlp 732
No ratings yet
2024 Findings-Emnlp 732
12 pages
Exegpt: Constraint-Aware Resource Scheduling For LLM Inference
No ratings yet
Exegpt: Constraint-Aware Resource Scheduling For LLM Inference
16 pages
Linde HPR135 - 1021
No ratings yet
Linde HPR135 - 1021
2 pages
Hermes
No ratings yet
Hermes
7 pages
Chapter 8 Philosophies of Education
No ratings yet
Chapter 8 Philosophies of Education
18 pages
COE IoT Objectives - 38761071
No ratings yet
COE IoT Objectives - 38761071
9 pages
Test Table Mounting Brackets
No ratings yet
Test Table Mounting Brackets
2 pages
Case Study
No ratings yet
Case Study
4 pages
Survey Questions With Tagalog
No ratings yet
Survey Questions With Tagalog
3 pages
Muniratnam Printed Notes Paper-1-2
No ratings yet
Muniratnam Printed Notes Paper-1-2
236 pages
Forensic Chemistry Gun Powder
No ratings yet
Forensic Chemistry Gun Powder
21 pages
Book Index The Art of Heavy Transport
50% (2)
Book Index The Art of Heavy Transport
6 pages
BKI Katalog
No ratings yet
BKI Katalog
8 pages
Asl NGP+ 8 - 100 - en - 2930714210
100% (1)
Asl NGP+ 8 - 100 - en - 2930714210
80 pages
05 - A - C Shreyas
No ratings yet
05 - A - C Shreyas
85 pages
Plywood Manufacturing Process
75% (4)
Plywood Manufacturing Process
40 pages
Mini Riset Bahasa Inggris Bisnis-1
100% (2)
Mini Riset Bahasa Inggris Bisnis-1
12 pages
Hands On Machine Learning With Scikit Learn and TensorFlow Aurélien Géron Online PDF
No ratings yet
Hands On Machine Learning With Scikit Learn and TensorFlow Aurélien Géron Online PDF
54 pages
Stastical Data Analysis: A Lokeshwari 22N31E0014
No ratings yet
Stastical Data Analysis: A Lokeshwari 22N31E0014
30 pages
MAMALUBA - Assignment For Chem Lab 401
No ratings yet
MAMALUBA - Assignment For Chem Lab 401
2 pages
D6R Series Ii Detailed Specifications
No ratings yet
D6R Series Ii Detailed Specifications
4 pages
Mobility and Effective Electric Field in Nonplanar Channel MOSFETs
No ratings yet
Mobility and Effective Electric Field in Nonplanar Channel MOSFETs
5 pages
Space Tourism: Future Holidays
No ratings yet
Space Tourism: Future Holidays
3 pages
Experiment 14: The Atomic Spectrum of Hydrogen
No ratings yet
Experiment 14: The Atomic Spectrum of Hydrogen
19 pages
Ew Chacklist
No ratings yet
Ew Chacklist
1 page
Item Assortment
No ratings yet
Item Assortment
270 pages
Tortosa, Francisco, and Civera, Cristina. Historia de La Psicología (1A. Ed.) - Madrid, Es: Mcgraw-Hill España, 2009. Proquest Ebrary. Web. 30 June 2016
No ratings yet
Tortosa, Francisco, and Civera, Cristina. Historia de La Psicología (1A. Ed.) - Madrid, Es: Mcgraw-Hill España, 2009. Proquest Ebrary. Web. 30 June 2016
14 pages
Collins - CAPE Revision Guide - Communication Studies
100% (1)
Collins - CAPE Revision Guide - Communication Studies
167 pages
Cse Syllabus R 2009
No ratings yet
Cse Syllabus R 2009
87 pages
Granulation Area Check List
No ratings yet
Granulation Area Check List
2 pages
Magniwork Generator MGW 09x
100% (1)
Magniwork Generator MGW 09x
57 pages
Arnav Bisht - Consultant - SNO - 2018 PDF
No ratings yet
Arnav Bisht - Consultant - SNO - 2018 PDF
1 page

Chi 24

Uploaded by

Chi 24

Uploaded by

Evaluating Large Language Models on Academic Literature

Understanding and Review: An Empirical Study among

Yan Song Youyu Sheng Dengbo He†

3.1 Participants 3.2 Tasks

task requires the users to search and identify relevant information

Table 2: Individual and Group Performance Statistics of Paper Understanding.

Table 3: Individual and Group Performance Statistics of Literature Review.

assigned descriptive labels to key paragraphs or viewpoints in the 4 RESULTS

Table 4: Summary of Statistical Results.

Dependent Variable (DV) Independent Variable (IV) F-value p

Table 5: Significant Post-hoc Results for Discrete Independent Variables.

DV IV IV Level IV Level compared to Estimation (95% CI) t value p

4.2.4 Concerns about LLM tools. Most participants expressed con-

the impact of LLM tools on academic tasks more comprehensively. 8 https://openreader.semanticscholar.org/

Table 6: Interview quantitative statistical data.

Type Theme Label Number of Mentions

Table 7: Interview Outline

Type No. Question Time/min

You might also like