Chi 24
Chi 24
ABSTRACT                                                                                        KEYWORDS
The rapid advancement of large language models (LLMs) such as                                   large language model, academic tasks, user perception, human-AI
ChatGPT makes LLM-based academic tools possible. However, little                                collaboration
research has empirically evaluated how scholars perform different
                                                                                                ACM Reference Format:
types of academic tasks with LLMs. Through an empirical study
                                                                                                Jiyao Wang, Haolong Hu, Zuyuan Wang, Yan Song, Youyu Sheng, and Dengbo
followed by a semi-structured interview, we assessed 48 early-stage                             He. 2024. Evaluating Large Language Models on Academic Literature Under-
scholars’ performance in conducting core academic activities (i.e.,                             standing and Review: An Empirical Study among Early-stage Scholars. In
paper reading and literature reviews) under different levels of time                            Proceedings of the CHI Conference on Human Factors in Computing Systems
pressure. Before conducting the tasks, participants received differ-                            (CHI ’24), May 11–16, 2024, Honolulu, HI, USA. ACM, New York, NY, USA,
ent training programs regarding the limitations and capabilities                                18 pages. https://doi.org/10.1145/3613904.3641917
of the LLMs. After completing the tasks, participants completed
an interview. Quantitative data regarding the influence of time
pressure, task type, and training program on participants’ perfor-
                                                                                                1    INTRODUCTION
mance in academic tasks was analyzed. Semi-structured interviews                                The rapid advancement of artificial intelligence (AI) and natural lan-
provided additional information on the influential factors of task                              guage processing (NLP) has led to the development of sophisticated
performance, participants’ perceptions of LLMs, and concerns about                              large language models (LLMs), such as ChatGPT 1 , GPT4 2 , and
integrating LLMs into academic workflows. The findings can guide                                Claude 3 . These models have demonstrated impressive capabilities
more appropriate usage and design of LLM-based tools in assisting                               in generating human-like text, understanding context, and solving
academic work.                                                                                  complex language tasks. The application scope of LLM technology
                                                                                                is extensive, and relevant scholars have been actively analyzing the
CCS CONCEPTS                                                                                    impact of such technology on fields such as healthcare [1, 39], edu-
                                                                                                cation [35, 36], and creative writing [26] (e.g., helping journalists
• Human-centered computing → Empirical studies in HCI; •                                        extract ideas from document [56], enabling scholars to communi-
Computing methodologies → Natural language processing.                                          cate findings mutually [25], and assisting writers in exploring more
                                                                                                ways of writing story [12]).
∗ Both
                                                                                                    In recent years, LLMs have been employed for various academic
         authors contributed equally to this research.
† Corresponding                                                                                 tasks [33], such as literature reviews, paper reading, writing polish-
                   author.
                                                                                                ing, etc. Being different from other areas where LLMs are applicable,
                                                                                                the academic work requires extensive training in acquiring, judging,
                                                                                                and synthesizing relevant information [46]. Moreover, the academic
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed           community also demands high standards for logical coherence, in-
for profit or commercial advantage and that copies bear this notice and the full citation       formation accuracy, and idea novelty [6] and thus requires more
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
                                                                                                responsible AI tools compared to other domains. However, the
republish, to post on servers or to redistribute to lists, requires prior specific permission   application of LLMs in academic contexts was under-investigated.
and/or a fee. Request permissions from permissions@acm.org.
CHI ’24, May 11–16, 2024, Honolulu, HI, USA
                                                                                                1 https://openai.com/chatgpt
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
                                                                                                2 https://openai.com/gpt-4
ACM ISBN 979-8-4007-0330-0/24/05
https://doi.org/10.1145/3613904.3641917                                                         3 https://www.claude.co.id/
CHI ’24, May 11–16, 2024, Honolulu, HI, USA                                                                                           Wang et al.
   Further, as emerging conversational systems, the promotion of          specifically, graduate students who have just started their academic
LLM results in more diverse user behaviors as well as new social          careers as researchers. This decision was based on the fact that
norms and user expectations [7]. Hence, it is imperative to evaluate      LLMs are new to most scholars, and based on the research in other
the capacity boundary of LLMs in academic settings through user           domains, new users may have highly uncertain and potentially
studies, so that the LLMs can be better designed to be integrated         inappropriate strategies when they first start to use the LLMs [30].
into academic workflows, and subsequently, contribute to academic         Thus, understanding the strategies new users adopt can help sup-
research. A few studies have evaluated the effectiveness of LLMs in       port young scholars to better use the LLM tools or at least shorten
assisting selected academic tasks. For example, Gordijn and Have          their familiarization process, by providing new users with appro-
[27] argued that the capacity of ChatGPT to develop a whole sci-          priate training materials. In the study, an onsite experiment was
entific paper is restricted. LLMs have also been found to be able         conducted, followed by a semi-structured interview regarding the
to alleviate some of the time pressures by automating certain pro-        usage of LLM in the experiment or their daily life. Together, the
cesses during their academic tasks [14]. However, academic tasks          empirical and interview data offered a nuanced perspective on the
are diverse, and different tasks may require completely different         opportunities and challenges of using LLMs for academic tasks.
cognitive resources. For instance, compared to extracting key in-
formation from a paper (in which the source of the information            2 RELATED WORKS
is known, but the information is unknown), literature reviews re-
                                                                          2.1 Natural Language Technologies
quire locating and summarizing information from a wider range of
studies (in which the targeted information is partially known, but        In recent years, a remarkable evolution has been happening in the
the source of the information is unknown). Actually, some research        field of Computational Linguistics, also known as Natural Language
[2, 16, 52] has also pointed out that LLMs may introduce inaccu-          Processing (NLP), primarily driven by the development of neuron-
racies and biases in academic tasks, especially in understanding          based network models trained on vast datasets [45, 68]. Compared
and summarizing the content of the literature as a priority concern       to traditional rule-based systems, recent data-driven models have
for humanity [16]. Furthermore, task complexity can also be mod-          shown remarkable results across various NLP tasks [19, 53]. Deep
erated by time pressure [44], which is prevalent in academia [11].        learning techniques have become mainstream in developing these
Thus, a more comprehensive investigation is needed to better un-          NLP models [40]. Current popular architectures include Long Short-
derstand the role of LLMs in different academic tasks with different      Term Memory (LSTM) [67] and transformer models [63].
task complexity, which can be moderated by task type and time                A significant paradigm shift in natural language technologies oc-
pressure.                                                                 curred over the past half-decade, primarily attributed to the advent
   On the other hand, given that LLMs can be regarded as a special        of large language models (LLMs) [17]. These techniques involve an
type of automation that can help gather, analyze, and summarize           initial training phase on a comprehensive dataset, followed by fine-
information, users’ perceptions of the LLM may also influence             tuning for specific tasks. Pre-trained models like BERT [15], BART
task performance. Although a few empirical studies discussed the          [42], XLNet [66], and LLaMa [62] have demonstrated substantial
implication and limitations of LLMs when they are used for specific       performance improvements across a variety of NLP tasks.
academic tasks (e.g., literature review [3], idea generation [22, 48]),      However, the challenges of smaller models persist in LLMs. For
no research has discussed different strategies young scholars may         instance, the LLMs still lack an explicit factual model, which makes
take when using the LLMs for different tasks, nor compared how            them prone to producing inaccurate information [34]. Even innocu-
task difficulties (e.g., as moderated by time pressure) and training      ous prompts can lead to the generation of toxic content from these
may influence users’ performance, although these factors have been        models [24, 55]. Their performance varies, excelling in some ar-
widely acknowledged as influential factors of users’ reliance on          eas while faltering in others [25]. Guiding these models to deliver
automation [32].                                                          specific outputs remains a challenge, leading to the emergence of
   Thus, using a mixed-methods approach combining an experi-              prompt engineering as a sub-field [4, 45]. Ethical concerns sur-
mental study with semi-structured interviews, this study aims to          rounding these models are wide-ranging, from environmental to
investigate:                                                              socio-political considerations [5].
                                                                             Our research acknowledges the limitations of current LLMs and
     • When using LLMs, whether and why there are discrepancies           assumes that they cannot fully replace human in creative writing
       in young academic users’ performance in conducting differ-         tasks. However, they can significantly aid academic writing across
       ent academic tasks, as defined by time pressure and required       various contexts to a certain extent. This perspective motivates our
       cognitive resources.                                               exploration of users’ concerns when using LLMs as a peer-level
     • How can young academic users’ perceptions of LLMs lim-             writing tool.
       itations affect their performance or strategies when using
       LLMs for academic tasks?                                           2.2    Large Language Model in Academic Tasks
     • What young academic users’ expectations of the LLMs and
                                                                          Large Language Models (LLMs), exemplified by ChatGPT, harness
       LLM training are when the LLMs are being used for different
                                                                          broad internet-based datasets to mimic human language patterns
       academic tasks?
                                                                          and create realistic text [57]. This capability has attracted interest
  Given that the younger generation has a higher acceptance of            across academia. For instance, the broader implications of AI in aca-
emerging technologies [9], and may lack experience in conducting          demic research have been scrutinized by Grimaldi and Ehrler [28]
academic tasks, this study was planned to target young scholars,          and Hutson [33]. These tasks include the compilation of essential
Evaluating Large Language Models on Academic Literature Understanding and Review                                      CHI ’24, May 11–16, 2024, Honolulu, HI, USA
components of the manuscript such as the abstract, introduction,                   Recruited through online posters in the social network and on-
literature review, methodology, results, discussions, and conclu-                  campus posters, all participants were native Mandarin speakers.
sions.                                                                             Each participant was given a unique experiment ID number from
    Scholars, researchers, and students in the academic community                  P1 to P48. Table 1 provides a comprehensive overview of the aca-
have utilized LLMs like ChatGPT for a variety of academic and non-                 demic profiles of the participants. All participants were actively
academic tasks. Dowling and Lucey [18] explored the application                    involved in academia, including 22 Ph.D. students, 17 MPhil stu-
of ChatGPT and found it to be particularly effective for initial idea              dents, and 9 research assistants (with a minimum of a bachelor’s
generation, literature synthesis, and creating testing frameworks.                 degree). An examination of their academic publication history re-
Yet, according to Gordijn et al., [27], ChatGPT still fails to produce             veals that 35 participants had 1 to 3 publications (including journal
a complete scientific article on par with a skilled researcher. How-               articles, conference proceedings, and edited books); 3 participants
ever, it is expected that the capabilities and uses of these tools will            had 4-6 publications; while 10 participants were still striving for
continue to grow, and will be capable of conducting more academic                  their first publication. Moreover, according to their self-reported
tasks, including experiment design, manuscript writing, peer re-                   current research topics, we classified them into three types, i.e.,
views, and editorial decision support [16]. Additionally, the ability              AI-related, Other STEM (Science, technology, engineering, and
of ChatGPT to generate and understand texts in multiple languages                  mathematics), and Social Science & Business. Basically, we tried
is believed to improve the efficiency of publishing and accessing                  to balance the participant distribution of background within each
literature for non-native English speakers [43]. In general, scien-                experiment condition.
tists in many fields are positive about the potential of using LLM                    Significantly, the study focused on participants with limited
in academic tasks [52].                                                            exposure to LLM in their academic career, specifically those who
    However, the performance of the LLM in academic tasks is still                 ”sometimes used LLMs for academic purposes” or less. This criterion
less than ideal. For example, Aydın and Karaarslan [3] pointed out                 was adopted given that frequent users of LLM tools may have
that when using ChatGPT for a literature review in healthcare, the                 developed their own strategies for using the LLM, which can hardly
content generated by LLM still lacks synthesis, and may suffer the                 be controlled in the experiment. More importantly, as illustrated
problem of plagiarism. In another study, Gao et al. [23] reported                  in previous human-automation interaction domains [31, 32], new
that the abstract generated by the LLMs can still be identified as                 users may encounter performance and trust degradation when
AI-generated using an AI output detector. Particularly, through                    using unfamiliar LLMs. Thus, focusing on this group of users can
multiple experimental trials, Dis et al. [16] reminded researchers                 help optimize the design of LLMs to better support the users.
to pay extra attention and remain vigilant when applying LLM to
literature comprehension and summarization tasks. However, to the                  Table 1: Background Statistics of Each Group of Participants.
best of our knowledge, no empirical research has been conducted
to understand how scholars use the LLMs and how the LLMs can                           Background                       Type               Group 1 Group 2
influence scholars’ performance in academic tasks. Given that the                    Academic Position              Ph.D. student             11          11
LLM can still only work as a collaborator, it is necessary to consider                                              Mphil student              8           9
the characteristics of the user-LLM combined system instead of                                                   Research assistant            5           4
the LLM alone. Furthermore, most of the existing research focused                    Publication Number                   0                    3           7
on the attitudes and opinions of senior researchers on the use of                                                        1-3                  19          16
LLM [2, 52]. Few research focused on younger scholars, who may                                                           4-6                   2           1
have a higher propensity to accept new technologies and lack the                           Gender                       Male                  15          15
necessary expert knowledge to supervise the application of LLM in                                                      Female                 9           9
academic tasks [16].                                                                  Usage Experience               Never used                3           3
                                                                                                                     Rarely used               9           9
                                                                                                                  Sometimes used              12          12
3     METHODOLOGY                                                                     Research Interest               AI-related               3           3
We adopted a mixed approach consisting of an empirical experiment                                                    Other STEM               13          12
and a semi-structured interview. Quantitative performance data                                               Social Science & Business        8           9
was gathered to evaluate the performance and the strategies partici-               Notes: In this table and the following tables, Group 1 refers to the group of
pants took for different academic tasks. Post-experiment interviews                participants who received additional training on LLM limitations; while the
focused on researchers’ evaluations of current LLM limitations in                  participants in Group 2 only receive basic training on how to use the LLM.
                                                                                    In our recruitment questionnaire, options of LLM tool usage experience
academia, their subjective understanding of the factors influencing
                                                                                     include: Never used; Rarely used; Occasionally used, but not frequent;
their performance across tasks, and their concerns about integrating                                   Sometimes used - about half the time.
LLM into their workflows.
                                    PU-MT                                                                     PU-LT
                   Group 1                        Group 2                                    Group 1                        Group 2
         LLM Usage      Grade:Time      LLM Usage      Grade:Time                  LLM Usage      Grade:Time      LLM Usage      Grade:Time
             y            90:100            n            70:100                        y            70:100            y            15:100
             y             75:80            y            65:100                        y            60:100            y             5:100
             n            70:100            y            55:100                        n            75:100            y            95:100
             y            55:100            y             75:80                        y            65:100            y            65:100
             y             85:85            y            75:100                        y            65:100            y             85:90
             y             65:90            n            60:100                        y            80:100            y            60:100
             y            80:100            y            75:100                        y            90:100            y            45:100
             y            80:100            y             85:95                        y            90:100            y             80:50
             y             80:85            n            75:100                        y            70:100            n            50:100
             y             90:95            y             90:80                        y            70:100            y            85:100
             y             80:80            y            65:100                        y             65:90            y            45:100
             y            75:100            y             55:90                        y            55:100            y            60:100
             n            35:100            y            20:100                        y            55:100            y            90:100
             y             75:75            y             85:90                        y            80:100            y            85:100
             y            50:100            y             70:70                        y            40:100            y            50:100
             y            55:100            y            65:100                        n            60:100            y            50:100
             y            30:100            y            45:100                        y            15:100            n            65:100
             y            75:100            n             50:80                        y            70:100            y            25:100
             y             85:80            y             65:90                        y            30:100            n             75:80
             n            35:100            y            85:100                        n            35:100            y            80:100
             y            50:100            n            75:100                        y            40:100            n            55:100
             y            90:100            n            55:100                        y            70:100            n            75:100
             y            65:100            y            20:100                        y            65:100            y            25:100
             y            50:100            n            45:100                        n            25:100            y            20:100
          n/y=3/21   average=67.5:94.6   n/y=7/17   average=63.5:94.8               n/y=4/20   average=60.0:99.6   n/y=5/19   average=57.7:94.8
Notes: The column ’LLM Usage’ indicates whether participants used the LLM tool to assist in completing the task. The symbol ’y’ means ’used’ and ’n’ is ’not
                                                     used’ in each task. The unit of time is minute.
conducted the evaluation independently. An Intraclass Correlation                   Variance Inflation Factor (VIF) was used to mitigate the issue of
Coefficient (ICC) analysis was conducted and the two raters reached                 multicollinearity. To examine the significance of variables within
an ICC of 0.94 (95% CI = [0.91, 0.97], p < .0001), which indicates                  each sub-structure, Tukey-Kramer post-hoc tests [38] were con-
high consistency and inter-rater reliability of the grades (i.e., from              ducted. Variables demonstrating a significance level of p < .05 were
0 to 100). The guiding principles employed for scoring, as well                     considered statistically significant in the analyses.
as detailed experimental materials, can be found in Appendix B.
Other metrics of the task performance in the empirical experiment                   3.4.2 Qualitative Analysis. As for the answers in the semi-structured
part of the study include task completion Time (%) (i.e., the actual                interview, Figure 1 illustrates the coding framework and its corre-
completion time / the time allowed for the current task.) and LLM                   sponding themes. We transcribed interviews from 48 participants
tool adoption rate (i.e., the number of participants in each group                  using automated transcription software6 , followed by content cal-
who used LLM during that task / the total number of participants                    ibration to ensure the alignment between the original audio and
in each group). The criterion of LLM tool usage is whether they had                 transcribed text. Our approach blends the strengths of qualitative
fully accepted responses by LLM in their answers while fulfilling                   and quantitative analysis to investigate textual content. This dual
each task.                                                                          approach not only facilitates more robust inferences but also opens
   For the quantitative analysis method, in order to quantify the                   avenues for additional reflection, hypothesis refinement, and fur-
combined effects of participants’ background and controlled ex-                     ther investigation [47].
periment conditions, regression analyses were performed by ”SAS                        To gain a deeper understanding of the interview content, two re-
OnDemand for Academics”. Mixed linear regression models (using                      searchers (co-authors of this paper) identified several topics of inter-
Proc MIXED) were built for two continuous dependent variables                       est based on the research questions and interview outline, including
(Time and Grade), which included all demographic factors, three                     training, academic task types, pressure, concerns, and individual
experimental conditions, and their two-way interactions as inde-                    differences. They independently read all the interview texts, and
pendent variables. Repeated measures were accounted for through                     extracted segments related to these topics. At the same time, they
a generalized estimating equation, which can be used to model                       performed open coding (i.e., taking apart the information collected,
multiple responses from a single subject. Backward stepwise selec-                  assigning concepts, and then reassembling it in new ways) and
tion procedures were employed based on model fitting criteria and                   6 https://www.feishu.cn/product/minutes
CHI ’24, May 11–16, 2024, Honolulu, HI, USA                                                                                          Wang et al.
                                    LR-MT                                                           LR-LT
                   Group 1                        Group 2                          Group 1                        Group 2
         LLM Usage      Grade:Time      LLM Usage      Grade:Time        LLM Usage      Grade:Time      LLM Usage      Grade:Time
             y             50:85            y             46:90              y             65:70            y             26:90
             y             58:85            y            46:100              y            33:100            y            40:100
             y            41:100            y             49:60              y            26:100            y            60:100
             y            46:100            y             59:55              y            17:100            y            34:100
             y             58:75            y             50:75              y            50:100            y            18:100
             y            66:100            y             55:90              y            46:100            y             41:80
             y             52:95            y             42:70              y            31:100            y            27:100
             y            41:100            y            38:100              y            36:100            y             50:60
             y            31:100            y             32:55              y            33:100            y             47:40
             y            26:100            y            50:100              y            11:100            y            32:100
             y             41:65            y            48:100              y             41:70            y            34:100
             y             53:85            y             42:75              y             49:50            y            45:100
             y            12:100            y             42:75              y            23:100            y            16:100
             y            43:100            y            50:100              y            29:100            y            47:100
             y             15:85            y             55:80              y            12:100            y            38:100
             y            30:100            y            26:100              y            33:100            y            30:100
             y            12:100            y            39:100              y             3:100            y            16:100
             y            83:100            y            28:100              y            80:100            y            44:100
             y             41:75            y            17:100              y            51:100            y            16:100
             y            17:100            y            40:100              y            19:100            y             3:100
             y            64:100            y            37:100              y            63:100            y            45:100
             y            39:100            y            37:100              y            49:100            y            45:100
             y             69:60            y            10:100              y            67:100            y            23:100
             y            13:100            y             33:80              y            14:100            y            33:100
          n/y=0/24   average=41.7:92.1   n/y=0/24   average=40.5:87.7     n/y=0/24   average=36.7:95.4   n/y=0/24   average=33.8:94.6
when they were conducting the literature review task. This indicates                4.2     Qualitative results from semi-structure
that scholars presented varied preferences for the use of LLM tools                         interview
on different tasks. Refer to [13], such attitude may be determined
                                                                                    We extracted four categories of topics from the interview: training
by the perceived ease of use and usability of the tool, which we will
                                                                                    for initial users, variations in two academic tasks, strategies under
further discuss in qualitative analysis.
                                                                                    time pressure, and concerns about LLM tools. Each category was
   Second, to better model the influence of users’ background and
                                                                                    further divided into three subtopics, which encompass the common
three experimental conditions, as well as their interaction effects,
                                                                                    themes emerging from participants’ responses. Figure 2 illustrates
we built two models for Time (%) and Grade of participants. Refer
                                                                                    the detailed statistics. Through coding and discussing diverse top-
to Table 4, we found that the type of training, task type and time
                                                                                    ics, we aim to delve into participants’ attitudes, strategies, and
pressure were significant predictors of time spent on task; task type
                                                                                    reflections on various aspects of LLM tools.
and time pressure were influential factors of grades. Specifically,
as shown in Table 5, one would spend more time and gain higher
scores when conducting a paper understanding task compared                          4.2.1 Training for initial users. The majority (47/48) of participants
to when conducting a literature review task. At the same time,                      would like to obtain some kind of guidance or training before using
people under higher time pressure spent a higher percentage of                      the LLM tools, but one subject explicitly stated that she did not
time on tasks but obtained lower scores. We also found that the                     need to know any information or knowledge to use the LLM tool
training made a difference - participants who received limitation-                  for the first time and she could use it in a straightforward away. A
related training used more time on task compared to those who                       total of 16 types of information that participants wished to know
received only basic training, while no significant effects of training              before using the LLM were identified, and these were categorized
were observed on grades. Finally, two significant interaction effects               into 6 themes through thematic analysis. These categories, listed in
related to LLM usage experience were identified. We found that,                     descending order of frequency in Figure 2, are pre-use techniques,
within the group without limitation, participants who had more                      features and limitations of LLM, basic methods and operations to
experience in utilizing LLMs in academic tasks spent more time                      use LLM, ethics and compliance, historical or current tool develop-
in tasks compared to those who used LLMs less frequently. At                        ment, and others. In addition to the most frequently mentioned and
the same time, when conducting paper understanding tasks, more                      emphasized ”questioning techniques”, many participants empha-
experienced LLM users were always more likely to obtain higher                      sized the importance of crafting effective ”prompts.” For instance,
grades compared to less experienced users of LLM.                                   P43 said, ”If you don’t use prompt engineering and instead express
                                                                                    yourself naturally, there’s a good chance you won’t get what you’re
CHI ’24, May 11–16, 2024, Honolulu, HI, USA                                                                                        Wang et al.
Figure 2: Interview quantitative statistical data. The categorization of coding and theme group corresponds to Figure 1.
looking for; if you don’t get the results you intended, LLM is not very   indicated that they would read or watch the official learning ma-
useful in academic assignments.” It is noteworthy that although 16        terials. Most individuals tended to rely on third-party educational
participants felt that understanding the limitations or flaws of the      resources when acquiring new skills. P23 mentioned that he would
LLM tool was necessary, only 2 of them considered it the most             learn how to use LLM tools through some user-generated content
crucial skill when using LLM for academic tasks.                          platforms; P38 said that he would check out posts shared on the
   When discussing the learning resources for the LLM tool, we            Internet to learn; P30 emphasized the role of watching reviews
find that the official guides or documentation provided by the LLM        of LLM tools through short video platforms; and P40 said that he
tools were not the primary learning resources. Only 4 participants        would check online forums or use a search engine to find relevant
                                                                          information.
Evaluating Large Language Models on Academic Literature Understanding and Review                                     CHI ’24, May 11–16, 2024, Honolulu, HI, USA
    To compare the effect of different training, we provided gen-                  In contrast, 16 participants held an opposite view, and the rest 6
eral training for all participants and conducted limitation-related                participants expressed uncertainty about the comparison of the
training for half of the participants, in which we emphasized the                  role of the LLM in two tasks used in the experiment. For example,
shortcomings, limitations, and academic integrity issues related to                P21 mentioned that ”The LLM is more useful in supporting paper
the LLM tool. By comparing these two training methods, we found                    understanding. The LLM tool can give me a general outline. It can
that the individuals who did not receive LLM-limitation-related                    explain terms I don’t understand, and it also can summarise the paper
training expressed greater satisfaction with the actual effective-                 a little bit.” ”When it comes to the literature review, I think it’s better
ness of the tools and a higher percentage of them (83.3% versus                    to refer to relevant published literature reviews that are more capable
62.5%) believed that the LLM tool provided important assistance in                 or conduct this myself, instead of referring to a bunch of literature
completing the tasks. Further, participants who did NOT receive                    summarized by the current LLM tool.”
limitation-related training mentioned more content outside of our                     In addition, some thought-provoking ideas were identified. For
’limitation-related training’, e.g., they mentioned limitations of con-            example, P41 emphasized that: ”Reading a paper is a process of
tent generation more frequently in the semi-structured interview                   comprehension, and the use of the LLM tool removes this purpose.
compared to those who received limitation-related training. For                    Implicitly, there is a concern that LLM tools may negatively impact
example, p8 said ”The current training data of LLM is also based on a              one’s capability in reading and comprehension of paper. While,
more general data site, so my current experience is that there is still a          P13 indicated that LLM can help the comprehension process as
lack of specialized knowledge. The generated answers are still limited             he mentioned that using LLM for paper reading is like ”going on
and not professional enough.”                                                      a treasure hunt with a treasure map”, highlighting the function of
                                                                                   LLM tools as an aid. Nevertheless, although the LLM tool can guide
4.2.2 Variations in the role of LLMs in different academic tasks.                  and speed up the paper reading task, a deeper comprehension of
Only 27% of individuals stated that the LLM tool was merely useful                 the paper still requires the involvement of one’s personal reflection.
in assisting the two academic tasks in the experiment, while the
rest of the respondents indicated that the LLM tool was useful to                  4.2.3 Strategies under time pressure. Under different levels of time
some extent. For example, P37 said, ”I find the ChatGPT very helpful,              pressure, there are significant divergences in the impact of LLM
especially for summarizing existing literature and quickly locating                tools on paper understanding and literature review tasks. 21 out
answers. I think it’s incredibly useful.”                                          of 48 participants felt that the LLM tool was more useful under
    Regarding the types of assistance gained from the LLM tools,                   less time pressure compared to that under high time pressure. For
we have categorized them into six primary themes using thematic                    instance, P40 said, ”During the literature review, the tool is more
analysis, ordered by frequency from high to low: 1. literature sum-                useful when 20 minutes were allowed for the task.”. At the same time,
marization, which aims to help users understand and summarise                      10 out of 48 participants felt that the LLM tool worked better under
the content of the literature; 2. information retrieval, which helps               higher time pressure compared to that under low time pressure. The
find relevant information for a specific problem or to give advice                 rest 17 participants thought that the time pressure did not make a
on how to solve the problem; 3. linguistic optimization, which in-                 difference.
volves polishing the texts, and correcting grammar, spelling, and                     At the same time, referring to Figure 2, 23 out of 48 participants
expression; 4. data analysis, which helps users process and analyze                specifically compared the role of the LLM tool in completing the
data; 5. writing aids, which support users with writing inspiration,               paper understanding task under different time pressures. Among
content continuation, and so on; 6. framework establishment, which                 these 23 participants, 6 participants believed that time pressure
helps users create a framework or structure to present their ideas or              would not affect the completion of the paper understanding task. In
research results. Figure 2 presents detailed data on these six themes.             comparison, 2 participants stated that they did not use the LLM tool
    Information search is a noteworthy feature of LLM tools, which                 at all in the paper understanding task regardless of time pressure.
is believed to have the potential to replace search engines and                    Further, 24 out of 48 participants specifically compared the role of
encyclopedias. As P15 said, ”I study chemistry, and when I come                    the LLM tool in completing the literature review task under different
across some unfamiliar compounds, I will ask the LLM tool directly,                time pressures. Nearly half (11/24) of the participants indicated that
which is more accurate and direct than the results obtained from a                 the LLM tool would be more effective under low time pressure,
search engine.” P8 also mentioned that ”asking questions to the LLM                while 5 participants held the opposite opinion.
tool is like asking a Wikipedia.” It is worth mentioning that, a few                  These discrepancies and divergences also led to variations in
participants (2/48) mentioned assistance of LLMs in personalized                   participants’ attitudes toward tasks under different levels of time
tasks (e.g., Language translation, coding). For example, P37 men-                  pressure. We employed creative coding to differentiate these atti-
tioned that ”the LLM tool can judge my solutions, then identify some               tudes and found that under lower time pressure, participants tended
shortcomings, and help me to correct them”                                         to exhibit more positive attitudes toward LLM. When the time pres-
    It is also interesting to find that participants exhibit significant           sure was low, only 3 respondents regarded the help from LLM
divergence regarding the role of the LLM tool in different tasks. As               tools as ignorable, while the rest held positive attitudes towards
illustrated in Figure 2, 21 respondents believed that the LLM was                  LLM in accomplishing academic tasks. It is likely that as the time
more helpful in assisting literature review tasks compared to in                   pressure reduced, participants could engage more in introspective
assisting paper understanding tasks. P19 said, ”I think it (LLM) was               thinking (e.g., contemplating ways to ask questions, strategies of
more useful in the Literature review, it not only helps us target some             using LLM, and double-checking the accuracy of the responses
key information but at the same time relieves us of writing burdens.”              generated by LLM), which has been mentioned in the interview.
CHI ’24, May 11–16, 2024, Honolulu, HI, USA                                                                                            Wang et al.
                                                                          the task once, I gained experience or a sense of how to finish this task
                                                                          more quickly.” Participants were likely to become more familiar
                                                                          with the experimental process, leading to better comprehension of
                                                                          response and optimized strategies. This familiarity can also play
                                                                          a role in real-world academic tasks, manifesting as increasing effi-
                                                                          ciency when conducting similar tasks or using LLM tools repeatedly.
                                                                          Another intriguing discovery was that some participants had to
                                                                          lower their interactions with the LLM tool due to time pressure.
                                                                          They stopped scrutinizing the generated content before adopting it,
                                                                          which paradoxically led to early task completion. For example, P44
                                                                          said, ”Because it might just be time pressure, I didn’t expect as much
                                                                          from the LLM tool. So I didn’t bother to make any further adjustments
                                                                          to the answer, and finished the task ahead of schedule.”. This raises
                                                                          concerns about over-reliance on AI tools in high-pressure situations
                                                                          [8, 54].
publication mentioned copyright concerns proactively. Even after                       We also found that participants can adaptively change their
reminding, only 2 (out of 10) of them said they would consider                     strategies when conducting different types of academic tasks. In
the copyright as an issue. Further, only one of them mentioned                     order to further reveal the participants’ strategies, a flowchart was
academic integrity issues when using LLM. While for participants                   obtained by summarising the interview data and the experimenter’s
who had at least one publication, a larger portion of them (23 out                 report of observation notes[21]. We combined various factors such
of 38) regarded the copyright or academic integrity as a potential                 as interview transcripts, task materials, task completion time, scores,
issue of using LLM. For example, P46 said ”There may be academic                   and interaction styles to create a basic flowchart showing the pro-
misconduct …… I’m also afraid that my intellectual property will                   cess of completing the task for most of the participants (Figure 4).
be compromised. I prefer not to send the paper I’m working on                      In the figure, we aggregated and abstracted the steps the majority
directly to LLM. Instead, I’ll probably send small segments and have               of participants took.
the GPT do some writing polishing.” It seems that researchers who                      In general, at the beginning of a task, participants would judge
received more extensive academic training were more aware of                       whether they need help from LLM tools, taking the task type, the
violating academic rules when an AI-based tool was used.                           difficulty of the task (as moderated by the time pressure), and their
   Regarding the issues in the design of the LLM, the majority of                  capabilities into consideration. Then, during their interactions with
participants believed that the current design of LLM tools does not                LLM tools in tasks, participants may repeatedly modify their strate-
provide users with sufficient information. Among the 48 partici-                   gies (e.g., adjusting the context in their prompts) to optimize the
pants, 33 of them expressed that the current design of LLM tools                   LLM-generated results. Different strategies were adopted for differ-
does not provide official guidance on how to use prompts efficiently.              ent types of tasks. Specifically, participants were highly uniform in
For example, P44 said, ”The design only offers an interface for input              their strategies when using LLM tools for the PU tasks. Most of them
and output, but it doesn’t provide specific guidance on how to better              would divide the articles into small segments and ask questions
utilize and master the tool. Most of the learning comes from seeking               based on the segments. When conducting LR tasks, participants
information through other channels.” Similarly, P31 said, ”The inter-              chose a more diverse strategy. For example, some participants asked
face is very simple, and the content is quite brief. It doesn’t provide me         the LLM to generate a complete review; others only let the LLM
with proper guidance.” However, some interviewees held different                   generate the outline. Some participants even chose to provide a
opinions. They believed that the simplicity of the interface makes                 framework for the LLM to refer to. Figure 4 depicts two strategies
the tool easy to operate, as P39 mentioned, ”The LLM tool itself is                that the participants used the most in LR tasks. Finally, another
quite simple. After having several conversations with it, you naturally            difference in strategies for the LR task and PU task was that par-
become familiar with the pattern. It doesn’t require excessive design.”            ticipants usually used LLM throughout the whole task procedures
                                                                                   for the PU task; whereas participants preferred to conduct self-
5 DISCUSSION                                                                       modification and refinement for the responses generated by LLM
                                                                                   tools in LR tasks. It is likely that the participants had different levels
5.1 Strategies in using LLMs for different                                         of trust in the LLM tool when completing different tasks, which led
    academic tasks                                                                 to different levels of reliance on the LLM tool in the tasks.
Combining results from quantitative and qualitative analysis, our                      In addition, when conducting PU tasks, participants were more
research indicates that young scholars performed better when us-                   inclined to complete the task on their own compared to when con-
ing LLM tools for paper understanding (PU) tasks compared to                       ducting the LR (see Tables 2 and 3). Further statistical tests show
literature review (LR) tasks. However, young scholars spent more                   that those who did not use the LLM tool obtained lower scores
time and had lower intentions to use LLMs for PU tasks versus                      and took a significantly longer time to complete the task (paired
LR tasks. In the field of human-computer interaction, it is widely                 t-test p<.0001). Based on the interviews, we found two potential
recognized that user reliance on automation can be moderated by                    reasons explaining the low usage rate of LLM in PU tasks. First,
task complexity [54]. Similarly, in our experiments, compared to                   participants were confident in their ability to comprehend the scien-
the PU task, where the information source was known, participants                  tific literature; second, they were trying to avoid the deterioration
needed to search a wider range of unknown sources in LR tasks.                     of their learning ability as a result of over-reliance on LLM. We
Further, most participants perceived LLM tools as being good at                    speculate that the abandonment of LLM in tasks may also be re-
handling complex tasks such as developing process frameworks.                      lated to one’s personality traits and the early-stage scholars’ wish
Both may explain why participants relied on LLM more in LR tasks,                  to develop their skills for future academic success. However, in
especially when the time pressure was high, as many participants                   this experiment, we cannot validate these assumptions given that
felt that copying and typing text from PDF files into LLM in PU                    only early-stage scholars participated in the experiment, and future
tasks was more complicated and time-consuming compared to the                      experiments are needed.
procedures in LR tasks. However, it should be noted that, given the
limitations of the LLM we provided in the experiment (i.e., source
of bias [2]), current LLM tools cannot provide the most up-to-date
results in LR tasks. Thus, the LLMs can provide limited assistance                 5.2    Strategic choices under time pressure
in LR tasks, which may explain why participants obtained lower                     Time pressure can influence researchers’ strategies when using
scores in LR tasks (which were judged based on scoring standards                   the LLMs and their attitudes toward LLM tools. Although previous
like the number of literature, source accuracy, and citation quality.              results have shown that time pressure can affect the strategies that
Details please refer to supplement materials) compared to PU tasks.                users adopt to learn new knowledge or skills [50], it is still unknown
CHI ’24, May 11–16, 2024, Honolulu, HI, USA                                                                                            Wang et al.
Figure 4: Flowchart of conducting two tasks with LLMs. Notes: This flowchart only summarizes the basic processes for the two
tasks in the experiment. The actual behaviors of participants were more diverse when under different levels of time pressure,
which are presented in Figure 3.
how time pressure may influence one’s strategies when an AI-              same time, under high time pressure, some participants chose to
based assistant, the LLM tool, was available for academic tasks. The      skip verifying the responses generated by LLM tools. This indi-
qualitative analyses in our study indicate that with relatively low       cates that when faced with more urgent deadlines, researchers may
time pressure, researchers exhibited a more positive attitude toward      prioritize efficiency over skepticism and potentially sacrifice the
the LLM and were more confident in fulfilling tasks using LLM tools.      quality of their work. This result is in line with previous findings in
In contrast, under high time pressure, most researchers showed a          the human-automation interaction domain, which suggested that
more hesitant and negative attitude toward using LLM for academic         external stressors of the tasks may influence their adoption of new
tasks. It is possible that researchers still have concerns over the       technologies [41]. Specifically, when the users are under a high
capability of LLM. Thus, under high time pressure, participants           workload or in a stressful situation, they tend to rely more on the
tended to adopt more conservative methods rather than use new             technologies, even if they do not fully trust them. Thus, LLM tool
tools [64], so that they do not need to double-check the content          designers should carefully balance efficiency and effectiveness to
generated by LLM.                                                         better support users. For instance, the trade-offs between response
   However, users’ attitudes toward the LLM tools may not directly        speed and the accuracy of the responses can be customizable to
reflect their choice of strategies. The observational data in our study   better cater to users’ needs when they are under different levels of
shows that, in PU tasks, with low time pressure, participants were        time pressure.
more likely to abandon LLM tools; whereas under high time pres-
sure, participants exhibited higher LLM tool usage rates. At the
Evaluating Large Language Models on Academic Literature Understanding and Review                                   CHI ’24, May 11–16, 2024, Honolulu, HI, USA
5.3    Users’ attitudes and concerns of LLM                                        not give users enough hints and feedback. It is recommended that
In general, researchers hold a positive and forward-looking attitude               the functionality and interface of the LLM should be improved so
toward LLM tools. They mentioned more about the functionality of                   that it is more suitable for different academic tasks. For example,
LLM tools and how to effectively utilize them, rather than the limita-             the option of ”Prompt” for different scenarios can be proactively
tions of the tools. On the one hand, this is an encouraging discovery,             provided, to reduce the learning cost of the researcher.
as it suggests that young scholars focused more on harnessing the                     What is more alerting is that, some participants overstated the
benefits of LLM tools rather than dwelling on the shortcomings                     abilities of LLM tools (i.e., overlooked potential limitations or risks
and thus they may be more willing to use them. On the other hand,                  in the LLM usage to academic tasks), which coincides with several
this may lead to misuse of the LLM tools. For example, during the                  voices supporting the use of LLM tools in academic tasks [51].
interviews, although most participants mentioned their concerns                    However, overestimating the capabilities of the LLMs may lead
about the limitations of the LLM tool (similar to the findings from                to over-reliance on LLM tools. From an academic performance
[2, 29, 52]), very few participants could comprehensively and sys-                 perspective of view, this may result in erroneous or inaccurate
tematically acknowledge the constraints and boundaries of LLM                      conclusions in academic tasks. From an educational perspective of
tools and even fewer participants were awareness of the potential                  view, this may negatively impact young scholars’ critical thinking
privacy and copyright issues of the LLMs. Especially, those with                   and academic skills, potentially affecting their overall academic
little academic experience (i.e., had no academic publications) were               development. This finding points to another important topic in
inclined to overlook the potential personal privacy, academic copy-                LLM usage, the training of the users.
right, and ethics issues caused by LLM tools. This finding provides
a different perspective on the opinions of adopting the LLMs for
academic tasks compared to the previous study, which focused
                                                                                   5.4    The role and future improvement of training
more on senior scholars [52]. Additionally, young scholars may
intentionally choose to ignore the limitations of the LLMs, similar                Training is pivotal for the appropriate use of the LLM tool. Previous
to how human beings rely on heuristics to make decisions in urgent                 research has pointed out that it is important to train users to refine
situations [58]. In our study, under time pressure, some participants              their mental models, and subsequently facilitate user-LLM collabo-
indicated that they intentionally ignored the deficiencies of LLM                  ration performance [65]. Our study reveals that individuals who re-
tools, even when they were aware of these issues. For instance,                    ceived limitation-related training expressed lower satisfaction with
when striving to complete PU tasks under high time pressure, some                  the effectiveness of the LLM tool and discussed more of the accuracy
participants indicated that they lowered their expectations of the                 of the LLM-generated responses in the post-experiment interview.
performance of the LLM tool and may cease to verify the content                    This implies that the trained individuals were more skeptical of the
generated by these tools.                                                          content generated by the LLM, which may explain why, among the
    The associations among academic experience, attitudes toward                   ones who never used LLM, those who received limitation-based
LLMs, and strategies when using LLM indicate that, in the academic                 training spent more time on academic tasks compared to those who
community, users’ willingness to use the LLM tools is a dynamic                    did not receive limitation-related training.
process and there is a chance that young scholars prefer to use the                   It is also interesting to notice that, in addition to the experience
LLM tools, especially under high time pressure. Thus, LLM tool                     passed in the limitation-related training, users can also gain ex-
designers should try to make the users aware of the limitations                    perience during interactions with LLMs, before the experiment.
and boundaries of LLM tools so that the users can use the LLMs                     For example, we found that compared to those who had little to
more effectively and responsibly. For instance, appropriate system                 no prior experience with the LLMs, the participants who had rel-
transparency [37, 49] can be an effective way to address the con-                  atively more experience (i.e., those who self-reported sometimes
cerns regarding the accuracy of tool outputs. Specifically, designers              using LLMs, of which 12 of them received limitation-based train-
can incorporate features such as confidence scores or explanatory                  ing and 12 did not) with the LLM tended to be more aware of the
annotations [60] in the responses generated by the LLMs, which                     strengths and weaknesses of the LLM tool and tried to find the
would help users better understand the reliability of the generated                best strategies when using LLM tools. Specifically, they adjusted
content so that the users can make more informed decisions when                    their interactions with LLM tools more constantly compared to
using the tool, even under high time pressure. On the other hand,                  those who had less experience and they also provided more insight-
before adopting the LLMs, young scholars should also receive train-                ful comments on LLMs. For example, 7 out of the 12 LLM users
ing regarding the limitations of LLM tools so that they can make                   who did not receive limitation-based training mentioned that they
more informed decisions on when and how to use the LLM tools.                      double-checked and double-examined the answers generated by
    At the same time, from the perspective of Human Machine Inter-                 LLM, which cost additional time in the task. In contrast, users who
face (HMI) design for LLM, understanding the differences in users’                 lacked LLM experience and did not receive limitation-related train-
performance and level of reliance can aid in designing LLM tools to                ing tended to show low confidence in the LLM tools. In particular,
fit the needs of different academic tasks and improve users’ satisfac-             in the study, among the 3 participants who had no LLM experience
tion with the LLM tools. Sakirin et al. showed that users preferred                and did not receive limitation-related training at the same time, 2
to use the dialogue interface supported by the LLM tool [59], but                  of them abandoned LLM tools during the PU task, and they did not
our study found that the interface of the LLM tool did not provide                 adopt the strategies most experienced users would take (as shown
users with enough information, and the oversimplified design may                   in Figure 4) in LR tasks. The above findings indicate that limitation-
                                                                                   related training can not only shorten the period that users may take
CHI ’24, May 11–16, 2024, Honolulu, HI, USA                                                                                           Wang et al.
to develop appropriate strategies to use new technologies, but also      Further, as an empirical study, though we tried to replicate realistic
help promote the adoption of LLMs among new users.                       scenarios in daily life, the scenarios the users encountered were still
   Unfortunately, the training or materials provided by the official     artificial to some level and users may have biased behaviors in the
providers of the LLM tools may not be enough. Many participants          experiment. Future research may consider observational studies to
reported receiving their training from third-party platforms on the      better reveal the strategies users may adopt when LLMs are used
Internet rather than from official sources. This could be attributed     for academic tasks. Lastly, considering the rapid development of
to uncertainties regarding the comprehensibility of official docu-       LLMs, more advanced models or interfaces are being introduced
mentation or the usability issues of the official documents. Such a      (e.g., GPT-4V 7 , Semantic Reader 8 ). In this work, we were not able
phenomenon has also been observed in other domains. For example,         to adopt these up-to-date tools, as they were not publicly accessible
researchers found that a very low portion of users read the manual       when our experiment began. Thus, the readers should be aware
of their vehicles regarding driving automation [20]. Therefore, it is    that some findings in our study may not apply to some emerging
recommended that LLM developers or maintainers explore better            LLM tools and future assessments of how users’ behaviors change
ways to present necessary information to new users, or actively          adaptively with the evolvement of LLM tools are needed.
engage with relevant online forums and social media groups to
assist users in addressing their usage-related queries. However, it      7    FINDINGS AND CONCLUSIONS
should be noted that the training methods adopted in our study are       In this study, we conducted an empirical study involving 48 early-
still preliminary, and future research should continue to optimize       stage scholars, to understand how LLM tools can be utilized for aca-
training methods and content, and better incorporate the training        demic tasks and affect early-stage scholars’ workflow. Specifically,
methods in the LLM tool design to improve users’ performance in          we discussed the influences of user perspectives on LLMs, evaluated
academic tasks with the LLM tools.                                       users’ performance when using LLM in two typical but different
   Especially, for the PU tasks, the results showed that most early-     academic tasks, and analyzed the influence of time pressure in these
stage scholars would prefer to read and understand the literature        tasks. Besides, the qualitative analysis based on a post-experiment
on their own, as they did not want to ”rely too much on the LLM to       interview revealed the strategies users adopted when using LLM
constrain their learning ability”. Hence, future LLMs can provide        tools. In general, several key findings are summarized as:
more translation or search functions for key information in PU
scenarios. For LR tasks, designers should try to reduce the chances           • We found that young scholars can adaptively change their
of noisy responses appearing or provide confidence scores [60] for              strategies when using the LLM for different tasks. Specifi-
the LLM-generated responses. Personalized training and support                  cally, we observed more diverse questioning styles and less
services may also be necessary to help participants make better                 reliance on the LLM tools when using LLM for LR task; while
use of LLM tools. By tailoring support based on researchers’ expe-              a more monotonic strategy was observed when the LLM was
rience, proficiency with the tools, and the type of tasks they are              used for PU tasks. Future LLM design may consider cus-
undertaking, individualized assistance can be provided. This could              tomizing the tool to better satisfy users’ needs in different
include training on specific usage techniques and strategies for a              scenarios.
particular type of task, thus enabling scholars to perform better in          • Time pressure can influence users’ attitudes toward the LLM
their academic endeavors.                                                       tools and the strategies they take to cooperate with the LLM.
                                                                                However, the strategies they took may not necessarily match
                                                                                the attitudes they hold. High time pressure led to declined
6    LIMITATIONS                                                                attitudes toward the LLM, but increased the adoption rate
We recognize that although our study has followed standards in                  of the LLM. It is likely that the users were not satisfied with
the field of human-computer interaction to some extent [10], there              the performance of LLM tools, but they had to use them
are still some limitations. First, as we intentionally limited our              to reduce the time pressure. Future LLM tools may need to
targeted user group to young scholars, the findings may not be                  allow users to customize the LLM tools to reach a balance
well-generalizable to the senior academic community. Users with                 between accuracy and efficiency.
different levels of familiarity with academic tasks may hold different        • Young scholars had an overall positive attitude towards the
attitudes toward AI-based tools and may adopt different strategies              LLMs in academic tasks, but due to their lack of academic ex-
for using them. In future research, senior scholars with different              perience, they were also inclined to ignore the academic eth-
backgrounds should be recruited. Secondly, limited by the sample                ical and privacy risks introduced by LLM tools, and tended
size, we had to focus on two types of common but typical academic               to voluntarily give up their concentration on the risks from
tasks in a single academic domain in this experiment, which may                 LLMs when the complexity of the task increases. Thus, train-
not cover all scenarios when LLM tools are used in academic tasks.              ing might be necessary to support better use of LLM tools
We also only considered the task difficulty controlled by the time              among young scholars.
pressure. In daily academic tasks, the task difficulties may be mod-          • We investigated a specific training, the limitation-based train-
erated by many factors. Future research may consider introducing                ing, on users’ attitudes towards strategies and performance
more types of academic tasks (e.g., academic writing, data analysis,            in academic tasks when using the LLM tools. The results
and experimental design) from more academic domains, and mod-
eling the influence of other task-difficulty-related factors to assess   7 https://openai.com/research/gpt-4v-system-card
        show that training can increase users’ awareness of the limi-                      [17] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng
        tations of the LLMs and lead to more appropriate strategies                             Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-
                                                                                                training for natural language understanding and generation. Advances in Neural
        when using LLM tools, similar to what experience with LLM                               Information Processing Systems 32 (2019).
        can do. Given that users criticized the current LLM devel-                         [18] Michael Dowling and Brian Lucey. 2023. ChatGPT for (finance) research: The
                                                                                                Bananarama conjecture. Finance Research Letters 53 (2023), 103662.
        opers for not providing adequate training materials, more                          [19] Armin Esmaeilzadeh and Kazem Taghva. 2022. Text classification using neural
        authoritative and comprehensive online training materials                               network language model (nnlm) and bert: An empirical comparison. In Intelligent
        for LLM are expected in the future.                                                     Systems and Applications: Proceedings of the 2021 Intelligent Systems Conference
                                                                                                (IntelliSys) Volume 3. Springer, 175–189.
                                                                                           [20] Yannick Forster, Sebastian Hergeth, Frederik Naujoks, Josef Krems, and Andreas
                                                                                                Keinath. 2019. User education in automated driving: Owner’s manual and interac-
ACKNOWLEDGMENTS                                                                                 tive tutorial support mental model formation and human-automation interaction.
                                                                                                Information 10, 4 (2019), 143.
This work was supported by the Guangzhou Municipal Science and                             [21] Patricia H Fowler, Janet Craig, Lawrence D Fredendall, and Uzay Damali. 2008.
Technology Project (No. 2023A03J0011) and the Guangzhou Science                                 Perioperative workflow: barriers to efficiency, risks, and satisfaction. AORN
                                                                                                Journal 87, 1 (2008), 187–208.
and Technology Program City-University Joint Funding Project
                                                                                           [22] Catherine A Gao, Frederick M Howard, Nikolay S Markov, Emma C Dyer, Siddhi
(No. 2023A03J0001). We first appreciate the assistance provided by                              Ramesh, Yuan Luo, and Alexander T Pearson. 2022. Comparing scientific abstracts
Wenbo Zhang throughout the entire experiment. We also would                                     generated by ChatGPT to original abstracts using an artificial intelligence output
                                                                                                detector, plagiarism detector, and blinded human reviewers. BioRxiv (2022),
like to express our gratitude to all participants for their valuable                            2022–12.
time and to the reviewers for their helpful feedback on our paper.                         [23] Catherine A. Gao, Frederick M. Howard, Nikolay S. Markov, Emma C. Dyer, Siddhi
                                                                                                Ramesh, Yuan Luo, and Alexander T. Pearson. 2022. Comparing scientific ab-
                                                                                                stracts generated by ChatGPT to original abstracts using an artificial intelligence
                                                                                                output detector, plagiarism detector, and blinded human reviewers. BioRxiv (2022).
REFERENCES                                                                                      https://doi.org/10.1101/2022.12.23.521610 arXiv:https://www.biorxiv.org/con-
 [1] Ian L Alberts, Lorenzo Mercolli, Thomas Pyka, George Prenosil, Kuangyu Shi,                tent/early/2022/12/27/2022.12.23.521610.full.pdf
     Axel Rominger, and Ali Afshar-Oromieh. 2023. Large language models (LLM)              [24] Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith.
     and ChatGPT: what will the impact on nuclear medicine be? European Journal of              2020. Realtoxicityprompts: Evaluating neural toxic degeneration in language
     Nuclear Medicine and Molecular Imaging 50, 6 (2023), 1549–1552.                            models. arXiv preprint arXiv:2009.11462 (2020).
 [2] Muath Alser and Ethan Waisberg. 2023. Concerns with the usage of ChatGPT              [25] Katy Ilonka Gero, Vivian Liu, and Lydia Chilton. 2022. Sparks: Inspiration
     in Academia and Medicine: A viewpoint. American Journal of Medicine Open                   for science writing using language models. In Designing Interactive Systems
     100036 (2023).                                                                             Conference. 1002–1019.
 [3] Ömer Aydın and Enis Karaarslan. 2022. OpenAI ChatGPT generated literature             [26] Katy Ilonka Gero, Tao Long, and Lydia B Chilton. 2023. Social dynamics of AI
     review: Digital twin in healthcare. Available at SSRN 4308687 (2022).                      support in creative writing. In Proceedings of the 2023 CHI Conference on Human
 [4] Stephen H Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Ni-              Factors in Computing Systems. 1–15.
     hal V Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, et al.      [27] Bert Gordijn and Henk ten Have. 2023. ChatGPT: evolution or revolution?
     2022. Promptsource: An integrated development environment and repository for               Medicine, Health Care and Philosophy 26, 1 (2023), 1–2.
     natural language prompts. arXiv preprint arXiv:2202.01279 (2022).                     [28] Gianluca Grimaldi and Bruno Ehrler. 2023. AI et al.: machines are about to change
 [5] Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret                      scientific publishing forever. ACS Energy Letters 8, 1 (2023), 878–880.
     Shmitchell. 2021. On the dangers of stochastic parrots: Can language models           [29] Mohanad Halaweh. 2023. ChatGPT in education: Strategies for responsible
     be too big?. In Proceedings of the 2021 ACM conference on fairness, accountability,        implementation. (2023).
     and transparency. 610–623.                                                            [30] Dengbo He, Dina Kanaan, and Birsen Donmez. 2022. Distracted when using
 [6] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora,                    driving automation: a quantile regression analysis of driver glances consider-
     Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma                ing the effects of road alignment and driving experience. Frontiers in Future
     Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv         Transportation 3 (2022), 772910.
     preprint arXiv:2108.07258 (2021).                                                     [31] Sebastian Hergeth, Lutz Lorenz, and Josef F Krems. 2017. Prior familiarization
 [7] Petter Bae Brandtzaeg and Asbjørn Følstad. 2018. Chatbots: changing user needs             with takeover requests affects drivers’ takeover performance and automation
     and motivations. Interactions 25, 5 (2018), 38–43.                                         trust. Human factors 59, 3 (2017), 457–470.
 [8] Dawn Branley-Bell, Rebecca Whitworth, and Lynne Coventry. 2020. User trust            [32] Kevin Anthony Hoff and Masooda Bashir. 2015. Trust in automation: Integrating
     and understanding of explainable ai: Exploring algorithm visualisations and user           empirical evidence on factors that influence trust. Human Factors 57, 3 (2015),
     biases. In International Conference on Human-Computer Interaction. Springer,               407–434.
     382–399.                                                                              [33] Matthew Hutson. 2022. Could AI help you to write your next paper? Nature 611,
 [9] Tim Broady, Amy Chan, and Peter Caputi. 2010. Comparison of older and younger              7934 (2022), 192–193.
     adults’ attitudes towards and abilities with computers: Implications for training     [34] Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. 2020. How can we
     and learning. British Journal of Educational Technology 41, 3 (2010), 473–485.             know what language models know? Transactions of the Association for Computa-
[10] Kelly Caine. 2016. Local standards for sample size at CHI. In Proceedings of the           tional Linguistics 8 (2020), 423–438.
     2016 CHI conference on human factors in computing systems. 981–992.                   [35] Andreas Jungherr. 2023. Using ChatGPT and Other Large Language Model (LLM)
[11] Lydia Carson, Christoph Bartneck, and Kevin Voges. 2013. Over-competitiveness              Applications for Academic Paper Assignments. (2023).
     in academia: A literature review. Disruptive Science and Technology 1, 4 (2013),      [36] Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna
     183–190.                                                                                   Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke
[12] John Joon Young Chung, Wooseok Kim, Kang Min Yoo, Hwaran Lee, Eytan Adar,                  Hüllermeier, et al. 2023. ChatGPT for good? On opportunities and challenges
     and Minsuk Chang. 2022. TaleBrush: visual sketching of story generation with               of large language models for education. Learning and Individual Differences 103
     pretrained language models. In CHI Conference on Human Factors in Computing                (2023), 102274.
     Systems Extended Abstracts. 1–4.                                                      [37] René F Kizilcec. 2016. How much information? Effects of transparency on trust
[13] Fred D Davis. 1989. Perceived usefulness, perceived ease of use, and user accep-           in an algorithmic interface. In Proceedings of the 2016 CHI conference on human
     tance of information technology. MIS Quarterly (1989), 319–340.                            factors in computing systems. 2390–2395.
[14] Ismail Dergaa, Karim Chamari, Piotr Zmijewski, and Helmi Ben Saad. 2023. From         [38] CY KRAMERß. 1956. Extension of multiple range tests to group means with
     human writing to artificial intelligence generated text: examining the prospects           unequal numbers of replication. Biometrics 12 (1956), 307–310.
     and potential threats of ChatGPT in academic writing. Biology of Sport 40, 2          [39] Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie
     (2023), 615–622.                                                                           De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido,
[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:              James Maningo, et al. 2023. Performance of ChatGPT on USMLE: Potential for
     Pre-training of deep bidirectional transformers for language understanding. arXiv          AI-assisted medical education using large language models. PLoS Digital Health
     preprint arXiv:1810.04805 (2018).                                                          2, 2 (2023), e0000198.
[16] Eva Dis, Johan Bollen, Willem Zuidema, Robert Rooij, and Claudi Bockting.             [40] Ivano Lauriola, Alberto Lavelli, and Fabio Aiolli. 2022. An introduction to deep
     2023. ChatGPT: five priorities for research. Nature 614 (02 2023), 224–226.                learning in natural language processing: Models, techniques, and tools. Neuro-
     https://doi.org/10.1038/d41586-023-00288-7                                                 computing 470 (2022), 443–456.
CHI ’24, May 11–16, 2024, Honolulu, HI, USA                                                                                                                          Wang et al.
[41] John D Lee and Katrina A See. 2004. Trust in automation: Designing for appro-              CA: Los Angeles, CA, 1506–1510.
     priate reliance. Human Factors 46, 1 (2004), 50–80.                                   [66] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov,
[42] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman                     and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language
     Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising              understanding. Advances in Neural Information Processing Systems 32 (2019).
     sequence-to-sequence pre-training for natural language generation, translation,       [67] Lirong Yao and Yazhuo Guan. 2018. An improved LSTM structure for natural
     and comprehension. arXiv preprint arXiv:1910.13461 (2019).                                 language processing. In 2018 IEEE International Conference of Safety Produce
[43] Michael Liebrenz, Roman Schleifer, Anna Buadze, Dinesh Bhugra, and Alexander               Informatization (IICSPI). IEEE, 565–569.
     Smith. 2023. Generating scholarly content with ChatGPT: ethical challenges for        [68] Nan Zhong, Zhenxing Qian, and Xinpeng Zhang. 2021. Deep neural network
     medical publishing. The Lancet Digital Health 5, 3 (2023), e105–e106.                      retrieval. In Proceedings of the 29th ACM International Conference on Multimedia.
[44] Peng Liu and Zhizhong Li. 2012. Task complexity: A review and conceptualization            3455–3463.
     framework. International Journal of Industrial Ergonomics 42, 6 (2012), 553–568.
[45] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and
     Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of           A INTERVIEW QUANTITATIVE STATISTIC
     prompting methods in natural language processing. Comput. Surveys 55, 9 (2023),
     1–35.
                                                                                             DATA
[46] Alexandra Luccioni and Joseph Viviano. 2021. What’s in the box? an analysis           During the coding and analysis phases, a large amount of data was
     of undesirable content in the Common Crawl corpus. In Proceedings of the 59th
     Annual Meeting of the Association for Computational Linguistics and the 11th
                                                                                           collected. Thus only the data involved in Figure 2 is shown in Table
     International Joint Conference on Natural Language Processing (Volume 2: Short        6 as an additional illustration for ease of reference.
     Papers). 182–189.
[47] Thorleif Lund. 2012. Combining qualitative and quantitative approaches: Some
     arguments for mixed methods research. Scandinavian Journal of Educational             B EXPERIMENT MATERIALS AND QUESTIONS
     Research 56, 2 (2012), 155–165.
[48] Muneer M Alshater. 2022. Exploring the role of artificial intelligence in enhancing
                                                                                           B.1 Paper Understanding 1 (P1)
     academic performance: A case study of ChatGPT. Available at SSRN (2022).              Paper link:
[49] Jan Maarten Schraagen, Sabin Kerwien Lopez, Carolin Schneider, Vivien Schnei-
     der, Stephanie Tönjes, and Emma Wiechmann. 2021. The role of transparency             https://journals.sagepub.com/doi/pdf/10.1177/1071181322661442
     and explainability in automated systems. In Proceedings of the Human Factors             (1) What is ADAS? Please list some typical functions it con-
     and Ergonomics Society Annual Meeting, Vol. 65. SAGE Publications Sage CA: Los
     Angeles, CA, 27–31.                                                                          cludes.
[50] Giuliana Mazzoni and Cesare Cornoldi. 1993. Strategies in study time allocation:         (2) What is the purpose of this research, and how this study can
     Why is study time sometimes not effective? Journal of Experimental Psychology:               benefit future studies?
     General 122, 1 (1993), 47.
[51] Jesse G Meyer, Ryan J Urbanowicz, Patrick CN Martin, Karen O’Connor, Ruowang             (3) Please briefly describe the procedures of how the survey data
     Li, Pei-Chen Peng, Tiffani J Bright, Nicholas Tatonetti, Kyoung Jae Won, Gra-                was collected, the participants’ criteria, and how valid data
     ciela Gonzalez-Hernandez, et al. 2023. ChatGPT and large language models in
     academia: opportunities and challenges. BioData Mining 16, 1 (2023), 20.
                                                                                                  was selected after the data collection.
[52] Meredith Ringel Morris. 2023. Scientists’ Perspectives on the Potential for Gen-         (4) Please briefly conclude the findings in this paper, and how
     erative AI in their Fields. arXiv preprint arXiv:2304.01420 (2023).                          these findings can benefit future studies.
[53] Michael Muller, Lydia B Chilton, Anna Kantosalo, Charles Patrick Martin, and
     Greg Walsh. 2022. GenAICHI: generative AI and HCI. In CHI conference on human            (5) Please indicate the limitations of this paper.
     factors in computing systems extended abstracts. 1–7.
[54] Raja Parasuraman and Victor Riley. 1997. Humans and automation: Use, misuse,          B.2 Paper Understanding 2 (P2)
     disuse, abuse. Human Factors 39, 2 (1997), 230–253.
[55] Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John                Paper link:
     Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red teaming        https://journals.sagepub.com/doi/pdf/10.1177/1071181322661400
     language models with language models. arXiv preprint arXiv:2202.03286 (2022).
[56] Savvas Petridis, Nicholas Diakopoulos, Kevin Crowston, Mark Hansen, Keren                (1) Please give a definition for AV and TAM respectively and
     Henderson, Stan Jastrzebski, Jeffrey V Nickerson, and Lydia B Chilton. 2023.                 indicate how TAM is relevant to AV.
     Anglekindling: Supporting journalistic angle ideation with large language models.
     In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems.         (2) What is the purpose of this research, and how this study can
     1–16.                                                                                        benefit future studies?
[57] Md Mizanur Rahman, Harold Jan Terano, Md Nafizur Rahman, Aidin Salamzadeh,
     and Md Saidur Rahaman. 2023. ChatGPT and academic research: a review and
                                                                                              (3) Please summarize the types of information collected in the
     recommendations based on practical examples. Journal of Education, Management                survey and how the valid data was selected after the data
     and Development Studies 3, 1 (2023), 1–12.                                                   collection.
[58] Torsten Reimer and Jörg Rieskamp. 2007. Fast and frugal heuristics. Encyclopedia
     of Social Psychology (2007), 346–348.                                                    (4) Please briefly conclude the findings in this paper, and how
[59] Tam Sakirin and Rachid Ben Said. 2023. User preferences for ChatGPT-powered                  these findings can benefit future studies.
     conversational interfaces versus traditional methods. Mesopotamian Journal of            (5) Please indicate the limitations of this paper.
     Computer Science 2023 (2023), 24–31.
[60] Anuschka Schmitt, Thiemo Wambsganss, and Andreas Janson. 2022. Designing
     for conversational system trustworthiness: the impact of model transparency on        B.3 Literature Review Topic 1 (T1)
     trust and task performance. (2022).
[61] Horrock Stevens. 2019. What Human Factors Isn’t: 1. Common Sense. https://            Topic No.1: Novice driver training
     humanisticsystems.com/2019/07/10/what-human-factors-isnt-1-common-sense/
[62] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
     Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
                                                                                           B.4 Literature Review Topic 2 (T2)
     Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv       Topic No.2: Hazard perception in driving
     preprint arXiv:2302.13971 (2023).
[63] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
     Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all            C     OUTLINE OF INTERVIEW
     you need. Advances in Neural Information Processing Systems 30 (2017).
[64] Christopher D Wickens, William S Helton, Justin G Hollands, and Simon Banbury.
                                                                                           Table 7 shows the outline of the interview, including types, ques-
     2021. Engineering psychology and human performance. Routledge.                        tions, and time. It should be noted that in addition to these questions,
[65] Bart D Wilkison, Arthur D Fisk, and Wendy A Rogers. 2007. Effects of mental           we rely on the participant’s answers for more in-depth discussion.
     model quality on collaborative system performance. In Proceedings of the Human
     Factors and Ergonomics Society Annual Meeting, Vol. 51. SAGE Publications Sage
Evaluating Large Language Models on Academic Literature Understanding and Review                                 CHI ’24, May 11–16, 2024, Honolulu, HI, USA