Overview of Large Language Models
Overview of Large Language Models
Humza Naveeda , Asad Ullah Khanb,∗, Shi Qiuc,∗, Muhammad Saqibd,e,∗, Saeed Anwarf,g , Muhammad Usmanf,g , Naveed Akhtarh,j ,
                                                                                           Nick Barnesi , Ajmal Mianj
                                                                                                        a The   University of Sydney, Sydney, Australia
                                                                                         b University     of Engineering and Technology (UET), Lahore, Pakistan
                                                                                             c The Chinese University of Hong Kong (CUHK), HKSAR, China
                                                                                                   d University of Technology Sydney (UTS), Sydney, Australia
                                                                               e Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, Australia
                                                                                   f King Fahd University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia
                                                                           g SDAIA-KFUPM Joint Research Center for Artificial Intelligence (JRCAI), Dhahran, Saudi Arabia
                                                                                                 h The University of Melbourne (UoM), Melbourne, Australia
                                                                                                 i Australian National University (ANU), Canberra, Australia
                                                                                                j The University of Western Australia (UWA), Perth, Australia
arXiv:2307.06435v10 [cs.CL] 17 Oct 2024
                                          Abstract
                                             Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and
                                          beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse
                                          topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs,
                                          robotics, datasets, benchmarking, efficiency, and more. With the rapid development of techniques and regular breakthroughs in
                                          LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering
                                          the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise
                                          yet comprehensive overview of the recent developments in this field. This article provides an overview of the literature on a broad
                                          range of LLM-related concepts. Our self-contained comprehensive overview of LLMs discusses relevant background concepts
                                          along with covering the advanced topics at the frontier of research in LLMs. This review article is intended to provide not only a
                                          systematic survey but also a quick, comprehensive reference for the researchers and practitioners to draw insights from extensive,
                                          informative summaries of the existing works to advance the LLM research.
                                          Keywords:
                                          Large Language Models, LLMs, chatGPT, Augmented LLMs, Multimodal LLMs, LLM training, LLM Benchmarking
                                          1. Introduction
                                             Language plays a fundamental role in facilitating commu-
                                          nication and self-expression for humans and their interaction
                                          with machines. The need for generalized models stems from
                                          the growing demand for machines to handle complex language
                                          tasks, including translation, summarization, information re-
                                          trieval, conversational interactions, etc. Recently, significant
                                          breakthroughs have been witnessed in language models, pri-
                                          marily attributed to transformers [1], increased computational
                                          capabilities, and the availability of large-scale training data.
                                          These developments have brought about a revolutionary trans-
                                          formation by enabling the creation of LLMs that can approxi-
                                          mate human-level performance on various tasks [2, 3]. Large
                                             ∗ Equalcontribution
                                              Email addresses: humza_naveed@yahoo.com (Humza Naveed),
                                          aukhanee@gmail.com (Asad Ullah Khan), shiqiu@cse.cuhk.edu.hk (Shi
                                          Qiu), muhammad.saqib@data61.csiro.au (Muhammad Saqib),
                                          saeed.anwar@kfupm.edu.sa (Saeed Anwar),                                           Figure 1: The trend of papers released over the years containing keywords
                                          muhammad.usman@kfupm.edu.sa (Muhammad Usman),                                     "Large Language Model", "Large Language Model + Fine-Tuning", and "Large
                                          naveed.akhtar1@unimelb.edu.au (Naveed Akhtar),                                    Language Model + Alignment".
                                          nick.barnes@anu.edu.au (Nick Barnes), ajmal.mian@uwa.edu.au
                                          (Ajmal Mian)
Figure 2: Chronological display of LLM releases: blue cards represent ‘pre-trained’ models, while orange cards correspond to ‘instruction-tuned’ models. Models
on the upper half signify open-source availability, whereas those on the bottom are closed-source. The chart illustrates the increasing trend towards instruction-tuned
and open-source models, highlighting the evolving landscape and trends in natural language processing research.
Language Models (LLMs) have emerged as cutting-edge arti-                                                              tool manipulation, question answering, autonomous agents, etc.
ficial intelligence systems that can process and generate text                                                         Various improvements have also been suggested in these areas
with coherent communication [4] and generalize to multiple                                                             either by task-specific training [25, 26, 27, 28, 29, 30, 31] or
tasks [5, 6].                                                                                                          better prompting [32].
The historical progress in natural language processing (NLP)                                                           The LLMs abilities to solve diverse tasks with human-level
evolved from statistical to neural language modeling and then                                                          performance come at the cost of slow training and inference,
from pre-trained language models (PLMs) to LLMs. While                                                                 extensive hardware requirements, and higher running costs.
conventional language modeling (LM) trains task-specific mod-                                                          Such requirements have limited their adoption and opened up
els in supervised settings, PLMs are trained in a self-supervised                                                      opportunities to devise better architectures [15, 33, 34, 35]
setting on a large corpus of text [7, 8, 9] with the aim of learning                                                   and training strategies [36, 37, 21, 38, 39, 40, 41]. Param-
a generic representation that is shareable among various NLP                                                           eter efficient tuning [38, 41, 40], pruning [42, 43], quantiza-
tasks. After fine-tuning for downstream tasks, PLMs surpass                                                            tion [44, 45], knowledge distillation, and context length inter-
the performance gains of traditional language modeling (LM).                                                           polation [46, 47, 48, 49] among others are some of the methods
The larger PLMs bring more performance gains, which has led                                                            widely studied for efficient LLM utilization.
to the transitioning of PLMs to LLMs by significantly increas-                                                         Due to the success of LLMs on a wide variety of tasks, the
ing model parameters (tens to hundreds of billions) [10] and                                                           research literature has recently experienced a large influx of
training dataset (many GBs and TBs) [10, 11]. Following this                                                           LLM-related contributions. Researchers have organized the
development, numerous LLMs have been proposed in the lit-                                                              LLMs literature in surveys [50, 51, 52, 53], and topic-specific
erature [10, 11, 12, 6, 13, 14, 15]. An increasing trend in the                                                        surveys in [54, 55, 56, 57, 58]. In contrast to these surveys, our
number of released LLMs and names of a few significant LLMs                                                            contribution focuses on providing a comprehensive yet concise
proposed over the years are shown in Fig 1 and Fig 2, respec-                                                          overview of the general direction of LLM research. This arti-
tively.                                                                                                                cle summarizes architectural and training details of pre-trained
The early work on LLMs, such as T5 [10] and mT5 [11] em-                                                               LLMs and delves deeper into the details of concepts like fine-
ployed transfer learning until GPT-3 [6] showed LLMs are                                                               tuning, multi-modal LLMs, augmented LLMs, datasets, eval-
zero-shot transferable to downstream tasks without fine-tuning.                                                        uation, applications, challenges, and others to provide a self-
LLMs accurately respond to task queries when prompted with                                                             contained comprehensive overview. Our key contributions are
task descriptions and examples. However, pre-trained LLMs                                                              summarized as follows.
fail to follow user intent and perform worse in zero-shot set-
tings than in few-shot. Fine-tuning them with task instruc-                                                               • We present a survey on the developments in LLM research,
tions data [16, 17, 18, 19] and aligning with human prefer-                                                                 providing a concise, comprehensive overview of the direc-
ences [20, 21] enhances generalization to unseen tasks, im-                                                                 tion.
proving zero-shot performance significantly and reducing mis-                                                             • We present extensive summaries of pre-trained models that
aligned behavior.                                                                                                           include fine-grained details of architecture and training de-
In addition to better generalization and domain adaptation,                                                                 tails.
LLMs appear to have emergent abilities, such as reasoning,
                                                                                                                          • We summarize major findings of the popular contributions
planning, decision-making, in-context learning, answering in
                                                                                                                            and provide a detailed discussion on the key design and
zero-shot settings, etc. These abilities are known to be ac-
                                                                                                                            development aspects of LLMs to help practitioners effec-
quired by them due to their gigantic scale even when the pre-
                                                                                                                            tively leverage this technology.
trained LLMs are not trained specifically to possess these at-
tributes [22, 23, 24]. Such abilities have led LLMs to be widely                                                          • In this self-contained article, we cover a range of con-
adopted in diverse settings, including multi-modal, robotics,                                                               cepts to present the general direction of LLMs compre-
                                                                                                                            hensively, including background, pre-training, fine-tuning,
                                                                                                              2
Figure 3: A broader overview of LLMs, dividing LLMs into seven branches: 1. Pre-Training 2. Fine-Tuning 3. Efficient 4. Inference 5. Evaluation 6. Applications
7. Challenges
      multi-modal LLMs, augmented LLMs, LLMs-powered                              utilization in different domains. Section 4 highlights the config-
      agents, datasets, evaluation, etc.                                          uration and parameters that play a crucial role in the function-
                                                                                  ing of these models. Summary and discussions are presented
We loosely follow the existing terminology to ensure a stan-                      in section 3.8. The LLM training and evaluation, datasets, and
dardized outlook of this research direction. For instance, fol-                   benchmarks are discussed in section 5, followed by challenges
lowing [50], our survey discusses pre-trained LLMs with 10B                       and future directions, and conclusion in sections 7 and 8, re-
parameters or more. We refer the readers interested in smaller                    spectively.
pre-trained models to [51, 52, 53].
The organization of this paper is as follows. Section 2 discusses
the background of LLMs. Section 3 focuses on LLMs overview,
architectures, training pipelines and strategies, fine-tuning, and
                                                                              3
2. Background                                                          2.4. Activation Functions
   We provide the relevant background to understand the fun-               The activation functions serve a crucial role in the curve-
damentals related to LLMs in this section. We briefly discuss          fitting abilities of neural networks [69]. We discuss activation
necessary components in LLMs and refer the readers interested          functions used in LLMs in this section.
in details to the original works.                                      ReLU [70]: The Rectified linear unit (ReLU) is defined as:
Masked Language Modeling: In this training objective, tokens                        2.12. LLMs Adaptation Stages
or spans (a sequence of tokens) are masked randomly and the
                                                                                       This section discusses the fundamentals of LLMs adaptation
model is asked to predict masked tokens given the past and
                                                                                    stages, from pre-training to fine-tuning for downstream tasks
future context. An example is shown in Figure 5.
                                                                                    and utilization. An example of different training stages and in-
Unified Language Modeling: Unified language modeling [94]
                                                                                    ference in LLMs is shown in Figure 6. In this paper, we refer
is a combination of causal, non-causal, and masked language
                                                                                    to alignment-tuning as aligning with human preferences, while
training objectives. Here in masked language modeling, the
                                                                                    occasionally the literature uses the term alignment for different
attention is not bidirectional but unidirectional, attending either
                                                                                    purposes.
left-to-right or right-to-left context.
                                                                                    2.12.1. Pre-Training
                                                                                       In the very first stage, the model is trained in a self-
2.11. LLMs Scaling Laws                                                             supervised manner on a large corpus to predict the next to-
                                                                                    kens given the input. The design choices of LLMs vary from
   Scaling laws study the optimal combination of model param-                       encoder-decoder to decoder-only architectures with different
eters, dataset size, and computational resources that predict the                   building blocks and loss functions in sections 2.5, 2.4, 2.10.
improvement in the model performance. It has been shown
that the loss scales according to the power-law with model size,                    2.12.2. Fine-Tuning
dataset size, and compute resources [95]. This study suggests                          There are different styles to fine-tune an LLM. This section
larger models are more important than big data for better perfor-                   briefly discusses fine-tuning approaches.
mance. Another variant of scaling law [96] suggests the model                       Transfer Learning: The pre-trained LLMs perform well for
size and the number of training tokens should be scaled equally.                    various tasks [6, 15]. However, to improve the performance for
                                                                                6
a downstream task, pre-trained models are fine-tuned with the            whereas to improve LLMs further on reasoning tasks many
task-specific data [10, 11], known as transfer learning.                 methods [16, 97] train them on reasoning datasets. We discuss
Instruction-tuning: To enable a model to respond to user                 various prompting techniques for reasoning below.
queries effectively, the pre-trained model is fine-tuned on in-          Chain-of-Thought (CoT): A special case of prompting where
struction formatted data i.e., instruction and an input-output           demonstrations contain reasoning information aggregated with
pair. Instructions generally comprise multi-task data in plain           inputs and outputs so that the model generates outcomes with
natural language, guiding the model to respond according to the          step-by-step reasoning. More details on CoT prompts are avail-
prompt and the input. This type of fine-tuning improves zero-            able in [55, 103, 101].
shot generalization and downstream task performance. Details             Self-Consistency: Improves CoT performance by generat-
on formatting instruction data and its various styles are avail-         ing multiple responses and selecting the most frequent an-
able in [16, 50, 97].                                                    swer [104].
Alignment-tuning: LLMs are prone to generating false, biased,            Tree-of-Thought (ToT): Explores multiple reasoning paths
and harmful text. To make them helpful, honest, and harmless,            with possibilities to look ahead and backtrack for problem-
models are aligned using human feedback. Alignment involves              solving [105].
asking LLMs to generate unexpected responses and then updat-             Single-Turn Instructions: In this prompting setup, LLMs are
ing their parameters to avoid such responses [20, 21, 98].               queried only once with all the relevant information in the
It ensures LLMs operate according to human intentions and                prompt. LLMs generate responses by understanding the con-
values. A model is defined to be an “aligned” model if the               text either in a zero-shot or few-shot setting.
model fulfills three criteria of helpful, honest, and harmless or        Multi-Turn Instructions: Solving a complex task requires mul-
“HHH” [99].                                                              tiple interactions with LLMs, where feedback and responses
Researchers employ reinforcement learning with human feed-               from the other tools are given as input to the LLM for the next
back (RLHF) [100] for model alignment. In RLHF, a fine-tuned             rounds. This style of using LLMs in the loop is common in
model on demonstrations is further trained with reward model-            autonomous agents.
ing (RM) and reinforcement learning (RL), shown in Figure 6.
Below we briefly discuss RM and RL pipelines in RLHF.
                                                                         3. Large Language Models
Reward modeling: trains a model to rank generated responses
according to human preferences using a classification objec-                This section reviews LLMs, briefly describing their architec-
tive. To train the classifier humans annotate LLMs generated             tures, training objectives, pipelines, datasets, and fine-tuning
responses based on the HHH criteria.                                     details.
Reinforcement learning: in combination with the reward model
is used for alignment in the next stage. The previously trained
                                                                         3.1. Pre-Trained LLMs
reward model ranks LLM-generated responses into preferred
vs. non-preferred, which is used to align the model with proxi-             Here, we provide summaries of various well-known pre-
mal policy optimization (PPO). This process repeats iteratively          trained LLMs with significant discoveries, changing the course
until convergence.                                                       of research and development in NLP. These LLMs have consid-
                                                                         erably improved the performance in NLU and NLG domains,
2.12.3. Prompting/Utilization                                            and are widely fine-tuned for downstream tasks. Moreover, We
   Prompting is a method to query trained LLMs for generating            also identify key findings and insights of pre-trained LLMs in
responses, as illustrated in Figure 6. LLMs can be prompted in           Table 1 and 2 that improve their performance.
various prompt setups, where they can be adapted to the instruc-
tions without fine-tuning and in other cases with fine-tuning on         3.1.1. General Purpose
data containing different prompt styles [16, 101, 102]. A good             T5 [10]: An encoder-decoder model employing a unified text-
guide on prompt engineering is available at [32]. Below, we              to-text training for all NLP problems is shown in Figure 7. T5
will discuss various widely used prompt setups.                          places layer normalization outside the residual path in a conven-
Zero-Shot Prompting: LLMs are zero-shot learners and ca-                 tional transformer model [64]. It uses masked language mod-
pable of answering queries never seen before. This style of              eling as a pre-training objective where spans (consecutive to-
prompting requires LLMs to answer user questions without see-            kens) are replaced with a single mask instead of separate masks
ing any examples in the prompt.                                          for each token. This type of masking speeds up the training as
In-context Learning: Also known as few-shot learning, here,              it produces shorter sequences. After pre-training, the model is
multiple input-output demonstration pairs are shown to the               fine-tuned using adapter layers [106] for downstream tasks.
model to generate the desired response. This adaptation style              GPT-3 [6]: The GPT-3 architecture is the same as the GPT-
is also called few-shot learning. A discussion on formatting in-         2 [5] but with dense and sparse attention in transformer layers
context learning (ICL) templates is available in [54, 50, 18, 16].       similar to the Sparse Transformer [67]. It shows that large mod-
Reasoning in LLMs: LLMs are zero-shot reasoners and can                  els can train on larger batch sizes with a lower learning rate to
be provoked to generate answers to logical problems, task                decide the batch size during training, GPT-3 uses the gradient
planning, critical thinking, etc. with reasoning. Generating             noise scale as in [107]. Overall, GPT-3 increases model param-
reasons is possible only by using different prompting styles,            eters to 175B showing that the performance of large language
                                                                     7
                                                                                    plete fine-tuning and prompt fine-tuning as in [40] where only
                                                                                    prompt-related parameters are updated by inserting prompts at
                                                                                    various positions, front, middle, and back. CPM-2 also pro-
                                                                                    poses the INFMOE, a memory-efficient framework with a strat-
                                                                                    egy to dynamically offload parameters to the CPU for inference
                                                                                    at a 100B scale. It overlaps data movement with inference com-
                                                                                    putation for lower inference time.
                                                                                      ERNIE 3.0 [110]: ERNIE 3.0 takes inspiration from multi-
   Figure 7: Unified text-to-text training example, source image from [10].         task learning to build a modular architecture using Transformer-
                                                                                    XL [111] as the backbone. The universal representation mod-
                                                                                    ule is shared by all the tasks, which serve as the basic block
                                                                                    for task-specific representation modules, which are all trained
                                                                                    jointly for natural language understanding, natural language
                                                                                    generation, and knowledge extraction. This LLM is primar-
                                                                                    ily focused on the Chinese language. It claims to train on the
                                                                                    largest Chinese text corpora for LLM training, and achieved
                                                                                    state-of-the-art in 54 Chinese NLP tasks.
                                                                                       Jurassic-1 [112]: A pair of auto-regressive language mod-
                                                                                    els, including a 7B-parameter J1-Large model and a 178B-
                                                                                    parameter J1-Jumbo model. The training vocabulary of
                                                                                    Jurassic-1 comprise word pieces, complete words, and multi-
                                                                                    word expressions without any word boundaries, where possible
                                                                                    out-of-vocabulary instances are interpreted as Unicode bytes.
                                                                                    Compared to the GPT-3 counterparts, the Jurassic-1 models
Figure 8: The image is the article of [108], showing an example of PanGu-α          apply a more balanced depth-to-width self-attention architec-
architecture.
                                                                                    ture [113] and an improved tokenizer for a faster prediction
                                                                                    based on broader resources, achieving a comparable perfor-
models improves with the scale and is competitive with the fine-                    mance in zero-shot learning tasks and a superior performance in
tuned models.                                                                       few-shot learning tasks given the ability to feed more examples
  mT5 [11]: A multilingual T5 model [10] trained on the mC4                         as a prompt.
dataset with 101 languages. The dataset is extracted from the                         HyperCLOVA [114]: A Korean language model with GPT-3
public common crawl scrape. The model uses a larger vocab-                          architecture.
ulary size of 250,000 to cover multiple languages. To avoid                           Yuan 1.0 [115]: Trained on a Chinese corpus with 5TB of
over-fitting or under-fitting for a language, mT5 employs a data                    high-quality text collected from the Internet. A Massive Data
sampling procedure to select samples from all languages. The                        Filtering System (MDFS) built on Spark is developed to pro-
paper suggests using a small amount of pre-training datasets,                       cess the raw data via coarse and fine filtering techniques. To
including all languages when fine-tuning for a task using En-                       speed up the training of Yuan 1.0 to save energy expenses and
glish language data. This allows the model to generate correct                      carbon emissions, various factors that improve the performance
non-English outputs.                                                                of distributed training are incorporated in architecture and train-
  PanGu-α [108]: An autoregressive model that has a query                           ing: like increasing the hidden state size improves pipeline and
layer at the end of standard transformer layers, example shown                      tensor parallelism performance, larger micro batches improve
in Figure 8, to predict the next token. Its structure is similar to                 pipeline parallelism performance, and larger global batch size
the transformer layer but with an additional embedding for the                      improve data parallelism performance. In practice, the Yuan 1.0
next position in the attention mechanism, given in Eq. 3.                           model performs well on text classification, Winograd Schema,
                                                                                    natural language inference, and reading comprehension tasks.
                           a = pn Whq Whk T HLT                           (3)          Gopher [116]: The Gopher family of models ranges from
                                                                                    44M to 280B parameters in size to study the effect of scale
   CPM-2 [12]: Cost-efficient Pre-trained language Models                           on the LLMs performance. The 280B model beats GPT-3 [6],
(CPM-2) pre-trains bilingual (English and Chinese) 11B and                          Jurrasic-1 [112], MT-NLG [117], and others on 81% of the
198B mixture-of-experts (MoE) models on the WuDaoCor-                               evaluated tasks.
pus [109] dataset. The tokenization process removes “_” white                         ERNIE 3.0 TITAN [35]: ERNIE 3.0 Titan extends ERNIE 3.0
space tokens in the sentencepiece tokenizer. The models are                         by training a larger model with 26x the number of parameters
trained with knowledge inheritance, starting with only the Chi-                     of the latter. This bigger model outperformed other state-of-the-
nese language in the first stage and then adding English and                        art models in 68 NLP tasks. LLMs produce text with incorrect
Chinese data. This trained model gets duplicated multiple times                     facts. In order to have control of the generated text with fac-
to initialize the 198B MoE model. Moreover, to use the model                        tual consistency, ERNIE 3.0 Titan adds another task, Credible
for downstream tasks, CPM-2 experimented with both com-                             and Controllable Generations, to its multi-task learning setup.
                                                                                8
It introduces additional self-supervised adversarial and control-
lable language modeling losses to the pre-training step, which
enables ERNIE 3.0 Titan to beat other LLMs in their manually
selected Factual QA task set evaluations.
  GPT-NeoX-20B [118]: An auto-regressive model that largely
follows GPT-3 with a few deviations in architecture design,
trained on the Pile dataset without any data deduplication. GPT-
NeoX has parallel attention and feed-forward layers in a trans-
former block, given in Eq. 4, that increases throughput by 15%.
It uses rotary positional embedding [66], applying it to only
25% of embedding vector dimension as in [119]. This reduces
the computation without performance degradation. As opposed                   Figure 9: The BLOOM architecture example sourced from [13].
to GPT-3, which uses dense and sparse layers, GPT-NeoX-20B
uses only dense layers. The hyperparameter tuning at this scale
is difficult; therefore, the model chooses hyperparameters from         relationship that model size should be doubled for every dou-
the method [6] and interpolates values between 13B and 175B             bling of training tokens. Over 400 language models ranging
models for the 20B model. The model training is distributed             from 70 million to over 16 billion parameters on 5 to 500 bil-
among GPUs using both tensor and pipeline parallelism.                  lion tokens are trained to get the estimates for compute-optimal
                                                                        training under a given budget. The authors train a 70B model
                 x + Attn(LN1 (x)) + FF(LN2 (x))             (4)        with the same compute budget as Gopher (280B) but with 4
                                                                        times more data. It outperforms Gopher [116], GPT-3 [6], and
  OPT [14]: It is a clone of GPT-3, developed to open-source            others on various downstream tasks, after fine-tuning.
a model that replicates GPT-3 performance. Training of OPT                AlexaTM [122]: An encoder-decoder model, where encoder
employs dynamic loss scaling [120] and restarts from an earlier         weights and decoder embeddings are initialized with a pre-
checkpoint with a lower learning rate whenever loss divergence          trained encoder to speed up training. The encoder stays frozen
is observed. Overall, the performance of OPT-175B models is             for the initial 100k steps and is later unfrozen for end-to-end
comparable to the GPT3-175B model.                                      training. The model is trained on a combination of denoising
 BLOOM [13]: A causal decoder model trained on the ROOTS                and causal language modeling (CLM) objectives, concatenat-
corpus to open-source an LLM. The architecture of BLOOM is              ing a [CLM] token at the beginning for mode switching. Dur-
shown in Figure 9, with differences like ALiBi positional em-           ing training, the CLM task is applied for 20% of the time, which
bedding, an additional normalization layer after the embedding          improves the in-context learning performance.
layer as suggested by the bitsandbytes1 library. These changes             PaLM [15]: A causal decoder with parallel attention and
stabilize training with improved downstream performance.                feed-forward layers similar to Eq. 4, speeding up training by
 GLaM [91]: Generalist Language Model (GLaM) represents a               a factor of 15. Additional changes to the conventional trans-
family of language models using a sparsely activated decoder-           former model include SwiGLU activation, RoPE embeddings,
only mixture-of-experts (MoE) structure [121, 90]. To gain              multi-query attention that saves computation cost during decod-
more model capacity while reducing computation, the experts             ing, and shared input-output embeddings. During training, loss
are sparsely activated where only the best two experts are used         spiking was observed, and to fix it, model training was restarted
to process each input token. The largest GLaM model, GLaM               from a 100-step earlier checkpoint by skipping 200-500 batches
(64B/64E), is about 7× larger than GPT-3 [6], while only part of        around the spike. Moreover, the model was found to memo-
the parameters are activated per input token. The largest GLaM          rize around 2.4% of the training data at the 540B model scale,
(64B/64E) model achieves better overall results as compared             whereas this number was lower for smaller models.
to GPT-3 while consuming only one-third of GPT-3’s training                PaLM-2 [123]: A smaller multi-lingual variant of PaLM,
energy.                                                                 trained for larger iterations on a better quality dataset. PaLM-
  MT-NLG [117]: A 530B causal decoder based on the GPT-                 2 shows significant improvements over PaLM, while reducing
2 architecture that has roughly 3× GPT-3 model parameters.              training and inference costs due to its smaller size. To lessen
MT-NLG is trained on filtered high-quality data collected from          toxicity and memorization, it appends special tokens with a
various public datasets and blends various types of datasets in a       fraction of pre-training data, which shows a reduction in gener-
single batch, which beats GPT-3 on several evaluations.                 ating harmful responses.
 Chinchilla [96]: A causal decoder trained on the same dataset             U-PaLM [124]: This method trains PaLM for 0.1% addi-
as the Gopher [116] but with a little different data sampling           tional compute with the UL2 (also named as UL2Restore) ob-
distribution (sampled from MassiveText). The model architec-            jective [125], using the same dataset it outperforms the baseline
ture is similar to the one used for Gopher, with the exception of       significantly on various NLP tasks, including zero-shot, few-
AdamW optimizer instead of Adam. Chinchilla identifies the              shot, commonsense reasoning, CoT, etc. Training with UL2R
                                                                        involves converting a causal decoder PaLM to a non-causal de-
                                                                        coder PaLM and employing 50% sequential denoising, 25%
  1 https://github.com/TimDettmers/bitsandbytes                         regular denoising, and 25% extreme denoising loss functions.
                                                                    9
  UL2 [125]: An encoder-decoder architecture trained using a               the computation significantly.
mixture of denoisers (MoD) objective. Denoisers include 1)                  Grok [133, 134]: Grok is a family of LLMs including Grok-1
R-Denoiser: a regular span masking, 2) S-Denoiser: which cor-              and Grok-1.5, released by XAI.
rupts consecutive tokens of a large sequence and 3) X-Denoiser:              Grok-1 [133]: Grok-1 is a 314B parameters language MoE
which corrupts a large number of tokens randomly. During pre-              model (eight experts), where two experts are activated per to-
training, UL2 includes a denoiser token from R, S , X to rep-              ken.
resent a denoising setup. It helps improve fine-tuning perfor-              Grok-1.5 [134]: Grok-1.5 is a multi-modal LLM with a larger
mance for downstream tasks that bind the task to one of the up-            context length and improved performance.
stream training modes. This MoD style of training outperforms                Gemini [135, 136]: Gemini replaces Bard (based on PaLM)
the T5 model on many benchmarks.                                           with multi-modal capabilities and significant language model-
  GLM-130B [33]: GLM-130B is a bilingual (English and Chi-                 ing performance improvements.
nese) model trained using an auto-regressive mask infilling pre-
                                                                              Gemini-1 [135]: The first-ever auto-regressive model to
training objective similar to the GLM [126]. This training style
                                                                           achieve human-level capabilities on the MMLU benchmark.
makes the model bidirectional as compared to GPT-3, which is
                                                                             Gemini-1.5 [136]: A multi-modal LLM with MoE architec-
unidirectional. As opposed to GLM, the training of GLM-130B
                                                                           ture builds on the findings of Gemini-1. The model has a 2M
includes a small amount of multi-task instruction pre-training
                                                                           context window and can reason over information up to 10M
data (5% of the total data) along with self-supervised mask in-
                                                                           tokens. Such large context windows were never achieved pre-
filling. To stabilize the training, it applies embedding layer gra-
                                                                           viously and shown to have a huge impact on performance gain.
dient shrink.
   LLaMA [127, 21]: A set of decoder-only language models                   Nemotron-4 340B [137]: A decoder-only model that has been
varying from 7B to 70B parameters. LLaMA models series is                  aligned on 98% synthetic data and only 2% manually annotated
the most famous among the community for parameter efficiency               data. Utilizing synthetic data at a large proportion improves the
and instruction tuning.                                                    model performance significantly. The paper suggested intro-
  LLaMA-1 [127]: Implements efficient causal attention [128]               ducing alignment data with a smaller subset of previously seen
by not storing and computing masked attention weights and                  data during the late stage of the model pre-training, enabling the
key/query scores. Another optimization is reducing the number              smooth transition from the pre-trained stage to the final train-
of activations recomputed in the backward pass, as in [129].               ing stage. To train better instruction-following models, weaker
  LLaMA-2 [21]: This work is more focused on fine-tuning a                 models are trained into stronger models iteratively. The syn-
safer and better LLaMA-2-Chat model for dialogue generation.               thetic data generated by the weaker instruction-tuned model is
The pre-trained model has 40% more training data with a larger             used to train a base model which is later supervised fine-tuned
context length and grouped-query attention.                                outperforming the weaker model.
   LLaMA-3/3.1 [130]: A collection of models trained on a                    DeepSeek [138]: DeepSeek studies the LLMs scaling laws
seven times larger dataset as compared to LLaMA-2 with dou-                in detail to determine the optimal non-embedding model size
ble the context length, outperforming its previous variants and            and training data. The experiments were performed for 8 bud-
other models.                                                              gets ranging from 1e17 to 3e20 training FLOPs. Each compute
   PanGu-Σ [92]: An autoregressive model with parameters                   budget was tested against ten different models/data scales. The
copied from PanGu-α and extended to a trillion scale with Ran-             batch size and learning rates were also fitted for the given com-
dom Routed Experts (RRE), the architectural diagram is shown               pute budget finding that the batch size should increase with
in Figure 10. RRE is similar to the MoE architecture, with                 the increased compute budget while decreasing the learning
distinctions at the second level, where tokens are randomly                rate. Following are the equations for the optimal batch-size (B),
routed to experts in a domain instead of using a learnable gat-            learning rate (η), model size (M), and data (D):
ing method. The model has bottom layers densely activated and
shared across all domains, whereas top layers are sparsely ac-                                 Bopt = 0.2920.C 0.3271
tivated according to the domain. This training style allows for                               ηopt = 0.3118.C −0.1250
extracting task-specific models and reduces catastrophic forget-                                 Mopt = Mbase .C a                       (5)
ting effects in the case of continual learning.
  Mixtral8x22b [131]: A mixture-of-experts (MoE) model with                                       Dopt = Dbase .C   b
eight distinct experts routes each token to two experts at each              Mbase = 0.1715, Dbase = 5.8316, a = 0.5243, b = 0.4757
layer and combines the outputs additively.
  Snowflake Arctic [132]: Arctic LLM is a hybrid of dense and                DeepSeek-v2 [139]: An MoE model that introduces multi-
mixture-of-experts (MoE) architecture. The MoE (128×3.66B                  head latent attention (MLA) to reduce inference costs, by com-
MLP experts) is parallel to the dense transformer (10B) with               pressing Key-Value (KV) cache into a latent vector. MLA
only two experts activated. The model has many experts, com-               achieves better performance than multi-head attention (MHA),
pared to other MoE LLMs [131, 133], to increase the model                  and other efficient attention mechanisms such as grouped query
capacity and provide an opportunity to choose among many ex-               attention (GQA), multi-query attention (MQA), etc. Because
perts for a diverse configuration. The model has 480B param-               of MLA, DeepSeek-v2 achieves 5.76 times faster inference
eters, and only 17B are active during a forward pass, reducing             throughput as compared to DeepSeek [138].
                                                                      10
3.1.2. Coding                                                            emails, and other personal data from the training data. Its fine-
    CodeGen [140]: CodeGen has a similar architecture to                 tuned variant outperforms PaLM, LLaMA, and LAMDA on
PaLM [15], i.e., parallel attention, MLP layers, and RoPE em-            HumanEval and MBPP benchmarks.
beddings. The model is trained on both natural language and
programming language data sequentially (trained on the first             3.1.3. Scientific Knowledge
dataset, then the second, and so on) on the following datasets             Galactica [148]: A large curated corpus of human scientific
1) PILE, 2) BIGQUERY, and 3) BIGPYTHON. CodeGen pro-                     knowledge with 48 million papers, textbooks, lecture notes,
posed a multi-step approach to synthesizing code. The purpose            millions of compounds and proteins, scientific websites, en-
is to simplify the generation of long sequences where the previ-         cyclopedias, and more are trained using the metaseq library3,
ous prompt and generated code are given as input with the next           which is built on PyTorch and fairscale [149]. The model wraps
prompt to generate the next code sequence. CodeGen open-                 reasoning datasets with the < work > token to provide step-by-
source a Multi-Turn Programming Benchmark (MTPB) to eval-                step reasoning context to the model, which has been shown to
uate multi-step program synthesis.                                       improve the performance on reasoning tasks.
 Codex [141]: This LLM is trained on a subset of public Python
Github repositories to generate code from docstrings. Com-
                                                                         3.1.4. Dialog
puter programming is an iterative process where the programs
are often debugged and updated before fulfilling the require-              LaMDA [150]: A decoder-only model pre-trained on pub-
ments. Similarly, Codex generates 100 versions of a program              lic dialog data, public dialog utterances, and public web doc-
by repetitive sampling for a given description, which produces           uments, where more than 90% of the pre-training data is in
a working solution for 77.5% of the problems passing unit tests.         English. LaMDA is trained with the objective of producing re-
Its powerful version powers Github Copilot2 .                            sponses that exhibit high levels of quality, safety, and grounded-
                                                                         ness. To achieve this, discriminative and generative fine-tuning
  AlphaCode [142]: A set of large language models, ranging
                                                                         techniques are incorporated to enhance the model’s safety and
from 300M to 41B parameters, designed for competition-level
                                                                         quality aspects. As a result, the LaMDA models can be utilized
code generation tasks. It uses the multi-query attention [143] to
                                                                         as a general language model performing various tasks.
reduce memory and cache costs. Since competitive program-
ming problems highly require deep reasoning and an under-
standing of complex natural language algorithms, the Alpha-              3.1.5. Finance
Code models are pre-trained on filtered GitHub code in popular             BloombergGPT [151]: A non-causal decoder model trained
languages and then fine-tuned on a new competitive program-              using both financial (“FINPILE” from the Bloomberg archive)
ming dataset named CodeContests. The CodeContests dataset                and general-purpose datasets. The model’s architecture is sim-
mainly contains problems, solutions, and test cases collected            ilar to the BLOOM [13] and OPT [14]. It allocates 50B param-
from the Codeforces platform3 . The pre-training employs stan-           eters to different blocks of the model using the approach [113].
dard language modeling objectives, while GOLD [144] with                 For effective training, BloombergGPT packs documents to-
tempering [145] serves as the training objective for the fine-           gether with < |endo f text| > to use the maximum sequence
tuning on CodeContests data. To evaluate the performance of              length, uses warmup batch size starting from 1024 to 2048, and
AlphaCode, simulated programming competitions are hosted                 manually reduces the learning rate multiple times during the
on the Codeforces platform: overall, AlphaCode ranks at the              training.
top 54.3% among over 5000 competitors, where its Codeforces                Xuan Yuan 2.0 [152]: A Chinese financial chat model with
rating is within the top 28% of recently participated users.             BLOOM’s [13] architecture trained on a combination of general
  CodeT5+ [34]: CodeT5+ is based on CodeT5 [146], with                   purpose, financial, general purpose instructions, and financial
shallow encoder and deep decoder, trained in multiple stages             institutions datasets. Xuan Yuan 2.0 combined the pre-training
initially unimodal data (code) and later bimodal data (text-code         and fine-tuning stages to avoid catastrophic forgetting.
pairs). Each training stage has different training objectives and
activates different model blocks encoder, decoder, or both ac-           3.2. Fine-Tuned LLMs
cording to the task. The unimodal pre-training includes span
denoising and CLM objectives, whereas bimodal pre-training                  Pre-trained LLMs have excellent generalization abilities to
objectives contain contrastive learning, matching, and CLM for           unseen tasks. However, because they are generally trained with
text-code pairs. CodeT5+ adds special tokens with the text to            the objective of next token prediction, LLMs have limited ca-
enable task modes, for example, [CLS ] for contrastive loss,             pacity to follow user intent and are prone to generate unethical,
[Match] for text-code matching, etc.                                     toxic or inaccurate responses [20]. For their effective utiliza-
  StarCoder [147]: A decoder-only model with the SantaCoder              tion, LLMs are fine-tuned to follow instructions [16, 17, 97] and
architecture, employing Flash attention to scale up the context          generate safe responses [20], which also results in increasing
length to 8k. The StarCoder trains an encoder to filter names,           zero-shot, few-shot, and cross-task generalization [97, 16, 18],
                                                                         with minimal compute increment, e.g., 0.2% of the total pre-
                                                                         training for PaLM 540B [16].
  2 https://github.com/features/copilot                                  We review various fine-tuned LLMs and strategies for effective
  3 https://codeforces.com/                                              fine-tuning in this section.
                                                                    11
                        Table 1: Noteworthy findings and insights of pre-trained Large Language Models.
                  • Encoder and decoder with shared parameters perform equivalently when parameters are not shared
T5                • Fine-tuning model layers (adapter layers) work better than the conventional way of training on only
                    classification layers
                  • Few-shot performance of LLMs is better than the zero-shot, suggesting that LLMs are meta-
GPT-3               learners
                  • Large multi-lingual models perform equivalently to single language models on downstream tasks.
mT5                 However, smaller multi-lingual models perform worse
                  • Prompt fine-tuning requires updating very few parameters while achieving performance compara-
                    ble to full model fine-tuning
                  • Prompt fine-tuning takes more time to converge as compared to full model fine-tuning
CPM-2             • Inserting prompt tokens in-between sentences can allow the model to understand relations between
                    sentences and long sequences
                  • In an analysis, CPM-2 finds that prompts work as a provider (additional context) and aggregator
                    (aggregate information with the input text) for the model
                  • A modular LLM architecture with a universal representation module and task-specific representa-
                    tion module helps in the finetuning phase
ERNIE 3.0         • Optimizing the parameters of a task-specific representation network during the fine-tuning phase is
                    an efficient way to take advantage of the powerful pre-trained model
                  • By employing prompt-based tuning, the performances of models can be improved, often surpassing
HyperCLOVA
                    those of state-of-the-art models when the backward gradients of inputs are accessible
                  • The model architecture that excels in pre-training and fine-tuning cases may exhibit contrasting
Yuan 1.0
                    behavior in zero-shot and few-shot learning
Gopher • Relative encodings enable the model to evaluate for longer sequences than training.
                  • Additional self-supervised adversarial loss to distinguish between real and generated text improves
ERNIE 3.0 Titan     the model performance as compared to ERNIE 3.0
                  • Parallel attention + FF layers speed-up training 15% with the same performance as with cascaded
                    layers
GPT-NeoX-20B      • Initializing feed-forward output layers before residuals with scheme in [153] avoids activations
                    from growing with increasing depth and width
                  • Training on Pile outperforms GPT-3 on five-shot
                                                                                                          Table Continued on Next Page
                                                              12
Models                                               Findings & Insights
             • Restart training from an earlier checkpoint with a lower learning rate if loss diverges
OPT          • Model is prone to generate repetitive text and stuck in a loop
             • Galactica’s performance has continued to improve across validation set, in-domain, and out-of-
               domain benchmarks, even with multiple repetitions of the corpus, which is superior to existing
               research on LLMs
Galactica    • A working memory token approach can achieve strong performance over existing methods on
               mathematical MMLU and MATH benchmarks. It sets a new state-of-the-art on several downstream
               tasks such as PubMedQA (77.6%) and MedMCQA dev (52.9%)
             • The model capacity can be maintained at reduced computation by replacing the feed-forward layer
               in each transformer layer with a mixture-of-experts (MoE)
             • The model trained on filtered data shows consistently better performances on both NLG and NLU
               tasks, where the effect of filtering is more significant on the former tasks
GLaM         • Filtered pretraining corpora play a crucial role in the generation capability of LLMs, especially for
               the downstream tasks
             • The scaling of GLaM MoE models can be achieved by increasing the size or number of experts in
               the MoE layer. Given a fixed budget of computation, more experts contribute to a better perfor-
               mance
LaMDA • The model can be fine-tuned to learn to call different external information resources and tools
             • For higher effectiveness and efficiency, a transformer model can be asymmetrically constructed
               with a shallower encoder and a deeper decoder
             • To achieve better performances, it is necessary to employ strategies such as massively scaling
AlphaCode      upsampling, followed by the filtering and clustering of samples into a compact set
             • The utilization of novel sampling-efficient transformer architectures designed to facilitate large-
               scale sampling is crucial
             • Simplifying problem descriptions can effectively improve the model’s performance
             • The model size and the number of training tokens should be scaled proportionately: for each dou-
Chinchilla     bling of the model size, the number of training tokens should be doubled as well
             • English-centric models produce better translations when translating to English as compared to non-
               English
             • Generalized models can have equivalent performance for language translation to specialized small
PaLM           models
             • Larger models have a higher percentage of training data memorization
             • Performance has not yet saturated even at 540B scale, which means larger models are likely to
               perform better
             • Encoder-decoder architecture is more suitable to train LLMs given bidirectional attention to the
               context than decoder-only
AlexaTM      • Causal Language Modeling (CLM) task can be added to benefit the model with efficient in-context
               learning
             • Placing layer norm at the beginning of each transformer layer improves the training stability
                                                                                              Table Continued on Next Page
                                                         13
Models                                                Findings & Insights
               • Training with a mixture of denoisers outperforms PaLM when trained further for a few more FLOPs
U-PaLM         • Training with a mixture of denoisers improves the infilling ability and open-ended text generation
                 diversity
               • Pre-training data with a small proportion of multi-task instruction data improves the overall model
GLM-130B         performance
               • Multi-step prompting for code synthesis leads to a better user intent understanding and code gen-
CodeGen          eration
               • Sparse models provide the benefits of large models at a lower computation cost
               • Randomly Routed Experts reduces catastrophic forgetting effects which in turn is essential for
PanGu-Σ          continual learning
               • Randomly Routed Experts allow extracting a domain-specific sub-model in deployment which is
                 cost-efficient while maintaining a performance similar to the original
               • Pre-training with general-purpose and task-specific data improves task performance without hurt-
BloombergGPT     ing other model capabilities
XuanYuan 2.0 • Combining pre-training and fine-tuning stages in single training avoids catastrophic forgetting
StarCoder • HHH prompt by Anthropic allows the model to follow instructions without fine-tuning
               • Model trained on unfiltered data is more toxic but may perform better on downstream tasks after
LLaMA-2          fine-tuning
               • Model trained on unfiltered data requires fewer samples for safety alignment
               • Increasing batch size gradually stabilizes the training without loss spikes
               • High-quality data at the final stages of training improves the model performance
LLaMA-3/3.1    • Increasing model context length windows step-wise allows it to better adapt to various sequence
                 lengths
               • Model aligned iteratively on synthetic data with data generated from the previously aligned model
Nemotron-40B     achieves competitive performance
DeepSeek • Batch size should increase with the increase in compute budget while decreasing the learning rate
               • Mult-head latent attention (MLA) performs better than multi-head attention (MHA) while requiring
DeepSeek-v2      a significantly smaller KV cache, therefore achieving faster data generation
                                                      14
                 Table 2: Key insights and findings from the study of instruction-tuned Large Language Models.
                   • To aid the model in effectively filtering and utilizing relevant information, human labelers play a
                     crucial role in answering questions regarding the usefulness of the retrieved documents
WebGPT             • Interacting a fine-tuned language model with a text-based web-browsing environment can improve
                     end-to-end retrieval and synthesis via imitation learning and reinforcement learning
                   • Generating answers with references can make labelers easily judge the factual accuracy of answers
                   • Creating a batch with multiple task examples is important for better performance
                   • Only example proportional sampling is not enough, training datasets should also be proportional
                     for better generalization/performance
                   • Fully held-out and partially supervised tasks performance improves by scaling tasks or categories
OPT-IML              whereas fully supervised tasks have no effect
                   • Including small amounts i.e. 5% of pretraining data during fine-tuning is effective
                   • Only 1% reasoning data improves the performance, adding more deteriorates performance
                   • Adding dialogue data makes the performance worse
                   • Labelers’ judgment and well-defined alignment rules help the model generate better responses
                   • Good dialogue goals can be broken down into detailed natural language rules for the agent and the
Sparrow              raters
                   • The combination of reinforcement learning (RL) with reranking yields optimal performance in
                     terms of preference win rates and resilience against adversarial probing
WizardCoder • Fine-tuning with re-written instruction-tuning data into a complex set improves performance
                   • Model learns to write safe responses with fine-tuning on safe demonstrations, while additional
LLaMA-2-Chat         RLHF step further improves model safety and make it less prone to jailbreak attacks
LIMA • Less high quality data is enough for fine-tuned model generalization
                                                             15
                                                                                   Figure 11: An example image shows an instance of the Flan training paradigm,
                                                                                   taken from [16].
                                              P
Figure 10: This example illustrates the PanGu- architecture, as depicted in        GPT-3 [6]. Contrary to this, Dynosaur [155] uses the meta-data
the image sourced from [92].                                                       of datasets on Huggingface to prompt LLMs to generate multi-
                                                                                   ple task instruction-tuning datasets.
                                                                                   LLaMA Tuned: Various models in the literature instruction-
3.2.1. Instruction-Tuning with Manually Created Datasets                           tune LLaMA [156] with GPT-3 [6] or GPT-4 [157] gener-
   Numerous hand-crafted instruction-tuning datasets with                          ated datasets. Among these, Alpaca [158], Vicuna [159],
different design choices are proposed in the literature to                         and LLaMA-GPT-4 [160] are a few general-purpose fine-tuned
instruction-tune LLMs. The performance of fine-tuned LLMs                          models, where Alpaca is trained on 52k samples from text-
depends on multiple factors, such as dataset, instruction diver-                   davinci-003, Vicuna on 70k samples from ShareGPT.com, and
sity, prompting templates, model size, and training objectives.                    LLaMA-GPT-4 by re-creating Alpaca instructions from GPT-
Keeping this in view, diverse fine-tuned models have emerged                       4. Goat [161] fine-tunes LLaMA for arithmetic tasks (1 million
in the literature using manually created datasets.                                 samples) by generating data from ChatGPT and outperforms
The models T0 [17] and mT0 (multi-lingual) [154] employ                            GPT-4, PaLM, BLOOM, OPT, etc., attributing its success to the
templates to convert existing datasets into prompt datasets.                       LLaMA’s consistent tokenization of numbers. HuaTuo [162] is
They have shown improvements in generalization to zero-shot                        a medical knowledge model, fine-tuned with a generated QA
and held-out tasks. Tk-Instruct [18] fine-tuned the T5 model                       dataset of 8k instructions.
with in-context instructions to study generalization on unseen                     Complex Instructions: Evol-Instruct [163, 164] prompts LLMs
tasks when given in-context instructions during test time. The                     to convert given instructions into a more complex set. The in-
model outperformed Instruct-GPT, despite being smaller in                          structions are iteratively evolved with re-writing instructions in
size, i.e., 11B parameters as compared to 175B of GPT-3.                           complex wording and creating new instructions. With this style
Increasing Tasks and Prompt Setups: Zero-shot and few-shot                         of automated instruction generation, WizardLM [163] (fine-
performance improves significantly by expanding task collec-                       tuned LLaMA on 250k instructions), outperforms Vicuna and
tion and prompt styles. OPT-IML [97] and Flan [16] curated                         Alpaca, and WizardCoder [164] (fine-tuned StarCoder) beats
larger 2k and 1.8k task datasets, respectively. While increasing                   Claude-Plus, Bard, and others.
task size alone is not enough, OPT-IML and Flan add more
prompting setups in their datasets, zero-shot, few-shot, and
CoT. In continuation, CoT Collection [101] fine-tunes Flan-T5                      3.2.3. Aligning with Human Preferences
further on 1.88M CoT samples. Another method [102] uses
                                                                                      Incorporating human preferences into LLMs presents a
symbolic tasks with tasks in T0, Flan, etc.
                                                                                   significant advantage in mitigating undesirable behaviors and
                                                                                   ensuring accurate outputs. The initial work on alignment, such
                                                                                   as InstructGPT [20] aligns GPT-3 using a 3-step approach,
3.2.2. Instruction-Tuning with LLMs Generated Datasets                             instruction-tuning, reward modeling, and fine-tuning with
   Generating an instruction-tuning dataset requires carefully                     reinforcement learning (RL). The supervised fine-tuned GPT-3
writing instructions and input-output pairs, which are often                       on demonstrations is queried to generate responses, which
written by humans, smaller in size, and less diverse. To over-                     human labelers rank according to human values, and a reward
come this, self-instruct [19] proposed an approach to prompt                       model is trained on the ranked data. Lastly, the GPT-3 is trained
available LLMs to generate instruction-tuning datasets. Self-                      with proximal policy optimization (PPO) using rewards on the
instruct outperformed models trained on manually created                           generated data from the reward model. LLaMA 2-Chat [21]
dataset SUPER-NATURALINSTRUCTIONS (a dataset with                                  improves alignment by dividing reward modeling into help-
1600+ tasks) [18] by 33%. It starts with a seed of 175 tasks, 1                    fulness and safety rewards and using rejection sampling in
instruction, and 1 sample per task and iteratively generates new                   addition to PPO. The initial four versions of LLaMA 2-Chat
instructions (52k) and instances (82k input-output pairs) using                    are fine-tuned with rejection sampling and then with PPO on
                                                                              16
top of rejection sampling.                                             The dataset collected through red-teaming is used to fine-tune
Aligning with Supported Evidence: This style of alignment              models for safety. While red-teaming largely relies on human
allows the model to generate responses with proofs and facts,          annotators, another work [180] red-team LLMs to find prompts
reduces hallucination, and assists humans more effectively,            that lead to harmful outputs for other LLMs.
which increases trust in the model’s output. Similar to
the RLHF training style, a reward model is trained to rank
generated responses containing web citations in answers                3.2.4. Continue Pre-Training
to questions, which is later used to train the model, as in               Although fine-tuning boosts a model’s performance, it leads
GopherCite [165], WebGPT [166], and Sparrow [167]. The                 to catastrophic forgetting of previously learned information.
ranking model in Sparrow [167] is divided into two branches,           Concatenating fine-tuning data with a few randomly selected
preference reward and rule reward, where human annotators              pre-training samples in every iteration avoids network forget-
adversarial probe the model to break a rule. These two rewards         ting [181, 152]. This is also effective in adapting LLMs for
together rank a response to train with RL.                             cases where fine-tuning data is small and the original capac-
Aligning Directly with SFT: The PPO in the RLHF pipeline               ity is to be maintained. Prompt-based continued pre-training
is complex, memory-intensive, and unstable, requiring mul-             (PCP) [182] trains the model with text and instructions related
tiple models, reward, value, policy, and reference models.             to tasks and then finally instruction-tunes the model for down-
Avoiding this sophisticated alignment pipeline is possible by          stream tasks.
incorporating minimal changes in the supervised fine-tuning
(SFT) pipeline as in [168, 169, 170], with better or compa-            3.2.5. Sample Efficiency
rable performance to PPO. Direct preference optimization                  While fine-tuning data is generally many-fold smaller than
(DPO) [168] trains a model directly on the human-preferred             the pre-training data, it still has to be large enough for accept-
responses to maximize the likelihood of preferred against              able performance [16, 97, 18] and requires proportional com-
unpreferred responses, with per-sample importance weight.              puting resources. Studying the effects on performance with less
Reward ranked fine-tuning RAFT [169] fine-tunes the model              data, existing literature [183, 184] finds that models trained
on ranked responses by the reward model. Preference ranking            on less data can outperform models trained with more data.
optimization (PRO) [171] and RRHF [170] penalize the model             In [183], 25% of the total downstream data is found enough
to rank responses with human preferences and supervised loss.          for state-of-the-art performance. Selecting coreset-based 0.5%
On the other hand, chain-of-hindsight (CoH) [172] provides             of the total instruction-tuning data improves the model perfor-
feedback to the model in language rather than reward, to learn         mance by 2% in [184], as compared to the complete data tun-
good versus bad responses.                                             ing. Less is more for alignment (LIMA) [185] uses only 1000
Aligning with Synthetic Feedback: Aligning LLMs with                   carefully created demonstrations to fine-tune the model and has
human feedback is slow and costly. The literature suggests a           achieved comparable performance to GPT-4.
semi-automated process to align LLMs by prompting LLMs to
generate helpful, honest, and ethical responses to the queries,        3.3. Increasing Context Window
and fine-tuning using the newly created dataset. Constitutional           LLMs are trained with limited context windows due to ex-
AI [173] replaces human feedback in RLHF with AI, calling              pensive attention and high memory requirements. A model
it RL from AI feedback (RLAIF). AlpacaFarm [174] designs               trained on limited sequence lengths fails to generalize to unseen
prompts to imitate human feedback using LLMs APIs. Oppo-               lengths at inference time [186, 49]. Alternatively, LLMs with
site to constitutional AI, AlpacaFarm injects noise in feedback        ALiBi [65] positional encodings can perform zero-shot length
to replicate human mistakes. Self-Align [98] prompts the               extrapolation. However, ALiBi has less expressive power [66]
LLM with ICL examples, instructing the LLM about what the              and inferior performance on multiple benchmarks [46], and
response should contain to be considered useful and ethical.           many LLMs use RoPE positional embedding that is unable to
The same LLM is later fine-tuned with the new dataset.                 perform zero-shot extrapolation. A larger context length has
Aligning with Prompts: LLMs can be steered with prompts to             benefits such as a better understanding of longer documents,
generate desirable responses without training [175, 176]. The          more samples in in-context learning, execution of bigger rea-
self-correction prompting in [176] concatenates instructions           soning processes, etc. Expanding context length during fine-
and CoT with questions, guiding the model to answer its                tuning is slow, inefficient, and computationally expensive [49].
instruction following a strategy to ensure moral safety before         Therefore, researchers employ various context window extrap-
the actual answer. This strategy is shown to reduce the harm in        olation techniques discussed below.
generated responses significantly.                                     Position Interpolation: Rather than extrapolating, [49] shows
Red-Teaming/Jailbreaking/Adversarial Attacks:             LLMs         that interpolating position encodings within the pre-trained con-
exhibit harmful behaviors, hallucinations, leaking personal in-        text window are more effective. The work demonstrates that
formation, and other shortcomings through adversarial probing.         only 1000 steps of fine-tuning are enough to achieve better re-
The models are susceptible to generating harmful responses             sults on larger windows without reducing performance com-
even though they are aligned for safety [177, 178]. Red-               pared to the original context size. Giraffe [46] uses power scal-
teaming is a common approach to address illicit outputs, where         ing in RoPE, and YaRN [47] proposed NTK-aware interpola-
the LLMs are prompted to generate harmful outputs [178, 179].          tion.
                                                                  17
Efficient Attention Mechanism: Dense global attention is
one of the major constraints in training larger context win-
dow LLMs. Using efficient attention variants, such as lo-
cal, sparse, and dilated attention, reduces the computation cost
significantly. LongT5 [48] proposes transient global atten-
tion (TGlobal), applying attention to local and global tokens
(windowed token averaging). The model replaces attention
in T5 [10] with TGlobal attention, pre-trains the model on
4098 sequence length, fine-tunes on larger window sizes, as
large as 16k, and improves task performance on longer inputs.
This shows the extrapolation ability of TGlobal attention with
only fine-tuning. COLT5 [187] uses two branches, one with
lightweight and the other with heavyweight attention and feed-
                                                                         Figure 12: A flow diagram of Retrieval Augmented LLMs. The retriever ex-
forward layers. All tokens are processed from the lightweight            tracts a similar context to the input and forwards it to the LLM either in simple
branch, and only important tokens are routed to the heavy-               language or encoded through Fusion-in-Decoder (FiD). Depending on the task,
weight branch. LongNet [188] replaces standard attention with            retrieval and generation may repeat multiple times.
dilated attention, expanding sequence length to 1 billion tokens.
LongLoRA [189] proposes shift-short attention, used during
fine-tuning to reduce dense attention costs. However, the model          3.4.1. Retrieval Augmented LLMs
during inference uses dense attention and achieves similar per-             LLMs may have limited memory and outdated information,
formance as full attention fine-tuning.                                  leading to inaccurate responses. Retrieving relevant informa-
Extrapolation without Training: LM-Infinite [186] and par-               tion from external up-to-date storage enables the LLMs to
allel context windows (PCW) [190] show length extrapolation              accurately answer with references and utilize more informa-
is possible using pre-trained LLMs. LM-Infinite suggested Λ-             tion. With retrieval augmentation, smaller models have been
shaped attention applied within the original context window              shown to perform at par with larger models. For instance, the
limits. Likewise, PCW chunks larger inputs into the pre-trained          11B model can become competitive to 540B PaLM in [25] and
context lengths and applies the same positional encodings to             7.5B to 280B Gopher in [193]. Retrieval augmented language
each chunk.                                                              modeling (RALM) has two major components, shown in
                                                                         Figure 12, namely: 1) retriever and 2) language model. In
                                                                         RALM, the retriever plays a crucial role in driving LLM
3.4. Augmented LLMs                                                      response, where incorrect information can steer LLMs to false
                                                                         behavior. This leads to the development of various methods to
   LLMs are capable of learning from the examples concate-               retrieve accurate information and fuse with the query for better
nated with the input, known as context augmentation, in-                 performance.
context learning (ICL), or few-shot prompting. They show ex-             Zero-Shot Retrieval Augmentation: This kind of augmen-
cellent generalization to unseen tasks with few-shot prompt-             tation keeps the original LLM architecture and weights
ing, enabling LLMs to answer queries beyond the capacity ac-             unchanged and uses BM25 [202], nearest neighbors, or frozen
quired during training [6, 55]. These emergent abilities allow           pre-trained models like Bert [7] as a retriever. The retrieved
for adapting the model without fine-tuning—a costly process.             information is provided as input to the model for response
Aside from this, hallucination, producing inaccurate, unsafe,            generation, shown to improve performance over LLMs without
or factually incorrect responses, is common for LLMs, which is           retrieval [198, 203]. In some scenarios, multiple retrieval
avoided by augmenting contextual data. While the user can pro-           iterations are required to complete the task. The output
vide in-context samples in the query [54, 32], here we specifi-          generated in the first iteration is forwarded to the retriever
cally refer to the methods that access external storage program-         to fetch similar documents. Forward-looking active retrieval
matically, calling them augmented LLMs.                                  (FLARE) [197] initially generates the response and corrects
The literature suggests various external memory designs to aug-          the output by retrieving relevant documents if the response
ment LLMs, long-term [191, 192, 193, 194], short-term [195],             contains low-confidence tokens. Similarly, RepoCoder [204]
symbolic [196], and non-symbolic [197, 198]. The memory                  fetches code snippets recursively for code completion.
can be maintained in different formats such as documents, vec-           Training with Retrieval Augmentation: To reduce failures in
tors, or databases. A few systems maintain intermediate mem-             retrieval augmentation generation (RAG), researchers train or
ory representations to retain information across multiple iter-          fine-tune retrievers and LLMs with a retrieval augmentation
ations [194, 192], while others extract important information            pipeline. We discuss the literature below based on their focus
from the datasets and save it in memory for recall [199]. The            on the respective training processes of the pipeline.
memory read and write operations are performed either with               Training LLM: Retrieval-enhanced transformer (RETRO) [193]
or without LLMs cooperation [192, 200, 194, 201], acting as              shows pre-training smaller LLMs with RAG pipeline outper-
a feedback signal in [195]. We discuss different types of aug-           forms larger LLMs, such as GPT-3 trained without RAG.
mented LLMs below.                                                       RETRO uses a 2-trillion token subset of MassiveText as
                                                                    18
a database. The retrieval pipeline divides the input query
into subsets and retrieves relevant chunks from the database
for each subset, encoded together with input intermediate
representations for generating tokens. It uses cross-chunked
attention to attend to previous chunks auto-regressively. A
study on RETRO [205] shows models pre-trained without RAG
but fine-tuned using RAG lack the performance gains obtained
by pre-training with RAG.
Training Retriever: Quality of responses generated by LLMs
is highly dependent on the in-context examples. There-
fore, [206, 207, 208, 209] train retrievers to retrieve accurate
few-shot samples while keeping the LLM frozen for gener-
ation. Retrieved samples are ranked to build ground-truth
data to train retrievers with contrastive learning in [206, 208].
RoBERTa is trained for downstream tasks in [207] for ICL
samples retrieval. REPLUG [209] trains the retriever with
supervised signals from the frozen LLM-generated outputs.
Training Retriever and LLM: Further benefits are achieved by
training both the retriever and the model in [25, 210, 211]. In
this case, the error propagates back to the retriever, updating
                                                                         Figure 13: A basic flow diagram of tool augmented LLMs. Given an input and
both the language model and the retriever. While masked                  a set of available tools, the model generates a plan to complete the task. The
language modeling (MLM) is a common pre-training objec-                  tool augmented LLMs utilize different modules iteratively, such as retriever,
tive [25, 211], retrieval pre-trained transformer (RPT) [210]            tool execution, read-write to memory, feedback, etc., depending on the task.
used document chunk prediction as a pre-training objective for
long text modeling.
Encoded Context Augmentation: Concatenating retrieved                    and API selection steps. The API selector understands the API
documents with the query becomes infeasible as the sequence              documentation to select a suitable API for the task and plan the
length and sample size grow. Encoding the context and fusing             execution. ToolkenGPT [220] uses tools as tokens by concate-
it with the decoder (Fusion-in-Decoder) using cross-attention            nating tool embeddings with other token embeddings. During
makes it possible to augment more samples without increasing             inference, the LLM generates the tool tokens representing the
computation costs significantly [212, 193, 210, 25].                     tool call, stops text generation, and restarts using the tool exe-
Web Augmented: Locally stored memory, but external to                    cution output.
LLM, has limited information. However, a large amount of                 Training with Tool Augmentation: LLMs are trained to inter-
information is available on the internet, which gets updated             act with diverse tools, enhancing planning abilities to overcome
regularly. Rather than storing information locally, various              the limitations of zero-shot tool augmentation [221, 27, 222,
methods retrieve query-related context through a web search              223]. Gorilla [221] instruction-tunes LLaMA with information
and forward it to LLMs [213, 214, 166].                                  retrieval from API documentation. It uses the self-instruct [19]
                                                                         data generation pipeline with GPT-4 by providing in-context
                                                                         examples retrieved from API documentation. Tool augmented
3.4.2. Tool Augmented LLMs                                               language model (TALM) [27] fine-tunes T5 [10] for tool use
   While RAG relies on the retriever to provide context to the           with a self-play approach, where it iteratively completes tool
LLM to answer queries, tool augmented LLMs capitalize on the             manipulation tasks and includes them back in the training set.
reasoning abilities of LLMs to iteratively plan by dividing tasks        ToolLLM [223] collects 16k APIs from RapidAPI. It samples
into sub-tasks, selecting necessary tools, and taking actions to         APIs from the list to generate an instruction-tuning dataset us-
complete the task [215, 216, 217, 27]. A generic pipeline of             ing ChatGPT in single-tool and multi-tool scenarios. For high-
tool-augmented LLMs is shown in Figure 13, where different               quality datasets, ToolLLM suggested a depth-first search-based
modules in Figure 13 are selected in a loop until the task com-          decision tree (DFSDT) method to generate ground-truths with
pletion.                                                                 diverse reasoning and planning.
Zero-Shot Tool Augmentation: LLMs in-context learning and                Multimodal Tool Augmentation: The compositional reasoning
reasoning abilities enable them to interact with tools with-             capacity of LLMs allows them to manipulate tools in multi-
out training. Automatic reasoning and tool-use (ART) [217]               modal settings [215, 216, 224]. Following the pipeline shown
builds a task library with demonstrations of reasoning steps and         in Figure 13, the LLM outlines a plan, generally executing in a
calling external tools. It retrieves similar task examples and           sequence: Plan → Tool selection → Execute → Inspect →
provides the context to the LLM for inference. Aside from                Generate, to respond to the user query. Here, the database of
this, [218] shows tool documentation is enough to teach LLMs             tools is rich in modalities, including text, images, etc. Many of
to use tools without demonstrations. RestGPT [219] integrates            the multimodal tool augmentation systems employ multimodal
LLMs with RESTful APIs by decomposing tasks into planning                LLMs [31, 225, 224, 216], while others utilize single modality
                                                                    19
LLMs and generate a plan on using different modality tools to          and long-term memory, where short-term memory contains re-
solve multimodal queries [226].                                        cent responses and long-term memory keeps summarized failed
                                                                       attempts to add in the prompt as reflection.
3.5. LLMs-Powered Agents                                               Multi-Agents Systems: LLMs can play user-defined roles and
                                                                       behave like a specific domain expert. In multi-agent systems,
   AI agents are autonomous entities, capable of planning,             each LLM is assigned a unique role, simulating human behav-
decision-making, and performing actions to achieve complex             ior and collaborating with other agents to complete a complex
goals. In the early days, AI agents were rule-based, de-               task [229, 239].
signed for narrow tasks, and had limited capabilities, such            LLMs in Physical Environment: LLMs are good at
as Clippy [227] and Deep Blue [228]. In contrast to this,              instruction-following, however, utilizing them for physically
LLMs abilities to respond to dynamic scenarios have made it            grounded tasks requires adaptation, as they lack real-world
possible to incorporate them in diverse applications, includ-          knowledge. This could lead to generating illogical responses
ing LLMs-powered agents [224, 216], where LLMs behave                  for a particular physical situation [240, 26]. SayCan [240]
as the brain of agents. LLMs have been incorporated in web             make LLMs aware of the available low-level task operations.
agents [166, 167], coding agents [229], tool agents [27, 223],         LLM (Say) builds a high-level plan to complete the task and
embodied agents [26], and conversational agents [195], requir-         a learned affordance function (Can) explores the possibility of
ing minimal to no fine-tuning". Below we summarize the re-             executing the plan in the real world. SayCan uses RL to train
search in LLMs-based autonomous agents. For a more detailed            the language-conditioned affordance function. PaLM-E enables
discussion, please refer to [230, 231].                                the LLM to solve grounded tasks by training multi-modal LLM
LLMs Steering Autonomous Agents: LLMs are the cognitive                feeding inputs directly from the sensors.
controllers of the autonomous agents. They generate plans, rea-        Manipulation: In the area of manipulation [236, 241], LLMs
son about tasks, incorporate memory to complete tasks, and             enhance a robot’s dexterity and adaptability, excelling in tasks
adapt the outline depending on the feedback from the environ-          like object recognition, grasping, and collaboration. They ana-
ment. Depending on the acquired capabilities of LLMs, many             lyze visual and spatial information to determine the most effec-
methods fine-tune, propose a better prompting approach, or uti-        tive approach to interact with objects.
lize different modules to enhance agents’ performance. Mod-            Navigation: LLMs enhance a robot’s ability to navigate com-
ules and strategies employed in autonomous agents are briefly          plex environments with precision and adaptability [242, 243,
discussed below.                                                       244, 245]. They generate feasible paths and trajectories for
Planning and Reasoning: Completing a complex task requires             robots, accounting for intricate environmental details [246].
human-like logical thinking, planning necessary steps, and             This ability is valuable in scenarios requiring precise and
reasoning current and future directions. Prompting methods             dynamically adaptable navigation in environments like ware-
like chain-of-thoughts [103], tree-of-thoughts [105], and self-        houses, transport, healthcare facilities, and residences.
consistency [104] are central to agents, eliciting LLMs to rea-
son its actions and choose among different paths for task com-         3.6. Efficient LLMs
pletion. When LLMs are prompted with a task description and
a sequence of actions, they can accurately generate plan ac-              Deploying LLMs in production is expensive. Reducing their
tions without any fine-tuning [232]. Reasoning via planning            running costs while preserving performance is an appealing
(RAP) [233] incorporates a re-purposed LLM as a world model            area of research. This section summarizes the approaches sug-
to reason about future outcomes and explore alternative paths          gested to enhance LLMs’ efficiency.
for task completion. Retroformer [234] uses a retrospective
LLM to improve main LLM planning and reasoning capabil-                3.6.1. Parameter Efficient Fine-Tuning
ities by providing helpful task cues.                                     Fine-tuning LLMs with tens or hundreds of billions of pa-
Feedback: LLMs in open-loop systems generate plans and as-             rameters, such as GPT-3 (175B), BLOOM (176B), MT-NLG
sume that the agent will complete them successfully. However,          (540B), etc., is computationally intensive and time-consuming.
the actual scenario is different with failures and variable re-        To avoid complete model fine-tuning, numerous parameter-
sponses from the environment. To correctly complete tasks,             efficient fine-tuning (PEFT) techniques [40, 247, 41, 38, 39] try
many methods use LLMs in a closed-loop where the action re-            to achieve acceptable model fine-tuning performance at reduced
sponse is provided as feedback to the LLMs to re-assess and            costs. As compared to full fine-tuning [248], PEFT performs
update the plan as required [235, 236, 237, 195]. Another di-          better in low-resource setups, achieves comparable perfor-
rection of research exploits LLMs as reward functions to train         mance on medium-resource scenarios, and performs worse than
reinforcement learning (RL) policies instead of humans [238].          full fine-tuning under high-resource availability. An overview
Memory: LLMs can learn from the context provided in the                of different PEFT approaches is shown in Figure 14.
prompt. In addition to internal memory, various systems em-            Adapter Tuning: Adds a few trainable parameters within the
ploy external memory to save the response history. Reflex-             transformer block. The adapter layer is a sequence of feature
ion [195] maintains an episodic memory to use previous re-             downscaling, non-linearity, and upscaling [106]. Variants of
sponses as feedback to improve future decision-making. Retro-          adapter tuning inject adapter layers sequentially [106] and in
former [234] improves its responses by employing short-term            parallel [38], whereas the mixture of adapter (AdaMix) [249]
                                                                  20
Figure 14: Illustration of parameter-efficient fine-tuning paradigms, where x is input and h is hidden state, figure courtesy [38]. Parallel adapter and LoRA fall in
the adapter tuning category.
employs multiple adapter modules in a single layer. AdaMix                           FP16 format [44]. Such demanding requirements for deploying
routes input instances randomly to one of the multiple down-                         LLMs make it harder for smaller organizations to utilize them.
scale and upscale modules. The mixture of adapters is averaged                       Model compression is an effective solution but comes at the cost
out for inference to avoid additional latency. Low-Rank Adap-                        of degraded performance, especially at large scales greater than
tation (LoRA) [250] learns low-rank decomposed matrices to                           6B. These models exhibit very large magnitude outliers that do
freeze original weights. The learned weights are fused with the                      not exist in smaller models [255], making it challenging and re-
original weights for inference, avoiding latency.                                    quiring specialized methods for quantizing LLMs [44, 256].
Prompt Tuning: Prompting is an effective way to adapt a                              Post-Training Quantization: Minimal or no training is re-
pre-trained LLM for the downstream task. However, manual                             quired in this type of quantization, without significantly com-
prompts bring uncertainty in the model’s prediction, where a                         promising the model performance. LLM-8-bit [255] uses full-
change in a single word drops the performance [247]. Prompt                          precision matrix multiplication for weights associated with out-
tuning alleviates this problem by fine-tuning only 0.001%-3%                         lier features and 8-bit for remaining features. The lower pre-
additional parameters [251]. It concatenates trainable prompt                        cision multiplication outputs are converted to FP-16 and con-
parameters with the model embeddings [247, 40, 251]. Task-                           catenated with others. The quantized models have homogenous
specific fixed discrete prompts are concatenated with input em-                      word embeddings, which may degrade their performance. To
beddings in [40]. As discrete prompts bring instability, prompts                     fix this, token-level knowledge distillation is employed in [45]
are encoded through a learnable mapping in P-Tuning [247],                           along with independent quantization scaling factors for each
naming continuous prompts, which are appended with the dis-                          module due to varying weight distribution. Feature distribu-
crete prompts. Only the prompt encoder is trainable in the                           tions are asymmetric and appear in different channels; outlier
model. In an extension of P-Tuning, continuous prompts are                           suppression [257] shifts and scales per-channel activation dis-
concatenated with each layer of the network in [251]. Progres-                       tributions for effective quantization. SmoothQuant [44] quan-
sive prompts [252] avoid catastrophic forgetting and transfer                        tizes activations and weights to INT8 format by smoothing
previously learned knowledge by sequentially adding trainable                        activations and migrating the quantization difficulty toward
prompt embeddings to the previously frozen task embeddings.                          weights. It multiplies the inverse of the smoothing factor with
Prefix Tuning: A set of trainable task-specific prefix vectors                       weights, which introduces a few outliers in the weights but is
are appended to the frozen transformer layers in prefix tun-                         easier to quantify than unsmoothed activations. OPTQ [256]
ing [41]. The prefix vectors are virtual tokens attended by the                      uses the optimal brain compression (OBC) [258] algorithm to
context tokens on the right. In addition, adaptive prefix tun-                       quantize the model layer-by-layer and update weights to com-
ing [253] applies a gating mechanism to control the information                      pensate for quantization error. To improve speed and per-
from the prefix and actual tokens.                                                   formance, OPTQ updates weights in arbitrary order, employs
Bias Tuning: Fine-tuning only bias terms in small to medium                          lazy updates, and uses better Cholesky kernels. Outlier-aware
training data has been found effective in BitFit [254]. This                         weight quantization (OWQ) [259] uses the OPTQ algorithm for
method achieves full fine-tuning performance for tasks with less                     quantization but assigns higher precision to vulnerable weights,
training data and comparable performance with more training                          causing outliers and lower precision for others.
data.                                                                                Quantization-Aware Training: To compensate for perfor-
                                                                                     mance degradation, a quantized model is fine-tuned in
                                                                                     quantization-aware training (QAT) [260, 261, 262]. Al-
3.6.2. Quantization
                                                                                     pha Tuning quantizes the model using binary coding quan-
   LLMs require extensive computing and memory for infer-
                                                                                     tization (BCQ) [263] and fine-tunes only quantization scal-
ence. Deploying a 175B parameter GPT-3 model needs at
                                                                                     ing factors.      This approach improves performance over
least five 80GB A100 GPUs and 350GB of memory to store in
                                                                                21
parameter-efficient fine-tuning of the pre-trained model. Sim-            now facilitating LLMs to perceive different modalities of infor-
ilarly, parameter-efficient and quantization-aware adaptation             mation like image [269, 270, 271], video [272, 273, 274], au-
(PEQA) [264] reduces the precision of fully-connected layers              dio [275, 274, 276], etc. Multimodal LLMs (MLLMs) present
and fine-tunes only quantization scaling parameters. LLM-                 substantial benefits compared to standard LLMs that process
QAT [262] generates training data from the pre-trained network            only text. By incorporating information from various modal-
and trains a quantized student model with knowledge distilla-             ities, MLLMs can achieve a deeper understanding of context,
tion. QLoRA [261] fine-tunes 4-bit quantized pre-trained LLM              leading to more intelligent responses infused with a variety of
with LoRA [250] using a 4-bit normal float, which shows better            expressions. Importantly, MLLMs align closely with human
performance over a 4-bit integer and float.                               perceptual experiences, leveraging the synergistic nature of our
                                                                          multisensory inputs to form a comprehensive understanding of
3.6.3. Pruning                                                            the world [276, 26]. Coupled with a user-friendly interface,
   Pruning is an alternative approach to quantization to com-             MLLMs can offer intuitive, flexible, and adaptable interactions,
press model size, thereby reducing LLMs deployment costs                  allowing users to engage with intelligent assistants through a
significantly. Compared to task-agnostic pruning, task-specific           spectrum of input methods. According to the ways of construct-
pruning is easily achievable with good performance, where a               ing models, current MLLMs can be generally divided into three
model is fine-tuned on the downstream task and pruned for                 streams: pre-training, fine-tuning, and prompting. In this sec-
faster inference. It is possible to prune LLMs for individual             tion, we will discuss more details of these main streams, as well
tasks, but the cost of pruning and deploying task-specific mod-           as the important application of MLLMs in visual reasoning.
els is high. To overcome this, many structured and unstructured           Pre-training: This stream of MLLMs intends to support differ-
pruning methods for LLMs have been proposed to maintain rea-              ent modalities using unified end-to-end models. For instance,
sonable performance across all tasks while shrinking the model            Flamingo [269] applies gated cross-attention to fuse vision and
size [265, 42, 266].                                                      language modalities, which are collected from pre-trained and
Unstructured Pruning: This kind of pruning removes less im-               frozen visual encoder and LLM, respectively. Moreover, BLIP-
portant weights without maintaining any structure. Existing               2 [270] proposes a two-stage strategy to pre-train a Querying
LLM pruning methods take advantage of the unique charac-                  Transformer (Q-Former) for the alignment between vision and
teristics of LLMs, uncommon for smaller models, where a                   language modalities: in the first stage, vision-language repre-
small subset of hidden states are activated with large magni-             sentation learning is bootstrapped from a frozen visual encoder;
tude [255]. Pruning by weights and activations (Wanda) [265]              and in the second stage, a frozen LLM bootstraps vision-to-
prunes weights in every row based on importance, calculated               language generative learning for zero-shot image-to-text gen-
by multiplying the weights with the norm of input. The pruned             eration. Similarly, MiniGPT-4 [277] deploys pre-trained and
model does not require fine-tuning, thereby saving computa-               frozen ViT [278], Q-Former and Vicuna LLM [159], only train-
tional costs. Outlier weighed layerwise sparsity (OWL) [267]              ing the linear projection layer for vision and language modali-
extends Wanda with non-uniform layer pruning. It shows that               ties alignment.
the number of outliers varies for different layers; therefore, the        Fine-tuning: Derived from instruction tuning [16] for NLP
model should have variable pruning ratios for better perfor-              tasks [20, 16, 97], researchers are fine-tune pre-trained LLMs
mance for every layer. Contrastive pruning (CAP) [43] itera-              using multimodal instructions. Following this method, LLMs
tively prunes the model by training the sparse model using con-           can be easily and effectively extended as multimodal chat-
trastive loss between pre-trained, fine-tuned, and snapshots of           bots [277, 271, 29] and multimodal task solvers [279, 30, 280].
previous sparse models to learn task-specific and task-agnostic           The key issue of this stream of MLLMs is to collect multi-
knowledge.                                                                modal instruction-following data for fine-tuning [58]. To ad-
Structured Pruning: Here, the parameters are removed in                   dress this issue, the solutions of benchmark adaptation [279,
groups, rows, columns, or matrices, which speeds up the                   281, 282], self-instruction [19, 31, 283], and hybrid composi-
inference because of effective hardware tensor core utiliza-              tion [284, 280] are employed, respectively. To mitigate the gap
tion [265]. LLM-Pruner [42] employs a 3-stage structured                  between the original language modality and additional modal-
pruning strategy, identifying the groups of hidden states caus-           ities, the learnable interface is introduced to connect differ-
ing each other to activate during the forward-pass, keeping im-           ent modalities from frozen pre-trained models. Particularly,
portant groups and removing less important ones, and fine-                the learnable interface is expected to work in a parameter-
tuning the pruned model with LoRA. Sparsity-induced mask                  efficient tuning manner: e.g., LLaMA-Adapter [285] applies
learning (SIMPLE) [268] prunes the network using learnable                an efficient transformer-based adapter module for training,
masks. Similarly, another method prunes LLMs by learning                  and LaVIN [284] dynamically learns the multimodal feature
masks and removing unimportant rank-1 components of the                   weights using a mixture-of-modality adapter. Different from
factorized weight matrix [266].                                           the learnable interface, the expert models can directly convert
                                                                          multimodalities into language: e.g., VideoChat-Text [272] in-
                                                                          corporates Whisper [286], a speech recognition expert model,
3.7. Multimodal LLMs
                                                                          to generate the captions of given videos for the understanding
  Inspired by the success of LLMs in natural language process-            of following LLMs.
ing applications, an increasing number of research works are              Prompting: Different from the fine-tuning technique that
                                                                     22
directly updates the model parameters given task-specific                 positional encoding also affects the performance and training
datasets, the prompting technique provides certain context, ex-           stability of LLMs. BLOOM [13] finds ALiBi outperforms
amples, or instructions to the model, fulfilling specialized tasks        learned and rotary positional encodings. Contrary to this,
without changing the model parameters. Since prompting can                GLM-130B [33] identifies rotary positional encoding as being
significantly reduce the need for large-scale multimodal data,            better than ALiBi. So, there is no conclusion in the literature
this technique is widely used to construct MLLMs. Particularly,           about positional encodings yet.
to solve multimodal Chain of Thought (CoT) problems [103],                Parallel Attention: In this type of attention, feed-forward and
LLMs are prompted to generate both the reasoning process and              attention layers are parallel to each other rather than sequen-
the answer given multimodal inputs [287]. On this front, differ-          tial in a transformer block. It has been shown to reduce train-
ent learning paradigms are exploited in practice: for example,            ing time by 15%. There is no evidence of performance drop
Multimodal-CoT [287] involves two stages of rationale genera-             due to this change in the literature and it is used by the models
tion and answer inference, where the input of the second stage            PaLM [15], GPT-NeoX [118], and CodeGen [140].
is a combination of the original input and the output of the first        Multi-Query Attention It has shared key and value attention
stage; and CoT-PT [288] applies both prompt tuning and spe-               heads in a transformer block while query attention heads are
cific visual bias to generate a chain of reasoning implicitly. In         projected as usual. This reduces memory usage and speeds up
addition to CoT problems, LLMs can also be prompted with                  sampling in autoregressive decoding. No performance degrada-
multimodal descriptions and tools, effectively dividing complex           tion has been observed with this change and it makes the train-
tasks into sub-tasks [289, 290].                                          ing efficient allowing larger batch sizes. Multi-query attention
Visual Reasoning Application: Recent visual reasoning sys-                is used in [15, 142].
tems [291, 292, 216, 293] tend to apply LLMs for better visual            Mixture of Experts: This type of architecture enables eas-
information analysis and visual-language integration. Differ-             ily scaling models to trillions of parameters [92, 91]. Only a
ent from previous works [294, 295] that rely on limited VQA               few experts are activated during the computation making them
datasets and small-scale neural networks, current LLM-aided               compute-efficient. The performance of MoE models is better
methods offer benefits of stronger generalization ability, emer-          than dense models for the same amount of data and requires less
gent ability, and interactivity [58]. To realize visual reasoning         computation during fine-tuning to achieve performance similar
with the help of LLMs, prompting and fine-tuning techniques               to dense models as discussed in [91]. MoE architectures are
can also be utilized: for example, PointClip V2 [292] applies             less prone to catastrophic forgetting, therefore are more suited
LLMs to generate 3D-specific prompts, which are encoded as                for continual learning [92]. Extracting smaller sub-models for
textual features and then combined with visual features for               downstream tasks is possible without losing any performance,
3D recognition; and GPT4Tools [31] employs LoRA [250] to                  making MoE architecture hardware-friendly [92].
fine-tune LLMs following tool-related instructions. Serving               Sparse vs Dense Activated: GPT-3 [6] uses sparse transform-
                                                                                                                     P
as a controller [293], decision maker [296], or semantics re-             ers [67] whereas GLaM [91] and PanGu- [92] use MoE [121]
finer [291, 297], LLMs significantly facilitates the progress of          architectures to lower computational costs and increase the
visual reasoning research.                                                model size and capacity. According to the literature, sparse
                                                                          modules do not degrade the model’s performance [67]. How-
3.8. Summary and Discussion                                               ever, more experiments are required to verify this statement.
                     Publication       License     Model            No. of Commercial Steps Data/          Data           No. of      Processing Training Calculated Training
Models
                       Venue            Type      Creators Purpose Params    Use     Trained Tokens      Cleaning    Processing Units Unit Type    Time Train. Cost Parallelism          Library
T5 [10]                JMLR'20        Apache-2.0    Google     General 11B      ✓        1M     1T   Heur+Dedup              1024       TPU v3      -           -        D+M     Mesh TensorFlow
GPT-3 [6]              NeurIPS'20          -        OpenAI     General 175B     ×          -  300B    Dedup+QF                 -         V100       -           -         M               -
mT5 [11]               NAACL'21       Apache-2.0    Google     General 13B      ✓        1M     1T         -                   -           -        -           -          -              -
PanGu-α [108]          arXiv'21       Apache-2.0    Huawei     General 200B     ✓       260k 1.1TB Heur+Dedup                2048     Ascend 910    -           -    D+OP+P+O+R      MindSpore
CPM-2 [12]             AI Open'21        MIT       Tsinghua    General 198B     ✓        1M 2.6TB       Dedup                  -           -        -           -        D+M        JAXFormer
Codex [141]            arXiv'21            -        OpenAI     Coding 12B       ×          -  100B      Heur                   -           -        -           -          -              -
ERNIE 3.0 [110]        arXiv'21            -         Baidu     General 10B      ×       120k∗ 375B Heur+Dedup                 384        V100       -           -         M∗       PaddlePaddle
Jurassic-1 [112]       White-Paper'21 Apache-2.0     AI21      General 178B     ✓          -  300B         -                  800        GPU        -           -       D+M+P     Megatron+DS
HyperCLOVA [114]       EMNLP'21            -         Naver     General 82B      ×          -  300B Clf+Dedup+PF              1024        A100     321h      1.32 Mil      M           Megatron
Yuan 1.0 [115]         arXiv'21       Apache-2.0       -       General 245B     ✓        26k∗ 180B Heur+Clf+Dedup            2128        GPU        -           -       D+T+P             -
Gopher [116]           arXiv'21            -        Google     General 280B     ×          -  300B    QF+Dedup               4096       TPU v3    920h     13.19 Mil     D+M        JAX+Haiku
ERNIE 3.0 Titan [35] arXiv'21              -         Baidu     General 260B     ×          -  300B Heur+Dedup                  -      Ascend 910    -           -     D+M+P+D*     PaddlePaddle
GPT-NeoX-20B [118] BigScience'22 Apache-2.0 EleutherAI         General 20B      ✓       150k 825GB      None                  96       40G A100     -           -         M    Megatron+DS+PyTorch
OPT [14]               arXiv'22          MIT         Meta      General 175B     ✓       150k 180B       Dedup                 992      80G A100     -           -        D+T          Megatron
BLOOM [13]             arXiv'22        RAIL-1.0 BigScience     General 176B     ✓          -  366B    Dedup+PR                384      80G A100 2520h       3.87 Mil    D+T+P     Megatron+DS
Galactica [148]        arXiv'22       Apache-2.0     Meta      Science 120B     ×       225k 106B       Dedup                 128     80GB A100     -           -          -          Metaseq
GLaM [91]              ICML'22             -        Google     General 1.2T     ×       600k∗ 600B       Clf                 1024       TPU v4      -           -         M           GSPMD
LaMDA [150]            arXiv'22            -        Google     Dialog 137B      ×        3M   2.81T    Filtered              1024       TPU v3    1384h     4.96 Mil     D+M           Lingvo
MT-NLG [117]           arXiv'22       Apache-v2.0 MS.+Nvidia   General 530B     ×          -  270B         -                 4480      80G A100     -           -       D+T+P     Megatron+DS
AlphaCode [142]        Science'22     Apache-v2.0 Google       Coding 41B       ✓       205k 967B Heur+Dedup                   -        TPU v4      -           -         M         JAX+Haiku
Chinchilla [96]        arXiv'22            -        Google     General 70B      ×          -   1.4T   QF+Dedup                 -        TPUv4       -           -          -        JAX+Haiku
PaLM [15]              arXiv'22            -        Google     General 540B     ×       255k 780B       Heur                 6144       TPU v4      -           -        D+M         JAX+T5X
AlexaTM [122]          arXiv'22       Apache v2.0 Amazon       General 20B      ×       500k 1.1T      Filtered               128        A100     2880h     1.47 Mil      M              DS
U-PaLM [124]           arXiv'22            -        Google     General 540B     ×        20k     -         -                  512       TPU v4    120h      0.25 Mil       -              -
UL2 [125]              ICLR'23        Apache-2.0    Google     General 20B      ✓        2M     1T         -                  512       TPU v4      -           -         M          JAX+T5X
GLM [33]               ICLR'23        Apache-2.0 Multiple      General 130B     ×          -  400B         -                  768      40G A100 1440h       3.37 Mil      M               -
CodeGen [140]          ICLR'23        Apache-2.0 Salesforce    Coding 16B       ✓       650k 577B Heur+Dedup                   -        TPU v4      -           -        D+M        JAXFormer
LLaMA [127]            arXiv'23            -         Meta      General 65B      ×       350k 1.4T Clf+Heur+Dedup             2048      80G A100 504h        4.12 Mil     D+M          xFormers
PanGuΣ [92]            arXiv'23            -        Huawei     General 1.085T   ×          -  329B         -                  512     Ascend 910 2400h          -    D+OP+P+O+R      MindSpore
BloombergGPT [151] arXiv23                 -      Bloomberg    Finance 50B      ×       139k 569B       Dedup                 512      40G A100 1272h       1.97 Mil      M           PyTorch
Xuan Yuan 2.0 [152] arXiv23            RAIL-1.0 Du Xiaoman     Finance 176B     ✓          -  366B     Filtered                -      80GB A100     -           -         P              DS
CodeT5+ [34]           arXiv'23         BSD-3     Salesforce   Coding 16B       ✓       110k 51.5B      Dedup                 16       40G A100     -           -          -             DS
StarCoder [147]        arXiv'23      OpenRAIL-M BigCode        Coding 15.5B     ✓       250k    1T Dedup+QF+PF                512      80G A100 624h        1.28 Mil    D+T+P      Megatron-LM
LLaMA-2 [21]           arXiv'23       LLaMA-2.0      Meta      General 70B      ✓       500k    2T Minimal Filtering           -       80G A100 1.7Mh           -          -              -
PaLM-2 [123]           arXiv'23            -        Google     General    -     ×          -     -  Ddedup+PF+QF               -           -        -           -          -              -
LLaMA-3.1 [130]        arXiv'24       LLaMA-3.0      Meta      General 405B     ✓       1.2M   15T    Dedup+QF                16k      80G H100 30.84Mh         -      D+T+P+C        PyTorch
Mixtral 8x22B [131] web'24            Apache-2.0 Mistral AI    General 141B     ✓          -     -         -                   -           -        -           -          -              -
Snowflake Arctic [132] web'24         Apache-2.0 Snowflake     General 480B     ✓          -   3.5T        -                   -                    -           -        T+P             DS
Nemotron-4 340B [137]web'24             Nvidia      Nvidia     General 340B     ✓          -    9T         -                 6144      80G H100     -           -       D+T+P             -
DeepSeek [138]         arXiv'24          MIT       DeepSeek    General 67B      ✓          -    2T    Dedup+QF                 -           -     300.6Kh        -       D+T+P            DS
DeepSeek-v2 [139]      arXiv'24          MIT       DeepSeek    General 67B      ✓          -   8.1T      QF                    -         H800    172.8Kh        -        D+P         HAI-LLM
Table 4: Summary of instruction tuned LLMs (>10B). All abbreviations are the same as Table 3. Entries in “Data/Tokens” starting with “S-” represent the number
of training samples.
                    Publication License           Model                No. of Commercial Pre-trained Steps Data/      No. of      Processing Train. Calculated Train.
Models
                      Venue      Type            Creators      Purpose Params    Use       Models Trained Tokens Processing Units Unit Type Time Train. Cost Parallelism Library
WebGPT [166]      arXiv'21             -        OpenAI          General 175B        ×         GPT-3           -        -              -           -     -              -            -           -
T0 [17]           ICLR'22          Apache-2.0 BigScience        General 11B         ✓           T5            -      250B           512        TPU v3 270h         0.48 Mil         -           -
Tk-Instruct [18]  EMNLP'22           MIT          AI2+          General 11B         ✓           T5          1000       -            256        TPU v3  4h         0.0036 Mil        -       Google T5
OPT-IML [97]      arXiv'22             -          Meta          General 175B        ×          OPT           8k       2B            128       40G A100  -              -          D+T        Megatron
Flan-U-PaLM [16] ICLR'22           Apache-2.0    Google         General 540B        ✓       U-PaLM          30k        -            512        TPU v4   -              -            -       JAX+T5X
mT0 [154]         ACL'23           Apache-2.0 HuggingFace+      General 13B         ✓          mT5            -        -              -           -     -              -            -           -
Sparrow [167]     arXiv'22             -         Google         Dialog 70B          ×       Chinchilla        -        -             64        TPU v3   -              -           M            -
WizardCoder [164] arXiv'23         Apache-2.0 HK Bapt.          Coding 15B          ×       StarCoder       200     S-78k             -           -     -              -            -           -
Alpaca [158]      Github'23        Apache-2.0   Stanford        General 13B         ✓        LLaMA        3-Epoch   S-52k            8        80G A100 3h            600          FSDP       PyTorch
Vicuna [159]      Github'23        Apache-2.0   LMSYS           General 13B         ✓        LLaMA        3-Epoch   S-125k            -           -     -              -          FSDP       PyTorch
LIMA [185]        arXiv'23             -         Meta+          General 65B         -        LLaMA       15-Epoch   S-1000            -           -     -              -            -           -
Koala [300]       Github'23        Apache-2.0 UC-Berkley        General 13B         ×        LLaMA        2-Epoch   S-472k           8          A100   6h            100            -      JAX/FLAX
                                                      Training
 Models                       Type                                      Attention         Vocab       Tokenizer          Norm          PE       Activation   Bias   nL    nH      HS
                                                      Objective
 T5 (11B)                    Enc-Dec           Span Corruption           Standard         32k        SentencePiece       Pre-RMS     Relative     ReLU         ×     24   128    1024
 GPT3 (175B)                Causal-Dec              Next Token        Dense+Sparse          -              -               Layer     Learned      GeLU         ✓     96    96   12288
 mT5 (13B)                   Enc-Dec           Span Corruption           Standard         250k       SentencePiece       Pre-RMS     Relative     ReLU          -     -     -       -
 PanGu-α (200B)             Causal-Dec              Next Token           Standard         40k             BPE              Layer        -           -           -    64   128   16384
 CPM-2 (198B)                Enc-Dec           Span Corruption           Standard         250k       SentencePiece       Pre-RMS     Relative     ReLU          -    24    64       -
 Codex (12B)                Causal-Dec              Next Token           Standard           -            BPE+           Pre-Layer    Learned      GeLU          -    96    96   12288
 ERNIE 3.0 (10B)            Causal-Dec              Next Token           Standard           -          WordPiece        Post-Layer   Relative     GeLU          -    48    64    4096
 Jurassic-1 (178B)          Causal-Dec              Next Token           Standard         256k       SentencePiece∗     Pre-Layer    Learned      GeLU         ✓     76    96   13824
 HyperCLOVA (82B)           Causal-Dec              Next Token        Dense+Sparse          -            BPE*           Pre-Layer    Learned      GeLU          -    64    80   10240
 Yuan 1.0 (245B)            Causal-Dec              Next Token           Standard           -              -                 -          -           -           -    76     -   16384
 Gopher (280B)              Causal-Dec              Next Token           Standard         32k        SentencePiece       Pre-RMS     Relative     GeLU         ✓     80   128   16384
 ERNIE 3.0 Titan (260B)     Causal-Dec              Next Token           Standard           -          WordPiece        Post-Layer   Relative     GeLU          -    48   192   12288
 GPT-NeoX-20B               Causal-Dec              Next Token            Parallel        50k             BPE              Layer     Rotary       GeLU         ✓     44    64       -
 OPT (175B)                 Causal-Dec              Next Token           Standard           -             BPE                -          -         ReLU         ✓     96    96       -
 BLOOM (176B)               Causal-Dec              Next Token           Standard         250k            BPE              Layer      ALiBi       GeLU         ✓     70   112   14336
 Galactica (120B)           Causal-Dec              Next Token           Standard         50k         BPE+custom           Layer     Learned      GeLU         ×     96    80   10240
 GLaM (1.2T)                 MoE-Dec                Next Token           Standard         256k       SentencePiece         Layer     Relative     GeLU         ✓     64   128   32768
 LaMDA (137B)               Causal-Dec              Next Token           Standard         32k             BPE              Layer     Relative    GeGLU          -    64   128    8192
 MT-NLG (530B)              Causal-Dec              Next Token           Standard         50k             BPE           Pre-Layer    Learned      GeLU         ✓    105   128   20480
 AlphaCode (41B)             Enc-Dec                Next Token         Multi-query         8k        SentencePiece           -          -           -           -    64   128    6144
 Chinchilla (70B)           Causal-Dec              Next Token           Standard         32k     SentencePiece-NFKC     Pre-RMS     Relative     GeLU         ✓     80    64    8192
 PaLM (540B)                Causal-Dec              Next Token     Parallel+Multi-query   256k       SentencePiece         Layer      RoPE       SwiGLU        ×    118    48   18432
 AlexaTM (20B)               Enc-Dec                 Denoising           Standard         150k       SentencePiece      Pre-Layer    Learned      GeLU         ✓     78    32    4096
 Sparrow (70B)              Causal-Dec          Pref.&Rule RM                -            32k     SentencePiece-NFKC     Pre-RMS     Relative     GeLU         ✓    16∗    64    8192
 U-PaLM (540B)            Non-Causal-Dec                  MoD      Parallel+Multi-query   256k       SentencePiece         Layer      RoPE       SwiGLU        ×    118    48   18432
 UL2 (20B)                   Enc-Dec                      MoD            Standard         32k        SentencePiece           -          -           -           -    64    16    4096
 GLM (130B)               Non-Causal-Dec      AR Blank Infilling         Standard         130k       SentencePiece         Deep       RoPE       GeGLU         ✓     70    96   12288
 CodeGen (16B)              Causal-Dec              Next Token            Parallel          -             BPE              Layer      RoPE          -           -    34    24       -
 LLaMA (65B)                Causal-Dec              Next Token           Standard         32k             BPE            Pre-RMS      RoPE       SwiGLU         -    80    64    8192
 PanGu-Σ (1085B)            Causal-Dec              Next Token           Standard           -             BPE          Fused Layer      -       FastGeLU        -    40    40    5120
 BloombergGPT (50B)         Causal-Dec              Next Token           Standard         131k          Unigram            Layer      ALiBi       GeLU         ✓     70    40    7680
 Xuan Yuan 2.0 (176B)       Causal-Dec              Next Token              Self          250k            BPE              Layer      ALiBi       GeLU         ✓     70   112   14336
 CodeT5+ (16B)               Enc-Dec       SC+NT+Cont.+Match             Standard           -        Code-Specific           -          -           -           -     -     -       -
 StarCoder (15.5B)          Causal-Dec                     FIM         Multi-query        49k             BPE                -       Learned        -           -    40    48    6144
 LLaMA-2 (70B)              Causal-Dec              Next Token        Grouped-query       32k             BPE            Pre-RMS      RoPE      SwiGLUE         -     -     -       -
 PaLM-2                         -                         MoD             Parallel          -              -                 -          -           -           -     -     -       -
 LLaMA-3.1 (405B)           Causal-Dec              Next Token        Grouped-query       128k            BPE            Pre-RMS      RoPE       SwiGLU         -   126   128   16384
 Nemotron-4 (340B)          Causal-Dec              Next Token           Standard         256k       SentencePiece           -        RoPE        ReLU         ×     96    96   18432
 DeepSeek (67B)             Causal-Dec              Next Token        Grouped-query       100k           BBPE            Pre-RMS      RoPE       SwiGLU         -    95    64    8192
 DeepSeek-v2 (67B)           MoE-Dec                Next Token      Multi-Head Latent     100k           BBPE            Pre-RMS      RoPE       SwiGLU         -    60   128    5120
5.2. Evaluation Datasets and Tasks                                                             ous pre-trained LLMs in Table 10 and fine-tuned LLMs in Ta-
                                                                                               ble 11. We also compare the top-performing LLMs in various
   The evaluation of LLMs is important in gauging their profi-
                                                                                               NLP tasks in Table 12.
ciency and limitations. This process measures the model’s abil-
ity to comprehend, generate, and interact with human language
across a spectrum of tasks. Evaluating a language model (LM)                                   5.2.1. Multi-task
is divided into two broader categories: 1) natural language un-                                  MMLU [307]: A benchmark that measures the knowledge
derstanding (NLU) and 2) natural language generation (NLG).                                    acquired by models during pretraining and evaluates models in
It is emphasized that tasks in NLU and NLG are softly catego-                                  zero-shot and few-shot settings across 57 subjects, testing both
rized and are often used interchangeably in the literature.                                    world knowledge and problem-solving ability.
Natural Language Understanding: It measures the language                                         SuperGLUE [2]: A more challenging and diverse successor
understanding capacity of LMs. It encompasses multiple tasks,                                  to the GLUE [309] benchmark, SuperGLUE includes a variety
including sentiment analysis, text classification, natural lan-                                of language understanding tasks, such as question answering,
guage inference (NLI), question answering (QA), common-                                        natural language inference, and co-reference resolution. It is
sense reasoning (CR), mathematical reasoning (MR), reading                                     designed to provide a rigorous test of language understanding
comprehension (RC), etc.                                                                       and requires significant progress in areas like sample-efficient,
Natural Language Generation: It assesses the language gener-                                   transfer, multi-task, and unsupervised or self-supervised learn-
ation capabilities of LLMs by understanding the provided input                                 ing.
context. It includes tasks such as summarization, sentence com-                                  BIG-bench [308]: The BIG-bench (Behavior of Intelligent
pletion, machine translation (MT), dialogue generation, etc.                                   Generative Models Benchmark) is a large-scale benchmark de-
Numerous datasets are proposed for each task, evaluating                                       signed to test the abilities of LLMs across a wide range of
LLMs against different characteristics. To provide an overview                                 tasks, including reasoning, creativity, ethics, and understanding
of evaluation datasets, we briefly discuss a few famous datasets                               of specific domains.
within each category and offer a comprehensive list of datasets                                  GLUE [309]: The General Language Understanding Evalua-
in Table 9. Moreover, we show a detailed overview of the train-                                tion (GLUE) benchmark is a collection of resources for train-
ing datasets and evaluation tasks and benchmarks used by vari-                                 ing, evaluating, and analyzing natural language understanding
                                                                                          26
Table 6: Summary of optimization settings used for pre-trained LLMs. The values for weight decay, gradient clipping, and dropout are 0.1, 1.0, and 0.1, respectively,
for most of the LLMs.
Table 7: Summary of optimization settings used for instruction-tuned LLMs. Values for gradient clipping and dropout are the same as the pre-trained models, while
no model uses weight decay for instruction tuning.
systems. It includes a variety of tasks that test a wide range of                      language text.
linguistic phenomena, making it a comprehensive tool for eval-                           CoQA [316]: A conversational question-answering dataset,
uating language understanding in AI.                                                   CoQA challenges models with questions that rely on conver-
                                                                                       sation history and require free-form text answers. Its diverse
5.2.2. Language Understanding                                                          content from seven domains makes it a rigorous test for mod-
 WinoGrande [354]: A large-scale dataset inspired by the orig-                         els’ ability to handle a wide range of topics and conversational
inal Winograd [357] Schema Challenge tests models on their                             contexts.
ability to resolve pronoun ambiguity and encourages the devel-                           WiC [317]: This dataset assesses a model’s ability to dis-
opment of models that understand the broad context in natural                          cern word meanings based on context, aiding in tasks related
                                                                                 27
                Table 8: Details of various well-known pre-training and fine-tuning datasets. Here, alignment means aligning with human preferences.
to Word Sense Disambiguation.                                                          ability to understand and generate coherent and sensible stories.
   Wikitext103 [318]: With over 100 million tokens from                                  LAMBADA [335]: This dataset evaluates contextual text un-
Wikipedia’s top articles, this dataset is a rich resource for tasks                    derstanding through a word prediction task. Models must pre-
that require understanding long-term dependencies, such as lan-                        dict the last word of a passage, which is easy for humans when
guage modeling and translation.                                                        given the whole passage, but not when given only the last sen-
  PG19 [319]: This is a digital library of diverse books from                          tence.
Project Gutenberg. It is specifically designed to facilitate re-
search in unsupervised learning and language modeling, with a
special focus on long-form content.                                                    5.2.4. Physical Knowledge and World Understanding
  C4 [10]: A clean, multilingual dataset, C4 offers billions of to-                     PIQA [340]: A dataset that probes the physical knowledge of
kens from web-crawled data. It is a comprehensive resource for                         models, aiming to understand how well they are learning about
training advanced Transformer models on various languages.                             the real world.
  LCQMC [320]: The Large-scale Chinese Question Matching                                 TriviaQA [341]: A dataset that tests models on reading com-
Corpus (LCQMC) is a dataset for evaluating the performance                             prehension and open domain question answering (QA) tasks,
of models in semantic matching tasks. It contains pairs of ques-                       with a focus on Information Retrieval (IR)-style QA.
tions in Chinese and their matching status, making it a valuable                          ARC [342]: A larger version of the ARC-Challenge, this
resource for research in Chinese language understanding.                               dataset contains both easy and challenging grade-school level,
                                                                                       multiple-choice science questions. It is a comprehensive test of
5.2.3. Story Cloze and Sentence Completion                                             a model’s ability to understand and answer complex questions.
  StoryCloze [334]: It introduces a new “StoryCloze Test”, a                              ARC-Easy [342]: A subset of the ARC dataset, ARC-
commonsense reasoning framework for evaluating story under-                            Easy, contains questions that are answered correctly by either
standing, generation, and script learning. It considers a model’s                      a retrieval-based algorithm or a word co-occurrence algorithm.
                                                                                  28
                                       Table 9: Categorized evaluation datasets used in evaluating LLMs.
 Type                             Datasets/Benchmarks
 Multi-Task                       MMLU [307], SuperGLUE [2], BIG-bench [308], GLUE [309], BBH [308], CUGE [310], Zero-
                                  CLUE [311], FewCLUE [312], Blended Skill Talk [313], HELM [314], KLUE-STS [315]
 Language Understanding           CoQA [316], WiC [317], Wikitext103 [318], PG19 [319], LCQMC [320], QQP [321], WinoGender [322],
                                  CB [323], FinRE [324], SanWen [325], AFQMC [311], BQ Corpus [326], CNSS [327], CKBQA 13 [328],
                                  CLUENER [311], Weibo [329], AQuA [330], OntoNotes [331], HeadQA [332], Twitter Dataset [333]
 Story Cloze and
                                  StoryCloze [334], LAMBADA [335], LCSTS [336], AdGen [337], E2E [338], CHID [339], CHID-
 Sentence Completion
                                  FC [312]
 Physical Knowledge and
                                  PIQA [340], TriviaQA [341], ARC [342], ARC-Easy [342], ARC-Challenge [342], PROST [343], Open-
 World Understanding
                                  BookQA [344], WebNLG [345], DogWhistle Insider & Outsider [346]
 Contextual Language              RACE [347], RACE-Middle [347], RACE-High [347], QuAC [348], StrategyQA [349], Quiz Bowl [350],
 Understanding                    cMedQA [351],cMedQA2 [352], MATINF-QA [353]
 Commonsense Reasoning            WinoGrande [354], HellaSwag [355], COPA [356], WSC [357], CSQA [358], SIQA [359], C3 [360],
                                  CLUEWSC2020 [311], CLUEWSC [311], CLUEWSC-FC [312], ReCoRD [361]
 Reading Comprehension            SQuAD [362], BoolQ [363], SQUADv2 [364], DROP [365], RTE [366], WebQA [367], CMRC2017 [368],
                                  CMRC2018 [369], CMRC2019 [370], COTE-BD [371], COTE-DP [371], COTE-MFW [371], Mul-
                                  tiRC [372], Natural Questions [373], CNSE [327], DRCD [374], DuReader [375], Dureaderrobust [376],
                                  DuReader-QG [375], SciQ [377], Sogou-log [378], Dureaderrobust -QG [376], QA4MRE [379], KorQuAD
                                  1.0 [380], CAIL2018-Task1 & Task2 [381]
 Mathematical Reasoning           MATH [382], Math23k [383], GSM8K [384], MathQA [385], MGSM [386], MultiArith [387], AS-
                                  Div [388], MAWPS [389], SVAMP [390]
 Problem Solving                  HumanEval [141], DS-1000 [391], MBPP [392], APPS [382], CodeContests [142]
 Natural Language Inference
                                  ANLI [393], MNLI-m [394], MNLI-mm [394],QNLI [362], WNLI [357], OCNLI [311], CMNLI [311],
 & Logical Reasoning
                                  ANLI R1 [393], ANLI R2 [393], ANLI R3 [393], HANS [395], OCNLI-FC [312], LogiQA [396], Strate-
                                  gyQA [349]
 Cross-Lingual Understanding      MLQA [397], XNLI [398], PAWS-X [399], XSum [400], XCOPA [401], XWinograd [402], TyDiQA-
                                  GoldP [403], MLSum [404]
 Truthfulness and Fact Checking   TruthfulQA [405], MultiFC [406], Fact Checking on Fever [407]
 Biases and Ethics in AI          ETHOS [408], StereoSet [409], BBQ [410], Winobias [411], CrowS-Pairs [412]
 Toxicity                         RealToxicityPrompts [413], CivilComments toxicity classification [414]
 Language Translation             WMT [415], WMT20 [416], WMT20-enzh [416], EPRSTMT [312], CCPM [417]
 Scientific Knowledge             AminoProbe [148], BioLAMA [148], Chemical Reactions [148], Galaxy Clusters [148], Mineral
                                  Groups [148]
 Dialogue                         Wizard of Wikipedia [418], Empathetic Dialogues [419], DPC-generated [96] dialogues, ConvAI2 [420],
                                  KdConv [421]
 Topic Classification             TNEWS-FC [312], YNAT [315], KLUE-TC [315], CSL [311], CSL-FC [312], IFLYTEK [422]
It is a great starting point for models beginning to explore ad-          tions. It is designed to evaluate the comprehension ability of
vanced question-answering.                                                models in a more academic and challenging context.
    ARC-Challenge [342]: A rigorous question-answering                      QuAC [348]: This dataset simulates an information-seeking
dataset, ARC-Challenge includes complex, grade-school level               dialog between students and teachers using hidden Wikipedia
questions that demand reasoning beyond simple retrieval, test-            text. It introduces unique challenges not found in machine com-
ing the true comprehension capabilities of models.                        prehension datasets, making it a valuable resource for advanc-
                                                                          ing dialog systems.
5.2.5. Contextual Language Understanding
  RACE [347]: The RACE dataset is a reading comprehension                 5.2.6. Commonsense Reasoning
dataset collected from English examinations in China, which                HellaSwag [355]: A dataset that challenges models to pick the
benchmarks AI models for understanding and answering ques-                best ending to a context uses Adversarial Filtering to create a
tions on long and complex passages, simulating the challenge              ‘Goldilocks’ zone of complexity, where generated text is absurd
of a real-world examination.                                              to humans but often misclassified by models.
   RACE-Middle [347]: Another subset of the RACE [347]                      COPA [401]: This dataset evaluates a model’s progress in
dataset, RACE-Middle, contains middle school-level English                open-domain commonsense causal reasoning. Each question
exam questions. It offers a slightly less challenging but academ-         comprises a premise and two alternatives, and the model must
ically oriented evaluation of a model’s comprehension skills.             select the more plausible alternative, testing a model’s ability to
   RACE-High [347]: A subset of the RACE [347] dataset,                   understand and reason about cause and effect.
RACE-High consists of high school-level English exam ques-                  WSC [357]: The Winograd Schema Challenge (WSC) is a
                                                                     29
Table 10: An illustration of training datasets and evaluation tasks employed by pre-trained LLMs. Here, “QA” is question-answering, “Clf” is classification, “NLI”
is natural language inference, “MT” is machine translation, “RC” is reading comprehension, “CR” is commonsense reasoning, “MR” is mathematical reasoning,
“Mem.” is memorization.
                                                                          Benchmark
                                                                                                                                                         Truthful/
                                                                  BIG-                Super                           Cloze/                               Bias/
 Models              Training Dataset                                      MMLU               QA   Clf   NLI   MT                RC   CR   MR   Coding
                                                                  bench               GLUE                          Completion                           Toxicity/
                                                                                                                                                          Mem.
 T5                  C4 [10]                                                           ✓      ✓          ✓     ✓        ✓        ✓    ✓    ✓
 GPT-3               Common Crawl, WebText, Books Cor-                                 ✓      ✓                ✓        ✓        ✓                          ✓
                     pora, Wikipedia
 mT5                 mC4 [11]                                                                 ✓          ✓     ✓
 PanGu-α             1.1TB Chinese Text Corpus                                                ✓          ✓              ✓        ✓    ✓
 CPM-2               WuDaoCorpus [109]                                                                                           ✓         ✓
 Codex               54 million public repositories from Github                                                                                   ✓
 ERNIE-3.0           Chinese text corpora, Baidu Search, Web                           ✓      ✓    ✓     ✓     ✓        ✓        ✓         ✓
                     text, QA-long, QA-short, Poetry and Cou-
                     plet Domain-specific data from medical,
                     law, and financial area Baidu knowledge
                     graph with more than 50 million facts
 Jurassic-1          Wikipedia, OWT, Books, C4, Pile [301],                                   ✓          ✓              ✓        ✓
                     arXiv, GitHub
 HyperCLOVA          Korean blogs, Community sites, News,                                                      ✓
                     KiN Korean Wikipedia, Wikipedia (En-
                     glish and Japanese), Modu-Corpus: Mes-
                     senger, News, Spoken and written lan-
                     guage corpus, Web corpus
 Yuan 1.0            Common Crawl, SogouT, Sogou News,                                        ✓    ✓     ✓                       ✓
                     Baidu Baike, Wikipedia, Books
 Gopher              subsets of MassiveWeb Books, C4, News,        ✓         ✓         ✓      ✓                                       ✓    ✓                ✓
                     GitHub and Wikipedia samples from Mas-
                     siveText
 ERNIE-3.0 TITAN     Same as ERNIE 3.0 and ERNIE 3.0 ad-                                      ✓    ✓     ✓              ✓        ✓
                     versarial dataset, ERNIE 3.0 controllable
                     dataset
 GPT-NeoX-20B        Pile [301]                                                        ✓      ✓          ✓              ✓             ✓    ✓
 OPT                 RoBERTa [299], Pile [301], PushShift.io                                  ✓    ✓                                  ✓                     ✓
                     Reddit [423]
 BLOOM               ROOTs [13]                                                        ✓                 ✓     ✓        ✓                         ✓         ✓
 Galactica           arXiv, PMC, Semantic Scholar, Wikipedia,      ✓         ✓                ✓                                            ✓                ✓
                     StackExchange, LibreText, Open Text-
                     books, RefSeq Genome, OEIS, LIPID
                     MAPS, NASAExoplanet, Common Crawl,
                     ScientificCC, AcademicCC, GitHub repos-
                     itories Khan Problems, GSM8K, OneS-
                     mallStep
 GLaM                Filtered Webpages, Social media conversa-                                ✓          ✓              ✓        ✓    ✓
                     tions Wikipedia, Forums, Books, News
 LaMDA               Infiniset : Public documents, Dialogs, Ut-                                                                                             ✓
                     terances
 MT-NLG              Two snapshots of Common Crawl and                                                   ✓              ✓        ✓    ✓                     ✓
                     Books3, OpenWebText2, Stack Exchange,
                     PubMed Abstracts, Wikipedia, PG-19
                     [242], BookCorpus2, NIH ExPorter, Pile,
                     CC-Stories, RealNews
 AlphaCode           Selected GitHub repositories, CodeCon-                                                                                       ✓
                     tests: Codeforces, Description2Code, Co-
                     deNet
 Chinchilla          MassiveWeb, MassiveText Books, C4,            ✓         ✓                ✓                                  ✓    ✓                     ✓
                     News, GitHub, Wikipedia
 PaLM                webpages, books, Wikipedia, news, arti-       ✓                          ✓                ✓                      ✓           ✓         ✓
                     cles, source code, social media conversa-
                     tions
 AlexaTM             Wikipedia, mC4                                                    ✓                 ✓     ✓                      ✓                     ✓
 U-PaLM              Same as PaLM                                  ✓                   ✓      ✓          ✓              ✓        ✓    ✓
 UL2                 -                                                                 ✓      ✓    ✓     ✓                                 ✓                ✓
 GLM-130B            -                                             ✓         ✓                                          ✓
 CodeGen             Pile, BigQuery, BigPython                                                                                                    ✓
 LLaMA               CommonCrawl, C4, Github, Wikipedia,                     ✓                ✓                                  ✓    ✓    ✓      ✓         ✓
                     Books, arXiv, StackExchange
 PanGu-Σ             WuDaoCorpora, CLUE, Pile, C4, Python                                     ✓    ✓     ✓     ✓        ✓                         ✓
                     code
 BloombergGPT        inPile, Pile, C4, Wikipedia                   ✓         ✓                           ✓              ✓        ✓    ✓                     ✓
 CodeT5+             CodeSearchNet, Github Code                                                                                            ✓      ✓
 StarCoder           The Stack v1.2                                          ✓                                                             ✓      ✓         ✓
 LLaMA-2             ✓                                             ✓                   ✓                                         ✓    ✓    ✓      ✓
 PaLM-2              Web documents, Code, Books, Maths,                                ✓      ✓    ✓     ✓     ✓        ✓        ✓    ✓    ✓      ✓         ✓
                     Conversation
                                                                                       30
          Table 11: An illustration of training datasets and evaluation benchmarks used in fine-tuned LLMs. “SNI” is a short of Super-NaturalInsturctions.
                                                                                                                                                        Truthful/
                                                       BIG-
 Models            Training Dataset                            MMLU    BBH    RAFT     FLAN     SNI   PromptSource    TyDiQA     HumanEval     MBPP       Bias/
                                                       bench
                                                                                                                                                        Toxicity
 T0                Pool of Prompts                      ✓
 WebGPT            ELI5      [424],    ELI5    fact-                                                                                                         ✓
                   check [166], TriviaQA [341],
                   ARC-Challenge [342],      ARC-
                   Easy [342], Hand-written data,
                   Demonstrations of humans, Com-
                   parisons between model-generated
                   answers
 Tk-INSTRUCT       SNI [18]                                                                      ✓
 mT0               xP3 [154]
 OPT-IML           PromptSource [17], FLAN [16],                ✓       ✓       ✓        ✓       ✓          ✓
                   SNI [425], UnifiedSKG [426],
                   CrossFit [427], ExMix [428],
                   T5 [10], Reasoning
 Flan              Muffin, T0-SF, NIv2, CoT                     ✓       ✓                                                 ✓
 WizardCoder       Code Alpaca                                                                                                        ✓          ✓
in formulating new hypotheses and research questions since                                  to perform legal reasoning tasks [468] and answer legal ques-
their ability to process large-scale datasets allows them to un-                            tions [469].
veil insights that might not be immediately apparent to human                               Finance: LLMs like BloombergGPT [151], trained on exten-
researchers [458]. Moreover, for scientific writing, LLMs can                               sive proprietary financial datasets, exhibit superior performance
help researchers draft documents, suggest improvements, and                                 on financial tasks. This indicates the value of domain-specific
ensure adherence to specific formatting guidelines [459, 460].                              training in creating LLMs that can more accurately understand
This not only saves time but also improves the clarity of scien-                            and process industry-specific language and concepts. The intro-
tific communication, enabling interdisciplinary teams to work                               duction of FinGPT [470] as an open-source model offers trans-
together more effectively.                                                                  parent and accessible resources to develop novel applications
Maths: In addition to providing mathematical research and                                   such as robo-advising, algorithmic trading, and low-code so-
education support, LLMs can assist in solving mathematical                                  lutions, ultimately expanding the capabilities of financial ser-
problems by giving step-by-step explanations and guiding users                              vices. Both BloombergGPT and FinGPT show the adaptabil-
through complex proofs and calculations. They can help iden-                                ity of LLMs to the financial domain, with the former showing
tify errors in reasoning or computation and suggest corrections,                            the power of custom datasets and the latter emphasizing a data-
serving as an invaluable tool for both learning and verification                            centric approach and low-rank adaptation techniques for cus-
purposes [461, 462]. LLMs can be employed to check the valid-                               tomization. Moreover, LLMs demonstrate an ability to break
ity of mathematical proofs, offering a preliminary filter before                            down complex financial tasks into actionable plans, enabling
human review. While they are not a substitute for the meticu-                               end-to-end solutions that were previously unfeasible with a sin-
lous work of mathematicians, they can help simplify the process                             gle model [471].
of proof verification [463, 464]. Moreover, LLMs enhance ac-                                Robotics: In robotics research, LLMs have promising appli-
cessibility to mathematics by translating complex concepts and                              cations, such as enhancing human-robot interaction [28, 472,
findings into understandable language for non-specialists [465],                            473, 474], task planning [237], motion planning [246], nav-
where the gap between theoretical mathematics and applied                                   igation [246, 475], object manipulation [236], personalized
contexts such as physics, engineering, and economics can be                                 robots [476], etc. LLMs enable robots to understand the en-
bridged.                                                                                    vironment effectively and generate plans to complete tasks col-
Law: LLMs can assist with the thematic analysis of legal doc-                               laboratively [240, 26]. They can facilitate continuous learning
uments, including generating initial coding for datasets, iden-                             by allowing robots to access and integrate information from a
tifying themes, and classifying data according to these themes.                             wide range of sources, helping robots acquire new skills, adapt
This collaborative effort between legal experts and LLMs has                                to changes, and refine their paths [224, 233, 234].
proved to be effective in analyzing legal texts such as court
opinions on theft, improving both the efficiency and quality of
                                                                                            7. Challenges and Future Directions
the research [466]. Additionally, LLMs have been evaluated for
their ability to generate explanations of legal terms, focusing                                LLMs such as GPT-4 and its predecessors have significantly
on improving factual accuracy and relevance by incorporating                                advanced natural language processing. Nevertheless, they also
sentences from case law. By feeding relevant case law into the                              bring along a set of challenges. The computational cost, ad-
LLM, the augmented models can generate higher-quality expla-                                versarial robustness, and interpretability are among the tech-
nations with less factually incorrect information [467]. More-                              nical challenges that are intrinsic to these models. Further-
over, LLMs can be trained with specialized domain knowledge                                 more, as these models are scaled up to handle more complex
                                                                                     33
tasks or to operate in more complex or dynamic environments,                   content that contradicts information they have generated
new challenges in scalability, privacy, and real-time processing               earlier.
emerge. On the frontier of foundational research, integrating
                                                                            • Fact-conflicting hallucination involves LLM’s generation
multi-modality and the effectiveness of transfer learning are be-
                                                                              of content that does not align with established world
ing keenly explored. Additionally, the continuous learning as-
                                                                              knowledge.
pect of these models, which aims to have models that can adapt
to new information over time, presents a fresh set of challenges.         Prompt Engineering: Prompts serve as inputs to LLMs, and
These challenges not only underscore the technical intricacies            their syntax and semantics play a crucial role in determining
involved but also highlight the broader impact and the future             the model’s output. The prompt variations, sometimes counter-
trajectory of LLMs in real-world applications. The following              intuitive to humans, can result in significant changes in model
sections delve into these challenges, shedding light on the on-           output and are addressed through prompt engineering, which
going and potential efforts to address them.                              involves designing natural language queries to guide LLMs
Computational Cost: Training LLMs require extensive compu-                responses effectively [484, 32].
tational resources, which increases production costs and raises           Limited Knowledge: Information acquired during pretraining
environmental concerns due to substantial energy consump-                 is limited and may become obsolete after some time. Re-
tion during large-scale training. Improved performance occurs             training the model using updated data is costly. To generate
as computational resources increase, but the rate of improve-             factually accurate responses, people use a retrieval augmen-
ment gradually decreases when both the model and dataset                  tation pipeline [198]. However, pre-trained models are not
size remain fixed, following the power law of diminishing re-             trained with retrieval augmentation generation (RAG) [6, 21];
turns [477].                                                              hence, adapting the training pipeline is necessary [193, 25].
Bias and Fairness: LLMs can inherit and amplify societal bi-              Safety and Controllability: Using LLMs comes with the risk
ases in their training data. These biases can manifest in the             of generating harmful, misleading, or inappropriate content,
model’s outputs, leading to potential ethical and fairness is-            whether by accident or when given specific prompts. Ensuring
sues [478].                                                               these models are safely utilized is a significant concern [485].
Overfitting: Although LLMs possess substantial learning ca-               Security and Privacy: LLMs are prone to leaking personal
pabilities, they are susceptible to overfitting noisy and peculiar        information and generating false, unethical, misaligned re-
patterns within their extensive training data. Consequently, this         sponses. Researchers have explored various security attacks,
may cause them to generate illogical responses [479]. The de-             i.e., backdoor attacks, jailbreaking, prompt injection, and data
bate about Memorization vs. Generalization in LLMs is about               poisoning, that lead to breaking LLMs security. Therefore,
finding the right balance. Memorization allows the model to               developing better defense mechanisms is essential to ensure
remember specific details from its training data, ensuring it can         LLMs are safe, reliable, and trustworthy for complex AI
provide accurate answers to precise questions. However, gen-              applications [486].
eralization enables the model to make inferences and produce              Multi-Modality: Multi-modal learning, where LLMs are
responses for inputs it has not seen before, which is essential           trained on diverse data like text, images, and videos, aims to
for handling various real-world tasks. Striking the right bal-            create models with richer understanding but faces challenges
ance is the challenge: too much memorization can lead to over-            in data alignment, fusion strategies, and higher computational
fitting, making the model inflexible and struggling with new              demands.
inputs [480].                                                             Catastrophic Forgetting: LLMs are often pre-trained on
Economic and Research Inequality: The high cost of train-                 large datasets and then fine-tuned on domain-specific data,
ing and deploying LLMs may make their development concen-                 reducing training resources. However, they face issues like
trated within well-funded organizations, potentially worsening            domain adaptation and catastrophic forgetting, which hinder
economic and research inequalities in AI [481].                           the retention of original knowledge when learning new tasks.
Reasoning and Planning: Some reasoning and planning tasks,                Adversarial Robustness: Large Language Models (LLMs)
even as seemingly simple as common-sense planning, which                  have shown great capabilities in various tasks but are vul-
humans find easy, remain well beyond the current capabilities             nerable to adversarial attacks, where slight, deliberate input
of LLMs evaluated using an assessment framework. This is not              alterations can mislead them. Especially with models like
entirely unexpected, considering that LLMs primarily generate             BERT, adversarial fine-tuning can enhance robustness, al-
text completions based on likelihood and offer no solid guaran-           though it sometimes compromises generalization [487]. As
tees in terms of reasoning abilities [482].                               LLMs integrate more into complex systems, examining their
Hallucinations: LLMs exhibit “hallucinations", where they                 security properties becomes crucial, given the emerging field
generate responses that, while sounding plausible, are incorrect          of adversarial attacks on LLMs within trustworthy ML [488].
or do not align with the provided information [483]. Hallucina-           This vulnerability is notable in safety-critical domains, ne-
tions can be categorized into three categories.                           cessitating robust adversarial evaluation tools to ensure LLM
                                                                          reliability [489].
  • Input-conflicting hallucination, wherein LLMs produce
                                                                          Interpretability and Explainability: The “black-box” nature
    content that diverges from the input given by users.
                                                                          of LLMs poses challenges in understanding their decision-
  • Context-conflicting hallucination, where LLMs generate                making, which is crucial for broader acceptance and trust,
                                                                     34
especially in sensitive domains. Despite their advanced                    tively or negatively, emphasizing the need for proactive ethical
capabilities, the lack of insight into their operation limits their        frameworks and policy measures to guide their responsible
effectiveness and trustworthiness [490, 491]. Efforts are being            use and assign accountability for their outputs [497]. Auditing
made to make LLMs more explainable to promote user trust                   is identified as a promising governance mechanism to ensure
and to ensure responsible AI usage. Understanding the logic                that AI systems, including LLMs, are designed and deployed
behind LLMs’ responses is essential for fostering trust and                ethically, legally, and technically robust [498].
ensuring they align with human values and legal standards.
Privacy Concerns: Privacy concerns in Large Language
Models (LLMs) have escalated with their growth in complexity               8. Conclusion
and size, particularly around data sharing and potential misuse.
There is a risk of malicious content creation, filter bypass,                 This article has comprehensively reviewed the develop-
and data privacy issues, especially in e-commerce, where                   ments in LLMs. It contributes to summarizing significant
protecting customer privacy is crucial. If models are trained              findings of LLMs in the existing literature and provides a
on private data, additional concerns arise if such models are              detailed analysis of the design aspects, including architec-
made publicly available. LLMs tend to memorize phrases from                tures, datasets, and training pipelines. We identified crucial
their training sets, which an adversary could exploit to extract           architectural components and training strategies employed by
sensitive data, posing a threat to personal privacy [492, 493].            different LLMs. These aspects are presented as summaries
Real-Time Processing: Real-time processing in Large Lan-                   and discussions throughout the article. Moreover, we have
guage Models (LLMs) is pivotal for various applications,                   discussed the performance differences of LLMs in zero-shot
especially with the rising popularity of mobile AI applications            and few-shot settings, explored the impact of fine-tuning, and
and concerns regarding information security and privacy.                   compared supervised and generalized models and encoder vs.
However, LLMs often have hundreds of layers and millions                   decoder vs. encoder-decoder architectures. A comprehensive
of parameters, which impede real-time processing due to the                review of multi-modal LLMs, retrieval augmented LLMs,
high computational demands and limited weight storage on                   LLMs-powered agents, efficient LLMs, datasets, evaluation,
hardware platforms, particularly in edge computing environ-                applications, and challenges is also provided. This article is
ments [494]. While certain efforts like MobileBERT aim                     anticipated to serve as a valuable resource for researchers,
to reduce memory requirements, they still face substantial                 offering insights into the recent advancements in LLMs and
execution overhead due to the large number of model layers,                providing fundamental concepts and details to develop better
leading to high inference latency.                                         LLMs.
Long-Term Dependencies: Large Language Models have
shown considerable progress in understanding and generating                Acknowledgement: The author/s would like to acknowl-
text, yet they often struggle with preserving context and                  edge the support received from Saudi Data and AI Authority
handling long-term dependencies, particularly in complex,                  (SDAIA) and King Fahd University of Petroleum and Miner-
multi-turn conversations or long documents. This limitation                als (KFUPM) under SDAIA-KFUPM Joint Research Center for
can lead to incoherent or irrelevant responses.                            Artificial Intelligence Grant No. JRC-AI-RFP-11.
Hardware Acceleration: The growth of LLMs presents signif-
icant hardware challenges due to the increasing computational              References
and memory demands associated with training and deploying
these models. GPUs have played a crucial role in meeting the                 [1] A. Chernyavskiy, D. Ilvovsky, P. Nakov, Transformers:“the end of his-
                                                                                 tory” for natural language processing?, in: Machine Learning and
hardware requirements for training LLMs, with the networking                     Knowledge Discovery in Databases. Research Track: European Con-
industry also evolving to optimize hardware for training                         ference, ECML PKDD 2021, Bilbao, Spain, September 13–17, 2021,
workloads. However, the growing size of LLMs, which has                          Proceedings, Part III 21, Springer, 2021, pp. 677–693. 1
been outpacing hardware progress, makes model inference in-                  [2] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill,
                                                                                 O. Levy, S. Bowman, Superglue: A stickier benchmark for general-
creasingly costly. Model quantization is a promising approach                    purpose language understanding systems, Advances in neural informa-
to bridge the widening gap between LLM size and hardware                         tion processing systems 32 (2019). 1, 26, 29
capacity [495]. Although specialized hardware acceleration                   [3] D. Adiwardana, M.-T. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan,
like GPUs or TPUs can significantly reduce the computational                     Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu, et al., Towards a human-
                                                                                 like open-domain chatbot, arXiv preprint arXiv:2001.09977 (2020). 1
cost, making real-time applications more feasible, they may not              [4] B. A. y Arcas, Do large language models understand us?, Daedalus
fully resolve all limitations, necessitating further advancements                151 (2) (2022) 183–197. 2
in hardware technology.                                                      [5] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al.,
Regulatory and Ethical Frameworks: The rapid advancements                        Language models are unsupervised multitask learners, OpenAI blog
                                                                                 1 (8) (2019) 9. 2, 7
in artificial intelligence have given rise to sophisticated Large            [6] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
Language Models (LLMs) like OpenAI’s GPT-4 [157] and                             A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models
Google’s Bard. These developments underscore the imperative                      are few-shot learners, Advances in neural information processing sys-
                                                                                 tems 33 (2020) 1877–1901. 2, 6, 7, 8, 9, 16, 18, 23, 24, 25, 34
for regulatory oversight to manage the ethical and social
                                                                             [7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training
challenges accompanying LLMs’ widespread use [496]. For                          of deep bidirectional transformers for language understanding, arXiv
instance, LLMs can generate content that can be used posi-                       preprint arXiv:1810.04805 (2018). 2, 18, 24
                                                                      35
 [8] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,                        multimodal language model, arXiv preprint arXiv:2303.03378 (2023).
     L. Zettlemoyer, Deep contextualized word representations, in: NAACL-                     2, 20, 22, 33
     HLT, Association for Computational Linguistics, 2018, pp. 2227–2237.              [27]   A. Parisi, Y. Zhao, N. Fiedel, Talm: Tool augmented language models,
     2                                                                                        arXiv preprint arXiv:2205.12255 (2022). 2, 19, 20
 [9] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy,                [28]   B. Zhang, H. Soh, Large language models as zero-shot human models
     V. Stoyanov, L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre-                   for human-robot interaction, arXiv preprint arXiv:2303.03548 (2023). 2,
     training for natural language generation, translation, and comprehen-                    33
     sion, arXiv preprint arXiv:1910.13461 (2019). 2                                   [29]   Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, A. Hu, P. Shi,
[10] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,                         Y. Shi, et al., mplug-owl: Modularization empowers large language
     Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with                models with multimodality, arXiv preprint arXiv:2304.14178 (2023). 2,
     a unified text-to-text transformer, The Journal of Machine Learning Re-                  22
     search 21 (1) (2020) 5485–5551. 2, 7, 8, 18, 19, 24, 25, 28, 30, 31               [30]   W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo,
[11] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant,                       T. Lu, J. Zhou, Y. Qiao, et al., Visionllm: Large language model
     A. Barua, C. Raffel, mt5: A massively multilingual pre-trained text-to-                  is also an open-ended decoder for vision-centric tasks, arXiv preprint
     text transformer, arXiv preprint arXiv:2010.11934 (2020). 2, 7, 8, 24,                   arXiv:2305.11175 (2023). 2, 22
     25, 28, 30                                                                        [31]   R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li, Y. Shan, Gpt4tools:
[12] Z. Zhang, Y. Gu, X. Han, S. Chen, C. Xiao, Z. Sun, Y. Yao, F. Qi,                        Teaching large language model to use tools via self-instruction, arXiv
     J. Guan, P. Ke, et al., Cpm-2: Large-scale cost-effective pre-trained lan-               preprint arXiv:2305.18752 (2023). 2, 19, 22, 23
     guage models, AI Open 2 (2021) 216–224. 2, 8, 25                                  [32]   E. Saravia, Prompt Engineering Guide, https://github.com/dair-
[13] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow,                          ai/Prompt-Engineering-Guide (12 2022). 2, 7, 18, 34
     R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al., Bloom: A 176b-            [33]   A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu,
     parameter open-access multilingual language model, arXiv preprint                        W. Zheng, X. Xia, et al., Glm-130b: An open bilingual pre-trained
     arXiv:2211.05100 (2022). 2, 4, 9, 11, 23, 24, 25, 30                                     model, arXiv preprint arXiv:2210.02414 (2022). 2, 10, 23, 24, 25
[14] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan,            [34]   Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, S. C. Hoi, Codet5+:
     M. Diab, X. Li, X. V. Lin, et al., Opt: Open pre-trained transformer                     Open code large language models for code understanding and genera-
     language models, arXiv preprint arXiv:2205.01068 (2022). 2, 9, 11, 24,                   tion, arXiv preprint arXiv:2305.07922 (2023). 2, 11, 24, 25
     25                                                                                [35]   S. Wang, Y. Sun, Y. Xiang, Z. Wu, S. Ding, W. Gong, S. Feng, J. Shang,
[15] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts,                     Y. Zhao, C. Pang, et al., Ernie 3.0 titan: Exploring larger-scale knowl-
     P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., Palm: Scal-                      edge enhanced pre-training for language understanding and generation,
     ing language modeling with pathways, arXiv preprint arXiv:2204.02311                     arXiv preprint arXiv:2112.12731 (2021). 2, 8, 24, 25
     (2022). 2, 6, 9, 11, 23, 24, 25                                                   [36]   J. Rasley, S. Rajbhandari, O. Ruwase, Y. He, Deepspeed: System op-
[16] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li,                       timizations enable training deep learning models with over 100 billion
     X. Wang, M. Dehghani, S. Brahma, et al., Scaling instruction-finetuned                   parameters, in: Proceedings of the 26th ACM SIGKDD International
     language models, arXiv preprint arXiv:2210.11416 (2022). 2, 7, 11, 16,                   Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505–
     17, 22, 24, 25, 28, 31                                                                   3506. 2, 5
[17] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai,              [37]   S. Rajbhandari, J. Rasley, O. Ruwase, Y. He, Zero: Memory optimiza-
     A. Chaffin, A. Stiegler, T. L. Scao, A. Raja, et al., Multitask                          tions toward training trillion parameter models, in: SC20: International
     prompted training enables zero-shot task generalization, arXiv preprint                  Conference for High Performance Computing, Networking, Storage and
     arXiv:2110.08207 (2021). 2, 11, 16, 25, 28, 31                                           Analysis, IEEE, 2020, pp. 1–16. 2, 4, 24
[18] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei,                    [38]   J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, G. Neubig, Towards
     A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap, et al.,                    a unified view of parameter-efficient transfer learning, arXiv preprint
     Super-naturalinstructions: Generalization via declarative instructions on                arXiv:2110.04366 (2021). 2, 20, 21
     1600+ nlp tasks, in: Proceedings of the 2022 Conference on Empirical              [39]   Z. Hu, Y. Lan, L. Wang, W. Xu, E.-P. Lim, R. K.-W. Lee, L. Bing, S. Po-
     Methods in Natural Language Processing, 2022, pp. 5085–5109. 2, 7,                       ria, Llm-adapters: An adapter family for parameter-efficient fine-tuning
     11, 16, 17, 24, 25, 28, 31                                                               of large language models, arXiv preprint arXiv:2304.01933 (2023). 2,
[19] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, H. Ha-                   20
     jishirzi, Self-instruct: Aligning language model with self generated in-          [40]   B. Lester, R. Al-Rfou, N. Constant, The power of scale for parameter-
     structions, arXiv preprint arXiv:2212.10560 (2022). 2, 16, 19, 22, 28                    efficient prompt tuning, arXiv preprint arXiv:2104.08691 (2021). 2, 8,
[20] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin,                       20, 21
     C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., Training language mod-            [41]   X. L. Li, P. Liang, Prefix-tuning: Optimizing continuous prompts for
     els to follow instructions with human feedback, Advances in Neural In-                   generation, arXiv preprint arXiv:2101.00190 (2021). 2, 20, 21
     formation Processing Systems 35 (2022) 27730–27744. 2, 7, 11, 16,                 [42]   X. Ma, G. Fang, X. Wang, Llm-pruner: On the structural pruning of
     22                                                                                       large language models, arXiv preprint arXiv:2305.11627 (2023). 2, 22
[21] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei,              [43]   R. Xu, F. Luo, C. Wang, B. Chang, J. Huang, S. Huang, F. Huang,
     N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open                   From dense to sparse: Contrastive pruning for better pre-trained lan-
     foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288                   guage model compression, in: Proceedings of the AAAI Conference on
     (2023). 2, 7, 10, 16, 25, 34                                                             Artificial Intelligence, Vol. 36, 2022, pp. 11547–11555. 2, 22
[22] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yo-             [44]   G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, S. Han, Smoothquant:
     gatama, M. Bosma, D. Zhou, D. Metzler, et al., Emergent abilities of                     Accurate and efficient post-training quantization for large language
     large language models, arXiv preprint arXiv:2206.07682 (2022). 2                         models, in: ICML, Vol. 202 of Proceedings of Machine Learning Re-
[23] T. Webb, K. J. Holyoak, H. Lu, Emergent analogical reasoning in large                    search, PMLR, 2023, pp. 38087–38099. 2, 21
     language models, Nature Human Behaviour 7 (9) (2023) 1526–1541. 2                 [45]   C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Luo, N. Wong,
[24] D. A. Boiko, R. MacKnight, G. Gomes, Emergent autonomous sci-                            Compression of generative pre-trained language models via quantiza-
     entific research capabilities of large language models, arXiv preprint                   tion, arXiv preprint arXiv:2203.10705 (2022). 2, 21
     arXiv:2304.05332 (2023). 2                                                        [46]   A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sundararajan, S. Naidu,
[25] G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick,                     Giraffe: Adventures in expanding context lengths in llms, arXiv preprint
     J. Dwivedi-Yu, A. Joulin, S. Riedel, E. Grave, Few-shot learning with                    arXiv:2308.10882 (2023). 2, 17
     retrieval augmented language models, arXiv preprint arXiv:2208.03299              [47]   B. Peng, J. Quesnelle, H. Fan, E. Shippole, Yarn: Efficient con-
     (2022). 2, 18, 19, 34                                                                    text window extension of large language models, arXiv preprint
[26] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter,                     arXiv:2309.00071 (2023). 2, 17
     A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al., Palm-e: An embodied                [48]   M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni, Y.-H. Sung, Y. Yang,
                                                                                  36
       Longt5: Efficient text-to-text transformer for long sequences, arXiv                  machine learning (ICML-10), 2010, pp. 807–814. 4
       preprint arXiv:2112.07916 (2021). 2, 18                                          [71] D. Hendrycks, K. Gimpel, Gaussian error linear units (gelus), arXiv
[49]   S. Chen, S. Wong, L. Chen, Y. Tian, Extending context window                          preprint arXiv:1606.08415 (2016). 4
       of large language models via positional interpolation, arXiv preprint            [72] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov,
       arXiv:2306.15595 (2023). 2, 17                                                        Dropout: a simple way to prevent neural networks from overfitting, The
[50]   W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang,               journal of machine learning research 15 (1) (2014) 1929–1958. 4
       J. Zhang, Z. Dong, et al., A survey of large language models, arXiv              [73] D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R.
       preprint arXiv:2303.18223 (2023). 2, 3, 7                                             Ke, A. Goyal, Y. Bengio, A. Courville, C. Pal, Zoneout: Regular-
[51]   U. Naseem, I. Razzak, S. K. Khan, M. Prasad, A comprehensive sur-                     izing rnns by randomly preserving hidden activations, arXiv preprint
       vey on word representation models: From classical to state-of-the-art                 arXiv:1606.01305 (2016). 4
       word representation language models, Transactions on Asian and Low-              [74] N. Shazeer, Glu variants improve transformer, arXiv preprint
       Resource Language Information Processing 20 (5) (2021) 1–35. 2, 3                     arXiv:2002.05202 (2020). 4
[52]   B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz,              [75] Y. N. Dauphin, A. Fan, M. Auli, D. Grangier, Language modeling with
       E. Agirre, I. Heinz, D. Roth, Recent advances in natural language pro-                gated convolutional networks, in: International conference on machine
       cessing via large pre-trained language models: A survey, arXiv preprint               learning, PMLR, 2017, pp. 933–941. 4
       arXiv:2111.01243 (2021). 2, 3                                                    [76] J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization, arXiv preprint
[53]   C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K. Zhang, C. Ji, Q. Yan,               arXiv:1607.06450 (2016). 4
       L. He, et al., A comprehensive survey on pretrained foundation models:           [77] B. Zhang, R. Sennrich, Root mean square layer normalization, Advances
       A history from bert to chatgpt, arXiv preprint arXiv:2302.09419 (2023).               in Neural Information Processing Systems 32 (2019). 4
       2, 3                                                                             [78] A. Baevski, M. Auli, Adaptive input representations for neural language
[54]   Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun,                            modeling, arXiv preprint arXiv:1809.10853 (2018). 4
       J. Xu, Z. Sui, A survey for in-context learning, arXiv preprint                  [79] H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, F. Wei, Deepnet: Scaling
       arXiv:2301.00234 (2022). 2, 7, 18                                                     transformers to 1,000 layers, arXiv preprint arXiv:2203.00555 (2022). 4
[55]   J. Huang, K. C.-C. Chang, Towards reasoning in large language models:            [80] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, B. Catanzaro,
       A survey, arXiv preprint arXiv:2212.10403 (2022). 2, 7, 18                            Megatron-lm: Training multi-billion parameter language models using
[56]   Y. Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang,               model parallelism, arXiv preprint arXiv:1909.08053 (2019). 4, 5
       Q. Liu, Aligning large language models with human: A survey, arXiv               [81] "bmtrain: Efficient training for big models.".
       preprint arXiv:2307.12966 (2023). 2                                                   URL https://github.com/OpenBMB/BMTrain 4, 5
[57]   X. Zhu, J. Li, Y. Liu, C. Ma, W. Wang, A survey on model compression             [82] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cis-
       for large language models, arXiv preprint arXiv:2308.07633 (2023). 2                  tac, T. Rault, R. Louf, M. Funtowicz, et al., Transformers: State-of-the-
[58]   S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, E. Chen, A survey on multi-             art natural language processing, in: Proceedings of the 2020 conference
       modal large language models, arXiv preprint arXiv:2306.13549 (2023).                  on empirical methods in natural language processing: system demon-
       2, 22, 23                                                                             strations, 2020, pp. 38–45. 5
[59]   J. J. Webster, C. Kit, Tokenization as the initial phase in nlp, in: COL-        [83] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclau-
       ING 1992 volume 4: The 14th international conference on computa-                      rin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, et al.,
       tional linguistics, 1992. 4                                                           Jax: composable transformations of python+ numpy programs (2018).
[60]   T. Kudo, Subword regularization: Improving neural network translation                 5
       models with multiple subword candidates, in: Proceedings of the 56th             [84] S. Li, J. Fang, Z. Bian, H. Liu, Y. Liu, H. Huang, B. Wang, Y. You,
       Annual Meeting of the Association for Computational Linguistics (Vol-                 Colossal-ai: A unified deep learning system for large-scale parallel train-
       ume 1: Long Papers), 2018, pp. 66–75. 4                                               ing, arXiv preprint arXiv:2110.14883 (2021). 5
[61]   R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare             [85] J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, J. Tang, Fastmoe: A
       words with subword units, in: Proceedings of the 54th Annual Meet-                    fast mixture-of-expert training system, arXiv preprint arXiv:2103.13262
       ing of the Association for Computational Linguistics (Volume 1: Long                  (2021). 5
       Papers), 2016, pp. 1715–1725. 4                                                  [86] L. Huawei Technologies Co., Huawei mindspore ai development frame-
[62]   M. Schuster, K. Nakajima, Japanese and korean voice search, in: 2012                  work, in: Artificial Intelligence Technology, Springer, 2022, pp. 137–
       IEEE international conference on acoustics, speech and signal process-                162. 5
       ing (ICASSP), IEEE, 2012, pp. 5149–5152. 4                                       [87] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
[63]   S. J. Mielke, Z. Alyafeai, E. Salesky, C. Raffel, M. Dey, M. Gallé,                   T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., Pytorch: An imper-
       A. Raja, C. Si, W. Y. Lee, B. Sagot, et al., Between words and char-                  ative style, high-performance deep learning library, Advances in neural
       acters: A brief history of open-vocabulary modeling and tokenization in               information processing systems 32 (2019). 5
       nlp, arXiv preprint arXiv:2112.10508 (2021). 4                                   [88] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
[64]   A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,               S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: a system for large-
       Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural               scale machine learning., in: Osdi, Vol. 16, Savannah, GA, USA, 2016,
       information processing systems 30 (2017). 4, 7                                        pp. 265–283. 5
[65]   O. Press, N. Smith, M. Lewis, Train short, test long: Attention with             [89] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao,
       linear biases enables input length extrapolation, in: International Con-              B. Xu, C. Zhang, Z. Zhang, Mxnet: A flexible and efficient machine
       ference on Learning Representations, 2022.                                            learning library for heterogeneous distributed systems, arXiv preprint
       URL https://openreview.net/forum?id=R8sQPpGCv0 4, 17                                  arXiv:1512.01274 (2015). 5
[66]   J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, Y. Liu, Roformer: En-                 [90] W. Fedus, B. Zoph, N. Shazeer, Switch transformers: Scaling to tril-
       hanced transformer with rotary position embedding, arXiv preprint                     lion parameter models with simple and efficient sparsity, The Journal of
       arXiv:2104.09864 (2021). 4, 9, 17                                                     Machine Learning Research 23 (1) (2022) 5232–5270. 5, 9
[67]   R. Child, S. Gray, A. Radford, I. Sutskever, Generating long sequences           [91] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun,
       with sparse transformers, arXiv preprint arXiv:1904.10509 (2019). 4, 7,               Y. Zhou, A. W. Yu, O. Firat, et al., Glam: Efficient scaling of language
       23                                                                                    models with mixture-of-experts, in: International Conference on Ma-
[68]   T. Dao, D. Fu, S. Ermon, A. Rudra, C. Ré, Flashattention: Fast and                    chine Learning, PMLR, 2022, pp. 5547–5569. 5, 9, 23, 24, 25
       memory-efficient exact attention with io-awareness, Advances in Neural           [92] X. Ren, P. Zhou, X. Meng, X. Huang, Y. Wang, W. Wang, P. Li,
                                                                                                                                                   P
       Information Processing Systems 35 (2022) 16344–16359. 4                               X. Zhang, A. Podolskiy, G. Arshinov, et al., Pangu- : Towards trillion
[69]   K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks                  parameter language model with sparse heterogeneous computing, arXiv
       are universal approximators, Neural networks 2 (5) (1989) 359–366. 4                  preprint arXiv:2303.10845 (2023). 5, 10, 16, 23, 24, 25
[70]   V. Nair, G. E. Hinton, Rectified linear units improve restricted boltz-          [93] T. Wang, A. Roberts, D. Hesslow, T. Le Scao, H. W. Chung, I. Beltagy,
       mann machines, in: Proceedings of the 27th international conference on                J. Launay, C. Raffel, What language model architecture and pretrain-
                                                                                   37
        ing objective works best for zero-shot generalization?, in: International                  S. Kim, S. Kim, D. Seo, et al., What changes can large-scale language
        Conference on Machine Learning, PMLR, 2022, pp. 22964–22984. 5                             models bring? intensive study on hyperclova: Billions-scale korean
 [94]   L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou,                       generative pretrained transformers, arXiv preprint arXiv:2109.04650
        H.-W. Hon, Unified language model pre-training for natural language                        (2021). 8, 25
        understanding and generation, Advances in neural information process-              [115]   S. Wu, X. Zhao, T. Yu, R. Zhang, C. Shen, H. Liu, F. Li, H. Zhu, J. Luo,
        ing systems 32 (2019). 6                                                                   L. Xu, et al., Yuan 1.0: Large-scale pre-trained language model in zero-
 [95]   J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child,                    shot and few-shot learning, arXiv preprint arXiv:2110.04725 (2021). 8,
        S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language                    24, 25
        models, arXiv preprint arXiv:2001.08361 (2020). 6                                  [116]   J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song,
 [96]   J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai,                               J. Aslanides, S. Henderson, R. Ring, S. Young, et al., Scaling lan-
        E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark,                        guage models: Methods, analysis & insights from training gopher, arXiv
        et al., Training compute-optimal large language models, arXiv preprint                     preprint arXiv:2112.11446 (2021). 8, 9, 25, 28
        arXiv:2203.15556 (2022). 6, 9, 25, 29                                              [117]   S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari,
 [97]   S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, K. Shuster,                 J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti, et al.,
        T. Wang, Q. Liu, P. S. Koura, et al., Opt-iml: Scaling language model in-                  Using deepspeed and megatron to train megatron-turing nlg 530b, a
        struction meta learning through the lens of generalization, arXiv preprint                 large-scale generative language model, arXiv preprint arXiv:2201.11990
        arXiv:2212.12017 (2022). 7, 11, 16, 17, 22, 25, 28                                         (2022). 8, 9, 24, 25
 [98]   Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox, Y. Yang, C. Gan,              [118]   S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding,
        Principle-driven self-alignment of language models from scratch with                       H. He, C. Leahy, K. McDonell, J. Phang, et al., Gpt-neox-20b: An open-
        minimal human supervision, arXiv preprint arXiv:2305.03047 (2023).                         source autoregressive language model, arXiv preprint arXiv:2204.06745
        7, 17                                                                                      (2022). 9, 23, 24, 25
 [99]   A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones,           [119]   W. Ben, K. Aran, Gpt-j-6b: A 6 billion parameter autoregressive lan-
        N. Joseph, B. Mann, N. DasSarma, et al., A general language assistant                      guage model (2021). 9
        as a laboratory for alignment, arXiv preprint arXiv:2112.00861 (2021).             [120]   P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia,
        7                                                                                          B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al., Mixed pre-
[100]   D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei,                     cision training, arXiv preprint arXiv:1710.03740 (2017). 9, 23
        P. Christiano, G. Irving, Fine-tuning language models from human pref-             [121]   N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hin-
        erences, arXiv preprint arXiv:1909.08593 (2019). 7                                         ton, J. Dean, Outrageously large neural networks: The sparsely-gated
[101]   S. Kim, S. J. Joo, D. Kim, J. Jang, S. Ye, J. Shin, M. Seo, The cot collec-                mixture-of-experts layer, arXiv preprint arXiv:1701.06538 (2017). 9, 23
        tion: Improving zero-shot and few-shot learning of language models via             [122]   S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, W. Hamza,
        chain-of-thought fine-tuning, arXiv preprint arXiv:2305.14045 (2023).                      H. Khan, C. Peris, S. Rawls, A. Rosenbaum, A. Rumshisky, et al., Alex-
        7, 16                                                                                      atm 20b: Few-shot learning using a large-scale multilingual seq2seq
[102]   Q. Liu, F. Zhou, Z. Jiang, L. Dou, M. Lin, From zero to hero: Exam-                        model, arXiv preprint arXiv:2208.01448 (2022). 9, 23, 24, 25
        ining the power of symbolic tasks in instruction tuning, arXiv preprint            [123]   R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos,
        arXiv:2304.07995 (2023). 7, 16                                                             S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al., Palm 2 technical report,
[103]   J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le,                        arXiv preprint arXiv:2305.10403 (2023). 9, 25
        D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large             [124]   Y. Tay, J. Wei, H. W. Chung, V. Q. Tran, D. R. So, S. Shakeri, X. Garcia,
        language models, Advances in Neural Information Processing Systems                         H. S. Zheng, J. Rao, A. Chowdhery, et al., Transcending scaling laws
        35 (2022) 24824–24837. 7, 20, 23                                                           with 0.1% extra compute, arXiv preprint arXiv:2210.11399 (2022). 9,
[104]   X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowd-                        24, 25
        hery, D. Zhou, Self-consistency improves chain of thought reasoning in             [125]   Y. Tay, M. Dehghani, V. Q. Tran, X. Garcia, J. Wei, X. Wang, H. W.
        language models, arXiv preprint arXiv:2203.11171 (2022). 7, 20                             Chung, D. Bahri, T. Schuster, S. Zheng, et al., Ul2: Unifying lan-
[105]   S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, K. Narasimhan,                guage learning paradigms, in: The Eleventh International Conference
        Tree of thoughts: Deliberate problem solving with large language mod-                      on Learning Representations, 2022. 9, 10, 24, 25
        els, arXiv preprint arXiv:2305.10601 (2023). 7, 20                                 [126]   Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, J. Tang, Glm: Gen-
[106]   N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe,                     eral language model pretraining with autoregressive blank infilling, in:
        A. Gesmundo, M. Attariyan, S. Gelly, Parameter-efficient transfer learn-                   Proceedings of the 60th Annual Meeting of the Association for Compu-
        ing for nlp, in: International Conference on Machine Learning, PMLR,                       tational Linguistics (Volume 1: Long Papers), 2022, pp. 320–335. 10
        2019, pp. 2790–2799. 7, 20                                                         [127]   H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
[107]   S. McCandlish, J. Kaplan, D. Amodei, O. D. Team, An empirical model                        T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al.,
        of large-batch training, arXiv preprint arXiv:1812.06162 (2018). 7                         Llama: Open and efficient foundation language models, arXiv preprint
[108]   W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang, X. Jiang, Z. Yang,                      arXiv:2302.13971 (2023). 10, 23, 25
        K. Wang, X. Zhang, et al., Pangu-α : Large-scale autoregressive pre-               [128]   M. N. Rabe, C. Staats, Self-attention does not need o(n2 ) memory, arXiv
        trained chinese language models with auto-parallel computation, arXiv                      preprint arXiv:2112.05682 (2021). 10
        preprint arXiv:2104.12369 (2021). 8, 23, 24, 25                                    [129]   V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch,
[109]   S. Yuan, H. Zhao, Z. Du, M. Ding, X. Liu, Y. Cen, X. Zou, Z. Yang,                         M. Shoeybi, B. Catanzaro, Reducing activation recomputation in large
        J. Tang, Wudaocorpora: A super large-scale chinese corpora for pre-                        transformer models, Proceedings of Machine Learning and Systems 5
        training language models, AI Open 2 (2021) 65–68. 8, 30                                    (2023). 10
[110]   Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen,             [130]   A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman,
        Y. Zhao, Y. Lu, et al., Ernie 3.0: Large-scale knowledge enhanced                          A. Mathur, A. Schelten, A. Yang, A. Fan, et al., The llama 3 herd of
        pre-training for language understanding and generation, arXiv preprint                     models, arXiv preprint arXiv:2407.21783 (2024). 10, 25
        arXiv:2107.02137 (2021). 8, 25                                                     [131]   https://mistral.ai/news/mixtral-8x22b/. 10, 25
[111]   Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, R. Salakhutdinov,                [132]   https://github.com/Snowflake-Labs/snowflake-arctic. 10,
        Transformer-xl: Attentive language models beyond a fixed-length con-                       25
        text, arXiv preprint arXiv:1901.02860 (2019). 8                                    [133]   https://github.com/xai-org/grok-1. 10
[112]   O. Lieber, O. Sharir, B. Lenz, Y. Shoham, Jurassic-1: Technical details            [134]   https://x.ai/blog/grok-1.5. 10
        and evaluation, White Paper. AI21 Labs 1 (2021). 8, 24, 25                         [135]   G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut,
[113]   Y. Levine, N. Wies, O. Sharir, H. Bata, A. Shashua, Limits to depth ef-                    J. Schalkwyk, A. M. Dai, A. Hauth, et al., Gemini: a family of highly
        ficiencies of self-attention, Advances in Neural Information Processing                    capable multimodal models, arXiv preprint arXiv:2312.11805 (2023).
        Systems 33 (2020) 22640–22651. 8, 11                                                       10
[114]   B. Kim, H. Kim, S.-W. Lee, G. Lee, D. Kwak, D. H. Jeon, S. Park,                   [136]   M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b.
                                                                                      38
        Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, et al., Gem-              ration, arXiv preprint arXiv:2305.14327 (2023). 16
        ini 1.5: Unlocking multimodal understanding across millions of tokens              [156] P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu,
        of context, arXiv preprint arXiv:2403.05530 (2024). 10                                   C. He, X. Yue, et al., Llama-adapter v2: Parameter-efficient visual in-
[137]   B. Adler, N. Agarwal, A. Aithal, D. H. Anh, P. Bhattacharya, A. Brun-                    struction model, arXiv preprint arXiv:2304.15010 (2023). 16, 24
        dyn, J. Casper, B. Catanzaro, S. Clay, J. Cohen, et al., Nemotron-4 340b           [157] Openai. gpt-4 technical report (2023). 16, 35
        technical report, arXiv preprint arXiv:2406.11704 (2024). 10, 25                   [158] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang,
[138]   X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong,                     T. B. Hashimoto, Stanford alpaca: An instruction-following llama
        Q. Du, Z. Fu, et al., Deepseek llm: Scaling open-source language models                  model,         https://github.com/tatsu-lab/stanford_alpaca
        with longtermism, arXiv preprint arXiv:2401.02954 (2024). 10, 25                         (2023). 16, 25, 28
[139]   DeepSeek-AI, A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao,                   [159] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng,
        C. Deng, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li,                        S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, E. P. Xing, Vicuna: An
        F. Lin, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Yang,                        open-source chatbot impressing gpt-4 with 90%* chatgpt quality (March
        H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang,                    2023).
        J. Guo, J. Ni, J. Li, J. Chen, J. Yuan, J. Qiu, J. Song, K. Dong, K. Gao,                URL https://lmsys.org/blog/2023-03-30-vicuna/ 16, 22, 25,
        K. Guan, L. Wang, L. Zhang, L. Xu, L. Xia, L. Zhao, L. Zhang, M. Li,                     28
        M. Wang, M. Zhang, M. Zhang, M. Tang, M. Li, N. Tian, P. Huang,                    [160] B. Peng, C. Li, P. He, M. Galley, J. Gao, Instruction tuning with gpt-4,
        P. Wang, P. Zhang, Q. Zhu, Q. Chen, Q. Du, R. J. Chen, R. L. Jin, R. Ge,                 arXiv preprint arXiv:2304.03277 (2023). 16, 28
        R. Pan, R. Xu, R. Chen, S. S. Li, S. Lu, S. Zhou, S. Chen, S. Wu, S. Ye,           [161] T. Liu, B. K. H. Low, Goat: Fine-tuned llama outperforms gpt-4 on
        S. Ma, S. Wang, S. Zhou, S. Yu, S. Zhou, S. Zheng, T. Wang, T. Pei,                      arithmetic tasks, arXiv preprint arXiv:2305.14201 (2023). 16
        T. Yuan, T. Sun, W. L. Xiao, W. Zeng, W. An, W. Liu, W. Liang, W. Gao,             [162] H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, T. Liu, Huatuo:
        W. Zhang, X. Q. Li, X. Jin, X. Wang, X. Bi, X. Liu, X. Wang, X. Shen,                    Tuning llama model with chinese medical knowledge, arXiv preprint
        X. Chen, X. Chen, X. Nie, X. Sun, Deepseek-v2: A strong, economical,                     arXiv:2304.06975 (2023). 16
        and efficient mixture-of-experts language model, CoRR abs/2405.04434               [163] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, D. Jiang,
        (2024). 10, 25                                                                           Wizardlm: Empowering large language models to follow complex in-
[140]   E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese,                   structions, arXiv preprint arXiv:2304.12244 (2023). 16
        C. Xiong, Codegen: An open large language model for code with multi-               [164] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin,
        turn program synthesis, arXiv preprint arXiv:2203.13474 (2022). 11,                      D. Jiang, Wizardcoder: Empowering code large language models with
        23, 25, 28                                                                               evol-instruct, arXiv preprint arXiv:2306.08568 (2023). 16, 25
[141]   M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Ed-          [165] J. Menick, M. Trebacz, V. Mikulik, J. Aslanides, F. Song, M. Chadwick,
        wards, Y. Burda, N. Joseph, G. Brockman, et al., Evaluating large lan-                   M. Glaese, S. Young, L. Campbell-Gillingham, G. Irving, et al., Teach-
        guage models trained on code, arXiv preprint arXiv:2107.03374 (2021).                    ing language models to support answers with verified quotes, arXiv
        11, 25, 29, 31                                                                           preprint arXiv:2203.11147 (2022). 17
[142]   Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond,                [166] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim,
        T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, et al., Competition-level                 C. Hesse, S. Jain, V. Kosaraju, W. Saunders, et al., Webgpt: Browser-
        code generation with alphacode, Science 378 (6624) (2022) 1092–1097.                     assisted question-answering with human feedback, arXiv preprint
        11, 23, 25, 29                                                                           arXiv:2112.09332 (2021). 17, 19, 20, 25, 31
[143]   N. Shazeer, Fast transformer decoding: One write-head is all you need,             [167] A. Glaese, N. McAleese, M. Trebacz,
                                                                                                                                  ˛     J. Aslanides, V. Firoiu, T. Ewalds,
        arXiv preprint arXiv:1911.02150 (2019). 11                                               M. Rauh, L. Weidinger, M. Chadwick, P. Thacker, et al., Improving
[144]   R. Y. Pang, H. He, Text generation by learning from demonstrations,                      alignment of dialogue agents via targeted human judgements, arXiv
        arXiv preprint arXiv:2009.07839 (2020). 11                                               preprint arXiv:2209.14375 (2022). 17, 20, 25
[145]   R. Dabre, A. Fujita, Softmax tempering for training neural machine                 [168] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, C. Finn,
        translation models, arXiv preprint arXiv:2009.09372 (2020). 11                           Direct preference optimization: Your language model is secretly a re-
[146]   Y. Wang, W. Wang, S. Joty, S. C. Hoi, Codet5: Identifier-aware unified                   ward model, arXiv preprint arXiv:2305.18290 (2023). 17
        pre-trained encoder-decoder models for code understanding and genera-              [169] H. Dong, W. Xiong, D. Goyal, R. Pan, S. Diao, J. Zhang, K. Shum,
        tion, arXiv preprint arXiv:2109.00859 (2021). 11                                         T. Zhang, Raft: Reward ranked finetuning for generative foundation
[147]   R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou,                          model alignment, arXiv preprint arXiv:2304.06767 (2023). 17
        M. Marone, C. Akiki, J. Li, J. Chim, et al., Starcoder: may the source be          [170] Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, F. Huang, Rrhf: Rank
        with you!, arXiv preprint arXiv:2305.06161 (2023). 11, 25                                responses to align language models with human feedback without tears,
[148]   R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia,                 arXiv preprint arXiv:2304.05302 (2023). 17
        A. Poulton, V. Kerkez, R. Stojnic, Galactica: A large language model for           [171] F. Song, B. Yu, M. Li, H. Yu, F. Huang, Y. Li, H. Wang, Preference rank-
        science, arXiv preprint arXiv:2211.09085 (2022). 11, 24, 25, 29                          ing optimization for human alignment, arXiv preprint arXiv:2306.17492
[149]   FairScale authors, Fairscale: A general purpose modular pytorch library                  (2023). 17
        for high performance and large scale training, https://github.com/                 [172] H. Liu, C. Sferrazza, P. Abbeel, Languages are rewards: Hindsight fine-
        facebookresearch/fairscale (2021). 11                                                    tuning using human feedback, arXiv preprint arXiv:2302.02676 (2023).
[150]   R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T.                 17
        Cheng, A. Jin, T. Bos, L. Baker, Y. Du, et al., Lamda: Language models             [173] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen,
        for dialog applications, arXiv preprint arXiv:2201.08239 (2022). 11, 25                  A. Goldie, A. Mirhoseini, C. McKinnon, et al., Constitutional ai: Harm-
[151]   S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann,                          lessness from ai feedback, arXiv preprint arXiv:2212.08073 (2022). 17
        P. Kambadur, D. Rosenberg, G. Mann, Bloomberggpt: A large language                 [174] Y. Dubois, X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, C. Guestrin,
        model for finance, arXiv preprint arXiv:2303.17564 (2023). 11, 25, 33                    P. Liang, T. B. Hashimoto, Alpacafarm: A simulation frame-
[152]   X. Zhang, Q. Yang, D. Xu, Xuanyuan 2.0: A large chinese finan-                           work for methods that learn from human feedback, arXiv preprint
        cial chat model with hundreds of billions parameters, arXiv preprint                     arXiv:2305.14387 (2023). 17
        arXiv:2305.12002 (2023). 11, 17, 25                                                [175] C. Si, Z. Gan, Z. Yang, S. Wang, J. Wang, J. Boyd-Graber, L. Wang,
[153]   W. Ben, Mesh-transformer-jax: Model-parallel implementation of trans-                    Prompting gpt-3 to be reliable, arXiv preprint arXiv:2210.09150 (2022).
        former language model with jax (2021). 12, 24                                            17
[154]   N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman,                     [176] D. Ganguli, A. Askell, N. Schiefer, T. Liao, K. Lukošiūtė, A. Chen,
        T. L. Scao, M. S. Bari, S. Shen, Z.-X. Yong, H. Schoelkopf, et al.,                      A. Goldie, A. Mirhoseini, C. Olsson, D. Hernandez, et al., The capac-
        Crosslingual generalization through multitask finetuning, arXiv preprint                 ity for moral self-correction in large language models, arXiv preprint
        arXiv:2211.01786 (2022). 16, 25, 28, 31                                                  arXiv:2302.07459 (2023). 17
[155]   D. Yin, X. Liu, F. Yin, M. Zhong, H. Bansal, J. Han, K.-W. Chang,                  [177] A. Wei, N. Haghtalab, J. Steinhardt, Jailbroken: How does llm safety
        Dynosaur: A dynamic growth paradigm for instruction-tuning data cu-                      training fail?, arXiv preprint arXiv:2307.02483 (2023). 17
                                                                                      39
[178] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath,                      (2023). 18
      B. Mann, E. Perez, N. Schiefer, K. Ndousse, et al., Red teaming lan-              [200] D. Schuurmans, Memory augmented large language models are compu-
      guage models to reduce harms: Methods, scaling behaviors, and lessons                   tationally universal, arXiv preprint arXiv:2301.04589 (2023). 18
      learned, arXiv preprint arXiv:2209.07858 (2022). 17, 28                           [201] A. Modarressi, A. Imani, M. Fayyaz, H. Schütze, Ret-llm: Towards a
[179] S. Casper, J. Lin, J. Kwon, G. Culp, D. Hadfield-Menell, Explore, estab-                general read-write memory for large language models, arXiv preprint
      lish, exploit: Red teaming language models from scratch, arXiv preprint                 arXiv:2305.14322 (2023). 18
      arXiv:2306.09442 (2023). 17                                                       [202] S. Robertson, H. Zaragoza, et al., The probabilistic relevance frame-
[180] E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese,                  work: Bm25 and beyond, Foundations and Trends® in Information Re-
      N. McAleese, G. Irving, Red teaming language models with language                       trieval 3 (4) (2009) 333–389. 18
      models, arXiv preprint arXiv:2202.03286 (2022). 17                                [203] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, D. Zhou,
[181] T. Scialom, T. Chakrabarty, S. Muresan, Fine-tuned language models are                  Rationale-augmented ensembles in language models, arXiv preprint
      continual learners, in: Proceedings of the 2022 Conference on Empirical                 arXiv:2207.00747 (2022). 18
      Methods in Natural Language Processing, 2022, pp. 6107–6122. 17                   [204] F. Zhang, B. Chen, Y. Zhang, J. Liu, D. Zan, Y. Mao, J.-G. Lou, W. Chen,
[182] Z. Shi, A. Lipani, Don’t stop pretraining? make prompt-based fine-                      Repocoder: Repository-level code completion through iterative retrieval
      tuning powerful learner, arXiv preprint arXiv:2305.01711 (2023). 17                     and generation, arXiv preprint arXiv:2303.12570 (2023). 18
[183] H. Gupta, S. A. Sawant, S. Mishra, M. Nakamura, A. Mitra, S. Mashetty,            [205] B. Wang, W. Ping, P. Xu, L. McAfee, Z. Liu, M. Shoeybi, Y. Dong,
      C. Baral, Instruction tuned models are quick learners, arXiv preprint                   O. Kuchaiev, B. Li, C. Xiao, et al., Shall we pretrain autoregressive
      arXiv:2306.05539 (2023). 17                                                             language models with retrieval? a comprehensive study, arXiv preprint
[184] H. Chen, Y. Zhang, Q. Zhang, H. Yang, X. Hu, X. Ma, Y. Yanggong,                        arXiv:2304.06762 (2023). 19
      J. Zhao, Maybe only 0.5% data is needed: A preliminary exploration                [206] L. Wang, N. Yang, F. Wei, Learning to retrieve in-context examples for
      of low training data instruction tuning, arXiv preprint arXiv:2305.09246                large language models, arXiv preprint arXiv:2307.07164 (2023). 19
      (2023). 17                                                                        [207] J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, W. Chen, What makes
[185] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat,                       good in-context examples for gpt-3?, arXiv preprint arXiv:2101.06804
      P. Yu, L. Yu, et al., Lima: Less is more for alignment, arXiv preprint                  (2021). 19
      arXiv:2305.11206 (2023). 17, 25, 28                                               [208] O. Rubin, J. Herzig, J. Berant, Learning to retrieve prompts for in-
[186] C. Han, Q. Wang, W. Xiong, Y. Chen, H. Ji, S. Wang, Lm-infinite: Sim-                   context learning, arXiv preprint arXiv:2112.08633 (2021). 19
      ple on-the-fly length generalization for large language models, arXiv             [209] W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettle-
      preprint arXiv:2308.16137 (2023). 17, 18                                                moyer, W.-t. Yih, Replug: Retrieval-augmented black-box language
[187] J. Ainslie, T. Lei, M. de Jong, S. Ontañón, S. Brahma, Y. Zemlyan-                      models, arXiv preprint arXiv:2301.12652 (2023). 19
      skiy, D. Uthus, M. Guo, J. Lee-Thorp, Y. Tay, et al., Colt5: Faster               [210] O. Rubin, J. Berant, Long-range language modeling with self-retrieval,
      long-range transformers with conditional computation, arXiv preprint                    arXiv preprint arXiv:2306.13421 (2023). 19
      arXiv:2303.09752 (2023). 18                                                       [211] K. Guu, K. Lee, Z. Tung, P. Pasupat, M. Chang, Retrieval augmented
[188] J. Ding, S. Ma, L. Dong, X. Zhang, S. Huang, W. Wang, F. Wei,                           language model pre-training, in: International conference on machine
      Longnet: Scaling transformers to 1,000,000,000 tokens, arXiv preprint                   learning, PMLR, 2020, pp. 3929–3938. 19
      arXiv:2307.02486 (2023). 18                                                       [212] S. Hofstätter, J. Chen, K. Raman, H. Zamani, Fid-light: Efficient and ef-
[189] Y. Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, J. Jia, Longlora: Effi-              fective retrieval-augmented text generation, in: Proceedings of the 46th
      cient fine-tuning of long-context large language models, arXiv preprint                 International ACM SIGIR Conference on Research and Development in
      arXiv:2309.12307 (2023). 18                                                             Information Retrieval, 2023, pp. 1437–1447. 19
[190] N. Ratner, Y. Levine, Y. Belinkov, O. Ram, I. Magar, O. Abend,                    [213] M. Komeili, K. Shuster, J. Weston, Internet-augmented dialogue gener-
      E. Karpas, A. Shashua, K. Leyton-Brown, Y. Shoham, Parallel context                     ation, arXiv preprint arXiv:2107.07566 (2021). 19
      windows for large language models, in: Proceedings of the 61st Annual             [214] A. Lazaridou, E. Gribovskaya, W. Stokowiec, N. Grigorev, Internet-
      Meeting of the Association for Computational Linguistics (Volume 1:                     augmented language models through few-shot prompting for open-
      Long Papers), 2023, pp. 6383–6402. 18                                                   domain question answering, arXiv preprint arXiv:2203.05115 (2022).
[191] W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, F. Wei,                             19
      Augmenting language models with long-term memory, arXiv preprint                  [215] D. Gao, L. Ji, L. Zhou, K. Q. Lin, J. Chen, Z. Fan, M. Z. Shou, Assist-
      arXiv:2306.07174 (2023). 18                                                             gpt: A general multi-modal assistant that can plan, execute, inspect, and
[192] X. Xu, Z. Gou, W. Wu, Z.-Y. Niu, H. Wu, H. Wang, S. Wang, Long                          learn, arXiv preprint arXiv:2306.08640 (2023). 19
      time no see! open-domain conversation with long-term persona memory,              [216] P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y. N. Wu, S.-C. Zhu,
      arXiv preprint arXiv:2203.05797 (2022). 18                                              J. Gao, Chameleon: Plug-and-play compositional reasoning with large
[193] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Milli-                   language models, arXiv preprint arXiv:2304.09842 (2023). 19, 20, 23
      can, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clark, et al.,          [217] B. Paranjape, S. Lundberg, S. Singh, H. Hajishirzi, L. Zettlemoyer, M. T.
      Improving language models by retrieving from trillions of tokens, in:                   Ribeiro, Art: Automatic multi-step reasoning and tool-use for large lan-
      International conference on machine learning, PMLR, 2022, pp. 2206–                     guage models, arXiv preprint arXiv:2303.09014 (2023). 19
      2240. 18, 19, 34                                                                  [218] C.-Y. Hsieh, S.-A. Chen, C.-L. Li, Y. Fujii, A. Ratner, C.-Y. Lee, R. Kr-
[194] W. Zhong, L. Guo, Q. Gao, Y. Wang, Memorybank: Enhanc-                                  ishna, T. Pfister, Tool documentation enables zero-shot tool-usage with
      ing large language models with long-term memory, arXiv preprint                         large language models, arXiv preprint arXiv:2308.00675 (2023). 19
      arXiv:2305.10250 (2023). 18                                                       [219] Y. Song, W. Xiong, D. Zhu, C. Li, K. Wang, Y. Tian, S. Li, Restgpt:
[195] N. Shinn, F. Cassano, B. Labash, A. Gopinath, K. Narasimhan, S. Yao,                    Connecting large language models with real-world applications via rest-
      Reflexion: Language agents with verbal reinforcement learning, arXiv                    ful apis, arXiv preprint arXiv:2306.06624 (2023). 19
      preprint arXiv:2303.11366 14 (2023). 18, 20                                       [220] S. Hao, T. Liu, Z. Wang, Z. Hu, Toolkengpt: Augmenting frozen lan-
[196] C. Hu, J. Fu, C. Du, S. Luo, J. Zhao, H. Zhao, Chatdb: Augment-                         guage models with massive tools via tool embeddings, arXiv preprint
      ing llms with databases as their symbolic memory, arXiv preprint                        arXiv:2305.11554 (2023). 19
      arXiv:2306.03901 (2023). 18                                                       [221] S. G. Patil, T. Zhang, X. Wang, J. E. Gonzalez, Gorilla: Large language
[197] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y. Yang,                     model connected with massive apis, arXiv preprint arXiv:2305.15334
      J. Callan, G. Neubig, Active retrieval augmented generation, arXiv                      (2023). 19
      preprint arXiv:2305.06983 (2023). 18                                              [222] Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, J. Zhang, On the tool manipu-
[198] O. Ram, Y. Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton-                    lation capability of open-source large language models, arXiv preprint
      Brown, Y. Shoham, In-context retrieval-augmented language models,                       arXiv:2305.16504 (2023). 19
      arXiv preprint arXiv:2302.00083 (2023). 18, 34                                    [223] Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang,
[199] X. Li, X. Qiu, Mot: Pre-thinking and recalling enable chatgpt to self-                  B. Qian, et al., Toolllm: Facilitating large language models to master
      improve with memory-of-thoughts, arXiv preprint arXiv:2305.05181                        16000+ real-world apis, arXiv preprint arXiv:2307.16789 (2023). 19,
                                                                                   40
      20                                                                                        your" cat-shaped mug"? llm-based zero-shot object navigation, arXiv
[224] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, Y. Zhuang, Hugginggpt: Solv-                      preprint arXiv:2303.03480 (2023). 20
      ing ai tasks with chatgpt and its friends in huggingface, arXiv preprint          [245]   C. Huang, O. Mees, A. Zeng, W. Burgard, Visual language maps for
      arXiv:2303.17580 (2023). 19, 20, 33                                                       robot navigation, in: 2023 IEEE International Conference on Robotics
[225] Y. Liang, C. Wu, T. Song, W. Wu, Y. Xia, Y. Liu, Y. Ou, S. Lu, L. Ji,                     and Automation (ICRA), IEEE, 2023, pp. 10608–10615. 20
      S. Mao, et al., Taskmatrix. ai: Completing tasks by connecting foun-              [246]   Y. Ding, X. Zhang, C. Paxton, S. Zhang, Task and motion planning
      dation models with millions of apis, arXiv preprint arXiv:2303.16434                      with large language models for object rearrangement, arXiv preprint
      (2023). 19                                                                                arXiv:2303.06247 (2023). 20, 33
[226] D. Surís, S. Menon, C. Vondrick, Vipergpt: Visual inference via python            [247]   X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, J. Tang, Gpt under-
      execution for reasoning, arXiv preprint arXiv:2303.08128 (2023). 20                       stands, too, arXiv preprint arXiv:2103.10385 (2021). 20, 21
[227] A. Maedche, S. Morana, S. Schacht, D. Werth, J. Krumeich, Advanced                [248]   G. Chen, F. Liu, Z. Meng, S. Liang, Revisiting parameter-efficient tun-
      user assistance systems, Business & Information Systems Engineering                       ing: Are we really there yet?, arXiv preprint arXiv:2202.07962 (2022).
      58 (2016) 367–370. 20                                                                     20
[228] M. Campbell, A. J. Hoane Jr, F.-h. Hsu, Deep blue, Artificial intelligence        [249]   Y. Wang, S. Mukherjee, X. Liu, J. Gao, A. H. Awadallah, J. Gao,
      134 (1-2) (2002) 57–83. 20                                                                Adamix: Mixture-of-adapter for parameter-efficient tuning of large lan-
[229] S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang,                         guage models, arXiv preprint arXiv:2205.12410 1 (2) (2022) 4. 20
      S. K. S. Yau, Z. Lin, L. Zhou, et al., Metagpt: Meta programming for              [250]   E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
      multi-agent collaborative framework, arXiv preprint arXiv:2308.00352                      W. Chen, Lora: Low-rank adaptation of large language models, arXiv
      (2023). 20                                                                                preprint arXiv:2106.09685 (2021). 21, 22, 23
[230] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang,               [251]   X. Liu, K. Ji, Y. Fu, W. Tam, Z. Du, Z. Yang, J. Tang, P-tuning: Prompt
      S. Jin, E. Zhou, et al., The rise and potential of large language model                   tuning can be comparable to fine-tuning across scales and tasks, in: Pro-
      based agents: A survey, arXiv preprint arXiv:2309.07864 (2023). 20                        ceedings of the 60th Annual Meeting of the Association for Computa-
[231] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang,                   tional Linguistics (Volume 2: Short Papers), 2022, pp. 61–68. 21
      X. Chen, Y. Lin, et al., A survey on large language model based au-               [252]   A. Razdaibiedina, Y. Mao, R. Hou, M. Khabsa, M. Lewis, A. Almahairi,
      tonomous agents, arXiv preprint arXiv:2308.11432 (2023). 20                               Progressive prompts: Continual learning for language models, arXiv
[232] W. Huang, P. Abbeel, D. Pathak, I. Mordatch, Language models as zero-                     preprint arXiv:2301.12314 (2023). 21
      shot planners: Extracting actionable knowledge for embodied agents,               [253]   Z.-R. Zhang, C. Tan, H. Xu, C. Wang, J. Huang, S. Huang, To-
      in: International Conference on Machine Learning, PMLR, 2022, pp.                         wards adaptive prefix tuning for parameter-efficient language model
      9118–9147. 20                                                                             fine-tuning, arXiv preprint arXiv:2305.15212 (2023). 21
[233] S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, Z. Hu, Reason-             [254]   E. B. Zaken, S. Ravfogel, Y. Goldberg, Bitfit: Simple parameter-
      ing with language model is planning with world model, arXiv preprint                      efficient fine-tuning for transformer-based masked language-models,
      arXiv:2305.14992 (2023). 20, 33                                                           arXiv preprint arXiv:2106.10199 (2021). 21
[234] W. Yao, S. Heinecke, J. C. Niebles, Z. Liu, Y. Feng, L. Xue, R. Murthy,           [255]   T. Dettmers, M. Lewis, Y. Belkada, L. Zettlemoyer, Llm. int8 ():
      Z. Chen, J. Zhang, D. Arpit, et al., Retroformer: Retrospective                           8-bit matrix multiplication for transformers at scale, arXiv preprint
      large language agents with policy gradient optimization, arXiv preprint                   arXiv:2208.07339 (2022). 21, 22
      arXiv:2308.02151 (2023). 20, 33                                                   [256]   E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh, Gptq: Accurate
[235] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng,                       post-training quantization for generative pre-trained transformers, arXiv
      J. Tompson, I. Mordatch, Y. Chebotar, P. Sermanet, T. Jackson,                            preprint arXiv:2210.17323 (2022). 21
      N. Brown, L. Luu, S. Levine, K. Hausman, brian ichter, Inner mono-                [257]   X. Wei, Y. Zhang, Y. Li, X. Zhang, R. Gong, J. Guo, X. Liu, Outlier sup-
      logue: Embodied reasoning through planning with language models, in:                      pression+: Accurate quantization of large language models by equiva-
      6th Annual Conference on Robot Learning, 2022.                                            lent and optimal shifting and scaling, arXiv preprint arXiv:2304.09145
      URL https://openreview.net/forum?id=3R3Pz5i0tye 20                                        (2023). 21
[236] C. Jin, W. Tan, J. Yang, B. Liu, R. Song, L. Wang, J. Fu, Alphablock:             [258]   E. Frantar, D. Alistarh, Optimal brain compression: A framework for
      Embodied finetuning for vision-language reasoning in robot manipula-                      accurate post-training quantization and pruning, Advances in Neural In-
      tion, arXiv preprint arXiv:2305.18898 (2023). 20, 33                                      formation Processing Systems 35 (2022) 4475–4488. 21
[237] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox,          [259]   C. Lee, J. Jin, T. Kim, H. Kim, E. Park, Owq: Lessons learned from ac-
      J. Thomason, A. Garg, Progprompt: Generating situated robot task plans                    tivation outliers for weight quantization in large language models, arXiv
      using large language models, in: 2023 IEEE International Conference on                    preprint arXiv:2306.02272 (2023). 21
      Robotics and Automation (ICRA), IEEE, 2023, pp. 11523–11530. 20,                  [260]   S. J. Kwon, J. Kim, J. Bae, K. M. Yoo, J.-H. Kim, B. Park, B. Kim, J.-
      33                                                                                        W. Ha, N. Sung, D. Lee, Alphatuning: Quantization-aware parameter-
[238] W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. G. Arenas, H.-T. L.                   efficient adaptation of large-scale pre-trained language models, arXiv
      Chiang, T. Erez, L. Hasenclever, J. Humplik, et al., Language to rewards                  preprint arXiv:2210.03858 (2022). 21
      for robotic skill synthesis, arXiv preprint arXiv:2306.08647 (2023). 20           [261]   T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Efficient
[239] X. Tang, A. Zou, Z. Zhang, Y. Zhao, X. Zhang, A. Cohan, M. Gerstein,                      finetuning of quantized llms, arXiv preprint arXiv:2305.14314 (2023).
      Medagents: Large language models as collaborators for zero-shot med-                      21, 22
      ical reasoning, arXiv preprint arXiv:2311.10537 (2023). 20                        [262]   Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y. Mehdad, Y. Shi, R. Kr-
[240] A. Brohan, Y. Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho,                            ishnamoorthi, V. Chandra, Llm-qat: Data-free quantization aware train-
      J. Ibarz, A. Irpan, E. Jang, R. Julian, et al., Do as i can, not as i say:                ing for large language models, arXiv preprint arXiv:2305.17888 (2023).
      Grounding language in robotic affordances, in: Conference on Robot                        21, 22
      Learning, PMLR, 2023, pp. 287–318. 20, 33                                         [263]   Y. Guo, A. Yao, H. Zhao, Y. Chen, Network sketching: Exploiting bi-
[241] H. Ha, P. Florence, S. Song, Scaling up and distilling down: Language-                    nary structure in deep cnns, in: Proceedings of the IEEE Conference on
      guided robot skill acquisition, arXiv preprint arXiv:2307.14535 (2023).                   Computer Vision and Pattern Recognition, 2017, pp. 5955–5963. 21
      20                                                                                [264]   J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, D. Lee,
[242] A. Rajvanshi, K. Sikka, X. Lin, B. Lee, H.-P. Chiu, A. Velasquez, Say-                    Memory-efficient fine-tuning of compressed large language models via
      nav: Grounding large language models for dynamic planning to navi-                        sub-4-bit integer quantization, arXiv preprint arXiv:2305.14152 (2023).
      gation in new environments, arXiv preprint arXiv:2309.04077 (2023).                       22
      20                                                                                [265]   M. Sun, Z. Liu, A. Bair, J. Z. Kolter, A simple and effective pruning
[243] C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, Y. Su,                        approach for large language models, arXiv preprint arXiv:2306.11695
      Llm-planner: Few-shot grounded planning for embodied agents with                          (2023). 22
      large language models, arXiv preprint arXiv:2212.04088 (2022). 20                 [266]   Z. Wang, J. Wohlwend, T. Lei, Structured pruning of large language
[244] V. S. Dorbala, J. F. Mullen Jr, D. Manocha, Can an embodied agent find                    models, arXiv preprint arXiv:1910.04732 (2019). 22
                                                                                   41
[267] L. Yin, Y. Wu, Z. Zhang, C.-Y. Hsieh, Y. Wang, Y. Jia, M. Pechenizkiy,            [289] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, N. Duan, Visual chatgpt: Talk-
      Y. Liang, Z. Wang, S. Liu, Outlier weighed layerwise sparsity (owl): A                  ing, drawing and editing with visual foundation models, arXiv preprint
      missing secret sauce for pruning llms to high sparsity, arXiv preprint                  arXiv:2303.04671 (2023). 23
      arXiv:2310.05175 (2023). 22                                                       [290] Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu,
[268] C. Tao, L. Hou, H. Bai, J. Wei, X. Jiang, Q. Liu, P. Luo, N. Wong,                      M. Zeng, L. Wang, Mm-react: Prompting chatgpt for multimodal rea-
      Structured pruning for efficient generative pre-trained language models,                soning and action, arXiv preprint arXiv:2303.11381 (2023). 23
      in: Findings of the Association for Computational Linguistics: ACL                [291] T. Wang, J. Zhang, J. Fei, Y. Ge, H. Zheng, Y. Tang, Z. Li, M. Gao,
      2023, 2023, pp. 10880–10895. 22                                                         S. Zhao, Y. Shan, et al., Caption anything: Interactive image descrip-
[269] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc,               tion with diverse multimodal controls, arXiv preprint arXiv:2305.02677
      A. Mensch, K. Millican, M. Reynolds, et al., Flamingo: a visual lan-                    (2023). 23
      guage model for few-shot learning, Advances in Neural Information Pro-            [292] X. Zhu, R. Zhang, B. He, Z. Zeng, S. Zhang, P. Gao, Pointclip v2:
      cessing Systems 35 (2022) 23716–23736. 22                                               Adapting clip for powerful 3d open-world learning, arXiv preprint
[270] J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Bootstrapping language-image                 arXiv:2211.11682 (2022). 23
      pre-training with frozen image encoders and large language models,                [293] T. Gupta, A. Kembhavi, Visual programming: Compositional visual rea-
      arXiv preprint arXiv:2301.12597 (2023). 22                                              soning without training, in: Proceedings of the IEEE/CVF Conference
[271] H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, arXiv preprint              on Computer Vision and Pattern Recognition, 2023, pp. 14953–14962.
      arXiv:2304.08485 (2023). 22                                                             23
[272] K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P. Luo, Y. Wang, L. Wang,                  [294] P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, H. Li, Dynamic
      Y. Qiao, Videochat: Chat-centric video understanding, arXiv preprint                    fusion with intra-and inter-modality attention flow for visual question
      arXiv:2305.06355 (2023). 22                                                             answering, in: Proceedings of the IEEE/CVF conference on computer
[273] M. Maaz, H. Rasheed, S. Khan, F. S. Khan, Video-chatgpt: Towards de-                    vision and pattern recognition, 2019, pp. 6639–6648. 23
      tailed video understanding via large vision and language models, arXiv            [295] Z. Yu, J. Yu, Y. Cui, D. Tao, Q. Tian, Deep modular co-attention net-
      preprint arXiv:2306.05424 (2023). 22                                                    works for visual question answering, in: Proceedings of the IEEE/CVF
[274] H. Zhang, X. Li, L. Bing, Video-llama: An instruction-tuned                             conference on computer vision and pattern recognition, 2019, pp. 6281–
      audio-visual language model for video understanding, arXiv preprint                     6290. 23
      arXiv:2306.02858 (2023). 22                                                       [296] H. You, R. Sun, Z. Wang, L. Chen, G. Wang, H. A. Ayyubi, K.-
[275] X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley,                       W. Chang, S.-F. Chang, Idealgpt: Iteratively decomposing vision
      Y. Zou, W. Wang, Wavcaps: A chatgpt-assisted weakly-labelled au-                        and language reasoning via large language models, arXiv preprint
      dio captioning dataset for audio-language multimodal research, arXiv                    arXiv:2305.14985 (2023). 23
      preprint arXiv:2303.17395 (2023). 22                                              [297] R. Zhang, X. Hu, B. Li, S. Huang, H. Deng, Y. Qiao, P. Gao, H. Li,
[276] C. Lyu, M. Wu, L. Wang, X. Huang, B. Liu, Z. Du, S. Shi, Z. Tu, Macaw-                  Prompt, generate, then cache: Cascade of foundation models makes
      llm: Multi-modal language modeling with image, audio, video, and text                   strong few-shot learners, in: Proceedings of the IEEE/CVF Conference
      integration, arXiv preprint arXiv:2306.09093 (2023). 22                                 on Computer Vision and Pattern Recognition, 2023, pp. 15211–15222.
[277] D. Zhu, J. Chen, X. Shen, X. Li, M. Elhoseiny, Minigpt-4: Enhancing                     23
      vision-language understanding with advanced large language models,                [298] T. Q. Nguyen, J. Salazar, Transformers without tears: Improving the
      arXiv preprint arXiv:2304.10592 (2023). 22                                              normalization of self-attention, CoRR abs/1910.05895 (2019). 24
[278] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,                 [299] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
      T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al.,                 L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pre-
      An image is worth 16x16 words: Transformers for image recognition at                    training approach, arXiv preprint arXiv:1907.11692 (2019). 24, 30
      scale, arXiv preprint arXiv:2010.11929 (2020). 22                                 [300] X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Levine,
[279] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung,                 D. Song, Koala: A dialogue model for academic research, Blog post
      S. Hoi, Instructblip: Towards general-purpose vision-language models                    (April 2023).
      with instruction tuning, arXiv preprint arXiv:2305.06500 (2023). 22                     URL https://bair.berkeley.edu/blog/2023/04/03/koala/
[280] Z. Xu, Y. Shen, L. Huang, Multiinstruct: Improving multi-modal zero-                    25
      shot learning via instruction tuning, arXiv preprint arXiv:2212.10773             [301] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster,
      (2022). 22                                                                              J. Phang, H. He, A. Thite, N. Nabeshima, et al., The pile: An
[281] Z. Zhao, L. Guo, T. Yue, S. Chen, S. Shao, X. Zhu, Z. Yuan, J. Liu,                     800gb dataset of diverse text for language modeling, arXiv preprint
      Chatbridge: Bridging modalities with large language model as a lan-                     arXiv:2101.00027 (2020). 28, 30
      guage catalyst, arXiv preprint arXiv:2305.16103 (2023). 22                        [302] H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. Villanova del Moral,
[282] L. Li, Y. Yin, S. Li, L. Chen, P. Wang, S. Ren, M. Li, Y. Yang, J. Xu,                  T. Le Scao, L. Von Werra, C. Mou, E. González Ponferrada, H. Nguyen,
      X. Sun, et al., M3 it: A large-scale dataset towards multi-modal multi-                 et al., The bigscience roots corpus: A 1.6 tb composite multilingual
      lingual instruction tuning, arXiv preprint arXiv:2306.04387 (2023). 22                  dataset, Advances in Neural Information Processing Systems 35 (2022)
[283] R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han,                      31809–31826. 28
      H. Xu, L. K. T. Zhang, Detgpt: Detect what you need via reasoning,                [303] Wikipedia.
      arXiv preprint arXiv:2305.14167 (2023). 22                                              URL https://en.wikipedia.org/wiki/Main_Page 28
[284] G. Luo, Y. Zhou, T. Ren, S. Chen, X. Sun, R. Ji, Cheap and quick:                 [304] Together Computer, Redpajama: An open source recipe to reproduce
      Efficient vision-language instruction tuning for large language models,                 llama training dataset (Apr. 2023).
      arXiv preprint arXiv:2305.15023 (2023). 22                                              URL                      https://github.com/togethercomputer/
[285] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, P. Gao, Y. Qiao,                RedPajama-Data 28
      Llama-adapter: Efficient fine-tuning of language models with zero-init            [305] O. Honovich, T. Scialom, O. Levy, T. Schick, Unnatural instructions:
      attention, arXiv preprint arXiv:2303.16199 (2023). 22                                   Tuning language models with (almost) no human labor, arXiv preprint
[286] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever,                   arXiv:2212.09689 (2022). 28
      Robust speech recognition via large-scale weak supervision, in: Inter-            [306] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma,
      national Conference on Machine Learning, PMLR, 2023, pp. 28492–                         D. Drain, S. Fort, D. Ganguli, T. Henighan, et al., Training a helpful and
      28518. 22                                                                               harmless assistant with reinforcement learning from human feedback,
[287] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, A. Smola, Multi-                        arXiv preprint arXiv:2204.05862 (2022). 28
      modal chain-of-thought reasoning in language models, arXiv preprint               [307] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song,
      arXiv:2302.00923 (2023). 23                                                             J. Steinhardt, Measuring massive multitask language understanding,
[288] J. Ge, H. Luo, S. Qian, Y. Gan, J. Fu, S. Zhan, Chain of thought prompt                 arXiv preprint arXiv:2009.03300 (2020). 26, 29
      tuning in vision language models, arXiv preprint arXiv:2304.07919                 [308] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch,
      (2023). 23                                                                              A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al., Beyond
                                                                                   42
        the imitation game: Quantifying and extrapolating the capabilities of              [331] R. Weischedel, S. Pradhan, L. Ramshaw, M. Palmer, N. Xue, M. Mar-
        language models, arXiv preprint arXiv:2206.04615 (2022). 26, 29                          cus, A. Taylor, C. Greenberg, E. Hovy, R. Belvin, et al., Ontonotes re-
[309]   A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, Glue:                     lease 4.0, LDC2011T03, Philadelphia, Penn.: Linguistic Data Consor-
        A multi-task benchmark and analysis platform for natural language un-                    tium (2011). 29
        derstanding, arXiv preprint arXiv:1804.07461 (2018). 26, 29                        [332] D. Vilares, C. Gómez-Rodríguez, Head-qa: A healthcare dataset for
[310]   Y. Yao, Q. Dong, J. Guan, B. Cao, Z. Zhang, C. Xiao, X. Wang, F. Qi,                     complex reasoning, arXiv preprint arXiv:1906.04701 (2019). 29
        J. Bao, J. Nie, et al., Cuge: A chinese language understanding and gen-            [333] S. L. Blodgett, L. Green, B. O’Connor, Demographic dialectal variation
        eration evaluation benchmark, arXiv preprint arXiv:2112.13610 (2021).                    in social media: A case study of african-american english, arXiv preprint
        29                                                                                       arXiv:1608.08868 (2016). 29
[311]   L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu,                [334] N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Van-
        C. Yu, et al., Clue: A chinese language understanding evaluation bench-                  derwende, P. Kohli, J. Allen, A corpus and evaluation framework
        mark, arXiv preprint arXiv:2004.05986 (2020). 29                                         for deeper understanding of commonsense stories, arXiv preprint
[312]   L. Xu, X. Lu, C. Yuan, X. Zhang, H. Xu, H. Yuan, G. Wei, X. Pan,                         arXiv:1604.01696 (2016). 28, 29
        X. Tian, L. Qin, et al., Fewclue: A chinese few-shot learning evaluation           [335] D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi,
        benchmark, arXiv preprint arXiv:2107.07498 (2021). 29                                    S. Pezzelle, M. Baroni, G. Boleda, R. Fernández, The lambada dataset:
[313]   E. M. Smith, M. Williamson, K. Shuster, J. Weston, Y.-L. Boureau, Can                    Word prediction requiring a broad discourse context, arXiv preprint
        you put it all together: Evaluating conversational agents’ ability to blend              arXiv:1606.06031 (2016). 28, 29
        skills, arXiv preprint arXiv:2004.08449 (2020). 29                                 [336] B. Hu, Q. Chen, F. Zhu, Lcsts: A large scale chinese short text summa-
[314]   P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga,                       rization dataset, arXiv preprint arXiv:1506.05865 (2015). 29
        Y. Zhang, D. Narayanan, Y. Wu, A. Kumar, et al., Holistic evaluation of            [337] Z. Shao, M. Huang, J. Wen, W. Xu, X. Zhu, Long and diverse text gener-
        language models, arXiv preprint arXiv:2211.09110 (2022). 29                              ation with planning-based hierarchical variational model, arXiv preprint
[315]   S. Park, J. Moon, S. Kim, W. I. Cho, J. Han, J. Park, C. Song, J. Kim,                   arXiv:1908.06605 (2019). 29
        Y. Song, T. Oh, et al., Klue: Korean language understanding evaluation,            [338] J. Novikova, O. Dušek, V. Rieser, The e2e dataset: New challenges for
        arXiv preprint arXiv:2105.09680 (2021). 29                                               end-to-end generation, arXiv preprint arXiv:1706.09254 (2017). 29
[316]   S. Reddy, D. Chen, C. D. Manning, Coqa: A conversational question                  [339] C. Zheng, M. Huang, A. Sun, Chid: A large-scale chinese idiom dataset
        answering challenge, Transactions of the Association for Computational                   for cloze test, arXiv preprint arXiv:1906.01265 (2019). 29
        Linguistics 7 (2019) 249–266. 27, 29                                               [340] Y. Bisk, R. Zellers, J. Gao, Y. Choi, et al., Piqa: Reasoning about phys-
[317]   M. T. Pilehvar, J. Camacho-Collados, Wic: 10,000 example                                 ical commonsense in natural language, in: Proceedings of the AAAI
        pairs for evaluating context-sensitive representations, arXiv preprint                   conference on artificial intelligence, Vol. 34, 2020, pp. 7432–7439. 28,
        arXiv:1808.09121 6 (2018). 27, 29                                                        29
[318]   S. Merity, C. Xiong, J. Bradbury, R. Socher, Pointer sentinel mixture              [341] M. Joshi, E. Choi, D. S. Weld, L. Zettlemoyer, Triviaqa: A large scale
        models, arXiv preprint arXiv:1609.07843 (2016). 28, 29                                   distantly supervised challenge dataset for reading comprehension, arXiv
[319]   J. W. Rae, A. Potapenko, S. M. Jayakumar, T. P. Lillicrap, Compres-                      preprint arXiv:1705.03551 (2017). 28, 29, 31
        sive transformers for long-range sequence modelling, arXiv preprint                [342] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick,
        arXiv:1911.05507 (2019). 28, 29                                                          O. Tafjord, Think you have solved question answering? try arc, the ai2
[320]   X. Liu, Q. Chen, C. Deng, H. Zeng, J. Chen, D. Li, B. Tang, Lcqmc: A                     reasoning challenge, arXiv preprint arXiv:1803.05457 (2018). 28, 29,
        large-scale chinese question matching corpus, in: Proceedings of the                     31
        27th international conference on computational linguistics, 2018, pp.              [343] S. Aroca-Ouellette, C. Paik, A. Roncone, K. Kann, Prost: Phys-
        1952–1962. 28, 29                                                                        ical reasoning of objects through space and time, arXiv preprint
[321]   S. Iyer, N. Dandekar, K. Csernai, First quora dataset re-                                arXiv:2106.03634 (2021). 29
        lease:        Question pairs,        https://quoradata.quora.com/                  [344] T. Mihaylov, P. Clark, T. Khot, A. Sabharwal, Can a suit of armor con-
        First-Quora-Dataset-Release-Question-Pairs. 29                                           duct electricity? a new dataset for open book question answering, arXiv
[322]   R. Rudinger, J. Naradowsky, B. Leonard, B. Van Durme, Gender bias in                     preprint arXiv:1809.02789 (2018). 29
        coreference resolution, arXiv preprint arXiv:1804.09301 (2018). 29                 [345] T. C. Ferreira, C. Gardent, N. Ilinykh, C. Van Der Lee, S. Mille,
[323]   M.-C. De Marneffe, M. Simons, J. Tonhauser, The commitmentbank: In-                      D. Moussallem, A. Shimorina, The 2020 bilingual, bi-directional
        vestigating projection in naturally occurring discourse, in: proceedings                 webnlg+ shared task overview and evaluation results (webnlg+ 2020),
        of Sinn und Bedeutung, Vol. 23, 2019, pp. 107–124. 29                                    in: Proceedings of the 3rd International Workshop on Natural Language
[324]   Z. Li, N. Ding, Z. Liu, H. Zheng, Y. Shen, Chinese relation extraction                   Generation from the Semantic Web (WebNLG+), 2020. 29
        with multi-grained information and external linguistic knowledge, in:              [346] C. Xu, W. Zhou, T. Ge, K. Xu, J. McAuley, F. Wei, Blow the dog whistle:
        Proceedings of the 57th Annual Meeting of the Association for Compu-                     A chinese dataset for cant understanding with common sense and world
        tational Linguistics, 2019, pp. 4377–4386. 29                                            knowledge, arXiv preprint arXiv:2104.02704 (2021). 29
[325]   J. Xu, J. Wen, X. Sun, Q. Su, A discourse-level named entity recognition           [347] G. Lai, Q. Xie, H. Liu, Y. Yang, E. Hovy, Race: Large-scale
        and relation extraction dataset for chinese literature text, arXiv preprint              reading comprehension dataset from examinations, arXiv preprint
        arXiv:1711.07010 (2017). 29                                                              arXiv:1704.04683 (2017). 29
[326]   J. Chen, Q. Chen, X. Liu, H. Yang, D. Lu, B. Tang, The bq corpus: A                [348] E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang,
        large-scale domain-specific chinese corpus for sentence semantic equiv-                  L. Zettlemoyer, Quac: Question answering in context, arXiv preprint
        alence identification, in: Proceedings of the 2018 conference on empiri-                 arXiv:1808.07036 (2018). 29
        cal methods in natural language processing, 2018, pp. 4946–4951. 29                [349] M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, J. Berant, Did aristo-
[327]   B. Liu, D. Niu, H. Wei, J. Lin, Y. He, K. Lai, Y. Xu, Matching arti-                     tle use a laptop? a question answering benchmark with implicit reason-
        cle pairs with graphical decomposition and convolutions, arXiv preprint                  ing strategies, Transactions of the Association for Computational Lin-
        arXiv:1802.07459 (2018). 29                                                              guistics 9 (2021) 346–361. 29, 31
[328]   P. Li, W. Li, Z. He, X. Wang, Y. Cao, J. Zhou, W. Xu, Dataset and neu-             [350] J. Boyd-Graber, B. Satinoff, H. He, H. Daumé III, Besting the quiz mas-
        ral recurrent sequence labeling model for open-domain factoid question                   ter: Crowdsourcing incremental classification games, in: Proceedings of
        answering, arXiv preprint arXiv:1607.06275 (2016). 29                                    the 2012 joint conference on empirical methods in natural language pro-
[329]   N. Peng, M. Dredze, Named entity recognition for chinese social media                    cessing and computational natural language learning, 2012, pp. 1290–
        with jointly trained embeddings, in: Proceedings of the 2015 conference                  1301. 29
        on empirical methods in natural language processing, 2015, pp. 548–                [351] S. Zhang, X. Zhang, H. Wang, J. Cheng, P. Li, Z. Ding, Chinese medical
        554. 29                                                                                  question answer matching using end-to-end character-level multi-scale
[330]   W. Ling, D. Yogatama, C. Dyer, P. Blunsom, Program induction by ratio-                   cnns, Applied Sciences 7 (8) (2017) 767. 29
        nale generation: Learning to solve and explain algebraic word problems,            [352] S. Zhang, X. Zhang, H. Wang, L. Guo, S. Liu, Multi-scale attentive in-
        arXiv preprint arXiv:1705.04146 (2017). 29                                               teraction networks for chinese medical question answer selection, IEEE
                                                                                      43
      Access 6 (2018) 74061–74071. 29                                                   [374] C. C. Shao, T. Liu, Y. Lai, Y. Tseng, S. Tsai, Drcd: A chinese ma-
[353] C. Xu, J. Pei, H. Wu, Y. Liu, C. Li, Matinf: A jointly labeled large-scale              chine reading comprehension dataset, arXiv preprint arXiv:1806.00920
      dataset for classification, question answering and summarization, arXiv                 (2018). 29
      preprint arXiv:2004.12302 (2020). 29                                              [375] W. He, K. Liu, J. Liu, Y. Lyu, S. Zhao, X. Xiao, Y. Liu, Y. Wang, H. Wu,
[354] K. Sakaguchi, R. L. Bras, C. Bhagavatula, Y. Choi, Winogrande: An                       Q. She, et al., Dureader: a chinese machine reading comprehension
      adversarial winograd schema challenge at scale, Communications of the                   dataset from real-world applications, arXiv preprint arXiv:1711.05073
      ACM 64 (9) (2021) 99–106. 27, 29                                                        (2017). 29
[355] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, Hellaswag: Can a           [376] H. Tang, J. Liu, H. Li, Y. Hong, H. Wu, H. Wang, Dureaderrobust: A
      machine really finish your sentence?, arXiv preprint arXiv:1905.07830                   chinese dataset towards evaluating the robustness of machine reading
      (2019). 29                                                                              comprehension models, arXiv preprint arXiv:2004.11142 (2020). 29
[356] M. Roemmele, C. A. Bejan, A. S. Gordon, Choice of plausible alter-                [377] J. Welbl, N. F. Liu, M. Gardner, Crowdsourcing multiple choice science
      natives: An evaluation of commonsense causal reasoning., in: AAAI                       questions, arXiv preprint arXiv:1707.06209 (2017). 29
      spring symposium: logical formalizations of commonsense reasoning,                [378] C. Xiong, Z. Dai, J. Callan, Z. Liu, R. Power, End-to-end neural ad-hoc
      2011, pp. 90–95. 29                                                                     ranking with kernel pooling, in: Proceedings of the 40th International
[357] H. Levesque, E. Davis, L. Morgenstern, The winograd schema chal-                        ACM SIGIR conference on research and development in information
      lenge, in: Thirteenth international conference on the principles of knowl-              retrieval, 2017, pp. 55–64. 29
      edge representation and reasoning, 2012. 27, 29                                   [379] A. Peñas, E. Hovy, P. Forner, Á. Rodrigo, R. Sutcliffe, R. Morante,
[358] A. Talmor, J. Herzig, N. Lourie, J. Berant, Commonsenseqa: A question                   Qa4mre 2011-2013: Overview of question answering for machine read-
      answering challenge targeting commonsense knowledge, arXiv preprint                     ing evaluation, in: Information Access Evaluation. Multilinguality, Mul-
      arXiv:1811.00937 (2018). 29, 31                                                         timodality, and Visualization: 4th International Conference of the CLEF
[359] M. Sap, H. Rashkin, D. Chen, R. LeBras, Y. Choi, Socialiqa:                             Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013. Pro-
      Commonsense reasoning about social interactions, arXiv preprint                         ceedings 4, Springer, 2013, pp. 303–320. 29
      arXiv:1904.09728 (2019). 29                                                       [380] S. Lim, M. Kim, J. Lee, Korquad1. 0: Korean qa dataset for machine
[360] K. Sun, D. Yu, D. Yu, C. Cardie, Investigating prior knowledge for chal-                reading comprehension, arXiv preprint arXiv:1909.07005 (2019). 29
      lenging chinese machine reading comprehension, Transactions of the                [381] C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han,
      Association for Computational Linguistics 8 (2020) 141–155. 29                          Z. Hu, H. Wang, et al., Cail2018: A large-scale legal dataset for judg-
[361] S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, B. Van Durme, Record: Bridg-                  ment prediction, arXiv preprint arXiv:1807.02478 (2018). 29
      ing the gap between human and machine commonsense reading compre-                 [382] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo,
      hension, arXiv preprint arXiv:1810.12885 (2018). 29                                     C. Burns, S. Puranik, H. He, D. Song, et al., Measuring coding challenge
[362] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 100,000+ questions                 competence with apps, arXiv preprint arXiv:2105.09938 (2021). 29, 31
      for machine comprehension of text, arXiv preprint arXiv:1606.05250                [383] Y. Wang, X. Liu, S. Shi, Deep neural solver for math word problems,
      (2016). 29, 31                                                                          in: Proceedings of the 2017 conference on empirical methods in natural
[363] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins,                              language processing, 2017, pp. 845–854. 29, 31
      K. Toutanova, Boolq: Exploring the surprising difficulty of natural               [384] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser,
      yes/no questions, arXiv preprint arXiv:1905.10044 (2019). 29, 31                        M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al., Training verifiers
[364] P. Rajpurkar, R. Jia, P. Liang, Know what you don’t know: Unanswer-                     to solve math word problems, arXiv preprint arXiv:2110.14168 (2021).
      able questions for squad, arXiv preprint arXiv:1806.03822 (2018). 29,                   29, 31
      31                                                                                [385] J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan,
[365] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, M. Gardner, Drop:                   E. Jiang, C. J. Cai, M. Terry, Q. V. Le, C. Sutton, Program synthesis with
      A reading comprehension benchmark requiring discrete reasoning over                     large language models, CoRR abs/2108.07732 (2021). 29
      paragraphs, arXiv preprint arXiv:1903.00161 (2019). 29, 31                        [386] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W.
[366] I. Dagan, O. Glickman, B. Magnini, The pascal recognising textual en-                   Chung, Y. Tay, S. Ruder, D. Zhou, et al., Language models are mul-
      tailment challenge, in: Machine learning challenges workshop, Springer,                 tilingual chain-of-thought reasoners, arXiv preprint arXiv:2210.03057
      2005, pp. 177–190. 29, 31                                                               (2022). 29
[367] Y. Chang, M. Narang, H. Suzuki, G. Cao, J. Gao, Y. Bisk, Webqa: Mul-              [387] S. Roy, D. Roth, Solving general arithmetic word problems, arXiv
      tihop and multimodal qa, in: Proceedings of the IEEE/CVF Conference                     preprint arXiv:1608.01413 (2016). 29
      on Computer Vision and Pattern Recognition, 2022, pp. 16495–16504.                [388] S.-Y. Miao, C.-C. Liang, K.-Y. Su, A diverse corpus for evaluating
      29, 31                                                                                  and developing english math word problem solvers, arXiv preprint
[368] Y. Cui, T. Liu, Z. Chen, W. Ma, S. Wang, G. Hu, Dataset for the first                   arXiv:2106.15772 (2021). 29
      evaluation on chinese machine reading comprehension, arXiv preprint               [389] R. Koncel-Kedziorski, S. Roy, A. Amini, N. Kushman, H. Hajishirzi,
      arXiv:1709.08299 (2017). 29                                                             Mawps: A math word problem repository, in: Proceedings of the 2016
[369] Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, G. Hu,                        conference of the north american chapter of the association for computa-
      A span-extraction dataset for chinese machine reading comprehension,                    tional linguistics: human language technologies, 2016, pp. 1152–1157.
      arXiv preprint arXiv:1810.07366 (2018). 29, 31                                          29
[370] Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, G. Hu,                  [390] A. Patel, S. Bhattamishra, N. Goyal, Are nlp models really able to solve
      A sentence cloze dataset for chinese machine reading comprehension,                     simple math word problems?, arXiv preprint arXiv:2103.07191 (2021).
      arXiv preprint arXiv:2004.03116 (2020). 29                                              29
[371] Y. Li, T. Liu, D. Li, Q. Li, J. Shi, Y. Wang, Character-based bilstm-crf          [391] Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.-t. Yih,
      incorporating pos and dictionaries for chinese opinion target extraction,               D. Fried, S. Wang, T. Yu, Ds-1000: A natural and reliable benchmark for
      in: Asian Conference on Machine Learning, PMLR, 2018, pp. 518–533.                      data science code generation, in: International Conference on Machine
      29                                                                                      Learning, PMLR, 2023, pp. 18319–18345. 29
[372] D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, D. Roth, Look-                  [392] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan,
      ing beyond the surface: A challenge set for reading comprehension                       E. Jiang, C. Cai, M. Terry, Q. Le, et al., Program synthesis with large
      over multiple sentences, in: Proceedings of the 2018 Conference of the                  language models, arXiv preprint arXiv:2108.07732 (2021). 29
      North American Chapter of the Association for Computational Linguis-              [393] Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, D. Kiela, Adver-
      tics: Human Language Technologies, Volume 1 (Long Papers), 2018,                        sarial nli: A new benchmark for natural language understanding, arXiv
      pp. 252–262. 29                                                                         preprint arXiv:1910.14599 (2019). 29, 31
[373] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Al-           [394] A. Williams, N. Nangia, S. R. Bowman, A broad-coverage challenge
      berti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al., Natural ques-              corpus for sentence understanding through inference, arXiv preprint
      tions: a benchmark for question answering research, Transactions of the                 arXiv:1704.05426 (2017). 29
      Association for Computational Linguistics 7 (2019) 453–466. 29                    [395] R. T. McCoy, E. Pavlick, T. Linzen, Right for the wrong reasons: Diag-
                                                                                   44
        nosing syntactic heuristics in natural language inference, arXiv preprint        [417] W. Li, F. Qi, M. Sun, X. Yi, J. Zhang, Ccpm: A chinese classical poetry
        arXiv:1902.01007 (2019). 29                                                            matching dataset, arXiv preprint arXiv:2106.01979 (2021). 29
[396]   J. Liu, L. Cui, H. Liu, D. Huang, Y. Wang, Y. Zhang, Logiqa: A chal-             [418] E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, J. Weston, Wizard of
        lenge dataset for machine reading comprehension with logical reason-                   wikipedia: Knowledge-powered conversational agents, arXiv preprint
        ing, arXiv preprint arXiv:2007.08124 (2020). 29                                        arXiv:1811.01241 (2018). 29
[397]   P. Lewis, B. Oğuz, R. Rinott, S. Riedel, H. Schwenk, Mlqa: Eval-                [419] H. Rashkin, E. M. Smith, M. Li, Y.-L. Boureau, Towards empathetic
        uating cross-lingual extractive question answering, arXiv preprint                     open-domain conversation models: A new benchmark and dataset, arXiv
        arXiv:1910.07475 (2019). 29                                                            preprint arXiv:1811.00207 (2018). 29
[398]   A. Conneau, G. Lample, R. Rinott, A. Williams, S. R. Bowman,                     [420] E. Dinan, V. Logacheva, V. Malykh, A. Miller, K. Shuster, J. Urbanek,
        H. Schwenk, V. Stoyanov, Xnli: Evaluating cross-lingual sentence rep-                  D. Kiela, A. Szlam, I. Serban, R. Lowe, et al., The second conversa-
        resentations, arXiv preprint arXiv:1809.05053 (2018). 29, 31                           tional intelligence challenge (convai2), in: The NeurIPS’18 Competi-
[399]   Y. Yang, Y. Zhang, C. Tar, J. Baldridge, Paws-x: A cross-                              tion: From Machine Learning to Intelligent Conversations, Springer,
        lingual adversarial dataset for paraphrase identification, arXiv preprint              2020, pp. 187–208. 29
        arXiv:1908.11828 (2019). 29, 31                                                  [421] H. Zhou, C. Zheng, K. Huang, M. Huang, X. Zhu, Kdconv: A chinese
[400]   S. Narayan, S. B. Cohen, M. Lapata, Don’t give me the details, just the                multi-domain dialogue dataset towards multi-turn knowledge-driven
        summary!, Topic-Aware Convolutional Neural Networks for Extreme                        conversation, arXiv preprint arXiv:2004.04100 (2020). 29
        Summarization. ArXiv, abs (1808). 29                                             [422] L. CO, Iflytek: a multiple categories chinese text classifier. competition
[401]   E. M. Ponti, G. Glavaš, O. Majewska, Q. Liu, I. Vulić, A. Korhonen,                   official website (2019). 29
        Xcopa: A multilingual dataset for causal commonsense reasoning, arXiv            [423] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, J. Blackburn, The
        preprint arXiv:2005.00333 (2020). 29                                                   pushshift reddit dataset, in: Proceedings of the international AAAI con-
[402]   A. Tikhonov, M. Ryabinin, It’s all in the heads: Using attention heads                 ference on web and social media, Vol. 14, 2020, pp. 830–839. 30
        as a baseline for cross-lingual transfer in commonsense reasoning, arXiv         [424] A. Fan, Y. Jernite, E. Perez, D. Grangier, J. Weston, M. Auli, Eli5: Long
        preprint arXiv:2106.12066 (2021). 29                                                   form question answering, arXiv preprint arXiv:1907.09190 (2019). 31
[403]   J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V. Niko-          [425] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei,
        laev, J. Palomaki, Tydi qa: A benchmark for information-seeking ques-                  A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap, et al.,
        tion answering in typologically diverse languages, Transactions of the                 Benchmarking generalization via in-context instructions on 1,600+ lan-
        Association for Computational Linguistics 8 (2020) 454–470. 29                         guage tasks, arXiv preprint arXiv:2204.07705 (2022). 31
[404]   T. Scialom, P.-A. Dray, S. Lamprier, B. Piwowarski, J. Staiano,                  [426] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Yasunaga, C.-S. Wu,
        Mlsum: The multilingual summarization corpus, arXiv preprint                           M. Zhong, P. Yin, S. I. Wang, et al., Unifiedskg: Unifying and multi-
        arXiv:2004.14900 (2020). 29                                                            tasking structured knowledge grounding with text-to-text language mod-
[405]   S. Lin, J. Hilton, O. Evans, Truthfulqa: Measuring how models mimic                    els, arXiv preprint arXiv:2201.05966 (2022). 31
        human falsehoods, arXiv preprint arXiv:2109.07958 (2021). 29, 32                 [427] Q. Ye, B. Y. Lin, X. Ren, Crossfit: A few-shot learning challenge
[406]   I. Augenstein, C. Lioma, D. Wang, L. C. Lima, C. Hansen,                               for cross-task generalization in nlp, arXiv preprint arXiv:2104.08835
        C. Hansen, J. G. Simonsen, Multifc: A real-world multi-domain                          (2021). 31
        dataset for evidence-based fact checking of claims, arXiv preprint               [428] V. Aribandi, Y. Tay, T. Schuster, J. Rao, H. S. Zheng, S. V. Mehta,
        arXiv:1909.03242 (2019). 29                                                            H. Zhuang, V. Q. Tran, D. Bahri, J. Ni, et al., Ext5: Towards extreme
[407]   J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, Fever: a                      multi-task scaling for transfer learning, arXiv preprint arXiv:2111.10952
        large-scale dataset for fact extraction and verification, arXiv preprint               (2021). 31
        arXiv:1803.05355 (2018). 29                                                      [429] A. Williams, N. Nangia, S. Bowman, A broad-coverage challenge cor-
[408]   I. Mollas, Z. Chrysopoulou, S. Karlos, G. Tsoumakas, Ethos: an online                  pus for sentence understanding through inference, in: Proceedings of
        hate speech detection dataset, arXiv preprint arXiv:2006.08328 (2020).                 the 2018 Conference of the North American Chapter of the Associ-
        29, 32                                                                                 ation for Computational Linguistics: Human Language Technologies,
[409]   M. Nadeem, A. Bethke, S. Reddy, Stereoset: Measuring stereotypical                     Volume 1 (Long Papers), Association for Computational Linguistics,
        bias in pretrained language models, arXiv preprint arXiv:2004.09456                    New Orleans, Louisiana, 2018, pp. 1112–1122. doi:10.18653/v1/
        (2020). 29, 32                                                                         N18-1101.
[410]   A. Parrish, A. Chen, N. Nangia, V. Padmakumar, J. Phang, J. Thomp-                     URL https://aclanthology.org/N18-1101 31
        son, P. M. Htut, S. R. Bowman, Bbq: A hand-built bias benchmark for              [430] Y. Zhang, J. Baldridge, L. He, PAWS: Paraphrase adversaries from word
        question answering, arXiv preprint arXiv:2110.08193 (2021). 29                         scrambling, in: Proceedings of the 2019 Conference of the North Amer-
[411]   J. Zhao, T. Wang, M. Yatskar, V. Ordonez, K.-W. Chang, Gender bias                     ican Chapter of the Association for Computational Linguistics: Human
        in coreference resolution: Evaluation and debiasing methods, arXiv                     Language Technologies, Volume 1 (Long and Short Papers), Associa-
        preprint arXiv:1804.06876 (2018). 29                                                   tion for Computational Linguistics, Minneapolis, Minnesota, 2019, pp.
[412]   N. Nangia, C. Vania, R. Bhalerao, S. R. Bowman, Crows-pairs: A chal-                   1298–1308. doi:10.18653/v1/N19-1131.
        lenge dataset for measuring social biases in masked language models,                   URL https://aclanthology.org/N19-1131 32
        arXiv preprint arXiv:2010.00133 (2020). 29                                       [431] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, D. Yang, Is chat-
[413]   S. Gehman, S. Gururangan, M. Sap, Y. Choi, N. A. Smith, Realtoxic-                     GPT a general-purpose natural language processing task solver?, in: The
        ityprompts: Evaluating neural toxic degeneration in language models,                   2023 Conference on Empirical Methods in Natural Language Process-
        arXiv preprint arXiv:2009.11462 (2020). 29                                             ing, 2023.
[414]   D. Borkan, L. Dixon, J. Sorensen, N. Thain, L. Vasserman, Nuanced                      URL https://openreview.net/forum?id=u03xn1COsO 32
        metrics for measuring unintended bias with real data for text classifica-        [432] M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh,
        tion, in: Companion proceedings of the 2019 world wide web confer-                     N. Akhtar, J. Wu, S. Mirjalili, et al., Large language models: a com-
        ence, 2019, pp. 491–500. 29                                                            prehensive survey of its applications, challenges, limitations, and future
[415]   O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow,                           prospects, TechRxiv (2023). 32
        M. Huck, A. J. Yepes, P. Koehn, V. Logacheva, C. Monz, et al., Find-             [433] X. L. Dong, S. Moon, Y. E. Xu, K. Malik, Z. Yu, Towards next-
        ings of the 2016 conference on machine translation, in: Proceedings of                 generation intelligent assistants leveraging llm techniques, in: Proceed-
        the First Conference on Machine Translation: Volume 2, Shared Task                     ings of the 29th ACM SIGKDD Conference on Knowledge Discovery
        Papers, 2016, pp. 131–198. 29                                                          and Data Mining, 2023, pp. 5792–5793. 32
[416]   B. Loïc, B. Magdalena, B. Ondřej, F. Christian, G. Yvette, G. Ro-               [434] K. Pandya, M. Holia, Automating customer service using langchain:
        man, H. Barry, H. Matthias, J. Eric, K. Tom, et al., Findings of the                   Building custom open-source gpt chatbot for organizations, arXiv
        2020 conference on machine translation (wmt20), in: Proceedings of                     preprint arXiv:2310.05421 (2023). 32
        the Fifth Conference on Machine Translation, Association for Compu-              [435] J. Li, B. Hui, G. Qu, B. Li, J. Yang, B. Li, B. Wang, B. Qin, R. Cao,
        tational Linguistics„ 2020, pp. 1–55. 29                                               R. Geng, et al., Can llm already serve as a database interface? a
                                                                                    45
        big bench for large-scale database grounded text-to-sqls, arXiv preprint                   International Journal of Advanced Computer Science and Applications
        arXiv:2305.03111 (2023). 32                                                                14 (6) (2023). 32
[436]   A. Rao, J. Kim, M. Kamineni, M. Pang, W. Lie, M. D. Succi, Evaluating              [456]   J. Irons, C. Mason, P. Cooper, S. Sidra, A. Reeson, C. Paris, Exploring
        chatgpt as an adjunct for radiologic decision-making, medRxiv (2023)                       the impacts of chatgpt on future scientific work, SocArXiv (2023). 32
        2023–02. 32                                                                        [457]   P. G. Schmidt, A. J. Meir, Using generative ai for literature searches and
[437]   M. Benary, X. D. Wang, M. Schmidt, D. Soll, G. Hilfenhaus, M. Nas-                         scholarly writing: Is the integrity of the scientific discourse in jeopardy?,
        sir, C. Sigler, M. Knödler, U. Keller, D. Beule, et al., Leveraging large                  arXiv preprint arXiv:2311.06981 (2023). 32
        language models for decision support in personalized oncology, JAMA                [458]   Y. Zheng, H. Y. Koh, J. Ju, A. T. Nguyen, L. T. May, G. I. Webb, S. Pan,
        Network Open 6 (11) (2023) e2343689–e2343689. 32                                           Large language models for scientific synthesis, inference and explana-
[438]   C. M. Chiesa-Estomba, J. R. Lechien, L. A. Vaira, A. Brunet, G. Cam-                       tion, arXiv preprint arXiv:2310.07984 (2023). 33
        maroto, M. Mayo-Yanez, A. Sanchez-Barrueco, C. Saga-Gutierrez, Ex-                 [459]   B. Aczel, E.-J. Wagenmakers, Transparency guidance for chatgpt usage
        ploring the potential of chat-gpt as a supportive tool for sialendoscopy                   in scientific writing, PsyArXiv (2023). 33
        clinical decision making and patient information support, European                 [460]   S. Altmäe, A. Sola-Leyva, A. Salumets, Artificial intelligence in sci-
        Archives of Oto-Rhino-Laryngology (2023) 1–6. 32                                           entific writing: a friend or a foe?, Reproductive BioMedicine Online
[439]   S. Montagna, S. Ferretti, L. C. Klopfenstein, A. Florio, M. F. Pengo,                      (2023). 33
        Data decentralisation of llm-based chatbot systems in chronic disease              [461]   S. Imani, L. Du, H. Shrivastava, Mathprompter: Mathematical reasoning
        self-management, in: Proceedings of the 2023 ACM Conference on In-                         using large language models, arXiv preprint arXiv:2303.05398 (2023).
        formation Technology for Social Good, 2023, pp. 205–212. 32                                33
[440]   D. Bill, T. Eriksson, Fine-tuning a llm using reinforcement learning from          [462]   Z. Yuan, H. Yuan, C. Li, G. Dong, C. Tan, C. Zhou, Scaling relationship
        human feedback for a therapy chatbot application (2023). 32                                on learning mathematical reasoning with large language models, arXiv
[441]   M. Abbasian, I. Azimi, A. M. Rahmani, R. Jain, Conversational health                       preprint arXiv:2308.01825 (2023). 33
        agents: A personalized llm-powered agent framework, arXiv preprint                 [463]   K. Yang, A. M. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil,
        arXiv:2310.02374 (2023). 32                                                                R. Prenger, A. Anandkumar, Leandojo: Theorem proving with retrieval-
[442]   K. V. Lemley, Does chatgpt help us understand the medical literature?,                     augmented language models, arXiv preprint arXiv:2306.15626 (2023).
        Journal of the American Society of Nephrology (2023) 10–1681. 32                           33
[443]   S. Pal, M. Bhattacharya, S.-S. Lee, C. Chakraborty, A domain-specific              [464]   K. M. Collins, A. Q. Jiang, S. Frieder, L. Wong, M. Zilka, U. Bhatt,
        next-generation large language model (llm) or chatgpt is required for                      T. Lukasiewicz, Y. Wu, J. B. Tenenbaum, W. Hart, et al., Evaluating
        biomedical engineering and research, Annals of Biomedical Engineering                      language models for mathematics through interactions, arXiv preprint
        (2023) 1–4. 32                                                                             arXiv:2306.01694 (2023). 33
[444]   Y. Du, S. Zhao, Y. Chen, R. Bai, J. Liu, H. Wu, H. Wang, B. Qin, The               [465]   Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, A. Li, M. He,
        calla dataset: Probing llms’ interactive knowledge acquisition from chi-                   Z. Liu, et al., Summary of chatgpt-related research and perspective
        nese medical literature, arXiv preprint arXiv:2309.04198 (2023). 32                        towards the future of large language models, Meta-Radiology (2023)
[445]   A. Abd-Alrazaq, R. AlSaad, D. Alhuwail, A. Ahmed, P. M. Healy,                             100017. 33
        S. Latifi, S. Aziz, R. Damseh, S. A. Alrazak, J. Sheikh, et al., Large             [466]   J. Drápal, H. Westermann, J. Savelka, Using large language models
        language models in medical education: Opportunities, challenges, and                       to support thematic analysis in empirical legal studies, arXiv preprint
        future directions, JMIR Medical Education 9 (1) (2023) e48291. 32                          arXiv:2310.18729 (2023). 33
[446]   A. B. Mbakwe, I. Lourentzou, L. A. Celi, O. J. Mechanic, A. Dagan,                 [467]   J. Savelka, K. D. Ashley, M. A. Gray, H. Westermann, H. Xu, Explain-
        Chatgpt passing usmle shines a spotlight on the flaws of medical educa-                    ing legal concepts with augmented large language models (gpt-4), arXiv
        tion (2023). 32                                                                            preprint arXiv:2306.09525 (2023). 33
[447]   S. Ahn, The impending impacts of large language models on medical                  [468]   N. Guha, J. Nyarko, D. E. Ho, C. Ré, A. Chilton, A. Narayana,
        education, Korean Journal of Medical Education 35 (1) (2023) 103. 32                       A. Chohlas-Wood, A. Peters, B. Waldon, D. N. Rockmore, et al., Legal-
[448]   E. Waisberg, J. Ong, M. Masalkhi, A. G. Lee, Large language model                          bench: A collaboratively built benchmark for measuring legal reasoning
        (llm)-driven chatbots for neuro-ophthalmic medical education, Eye                          in large language models, arXiv preprint arXiv:2308.11462 (2023). 33
        (2023) 1–3. 32                                                                     [469]   J. Cui, Z. Li, Y. Yan, B. Chen, L. Yuan, Chatlaw: Open-source legal
[449]   G. Deiana, M. Dettori, A. Arghittu, A. Azara, G. Gabutti, P. Castiglia,                    large language model with integrated external knowledge bases, arXiv
        Artificial intelligence and public health: Evaluating chatgpt responses to                 preprint arXiv:2306.16092 (2023). 33
        vaccination myths and misconceptions, Vaccines 11 (7) (2023) 1217. 32              [470]   H. Yang, X.-Y. Liu, C. D. Wang, Fingpt: Open-source financial large
[450]   L. De Angelis, F. Baglivo, G. Arzilli, G. P. Privitera, P. Ferragina, A. E.                language models, arXiv preprint arXiv:2306.06031 (2023). 33
        Tozzi, C. Rizzo, Chatgpt and the rise of large language models: the new            [471]   Y. Li, S. Wang, H. Ding, H. Chen, Large language models in finance: A
        ai-driven infodemic threat in public health, Frontiers in Public Health 11                 survey, in: Proceedings of the Fourth ACM International Conference on
        (2023) 1166120. 32                                                                         AI in Finance, 2023, pp. 374–382. 33
[451]   N. L. Rane, A. Tawde, S. P. Choudhary, J. Rane, Contribution and per-              [472]   A. Lykov, D. Tsetserukou, Llm-brain: Ai-driven fast generation of
        formance of chatgpt and other large language models (llm) for scientific                   robot behaviour tree based on large language model, arXiv preprint
        and research advancements: a double-edged sword, International Re-                         arXiv:2305.19352 (2023). 33
        search Journal of Modernization in Engineering Technology and Science              [473]   E. Billing, J. Rosén, M. Lamb, Language models for human-robot inter-
        5 (10) (2023) 875–899. 32                                                                  action, in: ACM/IEEE International Conference on Human-Robot Inter-
[452]   W. Dai, J. Lin, H. Jin, T. Li, Y.-S. Tsai, D. Gašević, G. Chen, Can large                 action, March 13–16, 2023, Stockholm, Sweden, ACM Digital Library,
        language models provide feedback to students? a case study on chatgpt,                     2023, pp. 905–906. 33
        in: 2023 IEEE International Conference on Advanced Learning Tech-                  [474]   Y. Ye, H. You, J. Du, Improved trust in human-robot collaboration with
        nologies (ICALT), IEEE, 2023, pp. 323–325. 32                                              chatgpt, IEEE Access (2023). 33
[453]   E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, D. Dementieva,                    [475]   Y. Ding, X. Zhang, C. Paxton, S. Zhang, Leveraging commonsense
        F. Fischer, U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier, et al.,                      knowledge from large language models for task and motion planning,
        Chatgpt for good? on opportunities and challenges of large language                        in: RSS 2023 Workshop on Learning for Task and Motion Planning,
        models for education, Learning and individual differences 103 (2023)                       2023. 33
        102274. 32                                                                         [476]   J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg,
[454]   N. Rane, Enhancing the quality of teaching and learning through chat-                      S. Rusinkiewicz, T. Funkhouser, Tidybot: Personalized robot assistance
        gpt and similar large language models: Challenges, future prospects,                       with large language models, arXiv preprint arXiv:2305.05658 (2023).
        and ethical considerations in education, Future Prospects, and Ethical                     33
        Considerations in Education (September 15, 2023) (2023). 32                        [477]   E. Strubell, A. Ganesh, A. McCallum, Energy and policy considerations
[455]   J. C. Young, M. Shishido, Investigating openai’s chatgpt potentials in                     for deep learning in nlp, arXiv preprint arXiv:1906.02243 (2019). 34
        generating chatbot’s dialogue for english as a foreign language learning,          [478]   E. M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, On the dan-
                                                                                      46
        gers of stochastic parrots: Can language models be too big?, in: Pro-
        ceedings of the 2021 ACM conference on fairness, accountability, and
        transparency, 2021, pp. 610–623. 34
[479]   C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding
        deep learning (still) requires rethinking generalization, Communications
        of the ACM 64 (3) (2021) 107–115. 34
[480]   M. Tänzer, S. Ruder, M. Rei, Memorisation versus generalisation in pre-
        trained language models, arXiv preprint arXiv:2105.00828 (2021). 34
[481]   S. M. West, M. Whittaker, K. Crawford, Discriminating systems, AI
        Now (2019) 1–33. 34
[482]   K. Valmeekam, A. Olmo, S. Sreedharan, S. Kambhampati, Large lan-
        guage models still can’t plan (a benchmark for llms on planning and
        reasoning about change), arXiv preprint arXiv:2206.10498 (2022). 34
[483]   Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao,
        Y. Zhang, Y. Chen, et al., Siren’s song in the ai ocean: A survey on hal-
        lucination in large language models, arXiv preprint arXiv:2309.01219
        (2023). 34
[484]   A. Webson, E. Pavlick, Do prompt-based models really understand the
        meaning of their prompts?, arXiv preprint arXiv:2109.01247 (2021). 34
[485]   O. Shaikh, H. Zhang, W. Held, M. Bernstein, D. Yang, On second
        thought, let’s not think step by step! bias and toxicity in zero-shot rea-
        soning, arXiv preprint arXiv:2212.08061 (2022). 34
[486]   B. C. Das, M. H. Amini, Y. Wu, Security and privacy challenges of large
        language models: A survey, arXiv preprint arXiv:2402.00888 (2024). 34
[487]   X. Liu, H. Cheng, P. He, W. Chen, Y. Wang, H. Poon, J. Gao, Adversar-
        ial training for large neural language models, ArXiv (April 2020).
        URL                https://www.microsoft.com/en-us/research/
        publication/adversarial-training-for-large-neural-language-models/
        34
[488]   E. Shayegani, M. A. A. Mamun, Y. Fu, P. Zaree, Y. Dong, N. Abu-
        Ghazaleh, Survey of vulnerabilities in large language models revealed
        by adversarial attacks (2023). arXiv:2310.10844. 34
[489]   X. Xu, K. Kong, N. Liu, L. Cui, D. Wang, J. Zhang, M. Kankanhalli, An
        llm can fool itself: A prompt-based adversarial attack (2023). arXiv:
        2310.13345. 34
[490]   H. Zhao, H. Chen, F. Yang, N. Liu, H. Deng, H. Cai, S. Wang, D. Yin,
        M. Du, Explainability for large language models: A survey (2023).
        arXiv:2309.01029. 35
[491]   S. Huang, S. Mamidanna, S. Jangam, Y. Zhou, L. H. Gilpin, Can large
        language models explain themselves? a study of llm-generated self-
        explanations (2023). arXiv:2310.11207. 35
[492]   H. Brown, K. Lee, F. Mireshghallah, R. Shokri, F. Tramèr, What does it
        mean for a language model to preserve privacy?, in: Proceedings of the
        2022 ACM Conference on Fairness, Accountability, and Transparency,
        2022, pp. 2280–2292. 35
[493]   R. Plant, V. Giuffrida, D. Gkatzia, You are what you write: Pre-
        serving privacy in the era of large language models, arXiv preprint
        arXiv:2204.09391 (2022). 35
[494]   W. Niu, Z. Kong, G. Yuan, W. Jiang, J. Guan, C. Ding, P. Zhao, S. Liu,
        B. Ren, Y. Wang, Real-time execution of large-scale language models
        on mobile (2020). arXiv:2009.06823. 35
[495]   C. Guo, J. Tang, W. Hu, J. Leng, C. Zhang, F. Yang, Y. Liu, M. Guo,
        Y. Zhu, Olive: Accelerating large language models via hardware-
        friendly outlier-victim pair quantization, in: Proceedings of the 50th
        Annual International Symposium on Computer Architecture, 2023, pp.
        1–15. 35
[496]   B. Meskó, E. J. Topol, The imperative for regulatory oversight of large
        language models (or generative ai) in healthcare, npj Digital Medicine
        6 (1) (2023) 120. 35
[497]   J. Zhang, X. Ji, Z. Zhao, X. Hei, K.-K. R. Choo, Ethical considerations
        and policy implications for large language models: Guiding responsible
        development and deployment, arXiv preprint arXiv:2308.02678 (2023).
        35
[498]   J. Mökander, J. Schuett, H. R. Kirk, L. Floridi, Auditing large language
        models: a three-layered approach, AI and Ethics (2023) 1–31. 35
47