LLM Forgetting
LLM Forgetting
Abstract
1 Introduction
Large language models (LLMs) have demonstrated impressive general capabilities in tack-
ling a wide range of tasks (Bubeck et al., 2023). However, when it comes to real-world
applications, users often find that certain specific abilities need enhancement. In such cases,
relevant task-specific data are adopted to fine-tune the model in instructions to improve
its performance in particular tasks (Touvron et al., 2023; Scialom et al., 2022). However, the
widely used LLMs such as LLAMA-Chat (Touvron et al., 2023), ChatGPT1 , and Claude-2
2 have been fine-tuned, and continual instruction tuning in specific tasks may affect the
general knowledge stored in LLMs since the original training data are inaccessible. The
study of continual instruction tuning is crucial for the practical application of LLMs. For
example, in Table 1, an LLM is first fine-tuned for the task of title generation and then learns
to generate an explanation. This scenario is related to the continual learning paradigm in
machine learning, where a model is trained on a sequence of tasks (Lopez-Paz & Ranzato,
2017; Wu et al., 2022; Ke & Liu, 2022). A significant challenge in this paradigm is catastrophic
forgetting (CF), in which a model forgets previously learned knowledge due to parameter
∗ Equal Contribution
1 https://openai.com/
2 https://www.anthropic.com/news/claude-2
1
Under review
updates during the learning of new tasks, leading to decreased performance on earlier tasks
(Li & Hoiem, 2017; Lopez-Paz & Ranzato, 2017).
Scialom et al. (2022) investigate the CF issues of T0 models during continual instruction
tuning. However, their analysis primarily focuses on the forgetting phenomenon observed
in downstream tasks, such as summarization and style transfer. The evolution of general
knowledge stored in the original pre-trained LLMs during instruction tuning remains
unexplored. Luo et al. (2023b) conduct an analysis using probing methods on pre-trained
language models to examine the problem of generality destruction and general knowledge
forgetting during continual fine-tuning. Nevertheless, their study is restricted to encoder-
only models and classification tasks. In this work, we draw attention to the following
fundamental questions regarding forgetting in generative LLMs:
1. Are the general knowledge stored in LLMs forgotten during continual instruction tuning?
2. What are the effects of model scales, model architectures, and general instruction tuning in
the forgetting problem?
2
Under review
instruction tuning process. When comparing BLOOMZ with mT0 at comparable model
scale, we find that BLOOMZ experiences a relatively milder forgetting problem, suggesting
that the decoder-only architecture may be better at retaining information during continual
instruction tuning. Lastly, empirical results on LLAMA and its instruction-tuned version (i.e.,
ALPACA) indicate that diverse instruction tuning can help alleviate the CF phenomenon
for LLMs in further continual fine-tuning.
The contribution of our paper can be summarized as follows:
1. We take an initial step to analyze the catastrophic forgetting (CF) problem during
continual instruction tuning by an empirically study, where a specific evaluation
setting is designed from the perspective of general knowledge such as domain
knowledge, reasoning, reading comprehension and the bias problem.
2. We provide an initial research evidence that the CF problem generally exists in the
continual instruction tuning process for different models such as BLOOMZ, mT0,
LLAMA and ALPACA. We also show that the model architecture, and model scale
have different effects on the CF problem.
3. Experimental results further show that the general instruction data can help mitigate
the CF problem to some extent by experiments.
2 Related Work
Instruction tuning has proven to be effective in aligning responses from pre-trained language
models with human intents or preferences (Ouyang et al., 2022; Stiennon et al., 2020; Min
et al., 2021). This technique refines a model’s ability to predict a specific response to a given
prompt, which may optionally include an instruction that outlines a task for the model.
Examples of such models include T0 (Sanh et al., 2021), mT0 (Muennighoff et al., 2022),
and BLOOMZ (Muennighoff et al., 2022). It has been demonstrated that instruction tuning
can enhance the ability of language models to generalize to unseen tasks without prior
exposure (Wei et al., 2021; Sanh et al., 2021). In this work, we focus on fine-tuning LLMs
in a continual manner and analyze the catastrophic forgetting (CF) phenomenon during
training. Specifically, instructions for a particular type of task (such as generating headlines)
are used to tune the LLMs in each training phase, and the model does not have access to
previously learned tasks.
Various training strategies have been proposed to address the problem of catastrophic
forgetting (CF) in continual learning (Riemer et al., 2019; Buzzega et al., 2020; Ke et al.,
2022; Chen et al., 2022; Luo et al., 2023a). Previous studies have primarily measured CF by
evaluating the performance decrease in previously learned tasks during continual learning
or the average performance of learned tasks at the end of training. However, Davari et al.
(2022) discovered that even when the model performance on previously learned tasks is
preserved, the representations still suffer from significant drift due to parameter updates.
As a result, they propose using an optimal linear classifier of learned tasks to measure
performance, with changes considered as a surrogate to quantify CF. Similarly, Wu et al.
(2022) employs layer-wise and task-wise probing to analyze CF in each layer for previously
learned tasks. Luo et al. (2023b) propose using a series of probing tasks to evaluate the
knowledge stored in LLMs and analyze the generality of the models. However, their study
is limited to classification tasks and encoder-only model architectures. To the best of our
knowledge, we are the first to evaluate the forgetting of general knowledge in generative
large language models during continual instruction tuning.
3
Under review
Figure 1: The framework in our empirical study for the continual instruction tuning. The
initial model M0 is continually trained with different instruction tasks and evaluated from
the perspective of general knowledge tasks including domain knowledge, reasoning, read-
ing comprehension, and the problem of bias.
Table 2: Details of the evaluation sets for the CF phenomenon in LLMs. DK, Rs, and RC
represent domain knowledge, reasoning, and reading comprehension.
Set Elements
DK STEM, Social, Human, Other
Rs BoolQ, PIQA, Winogrande, Hellaswag, MathQA, Mutual
RC RACE-high, RACE-middle
Bias Sexual Orientation, Physical Appearance, Religion, Nationality, Race/Color, Gender,
Socioeconomic, Disability, Age
3 Method
Formally, in the continual instruction tuning of LLMs, a model sequentially learns several
generation tasks denoted as T = { T m }, m = 1, 2, ..., N (N is the length of the task sequence).
During the training of each task T m ∈ T , only the corresponding data D m = {( xim , yim )} are
available, where xim is the input text together with an instruction and yim is the corresponding
generation labels. Given an initial LLM denoted by M0 , we continually train the model with
the data D m , obtaining the trained model Mm . The training and evaluation framework is
shown in Figure 1.
Following Scialom et al. (2022) (Scialom et al., 2022), we adopt instruction tasks dissimilar
to the training and evaluation tasks of BLOOMZ and mT0. Specifically, we select 5 tasks
from Scialom et al. (2022) as follows:
1. Text Simplification (Simp) (Jiang et al., 2020; Alva-Manchego et al., 2020) requires
paraphrasing the text with a simple text;
2. Empathetic Dialogue Generation (Emdg) (Rashkin et al., 2019), requires the model
to generate a reason for a conversational context under a given emotional situation;
3. Inquisitive Question Generation (InqQG) (Fan et al., 2019) requires the model to
generate a question for the long-form answers;
4. Explanation Generation (Exp) (Camburu et al., 2018), aims to train a model able to
generate natural language explanations for a given premise, hypothesis, or label;
5. Headline Generation with Constraint (HGen) (Scialom et al., 2022) aims to train a
model able to generate headlines under some specific constraints, such as containing
the keywords X at the beginning, at the ending, or anywhere.
During the instruction tuning, we first add a general prompt template to the beginning of
the data: ‘Below is an instruction that describes a task, paired with an input that provides further
4
Under review
context. Write a response that appropriately completes the request...’ followed by a specific prompt
for each task. We adopt the specific prompts designed by Scialom et al. (2022) (Scialom
et al., 2022) and 100,000 data samples are used for training. The details of the instruction are
shown in Appendix A. For simplicity, we train the model on one instruction task order for
the empirical study: Simp → Emdg → InqQG → Exp → HGen.
To evaluate the general/basic knowledge stored in the LLMs, we adopt several general
evaluation tasks (Table 2), which can be categorized into four sets:
Domain Knowledge: We employ the Massive Multitask Language Understanding bench-
mark (MMLU) (Hendrycks et al., 2020) to assess the knowledge stored in the LLMs. MMLU
covers a wide range of domains, including STEM, Human, Social, and Other.
Reasoning: We utilize commonly used commonsense reasoning datasets, such as Hellaswag
(Zellers et al., 2019), BoolQ (Clark et al., 2019), Winogrande (Sakaguchi et al., 2021), and
PIQA (Bisk et al., 2020). Additionally, we evaluate the models on mathQA (Amini et al.,
2019) for math reasoning and Mutual (Cui et al., 2020) for dialog reasoning.
Reading Comprehension: We assess the LLMs’ performance on the RACE dataset (Lai et al.,
2017), which includes both middle and high school level reading comprehension tasks.
Bias: To investigate the biases in the continually trained models, we employ the CrowSPairs
dataset (Nangia et al., 2020), which evaluates various biases, including gender, race/color,
religion, and more.
where Reo is the results of e on initial LLMs. We obtain the evaluation results using the
open-source evaluation framework – lm-evaluation-harness (Gao et al., 2021).
For most multi-choice evaluation elements, we adopt the accuracy in the zero-shot setting
to measure the model performance, including MathQA, Hellaswag, BoolQ, PIQA, Mu-
tual, Winograde, and RACE. For MMLU, we adopt the 5-shot setting for evaluation. For
CrowsPairs, we follow Nangia et al. (2022) (Nangia et al., 2020) to measure the model
preference for the stereotypical sentence based on the perplexity of the given stereotypical
and anti-stereotypical sentences, where a larger value means a stronger bias in the language.
4 Experimental Setting
We adopt BLOOMZ for the empirical study since BLOOMZ is diverse in the scales and
can be directly compared with the encoder-decoder model mT0, which is fine-tuned on
the same instruction datasets as BLOOMZ. We also consider the widely used LLAMA and
ALPACA to further study the effect of general instruction tuning.
5
Under review
Table 3: The performance of some LLMs before and after instruction tuning on the corre-
sponding task in the continual learning. ‘Initial’ refers to the performance of the original
LLMs. ‘Tuned’ refers to the performance after instruction tuning on this task. R1, and BS
denote ROUGE-1 and BERTScore, respectively.
Simp (SARI) Emdg (BS) InqQG (BS) Exp (BS) HGen (R1)
Initial Tuned Initial Tuned Initial Tuned Initial Tuned Initial Tuned
mT0-3.7b 39.01 39.92 48.29 51.70 52.66 56.25 51.62 61.91 30.50 31.77
BLOOMZ-3b 37.95 46.72 46.10 53.27 49.06 59.72 49.75 67.10 27.88 31.72
BLOOMZ-7.1b 45.26 47.24 49.68 53.30 52.30 59.69 51.47 68.71 31.50 32.93
BLOOM-7.1b 42.65 47.14 44.98 52.37 44.30 59.99 49.57 68.76 30.50 32.41
LLAMA-7b 43.02 46.92 44.28 49.54 43.90 47.54 50.72 54.22 32.42 33.80
ALPACA-7b 45.37 48.22 52.56 54.70 56.91 62.13 52.66 70.49 32.06 36.73
4.2 Implementation
We train our model on 8 GPU (Tesla A100 40G) using the Adam optimizer (Kingma & Ba,
2014) (the models in 1b level are trained on 4 GPU for saving resources). For all the models,
the batch size is 4 on each device, the learning rate is 2e-5, and the scheduler is set constant
for BLOOMZ and mT0 following (Muennighoff et al., 2022). In LLAMA and ALPACA, we
follow the hyperparameter of Taori et al. (2023) (Taori et al., 2023) that the scheduler is
cosine and learning rate is 2e-5. 7 The max sequence length of the inputs is 512. We train
our model 3 epochs and the final checkpoints are used for evaluation.
In this section, we first show that the forgetting phenomenon generally exists in LLMs
during continual instruction tuning in Section 5.1. Then we analyze the factors that affect
the forgetting extent, such as model scales, model architectures, and general instruction
tuning in Section 5.2-5.4, respectively.
6
Figure 2: The FG values of BLOOMZ in differ-
ent model scales after continually training.
Under review
Table 4: Main results of the forgetting in LLMs during continual instruction tuning. Rso and
Rs−1 refer to the evaluation results at the beginning and the end of instruction tuning.
Domain Knowledge Reasoning Reading Comprehension
Reo Re−1 FG Reo Re−1 FG Reo Re−1 FG
mT0-1.2b 26.82 22.47 9.18 45.43 40.22 7.75 35.06 29.54 17.45
mT0-3.7b 30.99 20.14 20.15 48.61 38.39 16.73 41.10 30.45 28.42
BLOOMZ-1.1b 27.19 23.84 9.54 47.37 41.97 6.73 36.77 27.28 18.04
BLOOMZ-1.7b 28.72 24.52 10.72 48.30 44.96 6.48 42.65 30.09 24.29
BLOOMZ-3b 30.04 24.29 14.63 56.17 47.03 11.09 48.29 31.38 27.56
BLOOMZ-7.1b 33.08 25.61 18.37 59.15 49.24 13.62 48.79 33.05 26.75
Table 6: Results of CF in the models w/o general instruction tuning, including the pairs
(BLOOM, BLOOMZ), and (LLAMA, ALPACA).
Domain Knowledge Reasoning Reading Comprehension
Reo Re−1 FG Reo Re−1 FG Reo Re−1 FG
BLOOM-7.1b 29.42 24.83 13.54 52.79 47.76 6.67 38.25 31.55 12.00
BLOOMZ-7.1b 33.08 25.61 18.37 59.15 49.24 13.62 48.79 33.05 26.75
LLAMA-7b 37.27 24.05 34.57 58.73 40.38 31.33 41.36 27.62 31.72
ALPACA-7b 39.29 29.88 18.14 60.11 53.68 7.56 44.47 37.61 10.31
7
Under review
BLOOMZ-1.1B
BLOOMZ-7.1B
% &&
$' ""
!"#!
Initial Simp Emdg InqQG Exp HGen Initial Simp Emdg InqQG Exp HGen Initial Simp Emdg InqQG Exp HGen
Figure 4: The detailed results of knowledge evolution between BLOOMZ and mT0 in the
comparable model scale. The first row refers to the model BLOOMZ-3b and the second
refers to mT0-3,7b. The first to third columns are the FG values of domain knowledge
(MMLU), reasoning and reading comprehension, respectively.
substantial compared to that of BLOOMZ-1.1b, the performance of both models ends at rela-
tively similar values, which would result in a more significant FG value for BLOOMZ-7.1B.
The main results of forgetting are reported in Table 4. We observe that the FG values for
domain knowledge, reasoning, and reading comprehension are all above zero, indicating
that general knowledge is forgotten during continual instruction tuning. Reading compre-
hension performance suffers the most drastic forgetting, followed by domain knowledge.
8
Under review
LLaMA-7B
ALPACA-7B
& ''
%( ##
"#$"
Initial Simp Emdg InqQG Exp HGen Initial Simp Emdg InqQG Exp HGen Initial Simp Emdg InqQG Exp HGen
For example, the FG values of BLOOMZ-7.1b are 26.75%, 18.37%, and 13.62% in reading
comprehension, domain knowledge, and reasoning, respectively. Interestingly, we observe
that the FG values for bias (Table 5) are mostly above zero in the experiments, which sug-
gests that model biases, such as those related to race, color, gender, and so on, are mitigated
during continual instruction tuning. For instance, in sentences describing physical appear-
ance, BLOOMZ-7.1b initially prefers stereotype-conforming sentences with a probability of
75.0%, but this preference decreases to 63.88% after continual instruction tuning.
9
Under review
increases to 3b, BLOOMZ-3b suffers less forgetting compared to mT0-3.7B. For example, the
FG value of BLOOMZ-3b is 11.09 which is 5.64 lower than that of mT0-3.7b. These results
suggest that BLOOMZ, which has a decoder-only model architecture, can maintain more
knowledge during continual instruction tuning. This difference may be attributed to the
autoregressive nature of the model or the differences in training objectives. Furthermore,
the results imply that as the model scale increases, decoder-only models may suffer from
less catastrophic forgetting compared to encoder-decoder models. As we observe, the
knowledge degraded more drastically in mT0.
We also conduct experiments to analyze the effect of general instruction tuning on the
CF problem during continual instruction tuning (Table 6). We compare BLOOM-7.1b
with BLOOMZ-7.1b and LLAMA-7b with ALPACA-7B. We observe that BLOOMZ-7.1b
outperforms BLOOM-7.1b by a large margin in the initial performance on domain knowl-
edge, reasoning, and reading comprehension. Due to the difference in initial performance,
BLOOMZ-7.1b experiences more significant forgetting. However, in the case of LLAMA and
ALPACA, there is no substantial gap in the initial performance, and ALPACA maintains
more general knowledge after continual fine-tuning. The illustration of the general knowl-
edge is shown in Figure 5. We observe that LLAMA-7b suffers significant forgetting in the
first instruction tuning, which suggests that models without general instruction tuning may
have less ability to retain knowledge during continual fine-tuning. The better retention
of knowledge implies that general instruction tuning can mitigate catastrophic forgetting
in LLMs during further continual fine-tuning. This finding highlights the importance of
general instruction tuning in preserving the acquired knowledge and skills of LLMs when
they undergo subsequent task-specific fine-tuning.
To further demonstrate the effect of general instruction tuning, we mix 10,000 general
instruction data samples from ALPACA (Taori et al., 2023) with the continual instruction
tasks to train the BLOOMZ-7b and LLAMA-7b model. For the sake of brevity, we present
the performance of one data split from each evaluation set (MMLU-human, Hellaswag, and
Race-middle) to illustrate the effect in Figure 6. The results clearly show that the forgetting
during continual instruction tuning can be mitigated to a certain extent by incorporating
general instruction data. For instance, the performance of MMLU-human in the initial
LLAMA-7b model is 34.72%, but it decreases to 26.8% when trained solely on the instruction
data. However, when trained on the mixed data, the performance becomes 30%. These
findings further further show that general instruction tuning can help alleviate the CF
problem encountered during continual instruction tuning.
10
Under review
6 Conclusion
In this study, we conducted an empirical investigation into the catastrophic forgetting (CF)
phenomenon experienced by large language models (LLMs) during continual instruction
tuning. Our findings revealed that the CF problem is generally prevalent in the continual
fine-tuning of various LLMs. Moreover, as the model scale increases, LLMs exhibit a
more severe degree of forgetting in domain knowledge, reasoning abilities, and reading
comprehension skills. Furthermore, our comparative analysis showed that the decoder-only
model, BLOOMZ, demonstrates a superior ability to retain knowledge and skills during
continual fine-tuning when compared to the encoder-decoder model, mT0. Additionally,
we discovered that employing general instruction tuning techniques may help alleviate the
CF problem in LLMs. Our empirical study suggests that exploring more effective methods
to mitigate CF in LLMs during continual fine-tuning is a promising research direction.
Meanwhile, since our work is an empirical study and is constrained by the computation
resources, there is still large room to investigate the forgetting phenomenon such as a larger
model scale (70b or larger) since a larger model may need less parameter changes to fit the
downstream tasks. When applying LLMs, practitioners should remain vigilant and pay
close attention to the issue of knowledge forgetting that may occur after instruction tuning.
Addressing this challenge is crucial to ensure the reliable and consistent performance of
LLMs in real-world applications.
7 Limitations
In this study, we take a initial step to analyze the CF problem during continual instruction
tuning. Due to the restricted computation resources, we could not carry out experiments on
the models with larger scales. But we can still observe the phenomenon of forgetting from
the model scales from 1b to 7b. We control the experiments in a task order for simplifying
the analysis, which may affect the forgetting phenomenon. Meanwhile, there are plenty
of benchmarks to evaluate the performance of LLMs, here we only adopt some popular
ones to analyse the general knowledge, otherwise, the computational cost of conducting
experiments would be prohibitively high.
References
Fernando Alva-Manchego, Louis Martin, Antoine Bordes, Carolina Scarton, Benoı̂t Sagot,
and Lucia Specia. Asset: A dataset for tuning and evaluation of sentence simplification
models with multiple rewriting transformations. In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, pp. 4668–4679, 2020.
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Han-
naneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with
operation-based formalisms. In Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume
1 (Long and Short Papers), pp. 2357–2367, 2019.
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical
commonsense in natural language. In Proceedings of the AAAI conference on artificial
intelligence, volume 34, pp. 7432–7439, 2020.
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz,
Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial
general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara.
Dark experience for general continual learning: a strong, simple baseline. Advances in
neural information processing systems, 33:15920–15930, 2020.
Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom. e-snli:
Natural language inference with natural language explanations. Advances in Neural
Information Processing Systems, 31, 2018.
11
Under review
Yulong Chen, Yang Liu, Li Dong, Shuohang Wang, Chenguang Zhu, Michael Zeng, and
Yue Zhang. Adaprompt: Adaptive model training for prompt-based nlp. arXiv preprint
arXiv:2202.04824, 2022.
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and
Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions.
In Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
pp. 2924–2936, 2019.
Leyang Cui, Yu Wu, Shujie Liu, Yue Zhang, and Ming Zhou. Mutual: A dataset for multi-
turn dialogue reasoning. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pp. 1406–1416, 2020.
MohammadReza Davari, Nader Asadi, Sudhir Mudur, Rahaf Aljundi, and Eugene
Belilovsky. Probing representation forgetting in supervised and unsupervised contin-
ual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 16712–16721, 2022.
Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli.
Eli5: Long form question answering. In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics, pp. 3558–3567, 2019.
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster,
Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang,
Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A
framework for few-shot language model evaluation, September 2021. URL https:
//doi.org/10.5281/zenodo.5371628.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint
arXiv:2009.03300, 2020.
Chao Jiang, Mounica Maddela, Wuwei Lan, Yang Zhong, and Wei Xu. Neural crf model for
sentence alignment in text simplification. In Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, 2020.
Zixuan Ke and Bing Liu. Continual learning of natural language processing tasks: A survey.
arXiv preprint arXiv:2211.12701, 2022.
Zixuan Ke, Haowei Lin, Yijia Shao, Hu Xu, Lei Shu, and Bing Liu. Continual training of
language models for few-shot learning. arXiv preprint arXiv:2210.05549, 2022.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale
reading comprehension dataset from examinations. In Proceedings of the 2017 Conference
on Empirical Methods in Natural Language Processing, pp. 785–794, 2017.
Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern
analysis and machine intelligence, 40(12):2935–2947, 2017.
David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual
learning. Advances in neural information processing systems, 30, 2017.
Yun Luo, Xiaotian Lin, Zhen Yang, Fandong Meng, Jie Zhou, and Yue Zhang. Mitigating
catastrophic forgetting in task-incremental continual learning with adaptive classification
criterion. arXiv preprint arXiv:2305.12270, 2023a.
Yun Luo, Zhen Yang, Xuefeng Bai, Fandong Meng, Jie Zhou, and Yue Zhang. Investigat-
ing forgetting in pre-trained representations through continual learning. arXiv preprint
arXiv:2305.05968, 2023b.
12
Under review
Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. Metaicl: Learning to
learn in context. arXiv preprint arXiv:2110.15943, 2021.
Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman,
Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, et al.
Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786,
2022.
Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel Bowman. Crows-pairs: A chal-
lenge dataset for measuring social biases in masked language models. In Proceedings
of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.
1953–1967, 2020.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin,
Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language
models to follow instructions with human feedback. Advances in Neural Information
Processing Systems, 35:27730–27744, 2022.
Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. Towards empathetic
open-domain conversation models: A new benchmark and dataset. In Proceedings of
the 57th Annual Meeting of the Association for Computational Linguistics. Association for
Computational Linguistics, 2019.
Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald
Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing in-
terference. In International Conference on Learning Representations. International Conference
on Learning Representations, ICLR, 2019.
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An
adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106,
2021.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai,
Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted
training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
Thomas Scialom, Tuhin Chakrabarty, and Smaranda Muresan. Fine-tuned language models
are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural
Language Processing, pp. 6107–6122, 2022.
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec
Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human
feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin,
Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following
llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux,
Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al.
Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,
2023.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi,
and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated
instructions, 2022.
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan
Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.
arXiv preprint arXiv:2109.01652, 2021.
13
Under review
Tongtong Wu, Massimo Caccia, Zhuang Li, Yuan-Fang Li, Guilin Qi, and Gholamreza
Haffari. Pretrained language model in continual learning: A comparative study. In The
Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-
29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=figzpGMrdD.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a
machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
14
Under review
We show the instruction samples for the continual instruction tasks adopted in the study in
Table 7.
Simp:
Instruction: Reformulate this text with simpler words: ‘His father Robert Alda -LRB- born
Alphonso Giuseppe Giovanni Roberto D ’Abruzzo -RRB- was an actor and singer, and his
mother Joan Browne was a homemaker and former beauty-pageant winner. ’
Output: His father, Robert Alda, was an actor, and his mother, Joan, was once crowned Miss
New York.
Emdg:
Instruction: The associated emotion is ‘guilty’ and the input prompt is ‘I felt guilty when
I was driving home one night and a person tried to fly into my lane, and didn’t see me. I
honked and they swerved back into their lane, slammed on their brakes, and hit the water
cones.’. Now what would be your response, given the following dialogue context:===- Yeah
about 10 years ago I had a horrifying experience. It was 100% their fault but they hit the
water barrels and survived. They had no injuries but they almost ran me off the road.
Output: No I wasn’t hit. It turned out they were drunk. I felt guilty but realized it was his
fault.
InqQG:
Instruction: The del operator shows up in the Laplacian, which is in the kinetic energy term,
when the Schrodinger equation is written in the position basis. 2207ˆ(2) is a differential oper-
ator, equal to (d/dx)ˆ(2) + (d/dy)ˆ(2) + (d/dz)ˆ(2), in Cartesian coordinates (the derivatives are
partial derivatives).hˆ(2)2207ˆ(2)/2m is the kinetic energy operator in the position basis.===
Given the above text, write the possible curious question it answers:
Output: What is the del operator in the Schrodinger equation?
Exp:
Instruction: Explain why the two following sentences are unrelated: Sentence 1: ‘This
church choir sings to the masses as they sing joyous songs from the book at a church’;
Sentence 2: ‘The church has cracks in the ceiling.’
Output: There is no indication that there are cracks in the ceiling of the church.
HGen:
Instruction: Make a title for this article with ‘computer’: japan’s NEC corp. and UNK
computer corp. of the United States said Wednesday they had agreed to join forces in
supercomputer sales.
Output: NEC UNK in computer sales tie-up.
15