0% found this document useful (0 votes)
40 views7 pages

Final Report 169374910

This document summarizes a project that explores using GPT-2 for lyrics generation through tuning-free prompting techniques. The authors propose DeepLyrics, a GPT-2 model that is prompted with lyrics from successful songs over several decades, requiring no fine-tuning. They explore two approaches: fixed prompt training including prefix tuning of GPT-2, and prompt construction including natural language prompts and using similarity to select example lyrics from the same genre. The authors find that DeepLyrics achieves comparable or better performance than fine-tuned models while requiring no fine-tuning.

Uploaded by

MSR MSR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views7 pages

Final Report 169374910

This document summarizes a project that explores using GPT-2 for lyrics generation through tuning-free prompting techniques. The authors propose DeepLyrics, a GPT-2 model that is prompted with lyrics from successful songs over several decades, requiring no fine-tuning. They explore two approaches: fixed prompt training including prefix tuning of GPT-2, and prompt construction including natural language prompts and using similarity to select example lyrics from the same genre. The authors find that DeepLyrics achieves comparable or better performance than fine-tuned models while requiring no fine-tuning.

Uploaded by

MSR MSR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

DeepLyrics: GPT2 for lyrics generation with

finetuning and prompting techniques


Stanford CS224N Custom Project

Li Tian Xiaoli Yang


Department of Statistics Department of Statistics
Stanford University Stanford University
lii@stanford.edu xiaoliy2@stanford.edu

Abstract

Lyrics are incredibly important part of a song’s success. The fast advancing natural
language processing (NLP) methods have been successful in lyrics generation,
making AI-assisting lyrics creation possible. However, as these methods get more
resource-consuming and data-demanding, we identify a neglected research area of
exploring efficient ways of model learning to simplify existing lyrics generation
methods. Our project proposes DeepLyrics, a GPT-2 model using tuning-free
prompting (in-context learning) on lyrics of highly successful songs in the past
several decades to assist creative generation. Our tuning-free method, DeepLyrics,
is able to achieve comparable and even better performance compared to thoroughly
fine-tuned lyrics generation models. Our work shows the practicability of reducing
large amount of training and finetuning work in existing lyrics generation methods
with a well-engineered prompting technique.

1 Key Information to include


• Mentor:NA
• External Collaborators: Ying Lin @ Stanford Culture Lab (data preparation only)
• Sharing project: NA

2 Introduction

Good lyrics appeals to the audience emotionally, but usually it is a humanly creative process that
cannot be replicated given the idiosyncrasy of cultures. With the emerging behaviors of Large
Language Models, many lyrics generation methods were created to automate or serve as the first-line
inspiration for creators. This hypothesis would rely on the language models’ ability to learn lyrics’
structure, meaning, and style to mimic the creative process.
Existing efforts are usually constrained in basic finetuning techniques with natural language prompt-
ing, so we see potential of higher-quality lyrics generation saving training/fine-tuning work with a
more thoughtful finetuning and prompting design.
We will explore two main areas of training and prompting techniques: Fixed Prompt Training
(finetuning and prefix-tuning) and Prompt Construction. In the process, we want to learn much about
finetuning techniques for large language models, and alternative prompting regimes that would yield
high quality lyrics generation. As a result of our project, we created DeepLyrics, a GPT-2 based
model using prompts engineered from lyrics of highly successful songs in the past several decades
for downstream generation task. It requires no fine-tuning and achieves comparable and even better
performance than thoroughly fine-tuned language models.

Stanford CS224N Natural Language Processing with Deep Learning


3 Related Work

Lyrics generation, as an important aspect of people’s every day literature creations, has been explored
by researchers in the area of deep learning. Most research work in this area now focuses on fine-tuning
the pre-trained large language models (LLMs) to their full capacity, supplementing the pre-trained
model with extra information, and/or setting proper generation constraints during training for better
lyrics generation. Lu et al. (2019) and Huang and You (2021) explored augmenting Seq2Seq models
with a piece of original melody and melody emotions to improve lyrics generation, respectively.
Ma et al. (2021) proposed AI-Lyricist, a lyrics-generating system consisting of four modules that
involve a music structure analyzer, a SeqGAN-based lyrics generator trained on all parameters with
multi-adversarial training, a deep coupled music-lyrics embedding model, and a polisher. As the
model training and data augmentation tends more and more comprehensive, Youling (Zhang et al.
(2022)), an AI-Assited lyrics creation system was launched as a functioning web app. It takes in
information including but not limited to music style, rhyme, rhythm, and prompts and revisions
provided by users. Although these methods present exciting and promising future steps in lyrics
generation, we find a missing part in the current research to explore simple and efficient ways to
generate high-quality lyrics with minimal manually supplemented information. Ventura and Toker
(2022) proposed an efficient prompting technique that utilizes paraphrasing and lyrics understanding.
However, it still requires training the prompting in a text summarization task which exerts extra work
in preparing task ground truths (text summaries). Hence, our work fills the gap of exploring and
proposing creative, efficient fine-tuning and prompting methods to simplify the existing complicated
lyrics generators.

4 Approach

Figure 1: Finetuning and Prompting experiments on GPT2-medium lyrics generation task

We approach lyrics generation task by fine-tuning the Huggingface off-the-shelf GPT-2 model on
successful lyrics. In particular, we explore optimizing the task from two dimensions - finetuning and
prompting. We use the GPT-2 medium model before fine-tuning as a baseline and adopted the official
Huggingface turorial and Prefix implementation to finetune GPT2 model.
Our approach will be a mix of Promptless-Finetuning and Fixed-LM Prompt-tuning, in reference
to the finetuning/prompting diagram introduced in Liu et al. (2023).

2
• Promptless-Finetuning: Promptless-Finetuning refers to, with fixed prompt, finetuning
the language model for downstream tasks GPT2-medium is a non-trivial model with 345
million parameters, which makes finetuning all parameters challenging. We experiment with
finetuning different parts of the model to build intuition on which layer or combination of
layers are more responsible for the task-specific performance.
1. Finetune the first transformer block [1]
2. Finetune the middle transformer block [5]
3. Finetune the last transformer block [12]
4. Prefix Embedding tuning Li and Liang (2021)
Prefix-tuning prepends a prefix for GPT2 to obtain z = [PREFIX; x; y]. Pidx
denotes the sequence of prefix indices, and we use |Pidx | to denote the length of
the prefix. In Prefix-tuning model, 2 affine layers with tanh activation is added
at the beginning to embed the prefix. The sequence of modules added as as
shown in Figure 2. The middle dimension is set to be midd im s= 512 as default.

Figure 2 Prefix modification on GPT2 model structure


In training, we follow the recurrence relationship hi = LanguageM odelϕ (zi , h<i ),
with the same training objective as finetuning maxϕ log pϕ (zi |h<i ).
Prefix-tuning initializes a trainable matrix Ptheta to store the pre-
fix parameters.Below is an annotated example of a training sample.

Figure 3 Prefix-tuning annotated example


• Fixed-LM Prompt-tuning: Fixed-LM Prompt-tuning refers to, with fixed language model
parameters, searching and designing prompting structures for downstream tasks. For lyrics
generation, we experiment with different prompt structures and in-context learning. Our
experiment will be within the realm of "tuning-free prompting" introduced in Liu et al.
(2023).
1. Natural Language Prompt:
i.e. "This is a pop song lyrics by Justin Bieber:"
2. Random Selection: randomly select an answered example, before prompting to generate
on the target
i.e. "Rock, Breakfast Club: Every time somebody says to me ... I’ve gotta let my
feelings show. pop, OneRepublic: Hope when you take that jump, you don’t fear the
fall Hope when the water rises... The only way you can know is give it all you have.
pop, Justin Bieber:"
3. (DeepLyrics) Similarity Selection: Similarity select an answered example from the
same genre, before prompting to generate on the target:

3
i.e. "pop, Adele: If you’re not the one for me Then how come I can bring you to your
knees?... And if I’m not the one for you You’ve gotta stop holding me the way you
do. pop, OneRepublic: Hope when you take that jump, you don’t fear the fall Hope
when the water rises... The only way you can know is give it all you have. pop, Justin
Bieber:"

5 Experiments
5.1 Data

The dataset consisting of Billboard top 100 From 1958 to 2016. It records artist, genre, and full
song lyrics. We preprocessed the full song lyrics by removing non-latin characters and recorded the
dataset metadata below. We split the whole dataset by the Train/Val/Test ratio of 21211/4714/2357
and maintained genre and artist distribution across all three datasets. During Promptless-Finetuning
experiments, we truncated lyrics at length 1024, as constrained by the max length acceptable by
GPT-2 tokenizer.

Dataset Metadata
Songs Genre Artist Lyrics
Number/Avg Length (chars) 23568 5580 19 1595.9
Examples – R&B, pop Justin Bieber, ColdPlay Every time somebody says to me ...

5.2 Evaluation method

We quantify the quality of lyrics generation using three evaluation criteria: (1) Perplexity, (2)
BertScores (F1, Recall, Precision), and (3) human evaluation. Human evaluation was performed by
survey. Six participants were sent an artist, a genre, with a list of lyrics generated by different models,
and were asked to rank the lyrics according to the lyrics’ readability, coherence, and representation of
the artist’s style and music genre. The average rank for each model from 6 participants were then
computed. A lower score represents a higher-ranked model by human evaluators.

5.3 Experimental details

For Prefix-tuning setup, we set the prefix embedding layers to be a multilayer perceptron (MLP) with
middle dimension of 512 as in the original paper Li and Liang (2021), and we set the prefix sequence
length as 10.
For model tuning, in both Finetuning and Prefix-tuning, we update GPT2-medium model with
backpropagation using the cross entropy loss for causal language modeling against gold text lyrics,
for 3 epochs (≈ 7k steps) with batch size of 2s using Adam optimizer without learning rate scheduler.
We save the model every 1000 steps, evaluate every 500 steps, and uses early stopping. The loss
function is the cross entropy for causal language modeling, i.e. the negative log likelihood:
t
X
Loss = − log pθ (xi |x<i )
i
For Fixed-LM Prompt-tuning experiments, we directly encoded different prompts using the pretrained
GPT2 tokenizer and generated lyrics following the prompt texts using the pre-trained GPT2-medium
model provided by HuggingFace. We truncated selected lyrics to 150 characters each to limit the total
length of the prompt. We set truncation True and max_length of overall prompt text to be 500
characters. Two baseline models for this section are the Prefix-tuning model and the best-performing
promptless-finetuned model (Middle Block) in the previous experiments.

5.4 Results

5.4.1 Finetuning and Prefix-tuning


For both finetuning and Prefix-tuning with lyrics gold text, the training and evaluation loss consistently
decreases, with no sign of overfitting. We observe that all three finetuning variations (first, middle,

4
last block training) reaches similar train loss. However, the evaluation loss differs significantly
-finetuning the first transformer block results in comparatively worse evaluation loss, but finetuning
middle versus the last transformer block of GPT2 achieves similar final evaluation loss. Comparing
rate of convergence, middle-block finetuning has a slightly faster loss decrease during training than
last-block finetuing.
Prefix-tuning result is slightly better than first-block finetuning. Train loss decesases at a similar rate,
but the the evaluation loss plateaus compared to the finetuning counterparts.

Figure 2: Train and Evaluation Loss for Fintuning and Prefix-tuning GPT2 on lyrics

Table of Experiment Results for Fine-tuning Strategies Exploration


Finetuning Untuned (Baseline) First Block Middle Block Last Block Prefix Tuning
train loss NA 2.6360 2.5516 2.6334 2.6801
evaluation loss 102.1126 3.0557 2.4962 2.5595 2.6421
perplexity 2.2e+44 20.6204 12.0282 12.8544 14.0426
BertScore (F1) NA 0.781 0.785 0.784 0.780

5.4.2 Prompting and Evaluation


The three tuning-free prompting method in general achieves comparable performance across all
evaluation metrics. In particular, DeepLyrics (i.e. similar selection prompt) achieves the best
performance among three prompting methods as expected. It is worth noting that there is a large
variance between perplexity scores achieved by different models but a small variance between
BertScores and human evaluation.

Table of Experiment Results for Propmting Schemes


DeepLyrics
Test Evaluations Middle Block (B) Prefix Tuning (B) Natural Lang Random Sel
Similar Sel
Perplexity 9.04 14.26 9.95 14.38 15.03
BertScore (F1) 0.8167 0.8119 0.7924 0.8073 0.8083
BertScore (Recall) 0.8235 0.8179 0.7992 0.8171 0.8175
BertScore (Precision) 0.8102 0.8062 0.7864 0.7978 0.7998
Human Eval (rank) 2.5 3.2 2.5 2.7 1.7

We also randomly sampled a task (Rock song, Annette) and print lyrics generation results from each
model below as a reference:

Promptless-Fintuning, Middle Block


Ooh, you got that right

5
You better really feel it
Love the way
I touch your body

Prefix Tuning
Love iiiiiht iideiiiush baby
I want to make all her happy
That’s my love Yo
wanna see me * Run from him

Fixed-ML Prompt-tuning, Natural Language Prompt


From the spirituals,
nostalgia meets Love
through some great musicianship
and not much else.

Fixed-ML Prompt-tuning, Random Sel Prompt


He was no saint
but he did help
No Reason For Hatred
had it stopped

DeepLyrics: Fixed-ML Prompt-tuning, Similar Sel Prompt


He was no saint
but he did help
us had Faith
in Jesus Christ God

6 Analysis
6.1 Fine-tuning and Prefix-tuning

The results section shows that fine-tuning the middle or the last block of transformers gives the best
performance compared to the first block fine-tuning or prefix-tuning. The poor performance of tuning
the first block is likely due to the long "distance" between the first block and the final generations.
An adjustment made in the first block in order to improve performance will undergo all computations
through later layers to exert an effect on the output, which makes the learning less effective under
more variances. Fine-tuning the middle or last block is more effective for the same reason.
The prefix-tuning’s underperformance is unexpected, however, as its effectiveness been proved in
previous work on various tasks. One possible reason for the failure is the trade-off between model
performance and number of tasks it can handle: Prefix-tuning provides the flexibility of handling
multiple tasks by keeping the backbone model task-agnostic and preserving task-specific information
in a set of linear layers with customizable size. It further adds flexibility with hyperparameters
including preseqlen, pref ix embedding, etc. When handling one specific task, fine-tuning directly
on a particular block in the language model is more efficient as it may require prefix-tuning to have
a much larger prefix network at the beginning to capture equivalent information compared to a
transformer block that includes attention and historical information within itself.

6.2 Fixed-LM Prompt-tuning

According to the results section, the non-human evaluation scores between fine-tuned models (two
baselines) and three tuning-free prompting models show no significant difference, but the human
evaluation indicates that DeepLyrics has much more superior performance compared to others. By
comparing the qualitative results from five models, we notice that the two baselines are better at
capturing the genre and artist’s style by using more style-specific words. Models with natural language

6
prompt and random selection prompt are weaker in imitating the artist’s work but is more natural in
its semantic meaning, which is an expected advantage of pretrained natural languages models. Finally,
DeepLyrics in general presents more in-style and natural lyrics. This implies that two samples of
similarly-styled lyrics are enough for GPT-2 to capture the proper style. It also has the advantage over
fine-tuned baselines in its naturality. Since lyrics are usually "broken pieces of language" compared
to normal sentences, this advantage likely comes from the baseline model losing the flow of natural
language when it is fine-tuned on too many lyrics with un-related genre/style.

7 Conclusion
In this paper, we identified a neglected area of research in lyrics generation and explored simple but
efficient ways of lyrics generation that requires less or no parameter training to achieve comparable
and even better performance than training-heavy methods. We proposed, designed, and evaluated a
method, DeepLyrics, that achieves superior performance in lyrics generation and requires no training
or fine-tuning. We innovatively designed the Similar Selection Prompting method that takes in two
pieces of lyrics in the same genre and prompts the model to generate new lyrics in that genre in style
of a given artist. Our work shows the practicability of simplifying many existing lyrics generation
methods to a great extent by saving training and fine-tuning work with an effective prompting
(in-context learning) technique.
Inevitably, one limitation our work is that we have only compared our method to a limited number of
baselines and have it evaluated by a limited number of human evaluators. Extending the method to
simplify models with more complicated fine-tuning processes is an interesting future direction. For
example, designing tuning-free pipelines to achieve similar performance as TRBLLmaker (Ventura
and Toker (2022)). Another limitation is to evaluate our proposed method using a different backbone
language model, for example, T5 (Roberts et al. (2020)).

References
Yin-Fu Huang and Kai-Cheng You. 2021. Automated generation of chinese lyrics based on melody
emotions. IEEE Access, 9:98060–98071.
Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation.
CoRR, abs/2101.00190.
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023.
Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language
processing. ACM Comput. Surv., 55(9).
Xu Lu, Jie Wang, Bojin Zhuang, Shaojun Wang, and Jing Xiao. 2019. A syllable-structured,
contextually-based conditionally generation of chinese lyrics. In PRICAI 2019: Trends in Artificial
Intelligence: 16th Pacific Rim International Conference on Artificial Intelligence, Cuvu, Yanuca
Island, Fiji, August 26-30, 2019, Proceedings, Part III 16, pages 257–265. Springer.
Xichu Ma, Ye Wang, Min-Yen Kan, and Wee Sun Lee. 2021. Ai-lyricist: Generating music and vocab-
ulary constrained lyrics. In Proceedings of the 29th ACM International Conference on Multimedia,
MM ’21, page 1002–1011, New York, NY, USA. Association for Computing Machinery.
Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How much knowledge can you pack into
the parameters of a language model? In Empirical Methods in Natural Language Processing
(EMNLP).
Mor Ventura and Michael Toker. 2022. Trbllmaker–transformer reads between lyrics lines maker.
arXiv preprint arXiv:2212.04917.
Rongsheng Zhang, Xiaoxi Mao, Le Li, Lin Jiang, Lin Chen, Zhiwei Hu, Yadong Xi, Changjie Fan,
and Minlie Huang. 2022. Youling: an ai-assisted lyrics creation system.

A Appendix (optional)

You might also like