research-article

Open access

Surveying More Than Two Decades of Music Information Retrieval Research on Playlists

Authors:

Giovanni Gabbolini,

Derek BridgeAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 6

Article No.: 114, Pages 1 - 68

https://doi.org/10.1145/3688398

Published: 13 November 2024 Publication History

PDF eReader

Abstract

In this article, we present an extensive survey of music information retrieval (MIR) research into music playlists. Our survey spans more than 20 years, and includes around 300 papers about playlists, with over 70 supporting sources. It is the first survey that is self-contained in the sense that it combines all the different MIR research into playlists. It embraces topics such as algorithms for automatic generation, for automatic continuation, for assisting with manual generation, for tagging and for captioning. It looks at manually constructed playlists, both those that are constructed for and by individuals and those constructed in collaboration with others. It covers ground-breaking research into enhancing playlists by cross-fading consecutive songs and by interleaving consecutive songs with speech, similar to what happens on a radio show. Most significantly, it is the first survey that can fully incorporate the paradigm shift that has taken place in the way people consume recorded music: the shift from physical media to music streaming. This has wrought profound changes in the size of music collections available to listeners and thus the algorithms that support the construction, curation and presentation of playlists and the methods adopted by users when they also construct, curate and listen to playlists.

1 Introduction

The technological revolutions of the late 20th century, such as the Internet, have shaped many parts of our contemporary lives, including how we interact with recorded music.¹ In the digital era we are living in, music streaming services are one of the most popular ways to interact with music. According to the global music report of the International Federation of the Phonographic Industry (IFPI), streaming accounted for 17.5 billion dollars of revenue in 2022, which is nearly 67% of the total global recorded music industry revenues for that year [160]. In the context of this article, we consider music streaming services as the default medium to interact with recorded music. In exchange for a monthly subscription fee, or for free in exchange for exposure to advertisements, music streaming services allow their users to access an enormous catalogue of music, from different devices, such as smartphones and personal computers, at any time.

The abundance of available music raises the risk that users of streaming services will be overwhelmed [140]. For example, at the time of writing (early 2024), the music streaming services Spotify² and Deezer³ have catalogue sizes of respectively 100 and 120 million tracks. The need, therefore, for efficient modalities of music access becomes apparent. In this scenario, playlists, which can be defined as “sequence[s] of tracks intended to be listened to together” [306], have become one of the preferred ways of accessing music. An example of a three-song playlist is in Figure 1. Listeners use playlists to structure their listening, efficiently accessing the right music at the right time [112].

Fig. 1.

The importance of playlists is evident by looking at the home page of the music streaming service Spotify, which features playlists as a prominent element, as shown in Figure 2. Notice that the playlists of Figure 2 are personalized, that is they are chosen to meet the user's preferences and requirements. The value of playlists is also highlighted by several statistics: in 2016, playlists accounted for 31% of music streaming time among listeners in the USA, which is more than albums (22%), but less than single tracks (46%) [306].

Fig. 2.

Playlists are created for users by professional curators and algorithms, and by users for themselves and other users, for convenience and self-expression [358]. A study conducted in 2017 revealed that 58% of users in the USA create their own playlists, and that 32% of users share their playlists with other users [306]. In total, the music streaming service Spotify was hosting more than four billion playlists in 2021.⁴ The statistics that we have quoted reveal the commercial value of playlists. Partly in consequence, notable research efforts have ensued during the last two decades. Much of the research on playlists is concerned with automatic generation of playlists, e.g. [111, 127, 343, 351]. Other work looks into how humans manually construct playlists, e.g. see [140, 281].

In this article, we present an extensive survey of music information retrieval (MIR) research into music playlists. It spans more than 20 years, and includes around 300 papers about playlists. It is the first self-contained survey to include all the different MIR research on music playlists, combining in particular the two topics mentioned above (automatic generation and human construction). It also includes research that, as far as we know, has not been surveyed before, such as research into playlist tagging, playlist captioning, and enhancing the delivery of consecutive songs through cross-fading or spoken links.

The remainder of this article is organized as follows. In Section 2, we formalize the concept of playlists, we divide the research on playlists into several topics, and we compare it with other related research. In Section 3, we summarize existing surveys on playlists, we describe the relationships between these and our own survey, and we highlight the contributions that ours make. In Section 4, we describe how we select relevant papers for inclusion. In the remaining sections, we present those relevant papers, by dividing them into the research topics we identify in Section 2. In particular, Section 5 reviews work on automatic playlist generation, and Section 6 surveys the field of manual playlist generation (MPG). Section 7 looks at ways researchers are enhancing playlists by cross-fading consecutive songs, interleaving songs with speech, assigning them descriptive tags, and automatically captioning them. Section 8 offers possible directions for future research. Section 9 furnishes some conclusions.

2 Definitions and Research Landscape

This survey is about the consumption of recorded music, which, for conciseness, we refer to simply as “music.” We refer to a person who is interacting with a music application as a “user,” irrespective of whether they are listening to music, constructing a playlist, organizing their playlists, or any other activities that the application allows. For a music streaming service, “user” refers not only to a subscriber to the service but also anyone who uses a free version of the service.

In this article, we define the concept of playlists as follows:

Definition 2.1.

A playlist is a sequence of songs, intended to be listened to together.

Definition 2.1 is equivalent to what we find in notable references, e.g. [107, 306, 364]. Some authors define playlists in simpler terms; for example, Bonnin and Jannach write that a playlist is “a sequence of songs” [39]. We fear that this simpler definition is too broad. For example, a random sample of songs from a song catalogue satisfies the simpler definition, but does not correspond to the concept of playlist that is commonly intended, i.e. a sequence of songs organized according to some principle [81]. Note also that Definition 2.1 entails a notion of ordering, that is a playlist is made of songs in a specific order. However, the importance of song order in music playlists is a debated topic. For example, Andric and Haus offer some evidence that the order does not matter [10], while De Mooij and Verhaegh find that the order does matter [86]. We further discuss this topic in Section 6.1.2.

One final remark about Definition 2.1 is its use of the word “song.” Some authors draw a distinction between “songs” and “tracks.” The latter are specific recordings of the former; multiple recordings (tracks) can exist for a single song, e.g. where the song is recorded by different artists. Using this distinction, Definition 2.1 would need to be changed to refer to a sequence of tracks, as in [306]. However, similar to the way we use the word “music” to only refer to recorded music, we have chosen to use only the word “song” from now on⁵.

In the remainder of this section, we first present several research topics on playlists; then, we briefly relate the research on playlists to research on music recommender systems (RSs) and to the research on sequence-aware and session-based RSs in other domains, such as e-commerce.

2.1 Playlist Research Topics

We divide the research on playlists that we survey into several topics. We organize these topics with a diagram in Figure 3, and we briefly introduce them in the following sections.

Fig. 3.

2.1.1 Playlist Generation.

Research on playlist generation is concerned with the construction of playlists. We adopt the definition of playlist generation presented in [39]:

Definition 2.2.

Given (1) a catalogue of songs, (2) background knowledge, and (3) some target characteristics of the playlist, construct a sequence of songs fulfilling the target characteristics in the best possible way.

The target characteristics of the playlist are the organization principles which make a playlist of a sequence of songs to be listened to together, which is consistent with Definition 2.1, while the background knowledge allows the agent which carries out the construction to select the songs from the catalogue so as to match the target characteristics. For example, a target characteristic may be a playlist for a beach day, the background knowledge may be some notion of what musical genres are more suited for a beach day, along with some modeling of the users’ musical tastes, and the catalogue of songs may be all the music hosted in a music streaming service.

Depending on the agent which carries out the construction, research in playlist generation can be divided into automatic and manual:

Automatic Playlist Generation (APG). In APG, a playlist is constructed automatically for the user by an algorithm. For example, Flexer et al. present an algorithm for creating a playlist that progresses smoothly from a specified start song to a specified end song [111]. APG is a major research topic, counting hundreds of publications. The interest in APG is motivated by the fact that MPG (below) can be experienced as a tedious, time-consuming activity, and it may require special background knowledge [39]. APG research dominates our survey, which is one of the reasons we review it first.

One special case of APG is automatic playlist continuation (APC). In APC, the goal is to extend an existing playlist with further songs. In terms of the definition of playlist construction above, the existing playlist provides the target characteristics since the goal is to add songs that are compatible with the original playlist. We discuss this further in Section 5.1.2.

The above characterization of APG tacitly assumes that playlists are constructed for a single user. However, there exists a special subset of work on APG which deals with the case in which playlists are constructed for a group of users. Where we wish to distinguish these from each other, we use APG for individuals (APG-I) and APG for groups (APG-G).

Manual Playlist Generation (MPG). In MPG, the playlist is constructed manually by the user. MPG is an activity that dates before the digitization of music. For example, in the 1980s it was common to compile mix-tapes and to exchange those mix-tapes with other users [113]. MPG is an activity which is important also nowadays because users in streaming services frequently create playlists. They do this for two main reasons:

(1)

Convenience [112]. Users create playlists, and give them a title, so as to have a personal interface to their music, built inside the streaming service. This helps users to access the right music at the right time, in a fast way.

(2)

Self-expression [358]. In music streaming services, all users have access to the same catalogue of music, without actually owning any of it. But playlists allow users to select and organize music, giving them the opportunity to express their personality and musical tastes.

A study conducted in 2017 revealed that 58% of users in the USA of the Spotify music streaming service create their own playlists, and that 32% of these users share their playlists with other users [306].

Some research in MPG considers the scenario in which the user is assisted by an algorithm while manually constructing the playlist [96]. For example, in [182] the user is assisted by an algorithm that recommends the next song to add as the user constructs the playlist, while in [352] an algorithm organizes songs in a color map, and the user can create playlists by drawing on the map. We refer to these approaches as assisted manual playlist generation (A-MPG). A-MPG combines MPG and APG, as the user is still in control of the playlist construction process, but the task is facilitated by an algorithm.

Just as work on APG can be for individuals (APG-I) or groups (APG-G), we can divide the work on MPG: MPG-I is where playlists are constructed for and by a single user; MPG-G is where they are constructed for and by a group of users.

2.1.2 Playlist Enhancement.

Research on playlist enhancement is concerned with automatically decorating existing playlists with additional content, so as to increase such things as the enjoyability or accessibility of the playlist. The first two enhancements that we consider deal with the presentation of consecutive songs. One is based on mixing consecutive songs, so as to obtain a continuous music flow. The other is based on interleaving pairs of consecutive songs with speech, similar to what happens on a radio show.

The remaining two enhancements that we consider look into automatically describing playlist content at a semantic level that can be understood by humans. Playlist description is a useful way to cope with content overload. As we have already noted, music streaming services feature billions of playlists created by users, professional editors, or algorithms [88]. Playlist description allows for effective and automated organization and access to playlists [69]. One way to describe playlists is by using tags, which are closed-vocabulary short textual descriptions, naturally limited in expressiveness. Another way is by captioning a playlist with well-formed natural language, which is expressive but more complex to generate.

2.2 Related Research Topics

The MIR research on playlists that we describe in this survey has relationships with other research, most notably with research into RSs, and especially with the topics of sequence-aware recommender systems (SARSs), session-based recommender systems (SBRSs), and music recommender systems (MRSs).

2.2.1 SARSs and SBRSs.

SARSs are used to recommend sequences of items, typically by generating item after item, iteratively. As well as recommending products in the e-commerce domain, SARSs have been adopted in the tourism domain for recommending the next point-of-interest to visit in a tour. We refer the reader to the survey in [291].

In SBRSs, users interact with items within a continuous but bounded period of time—usually a fairly short period. The RS must take into account the user's short-term preferences, which may be specific to the session. Like SARSs, SBRSs can be found in e-commerce domains: in one session, a user might want a hotel for a business trip; in another session, the same user might want a hotel for a family vacation. We refer the reader to the survey in [354].

SARSs and SBRSs are different: in the former, order is significant; in the latter, the time period and its short-term preferences are paramount. But they often coincide. This happens when the order of item interactions within a session is significant.

Research in APG is very much related to the research in SARSs and SBRSs. A playlist typically has characteristics in common with a session: a playlist has a bounded duration and may be designed to satisfy short-term preferences, such as the user's mood or activity. Furthermore, Definition 2.1 states that playlists are sequences, and hence ordered—although, again, we refer the reader to Section 6.1.2 for further discussion of this. Many algorithms for APG can be seen as applications of SARSs and SBRSs in the music domain. This is especially true of the work in APC that we mentioned earlier: in the same way that many of these RSs recommend the item that comes next in a sequence and/or session, APC systems recommend the next song to add to the playlist based on the songs already in the playlist; see Section 5.1.2.

However, the music domain is different from other domains, for a number of reasons, as argued by Schedl et al. in their survey on MRSs [305]. Songs are different from other items because they are consumed in a relatively short time and because they are often consumed more than one time. Owing to these peculiar characteristics, much research effort has been put into building RSs explicitly for the music domain, which are of interest for this survey.

2.2.2 MRSs.

MRSs are RSs that work in the music domain, tackling tasks such as the recommendation of a personalized selection of songs, artists, or albums [305]. Since algorithms for APG are also tasked with recommending a selection of songs, i.e., the playlist, they can be seen as MRSs. However, APG algorithms are different from general MRSs because the selection of songs must satisfy some additional soft constraints, such as matching some user-defined target characteristics, and general characteristics, such as the right level of song diversity/coherence, as well as a non-jarring song ordering; see Section 6.1.2.

3 Related Surveys and Contributions

MIR research on playlists is a two-decades-old research field: the oldest of the citations to playlist research that we include dates back to 1997 [86]. We are not the first authors to survey the literature. To the best of our knowledge, there are two others that are closely related to ours:

(1)

Bonnin and Jannach's “Automated Generation of Music Playlists: Survey and Experiments,” published in 2014 [39]; and

(2)

Dias et al.'s “From Manual to Assisted Playlist Creation: A Survey,” published in 2017 [96].

Survey (1) is specific to APG, as it presents algorithms for APG based on three attributes: (a) what background knowledge the algorithms employ; (b) how the target characteristics of the desired playlist can be input to the algorithms; and (c) the type of algorithm. Each attribute admits a number of possible categories; for example, the background knowledge can be content-based data, metadata and expert annotations, social web data, and usage data. We give more details in Section 5.1. We borrow some ideas from (1), as we also characterize APG algorithms based on the same three attributes. However, we include novel categories of those attributes, so as to better accommodate the work that we survey but that is not cited in (1). For example, we cover APG algorithms based on (neural) sequence modeling and on reinforcement learning (RL), which have grown in significance since 2014.

Survey (2) is mainly about A-MPG but, in order to better position A-MPG algorithms in the literature, it also includes a discussion of MPG and APG. The part of (2) on APG is a subset of what is included in (1), while the part of (2) on MPG briefly discusses characteristics of manually constructed playlists, such as recurrent playlist themes, and notable manual construction styles. Nevertheless, the main contribution of (2) is a review of A-MPG algorithms. It presents several algorithms for A-MPG, all based on visualizations, i.e., that assist the user in the manual construction of a playlist by using visualizations of the song catalogue. Survey (2) identifies several categories of A-MPG algorithm, based on the kind of visualization that is employed: maps, graphs, dots, and radar. Our survey also includes the A-MPG algorithms that were covered by (2), as well as other work published from 2017 onward. However, the recent work does not belong to any of the categories proposed in (2). For example, recent work provides lists of song recommendations for addition to the playlist, which is not a catalogue visualization of any kind. Hence, we propose a novel categorization of A-MPG algorithms, to cover both recent and non-recent work. Specifically, we divide algorithms into two categories: visualization and recommendation.

Despite some overlap, our survey is substantially different from (1) and (2), as we give a fresh look at APG and A-MPG research, by including relevant work published after the publication of (1) and (2). A fresh look is needed. In recent years, we have witnessed a change of paradigm in how people access recorded music. It has shifted from physical media to music streaming. According to the IFPI [160], the share of revenue coming from music streaming was nearly 10% of the total revenue for recorded music in 2013, nearly 30% in 2016, approaching 57% in 2019, and nearly 67% in 2022. The rise of music streaming has had a dramatic impact on research on music playlists, especially in terms of sources of data and song catalogues.

In the case of sources of data, streaming services allow for the collection of enormous quantities of usage data. Examples of usage data are manually constructed playlists, as well as listening logs where the streaming service records the actions of its users when they are listening to music. A few years after publication of (1), large datasets of usage data became available for researchers to use. One example is the Million Playlists Dataset (MPD), released in 2018, which contains manually constructed playlists, and is one order of magnitude larger than the datasets commonly employed in the research included in (1) [39]. Others are the 30Music dataset [332], which contains nearly 60,000 playlists; the AOTM dataset [237], comprising over 100,000 playlists; and the ALF-200k dataset [365], containing more than 11,000 playlists. The availability of usage data has dramatically changed research in APG. The recent APG algorithms that we survey here, post-dating the ones in (1), rely mainly on usage data, while few of them employ content-based data, metadata and expert annotations, which is the predominant background knowledge used in APG algorithms included in survey (1) [39] (see Table 1 for more details). The availability of large quantities of usage data allows for the use of sophisticated machine learning algorithms, which are known to provide satisfactory results only when large quantities of data are available [200]. Many of the recent APG algorithms that we survey here, post-dating those in (1), use deep learning for computing embeddings and for sequence modeling and RL, while fewer of them rely on music similarity, which was the predominant approach of APG algorithms included in (1) [39] (see Table 3 for more details).

Table 1.

Category	Up to 2013	From 2014
Content-based data	[14, 15, 21, 36, 43, 46, 55, 67, 74, 75, 98, 100, 108, 111, 123, 135, 153, 174, 194, 207, 226, 228, 233, 237, 244, 258, 259, 264, 284, 285, 294, 303, 309, 349, 360]	[1, 12, 32, 40, 49, 64, 71, 104, 115, 117, 129, 136, 157, 159, 161, 162, 165, 171, 192, 220–222, 273, 286, 299, 300, 324, 328, 336, 337, 339, 351]
Metadata and expert annotations	[5, 6, 26, 35, 55, 56, 66, 75, 79, 145, 153, 155, 158, 174, 207, 237, 243, 251, 252, 263, 274, 277, 283, 294, 349, 370]	[12, 124–126, 156, 157, 164, 165, 171, 185, 232, 286, 298, 301, 335, 346, 351]
Social web data	[56, 75, 98, 108, 130, 144, 194, 237, 246, 251, 252, 284, 293, 302, 309, 315]	[32, 170, 171, 181, 185, 262, 286, 304, 337, 339]
Usage data	[3, 18, 38, 61, 130, 144, 158, 233, 237, 246, 292, 367, 370]	[4, 12, 32, 40, 49, 59, 71, 73, 87, 102, 124, 126, 156, 157, 159, 162, 164, 166, 170–172, 185, 187, 189, 191, 199, 229, 232, 245, 264, 265, 283, 298–301, 312, 326, 330, 331, 333, 335–337, 339–342, 345, 346, 351, 361–363, 369, 372]

Table 1. Organization of APG-I Algorithms, Based on the Category of Background Knowledge They Use, and Dividing Those Published up to 2013 from Those Published from 2014 Onwards

Those published up to 2013 are also reviewed by Bonnin and Jannach in their 2014 survey [39]. The categories are the ones used by Bonnin and Jannach.

In the case of song catalogues, in the pre-streaming era users had access to small personal collections of music, which typically consisted of, at most, thousands of songs. With music streaming, users have access to large song catalogues, consisting of millions of songs. As such, recent APG algorithms need to scale with catalogue size, which is not always the case for many APG algorithms included in survey (1), especially those that work by solving expensive discrete optimization problems with non-linear complexity. As a result, we encounter almost no discrete optimization APG algorithms in the recent research, post-dating (1), see Table 3. The shift from small to large song catalogues has also impacted research on A-MPG, as the goal of recent A-MPG algorithms, post-dating (2), is to assist the user in the manual construction of playlists by providing recommendations for songs to add to the playlist. A-MPG work included in survey (2), instead, mainly focuses on visualizations that assist the user in the manual construction of playlists by visualizing the song catalogue in a map or in a graph, and which do not scale to catalogues of millions of songs.

Finally, we discuss a substantial amount of work that was not in surveys (1) and (2), such as the research on group automatic playlist generation (APG-G), on group manual playlist generation (MPG-G), as well as the research on playlist enhancement. We also include an extensive account of research on MPG. Survey (2) also contains research on MPG, but it gives only a partial account, from just five papers. We, by contrast, review over 50 papers on MPG, and we cover MPG topics not covered in (2), such as a detailed discussion of the role of song diversity and homogeneity when manually selecting songs.

4 Research Methodology

We collect relevant papers by following a well-defined procedure, where we first define a search string, and then we review all the relevant papers that match the search string, as well as all the relevant papers contained in the references of those papers, in ”snowballing” fashion. For example, we retrieve a relevant paper \(p_{1}\) and we scan its references. If we find a relevant paper \(p_{2}\) in the references of \(p_{1}\), we review \(p_{2}\) and we scan its references. If we find a relevant paper \(p_{3}\) in the references of \(p_{2}\), we review \(p_{3}\) and we scan its references, and so on, until we run out of relevant papers.

We consider a paper to be relevant by performing an initial scan of its contents, where we search for the keyword “playlist.” If the paper never mentions “playlist,” it is safe to assume that it is not relevant. If the paper does mention “playlist,” then we look at it more carefully and we consider it to be relevant if playlists are the main topic of investigation of the paper. For example, a paper that creates a dataset of songs from a dataset of playlists and then extracts song embedding representations to power a song RS is considered not relevant since the playlists are incidental to the work.

We defined the search string by scanning the proceedings of the International Society for Music Information Retrieval Conference (ISMIR),⁶ which is the premiere venue for research in MIR. In particular, based on papers published in ISMIR, we crafted a search string that includes keywords from the titles of those papers.⁷ We use the string to search the academic aggregator DBLP.⁸

In total, our survey reviews around 300 relevant papers, and it also cites over 70 additional supporting sources.

5 APG

APG is the most popular topic within research on playlists. Research in APG is concerned with the design, implementation, and evaluation of algorithms for constructing playlists. In this section, we survey the literature on APG, presenting first algorithms to generate playlists for individual users (APG-I, Section 5.1), and then algorithms to generate playlists to be listened to by groups of users (APG-G, Section 5.2).

5.1 APG for Individual Users

Bonnin and Jannach [39] surveyed the literature on APG-I in 2014, organizing APG-I algorithms based on three attributes: (1) what background knowledge they employ; (2) how the target characteristics of the desired playlist can be input to the algorithm; and (3) the algorithm type, e.g., content-based, collaborative filtering (CF), and so on. According to Definition 2.2, three inputs are required for an APG-I algorithm: a catalogue of songs, background knowledge, and some target characteristics of the desired playlist. The first two of Bonnin and Jannach's attributes correspond with two of the inputs that are mentioned in the definition of APG-I. In this section, we adopt Bonnin and Jannach's organization: we also characterize APG-I algorithms based on the same three attributes, but we complement their work, as we incorporate the newer literature up to the date of writing.

5.1.1 Background Knowledge.

The background knowledge is the information used to choose the songs from the catalogue in order to construct a playlist that matches the target characteristics. The background knowledge should be represented in some machine readable form, in such a way that it can be used by algorithms. In their 2014 survey on APG-I, Bonnin and Jannach [39] identify several categories of background knowledge. In this section, we review those categories, while adding fresh details from the more recent work that we have surveyed. We organize APG-I algorithms based on their background knowledge in Table 1. We highlight differences from the research surveyed by Bonnin and Jannach [39] to the research we exclusively survey by dividing algorithms published up to 2013 from those published from 2014 onward.

Content-Based Data. Researchers in the field of MIR have been concerned for a long time with extracting information, or features, from the music audio signal, a research topic which is often referred to as content-based MIR [249].⁹ These content-based features can be high-level, such as the emotions evoked by a musical piece [324, 366], its genre [76], timbre [279], chords [101], pitch [373], and beats per minute (BPM) [307], or low-level, such as representations of the audio signal, for example mel frequency cepstrum coefficients (MFCC) [371], or such as learned embedding representations, as extracted, for example, by convolutional neural networks (CNNs) [162, 273]. Often, low-level features are used for extracting high-level features, e.g. see [72, 287, 359]. We refer the reader to [249] for a survey of content-based MIR.

Some APG-I algorithms rely on high-level content-based features. For example, Griffiths et al. extract the emotion evoked by songs, and construct a playlist that matches the emotion of the user, which is extracted by using several sensors [135]. The same approach is taken in [136, 324]. The work in [40] is similar to the above, except that users manually input their current emotion, e.g., melancholy. Liebman et al. [221, 222] propose a RL algorithm for APG-I in which songs are represented as vectors containing timbre, pitch, BPM, and statistics thereof.

Some other content-based algorithms rely on low-level content-based features. For example, Pohle et al. compute the similarity between songs in the catalogue based on an MFCC representation, to create a playlist in which consecutive songs are as similar as possible, so as to guarantee a coherent listening experience [285]. A similar approach is followed by [21, 111, 226, 228]. In [71, 162], instead, song representations extracted by a CNN are used as input to a recurrent neural network (RNN), so as to predict the next song in the playlist.

Metadata and Expert Annotations. Following [39], we use the word metadata to refer to any information describing the playlist or its songs that is not derived from the audio signal. An example of playlist metadata is the playlist title or caption, when assigned by an expert. (When assigned by an end-user, it might be preferable to classify them as examples of social web data, like user tags—see below.) Examples of song metadata are the year of release, the record label, the lyrics, and the genre,¹⁰ among others [118]. Usually, experts manually annotate songs with their metadata. A notable example is the Music Genome Project [50], a database of songs and their metadata created and maintained by experts employed by the Pandora music streaming service.¹¹

Different types of metadata are sometimes represented in a single structure, for example in a knowledge graph, where nodes represent both songs and also heterogeneous types of metadata [103], and edges express the relationships between the songs and the metadata, and the metadata with other metadata. For example, song names, album names and years can be represented as nodes, and a “belongs to” edge can link a specific song name to its album name, and a “released in” edge can link the album name to its year of release [118]. Knowledge graphs are also used in [87, 165, 237, 251, 252, 333]. Edges in a knowledge graph might also represent relationships between classes, subclasses, and instances, e.g. between genres and subgenres. Indeed, ontologies and taxonomies offer an alternative to knowledge graphs, placing the focus on classes, subclasses, instances, and their properties. For example, Ben-Elazar et al. use a taxonomy of musical genres [32].

One way to use metadata in APG-I is by allowing users to specify constraints on the metadata and then generate a playlist that satisfies those constraints. For example, in [277] users can input constraints on the song genre, release year, and length (in seconds), and an optimization algorithm is used to generate a playlist which satisfies these constraints. The same strategy is followed in [5, 15, 55, 155, 263, 275, 276].

Social Web Data. Social web data is data shared online by Internet users. Following [39], we list three types of social web data:

User tags. A user tag¹² is a free text annotation that a user applies to a musical item, e.g. a song or an artist [201]. User tags can be very rich and varied, as they can cover a wide range of different topics, such as musical genres (e.g., “rock”), years (e.g., “90s”), countries (e.g., “Ireland”), activities (e.g., “chill”), seasons (e.g., “summer”), among others.

Ratings. A rating is a piece of explicit user feedback for a musical item, often expressed in a 1-to-5 rating scale or as a “like” or “dislike” judgment. The usage of ratings as background knowledge is becoming less and less common, as ratings are too difficult to gather for the majority of the user base [305]. In particular, we do not encounter any work that uses ratings as background knowledge in the literature from 2014 to the date of writing.

The social graph. A social graph connects people by different relationships, such as “friend” or “spouse,” and to musical items, e.g. person \(x\) “likes” artist \(y\), in social networks such as Facebook.¹³ Social graphs are sometimes used as background knowledge, under the assumption that people who are closely connected in the graph have similar musical tastes [130]. A playlist for a user can be constructed, for example, by including music that is liked by the user's friends, giving an automated version of word-of-mouth recommendation.

Usage Data. In streaming services, usage data record interactions between users and musical items.

Listening logs. Listening logs record the songs a user listens to, including those that they skip, those that they listen to completion, and those that they download. As such, listening logs provide indications about user tastes. For example, it is common for algorithms to interpret a skip as an indication of a song that the user does not like [32, 66, 100, 157, 192, 265], although, of course, it could just signal that the song does not suit the user's context, such as their mood or activity. Listening logs can be used to compute embedding representations. One strategy is to rely on the word2vec algorithm [240], by treating songs as words, and listening logs as phrases, by analogy with natural language.

Popularity. Usage data gives a clear indication of the popularity of musical items, e.g., of songs. We can, for example, simply count the occurrences of each song in the listening logs of all the users. Popularity is sometimes used as background knowledge for building simple but effective heuristics for APG-I. For example, Bonnin and Jannach show that it is possible to build high-quality playlists by simply including the most popular songs made by artists similar to the artists that the user likes [38]. Moreover, popularity can be employed as a fallback strategy for estimating the musical tastes of new users, i.e., those that are new to the streaming service.

Manually constructed playlists. Users frequently create playlists for convenience [112] and self-expression [358]. These user playlists can be used as background knowledge for creating new playlists. For example, McFee and Lanckriet learn song-to-song transition probabilities based on a database of user playlists, and they use these probabilities to generate new playlists [238]. Manually constructed playlists can also be used to compute embedding representations; see our discussion of listening logs, where we mentioned the word2vec algorithm as a possible way of computing these embeddings.

Discussion. The categories of background knowledge we review above differ in their availability, consistency, and abundance:

Availability. The availability of some background knowledge may not be guaranteed for all the songs in the catalogue. For example, recently added songs may have no user tags, or may occur few times, or never, in listening logs or in manually constructed playlists. The same goes for “long-tail” songs [195], i.e., those songs which are rarely listened to, that constitute the large majority of the catalogue [51]. The unavailability of background knowledge for such portions of the catalogue leads to biases against new songs (the cold-start problem [89]) and against long-tail songs (popularity bias [169]). In fact, algorithms cannot evaluate a song for inclusion in a playlist if there is no background knowledge to match the song to the target characteristics of the playlist. Content-based data is the only category of background knowledge which can be available for every song in the catalogue, as content-based data is extracted from the song audio itself. As such, content-based data allows for the construction of “fair” algorithms, in the sense that they can select any song in the catalogue for inclusion in the playlist.

Consistency. Some background knowledge may be noisier than other background knowledge. For example, the level of noise in metadata is often low because metadata annotations are typically made manually by domain experts. Nevertheless, inconsistencies in metadata may exist, especially because some metadata is not objective. For example, Flexer et al. find that different annotators may disagree on the musical genre of songs [110]. Similarly, the level of noise in content-based data also tends to be low because content-based data is extracted by automatic procedures. Nevertheless, inconsistencies in content-based data may arise because those automatic procedures are never 100% accurate. For example, predicting the tempo of a song is challenging and the results can be inaccurate [58, 278].

Usage and social web data are typically much noisier than content-based and metadata, as they record the unpredictable behavior of Internet users. For example, Lamere analyzes a dataset of user tags and finds misspellings, spelling variants and synonyms among user tags, as well as user tags with little to no relevance to music, such as the user tag “random” [201]. Similarly, Hagen finds that manually constructed playlists often do not have clearly defined target characteristics but may be used as a randomly arranged container of the user's favorite music [140].

Abundance. Usage and social web data are by far the most abundant category of background knowledge, as they are generated in large quantities by billions of Internet users every day. Content-based and metadata are less abundant as their extraction depends, respectively, on computationally bounded procedures [308] and on the expensive annotation work of domain experts.

Using one category of background knowledge rather than another influences the quality of the generated playlists. For example, there is some evidence that algorithms relying solely on content-based data produce playlists of low quality, especially when compared with other types of systems, e.g. those which rely on usage data [321] or metadata [338]. However, it is wrong to consider one category of background knowledge to be superior to another. A more correct view is to consider them as complementary: while content-based data can help to create coherent playlists in terms of acoustic properties, usage data gives information about the musical tastes of the user, allowing the creation of personalized playlists. Hence, it is common to combine different categories of background knowledge. There are several papers that show how the quality of generated playlists is enhanced by making effective use of more than one category of background knowledge, e.g. [32, 171, 237]. One way to readily include different sources of background knowledge is to organize them in a unifying structure, for example a knowledge graph (described earlier) [103]. In addition to representing songs and their metadata, knowledge graphs can represent listening logs: nodes for users would link to nodes for the songs they listened to [261].

If we refer back to Table 1, we can see the differences between the research up to 2013 (the first period), which was already surveyed by Bonnin and Jannach [39], with respect to the research from 2014 onward (the second period). The majority of algorithms from the first period rely on content-based and metadata for their background knowledge, especially because the song catalogue sizes before the streaming era allowed for the manual annotation of songs or the extraction of content-based data. In the second period, when streaming became the prevalent type of music access [160], the emphasis shifted to usage data, mainly due to the availability of that type of background knowledge, easily recorded by the music streaming service.

5.1.2 Target Characteristics.

The target characteristics of a playlist are the organization principles which make a playlist of a sequence of songs to be listened to together. The target characteristics should be input in some machine readable form, so that they can be readily used by algorithms. In their 2014 survey on APG-I, Bonnin and Jannach [39] identify several categories of target characteristics. In this section, we review those categories and make a small update to better accommodate the more recent work.¹⁴ We organize APG-I algorithms based on the target characteristics in Table 2. We highlight differences between the research surveyed by Bonnin and Jannach [39] and the research we exclusively survey by dividing algorithms published up to 2013 from those published from 2014 onward.

Table 2.

Category	Sub-category	Up to 2013	From 2014
Explicit preferences and constraints	Seed songs	[3, 18, 21, 38, 43, 46, 61, 67, 98, 108, 111, 123, 130, 144, 194, 207, 226, 228, 233, 237, 246, 264, 265, 274, 275, 283, 285, 292, 302, 303, 309, 315, 349, 360, 367, 370]	[4, 12, 32, 36, 59, 64, 71, 87, 102, 104, 115, 117, 124, 126, 129, 161, 162, 164–166, 170–172, 185, 187, 189, 199, 220, 222, 229, 245, 273, 286, 298, 304, 312, 326, 330, 331, 333, 335, 337, 339, 342, 345, 346, 351, 361, 369, 372]
	Free-form text	[75, 246, 293]	[12, 73, 102, 104, 187, 189, 191, 229, 245, 298, 324, 335, 345, 351, 361, 363, 369, 372]
	Explicit and pre-defined constraints	[6, 15, 55, 74, 107, 145, 153, 155, 174, 233, 243, 263, 275–277, 293]	[1, 40, 136, 328]
	Real-time feedback	[56, 66, 123, 158, 264, 265, 293]	[40, 157, 181, 192, 220, 222, 262]
Past user preferences		[56, 78, 108, 130, 158, 251, 252, 370]	[32, 40, 49, 156, 157, 170, 171, 185, 331]
Contextual and sensor information		[3, 26, 35, 79, 100, 135, 158, 244, 258, 259, 293, 294]	[116, 125, 126, 159, 232, 304, 324]

Table 2. Organization of APG-I Algorithms, Based on the Category (and Sub-Category) of target Characteristics They Receive as Input, and Dividing Algorithms Published up to 2013 from Those Published from 2014 Onwards

Those published up to 2013 are also reviewed by Bonnin and Jannach in their 2014 survey [39]. The categories and sub-categories are the same ones used by Bonnin and Jannach, except for a minor change we make so as to better accommodate recent work: we rename their category “free-form keywords” to “free-form text.”

Explicit Preferences and Constraints. Some algorithms allow users to input the target characteristics manually, in different ways:

Seed songs. Users can guide the algorithms in their song selection by specifying the first song [32], or the first and last song [5, 8], or a list of songs already contained in the playlist [331], or the set of all the songs to include in the playlist¹⁵ [194]. Some algorithms allow users to specify seed artists, instead of seed songs [32, 304].

Free-form text. In this case, users specify constraints on the songs by providing free-form text, which is used to select relevant songs [73]. The free-form text can be single keywords, such as artist names, musical genres, or moods [362, 363]. But free-form text can also be well-formed natural language phrases. For example, the APG-I algorithm proposed in [324] generates a playlist from a natural language text.

Explicit pre-defined constraints. Similar to free-form text, users can specify constraints on the playlist. However, in this case, the user does not have the flexibility of free-form text. Instead, constraints are pre-defined and users choose among them, e.g. the user chooses a desired mood from six categories [40], or the user chooses a tag from a tag-cloud [233], or colors from a color-picker [1].

Real-time feedback. Users provide feedback on the playlist as it is being played and generated. Feedback might be explicit, by liking or disliking a song [264], or implicit by listening to a song to completion or by skipping a song [265]. The playlist can be modified in real time according to the feedback [192]. As well as giving feedback on songs, users can give feedback about metadata associated with the songs. For example, in [181, 262] the user is shown the tags of the current songs and can select one or more of those tags in order to influence the selection of the next song.

Past User Preferences. The users’ musical preferences are an important target characteristic. In fact, although users may input some explicit target characteristics, such as a seed song, they implicitly desire that the constructed playlist contains music that they like [11]. For example, Lee et al. find that a user's opinion about a whole playlist can be easily influenced by a single song that the user loves or hates, or even by a specific element of the song that the user loves or hates [205]. This means that music in a playlist should be highly personalized. The musical preferences of a user are usually estimated by considering usage data, such as listening logs and manually constructed playlists [38, 292] (see Section 5.1.1).

Contextual and Sensor Information. The listening context influences the musical choices of users [2]. For example, the user's mood and location can influence their musical choices [65, 90]. Therefore, context-awareness is an important target characteristic. Mood and location are only two examples of listening context, which is a broad concept. Indeed, Kaminskas and Ricci define the listening context as “any contextual conditions that might influence the user's perception of music” [186]. Other examples of listening contexts are user activities, e.g., “party” [68]; the time of day [149]; the weather conditions; characteristics of the user's listening device such as the battery level; ambient conditions such as light and noise levels; and motion, e.g., as measured by an accelerometer [304].

Acquiring the listening context of a user is a first, necessary step towards context-awareness. Some listening contexts may be easier to acquire than others. For example, the level of light can be easily acquired with sensors that feature in nearly any device. Other contexts may be more difficult to acquire, especially those contexts which are not observable by means of a sensor, such as the mood of the listener. In the literature, we find examples such as: building a model to infer a user's mood from audio signals detected by microphones [260]; inferring the user's mood from free-form text [324]; and inferring what activity the user is engaging in from listening logs [159].

Once the listening context is determined, a playlist can be constructed by means of handcrafted rules that link the context to the music. For example, we find several papers that extract a model of user's pace from accelerometer data and construct a playlist where the BPMs of songs depend on the user's pace [35, 100, 244, 258, 259].

Discussion. The categories of target characteristics that we present above are complementary. For example, while a seed song broadly defines how a playlist should sound, past user preferences and contextual information can tailor the playlist to the tastes of the user and to their current context. Hence, some algorithms combine different target characteristics to construct high-quality playlists [171].

Nevertheless, the most common way of specifying the target characteristics is via seed songs, as Table 2 shows. The other ways of specifying the target characteristics are generally under-explored, especially in the literature from 2014 until now. One notable way of specifying the seed songs is by providing a list of songs already contained in the playlist. In this latter case, the algorithm adds more songs to the playlist, so as to fit the same target characteristics as the original playlist [364], which is a task known as APC. APC has benefits both for listening to and for creating playlists: APC enables users to enjoy listening sessions that continue beyond the end of a finite-length playlist, while also making it easier to create longer, more compelling playlists without the need to have extensive musical familiarity [306].

APC was the focus of the ACM RecSys Challenge 2018,¹⁶ in which participants were asked to add more songs to user-created playlists taken from the Spotify music streaming service [364]. In total, 113 teams participated in the Challenge, which represents a landmark in APC research. APC is the dominant research trend in APG-I: we estimate that over 40% of the works in APG-I from 2014 onward focus on APC.

5.1.3 Algorithm Type.

We review algorithms for APG-I based on their type. In their 2014 survey on APG-I, Bonnin and Jannach [39] identify several types of algorithms. In this section, we review those types, while adding two types that emerge from the more recent work: sequence modeling and RL. We organize APG-I algorithms based on their type in Table 3. We highlight differences between the research surveyed by Bonnin and Jannach [39] and the more recent research by dividing algorithms published up to 2013 from those published from 2014 onward.

Table 3.

Algorithm type	Up to 2013	From 2014
Similarity	[21, 38, 43, 46, 56, 67, 74, 75, 98, 107, 108, 111, 123, 130, 153, 174, 194, 207, 226–228, 233, 251, 252, 264, 265, 274, 283, 285, 292, 302, 303, 309, 315, 349, 360]	[1, 12, 36, 40, 49, 64, 115, 117, 129, 161, 164, 165, 232, 273, 286, 304, 346]
Collaborative filtering	[3]	[4, 12, 59, 102, 104, 104, 166, 170–172, 185, 187, 189, 191, 229, 245, 298, 326, 331, 335, 337, 339, 351, 361–363, 369, 372]
Frequent pattern mining	[38, 61, 144]	[39]
Frequent pattern mining	[156]
Statistical models	[61, 237, 238, 246, 367, 370]	[32, 73, 87, 199, 304, 330, 333, 345]
Case-based reasoning	[18]	[124–127]
Discrete optimization	[5, 6, 15, 98, 145, 155, 194, 243, 263, 275–277, 284, 285, 309]	[156]
Sequence modeling	—	[33, 34, 71, 162, 172, 191, 311, 312, 340, 342]
Reinforcement learning	[66, 158]	[192, 220–222, 299–301, 328]

Table 3. Organization of APG-I Algorithms Based on Their Type, and Dividing Algorithms Published up to 2013 from Those Published from 2014 Onwards

Those published up to 2013 are also reviewed by Bonnin and Jannach in their 2014 survey [39], except for two papers on reinforcement learning, that were published before 2014 but did not make it into their survey. The algorithm types are similar to those used by Bonnin and Jannach, except for some changes we make so as to better accommodate the recent work that we exclusively survey. Specifically, we include two additional algorithm types: sequence modeling and reinforcement learning

Similarity. Similarity algorithms use song similarity to construct playlists. Song similarity can be derived from different kinds of background knowledge, such as content-based data [14, 21, 46, 161, 265, 285], metadata [274, 283], tags [284, 304], manually constructed playlists [38, 233, 292], listening logs [346], ratings [56, 315], or any combination of the above [174]. For example, both Pohle et al. and Cai et al. compute the similarity between songs based on an MFCC representation [21, 46, 285]; Pauws and Eggen and Polignano et al. count the values of metadata features that two songs have in common [274, 286]; and Bonnin and Jannach consider how often two songs co-occur in manually constructed playlists [38]. More recently, we have seen increasing reliance on song embedding representations [85, 199, 346], learned using the word2vec algorithm [240]. These embeddings are obtained by treating songs as words and treating manually constructed playlists as phrases, by analogy with natural language. Song embedding representations can also be given as input to a clustering algorithm, such as \(k\)-means, to generate playlists of similar songs by sampling from the clusters [116].

Integrating multiple sources of background knowledge is beneficial when computing similarity. For example, in the context of judging the similarity between a song that could be added to a playlist and the songs already in the playlist, research shows that users consider content-based features, such as energy and tempo, as well as meta-data, such as musical styles and lyrical content [22, 323].

The perception of song similarity is subjective. Even expert listeners are found to disagree when asked to rate the similarity between songs [109]. For example, some people consider content-based data more than metadata while judging similarity, and other people may do the opposite [205]. Some work in APG-I integrates personalization in the similarity computation. For example, Sotiropoulos et al. allow users to set different weights for different features when assessing similarity, e.g., a user might weight content-based data more than metadata or vice-versa [316]. And, in [302], Sandvold et al. propose a system where users can assign tags to songs, drawing from a vocabulary of tags. Then, the system learns how to tag new songs, so that the predicted tags reflect the user's tagging style. Finally, playlists can be created using a similarity measure based on both kinds of tags (those that are assigned and those that are predicted), which means that similarity is personalized based on the user's tagging style.

Once similarities are computed, songs can be chosen for their similarity with the seed songs [14, 38, 226], or for their similarity with other songs liked by the user [158]. Another possibility is to create playlists so as to maximize the similarity between songs [194].

Playlists generated with similarity-based algorithms are expected to be coherent. However, coherence is not the only quality criterion for playlists, as some other criteria exist, such as diversity [290]. Also, a risk with optimizing for similarity is that the playlist may become monotonous [205], e.g., containing songs from the same album. See Section 6.1.2 for a discussion of coherence and diversity.

One use-case for similarity algorithms is that of playlist sequencing, the special case of APG-I where the target characteristics are given as a set of songs, and algorithms are tasked simply with arranging the set of songs, without applying any further song selection, in such a way that the music is coherent, from one song to the next song [36]. For example, Bittner et al. and Cliff both use a similarity algorithm working on content-based similarity [36, 74]. Their approach compares the distance between songs based on several features and then arranges the songs in such a way that those distances are minimized. Sarroff and Casey [303] use the approach of building a machine learning predictor working on content-based features that can distinguish suitable from not-suitable song-to-song transitions. Finally, Furini and his co-authors go in the direction of personalized sequencing [115, 117]. They analyze song-to-song transitions in a user's playlists so that they can sequence playlists in a personalized way.

CF. CF is a common approach in the RSs literature. It is based on the heuristic that if the active user agreed with certain users in the past, then these users are similar to the active user, and items that these users liked should be relevant to, and can be recommended to, the active user [296]. Hence, the use of CF algorithms is facilitated by the existence of usage data, recording the preferences of other users. Specifically, they typically assume a sparse user–item matrix that may record user ratings for items or user interactions with items. CF is then a family of methods for predicting the rating a user would assign to an item or the relevance of an item to a user, based on the data that is given in the rest of the matrix. It is the fact that this data may come from other users that makes these approaches “collaborative.”

The most common way of employing CF for APG-I is by applying the playlists-as-users analogy [38, 298], in which a user is a playlist and an item is a song: instead of a user–item matrix that records each interaction between a user and an item, we have a playlist–song matrix that records information about the presence of each song in a playlist. Another common analogy is the titles-as-users analogy [298, 351, 369], in which a user is a playlist title and items are again the songs. Which analogy to employ depends on the target characteristics. The playlists-as-users analogy fits the case in which the target characteristics are given as seed songs. The titles-as-users analogy fits the case in which the target characteristics are given as a playlist title, i.e. a special case of free-form text. For simplicity of exposition, most of the rest of this section uses only the playlist-as-user analogy.

In the playlist-as-user analogy, the playlists–songs matrix can be unary [172], i.e., recording a \(1\) if a playlist contains the song. Or, it can be non-unary; e.g., it may assign a value to a song according to its position in the playlist, giving more weight to the later songs [12, 331]. Given a playlists–songs matrix, CF algorithms that would ordinarily predict the relevance of an item to a user can be re-purposed to predict the suitability of a song for a playlist.

One option is to employ nearest neighbors CF algorithms [255] ¹⁷. For example, the system in [341] computes the similarity between the active playlist and the other playlists in the dataset as the cosine similarity of their row-vector representations in the playlists–songs matrix. Then, for each song in the catalogue, it computes a score by summing the similarities of the active playlist to the \(k\) most similar playlists in the dataset that contains the song, where \(k\) is a positive integer hyper-parameter. Finally, the highest-scoring song is selected to be added to the playlist. The algorithm above corresponds to the user-based \(k\)-nearest neighbors CF algorithm in the RSs literature [198]. The user-based \(k\)-nearest neighbors algorithm is also employed in [166, 172].

Another option is to use the item-based \(k\)-nearest neighbors algorithm [91]. For example, the system in [340, 342] computes the similarity between songs as the cosine similarity of their column-vector representations in the playlists–songs matrix. Then, for each song in the catalogue, it computes a score by summing the similarity of the last song in the playlist to the \(k\) most similar songs in the catalogue.

Some work uses similarity functions other than cosine; e.g., Jaccard [331]. Additionally, the similarity functions can be augmented with heuristics, e.g., by giving higher weight to unpopular items [189].

But there exist CF algorithms, other than nearest neighbors, for constructing playlists. For example, Aizenberg et al. [3] use a matrix factorization (MF)-based approach, in which the playlists–songs matrix is factorized into two low-dimensional matrices, containing the playlist embeddings of every playlist in the dataset and the song embeddings of every song in the catalogue. Then, for each song in the catalogue, the system computes a score by taking the dot-product of the playlist and song embeddings. Finally, the highest-scoring song is selected to be added to the playlist.

In some cases, MF is the first step of a two-step APG-I algorithm, especially in the case of APC, e.g. [298, 351]. MF is used to learn a model that can predict the relevance of every song to a playlist. But these relevances are used to filter to a more manageable (but still large) set of candidate songs to add to a playlist. These remaining songs are associated with features such as their popularity [351] or the degrees of homogeneity and diversity they would bring to the playlist [298]. A second model learns to re-rank the remaining candidates based on these features. Re-ranking algorithms commonly used in APG-I include gradient boosted trees [351], for example, XGBoost [62]

The years since the publication of the Bonnin and Jannach survey have also seen the rise of deep learning. Deep learning is a form of machine learning that is based on the use of many-layered artificial neural networks. It has led to advances in different application fields of AI, such as natural language modeling [42] and object classification in images [200]. Given those promising results, deep learning has recently been applied to the task of APG-I. We will discuss its use in sequence modeling in a later subsection, but here we can see the effect it has had on CF-style approaches to APG-I.

One example is to be found in the work of Zhao et al. [369]. They take a similar approach to Aizenberg et al. above, i.e., using an MF algorithm to factorize the playlists–songs matrix. But then, where Aizenberg et al. compute scores between playlist and song embeddings, Zhao et al. provide for extra learning: the playlist and song embeddings are fed into a feed-forward neural network, which outputs a score indicating the fit of the song for the playlist.

A well-known family of deep learning models are the autoencoders. In the simplest case, an autoencoder consists of two components, encoder and decoder, both of which are usually feed-forward neural networks. A more sophisticated autoencoder is the adversarial autoencoder, in which the hidden vector distribution is regularized so as to match a Gaussian prior distribution [234]. Autoencoders of both kinds are used for APG-I by setting the input to be a binary vector indicating which songs are in the playlist [335, 361]. This, in effect, is the playlist-as-user analogy and so we can regard these as CF approaches to APG-I. The output is a vector approximating the input vector, that can be used for selecting other songs for addition to the playlist. Vagliano et al. [335] successfully integrate additional background knowledge into an adversarial autoencoder for APG-I, by concatenating embeddings of textual data, such as the playlist title, to the hidden representation.

The playlists-as-users analogy has three main limitations: (1) the constructed playlists are not personalized, (2) the performance depends on the number of songs in the playlist, and (3) they may perform poorly on songs that occur in very few playlists. Concerning limitation (1), the songs are chosen so that they are tailored to those already in the playlist, but not to the listener's musical tastes. Some work tweaks CF approaches so that they become personalized. One approach, for example, is to modify the active playlist by adding songs from other playlists created by the same listener [3, 170, 185]. Concerning limitation (2), CF algorithms are affected by the cold-start problem, which manifests with small or newly created playlists [59]. In fact, the accuracy of CF algorithms is positively correlated with the number of seed songs, i.e., the algorithm will generate a better playlist when provided with more seed songs [364]. And, CF algorithms are not able to generate a playlist if no seed songs are provided. In such cases, it is necessary to resort to a fall-back strategy, for example by working with other target characteristics or by employing simple heuristics based on song popularity. The solution to limitation (3) is usually some form of hybrid. For example, Val et al. combine song embedding representations extracted from content-based data, metadata, and manually curated playlists by means of a deep feed-forward neural network, so as to model the probability that a specific song is a good fit for a specific playlist [337, 339].

The titles-as-users analogy shares the same three limitations. Some authors propose a solution to alleviate the cold-start problem when using the titles-as-users analogy, which consists of clustering similar titles together, so as to increase the number of songs for each title [369]. One way to cluster titles is to rely on simple text pre-processing pipelines, which transform the text to a common format, for example by removing special characters [362, 363, 369]. Another way to cluster titles (although here employed in a recurrent neural network) is by employing text-embedding procedures, such as FastText [175], and by running a clustering algorithm on those embeddings [245]. Yang et al. [361], on the other hand, treat playlist titles as sequences of characters and use a CNN to process the characters, obtaining an embedding vector that can be used to predict the songs in the playlist, given the title. They combine the CNN with an autoencoder to obtain a system that has both a titles-as-users approach and a playlist-as-users approach.

One last drawback of CF approaches, however, is that they are not designed for the specific challenges of APG-I, and aspects such as song coherence have to be addressed separately [39].

Frequent Pattern Mining. Frequent pattern mining approaches work by mining patterns from a dataset of manually constructed playlists. A pattern can be expressed in the form \(S_{1}\Rightarrow S_{2}\), where \(S_{1}\) and \(S_{2}\) are two sets of songs. The pattern signifies that it is likely to find the songs in \(S_{2}\) after the songs in \(S_{1}\). In sequential pattern mining, \(S_{1}\) and \(S_{2}\) would be sequences of songs, not sets of songs.

Patterns can be used to generate playlists. For example, consider three songs: \(s_{1}\), \(s_{2}\) and \(s_{3}\); given a playlist with \(s_{1}\) and \(s_{2}\) as seed songs, if we have extracted the pattern \(S_{1}\Rightarrow S_{2}\), where \(S_{1}\) is the set \(\{s_{1},s_{2}\}\), and \(S_{2}\) is the set \(\{s_{3}\}\), then a candidate continuation for the playlist is \(s_{3}\). Pattern mining is not often applied to APG-I. However, Bonnin and Jannach show that patterns and sequential patterns can, in fact, achieve comparable performance to other types of algorithms that were in use in 2014 [38, 39]. Chen et al. furnish one example of the use of sequential patterns [61]. They propose to use a simple bigram model that extracts \(S_{1}\Rightarrow S_{2}\) rules by counting how frequently the set of songs \(S_{2}\) follows the song \(S_{1}\) in the dataset, corrected with Witten-Bell discounting [177].

One problem with this approach is that patterns based on songs will be very rare, and sequential patterns even rarer. For example, even in a very large dataset of playlists, the number of times that Alice Coltrane's Wisdom Eye follows Post Requisite by Flying Lotus will be small. The patterns will be associated with very low confidence values, and many reasonable patterns will be not be seen at all. The solution is to mine patterns based, not on songs, but on representations of songs, e.g., based on hand-crafted features or latent features. In [144], for example, the PrefixSpan algorithm [142] is used to mine sequential patterns on song latent embeddings. The song latent embeddings are obtained by applying Latent Dirichlet Allocation to the songs’ tags [37].

Statistical Models. Statistical models work by modeling the probability of adding a song to the playlist.¹⁸

One class of statistical models are the Markov models, i.e., those that model the probability of adding a song to the playlist based on the current “state.” In the APG-I research cited below, the state is defined as the last song in the playlist. Markov models are proposed in [61, 87, 237, 246, 333, 345, 367].

The core component of Markov models is the estimation of the song-to-song transition probabilities. McFee and Lanckriet [238] offer a comparison of a number of Markov models, which differ on how the probabilities are estimated: some of them count song co-occurrences in manually constructed playlists, while others rely on content-based or metadata similarity, and in some cases latent representations. This recalls the problem with frequent pattern mining (above): song co-occurrences will typically be low frequency; using song representations or Hidden Markov Models (e.g., [367]) can overcome this.

Markov models may lead to the construction of problematic playlists [342]. For example, adding a song based only on the previous one may lead to a lack of coherence throughout the playlist.

Some other statistical models are not Markov models, and model the probability of adding a song to the playlist based on the other songs in the playlist. For example, Hu and Ogihara [158] consider a playlist as a time series and use an autoregressive integrated moving average model [146] to predict the next song.

The most sophisticated statistical models are also personalized, i.e., they model the probability of adding a song to the playlist based on the other songs in the playlist and based on the user's musical tastes. For example, Ben-Elazar et al. [32] propose a Bayesian classification model whose parameters are estimated via variational inference based on the playlist songs and on the other songs liked by the user. Two similar models are proposed in [330, 370].

Other notable statistical models are proposed for the scenario in which the target characteristics are specified using natural language. For example, Chung et al. propose a statistical model for linking a word to a song [73]. It is trained on a dataset of manually constructed playlists and their titles. In practice, they learn an embedding for every word and song, in such a way that a particular song embedding is aligned with a particular word embedding if that song is likely to appear in a playlist which contains that word in its title.

Case-Based Reasoning (CBR). CBR is an approach to problem-solving that involves reasoning with prior experiences. CBR can be effective when two tenets hold [204]: similar problems have similar solutions; the types of problems an agent encounters tend to recur. Case-based APG-I assumes that existing playlists encode the results of prior reasoning, and that it is therefore worthwhile to re-use existing playlists when creating new playlists.

Given a dataset or case base of existing playlists and an initial seed playlist, Gatzioura and Sànchez-Marrè use CBR to recommend a set of songs for playlist continuation (APC) [124]. Their system retrieves from the case base a set of \(k\) playlists that are similar to the user's seed playlist. In this system, similarity is an aggregate of song similarity, where song similarity is based on shared meta-data. The system recommends songs taken from the \(k\) playlists, based on the playlist similarity scores.¹⁹ The approach is extended to include time-of-creation pre-filtering [125] and shared latent topic pre-filtering [126, 127].

By contrast, Baccigalupo and Plaza [18] deal with case-based playlist generation from a seed song (APG-I), rather than case-based playlist continuation and treat the order of the songs in the playlists in the case base as significant. In their approach, the system retrieves and combines a set of so-called relevant patterns. Relevant patterns are subsequences that contain the user's seed song and which recur across multiple playlists in the case base.²⁰ There is an even greater risk of low frequency patterns than there was in frequent pattern mining and Markov model algorithms: Baccigalupo and Plaza's algorithm works with songs, rather than representations of songs, and the patterns it mines must include the seed song.

Both Gatzioura and Sànchez-Marrè and Baccigalupo and Plaza equip their systems with additional scoring mechanisms that try to take coherence and diversity into account, these being concepts that we discuss further in Section 6.1.2.

Discrete Optimization. A different way to approach playlist generation is by setting up a discrete optimization problem. Given the catalogue of songs and a set of explicitly specified constraints that capture the desired target characteristics, the goal is to construct a sequence of songs that satisfies the constraints, while maximizing some utility function [39].

Discrete optimization approaches to APG-I differ in their constraints. Some approaches impose constraints on consecutive songs, for example by requiring that their similarity should be higher than some value, as measured by a song similarity measure [5, 263]. Other approaches impose constraints directly on metadata or content-based data [6, 15, 145, 155, 263, 275]. In the case of [15], for example, this is done by requiring that at least \(n\) songs in the playlist should have a specific musical genre. In [156], constraints are sequential patterns that have been mined from a user's listening history.

In addition to constraints that must be satisfied, there may be a utility function to be maximized. For example, the approaches in [145, 194, 284] seek to maximize the similarity of consecutive songs in the playlist, as measured by a song similarity measure [195].

Also, different approaches use different strategies to solve the optimization problem. Some use linear programming [5, 6]; others use constraint satisfaction [15, 263, 275]; yet others use simulated annealing [145, 277]; and at least one uses each of genetic algorithms [155], ant-colony optimization algorithms [243], and tabu search [156]. In all cases, the greatest challenge is coping with the combinatorial explosion that results when scaled up to large music collections.

Sequence Modeling. We have already discussed the recent contribution that deep learning has made to new CF algorithms for APG-I. We have also seen some algorithm types (most notably sequential pattern mining and statistical models) that depend on the sequence of songs in a corpus of playlists. In what we are here referring to as sequence modeling algorithms, we look at the application of deep learning methods to the song sequence data.²¹

One well-known family of deep learning models are RNNs [224]. RNNs are particularly suited to learning from sequential data. At their core, there is the concept of hidden state, which is updated at each step of the sequence, as a function of the current and past elements of the sequence. The hidden state contributes to the prediction of the next value in the sequence.

RNNs can be applied to APG-I by considering that a playlist is a sequence of songs. As such, RNNs can naturally predict the next song in the sequence, i.e., the song to add to the playlist. They can be used to construct a playlist from scratch, song by song, but may be particularly suited to APC, where they propose a continuation of an existing playlist. In [172, 340–342] a particular RNN, known as GRU4Rec [150], originally proposed for SARSs, is used for APG-I. They train the RNN on a dataset of manually constructed playlists, with the objective of correctly replicating those playlists, i.e., the RNN is fed the playlist up to song \(n\), and its parameters are optimized so that it correctly predicts the song in position \(n+1\). Related work makes use of other RNN models, such as LSTMs [71, 162]. Shih and Chi [312] propose an additional RNN training step, in which other training objectives are included, such as song diversity and freshness, by resorting to a policy-guided RL algorithm [154]. Moreover, by treating a playlist title as a sequence of characters, it is possible to use RNNs to process the characters, obtaining an embedding vector, that can be used to predict the songs in the playlist, given the title [191].

Another family of deep learning models are the CNNs. The use of CNNs was popularized in computer vision, where they yield state-of-the-art accuracy in tasks such as image recognition [200]. While not always seen as sequence models, CNNs have been adapted to do language modeling [178] and this inspires ways of using CNNs for APG-I. For example, by applying the songs-as-words analogy, it is possible to use a CNN to predict the next song in the playlist [351].

Our final type of deep learning model is the transformer [347]. It is only very recently that transformers have been applied to APG-I, e.g., [33, 34, 311]. In the context of APG-I from seed songs, Bendada and his co-authors [33, 34] report a comparison in a A/B test of two approaches: a transformer and a latent factor model. The transformer model resulted in longer listening times, which is a positive result. However, for more mature users, this was accompanied by a reduction in actions such as adding songs from the playlist to a list of favorites. We mention transformers in Section 8 as a promising direction for future work. Indeed, sequential modeling in general is a promising approach, but it does require the availability of large quantities of reasonably high-quality playlists.

RL. RL is a form of machine learning in which an agent, through interaction with its environment, learns how to take specific actions so as to maximize a long-term numerical reward. In each step, the agent takes an action and the environment transitions from one state to another state. After each action, the agent may observe a reward. The agent aims to learn a policy that defines which action should be taken in each state in order to receive the greatest cumulative reward [325].

RL is suitable for modeling sequential problems, in which each action is taken as a consequence of the previous action. Playlist construction can be modeled as a RL problem, by considering an action to be the addition of a particular song to the playlist for which there is a reward. The goal is to learn a policy that maximizes cumulative reward. The survey of APG-I by Bonnin and Jannach does not contain any RL algorithm for APG-I, since most were published after 2014.

In APG-I work that is based on RL, the current state is given by the list of songs in the playlist. For example, if the playlist has been constructed up to the 10th song, and the agent is tasked with choosing the 11th song, then the state is the list of those 10 songs; an action is the addition of a specific song to the playlist.

The observed rewards depend on the user, and on the newly added song. In some work, rewards are observed implicitly; for example, a skip is a negative reward, while listening to a song to completion is a positive reward [192]. In other work, rewards are observed explicitly, for example by asking the user to rate the newly added song on a numerical scale [221]. In some works, rewards are calculated from the system's background knowledge. For example, in [299, 300], reward is based on similarity in an embedding of acoustic features, and this is extended in [301] to similarity also in a knowledge graph embedding combined with measures of popularity and novelty. In this way, rewards help balance smooth transitions, diversity, and discovery.

In some APG-I work based on RL, the agent learns a policy from the observed rewards directly. The work in [66, 157, 192], for example, uses Q-learning [357] to learn a policy from observed rewards. In other work, the agent estimates a reward function from the observed rewards and uses this reward function to determine the policy. The work in [221, 222], for example, parameterizes the estimated reward function as a linear transformation of the newly added song's content- based data.

One apparent issue with the formalization of APG-I as RL above is that of combinatorial explosion. For example, if the catalogue size is just 10 million songs (much smaller than it is in some music streaming services [318]), then there are \(10^{70}\) possible states just for playlists that contain 10 songs, and \(10^{7}\) possible actions. One way to tackle the state explosion problem is by factorizing the song representation in terms of their features, such as content-based features [192, 221, 222], metadata [157] or mood [66], and/or by applying windowing, e.g. by representing just the last three songs of a playlist in a state [66, 157].

At the time of writing, RL algorithms for APG-I are promising but relatively under-explored. Their need for reward data is their greatest limitation, and it is not clear that calculating rewards from other data is an adequate substitute for human reward data.

Having now reviewed the different algorithm types, we finish this section with some topics that are algorithmic but which cut across the different algorithm types.

Discussion: Trends. Our survey of algorithm types enables us to identify some trends in the research. We gave an overview of these in Section 3, where we were contrasting our survey with previous ones. But, now, at the risk of some repetition, we can use Table 3 and our presentation of the algorithms above to confirm them. There are differences between the research up to 2013 (the first period), which was already surveyed by Bonnin and Jannach [39], and the research from 2014 onward (the second period), that we exclusively survey. We can explain the changes in terms of at least three factors: (1) the paradigm shift in music consumption from small personal music collections to streaming services, requiring algorithms that scale well to millions of candidate songs; (2) consequentially, the availability of certain kinds of data, most notably usage data, such as the MPD, released in 2018 for the ACM RecSys Challenge, which enables approaches that train on collections of manually constructed playlists; and (3) the rise of deep learning across ML in general.

Accordingly, work on Similarity algorithms has declined a little since the first period, perhaps because some approaches do not scale well. The more recent Similarity algorithms use embeddings learned from datasets of manually constructed playlists. Use of CF algorithms has grown enormously, exploiting, for example, the MPD, mentioned above. A little of the CF work uses nearest-neighbors methods, but these do not always scale well. Instead, MF has become common, and the most recent work combines MF with models for re-ranking or with multi-layered neural networks, autoencoders, and the like. Discrete optimization methods have largely not survived the transition to music streaming services. When there are millions of songs, it is near impossible to utilize discrete optimization algorithms, as their worst-case computational complexity is exponential.

Approaches that try to learn from the order in which songs appear in playlists have changed greatly. Neural approaches to sequence modeling, such as recurrent neural networks and transformers have superseded frequent pattern mining and, to some extent, statistical models. Finally, RL, which seems to promise much in terms of modeling but also the handling of user feedback, has grown strongly.

Discussion: Hybrids. The types of algorithms we reviewed differ in their performance. There is, for example, some evidence that CF and sequence modeling excel in generating high-quality playlists, especially in the case where the target characteristics are specified as a list of seed songs [39, 364]. However, it is wrong to consider one type of algorithm to be superior to another. A more correct view is to consider them as complementary. Algorithms of different types usually leverage different sources of background knowledge. For example, while CF algorithms are mostly limited to usage data, similarity algorithms can easily include content-based data. Also, algorithms of different types usually accommodate different ways of specifying the target characteristics. For example, while CF algorithms are mostly limited to the case in which a static playlist is generated from a list of seed songs, RL algorithms can generate dynamic playlists that adapt to the user feedback in real time.

It is therefore necessary to employ hybrids that combine algorithms of different types or that use different data, in different circumstances, especially depending on the background knowledge available, and on the way the target characteristics are specified. For example, Schedl et al. [304] use a similarity-based algorithm to generate a playlist, which is then adapted in real time based on the contextual information, gathered from sensors and processed by a statistical model. Frequently, different types of algorithms are combined with the goal of attaining playlists of higher quality [172, 248, 326, 364]. One common way of combining algorithms is to compute a weighted average of their predictions [229, 351].

Deep learning is often used as a powerful tool to combine heterogeneous features and information sources [320]. For example, Vall and his co-authors [337, 339] combine song embedding representations extracted from content-based data; metadata; and manually curated playlists, by means of a deep feed-forward neural network, so as to model the probability that a specific song is a good fit for a specific playlist.

We refer the reader to [44] for an understanding of the possible ways in which RS algorithms can be combined, many of which can be adapted to APG-I.

Discussion: Re-Ranking. Finally, we will discuss re-ranking, which can be thought of a particular type of hybrid algorithm. Systems that use re-ranking typically have a two-stage architecture. In the first stage, a model ranks the candidate songs for their relevance to a playlist. In the second stage, the candidates are re-ranked by a second algorithm, typically using data that was not used in the first stage.

In APG-I, there are at least two motivations for using re-ranking. One motivation is to improve scalability. The first stage would use a model that can score all the songs in the catalogue for relevance but at speed. A common choice is an MF model. Only those candidates with the highest scores from the MF model are passed to the second stage, where they are ranked by a model that takes different data into account and may not operate as quickly as the model in the first stage [298, 351].

The second motivation for re-ranking is to improve the top-\(n\) song recommendations that are selected for display to the user of, e.g., an APC system. Songs appearing lower in the ranking produced by the first stage might be ‘promoted’ in order to produce a set of \(n\) recommendations that satisfy additional criteria. In [187], for example, the intuition is that the set of songs that is recommended for continuation of a playlist should match the level of diversity of songs that are already in the playlist. Kaya and Bridge use sub-profile aware diversification [188] to implement this intuition, measuring an increase in accuracy. A similar approach is taken in [229], where diversity is measured by means of content-based features. The intuition of [116] is, instead, that some songs recommended for continuation of a playlist should be familiar to the user, while some others should be novel. All these papers implement their intuitions using a re-raking approach.

5.1.4 Evaluating APG-I Research.

Up to now, we have referred to playlist quality as the way we would measure the performance of APG-I algorithms. Playlist quality is, however, an ill-defined concept, difficult to pin down to a mathematical definition that would allow its measurement. In fact, playlist quality depends on the musical tastes of the user, on the user's familiarity with the music [356], and on the listening context [2]. For example, two different listeners may rate the quality of the same playlist differently, because they may have different musical preferences, because they may already know the songs, or because they are listening to the playlist in two different locations, e.g., at the beach or in the bus.

In their 2014 survey on APG-I, Bonnin and Jannach [39] review the different strategies for measuring the quality of playlists. They identify three categories of evaluation protocols:

(1)

User studies, where users are involved in rating the quality of playlists;

(2)

Objective measures, where statistics of the constructed playlists are computed, under the assumption that those statistics (e.g., coherence or diversity of the songs’ musical genres) reflect the notion of playlist quality; and

(3)

Ground truth playlists, where algorithms are tested for how well they can recreate manually constructed playlists or listening logs, under the assumption that the manually constructed playlists or listening logs reflect a gold standard.

These three categories are still valid today, covering also the evaluation protocols in papers published from 2014 onward. In the following, we focus on how APG-I algorithms are evaluated in the papers that are exclusive to our survey, i.e., papers published from 2014 onward.

Evaluation protocol (3) is probably the most common and can be described as a three step procedure: (a) preparation, in which a number \(N\) of songs are withheld from a ground truth playlist; (b) recommendation, in which an APG-I algorithm is used to get a ranked list of \(K\) candidate songs to be added to the playlist; (c) scoring, in which metric \(M\) is used to measure the fitness of the \(K\) recommended songs relative to the \(N\) withheld songs. \(K\) can assume any value from \(1\) to the size of the song catalogue. \(N\) can assume any value from \(1\) to the playlist size. The three steps are repeated for every ground truth playlist in the dataset, and the resulting values of \(M\) are averaged.

Different instances of evaluation protocol (3) differ for the choice of \(N\), \(K\) and \(M\). For example, in a comparative evaluation of APG-I algorithms, Bonnin and Jannach set \(N\) to \(1\) and allowed \(K\) to range from \(1\) to 1,000 [38, 39]. They used hit-rate as \(M\), which, for a ground truth playlist, measures whether the set of \(K\) recommended songs contains the withheld song. They reported the percentage of ground truth playlists for which there was a hit. By contrast, in the ACM RecSys Challenge 2018, \(N\) is different for each ground truth playlist,²² \(K\) is set to \(500\), and \(M\) is set to a number of different metrics related to hit-rate, including Normalized Discounted Cumulative Gain and a metric they called R-precision [364].

In general, there is no agreement on what combination of \(N\), \(K\) and \(M\) to use, but there is work that offers insights into some combinations to adopt or avoid. For example, Bonnin and Jannach show that using average log-likelihood as \(M\), which was used in some other work [238], leads to inconsistent conclusions with hit-rate-related metrics, and recommend to avoid its use [38, 39]. Kamehkhosh and Jannach carry out a user trial where users are asked to choose the most appropriate continuation for a playlist among four song alternatives, three of which are generated using an algorithm and the last is the withheld song [183]. They find that users are likely to select the withheld song as a favorite continuation. Their experiment provides evidence that the choice of \(N\) as \(1\), \(K\) as \(1\), and \(M\) as hit-rate is a reliable setting. In contrast, Vall et al. criticize the choice of setting a specific cut-off for \(K\), by showing that the relative ordering in performance of a set of competing algorithms changes when varying \(K\) from \(1\) to the size of the song catalogue, while keeping \(N\) fixed to \(1\) and \(M\) to be hit-rate [341].

Evaluation protocol (3) is explicitly designed to work in the case where the target characteristics are specified using seed songs, which is the most common scenario in the recent literature, see Table 2. However, evaluation protocol (3) can be adapted to work when the target characteristics are specified in different ways. For example, Chung et al. propose an APG-I algorithm for which the target characteristics are input as free-form text, and they evaluate the algorithm using ground truth playlists, setting \(N\) to the playlist size, \(K\) to vary from \(5\) to \(20\), and hit-rate as \(M\) [73].

Evaluation protocol (3) works under the assumption that the ground truth playlists represent a gold standard, and thus the ability to replicate those playlists reflects the ability to construct high-quality playlists. However, the assumption may be too strong in some circumstances. For example, Hagen finds that manually constructed playlists often do not have clearly defined target characteristics [140]. For example, they may sometimes be used as a randomly arranged container of the user's favorite music, created for convenience of access. Similarly, although listening logs can be assimilated to the concept of playlist, they may contain spurious interactions, such as songs recommended by the automatic continuation features of streaming services during periods that the user is not paying attention to the recommendations. In some work, listening logs are filtered before running the evaluation, for example removing skipped songs [41]. Ideally, the quality of the ground-truth playlists must be checked before running the evaluation, which circles back to the original question of how to evaluate playlist quality. One guideline to distinguish suitable datasets of manually constructed playlists for evaluation is offered in [81], as they find that playlists manually constructed by users for sharing with other users usually satisfy high -quality standards and have clearly defined target characteristics.

Lastly, evaluation protocol (3) is undermined to a degree by several biases, most notably by popularity bias [31]. Since most of the ground truth playlists tend to contain popular songs [51], an APG-I strategy that constructs playlists in a popularity-driven fashion will usually yield good performance [38, 39]. However, while a playlist with popular songs would satisfy a large share of users, it would not suit minorities of listeners. There exist several strategies for de-biasing evaluation protocol (3) with respect to popularity, for example see [60]. A simple strategy is used in [341], where evaluation protocol (3) is run separately for popular songs and for the rest of the songs, finding that the relative ranking in performance of algorithms changes for these two segments of the song catalogue.

Evaluation protocol (2) is sometimes adopted for evaluating APG-I algorithms. It works by computing statistics on the constructed playlists, under the assumption that those statistics reflect aspects of playlist quality. For example, several papers measure song diversity and coherence from song tags or musical genres [73, 170, 171, 185]. Additionally, there are papers that measure song popularity [73], and others that measure song novelty, i.e., the degree to which songs are known by the listener, and freshness, i.e., the degree to which the songs are recently released [312]. These statistics do offer insights into the characteristics of constructed playlists, but it is not clear how those characteristics relate to the concept of playlist quality. For example, it is not clear what value of song diversity in a playlist is ideal: research suggests that songs should be somewhat diverse, while staying coherent overall [86, 182]. See Section 6.1.2 for a discussion of coherence and diversity.

Evaluation protocol (1) consists of user studies. For example, RL algorithms require real-time interaction with the listener and are evaluated with user studies in which the quality of an algorithm is estimated by monitoring implicit user signals, such as the number of skips [192], or by explicitly asking the user if they like the music or not [221, 222]. As another example, Ikeda et al. employ a user study to evaluate the smoothness of song-to-song transitions in playlists [161]. However, user studies have the disadvantages that they may not be reproducible and they are costly and time consuming. Another strategy for involving users in the evaluation of playlists is via A/B tests, in which users of streaming services are partitioned, and each partition receives playlists constructed by a different algorithm. Users in each partition are monitored for their engagement with the playlists, for example by monitoring the play counts [40], which is an indicator of playlist quality. Unfortunately, A/B tests require resources not accessible to most researchers, such as the availability of a music streaming service platform in which the A/B test can be conducted. A middle ground between an A/B test and evaluation protocol (3) is counterfactual evaluation, which allows estimation of the performance of a candidate algorithm, as if it were in production, by relying on listening logs extracted from whatever algorithm is currently in production. Researchers at Spotify share their recipe for counterfactual evaluation in [138], showing that they can rely on high correlation with actual A/B tests.

This concludes our survey of APG for individual users. We turn now to the case of APG for groups of users.

5.2 APG for Groups of Users

The APG work that we reviewed above tacitly assumes that the playlist is constructed to serve a single user, suitable for private listening sessions. However, listening to music is often a collective activity, consisting of users enjoying music together. Collective listening allows the discovery of new music, gives insight into the music tastes of peers, and creates shared moments where the listening can bring people closer together [141, 225]. Instances of collective listening can happen during a shared car journey, at a party, or at the gym, for example. For these occasions, it is important to tailor the playlist for the group, so as to satisfy the musical preferences of every user. There exists a category of APG algorithms which are explicitly designed to generate playlists for groups, i.e., the APG-G algorithms.

The work in APG-G that we survey is closely related to the work on APG-I that we surveyed in the previous section, not least because the two research topics share the end goal of generating a playlist automatically. Definition 2.2 states that playlist generation is the problem of selecting a sequence of songs from a catalogue of songs, while using some background knowledge, in order to match some given target characteristics. The fundamental difference between APG-I and APG-G resides in the target characteristics. APG-I algorithms handle target characteristics coming from a single user, while APG-G algorithms handle target characteristics coming from multiple users, i.e., the musical preferences of the group members, which, as we will discuss below, are then aggregated to construct the playlist. The aggregation step is not present in APG-I algorithms, as the target characteristics are coming from a single user. Also, some APG-G algorithms support a way of giving target characteristics that is peculiar to the group setting: song requests (see Section 5.2.1). With a song request, any of the group members can suggest what song to play next in the playlist. This is similar to a seed song in APG-I algorithms. However, differently from seed songs, which specify the music to start the playlist, a song request specifies a song to include at any point in the playlist, interactively.

APG-G algorithms are strongly related to group RS, which are tasked with recommending items to a group of users. In their survey of group RS, Masthoff and Delić [235] describe the typical design of a group RS as a three step procedure, that we borrow and adapt for our discussion of APG-G algorithms: (1) preference acquisition, in which the musical preferences of all group members are acquired; (2) preference aggregation, in which the musical preferences of all group members are aggregated into group preferences; and (3) playlist construction, based on the group preferences. In the rest of the section, we describe the APG-G algorithms in terms of their preference acquisition, preference aggregation, and playlist construction mechanisms. Also, we describe strategies for evaluating APG-G algorithms.

5.2.1 Preference Acquisition.

APG-G algorithms differ in what kind of data are acquired as musical preferences. We can distinguish implicit from explicit data. Implicit data are obtained by monitoring users’ behavior, without any explicit actions required by the users. Examples of implicit data include listening counts, which can be acquired by inspecting log data [78, 219]. Explicit data, instead, are filled in by the user manually. Examples of explicit data are ratings, such as a one to five rating [19], like/dislike [54], or a vote [257] for a musical item, as well as song requests [99, 319, 348], and pairwise ratings, indicating the preference of one song over another song [16].

Explicit and implicit data are limited in that they are restricted to the users’ actions. For example, it is not possible to know the preference of a user toward a musical item if they never interacted with it. Some APG-G work uses explicit and/or implicit data to infer unknown preferences. For example, the Poolcasting system [19] infers unknown song preferences by taking the average of the known preferences for songs composed by the same artist. And, the Flytrap system [78] uses a taxonomy of musical genres to infer unknown song preferences based on the known preferences for songs of similar genres. Similarly to Flytrap, the PartyVote [319] system uses song similarity based on content-based data for preference inference. Finally, the Flycasting system [147] uses CF to infer the unknown song preferences based on the known preferences for songs of a community of users.

The musical preferences of group members can be acquired statically, if the acquisition happens only once, or dynamically, if, instead, the acquisition continues over time. For example, in the MusicFX system [236], the preferences are acquired statically during a registration process, where users are asked to rate musical genres in a range from minus two to two. As another example, in the Poolcasting system [19] the musical preferences are acquired dynamically based on the feedback that users give to the current song, which will influence the choice of the next song. We note that in the case of static acquisition the three steps of preference acquisition, preference aggregation, and playlist construction happen sequentially, while in the case of dynamic acquisition the three steps happen iteratively, because the next song in the playlist is selected by aggregating the group feedback for the previous song, e.g. see the Adaptive-radio system [54]. One advantage of the static acquisition technique is that the algorithm is ready to work for new users, while the dynamic acquisition technique requires the users to interact with the system. However, dynamic preferences are advantageous because they improve over time with usage. Some work combines the two techniques. For example, the Poolcasting system [19] selects the first song in the playlist by aggregating static group preferences acquired from listening logs and then selects the subsequent songs by also considering user feedback.

Some of the preference acquisition mechanisms above acquire “private” musical preferences, i.e., the musical preferences of users for their private music listening [19], and use those private preferences in a group setting. This makes the assumption that a person's private tastes are also their communal tastes. However, the private and communal tastes of an individual may differ [84].

Table 4 classifies the APG-G work that we survey for how the preference are acquired, distinguishing implicit from explicit preferences, static from dynamic acquisition mechanics, and indicating whether preference inference is done or not.

Table 4.

Reference	Preference acquisition					Preferences aggregation		Playlist construction
	Mechanics		Type		Inference	Strategy	Requests	Strategy	Similarity correction
	Stat.	Dyn.	Impl.	Expl.
[19]	✓	✓	✓	✓	✓	Avg wo misery		Deterministic	✓
[78]	✓		✓		✓	Avg		Stochastic	✓
[236]	✓			✓		Avg wo misery		Stochastic
[147]		✓		✓		Avg	✓	Deterministic	✓
[16]		✓		✓		Avg		Deterministic
[54]		✓		✓		Avg wo misery		Stochastic
[99]		✓		✓		Avg	✓	Stochastic
[319]		✓		✓	✓	Avg	✓	Deterministic
[219]		✓	✓		✓	Avg		Deterministic

Table 4. Classification of the APG-G Work That we Survey Based on How the Musical Preferences of Group Members are Acquired, on How They are Aggregated, and on How the Playlist is Constructed on the Basis of Those Group Preferences

We distinguish static (Stat.) from dynamic (Dyn.) preference acquisition; we distinguish the acquisition of implicit (Impl.) from explicit (Expl.) data as preferences; and we also indicate whether preference inference is done or not. Additionally, we distinguish two strategies for aggregating individual preferences, Average (Avg) and Average without misery (Avg wo misery), and we indicate whether song requests are handled or not. Finally, we distinguish the strategies to construct the playlists, and we indicate whether similarity correction is applied or not.

5.2.2 Preference Aggregation.

Once the musical preferences of all group members are determined, the next step is to aggregate those preferences into the group musical preferences. The APG-G papers that we survey perform preference aggregation in two different ways:

Average. The group preference for a musical item is the arithmetic average of the musical preferences of the group members towards that item. For example, the Flytrap system [78] averages the preferences of group members for songs.

Average without misery. This is similar to Average, but items for which at least one user has expressed an extremely low rating are avoided. For example, the MusicFX system [236] avoids musical genres that have been given extremely low ratings by at least one user in the group. And, the Adaptive-radio [54] system implements an interesting variant of Average without misery where group members are allowed to determine songs to avoid, while the remaining songs are treated as having equal group preference.

It is worth mentioning that the above preference aggregation mechanisms, Average and Average without misery, are two of the many mechanisms used in group RSs research. For example, Masthoff and Delić list 11 preference aggregation mechanisms to be found in group RSs research [235].

Along with the mentioned preference aggregation mechanisms, some of the APG-G work that we survey handles song requests separately. Song requests are a type of musical preference that consists of suggestions of songs to be played next in the playlist. One way to handle song requests separately from the other user preferences is via a queue, i.e., song suggestions are played in the order that they arrive, until the last song in the queue is played. If no songs are in the queue, the next song to play is selected via the group musical preferences, aggregated as described above. When faced with multiple requests coming from the same user, systems may decide to play a subset of those requests, while still considering the other requests as explicit preferences [319].

Table 4 classifies the APG-G work that we survey for how the preference are aggregated and indicating whether song requests are handled or not.

5.2.3 Playlist Construction.

Once the group preferences are aggregated, the last step is to construct the playlist. The APG-G papers that we survey follow two main strategies for selecting music for the group playlist: deterministic and stochastic. The deterministic strategy consists in selecting the music with highest group preference. For example, the CoCoA-radio system [16] applies the deterministic strategy, by selecting the next song in the playlist as the one with highest group preference. The stochastic strategy consists in producing a probability distribution based on the group preferences, e.g., songs with highest group preference have highest probability of being played. The stochastic strategy allows for discovery of new music, at the risk of selecting music that the group does not like [51]. For example, the MusicFX system selects a genre by drawing from a probability distribution over genres, determined by the group preferences for genres.

Together with group preferences, other considerations can be made for constructing the playlist. One common strategy is that of similarity correction, that is considering similarity to the previously played song, together with group preferences, for selecting the next song. For example, in the Flytrap system [78], the probability distribution over the candidates for next song is corrected with a multiplicative term, which takes into account the similarity with the previously played song.

Table 4 classifies the APG-G work that we survey on whether the music selection strategy is deterministic or stochastic, and indicating whether similarity correction is applied.

5.2.4 Evaluating APG-G Research.

Let us turn our attention now to evaluation of APG-G algorithms. Recall from Section 5.1.4 that three protocols are used to evaluate research in the case of APG-I for individual users: (1) User studies; (2) Objective measures; and (3) Ground truth playlists.

Overwhelmingly, protocol (1) — conducting a user study — is the main approach to evaluation of APG-G research. One strategy for evaluating APG-G algorithms is to present a playlist to a group of listeners and to measure the playlist quality by monitoring the engagement of the group with the playlist, for example by means of a survey. For example, Chao et al. install their Adaptive-radio system in an office environment so as to provide co-workers with music during working hours; they report the results of a survey, where the co-workers are asked for their opinions about the musical choices [54]. Similarly, McCarthy and Anagnost install the MusicFX system in a gym and poll participants on whether the music playing in the gym has improved since the installation of MusicFX [236]. A similar experiment is reported in [147], and in [289], where the authors compare several preference aggregation and playlist construction strategies.

We do have an example of protocol (3), where listening logs provide a ground-truth [219]. Specifically, Li et al. take check-in data for a coffee-shop from Foursquare,²³ as well as the listening histories of the people in the shop from Last.fm;²⁴ they analyze the accuracy of algorithms in predicting songs in the listening histories of users currently in the coffee-shop.

We have no clear examples of protocol (2) for APG-G. Indeed, a conspicuous share of APG-G work that we survey (four of nine papers) does not offer any evaluation.

6 MPG

MPG is a major topic within research on playlists. Research in MPG looks into how people construct playlists manually. Research in MPG is important as understanding how people construct playlists manually can help to improve research in APG, as well as improving user interfaces for playlist construction.

We define the task of playlist generation in Definition 2.2 as the problem of selecting a sequence of songs from a catalogue of songs, while using some background knowledge, in order to match some given target characteristics of the playlist. In MPG, users set their minds to some desired target characteristics (or themes, see Section 6.1.1) and manually select songs from a catalogue so as to match those desired target characteristics. One notable playlist construction style would start with the user selecting a handful of anchor songs to reflect the theme [81] and, on the basis of those anchor songs and on the theme, the user would then search through the catalogue for similar songs to add to the playlist [27, 323]. The search for songs might be based on familiarity, drawing from the user's musical knowledge, or by looking for other songs from the same artists, or of the same genre, or by exploring the catalogue for songs coming from similar artists or having similar genres [96].

6.1 MPG for Individual Users

In this section, we survey the literature on MPG for individual users, presenting common themes for playlists in Section 6.1.1, and criteria for selecting the songs, with a special focus on the issue of song ordering, in Section 6.1.2. Finally, in Section 6.1.3 we review a special case of MPG: assisted MPG, i.e. the case in which the user is assisted by an algorithm while constructing the playlist manually.

6.1.1 Themes.

The target characteristics of a playlist are the organization principles which make the playlist of a sequence of songs to be listened to together. In MPG, the target characteristics are expressed by users for themselves, because they are the agents that carry out the playlist construction. This is different from APG, where the target characteristics need to be expressed in a machine readable form, which inevitably limits expressiveness to the kinds of things that algorithms can interpret. We distinguish the target characteristics as expressed in MPG from those expressed in APG, by using the word “themes” to refer to the target characteristics for MPG.

Hagen [140] analyzes the themes of playlists created by users of music streaming services and provides a categorization of those themes, employing four categories:

Standardized. Themes concerning standard music organization principles, including: genres, e.g., “Hawaiian music”; artists, e.g., “100% Prince”; albums, e.g., putting songs from one or more albums in a playlist; styles, e.g., “British & psychedelic”; instrumentation, e.g., “songs with cello”; and performance, e.g., “female singers”; years, e.g. “1970s and older.” Other themes concern more specific music organization principles, e.g., producer, label or composer, or content features, e.g., BPMs or energy.

Contextual. Themes concerning the listening context. The concept of listening context is broad, one definition (repeated from earlier) being “any contextual condition that might influence the user's perception of music” [186]. Examples of contextual themes include: events and activities, e.g., “birthday party,” “road trip,” “Christmas”; mood, e.g., “happy hits”; and time of day, e.g., “night time.”

Personal. Themes concerning the user's personal life, including: relationships: e.g., “breakup”; biographical histories, e.g., “the soundtrack of my life” or “Amsterdam 1999”; and memories and experiences, e.g., “memory lane.”

Individual. Idiosyncratic themes. This last category is a container for a wide range of peculiar themes that arise from an individual's creativity. Examples include: messages and puzzles, e.g., a playlist that sends a hidden message by playing with song titles, artist names or lyrics, such as a playlist where all songs are about water; themes outside the music universe, such as soundtracks, e.g. “Gossip Girl,” and cultural references, e.g., “Viva Las Vegas.” Another notable type of example is a playlist containing favorite songs, e.g., “all time favorites” or “latest favorites.”

The categorization above is consistent with other categorizations proposed in related work, e.g. see [81, 82, 92, 143].

It is worth noting that a playlist theme may not give a clear indication of the music included in the playlist. For example, personal themes strongly depend on a user's personal life, and their musical choices may make sense to the playlist constructor only. In some cases, even the playlist theme itself may make sense to the playlist constructor only, for example the individual theme about water that we gave earlier. Other categories of themes, such as standardized and contextual themes, may be more “predictable” in the sense that a third entity may be able to select a suitable song to add to those playlists.

Moreover, even though every playlist is thematic by definition, the theme may be more or less strictly followed by the playlist constructor. For example, Cunningham et al. report that playlists created for personal use have a less strictly defined theme, as they may be employed as a background for another activity, while playlists created for sharing with other users usually have a more clearly defined theme [81]. And Kamehkhosh et al. suggest that playlists sometimes have more than one theme [182].

A few studies investigate the popularity of different themes, i.e., how frequently users create playlists of specific themes. For example, Cunningham et al. analyze playlist requests posted by users in blog posts and find that the majority of requests are for standardized and contextual themes [81]. And Pichl et al. carry out a data-driven analysis of manually constructed playlists in Spotify, by clustering playlists on the basis of acoustic features and analyzing those clusters based on their semantics, as extracted from the most relevant song tags in the cluster [281, 282]. They identify five clusters. They call the biggest cluster “feel-good music” and find that 91% of playlist creators have at least one playlist in this cluster. The other clusters contain more niche music, as only a minority of playlist creators have a playlist belonging to these latter clusters.

6.1.2 Song Selection.

In MPG, the songs are selected manually by the user so as to match the playlist theme, and according to different guidelines. One obvious guideline is musical taste since users commonly compile playlists of music that they like [86]. Interestingly, Hansen and Golbeck find that songs liked the most by users are usually picked first [143]. Two other guidelines are song coherence/diversity and song ordering.

Song Coherence/Diversity. Songs should remain coherent throughout the playlist. In the literature, the concept of song coherence is related to that of song similarity. For example, both Kamehkosh et al. and Lee et al. mention that songs are coherent if they are similar in terms of, e.g., tempo, mood, genre, time period, musical style, and/or lyrical content [182, 205]. Music similarity is a multi-faceted and subjective concept that we discussed in Section 5.1.3. The work we reviewed there measures coherence using metadata; for example, musical genres [168, 314] and tags [70, 290].

As well as being coherent, songs should also be diverse through the playlist [143, 327]. We use the concept of song diversity as a contrast to the concept of song coherence. For example, Lee et al. point out that songs should be varied in their metadata, e.g., there should be relatively few songs by the same artists, and there should also be some variety in musical genres [205].

Striking the right coherence/diversity balance is key to successful song selection in playlists [139, 182], almost as important as matching the listener's musical taste [86]. Some work tries to measure coherence/diversity in manually constructed playlists, with the goal of characterizing the ideal tradeoff between the two quantities. For example, Jannach et al. utilize an intuitive measure, counting the average number of songs from the same genre in manually constructed playlists, and find that typical values are around three to four songs per genre [168]. Porcaro et al. [290] measure coherence/diversity as a function of the year in which a playlist was created, by comparing playlists created in the 2000s, in the 2010s and in the 2020s. They find that earlier playlists featured a higher level of diversity, while in recent years playlists tend to be more coherent. They relate this change to the advent of music streaming services, and the phenomenon of filter bubbles [7]. Slaney et al. [314] measure coherence/diversity as a function of playlist length, finding that the longer the playlist, the higher the diversity, which can be related to the fact that a longer listening time requires more diversity so as to keep the interest of the listener alive. Choi et al. [70] measure coherence/diversity as a function of playlist theme, as identified by relevant tags associated with a playlist, finding that playlists with different themes are different in terms of diversity/coherence. These studies highlight the fact that the appropriate tradeoff between diversity and coherence is difficult to determine since it depends on several factors, including the year of creation, the number of songs, and on the theme. Additionally, Lee et al. find that the appropriate tradeoff between diversity and coherence depends on the preferences of the individual user [205]. Of course, this issue has also been explored in work on APG, including work we cited earlier on RL [299, 301] and on re-ranking [116, 187, 229].

Song Ordering. We refer to song ordering as the process of arranging the playlist's songs in a particular order. The relevance of song ordering in the process of manually constructing playlists is disputed. Some work maintains that song ordering is relevant; other work maintains the opposite. All of the work agrees that song ordering is less important than other factors, such as striking the right coherence/diversity balance [86].

There is work that investigates the relevance of song ordering via user studies, finding that song ordering is relevant. For example, Cunningham et al. observe the playlist creation behavior of users, reporting that they actively arrange songs so as to achieve a music flow, especially when they are creating a playlist to be shared with other listeners [81]. And, De Mooij and Verhaegh explicitly ask participants in their study about the importance of song ordering, and report a result of zero, on a scale from –4 to 4 [86], which we interpret as an indication that song ordering has some relevance. They also ask participants about the importance of other factors, such as the start and end songs, rated as –2, choosing songs according to taste, rated as 2, and balancing coherence and diversity, also rated as 2. In their playlist creation study, Kamehkhosh et al. ask study participants about the importance of song ordering and six other factors [182]. While song ordering was ranked fourth, suggesting low relevance, they noticed that about one third of their participants reordered the songs at least once during playlist creation, which implies that it is important for at least some people.

However, other work that also employs user studies finds that song ordering is not relevant. For example, Andric and Haus ask study participants to send in a playlist of songs. Later, participants are presented with the playlist they sent, along with an alternative playlist with the same songs but shuffled, and are asked to rate the two playlists based on their quality by sight, i.e., without listening to the music [10]. Andric and Haus find no difference in quality between the two options: half of the participants mention that they picked songs at random from songs that they like when building the playlist they sent in. Similarly, Tintarev et al. find that users do not expect playlists to be ordered in the first place [327].

Some work in APG also indirectly studies the relevance of song ordering. For example, Antenucci et al. weight the final part of a playlist more than the initial part when predicting the next song, observing an increase in performance compared with the case when both parts have equal weight [12]. This is an indication that song ordering is relevant. But, Vall et al. instead find that song ordering is not relevant [341]. They train RNN models for APG, which naturally take song order into account. They train on a set of playlists and, separately, on shuffled versions of the playlists, reporting the same accuracy in both settings, suggesting that ordering is not significant. A result similar to [341] is presented in [237].

Some work investigates the importance of song ordering via statistical analysis. For example, Schweiger et al. consider a set of manually constructed playlists and extract a variety of song features, including content-based data and metadata [310]. The content-based data are acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence, key, and mode. The metadata is genre, artist, and popularity. They compute song similarity using this data, measuring higher similarity between songs that are close to each other in the playlists; the similarity decreases when considering songs that are further apart in the playlists, which is an indication that ordering is relevant.

We believe that the contrasting results presented above are due to the lack of a standardized experimental procedure. For example, Schweiger et al. adopt a statistical analysis which proves the relevance of ordering, as songs closer to each others in playlists are found to be more similar to those further away [310]. The difference in similarity is only slight, but can still be detected with statistical analysis. The slight relevance of ordering may not be detectable in user studies. The studies above also differ in their experimental settings, and song ordering may be relevant in some experimental settings, and not relevant in some others. One experimental setting in which song ordering is relevant is when users construct playlists to be shared with other people, as opposed to when they construct playlists for private use, which is one experimental setting in which song ordering is not so relevant [81]. Looking at the studies above, those that deal with playlists to be shared find that song ordering is important, e.g. [12, 86, 310], while those that deal with playlists for private use find that song ordering is not important, e.g. [10, 182, 327]. More generally, song ordering might not be relevant in cases when a playlist is constructed to be listened to in shuffle mode, such as playlists for private use [81]. The research of Leong et al. [215–217] looks at shuffle listening in playlists, highlighting the positive listening experiences allowed by shuffling, like serendipity, as well as studying the different use cases when shuffle listening is appropriate, which may be a good starting point for researching other use cases in which song ordering is relevant.

Song ordering has some connection with song coherence/diversity, as different orderings of the same set of songs may result in different perceived coherence/diversity. Dias et al. [96] present two empirical rules of thumb for song ordering in order to help to strike the right coherence/diversity balance. Rule (1): avoid putting songs by the same artist or with the same genre together in a sequence, unless there is a special link between the two songs, for example, two parts of the same continuous recording [80, 81]; Rule (2): place songs with complementary sounds and styles together consecutively, so as to avoid the “clash of one song against the other.”

6.1.3 A-MPG.

A-MPG is a special case of MPG, concerned with the development and evaluation of algorithms for assisting users in the process of manual playlist construction. An example of work in A-MPG is [182]. In [182], Kamehkosh et al. utilize algorithms for recommending the next song that the user may want to add to the playlist, in order to relieve users from the burden of searching for songs manually.

A-MPG represents a middle ground between MPG and APG. In MPG, the user is tasked with constructing the playlist, finding songs by manually browsing the song catalogue. In APG, an algorithm is tasked with constructing the playlist, aligned with some target characteristics input by the user. Both MPG and APG have advantages and disadvantages. An advantage of MPG is that users are left in control of the playlist construction process, which allows for self-expression through their own musical choices [358]. But, the process of MPG is time consuming, especially in music streaming services, where catalogues contain millions of songs. An advantage of APG is that it relieves the user of the task of manual song selection. But, APG lowers the control of users over the construction process.

In A-MPG, the user is tasked with constructing the playlist, but algorithms are employed as facilitators, to assist users in selecting the songs.²⁵ A-MPG unites the two paradigms, leaving users in control of the playlist construction process, while alleviating users of the burden of manually searching enormous song catalogues. A-MPG thus strives to achieve a tradeoff between the time spent in creating the playlist and the user satisfaction with the playlist: MPG is very time consuming, but leads to highest user satisfaction with song choices, since users have full control over them; APG is faster, but may lead to low user satisfaction with the song choices [193]; and A-MPG is less time consuming than MPG, but more time consuming than APG, and can result in user satisfaction with the song choices that is lower than MPG but higher than APG [28].

In their 2017 survey, Dias et al. [96] review the research on A-MPG to that date. They present several categories of algorithms, all based on visualizations: maps, graphs, dots, and radar A-MPG algorithms. In our survey here we include the A-MPG algorithms that were covered by their survey, as well as more recent work. However, the recent work does not belong to the categories proposed in their survey. Hence, we propose a novel categorization of A-MPG algorithms, to cover both recent and non-recent work. Specifically, we divide algorithms in two categories: visualization and recommendation. In Table 5 we divide the work on A-MPG based on the categorization, and we distinguish the research up to 2016, which was already surveyed by Dias et al., from the research from 2017 onward, that we exclusively survey.

Table 5.

Category	Up to 2016	From 2017
Visualization	[48, 63, 77, 94, 132, 133, 152, 176, 196, 203, 213, 223, 242, 247, 254, 266, 329, 343, 344, 352]	—
Recommendation	[28, 29, 95, 131, 151]	[182, 184, 241, 288]

Table 5. Organization of Work on A-MPG Based on How the User Is Assisted During the Playlist Construction Process

We distinguish work that assists users with visualizations and with recommendations. Additionally, we divide work published up to 2016 from work published from 2016 onward.

Visualization. Some of the work in A-MPG uses visualizations for assisting users in the manual construction of playlists. An example of visualizations are maps that represent the songs in the catalogue by similarity, i.e., similar songs are close in the map, and dissimilar songs are further away from each other. Maps can either be two-dimensional (2D), e.g. [176, 266], or three-dimensional (3D), e.g. [196, 203]. In Figure 4, we show examples of maps from two A-MPG algorithms.

Fig. 4.

Figure 4(a) is a map proposed in [196]. The authors represent the songs in a multi-dimensional space made of multiple types of content-based data. Then, they use self-organizing maps (SOMs) [197] to project the songs to a 2D space. Finally, they build a 3D map by analogy with a landscape, with hills in correspondence to dense clusters of songs, and valleys in correspondence to sparse areas, while the sea represents areas with no songs. SOMs are frequently used in this kind of work, especially for their ability to project songs not seen at training time, and for their scalability [266]. Similar work that uses SOMs includes [176, 213, 242, 247, 254, 344, 352].

Figure 4(b) is a map proposed in [329]. The authors represent songs as the leaves of a tree, whose upper levels represent, from top to bottom: musical genres, sub-genres, and artists. Then, they use treemaps [313] to transform the tree representation to a map representation. It is possible to draw a map representation at the different tree levels, i.e., at the level of genres, sub-genres, or artists. Figure 4(b) shows a map drawn at the level of genres. Another work that uses treemaps is [94].

Visualizations can also consist of graphs. For example, Gouyon et al. create a graph of artists [133], using similarity between artists, as retrieved from Last.fm, which is known to rely on usage data [47]. Artist-to-artist similarity is symbolized as edges, with longer edges joining less similar artists and shorter edges joining more similar artists. A similar work using graphs is [77].

Visualizations can be drawn using simpler strategies, such as associating songs to points in the valence-arousal space [134], without applying other transformations [48], or by associating songs to colors [152], and drawing a color-map representing a collection of songs.

The visualizations discussed so far in this section represent songs by similarity based on content-based data and metadata. However, using other sources of background knowledge is also possible. For example, Goussevskaia et al. rely on usage data [132]. Some visualizations even leave the user in control to decide their preferred background knowledge for computing song similarity. For example, van Gulik and his co-authors propose a map where song similarities can be computed based on different metadata, such as the year of release, the genre, or based on content-based data, and users are left in control of deciding which sources of background knowledge to use [343, 344].

Once the visualization is created, the user can interact with the visualization while creating a playlist. For example, in the case of maps, different types of interactions are possible, including manual song picking in the map [329], as well as drawing paths [176, 223] or shapes [254] in order to create a playlist of the songs included on the path or shape.

Recommendation. Another way in which algorithms can assist users in the process of manually constructing playlists is by providing recommendations for songs to add to the playlist. In principle, any APG algorithm we present in Section 5.1 can be used for providing those recommendations. For example, Kamehkosh et al. use the APG-I feature that is part of the Spotify API and an heuristic algorithm based on popularity [182, 184], while Goto and Goto use a similarity-based algorithm [131].

The recommendations are usually displayed as elements of the user interface. A standard approach is to present recommendations as a list of songs that can be clicked on [182, 184]. A more creative approach is in [131], which presents recommendations as discs that can be dragged and dropped on top of other discs, so as to build a playlist. Another approach is Rush [28]. Rush asks the user to select a seed song, and it then presents the user with a carousel of five songs that can be added to the playlist; as soon as the user selects one of these songs, another carousel of five songs is presented, with recommendations dependent on the previous song, and so on, until the playlist is fully built. The same authors present Rush 2 [29], which works in the same way as Rush, but now carousels are replaced by wheels that have at their centre the last selected song, and recommended songs are displaced around the wheel based on their relevance to the seed song. The user can customize Rush 2's recommendations by setting filters on BPMs and genre. AudiRadar is a system that is similar to Rush 2 [151]. MixTape [288] is also an A-MPG algorithm similar to Rush, but it follows a different interaction paradigm: it shows the user just a single song recommendation, which the user can either accept or reject. As soon as the user gives feedback, another song recommendation is displayed, dependent on the user feedback. MixTape chooses between exploration and exploitation: it explores the song catalogue if the user skips a song recommendation, and it exploits similarity when the user accepts a song recommendation. PlaylistCreator [95] displays recommendations using a standard list strategy, but the interface allows users to customize the song recommendation algorithm. For example, they can specify the playlist theme, or they can filter songs by metadata. Another way to present recommendations is together with explanations. For example, Millecamp et al. provide explanations based on the visualization of content-based features and find that the presence of explanations increases the satisfaction of users with the process of playlist construction [241].

Kamehkhosh et al. [182, 184] conduct a user study where users select a theme among six themes and then compose a playlist around that theme. Users are partitioned in two different treatments: rec and no-rec. Rec users see 10 song recommendations while building the playlist. No-rec see no recommendations. Users in both treatments can search for songs to add to the playlist using a search bar. They find that 67% of rec users adopted at least one song recommendation. The users who did not adopt any recommendation were either experts in playlist construction or had low enthusiasm for music. Increasing user trust in song suggestions is mentioned as an area of improvement for increasing the adoption of song recommendations. Also, the mere presence of the recommendations is found to influence the songs that rec users search for, even in the case that the song recommendations are not selected.

Discussion. Table 5 reveals the differences between the research on A-MPG up to 2017 (the first period), which was already surveyed by Dias et al. compared with the research from 2017 onward (the second period). Research from the first period focuses mainly on visualizing the song catalogue. In the second period, the focus is shifting to recommending songs to be added to the playlist, probably due to the rise of music streaming services. In fact, with streaming services, the song catalogue shifted from personal collections to millions of songs [318], which are difficult to visualize on a map or on a graph. Recommendations, on the other hand, can help users by providing a small selection of relevant songs.

In Sections 5.1.4 and 5.2.4, we reviewed the evaluation of APG algorithms. There is not much to add when discussing the evaluation of A-MPG algorithms, so we do so here in a short paragraph, rather than in a separate section. Evaluation of A-MPG algorithms is primarily done using what we referred to previously as protocol (1), i.e., through user studies. In some cases, what gets evaluated is the usability of the A-MPG algorithm, which can be done using standard questionnaires, common in the Human-Computer Interaction field, e.g. see [218], or through semi-structured interviews. Other evaluation work investigates domain-specific aspects, for example satisfaction with playlists that are constructed with the assistance of an A-MPG algorithm [28]. Finally, some works in A-MPG do not report any experimental evaluation, e.g. [48, 132].

6.2 MPG for Groups of Users

Research in MPG-G looks into how a group of users constructs playlists, as well as the purposes for constructing these playlists, and the practices around listening to these playlists. Earlier, we defined MPG as the task of manually selecting songs from a song catalogue so as to match some desired target characteristics of the playlist. MPG-G is a special case of MPG where a group of users is involved in the manual selection of the songs, instead of a single user. For the purposes of this section, we refer to a playlist constructed by a group of users as a collaborative playlist (CP); and we refer to a playlist constructed by a single user as a personal playlist (PP). Finally, we refer to the users involved in the construction of a CP as the collaborators.

CPs can be constructed by groups of varying size. For example, Park and Kaneshiro report that the majority of CPs have a maximum of six collaborators [268]. The same study finds that collaborators are usually groups of friends, or family, while it is much rarer to construct a playlist together with strangers, which corroborates a similar study by Spinelli et al. [317]. CPs can be listened to in a group, which can be composed of the people that contributed to the playlist, or may include others, such as other guests invited to a party [83]. In a large proportion of cases, CPs are also listened to by individuals on their own, for example to accompany everyday activities [268]. CPs are very dynamic as it is common for the collaborators to frequently update a CP [209]. For example, during a party, a playlist could be modified so as to accommodate song requests from the guests [317].

The practice of constructing CPs is widespread nowadays, because most popular streaming services offer a dedicated feature where collaborators can add, delete, and re-order songs in a CP. In their study, Park and Kaneshiro report that more than half of Spotify users in the USA collaborate on CPs [269].

CPs serve a number of important purposes. Park et al. [270] divide the purposes of CPs into three categories: practical, cognitive, and social. Practical purposes include creating a playlist to be listened to in a group during a shared social occasion or speeding up the creation of a playlist via collaboration between users. Cognitive purposes include discovering new music, since collaborators may select songs which are novel to some of the other collaborators. Social purposes include to keep in touch with the other members of the group, as well as developing a social connection with them, by using music as a medium for bonding, or for sharing music between the members of the group. We note that the practical and social purposes of CPs align with the convenience and self-expression purposes of PPs (see Section 2.1). Several other works corroborate the three categories of CP purposes proposed in [270]. For example, Park and Kaneshiro find that users use CPs as a means of discovering new music added by the other collaborators, and that users welcome these social recommendations because of the social links with the collaborators who recommended them [268]. In a similar vein, Lehtiniemi et al. [210, 211] and Spinelli et al. [317] report that social recommendations are more appreciated than algorithmic recommendations. And, Park and various co-authors find that CPs facilitate the reinforcement of social connections [268, 272], because music is often used as a medium for discovering the personality of other individuals [214], and can spark conversations among collaborators [23]. Similarly, Lenz et al. find that CPs can be used to keep in touch, for example using songs that evoke shared memories, acting a little like souvenirs [214].

In the rest of this section, we continue our characterization of CPs. In Sections 6.2.1 and 6.2.2 we talk, respectively, about themes and song selection strategies in CPs. In Section 6.2.3, we talk about the group dynamics that arise in the construction of CPs.

6.2.1 Themes.

The target characteristics of a playlist are the organization principles which make the playlist a sequence of songs to be listened to together. In Section 6.1.1, we make use of the word “theme” to refer to the target characteristics of a manually constructed playlist, and we analyzed the themes of PPs. In this section, we make the same use of the word “theme,” as we analyze the themes of CPs.

The themes of CPs mostly overlap with the themes of PPs, which we covered in Section 6.1.1. Often, the themes of CPs tend to be related to group activities, for example a playlist for a house party, an intimate dinner with friends [27], or a road trip [17, 84]. As well as being tailored to the group activity, the themes of CPs are sometimes also tailored to the location where the group activity happens, and to the social interaction the music is hoped to encourage [27]. Intuitively, the theme of a CP needs to be understood by the collaborators, otherwise they will not be sure what songs to add to the playlist [268, 269], and whether the playlist matches a desired listening context [225]. The active engagement of the collaborators with a CP determines its success [268], and so the theme of a CP is usually clearly defined and easy to understand by the collaborators, which is different to what we reported for PPs in Section 6.1.1, as themes of PPs are sometimes only loosely defined, related to the playlist creator's personal life, and therefore difficult for other users to understand.

6.2.2 Song Selection.

In Section 6.1.2, we analyze song selection in PPs. One obvious guideline for song selection in PPs is that the playlist constructor selects music according to their tastes. In CPs, there are a number of playlist constructors (the collaborators), and so the song selection should be tailored to the tastes of a group of listeners [268]. The group of listeners might be composed of the collaborators, but may include also other people, for example other people invited to a party. The degree of heterogeneity of the musical tastes among the group of listeners plays a role in song selection for CPs. If the listeners have similar tastes, it is easier to select songs that everybody will like, and the various collaborators can feel free to select music that they truly like, knowing it will be appreciated also by the other listeners. If the listeners have dissimilar musical tastes, song selection might deviate from each individual's musical tastes, because the collaborators might settle for something that pleases, at least partially, the musical tastes of all the listeners. This is especially true in situations where the playlist will be listened to in a large group, and especially with strangers. In such cases the collaborators might settle for a “safe” playlist of songs familiar to most people, made, for example, of current hits and classics [317], and with a high degree of diversity, so as to increase the chance that every listener likes at least some of the songs in the playlist. In circumstances of high diversity, it is common to navigate the playlist by skipping songs [83], and/or by sorting songs, for example by musical genre or by artist name [269].

6.2.3 Group Dynamics.

The construction of CPs involves a group of collaborators who select the songs to add to the playlist. The involvement of multiple users in the decision making process raises questions about group dynamics, that we address in the rest of this section.

Collaborators’ Roles and Comfort in Modifying the Playlist. The collaborators assume different roles in the process of constructing a CP. In particular, the user that initiates the playlist is sometimes called the host, and the other collaborators are the guests. The analogy is that of a physical setting, for example a party, where a person, the host, invites other people, the guests [83]. Usually, the role of the host is that of specifying the playlist theme, as well as adding an initial set of songs that fit the theme, and inviting guests to contribute to the CP [83].

Since musical taste is linked to the personality of an individual, sharing music involves some revelation of the personality of an individual [214], which can be a delicate matter, especially when the collaborators are not linked by strong friendship relationships, and even more so when they are strangers [84].

When collaborators do add songs to a playlist, they do so consciously, trying to fit the theme of the playlist and the tastes of the other collaborators and any other intended listeners [210, 211]. Bauer and Ferwerda [25] set up an experiment where a user is asked to propose a song for a CP. The user then receives feedback from four other collaborators, which can either be positive (like) or negative (dislike). They find that, in most cases, it is enough to receive negative feedback from at least one collaborator for the user to reconsider the song that they added. At the same time, the user tends to give positive feedback to songs proposed by other collaborators, even if those songs do not really fit the musical tastes of the user.

In a similar vein, Park and Lee measure the comfort of a user in performing different tasks, namely adding, deleting, and re-ordering songs in a CP [271]. They measure comfort in the case that the user is the host, and in the case that the user is a guest. They also distinguish the case in which the user is deleting a song added by them or by another collaborator. They find that users are quite comfortable in performing addition and re-ordering of songs, both in the case that they are hosts and in the case that they are guests, even if the comfort slightly decreases in the case that they are guests. They also find that users are comfortable in deleting a song they added, but very uncomfortable in deleting a song added by some other collaborator. They mention that deleting someone else's song is a socially rude practice. The result corroborates the work in [268], which finds that users often add or re-order songs, but rarely delete songs. This can cause CPs to grow monotonically in size over time.

Privacy and Access Control. One purpose of CPs is to allow social connection between people by sharing music, to keep in touch with friends, and to use music as a medium for bonding. The music streaming services that allow their users to create CPs offer only basic features. For example, CPs in Spotify are like PPs, with the difference that song addition, deletion, and sorting can be done by a group of authorized users. Some works propose going a step forward, by developing a range of social features in the context of CPs. The proposals include the logging of modifications to the playlists, such as song additions, deletions, and re-orderings, as well as logging the author of the modification [209, 225, 269]; a chat functionality, that can be used by collaborators, for example to discuss whether a song addition is suitable for the playlist or not, or to simply socialize [209, 269, 271]; and a listening-together functionality, that would allow collaborators to tune-in to the playlist, so as to listen to the same music at the same time [269].

Some work reports that users are in favor of including these social functionalities in streaming services. For example, Park and Lee report that the chat functionality would increase the comfort of users in adding, deleting, and re-ordering songs [271]. But, other work reports that these social functionalities should be mindful of user privacy. For example, some users are opposed to the idea of logging the changes to a CP, as they are afraid to be judged for their musical tastes [9, 23], which circles back to the low comfort that collaborators may have in adding, deleting, and re-ordering songs in the playlist (see previous section).

Related to the issue of privacy is the issue of access control. The creator of a CP should be able to determine who can see the playlist, and who can modify it [269]. Some more fine-grained levels of access control are also discussed; for example, Lehtiniemi et al. propose that the playlist creator should have the opportunity to approve any changes to the playlist [210, 211]. In general, the literature sees fine-grained access control as a positive feature to have in CPs.

So far in our survey, we have focused on work that deals with the construction of playlists, both automatic construction and manual construction. In the remainder of the survey, we focus on another topic, the construction of enhanced playlists.

7 Enhanced Playlists

Section 2 defines a playlist as a sequence of songs intended to be listened to together. Some work we survey goes one step forward, by enhancing playlists with additional features. In the following, we discuss four types of enhancements: song mixing, interleaving songs with speech, playlist tagging, and playlist captioning.

7.1 Cross-Fading

Some work in APG strives to create playlists but with the additional objective that there are smooth song-to-song transitions, so that the ending of one song flows smoothly to the beginning of the next song. For example, Sarroff and Casey propose a method to classify “good” song-to-song transitions from “bad” song-to-song transitions [303]. But even in a playlist with “good” song-to-song transitions, there is still a short silence between the end of one song and the start of the next. Research into automatic cross-fading seeks methods for, as seamlessly as possible, superimposing the beginning of one song over the end of its predecessor. Automatic cross-fading is similar to the practice of DJs who often superimpose consecutive songs so as to create a seamless music flow.

Algorithms for song cross-fading mainly rely on digital signal processing techniques [297] and on psychoacoustic studies [128]. Ishizaki et al. [163] propose a system that works in two steps. The first step is to align the beats of the two songs to be mixed. Beats alignment can be achieved by slowing down or speeding up songs, as well as shifting songs in time. Slowing down/speeding up songs may result in a jarring listening experience. The authors determine a formula that states the discomfort of users while listening to a particular song as a function of the slowing-down/speeding-up factor. Then, they propose an algorithm that determines the slowing-down/speeding-up factor for the two songs in such a way that the listening discomfort is minimized. The second step is to cross-fade the two songs. They devise a cross-fading algorithm that takes into account the energy of the beats, adjusting the energy of the beats of the first song to that of the following song when cross-fading.

A similar two-step strategy is followed in [74] and [36]. The latter work proposes a sophisticated algorithm for determining the transition points, i.e. the points that mark where the two songs should be cross-faded. The mechanism works by extracting the position of downbeats, and several features of those downbeats. For example, they classify downbeats that mark division points in the songs, e.g. those downbeats delimiting parts with lyrics and instrumental parts. Moreover, for every downbeat, they extract timbre features, chroma features, loudness, and vocalness. They pair each downbeat from the first song with each downbeat from the second song, and then they score all the pairs using an heuristic. Downbeats with similar values of timbre and chroma features are rewarded; downbeats with high loudness and/or vocalness are penalized; and downbeats happening on division points are rewarded. Finally, they select the transition points as the pair of downbeats with highest score.

The systems described above work particularly well with songs from the same genre, and especially with dance music, whose regular structure makes it easy to determine the position of the beats [137]. Basu [24] focuses on how to mix two songs coming from a variety of genres, e.g., classical music and techno, without relying on the two-step mechanism described above. Given two songs, Basu's system computes their energy signal. Then, it time scales and time shifts those energy signals and computes their correction, for different combinations of time scales and time shifts in a range. Then, it selects the correlation signal with highest value, and converts the scaled and shifted energy signal back in the time domain, which results in scaled and shifted versions of the songs, which can be played together so as to achieve cross-fading.

The works on song cross-fading that we have reviewed are evaluated by means of user studies, where listeners are asked to listen to song mixes and to complete questionnaires that report on their listening experience, e.g. see [24, 36, 163].

7.2 Interleaving Songs with Speech

In playlists, one song follows another, from the first song to the last song. Playlists therefore do not facilitate easy access to music contextual information while listening.²⁶ However, studies report that information seeking is one of the most important motivations for listening to music [206], at least in certain contexts. In the era of music streaming, traditional radio remains a popular means of music access. A recent study finds that music listeners consider radio and streaming services as complementary, because radio allows them to satisfy their information seeking needs, while streaming services fail to do so [53]. The work that we review in this section is concerned with generating sequences in which every pair of consecutive songs is interleaved with speech. This resembles the concept of radio programs. We refer to the sequence of songs and speech as a music tour.

We are aware of two lines of work investigating music tours. The first line of work is that of analyzing real radio programs and developing tools and knowledge that can help to replicate those radio programs. For example, Jani et al. [167] construct a dataset of tuples, where each tuple contains one block of music (e.g., a song) and one block of speech, crawled from actual radio shows. Some tuples in the dataset are positive samples, i.e., were aired in sequence in radio shows, and some are negative samples, i.e., were not aired in sequence. They propose an algorithm that, using content-based data extracted from the speech and music, can distinguish positive samples from negative samples, i.e., that can recognize if one musical block and one speech block are suited to be listened in sequence. The algorithm has an accuracy of 75%. Lukacs et al. [230] construct another dataset by crawling actual radio shows that contain music and speech blocks. They annotate the radio shows, by marking the music blocks, the speech blocks, and hybrid blocks. Finally, they present statistics on the annotated dataset, e.g., statistics on how much speech there is in radio programs, compared to how much music is there. They show that the playlist theme and the time of the day influence the statistics. For example, when comparing pop music radio shows and classical music radio shows, they find that, on average, pop music radio contains more speech than classical music radio, but that the statistic is the opposite when restricting to the evenings.

The other line of work is concerned with generating music tours. For example, Behrooz et al. [30] propose an algorithm for generating small pieces of text, called segues, for connecting one song to the next one in a playlist. Segues are then converted to speech using a text-to-speech engine. The work of Behrooz et al. is limited in that they select the segue connecting two songs with author-defined segue scores. The work of Gabbolini and Bridge [118] goes one step forward, by defining a scoring function able to determine the interestingness of any segue. The authors validate their measure of interestingness, finding that it positively correlates with human perceptions of segue quality. The same authors follow-up their work by providing several algorithms to generate music tours, with the objective being to maximize the interestingness of the segues in the tour [119, 120].

7.3 Playlist Tagging

Music streaming services feature billions of playlists created by users, professional editors, and algorithms [88]. In this content overload scenario, it is crucial to characterize playlists, so that music can be effectively organized and accessed [69].

One common approach for characterizing playlists is tagging, which is the task of assigning to a playlist one or more tags.²⁷ (Another approach is automatic playlist captioning, which we cover in the next subsection.)

Examples of playlist tagging can be found in [105] and [353], which describe datasets of playlists annotated with a variety of different tags, such as musical genres and decades. Similarly, Choi et al. [68] describe a dataset of playlists annotated with listening context tags. Examples of listening context tags are “workout” and “party,” which characterize playlists as being suitable to be listened to by users while working out and while having a party.

Existing work on playlist tagging focuses on listening contexts tags. Choi et al. [68] set up a multi-label classification problem, in which playlists are classified for their listening contexts, proposing four classifiers: two MF-based classifiers, that work by counting how many times a song is associated with each listening context, and two CNN-based classifiers, that work with song audio. Gabbolini and Bridge [121] follow-up Choi et al. by proposing four other classifiers, that integrate metadata in the form of a knowledge graph, reporting state of the art accuracy.

7.4 Playlist Captioning

Tagging is limited to the usage of one, or to a set of, single words. However, playlists may sometimes be centered around elaborated themes, see Section 6.1.1, that may not be explicable by using a set of tags. For example, a playlist tagged as “Jamaica” and “UK” may refer to a playlist of UK songs influenced by Jamaican traditional music or to a playlist of top-charted Jamaican songs in the UK. Natural language, instead, allows for precise characterization of playlists at a high semantic level.

Playlist captioning is introduced in [69] as the task of automatically describing a playlist using natural language. In the same paper, Choi et al. [69] propose to use a sequence-to-sequence (seq2seq) model based on an RNN, similar to those common in machine translation [20], adapted to translate a playlist, as represented by song embeddings, to a caption. The song embeddings are extracted from song audio using a CNN. They benchmark their model on a small dataset of captioned playlists, reporting that their model fails to generalize to new playlists. Similarly, Doh et al. [97] use an RNN and a transformer [347] model to translate a playlist, as represented by song embeddings, to a caption. In their case, they rely on learned embeddings, and they use a larger dataset. Similar also is the work of Kim et al. [190]. Gabbolini et al. [122] also use a transformer architecture. In their case, their song embeddings capture musical knowledge from song audio and tags. Additionally, they utilize linguistic knowledge held by the GPT-2 language model [42], reporting state of the art performance in the captioning task.

8 Possible Future Research Directions

Despite more than two decades of research into playlists, much remains unknown or under-explored. The need for more research is heightened by the advent of music streaming services, which wrought a revolution in how people consume music. We conclude this survey with some possible future research directions. We do not attempt to be systematic or comprehensive. Inevitably, there is something subjective about our selection, and readers of this article will no doubt have ideas of their own that they feel should have been included but were not.

To organize the material a little, we have chosen to use four subsections. The first two correspond to APG and MPG. The second two cut across the APG/MPG distinction: playlist presentation and playlist recommendations. However, we should say in advance that this organization is quite porous since research in one category may feed into research in another.

8.1 APG

Obvious topics for future work include more on sequence modeling, especially with transformers; on deep learning in general; on RL; and on the use of knowledge graphs. But below, we select some topics that we think are more unusual and intriguing.

8.1.1 Lyrics as Background Knowledge.

Section 5.1.1 reviews the many types of background knowledge that APG algorithms currently use. Notable is that the use of lyrics is under-explored. Lyrics are used in playlist construction in [237] but it is difficult to find other examples.

A first necessary step is to find out whether lyrics are important for constructing playlists or not. Such evidence as exists is somewhat mixed. In [22], for example, Barrington et al. try to discover which factors influence user evaluations of next-song recommendations. Several content-based features, such as energy, style, and so on, are the ones that count the most; the artist name also counts; lyrics are found to be the one that counts the least. In [323], Stumpf and Muscroft do a very similar thing, observing which song features are mentioned by users while constructing playlists, also finding that lyrics were among the least mentioned features. But the participants in the small-scale study in [205] do mention lyrical content and themes or stories, and some participants claim to construct some of their own playlists based on lyrical content. In [237] itself, it is claimed that lyrical features are most important for genres with salient lyrical content, such as folk music, and are less important for genres such as electronic music. This finding adds nuance that is missing from [22, 323].

There has been an explosion of recent work on large language models (LLMs); see, e.g., the survey in [368]. For cases where lyrical or other textual content is important to APG and MPG, it might be possible to exploit the work that is being done on LLMs.

8.1.2 Innovative Specification of Target Characteristics.

In most of the work that we reviewed in Section 5.1.2, the user specifies the target characteristics of a playlist via seed songs. Perhaps the most interesting research direction here comes from finding new, innovative ways of eliciting target characteristics.

There is, for example, a body of work from Nokia Research Centre and Tampere University of Technology in the context of A-MPG that could form the basis for future research on APG [153, 207, 210]. For example, in [207], users move their mouse through a space of album covers organized by genre; they hear audio clips when they hover; the final song that they alight on becomes the seed song. Lehtiniemi and Holm [208] do a similar thing but with a handful of ”mood pictures” instead of album covers, where mood pictures are carefully chosen images or brief videos that represent moods. In the context of CPs, Lehtiniemi et al. [211] introduce ”mood shapes,” which are radar charts that summarize a playlist's moods.

Future work could also look further at explicitly inferring target characteristics from a set of seed songs. For example, in [74] the user supplies a set of songs, and the system extracts the BPM for each. The user can then specify a beat map, i.e., an outline of BPMs over time. In [74], the system sequences the songs to follow the beat map, but one can imagine an APG algorithm that instead chooses songs to fit the map.

There are also innovative methods that do not rely on seed songs. Several are based on free-form text. For example, in the Reflektor system [27], users converse about music through a chat client; the system identifies and creates a visualization of keywords from the users’ contributions to the conversation. Similarly, Chaganty et al. propose a conversational approach to playlist generation [52].

Dedicated music fans might appreciate innovative ways of providing explicit constraints. The SmarterPlaylists project, for example, provides a graph-based user interface that allows users to define programs that specify how playlists are to be assembled [202]; e.g., a user might write a playlist generating program that draws songs in alternating fashion from the top hits from today with the top hits from the 1980s but filtered to exclude anything appearing in the user's set of banned songs.

In the cases we have discussed above (i.e., seed songs, free-form text, and explicit pre-defined constraints), users specify target characteristics as an initial input to APG. But the user's feedback while the playlist plays also provides target characteristics that can guide subsequent real-time APG, e.g. [264, 265]. This is relatively under-explored, not least because RL APG algorithms, which are best-suited to using and learning from the feedback, are relatively under-explored. In the few examples of RL for APG, the user feedback is limited to, e.g., skips [192] and explicit, numeric ratings [221]. But feedback can come in other forms; in Section 5.1.2, for example, we described systems where users give feedback by clicking on meta-data (e.g., tags) that describe the current song and upcoming songs in the playlist [181, 262]. Instead of using this feedback heuristically as is done in [181, 262], RL could learn from this or other kinds of feedback, resulting in a more data-driven approach.

Of course, creative ways of specifying target characteristics, especially more cognitively demanding forms of feedback elicited during playlist playback, must be usable by users, and welcomed by them [96]. There is a need for future work of a similar kind to that in [180] that studies music listeners’ desired levels of control and interaction, and how these relate to the listening context.

Target characteristics may also derive from contextual and sensor information (Section 5.1.2). Considering its relevance, there is a surprising paucity of context-aware approaches to APG in the literature. The context-aware approaches are mainly based on heuristics, and their accuracy is not always evaluated. This is an obvious opportunity for future research.

One example of the kind of research we are envisaging can be found in [295]. Their user study provides some evidence that user activity (i.e., what users are doing while listening to playlists) affects their mood, which in turn influences their musical choices. Therefore, they suggest to rely very much on activities in playlist generation algorithms, which can probably be more easily inferred than mood can. There is work on song recommendation (as opposed to playlist generation) based on daily activities [93] and on song recommendation based on listening contexts extracted from playlist titles [280] that could feed into this line of research, as can work on the prediction of playlist listening contexts [68, 121].

In fact, the previous paragraph hints at a general point which is that, while there is a relative paucity of work on context-awareness for playlists, some of the issues have been investigated in the literature on general music recommendation (e.g., [13]), from which it is possible to take inspiration.

Finally, growing deployment of sensors offers more data to context-aware approaches. A fun example is [173], where the mix of music is influenced by the mixing of drinks, detected by sensors in drinks coasters!

8.2 MPG

8.2.1 Studies on Playlist Characteristics.

The literature on MPG has widely investigated the concepts of song coherence/diversity and ordering; see Section 6.1.2. One interesting avenue for future research is to take into account the playlists’ listening contexts (Section 7.3), e.g. whether a playlist is for driving, for the beach, and so on. Future research could look, for example, at how playlist diversity varies between playlists that have different listening contexts. There may even be differences within listening context. For example, Mélo looks into playlists built for relaxation, finding that they can cluster users into five different groups based on features of the songs in the playlists of those users [239]. It would be interesting to know, not just how they vary in terms of song features but also how they vary in terms of playlist characteristics, such as diversity.

Moreover, there are studies which report that playlist characteristics other than song coherence/diversity and ordering are important too, such as achieving the right level of song popularity, the right level of song familiarity, and the right level of freshness [96, 168, 169, 231, 290]. These are under-investigated at present.

Of these, we think the most interesting is familiarity. It is a topic which has been explored a little in the MRS literature, where there is a recognition that recommendation lists should include a mix of items that the user is familiar with (e.g., known songs or known artists) and ones that are likely to be novel to the user [305, 355, 356]. There is a challenge in knowing whether an artist or song is unfamiliar to a user: the user may not have interacted with them on a given platform but may, e.g., have been exposed to them elsewhere [305].

Turning more specifically to playlists, the user study in [205] reveals the right mix of familiar and unfamiliar music is key: people usually like learning about new music and would love to have familiar and unfamiliar music in their playlists. The right mix of familiarity and unfamiliarity seems to be personal, and the unfamiliar music should be linked to the familiar music somehow, maybe by being very similar or by provision of explanations. In their study of A-MPG, Kamehkhosh et al. [182] also find that users are generally interested in getting both familiar and unfamiliar recommendations. There is very little other A-MPG literature (or even APG literature) on how to generate playlists with familiar and unfamiliar songs. This is despite the fact that streaming services do seem to include unfamiliar songs in the playlists they generate. For example, Bontempelli et al. [40] present a feature on the music streaming service Deezer that is about constructing a playlist containing familiar and unfamiliar songs, but the paper does not explain how unfamiliar music is chosen. A paper that does report a method for balancing known and unknown music in a playlist is [116]. They use the user's interaction history to determine both whether something is known or not and also the user's preferences for known versus unknown music. We can also mention Hu and his co-authors [157, 158], who use heuristics and RL to also integrate concepts such as freshness and diversity. However, much more work can be done.

In the context of further research into algorithms for A-MPG and APG that take familiarity into account, it may also be useful to see extensions to our own work on segues [118–120] (Section 7.2). These textual connections between songs in a playlist can act as the explanations of unfamiliar songs, linking them to familiar content, as found to be desirable in [205].

8.2.2 Studies of CPs.

CPs are very popular among users of streaming services, but research on CPs is limited to user studies that look into: how group of users construct playlists (MPG-G); the purposes for constructing these playlists; and practices around listening to these playlists. We see great potential for statistical analysis of CPs, for example looking at the diversity/coherence of CPs, especially when compared to PPs. So while, for example, Popescu and Pu [289] report that song diversity is important in CPs, quantifying the levels of diversity of existing CPs, for example as a function of group size, and comparing this to PPs might help to better understand CPs, and might help to provide fresh knowledge for constructing effective algorithms for APG-G.

Indeed, the APG-G work that we reviewed in Section 5.2 is, to the best of our knowledge, all the work on APG-G that exists. It is all framed for a physical setting, such as a gym or in a party, where a playlist has to be constructed for a group of physically co-located listeners. However, with the advent of music streaming, new ways of listening to co-created playlists have emerged, where the listeners can access a co-created playlist via the Internet. Music streaming services allow for deeper and more nuanced preference acquisition than in the physical setting, as there is access to all the users’ listening histories. There is an opportunity for research into the automatic construction, or manually-assisted construction, of playlists that will be listened to in this way. Some of the work on on-the-fly radio programming, e.g., the work in [16, 148], might provide a starting point.

8.2.3 Studies of How Users Interact with Playlists.

We do not have many studies on user listening habits, especially in the era of music streaming. For example, Leong et al. [217] investigate listening habits but the paper does so in terms of the listening modes of the day (e.g., sequential listening versus shuffle on iPods). More recent is the survey in [179], which comes at the onset of the streaming age. They look in particular at active and passive listening, finding that listening to single songs dominates active listening while listening to playlists is preferred during passive listening. Their later, larger-scale survey also seeks to answer questions about the degree of control their respondents desire during listening [180]. In a similar vein, the even larger study in [114] identifies seven different personas that capture different streaming service user listening behaviors and degrees of control (e.g., active curator, guided listener).

One aspect of curation is playlist editing, and there are only a few studies about this. In the context of MPG, Cunningham et al. [81] report that some playlists are designed to be edited and others are designed to be left as is. For CPs in particular, Park and Kaneshiro [268] found that 78% of respondents in their study listened to their favorite CP unchanged, while 30% continued to update it. Playlists may even transition to archival status and back again [140]. There is room for more research into the contexts of playlist re-use, the factors that motivate change, and the kinds of changes that are made.

8.2.4 Studies of Diverse Types of Music and Users.

Most of the work considers playlists created by and for users located in certain parts of the world. Addressing multicultural diversity is a challenge that could be taken on in future MIR research on music playlists.

First, it may be that the criteria we are using when constructing playlists are best suited to certain types of music. It would be useful to see work that assesses how universal these criteria are, across the use of different scales, harmonies, rhythms, and instruments in the composition and performance of the music.

Second, it is well-known that song RSs have gender biases, disproportionately recommending male artists ahead of female artists, e.g. [106]. It would be useful to study whether these and other biases affect algorithms for APG and A-MPG. If they do, then it would be good to see research into a new generation of these algorithms that mitigate this bias, either using the kinds of methods already to be found in the MRSs literature [106] or perhaps novel, playlist-specific methods.

Finally, where there are user studies, there is little consideration of who the users are; the study in [114] is an exception, since it considers gender and personas. Participants are most likely university students based in a relatively small set of countries. Studies that embrace a wide range of participants, and analyze the results accordingly, might prove illuminating. Music streaming services have the kind of reach that can make these kinds of studies possible.

8.2.5 Integration into Playlist Generation Algorithms.

One of the main reasons for studying manually constructed playlists is to discover what makes a good playlist, so as to integrate this knowledge into algorithms for APG and A-MPG. At the moment, for the most part, APG and A-MPG algorithms optimize for some notion of similarity or learn to replicate user-created playlists; see Section 5.1.3. There is room in the literature for more work that models the user-related concepts that are uncovered through study of manually constructed playlists and integrates them in the automatic playlist creation process. For example, Hu and his co-authors [157, 158] use heuristics and RL but also integrate concepts such as freshness and diversity. However, much more work can be done.

It follows too that the concepts that are uncovered should feature in the evaluation of algorithmically created playlists. An exemplar of work of this kind is [168, 169]. They split a set of user-curated playlists in half and use the first half for training APG algorithms, and the other half for testing. Their evaluation metrics include accuracy, popularity, coherency, and diversity. Among many results, they show, for example, that the APG algorithms that they use cannot replicate the levels of diversity exhibited in the user-curated playlists. More work like this would be very valuable.

We finish with two topics that are not specific to APG or MPG: playlist presentation and recommendation.

8.3 Presentation of Playlists

There is very little research that looks at how to present playlists (whether automatically generated or manually constructed) to the user, even though the topic is of importance. Of course, a linear presentation is most obvious. But, even then, there is little research to guide decisions such as: the balance between text and imagery, and the amount of metadata to display.

In systems that recommend continuations, there is little research into how to present the recommendations, for example how many competing recommendations, or how far into the ”future” recommendations should extend. One exception is [212]. In [212], Lehtiniemi and Seppänen consider how to present recommendations, how many recommendations to present, as well as how to give explanations to accompany recommendations. But they do so in a very small-scale study and for small-screen devices only. Another small-scale study can be found in [262]. They distinguish the case in which upcoming songs are shown and not shown, reporting that users prefer the former. For upcoming songs, Nonaka and Nakamura [256] propose the concept of branching playlists. In this way, the user can see the decision process. Predating the streaming era, Goto and Goto [131] design an innovative interface for displaying multiple playlists and allowing users to extend and merge them.

Some systems allow users to give feedback and even to have control during playback. For example, streaming services may offer user interface elements such as those proposed in [205] that allow users to explicitly ban songs from generated playlists; PlaylistPlayer by Nakano et al. allows users to re-order the songs in playlists dynamically during playback [253]. We are not aware of any user studies to date that consider the design of these kinds of interface elements.

Above, we mentioned that Lehtiniemi and Seppänen [212] briefly look at explanations to accompany recommendations. But this is also under-explored. There is little work on how to generate explanations in the context of APG and A-MPG; how to present them; and when and whether they are valued by users. For example, Millecamp et al. [241] show only content-based explanations; Moscati et al. [248] can explain recommendations in terms of the simple algorithms that make up their hybrid RS. But there is room for other kinds of explanations.

The opportunity here is for explanations that are more oriented toward the properties of the playlist. For example, a system might explain why a song fits the playlist theme, or how it enhances playlist diversity, or how it is compatible in music flow with the previous song. Playlist-based explanations such as these have not been investigated yet. Of course, such explanations will only have fidelity in APG and A-MPG algorithms that do in fact take these properties into account during playlist construction.

Up to now in this section, we have been assuming that playlists are presented using text, e.g., list of titles, perhaps with accompanying images. But there are alternatives. The techniques we discussed for visualizing music collections (Section 6.1.3) could be adapted to present playlists, as is done in [334], for example; and there are even proposals for interfaces that present playlists in an auditory fashion [322]. Both are interesting avenues for future research. More radical still is the work by Burnett et al. [45], which attempts to bring back some of the properties of mix-tapes. By storing playlist data on NFC tags attached to bracelets, they create playlists that are physical, and that might be gifted and treasured.

8.4 Recommendation of Playlists

Most of the literature that we have surveyed here is about algorithms for constructing playlists (APG and MPG). However, given that streaming services are filled with billions of playlists [88], relatively little is known about recommending existing playlists [305]. One reason may be that the research literature reflects the fact that streaming services seem to favor playlist generation over playlist recommendation. Presumably, only high-quality playlists should be candidates for recommendation. Among the multitude of playlists that a service hosts, quality is not guaranteed: algorithms for APG and A-MPG may not always produce playlists that are worth recommending; and, with MPG, users may be saving collections of songs that are not truly playlists, e.g. sets of favorites [140].²⁸ Research into automatic identification of high quality candidates might be a necessary part of research into playlist recommendation.

Once candidates are identified, it is possible to recommend playlists by relying on algorithms developed for the well-known task of song recommendation. Commonly used algorithms for song recommendation represent the active user as an embedding, represent the songs as embeddings, and recommend songs whose embeddings are closest to the user embedding [305]. It is possible to recommend playlists instead of songs by operating in the same manner, but at the level of playlists, computing a playlist embedding as, for example, an average of its song embeddings.

However, the strategy of computing a playlist embedding as the average of its song embeddings is sub-optimal. In [139], users are asked to rate each song in a playlist and to rate the playlist as a whole. The paper finds that the playlist rating cannot be predicted as the average of the song ratings. This is in part because there is evidence that including even a single song that the user does not like has a disproportionate effect on the user's opinion of the whole playlist [355, 356]. Moreover, factors like song coherence/diversity and ordering are important quality criteria for playlists (see Section 6.1.2), which are not taken into account by a simple average.

An alternative, found in some work, is the idea of learning a playlist embedding. An example of this can be found in the work of Patwari et al. [273]. However, their evaluation is small-scale, qualitative and concerned with the familiar task of suggesting a song to extend a playlist, rather than the task of playlist recommendation that is the subject of this section. A different approach can be found in [350]. The approach taken is to construct a dataset of playlists listened to in sequence by a user (e.g., by taking chronological slices of a user's listening log), and then learning a playlist embedding from ”current” playlists that can predict ”future” playlists in the sequence. In contrast to [273], Vintch validates the learned embeddings in several downstream tasks, including recommendation, showing that learned embeddings largely outperform average embeddings [350]. Other work that uses learned embeddings includes [57, 267].

In general, we see great potential for research into developing powerful playlist embedding representations beyond a simple average, which can power the next generation of playlist recommendation systems and, more generally, of systems that operate on playlists.

We note too that recommending that an individual user listens to a playlist is only one setting. RSs might also recommend playlists to groups of users. Or we can imagine that they recommend CPs that the user could join, or that they recommend people with whom a user might collaborate. These possibilities are, to the best our knowledge, not yet investigated.

9 Conclusion

In this article, we have surveyed two decades of research into music playlists, adopting the definition that a playlist is a sequence of songs, intended to be listened to together. Our survey was conducted using a systematic and reproducible methodology. The result, while not claiming to being exhaustive in its selection of citations, is nevertheless the most comprehensive survey to date. At top-level, it covers three main subjects: APG (Section 5), MPG (Section 6) and enhanced playlists (Section 7).

We divide the section on APG into two: automatic generation of playlists for individual users (APG-I, Section 5.1), and for groups of users (APG-G, Section 5.2). For our review of APG-I, we adopt a classification from Bonnin and Jannach's 2014 survey [39], which allows us to show how interest in different topics has changed in the last decade. Specifically, we present the systems in the surveyed papers in terms of the background knowledge that they assume, the way they obtain the target characteristics of the playlists they are to construct, and the type of algorithms they use. We also discuss how researchers evaluate their APG-I algorithms.

We believe we are the first to survey work on APG-G. Inspired in part by work on group RSs, we survey APG-G systems in terms of preference acquisition, preference aggregation, and playlist construction, making connections to the classification of APG-I algorithms where relevant. As with APG-I, we conclude with a discussion of the evaluation of these algorithms.

Our survey has revealed many changes in APG algorithms. In terms of their background knowledge, we see a shift away from content-based data and metadata towards usage data— although, in the case of the latter, there has also been a shift away from ratings towards listening logs and manually constructed playlists. In terms of how a user specifies the target characteristics of a playlist, the use of seed songs has grown ever more prevalent. There has been some growth too in the use of free-form text. The recent explosion in the capabilities of natural language processing systems may see even greater use of free-form text in APG in the coming years. In terms of algorithm type, we see more use of embeddings across all the algorithm types. The rise of deep learning more generally has had a profound effect. Most obviously, it has given rise to algorithms that use RNNs and transformers for sequence modeling. But we find it too in some algorithms using MF for CF and in RL, which itself has grown hugely. A final observation is that our survey reveals that there has been very little recent work on APG-G.

Turning to MPG, we adopt the same division into two: MPG for individual users (MPG-I, Section 6.1) and for groups of users (MPG-G, Section 6.2). Most of this literature has not, to the best of our knowledge, been surveyed before. We consider the themes that users choose for the playlists that they construct (relating this to the target characteristics in APG), and the way they choose and order the songs. We reviewA-MPG, where algorithmic tools help the users to construct their playlists. Our survey of A-MPG extends the one by Dias et al. from 2017 [96]. We find that approaches to A-MPG have changed, reflecting the advent of large music collections: where visualization was common, recommendation is now dominant.

As with APG-G, we believe we are the first to survey work on MPG-G. Within this, the topics we survey are similar to those for MPG-I, but additionally we survey the group dynamics, such as the roles people play and their concerns for privacy. Interestingly, while there is not very much recent research on APG-G, most of the work on MPG-G is fairly recent, perhaps reflecting the fact that streaming services offer facilities that make it easy for people to construct playlists collaboratively.

Our survey also includes a review of work on what we call enhanced playlists—work that does not directly address playlist construction. We find four main topics: cross-fading one song into the next; interposing speech between consecutive songs; playlist tagging; and playlist captioning. Work on these is all in its infancy but it holds great promise, not least since it can be integrated into future work on playlist construction.

So much has been learned over the first two decades of research into music playlists. Yet we were still able to conclude our survey with a substantial set of directions for future research. We chose to highlight directions that we find intriguing. They will form only a small part of the research that we can expect to see over the next two decades.

Acknowledgment

The authors are grateful to the anonymous reviewers for their comments.

Footnotes

Music listening can also happen at live performances. But the focus of this article is on recorded music.

https://spotify.com

https://deezer.com

⁴

https://backlinko.com/spotify-users

⁵

In a similar vein, we do not intend by using the word “song” to confine attention to music in which there is singing. Our use is intended to cover the whole range of musical pieces that are found on music streaming services.

⁶

https://ismir.net/

⁷

The search string we use is “playlist continuation\({|}\)continuing\({|}\)expansion\({|}\)expanding\({|}\)creation\({|}\)creating\({|}\)recommendation\({|}\) recommender\({|}\)recommending\({|}\)generation\({|}\)generating\({|}\)user\({|}\)study\({|}\)trial\({|}\)evaluation\({|}\)evaluating\({|}\)interview\({|}\)interviewing\({|}\) sequencing\({|}\)sequence\({|}\)representation\({|}\)representing\({|}\)caption\({|}\)captioning”.

⁸

https://dblp.org/

⁹

The term content-based has different meanings in different research communities. For example, in the RSs community, content-based features are any type of feature describing an item, such as the song lyrics or its musical genre [250]. We position our survey more in line with the MIR community, in which content-based features typically refer to those features which are extracted from the music audio signal [249].

¹⁰

Note that genres appear in our classification of background knowledge both as examples of content-based and as examples of metadata. This is because song genres can be extracted by an algorithm from the audio signal, or they can be assigned by experts. For example, in early 2000s work, small catalogues of songs were manually annotated with their musical genres [263]; in recent work, accurate content-based MIR algorithms are often employed to extract the musical genres of large-scale song catalogues [287].

¹¹

https://pandora.com

¹²

The word “tag” is ambiguous. In some work, it is used to indicate free text, e.g. [201]; in other work, it is used to indicate an item of text drawn from a fixed vocabulary, e.g. [68]. This survey needs to use the word “tag” in both of its meanings. For example, in this section, tags are free text, while in Section 7.3 tags are drawn from a fixed vocabulary. In order to make clear which of the two meanings we intend, we use “user tag” for free text, and we use simply “tag” where there is a fixed vocabulary.

¹³

https://facebook.com

¹⁴

Specifically, we update the category “free-form keywords” proposed in the survey by Bonnin and Jannach [39] to the more general “free-form text.”

¹⁵

In this last case, algorithms are tasked with arranging the provided set of songs, without applying any song selection. These special APG-I algorithms are sometimes called sequencing algorithms [36].

¹⁶

https://recsys.acm.org/recsys18/challenge/

¹⁷

Nearest neighbors CF algorithms can be considered similarity algorithms, but we review them in this section and not in the “Similarity” section as they are commonly categorized as CF, especially in the RSs community [296].

¹⁸

Some of the other types of algorithm are also statistical models. In particular, CF, sequence modeling, and RL are statistical models. Also, similarity algorithms can be used to build statistical models. However, we devote a separate section to each of those as they are notable and recognizable algorithm types for RSs in general and for APG in particular. The algorithm type designated as statistical models, following Bonnin and Jannach [39], are ones that explicitly compute transition probabilities.

¹⁹

This CBR system could alternatively be classified as a similarity algorithm, but we review it in this CBR section and not in the “Similarity” section because of the way it computes not just song similarity but also playlist (case) similarity and because this is how the authors view their work and how it is viewed in [39].

²⁰

This CBR system could alternatively be classified as a frequent pattern mining algorithm, but we review it in this CBR section and not in the “Frequent Pattern Mining” section because this is how the authors view their work and how it is viewed in [39].

²¹

The name sequence modeling could, of course, be used generically to cover sequential pattern mining, Markov models, and some of the other work cited in the previous section. We choose to use it more narrowly to refer to neural approaches that build models from sequential data.

²²

Specifically, \(N\) was the length of the playlist less the number of songs that were not withheld. For some playlists, all songs were withheld; for others, five were not withheld; and there were other playlists where the numbers not withheld were 10, 25, and 100.

²³

https://foursquare.com/

²⁴

https://last.fm

²⁵

In MPG, the users are also assisted by algorithms, in particular by search engines, while manually browsing the catalogue. However, those search engines are not tailored to song selection for playlists.

²⁶

Some work offers music contextual information in textual form that can be accessed while listening to a song, e.g. the “behind the lyrics” feature on Spotify, https://genius.com/Genius-x-spotify-behind-the-lyrics-the-complete-experience-playlist-annotated. However, attending to this textual information requires actions from users, as they need to access and read the information while listening to the music. In this section, we are more interested in contextual information expressed as speech interleaved with the songs in a playlist.

²⁷

We remind the reader that we use the word “tag” where there is a fixed vocabulary, and we “user tag” where free text is allowed.

²⁸

We are grateful to an anonymous referee for suggesting that this may be one reason why there is a preference for playlist generation over playlist recommendation.

References

[1]

Hiba Abderrazik, Giovan Angela, Hans Brouwer, Henky Janse, Sterre Lutz, Gwennan Smitskamp, Sandy Manolios, and Cynthia C. S. Liem. 2019. Spotivibes: Tagging playlist vibes with colors. In Proceedings of the 6th Joint Workshop on Interfaces and Human Decision Making for Recommender Systems, 55–59.

Abstract

1 Introduction

2 Definitions and Research Landscape

2.1 Playlist Research Topics

2.1.1 Playlist Generation.

2.1.2 Playlist Enhancement.

2.2 Related Research Topics

2.2.1 SARSs and SBRSs.

2.2.2 MRSs.

3 Related Surveys and Contributions

4 Research Methodology

5 APG

5.1 APG for Individual Users

5.1.1 Background Knowledge.

5.1.2 Target Characteristics.

5.1.3 Algorithm Type.

5.1.4 Evaluating APG-I Research.

5.2 APG for Groups of Users

5.2.1 Preference Acquisition.

5.2.2 Preference Aggregation.

5.2.3 Playlist Construction.

5.2.4 Evaluating APG-G Research.

6 MPG

6.1 MPG for Individual Users

6.1.1 Themes.

6.1.2 Song Selection.

6.1.3 A-MPG.

6.2 MPG for Groups of Users

6.2.1 Themes.

6.2.2 Song Selection.

6.2.3 Group Dynamics.

7 Enhanced Playlists

7.1 Cross-Fading

7.2 Interleaving Songs with Speech

7.3 Playlist Tagging

7.4 Playlist Captioning

8 Possible Future Research Directions

8.1 APG

8.1.1 Lyrics as Background Knowledge.

8.1.2 Innovative Specification of Target Characteristics.

8.2 MPG

8.2.1 Studies on Playlist Characteristics.

8.2.2 Studies of CPs.

8.2.3 Studies of How Users Interact with Playlists.

8.2.4 Studies of Diverse Types of Music and Users.

8.2.5 Integration into Playlist Generation Algorithms.

8.3 Presentation of Playlists

8.4 Recommendation of Playlists

9 Conclusion

Acknowledgment

Footnotes

References

Index Terms

Recommendations

Music/lyrics composition system considering user's image and music genre

Music organisation using colour synaesthesia

Problems of music information retrieval in the real world

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media