5.1.1 Background Knowledge.
The background knowledge is the information used to choose the songs from the catalogue in order to construct a playlist that matches the target characteristics. The background knowledge should be represented in some machine readable form, in such a way that it can be used by algorithms. In their 2014 survey on APG-I, Bonnin and Jannach [
39] identify several categories of background knowledge. In this section, we review those categories, while adding fresh details from the more recent work that we have surveyed. We organize APG-I algorithms based on their background knowledge in
Table 1. We highlight differences from the research surveyed by Bonnin and Jannach [
39] to the research we exclusively survey by dividing algorithms published up to 2013 from those published from 2014 onward.
Content-Based Data. Researchers in the field of MIR have been concerned for a long time with extracting information, or
features, from the music audio signal, a research topic which is often referred to as content-based MIR [
249].
9 These content-based features can be high-level, such as the emotions evoked by a musical piece [
324,
366], its genre [
76], timbre [
279], chords [
101], pitch [
373], and
beats per minute (BPM) [
307], or low-level, such as representations of the audio signal, for example
mel frequency cepstrum coefficients (MFCC) [
371], or such as learned embedding representations, as extracted, for example, by
convolutional neural networks (CNNs) [
162,
273]. Often, low-level features are used for extracting high-level features, e.g. see [
72,
287,
359]. We refer the reader to [
249] for a survey of content-based MIR.
Some APG-I algorithms rely on high-level content-based features. For example, Griffiths et al. extract the emotion evoked by songs, and construct a playlist that matches the emotion of the user, which is extracted by using several sensors [
135]. The same approach is taken in [
136,
324]. The work in [
40] is similar to the above, except that users manually input their current emotion, e.g., melancholy. Liebman et al. [
221,
222] propose a RL algorithm for APG-I in which songs are represented as vectors containing timbre, pitch, BPM, and statistics thereof.
Some other content-based algorithms rely on low-level content-based features. For example, Pohle et al. compute the similarity between songs in the catalogue based on an MFCC representation, to create a playlist in which consecutive songs are as similar as possible, so as to guarantee a coherent listening experience [
285]. A similar approach is followed by [
21,
111,
226,
228]. In [
71,
162], instead, song representations extracted by a CNN are used as input to a
recurrent neural network (RNN), so as to predict the next song in the playlist.
Metadata and Expert Annotations. Following [
39], we use the word metadata to refer to any information describing the playlist or its songs that is not derived from the audio signal. An example of playlist metadata is the playlist title or caption, when assigned by an expert. (When assigned by an end-user, it might be preferable to classify them as examples of social web data, like user tags—see below.) Examples of song metadata are the year of release, the record label, the lyrics, and the genre,
10 among others [
118]. Usually, experts manually annotate songs with their metadata. A notable example is the Music Genome Project [
50], a database of songs and their metadata created and maintained by experts employed by the Pandora music streaming service.
11 Different types of metadata are sometimes represented in a single structure, for example in a knowledge graph, where nodes represent both songs and also heterogeneous types of metadata [
103], and edges express the relationships between the songs and the metadata, and the metadata with other metadata. For example, song names, album names and years can be represented as nodes, and a “belongs to” edge can link a specific song name to its album name, and a “released in” edge can link the album name to its year of release [
118]. Knowledge graphs are also used in [
87,
165,
237,
251,
252,
333]. Edges in a knowledge graph might also represent relationships between classes, subclasses, and instances, e.g. between genres and subgenres. Indeed, ontologies and taxonomies offer an alternative to knowledge graphs, placing the focus on classes, subclasses, instances, and their properties. For example, Ben-Elazar et al. use a taxonomy of musical genres [
32].
One way to use metadata in APG-I is by allowing users to specify constraints on the metadata and then generate a playlist that satisfies those constraints. For example, in [
277] users can input constraints on the song genre, release year, and length (in seconds), and an optimization algorithm is used to generate a playlist which satisfies these constraints. The same strategy is followed in [
5,
15,
55,
155,
263,
275,
276].
Social Web Data. Social web data is data shared online by Internet users. Following [
39], we list three types of social web data:
User tags. A user tag
12 is a free text annotation that a user applies to a musical item, e.g. a song or an artist [
201]. User tags can be very rich and varied, as they can cover a wide range of different topics, such as musical genres (e.g., “rock”), years (e.g., “90s”), countries (e.g., “Ireland”), activities (e.g., “chill”), seasons (e.g., “summer”), among others.
Ratings. A rating is a piece of explicit user feedback for a musical item, often expressed in a 1-to-5 rating scale or as a “like” or “dislike” judgment. The usage of ratings as background knowledge is becoming less and less common, as ratings are too difficult to gather for the majority of the user base [
305]. In particular, we do not encounter any work that uses ratings as background knowledge in the literature from 2014 to the date of writing.
The social graph. A social graph connects people by different relationships, such as “friend” or “spouse,” and to musical items, e.g. person
\(x\) “likes” artist
\(y\), in social networks such as Facebook.
13 Social graphs are sometimes used as background knowledge, under the assumption that people who are closely connected in the graph have similar musical tastes [
130]. A playlist for a user can be constructed, for example, by including music that is liked by the user's friends, giving an automated version of word-of-mouth recommendation.
Usage Data. In streaming services, usage data record interactions between users and musical items.
Listening logs. Listening logs record the songs a user listens to, including those that they skip, those that they listen to completion, and those that they download. As such, listening logs provide indications about user tastes. For example, it is common for algorithms to interpret a skip as an indication of a song that the user does not like [
32,
66,
100,
157,
192,
265], although, of course, it could just signal that the song does not suit the user's context, such as their mood or activity. Listening logs can be used to compute embedding representations. One strategy is to rely on the word2vec algorithm [
240], by treating songs as words, and listening logs as phrases, by analogy with natural language.
Popularity. Usage data gives a clear indication of the popularity of musical items, e.g., of songs. We can, for example, simply count the occurrences of each song in the listening logs of all the users. Popularity is sometimes used as background knowledge for building simple but effective heuristics for APG-I. For example, Bonnin and Jannach show that it is possible to build high-quality playlists by simply including the most popular songs made by artists similar to the artists that the user likes [
38]. Moreover, popularity can be employed as a fallback strategy for estimating the musical tastes of new users, i.e., those that are new to the streaming service.
Manually constructed playlists. Users frequently create playlists for convenience [
112] and self-expression [
358]. These user playlists can be used as background knowledge for creating new playlists. For example, McFee and Lanckriet learn song-to-song transition probabilities based on a database of user playlists, and they use these probabilities to generate new playlists [
238]. Manually constructed playlists can also be used to compute embedding representations; see our discussion of listening logs, where we mentioned the word2vec algorithm as a possible way of computing these embeddings.
Discussion. The categories of background knowledge we review above differ in their availability, consistency, and abundance:
Availability. The availability of some background knowledge may not be guaranteed for all the songs in the catalogue. For example, recently added songs may have no user tags, or may occur few times, or never, in listening logs or in manually constructed playlists. The same goes for “long-tail” songs [
195], i.e., those songs which are rarely listened to, that constitute the large majority of the catalogue [
51]. The unavailability of background knowledge for such portions of the catalogue leads to biases against new songs (the cold-start problem [
89]) and against long-tail songs (popularity bias [
169]). In fact, algorithms cannot evaluate a song for inclusion in a playlist if there is no background knowledge to match the song to the target characteristics of the playlist. Content-based data is the only category of background knowledge which can be available for every song in the catalogue, as content-based data is extracted from the song audio itself. As such, content-based data allows for the construction of “fair” algorithms, in the sense that they can select any song in the catalogue for inclusion in the playlist.
Consistency. Some background knowledge may be noisier than other background knowledge. For example, the level of noise in metadata is often low because metadata annotations are typically made manually by domain experts. Nevertheless, inconsistencies in metadata may exist, especially because some metadata is not objective. For example, Flexer et al. find that different annotators may disagree on the musical genre of songs [
110]. Similarly, the level of noise in content-based data also tends to be low because content-based data is extracted by automatic procedures. Nevertheless, inconsistencies in content-based data may arise because those automatic procedures are never 100% accurate. For example, predicting the tempo of a song is challenging and the results can be inaccurate [
58,
278].
Usage and social web data are typically much noisier than content-based and metadata, as they record the unpredictable behavior of Internet users. For example, Lamere analyzes a dataset of user tags and finds misspellings, spelling variants and synonyms among user tags, as well as user tags with little to no relevance to music, such as the user tag “random” [
201]. Similarly, Hagen finds that manually constructed playlists often do not have clearly defined target characteristics but may be used as a randomly arranged container of the user's favorite music [
140].
Abundance. Usage and social web data are by far the most abundant category of background knowledge, as they are generated in large quantities by billions of Internet users every day. Content-based and metadata are less abundant as their extraction depends, respectively, on computationally bounded procedures [
308] and on the expensive annotation work of domain experts.
Using one category of background knowledge rather than another influences the quality of the generated playlists. For example, there is some evidence that algorithms relying solely on content-based data produce playlists of low quality, especially when compared with other types of systems, e.g. those which rely on usage data [
321] or metadata [
338]. However, it is wrong to consider one category of background knowledge to be superior to another. A more correct view is to consider them as complementary: while content-based data can help to create coherent playlists in terms of acoustic properties, usage data gives information about the musical tastes of the user, allowing the creation of personalized playlists. Hence, it is common to combine different categories of background knowledge. There are several papers that show how the quality of generated playlists is enhanced by making effective use of more than one category of background knowledge, e.g. [
32,
171,
237]. One way to readily include different sources of background knowledge is to organize them in a unifying structure, for example a knowledge graph (described earlier) [
103]. In addition to representing songs and their metadata, knowledge graphs can represent listening logs: nodes for users would link to nodes for the songs they listened to [
261].
If we refer back to
Table 1, we can see the differences between the research up to 2013 (the
first period), which was already surveyed by Bonnin and Jannach [
39], with respect to the research from 2014 onward (the
second period). The majority of algorithms from the
first period rely on content-based and metadata for their background knowledge, especially because the song catalogue sizes before the streaming era allowed for the manual annotation of songs or the extraction of content-based data. In the
second period, when streaming became the prevalent type of music access [
160], the emphasis shifted to usage data, mainly due to the availability of that type of background knowledge, easily recorded by the music streaming service.
5.1.3 Algorithm Type.
We review algorithms for APG-I based on their type. In their 2014 survey on APG-I, Bonnin and Jannach [
39] identify several types of algorithms. In this section, we review those types, while adding two types that emerge from the more recent work: sequence modeling and RL. We organize APG-I algorithms based on their type in
Table 3. We highlight differences between the research surveyed by Bonnin and Jannach [
39] and the more recent research by dividing algorithms published up to 2013 from those published from 2014 onward.
Similarity. Similarity algorithms use song similarity to construct playlists. Song similarity can be derived from different kinds of background knowledge, such as content-based data [
14,
21,
46,
161,
265,
285], metadata [
274,
283], tags [
284,
304], manually constructed playlists [
38,
233,
292], listening logs [
346], ratings [
56,
315], or any combination of the above [
174]. For example, both Pohle et al. and Cai et al. compute the similarity between songs based on an MFCC representation [
21,
46,
285]; Pauws and Eggen and Polignano et al. count the values of metadata features that two songs have in common [
274,
286]; and Bonnin and Jannach consider how often two songs co-occur in manually constructed playlists [
38]. More recently, we have seen increasing reliance on song embedding representations [
85,
199,
346], learned using the word2vec algorithm [
240]. These embeddings are obtained by treating songs as words and treating manually constructed playlists as phrases, by analogy with natural language. Song embedding representations can also be given as input to a clustering algorithm, such as
\(k\)-means, to generate playlists of similar songs by sampling from the clusters [
116].
Integrating multiple sources of background knowledge is beneficial when computing similarity. For example, in the context of judging the similarity between a song that could be added to a playlist and the songs already in the playlist, research shows that users consider content-based features, such as energy and tempo, as well as meta-data, such as musical styles and lyrical content [
22,
323].
The perception of song similarity is subjective. Even expert listeners are found to disagree when asked to rate the similarity between songs [
109]. For example, some people consider content-based data more than metadata while judging similarity, and other people may do the opposite [
205]. Some work in APG-I integrates personalization in the similarity computation. For example, Sotiropoulos et al. allow users to set different weights for different features when assessing similarity, e.g., a user might weight content-based data more than metadata or vice-versa [
316]. And, in [
302], Sandvold et al. propose a system where users can assign tags to songs, drawing from a vocabulary of tags. Then, the system learns how to tag new songs, so that the predicted tags reflect the user's tagging style. Finally, playlists can be created using a similarity measure based on both kinds of tags (those that are assigned and those that are predicted), which means that similarity is personalized based on the user's tagging style.
Once similarities are computed, songs can be chosen for their similarity with the seed songs [
14,
38,
226], or for their similarity with other songs liked by the user [
158]. Another possibility is to create playlists so as to maximize the similarity between songs [
194].
Playlists generated with similarity-based algorithms are expected to be coherent. However, coherence is not the only quality criterion for playlists, as some other criteria exist, such as diversity [
290]. Also, a risk with optimizing for similarity is that the playlist may become monotonous [
205], e.g., containing songs from the same album. See
Section 6.1.2 for a discussion of coherence and diversity.
One use-case for similarity algorithms is that of playlist sequencing, the special case of APG-I where the target characteristics are given as a set of songs, and algorithms are tasked simply with arranging the set of songs, without applying any further song selection, in such a way that the music is coherent, from one song to the next song [
36]. For example, Bittner et al. and Cliff both use a similarity algorithm working on content-based similarity [
36,
74]. Their approach compares the distance between songs based on several features and then arranges the songs in such a way that those distances are minimized. Sarroff and Casey [
303] use the approach of building a machine learning predictor working on content-based features that can distinguish suitable from not-suitable song-to-song transitions. Finally, Furini and his co-authors go in the direction of personalized sequencing [
115,
117]. They analyze song-to-song transitions in a user's playlists so that they can sequence playlists in a personalized way.
CF. CF is a common approach in the RSs literature. It is based on the heuristic that if the active user agreed with certain users in the past, then these users are similar to the active user, and items that these users liked should be relevant to, and can be recommended to, the active user [
296]. Hence, the use of CF algorithms is facilitated by the existence of usage data, recording the preferences of other users. Specifically, they typically assume a sparse user–item matrix that may record user ratings for items or user interactions with items. CF is then a family of methods for predicting the rating a user would assign to an item or the relevance of an item to a user, based on the data that is given in the rest of the matrix. It is the fact that this data may come from other users that makes these approaches “collaborative.”
The most common way of employing CF for APG-I is by applying the playlists-as-users analogy [
38,
298], in which a user is a playlist and an item is a song: instead of a user–item matrix that records each interaction between a user and an item, we have a playlist–song matrix that records information about the presence of each song in a playlist. Another common analogy is the titles-as-users analogy [
298,
351,
369], in which a user is a playlist title and items are again the songs. Which analogy to employ depends on the target characteristics. The playlists-as-users analogy fits the case in which the target characteristics are given as seed songs. The titles-as-users analogy fits the case in which the target characteristics are given as a playlist title, i.e. a special case of free-form text. For simplicity of exposition, most of the rest of this section uses only the playlist-as-user analogy.
In the playlist-as-user analogy, the playlists–songs matrix can be unary [
172], i.e., recording a
\(1\) if a playlist contains the song. Or, it can be non-unary; e.g., it may assign a value to a song according to its position in the playlist, giving more weight to the later songs [
12,
331]. Given a playlists–songs matrix, CF algorithms that would ordinarily predict the relevance of an item to a user can be re-purposed to predict the suitability of a song for a playlist.
One option is to employ nearest neighbors CF algorithms [
255]
17. For example, the system in [
341] computes the similarity between the active playlist and the other playlists in the dataset as the cosine similarity of their row-vector representations in the playlists–songs matrix. Then, for each song in the catalogue, it computes a score by summing the similarities of the active playlist to the
\(k\) most similar playlists in the dataset that contains the song, where
\(k\) is a positive integer hyper-parameter. Finally, the highest-scoring song is selected to be added to the playlist. The algorithm above corresponds to the user-based
\(k\)-nearest neighbors CF algorithm in the RSs literature [
198]. The user-based
\(k\)-nearest neighbors algorithm is also employed in [
166,
172].
Another option is to use the item-based
\(k\)-nearest neighbors algorithm [
91]. For example, the system in [
340,
342] computes the similarity between songs as the cosine similarity of their column-vector representations in the playlists–songs matrix. Then, for each song in the catalogue, it computes a score by summing the similarity of the last song in the playlist to the
\(k\) most similar songs in the catalogue.
Some work uses similarity functions other than cosine; e.g., Jaccard [
331]. Additionally, the similarity functions can be augmented with heuristics, e.g., by giving higher weight to unpopular items [
189].
But there exist CF algorithms, other than nearest neighbors, for constructing playlists. For example, Aizenberg et al. [
3] use a
matrix factorization (MF)-based approach, in which the playlists–songs matrix is factorized into two low-dimensional matrices, containing the playlist embeddings of every playlist in the dataset and the song embeddings of every song in the catalogue. Then, for each song in the catalogue, the system computes a score by taking the dot-product of the playlist and song embeddings. Finally, the highest-scoring song is selected to be added to the playlist.
In some cases, MF is the first step of a two-step APG-I algorithm, especially in the case of APC, e.g. [
298,
351]. MF is used to learn a model that can predict the relevance of every song to a playlist. But these relevances are used to filter to a more manageable (but still large) set of candidate songs to add to a playlist. These remaining songs are associated with features such as their popularity [
351] or the degrees of homogeneity and diversity they would bring to the playlist [
298]. A second model learns to re-rank the remaining candidates based on these features. Re-ranking algorithms commonly used in APG-I include gradient boosted trees [
351], for example, XGBoost [
62]
The years since the publication of the Bonnin and Jannach survey have also seen the rise of deep learning. Deep learning is a form of machine learning that is based on the use of many-layered artificial neural networks. It has led to advances in different application fields of AI, such as natural language modeling [
42] and object classification in images [
200]. Given those promising results, deep learning has recently been applied to the task of APG-I. We will discuss its use in sequence modeling in a later subsection, but here we can see the effect it has had on CF-style approaches to APG-I.
One example is to be found in the work of Zhao et al. [
369]. They take a similar approach to Aizenberg et al. above, i.e., using an MF algorithm to factorize the playlists–songs matrix. But then, where Aizenberg et al. compute scores between playlist and song embeddings, Zhao et al. provide for extra learning: the playlist and song embeddings are fed into a feed-forward neural network, which outputs a score indicating the fit of the song for the playlist.
A well-known family of deep learning models are the autoencoders. In the simplest case, an autoencoder consists of two components, encoder and decoder, both of which are usually feed-forward neural networks. A more sophisticated autoencoder is the adversarial autoencoder, in which the hidden vector distribution is regularized so as to match a Gaussian prior distribution [
234]. Autoencoders of both kinds are used for APG-I by setting the input to be a binary vector indicating which songs are in the playlist [
335,
361]. This, in effect, is the playlist-as-user analogy and so we can regard these as CF approaches to APG-I. The output is a vector approximating the input vector, that can be used for selecting other songs for addition to the playlist. Vagliano et al. [
335] successfully integrate additional background knowledge into an adversarial autoencoder for APG-I, by concatenating embeddings of textual data, such as the playlist title, to the hidden representation.
The playlists-as-users analogy has three main limitations: (1) the constructed playlists are not personalized, (2) the performance depends on the number of songs in the playlist, and (3) they may perform poorly on songs that occur in very few playlists. Concerning limitation (1), the songs are chosen so that they are tailored to those already in the playlist, but not to the listener's musical tastes. Some work tweaks CF approaches so that they become personalized. One approach, for example, is to modify the active playlist by adding songs from other playlists created by the same listener [
3,
170,
185]. Concerning limitation (2), CF algorithms are affected by the cold-start problem, which manifests with small or newly created playlists [
59]. In fact, the accuracy of CF algorithms is positively correlated with the number of seed songs, i.e., the algorithm will generate a better playlist when provided with more seed songs [
364]. And, CF algorithms are not able to generate a playlist if no seed songs are provided. In such cases, it is necessary to resort to a fall-back strategy, for example by working with other target characteristics or by employing simple heuristics based on song popularity. The solution to limitation (3) is usually some form of hybrid. For example, Val et al. combine song embedding representations extracted from content-based data, metadata, and manually curated playlists by means of a deep feed-forward neural network, so as to model the probability that a specific song is a good fit for a specific playlist [
337,
339].
The titles-as-users analogy shares the same three limitations. Some authors propose a solution to alleviate the cold-start problem when using the titles-as-users analogy, which consists of clustering similar titles together, so as to increase the number of songs for each title [
369]. One way to cluster titles is to rely on simple text pre-processing pipelines, which transform the text to a common format, for example by removing special characters [
362,
363,
369]. Another way to cluster titles (although here employed in a recurrent neural network) is by employing text-embedding procedures, such as FastText [
175], and by running a clustering algorithm on those embeddings [
245]. Yang et al. [
361], on the other hand, treat playlist titles as sequences of characters and use a CNN to process the characters, obtaining an embedding vector that can be used to predict the songs in the playlist, given the title. They combine the CNN with an autoencoder to obtain a system that has both a titles-as-users approach and a playlist-as-users approach.
One last drawback of CF approaches, however, is that they are not designed for the specific challenges of APG-I, and aspects such as song coherence have to be addressed separately [
39].
Frequent Pattern Mining. Frequent pattern mining approaches work by mining patterns from a dataset of manually constructed playlists. A pattern can be expressed in the form \(S_{1}\Rightarrow S_{2}\), where \(S_{1}\) and \(S_{2}\) are two sets of songs. The pattern signifies that it is likely to find the songs in \(S_{2}\) after the songs in \(S_{1}\). In sequential pattern mining, \(S_{1}\) and \(S_{2}\) would be sequences of songs, not sets of songs.
Patterns can be used to generate playlists. For example, consider three songs:
\(s_{1}\),
\(s_{2}\) and
\(s_{3}\); given a playlist with
\(s_{1}\) and
\(s_{2}\) as seed songs, if we have extracted the pattern
\(S_{1}\Rightarrow S_{2}\), where
\(S_{1}\) is the set
\(\{s_{1},s_{2}\}\), and
\(S_{2}\) is the set
\(\{s_{3}\}\), then a candidate continuation for the playlist is
\(s_{3}\). Pattern mining is not often applied to APG-I. However, Bonnin and Jannach show that patterns and sequential patterns can, in fact, achieve comparable performance to other types of algorithms that were in use in 2014 [
38,
39]. Chen et al. furnish one example of the use of sequential patterns [
61]. They propose to use a simple bigram model that extracts
\(S_{1}\Rightarrow S_{2}\) rules by counting how frequently the set of songs
\(S_{2}\) follows the song
\(S_{1}\) in the dataset, corrected with Witten-Bell discounting [
177].
One problem with this approach is that patterns based on songs will be very rare, and sequential patterns even rarer. For example, even in a very large dataset of playlists, the number of times that Alice Coltrane's
Wisdom Eye follows
Post Requisite by Flying Lotus will be small. The patterns will be associated with very low confidence values, and many reasonable patterns will be not be seen at all. The solution is to mine patterns based, not on songs, but on representations of songs, e.g., based on hand-crafted features or latent features. In [
144], for example, the PrefixSpan algorithm [
142] is used to mine sequential patterns on song latent embeddings. The song latent embeddings are obtained by applying Latent Dirichlet Allocation to the songs’ tags [
37].
Statistical Models. Statistical models work by modeling the probability of adding a song to the playlist.
18 One class of statistical models are the Markov models, i.e., those that model the probability of adding a song to the playlist based on the current “state.” In the APG-I research cited below, the state is defined as the last song in the playlist. Markov models are proposed in [
61,
87,
237,
246,
333,
345,
367].
The core component of Markov models is the estimation of the song-to-song transition probabilities. McFee and Lanckriet [
238] offer a comparison of a number of Markov models, which differ on how the probabilities are estimated: some of them count song co-occurrences in manually constructed playlists, while others rely on content-based or metadata similarity, and in some cases latent representations. This recalls the problem with frequent pattern mining (above): song co-occurrences will typically be low frequency; using song representations or Hidden Markov Models (e.g., [
367]) can overcome this.
Markov models may lead to the construction of problematic playlists [
342]. For example, adding a song based only on the previous one may lead to a lack of coherence throughout the playlist.
Some other statistical models are not Markov models, and model the probability of adding a song to the playlist based on the other songs in the playlist. For example, Hu and Ogihara [
158] consider a playlist as a time series and use an autoregressive integrated moving average model [
146] to predict the next song.
The most sophisticated statistical models are also personalized, i.e., they model the probability of adding a song to the playlist based on the other songs in the playlist and based on the user's musical tastes. For example, Ben-Elazar et al. [
32] propose a Bayesian classification model whose parameters are estimated via variational inference based on the playlist songs and on the other songs liked by the user. Two similar models are proposed in [
330,
370].
Other notable statistical models are proposed for the scenario in which the target characteristics are specified using natural language. For example, Chung et al. propose a statistical model for linking a word to a song [
73]. It is trained on a dataset of manually constructed playlists and their titles. In practice, they learn an embedding for every word and song, in such a way that a particular song embedding is aligned with a particular word embedding if that song is likely to appear in a playlist which contains that word in its title.
Case-Based Reasoning (CBR). CBR is an approach to problem-solving that involves reasoning with prior experiences. CBR can be effective when two tenets hold [
204]: similar problems have similar solutions; the types of problems an agent encounters tend to recur. Case-based APG-I assumes that existing playlists encode the results of prior reasoning, and that it is therefore worthwhile to re-use existing playlists when creating new playlists.
Given a dataset or case base of existing playlists and an initial seed playlist, Gatzioura and Sànchez-Marrè use CBR to recommend a set of songs for playlist continuation (APC) [
124]. Their system retrieves from the case base a set of
\(k\) playlists that are similar to the user's seed playlist. In this system, similarity is an aggregate of song similarity, where song similarity is based on shared meta-data. The system recommends songs taken from the
\(k\) playlists, based on the playlist similarity scores.
19 The approach is extended to include time-of-creation pre-filtering [
125] and shared latent topic pre-filtering [
126,
127].
By contrast, Baccigalupo and Plaza [
18] deal with case-based playlist generation from a seed song (APG-I), rather than case-based playlist continuation and treat the order of the songs in the playlists in the case base as significant. In their approach, the system retrieves and combines a set of so-called relevant patterns. Relevant patterns are subsequences that contain the user's seed song and which recur across multiple playlists in the case base.
20 There is an even greater risk of low frequency patterns than there was in frequent pattern mining and Markov model algorithms: Baccigalupo and Plaza's algorithm works with songs, rather than representations of songs, and the patterns it mines must include the seed song.
Both Gatzioura and Sànchez-Marrè and Baccigalupo and Plaza equip their systems with additional scoring mechanisms that try to take coherence and diversity into account, these being concepts that we discuss further in
Section 6.1.2.
Discrete Optimization. A different way to approach playlist generation is by setting up a discrete optimization problem. Given the catalogue of songs and a set of explicitly specified constraints that capture the desired target characteristics, the goal is to construct a sequence of songs that satisfies the constraints, while maximizing some utility function [
39].
Discrete optimization approaches to APG-I differ in their constraints. Some approaches impose constraints on consecutive songs, for example by requiring that their similarity should be higher than some value, as measured by a song similarity measure [
5,
263]. Other approaches impose constraints directly on metadata or content-based data [
6,
15,
145,
155,
263,
275]. In the case of [
15], for example, this is done by requiring that at least
\(n\) songs in the playlist should have a specific musical genre. In [
156], constraints are sequential patterns that have been mined from a user's listening history.
In addition to constraints that must be satisfied, there may be a utility function to be maximized. For example, the approaches in [
145,
194,
284] seek to maximize the similarity of consecutive songs in the playlist, as measured by a song similarity measure [
195].
Also, different approaches use different strategies to solve the optimization problem. Some use linear programming [
5,
6]; others use constraint satisfaction [
15,
263,
275]; yet others use simulated annealing [
145,
277]; and at least one uses each of genetic algorithms [
155], ant-colony optimization algorithms [
243], and tabu search [
156]. In all cases, the greatest challenge is coping with the combinatorial explosion that results when scaled up to large music collections.
Sequence Modeling. We have already discussed the recent contribution that deep learning has made to new CF algorithms for APG-I. We have also seen some algorithm types (most notably sequential pattern mining and statistical models) that depend on the sequence of songs in a corpus of playlists. In what we are here referring to as sequence modeling algorithms, we look at the application of deep learning methods to the song sequence data.
21 One well-known family of deep learning models are RNNs [
224]. RNNs are particularly suited to learning from sequential data. At their core, there is the concept of hidden state, which is updated at each step of the sequence, as a function of the current and past elements of the sequence. The hidden state contributes to the prediction of the next value in the sequence.
RNNs can be applied to APG-I by considering that a playlist is a sequence of songs. As such, RNNs can naturally predict the next song in the sequence, i.e., the song to add to the playlist. They can be used to construct a playlist from scratch, song by song, but may be particularly suited to APC, where they propose a continuation of an existing playlist. In [
172,
340–
342] a particular RNN, known as GRU4Rec [
150], originally proposed for SARSs, is used for APG-I. They train the RNN on a dataset of manually constructed playlists, with the objective of correctly replicating those playlists, i.e., the RNN is fed the playlist up to song
\(n\), and its parameters are optimized so that it correctly predicts the song in position
\(n+1\). Related work makes use of other RNN models, such as LSTMs [
71,
162]. Shih and Chi [
312] propose an additional RNN training step, in which other training objectives are included, such as song diversity and freshness, by resorting to a policy-guided RL algorithm [
154]. Moreover, by treating a playlist title as a sequence of characters, it is possible to use RNNs to process the characters, obtaining an embedding vector, that can be used to predict the songs in the playlist, given the title [
191].
Another family of deep learning models are the CNNs. The use of CNNs was popularized in computer vision, where they yield state-of-the-art accuracy in tasks such as image recognition [
200]. While not always seen as sequence models, CNNs have been adapted to do language modeling [
178] and this inspires ways of using CNNs for APG-I. For example, by applying the songs-as-words analogy, it is possible to use a CNN to predict the next song in the playlist [
351].
Our final type of deep learning model is the transformer [
347]. It is only very recently that transformers have been applied to APG-I, e.g., [
33,
34,
311]. In the context of APG-I from seed songs, Bendada and his co-authors [
33,
34] report a comparison in a A/B test of two approaches: a transformer and a latent factor model. The transformer model resulted in longer listening times, which is a positive result. However, for more mature users, this was accompanied by a reduction in actions such as adding songs from the playlist to a list of favorites. We mention transformers in
Section 8 as a promising direction for future work. Indeed, sequential modeling in general is a promising approach, but it does require the availability of large quantities of reasonably high-quality playlists.
RL. RL is a form of machine learning in which an agent, through interaction with its environment, learns how to take specific actions so as to maximize a long-term numerical reward. In each step, the agent takes an action and the environment transitions from one state to another state. After each action, the agent may observe a reward. The agent aims to learn a policy that defines which action should be taken in each state in order to receive the greatest cumulative reward [
325].
RL is suitable for modeling sequential problems, in which each action is taken as a consequence of the previous action. Playlist construction can be modeled as a RL problem, by considering an action to be the addition of a particular song to the playlist for which there is a reward. The goal is to learn a policy that maximizes cumulative reward. The survey of APG-I by Bonnin and Jannach does not contain any RL algorithm for APG-I, since most were published after 2014.
In APG-I work that is based on RL, the current state is given by the list of songs in the playlist. For example, if the playlist has been constructed up to the 10th song, and the agent is tasked with choosing the 11th song, then the state is the list of those 10 songs; an action is the addition of a specific song to the playlist.
The observed rewards depend on the user, and on the newly added song. In some work, rewards are observed implicitly; for example, a skip is a negative reward, while listening to a song to completion is a positive reward [
192]. In other work, rewards are observed explicitly, for example by asking the user to rate the newly added song on a numerical scale [
221]. In some works, rewards are calculated from the system's background knowledge. For example, in [
299,
300], reward is based on similarity in an embedding of acoustic features, and this is extended in [
301] to similarity also in a knowledge graph embedding combined with measures of popularity and novelty. In this way, rewards help balance smooth transitions, diversity, and discovery.
In some APG-I work based on RL, the agent learns a policy from the observed rewards directly. The work in [
66,
157,
192], for example, uses Q-learning [
357] to learn a policy from observed rewards. In other work, the agent estimates a reward function from the observed rewards and uses this reward function to determine the policy. The work in [
221,
222], for example, parameterizes the estimated reward function as a linear transformation of the newly added song's content- based data.
One apparent issue with the formalization of APG-I as RL above is that of combinatorial explosion. For example, if the catalogue size is just 10 million songs (much smaller than it is in some music streaming services [
318]), then there are
\(10^{70}\) possible states just for playlists that contain 10 songs, and
\(10^{7}\) possible actions. One way to tackle the state explosion problem is by factorizing the song representation in terms of their features, such as content-based features [
192,
221,
222], metadata [
157] or mood [
66], and/or by applying windowing, e.g. by representing just the last three songs of a playlist in a state [
66,
157].
At the time of writing, RL algorithms for APG-I are promising but relatively under-explored. Their need for reward data is their greatest limitation, and it is not clear that calculating rewards from other data is an adequate substitute for human reward data.
Having now reviewed the different algorithm types, we finish this section with some topics that are algorithmic but which cut across the different algorithm types.
Discussion: Trends. Our survey of algorithm types enables us to identify some trends in the research. We gave an overview of these in
Section 3, where we were contrasting our survey with previous ones. But, now, at the risk of some repetition, we can use
Table 3 and our presentation of the algorithms above to confirm them. There are differences between the research up to 2013 (the
first period), which was already surveyed by Bonnin and Jannach [
39], and the research from 2014 onward (the
second period), that we exclusively survey. We can explain the changes in terms of at least three factors: (1) the paradigm shift in music consumption from small personal music collections to streaming services, requiring algorithms that scale well to millions of candidate songs; (2) consequentially, the availability of certain kinds of data, most notably usage data, such as the MPD, released in 2018 for the ACM RecSys Challenge, which enables approaches that train on collections of manually constructed playlists; and (3) the rise of deep learning across ML in general.
Accordingly, work on Similarity algorithms has declined a little since the first period, perhaps because some approaches do not scale well. The more recent Similarity algorithms use embeddings learned from datasets of manually constructed playlists. Use of CF algorithms has grown enormously, exploiting, for example, the MPD, mentioned above. A little of the CF work uses nearest-neighbors methods, but these do not always scale well. Instead, MF has become common, and the most recent work combines MF with models for re-ranking or with multi-layered neural networks, autoencoders, and the like. Discrete optimization methods have largely not survived the transition to music streaming services. When there are millions of songs, it is near impossible to utilize discrete optimization algorithms, as their worst-case computational complexity is exponential.
Approaches that try to learn from the order in which songs appear in playlists have changed greatly. Neural approaches to sequence modeling, such as recurrent neural networks and transformers have superseded frequent pattern mining and, to some extent, statistical models. Finally, RL, which seems to promise much in terms of modeling but also the handling of user feedback, has grown strongly.
Discussion: Hybrids. The types of algorithms we reviewed differ in their performance. There is, for example, some evidence that CF and sequence modeling excel in generating high-quality playlists, especially in the case where the target characteristics are specified as a list of seed songs [
39,
364]. However, it is wrong to consider one type of algorithm to be superior to another. A more correct view is to consider them as complementary. Algorithms of different types usually leverage different sources of background knowledge. For example, while CF algorithms are mostly limited to usage data, similarity algorithms can easily include content-based data. Also, algorithms of different types usually accommodate different ways of specifying the target characteristics. For example, while CF algorithms are mostly limited to the case in which a static playlist is generated from a list of seed songs, RL algorithms can generate dynamic playlists that adapt to the user feedback in real time.
It is therefore necessary to employ hybrids that combine algorithms of different types or that use different data, in different circumstances, especially depending on the background knowledge available, and on the way the target characteristics are specified. For example, Schedl et al. [
304] use a similarity-based algorithm to generate a playlist, which is then adapted in real time based on the contextual information, gathered from sensors and processed by a statistical model. Frequently, different types of algorithms are combined with the goal of attaining playlists of higher quality [
172,
248,
326,
364]. One common way of combining algorithms is to compute a weighted average of their predictions [
229,
351].
Deep learning is often used as a powerful tool to combine heterogeneous features and information sources [
320]. For example, Vall and his co-authors [
337,
339] combine song embedding representations extracted from content-based data; metadata; and manually curated playlists, by means of a deep feed-forward neural network, so as to model the probability that a specific song is a good fit for a specific playlist.
We refer the reader to [
44] for an understanding of the possible ways in which RS algorithms can be combined, many of which can be adapted to APG-I.
Discussion: Re-Ranking. Finally, we will discuss re-ranking, which can be thought of a particular type of hybrid algorithm. Systems that use re-ranking typically have a two-stage architecture. In the first stage, a model ranks the candidate songs for their relevance to a playlist. In the second stage, the candidates are re-ranked by a second algorithm, typically using data that was not used in the first stage.
In APG-I, there are at least two motivations for using re-ranking. One motivation is to improve scalability. The first stage would use a model that can score all the songs in the catalogue for relevance but at speed. A common choice is an MF model. Only those candidates with the highest scores from the MF model are passed to the second stage, where they are ranked by a model that takes different data into account and may not operate as quickly as the model in the first stage [
298,
351].
The second motivation for re-ranking is to improve the top-
\(n\) song recommendations that are selected for display to the user of, e.g., an APC system. Songs appearing lower in the ranking produced by the first stage might be ‘promoted’ in order to produce a set of
\(n\) recommendations that satisfy additional criteria. In [
187], for example, the intuition is that the set of songs that is recommended for continuation of a playlist should match the level of diversity of songs that are already in the playlist. Kaya and Bridge use sub-profile aware diversification [
188] to implement this intuition, measuring an increase in accuracy. A similar approach is taken in [
229], where diversity is measured by means of content-based features. The intuition of [
116] is, instead, that some songs recommended for continuation of a playlist should be familiar to the user, while some others should be novel. All these papers implement their intuitions using a re-raking approach.
5.1.4 Evaluating APG-I Research.
Up to now, we have referred to playlist quality as the way we would measure the performance of APG-I algorithms. Playlist quality is, however, an ill-defined concept, difficult to pin down to a mathematical definition that would allow its measurement. In fact, playlist quality depends on the musical tastes of the user, on the user's familiarity with the music [
356], and on the listening context [
2]. For example, two different listeners may rate the quality of the same playlist differently, because they may have different musical preferences, because they may already know the songs, or because they are listening to the playlist in two different locations, e.g., at the beach or in the bus.
In their 2014 survey on APG-I, Bonnin and Jannach [
39] review the different strategies for measuring the quality of playlists. They identify three categories of evaluation protocols:
(1)
User studies, where users are involved in rating the quality of playlists;
(2)
Objective measures, where statistics of the constructed playlists are computed, under the assumption that those statistics (e.g., coherence or diversity of the songs’ musical genres) reflect the notion of playlist quality; and
(3)
Ground truth playlists, where algorithms are tested for how well they can recreate manually constructed playlists or listening logs, under the assumption that the manually constructed playlists or listening logs reflect a gold standard.
These three categories are still valid today, covering also the evaluation protocols in papers published from 2014 onward. In the following, we focus on how APG-I algorithms are evaluated in the papers that are exclusive to our survey, i.e., papers published from 2014 onward.
Evaluation protocol (3) is probably the most common and can be described as a three step procedure: (a) preparation, in which a number \(N\) of songs are withheld from a ground truth playlist; (b) recommendation, in which an APG-I algorithm is used to get a ranked list of \(K\) candidate songs to be added to the playlist; (c) scoring, in which metric \(M\) is used to measure the fitness of the \(K\) recommended songs relative to the \(N\) withheld songs. \(K\) can assume any value from \(1\) to the size of the song catalogue. \(N\) can assume any value from \(1\) to the playlist size. The three steps are repeated for every ground truth playlist in the dataset, and the resulting values of \(M\) are averaged.
Different instances of evaluation protocol (3) differ for the choice of
\(N\),
\(K\) and
\(M\). For example, in a comparative evaluation of APG-I algorithms, Bonnin and Jannach set
\(N\) to
\(1\) and allowed
\(K\) to range from
\(1\) to 1,000 [
38,
39]. They used hit-rate as
\(M\), which, for a ground truth playlist, measures whether the set of
\(K\) recommended songs contains the withheld song. They reported the percentage of ground truth playlists for which there was a hit. By contrast, in the ACM RecSys Challenge 2018,
\(N\) is different for each ground truth playlist,
22 \(K\) is set to
\(500\), and
\(M\) is set to a number of different metrics related to hit-rate, including Normalized Discounted Cumulative Gain and a metric they called R-precision [
364].
In general, there is no agreement on what combination of
\(N\),
\(K\) and
\(M\) to use, but there is work that offers insights into some combinations to adopt or avoid. For example, Bonnin and Jannach show that using average log-likelihood as
\(M\), which was used in some other work [
238], leads to inconsistent conclusions with hit-rate-related metrics, and recommend to avoid its use [
38,
39]. Kamehkhosh and Jannach carry out a user trial where users are asked to choose the most appropriate continuation for a playlist among four song alternatives, three of which are generated using an algorithm and the last is the withheld song [
183]. They find that users are likely to select the withheld song as a favorite continuation. Their experiment provides evidence that the choice of
\(N\) as
\(1\),
\(K\) as
\(1\), and
\(M\) as hit-rate is a reliable setting. In contrast, Vall et al. criticize the choice of setting a specific cut-off for
\(K\), by showing that the relative ordering in performance of a set of competing algorithms changes when varying
\(K\) from
\(1\) to the size of the song catalogue, while keeping
\(N\) fixed to
\(1\) and
\(M\) to be hit-rate [
341].
Evaluation protocol (3) is explicitly designed to work in the case where the target characteristics are specified using seed songs, which is the most common scenario in the recent literature, see
Table 2. However, evaluation protocol (3) can be adapted to work when the target characteristics are specified in different ways. For example, Chung et al. propose an APG-I algorithm for which the target characteristics are input as free-form text, and they evaluate the algorithm using ground truth playlists, setting
\(N\) to the playlist size,
\(K\) to vary from
\(5\) to
\(20\), and hit-rate as
\(M\) [
73].
Evaluation protocol (3) works under the assumption that the ground truth playlists represent a gold standard, and thus the ability to replicate those playlists reflects the ability to construct high-quality playlists. However, the assumption may be too strong in some circumstances. For example, Hagen finds that manually constructed playlists often do not have clearly defined target characteristics [
140]. For example, they may sometimes be used as a randomly arranged container of the user's favorite music, created for convenience of access. Similarly, although listening logs can be assimilated to the concept of playlist, they may contain spurious interactions, such as songs recommended by the automatic continuation features of streaming services during periods that the user is not paying attention to the recommendations. In some work, listening logs are filtered before running the evaluation, for example removing skipped songs [
41]. Ideally, the quality of the ground-truth playlists must be checked before running the evaluation, which circles back to the original question of how to evaluate playlist quality. One guideline to distinguish suitable datasets of manually constructed playlists for evaluation is offered in [
81], as they find that playlists manually constructed by users for sharing with other users usually satisfy high -quality standards and have clearly defined target characteristics.
Lastly, evaluation protocol (3) is undermined to a degree by several biases, most notably by popularity bias [
31]. Since most of the ground truth playlists tend to contain popular songs [
51], an APG-I strategy that constructs playlists in a popularity-driven fashion will usually yield good performance [
38,
39]. However, while a playlist with popular songs would satisfy a large share of users, it would not suit minorities of listeners. There exist several strategies for de-biasing evaluation protocol (3) with respect to popularity, for example see [
60]. A simple strategy is used in [
341], where evaluation protocol (3) is run separately for popular songs and for the rest of the songs, finding that the relative ranking in performance of algorithms changes for these two segments of the song catalogue.
Evaluation protocol (2) is sometimes adopted for evaluating APG-I algorithms. It works by computing statistics on the constructed playlists, under the assumption that those statistics reflect aspects of playlist quality. For example, several papers measure song diversity and coherence from song tags or musical genres [
73,
170,
171,
185]. Additionally, there are papers that measure song popularity [
73], and others that measure song novelty, i.e., the degree to which songs are known by the listener, and freshness, i.e., the degree to which the songs are recently released [
312]. These statistics do offer insights into the characteristics of constructed playlists, but it is not clear how those characteristics relate to the concept of playlist quality. For example, it is not clear what value of song diversity in a playlist is ideal: research suggests that songs should be somewhat diverse, while staying coherent overall [
86,
182]. See
Section 6.1.2 for a discussion of coherence and diversity.
Evaluation protocol (1) consists of user studies. For example, RL algorithms require real-time interaction with the listener and are evaluated with user studies in which the quality of an algorithm is estimated by monitoring implicit user signals, such as the number of skips [
192], or by explicitly asking the user if they like the music or not [
221,
222]. As another example, Ikeda et al. employ a user study to evaluate the smoothness of song-to-song transitions in playlists [
161]. However, user studies have the disadvantages that they may not be reproducible and they are costly and time consuming. Another strategy for involving users in the evaluation of playlists is via A/B tests, in which users of streaming services are partitioned, and each partition receives playlists constructed by a different algorithm. Users in each partition are monitored for their engagement with the playlists, for example by monitoring the play counts [
40], which is an indicator of playlist quality. Unfortunately, A/B tests require resources not accessible to most researchers, such as the availability of a music streaming service platform in which the A/B test can be conducted. A middle ground between an A/B test and evaluation protocol (3) is counterfactual evaluation, which allows estimation of the performance of a candidate algorithm, as if it were in production, by relying on listening logs extracted from whatever algorithm is currently in production. Researchers at Spotify share their recipe for counterfactual evaluation in [
138], showing that they can rely on high correlation with actual A/B tests.
This concludes our survey of APG for individual users. We turn now to the case of APG for groups of users.