In an era of ever-increasing digital music catalogs, the task of discovering new music tailored to individual preferences can be daunting. This capstone project addresses this challenge by developing a music recommendation system capable of proposing the top 10 songs to a user based on their predicted listening likelihood. Our goal is to enrich the user's music exploration experience by providing fast, accurate, and personalized recommendations.
The core problem tackled is leveraging data science to build a recommendation system that can effectively suggest the top 10 songs to a user, aligning with their probable listening habits.
To achieve this, we addressed several key questions:
- How do we define the musical preference of a user?
- What factors contribute to a user's musical preference?
- What data is most useful when defining a user's preference?
- What musical content is a user most likely to consume?
Our solution involved extensive data preprocessing, exploration of various recommendation algorithms, rigorous hyperparameter tuning, and detailed performance evaluation.
The project utilizes the Taste Profile Subset from the Million Song Dataset.
Data Source: http://millionsongdataset.com/
song_data: Contains details about songs.song_id: A unique ID for every song.title: Title of the song.release: Name of the released album.artist_name: Name of the artist.year: Year of release.
count_data: Records user listening activity.user_id: A unique ID for the user.song_id: A unique ID for the song.play_count: Number of times the song was played by the user.
count_dfObservations:- Contains
2,000,000entries. 10,000unique song IDs, implying approximately 200 entries (user listens) per song on average.- No missing values.
- Average
play_countis 3, with a minimum of 1 and a maximum of 2213 (potential outlier).
- Contains
song_dfObservations:- Contains
1,000,000entries. 999,056unique song IDs and702,428unique song titles, indicating multiple songs share the same title.- Minor missing values (15 for
title, 5 forrelease).
- Contains
- Year-based Observations:
The raw datasets were preprocessed through the following stages to prepare them for modeling:
- Merge DataFrames:
count_dfandsong_dfwere merged (left-merge onsong_id), dropping duplicatesong_identries fromsong_dfand theUnnamed: 0column. - Encode IDs:
user_idandsong_idattributes were label-encoded from strings to numeric values. - Filter Dataset:
- Users who listened to less than 90 songs were removed.
- Songs listened to by less than 120 users were removed.
- Generate Rating Scale:
play_countwas re-scaled as a rating system from1to5by restricting data to users who listened to a song at most 5 times. - Drop Imputed Years: Entries with
year = 0were removed.
These steps significantly reduced the dataset from 2,000,000 rows to 97,227 rows (4.86% of the original size), addressing computational intensity.
| Step | Shape before step | Shape after step |
|---|---|---|
| 3a | (2000000, 7) | (438390, 7) |
| 3b | (438390, 7) | (130398, 7) |
| 4 | (130398, 7) | (117876, 7) |
| 5 | (117876, 7) | (97227, 7) |
After preprocessing, the final dataset contains:
3,154unique users.478unique song IDs.476unique song titles (confirming duplicate titles for different song IDs).185unique artists.- No missing values.
(Placeholder for your plot output)
Our recommendation system workflow follows standard machine learning practices:
- A training set is fed to a recommendation algorithm, which produces a recommendation model.
- A held-out test set is used to generate predictions from the learned model for each user-item pair.
- Predictions with known true values are then input into an evaluation algorithm to produce performance results.
This case study built and analyzed recommendation systems using four distinct categories of algorithms:
- Rank-Based Algorithm:
- Recommendations are based on the overall popularity of a song, defined by the frequency and average of its play counts.
- Collaborative-Filtering Algorithms:
- Similarity/Neighborhood-Based:
- User-user similarity (and an improved version).
- Item-item similarity (and an improved version).
- Model-Based:
- Matrix Factorization: Singular Value Decomposition (SVD) and its optimized version.
- Similarity/Neighborhood-Based:
- Cluster-Based Algorithm:
- Groups similar users based on listening patterns and recommends songs popular within the same cluster.
- Content-Based Recommendation System:
- Recommends items based on the features/attributes of the items themselves.
The Surprise library was extensively used to demonstrate all Collaborative Filtering algorithms. Grid Search Cross-Validation was applied to find optimal hyperparameters and improve model performance.
A key focus was on improving system performance. We conducted extensive experiments via a Grid Search across critical hyperparameters:
- Train/Test Data Split Ratio:
80/20,70/30, or60/40. - Handling Imputed Year Values (
year = 0):True(drop) orFalse(keep). - Relevance Threshold Parameter (for
precision@kandrecall@k):1.0, 1.1, 1.2, 1.3, 1.4, 1.5.
Key Takeaway from Tuning: The relevance threshold parameter was found to be the most influential hyperparameter affecting performance metrics, with other parameters yielding similar results regardless of their choice. This insight is attributed to the inherent characteristics of the data.
Final Hyperparameter Values for Analysis:
- Train/Test Data Split Ratio:
80/20 - Dropping Imputed Year Values (
year = 0):True - Relevance Threshold Parameter:
1.0
These values enable comparable results across models, simplifying the deployment of even the simplest models.
For recommendation systems, evaluating the top-N recommendations is crucial. Therefore, we focused on Precision@k, Recall@k, and their harmonic mean, the F1-score@k.
Definitions:
-
Precision@k: The proportion of recommended items within the top-k set that are actually relevant to the user.
- Formula:
(Number of Recommended items that are relevant) / (Total Number of Recommended items)
- Formula:
-
Recall@k: The proportion of all relevant items that were successfully found within the top-k recommendations.
- Formula:
(Number of Recommended items that are relevant) / (Total Number of Relevant items)
- Formula:
-
F1-score@k: The harmonic mean of Precision@k and Recall@k, providing a single metric that balances both.
- Formula:
2 * ( (Precision@k * Recall@k) / (Precision@k + Recall@k) )
- Formula:
An item is considered relevant if its true rating r_ui is greater than a given threshold. An item is considered recommended if its estimated rating r_hat_ui is greater than the threshold AND it is among the k highest estimated ratings.
Note: In edge cases where division by zero occurs, Precision@k and Recall@k values are conventionally set to 0.
We performed an in-depth analysis of Precision@k and Recall@k by varying k from 10 to 40, in increments of 10. A significant takeaway is that when k = 40, all tested models achieved Precision@40 = Recall@40 = 1, indicating 100% accuracy within the top 40 recommendations. Similarly, for a 60/40 data split, models reached 100% accuracy when k = 60.
The table below summarizes the evaluation metric results for the models. Detailed analysis and visualizations are available in the project's Jupyter Notebook.
| Metric | User-User | User-User-Optimized | Item-Item | Item-Item-Optimized | SVD | SVD-Optimized | CoCluster | CoCluster-Optimized |
|---|---|---|---|---|---|---|---|---|
| RMSE | 1.088 | 1.041 | 1.024 | 1.02 | 1.009 | 1.009 | 1.033 | 1.04 |
| Precision@10 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| Recall@10 | 0.948 | 0.948 | 0.948 | 0.948 | 0.948 | 0.948 | 0.948 | 0.948 |
| F1-score@10 | 0.973 | 0.973 | 0.973 | 0.973 | 0.973 | 0.973 | 0.973 | 0.973 |
| Recall@20 | 0.994 | 0.994 | 0.994 | 0.994 | 0.994 | 0.994 | 0.994 | 0.994 |
| F1-score@20 | 0.997 | 0.997 | 0.997 | 0.997 | 0.997 | 0.997 | 0.997 | 0.997 |
| Recall@30 | 0.999 | 0.999 | 0.999 | 0.999 | 0.999 | 0.999 | 0.999 | 0.999 |
| F1-score@30 | 0.999 | 0.999 | 0.999 | 0.999 | 0.999 | 0.999 | 0.999 | 0.999 |
| Recall@40 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| F1-score@40 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
All models performed exceptionally well, particularly at higher k values.
| Algorithm | min_predicted_play_count | max_predicted_play_count | (6958, 1671, 2) | (6958, 3232, None) |
|---|---|---|---|---|
| user-user | 1.67 | 1.67 | ||
| user-user-optimized | 2.69 | 3.05 | 1.91 | 1.29 |
| item-item | 1.31 | 1.3 | ||
| item-item-optimized | 2.04 | 2.72 | 1.91 | 1.29 |
| SVD | 1.31 | 1.23 | ||
| SVD-optimized | 2.04 | 2.72 | 1.36 | 1.44 |
| coCluster | 1.2 | 1.02 | ||
| coCluster-optimized | 1.98 | 3.14 | 1.0 | 1.0 |
Based on predictions for user 6958, the user-user-optimized model provided the most accurate prediction for a known song (1.91 vs. 2.0 actual).
We propose the Singular Value Decomposition (SVD) model as a robust and ready-to-use solution. SVD is a personalized, model-based collaborative filtering algorithm that relies on latent features extracted from past user behavior, making recommendations independent of additional external information.
An example implementation strategy would involve:
- Establishing a Goal: E.g., implementing the model to recommend songs when a
genrefeature is added. - Scheduling Milestones: E.g., adding a new feature and testing its usefulness every month.
- Defining Success: E.g., successful implementation and training in 4 weeks or less; model still in use 3 months or more after implementation.
- Re-assessing Metrics: Continuously evaluating time, cost, and human resources needed for implementation.
The current analysis was performed on Google Colab (free tier). For a large-scale recommendation system, a cloud solution (e.g., IaaS, PaaS, or SaaS) is recommended over an on-premise setup due to data volume, efficiency, and cost offsets in maintenance. A detailed cost analysis would require an economist or business analyst.
- Threshold Parameter Variability: The optimal relevance threshold (currently
1.0) may need to change if more data is added, requiring re-evaluation of all models. - Content-Based Accuracy Dilemma: Our content-based model, trained for top 10 recommendations, sometimes provided only half as many genuinely similar songs (e.g., 5 for "Learn To Fly" instead of 10). This raises an ethical dilemma regarding whether to "recommend" or merely "suggest." Simple NLP techniques could improve this.
We believe the system can be significantly enhanced by:
- Incorporating a Larger NLP Component: Utilizing ready-to-use NLP models could bring substantial benefits.
- Exploring Neural Networks: While computationally expensive, Neural Networks could improve recommendations if cost is not a limiting factor.
- Expanding Data Sources:
- Including more original features like
danceability,energy,loudness,mode,similar artists,song hotttnesss,tempofor a more robust content-based model. - Integrating Spotify-related data, such as
genreinformation (which is missing from the current dataset) via vast Spotify datasets or theSpotipylibrary. - Utilizing data from radio stations (new songs, oldies) or music discovery apps like
Shazamto capture user preferences.
- Including more original features like
- Genre-Based Clustering: A cluster-based system using genre information would be beneficial for users with limited musical tastes.
- Sentiment Analysis: For users with eclectic tastes, sentiment analysis could provide more accurate predictions.
- Playlist-Based Recommendations: Instead of individual songs, recommending entire playlists, potentially integrated with IoT wearable device data (e.g., BPM to suggest "Relaxing/Meditative" or "High Energy/Dancing" playlists).
- Programming Language: Python
- Data Manipulation: Pandas, NumPy
- Data Preprocessing: scikit-learn (
LabelEncoder) - Recommendation Algorithms: Surprise (
KNNBasic,SVD,CoClustering,Reader,Dataset,GridSearchCV,RandomizedSearchCV,accuracy,train_test_split) - Natural Language Processing: NLTK (
punkt,stopwords,wordnet,word_tokenize,WordNetLemmatizer),re(regular expressions),TfidfVectorizer(from scikit-learn) - Similarity Calculation: scikit-learn (
cosine_similarity) - Visualization: Matplotlib, Seaborn
- Environment: Google Colab
├── music_recommendation_system_final_submission.ipynb # The main Jupyter Notebook for the project.
├── data/ # Directory for raw and processed datasets (e.g., song_data.csv, count_data.csv).
├── README.md # This README file.
└── requirements.txt # Lists Python package dependencies for reproducibility.
To set up and run this project locally:
-
Clone the repository:
git clone [https://github.com/yourusername/your-repo-name.git](https://github.com/yourusername/your-repo-name.git) cd your-repo-name(Replace
yourusernameandyour-repo-namewith your actual GitHub details.) -
Obtain the datasets:
- The
data/directory in this repository contains placeholder files forsong_data.csvandcount_data.csv. The actual files are very large and are not directly included in this repository. - Please download the Taste Profile Subset from the Million Song Dataset website.
- Place the downloaded
song_data.csvandcount_data.csvfiles inside thedata/subdirectory within your cloned repository.
- The
-
Install dependencies:
- Ensure you have Python (3.6+) and
pipinstalled. - Install all required libraries using the
requirements.txtfile:pip install -r requirements.txt
- Ensure you have Python (3.6+) and
-
Run the Jupyter Notebook:
- Launch Jupyter Notebook or JupyterLab from your repository's root directory:
or
jupyter notebook
jupyter lab
- Open
music_recommendation_system_final_submission.ipynband run all cells to execute the analysis, modeling, and generate results.
- Launch Jupyter Notebook or JupyterLab from your repository's root directory:
Viewing on GitHub:
- You can directly view the
music_recommendation_system_final_submission.ipynbfile on GitHub, as it supports rendering Jupyter Notebooks directly in the browser. Viewing on GitHub: - You can directly view the
music_recommendation_system_final_submission.ipynbfile on GitHub, as it supports rendering Jupyter Notebooks directly in the browser.
This project is licensed under the MIT License - see the LICENSE file for details.
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.