0% found this document useful (0 votes)
12 views4 pages

Final-Term Project Topics

The document outlines eight distinct projects focused on various data science techniques, including hybrid recommender systems, Bayesian networks, anomaly detection, and clustering methods. Each project includes a description, dataset information, detailed student instructions, and submission guidelines for Google Classroom. Global submission guidelines emphasize consistency in file naming and the importance of original work and proper citation.

Uploaded by

tuthinh2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views4 pages

Final-Term Project Topics

The document outlines eight distinct projects focused on various data science techniques, including hybrid recommender systems, Bayesian networks, anomaly detection, and clustering methods. Each project includes a description, dataset information, detailed student instructions, and submission guidelines for Google Classroom. Global submission guidelines emphasize consistency in file naming and the importance of original work and proper citation.

Uploaded by

tuthinh2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Project 1: Hybrid Recommender System for Movie Ratings Using Weighted Fusion

(Session 2)

Description: Develop a hybrid recommender system that integrates content-based filtering,


derived from movie genre features, with collaborative filtering based on user rating patterns.
Apply a weighted fusion approach, assigning 60 percent weight to content-based predictions
and 40 percent to collaborative predictions, to mitigate challenges like recommending for
new users. Evaluate performance using root mean square error and precision at the top 10
recommendations. This project aligns with session discussions on hybrid methods to
enhance recommendation accuracy and address individual approach limitations.

Dataset: The MovieLens 100K dataset, available at MovieLens 100K, contains 100,000
ratings from 943 users on 1,682 movies, along with metadata like genres. It is suitable due
to its representation of user-item interactions and sparsity, ideal for testing hybrid systems.

Detailed Student Instructions:

1. Data Preprocessing Step: Import necessary libraries for data handling and
recommender tools. Load the ratings and movies files from the dataset. Address any
missing values in the genres column by replacing them with a default label such as
"unknown". Divide the data into training and testing sets using an 80-20 split to allow
for accurate model validation.
2. Content-Based Model Implementation Step: Transform movie genres into
numerical representations through text vectorization. Calculate similarity scores
between movies based on these representations to form the basis for content-based
recommendations.
3. Collaborative Model Implementation Step: Train a matrix factorization model on
the training set to identify patterns in user ratings and predict preferences for unseen
items.
4. Hybrid Fusion Step: For each prediction, compute a combined score as a weighted
average, prioritizing content-based results in cases of sparse data. Experiment with
different weights and select the optimal combination based on preliminary
performance checks.
5. Evaluation Step: Generate predictions on the test set and assess error using root
mean square error. Additionally, measure precision at the top 10 by determining how
many recommended items align with actual high ratings from users.
6. Report Preparation Step: In the report, explain the rationale for fusion, such as how
it overcomes data sparsity issues from the session. Include visualizations of sample
recommendation lists and compare results against standalone models.

Submission on Google Classroom: Prepare a zip file with your project notebook, PDF
report, and any supporting files like visualizations. Name it "YourName_Project1.zip". Upload
to the assignment post by selecting "Add or create" and then "File". Click "Turn in" to
complete submission and check for confirmation via email.

Project 2: Content-Based Book Recommender with Cosine Similarity and Pearson


Correlation (Session 2)

Description: Create a content-based recommender system that utilizes text features from
book summaries for similarity assessments and incorporates rating correlations for
refinement. Evaluate using mean average precision at the top K recommendations, with a
focus on session metrics for recommendation diversity, such as overlap measures between
suggested items.
Dataset: The Book Recommendation Dataset, available at Book Recommendation Dataset,
includes ratings and summaries for approximately 278,000 users and books. It is suitable for
content-based approaches due to its rich textual metadata and rating data.

Detailed Student Instructions:

1. Preprocessing Step: Import libraries for data manipulation and statistical


computations. Load the ratings and books files, merging them on shared identifiers
like ISBN. Replace any missing summaries with empty strings to prevent processing
errors.
2. Feature Extraction Step: Convert book summaries into vector forms using text
processing techniques. Compute similarity matrices from these vectors and
correlations between user ratings for pairs of books.
3. Recommendation Step: For a given book, rank similar books by combining
similarity scores, giving higher priority to vector-based matches over rating
correlations.
4. Evaluation Step: On a reserved validation set, calculate mean average precision at
the top 10 by comparing recommended books to users' actual preferences.
5. Report Preparation Step: Discuss how these similarity assessments promote
relevant recommendations, referencing session analogies like adapting to user
tastes. Include performance tables for clarity.

Submission on Google Classroom: Zip your notebook, report in PDF, and visuals as
"YourName_Project2.zip". Upload to the relevant assignment and turn in, ensuring you
receive a confirmation.

Project 3: Bayesian Network for Heart Disease Diagnosis with Variable Elimination
Inference (Session 3)

Description: Build a Bayesian network with nodes representing symptoms and disease
outcomes, estimating conditional probability distributions from data. Use variable elimination
for computing likelihoods given evidence. Evaluate with accuracy and receiver operating
characteristic area under the curve.

Dataset: The Heart Disease UCI dataset, available at Heart Disease UCI, features 303
instances with 14 attributes like age and chest pain. It is suitable for probabilistic modeling
due to its clinical features and binary outcomes.

Detailed Student Instructions:

1. Model Setup Step: Define the network structure in a Bayesian modeling library,
discretizing continuous features into categories.
2. Parameter Estimation Step: Fit conditional probability distributions to the data using
estimation methods.
3. Inference Step: Perform queries to obtain posterior likelihoods for the target variable
given specific evidence values.
4. Evaluation Step: Split the data for testing and measure accuracy by comparing
predicted outcomes to actual ones.
5. Report Preparation Step: Relate the process to session steps on network
construction and analogies like tracking preferences.

Submission on Google Classroom: Use "YourName_Project3.zip" for your files, upload,


and turn in as described.
Project 4: Credit Risk Bayesian Network with Belief Propagation and Markov Chain
Monte Carlo (Session 3)

Description: Construct a Bayesian network for credit risk assessment, applying belief
propagation for precise inference and Markov chain Monte Carlo for approximations.
Compare their efficiencies and evaluate using root mean square error on predicted values.

Dataset: The German Credit Data dataset, available at German Credit Data, includes 1,000
instances with 20 attributes. It is suitable for risk modeling due to its financial indicators.

Detailed Student Instructions:

1. Network Fitting Step: Estimate conditional distributions from the dataset.


2. Inference Step: Conduct queries using belief propagation for exact results and
Markov chain Monte Carlo for sampled approximations.
3. Evaluation Step: Assess runtimes and error metrics like root mean square error on
test data.
4. Report Preparation Step: Discuss session aspects like scalability and ethical
concerns such as bias in financial data.

Submission on Google Classroom: Follow the zip file naming and upload process, then
turn in.

Project 5: Isolation Forest Anomaly Detection in Fraud with SHAP Interpretability


(Session 4)

Description: Train an isolation forest model for detecting fraudulent transactions and apply
SHAP for explaining feature impacts. Evaluate with F1-score and precision-recall curves.

Dataset: The Credit Card Fraud Detection dataset, available at Credit Card Fraud, has
284,807 transactions. It is suitable for anomaly tasks due to its imbalance.

Detailed Student Instructions:

1. Training Step: Fit the model with a low contamination rate on preprocessed
features.
2. Interpretability Step: Generate explanations for feature contributions on test data.
3. Evaluation Step: Compute F1-score and visualize precision-recall.
4. Report Preparation Step: Connect to session metrics like false positive rates and
analogies.

Submission on Google Classroom: Zip as "YourName_Project5.zip", upload, and turn in.

Project 6: Z-Score Time Series Anomaly Detection in Stocks with LIME (Session 4)

Description: Apply Z-score methods for identifying anomalies in stock returns and use LIME
for model explanations on a classifier.

Dataset: The Huge Stock Market Dataset, available at Huge Stock Market, provides time
series for various stocks. It is suitable for time-based anomaly detection.

Detailed Student Instructions:

1. Detection Step: Calculate standardized scores on returns and set a threshold for
anomalies.
2. Interpretability Step: Apply LIME to explain predictions from a supporting classifier.
3. Evaluation Step: Measure recall and visualize detected points.
4. Report Preparation Step: Link to session time series analysis.

Submission on Google Classroom: Use the standard zip and upload method.

Project 7: DBSCAN Clustering for E-Commerce Customer Segmentation (Session 5)

Description: Use DBSCAN with parameters like epsilon 0.5 and minimum samples 5 for
segmenting customers based on purchase behavior. Evaluate with silhouette score.

Dataset: The Online Retail Dataset, available at Online Retail, contains transaction records.
It is suitable for clustering due to behavioral features.

Detailed Student Instructions:

1. Clustering Step: Fit the model on scaled features representing recency, frequency,
and monetary value.
2. Evaluation Step: Calculate silhouette score to assess cluster quality.
3. Report Preparation Step: Discuss session concepts like core and noise points.

Submission on Google Classroom: Zip files accordingly and turn in.

Project 8: Gaussian Mixture Model for Mall Customer Segmentation with BIC Tuning
(Session 5)

Description: Fit a Gaussian mixture model, tuning the number of components using
Bayesian information criterion for optimal segmentation by demographics and spending.

Dataset: The Mall Customers dataset, available at Mall Customers, has 200 instances with
age, income, and spending scores. It is suitable for mixture-based clustering.

Detailed Student Instructions:

1. Tuning and Fitting Step: Test different component numbers and select the best
based on information criterion scores.
2. Evaluation Step: Visualize clusters and compute purity against known labels if
available.
3. Report Preparation Step: Explain session ideas like covariance roles in modeling.

Submission on Google Classroom: Follow the zip upload and turn-in process.

Global Submission Guidelines for All Projects


For consistency across all projects, prepare a zip file named
"YourName_Project[Number].zip" with your notebook, PDF report, modified dataset if
applicable, and outputs. Log into Google Classroom, find the "Final Project Submission"
assignment, upload the file by selecting "Add or create" and "File", add any notes, and click
"Turn in". Confirm receipt via email. Deadlines must be met; contact the instructor for
extensions or issues. Ensure all work is original and datasets are cited properly.

You might also like