0% found this document useful (0 votes)
26 views14 pages

Jangan Hapus 1

This document is a configuration manual for Muhammad Imran Shaikh's MSc research project on data analytics. It details the system configuration including hardware of an Intel Core i5 processor with 16GB RAM and software including Microsoft Office 365 and Python coding libraries. It describes the project development process involving data extraction, preprocessing, and implementation of recommendation engine models including content-based filtering, collaborative filtering, and matrix factorization. Evaluation of models is done using k-fold cross validation and leave one out cross validation to optimize hyperparameters and accuracy.

Uploaded by

rodiahgam1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views14 pages

Jangan Hapus 1

This document is a configuration manual for Muhammad Imran Shaikh's MSc research project on data analytics. It details the system configuration including hardware of an Intel Core i5 processor with 16GB RAM and software including Microsoft Office 365 and Python coding libraries. It describes the project development process involving data extraction, preprocessing, and implementation of recommendation engine models including content-based filtering, collaborative filtering, and matrix factorization. Evaluation of models is done using k-fold cross validation and leave one out cross validation to optimize hyperparameters and accuracy.

Uploaded by

rodiahgam1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Configuration Manual

MSc Research Project


Data Analytics

Muhammad Imran Shaikh


Student ID: x17119308

School of Computing
National College of Ireland

Supervisor: Dr. Muhammad Iqbal


National College of Ireland
Project Submission Sheet
School of Computing

Student Name: Muhammad Imran Shaikh


Student ID: x17119308
Programme: Data Analytics
Year: 2020
Module: MSc Research Project
Supervisor: Dr. Muhammad Iqbal
Submission Due Date: 17/08/2020
Project Title: Configuration Manual
Word Count: 1458
Page Count: 12

I hereby certify that the information contained in this (my submission) is information
pertaining to research I conducted for this project. All information other than my own
contribution will be fully referenced and listed in the relevant bibliography section at the
rear of the project.
ALL internet material must be referenced in the bibliography section. Students are
required to use the Referencing Standard specified in the report template. To use other
author’s written or electronic work is illegal (plagiarism) and may result in disciplinary
action.

Signature:

Date: 25th September 2020

PLEASE READ THE FOLLOWING INSTRUCTIONS AND CHECKLIST:

Attach a completed copy of this sheet to each project (including multiple copies). 
Attach a Moodle submission receipt of the online project submission, to 
each project (including multiple copies).
You must ensure that you retain a HARD COPY of the project, both for 
your own reference and in case a project is lost or mislaid. It is not sufficient to keep
a copy on computer.

Assignments that are submitted to the Programme Coordinator office must be placed
into the assignment box located outside the office.

Office Use Only


Signature:

Date:
Penalty Applied (if applicable):
Configuration Manual
Muhammad Imran Shaikh
x17119308

1 Introduction
The main reason to write this configuration manual is to demonstrate the configura-
tion of system setup, software, and hardware compatibility to run and implement the
programming language code which will help to design the Research project and report.
This manual will cover sections like System Configuration, Project Development, Codes
Implementation, and Experiments with different machine learning models.

2 System Configuration
2.1 Hardware
Processor:3rd Generation Intel Core i5-3320M (2.6 GHz, 3MB L3 cache, 2 cores)1 Up to
3.30 GHz, Ram:16gb,System type:64-bit OS, Graphics: NVIDIA Quadro K2000M with
2GB dedicated DDR3, Windows:10pro(2019)

2.2 Software
Microsoft Office 356: Microsoft Word (For all the professional written stuff), Microsoft
Excel(For storing dataset as CSV and Excel format and for visualization purpose too),
Microsoft PowerPoint(presentation slides).
Python coding Language: Loading Libraries, Data Cleaning, Data preprocessing
and engineering, Initial Data Analysis, Training and Test data splitting, Models Im-
plementation, Hyperparameter Tuning, and Evaluation.Python IDE: Jupyter Notebook
and PyCharm.

3 Project Development
The main steps involved in the project development phase are the selection for ideal
IDE(Jupyter Notebook) to perform our coding task, Loading suitable libraries, Data
cleaning(checking null values and imputations with aggregations), and Data preprocessing
and engineering(grouping and joining datasets, Data merging, Describing columns, Re-
moval of unnecessary features, Removal of pipes with string split into Genre columns).Initial
data visualization(Word Cloud, several bar charts).
Preparation of separate class for movie dataset to implement Recommendation system
techniques(Content-based filtering and Collaborative filtering), Splitting train and test set
to fit in various models of our recommendation engine. Getting Top-N movies results from

1
our Recommendation machine learning models by utilizing different techniques. Models
Hyperparameter tuning (for extracting the best parameters for optimal results). Models
Evaluation(K-fold cross-validation,LOO(Leave One Out) Cross-validation with splits to
get better Accuracy results from our machine learning models)Multiple Evaluation plots
are plotted to get visual analysis.

3.1 Data Extraction and Pre-processing


The dataset is collected and generated by GroupLens Research Group, Dataset (ml-latest-
small)1 consist of almost 100k ratings by different users with over 1200 movie tags from
9125 movies. ’ml‘ stands for movie lens. Each selected user had at least rated 20 movies.
There are 4 files included in this dataset named ‘movies.csv’, ‘ratings.csv’, ‘tags.csv’,
‘links.csv’ but for recommendation purpose, we are considering only two dataset files
i.e. ‘movies.csv’ and ‘rating.csv’ The entire coding is done in the python programming
language. Various python Libraries are imported based on the implementation of different
recommendation techniques. All Recommendation system models are imported from
’Surprise Library’ which is an official python recommendation system library can be seen
in Figure 1 . Data preprocessing and engineering are obtained by grouping and joining

Figure 1: Importing python libraries

datasets, Data merging, Describing columns, deletion of unnecessary features, Removal


of pipes with string split in Genre columns, Create a function that counts the number of
times each genre appear can be seen in Figure 2.

4 Implementation of Recommendation Engine Ma-


chine Learning Models
Our recommendation engine is tested with 3 different recommendation system techniques
i.e. (Content-based filtering, Collaborative filtering, and Matrix Factorization) by res-
ulting Top-N movies results for users and even for movies in content-based filtering. A
separate Movie class is generated to combine both movies and ratings CSV files by group-
ing them by users and movie ids numbers can be seen in the figure. Different machine
1
https://grouplens.org/datasets/movielens/latest/

2
Figure 2: Create a function that counts the number of times each genre appears

learning models are implemented with their defined parameters to get better recommend-
ations. Surprise library which is an official python recommendation system library has
been utilized to implement machine learning models for our recommendation engine Fig-
ure 1.Following are the steps that are taken to create and evaluate our recommendation
engine.

4.1 Experiment with Movielens Dataset Analysis


The Analysis on movielens dataset based on the merging of two dataset files i.e. ‘movies.csv’
and ‘rating.csv’ as an inner join by movies Ids in fig. With the help of this merged dataset,
we are visualizing movie genres by the creation of world cloud and histogram to analyses
which movie genres are most popular can be seen in Figure 3, Figure 4. Moreover, top 25
movies with the highest ratings are also plotted to analyze which movie is rated highest
by different users in Figure 5.

4.2 Experiment with Content Based Filtering


Content-based filtering works on the phenomena of users’ interest in different items. So,
based on that interest, similar items are recommended to the users. Recommendation
action becomes more accurate if the user provides more input. In our Content-based
recommendation engine, we are finding 10 nearest neighbors of the movie of our interest
by implementing the KNNBaseline algorithm with a similarity metric of Pearson baseline
Figure 6.

3
Figure 3: Word cloud to analyze which Genre are the popular ones

4
Figure 4: Histogram to analyze which Genre are the popular ones

Figure 5: Top 25 movies with highest ratings

5
Figure 6: Content based filtering technique by using KNNBaseLine algorithm

6
4.3 Experiment with User and Item(Memory) Based Collabor-
ative Filtering
User and Item-based collaborative filtering are one of the most extensively used techniques
in the recommendation system, it works by finding a group of similar users who have
given the similar reactions to the item of your interest. The rating matrix is created to
find similar users and items based on ratings that are given by the user.KNNwithMean
machine learning algorithm along with similarity metric of Cosine is utilized to get Top-10
nearest neighbor movies for specific user Figure 7, Figure 8.

Figure 7: User based collaborative filtering technique by using KNNWithMean algorithm

7
Figure 8: Item based collaborative filtering technique by using KNNWithMean algorithm

8
4.4 Experiment with Matrix Factorization(Model)Based Col-
laborative Filtering
In Matrix factorization or Model-based collaborative filtering is the Dimensionality Re-
duction technique just like Principal Component Analysis(PCA). Matrix factorization
breaks down the user-item large matrix into a smaller matrix. The hidden features are
defined by latent factors that are created by item and user column and row matrix. In our
Matrix factorization method, we are implementing two matrix factorization algorithms
i.e. SVD(Singular Value Decomposition) and SVDpp(Singular Value Decomposition plus
plus) to get Top-10 recommendation results shown in Figure 9, Figure 10.

Figure 9: Matrix Factorization technique by using SVD algorithm

9
Figure 10: Matrix Factorization technique by using SVD++ algorithm

5 Experiment with Models Hyperparameter Tuning


For Models Hyperparameter tuning we are considering Grid Search CV from Python
Surprise Library, which provides us the best parameters to get optimal value from our
machine learning models when we are training our dataset. The main parameters which
are considered for KNN and Matrix factorization-based algorithms are different K-Values,
Epoches no, learning rate, Similarity options, and accuracy measures(RMSE and MAE)
can be seen in Figure 11, Figure 12.

Figure 11: Hyperparameter tuning with GridSearchCV for different params of collabor-
ative filtering models

10
Figure 12: Hyperparameter tuning with GridSearchCV for different params of Matrix
Factorization models

6 Experiment with Models Evaluation


For Models Cross-Validation, we are considering two cross validators to test the accuracy
of our recommendation engine models in different splits. Those two Cross validators are
K-Fold cross Validator and LOOCV leave one out cross validator can be seen in Figure 13,
Figure 14

Figure 13: Models Evaluation with K-fold Cross Validation

11
Figure 14: Models Evaluation with LOO(Leave One Out) Cross Validation

12

You might also like