MSCI641_project

SENTIMENT ANALYSIS OF IMBD Movie REVIEWS USING LTSM (Long Short-Term Memory) Neural Networks

MEMBER 1
- Name: Kabiir Krishna
- Email: k7krishn@uwaterloo.ca
- WatIAM: k7krishn
- Student Number: 21106092

About the Project

In the digital age, online reviews significantly influence decisions about movies and TV shows. This project explores the use of Long Short-Term Memory (LSTM) neural networks for sentiment analysis of IMDB reviews. Using distilBERT and VADER, we generate continuous sentiment scores ranging from -1 to +1 for our training dataset. These scores train the LSTM model to handle the sequential nature of textual data, accurately identifying consensus and overall sentiment across reviews.

This approach helps the entertainment industry & users understand audience preferences, guiding marketing strategies, recommendation systems, and content creation. Consumers benefit from the wisdom of the crowd, which helps them make better choices.

The technology can also be extended to other areas such as product reviews and social media monitoring. Experiments show that the model effectively captures and analyzes sentiment from large-scale data, demonstrating the potential of sentiment analysis to improve decision-making and tailor content to audience expectations.

Project Structure

Project Root/
│   .gitignore
│   00-scrape.py
│   01.1-clean_score.py
│   01.2-Training_DistilBERT.ipynb
│   02-msci641_project.ipynb
│   03-sentiment_finetuning_w_distilbert.ipynb
│   04-dashboard.py
│   load_model_score.py
│   requirements.txt
│   README.md
│
├───media
│       # Contains media assets for README.md
│
├───models
│   ├───00-baseline
│   │       lstm.pt
│   │       vocab.pth
│   │
│   └───02-final
│           lstm_final.pt
│           vocab.pth
│
└───reviews
    ├───00-scraped
    │       reviews_tt0111161.csv
    │       reviews_tt0455944.csv
    │       reviews_tt0468569.csv
    │       reviews_tt15398776.csv
    │
    └───01-cleaned_scored
        ├───VADER
        │       cleaned_scored_reviews_tt0111161.csv
        │       cleaned_scored_reviews_tt0455944.csv
        │       cleaned_scored_reviews_tt0468569.csv
        │       cleaned_scored_reviews_tt15398776.csv
        │
        └───VADER_DISTILBERT_FINAL
                vader_dbert_scored.csv

Important Files/Folders

Preparing training Data

00-scrape.py: Contains logic for Scraping reviews (This script was used to generate initial scraped data which was later cleaned and socred with 01-clean_score.py :). To run it separately, type:
```
python3 00-scrape.py
```
01.1-clean_score.py: Cleans scraped reviews & scores them with VADER to create. (Works on the data scraped by previous script, cleans it, scores it using VADER and outputs cleaned & scored .csv files in 01-cleaned_scored/VADER/). Can be run separately using:
```
python3 01.1-scrape.py
```

Training the DistilBERT & LSTM Model

01.2-Training_DistilBERT.ipynb: Notebooko for Training DistilBERT model, a strong classifier.
02-msci641_project.ipynb: Training Script for the LSTM model. Exports the model for future usage too.
load_model_score.py: Contains LSTM definintion (Helps with loading) & Function to score using a pre-loaded model.
04-dashboard.py: Contains main Logic for dashboard webpage displayed.

Augmenting data wth scores from VADER and DistilBERT:

03-sentiment_finetuning_w_distilbert.ipynb: Contains logic for loading DistilBERT model trained with 01.2-Training_DistilBERT.ipynb & using it to augment the VADER-scored reviews.

Data Folders

reviews/00-scraped/: Contains the initial reviews scraped by Web Crawler.
reviews/01-cleaned_scored/VADER/: contains the cleaned reviews which were only scored by VADER.
reviews/01-cleaned_scored/VADER_DISTILBERT_FINAL/: Contains the Reivews which were scored by DistilBERT and the final finetuned (VADER + DistilBERT) scores for the review. This was the final training data used to train the model.
models/: Contans the Initial & the Final LSTM models used in this project. (The DistilBERT model couldn't be included in this repo since it was too large to be pushed here)

NOTE: Though not required to run the project, the DistilBERT model can be regenerated at user's end by executing the 01.2-Training_DistilBERT.ipynb.

Getting Started

How To Run the Project locally

Clone the repository

git clone https://github.com/Kabiirk/MSCI641_project.git

Navigate to the project root
```
cd MSCI641_project
```
Install dependdencies
```
pip install -r requirements.txt
```
Run the Dashboard
```
streamlit run 04-dashboard.py
```
This would automatically open a new Browser window/tab with the dashboard deployed on localhost.

Using the Dashboard

Upon running the project initially, the Dashboard loads up the pre-scraped reviews of the movie "The Equalizer" so that the sample visualizations are already visible. The users can start real-time scraping and analysis of any new movie by following these steps:

Type out the movie/TV-Series ID as per IMDb in the Text box (under "IMDb ID") at the top and press the Scrape Data button.
A live progess Bar will apprear indicating the status of operations (Scraping,Scoring etc.)
Once the Reviews are loaded & analysis is done, the Dashboard will update the existing visualization as per the new reviews which have been scored by the model.

Demo Video

Note: The Demo video has been trimmed (& sped up) for brevity. This project scrapes the latest reviews every time the user inputs a Movie/TV-Show ID for analysis by scraping the reviews afresh. Upon creating the dataframe of the reviews, the script uses Pandas' apply() function (ref.) on the Reviews column of the new Dataframe which can take some time depending on the compute resources available on the local machine running the dashboard. Ideally, for quick results, it is suggested to scrape for movies with fewer reviews.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MSCI641_project

About the Project

Project Structure

Important Files/Folders

Preparing training Data

Training the DistilBERT & LSTM Model

Augmenting data wth scores from VADER and DistilBERT:

Data Folders

Getting Started

How To Run the Project locally

Using the Dashboard

Demo Video

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
media		media
models		models
reviews		reviews
.gitignore		.gitignore
00-scrape.py		00-scrape.py
01.1-clean_score.py		01.1-clean_score.py
01.2-Training_DistilBERT.ipynb		01.2-Training_DistilBERT.ipynb
02-msci641_project.ipynb		02-msci641_project.ipynb
03-sentiment_finetuning_w_distilbert.ipynb		03-sentiment_finetuning_w_distilbert.ipynb
04-dashboard.py		04-dashboard.py
README.md		README.md
load_model_score.py		load_model_score.py
requirements.txt		requirements.txt

Kabiirk/MSCI641_project

Folders and files

Latest commit

History

Repository files navigation

MSCI641_project

About the Project

Project Structure

Important Files/Folders

Preparing training Data

Training the DistilBERT & LSTM Model

Augmenting data wth scores from VADER and DistilBERT:

Data Folders

Getting Started

How To Run the Project locally

Using the Dashboard

Demo Video

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages