0% found this document useful (0 votes)

6 views6 pages

Unit 3 4

The document outlines a project to classify movie reviews as positive or negative using Support Vector Machine (SVM) with a dataset from IMDb. It details the steps of loading and preprocessing the data, performing text preprocessing, training the SVM classifier, and experimenting with feature extraction techniques like TF-IDF and Word2Vec. The results indicate that both methods achieved high accuracy, with Word2Vec slightly outperforming TF-IDF due to its ability to capture semantic relationships.

Uploaded by

mcanarender

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views6 pages

Unit 3 4

Uploaded by

mcanarender

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Sentiment Analysis of Movie Reviews

Objective: Classify movie reviews as positive or negative using SVM.

Dataset: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-
reviews?form=MG0AV3
Tasks:
1. Load and preprocess a dataset of movie reviews (e.g., IMDb dataset).
2. Perform text preprocessing (e.g., tokenization, stopword removal, vectorization).
3. Train an SVM classifier to predict the sentiment of reviews.
4. Experiment with different feature extraction techniques (e.g., TF-IDF, word
embeddings).
5.Evaluate the model's performance and discuss the results

#Load and Preprocess the Dataset

import pandas as pd

# Load dataset
df = pd.read_csv("/content/drive/MyDrive/nkphd/IMDB Dataset.csv")

# Display dataset info

print(df.head())
print(df.info())

# Convert labels to numeric (positive = 1, negative = 0)

df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})

print(df['sentiment'].value_counts())
#Perform Text Preprocessing
import pandas as pd
import nltk
import string
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')
# Download the 'punkt_tab' resource
nltk.download('punkt_tab')

# Load stopwords once (Fix for LookupError)

stop_words = set(stopwords.words('english'))

# Function to clean text

def preprocess_text(text):
text = text.lower() # Convert to lowercase
text = re.sub(r'<.*?>', '', text) # Remove HTML tags
text = re.sub(r'\d+', '', text) # Remove numbers
text = text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation
tokens = word_tokenize(text) # Tokenization
tokens = [word for word in tokens if word not in stop_words] # Remove stopwords
return ' '.join(tokens)

# Load dataset
df = pd.read_csv("/content/drive/MyDrive/nkphd/IMDB Dataset.csv")

# Display dataset info

print(df.head())
print(df.info())

# Convert labels to numeric (positive = 1, negative = 0)

df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})
print(df['sentiment'].value_counts())

# Apply preprocessing
df['cleaned_review'] = df['review'].apply(preprocess_text)

print(df[['review', 'cleaned_review']].head())
#Feature Extraction using TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

# Convert text into numerical features using TF-IDF

vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['cleaned_review'])

# Labels
y = df['sentiment']

print(f"Feature matrix shape: {X.shape}")

Feature matrix shape: (50000, 5000)

#Train an SVM Classifier
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,
stratify=y)

# Train SVM model

svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Predictions
y_pred = svm_model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy * 100:.2f}%')
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

#Experiment with Word Embeddings

import gensim
from gensim.models import Word2Vec
import numpy as np

# Tokenize reviews
df['tokenized'] = df['cleaned_review'].apply(lambda x: x.split())

# Train Word2Vec model

w2v_model = Word2Vec(sentences=df['tokenized'], vector_size=100, window=5, min_count=2,
workers=4)

# Function to convert reviews into vectors

def get_word2vec_vectors(review):
vectors = [w2v_model.wv[word] for word in review if word in w2v_model.wv]
return np.mean(vectors, axis=0) if vectors else np.zeros(100)

# Convert dataset to word embeddings

X_w2v = np.array([get_word2vec_vectors(review) for review in df['tokenized']])

# Train SVM on Word Embeddings

X_train, X_test, y_train, y_test = train_test_split(X_w2v, y, test_size=0.2, random_state=42)
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy with Word2Vec: {accuracy * 100:.2f}%')

Model Accuracy with Word2Vec: 85.98%

#Evaluate the model's performance

1. TF-IDF Feature Extraction:

 Accuracy: 85.52%
 The SVM classifier performed well with a 5000-feature TF-IDF representation of
the movie reviews.
 The high-dimensional feature space effectively captured text patterns relevant to
sentiment classification.

2. Word2Vec Embeddings:

 Accuracy: 85.98%
 Word2Vec provided comparable results to TF-IDF but leveraged semantic
relationships between words, enhancing performance slightly.
Discussion of Results

1. Performance Comparison:

 TF-IDF Accuracy: 85.52%

 Word2Vec Accuracy: 85.98% (slightly better due to capturing semantic
relationships).

2. Strengths:

 TF-IDF: Simple, interpretable, effective for key term identification.

 Word2Vec: Captures context and word semantics, improving generalization.

3. Trade-offs:

 TF-IDF creates sparse, high-dimensional data, requiring more memory.

 Word2Vec uses dense vectors but needs more computation and a good embedding
model.

4. Conclusion: Both methods perform well, with Word2Vec slightly better.

DL Project
No ratings yet
DL Project
21 pages
MLT 09
No ratings yet
MLT 09
3 pages
FALLSEM2024-25 BCSE332P LO VL2024250102168 2024-10-07 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE332P LO VL2024250102168 2024-10-07 Reference-Material-I
18 pages
Group 4 MovieReview
No ratings yet
Group 4 MovieReview
10 pages
"Sentiment Analysis of Imdb Movie Reviews": A Project Report
0% (1)
"Sentiment Analysis of Imdb Movie Reviews": A Project Report
22 pages
Assignment 02
No ratings yet
Assignment 02
3 pages
Cs221 Report
No ratings yet
Cs221 Report
16 pages
Sentiment Analysis Using Text Mining PDF
100% (1)
Sentiment Analysis Using Text Mining PDF
12 pages
Text, Pos, Wor2vec
No ratings yet
Text, Pos, Wor2vec
3 pages
Case Study NLP
No ratings yet
Case Study NLP
4 pages
Neural Networks
No ratings yet
Neural Networks
8 pages
Sentiment Analysis Using LSTM
No ratings yet
Sentiment Analysis Using LSTM
2 pages
Final Presentation
No ratings yet
Final Presentation
18 pages
Combine PDF
No ratings yet
Combine PDF
124 pages
Ir Practical 5
No ratings yet
Ir Practical 5
2 pages
Deep Learning IMDB Model
No ratings yet
Deep Learning IMDB Model
2 pages
Part A
No ratings yet
Part A
6 pages
Wa0012
No ratings yet
Wa0012
8 pages
Q 3
No ratings yet
Q 3
2 pages
AIML IA3 Loki & SG
No ratings yet
AIML IA3 Loki & SG
31 pages
Practical 2
No ratings yet
Practical 2
4 pages
05 ML PDF
No ratings yet
05 ML PDF
1 page
"Sentiment Analysis of Imdb Movie Reviews": A Project Report
No ratings yet
"Sentiment Analysis of Imdb Movie Reviews": A Project Report
27 pages
Sentiment Analysis of Movie Reviews Using Machine Learning: Members
No ratings yet
Sentiment Analysis of Movie Reviews Using Machine Learning: Members
17 pages
NLP Final Mini Project
No ratings yet
NLP Final Mini Project
17 pages
Kindle Review Sentiment Analysis - Ipynb - Colab
No ratings yet
Kindle Review Sentiment Analysis - Ipynb - Colab
5 pages
DL Exp-10,11,12
No ratings yet
DL Exp-10,11,12
6 pages
Detailed Report
No ratings yet
Detailed Report
6 pages
Research Paper Text Classification
No ratings yet
Research Paper Text Classification
17 pages
Project Report
No ratings yet
Project Report
9 pages
5700-Article Text-21868-1-10-20230318
No ratings yet
5700-Article Text-21868-1-10-20230318
6 pages
An Expert-Level Report On The Comparative Analysis of Machine Learning and Deep Learning Models For
No ratings yet
An Expert-Level Report On The Comparative Analysis of Machine Learning and Deep Learning Models For
8 pages
Naive Bayes
No ratings yet
Naive Bayes
1 page
Jadavpur University: Assignment Submission
No ratings yet
Jadavpur University: Assignment Submission
9 pages
Maneesha Nidigonda Major Project
No ratings yet
Maneesha Nidigonda Major Project
11 pages
MN2
No ratings yet
MN2
17 pages
Assignment 2
No ratings yet
Assignment 2
6 pages
Iscs 476
No ratings yet
Iscs 476
18 pages
Amna Bagh Ali
No ratings yet
Amna Bagh Ali
6 pages
Prac - 5 (Aam)
No ratings yet
Prac - 5 (Aam)
1 page
Ai Project
No ratings yet
Ai Project
15 pages
ML Week10.1
No ratings yet
ML Week10.1
5 pages
Solution T1
No ratings yet
Solution T1
9 pages
2023 Aug How To Prepare Data For A Neural Network A Step-by-Step Guide
No ratings yet
2023 Aug How To Prepare Data For A Neural Network A Step-by-Step Guide
7 pages
Sentiment Analysis of IMDb Movie Reviews A Comparative Study On Performance of Hyperparameter-Tuned Classification Algorithms
No ratings yet
Sentiment Analysis of IMDb Movie Reviews A Comparative Study On Performance of Hyperparameter-Tuned Classification Algorithms
6 pages
Foundations of Python For AI
No ratings yet
Foundations of Python For AI
67 pages
Unit 4
No ratings yet
Unit 4
23 pages
Sentiment Analysis Part 1
No ratings yet
Sentiment Analysis Part 1
9 pages
An Expert-Level Report On The Comparative Analysis of Machine Learning and Deep Learning Models For IMDb Sentiment Classification
No ratings yet
An Expert-Level Report On The Comparative Analysis of Machine Learning and Deep Learning Models For IMDb Sentiment Classification
12 pages
Sentimental Analysis Final Year Project
No ratings yet
Sentimental Analysis Final Year Project
21 pages
Classifier Series - Naive Bayes Sentiment Analysis
No ratings yet
Classifier Series - Naive Bayes Sentiment Analysis
10 pages
431 Paper
No ratings yet
431 Paper
5 pages
2023 Aug How To Produce Data For A Neural networkORG
No ratings yet
2023 Aug How To Produce Data For A Neural networkORG
6 pages
49098-Article Text-137754-1-10-20210814
No ratings yet
49098-Article Text-137754-1-10-20210814
8 pages
PDS - Proj - Report-2 RISHI B VATSAL P ANISHA M
No ratings yet
PDS - Proj - Report-2 RISHI B VATSAL P ANISHA M
49 pages
Maneesha Nidigonda Verzeo Major Project
No ratings yet
Maneesha Nidigonda Verzeo Major Project
11 pages
P 2
No ratings yet
P 2
1 page
Sentiment Analysis From H El Reviews: Data Mining For Business Intelligence
No ratings yet
Sentiment Analysis From H El Reviews: Data Mining For Business Intelligence
13 pages
Synopsis
No ratings yet
Synopsis
8 pages
Comparative - Analysis - With - Performance - Metrics 5
No ratings yet
Comparative - Analysis - With - Performance - Metrics 5
3 pages
New Slide Data
No ratings yet
New Slide Data
3 pages
RM Good
No ratings yet
RM Good
8 pages
PHD Syllabus Computer Science and Appls-2024-2025
No ratings yet
PHD Syllabus Computer Science and Appls-2024-2025
24 pages
On The Insert Tab
No ratings yet
On The Insert Tab
1 page
Covid 19
No ratings yet
Covid 19
12 pages
Question Bank For Research Methodology
No ratings yet
Question Bank For Research Methodology
1 page
Scan Homework Easily with StudyHub
100% (1)
Scan Homework Easily with StudyHub
9 pages
Ch-12 Recurrent-Neural-Networks-And-Long-Short-Term-Memory BooK - Machine Learning and Deep Learning Using Python (McGraw Hill)
No ratings yet
Ch-12 Recurrent-Neural-Networks-And-Long-Short-Term-Memory BooK - Machine Learning and Deep Learning Using Python (McGraw Hill)
65 pages
Databook - Q2'17 - NAND (Rev1.0)
No ratings yet
Databook - Q2'17 - NAND (Rev1.0)
12 pages
Daily Lesson Log WEEK 3-4 Sessions1
No ratings yet
Daily Lesson Log WEEK 3-4 Sessions1
3 pages
Active VCs
No ratings yet
Active VCs
11 pages
Excel Lookup Techniques Explained
No ratings yet
Excel Lookup Techniques Explained
35 pages
Gr.11 Media Notes Unit 1 - SESSION 1-4
No ratings yet
Gr.11 Media Notes Unit 1 - SESSION 1-4
16 pages
D-155 - 3 Cylinder Diesel Engine (01/75 - 12/85) 00 - Complete Machine 16-06 - Compressor
No ratings yet
D-155 - 3 Cylinder Diesel Engine (01/75 - 12/85) 00 - Complete Machine 16-06 - Compressor
3 pages
Automated Testing Lifecycle Methodology
No ratings yet
Automated Testing Lifecycle Methodology
21 pages
Lesson 1 Quadratic Equation SY 2022 - 2023
No ratings yet
Lesson 1 Quadratic Equation SY 2022 - 2023
14 pages
11.1 38-11 Cisco DNA Center
No ratings yet
11.1 38-11 Cisco DNA Center
30 pages
Advanced Java Mastery in 180 Days
No ratings yet
Advanced Java Mastery in 180 Days
9 pages
Bibliometrics Tools for Researchers
No ratings yet
Bibliometrics Tools for Researchers
10 pages
Python - Numbers
No ratings yet
Python - Numbers
37 pages
B-20 Underground Codigos Manual
No ratings yet
B-20 Underground Codigos Manual
79 pages
Uploading Stock to Storage Bin Guide
No ratings yet
Uploading Stock to Storage Bin Guide
3 pages
Mgt4216e Strategic Management - Ia
No ratings yet
Mgt4216e Strategic Management - Ia
6 pages
SMR Core Design for Long-Life Use
No ratings yet
SMR Core Design for Long-Life Use
5 pages
Rekap Info Magang 2022
No ratings yet
Rekap Info Magang 2022
54 pages
Integrating 21st Century Skills in Classroom-Based Assessment
No ratings yet
Integrating 21st Century Skills in Classroom-Based Assessment
135 pages
Fire Fighting System PSP
No ratings yet
Fire Fighting System PSP
7 pages
Introduction To Javascript Answers
No ratings yet
Introduction To Javascript Answers
15 pages
Attachments Catalogue: l60f, l70f, l90f
100% (1)
Attachments Catalogue: l60f, l70f, l90f
182 pages
Marketing Conclave for Students
No ratings yet
Marketing Conclave for Students
7 pages
Azure Certification Pathways E Book
No ratings yet
Azure Certification Pathways E Book
31 pages
Exp10 Cmos Inv PDF
No ratings yet
Exp10 Cmos Inv PDF
7 pages
Vector Functions
No ratings yet
Vector Functions
93 pages
Rv220w Admin v1 0 1 0 Manual
No ratings yet
Rv220w Admin v1 0 1 0 Manual
178 pages
DSE 20.1F Computer Architecture and Networks
No ratings yet
DSE 20.1F Computer Architecture and Networks
3 pages
Classical IPC Problems
No ratings yet
Classical IPC Problems
15 pages

Unit 3 4

Uploaded by

Unit 3 4

Uploaded by

Sentiment Analysis of Movie Reviews

Objective: Classify movie reviews as positive or negative using SVM.

#Load and Preprocess the Dataset

# Display dataset info

# Convert labels to numeric (positive = 1, negative = 0)

# Load stopwords once (Fix for LookupError)

# Function to clean text

# Display dataset info

# Convert labels to numeric (positive = 1, negative = 0)

from sklearn.feature_extraction.text import TfidfVectorizer

# Convert text into numerical features using TF-IDF

print(f"Feature matrix shape: {X.shape}")

Feature matrix shape: (50000, 5000)

# Split data into training and testing sets

# Train SVM model

#Experiment with Word Embeddings

# Train Word2Vec model

# Function to convert reviews into vectors

# Convert dataset to word embeddings

# Train SVM on Word Embeddings

Model Accuracy with Word2Vec: 85.98%

#Evaluate the model's performance

1. TF-IDF Feature Extraction:

 TF-IDF Accuracy: 85.52%

 TF-IDF: Simple, interpretable, effective for key term identification.

 TF-IDF creates sparse, high-dimensional data, requiring more memory.

4. Conclusion: Both methods perform well, with Word2Vec slightly better.

You might also like