Sentiment Analysis of Movie Reviews
Objective: Classify movie reviews as positive or negative using SVM.
Dataset: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-
reviews?form=MG0AV3
Tasks:
1. Load and preprocess a dataset of movie reviews (e.g., IMDb dataset).
2. Perform text preprocessing (e.g., tokenization, stopword removal, vectorization).
3. Train an SVM classifier to predict the sentiment of reviews.
4. Experiment with different feature extraction techniques (e.g., TF-IDF, word
embeddings).
5.Evaluate the model's performance and discuss the results
#Load and Preprocess the Dataset
import pandas as pd
# Load dataset
df = pd.read_csv("/content/drive/MyDrive/nkphd/IMDB Dataset.csv")
# Display dataset info
print(df.head())
print(df.info())
# Convert labels to numeric (positive = 1, negative = 0)
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})
print(df['sentiment'].value_counts())
#Perform Text Preprocessing
import pandas as pd
import nltk
import string
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
# Download the 'punkt_tab' resource
nltk.download('punkt_tab')
# Load stopwords once (Fix for LookupError)
stop_words = set(stopwords.words('english'))
# Function to clean text
def preprocess_text(text):
text = text.lower() # Convert to lowercase
text = re.sub(r'<.*?>', '', text) # Remove HTML tags
text = re.sub(r'\d+', '', text) # Remove numbers
text = text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation
tokens = word_tokenize(text) # Tokenization
tokens = [word for word in tokens if word not in stop_words] # Remove stopwords
return ' '.join(tokens)
# Load dataset
df = pd.read_csv("/content/drive/MyDrive/nkphd/IMDB Dataset.csv")
# Display dataset info
print(df.head())
print(df.info())
# Convert labels to numeric (positive = 1, negative = 0)
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})
print(df['sentiment'].value_counts())
# Apply preprocessing
df['cleaned_review'] = df['review'].apply(preprocess_text)
print(df[['review', 'cleaned_review']].head())
#Feature Extraction using TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
# Convert text into numerical features using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['cleaned_review'])
# Labels
y = df['sentiment']
print(f"Feature matrix shape: {X.shape}")
Feature matrix shape: (50000, 5000)
#Train an SVM Classifier
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,
stratify=y)
# Train SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)
# Predictions
y_pred = svm_model.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy * 100:.2f}%')
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
#Experiment with Word Embeddings
import gensim
from gensim.models import Word2Vec
import numpy as np
# Tokenize reviews
df['tokenized'] = df['cleaned_review'].apply(lambda x: x.split())
# Train Word2Vec model
w2v_model = Word2Vec(sentences=df['tokenized'], vector_size=100, window=5, min_count=2,
workers=4)
# Function to convert reviews into vectors
def get_word2vec_vectors(review):
vectors = [w2v_model.wv[word] for word in review if word in w2v_model.wv]
return np.mean(vectors, axis=0) if vectors else np.zeros(100)
# Convert dataset to word embeddings
X_w2v = np.array([get_word2vec_vectors(review) for review in df['tokenized']])
# Train SVM on Word Embeddings
X_train, X_test, y_train, y_test = train_test_split(X_w2v, y, test_size=0.2, random_state=42)
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy with Word2Vec: {accuracy * 100:.2f}%')
Model Accuracy with Word2Vec: 85.98%
#Evaluate the model's performance
1. TF-IDF Feature Extraction:
Accuracy: 85.52%
The SVM classifier performed well with a 5000-feature TF-IDF representation of
the movie reviews.
The high-dimensional feature space effectively captured text patterns relevant to
sentiment classification.
2. Word2Vec Embeddings:
Accuracy: 85.98%
Word2Vec provided comparable results to TF-IDF but leveraged semantic
relationships between words, enhancing performance slightly.
Discussion of Results
1. Performance Comparison:
TF-IDF Accuracy: 85.52%
Word2Vec Accuracy: 85.98% (slightly better due to capturing semantic
relationships).
2. Strengths:
TF-IDF: Simple, interpretable, effective for key term identification.
Word2Vec: Captures context and word semantics, improving generalization.
3. Trade-offs:
TF-IDF creates sparse, high-dimensional data, requiring more memory.
Word2Vec uses dense vectors but needs more computation and a good embedding
model.
4. Conclusion: Both methods perform well, with Word2Vec slightly better.