0% found this document useful (0 votes)
6 views6 pages

Unit 3 4

The document outlines a project to classify movie reviews as positive or negative using Support Vector Machine (SVM) with a dataset from IMDb. It details the steps of loading and preprocessing the data, performing text preprocessing, training the SVM classifier, and experimenting with feature extraction techniques like TF-IDF and Word2Vec. The results indicate that both methods achieved high accuracy, with Word2Vec slightly outperforming TF-IDF due to its ability to capture semantic relationships.

Uploaded by

mcanarender
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views6 pages

Unit 3 4

The document outlines a project to classify movie reviews as positive or negative using Support Vector Machine (SVM) with a dataset from IMDb. It details the steps of loading and preprocessing the data, performing text preprocessing, training the SVM classifier, and experimenting with feature extraction techniques like TF-IDF and Word2Vec. The results indicate that both methods achieved high accuracy, with Word2Vec slightly outperforming TF-IDF due to its ability to capture semantic relationships.

Uploaded by

mcanarender
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Sentiment Analysis of Movie Reviews

Objective: Classify movie reviews as positive or negative using SVM.


Dataset: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-
reviews?form=MG0AV3
Tasks:
1. Load and preprocess a dataset of movie reviews (e.g., IMDb dataset).
2. Perform text preprocessing (e.g., tokenization, stopword removal, vectorization).
3. Train an SVM classifier to predict the sentiment of reviews.
4. Experiment with different feature extraction techniques (e.g., TF-IDF, word
embeddings).
5.Evaluate the model's performance and discuss the results

#Load and Preprocess the Dataset


import pandas as pd

# Load dataset
df = pd.read_csv("/content/drive/MyDrive/nkphd/IMDB Dataset.csv")

# Display dataset info


print(df.head())
print(df.info())

# Convert labels to numeric (positive = 1, negative = 0)


df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})

print(df['sentiment'].value_counts())
#Perform Text Preprocessing
import pandas as pd
import nltk
import string
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')
# Download the 'punkt_tab' resource
nltk.download('punkt_tab')

# Load stopwords once (Fix for LookupError)


stop_words = set(stopwords.words('english'))

# Function to clean text


def preprocess_text(text):
text = text.lower() # Convert to lowercase
text = re.sub(r'<.*?>', '', text) # Remove HTML tags
text = re.sub(r'\d+', '', text) # Remove numbers
text = text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation
tokens = word_tokenize(text) # Tokenization
tokens = [word for word in tokens if word not in stop_words] # Remove stopwords
return ' '.join(tokens)

# Load dataset
df = pd.read_csv("/content/drive/MyDrive/nkphd/IMDB Dataset.csv")

# Display dataset info


print(df.head())
print(df.info())

# Convert labels to numeric (positive = 1, negative = 0)


df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})
print(df['sentiment'].value_counts())

# Apply preprocessing
df['cleaned_review'] = df['review'].apply(preprocess_text)

print(df[['review', 'cleaned_review']].head())
#Feature Extraction using TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

# Convert text into numerical features using TF-IDF


vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['cleaned_review'])

# Labels
y = df['sentiment']

print(f"Feature matrix shape: {X.shape}")

Feature matrix shape: (50000, 5000)


#Train an SVM Classifier
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,
stratify=y)

# Train SVM model


svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Predictions
y_pred = svm_model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy * 100:.2f}%')
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

#Experiment with Word Embeddings


import gensim
from gensim.models import Word2Vec
import numpy as np

# Tokenize reviews
df['tokenized'] = df['cleaned_review'].apply(lambda x: x.split())

# Train Word2Vec model


w2v_model = Word2Vec(sentences=df['tokenized'], vector_size=100, window=5, min_count=2,
workers=4)

# Function to convert reviews into vectors


def get_word2vec_vectors(review):
vectors = [w2v_model.wv[word] for word in review if word in w2v_model.wv]
return np.mean(vectors, axis=0) if vectors else np.zeros(100)

# Convert dataset to word embeddings


X_w2v = np.array([get_word2vec_vectors(review) for review in df['tokenized']])

# Train SVM on Word Embeddings


X_train, X_test, y_train, y_test = train_test_split(X_w2v, y, test_size=0.2, random_state=42)
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy with Word2Vec: {accuracy * 100:.2f}%')

Model Accuracy with Word2Vec: 85.98%

#Evaluate the model's performance

1. TF-IDF Feature Extraction:

 Accuracy: 85.52%
 The SVM classifier performed well with a 5000-feature TF-IDF representation of
the movie reviews.
 The high-dimensional feature space effectively captured text patterns relevant to
sentiment classification.

2. Word2Vec Embeddings:

 Accuracy: 85.98%
 Word2Vec provided comparable results to TF-IDF but leveraged semantic
relationships between words, enhancing performance slightly.
Discussion of Results

1. Performance Comparison:

 TF-IDF Accuracy: 85.52%


 Word2Vec Accuracy: 85.98% (slightly better due to capturing semantic
relationships).

2. Strengths:

 TF-IDF: Simple, interpretable, effective for key term identification.


 Word2Vec: Captures context and word semantics, improving generalization.

3. Trade-offs:

 TF-IDF creates sparse, high-dimensional data, requiring more memory.


 Word2Vec uses dense vectors but needs more computation and a good embedding
model.

4. Conclusion: Both methods perform well, with Word2Vec slightly better.

You might also like