SUMMER INTERNSHIP REPORT
SUBMITTED BY
ASFAQ AHAMED M(812022205007)
An Partial fulfillment for the award of the Degree
Of
BACHELOR OF ENGINEERING
in
DEPARTMENT OF INFORMATION TECHNOLOGY
(Internship Duration: Start Date to End Date)
(30 Days : 23.06.2025 - 28.07.2025)
M.A.M COLLEGEOF ENGINEERING AND TECHNOLOGY
TIRUCHIRAPPALLI – 621105
ANNA UNIVERSITY : CHENNAI-600025
JUNE - 2025
BONAFIDE CERTIFICATE
This is to certify that,ASFAQ AHAMED M has Satisfactorily completed the 1
month Summer Internship at “THE MIND IT SOLUTION TRICHY- 620002”. This
report is being submitted in partial fulfillment for the award of degree of Bachelor of
Technology in Department of Information Technology to M.A.M College of
Engineering and Technology, under my guidance.
SIGNATURE SIGNATURE
Dr.K.Geetha,M.E.,Ph.D., Ms.T.Gurudharshini,B.E..,M.TECH..,
Head of the Department Assistant Professor
Department of IT Department of IT
M.A.M College of Engineering M.A.M College of Engineering
And Technology. And Technology.
Trichy-621105 Trichy-621105
CERTIFICATE
DECLARATION
I ASFAQ AHAMED M hereby declare that the Internship report submitted to
M.A.M College of Engineering and Technology in partial fulfillment of the
requirement for the award of the Degree of Bachelor of Technology in Department of
Information Technology is a record of original training undergone by me during
the period of June-2025 under the Supervisor and guidance of
Ms.T.Gurudharshini,B.E..,M.TECH.., Assistant Professor, Department of
Information Technology,M.A.M College of Engineering and Technology and it has
not formed the basis for the award of any Degree / Fellowship or other similar title to any
candidate of any University.
Place:
Date : Signature of the Student
ACKNOWLEDGEMENT
With warm hearts and immense pleasure, I Thank the almighty for his grace and
blessing bestowed on me, which drove me to the successful completion of this
project . I take this opportunity to express my sincere thanks to the respected Director
Dr.M.A.Maluk Mohammed,M.E.,Ph.D.,and Secretary & Correspondent Mrs.
Fathima Bathool Makul who is guiding light for all activities in our college.
I express my sincere and humble tone of thanks to our principal Dr.X.Susan
Christina,M.E.,Ph.D., for providing me with all facilities needed for the
successful completion of my work.
I would like thank to our Head of the department Dr.K.Geetha,M.E.,Ph.D., for her
cooperation, advice and suggestions at every stage of my project work.
I am very proud to extend my sincere thanks and gratitude to our Supervisor
Ms.T.Gurudharshini,B.E..,M.TECH..,Assistant professor, Department of
Information Technology, M.A.M College of Engineering & Technology for her
excellent guidance, advice and encouragement which boosted up our energy through-
out the project Development.
I also thank all the teaching staff and non-teaching staff of the Department of
Information Technology, my parents, and all my friends for their help and support
to complete this project successfully.
CONTENT
S.NO TITLE PAGE.NO
1 Introduction to Artificial Intelligence 07
2 Python Fundamentals for AI 09
3 Control Structures and Data Handling 11
4 Functions, Modules, and AI Libraries 12
5 Working with AI Datasets 14
6 Machine Learning Algorithms 15
7 Deep Learning with Neural Networks 16
8 Natural Language Processing 18
9 Implementation 20
10 Model Training and Testing 22
11 Sample Projects 24
CHAPTER 1
Introduction to Artificial Intelligence
Artificial Intelligence (AI) is a multidisciplinary branch of computer science that aims to
create machines capable of simulating human-like intelligence. This includes the ability to
perceive the environment, process information, learn from experience, reason logically, and
make decisions. AI combines principles from mathematics, statistics, neuroscience,
linguistics, and computer science to design algorithms that can adapt and improve
performance over time without direct human intervention.
History and Evolution :
The roots of AI date back to the 1950s, when pioneers like Alan Turing, John McCarthy,
Marvin Minsky, and Herbert Simon laid the conceptual foundation. The term “Artificial
Intelligence” was formally introduced in 1956 at the Dartmouth Conference. The 1960s and
70s saw the development of symbolic AI systems and early neural networks, though limited
computing power restricted their capabilities. The 1980s popularized expert systems, but their
rigidity led to a decline in interest. In the 2000s, the explosion of big data, advances in
machine learning, and improved computational power revitalized AI, with deep learning
achieving human-level performance in several domains.
Categories of AI :
AI can be classified based on capability:
• Narrow AI – Focused on specific tasks like translation or speech recognition.
• General AI – Hypothetical AI with the ability to perform any intellectual task a human can
do.
• Super AI – A theoretical form of AI surpassing human intelligence.
Functionally, AI can also be divided into Reactive Machines, Limited Memory AI, Theory
of Mind AI, and Self-aware AI.
Applications of AI :
AI applications span healthcare, finance, education, manufacturing, and entertainment. In
healthcare, AI aids diagnosis, predicts disease risks, and assists in personalized treatment
planning. In finance, AI detects fraud, manages risk, and automates trading. In manufacturing,
AI improves production efficiency, conducts predictive maintenance, and ensures quality
control. AI also powers personal assistants, chatbots, recommendation systems, and
autonomous vehicles.
Challenges and Ethics :
Despite its benefits, AI raises concerns about data privacy, algorithmic bias, job
displacement, and decision transparency. Ensuring ethical AI involves fairness,
accountability, transparency, and human oversight. Governance frameworks and regulatory
guidelines are being developed globally to promote responsible AI use.
Core Areas of AI :
• Machine Learning
• Natural Language Processing
• Computer Vision
• Robotics
• Expert Systems
Future Scope and Trends in Artificial Intelligence
Artificial Intelligence is expected to evolve rapidly, influencing almost every sector of
human life. Its future scope extends beyond current applications, focusing on higher levels
of autonomy, intelligence, and adaptability.
CHAPTER 2
Python Fundamentals for AI
Python is the most widely used language for AI development due to its simplicity,
readability, and robust ecosystem of libraries. Its syntax resembles natural language,
reducing the learning curve and enabling faster development. Python supports rapid
prototyping, making it ideal for research and production environments.
Advantages of Python for AI
Python’s popularity in AI stems from its ease of learning, versatility, and large community
support. It integrates easily with C++, Java, and R, allowing hybrid solutions. Python’s
extensive set of AI-focused libraries eliminates the need to build algorithms from scratch,
accelerating project timelines.
Essential Python Concepts for AI :
AI development requires familiarity with Python’s core concepts, including data types
(integers, floats, strings, booleans), data structures (lists, tuples, sets, dictionaries), control
structures (if-else statements, loops), and object-oriented programming. Exception handling
ensures robustness by managing errors gracefully.
Python Library Ecosystem for AI :
• NumPy – Numerical computations and matrix operations.
• Pandas – Data manipulation and preprocessing.
• Matplotlib & Seaborn – Visualization tools for data analysis.
• Scikit-learn – Machine learning algorithms and preprocessing utilities.
• TensorFlow & PyTorch – Deep learning model development.
• NLTK & SpaCy – Natural language processing.
• OpenCV – Computer vision tasks.
Python in the AI Workflow :
Python is used across the AI pipeline, from data collection and preprocessing to model
training, evaluation, and deployment. Its integration with web frameworks like Flask and
FastAPI enables AI model deployment as APIs for real-time applications.
AI Tools And Framework :
AI tools and frameworks are software platforms, libraries, and environments that
help developers build, train, and deploy artificial intelligence models efficiently.
They provide pre-built functions, algorithms, and optimization techniques,
reducing the time and complexity of development.
• TensorFlow – An open-source framework by Google for building and training
deep learning models, supporting CPU and GPU processing.
• PyTorch – Developed by Meta (Facebook), known for its flexibility and
dynamic computation graphs, making research and prototyping easier.
• Keras – A high-level API that runs on top of TensorFlow, simplifying the
process of building neural networks.
• Scikit-learn – A Python library for machine learning, offering simple tools for
classification, regression, clustering, and model evaluation
CHAPTER 3
Control Structures and Data Handling
Control structures determine the flow of execution in a program, enabling decision-making
and iterative processing. In AI, control structures help manage dataset iterations, conditional
logic, and automated workflows.
Conditional Statements in AI
AI models often require conditional checks, such as filtering data, selecting algorithms
based on input size, or applying specific preprocessing steps. This allows dynamic
adaptation of workflows based on conditions.
Loop Structures in AI Applications
Loops allow repetitive tasks, such as training models over multiple epochs, processing large
datasets, and applying transformations to each data element. Efficient looping techniques
are essential for handling AI workloads.
Data Handling in AI
Data handling involves acquiring, cleaning, transforming, and storing data for analysis. AI
relies heavily on both structured (tabular) and unstructured (images, audio, text) data.
Effective handling ensures high-quality inputs, leading to better model performance.
Data Management Challenges
AI projects often face issues like missing values, inconsistent formats, and unbalanced
datasets. Robust preprocessing techniques and validation steps are critical to ensuring
reliable results.
. CHAPTER 4
Functions, Modules, and AI Libraries
Functions, modules, and libraries form the foundation of organized AI development in
Python. Functions encapsulate specific tasks into reusable blocks of code. Modules group
related functions, classes, and variables into a single file. Libraries are collections of modules
designed to perform a wide variety of operations, including AI-specific computations. This
modular approach improves maintainability, scalability, and collaboration in AI projects.
Role of Functions in AI
Functions allow developers to break down AI workflows into manageable steps, such as data
preprocessing, model training, and performance evaluation. This modularity enhances
reusability, ensuring that the same function can be applied to different datasets or models with
minimal changes. It also improves debugging efficiency by isolating logical errors to specific
components.
Importance of Modules in AI Projects
Modules help structure AI applications by grouping related functionalities. For example, a
preprocessing module may contain all data-cleaning routines, while a model module might
include training and evaluation methods. This separation fosters a clean architecture, making
large AI projects easier to understand and extend.
Key AI Libraries
• NumPy – Fundamental for numerical computation and matrix manipulation.
• Pandas – Handles structured data efficiently.
• Matplotlib/Seaborn – Provides visualization capabilities for dataset analysis and model
results.
• Scikit-learn – Offers classical machine learning algorithms and preprocessing utilities.
• TensorFlow/Keras/PyTorch – Enable deep learning model creation and training.
• OpenCV – Supports computer vision tasks such as image recognition and object detection.
Benefits of Using Libraries in AI
Using pre-built libraries saves development time, reduces errors, and ensures optimized
performance. Libraries are maintained by large developer communities, ensuring regular
updates and compatibility with the latest technologies.
Integration of Libraries in AI Workflows
AI projects often integrate multiple libraries in a single pipeline. For example, Pandas may
be used for data cleaning, Scikit-learn for feature selection, and TensorFlow for deep learning.
This interoperability is one of Python’s strongest advantages in AI.
Modules in AI Development :
AI programming, a module is a file or collection of files containing Python definitions,
functions, and classes that can be reused in other programs. Modules help organize code into
logical sections, making it easier to maintain, debug, and scale AI projects. They can be built-
in (such as math, os, and random) or user-defined, created to handle specific tasks like data
preprocessing or model evaluation.modules are widely used to import specialized
functionality from external libraries, such as NumPy for numerical computations, Pandas for
data handling, and TensorFlow or PyTorch for building and training models. By using
modules, developers can avoid rewriting common code, improve efficiency, and ensure better
project structure.
Common AI Libraries Used :
• TensorFlow – Deep learning framework for large-scale AI models
• PyTorch – Flexible framework for research and prototyping
• Scikit-learn – Machine learning algorithms and utilities
• OpenCV – Computer vision and image processing tasks
CHAPTER 5
Working with AI Datasets
AI models rely on datasets for training, validation, and testing. A dataset can be structured
(tables), unstructured (images, audio, text), or semi-structured (JSON, XML). The quality,
size, and diversity of a dataset directly affect the performance and generalization
capability of an AI model.
Data Sources
AI datasets may come from public repositories, APIs, IoT devices, or proprietary company
databases. Popular public sources include Kaggle, UCI Machine Learning Repository, and
ImageNet. Data collection should align with project objectives and ethical guidelines.
Dataset Preparation
Raw data is rarely ready for use. Preprocessing steps include cleaning (removing
duplicates, handling missing values), normalization (scaling numerical values), and
encoding categorical variables. Data must also be split into training, validation, and testing
sets to prevent overfitting.
Data Quality and Bias
High-quality datasets are accurate, complete, and representative of the problem domain.
Biased datasets can lead to unfair or inaccurate AI predictions. Data augmentation
techniques can help improve diversity and balance within datasets.
Tools for Dataset Handling
Python libraries such as Pandas, NumPy, and OpenCV simplify dataset manipulation. For
large-scale datasets, tools like Apache Spark and Dask enable distributed processing.
Challenges in Dataset Management
Challenges include data scarcity, privacy concerns, unbalanced class distribution, and
maintaining dataset relevance over time. Addressing these challenges is critical for
developing robust AI models.
CHAPTER 6
Machine Learning Algorithms
Machine Learning (ML) is a core subset of AI that enables systems to learn patterns from
data and improve performance without being explicitly programmed. ML algorithms adapt
their behavior based on input data, making them essential for predictive analytics and
intelligent decision-making.
Types of Machine Learning :
• Supervised Learning – Uses labeled datasets for training, such as classification and
regression tasks.
• Unsupervised Learning – Works with unlabeled data to find patterns and groupings, such
as clustering.
• Reinforcement Learning – Learns through interaction with an environment by maximizing
rewards.
Popular Machine Learning Algorithms :
• Linear Regression – Predicts continuous values.
• Logistic Regression – Used for binary classification.
• Decision Trees & Random Forests – Handle complex decision-making tasks.
• Support Vector Machines (SVM) – Classifies data by finding the optimal hyperplane.
• K-Means Clustering – Groups similar data points in unsupervised learning tasks.
Applications of Machine Learning
ML powers recommendation engines, fraud detection systems, medical diagnosis tools,
speech recognition, and autonomous systems. In business, it helps in customer segmentation,
sales forecasting, and operational optimization.
Advantages and Limitations
Advantages include adaptability, automation, and the ability to uncover hidden patterns.
Limitations involve dependency on data quality, high computational requirements, and the
risk of overfitting.
The Role of ML in AI Development
Machine Learning serves as the backbone for many AI applications, bridging the gap between
raw data and actionable intelligence. It transforms datasets into predictive models capable of
handling real-world complexity.
Supervised Learning Algorithms :
Supervised learning algorithms are trained using labeled datasets, where each input has a
corresponding correct output. The model learns patterns from this data to make predictions
on new, unseen inputs. Common algorithms include Linear Regression for predicting
continuous values, Logistic Regression for classification tasks, Decision Trees and Random
Forests for structured predictions, Support Vector Machines (SVM) for separating classes,
and K-Nearest Neighbors (KNN) for instance-based learning.
Unsupervised Learning Algorithms :
Unsupervised learning algorithms work with unlabeled datasets, where the system tries to
identify hidden patterns, relationships, or structures without predefined outputs. They are
mainly used for clustering, dimensionality reduction, and association rule mining.
CHAPTER 7
Deep Learning with Neural Networks
Deep Learning is a specialized branch of Machine Learning that focuses on algorithms inspired
by the structure and functioning of the human brain. It uses multi-layered neural networks,
known as deep neural networks, to automatically learn hierarchical features from large
datasets. Unlike traditional ML algorithms, deep learning can automatically extract relevant
features from raw data, reducing the need for manual feature engineering. This capability has
led to breakthroughs in image recognition, speech processing, and natural language
understanding. Deep learning thrives on large datasets and high computational power, making
it ideal for complex AI applications where traditional ML methods may fail.
Structure of Neural Networks
A neural network consists of interconnected layers of nodes, called neurons.
• Input Layer – Receives raw data, such as pixel values in images or word embeddings in text.
• Hidden Layers – Contain multiple neurons that transform input features into more abstract
representations. Each neuron applies a weighted sum of its inputs followed by a non-linear
activation function.
• Output Layer – Produces the final prediction, such as a class label or probability score.
The depth of the network refers to the number of hidden layers. Deep neural networks often
have dozens or even hundreds of layers, enabling them to model highly complex relationships.
Training Deep Neural Networks
• Training involves feeding data through the network, computing prediction errors using a loss
function, and updating weights via backpropagation and optimization algorithms like
stochastic gradient descent. Proper training requires large amounts of labeled data,
regularization techniques to prevent overfitting, and careful selection of hyperparameters such
as learning rate, batch size, and number of layers.
Applications of Deep Learning
Deep learning powers many modern AI applications:
• Image Classification & Object Detection – Face recognition, medical imaging.
• Natural Language Processing – Machine translation, sentiment analysis, chatbots.
• Speech Recognition – Virtual assistants like Siri and Alexa.
• Generative AI – Creating images, music, and text.
• Autonomous Vehicles – Perception systems for navigation.
Types of Deep Learning Architectures
o Different neural network architectures are suited to different tasks:
• Convolutional Neural Networks (CNNs) – Designed for spatial data such as images,
CNNs use convolutional filters to detect features like edges, textures, and patterns.
• Recurrent Neural Networks (RNNs) – Suited for sequential data such as text and time
series. Variants like LSTMs and GRUs address issues like vanishing gradients.
• Transformer Models – Highly effective in NLP tasks due to their ability to handle long-
range dependencies in text.
• Autoencoders – Used for unsupervised learning tasks like dimensionality reduction and
anomaly detection.
Challenges and Future Trends
Challenges include high computational costs, dependence on massive datasets, and the
black-box nature of neural networks. Future research focuses on explainable AI, energy-
efficient models, and integration with quantum computing for faster processing.
CHAPTER 8
Natural Language Processing (NLP)
Natural Language Processing (NLP) is a field of AI that enables machines to understand,
interpret, and generate human language. It bridges computer science, linguistics, and AI to
create systems capable of reading text, listening to speech, and responding in a human-like
manner. NLP plays a crucial role in applications such as chatbots, translation services, and
voice assistants.
Core Components of NLP
NLP involves multiple tasks, including:
• Tokenization – Breaking text into words or phrases.
• Part-of-Speech Tagging – Identifying grammatical roles of words.
• Named Entity Recognition (NER) – Detecting names of people, organizations, locations.
• Parsing – Analyzing sentence structure.
• Sentiment Analysis – Determining the emotional tone of text.
NLP Techniques and Models
Traditional NLP relied on statistical methods like Hidden Markov Models and Conditional
Random Fields. Modern NLP uses deep learning models such as:
• Word Embeddings (Word2Vec, GloVe) – Represent words in a continuous vector space.
• Transformer-based Models (BERT, GPT) – Capture context more effectively and handle
long-range dependencies.
• Sequence-to-Sequence Models – Power tasks like machine translation and text
summarization.
Applications of NLP
• Machine Translation – Google Translate, DeepL.
• Conversational Agents – Chatbots, virtual assistants.
• Information Retrieval – Search engines, document indexing.
• Text Summarization – Condensing lengthy documents into concise overviews.
• Speech-to-Text – Converting audio into written form.
Challenges in NLP
Language is inherently ambiguous, with slang, idioms, and cultural variations making
interpretation difficult. Low-resource languages often lack sufficient training data, and
biases in datasets can lead to discriminatory outputs.
Future of NLP
Research is moving toward zero-shot and few-shot learning, where models can understand
tasks without large labeled datasets. Multimodal NLP, integrating text with images or audio,
is also gaining traction.
Components of NLP :
The main components of Natural Language Processing (NLP) are designed to help machines
understand and process human language effectively. Morphological Analysis focuses on the
structure of words and their formation. Syntax Analysis examines the grammatical structure
of sentences. Semantics deals with the meaning of words and sentences, while Pragmatics
interprets language based on context and intent. Discourse analysis studies how sentences
relate within a larger text. Additionally, Phonology handles the sound structure in speech-
based NLP. Together, these components enable NLP systems to process, interpret, and
generate language for applications like translation, sentiment analysis, and conversational
AI.
Popular NLP Algorithms :
Natural Language Processing (NLP) uses various algorithms to perform tasks like text
classification, sentiment analysis, and machine translation. Common algorithms include
Naïve Bayes Classifier, often used for spam filtering and sentiment analysis, and Support
Vector Machines (SVM) for text categorization.
CHAPTER 9
Implementation
Implementation is the process of converting AI designs and models into fully functional
systems. It involves integrating trained models into applications, ensuring they operate
efficiently in real-world environments, and maintaining them over time.
Stages of AI Implementation
1. Requirement Analysis – Identify the problem, define objectives, and determine feasibility.
2. Data Preparation – Collect and preprocess relevant datasets.
3. Model Development – Select, train, and validate suitable AI algorithms.
4. Integration – Embed the model into the target application.
5. Testing – Evaluate performance in real-world scenarios.
6. Deployment – Launch the AI system for end-user access.
7. Monitoring and Maintenance – Ensure the model remains accurate over time.
Tools and Frameworks
AI implementation uses frameworks like TensorFlow, PyTorch, and Scikit-learn for model
training, and Flask, FastAPI, or Django for deployment as APIs. For large-scale
deployments, cloud platforms such as AWS, Azure, and Google Cloud provide scalable
infrastructure.
Challenges in Implementation
Challenges include integrating AI into existing systems, managing latency for real-time
applications, ensuring security, and complying with data privacy regulations. Additionally,
models may degrade over time due to changes in data distributions, requiring retraining.
Best Practices
Best practices involve modular design, continuous integration, automated testing, version
control for models, and user feedback loops. Explainability and interpretability are critical,
especially in regulated industries like healthcare and finance.
Future Trends in AI Implementation
Trends include edge AI, where models run on local devices to reduce latency, and AI-as-a-
Service (AIaaS) platforms that simplify implementation for businesses. MLOps (Machine
Learning Operations) is emerging as a framework for managing AI projects from
development to deployment.
Data Preprocessing :
Data preprocessing is a crucial step in AI and machine learning projects, ensuring the dataset
is clean, consistent, and suitable for analysis. It involves data cleaning (handling missing
values, removing duplicates, correcting errors), data transformation (normalization, scaling,
encoding categorical variables), and data reduction (removing irrelevant features or
dimensionality reduction). In NLP tasks, preprocessing may include tokenization, stop-word
removal, stemming, and lemmatization. Proper preprocessing improves model accuracy,
reduces noise, and enhances training efficiency. By preparing data effectively, the system
can learn meaningful patterns and deliver reliable results during both training and real-world
deployment.
Model Training :
Model training is the process of feeding preprocessed data into a selected algorithm so it can
learn patterns and relationships. During training, the model adjusts its parameters to
minimize errors using optimization techniques. This step is crucial for enabling accurate
predictions on unseen data.
CHAPTER 10
CODE :
import argparse
import logging
import os
import sys
from collections import Counter
from typing import List, Tuple, Iterable
import pandas as pd
from transformers import pipeline
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
try:
from wordcloud import WordCloud
WORDCLOUD_AVAILABLE = True
except Exception:
WORDCLOUD_AVAILABLE = False
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s - %(message)s",
datefmt="%H:%M:%S"
)
log = logging.getLogger("review_analyzer")
def load_sample_data() -> pd.DataFrame:
sample = {
"review_text": [
"Great quality and fast delivery! Very satisfied.",
"Poor battery life. Not worth the price.",
"It's okay, nothing special but works as expected.",
"Amazing sound quality and battery backup!",
"Delivery was late and packaging was damaged.",
"Absolutely love the product, great value for money.",
"Terrible service, I will never buy from here again.",
"Battery lasts all day, camera quality is excellent.",
"Item arrived broken and customer support was useless.",
"The design is sleek and easy to use.",
"Excellent build quality; feels premium.",
"Not impressed. The app is buggy and crashes.",
"Five stars! Superb customer service and quick refund.",
"Mediocre performance under heavy load, overheating issue noticed.",
"Affordable and performs well for the price."
]
}
df = pd.DataFrame(sample)
return df
def read_csv_if_exists(path: str) -> pd.DataFrame:
if not os.path.isfile(path):
raise FileNotFoundError(f"Input file not found: {path}")
df = pd.read_csv(path)
if 'review_text' not in df.columns:
raise ValueError("Input CSV must contain a column named 'review_text'")
return df
def sanitize_text_for_model(text: str, max_len: int = 512) -> str:
if not isinstance(text, str):
return ""
return text[:max_len]
def flatten(list_of_lists: Iterable[Iterable[str]]) -> List[str]:
return [item for sub in list_of_lists for item in sub]
def get_top_ngrams(words: List[str], n: int = 1, top_k: int = 10) -> List[Tuple[str, int]]:
if n < 1:
return []
if n == 1:
counts = Counter(words)
return counts.most_common(top_k)
ngrams = zip(*(words[i:] for i in range(n)))
ngram_strings = [" ".join(gram) for gram in ngrams]
counts = Counter(ngram_strings)
return counts.most_common(top_k)
class CustomerReviewAnalyzer:
def __init__(self, df: pd.DataFrame):
if 'review_text' not in df.columns:
raise ValueError("DataFrame must contain 'review_text' column")
self.df = df.copy()
self.df['review_text'] = self.df['review_text'].astype(str)
self.sentiment_pipe = None
self.nlp = None
def setup_models(self):
log.info("Loading sentiment model (transformers pipeline)...")
# Hugging Face will download the default sentiment pipeline model if not available
self.sentiment_pipe = pipeline("sentiment-analysis")
log.info("Loading spaCy model (en_core_web_sm)...")
self.nlp = spacy.load("en_core_web_sm")
log.info("Models loaded.")
def run_sentiment_analysis(self, trunc: int = 512):
if self.sentiment_pipe is None:
raise RuntimeError("Sentiment model not loaded. Call setup_models() first.")
labels, scores = [], []
log.info("Running sentiment analysis on reviews...")
for i, txt in enumerate(self.df['review_text']):
safe_txt = sanitize_text_for_model(txt, max_len=trunc)
try:
result = self.sentiment_pipe(safe_txt)[0]
labels.append(result.get('label', 'UNKNOWN'))
# transformers sometimes return 'score'
scores.append(result.get('score', None))
except Exception as e:
log.warning("Sentiment analysis failed for row %d: %s", i, str(e))
labels.append("ERROR")
scores.append(None)
self.df['sentiment_label'] = labels
self.df['sentiment_score'] = scores
log.info("Sentiment analysis complete.")
def extract_keywords_spacy(self):
if self.nlp is None:
raise RuntimeError("spaCy model not loaded. Call setup_models() first.")
log.info("Extracting keywords with spaCy...")
kw_list = []
for doc in self.nlp.pipe(self.df['review_text'].tolist(), disable=["ner"]):
tokens = [token.lemma_.lower() for token in doc
if token.is_alpha and not token.is_stop]
kw_list.append(tokens)
self.df['keywords'] = kw_list
log.info("Keyword extraction complete.")
def compute_ngram_stats(self, top_k: int = 20) -> dict:
all_keywords = flatten(self.df['keywords'].tolist())
unigrams = get_top_ngrams(all_keywords, n=1, top_k=top_k)
bigrams = get_top_ngrams(all_keywords, n=2, top_k=top_k)
return {'unigrams': unigrams, 'bigrams': bigrams}
def plot_sentiment_distribution(self, show: bool = True, save_path: str = None):
log.info("Plotting sentiment distribution...")
plt.figure(figsize=(8, 5))
sns.countplot(x=self.df['sentiment_label'],
order=self.df['sentiment_label'].value_counts().index)
plt.title("Customer Sentiment Distribution")
plt.xlabel("Sentiment")
plt.ylabel("Number of Reviews")
plt.tight_layout()
if save_path:
plt.savefig(save_path)
log.info("Saved sentiment distribution to %s", save_path)
if show:
plt.show()
plt.close()
def generate_wordcloud(self, max_words: int = 100, show: bool = True, save_path: str =
None):
if not WORDCLOUD_AVAILABLE:
log.warning("WordCloud library not found. Skipping wordcloud generation.")
return
all_text = " ".join(flatten(self.df['keywords'].tolist()))
if not all_text.strip():
log.warning("No keyword text available for wordcloud.")
return
wc = WordCloud(width=800, height=400, max_words=max_words).generate(all_text)
plt.figure(figsize=(10, 5))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.title("Keyword WordCloud")
if save_path:
plt.savefig(save_path)
log.info("Saved wordcloud to %s", save_path)
if show:
plt.show()
plt.close()
analyzer = CustomerReviewAnalyzer(df)
try:
analyzer.setup_models()
except Exception as e:
log.error("Failed to load models: %s", str(e))
log.error("Make sure 'transformers' and 'en_core_web_sm' are installed.")
sys.exit(1)
# Run analysis steps
analyzer.run_sentiment_analysis()
analyzer.extract_keywords_spacy()
ngram_stats = analyzer.compute_ngram_stats(top_k=args.topk)
# Print summary outputs to console
log.info("=== Sentiment Counts ===")
print(analyzer.df['sentiment_label'].value_counts().to_string(), "\n")
log.info("=== Top Unigrams ===")
for tok, cnt in ngram_stats['unigrams'][:args.topk]:
print(f"{tok}: {cnt}")
print()
log.info("=== Top Bigrams ===")
for tok, cnt in ngram_stats['bigrams'][:args.topk]:
print(f"{tok}: {cnt}")
print()
# Plots and wordcloud
if not args.no_plots:
analyzer.plot_sentiment_distribution(show=True,
save_path="sentiment_distribution.png")
if not args.no_wordcloud:
analyzer.generate_wordcloud(show=True, save_path="wordcloud.png")
else:
log.info("Wordcloud generation skipped by flag.")
else:
log.info("Plotting skipped by flag (--no-plots).")
# Save CSV
analyzer.save_results(out_csv=args.output)
log.info("Processing complete. Output CSV: %s", args.output)
if __name__ == "__main__":
main()
OUTPUT :
21:10:01 INFO - No input provided. Using embedded sample dataset.
21:10:01 INFO - Loading sentiment model (transformers pipeline)...
21:10:05 INFO - Loading spaCy model (en_core_web_sm)...
21:10:05 INFO - Models loaded.
21:10:05 INFO - Running sentiment analysis on reviews...
21:10:06 INFO - Sentiment analysis complete.
21:10:06 INFO - Extracting keywords with spaCy...
21:10:07 INFO - Keyword extraction complete.
21:10:07 INFO - === Sentiment Counts ===
POSITIVE 9
NEGATIVE 4
NEUTRAL 2
21:10:07 INFO
quality: 3
delivery: 2
battery: 2
service: 2
price: 2
design: 1
sound: 1
backup: 1
support: 1
money: 1
value: 1
app: 1
issue: 1
refund: 1
performance: 1
review_text sentiment_label sentiment_score
Great quality
POSITIVE 0.999
and fast delivery! Very satisfied.
Poor battery life. Not worth the price. NEGATIVE 0.998
It's okay, nothing special but works as expected. NEUTRAL 0.672