Research Internship/ Industry Internship (21INT82) 2024-25
CHAPTER 1
INTRODUCTION
1.1 About the Company
Rooman Technologies, established in 1999 in Bangalore by a group of technology
enthusiasts, is a premier training and skill development company. It uses state-of-the-art
IT training and creative solutions to empower and improve people's lives. Our highly
qualified technical staff is an expert at creating successful and efficient learning
programs. They are dedicated to turning dreams into reality by giving people knowledge
and skills pertinent to the industry [1].
Wadhwani Foundation strives to empower and enhance the lives of people through
impactful entrepreneurship, skilling, and business acceleration programs. With the
expertise of our dedicated team, they provide innovative and scalable solutions to drive
economic growth and job creation. Transforming aspirations into reality, they enable
individuals and enterprises to achieve their full potential [2].
IBM strives to empower and enhance the lives of people through cutting-edge
technology, innovation, and enterprise solutions. With the expertise of our highly skilled
technical team, they specialize in delivering effective and efficient IT services, AI-driven
solutions, and cloud computing advancements. They are committed to transforming
aspirations into reality by equipping individuals and businesses with industry-leading
tools, knowledge, and digital transformation strategies [3].
1.2 Company Profile
Company Name: Rooman Technologies
Main Office: #30, 12th Main, 1st Stage Rajajinagar, Bengaluru - 560 010
Email: online@rooman.net
Year Stand Up: 1999
Company Category: IT Software
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 1
Research Internship/ Industry Internship (21INT82) 2024-25
CHAPTER 2
TECHNICAL SKILLS ACQUIRED
As part of the internship, we underwent 165 hours of intensive technical training provided
by Rooman Technologies, focusing on industry-relevant skills and practical application.
The curriculum was thoughtfully designed to cover foundational and advanced topics
essential for careers in AI and Machine Learning. We began with Python programming,
mastering core concepts such as data structures, functions, and file handling. The training
expanded into prompt engineering and the use of AI coding assistants to enhance
development efficiency. We also explored the fundamentals of Linux and networking,
building a solid understanding of system-level operations. Web development with Flask
and database integration provided us with full-stack capabilities. Cloud computing
modules introduced us to deployment strategies and orchestration platforms. Using Power
BI, we learned to analyze data and visualize insights through interactive dashboards.
Additionally, we developed and deployed machine learning models within the Power BI
environment. The hands-on approach and structured practice sessions enabled us to
translate theoretical concepts into real-world skills.
2.1 Programming and Scripting Skills
As part of the internship program, I received in-depth training in Python programming,
starting from the fundamentals and gradually moving toward more advanced topics. I
began with learning the basic syntax and structure of Python, which included
understanding how to write clean, readable, and efficient code. I then explored control
structures such as conditional statements (if, elif, else) and loops (for, while), which
allowed me to control the logical flow of programs effectively. The module also
emphasized the importance of functions, teaching me how to build modular and reusable
code blocks that enhance clarity and maintainability. A significant focus was placed on
understanding and working with various data structures like lists, tuples, sets, and
dictionaries, which are essential for organizing and processing data efficiently.
Furthermore, I gained practical experience in file handling, including reading from and
writing to files in different formats such as .txt and .csv. These skills laid the groundwork
for performing a wide range of programming tasks, from automation to data
preprocessing, and provided a strong technical base for the rest of the internship modules.
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 2
Research Internship/ Industry Internship (21INT82) 2024-25
2.2 Web Development and Databases
The web development segment of the internship equipped me with a comprehensive set
of skills and theoretical understanding necessary for building full-stack web applications.
This training module introduced me to the principles of web architecture and the
foundational technologies that drive modern web services. Using Flask, a lightweight and
flexible Python web framework, I gained hands-on experience in designing and
implementing web applications that encompass routing mechanisms, HTML templating
using Jinja2, form handling, and backend logic. I explored how to structure a web
application effectively, separating concerns between frontend views and backend
processing. In addition to Flask, I was introduced to essential web development concepts
such as the client-server model, HTTP request and response cycles, HTTP methods
(GET, POST, PUT, DELETE), and the structure and use of RESTful APIs for enabling
communication between distributed components. These concepts allowed me to
understand how data flows within a web application and how users interact with the
system. Furthermore, database management was a key component of this module. I was
trained in using SQL (Structured Query Language) to perform a wide range of database
operations, including designing relational schemas, creating and modifying tables,
inserting records, filtering data using conditional queries, and performing joins between
multiple tables. This knowledge was critical in understanding how applications persist
and manage user data.
A major focus was placed on integrating Flask applications with SQL databases using
libraries such as SQLAlchemy, which allowed for seamless interaction between the web
application and its underlying data layer. I also learned about establishing database
connections, executing CRUD (Create, Read, Update, Delete) operations
programmatically, and securing data transactions. This integration enabled the
development of dynamic, data-driven web applications that respond to user input and
display real-time information. Through various mini-projects and hands-on lab exercises,
I applied these concepts to build end-to-end applications that mimicked real-world
systems. This comprehensive exposure not only strengthened my technical proficiency in
web development but also helped me appreciate the practical challenges and solutions
involved in deploying full-stack applications. Gaining this experience has significantly
improved my confidence in working with web technologies and has prepared me for
further exploration in both backend and frontend development roles.
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 3
Research Internship/ Industry Internship (21INT82) 2024-25
2.3 Cloud Computing and Deployment
Cloud computing formed a crucial component of the internship curriculum, emphasizing
a balanced approach to both theoretical foundations and hands-on practical skills. The
module began by introducing the fundamental concepts of cloud computing, helping me
understand how computing resources such as storage, servers, and networking are
delivered over the internet on-demand. I explored the different service models—
Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service
(SaaS)—each catering to different levels of abstraction and user control. The deployment
models, including public, private, hybrid, and community clouds, were also covered in
detail, offering insight into how organizations choose cloud environments based on their
scalability, security, and operational requirements. Through detailed sessions, I gained a
deeper appreciation of how cloud platforms are structured to offer reliability, cost-
efficiency, and global scalability for applications and services. The training included case
studies and real-world examples that demonstrated how businesses leverage cloud
computing to reduce infrastructure costs and increase agility. I learned how cloud services
support distributed computing and enable rapid deployment and scaling of applications
without the need for extensive physical infrastructure. One of the key focuses of this
module was understanding cloud orchestration platforms—tools and services that help
automate the deployment, configuration, and management of applications across multi-
cloud and hybrid environments.
We were introduced to the concepts of Continuous Integration and Continuous
Deployment (CI/CD), Infrastructure as Code (IaC), and other DevOps practices that are
essential for modern software delivery pipelines. A highlight of the cloud computing
training was the practical sessions where I got hands-on experience in deploying web
applications to cloud platforms. I practiced configuring virtual machines, setting up
cloud-based environments, and managing resources such as storage and databases. These
exercises helped me understand important concepts like virtualization, which allows
multiple virtual systems to run on a single physical machine, and containerization, which
packages applications and their dependencies into isolated units for consistent execution
across environments. Additionally, I explored the use of cloud dashboards, billing
systems, and monitoring tools that are critical for managing costs and performance in
cloud-based projects. This segment of the training gave me the ability to understand how
applications can be securely hosted and accessed online, and how fault-tolerance and
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 4
Research Internship/ Industry Internship (21INT82) 2024-25
availability are ensured using cloud-native tools. Overall, this module significantly
enhanced my technical proficiency in working with cloud technologies. It bridged the gap
between theoretical knowledge and real-world application, preparing me to confidently
approach scenarios involving cloud migration, application deployment, and modern IT
infrastructure management in professional environments.
2.4 Data Analytics and Machine Learning
The final and most technical phase of the internship focused on data analytics and
machine learning, two of the most critical and high-demand domains in the modern
technology landscape. The training began with an in-depth introduction to Power BI, a
robust business intelligence and data visualization tool widely used across industries for
making data-driven decisions. I learned how to connect to various data sources, perform
data transformations using Power Query, and clean datasets to ensure consistency and
accuracy. A significant part of the Power BI training involved using Data Analysis
Expressions (DAX) to create calculated columns and measures, enabling deeper insights
through custom aggregations and logic.
Evaluation of machine learning models was another key area of focus. I learned how to
assess model performance using metrics such as accuracy, precision, recall, F1-score, and
Root Mean Squared Error (RMSE), depending on whether the task was classification or
regression. Understanding how to interpret these metrics and improve model performance
through hyperparameter tuning, cross-validation, and model optimization techniques was
also part of our training.
This immersive experience gave me a holistic understanding of the machine learning
workflow, from understanding the business problem to delivering a functioning analytical
solution. It greatly enhanced my technical capabilities, deepened my understanding of
model interpretability, and improved my problem-solving approach. More importantly, it
taught me how data analytics and machine learning can be applied strategically to drive
business value, making this phase of the internship both enriching and transformative in
my journey as a budding AI and Machine Learning engineer.
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 5
Research Internship/ Industry Internship (21INT82) 2024-25
CHAPTER 3
SOFT SKILLS & PROFESSIONAL DEVELOPMENT
Alongside technical training, the internship placed strong emphasis on soft skills and
professional development, particularly through the 75-hour module facilitated by the
Wadhwani Foundation. This training focused on enhancing our verbal and written
communication, ensuring we could express ideas clearly and confidently in a professional
setting. We were also trained in professional etiquette, which included appropriate
workplace behavior, grooming, punctuality, and respectful interactions with colleagues
and superiors. The sessions helped us become more team-oriented, promoting effective
collaboration, active listening, and the ability to contribute meaningfully in group tasks.
We worked on leadership skills, such as decision-making, time management, and taking
initiative during challenging scenarios. The training also introduced structured approaches
to problem-solving, enabling us to analyze issues systematically and propose practical
solutions. Additionally, through IBM’s modules, we gained exposure to Agile
methodology, Scrum practices, and Design Thinking, which are widely used in modern
project workflows. These experiences collectively helped us grow as well-rounded
professionals ready to adapt and thrive in real-world work environments.
3.1 Communication and Etiquette
During the internship, special emphasis was placed on developing strong verbal and
written communication skills, delivered through the Wadhwani Foundation’s online
learning platform. The training was engaging and interactive, featuring animated,
cartoon-style videos that presented real-life workplace scenarios in a fun and relatable
manner. After each video, there were questions, exercises, and quizzes that tested our
understanding and encouraged reflection. These sessions helped us improve our ability to
articulate ideas clearly during presentations, group discussions, and professional
conversations. We also learned to draft effective emails, reports, and formal documents,
enhancing our written communication. To further support our learning, Wadhwani
representatives occasionally conducted live sessions or check-ins, offering guidance,
answering queries, and motivating us throughout the training. The platform made the
learning process enjoyable while ensuring we grasped important communication
principles that are crucial in the workplace.
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 6
Research Internship/ Industry Internship (21INT82) 2024-25
3.2 Teamwork and Leadership
The internship included structured modules on teamwork and collaboration, especially
through the Wadhwani Foundation’s online platform. Although there were no direct
group tasks or live team activities, the training effectively conveyed the importance of
teamwork using animated videos, case studies, and reflective exercises. These sessions
simulated real workplace situations where team collaboration is essential.
In parallel, leadership skills were developed through similar engaging content. The
modules demonstrated how to take initiative, led by example, and guide a team through
challenges using storytelling-based scenarios. Additionally, the training highlighted the
difference between authority-based leadership and influence-based leadership,
encouraging us to adopt a more inclusive and thoughtful leadership style. Through
consistent exposure to these concepts, we built awareness of how leaders should adapt
their communication, manage time, and support their teams to drive collective success.
3.3 Problem-Solving and Agile Mindset
The internship program placed significant focus on developing a structured approach to
problem-solving, critical thinking, and adaptability—essential traits for any technology
professional. Through animated case studies and real-world inspired tasks on the
Wadhwani Foundation platform, we were introduced to scenarios requiring analytical
thinking and decision-making. These modules helped us understand how to break down
complex challenges into manageable parts, identify the root cause of issues, and evaluate
multiple solutions before choosing the most effective one.
In addition to problem-solving, we were introduced to the Agile mindset, which plays a
crucial role in modern software development and project management. Through sessions
conducted as part of the IBM training, we learned about Agile principles, Scrum
framework, and Design Thinking methodology. We explored concepts like user stories,
sprints, daily stand-ups, iterative development, and team roles such as Scrum Master and
Product Owner. These methodologies emphasized adaptability, continuous feedback, and
collaboration—key elements in dynamic work environments. By combining Agile
thinking with structured problem-solving, we gained a mindset focused on continuous
improvement, flexibility, and innovation.
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 7
Research Internship/ Industry Internship (21INT82) 2024-25
CHAPTER 4
FOUNDATIONS OF MACHINE LEARNING AND
MODEL BUILDING
As part of the internship program, a specialized module on Foundations of Machine
Learning and Model Building was offered to build both conceptual understanding and
hands-on skills. The training began with orientation and foundational sessions on Agile
methodologies, Git, GitHub, and Design Thinking. Participants were introduced to key
libraries such as Scikit-learn, NumPy, and Pandas, with practical exercises in data
aggregation, data splitting, and model evaluation. The module systematically covered the
machine learning workflow, including both supervised and unsupervised techniques.
Cloud-based services, real-time data sources, and feature engineering were explored to
strengthen real-world applicability. Labs focused on AutoAI, classification algorithms,
RNN with MNIST, and sentiment analysis using cloud-based tools. Weekly recap
sessions ensured continuous learning and concept reinforcement. The learning platform
also supported collaborative tools like discussion forums, project mapping, and progress
tracking. Quizzes and lab activities were embedded into the learning cycle to test and
apply knowledge. Each week progressively developed the participants' ability to build
and deploy ML models. The module balanced theory with practical application, preparing
learners to solve industry problems. Overall, this training acted as a bridge between
academic concepts and real-world machine learning implementation. It laid a solid
foundation for future work in AI and data science.
4.1 Learning Structure and Platform Features
The IBM SkillsBuild platform provided a comprehensive and structured learning
environment tailored to help learners progress effectively through both foundational and
advanced concepts. The platform was enriched with multiple integrated tools such as
discussion forums—where learners could post questions and interact with mentors—lab
access, detailed project descriptions, project-student mapping, and a Learning
Management System (LMS) user guide that made navigation seamless.
The curriculum was distributed across multiple weeks, beginning with an orientation and
moving into essential tech skills like Agile methodologies, Git & GitHub, and Design
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 8
Research Internship/ Industry Internship (21INT82) 2024-25
Thinking using IBM Mural. In subsequent weeks, technical depth increased with topics
such as Scikit-learn models, data aggregation in Python, data splitting, and machine
learning evaluation metrics. Recap sessions helped consolidate learning from Agile, Git,
Python, Pandas, NumPy, and machine learning essentials.
Weeks 4 through 11 continued this progression with concepts like ML workflows, feature
engineering, data visualization, and exposure to IBM Cloud Services. Although full
details for Weeks 6–11 were not specified, they included continuous engagement through
daily sessions and interactive labs.
By Week 12, project documentation began in earnest, culminating in the submission of
Phase 1. Subsequent weeks (Weeks 13 onward) involved further documentation of phases
2 through 4, including preprocessing, model training, and deployment. Each phase was
matched with its own quiz, assessing students’ understanding of both theoretical and
applied components.
Practical understanding was also reinforced through a variety of assignments, such as
Predicting House Prices, Customer Churn Prediction, and Clustering Algorithm Analysis,
all of which were completed and graded. A series of lab experiments focused on core ML
topics like classification, regression, clustering, RNN with MNIST datasets, and
sentiment analysis using Watson NLU.
Finally, regular feedback forms were available during Weeks 12 and 13, giving learners a
channel to reflect on their learning experience and suggest improvements. This structured,
multifaceted approach made IBM SkillsBuild an effective and interactive platform for
mastering applied machine learning and real-world project development.
4.2 Weekly Course Content and Technical Modules
The learning journey began with foundational sessions designed to build a strong base in
modern software development and collaboration practices. Week 1 introduced us to Agile
methodologies, emphasizing iterative development and team collaboration. This was
followed by hands-on learning in Git and GitHub, essential tools for version control and
project management. We also explored Design Thinking using IBM Mural, which
encouraged creative problem-solving and user-centered design.
In Week 2, the focus shifted to technical prerequisites for machine learning. We explored
Scikit-learn models, gaining exposure to widely used algorithms and tools in the Python
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 9
Research Internship/ Industry Internship (21INT82) 2024-25
ecosystem. Sessions on data aggregation using Python taught us how to clean and
consolidate datasets, while lessons on data splitting prepared us for model training and
evaluation.
Week 3 deepened our understanding with content on model evaluation metrics, where we
explored accuracy, precision, recall, F1-score, and confusion matrices—critical tools for
assessing machine learning performance. This week also included two recap sessions
covering key concepts from Agile, Scrum, Jira, Git, and programming fundamentals in
Python, including string manipulation and Pandas. The final session of the week revisited
NumPy, Scikit-learn, and core machine learning principles, reinforcing theoretical and
practical knowledge.
By Week 4, we were introduced to the machine learning workflow, including data
preparation, model building, and evaluation. A demo of the Learning Management
System (LMS) provided us with tips on tracking progress and accessing resources. We
also examined different types of machine learning, such as supervised, unsupervised, and
reinforcement learning, along with real-world examples.
Week 5 covered IBM Cloud Services, offering insight into scalable cloud-based tools and
infrastructure. Sessions on data sources explored structured and unstructured data
formats, followed by feature engineering techniques that emphasized the importance of
selecting and transforming input variables for better model performance. The week
concluded with an engaging session on data visualization, highlighting the role of charts
and plots in exploratory data analysis and presentation.
Although the detailed schedules for Weeks 6 to 11 were not explicitly outlined, these
weeks continued to build on our learning with advanced concepts, lab experiments, and
project guidance. By Week 12, we transitioned into the project documentation phases,
starting with problem analysis, and moving on to data preprocessing, model training and
evaluation, and model deployment and interface development in Week 13.
Throughout the program, technical modules were reinforced through a mix of
assignments, quizzes, and lab experiments focused on real-world use cases such as
predicting house prices, customer churn, clustering analysis, and digit recognition using
RNNs and MNIST datasets. These tasks not only enhanced our technical skills but also
prepared us for industry-relevant challenges.
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 10
Research Internship/ Industry Internship (21INT82) 2024-25
This structured weekly progression ensured a balanced blend of theory, practice, and
project-based learning, resulting in a robust understanding of machine learning and its
applications.
4.3 Assignments, Labs, and Hands-on Practice
The course also featured multiple assignments and lab exercises that allowed us to apply
theoretical knowledge in a practical way. Assignments included predicting house prices,
customer churn prediction, and clustering analysis, all of which were carefully designed
to simulate real-world problem-solving scenarios. These tasks not only tested our
understanding of machine learning concepts but also helped us develop critical thinking,
data preprocessing, and model evaluation skills.
Each assignment was aligned with the weekly learning objectives and often required the
use of Python libraries such as Pandas, NumPy, Matplotlib, and Scikit-learn. For
example, in the House Price Prediction assignment, we implemented regression
techniques, handled missing values, and performed feature selection. Lab sessions further
reinforced our understanding through practical, hands-on exploration. We experimented
with AutoAI to automatically build and optimize machine learning pipelines using IBM
Watson Studio. Another lab focused on supervised learning, where we applied various
classification and regression techniques using Scikit-learn. A particularly engaging lab
involved digit recognition using Recurrent Neural Networks (RNNs) with the MNIST
dataset, giving us exposure to deep learning techniques and image processing.
Several labs required integration with IBM Cloud and made use of Watson Notebooks,
giving us hands-on experience with cloud-based data science platforms. This exposure
was instrumental in understanding how to scale models, manage resources, and
collaborate in real-world environments.
The inclusion of quizzes, project checkpoints, and feedback forms throughout these
modules ensured that learning was continuous and reflective. Collectively, the
assignments and labs transformed theoretical content into a dynamic, skill-based
experience that prepared us to tackle real-time machine learning problems with
confidence and competence.
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 11
Research Internship/ Industry Internship (21INT82) 2024-25
CHAPTER 5
INTERNSHIP PROJECT OVERVIEW
5.1 Introduction to the Project
The project titled "Advanced Market Segmentation using Deep Clustering" is aimed at
uncovering customer segments within a retail dataset through unsupervised machine
learning techniques. This segmentation approach empowers businesses to understand
their customer base, enabling more targeted marketing strategies and product offerings.
In this project, customer data—such as Annual Income and Spending Score—is collected,
preprocessed, and analyzed using K-Means Clustering, a popular unsupervised learning
algorithm. Data normalization is performed with StandardScaler to ensure uniformity and
scale-independence. The model was trained using an autoencoder architecture to learn
compact, meaningful representations of the input data for effective feature extraction and
clustering. The clustering results are visualized using Seaborn and Matplotlib, allowing
stakeholders to interpret the natural groupings within the data.
An interactive web interface is built using Streamlit, enabling real-time visualization and
predictions. Users can explore raw data, observe dynamic clustering visuals, and input
hypothetical customer profiles to predict their likely segment. The system’s architecture
ensures a smooth user experience, integrating data science, visualization, and user
interaction in a single application.
This project not only demonstrates practical implementation of machine learning but also
showcases skills in data preprocessing, visual analytics, and interactive web app
development, providing a comprehensive solution for customer segmentation challenges
faced by modern businesses.
5.2 System Requirements
The system requirements for the Advanced Market Segmentation using Deep Clustering
application encompass both functional and non-functional aspects to ensure effective
operation. Functionally, the system must load customer data from a structured CSV file
and preprocess the data using techniques such as StandardScaler to normalize features
like Annual Income and Spending Score. It should implement the K-Means clustering
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 12
Research Internship/ Industry Internship (21INT82) 2024-25
algorithm to categorize customers into distinct segments based on these attributes. The
model was trained using an autoencoder architecture to learn compact, meaningful
representations of the input data for effective feature extraction and clustering. The
system should provide visualizations of the clusters using Matplotlib and Seaborn,
enabling clear insights into customer groupings. An interactive Streamlit interface must
allow users to view the dataset, explore cluster plots, and input custom values via sliders
to predict customer segments in real-time using the trained model.
From a non-functional perspective, the application should be responsive and deliver
results efficiently, even with large datasets. It must be scalable to accommodate future
enhancements like additional clustering algorithms or advanced customer profiling
features. Security is essential, particularly in handling user inputs to prevent misuse or
data corruption. Usability is also a key factor, requiring a clean, intuitive interface that is
accessible to both technical and non-technical users. Lastly, the application should ensure
consistent performance and reliability, offering accurate clustering and robust handling of
edge cases or missing data. Together, these requirements define a scalable, interactive,
and insightful market segmentation tool suitable for strategic customer analysis.
5.2.1 Functional Requirements
The functional requirements define the core capabilities of the customer segmentation
application, focusing on data processing, clustering, interactive prediction, and
visualization. The following features are implemented in the system:
Data Import & Preprocessing
Users can upload a customer data in CSV format. The system processes relevant
features (e.g., Annual Income and Spending Score) using StandardScaler for
normalization before clustering.
Feature Extraction using Autoencoder
An autoencoder model is employed to learn compressed, meaningful
representations of the input features, enhancing the clustering performance by
reducing noise and dimensionality.
Customer Segmentation using K-Means
The application applies the K-Means clustering algorithm to divide customers into
distinct groups based on selected numerical features, revealing behavioral patterns
for targeted marketing.
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 13
Research Internship/ Industry Internship (21INT82) 2024-25
Interactive Cluster Prediction
A sidebar interface allows users to input new customer values (Annual Income
and Spending Score) using sliders. The system predicts and displays the
corresponding customer segment using the trained model.
Data Visualization
The application uses Seaborn and Matplotlib to display clustered data in a 2D
scatter plot, with axes labeled by feature names (e.g., "Annual Income (k$)",
"Spending Score (1–100)") and clusters color-coded for clarity.
Display of Uploaded Dataset
Users can view the first few rows of the uploaded dataset using a checkbox,
helping to verify file correctness and data structure.
Real-time UI Interaction
Built with Streamlit, the application responds instantly to user inputs without full-
page reloads, ensuring a smooth and dynamic user experience.
Error Handling & Validation
Input validation and exception handling are implemented to catch missing values,
incorrect data types, or unsupported file formats, ensuring reliable application
behavior.
Feature-specific Segmentation
The system is designed to work with features such as "Annual Income" and
"Spending Score," but it can be easily extended to include more variables (e.g.,
age, gender) for deeper segmentation.
5.2.2 Non-Functional Requirements
The non-functional requirements describe the quality attributes of the clustering
application, focusing on performance, scalability, usability, and security. These factors
ensure the application remains efficient, reliable, and user-friendly:
Performance
The application delivers fast data processing and model predictions, with
visualizations and clustering operations rendered in real-time via Streamlit for a
responsive user experience.
Reliability
The system is designed to operate consistently with minimal downtime. Proper
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 14
Research Internship/ Industry Internship (21INT82) 2024-25
error handling ensures that issues like missing data or incorrect file formats do not
disrupt the user experience.
Security
Although the system does not handle sensitive financial data, it prevents code
injection, unauthorized file access, and improper user input through validation and
Streamlit’s secure execution environment.
Maintainability
The codebase is well-structured and documented, making it easy to update
models, switch algorithms, or integrate new UI components without breaking
existing functionality.
Compatibility
Built using Python and Streamlit, the system runs on all platforms (Windows,
macOS, Linux) and can be deployed on the web or cloud platforms, ensuring
broad compatibility.
Usability
A user-friendly interface with sliders, checkboxes, and visual feedback makes the
application intuitive for both technical and non-technical users to interact with
customer data and understand clustering results.
Testing
Manual testing ensures that all UI components, model predictions, and
visualizations function as expected. Future updates may include unit testing for
clustering logic and automated validation for input data.
Cost
The application uses lightweight libraries and local resources, ensuring minimal
cost for deployment. It can also be hosted on free tiers of platforms like Streamlit
Cloud or Heroku for demonstration purposes.
5.3 Software Requirements
The following software and tools are essential for developing, testing, and deploying the
customer segmentation application using K-Means clustering efficiently:
Operating System: Windows 10 or higher / Linux / macOS
Programming Language: Python 3.7 or higher (for data processing, modeling,
and deployment)
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 15
Research Internship/ Industry Internship (21INT82) 2024-25
Development Environment: Visual Studio Code / Jupyter Notebook (for writing
and debugging code)
Data Analysis & Machine Learning Libraries:
o Pandas (for data manipulation)
o NumPy (for numerical computations)
o Scikit-learn (for K-Means clustering, scaling, and preprocessing)
o Tensorflow/Keras (for Autoencoder model)
o Matplotlib and Seaborn (for data visualization)
Web App Framework: Streamlit (for creating interactive UI and deploying the
clustering application)
Model Deployment Tools: Streamlit Cloud / Localhost (for running the app in
browser)
Version Control: Git & GitHub (for source code management and collaboration)
Virtual Environment Manager: venv or conda (for managing project
dependencies)
Data Format: CSV files (used to load and analyze customer data)
Testing Tools: Streamlit's built-in tools for UI validation and manual testing for
logic correctness
5.4 Methodology
The methodology adopted for this project involves several systematic steps, from data
acquisition to the deployment of a Streamlit-based interactive application for customer
segmentation. The project leverages K-Means clustering, a popular unsupervised machine
learning algorithm, to group customers based on their behavioral attributes.
Data Collection
The dataset used in this project was obtained from a CSV file named
customers.csv. It contains customer information such as CustomerID, Annual
Income (k$), and Spending Score (1-100). This dataset serves as the foundation
for clustering analysis.
Data Preprocessing
Before applying any machine learning algorithm, data preprocessing is essential to
ensure accuracy and efficiency:
Irrelevant columns such as CustomerID were removed.
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 16
Research Internship/ Industry Internship (21INT82) 2024-25
Only relevant numerical features (Annual Income (k$) and Spending Score (1-
100)) were selected for clustering.
StandardScaler was used to normalize the feature values, ensuring equal
contribution of both features to the clustering algorithm.
Model Training (Autoencoder+K-Means Clustering)
The K-Means clustering algorithm was implemented using the scikit-learn library:
An autoencoder model was constructed using the Keras library to learn
compressed, meaningful representations of the scaled input features.
The encoded (latent) features obtained from the autoencoder were then passed
to the K-Means clustering algorithm implemented using the scikit-learn
library.
Customers were grouped into clusters based on the encoded representations,
reflecting income and spending behavior more effectively.
Cluster labels were appended to the original dataset for further analysis and
visualization.
Building the Streamlit Application
An interactive web application was built using the Streamlit framework to make
the solution user-friendly and accessible:
A sidebar was created for user input to simulate prediction on unseen customer
data (based on income and spending score).
Visualization tools like Matplotlib and Seaborn were used to display cluster
separation graphically.
The model predicts which customer segment a new user belongs to using the
trained K-Means model and StandardScaler.
Model Evaluation
Though unsupervised learning does not use labeled outcomes, the performance
was visually evaluated using scatter plots showing distinct customer segments.
The separation of clusters indicates the effectiveness of segmentation.
Deployment
The entire application was deployed locally using Streamlit and can be further
deployed using cloud-based platforms like Streamlit Cloud or Heroku for public
access.
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 17
Research Internship/ Industry Internship (21INT82) 2024-25
CHAPTER 6
INTERNSHIP PROJECT IMPLEMENTATION
6.1 Introduction
The Customer Segmentation Application identifies customer groups based on annual
income and spending behavior, enabling personalized marketing. The project uses
autoencoder to compress and extract meaningful features from the data, followed by K-
Means Clustering to segment customers into distinct behavioral groups.
Built using Python and Streamlit, the system provides an interactive interface where users
can explore customer clusters visually and input new customer data to predict their
segment. The application leverages Pandas for data manipulation, Matplotlib and Seaborn
for visualizations, and Scikit-learn for data scaling and clustering.
To ensure consistency in clustering, StandardScaler is applied to normalize the input
features, and the clustering model is trained on two primary features: Annual Income (k$)
and Spending Score (1-100). The resulting clusters are visualized in real time through a
web interface, enhancing understanding of customer distribution.
This tool supports real-time prediction by allowing users to simulate new customer data
via a sidebar and instantly receive a cluster assignment, making it an effective
demonstration of unsupervised learning for business insights.
6.2 Algorithm Used
The core algorithm used in this project is K-Means Clustering, an unsupervised machine
learning algorithm that partitions a dataset into a predefined number of distinct clusters
based on feature similarity. The goal of K-Means is to identify underlying patterns in
customer behavior, enabling effective segmentation for targeted marketing strategies.
K-Means Algorithm Overview:
Initialization:
Select k initial centroids randomly, where k is the number of clusters
predefined by the user.
Assignment Step:
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 18
Research Internship/ Industry Internship (21INT82) 2024-25
Each data point (e.g., a customer represented by annual income and
spending score) is assigned to the nearest centroid based on Euclidean
distance. This forms k clusters.
Update Step:
The centroids of each cluster are recalculated as the mean of all data points
assigned to that cluster.
Iteration:
Steps 2 and 3 are repeated until convergence, i.e., the centroids no longer
change significantly, or a maximum number of iterations is reached.
Prediction:
Once trained, the algorithm can assign new customers to one of the learned
clusters based on their features.
Preprocessing Techniques Used:
StandardScaler: Before clustering, features like 'Annual Income (k$)' and
'Spending Score (1-100)' are standardized to ensure fair distance computation, as
these features may have different units and scales.
6.2.1 Workflow
The project follows a systematic workflow that combines data preprocessing, machine
learning, and visualization to segment customers effectively. Below is the step-by-step
workflow of the clustering-based customer segmentation system:
Step 1: Initialize the Project Environment
Set up a Python environment using venv or Anaconda.
Install necessary libraries:
Pandas (Data processing)
NumPy (Numerical computations)
Matplotlib & Seaborn (Data visualization)
Scikit-learn (Machine learning & preprocessing)
Streamlit (Web-based interactive interface)
Step 2: Data Loading and Exploration
Load the customer data from a CSV file.
Inspect data types, null values, and statistical summaries.
Perform Exploratory Data Analysis (EDA) through visualizations to understand
feature distributions and relationships.
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 19
Research Internship/ Industry Internship (21INT82) 2024-25
Step 3: Data Preprocessing
Select relevant features (Annual Income (k$) and Spending Score (1-100)) for
clustering.
Handle missing values, if any.
Normalize the selected features using StandardScaler to ensure all features
contribute equally to the distance calculations.
Step 4: Feature Extraction using Autoencoder
Use an autoencoder to learn compressed, meaningful representations of the scaled
features, reducing dimensionality and noise.
Step 5: K-Means Clustering Implementation
Choose the optimal number of clusters using the Elbow Method.
Fit the K-Means model on the scaled feature set.
Predict the cluster labels and append them to the original dataset.
Step 6: Cluster Visualization
Plot the resulting clusters using Matplotlib or Seaborn.
Use distinct colors for each cluster.
Label axes with real feature names (Annual Income (k$) and Spending Score (1-
100)) for better interpretability.
Step 7: Real-Time Interaction with Streamlit
Create an intuitive Streamlit app interface.
Add sliders or input fields in the sidebar for users to simulate a new customer’s
income and spending score.
Use the trained model to predict the cluster for the new input in real-time.
Dynamically display the cluster number and visualize where the customer fits in
the overall segmentation plot.
6.3 Code Snippets
Code snippets are small, reusable pieces of code that demonstrate the implementation of
specific functionalities within a program. They help in understanding the logic and
structure of key components in the system.
Importing Libraries and Loading Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 20
Research Internship/ Industry Internship (21INT82) 2024-25
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import streamlit as st
These libraries are essential for data manipulation (pandas, numpy), visualization
(matplotlib, seaborn), machine learning (KMeans, StandardScaler), and building the user
interface (streamlit).
Reading and Displaying Data
df = pd.read_csv(‘Customers_augmented.csv')
st.write("Dataset Preview:")
st.dataframe(df.head())
The dataset is loaded from a CSV file and previewed in the Streamlit interface to give
users a quick glance at the data.
Feature Selection and Preprocessing
X = df[['Annual_Income', 'Spending_Score']]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Key features are selected for clustering. StandardScaler is applied to normalize the values,
which is critical for effective distance-based clustering.
Elbow Method to Determine Optimal Clusters
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(X_scaled)
wcss.append(kmeans.inertia_)
This loop computes the WCSS (Within-Cluster Sum of Squares) for 1 to 10 clusters to
help determine the optimal number using the Elbow Method.
Defining an Autoencoder model and training
input_dim = X_scaled.shape[1]
input_layer = Input(shape=(input_dim,))
encoded = Dense(2, activation='relu')(input_layer)
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 21
Research Internship/ Industry Internship (21INT82) 2024-25
decoded = Dense(input_dim, activation='linear')(encoded)
autoencoder = Model(inputs=input_layer, outputs=decoded)
autoencoder.compile(optimizer='adam', loss='mse')
autoencoder.fit(X_scaled, X_scaled, epochs=50, batch_size=16, verbose=0)
encoder = Model(inputs=input_layer, outputs=encoded)
X_encoded = encoder.predict(X_scaled)
An autoencoder model is built with a 2-neuron bottleneck layer to learn compressed
representations of the input data. After training it to reconstruct the original features, the
encoder part is used to reduce the dimensionality of the normalized dataset for clustering.
Applying K-Means Clustering
optimal_clusters = 5 # example
kmeans = KMeans(n_clusters=optimal_clusters, init='k-means++', random_state=42)
cluster_labels = kmeans.fit_predict(X_scaled)
df['Cluster'] = cluster_labels
The KMeans algorithm is trained on the scaled data, and each data point is assigned a
cluster label. The cluster labels are stored in the original DataFrame for analysis and
plotting.
Cluster Visualization
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='Annual Income (k$)', y='Spending Score (1-100)',
hue='Cluster', palette='Set2', s=100)
plt.title('Customer Segments')
st.pyplot(plt)
The scatterplot shows customer groups based on their annual income and spending score.
Each cluster is colored differently to visualize groupings clearly.
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 22
Research Internship/ Industry Internship (21INT82) 2024-25
Streamlit Sidebar for Interactivity
st.sidebar.header('Filters')
gender_filter = st.sidebar.multiselect('Select Gender:', df['Gender'].unique(),
default=df['Gender'].unique())
age_range = st.sidebar.slider('Select Age Range:', int(df['Age'].min()),
int(df['Age'].max()), (20, 40))
filtered_df = df[(df['Gender'].isin(gender_filter)) &
(df['Age'].between(age_range[0], age_range[1]))]st.pyplot(plt)
Sidebar widgets allow users to filter customers by gender and age range. The filtered
data is used for clustering and visualization dynamically.
6.4 Results
The project effectively segmented retail customers using deep clustering models,
achieving high accuracy and clear cluster separation. Visualization tools validated
meaningful customer groups. The interactive dashboard enabled real-time insights,
supporting targeted marketing decisions. The system proved scalable, interpretable, and
impactful for retail customer analysis.
Figure 6.1: Streamlit Web App Interface
Figure 6.1 shows a Streamlit web app for customer segmentation using K-Means
clustering. Users input income and spending score to predict a cluster. The interface
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 23
Research Internship/ Industry Internship (21INT82) 2024-25
displays dataset columns, sample data, and a scatter plot visualizing distinct customer
segments based on annual income and spending patterns.
Figure 6.2: Interface showing Raw Data
Figure 6.2 displays a Streamlit app interface for customer segmentation using K-Means.
Users enter annual income and spending score to predict the customer’s cluster. The
prediction result is shown as "Cluster 1". The interface also includes dataset columns, raw
data preview, and a section for visualizing clustered customer segments.
Figure 6.3: Interface showing Cluster Visualization
Figure 6.3 shows a cluster visualization of customer segments based on annual income
and spending score. Different colors represent distinct K-Means clusters. Each dot
corresponds to a customer, allowing easy identification of behavioral patterns. The left
panel predicts a customer’s cluster from input values, shows as “Predicted Segment:
cluster 1”.
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 24
Research Internship/ Industry Internship (21INT82) 2024-25
CHAPTER 7
ASSESSMENT & CONCLUSION
7.1 Assessment of Internship
The internship provided hands-on experience in full-stack web development and system
integration, enhancing my understanding of e-commerce architecture and the ability to
build secure, scalable applications. Working on a team-based project improved my
collaboration skills, including communication, decision-making, documentation, and
problem-solving. Regular interactions with the team lead and mentor strengthened
corporate work ethics, emphasizing quality assurance, time management, and
professionalism. Additionally, developing a real-world MERN-based e-commerce system
refined my ability to debug issues, optimize performance, and implement security
protocols. Exposure to database management (MongoDB), payment gateway integration
(Stripe/PayPal), real-time synchronization (Socket.io), and authentication (JWT) provided
a comprehensive learning experience, equipping me with industry-relevant skills for
future roles in software development and system architecture.
7.2 Conclusion
The project “Advanced Market Segmentation Using Deep Clustering”, developed as part
of the AI Machine Learning Engineer internship conducted by Rooman Technologies,
IBM, and the Wadhwani Foundation, enabled us to apply machine learning techniques to
solve a real-world business problem. Through unsupervised learning, we successfully
grouped customers based on behavioral patterns and spending habits, providing
actionable insights for targeted marketing and business optimization. This end-to-end
implementation from data preprocessing to model deployment using tools like Python,
and Streamlit helped reinforce our understanding of AI concepts and real-time application
development. The internship offered a valuable blend of technical learning and practical
experience, equipping us with the skills required to build and deploy intelligent, data-
driven solutions in a professional setting.
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 25
Research Internship/ Industry Internship (21INT82) 2024-25
REFERENCES
[1] Rooman Technologies, Rooman Technologies -- Skill Development. Training,
and IT Services, https://rooman.com. Accessed: April 23, 2025.
[2] Wadhwani Wadhwani Opportunity — Youth to Better Employment Outcomes.
https://wadhwanifoundationorg/. Accessed: April 23. 2025.
[3] IBM, IBM -Let's Create. Solutions, Services Training, https://www.ibm.com,
Accessed: April 23. 2025.
[4] Md Ashraf Uddin, Md Alamin Taluker, Md Redwan Ahmen, et al., “Data-driven
strategies for digital native market segmentation using clustering”, Volume 5,
Pages 178-191, April 30 2024.
[5] Mahmoud SalahEldin Kasem, Mohamed Hamada, Islam Taj-Eddin, et al.,
“Customer profiling, segmentation, and sales prediction using AI in direct
marketing”, Volume 36, December 23 2023.
Department of Artificial Intelligence & Machine Learning, VCET, Puttur. Page 26