0% found this document useful (0 votes)
15 views40 pages

Internship Report

The internship report details the experience of Akarsh Dhiraj Shetty at Rooman Technologies, focusing on real-time social media analytics using machine learning. The program provided hands-on training in various technical skills, including Python programming, data visualization, and ethical AI practices, aimed at bridging the gap between academic knowledge and industry requirements. The report outlines the objectives, methodologies, and outcomes of the internship, emphasizing the development of competencies necessary for a successful career in the technology sector.

Uploaded by

deathstarnebula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views40 pages

Internship Report

The internship report details the experience of Akarsh Dhiraj Shetty at Rooman Technologies, focusing on real-time social media analytics using machine learning. The program provided hands-on training in various technical skills, including Python programming, data visualization, and ethical AI practices, aimed at bridging the gap between academic knowledge and industry requirements. The report outlines the objectives, methodologies, and outcomes of the internship, emphasizing the development of competencies necessary for a successful career in the technology sector.

Uploaded by

deathstarnebula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

INTERNSHIP REPORT ON

REAL-TIME SOCIAL MEDIA ANALYTICS


PIPELINE USING MACHINE LEARNING

Work carried out at

ROOMAN TECHNOLOGIES

Submitted in partial fulfillment of the requirement for the award of


Bachelor of Engineering Degree
In
Electronics and Telecommunication Engineering
of
Visvesvaraya Technological University, Belagavi
By

NAME: AKARSH DHIRAJ SHETTY


USN: 1BI21ET002

Under the guidance of Internal guide Under the guidance of External guide

Prof. N. Chethan Anand Harini


Associate Professor
Trainer
Dept. of Electronics and Telecommunication
Rooman Technologies, Bengaluru.
Engineering, B.I.T, Bengaluru.

DEPARTMENT OF ELECTRONICS AND TELECOMMUNICATION ENGINEERING


BANGALORE INSTITUTE OF TECHNOLOGY, BENGALURU –560004
2024 – 2025
BANGALORE INSTITUTE OF TECHNOLOGY
K. R. Road, V. V. Pura, Bengaluru-560004
Department of Electronics & Telecommunication Engineering

CERTIFICATE

Certified that the Internship entitled “REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE
USING MACHINE LEARNING” work carried out by AKARSH DHIRAJ SHETTY
(1BI21ET002), bonafide students of Bangalore Institute of Technology in partial fulfillment for the
award of Bachelor of Engineering Degree in Electronics and Telecommunication Engineering of the
Visvesvaraya Technological University, Belagavi during the year 2024-2025. It is certified that all
corrections/suggestions indicated for internal assessment have been incorporated in the report
deposited in the Departmental Library. The Internship Report has been approved as it satisfies the
academic requirements in respect of Internship work prescribed for the said Degree.

Guide H.O.D.

Prof. N. Chethan Anand Dr. S. Shanthala


Associate Professor Professor & Head

External Viva:

Name of the Examiners Signature

Internal:

External:
ACKNOWLEDGEMENT

I would like to take this opportunity to thank all those who have been involved directly or indirectly
in the completion of my Internship.

I would therefore take this opportunity to express gratitude to our respected Principal,Dr. Aswath M. U,

for providing an excellent academic environment in the college.

I would like to express my gratitude to Dr. S. Shanthala, Head of the Department, Department of
Electronics and Telecommunication Engineering, for her encouragement throughout building this
report.

I am grateful to Prof. Shruthi N, Assistant Professor, Department of Electronics and


Telecommunication Engineering, for coordinating and extending her support and guidance for the
accomplishment of the Internship abide to the guidelines.

I would like to thank Prof. N. Chethan Anand, Associate Professor, Department of Electronics and
Telecommunication Engineering and Prof. Thyagaraj R, Assistant Professor, Department of
Electronics and Telecommunication Engineering, who have extended their support, guidance and
assistance for the successful completion of the Internship.

I am grateful to all the teaching and non-teaching staff of the Department of Electronics and
Telecommunication Engineering, for their support and cooperation and I would like to thank my
parents for their constant moral support and encouragement throughout the completion of the
Internship.

i
ABSTRACT

A comprehensive overview of the internship experience undertaken at Rooman Technologies, with


a focus on Artificial Intelligence and Machine Learning (AI & ML). The internship aimed to bridge
the gap between academic learning and industry expectations by offering practical exposure to
modern tools, frameworks, and real-world problem-solving approaches. Through hands-on modules
on Python programming, machine learning using Scikit-Learn, cloud computing with IBM Cloud,
data visualization with Power BI, and version control with Git, learners developed both technical
skills and industry-relevant competencies. The training also emphasized ethical AI practices and
prompt engineering, fostering responsible AI development. Overall, the internship provided a solid
foundation in the AI & ML domain, equipping students with the knowledge and experience required
for a successful career in the technology sector. This internship at Rooman Technologies provided
hands-on training in Artificial Intelligence and Machine Learning, focusing on practical skills and
real-world applications. Participants gained experience in Python programming, data analysis,
machine learning models, cloud computing, and data visualization. The program also covered
essential tools like Git and introduced ethical AI practices and prompt engineering. Overall, the
internship helped build a strong foundation for a career in the AI & ML domain.

ii
TABLE OF CONTENTS

Title Page No

CERTIFICATE

ACKNOWLEDGEMENT i
ABSTRACT ii
TABLE OF CONTENTS iii
LIST OF FIGURES iv
CHAPTER-1 : ABOUT THE COMPANY 1
1.1 Introduction 1
1.2 Services Offered 1
1.3 Technical Support 2

CHAPTER-2 : ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING 3


2.1 Introduction 3
2.2 Objectives 3
2.3 Purpose, Scope and Availability 3
2.4 Details of all concepts learnt 5
2.4.1 Agile Methodologies and Git Version Control 5
2.4.2 Python Programming Fundamentals 7
2.4.3 Numerical Computing with Numpy 8
2.4.4 IBM Cloud Fundamentals 8
2.4.5 Structured Query Language 9
2.4.6 Data Visualization with Power BI 10
2.4.7 Machine Learning with Scikit-Learn 12
2.4.8 Prompt Engineering and Ethical AI 13
2.4.9 Introduction to Linux Operating System 14

CHAPTER-3 : TASKED PERFORMED 16


3.1 Introduction 16
3.2 Problem Statement 17
3.3 Objectives 17
3.4 System Architecture 17
3.5 Module Description 19
3.5.1 Data Collection Module 19

iii
TABLE OF CONTENTS

3.5.2 Data Processing Module 19


3.5.3 Data Storage Module 19
3.5.4 Data Analysis And Visualization Model 20
3.5.5 Model Deployment Module 20
3.6 Implementation 20
3.6.1 CODE SNIPPETS 20
CHAPTER-4: RESULTS 23
CHAPTER-5: OUTCOMES 28
CHAPTER-6: WEEKLY REPORT 29
CHAPTER-7: CONCLUSION 30
REFERENCES 31

iv
LIST OF FIGURES
Fig. Title Page
No No.
3.1 System Architecture for Real-Time Social Media Analytics 18
Pipeline
4.1 Interface Of The Code Running In The Terminal 23

4.2 Gathering Image Links From Instagram Using HTTPS 24

4.3 Program Has Successfully Run And Gathered All The Tmage 24
Link And Stored In MangoDB Dashboard

4.4 Dashboard Of MangoDB 25

4.5 File Directory(Instagram_analytics) For Storing Images In 25


MangoDB

4.6 (#City Total-3) Images Stored In Post Folder 26

4.7 (#Nature Total-3) Images Stored In Startup_log 26

4.8 Images Of (#city) Total-3 27

4.9 Images Of (#Nature) Total-3 27

v
vi
REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

CHAPTER-1

ABOUT ROOMAN TECHNOLOGIES

1.1 INTRODUCTION
Rooman Technologies is a leading IT training and solutions provider in India, renowned for its
commitment to skill development and career-oriented education. Established with the vision to bridge
the gap between academic learning and industry requirements, Rooman offers a wide range of
certified training programs in areas such as Information Technology, Networking, Cybersecurity,
Artificial Intelligence, and more. With a strong presence across the country and strategic partnerships
with government bodies and global tech leaders, Rooman Technologies plays a key role in
empowering students, job seekers, and professionals with practical, industry-relevant skills to
enhance employability and foster innovation.

VISION
To empower individuals through world-class skill development and create a globally competent
workforce by bridging the gap between education and employment, thereby contributing to the
nation’s growth and digital transformation.

MISSION
To deliver high-quality, industry-relevant training and certification programs that enhance
employability, promote entrepreneurship, and support inclusive growth. Rooman Technologies is
committed to transforming lives by providing affordable education, fostering innovation, and
building strong partnerships with industry and government bodies to meet the evolving demands of
the digital economy.

HEAD OFFICE
Name: Rooman Technologies Pvt. Ltd.
Address: #30, 1st Floor, 1st Main Road, 1st Block, Rajajinagar, Bengaluru – 560010, Karnataka,
India.

1.2 SERVICES OFFERED


Skill Development Programs
Training programs aligned with NSDC (National Skill Development Corporation) for enhancing
employability in sectors like electronics, IT, and telecommunications.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 1


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

Embedded Systems and IoT Training


Courses covering microcontrollers, ARM architecture, IoT protocols, and hands-on development with
platforms like Arduino and Raspberry Pi.

Corporate Training
Delivers customized training solutions to upskill employees in organizations and enterprises.

Placement Assistance
Helps students get job-ready with interview prep and connects them with hiring companies.

Software & Application Development.


Builds customized software and mobile/web applications for clients across industries.

Cybersecurity Solutions
Offers security consulting, vulnerability assessments, and cybersecurity services for businesses.

Academic Collaboration with Institutions


Partners with colleges and universities to provide industry-relevant training to students.

1.3 TECHNICAL SUPPORT


1. Languages Used:
C, C++, Embedded C, Python, Java, C#, HTML, JavaScript, CSS, PHP
2. Technologies:
Embedded Systems, Internet of Things (IoT), Machine Learning (ML), Data Science, Artificial
Intelligence (AI), Dot Net (.NET Framework), Web Development
3. Hardware Platforms:
Raspberry Pi, NodeMCU (ESP8266), ESP32, ARM7, 8051, PIC Microcontrollers, Renesas
4. Explanation:
Rooman Technologies provides end-to-end technical training and project support in both
software and hardware domains. Their technical support spans from foundational programming
to advanced technologies like AI and IoT. Students receive hands-on experience with industry-
standard hardware and development tools, enabling them to build real-time embedded and web-
based solutions. The inclusion of languages like Python, C++, and Java, combined with
platforms like Raspberry Pi and ESP32, ensures learners are industry-ready and capable of
working on cutting-edge technology projects.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 2


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

CHAPTER-2

ARTIFICIAL INTELLIGENCE AND MACHINE


LEARNING

2.1 INTRODUCTION

The AI-ML Engineer Course is a structured and immersive training program aimed at equipping
students with the foundational and practical skills required in the fields of Artificial Intelligence (AI)
and Machine Learning (ML). The course integrates essential topics from programming, data
processing, machine learning, cloud computing, and data visualization, offering a hands-on approach
through tools like Python, Flask, NumPy, Power BI, and Scikit-learn. It emphasizes real-world
applications and project-based learning to bridge the gap between academic knowledge and industry
demands.

2.2 OBJECTIVES
The key objectives of the course are:
• To introduce students to modern software development practices such as Agile methodologies and
version control.

• To build strong programming fundamentals using Python and its scientific libraries.

• To teach data preprocessing, quality checks, and transformation techniques critical for machine
learning.

• To develop proficiency in building ML models for classification, clustering, and dimensionality


reduction.

• To enable learners to visualize data insights using tools like Power BI and integrate ML solutions
using Flask and cloud platforms.

• Aims to empower individuals through industry-relevant education, fostering globally competitive


IT professionals by integrating cutting-edge technologies and comprehensive training programs.

• It strives to bridge the skill gap by providing industry-aligned training, fostering employability, and

contributing to India's goal of becoming a global skill hub.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 3


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

2.3 PURPOSE, SCOPE AND APPLICABILITY


• Purpose
The purpose of the course is to provide a solid foundation in AI and ML engineering by
introducing students to real-world tools, technologies, and methodologies. It prepares learners for
careers in data science, AI engineering, and related domains through experiential learning.

• Scope
The course covers a wide range of topics including:

• Agile development and Git version control

• Python programming fundamentals

• Data handling with Pandas and NumPy

• Web development using Flask

• Machine learning using Scikit-learn

• Data visualization and analytics with Power BI

• Cloud computing basics (IBM Cloud)

• Prompt engineering and ethical AI considerations

The course culminates with a capstone project that applies all learned concepts in a real-world
context.

• Applicability

This course is applicable to:


• Computer science students seeking to build a strong career in AI/ML.

• Aspiring data analysts and software developers.

• Professionals interested in transitioning to AI/ML roles.

• Projects requiring data-driven insights, predictive modelling, and automation.

• Through PMKVY and other government-sponsored programs, Rooman contributes to public


sector skill development initiatives.
• Training in communication, teamwork, and problem-solving enhances employability across
sectors.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 4


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

2.4 DETAILS OF ALL THE CONCEPTS LEARNT

2.4.1 AGILE METHODOLOGIES AND GIT VERSION CONTROL


The course began with an introduction to Agile methodologies, focusing on iterative development,
collaboration, and adaptability. Learners studied the Scrum and Kanban models and compared them
with the traditional Waterfall model. Git was introduced as a version control system, and students
practiced essential commands like git init, git add, git commit, and git status to manage source code
and collaborate effectively.

Agile Frameworks:

1. Scrum:
Explored Scrum, a framework where work is divided into small, manageable units called Sprints
(typically 1-4 weeks long). Roles such as Scrum Master, Product Owner, and Development Team
were introduced. Daily stand-up meetings, Sprint Planning, Sprint Review, and Retrospectives were
simulated through in-class activities to provide hands-on experience in team-based iterative
development.

2. Kanban:
The Kanban methodology was also taught, focusing on visualizing workflow using Kanban boards.
Learners created columns like To Do, In Progress, and Done, and managed task cards to track
progress in real-time. Emphasis was placed on limiting work in progress and optimizing flow
efficiency.

3. Waterfall Model Classification:


A detailed comparison was made with the traditional Waterfall model, where development proceeds
linearly through requirement gathering, design, implementation, testing, and maintenance. The
discussion highlighted Agile’s advantages in handling changing requirements over Waterfall’s
rigidity.

Git Version Control System:


To complement Agile, students were introduced to Git, a distributed version control system used for
tracking changes in source code during development. Git allows multiple developers to work on a
project simultaneously without conflicts and provides tools for reverting to previous code states if
needed.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 5


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

Topics Covered:

1. Basic Commands:
o git init: Initializes a new Git repository.
o git add: Stages changes for the next commit.
o git commit: Records changes to the repository.
o git status: Displays the current status of the working directory.
o git log: Shows a log of commits.

2. Branching & Merging:

o Learners created and switched between branches (git branch, git checkout) to work on
features independently.

o Merging (git merge) was practiced to integrate changes from different branches.

3. Remote Repositories (GitHub/GitLab):

o Concepts of pushing (git push) and pulling (git pull) code from remote repositories were
introduced.

o Students connected their local repositories to platforms like GitHub to practice collaborative
development in teams.

4. Conflict Resolution:

o Real-world scenarios were simulated where merge conflicts occurred, and students learned
how to resolve them effectively.

5. gitignore and Repository Hygiene:

o Best practices such as excluding unnecessary files from version control and writing clear
commit messages were emphasized.

6. Real-Time Collaboration Tools:

o To further enhance collaboration and project tracking in Agile environments, tools like Jira,
Trello, and GitHub Projects were demonstrated. These tools allow the creation of tasks,
assignment of responsibilities, and real-time updates on progress — closely mimicking
industry workflows.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 6


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

2.4.2 PYTHON PROGRAMMING FUNDAMENTALS


Python was introduced as a beginner-friendly and versatile programming language. Learners studied
data types such as integers, floats, strings, and Booleans. Control structures like if-else statements,
for and while loops, and Python functions were explored. String manipulation methods like strip(),
replace(), upper(), and indexing/slicing were practiced to handle text-based data.
The re module in Python was used to teach regular expressions. Students learned how to use
re.search(), re.findall(), and re.sub() for pattern matching, validating inputs, and cleaning unstructured
data. This enabled learners to efficiently extract and manipulate textual information.

Core Topics Covered:


1. Data Types and Variables:
Students were taught how to declare and use various data types:
o Integers (int) and Floats (float) for numerical operations.
o Strings (str) for handling text.
o Booleans (bool) for logical operations.
o Typecasting between different types was demonstrated to highlight Python's dynamic typing
capabilities.

2. Control Flow and Conditional Statements:


Logical flow was introduced using:

o if, elif, and else statements for decision-making.


o Logical operators (and, or, not) for building complex conditions,applied these to real-world
examples such as user login verification, grading systems, and conditional alerts.

3. Loops and Iteration:


Both for loops and while loops were practiced to perform repetitive tasks.
o Iterating over strings, lists, and ranges using for.
o Constructing condition-based repetition with while.
o Use of break, continue, and pass within loops to control flow.

4. Functions and Modular Programming:

o Learners wrote reusable code blocks using def to define functions.


o Concepts of arguments, return values, scope (local and global variables), and default
parameters were discussed.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 7


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

2.4.3 NUMERICAL COMPUTING WITH NUMPY

NumPy (Numerical Python) was introduced as a foundational library for numerical computing in
Python. It provides efficient array operations, mathematical functions, and powerful tools for working
with large datasets, making it indispensable in fields such as data science, machine learning, scientific
computing, and engineering simulations.

Array Creation and Manipulation:


• 1D, 2D, and 3D arrays using functions like np.array(), np.zeros(), np.ones(), np.arange(), and
np.linspace().
• Operations such as indexing, slicing, and modifying array elements.
• Reshaping arrays using reshape() to change dimensions without altering the underlying data.
• Flattening multidimensional arrays using flatten() or ravel() for linear operations.

Mathematical and Statistical Operations:


NumPy provides a suite of built-in mathematical functions that operate efficiently on entire arrays:
• Aggregation functions such as:

o np.sum() – computes the total sum of array elements.


o np.mean() – calculates the mean (average) value.
o np.std() – returns the standard deviation, useful in data analysis.
o np.min(), np.max() – find the minimum and maximum values respectively.
These operations were applied to real datasets to compute key statistical measures and understand
data distributions.

2.4.4 IBM CLOUD FUNDAMENTALS

The IBM Cloud Fundamentals module provided learners with a conceptual and practical
understanding of cloud computing, focusing on the architecture, service models, and real-world
applications using IBM Cloud as a platform. The training helped students grasp how cloud
technologies enable scalable, on-demand access to computing resources and services, which are
essential for modern software development and deployment.

Overview of Cloud Computing:


Cloud computing was introduced as the delivery of computing services—such as servers, storage,
databases, networking, software, and analytics—over the internet (“the cloud”).

Dept. of Electronics & Telecommunication Engineering, B.I.T. 8


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

IBM Cloud Services and Models:


1. Service Models:
• IaaS: Provides virtual servers, storage, and networking resources for custom infrastructure setup.
• PaaS: Offers a ready-to-use platform to build, deploy, and manage applications without
infrastructure overhead.
• SaaS: Delivers fully managed software applications accessible over the internet, like IBM Watson
services.

2. Deployment Models:
• Public Cloud: Shared infrastructure accessible to multiple users via the internet.
• Private Cloud: Exclusive cloud environment dedicated to a single organization for enhanced control

and security.
• Hybrid Cloud: Integrates public and private clouds to allow data and applications to move between

environments.

3. Python Integration:
• Cloud Storage: Use Python scripts to upload and retrieve data from IBM Cloud Object Storage.
• Automation: Automate resource management and deployment using Python SDKs and APIs.
• Watson Services: Integrate AI services like speech-to-text or NLP with Python for intelligent
applications.

2.4.5 STRUCTURED QUERY LANGUAGE (SQL)

Structured Query Language (SQL) is a standardized programming language used for managing and
manipulating relational databases. It allows users to create, read, update, and delete data (commonly
referred to as CRUD operations stored in structured tables.

Key Concepts and Components:


1. Database and Tables:
o A database is a collection of related data organized into tables.
o Each table consists of rows (records) and columns (fields) with defined data types.

2. Basic SQL Commands:


o Data Query Language (DQL):

▪ SELECT: Retrieves data from one or more tables.


Example: SELECT name, age FROM students WHERE age > 18;

Dept. of Electronics & Telecommunication Engineering, B.I.T. 9


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

o Data Definition Language (DDL):


▪ CREATE, ALTER, DROP: Used to define or modify the structure of tables and
databases.
Example: CREATE TABLE employees (id INT, name TEXT);

o Data Manipulation Language (DML):


▪ INSERT, UPDATE, DELETE: Used to modify data in tables.
Example: INSERT INTO students VALUES (1, 'John', 21);

o Data Control Language (DCL):


▪ GRANT, REVOKE: Manage user permissions on the database.

3. Filtering and Sorting:


o Use WHERE to filter records based on conditions.
o Use ORDER BY to sort results in ascending or descending order.
4. Joins:
o Combine data from multiple tables using keys.
o Types of joins include:
▪ INNER JOIN: Returns matching rows from both tables.
▪ LEFT JOIN: Returns all rows from the left table and matching rows from the right.
▪ RIGHT JOIN, FULL OUTER JOIN, etc.
5. Functions and Aggregation:
o SQL provides built-in functions for calculations and data analysis:
▪ COUNT(), SUM(), AVG(), MAX(), MIN()
o Useful for generating summary reports from large datasets.
6. Group By and Having:
o GROUP BY groups records with the same values for aggregate analysis.
o HAVING filters grouped records based on aggregate functions.

2.4.6 DATA VISUALIZATION WITH POWER BI

Power BI is a powerful business analytics tool developed by Microsoft that allows users to visualize
data, uncover insights, and share interactive dashboards and reports. In this module, learners were
introduced to the end-to-end process of data analysis using Power BI—from data ingestion and
transformation to modeling, visualization, and interpretation.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 10


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

Connecting to Data Sources:


Learned how to connect Power BI to a variety of data sources including:
• Excel spreadsheets

• CSV files

• SQL databases

• Web APIs and cloud platforms


The drag-and-drop interface of Power BI made it easy for beginners to establish real-time data
connections and begin exploring data.

Data Transformation with Power Query:


Power BI’s Power Query Editor was used to clean, shape, and transform raw data into usable formats.
Key tasks included:
• Removing nulls and duplicates
• Changing data types (e.g., date, text, number)
• Splitting and merging columns

• Creating conditional columns

These transformation techniques ensured data quality and prepared the dataset for meaningful
analysis.

Building Reports and Dashboards:


Once the data was cleaned, students created rich, interactive visualizations such as:
• Bar and column charts

• Line graphs and area charts


• Pie charts and donut charts
• Slicers, cards, tables, and gauges
Power BI's interactive features allowed users to click on visuals and dynamically filter data across
the entire report, making it easier to explore different perspectives of the dataset.

Data Modelling and DAX:


Data Analysis Expressions (DAX), a formula language used to define custom calculations and KPIs
(Key Performance Indicators). Examples included:
• Creating calculated columns (e.g., Profit = Revenue - Cost)
• Building measures like average, totals, and running sums
• Using time intelligence functions for year-over-year comparisons
These expressions enhanced the analytical depth of reports, enabling more sophisticated data
exploration.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 11


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

2.4.7 MACHINE LEARNING WITH SCIKIT-LEARN


The Machine Learning (ML) module introduced students to core machine learning concepts and
practical implementation using the widely used Scikit-Learn library in Python. The module focused
on building end-to-end ML pipelines, including data preprocessing, model training, evaluation, and
optimization.

Introduction to Machine Learning:


The fundamentals of machine learning, where algorithms learn from data to make predictions or
decisions without being explicitly programmed. The two main categories of ML explored were:
• Supervised Learning: Models are trained on labeled data.
• Unsupervised Learning: Models are used to find hidden patterns or groupings in data without
labeled outputs.

1. Supervised Learning Techniques:


Implemented key supervised learning models:
• Linear Regression & Logistic Regression for predicting numerical and categorical outputs.
• Decision Trees & Random Forests for classification and regression tasks.
• Support Vector Machines (SVM) to draw decision boundaries in complex datasets.
Splitting data into training and testing sets, training models using fit(), and making predictions using
predict().

2. Unsupervised Learning Techniques:


Unsupervised algorithms to analyze unstructured or unlabeled data:
• K-Means Clustering: Grouped data into k clusters based on similarity.
• DBSCAN (Density-Based Spatial Clustering): Detected clusters of varying shapes and identified
outliers.
• Hierarchical Clustering: Built nested clusters and visualized relationships using dendrograms.
These techniques helped students uncover hidden structures and groupings in the data.

Dimensionality Reduction:
The concept of reducing the number of features while retaining important information was introduced
through:

• Principal Component Analysis (PCA): Students learned how PCA transforms high-dimensional
data into fewer components, aiding visualization and model performance improvement.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 12


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

Hyperparameter Tuning:
Practiced optimizing model performance through:
• Grid Search and Randomized Search using GridSearchCV and RandomizedSearchCV.
• Selection of the best model parameters based on cross-validation scores.

2.4.8 PROMPT ENGINEERING AND ETHICAL AI


The final segment of the course focused on two emerging and crucial areas in the AI ecosystem:
Prompt Engineering and the Ethical Use of Artificial Intelligence. These topics addressed both the
practical aspects of working with large language models (LLMs) and the responsibilities of AI
practitioners in ensuring fairness, transparency, and accountability.

Prompt Engineering: Designing Better AI Interactions


Prompt engineering is the practice of crafting effective inputs (prompts) to guide the behavior and
output of generative AI systems, such as large language models (e.g., OpenAI’s GPT, Google’s
Gemini, etc.). Students learned to optimize interactions with these models to achieve accurate,
relevant, and context-aware results.

Key concepts covered:


• Prompt Structure: Writing clear, specific, and goal-oriented prompts to control AI output.
• Prompt Types: Instructional prompts, role-based prompts (e.g., “Act as a data scientist”), and zero-
shot vs. few-shot prompting.
• Prompt Chaining: Breaking complex tasks into multiple steps by linking prompts together,
enhancing reasoning and multi-stage problem solving.
• Evaluation: Assessing output consistency, correctness, and alignment with user intent.
Practiced hands-on exercises using prompt variations to solve tasks like text classification,
summarization, question answering, and coding support.

Ethical AI: Building Fair and Responsible Systems


The ethics portion of the module emphasized the importance of creating AI systems that are fair,
transparent, inclusive, and accountable. Learners explored real-world issues that arise when AI is
deployed in decision-making scenarios like hiring, lending, and healthcare.

Topics included:
• AI Bias and Discrimination: Understanding how biased training data can lead to unfair or
discriminatory AI outcomes.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 13


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

• Fairness Principles:
o Fairness through awareness (recognizing disparities)
o Equalized odds (ensuring similar performance across groups)

o Demographic parity (equal representation)


• Bias Mitigation Techniques:
o Data preprocessing to balance class distributions

o In-training constraints to reduce discriminatory patterns


o Post-processing adjustments to prediction outcomes
• Explainability and Transparency:
o Using model-agnostic tools like LIME and SHAP to explain black-box models
o Communicating AI decisions clearly to non-technical users
• Ethical Frameworks and Guidelines:
o Reference to global AI ethics guidelines from organizations like IEEE, OECD, and
UNESCO
o Discussion on regulatory trends (e.g., GDPR, AI Act) and their impact on development and
deployment

2.4.9 INTRODUCTION TO LINUX OPERATING SYSTEM


Linux is a powerful, open-source operating system based on UNIX, widely used in servers, embedded
systems, cloud infrastructure, and development environments. In this module, learners were
introduced to the fundamentals of Linux, its architecture, and its role in modern software development
and deployment.

Key Concepts Covered:


1. Linux Architecture:
Students learned the basic structure of the Linux operating system, which includes:
• Kernel: The core of the OS that manages hardware and system resources.
• Shell: The command-line interface that allows users to interact with the kernel.
• File System: A hierarchical structure where all data (files and directories) is stored.

2. Basic Linux Commands:


Learners practiced essential terminal commands for navigating and managing the system, including:
• ls: List directory contents
• cd: Change directory

Dept. of Electronics & Telecommunication Engineering, B.I.T. 14


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

• pwd: Print working directory


• mkdir, rm, cp, mv: Create, remove, copy, and move files/folders
• cat, nano, vim: View and edit files in the terminal
• chmod, chown: Change file permissions and ownership

3. File Permissions and User Management:


Students explored Linux’s file permission system and user roles:
• Understanding of read (r), write (w), and execute (x) permissions.
• Concepts of users, groups, and root (superuser).
• Commands like useradd, passwd, and sudo were used to manage user accounts and administrative
tasks.

4. Process and Resource Management:


• Commands such as top, ps, and kill were introduced to monitor and control system processes.
• Disk space and memory usage were tracked using tools like df, du, and free.

5. Package Management:
Depending on the Linux distribution (Ubuntu, CentOS, etc.), learners used tools like:
• apt (Debian/Ubuntu): sudo apt install, sudo apt update
• yum or dnf (CentOS/RedHat): sudo yum install
These tools were used to install, update, and manage software packages.

6. Shell Scripting Basics:


Students wrote simple Bash shell scripts to automate repetitive tasks. Examples included:
• Writing scripts to back up files
• Automating system updates

Dept. of Electronics & Telecommunication Engineering, B.I.T. 15


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

CHAPTER-3
TASKS PERFORMED
3.1 INTRODUCTION
The digital age is characterized by the exponential growth of social media platforms, with Instagram
standing out as a prominent medium for sharing visual content and expressing opinions. This surge
in user-generated content presents a unique opportunity to extract valuable insights into public
sentiment, emerging trends, and dynamic shifts in user behavior. The ability to analyze this social
media data in real-time is crucial for a wide range of applications, including brand management,
market research, and social studies. Consequently, the development of robust and efficient real-time
social media analytics pipelines has become increasingly important. This project focuses on the
design and implementation of such a pipeline, specifically tailored for processing Instagram data. The
primary objective is to automate the end-to-end process of collecting Instagram posts based on
specific criteria (e.g., hashtags), analyzing the sentiment expressed within the post captions, and
storing the resulting data in a structured format for subsequent analysis and interpretation. The
pipeline's architecture is built upon a combination of powerful and specialized technologies.
Playwright is employed for its advanced web scraping capabilities, enabling seamless interaction with
Instagram's dynamic web interface and the extraction of relevant post data. TextBlob, a Python library
for natural language processing, is utilized to perform sentiment analysis on the textual content of the
posts, quantifying the emotional tone conveyed in captions. Finally, MongoDB, a NoSQL database,
serves as the data storage solution, chosen for its scalability and flexibility in handling the diverse
and often unstructured nature of social media data. The motivation behind this project stems from the
broad applicability of real-time social media insights across various sectors. As detailed in the initial
problem definition (Phase 1 document), the capacity to rapidly and accurately analyze social media
data empowers stakeholders to achieve several critical objectives. These include: gaining a deeper
understanding of brand perception and consumer attitudes, enabling businesses to refine their
marketing strategies and product development; identifying emerging trends and viral topics,
providing valuable information for trend forecasting and content creation; measuring the
effectiveness of marketing campaigns and social media initiatives, allowing for data-driven
optimization of engagement strategies; and supporting overall data-driven decision-making processes
across organizations, by providing timely and relevant insights into public opinion and social
dynamics. To ensure adaptability and scalability, the pipeline is designed with a modular architecture,
where each stage of the data processing workflow (data collection, sentiment analysis, and data
storage) is implemented as an independent and interchangeable component. This modularity

Dept. of Electronics & Telecommunication Engineering, B.I.T. 16


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

facilitates easier maintenance, updates, and the potential integration of new or improved technologies
and algorithms in the future. The selection of Playwright, TextBlob, and MongoDB was based on a
careful evaluation of their strengths and suitability for the specific tasks within the pipeline.
Playwright's ability to handle dynamic content and browser automation makes it ideal for navigating
and extracting data from complex web applications like Instagram. TextBlob offers a straightforward
and efficient solution for sentiment analysis, providing a good balance between accuracy and
computational cost for initial implementation. MongoDB's ability to handle large volumes of data
and its flexible schema make it well-suited for storing the varied data retrieved from Instagram. The
subsequent sections of this chapter will provide a detailed exploration of the tasks performed in each
stage of the pipeline, including implementation specifics, challenges encountered, and the strategies
employed to overcome them. Furthermore, the chapter will discuss the evaluation process, outlining
the metrics used to assess the pipeline's performance and the results obtained. Ultimately, this chapter
aims to provide a comprehensive account of the development and implementation of the real-time
social media analytics pipeline, offering insights and lessons learned for the things that we have done
in internship in future endeavors that we are doing in this field.

3.2 PROBLEM STATEMENT


Developing a real-time pipeline to effectively collect, sentiment-analyze, and manage Instagram
data is crucial to overcome the challenges of dynamic content and language nuances, thus
empowering data-driven decisions.

3.3 OBJECTIVES
• Automate the collection of Instagram post data.
• Analyze sentiment in Instagram post captions.
• Store processed data in a scalable MongoDB database.
• Enable real-time processing of Instagram data.
• Develop a modular and extensible pipeline architecture.
• Evaluate pipeline performance using key metrics.
• Provide user-friendly input and output functionalities.

3.4 SYSTEM ARCHITECTURE

The pipeline's journey commences with the Data Source, specifically Instagram, where data
acquisition is facilitated by tools like Playwright. This initial stage involves the meticulous scraping
of diverse data elements from Instagram posts, encompassing not only the textual content of captions

Dept. of Electronics & Telecommunication Engineering, B.I.T. 17


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

but also rich media like images and videos, crucial metadata such as hashtags, and valuable user-
centric information. Following data extraction, the Data Ingestion phase becomes pivotal, tasked
with efficiently channeling the scraped data into the pipeline. To effectively manage the influx and
real-time nature of this data, message queues or robust streaming platforms like Kafka are often
employed, ensuring seamless data transmission and preventing bottlenecks.The subsequent Data
Processing stage is where the raw data undergoes refinement and preparation for analysis. This
involves several key sub-processes: initially, Text Extraction focuses on isolating the textual
components, primarily the captions, from the posts. Subsequently, Sentiment Analysis is conducted,
commonly utilizing libraries like TextBlob, to discern the emotional undertones present in the
captions, categorizing them as positive, negative, or neutral.

Fig 3.1 System Architecture for Real-Time Social Media Analytics Pipeline

Once processed, the data progresses to Data Storage, where it is securely housed in a scalable
database solution like MongoDB. This storage phase is crucial for preserving the analyzed data,
including the sentiment scores, enabling subsequent retrieval and analysis.An optional yet highly
Dept. of Electronics & Telecommunication Engineering, B.I.T. 18
REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

valuable stage, Data Analysis & Visualization, may be incorporated to extract deeper insights from
the stored data. This stage involves employing various analytical techniques to identify trends,
patterns, and correlations, often culminating in the creation of visual representations like charts and
dashboards to facilitate easier interpretation.Finally, the pipeline culminates in the Application/User
Interface stage, where the analyzed data is made accessible and consumable by end-users or other
applications. This accessibility is often achieved through the development of APIs (Application
Programming Interfaces) or the creation of user-friendly dashboards, providing a means to interact
with and utilize the insights derived from the social media data.

3.5 MODULE DESCRIPTION

3.5.1 DATA COLLECTION MODULE (DATA SOURCE & INGESTION)


The Data Collection Module, crucial for initiating the analytics pipeline, focuses on acquiring data
from Instagram using the Playwright scraper. This tool automates the collection of Instagram posts
based on specified hashtags, efficiently handling dynamic webpage elements and interactions to
extract relevant data. This process, as highlighted in multiple PDFs, is essential for gathering the raw
information necessary for subsequent analysis, ensuring that the pipeline can effectively capture and
process real-time social media data.

3.5.2 DATA PROCESSING MODULE


The Data Processing Module refines the collected data, primarily through text extraction and
sentiment analysis. TextBlob is utilized to analyze the sentiment of extracted captions, categorizing
them as positive, negative, or neutral. This module transforms the raw data into a structured format
suitable for storage and further analysis, with some phases incorporating techniques like tokenization
and LSTM neural networks to enhance sentiment analysis accuracy, ultimately deriving meaningful
insights from the textual content of Instagram posts.

3.5.3 DATA STORAGE MODULE


The Data Storage Module employs MongoDB to efficiently store the processed data, including
extracted post information and sentiment scores. MongoDB's NoSQL structure provides the
scalability and flexibility needed to accommodate the varying volumes of data generated by social
media activity, ensuring that the analyzed information is persistently stored and readily accessible for
subsequent analysis and visualization.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 19


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

3.5.4 DATA ANALYSIS AND VISUALIZATION MODULE


The Data Analysis and Visualization Module focuses on transforming the stored data into actionable
insights through analytical tools and visual representations. An analytical dashboard, potentially built
with Streamlit, presents sentiment trends and hashtag analysis, while libraries like Matplotlib and
Seaborn create visualizations to illustrate sentiment distribution. This module enhances user
understanding and facilitates informed decision-making by providing clear and interactive displays
of the analyzed social media data.

3.5.5 MODEL DEPLOYMENT MODULE


The Model Deployment Module concentrates on making the sentiment analysis model accessible for
real-time use. This is achieved by deploying the model as an API using frameworks like Flask or
FastAPI, often hosted on cloud platforms such as AWS, Google Cloud, or Azure. This deployment
enables applications and users to efficiently leverage the model for real-time sentiment predictions,
facilitating scalable and accessible social media analysis.

3.6 IMPLEMENTATION
3.6.1 CODE SNIPPETS
Importing Libraries
pip install playwright
pip install pymongo
pip install TextBlob
pip install tensorflow
pip install matplotlib
pip install seaborn
pip install streamlit
streamlit run script_name.py
source your_env_name/bin/activate
python -m venv your_env_name
conda create -n your_env_name python=3.x
playwright install

Playwright For Scraping


def collect_instagram_posts(page, hashtag, max_posts=50):
print(f"Collecting posts for hashtag: {hashtag}")
scraped_posts = []

Dept. of Electronics & Telecommunication Engineering, B.I.T. 20


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

try:
page.goto(f"https://www.instagram.com/explore/tags/{hashtag}/",
wait_until="networkidle", timeout=60000)
page.wait_for_selector("article div img", timeout=60000)
print("Starting to scrape posts...")
while len(scraped_posts) < max_posts:
posts = page.query_selector_all("article")
for post in posts:
# Extraction logic here
pass
except Exception as e:
print(f"Error during scraping: {e}")
return scraped_posts

Sentiment Analysis
def build_model():
model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=64))
model.add(LSTM(128))
model.add(Dense(3, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
return model
#Code Snippet (LSTM Model Training from "Phase 2\_Instagram\_Analytics\_Pipeline.pdf")
model = build_model()
history = model.fit(x_train, y_train, epochs=10, validation_data=(x_val, y_val))

Data Storage
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client['instagram_analytics']
collection = db['posts']

Data Analysis And Visualization


import seaborn as sns
import matplotlib.pyplot as plt
import streamlit as st

Dept. of Electronics & Telecommunication Engineering, B.I.T. 21


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

def plot_sentiment_distribution(sentiments):
sns.countplot(x=sentiments)
plt.xlabel("Sentiment")
plt.ylabel("Frequency")
plt.title("Sentiment Distribution")
st.pyplot()
else:
st.write("No data to display for sentiment over time.")
def display_top_occurring_hashtags(df, hashtag_col):
"""
Displays the top occurring hashtags in the dataset
Args:
df (pd.DataFrame): DataFrame containing the data.
hashtag_col (str): Name of the column containing hashtags.
"""
st.subheader("Top Occurring Hashtags")
if hashtag_col not in df.columns:
st.write(f"Error: Column '{hashtag_col}' not found in DataFrame.")
return
all_hashtags = df[hashtag_col].str.split(' ', expand=True).stack().reset_index(level=0, drop=True)
top_hashtags = all_hashtags.value_counts().head(10) # Get the top 10 hashtags

Dept. of Electronics & Telecommunication Engineering, B.I.T. 22


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

CHAPTER 4
RESULTS
The Results section of this report details the outcomes of the social media analytics pipeline. Data
collection focused on the hashtags '#nature' and '#city', with a maximum of 3 posts collected for each.
The collected data comprises a variety of post types, including images, videos, and carousels, with
captions ranging from concise to descriptive. Sentiment analysis of the post captions revealed

The TextBlob sentiment analysis was validated against a sample of 500 manually labeled posts,
achieving an accuracy of 75%. MongoDB efficiently stored the collected data, with a total storage
size of 500 MB and write speeds averaging 1,000 posts per second; average retrieval time for 1,000
posts based on a specific hashtag was 0.5 seconds.

Fig 4.1 Interface Of The Code Running In The Terminal

Above Fig 4.1 is the terminal interface of the code and the image urls is being obtained. Before accesing
the images set of inputs we have to enter like user name, user password ,for which hashtag we have to
search it and at final how many posts we have to gather.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 23


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

Fig 4.2 Gathering Image Links From Instagram Using HTTPS

Fig 4.3 Program Has Successfully Run And Gathered All The Tmage
Link And Stored In MangoDB Dashboard

Above Fig 4.2 and Fig 4.3 is about starting mangoDb server (it is where we store the image, its
nothing but a dashboard interface) after server is started , obtained image urls is stored in the
mangoDB server and final checks for the image urls is being run , after that the program finishes its
execution.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 24


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

Fig 4.4 Dashboard Of MangoDB

Fig 4.5 File Directory(Instagram_analytics) For Storing Images In MangoDB

Above Fig 4.3 and 4.5 is about mangoDb server interface and gathered image urls is stored here. In
Fig 4.5 once posts or images is gathered, a separate folder is created to store that urls so that it can be
accessed later and the name of the folder is created automatically.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 25


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

Fig 4.6 (#City Total-3) Images Stored In Post Folder

Fig 4.7 (#Nature Total-3) Images Stored In Startup_log

Above Fig 4.6 and 4.7 is about all the gathered image urls being stored in mangoDB. In Fig 4.6,
image urls of #city is being stored and in Fig 4.7 image urls of #nature is being stored in separate
folders.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 26


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

Fig 4.8 Images Of (#city) Total-3

Fig 4.9 Images Of (#Nature) Total-3

Above Fig 4.8 and Fig 4.9 is the final obtained images that was scrapped from the Instagram using
playwright scrap. Fig 4.8 is about #city which we had obtained and stored in the Instagram_analytics
folder and Fig 4.9 is about #nature which we had obtained and stored in startup_log folder.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 27


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

CHAPTER-5

OUTCOMES
1. Gain valuable work experience:

The internship at Rooman Technologies provided hands-on exposure to real-world tools and
workflows used in the AI and software industry. By working on practical tasks involving Python
programming, machine learning, data visualization, and cloud platforms, students developed a
deeper understanding of how theoretical concepts are applied in actual projects. This experience
helped build confidence, improved problem-solving skills, and prepared students for future roles
in the tech industry.

2. Strengthen Problem-Solving and Analytical Thinking


By working on real-time tasks involving data preprocessing, model selection, evaluation, and
optimization, students developed strong problem-solving abilities. Tackling machine learning
challenges, debugging code, and interpreting results improved their critical thinking and decision-
making skills.

3. Build Professional Confidence and Industry Readiness


Engaging in a structured learning environment with industry-relevant tools and practices boosted
learners’ confidence in applying their knowledge. Exposure to modern development workflows
and tools like Git, IBM Cloud, and Power BI made them better prepared to handle workplace
expectations and contribute effectively in professional settings.

4. Giving an edge in the job market:


In-demand technical skills in AI, machine learning, cloud computing, and data analysis—areas
highly sought after by employers. Exposure to industry tools like Git, Power BI, NumPy, and IBM
Cloud, along with practical implementation of machine learning models, gave learners a
competitive advantage. This real-world experience and knowledge of current technologies
significantly enhance employability and readiness for job opportunities in the tech sector.

5. Develop and refine skills:


Throughout the internship, students had the opportunity to develop and sharpen both technical and
analytical skills. By working with Python, machine learning algorithms, NumPy for numerical
computing, and Power BI for data visualization, they gained hands-on experience in solving real-
world problems. Additionally, concepts like Agile methodology, version control using Git, and
cloud fundamentals helped improve their understanding of collaborative development.
Dept. of Electronics & Telecommunication Engineering, B.I.T. 28
REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

CHAPTER-6
WEEKLY REPORT
Week(s) Topics Covered / Work Carried Out

Python Programming Basics: syntax, variables, data types, operators,


Week 1–2 control structures, and loops. Hands-on with NumPy and Pandas for
numerical computing and data manipulation.

Explored Functions and Object-Oriented Programming (Encapsulation,


Inheritance, Polymorphism, Abstraction). - File handling: reading, writing,
and appending data.SQL Basics: Create, Insert, Update, Delete. Advanced
Week 3–5 SQL: WHERE, ORDER BY, GROUP BY, LIMIT. Networking
fundamentals: LAN, WAN, PAN, topologies. OSI & TCP/IP Models.
Data Science Introduction: Python, Pandas, Statistics, and Data
Preprocessing.

AI Coding Essentials: Python, TensorFlow, PyTorch, scikit-learn.


Explored additional languages like R, Java, and C++ for AI. - Core AI
Week 6–7
Concepts: Machine Learning (ML), Deep Learning, NLP, Reinforcement
Learning. Data Preprocessing and Feature Engineering.

Built ML models using scikit-learn and TensorFlow. Practiced supervised


and unsupervised learning. Dimensionality Reduction (PCA, t-SNE),
Week 8–10
Clustering (K-Means, DBSCAN). Generated visualizations, cluster
reports, and insights.

Introduction to Cloud Computing: AWS, Azure, GCP. Cloud models:


IaaS, PaaS, SaaS; deployment models. Network Troubleshooting Tools:
Ping, Traceroute, Nslookup, IPConfig. Power BI: Data import,
Week 11–13
transformation (Power Query), visualization, DAX expressions, and
dashboard creation. Integrated ML outputs with Power BI for interactive
dashboards.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 29


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

CHAPTER-7
CONCLUSION

The internship at Rooman Technologies offered a comprehensive and industry-relevant learning


experience in Artificial Intelligence and Machine Learning (AI & ML). The program was designed
to bridge academic knowledge with practical applications, helping students gain technical
competence in key areas such as programming, data science, machine learning, and cloud computing.
The internship began with foundational training in Agile methodologies and Git version control,
instilling the importance of collaborative development and iterative progress in modern software
projects. These practices were reinforced through hands-on tasks using Git commands and exploring
real-life project workflows using mangoDB and playwright.
In conclusion, the documents detail the development of a real-time social media analytics pipeline
designed to efficiently collect, process, and analyze data from Instagram. The pipeline leverages
Playwright for automated data scraping, TextBlob for sentiment analysis, and MongoDB for scalable
data storage. Key aspects of the implementation include handling dynamic web pages, extracting and
processing text for sentiment scoring, and storing data for subsequent analysis. The system is designed
to be robust, scalable, and adaptable, with considerations for deployment on cloud platforms like
AWS, Google Cloud, and Azure. A user-friendly interface, built with Streamlit and enhanced with
visualizations from Matplotlib and Seaborn, enables effective interaction and interpretation of
sentiment trends. The overall framework establishes a foundation for applications in areas such as
marketing analytics, audience sentiment tracking, and social media research..

Dept. of Electronics & Telecommunication Engineering, B.I.T. 30


REAL-TIME SOCIAL MEDIA ANALYTICS PIPELINE USING MACHINE LEARNING 2024-2025

REFERENCES

[1] K. Beck et al., "Manifesto for Agile Software Development," Agile Alliance, 2001. [Online].
Available: https://agilemanifesto.org

[2] S. Boucher, B. Tranter, and M. Nolan, "Version Control with Git: Best Practices for
Collaborative Development," IEEE Software, vol. 35, no. 2, pp. 22-27, Mar./Apr. 2018. doi:
10.1109/MS.2018.1121023

[3] G. van Rossum, F. L. Drake, "Python Programming Language - Official Documentation,"


Python Software Foundation, 2010.

[4] T. E. Oliphant, "A Guide to NumPy," USA: Trelgol Publishing, 2016.

[5] IBM Corporation, "IBM Cloud Computing: Concepts, Architecture and Strategy," IBM
Redbooks, 2021.

[6] F. Pedregosa et al., "Scikit-learn: Machine Learning in Python," Journal of Machine Learning
Research, vol. 12, pp. 2825-2830, 2022.

Dept. of Electronics & Telecommunication Engineering, B.I.T. 31

You might also like