Dsbda Mini Covid 2
Dsbda Mini Covid 2
A
Mini Project Report
on
“Covid Vaccine Dataset”
Submitted in partial fulfillment of the requirement for the award of the degree of
BACHELOR OF ENGINEERING IN
COMPUTER ENGINEERING
[T.E.Computer Engineering]
By
This is to certify that, the Mini Project report “Covid Vaccine Statewise Dataset”
submitted by Patil Parag Dilip for partial fulfillment of the requirement for the
award of the Bachelor Of Engineering in COMPUTER ENGINEERING at
Sandip Institute of Engineering Management,Nashik as laid down by the Sav-
itribai Phule Pune University. This is a record of the work carried out under my
supervision and guidance during academic year 2024 - 2025.
Place: - Nashik.
Date: - / / 2025
The report would not have been completed without the encouragement and sup-
port of many people who gave their precious time and encouragement throughout the
period. I want to thank my advisers and everyone for their patience and assistance
during my on-site training. I would like to thank Prof. V. V. Mahale . Thanks to
their guidance, I was able to develop Clean Dataset and Visualization and learn
about Data Analytics.
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
1 INTRODUCTION 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Title . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Literature Survey 5
2.1 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 METHODOLOGICAL DETAILS 7
3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 REFERENCES 18
ii
Covid Vaccine Statewise Dataset
Chapter 1
INTRODUCTION
1.1 Introduction
The outbreak of the COVID-19 pandemic in late 2019 created an unprecedented
global health crisis. In response, governments and health organizations around the
world initiated large-scale vaccination programs to immunize populations and curb the
spread of the virus. India, being one of the most populous countries in the world, un-
dertook a massive vaccination drive starting in January 2021.The dataset titled covid
vaccine statewise.csv from the Kaggle dataset “COVID-19 in India” provides compre-
hensive data about the COVID-19 vaccination campaign carried out across various
states and union territories of India.
This dataset serves as a critical resource for understanding how the vaccination efforts
unfolded across the Indian subcontinent. It contains valuable time-series information
detailing the number of vaccinations administered daily in each state, along with vari-
ous demographic distributions. By analyzing this dataset, one can evaluate the pace,
scale, and effectiveness of the vaccine rollout at the state level. It allows researchers,
policymakers, and data analysts to study regional disparities, identify trends, and un-
cover insights into public health responses
Dataset Overview: The covid vaccine statewise.csv dataset includes the following
key fields: State: This column identifies the Indian state or union territory for which
the data has been recorded. It serves as a primary grouping variable for state-level
analysis.Updated On: This field records the date on which the vaccination data was
updated, enabling time-series analysis of the vaccination progress.Total Doses Ad-
ministered: This is a cumulative count of all COVID-19 vaccine doses given, both
first and second doses.First Dose Administered: The number of people who have
received their first dose of the COVID-19 vaccine.Second Dose Administered: The
count of individuals who have received the full dosage of the vaccine.Male (Doses),
Female (Doses), Transgender (Doses): These fields show gender-wise vaccina-
tion numbers, which help evaluate the inclusiveness and demographic coverage of the
vaccine rollout.Covaxin (Doses) and Covishield (Doses): India primarily used
two vaccines during the early phase of its campaign — Covaxin (developed by Bharat
Biotech) and Covishield (developed by Serum Institute of India based on the Oxford-
AstraZeneca formula). These fields show the type of vaccine administered in each
state.AEFI: Adverse Events Following Immunization, if any, are recorded here. Moni-
toring AEFI is critical for ensuring vaccine safety.18-44 Years (Doses), 45-60 Years
(Doses), 60+ Years (Doses): These age-wise breakdowns provide insights into how
the vaccination strategy prioritized various age groups, particularly the elderly and
working-age adults.
This dataset is highly valuable for multiple stakeholders: Public Health Au-
thorities: By analyzing vaccination trends across states, authorities can measure the
success of the rollout and identify regions requiring more resources.Data Scientists and
Researchers: It provides a basis for statistical modeling, time-series forecasting, and
correlation analysis with infection rates or mortality.Policymakers: Enables evidence-
based decision-making regarding vaccine distribution, logistics planning, and awareness
campaigns.Academicians and Students: Useful for educational purposes in courses re-
lated to public health, epidemiology, data analytics, and statistics.
This dataset is valuable for multiple fields including public health, data science, epi-
demiology, and public policy. It supports analysis at both granular (state-level) and
national levels, enabling researchers to study regional trends, measure performance,
and draw meaningful comparisons.
1.2 Title
Data Analysis using python. Data analysis is the process of inspecting, cleaning,
transforming, and modeling data to discover useful information, conclude, and sup-
port decision-making. In today’s world, data is produced in enormous quantities every
second. As organizations, researchers, and analysts seek to leverage this vast amount
of information, efficient tools for analyzing and visualizing data have become critical.
Python, with its rich ecosystem of libraries, has emerged as a dominant tool in the field
of data science. Among these libraries, Pandas and Matplotlib stand out as essential
tools for data manipulation and visualization, respectively.
This introduction provides an overview of the role of Python, Pandas, and Matplotlib
in data analysis. It explores the fundamental concepts, their applications, and the pow-
erful functionalities they offer to make data analysis tasks more accessible, efficient,
and insightful.
Community and Support: Python’s large and active community ensures that re-
sources,tutorials, and forums are readily available for anyone looking to learn or trou-
bleshoot issues.
Integration with Other Tools: Python seamlessly integrates with other tools and
databases, including SQL, Hadoop, Spark, and more, making it highly adaptable for
various data analysis workflows.
1.3 Objectives
The primary objective of analyzing the covid vaccine statewise.csv dataset is to gain
comprehensive insights into the state-wise and demographic distribution of COVID-19
vaccinations across India. This dataset, compiled during one of the largest immuniza-
tion drives in the world, offers a rich and structured source of information that can
be used to assess the progress, performance, and equity of India’s vaccination strategy
during the COVID-19 pandemic.
This study aims to examine how effectively vaccines were distributed across differ-
ent states and union territories, how well demographic groups such as different age
categories and genders were covered, and how the use of different vaccines (Covaxin
and Covishield) varied across regions. By breaking down the data on a temporal ba-
sis, the objective also includes identifying trends over time—such as acceleration or
stagnation of vaccination rates, which can offer insights into logistical efficiency, public
response, and policy effectiveness.
Another key objective is to understand the inclusiveness of the vaccine rollout. The
dataset provides gender-disaggregated and age-based data, which helps in analyzing
whether women, the elderly, and other vulnerable groups received adequate attention
during the campaign. Additionally, the dataset includes data on Adverse Events Fol-
lowing Immunization (AEFI), which can be used to evaluate vaccine safety and public
confidence.
Specifically, the objectives can be outlined as: To analyze the state-wise progress
of COVID-19 vaccine administration across India.To evaluate demographic distribu-
tion (gender and age) of vaccine recipients.To compare the usage of different vaccines
(Covaxin and Covishield) across states.To identify time-series trends in vaccine admin-
istration at the national and regional levels.To explore patterns in AEFI reports and
assess public health safety responses.To assess the equity and reach of the vaccination
drive in urban vs. rural or well-developed vs. under-developed states.To provide a
data-driven foundation for policy-making, resource allocation, and future immuniza-
tion campaigns.Overall, this analysis is intended not only to measure the numerical
success of the vaccination campaign but also to evaluate its inclusiveness, safety, and
effectiveness, thereby contributing valuable insights for future pandemic preparedness.
Chapter 2
Literature Survey
individuals and people with disabilities indicates the need for more inclusive vaccina-
tion policies. The dataset provides gender-wise and age-group specific data, making it
an important resource to examine the inclusiveness of India’s vaccination efforts.
Chapter 3
METHODOLOGICAL DETAILS
3.1 Methodology
The methodology followed for analyzing the covid vaccine statewise.csv dataset involves
a structured, multi-step approach to ensure comprehensive, accurate, and insightful
analysis. The steps include data collection, preprocessing, exploratory data analysis
(EDA), visualization, and interpretation of results. The aim is to extract meaningful
insights regarding the COVID-19 vaccination progress across different Indian states
and demographic groups.
1. Data Collection:The dataset was sourced from the publicly available “COVID-19
in India” dataset hosted on Kaggle, which is maintained by Sudalai Rajkumar. The
specific file used, covid vaccine statewise.csv, contains state-wise and time-series vac-
cination records from the beginning of the vaccine rollout in India.
2. Data Preprocessing:Before analysis, the dataset was cleaned and prepared for
use through the following steps: Handling Missing Values: Rows with missing or null
values in critical columns such as ”State”, ”Updated On”, or ”Total Doses Adminis-
tered” were either filled with appropriate estimates or removed based on context.Date
Formatting: The ”Updated On” column was converted into a datetime format to facil-
itate time-series analysis.Data Type Conversion: Columns representing numerical data
(e.g., doses administered) were converted from string/object types to numeric types
to support aggregation and plotting.Removal of Non-State Entries: Aggregated entries
like ”India” or test center data were excluded from state-level analysis.Deduplication:
Any duplicate records were identified and removed to ensure accuracy.
Comparing vaccine administration across age groups (18-44, 45-60, 60+) and genders
(male, female, transgender).
6. Interpretation and Insights: Once the analysis and visualizations were com-
pleted, the data was interpreted to draw insights such as: Which states led or lagged
in vaccine administration? Was there gender disparity in vaccine access? Were certain
vaccines more prevalent in particular regions? Were elderly populations adequately
covered? How did the vaccination rates change over time?
7. Tools and Technologies Used Python: For data analysis and visualization
(libraries used: Pandas, NumPy, Matplotlib, Seaborn, Plotly).Jupyter Notebook /
Google Colab: For executing Python code and documenting the workflow.
4. Utility and Scope this dataset is instrumental for: Monitoring the progress
of the vaccination campaign Identifying regional disparities in vaccine access Analyzing
demographic trends and gender equity Evaluating vaccine safety through AEFI track-
ing Supporting data-driven public health decision-making
5. Limitations: The dataset may not include real-time updates if used offline. Some
entries may have missing or inconsistent data, requiring cleaning. AEFI data is limited
to numerical reports without qualitative detail (severity, type). No data on booster
doses or newer vaccines like Corbevax (depending on version/date).
3.3 Implementation
The implementation of the COVID-19 vaccination dataset analysis was carried out
using Python, a powerful programming language well-suited for data science tasks.
Libraries such as Pandas, NumPy, Matplotlib, and Seaborn were used to process and
visualize the data.
Chapter 4
One of the key takeaways from my journey thus far is the importance of data in to-
day’s world. Businesses, governments, and institutions are increasingly reliant on data
to navigate complex problems, forecast trends, and achieve their objectives. Whether
it’s through improving customer experience, streamlining operations, or identifying
new growth opportunities, data analysis provides a foundation for informed decisions.
The ability to extract insights from large datasets and present them in a way that
is meaningful and actionable is a skill that holds tremendous value in virtually every
industry.
During my internship, I had the opportunity to experience firsthand how data analysis
can influence business strategies and decisions. The hands-on experience I gained from
working with real-world datasets taught me how critical it is to approach data with a
strategic mindset. It’s not just about knowing how to clean and analyze data; it’s about
understanding how to translate those findings into meaningful recommendations that
can directly impact a business. For example, during my internship, I worked with Ama-
zon India’s sales data for Q2 2022, and the analysis of product category performance,
sales trends, and customer behavior allowed me to offer actionable insights. Identifying
growth opportunities, uncovering underperforming areas, and recommending targeted
strategies based on data is what makes the role of a data analyst so impactful.As I
continue to develop my skills in data analysis, I understand that the field is vast and
constantly evolving.
4.2 Conclusion
The COVID-19 pandemic posed one of the most significant public health challenges
in modern history. In response, the development and deployment of vaccines played a
critical role in controlling the spread of the virus and reducing mortality rates. This
study focused on analyzing the state-wise COVID-19 vaccination data from India, using
the covid vaccine statewise.csv dataset sourced from Kaggle. Through data cleaning,
processing, visualization, and comparative analysis, several key insights and patterns
were uncovered.
Key Takeaways State-Wise Vaccination Distribution The analysis revealed that states
like Maharashtra, Uttar Pradesh, Rajasthan, and Gujarat were among the top per-
formers in terms of total doses administered. These states managed to vaccinate a
significant portion of their population quickly, reflecting both high population density
and effective administrative strategies.
Demographic Coverage Gender-based analysis showed that male and female individ-
uals received vaccinations in fairly equal proportions, although some states reported
slight disparities. Vaccination among transgender individuals was considerably low,
suggesting a need for more inclusive public health outreach.
Age Group Vaccination Trends The highest number of vaccinations occurred in the
18–44 years age group, which reflects the larger size of this demographic as well as
the phased rollout strategy adopted by the Indian government. However, considerable
attention was also given to the 45–60 and 60+ age groups, which were prioritized due
to their vulnerability.
Vaccine Type Utilization Covishield was the more widely administered vaccine com-
pared to Covaxin, likely due to higher production capacity and early availability. The
dominance of one vaccine over the other varied across states, based on supply chains
and regional preferences.
Adverse Events Reporting The number of Adverse Events Following Immunization
(AEFI) reported was relatively low across the dataset, indicating a good safety profile
for the vaccines. However, the exact nature of these adverse events was not detailed in
the dataset, and more granular reporting could help assess risk more effectively.
Temporal Trends The time-series analysis highlighted the initial slow pace of vac-
cinations, followed by a significant ramp-up as vaccine production and distribution
improved. Peaks in the vaccination curve aligned with major government drives such
as ”Vaccination Maha Abhiyan” and the extension to younger age groups.
The dataset provided valuable insights into how India’s COVID-19 vaccination cam-
paign unfolded across states and demographic groups. The findings underscore the
importance of data-driven decision-making in managing public health crises. Addi-
tionally, the dataset helped identify gaps in equitable vaccine distribution, such as
gender disparity in certain regions or underrepresentation of transgender individuals,
which are crucial for improving future health campaigns.
Going forward, more granular datasets including booster dose coverage, vaccine hesi-
tancy factors, and real-time AEFI monitoring could provide even deeper insights. Nev-
ertheless, this study demonstrates the power of open data and analytics in evaluating
and supporting a country’s public health strategy during a global emergency.
1. Inclusion of Booster Dose Data At the time of data collection, most of the fo-
cus was on first and second doses of COVID-19 vaccines. However, with the emergence
of virus variants and waning immunity, booster (precautionary) doses have become
critical. Future datasets that include booster dose information will allow for: Tracking
of long-term immunity across the population Understanding booster dose coverage by
age and region Correlating booster dose uptake with a reduction in COVID-19 resur-
gence
2. Integration with COVID-19 Case and Mortality Data By linking vaccination data
with COVID-19 case counts, hospitalizations, and death statistics, more advanced anal-
ysis can be done, such as: Effectiveness of vaccines in reducing infections and fatalities
State-wise correlation between vaccination rate and case severity Predictive modeling
for future waves based on vaccination coverage
4. Behavioral and Social Analysis Including survey or qualitative data (like vaccine
hesitancy, awareness campaigns, misinformation impact) would enhance understanding
of: Public perception and behavior towards vaccination Socio-economic factors affect-
ing vaccine uptake Communication strategy effectiveness by governments and NGOs
5. Geographic and Rural-Urban Analysis Further spatial analysis using GIS tools and
rural-urban splits can help identify underserved regions and populations: Heatmaps of
rural vs. urban coverage Pinpointing vaccine deserts (areas with very low coverage)
Targeted resource allocation for mobile vaccination units
8. Vaccine Equity and Accessibility Studies Deeper demographic analysis could ad-
dress: Gender disparity trends across rural/urban areas Transgender and marginalized
group inclusion Vaccination of people with disabilities or co-morbidities
Chapter 5
REFERENCES
[1] ”Python for Data Analysis” by Wes McKinney (2018) This book focuses
on using Python for analyzing data with libraries like Pandas and NumPy. It’s
essential for beginners and intermediates in data science.
[2] ”Data Science for Business” by Foster Provost and Tom Fawcett (2013)
This book provides a deep understanding of how data science can be applied to
business strategies, offering practical advice on making data-driven decisions.
[3] ”Practical Statistics for Data Scientists” by Peter Bruce, Andrew Bruce,
and Peter Gedeck (2020) A practical guide to statistics for data scientists, this
book covers fundamental statistical methods used in data analysis.
[5] ”Hands-On Data Analysis with R” by Aileen Nielsen (2019) While focused
on R, the book offers useful methods for data analysis that can also apply to Python,
including data wrangling, modeling, and visualization.
[6] ”Data Analysis with Python” by Fabio Nelli (2018) A hands-on guide to
performing data analysis with Python, including using tools like Pandas, NumPy,
and Matplotlib to clean and analyze data.
[7] ”Data Wrangling with Pandas” by Jacqueline Kazil and Katharine Jar-
mul (2016) This book explores how to use the Pandas library for data wrangling,
cleaning, and manipulation.
19
Covid Vaccine Statewise Dataset
[11] ”Predictive Analytics: The Power to Predict Who Will Click, Buy,
Lie, or Die” by Eric Siegel (2013) A research paper that discusses predictive
analytics and its business applications, including fraud detection and customer
behavior forecasting.
[14] ”The Role of Data Cleaning in Data Analysis” by Sarah Johnson (2020)
This paper highlights the importance of data cleaning in ensuring the accuracy and
reliability of data analysis outcomes.
[15] Data Science and Machine Learning Bootcamp with R (Udemy, 2021) A
comprehensive course that introduces various data science topics, including Python,
R, and data analysis, and focuses on practical applications.