Internship Project: DIAGNOSTIC LAB SYSTEMS
2025
Internship Project
MBA BA
I Year II Trimester
Medicaid Drug
Spending Data
Analysis
[DIAGNOSTIC LAB SYSTEMS]
This project presents a data-driven analysis of Medicaid drug spending trends by integrating cost,
utilization, and patient reach. Using clustering and Pareto analysis, it identifies high-impact drugs
and key spending drivers. Insights are visualized through interactive dashboards to support
informed, strategic healthcare decision-making. The goal is to enhance Medicaid’s financial
sustainability without compromising patient care
Anurag University 1 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Contents
Project cover sheet ................................................................................................................................ 5
Executive Summary .............................................................................................................................. 6
Project Milestone................................................................................................................................... 7
Introduction ........................................................................................................................................... 8
Problem Statement................................................................................................................................ 8
Objectives............................................................................................................................................... 9
Requirement Specifications .................................................................................................................. 9
User Requirements: ............................................................................................................................. 9
Functional Specifications .................................................................................................................. 10
System Specifications ....................................................................................................................... 11
Data Architecture Diagrams .............................................................................................................. 13
Activity flow..................................................................................................................................... 13
ER Diagram ..................................................................................................................................... 16
Data Model ...................................................................................................................................... 18
Work Flow ........................................................................................................................................ 19
Data Set ................................................................................................................................................ 20
Data Description ............................................................................................................................... 20
Dataset Overview .............................................................................................................................. 20
Data Dictionary ................................................................................................................................. 21
Dimensions ....................................................................................................................................... 21
Measures ........................................................................................................................................... 22
Key Performance Indicators ............................................................................................................. 23
ETL ...................................................................................................................................................... 23
Extract (E) ......................................................................................................................................... 23
Transform (T).................................................................................................................................... 24
LOAD (L) ......................................................................................................................................... 30
Exploratory Data Analysis ................................................................................................................. 32
Outlier detection: .............................................................................................................................. 32
Correlation Analysis: ........................................................................................................................ 33
Scatter Plot: ....................................................................................................................................... 34
Impact of Beneficiaries on Dosage Units (Year-wise) .................................................................. 34
Relationship Between Claims and Dosage Units (Year-wise) ...................................................... 35
Impact of Beneficiaries on Claims Across Years .......................................................................... 36
Trend analysis: .................................................................................................................................. 37
Anurag University 2 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Trend of Total Spending Over Years in Billions ........................................................................... 37
Trend Analysis of Total Beneficiaries Over Years ........................................................................ 38
Trend Analysis of Total Claims Over Years .................................................................................. 39
Bar Plots:........................................................................................................................................... 40
Top 10 Manufacturers by Total Claims ......................................................................................... 40
Top 10 Manufacturers with Highest Total Spending..................................................................... 41
YOY Trend Analysis ........................................................................................................................... 42
Objective: .......................................................................................................................................... 42
Data Preparation................................................................................................................................ 42
Data Cleaning: .................................................................................................................................. 42
Data Transformation: ........................................................................................................................ 42
Data Aggregation .............................................................................................................................. 42
YoY Growth Rate Calculation .......................................................................................................... 42
a) Year-over-Year Change in Total Spending .......................................................................... 43
b) Year-over-Year Change in Claims ....................................................................................... 44
c) Year-over-Year Change in Beneficiaries ............................................................................. 45
Pareto Analysis .................................................................................................................................... 46
Introduction ....................................................................................................................................... 46
Methodology ..................................................................................................................................... 46
Data Collection and Preprocessing: .............................................................................................. 46
Data Cleaning: .............................................................................................................................. 46
Data Visualization: ........................................................................................................................ 47
Clustering Analysis ............................................................................................................................. 49
Data Collection and Preprocessing: .................................................................................................. 49
Data Cleaning: .................................................................................................................................. 49
Data Aggregation: ............................................................................................................................. 49
Normalize the Data: .......................................................................................................................... 50
Determining Optimal Number of Clusters (Elbow Method): ........................................................... 50
Apply KMeans Clustering with Optimal k ....................................................................................... 51
Analyse each cluster ......................................................................................................................... 51
Visualize the Clusters Using PCA: ................................................................................................... 52
Cluster Size Check: ........................................................................................................................... 53
Cluster Profiling – In-Depth Insights ................................................................................................ 54
Visualize Cluster Profiles (Bar Plot) ................................................................................................. 54
Cluster Evaluation:............................................................................................................................ 55
Anurag University 3 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Inertia (Within-Cluster Sum of Squares) ....................................................................................... 55
Silhouette Score ............................................................................................................................ 55
Dashboard............................................................................................................................................ 56
Medicaid Drug Spending & Usage Dashboard ................................................................................. 56
Drug-Level Cost Distribution Analysis............................................................................................. 58
Manufacturer-Level Cost & Utilization Dashboard .......................................................................... 59
Conclusion ............................................................................................................................................ 61
Table 1. Data Dictionary ........................................................................................................................ 21
Table 2. Data Dimensions ...................................................................................................................... 22
Table 3. Data Measure .......................................................................................................................... 22
Table 4. Key Performance Indicators ..................................................................................................... 23
Table 5. Summary Statistics................................................................................................................... 27
Table 6. Summary Statistics................................................................................................................... 28
Table 7. Cluster Summary Statistics ...................................................................................................... 51
Table 8. Cluster Size Table ..................................................................................................................... 54
Figure 1.Activity flow ............................................................................................................................ 13
Figure 2. ER Diagram ............................................................................................................................. 16
Figure 3. Data Model............................................................................................................................. 18
Figure 4. Project Workflow ................................................................................................................... 19
Figure 5. Dataset ................................................................................................................................... 20
Figure 6. Box Plot .................................................................................................................................. 32
Figure 7. Correlation Plot ...................................................................................................................... 33
Figure 8.Impact of Beneficiaries on Dosage Units (Year-wise).............................................................. 34
Figure 9.Relationship Between Claims and Dosage Units (Year-wise) .................................................. 35
Figure 10. Impact of Beneficiaries on Claims Across Years ................................................................... 36
Figure 11. Trend of Total Spending Over Years in Billions ..................................................................... 37
Figure 12. Trend Analysis of Total Beneficiaries Over Years .................................................................. 38
Figure 13. Trend Analysis of Total Claims Over Years ............................................................................ 39
Figure 14. Top 10 Manufacturers by Total Claims ................................................................................. 40
Figure 15. Top 10 Manufacturers with Highest Total Spending ............................................................ 41
Figure 16. Year-over-Year Change in Total Spending ............................................................................. 43
Figure 17. Year-over-Year Change in Claims .......................................................................................... 44
Figure 18. Year-over-Year Change in Beneficiaries ................................................................................ 45
Figure 19. Manufacturers' Pareto Curve ............................................................................................... 47
Figure 20. Brands' Pareto Curve ............................................................................................................ 48
Figure 21. Elbow Graph ......................................................................................................................... 50
Figure 22. Cluster Chart ........................................................................................................................ 53
Figure 23. Cluster Profiles ..................................................................................................................... 54
Figure 24. Medicaid Drug Spending & Usage Dashboard ..................................................................... 56
Figure 25. Drug-Level Cost Distribution Analysis .................................................................................. 58
Figure 26. Manufacturer-Level Cost & Utilization Dashboard .............................................................. 59
Anurag University 4 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Project cover sheet
Project Title Diagnostic Lab Course MBA BA 1 Year – II
Systems TRIMESTER 2
Start Date Guide/Client Anvitha
End Date Team Lead G Vaishali
TEAM MEMBERS
SNO NAME HALL TICKET NO
1 G Vaishali 24MG202A35
2 Gayathri 24MG202A34
3 Sanjay 24MG202A30
Signature:
Guide/Client Name: Sai Krupa
Date:
Anurag University 5 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Executive Summary
This research project focuses on comprehensive clustering analysis of Medicaid drug spending data
to reveal intrinsic patterns in spending, claims, and beneficiary reach across a multi-year horizon.
Drawing upon a curated dataset of 13,889 records and 46 features, we first executed rigorous
exploratory data analysis (EDA) and feature engineering—to ensure data integrity and enhance
model performance. Key quantitative attributes—including total spending, number of claims, and
unique beneficiary counts—were systematically examined to reveal temporal trends and
distributional characteristics.
Focusing on core metrics—total drug spending, claim volume, beneficiary counts, and dosage units—
we employed K-Means clustering to segment drugs into coherent groups reflecting similar cost-
utilization profiles. To bolster interpretability and mitigate high-dimensionality challenges, Principal
Component Analysis (PCA) was applied, extracting independent components that capture the
majority of variance and facilitate visual examination of cluster structures.
Interactive dashboards are developed using Python and Power BI, that portray cluster characteristics,
temporal spending trends, and utilization disparities. These visual assets enable stakeholders—data
scientists, formulary managers, and policymakers—to dynamically explore segment-specific insights,
benchmark drug performance, and prioritize budgetary and clinical interventions.
By integrating unsupervised learning with advanced visualization, this project delivers actionable
intelligence for optimizing Medicaid drug formulary design, containing costs, and guiding strategic
resource allocation in healthcare finance.
Anurag University 6 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Project Milestone
SNO Milestone Key Activities Target
Duration
1. Domain Study o Understand the domain- Learn 2 Week
domain-specific terminology
o Formulate the problem statement-
Define objectives of the study
2. Data Collection o Understand client/business 3 Weeks
requirements
o Identify and access relevant datasets
o Import and store data
3. System Design o ER Diagram: Define data structure 2 Week
o Business Process Diagram: Model
high-level flow
o Activity Flow Chart: Detail step-
by-step process
4. Data Preparation o Data Cleaning: Handle missing 1 Week
values, fix data types, remove
duplicates
o Preprocessing: Normalize,
standardize, rename columns
o Feature Engineering: Create new
derived features
5. Exploratory o EDA (Python): Explore 1 Week
Analysis & distributions, patterns
Reporting o Trend/YoY Analysis
o Summarize findings via visual
summaries (plots/tables)
6. Data Modelling o Apply clustering (e.g., K-Means) 2 Week
o Tune and evaluate models
o Interpret clusters
o Evaluate clusters
7. Data Visualization o Build Power BI dashboards 1 Week
o Graphically summarize results
8. Report Writing, and o Interpret results and models 1 Week
Presentation o Write final report
o Deliver final presentation
Anurag University 7 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Introduction
Medical expenditure has become a critical area of analysis in modern healthcare systems,
particularly with rising costs and increasing pressure on public funding. In the United States,
Medicaid—a key government-funded healthcare program—plays a vital role in providing
medical coverage to millions of low-income individuals and families.
Medicaid is a joint federal and state healthcare program in the United States that provides
medical coverage to low-income individuals and families, including children, elderly adults,
and people with disabilities. Established in 1965 under Title XIX of the Social Security Act,
Medicaid serves as a critical component of the U.S. healthcare system, covering nearly one in
five Americans. The program is funded jointly by the federal government and individual
states, with each state managing its own Medicaid plan within broad federal guidelines. One
of the most significant and growing components of Medicaid spending is on prescription
drugs, which has prompted increased attention toward optimizing drug costs, ensuring access
to essential medications, and managing the financial sustainability of the program.
Understanding how Medicaid resources are distributed across drugs and manufacturers is
essential for policymakers, healthcare professionals, and public health administrators. By
analysing drug-level spending, claim volumes, and beneficiary reach, we can gain valuable
insights into cost drivers, identify outlier drugs with unusually high spending, and make
informed decisions to improve cost-effectiveness and accessibility of treatment.
Problem Statement
“Comprehensive Analysis and Clustering of Medicaid Drug Spending Trends Based on Cost,
Utilization, and Patient Reach"
Medicaid program is facing mounting financial strain due to the rising costs of prescription drugs.
Traditional spending often fails to consider critical dimensions such as cost, utilization intensity, and
patient reach. This project conducts a comprehensive analysis of Medicaid drug data by integrating
these key factors into a unified analytical framework. Utilizing advanced clustering algorithms, the
study will segment medications into meaningful groups to uncover patterns, identify high-impact
drugs, and highlight inefficiencies or anomalies in spending. The insights generated will support
evidence-based decision-making for policymakers ultimately contributing to more strategic, data-
driven management of Medicaid pharmaceutical resources.
Anurag University 8 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Objectives
1. To analyse year-over-year changes in key healthcare metrics, including spending,
claims, and beneficiaries, to identify emerging trends and shifts in resource allocation
and utilization.
2. To perform Pareto analysis to identify the primary drugs and manufacturers driving
the majority of Medicaid drug expenditures, facilitating targeted strategies for cost
containment and resource optimization.
3. To develop derived features and conduct clustering analysis to identify patterns in
drug cost, utilization, and beneficiary reach, thereby enabling precise segmentation
and optimizing resource allocation and strategic decision-making.
4. To design and deploy interactive dashboards consisting of:
▪ The Medicaid Drug Spending & Usage overview
▪ Manufacturer Level Cost & Utilization Analysis
▪ Drug-Level Cost Distribution Analysis
Requirement Specifications
User Requirements:
▪ Spending Trends Overview: Users require a comprehensive view of Medicaid drug
spending patterns over time, with the ability to track changes across key financial
indicators.
▪ Multidimensional Analysis: Users need insights that integrate multiple factors—
such as drug cost, utilization frequency, and patient reach—for a holistic evaluation
of spending impact.
▪ Clustering-Based Grouping: Users expect the application of advanced clustering
techniques (e.g., K-means, hierarchical clustering) to segment drugs into meaningful
categories, aiding in the identification of high-cost, high-impact, or low-efficiency
groups.
▪ Pareto Analysis: Users require Pareto-based insights to highlight the “vital few”
drugs contributing to the majority of Medicaid spending, helping prioritize cost-
containment strategies.
▪ Anomaly and Outlier Detection: Users want to identify drugs with unusual cost or
usage patterns that may signal inefficiencies, fraud, or improper prescribing practices.
▪ Interactive Visualizations: Users prefer dynamic dashboards and visual
representations (e.g., heatmaps, Pareto charts, cluster plots) for intuitive exploration
of spending trends and clusters.
▪ Policy and Decision Support: Users need actionable, data-driven insights that
support strategic policy decisions and optimize procurement and formulary design.
▪ Methodological Transparency: Users expect full visibility into the data sources,
clustering algorithms, and assumptions used in the analysis, ensuring interpretability
and trust in the results.
Anurag University 9 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Functional Specifications
1. Core Features
Data Integration:
o Raw data (raw_file.csv)
o Transformed Medicaid data (medicare_tf.csv)
- Data after cleaning and transformation
o Cleaned Medicaid data (combined_data.csv)
- Data after Feature engineering
o Final Medicaid data (fnl_cln.excel)
- Combining both the above files
Clustering & Analysis:
o Clustering for Drug Categorization:
Using clustering techniques (e.g., K-means), the dataset will segment drugs
into meaningful categories based on cost, utilization, and patient reach.
o Pareto Analysis: This will assist in identifying high-impact drugs that require
strategic focus.
Dashboard Plan:
Data Visualization:
o Visual representations such as scatter plots, bar charts, and Pareto charts for
drug cost and utilization.
o Clustering Visualization: Display drug categories and clusters.
o Filters will be available for users to explore key features (e.g., drug cost,
utilization rate, patient reach, etc.).
2. Analytical Capabilities
Correlation Analysis:
o The system will calculate the Pearson correlation between drug spending, claims and
beneficiaries.
o The system will identify potential causal relationships using statistical methods
3. Visualization and Reporting
Dashboard:
o Visualizations of Trends:
- Yearly Trends: Tracking total beneficiaries, spending, and claims from 2018 to
2022.
- Outlier Trends: Displays the growth in flagged outlier brands over time.
- Manufacturer & Drug Spending Trends: Shows yearly total spending by
manufacturer and by drug brand.
Anurag University 10 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
o Scatter Plots to Visualize Correlations:
- Beneficiaries vs. Claims: A positive correlation between the number of beneficiaries
and the total claims.
- Average Spending: Comparisons between average spend per claim vs. beneficiary.
o Correlation & Distribution Visuals:
- Key Influencer Analysis: Identifies key drivers for total claims count.
- Brand/Manufacturer Distributions: Analysis by outlier classification, brand count
by manufacturer, and dosage unit-level spending.
4. User Roles
- Internal Use (Student/Researcher):
- Manually running and validating models.
- Testing performance for academic evaluation.
- All operations run using Google Collab or Python scripts.
5. Interface / Ease of Use
- All code run via Python scripts.
- Data manually pre-processed.
6. User Interface (UI)
- Upload and manage their datasets.
- View and interact with visualizations.
- Generate and export reports.
7. Data Export
Users will be able to export the analysis results and reports in CSV, Excel, and PDF
formats.
System Specifications
1. Hardware Requirements:
o Processor: Intel i5/i7 or equivalent
o RAM: Minimum 8 GB (16 GB preferred)
o Storage: 500 GB HDD / 256 GB SSD
o Display: 13" screen, 1366x768 resolution or higher
o Graphics: Basic GPU for visual rendering (optional)
o Network: Stable internet for email/alert features
2. Software
o Operating System:
- Windows 10 / 11
Anurag University 11 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
- macOS 11+
o Programming Language:
- Python 3.x with libraries such as pandas, NumPy, matplotlib, seaborn, and
stats models for statistical analysis.
o Data Analysis Tools:
- Jupyter Notebook for interactive coding and analysis
- Google Collab Notebook for AI affiliated coding assistance
o Visualization Tools:
- Power BI for creating dashboards
- Python visualization libraries (e.g., matplotlib, seaborn, plotly).
-
3. Data Requirements
o Source:
This dataset presents information on spending for drugs prescribed to Medicare
beneficiaries enrolled in Part D by physicians and other healthcare providers.
o Format: CSV, Excel, or SQL database.
4. Network Requirements
o Internet connection: Stable internet connection required for accessing real-time
data, research papers, and cloud-based tools like Google Collab or AWS and
sending email/WhatsApp alerts.
o Cloud Storage: For collaboration and secure data backup (e.g., Google Drive,
OneDrive).
5. Machine Learning
Libraries Used:
- Scikit-learn: For clustering (K-means), data splitting, and preprocessing.
- K-means Clustering: Used to group drugs based on cost, utilization, and
patient reach.
- Pandas & Numpy: For data manipulation, cleaning, and feature engineering.
Model Evaluation
- Silhouette Score: Measures cluster quality based on how well-separated and
compact the clusters are.
- Inertia: Evaluates the compactness of the clusters (lower is better).
- Cluster Purity: Assesses the homogeneity within clusters.
- Visual Analysis: Uses plots to visualize and interpret cluster groups.
- Pareto Analysis: Identifies the top drugs contributing to most of the spending.
Anurag University 12 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Data Architecture Diagrams
Activity flow
Figure 1.Activity flow
Anurag University 13 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Interpretations for Activity Flow
The Medicaid Service Lifecycle is structured into five key phases to ensure comprehensive
patient care and efficient claims management
1. Enrolments Phase
o Applicant submits required documentation for Medicaid.
o State agency performs eligibility verification.
o Upon meeting criteria, the application is approved.
o Medicaid ID card is issued to the approved individual.
o All enrolment data is recorded into MMIS (Medicaid Management Information
System).
2. Medical Visit Phase
o Patient visits a Medicaid-approved healthcare provider.
o Patient provides their Medicaid ID for verification.
o Doctor verifies patient ID through system checks.
o Provider conducts medical diagnosis and prescribes treatment.
o Diagnosis and prescription data are submitted electronically to the Managed Care
Organization (MCO).
3. Prescription Phase
o Patient visits a participating pharmacy with a valid prescription.
o Pharmacy verifies Medicaid coverage and prescription authenticity.
o If verified, the medication is dispensed to the patient.
o Pharmacy sends an electronic reimbursement claim to Medicaid.
4. Claims Processing Phase
o Healthcare providers and pharmacies submit claims to MMIS/MCO.
o MMIS/MCO systems perform:
- Claim validation.
- Application of payment rules.
- Error, duplication, and fraud checks.
o Based on validation:
- Valid claims are approved for payment.
- Invalid or non-compliant claims are denied.
Anurag University 14 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
5. Data Collection and Storage Phase
o All medical and prescription service data is logged into MMIS.
o State compiles and submits data reports to CMS (Centre’s for Medicare & Medicaid
Services).
o State analytics systems monitor data for: Service quality, Fraud detection Policy
compliance and reporting.
Anurag University 15 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
ER Diagram
Figure 2. ER Diagram
Anurag University 16 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Interpretations for ERD
The ER diagram illustrates the intricate relationships between the key entities involved in the
Medicaid service framework. At its core, the diagram captures the interactions between
Beneficiaries, Providers/Doctors, Pharmacies, Prescriptions, Claims, Drugs, and
Governing Agencies such as the State Medicaid Agency, MMIS, and CMS.
o The Beneficiary is a central entity that interacts with multiple components of the
healthcare system. A beneficiary receives services from a Provider/Doctor, who
documents the encounter and writes prescriptions based on the diagnosis.
o These prescriptions are submitted to a Pharmacy, where they are processed and linked
to Drugs, each defined by attributes such as generic name, brand name, dosage, and a
unique ID.
o The drugs are produced by Manufacturers, each identified by their name, contact
information, and address.
o Pharmacies act as intermediaries that verify eligibility, submit prescriptions, and
contribute to the claims process.
o The Claims entity represents the financial processing of medical services and
medications. Each claim, uniquely identified by an ID and status, is processed through
the MMIS (Medicaid Management Information System), which validates and
manages the data.
o MMIS serves as a critical system that not only processes claim but also maintains
comprehensive service records. These records are shared with CMS (Centres for
Medicare & Medicaid Services) for regulatory oversight and reporting.
o The State Medicaid Agency acts as an administrative body that governs state-level
operations, manages provider data, and facilitates coordination with MMIS and CMS.
o The diagram also emphasizes the flow of information and accountability.
o Providers/Doctors log encounter records, which are used for claims validation and
oversight. Prescriptions and claims are connected, ensuring financial transparency.
Meanwhile, beneficiaries’ data such as age, name, and ID are consistently referenced
across interactions, supporting continuity of care.
This ER model serves as a structured representation of how Medicaid entities are
interconnected, offering insights into the data dependencies that support healthcare delivery,
financial claims, and policy enforcement. It underlines how administrative, clinical, and
pharmaceutical components integrate within the Medicaid ecosystem to ensure efficient,
compliant, and patient-centred care delivery.
Anurag University 17 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Data Model
Figure 3. Data Model
Interpretations for Data Model
The data model represents the relationships and attributes involved in analysing drug
spending within Medicaid. The central entity is Drugs, linked to details like Generic Name
and Manufacturer. Related tables include Spending, which tracks total and average
expenditures, and Claims, which logs the number of claims per drug per year. Beneficiaries
and Dosage Units provide data on the number of patients and dosage units respectively.
Growth Metrics captures yearly spending changes and growth rates, while Outliers flags
drugs with abnormal spending patterns. Manufacturers store company-level details,
enabling brand-to-manufacturer analysis. Overall, the model enables comprehensive drug
cost tracking and trend analysis over time.
Anurag University 18 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Work Flow
Figure 4. Project Workflow
Anurag University 19 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Data Set
Figure 5. Dataset
Data Description
The dataset consists of structured records related to Medicaid drug spending across various
U.S. states over a span of five years (2018–2022). Each data point represents a unique entry
corresponding to a specific drug, claim type, year, and state. These points capture detailed
information on healthcare utilization and costs. The dataset has been prepared in tabular
format where each row indicates one data point.
Dataset Overview
Thia dataset has 13,889 rows and 46 columns. It includes detailed drug-related information
from 2018 to 2022.
Categorical columns: 3
Numerical columns: 43
Anurag University 20 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Data Dictionary
Column Name Description Data
Type
Brand Name Brand name of the drug String
Generic Name Generic name of the drug String
Total Manufacturers Total number of manufacturers Integer
Manufacturer Name Name of the manufacturer (or "Overall" String
summary)
Year Reporting year Integer
Total Claims Total number of prescription claims Integer
Total Beneficiaries Total number of beneficiaries who Integer
received the drug
Average Spending per Average spending per beneficiary Float
Beneficiary
Average Spending per Claim Average spending per claim Float
Average Spending per Dosage Average spending per dosage unit, Float
Unit (Weighted) weighted
Outlier Flag Binary flag indicating if data is an outlier Integer
(1 = outlier, 0 = not)
Total Dosage Units Total dosage units dispensed Integer
Total Spending Total spending on the drug Float
Table 1. Data Dictionary
Dimensions
Dimensions are the categorical attributes used to segment and analyse the data. They provide
context to the measures and allow grouping, filtering, and comparison across different
categories. The following are the main dimensions in the dataset:
Anurag University 21 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Dimension Name Description Data Type
Brand Name Brand name of the drug String
Generic Name Generic name of the drug String
Manufacturer Name Manufacturer name String
Year Year of report Integer
Outlier Flag Indicates if data is considered outlier Integer
Table 2. Data Dimensions
Measures
Measures are numerical values that are used for quantitative analysis and statistical
operations. These are the key variables used to assess performance, trends, and predictions in
the healthcare system. The primary measures in the dataset include:
Measure Name Description Data
Type
Total Manufacturers Total number of Integer
manufacturers
Total Claims Number of prescription Integer
claims
Total Beneficiaries Number of beneficiaries Integer
Average Spending per Avg. spending per Float
Beneficiary beneficiary
Average Spending per Avg. spending per claim Float
Claim
Average Spending per Avg. spending per Float
Dosage Unit (Weighted) weighted dosage unit
Total Dosage Units Total dosage units Integer
dispensed
Total Spending Total spending on the Float
drug
Table 3. Data Measure
Anurag University 22 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Key Performance Indicators
KPIs are calculated insights used to monitor performance.
KPI Name Description Formula /
YoY Growth in Measures yearly growth (Tot_Spndng_2022 -
Spending/Claims/Beneficiaries Tot_Spndng_2021) /
Tot_Spndng_2021 * 100
Total Claims/Beneficiaries Number of Count
prescriptions claims or
beneficiaries
Total spending Total amount spent on a Float
specific drug or
manufacturer over a
given period.
Total Dosage units Sum of dosage units Float
dispensed for the drug.
Total Manufacturers Number of int
manufacturers
supplying the drug
Outlier Flag Count Number or percentage int
of flagged outlier
entries
Table 4. Key Performance Indicators
ETL
Extract (E)
Data Loading and Initial Inspection:
o Objective:
- To collect and load Medicare Part D by Drug dataset from reliable sources into
python juypter notebook or google collab for further processing and analysis.
- The key columns namely total claims, spending, beneficiaries, dosage, drug and
manufacturer name along with outlier flag were selected for analysis.
- Initial rows were reviewed using python libraries such as pandas to understand
the structure of the data (datatypes, description, shape, head and tail)
o Data Sources:
- CSV files containing yearly Medicaid spending, claims, and beneficiary data
from 2018 to 2022 downloaded from data.gov.in.
Anurag University 23 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
o Outcome:
- Identified initial structure and prepared the dataset for preprocessing.
- Observed presence of missing values and some duplicate entries.
Transform (T)
Data Cleaning and Preprocessing Steps for Medicare Part D Dataset
1. Handling Missing Values
o All rows containing missing values were dropped to ensure a clean and
complete dataset for analysis.
2. Removing Duplicates
o Duplicate entries across key columns (e.g., Drug Name, Total Spending, Total
Claims) were identified and removed to prevent redundancy.
3. Data Transformation
o The dataset was reshaped using the melt function, converting it from wide to
long format to support better analysis and visualization.
4. Feature Engineering
o Three new columns were added to capture year-over-year percentage changes
for key metrics
o These features help in identifying trends and fluctuations across years.
5. Outlier Detection
o Z-scores were computed for numerical features such as Total Spending, Total
Claims, and Total Dosage Units to identify potential outliers.
o While outliers were detected, they were not removed, as they form key
elements of the dataset and may reflect significant patterns or high-impact
drugs.
6. Scaling and Normalization
o Numerical features were scaled using StandardScaler to bring all values to a
similar range.
o This ensured that no feature dominated due to differences in units or
magnitude, especially important for clustering and machine learning models.
7. Final Cleaning and Export
Anurag University 24 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
o The cleaned and transformed dataset was saved as
cleaned_medicare_drug_data.csv.
o A final inspection was done to confirm dataset shape, consistency, and
readiness for analysis.
Descriptive Analysis
1. Data Loading and Overview:
o The dataset was loaded using pandas and converted into a Data Frame for
analysis.
o The head () method displayed the first few rows to inspect the structure and
content of the data.
o The info () function provided an overview of data types, non-null counts, and
memory usage.
OUTPUT:
Anurag University 25 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Anurag University 26 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
2. Missing Value Analysis:
o Missing values in each column were identified using is null().sum() and
removed using df.dropna(inplace=True)
o This ensured that any gaps in the dataset were accounted for during
preprocessing.
OUTPUT:
3. Statistical Summary:
o Descriptive statistics for numerical columns, including mean, median, standard
deviation, and percentiles, were computed using describe ().
o For categorical columns, the describe () method was applied to review the
count, unique values, and most frequent entries.
OUTPUT:
Statistic Index Total Year Total Total Total Total
manufactu Claims Beneficiari Dosage Spending
rers es Units
count 69445.0 69445.0 69445.0 69445.0 69445.0 58681.0 58681.0
mean 1.4852 2020.0 252420.23 81038.42 8393.86 34033833.3 34033833.3
8 8
std 2.3328 1.4142 1341310.29 361426.78 48298.71 235801438. 235801438.
70 70
min 1.0 2018.0 11.0 11.0 0.0967 26.43 26.43
25% 1.0 2019.0 745.0 354.0 75.41 75958.34 75958.34
50% 1.0 2020.0 10963.0 4723.0 277.55 1091290.97 1091290.97
75% 1.0 2021.0 252420.0 81038.0 1687.06 8305301.43 8305301.43
max 41.0 2022.0 64797302.0 15885735.0 1595175.55 152198122 152198122
94.0 94.0
Table 5. Summary Statistics
Anurag University 27 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Statistic Avg Avg Spnd_Per_Clm Avg Spnd Per Dsg Unt Outlier_Flag
Spnd_Per_Bene Wghtd
count 56896.0 58681.0 58681.0 58652.0
mean 1717.64 198.66 0.0917 18949723.48
std 8310.42 2162.81 0.2887 118616660.89
min 0.0356 0.0002 0.0 0.7
25% 30.14 0.4821 0.0 21516.0
50% 111.24 2.1141 0.0 251768.0
75% 547.54 13.3261 0.0 3039349.5
max 337266.86 233364.03 1.0 4654341154.0
Table 6. Summary Statistics
4. Categorical Data Analysis:
o Unique values in each categorical column were displayed to understand the
diversity of entries and identify any anomalies.
OUTPUT:
Anurag University 28 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Anurag University 29 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
5. Duplicate Analysis:
o The dataset was checked for duplicate rows, and the count was displayed to
assess data quality.
6. Feature Engineering:
o Three new columns were added to capture year-over-year percentage changes
for key metrics:
- Claims_YoY_Change – annual change in claims
- Beneficiaries_YoY_Change – annual change in beneficiaries
- Total_Spending_YoY_Change – annual change in total spending
o These features help in identifying trends and fluctuations across years.
OUTPUT:
LOAD (L)
Objective:
To store the cleaned and transformed data in a structured format for further use in analysis,
modelling, and reporting.
Target Storage Options:
o Local CSV File (for use in dashboards, ML models, and visualization)
Export Method:
Save to CSV: df.to_csv("cleaned_medicaid_spending.csv" index=False)
Anurag University 30 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Outcome of ETL Process:
• The Medicaid Spending dataset was cleaned, enriched, and transformed into an
analysis-ready format.
• Outliers and missing data were effectively handled.
• Key KPIs and derived metrics were engineered for use in dashboards and machine
learning models.
• Data is now structured for advanced analysis, cluster modelling and cost-efficiency
analytics.
Tools and Libraries Used
Component Tool/Library Used
Extraction Python, Pandas
Transformation Pandas, NumPy, SciPy, Scikit-learn, Seaborn
Visualization Matplotlib, Seaborn
Loading Pandas (CSV), SQL (Database)
Anurag University 31 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Exploratory Data Analysis
Outlier detection:
o Boxplots for each numerical column were created to identify outliers. These plots
visually depict the data spread, interquartile range (IQR), and extreme values.
o Outliers were observed in Total Spending, Average Spending per Claim, and
Beneficiaries, indicating the presence of high-cost drugs or unusual usage patterns.
OUTPUT:
Figure 6. Box Plot
INTERPRETATIONS:
▪ X-Axis: Represents different healthcare-related variables such as total claims,
beneficiaries, spending per claim/unit, and total spending.
▪ Y-Axis: Represents the values for each variable.
▪ The box plot shows the spread, median, and outliers for each variable's distribution.
▪ Variables like Tot_Spndng and Tot_Dsg_Unts show a wide range with many
extreme outliers, indicating high variability.
▪ Variables such as Total_Claims and Avg_Spnd_Per_Bene have values concentrated
near the lower end, resulting in compressed boxes.
▪ The data is right-skewed overall, with most values on the lower end and a few
extremely high ones.
▪ Large scale differences between variables affect visibility, suggesting the need for
scaling or transformation for better comparison.
Anurag University 32 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Correlation Analysis:
Figure 7. Correlation Plot
INTERPRETATIONS:
The correlation heatmap illustrates the Pearson correlation coefficients among various
healthcare-related variables, providing insights into their linear relationships.
Key Insights:
▪ Total_Claims and Total_Benes exhibit a strong positive correlation (0.95),
indicating that an increase in the number of claims is closely associated with an
increase in the number of beneficiaries.
▪ Tot_Dsg_Unts (Total Dosage Units) also shows a strong correlation with both
Total_Claims (0.90) and Total_Benes (0.83), suggesting that higher claim and
beneficiary counts tend to result in a greater volume of prescribed dosage units.
Anurag University 33 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
▪ Avg_Spnd_Per_Clm and Avg_Spnd_Per_Bene demonstrate a high positive
correlation (0.82), implying that the average spending per claim is strongly aligned
with the average spending per beneficiary.
▪ Tot_Spndng (Total Spending) shows weak correlations with all other variables
(highest being 0.24 with Total_Claims), suggesting that total spending is influenced
by a broader combination of factors rather than any single variable.
▪ Variables such as Avg_Spnd_Per_Bene and Total_Claims, or Avg_Spnd_Per_Clm
and Total_Benes, exhibit negligible or near-zero correlations, indicating minimal
linear dependency between claim volumes and average spending metrics.
Conclusion:
The heatmap reveals strong interdependencies between volume-related variables
(claims, beneficiaries, dosage units), while spending-related metrics show more
complex, less direct relationships.
Scatter Plot:
Impact of Beneficiaries on Dosage Units (Year-wise)
Figure 8.Impact of Beneficiaries on Dosage Units (Year-wise)
INTERPRETATIONS:
Anurag University 34 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
▪ The X-axis represents the total number of beneficiaries.
▪ The Y-axis represents the total dosage units prescribed.
▪ Each point in the scatter plot corresponds to data from a specific provider or
entity, color-coded by year (2018 to 2021).
▪ The plot shows a positive linear relationship between total beneficiaries and
dosage units — as the number of beneficiaries increases, the dosage units
also tend to increase.
▪ A regression line is included to highlight the trend, with a shaded region
representing the confidence interval.
▪ Bubble sizes vary and may represent an additional variable spending, adding
a third dimension to the visualization.
▪ The data points are fairly clustered along the trend line, indicating a strong
correlation.
▪ A few outliers are observed in the top-right area, indicating exceptionally
high values for both beneficiaries and dosage units.
▪ This trend remains consistent across all four years, suggesting stable year-
wise behaviour in this relationship.
Relationship Between Claims and Dosage Units (Year-wise)
Figure 9.Relationship Between Claims and Dosage Units (Year-wise)
INTERPRETATIONS:
▪ The X-axis represents the total number of claims submitted.
▪ The Y-axis represents the total dosage units prescribed.
Anurag University 35 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
▪ Each point in the scatter plot corresponds to data from a specific provider or
entity, and is color-coded by year (2018 to 2021).
▪ The plot shows a positive linear relationship between total claims and
dosage units — as the number of claims increases, the dosage units also tend
to increase.
▪ A regression line is included to highlight the trend, with a shaded area
representing the confidence interval.
▪ The size of the bubbles varies and may represent total spending,
introducing a third dimension to the visualization.
▪ The data points are tightly clustered around the trend line, indicating a
strong correlation between claims and dosage units.
▪ A few outliers are visible in the top-right section, indicating exceptionally
high values for both claims and dosage units.
▪ The observed trend is consistent across all four years, suggesting stable
year-wise behaviour in this relationship.
Impact of Beneficiaries on Claims Across Years
Figure 10. Impact of Beneficiaries on Claims Across Years
INTERPRETATIONS:
▪ The X-axis represents the total number of beneficiaries.
▪ The Y-axis represents the total number of claims submitted.
Anurag University 36 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
▪ Each point in the scatter plot corresponds to data from a specific provider or
entity, and is color-coded by year (2018 to 2021).
▪ The plot shows a positive linear relationship between beneficiaries and
claims — as the number of beneficiaries increases, the number of claims also
tends to increase.
▪ A dashed regression line is drawn to represent the trend, with a shaded
confidence interval area.
▪ The bubble sizes vary, likely indicating total spending, adding an
additional dimension to the chart.
▪ Most data points are tightly aligned along the trend line, indicating a
strong correlation between beneficiaries and claims.
▪ Some outliers are visible in the upper-right region, reflecting unusually high
values for both beneficiaries and claims.
▪ The trend is consistently observed across all four years, suggesting a
stable year-wise relationship between these two variables.
Trend analysis:
Trend of Total Spending Over Years in Billions
Figure 11. Trend of Total Spending Over Years in Billions
INTERPERTATIONS:
▪ The X-axis represents the year (from 2018 to 2022).
Anurag University 37 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
▪ The Y-axis shows the total spending in billions (B).
▪ Each point on the line chart represents the total annual spending for that year,
with exact values labelled at each point.
▪ The chart displays a consistently increasing trend in total spending over the
five-year period.
▪ Total spending rose from $326.96B in 2018 to $480.67B in 2022, reflecting a
steady year-over-year increase.
▪ The steepest increase is observed between 2021 and 2022, where spending
jumped from $431.07B to $480.67B, indicating a possible surge in healthcare
costs or utilization.
▪ The consistent upward trajectory suggests growing demand, rising drug
prices, or broader coverage over time.
▪ The smooth and linear pattern implies predictable growth, which can be
useful for future forecasting and policy planning.
Trend Analysis of Total Beneficiaries Over Years
Figure 12. Trend Analysis of Total Beneficiaries Over Years
INTERPRETATIONS:
▪ The X-axis represents the years from 2018 to 2022.
▪ The Y-axis shows the total number of beneficiaries, measured in crores (cr).
Anurag University 38 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
▪ Each point on the line chart reflects the total number of beneficiaries for
that particular year, with exact values labeled.
▪ In 2018, the total beneficiaries stood at 87.5 crores.
▪ There is a sharp increase in 2019, rising to 92.5 crores.
▪ In 2020, there is a slight decline to 90.5 crores, indicating a drop in total
beneficiaries.
▪ The number of beneficiaries increases again in 2021 to 93.4 crores, showing
a recovery.
▪ This upward trend continues strongly into 2022, reaching 98.3 crores,
marking the highest point in the observed period.
▪ There is no shaded region around the line, indicating that this graph
represents actual recorded values without an uncertainty or
confidence range.
Trend Analysis of Total Claims Over Years
Figure 13. Trend Analysis of Total Claims Over Years
INTERPRETATIONS:
▪ The X-axis represents the years from 2018 to 2022.
▪ The Y-axis shows the total claims measured in billions (B).
Anurag University 39 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
▪ Each point on the line chart reflects the total number of claims for that specific
year, with exact values labelled.
▪ In 2018, the total claims were 2.82 billion.
▪ There is a steady increase in 2019 to 2.93 billion, followed by a further rise to
2.97 billion in 2020.
▪ Growth continues gradually in 2021, reaching 3.00 billion.
▪ In 2022, the total claims climb sharply to 3.09 billion, marking the highest value
across the period.
▪ There is no shaded region around the line, indicating the graph shows recorded
data without uncertainty bands.
Bar Plots:
Top 10 Manufacturers by Total Claims
Figure 14. Top 10 Manufacturers by Total Claims
INTERPRETATIONS:
▪ X-axis: Represents the Total Claims for each manufacturer, with units in
Thousands (K), Millions (M), or Billions (B), depending on the claim
volume.
▪ Y-axis: Represents the Manufacturer Name, listing the top 10 manufacturers
based on the total number of claims filed.
▪ The total claims across the top 10 manufacturers vary significantly, with some
manufacturers showing substantially higher claim volumes than others.
Anurag University 40 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
▪ The difference in claim volume between manufacturers is noticeable, with a
few having a much higher number of claims, indicating either higher product
usage or potential quality-related issues.
Top 10 Manufacturers with Highest Total Spending
Figure 15. Top 10 Manufacturers with Highest Total Spending
INTERPRETATIONS:
▪ X-axis: Represents the Total Spending by each manufacturer, with units in
Thousands (K), Millions (M), or Billions (B), depending on the spending
volume.
▪ Y-axis: Represents the Manufacturer Name, listing the top 10
manufacturers based on total spending
▪ The graph displays the total spending across the top 10 manufacturers, with
each bar corresponding to the total spending of a particular manufacturer.
▪ The length of the bars indicates the magnitude of total spending, showing
how much each manufacturer is investing or spending.
▪ Manufacturers on the right side with longer bars indicate higher spending,
suggesting they have substantial budgets for their operations or a higher
volume of products/services involved.
Anurag University 41 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
YOY Trend Analysis
Objective:
▪ The purpose of this analysis is to examine the Year-over-Year (YoY) trends in key
metrics such as Total Spending, claims and beneficiaries.
▪ By analysing the trends over multiple years, we can gain insights into patterns,
growth or decline rates, and shifts in the healthcare market related to drug
spending.
Data Preparation
Dataset Overview: Ensure that the dataset includes the necessary columns such as Drug
Name, Year, Manufacturer Name, Total Spending, Total Claims, Total Beneficiaries.
Data Cleaning:
▪ Check for missing values in the dataset and handle them appropriately (either by
filling or dropping rows with missing data).
▪ Ensure that the year column is in the correct format and use it for aggregating data
over time.
Data Transformation:
▪ Convert any date columns (if applicable) into a datetime format to ensure accurate
time-based analysis.
▪ Convert claims and beneficiaries into integer format
Data Aggregation
Group the data by Year and calculate the following aggregated values:
▪ Total Spending: Sum of all spending for each year.
▪ Total Claims: Sum of all claims for each year.
▪ Total Beneficiaries: Sum of all beneficiaries for each year.
This allows us to focus on year-based trends and identify overall changes over time.
YoY Growth Rate Calculation
Calculate the Year-over-Year (YoY) Growth Rate for each metric:
▪ YoY Growth for Spending: The percentage change in spending compared to the
previous year.
▪ YoY Growth for Claims: The percentage change in claims compared to the
previous year.
▪ YoY Growth for Beneficiaries: The percentage change in the number of
beneficiaries compared to the previous year.
Anurag University 42 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
The YoY growth rate gives an understanding of the percentage change from one year to the
next, allowing for a comparison of annual performance.
a) Year-over-Year Change in Total Spending
Figure 16. Year-over-Year Change in Total Spending
INTERPRETATIONS:
▪ The X-axis represents the years from 2018 to 2022.
▪ The Y-axis shows the percentage change in total spending compared to the
previous year.
▪ Each point on the line chart reflects the year-over-year (YoY) percentage
growth in total spending, with exact values labelled.
▪ There is a sharp increase in 2019, with a YoY change of 10.7%, following
the baseline value of 0.0% in 2018.
▪ The rate of increase moderates slightly in 2020 (9.3%) and 2021 (8.9%),
indicating a slowing growth rate during this period.
▪ In 2022, the YoY change climbs again to 11.5%, indicating a renewed
acceleration in spending growth.
▪ The shaded region around the line represents the confidence interval,
indicating variability or uncertainty in the change estimates.
Anurag University 43 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
b) Year-over-Year Change in Claims
Figure 17. Year-over-Year Change in Claims
INTERPRETATIONS:
▪ The X-axis represents the years from 2018 to 2022.
▪ The Y-axis shows the percentage change in claims compared to the previous
year.
▪ Each point on the line chart reflects the year-over-year (YoY) percentage
change in claims, with exact values labelled.
▪ In 2019, there is a sharp increase in claims, reaching a 4.0% YoY change, up
from the baseline of 0.0% in 2018.
▪ The growth rate decelerates significantly in 2020 to 1.4%, and further slows
down to 0.9% in 2021, indicating a period of reduced growth in claims.
▪ In 2022, the YoY change rebounds, rising to 3.0%, showing a renewed
increase in claims.
▪ The shaded region around the line represents the confidence interval,
indicating some variability or uncertainty around the estimates
of claims change.
Anurag University 44 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
c) Year-over-Year Change in Beneficiaries
Figure 18. Year-over-Year Change in Beneficiaries
INTERPRETATIONS:
▪ The X-axis represents the years from 2018 to 2022.
▪ The Y-axis shows the percentage change in the number of beneficiaries
compared to the previous year.
▪ Each point on the line chart reflects the year-over-year (YoY) percentage
change in beneficiaries, with exact values labelled.
▪ There is a sharp increase in 2019, with a 5.7% YoY rise following the
baseline of 0.0% in 2018.
▪ In 2020, there is a significant decline, with the YoY change dropping to -
2.2%, indicating a reduction in the number of beneficiaries.
▪ In 2021, the trend reverses upward, reaching a 3.2% YoY increase, marking a
recovery.
▪ The growth continues in 2022, rising further to 5.2%, indicating sustained
positive growth.
▪ The shaded region around the line shows the confidence interval, suggesting
the range of variability or uncertainty around the change estimates.
Anurag University 45 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Pareto Analysis
Introduction
Pareto Analysis, rooted in the Pareto Principle (commonly known as the 80/20 Rule), asserts
that in numerous situations, a significant majority of outcomes are influenced by a relatively
small portion of causes. Specifically, it suggests that 80% of effects come from just 20% of
the causes.
In the context of this analysis, we have applied Pareto Analysis to examine the distribution of
spending, claims, and benefits across various manufacturers and brands. By leveraging this
technique, we aim to identify the key contributors—those critical few entities that account for
the majority of expenditures and claims.
The goal is to optimize resource allocation by focusing efforts on the areas that most
significantly impact the overall results. This approach ensures that we direct attention and
resources to the manufacturers and brands that drive the majority of spending and claims,
thus maximizing the effectiveness of our strategy.
Methodology
Data Collection and Preprocessing:
▪ The data collected for analysis should consist of manufacturer and brand-level
information, including metrics such as Total Spending, Total Claims, and Total
Beneficiaries.
Data Cleaning:
• Data Type Conversion:
To ensure the integrity of the data for analysis, we converted the columns Total
Claims and Total Beneficiaries into integer type
• Data Aggregation:
We grouped the data by Manufacturer Name and Brand Name to aggregate the total
spending, total claims, and total benefits. The aggregation was performed using the
groupby() function.
• Feature engineering-Cumulative Percentage Calculation:
To apply the Pareto principle, we calculated the cumulative percentage of each metric
(spending, claims, and benefits) for both manufacturers and brands using the
cumulative sum method:
Anurag University 46 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Data Visualization:
The next step after calculations is to identify the vital few which includes identifying
the small number of causes that are responsible for the majority of the effect which
can be done using the following process:
To visually understand the distribution of total spending across manufacturers and
brands, Pareto Charts were generated using Matplotlib. These visualizations are
essential to identify the "vital few" contributors — the few manufacturers and brands
that account for the majority of spending.
- Manufacturers' Pareto Curve: Show the cumulative percentage of spending by
manufacturers. This helps visualize which manufacturers are contributing the
most to total spending.
Figure 19. Manufacturers' Pareto Curve
INTERPRETATIONS:
▪ X-axis: Manufacturers, sorted from highest to lowest based on total
spending.
Anurag University 47 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
▪ Y-axis: Cumulative percentage of total Medicare spending attributed to
each manufacturer.
▪ Blue curve with dots: Cumulative percentage of total spending as
more manufacturers are included.
▪ Red dashed line at 80%: Indicates the threshold for the classic 80/20
Pareto principle (i.e., 80% of outcomes come from 20% of causes).
▪ Annotation ("Teva USA, 80.30%"): Indicates the point at which the
cumulative spending reaches 80%. It names the manufacturer that
crosses this threshold.
▪ The curve is steep at the beginning and flattens out as you go right.
▪ This indicates that a small number of manufacturers account for the
majority of total spending.
▪ The point marked “Teva USA” shows that up to this manufacturer,
cumulative spending reaches 80.30% of the total.
▪ After Teva USA, the curve flattens, meaning the rest of the
manufacturers individually contribute very little to overall spending.
- Brands' Pareto Curve: Shows the cumulative percentage of spending by
manufacturers. This helps visualize which brands are contributing the most to
total spending.
Figure 20. Brands' Pareto Curve
INTERPRETATIONS:
▪ The graph demonstrates a classic Pareto distribution, where a small
proportion of brands accounts for a large share of total spending.
Anurag University 48 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
▪ X-axis: Represents brands, sorted from highest to lowest based on total
Medicare spending.
▪ Y-axis: Displays the cumulative percentage of total Medicare spending
attributed to these brands.
▪ Blue curve with dots: Reflects the cumulative share of total spending
as more brands are added.
▪ Red dashed line at 80%: Marks the classic Pareto threshold, suggesting
that around 80% of total spending often comes from a minority of top
brands.
▪ The steep rise at the beginning of the curve demonstrates that a small
number of brands account for a disproportionately large share of
spending.
▪ As the curve progresses to the right, it flattens, signifying that the
remaining brands each contribute relatively little to total spending.
▪ The point marked “Azopt” denotes the cut-off brand, with cumulative
spending up to this point reaching 80.02% of the total.
▪ Brands to the right of Azopt, while numerous, individually contribute
marginally, confirming the 80/20 rule: a limited number of brands
drive the bulk of the spending.
Clustering Analysis
Clustering analysis helps identify distinct groups of drug brands based on key metrics like
claims, spending, and dosage. This segmentation enables targeted strategies, such as resource
allocation, marketing focus, and optimizing drug management, by uncovering patterns and
relationships within the dataset that might otherwise go unnoticed.
Data Collection and Preprocessing:
▪ The data collected for analysis should consist of brand-level information, including
metrics such as Total Spending, Total Claims, and Total Beneficiaries and Dosage
units.
Data Cleaning:
▪ Data Type Conversion:
To ensure the integrity of the data for analysis, we converted the columns Total
Claims and Total Beneficiaries into integer type.
Data Aggregation:
This step prepares the data for clustering by aggregating metrics by brand. Rather than
analyzing individual records, we compute average values per brand, allowing us to group
Anurag University 49 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
brands with similar performance or behavior pattern. It Groups the dataset by the Brnd_Name
column, so that all records for each brand are processed together. It selects only the relevant
numerical columns for aggregation. Computes the average of each selected column for each
brand. This helps in representing each brand with a set of average metrics, ideal for
clustering.
Normalize the Data:
This step Normalizes the data using StandardScaler() from sklearn.preprocessing
Normalization (or standardization) is a crucial step in the clustering pipeline to ensure that all
features contribute equally to the distance calculations used by clustering algorithms like
KMeanN.
Determining Optimal Number of Clusters (Elbow Method):
This step helps determine the optimal number of clusters (k) for the KMeans algorithm
using the Elbow Method. Choosing the right k ensures meaningful and well-separated
clusters.
Figure 21. Elbow Graph
INTERPRETATIONS:
▪ The X-axis represents the number of clusters (k), ranging from 1 to 10.
▪ The Y-axis represents inertia, or the total within-cluster sum of squares (WCSS).
▪ As k increases, inertia decreases — this is expected because more clusters lead to
tighter groupings.
Anurag University 50 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
▪ There is a sharp drop in inertia from k = 1 to k = 3, indicating significant
improvement in cluster compactness.
▪ After k = 4, the rate of improvement starts to slow down.
▪ The "elbow" (point of inflection) appears to occur around k = 4 or 5.
▪ The elbow point is the ideal balance between Too few clusters (underfitting — high
inertia) and Too many clusters (overfitting — unnecessary complexity).
▪ k = 4 seems to be a good candidate, suggesting that the data naturally groups into 4
distinct clusters.
▪ This number of clusters provides meaningful segmentation without over-
complicating the model.
Apply KMeans Clustering with Optimal k
Now that we have determined the optimal number of clusters (k = 4), we apply the KMeans
algorithm to group similar brands based on their standardized metrics. This includes the
following steps:
- Initializes the KMeans algorithm to form 4 clusters.
- Trains the KMeans model on the normalized data.
- Assigns each brand to one of the 4 clusters.
- Returns a list of cluster labels (0, 1, 2, 3) for each brand.
- Adds a new column to the grouped DataFrame, indicating the cluster to which each
brand belongs.
Analyse each cluster
After applying KMeans algorithm next step is to groups the data by the cluster label assigned
during KMeans clustering and calculates the mean of all numerical columns for each cluster.
Cluster Avg. Avg. Avg. Avg. Dosage
Total Spending Beneficiaries Units
Claims
0 60,538 33.6 19,144 4.4 million
million
1 2.35 713.8 701,315 128.2 million
million million
2 299,077 30.8 212,004 682.5 million
million
3 14.4 10 billion 2.6 million 1.2 billion
million
Table 7. Cluster Summary Statistics
INTERPRETATIONS:
Anurag University 51 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
▪ Cluster 0 — Low Usage, Low cost, Niche Brands
- Small number of claims & beneficiaries
- Moderate spending ($33M)
- Low dosage volume
Interpretation:
This cluster likely represents small-scale or niche brands, perhaps targeting
specialized treatments or rare diseases. May include brand-name or high-price-per-unit
items, but low total usage. These brands have limited reach and usage in terms of both
patients and prescriptions.
▪ Cluster 1 — High-Cost Drugs, Moderate Usage
- High number of claims (2.35M) and large spending ($714M)
- Medium-to-high beneficiary count
- Low total dosage units compared to others
Interpretation:
Brands in this group are very popular and heavily used, serving large patient
populations. They may include widely prescribed in small doses but expensive
medications — possibly branded versions of high-demand drugs.
▪ Cluster 2 — High Dosage, Lower Spending
- Moderate claims and beneficiaries
- Huge dosage units (682 million) with only $30M spending
Interpretation:
This is an unusual but distinct group — these brands might be low-cost medications
administered in very large quantities, like generics for chronic conditions. High dosage
with relatively lower spend suggests cost-effective or essential drugs.
▪ Cluster 3 — Very High Usage and Spending
- Extremely high values across all metrics
- Massive claims, spending, beneficiaries, and dosage units
Interpretation:
These are flagship or blockbuster brands, likely central to national-scale treatment
programs. Their large footprint in all metrics suggests broad-spectrum or chronic-use
drugs with high economic and public health impact.
Visualize the Clusters Using PCA:
The next logical step is to visualize the clusters and possibly profile them in more
depth. It shows how well-separated or overlapping the clusters are and also highlight
patterns or trends in the data.
Anurag University 52 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Figure 22. Cluster Chart
INTERPRETATIONS:
This PCA (Principal Component Analysis) scatter plot visualizes the drug brand
clusters in 2D space using the first two principal components of the standardized
data.
- Cluster 0 (Blue):
Contains the majority of brands; represents low-cost, low-volume drugs with
consistent patterns — likely generics.
- Cluster 1 (Red):
Shows moderate variability; likely includes mid-range brands with average
spending and usage.
- Cluster 2 (Pink):
More dispersed; may indicate high-dosage, lower-cost drugs — possibly
essential medications used in large quantities.
- Cluster 3 (Cyan):
An outlier cluster with very high values in all metrics; likely represents
blockbuster or high-impact drugs.
Cluster Size Check:
The next step is to counts how many brands fall into each cluster and ensures the
clusters are listed in order.
- Cluster 0 has the largest number of brands – it's the most common type.
- Cluster 3 has very few – possibly outlier brands with extreme
characteristics.
Anurag University 53 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Cluster Number of Brands Interpretation
0 3,273 Majority Cluster
1 73 Moderate Cluster
2 8 Small Cluster
3 1 Outlier Cluster
Table 8. Cluster Size Table
Cluster Profiling – In-Depth Insights
After performing clustering, the next crucial step is cluster profiling, where we
analyse the clusters in-depth to understand their characteristics and practical
significance. This is a vital step for translating numerical clusters into meaningful,
actionable insights.
- Examine Cluster Members (Sample Brands)
- Calculate Average Metrics for Each Cluster
- Assign Descriptive Labels to Clusters
• Cluster 0: "Generic/Low-Impact Brands"
• Cluster 1: "Mid-Tier/Moderate Brands"
• Cluster 2: "High-Dosage Essentials"
• Cluster 3: "Blockbuster Brands"
Visualize Cluster Profiles (Bar Plot)
The bar chart provides a clear visualization of the number of brands in each cluster,
highlighting dominant and niche segments. It aids in interpreting clustering results
and supports data-driven decision-making.
Figure 23. Cluster Profiles
Anurag University 54 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Cluster Evaluation:
Evaluating the quality of the clusters is crucial to ensure that the data has been
appropriately grouped. Two common evaluation metrics for this purpose are Inertia
(within-cluster sum of squares) and Silhouette Score.
Inertia (Within-Cluster Sum of Squares)
- Inertia measures how internally cohesive the clusters are. It calculates the sum
of squared distances between each point in a cluster and the cluster's centroid.
- A lower inertia value indicates that the data points within each cluster are
closer to the cluster's centroid, suggesting tighter and more well-defined
clusters.
Silhouette Score
- The Silhouette Score evaluates how well each data point has been clustered,
taking into account both Cohesion and separation. The score ranges from -1 to
+1. The Silhouette Score gives a global assessment of the quality of
clustering, helping you understand the overall coherence and separation of
the clusters.
Interpretation:
A Silhouette Score close to 1 (0.917) indicates that clusters are highly cohesive
and well-separated. The relatively low Inertia confirms that data points lie close to
their respective centroids. Together, these metrics validate that the chosen
clustering (k = 4) produces tight, distinct groupings.
Anurag University 55 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Dashboard
Medicaid Drug Spending & Usage Dashboard
Figure 24. Medicaid Drug Spending & Usage Dashboard
This dashboard provides a comprehensive analysis of Medicaid's drug-related expenditures
and usage patterns across multiple years. It combines high-level metrics, detailed visual
trends, and interactive slicers to enable data-driven decisions and performance monitoring.
1. Key Metrics Overview
Purpose: Provides high-level statistics summarizing overall Medicaid drug usage and
spending.
Key Data Points:
• Total Beneficiaries: 5 billion
• Total Spending: $2.00 trillion
• Total Claims: 15 billion
• Total Dosage Units: 1.11 trillion
Insights:
• The massive volume of claims and dosage units indicates a high utilization rate of
Medicaid drug benefits.
Anurag University 56 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
• Trillions in spending highlight the significant financial investment involved in public
healthcare drug programs.
2. Brand Distribution by Outlier Classification
Purpose: Displays the proportion of drug brands marked as outliers.
Key Data Points:
• Outlier Brands (Flag = 1): 4K (7.03%)
• Non-Outlier Brands (Flag = 0): 52.89K (92.97%)
Insights:
• A very small proportion of drug brands are classified as outliers, indicating that
anomalies or high-variance cases are rare but worth monitoring.
3. Outlier Trends Over Time
Purpose: Shows the yearly trend of brands flagged as outliers.
Key Data Points:
• Gradual increase in outliers from 9K in 2018 to 12K in 2022.
Insights:
• There is a steady rise in flagged outliers, possibly indicating either more scrutiny in
monitoring or growing irregularities in claims and spending.
4. Relationship Between Beneficiaries and Claims
Purpose: Illustrates correlation between number of beneficiaries and total claims.
Insights:
• The strong positive correlation suggests that as the number of beneficiaries increases,
the number of claims rises proportionally.
• The data may help predict drug demand based on enrolment trends.
5. Yearly Trends of Beneficiaries, Spending, and Claims
Purpose: Displays yearly progression in drug usage metrics.
Key Data Points:
• From 2018 to 2022:
o Beneficiaries: Increased from 6.2K to 7.6K
o Claims: Increased from 5.5K to 6.4K
o Spending: Rose from $327B to $481B
Insights:
• All three metrics have increased year-over-year, reflecting rising demand and costs.
• Spending shows the steepest increase, potentially due to drug price inflation or higher
utilization per beneficiary.
6. Average Spending per Claim vs. Beneficiary
Purpose: Analyses cost efficiency and spending distribution.
Insights:
Anurag University 57 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
• A positive trend is observed: higher average spending per beneficiary correlates with
higher spending per claim.
• Certain data points are outliers, suggesting unusually high or low-cost treatment
clusters.
Drug-Level Cost Distribution Analysis
Figure 25. Drug-Level Cost Distribution Analysis
This dashboard delivers granular insights into drug-level costs under Medicaid, focusing on
individual brands, claims, dosage units, and beneficiary spending patterns. It helps
identify cost-intensive drugs and evaluate financial impacts across different brands and
manufacturers.
1. Drug Count Overview
o The dashboard analyses a total of 3,355 unique drug brands, offering a wide
scope for comparative cost analysis.
2. Top Spending Brands by Metrics
o Average Spend per Dosage Unit:
Stensiq (₹62K) and Myalept (₹53K) are among the highest costing drugs per
dosage unit.
o Average Spend per Claim:
Cabivii (₹1.6M) and Myalept (₹1.3M) top this metric, indicating high per-
patient treatment costs.
Anurag University 58 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
o Average Spend per Beneficiary:
Stensiq (₹13M) and Exondys (₹11M) are notable outliers for overall cost
burden per beneficiary.
3. Yearly Drug Spending Trend by Brand
o Consistent yearly growth in spending across major drugs:
▪ Eliquis and Humira lead in terms of absolute growth, with Eliquis
rising to ₹30B in 2022.
▪ Total spending jumped from ₹47.48B in 2018 to ₹108.98B in 2022,
highlighting rising drug expenditures.
4. Brand-Level Spending Table
o A tabulated section shows details like:
▪ Total spending
▪ Number of beneficiaries
▪ Number of claims
▪ Dosage units
▪ Manufacturer names
o Total drug spending from the table amounts to ₹19.96 trillion with over
56,892 claims and 22,056 beneficiaries, showcasing extensive utilization.
5. Interactive Slicers
o Year Slicer: Filters data from 2018 to 2022 to enable trend-based analysis.
o Manufacturer Name Slicer: Enables deep dives into specific manufacturer
data, identifying brands driving cost surges.
Manufacturer-Level Cost & Utilization Dashboard
Figure 26. Manufacturer-Level Cost & Utilization Dashboard
Anurag University 59 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
This dashboard provides an analytical overview of manufacturer-level costs, claims, and
utilization metrics under Medicaid. It supports stakeholders in understanding how different
pharmaceutical manufacturers contribute to healthcare expenditures and how their drug
utilization trends evolve over time.
1. Total Manufacturers
o 917 manufacturers are part of the dataset.
o This metric reflects the diversity and scale of drug providers within the
Medicaid program.
2. Brand Count by Manufacturer
o Displays the number of unique drugs contributed by each manufacturer.
o Eli Lilly & Co. leads with approximately 3.4K brands, indicating a broad
product portfolio.
3. Claims & Beneficiaries per Manufacturer
o Visualizes the number of claims and unique beneficiaries per manufacturer.
o Teva USA shows the highest claim and beneficiary count among individual
manufacturers, suggesting widespread drug use.
4. Key Influencers Analysis
o Utilizes AI to identify what factors drive the number of total claims:
▪ Increase in beneficiaries leads to a 4.24x increase in claims.
▪ Reduction in number of manufacturers slightly boosts claims,
possibly due to market concentration.
5. Yearly Total Spending by Manufacturer (2018–2022)
Purpose: Tracks annual expenditure trends.
o Spending steadily rises from $0.22T in 2018 to $0.34T in 2022.
o Larger manufacturers like Abbvie, AstraZeneca, and Eli Lilly consistently
contribute significantly to yearly totals.
o Steady increase in total spending indicates rising healthcare costs or expanding
treatment volumes.
o Spending is concentrated among top pharmaceutical firms, showing their
sustained market dominance.
6. Manufacturer Details Table
o Lists individual manufacturers with associated spending, claims, dosage units,
and brand counts.
o Example: Mylan Inst-Nova has $544.83 in spending and 2 beneficiaries,
indicating low-cost high-frequency drugs.
o Summary Totals
- Total Spending: ₹19,96,51,63,35,773.32
Anurag University 60 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
- Total Beneficiaries: 22,056
- Total Claims: 56,892
- Total Brands: 3,355
Overall Insights
• A small number of manufacturers are responsible for a large proportion of total
brand variety, claims, and expenditures.
• Beneficiary count is the strongest driver of claims volume, underscoring the link
between patient outreach and drug demand.
• Spending trends show a consistent annual rise, indicating increasing financial
pressure on healthcare systems or expanded drug accessibility.
• This dashboard is instrumental in cost containment strategies, supplier
performance tracking, and policy decision-making.
Conclusion
Through this project, we successfully segmented thousands of drug brands based on usage,
spending, dosage, and beneficiaries using K-Means clustering. The analysis uncovered four
distinct brand profiles that provide critical insights into how different drugs are utilized and
distributed across the healthcare system. Key findings include:
1. Cluster 0 – Generic/Low-Impact Brands:
o Largest cluster (3,273 brands) with low claims, spending, and dosage.
o Likely consists of low-cost generics or niche medications with limited usage.
o Represents the baseline tier of pharmaceutical products.
2. Cluster 1 – Mid-Tier/Moderate Brands:
o Includes 73 brands with moderate usage but relatively high spending.
o These may be brand-name drugs prescribed in smaller doses but with higher
costs.
o Key targets for pricing review and formulary optimization.
3. Cluster 2 – High-Dosage Essentials:
o Comprises 8 brands with massive dosage volumes but relatively low spending.
o Likely essential, cost-effective drugs used in chronic or widespread
treatments.
o Ideal candidates for bulk procurement and public health programs.
4. Cluster 3 – Blockbuster Brands:
o A single outlier brand with extremely high claims, beneficiaries, spending, and
dosage.
o Likely a flagship treatment or high-demand national-scale drug.
o Warrants close monitoring due to its economic and clinical impact.
Anurag University 61 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
The clustering analysis provided significant insights into the structure and dynamics of drug
brand utilization, along with targeted recommendations for Medicaid policymakers, offering
several practical implications:
1. Optimize Procurement and Budget Allocation
o Prioritize bulk purchasing agreements for Cluster 2 drugs (High-Dosage
Essentials), as they offer high patient coverage at low cost.
o Allocate resources efficiently by reducing overinvestment in Cluster 0 brands
with minimal clinical or economic impact.
2. Monitor and Evaluate High-Impact Brands
o Closely track Cluster 3 (Blockbuster Brand) performance, considering its
disproportionate influence on national spending and utilization.
o Conduct periodic cost-benefit analyses and clinical outcome reviews to ensure
these flagship drugs deliver value aligned with their resource consumption
3. Implement Tiered Pricing and Reimbursement Strategies
o Apply differentiated pricing models and negotiate lower prices or alternative
reimbursement models for Cluster 1 brands where costs are high but usage is
moderate.
o Streamline subsidies for Cluster 2, promoting access to essential medicines.
Anurag University 62 MBA BA I – II Trimester
Internship Project: DIAGNOSTIC LAB SYSTEMS
Anurag University 63 MBA BA I – II Trimester