Projects
Analysis Of Patient Data (Domain: Healthcare)
This project requires learners to analyze the patient data of those suffering from
different diseases across various summaries. The facility, chain organizations, and
dialysis stations analysis is required to be carried out where the patients are
undergoing dialysis. The project also focuses on the payment mode aspect wherein if
any discounts or reduction in payments have happened then those are analyzed.
# Correct the data preparation
# Set number of rows
num_rows = 55
# Define sample data with consistent length
data = {
"Patient_ID": [f"P{str(i).zfill(3)}" for i in range(1, num_rows - 1)] + [None],
"Facility_Name": ["CityCare Hosp", "MediLife Center", "Wellness Clinic", None, "CityCare
Hosp"] * 11,
"Chain_Organization": ["HealthPlus", "Independent", None, "HealthPlus", "MediLife
Group"] * 11,
"Dialysis_Station_ID": ["Station_" + str(random.randint(1, 5)) for _ in range(num_rows)],
"Disease_Type": ["Kidney Failure", "Acute Kidney Disease", "Chronic Kidney Disease",
None, "Kidney Failure"] * 11,
"Payment_Mode": ["Cash", "Insurance", "Govt Aid", None, "Insurance"] * 11,
"Total_Cost": [random.choice([5000, 4800, 5200, None, 5100, "Five Thousand", "4,800"])
for _ in range(num_rows)],
"Discount_Applied": [random.choice([0, 500, 700, None, "Five Hundred", 600]) for _ in
range(num_rows)],
"Final_Amount": [random.choice([4500, 4800, 4700, 4400, None, "Four Thousand Five
Hundred"]) for _ in range(num_rows)],
"Visit_Date": pd.date_range(start="2024-01-01", periods=num_rows, freq="7D").tolist()
}
# Create DataFrame
df = pd.DataFrame(data)
# Manually inject more errors
df.loc[5, 'Facility_Name'] = "medilife center" # Lowercase error
df.loc[10, 'Payment_Mode'] = "cash" # lowercase
df.loc[15, 'Total_Cost'] = "five thousand" # text instead of number
df.loc[20, 'Discount_Applied'] = "six hundred" # text
df.loc[25, 'Final_Amount'] = "four thousand five hundred" # text
df.loc[30, 'Disease_Type'] = "kidney failure" # lowercase
df.loc[35, 'Chain_Organization'] = None # missing value
# Save to CSV
csv_path = '/mnt/data/patient_dialysis_data_with_errors_final.csv'
df.to_csv(csv_path, index=False)
csv_path
1. Dataset (Expected Columns)
Column Name Description
Patient_ID Unique ID of each patient
Facility_Name Hospital/facility name
Chain_Organization Organization that owns the facility
Dialysis_Station_ID Station where dialysis is done
Disease_Type Type of kidney disease
Payment_Mode Payment method (Cash, Insurance, Govt Aid)
Total_Cost Original treatment cost
Discount_Applied Discount amount given
Final_Amount Final amount after discount
Visit_Date Date of visit
2. Solution Guide — Step-by-Step
Step 1: Understand the Problem
You need to analyze patient data, dialysis facilities, chains, stations, and payment
modes.
You need to find patterns, summaries, and insights.
Step 2: Load and Clean the Data
Load Data: Load CSV or Excel file into Python (Pandas) or Power BI/Excel.
Check Null Values: If missing values exist, decide to fill or remove them.
Correct Data Types: Ensure cost columns are numbers, dates are dates, etc.
Remove Duplicates: Check and clean.
Tools: Pandas, Excel, or Power Query (Power BI).
Step 3: Exploratory Data Analysis (EDA)
Here you find answers to important questions:
✅ Facility Level
How many patients per facility?
Which facility has the highest number of dialysis sessions?
✅ Chain Organization Level
How many facilities belong to each chain?
Compare the patient numbers between chains.
✅ Dialysis Station Level
Which stations are used most?
Are some stations underutilized?
✅ Payment Mode Analysis
How are patients paying? (cash, insurance, govt aid, etc.)
What percentage received discounts?
Average discount amount?
✅ Revenue Analysis
Total revenue = sum(Final_Amount)
Revenue lost due to discounts = sum(Discount_Applied)
Tools: Pandas groupby, pivot tables, bar charts.
Step 4: Visualization
Create graphs and charts like:
Bar chart of patients per facility
Pie chart of payment mode distribution
Line chart showing monthly revenue
Heatmap of station usage
Tools: Power BI, Tableau, Matplotlib, or Excel Charts.
Step 5: Insights and Recommendations
Write a summary report with conclusions, such as:
"Facility X handles 30% of all dialysis cases."
"Chain Y has the highest discount application, affecting revenue."
"Most payments happen via insurance — 60%."
"Station 4 and 5 are underused — optimization needed."
Step 6: Final Deliverables
You can submit:
A clean Excel file or Python notebook (Jupyter) or Power BI Dashboard.
A PDF report or PowerPoint summarizing the key insights and charts.
Optional: a few recommendations for management based on findings.
I even added mistakes/errors like:
Null values
Wrong spellings
Data types mistakes (text instead of numbers)
document all their assumptions.
(Example: If filling missing payment modes with "Unknown", write it clearly.)
Data Cleaning Steps
👉 Step 1: Handle Missing Values
Identify missing data (nulls) in key columns.
Fill or remove rows depending on analysis needs.
o Example: If Facility_Name or Patient_ID is missing → drop that row.
o If Disease_Type is missing → can fill as 'Unknown'.
👉 Step 2: Standardize Text Data
Correct lowercase/case issues.
o Example: Change "cash" to "Cash", "medilife center" to "MediLife Center".
Remove unnecessary spaces.
👉 Step 3: Correct Data Types
Convert Total_Cost, Discount_Applied, Final_Amount to numeric values.
o Text like "Five Thousand" → convert to 5000
o Remove commas inside numbers like "4,800" → 4800
👉 Step 4: Check Duplicates
Check if any Patient_ID is duplicated.
Keep only the latest Visit_Date record if needed.
👉 Step 5: Validate Logical Consistency
Check if:
Final Amount=Total Cost−Discount Applied\text{Final Amount} = \text{Total Cost} -
\text{Discount Applied}Final Amount=Total Cost−Discount Applied
o If not, fix or flag incorrect rows.
Data Analysis Steps
👉 Step 6: Facility-wise Patient Count
How many patients went to each facility?
👉 Step 7: Chain Organization Summary
How many facilities under each chain?
Number of patients handled per chain.
👉 Step 8: Dialysis Station Usage
Number of treatments per Dialysis_Station_ID.
Find busiest stations.
👉 Step 9: Payment Mode Analysis
How many patients used Cash, Insurance, Govt Aid?
Average Final Amount per payment mode.
👉 Step 10: Discount Analysis
How many patients received discounts?
Total amount of discounts given (sum).
👉 Step 11: Timeline Trend
Plot number of visits per month.
Are patients increasing or decreasing over time?
Reporting and Insights
👉 Step 12: Key Findings
Facility with most patients.
Chain organization handling maximum cases.
Popular payment mode.
Average discount given.
Busiest dialysis station.
👉 Step 13: Visualizations (Optional)
Bar chart: Facility vs Number of Patients
Pie chart: Payment Mode share
Line chart: Visits over time
Step Task Output
1 Handle Missing Values Cleaned data
2 Standardize Text Proper text formatting
3 Correct Data Types Numeric columns fixed
4 Remove Duplicates Unique patients
5 Validate Amounts Logical data
6-11 Analyze Business insights
12-13 Report Final dashboard