Phase-2 Submission – Data Analytics
Student Name: BHAVAN S
Register Number: 512223104012
Institution: SKP ENGINEERING COLLEGE
Department: CSE
Date of Submission:
GitHub Repository Link: github profile
1. Problem Statement
The healthcare industry faces significant challenges in early disease detection and
personalized treatment. Traditional diagnostic methods often rely on reactive approaches,
leading to delayed interventions and higher costs. This project aims to leverage AI and
machine learning to predict diseases early by analyzing patient data such as medical
history, lifestyle factors, and biometric measurements. By transitioning from reactive to
proactive healthcare, we can improve patient outcomes, reduce treatment costs, and
optimize resource allocation.
2. Project Objectives
The primary goal is to develop an AI-powered system that predicts diseases (e.g.,
diabetes, cardiovascular diseases) based on patient data. Key objectives include:
- Identifying patterns and risk factors in patient data that correlate with specific diseases.
- Building predictive models to assess disease likelihood and recommend preventive
measures.
- Providing actionable insights to healthcare providers for early intervention.
- Ensuring the model is interpretable and scalable for real-world deployment.
3. Flowchart of the Project Workflow
Data Collection
- EHRs, Wearables, Surveys
- Lab results, Demographics
Data Cleaning
- Missing values
- Outlier removal
- Standardization.
│ - Standardization.
Exploratory Data Analysis (EDA)
- Distributions
- Correlations
- Visualizations
Feature Selection
- Statistical tests
- Domain knowledge
- Feature importance.
Insight Extraction
- SHAP value analysis
- Key risk factor identification
- Patient stratification
Visualization
- Interactive dashboards
- Risk prediction charts
- Trend analysis graphs
Reporting & Recommendations
Automated PDF reports
Executive summaries
Personalized prevention plans
4. Data Description
Public datasets (e.g., Kaggle, UCI ML Repository) or synthetic data mimicking real-
world patient records.
• Data Type: Structured tabular data (e.g., CSV files).
• Number of Rows and Columns: 1,00 rows × 12 columns
• Dataset Nature: Static (data does not change in real time)
Key Fields Relevant to the Problem:
• - Patient_ID, Age, Gender
• - Medical history (e.g., past diagnoses, family history)
• - Biometrics (e.g., blood pressure, cholesterol levels)
• - Lifestyle factors (e.g., smoking, exercise habits)
• - Target variable: Disease diagnosis (binary/multi-class)
5. Data Preprocessing
To ensure accurate analysis, we performed the following data cleaning and preparation
steps:
• Handling Missing Values:
Mean/Median Imputation for numerical fields (e.g., blood pressure, glucose
levels).
• Mode Imputation for categorical values (e.g., gender, disease history).
• Removing Duplicates:
Each patient is uniquely identified using a Patient_ID. Duplicates are removed to
avoid bias in model training and disease prediction outcomes.
• Formatting and Parsing:
Dates (e.g., admission, diagnosis, follow-up) are standardized to datetime
format.
• Clinical values are formatted as float/int to ensure compatibility with ML models.
• Encoding Categorical Variables:
Label Encoding for binary features like gender (Male/Female).
• One-Hot Encoding for multi-class variables like symptoms or departments visited.
• Outlier Detection and Treatment:
• Interquartile Range (IQR) and Z-score methods are used to detect anomalies in
lab results (e.g., extremely high cholesterol).
• Outliers are either capped or removed if medically implausible.
• Transformations:
• Creating New Fields: New fields like Efficiency_Score =
Performance_Score / Monthly_Hours_Worked were created to better reflect
productivity.
Deeper Insights: These transformations helped in uncovering deeper insights.
● 6. Exploratory Data Analysis (EDA)
● Univariate Analysis:
Histograms for age distribution, bar charts for disease prevalence.
• Bivariate/Multivariate Analysis:
Scatter plots (e.g., glucose vs. diabetes), correlation heatmaps.
● Key Insights:
- High cholesterol and age are strong predictors of cardiovascular diseases.
- Lifestyle factors (e.g., sedentary habits) correlate with higher diabetes risk.
7. Tools and Technologies Used
• Programming Language: Python
• Notebook/IDE: Google Colab, Jupyter Notebook
• Libraries Used:
- Data Processing: pandas, numpy
- Visualization: matplotlib, seaborn, plotly
- ML Models: scikit-learn, XGBoost, TensorFlow (for deep learning
• Optional Tools:
o pandas-profiling – For quick automated EDA reports
o These tools helped efficiently clean, explore, and visualize the data for
performance analysis.
8. Team Members and Contributions
Name Contribution
BHAVAN S Data Cleaning, EDA.
C K YESU Data Collection, Visualization,
Insights
GOKUL Documentation, Flowchart Design,
Presentation