This project applies logistic regression to the Cleveland Heart Disease dataset to predict heart disease presence based on patient health records. All tasks are performed in Python using pandas and scikit-learn
- Source: Kaggle, Cleveland Heart Disease (
cherngs/heart-disease-cleveland-uci) - File Used:
heart_cleveland_upload.csv - Target Variable:
condition(presence of heart disease) - Dataset was downloaded using
kagglehub.
- Data loaded and features/target selected.
- Train-test split (80/20) with stratification.
- Feature scaling via StandardScaler.
- Logistic Regression model trained (
max_iter=5000). - Model evaluations: accuracy, confusion matrix, classification report (precision, recall, F1-score).
Key metrics from model evaluation:
- Accuracy: 91.66%
- Confusion Matrix:
[[32 0]
[ 5 23]]
- Classification Report:
precision recall f1-score support
0 0.86 1.00 0.93 32
1 1.00 0.82 0.90 28
accuracy 0.92 60
macro avg 0.93 0.91 0.91 60
weighted avg 0.93 0.92 0.92 60
- Ensure Python >=3.7, pandas, scikit-learn, and kagglehub are installed.
- Set up your Kaggle API credentials for downloading datasets.
- Run the
main.pyscript (provided in this repo).
main.py— main code with all stepsREADME.md— explanation and summaryreport.pdf— screenshots of code and outputs