A comprehensive Python survival analysis workflow using statsmodels. Covers Kaplan-Meier estimation, log-rank tests, Cox proportional hazards regression, and alternative approaches like Nelson-Aalen and Accelerated Failure Time models with practical code examples.
Survival analysis (also called time-to-event analysis) models the time until an event of interest occurs. This repository provides a complete workflow for:
- Nonparametric survival estimation using Kaplan-Meier and Nelson-Aalen methods
- Comparison of survival curves with various weighted log-rank tests
- Semiparametric regression via Cox Proportional Hazards models
- Alternative approaches including Accelerated Failure Time models
- Model diagnostics and assumption checking
The main analysis is contained in the Jupyter notebook survival_and_duration_analysis.ipynb, which includes:
- Environment Setup and Imports - All required libraries and configurations
- Introduction to Survival Analysis - Key concepts and mathematical foundations
- Nonparametric Methods - Kaplan-Meier estimation and visualization
- Survival Curve Comparison - Multiple hypothesis testing approaches
- Cox Proportional Hazards Regression - Multivariate modeling with interpretation
- Model Diagnostics - Checking proportional hazards assumption
- Alternative Approaches - Nelson-Aalen, AFT models, and time-varying covariates
- References - Key academic and documentation sources
- Survival function estimation with right-censored data
- Quantile estimation with confidence intervals
- Simultaneous confidence bands
- Comparative visualization of multiple groups
- Log-rank test (default)
- Fleming-Harrington test with parameter tuning
- Gehan-Breslow test (weights early events)
- Tarone-Ware test (intermediate weighting)
- Multivariate regression with hazard ratio interpretation
- Formula interface for model specification
- Complete coefficient diagnostics with confidence intervals
- Assessment of proportional hazards assumption
- Nelson-Aalen cumulative hazard estimation
- Introduction to Accelerated Failure Time (AFT) models
- Discussion of time-varying covariates approaches
The analysis demonstrates:
- Survival quantiles: 25th percentile at 3995 days (CI: 3776-4166 days) for females
- Group comparisons: Marginal differences between sexes (p ≈ 0.05 across tests)
- Cox model findings: Age and lambda levels significantly increase hazard, while female sex is protective
statsmodels >= 0.14.0
matplotlib >= 3.7.0
numpy >= 1.24.0
pandas >= 2.0.0Uses the flchain dataset from the R survival package, available through statsmodels.datasets.get_rdataset().
# Kaplan-Meier estimation
sf = sm.SurvfuncRight(df["futime"], df["death"])
sf.plot()
# Cox Proportional Hazards model
mod = PHReg.from_formula("futime ~ age + female + creatinine",
data=data, status=data['death'])
rslt = mod.fit()This analysis framework is applicable to:
- Medical research: Patient survival analysis
- Clinical trials: Treatment efficacy evaluation
- Engineering: Time-to-failure analysis
- Social sciences: Event history analysis
- Business: Customer churn prediction
-
Survival function
$S(t) = P(T > t)$ -
Hazard function
$\lambda(t) = \frac{f(t)}{S(t)}$ - Censoring mechanisms (right, left, interval)
- Hazard ratios and their interpretation
- Partial likelihood estimation
- Proportional hazards assumption
- Currently handles only right-censoring
- Requires careful checking of model assumptions
- Sample size considerations for quantile estimation
- Interpretation of borderline statistical significance
Potential enhancements include:
- Implementation of left and interval censoring
- Competing risks analysis
- Frailty models for clustered data
- Machine learning approaches (random survival forests, etc.)
- Bayesian survival analysis
See the notebook for complete references to:
- Therneau & Grambsch (2000) on Cox models
- Kaplan & Meier (1958) original paper
- Cox (1972) seminal work
- StatsModels official documentation
This notebook serves as an educational resource. For questions, suggestions, or improvements, please open an issue or discussion
This project is licensed under the MIT License - see the LICENSE file for details.
This notebook is designed for educational and research purposes. Real-world applications may require additional considerations and validation.