Skip to content

A comprehensive Python survival analysis workflow using statsmodels. Covers Kaplan-Meier estimation, log-rank tests, Cox proportional hazards regression, and alternative approaches like Nelson-Aalen and Accelerated Failure Time models with practical code examples.

License

Notifications You must be signed in to change notification settings

esosetrov/survival_and_duration_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

survival_and_duration_analysis

A comprehensive Python survival analysis workflow using statsmodels. Covers Kaplan-Meier estimation, log-rank tests, Cox proportional hazards regression, and alternative approaches like Nelson-Aalen and Accelerated Failure Time models with practical code examples.

Project Overview

Survival analysis (also called time-to-event analysis) models the time until an event of interest occurs. This repository provides a complete workflow for:

  • Nonparametric survival estimation using Kaplan-Meier and Nelson-Aalen methods
  • Comparison of survival curves with various weighted log-rank tests
  • Semiparametric regression via Cox Proportional Hazards models
  • Alternative approaches including Accelerated Failure Time models
  • Model diagnostics and assumption checking

Repository Structure

The main analysis is contained in the Jupyter notebook survival_and_duration_analysis.ipynb, which includes:

  1. Environment Setup and Imports - All required libraries and configurations
  2. Introduction to Survival Analysis - Key concepts and mathematical foundations
  3. Nonparametric Methods - Kaplan-Meier estimation and visualization
  4. Survival Curve Comparison - Multiple hypothesis testing approaches
  5. Cox Proportional Hazards Regression - Multivariate modeling with interpretation
  6. Model Diagnostics - Checking proportional hazards assumption
  7. Alternative Approaches - Nelson-Aalen, AFT models, and time-varying covariates
  8. References - Key academic and documentation sources

Key Features

1. Kaplan-Meier Estimation

  • Survival function estimation with right-censored data
  • Quantile estimation with confidence intervals
  • Simultaneous confidence bands
  • Comparative visualization of multiple groups

2. Hypothesis Testing

  • Log-rank test (default)
  • Fleming-Harrington test with parameter tuning
  • Gehan-Breslow test (weights early events)
  • Tarone-Ware test (intermediate weighting)

3. Cox Proportional Hazards Model

  • Multivariate regression with hazard ratio interpretation
  • Formula interface for model specification
  • Complete coefficient diagnostics with confidence intervals
  • Assessment of proportional hazards assumption

4. Complementary Methods

  • Nelson-Aalen cumulative hazard estimation
  • Introduction to Accelerated Failure Time (AFT) models
  • Discussion of time-varying covariates approaches

Sample Results

The analysis demonstrates:

  • Survival quantiles: 25th percentile at 3995 days (CI: 3776-4166 days) for females
  • Group comparisons: Marginal differences between sexes (p ≈ 0.05 across tests)
  • Cox model findings: Age and lambda levels significantly increase hazard, while female sex is protective

Technical Implementation

Dependencies

statsmodels >= 0.14.0
matplotlib >= 3.7.0
numpy >= 1.24.0
pandas >= 2.0.0

Data Source

Uses the flchain dataset from the R survival package, available through statsmodels.datasets.get_rdataset().

Code Examples

# Kaplan-Meier estimation
sf = sm.SurvfuncRight(df["futime"], df["death"])
sf.plot()

# Cox Proportional Hazards model
mod = PHReg.from_formula("futime ~ age + female + creatinine", 
                         data=data, status=data['death'])
rslt = mod.fit()

Applications

This analysis framework is applicable to:

  • Medical research: Patient survival analysis
  • Clinical trials: Treatment efficacy evaluation
  • Engineering: Time-to-failure analysis
  • Social sciences: Event history analysis
  • Business: Customer churn prediction

Key Concepts Covered

  • Survival function $S(t) = P(T > t)$
  • Hazard function $\lambda(t) = \frac{f(t)}{S(t)}$
  • Censoring mechanisms (right, left, interval)
  • Hazard ratios and their interpretation
  • Partial likelihood estimation
  • Proportional hazards assumption

Limitations and Considerations

  • Currently handles only right-censoring
  • Requires careful checking of model assumptions
  • Sample size considerations for quantile estimation
  • Interpretation of borderline statistical significance

Further Extensions

Potential enhancements include:

  • Implementation of left and interval censoring
  • Competing risks analysis
  • Frailty models for clustered data
  • Machine learning approaches (random survival forests, etc.)
  • Bayesian survival analysis

References

See the notebook for complete references to:

  • Therneau & Grambsch (2000) on Cox models
  • Kaplan & Meier (1958) original paper
  • Cox (1972) seminal work
  • StatsModels official documentation

Contributing

This notebook serves as an educational resource. For questions, suggestions, or improvements, please open an issue or discussion

License

This project is licensed under the MIT License - see the LICENSE file for details.

Note

This notebook is designed for educational and research purposes. Real-world applications may require additional considerations and validation.

About

A comprehensive Python survival analysis workflow using statsmodels. Covers Kaplan-Meier estimation, log-rank tests, Cox proportional hazards regression, and alternative approaches like Nelson-Aalen and Accelerated Failure Time models with practical code examples.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published