Survival 2
Survival 2
Sayan Ghosh
Abstract
This document provides a comprehensive treatment of survival analysis with
emphasis on the derivation and interpretation of the log-rank test. We cover:
• Fundamentals of survival analysis and censoring
• Kaplan–Meier estimator and its properties
• Derivation of the log-rank test statistic, expected values under the null hy-
pothesis, variance, and asymptotic chi-square distribution
• Construction and role of contingency tables at each event time
• A real-life example: comparing court case disclosure times between two peri-
ods (2015–2020 vs. 2020–2025), including data structure, handling of censor-
ing, Kaplan–Meier curves, log-rank computation, and interpretation
• Guidelines for implementing analyses and plotting in software
All mathematical expressions, tables, and explanatory text are provided in LATEX
format for direct compilation or inclusion in reports.
Contents
1 Introduction 3
1.1 Types of Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1
5 Real-Life Example: Court Case Disclosure Times 9
5.1 Context and Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2 Handling Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.3 Kaplan–Meier Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.4 Applying the Log-Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.4.1 Step-by-Step Computation . . . . . . . . . . . . . . . . . . . . . . 11
5.4.2 Illustrative Contingency Table at a Given Time . . . . . . . . . . 11
5.5 Interpretation of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.6 Software Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . 12
9 Practical Considerations 16
10 References 17
2
1 Introduction
Survival analysis concerns the study of time until the occurrence of an event of inter-
est (often called “failure time”, “event time”, or “time to event”) in the presence of
potentially censored observations. It arises in many fields: medicine (time to death or
relapse), engineering (time to failure of components), social sciences (duration until an
event such as job change), and in our motivating example, legal studies (time until court
case disclosure).
Key challenges:
• Right-censoring: For some subjects, the event has not occurred by the end of
observation or they are lost to follow-up. We know only that their survival time
exceeds their last observed time.
• Left-censoring: The event occurred before the observation period began. We only
know that it happened before a certain time.Left censoring is when the event of
interest has already occurred before enrolment. This is very rarely encountered.
Truncation is deliberate and due to study design. Legal example: A case was
disclosed before official records began, so the exact date of disclosure is unknown.
3
• Progressive censoring: In progressive censoring, subjects are removed from the
study at various times according to a pre-specified rule, not necessarily because the
event of interest occurred. This allows for more flexibility in study design and often
reduces costs.
Legal example: In a long-running case tracking system, certain cases are periodi-
cally removed from observation due to lack of resources or re-prioritization. The
outcome of these cases remains unknown after removal, even though they were
under observation earlier.
In the context of survival analysis, the survival function is often estimated using
methods like the Kaplan-Meier estimator, which accounts for censored data.
Density Function
The density function f (t) describes the instantaneous rate at which events occur
at time t. It is the derivative of the cumulative distribution function F (t):
d P (t ≤ T < t + ∆t)
f (t) = F (t) = lim .
dt ∆t→0 ∆t
In survival analysis, f (t) gives the probability density of the event happening exactly
at time t.
The relationship between the density function and the survival function S(t) is:
d
f (t) = − S(t),
dt
because S(t) = 1 − F (t).
4
From Density Function to Survival Function
Let T be the random variable denoting time to event. Its probability density func-
tion (pdf) is denoted by f (t), and the cumulative distribution function (cdf) is
Z t
F (t) = P (T ≤ t) = f (u) du.
0
The survival function S(t) is the probability that the event has not occurred by
time t, i.e.,
Z t
S(t) = P (T > t) = 1 − F (t) = 1 − f (u) du.
0
In other words, the survival function is the complement of the cumulative distribu-
tion function, representing the probability of surviving beyond time t.
• The hazard function is
Pr(t ≤ τ < t + ∆t | τ ≥ t) f (t)
h(t) = lim+ = ,
∆t→0 ∆t S(t)
The hazard function can be interpreted as the conditional failure rate at time t.
A high hazard at a particular time indicates that the event is very likely to occur
immediately after that time, assuming it has not occurred yet.
5
Cumulative Hazard Function
The cumulative hazard function H(t) represents the total accumulated risk of
experiencing the event by time t. It is defined as the integral of the hazard function
over time:
Z t
H(t) = h(u) du,
0
Properties:
• It is a step function with downward jumps at each observed event time.
• Censored times contribute to risk set counts up to the time of censoring but do not
produce jumps.
• One can compute pointwise confidence intervals, e.g. using Greenwood’s formula
for the variance of S(t).
b
6
3 Log-Rank Test: Theory and Derivation
3.1 Objective
Compare survival experiences of two groups (Group 1 vs. Group 2) under the null hy-
pothesis:
N1,j = #{subjects in Group 1 still at risk just before t(j) }, N2,j = #{subjects in Group 2 still at risk
and
Also let
Nj = N1,j + N2,j , Oj = O1,j + O2,j .
Right-censoring affects the “at risk” counts but does not directly produce events; censored
subjects are removed from the risk set after their censoring time.
Similarly,
N2,j
E2,j = Oj .
Nj
(When Nj is large, the finite-population correction (Nj − 1) matters little; often approx-
imate forms drop that term, but the exact form ensures correctness.)
7
3.5 Aggregating over All Event Times
Define the total observed-minus-expected for Group 1:
J
X
U1 = O1,j − E1,j .
j=1
neglecting covariance terms since event times are distinct and counts at different times
are (approximately) uncorrelated under the standard counting process framework.
where Φ is the standard normal CDF and Fχ21 is the chi-square CDF with 1 df. If p is
below a chosen significance level (e.g. 0.05), reject H0 : conclude the survival curves differ.
8
4 Kaplan–Meier Curves and Plotting
To visualize survival (or disclosure-time) differences between two groups, plot the Ka-
plan–Meier curves:
Y dg,j
Sg (t) =
b 1− , g = 1, 2,
t ≤t
ng,j
(j)
where ng,j and dg,j are the risk set size and number of events in group g at time t(j) . On
a single plot:
• The horizontal axis is time t (e.g., time until disclosure, in days/months).
Interpretation: If one curve lies consistently below another (i.e., drops faster), it
indicates shorter times to event (e.g., faster disclosures). However, formal inference uses
the log-rank test.
9
• Group 1: Cases initiated during 2015–2020.
• time: duration from initiation until disclosure or censoring (in appropriate units,
e.g. days or months)
• The case remains undisclosed by the study end date (e.g., last follow-up at December
2025).
• Data loss or subject dropout—but in legal data, typically censoring means still
pending at cutoff.
Censored cases contribute to risk sets up to their censoring time and then are removed
thereafter.
Software (e.g., R’s survival::survfit(), Python lifelines, or other) can produce esti-
mates and plots. The plot helps visualize any separation between curves.
10
5.4 Applying the Log-Rank Test
5.4.1 Step-by-Step Computation
1. Pool all observed event times across both groups; order distinct times t(1) < · · · <
t(J) .
2. For each event time t(j) :
• Determine N1,j : number of Group 1 cases with Ti ≥ t(j) .
• Determine N2,j : number of Group 2 cases with Ti ≥ t(j) .
• Determine O1,j : number of Group 1 cases with event exactly at t(j) (δi = 1).
• Determine O2,j : number of Group 2 cases with event at t(j) .
• Compute totals: Nj = N1,j + N2,j , Oj = O1,j + O2,j .
• Compute expected event count in Group 1:
N1,j
E1,j = Oj .
Nj
• Compute variance:
N1,j N2,j Oj (Nj − Oj )
V1,j = .
Nj2 (Nj − 1)
• Record the contribution O1,j − E1,j and V1,j .
3. Sum contributions:
J
X J
X
2
U1 = (O1,j − E1,j ), σ = V1,j .
j=1 j=1
11
5.5 Interpretation of Results
After computing the overall statistic:
P 2
j (O1,j − E 1,j )
χ2 = P ,
j V1,j
Since p < 0.05, we reject H0 and conclude that disclosure times differ between 2015–2020
and 2020–2025. One might then inspect which group has shorter times: if Group 2’s
Kaplan–Meier curve drops earlier, it indicates faster disclosures in 2020–2025 compared
to 2015–2020. In reporting:
“The log-rank test comparing disclosure times yields χ2 (1) = 5.12, p = 0.024,
indicating a statistically significant difference in disclosure-time distributions
between the two time periods. The Kaplan–Meier curves (Figure 1) show
that cases initiated in 2020–2025 tend to be disclosed earlier than those in
2015–2020.”
library(survival)
# Assume df has columns: time, status (1=event, 0=censor), group (factor with
fit <- survfit(Surv(time, status) ~ group, data = df)
plot(fit, col = c("blue","red"), lty = 1:2, xlab="Time", ylab="Survival Proba
legend("topright", legend=levels(df$group), col=c("blue","red"), lty=1:2)
# Log-rank test:
lr <- survdiff(Surv(time, status) ~ group, data = df)
print(lr)
# survdiff gives chisq statistic and p-value via chisq distribution
• Python (lifelines):
kmf1 = KaplanMeierFitter()
kmf2 = KaplanMeierFitter()
mask1 = (df[’group’]==1)
mask2 = (df[’group’]==2)
kmf1.fit(df.loc[mask1,’time’], df.loc[mask1,’status’], label=’Group1’)
kmf2.fit(df.loc[mask2,’time’], df.loc[mask2,’status’], label=’Group2’)
12
ax = kmf1.plot_survival_function()
kmf2.plot_survival_function(ax=ax)
plt.xlabel(’Time’)
plt.ylabel(’Survival Probability’)
plt.title(’Kaplan{Meier Curves: Disclosure Times’)
# Log-rank:
from lifelines.statistics import logrank_test
results = logrank_test(
df.loc[mask1,’time’], df.loc[mask2,’time’],
event_observed_A=df.loc[mask1,’status’],
event_observed_B=df.loc[mask2,’status’]
)
print(results.test_statistic, results.p_value)
• Ensure correct coding of group indicator and status. Check assumptions: non-
informative censoring, independence, etc.
• For small sample sizes or few events, large-sample approximations may be poor;
consider exact methods or permutation if feasible.
Derivation: The conditional probability of surviving beyond t(j) given survival just prior
is 1 − dj /nj . Multiply across event times (product-limit).
13
Use of Hypergeometric Distribution in Log-Rank
Test
At a fixed event time tj , suppose:
Under the null hypothesis H0 (equal hazard rates between groups), all at-risk indi-
viduals have the same probability of experiencing the event. Hence, the assignment
of the Oj events to the two groups is random and without replacement.
This follows the hypergeometric distribution:
With:
Summing over times yields aggregate mean zero, variance sum of per-time variances
(approximate independence across times).
Thus
U1
Z = qP ≈ N (0, 1), Z 2 ≈ χ21 .
j V1,j
14
6.4 Connection to Chi-Square Goodness-of-Fit
The log-rank test can be viewed as a generalized chi-square test comparing observed vs.
expected counts across the risk sets over time. At each event time, the contingency table
yields a 2×2 comparison; aggregating over times accumulates evidence of departure from
the null pattern of proportional event occurrence. The final statistic sums the squared
standardized deviations, akin to a chi-square sum of (observed minus expected) over
variance, yielding an overall chi-square with 1 df for two groups.
Compute:
N1,j N1,j N2,j Oj (Nj − Oj )
E1,j = Oj , V1,j = .
Nj Nj2 (Nj − 1)
Define
d1,j = O1,j , d2,j = O2,j .
15
8.1 Hypothetical Dataset
• At t = 4.0:
• At t = 5.0: adjust risk sets removing those with event at 4.0; and so on.
P
Compute each time’s table, accumulate U1 and V1,j . Finally compute Z.
9 Practical Considerations
• Ties: If multiple events occur at exactly same time, different methods exist (exact,
Breslow, Efron) in Cox model context. For the log-rank test, ties are handled
by treating Oj as total events at that time; hypergeometric model handles ties
naturally.
16
• Sample size and number of events: The approximation to chi-square is bet-
ter with larger number of observed events. With few events, consider exact or
permutation tests if feasible.
• Multiple groups: For more than two groups, one can extend log-rank to a multi-
sample test, leading to χ2k−1 distribution for k groups. Here we focus on two-group
case (1 df).
10 References
References
[1] Klein, J. P., & Moeschberger, M. L. (2003). Survival Analysis: Techniques for Cen-
sored and Truncated Data. Springer.
[2] Collett, D. (2003). Modelling Survival Data in Medical Research. CRC Press.
[3] Hosmer, D. W., Lemeshow, S., & May, S. (2008). Applied Survival Analysis: Regres-
sion Modeling of Time-to-Event Data. Wiley.
[4] Thern eau, T. M., & Grambsch, P. M. (2000). Modeling Survival Data: Extending the
Cox Model. Springer.
17