Topic 3: Survival analysis and expectation-maximisation Flashcards
Describe Principles of survival analysis
In survival analysis we use concepts such as censored data, survival functions, hazard rate and hazard ratios, time to event (proportional hazard assumption, Kaplan–Meier estimator, log rank test, cox model)
Survival function is: S_i which is the probability of surviving past age i − 1
Hazard function: H(t) = f(t)/S(t), instantaneous risk of dying at time t, given you have survived so far
time to event data: data that has an event (death, failure), and time series data (day, month, year: treatment start of that),
censoring: Censored data is also a big part of survival analysis, it is when a study dont follow up with a patient, when they fx lived longer, but we do not know what happened after the experiment ended.
What is Hazard rate?
The hazard rate for the continues case can be written as:
EQUATION
https://docs.google.com/document/d/1k75C-8K8lC7icJQMzUlberUe9UZ_Oy2s-zOtgipR5WI/edit?tab=t.0
We have the density function, f(t) and the survival function S(t) (which is the reverse cumulative distribution function).
The hazard rate measures the instantaneous probability of death occurring precisely at time.
The exponential distribution has the memoryless property, meaning that the probability of surviving beyond any time t is the same, regardless of how long an individual has already survived.
What is Hazard ratio?
The hazard ratio (HR) is the ratio of the hazard rates corresponding to the conditions characterised by e.g. two treatments.
Fx. in a clinical study, the treated population dies at twice the rate per time unit compared to the control population, so:
EQUATION
https://docs.google.com/document/d/17N7zAmP6-SXwm4mgYxq3kDAHZ51qZhioTcmETlNIepk/edit?tab=t.0
This indicates that there’s a higher hazard of death from the treatment.
- HR > 1: Group treated has a higher risk of the event occurring compared to Group control.
- HR < 1: Group treated has a lower risk of the event occurring compared to Group control.
- HR = 1: There is no difference in the risk between the two groups.
The hazard ratio is calculated for all data, not at a specific time, and this is because the hazard ratio is supposed to be constant through time.
Describe Kaplan–Meier estimate
Kaplan–Meier curves provide a graphical comparison that takes proper account of censoring. Observations zi for censored data problems are of the form:
EQUATION
https://docs.google.com/document/d/1WJHJXjKo7WlcQTT_a_9wQUwvjFG6gOB7LRBJmUnklgg/edit?tab=t.0
where t_i equals the observed survival time while d_i indicates whether or not there was censoring:
EQUATION
The The Kaplan–Meier estimate for survival probability:
EQUATION
This will mean that hat S jumps down a death is met 1, and it stays constant when people survive (0).
Describe the log-rank test
Key concepts:
- it tests the null hypothesis, to see that there’s no difference in the survival distributions
- Used for censored data
- It’s non-parametric, so it doesn’t assume any specific survival distribution, making it broadly applicable
null hypothesis: both groups have the same distribution
alternative hypothesis: both groups have different distributions
Assumptions of the log-rank test:
- Independence: The occurrence of an event for one individual should not influence the occurrence of an event for another individual
- Non-Informative Censoring: Censoring should not be related to the event being studied or to the group assignment. At the time of censoring, censored and non-censored patients should have similar risks of experiencing the event.
- Proportional hazard: The hazard rates (the risk of an event occurring) for the compared groups should be consisted over time. The ratio of the hazard rates should remain constant, indicating that the two groups are not experiencing significantly different risks at different times
The log-rank statistic, $Z$, is defined to be:
EQUATION
https://docs.google.com/document/d/1595ChGwDcKrCR62Uxg-362B0IZkW50f5y5cseNApgCA/edit?tab=t.0
Describe Cox’s proportional hazard model
The Kaplan–Meier estimator is a one-sample device, dealing with data coming from a single distribution. The log-rank test makes two-sample comparisons. Proportional hazards ups the ante to allow for a full regression analysis of censored data.
The individual data points can be described as:
EQUATION
https://docs.google.com/document/d/17iDsWAJRvS8Q_xqkdfFpna3dW1XbrG4vYQsWtqFxZLE/edit?tab=t.0
t_i = observed survival time
d_i = censor indicator
c_i = 1 x p vector of covariates whose effect on survival we wish to asses
EQUATION
https://docs.google.com/document/d/17iDsWAJRvS8Q_xqkdfFpna3dW1XbrG4vYQsWtqFxZLE/edit?tab=t.0
This tells us the individual hazard rate for i and time t, is the baseline hazard rate multiplied with the exponential with the vector (row vector ‘)
To find the estimates for the vector $\beta$ you construct the partial likelihood, so the MLE and parameter covariance matrix estimate are obtained as the minimiser of the negative log-likelihood.
Assumption:
Hazard ratios should be proportional, this means they should be consistent throughout time.
Describe applying survival analysis methods in real problems
Kaplan-Meier Estimator (Non-Parametric): Visualizes survival curves and estimates survival probabilities over time. It’s intuitive as it plots curves for pre- and post-menopausal groups, allowing for visual comparison. (Real-life: Tracking cancer survival rates over months.)
Log-Rank Test: Statistically compares survival curves between groups, accounting for censored data. Assumes proportional hazards (hazard ratio constant over time). (Real-life: Testing if survival differs between treatment and control groups.)
Cox Proportional Hazards Model (Semi-Parametric): Models survival as a function of multiple covariates (e.g., age, tumor size, therapy). Estimates hazard ratios between groups. Flexible and handles time-to-event data but depends on proportional hazards. (Real-life: Identifying key risk factors for cancer recurrence.)
Describe expectation maximisation algorithm
EM Algorithm for Bivariate Data with Missing Values
Initial Setup:
We have a bivariate dataset (x₁, x₂)
All x₁ values are present
Some x₂ values are missing
Goal: Find maximum likelihood estimates of parameters
Why Standard MLE Won’t Work:
Regular maximum likelihood estimation requires complete data
Can’t compute likelihood with missing x₂ values
- Start: Make initial guess for missing x₂ values
- M-Step: Calculate ML estimates using “complete” data
- E-Step: Update missing x₂ values using new estimates
- Iterate: Repeat steps 2-3 until convergence
Convergence Check:
Stop when |θₖ₊₁ - θₖ| < ε
In practical terms: when parameter estimates stop changing significantly