Topic 3: Survival analysis and expectation-maximisation Flashcards

Question 1

Q

Describe Principles of survival analysis

Answer

A

In survival analysis we use concepts such as censored data, survival functions, hazard rate and hazard ratios, time to event (proportional hazard assumption, Kaplan–Meier estimator, log rank test, cox model)

Survival function is: S(t) = P(T > t), probability you will survive past a certain time

Hazard function: H(t) = f(t)/S(t), instantaneous risk of dying at time t, given you have survived so far

time to event data: data that has an event (death, failure), and time series data (day, month, year: treatment start of that),

censoring: Censored data is also a big part of survival analysis, it is when a study dont follow up with a patient, when they fx lived longer, but we do not know what happened after the experiment ended.

Question 2

Q

What is Hazard rate?

Answer

A

The hazard rate for the continuous case can be written as:

EQUATION
https://docs.google.com/document/d/1k75C-8K8lC7icJQMzUlberUe9UZ_Oy2s-zOtgipR5WI/edit?tab=t.0

We have the density function, f(t) and the survival function S(t) (which is the reverse cumulative distribution function).

The hazard rate measures the instantaneous probability of death occurring precisely at time.

In the case of the exponential distribution, the hazard rate is constant, so risk does not change over time. The exponential distribution has the property of being memoryless, meaning that the probability of surviving beyond a certain time (t+s), given survival until that time is the same as the probability surviving beyond the time from the start.

The exponential distribution has the memoryless property, meaning that the probability of surviving beyond any time t is the same, regardless of how long an individual has already survived.

Question 3

Q

What is Hazard ratio?

Answer

A

The hazard ratio (HR) is the ratio of the hazard rates corresponding to the conditions characterised by e.g. two treatments.

Fx. in a clinical study, the treated population dies at twice the rate per time unit compared to the control population, so:

EQUATION
https://docs.google.com/document/d/17N7zAmP6-SXwm4mgYxq3kDAHZ51qZhioTcmETlNIepk/edit?tab=t.0

This indicates that there’s a higher hazard of death from the treatment.

HR > 1: Group treated has a higher risk of the event occurring compared to Group control.
HR < 1: Group treated has a lower risk of the event occurring compared to Group control.
HR = 1: There is no difference in the risk between the two groups.

The hazard ratio is calculated for all data, not at a specific time, and this is because the hazard ratio is supposed to be constant through time.

Question 4

Q

Describe Kaplan–Meier estimate

Answer

A

Kaplan–Meier curves provide a graphical comparison that takes proper account of censoring. Observations zi for censored data problems are of the form:

EQUATION
https://docs.google.com/document/d/1WJHJXjKo7WlcQTT_a_9wQUwvjFG6gOB7LRBJmUnklgg/edit?tab=t.0

where t_i equals the observed survival time while d_i indicates whether or not there was censoring:

EQUATION

The The Kaplan–Meier estimate for survival probability:

EQUATION

This will mean that hat S jumps down a step when death is met 1, and it stays constant when people survive (0).

Question 5

Q

Describe the log-rank test

Answer

A

Key concepts:
- It can be very obvious that 2 survival curves are similar, or very different. We need an objective way to quantify the differences between two survival curves. This can be done with the log-rank test.

H_0 = There is no difference between the population survival curves.
H_1 = There IS a difference between the population survival curves.

Used for censored data
It’s non-parametric, so it doesn’t assume any specific survival distribution, making it broadly applicable

The result of log rank test: We get a p-value, then we can reject or not reject the null hypothesis.

Assumptions of the log-rank test:
(1) the censoring patterns are the same for the two treatment groups, and
(2) the hazard functions for the two treatment groups are proportional.

The log-rank statistic, $Z$, is defined to be:

EQUATION

https://docs.google.com/document/d/1595ChGwDcKrCR62Uxg-362B0IZkW50f5y5cseNApgCA/edit?tab=t.0

Question 6

Q

Describe Cox’s proportional hazard model

Answer

A

Cox’s proportional hazard model: Models the relationship between covariates and survival time, estimating the hazard ratio for each covariate. When you want to adjust for multiple variables or assess the effect of covariates on survival. It allows for a full regression analysis of censored data.
The individual data points can be described as:

EQUATION
https://docs.google.com/document/d/17iDsWAJRvS8Q_xqkdfFpna3dW1XbrG4vYQsWtqFxZLE/edit?tab=t.0

t_i = observed survival time

d_i = censor indicator

c_i = 1 x p vector of covariates whose effect on survival we wish to asses

The Core Formula

h_i(t) = h₀(t) × exp(xᵢ’β)

Where:

hᵢ(t) is the hazard for individual i at time t
h₀(t) is the baseline hazard
xᵢ is the vector of covariates for individual i
β is the vector of coefficients we want to estimate

EQUATION
https://docs.google.com/document/d/17iDsWAJRvS8Q_xqkdfFpna3dW1XbrG4vYQsWtqFxZLE/edit?tab=t.0

This tells us the individual hazard rate for i and time t, is the baseline hazard rate multiplied with the exponential with the vector (row vector ‘)

To find the estimates for the vector $\beta$ you construct the partial likelihood, so the MLE and parameter covariance matrix estimate are obtained as the minimiser of the negative log-likelihood.

Assumption:

Hazard ratios should be proportional, this means they should be consistent throughout time.

Question 7

Q

Describe applying survival analysis methods in real problems

Answer

A

Kaplan-Meier Estimator (Non-Parametric): Visualizes survival curves and estimates survival probabilities over time. It’s intuitive as it plots curves for pre- and post-menopausal groups, allowing for visual comparison. (Real-life: Tracking cancer survival rates over months.)

Log-Rank Test: Statistically compares survival curves between groups, accounting for censored data. Assumes proportional hazards (hazard ratio constant over time). (Real-life: Testing if survival differs between treatment and control groups.)

Cox Proportional Hazards Model (Semi-Parametric): Models survival as a function of multiple covariates (e.g., age, tumor size, therapy). Estimates hazard ratios between groups. Flexible and handles time-to-event data but depends on proportional hazards. (Real-life: Identifying key risk factors for cancer recurrence.)

Question 8

Q

Describe expectation maximisation algorithm

Answer

A

EM Algorithm for Bivariate Data with Missing Values

Initial Setup:

We have a bivariate dataset (x₁, x₂)
All x₁ values are present
Some x₂ values are missing

Goal: Find maximum likelihood estimates of parameters

Why Standard MLE Won’t Work:
Regular maximum likelihood estimation requires complete data
Can’t compute likelihood with missing x₂ values

Start: Make initial guess for missing x₂ values
M-Step: Calculate ML estimates using “complete” data
E-Step: Update missing x₂ values using new estimates
Iterate: Repeat steps 2-3 until convergence

Convergence Check:

Stop when |θₖ₊₁ - θₖ| < ε
In practical terms: when parameter estimates stop changing significantly