Survival Analysis I Flashcards
What is survival analysis?
A statistical method to analyse time-to-event data, focusing on the time until an event occurs (e.g., death, disease onset)
What some common applications of survival analysis?
Medical research (time to death), clinical trials, engineering (failure time of components), and social sciences (time to job acquisition)
What is censoring in survival analysis?
A censored observation is one with incomplete information.
A situation where we do not observe the event of interest for some subjects during the study period
What is right censoring?
When an event has not occurred by the end of the study period or the subject is lost to follow-up
What are the assumptions of survival analysis?
- Non-informative censoring
- Survival probabilities are the same for all participants at the same time
- Events occur at recorded times
What are the two key variables in survival data?
- Time variable
- Failure/censoring indicator (1 = event occurred, 0 = censored)
What is an example row of data in a survival dataset for someone who never had an event?
ID: 1
Entry date: 01 Jan 2023
Event date: .
Censor date: 01 Jan 2025
Time: 2.0
Event: 0
What is an example row of data in a survival dataset for someone who had an event?
ID: 2
Entry date: 01 Jan 2023
Event date: 01 May 2024
Censor date: .
Time: 1.3
Event: 1
What is the Kaplan-Meier method?
Known as “product-limit estimator”. A non-parametric estimator for survival probability over time, considering censored data. Does not assume frequency of event remains constant over time, so not easily summarised by a single number. Estimates the cumulative probability of experiencing an event by a certain time point
How is the Kaplan-Meier survival function estimated?
By multiplying the conditional survival probabilities at each event time: 1*(1-p(tj)) * (1-p(tj))
What does a Kaplan-Meier curve show?
The probability of remaining event-free over time
What does a step-down in the Kaplan-Meier curve indicate?
The occurrence of an event at that time
How do we interpret the medial survival time from a Kaplan-Meier curve?
The time at which 50% of subjects have experience the event
How is the incidence rate calculated?
Incidence rate = (number of events) / (total person-years at risk)
How do we an interpret an IRR?
IRR > 1: Higher risk in group 1
IRR < 1: Lower risk in group 1 compared to group 2
Example calculation: If men have an IR of 12 per 100 PY and women 18 per 100 PY, what is the IRR?
IRR = 18/12 = 1.5 (women have 50% higher event rate)
What test is commonly used to compare equality of survival curves?
Log-rank test (similar to Mann-Whitney U/Wilcoxon rank sum test)
Used to compare cures on a Kaplan-Meier plot (same formula used in chi-square test - calculates the observed no. of events and compares this to the no. expected if there were in reality no difference between groups)
What is the H0 of the log-rank test?
There is no difference in survival probabilities between groups at any time point
What does a significant log-rank test indicate?
That survival distributions differ between the groups at different time points
What are the limitations of the log-rank test?
It assumes proportional hazards and may not detect differences survival curves cross (e.g., when comparing a medicine with a surgical intervention)
What Stata command is used to set survival data?
stset <time>, failure(<event>)</event></time>
<time> = time from beginning of study to death or end of study follow-up
</time>
How do you generate Kaplan-Meier survival estimates?
sts graph
By group: sts graph, by(<group>)</group>
How can you test the quality of survival curves in Stata?
sts test group
What command provides summary statistics for survival data?
stdescribe
This provides basic information like average follow-up per person, shortest follow-up time, etc. Used to check for obvious errors in the dataset. It gives the total PYs at follow-up, used to calculate overall incidence rate
How do you estimate the median survival time in Stata?
stci, median
What is the Cox proportional hazards model?
A regression model estimating the hazard ratio for different covariates while controlling for confounders
What are parametric survival models?
Models that assume a specific distribution for survival times (e.g., Weibull, Exponential)
What is a hazard function?
The instantaneous rate of event occurrence at a given time point
What is the proportional hazards assumption?
The hazard ratio between groups remains constant over time
What is survival analysis also known as?
Time-to-event analysis
What happens in real-life studies?
People drop out before follow-up or join late. Therefore, calculating incidence risk as normal risks underestimating calculations
Why take into account differing amounts of follow-up?
Those followed for the whole study had a greater chance of experiencing the event, simply because they were followed for longer. Estimates of the event incidence are likely to be underestimated if you assume everyone was followed for the same length of time. If the pattern of censoring differs between groups, there is the potential for comparisons to be biased
What example demonstrates how not considering differing amounts of follow-up may bias comparisons?
If looking at the effect of SES on heart attack risk, the wealthier may be more likely to participate for the whole study period than those of low SES. This may result in underestimations in the lower SES group
How does Stata refer to events?
As ‘failures’
Properties of time-to-event data:
- Analysing the length of time until the occurrence of a binary event
- The data are positive (i.e., time to event cannot be <0) and the distribution is usually skewed (couldn’t do a linear regression)
- Data are usually censored (people are usually event-free before censoring)
Examples of right censoring:
- Administrative censoring: person not (yet) experienced at time the study database is closed for analysis
- Loss-to-follow-up: person drops out before experiencing the event and before study has ended
We don’t know the mechanism by which someone’s data has been censored - treat both types the same.
How many outcome events per person are considered?
Maximum of one
Why consider maximum one event per person?
- Survival times are independent of each other, given the values of the predictor variables
- This means the survival time of one individual (/observation) should not influence the survival time of another (/observation)
E.g., if someone has 2 heart attacks, the second one isn’t counted. You only have one line of data per person otherwise assumptions will be violated
Under what scenarios if the assumption of one event per person reasonable?
- Only possible for an event to occur once (e.g., death)
- Underlying process is altered by the first event occurring (e.g., myocardial infarction). The predictors of the first may be different to those of the second.
If an event can occur multiple times (e.g., COVID-19), can consider “time to first event”
What should you check with dates in Stata?
For americanisation
How is the ‘time’ column made in a time-to-event dataset?
Subtracting event/censoring date by entry date
How is survival analysis different to other regression models?
Have 2 columns instead of 1 (one column for time and another for the event)
Why can we not calculate prevalence/incidence risk (% people who have an event)?
People are followed for different lengths of time. We can use rates (death or incidence rates which cannot be expressed as a percentage) to consider PYs at risk
Breakdown of calculating rates:
- Numerator = number of events (d)
- Denominator = total person-time at risk e.g., PYs at risk (rather than no. of people)
- Rate per PY = number of events (d) / PYs at risk (py)
- Rate (/100 PYs) = d / py * 100
How do you know how many PYs to use to calculate rates?
E.g., whether 100 or 1000 is used is down to individual judgement as long as it makes sense
How do you calculate total PYs at risk?
Add up the time people were in the study for
Properties of the incidence rate:
- Assumes constant over time - the frequency an event occurs in 1st year of follow-up is same as frequency in 2nd, 3rd, 10th, … year
- May be expressed relative to any period of time (per 100 PYs, per 1000 person-months, etc.) depending on event’s frequency
- Can compare rates in two groups with the IRR (rate in group 1 / rate in group 2)
Why may using incidence rate not be appropiate?
One number that averages the frequency of the event over the time period. This may not be appropriate if analysing survival after surgery, where risk of death is highest a few days after surgery
How does the Kaplan-Meier fix percentages?
By pulling them up a bit due to underestimations resulting from not considering differing follow-up periods
What does ‘&’ mean?
Multiplication of independent events
E.g., by time 0.5, someone has been censored, but no one has yet experienced the event. Up until now the sample size is n = 22. Fill in a row in a table of Kaplan-Meier estimates
Year: 0.6
No. at risk: 21
No. events: 1
No. censored: 0
Prob. event at this time: 1/21 = 0.0048
Prob. no event at this time (p(t)): 20/21 = 0.952
Prob. remaining event free up to and including this time: 1.00 x 0.952 = 0.952
At what point is censoring and an event considered e.g., If both happened at time 0.5 independently of one another?
Convention to not consider censoring until time 0.6
The event is considered at time 0.5
In Kaplan-Meier estimates, what happens to the denominator over time?
Cumulatively decreases as the sample size has reduced (the steps will get bigger)
If the probability of remaining event free at time 0.7 was 0.952, and the probability now is 0.950, what’s the probability of remaining event-free at time 0.8?
0.952 x 0.950 = 0.904
What can we calculate on a Kaplan-Meier plot?
CIs to show the range within which the true probability is likely to lie at any given timepoint
Do Kaplan-Meier plots always have to start at the top?
No, they can start at the bottom (zero and go up). The y-axis is often altered due to different quantities of white space. However, this can make things look more common than they are. Therefore, some journals want the scale to run from 0 to 1
If everybody in a study had been followed for the same amount of time, what could you do instead?
A chi-squared test on a tabulation and a logistic regression to estimate predictors of death
What’s the first thing to check after setting survival data in Stata?
Check no. of exclusions - Stata will drop those with a negative follow-up time (e.g., erroneously inputting a date so follow-up is -10 years)
Need to check observations and failures in cross-tab as well as longest someone was followed for
What happens to the CIs by the end of the Kaplan-Meier plot?
Widen due to less certainty and less accurate estimations of the true population
How can we get the exact numbers rather than read off a Kaplan-Meier graph in Stata?
sts list
Other commands will allow you to see how many were alive at 5 years, for example
What two ways can you plot a Kaplan-Meier graph?
Either the probability of having an event (curve goes upwards) or of remaining event-free (curve goes downwards)
When should plots be stopped?
When the no. of patients remaining under follow-up and event-free is small (<10?)
How should the no. remaining at risk in each group be shown in a Kaplan-Meier plot?
Should be shown at regular intervals under the x-axis
Can we always estimate the median survival time?
No - sometimes not enough data (e.g., only 5% experienced event by end of follow-up)
What are we assuming with censoring?
By treating observations as censored, we assume that, were people to have been followed after censoring, they would have experienced the same event rate as those not censored (i.e., all those who are censored are similar to each other). This may not be the case if censoring is due to some other event that happened to the patient
What is good practice when computing graphs?
Informative and neat graphs through labels and legends
Why is log rank test popular for comparing survival curves?
Takes whole follow-up period into account and does not require us to know anything about the shape of the survival curve or distribution of survival times
Log rank test statistic:
Sum of (O-E)2 / E for each group and then compared to a chi-squared distribution to obtain the relevant p-value
Computing a log rank test in Stata:
sts test <varname>
Where <varname> refers to a group, such as one defined by age</varname></varname>
When is the log rank test most likely to detect a difference between groups?
When the risk of an event is consistently greater for one group than another
What should you always do when analysing survival data?
Survival curves should always be plotted. Never just do p-values - they are there to complement the graph
Why is censoring being unrelated to prognosis the most important assumption of log rank test and Kaplan-Meier method?
Assumes those who drop out aren’t different to those who stayed. If those who dropped out were really sick, and then you calculate the death rate based on healthier participants (which will comprise the majority by end of study follow-up), the reason why people drop out wouldn’t be random.
In the real-world, there may be some differential drop-out. If there’s more than 10% lost to follow-up, there may be concerns regarding this.
Assumptions of log rank test and Kaplan-Meier method:
- Censoring is unrelated to prognosis (non-informative censoring)
- Survival probabilities are the same for subjects recruited early and late
- Events happened at the times specified
- Independence