Midterm Flashcards
Association vs. Causality
Causality requires meeting assumptions such as temporal relationship, strength of association, dose response relationship. Experimental studies tend to look at causality.
Association is when there is limited knowledge and you cannot say for sure that the exposure causes the outcome. Observational studies tend to look at association.
When a study is about association, they will have a hypothesis that states “is associated with” while a causality study will say “increases/decreases the risk”.
Descriptive Study
A study that describes the distribution of disease (e.g. person, place, or time).
Often an implicit hypothesis such as “the distribution of disease varies by person, place or time”. But can also be explicit as well.
Analytic Study
Motivation is often to identify a causal determinant and find an association between exposure and outcome.
Relative Risk
RR can mean incidence rate ratio, risk ratio (cumulative incidence ratio), hazard ratio, and odds ratio
Bias
Systematic error in the design or conduct of a study that results in a measure of association among study participants that is meaningfully different from the true measure of association (e.g. such as that in the source population)
Information Bias
Error due to collection of incorrect information about study participants.
Participants are classified into incorrect exposure or disease categories (misclassification)
Selection Bias
Error arising from 1) criteria or procedures used to select study participants or 2) nonparticipation (occurring at initial enrollment or due to losses to follow-up)
Direction of bias for RR
Axis 1: Upward vs. downward (this does not provide information on strength of association is being over or underestimated)
Axis 2: Toward the null vs. away from the null
When assessing direction of bias the reference point is always the true RR. (e.g. if the True OR is .8 and the Obs OR is .2, then the bias is downward and away from the null).
Strength of Association
The further from the null, the stronger the association.
Bias away from null overestimates the strength of association
Bias towards the null underestimates the strength of association
Source population in a cohort study
The population that gave rise to the study sample. (should always include calendar time)
General cohort
Defined by a factor unrelated to any particular exposure
Typically a convenience sample based on logistical advantages (e.g. willingness to participate, ease of recruitment, and/or follow-up)
Use of an internal comparison group
Uses RR
Specific-exposure cohort
Defined by a specific exposure
Use of an external comparison group (e.g. general population).
Method to analyze is indirect standardization
Uses RR
Susceptible to selection bias (such as healthy worker effect) - The main issue is that the exposed cohort and nonexposed external
comparison group are not selected in the same fashion from the same
source population. Selection from different source populations may result in
different disease risk for reasons other than the exposure under study
Sources of selection bias
- different criteria are used to select exposed and unexposed participants
- Selection of exposed or nonexposed participants is related to the development of the outcome of interest
- Loss to follow-up is related to both the exposure and the outcome of interest (differential losses to follow-up)
Susceptibility to selection bias
Cohorts with internal comparison groups are less prone to selection bias than specific-exposure cohorts. Study participants are selected before the development of the disease and it is unlikely that future events will bias selection process. Cohorts using internal groups could have selection bias due to differential losses to follow-up.
Cohort using an external comparison group - healthy worker effect - RR is biased downward
specific-exposure cohorts are extremely prone to selection bias
Differential losses to follow-up
a situation in research where participants who drop out of a study have different characteristics than those who stay in the study
Source Population in case-control
The population that gave rise to the cases. Essentially, the population of persons who would have been identified as cases if they had developed the condition of interest during the time period in which the cases were identified.
Calendar time should be included
Types of Source populations
Primary source population - well-defined (e.g. residence, calendar period), and specified a priori. Determines case ascertainment
Examples include:
- residents of a defined geographic area
-members of a health plan
-members of a general cohort
Secondary source population (more prone to selection bias than primary) - theoretically defined and inferred based on the method of case ascertainment. case ascertainment method is defined a priori. “Would/if criterion” is employed.
Examples include:
- cases ascertained through a hospital “person who would attend the hospital if..”
-cases recruited through advertisements “person who would answer the ad if they were…”
Case-control studies
a method of sampling controls from the source population such that the controls reflect exposure distribution in the source population that gave rise to the cases. Controls should be randomly sampled and representative of source population.
Uses odds ratio.
case selection: includes all cases that arise in the source population. But in reality usually only a sample of cases are included but they need to be representative of all cases.
Selection bias in case-control
If the exposure under study is not similar among study cases compared to all cases that arose in the source population.
If the exposure under study is not similar among study controls compared to the source population.
Prone to selection bias. cases and controls are often selected through fundamentally different processes
- imperfect method of case ascertainment
- case non-participation
- case refusal, inability to locate cases, case too sick, case died
Controls: - non-participation, control refusal, inability to locate, random sampling from primary source is hard, secondary source pop is difficult to operationally define
partial non-participation among cases and/or controls
Timeline of case and control recruitment
ascertain and recruit incident cases
accumulate controls during the study period at same rate that cases are being accumulated
source population is restricted to persons at risk of becoming a case
a control who later becomes a case serves as both a control and a case
2 x 2 table
cases controls
exposed a b
non-exposed c d
odds ratio = ad/bc
Types of Case-control Studies
Population-based
- primary source population
- cases: all new cases of disease x that arise
- control: rep sample of the source pop with respect to exposure
Hospital-based
- secondary source population
-cases: same as above but in a hospital
-control: same as above, but it’s hard to achieve in a secondary source pop
Source pop can come from place of residence, insurance, access to a regular physician, etc.
One exception: if most residents of a defined geographic area would attend hospital A and no other hospitals if they contracted a disease then cases could be considered population-based and population-based controls can be used.
Nested
- primary source population
-case and control same as above
-typically conducted when the exposure of interest are measured by assay of stored biologic specimens
Pros and Cons of hospital-based case-control studies
Pro
- easily accessible and high participation rate
- protect against recall bias
Con
-nonrandom sample of the source pop, most of whom are healthy
-some may not even be members of source population
Strategies for selection of hospital controls
only include patients admitted for diseases for which there is no suspicion of an association with the exposure under study
include controls with a variety of diseases
include diseases thought to have a comparable source population as the disease under study
base exclusions on diagnosis at the current hospitalization, not on past medical history
pros and cons of nested
pro
- exposure measured at baseline before development of disease
-selection of controls by random sampling from a well-defined, primary source population
sources of selection bias
- incomplete case ascertainment
-cohort losses to follow-up
-selection bias associated with participant selection in the entire cohort itself
Confounding
When associated with both exposure and outcome and is not a mediator on the casual pathway.
Can be caused by an imbalance b/w exposed and nonexposed groups in another, extraneous exposure (confounder)
If there is confounding and the variable is identified and measured, then can adjust as long as there was no bias in selection of cases or controls within each stratum of the covariate.
For example - if SES is only associated with exposure, but there is over selection of high SES controls, then there is an artifactual inverse association with the outcome leading to confounding. Can be addressed through stratification (Mantel-Haenszel method)
Key test of validity of a case control study
Controls and the source pop should be alike with respect to the exposure under study
The best indication of the presence of confounding is
A meaningful difference between the unadjusted RR and the adjusted RR
calculate and inspect RR for each stratum of potential confounder. If the stratum specific RRs are similar, then potential confounding. If they are different it may be effect modification
When and how to address confounding
Design phase - identify potential confounders by consulting the lit
data collection - measure potential confounders accurately
analysis - check theoretical confounders and other study variables. determine if there is confounding.
Methods used to adjust for confounding in analysis stage
methods based on stratification
multivariable statistical models
standardization (direct or indirect)
Magnitude of confounding
RRunadj - RR adj
———————— x 100
RRadj
This percentage should be more than 10%. No need to look at p value here. This will show whether is confounding
Mantel-Haenszel summary RRs
To calculate:
Set up i 2x2 tables (where i is the # of strata or categories of a potential confounding variable)
Compute the weighted average of the stratum-specific RRs (# of subjects or person-time experience in each stratum)
For cohort - risk ratio or rate ratio
For case-control studies - odds ratio
ORmh = sum of aidi/Ni
———————-
sum of bici/Ni
When to use MH
adjustment for a single confounder that is a categorical variable
simult adj for 2 or 3 confounders, as long as the number of strata for each confounder is relatively small
MH only is to be used for categorical variables
cannot use for large strata - too cumbersome
Generalized linear models
Linear - no RR estimated
Y = b0 + b1 *X1
If want to find b1 in 10 years, just b1 * 10
Null value is 0
The following models are all log-transformed:
Logistic - odds ratio
- used in case-control studies
-other studies with binary dependent variables
-risk prediction
-ignores time
ln(odds of Y) = bo + b1*X…
Poisson (log-linear) - IRR
- cohort studies with person-time data
- incidence rate studies that use aggregate level data
ln(incidence rate of Y) = b0 + b1*X…
Cox Proportional Hazards - HRR
- studies with binary outcome and person-time data
- cohort studies
- RCTs
- Survival analysis
ln[h(t)] = ln[h0(t)] + b1*X…
Unconditional logistic regression vs conditional logistic regression
unconditional - used in unmatched case control (can also use stratification such as the traditional Mantel-Haenszel method with stratification by the matching factors for unmatched case control)
conditional - used in some matched cases control
Deriving RR (per N-year increase of age) in a log-transformed model
for example, it could be any continuous variable.
ex: N = 10
beta(per year) = .04
RR(per year) = 1.05
beta(per year)*10 = .4 then e^.4 = 1.63
This is done in relation to reference level - could be 10 could be 20, but N=10 will always be the same.
Categorical or continuous in model for a natural order variable? (test for trend)
Can model it as a singular term, if linear then can use as continuous.
Test for trend can be used to assess evidence of an exponential trend (linear on a log scale). only applied to exposures with a natural order.
To do this:
1) model variable as categorical variable to capture shape of the dose-response relationship
2) model variable as a single term in a separate regression model to test for trend
p-value for trend = p-value for b1 in the single term model. if p<.05 then it is significant
Standardization
Stratification-based method of comparing
rates of an outcome between two populations that have
different distributions of one or more confounders
To make the comparisons fair (i.e. to remove
confounding) by forcing the two populations to have the
same covariate distribution
Indirect Standardization
Used for retrospective (historical) cohorts with an
external comparison group such as special exposed cohorts.
Standardization covariates MUST be categorical
Standardized incidence/mortality ratio
incidence ratio: total observed cases
—————————–
total expected cases
mortality ratio: total observed deaths
——————————–
total expected deaths
Residual confounding
When your study adjusts for a variable or set of related variables that do not completely remove the confounding by that/those variables.
Coarse categorization: This may be because you use too broad of categories so that there are heterogenous groups of people within each stratum. This is problematic because these heterogenenous groups of people could also differ with respect to their exposure prevalence and risk of the outcome.
Suboptimal modeling of the confounder in a multivariable model (e.g. modeling a covariate as continuous when the true dose-response curve is U-shaped)
Inadequate adjustment for complex, multidimensional confounders, such as smoking, SES, and health status
Inadequate measurement of the confounder (measurement error - unvalidated data collection instrument), collection of insufficiently detailed information
**If confounding remains due to not adjusting at all for a particular confounder this is NOT considered residual confounding.
Health status as confounder
Healthy vaccinee effect - seniors are at high short-term risk of death who are unvaccinated
Addressing residual confounding
Measurement - measure potential confounders as carefully as the exposure under study. Especially if multidimensional
Data analysis - Use sufficiently fine covariate categorization, optimize modeling of covariates in multivariable models, strive to capture full dimensionality of multidimensional confounders in multivariable models
*however need to take into account statistical imperative of model parsimony - ratio of # of outcomes to # of covariates should be more than 10.
interpretation - Be transparent about the residual confounding in interpretation and how it could be better accounted for.
Matching in Cohort Studies
Adjust for one or more potential confounders in the design phase of your study
Select non-exposed participants who are similar to the exposed participants with respect to the distribution of one or more potential confounders.
Potential confounders are called matching factors. When matched no need to account in the analysis phase but only if there’s complete follow-up
Matching in case-control studies
to adjust for one or more potential confounders in the design phase of the study
selection of controls who are similar to cases with respect to their distribution of one or more potential confounders
However matching in the design phase alone does not completely remove confounding and so will need to still adjust in the analysis phase
matching intentionally introduces selection bias and creates a new, superimposed confounding toward the null
Matching on a true confounder increases statistical efficiency by optimizing precision
Frequency matching
Selection of controls such that the distribution(s) of one or more potential confounders is/are similar in cases and controls
Often used when matching factors are demographic variables (e.g. age, sex, race)
For example if some stratum have 0 individuals, you risk not being able to use the data from all subjects in the study leading to reduced statistical efficiency
Individual Matching
Selection of one or more controls that are identical to a given case with respect to one or more potential confounders
Useful for controlling for a confounder using “fine stratification” (mini stratum)
matching factors that are multidimensional confounders
using risk-set sampling of controls in nested case-control studies
The matched set is the stratum
Cannot do twin studies with unmatched case-control
Must use conditional logistic regression - don’t need to include matching factors OR stratification - mantel-Haenszel matched analysis (McNemar Test) - this gives matched OR
Nested Case-Control Studies with Matching
For each case, N number of matched controls can randomly sampled from the case’s risk set
- can restrict the risk set by matching factors
Enables selection of control with the same risk set as case
- same concurrent time at risk for development of outcome
Simplest Mantel-Haenszel matched analysis
four possible combinations of matched pairs
concordant, concordant, discordant, and discordant. Only need to look at the discordant pairs.
q r s t
r and s are the discordant pairs
r/s = Matched Odds Ratio
Can association between matching factors and disease be studied?
No, because matching forces controls to be the same as cases with respect to the matching factor therefore, there is no way to find the association.
Overmatching
Overmatching generally refers to matching that is counter productive, by either causing bias or reducing efficiency. This causes a new superimposed confounding toward the null and leads to loss of statistical efficiency.
Overmatching must be corrected in analysis phase
Matching on a mediator in a causal pathway between exposure and disease will bias the effect estimate towards the null
Matching on a non-confounder that is associated with exposure, but not a risk factor for disease
Survival analysis
Study of the distribution of time elapsed from a baseline time to an outcome(event)
Study of the effect of exposures (including treatments) affect the distribution of time to event
Used for two study designs: cohort studies and RCTs
Baseline data examples: date of entry into a cohort, birth date, etc.
outcome examples: death, incident disease, disease cure, etc.
It is better to experience a beneficial outcome earlier than later
It is better to experience an adverse outcome later than earlier
Cumulative incidence vs. cumulative survival
CI (0 to 1) is the proportion of a specified population at risk that experienced the outcome under study during a specified time period
Probability (risk) of experiencing the outcome under study in the specified time period
CS (0 to 1) is the proportion of a specified population at risk that does NOT experience the outcome under study (i.e. “survives”) during a specified time period
Probability (risk) of NOT experiencing the outcome under study in the specified time period
Can both be calculated directly if closed cohort
CS + CI =?
1
CI curve vs CS curve
CI curve is the proportion of subjects who have experienced the event as a function of time since baseline
CS curve is the proportion of subjects who have NOT experienced the event as a function of time since baseline
Median survival time
where CI = CS = .5
How to plot cumulative incidence/survival as a step function
- rank survival times from lowest to highest
- create intervals that start when one or more events occur
- calculate cumulative incidence during interval
- calculating cumulative survival would just be subtraction/total instead of addition/total
Cumulative Incidence in open cohort
cumulative incidence will be underestimated because it assumes those who withdrew, lost to follow-up or died did not experience that incidence.
Kaplan Meier method
- Rank survival times from lowest to highest
- divide survival time into intervals that start when one or more events occur (ei and ci) and calculate # at risk at start of each interval (ni)
- calculate probability of surviving each interval (pi = (ni-ei)/ni)
- calculate cumulative survival during each interval (Si = Si-1 x pi) - first interval is always 1
Censoring
Termination of follow-up for a subject on a specified date because it is unknown whether the outcome occurred or would have occurred after that date.
unknown whether outcome occurred or would have occurred
Kaplan-Meier survival estimates calculated cumulative incidence/survival taking censoring into account, but assumes that censoring is unbiased
Log-rank test
Compares K-M curves for 2 or more groups.
Stratified log rank test
Compares K-M curves for 2 or more groups using stratification to control for confounding
limitation: method breaks down if data becomes too sparse
Interpreting and presenting K-M curves
The further to the right, the fewer subjects at risk and the more uncertainty
Good practice to end the plot at a follow-up time when only 10-20% of subjects are still at risk.
Two main survival analysis methods
KM survival curves (descriptive and cannot readily calculate RR or adjust for multiple covariates) and cox proportional hazards regression
Cox proportional hazards regression
allows baseline hazard to vary over time
assumes the hazard ratio is constant over time which is equivalent to stating that the exposure-outcome relationship is NOT modified by follow-up time (therefore not an effect modifier) - if PH assumption is violated then follow-up time is a modifier and stratification by follow-up time would be needed
allows adjustment of multiple covariates and provides an RR
Hazard
the instantaneous incidence rate at a point in time (change in number of new cases at time point) - basically the slope between two points on the curve.
Incidence rate could change with time
When is Proportional Hazards assumption not met
When the proportional hazards curves cross one another.
Cause-specific mortality is always ___ overall mortality
less than or equal to
cause-specific survival is always _____ overall survival
more than or equal to
Measuring cause-specific mortality is ____ logistically challenging than measuring overall mortality
more
Methods for assessing cause-specific mortality
direct methods (gold standard)- determine the cause of death for each decedent. This can be done by review of medical records or death certificates but medical records is better.
indirect methods - take overall-mortality estimates and apply a correction to them, in order to estimate the number of deaths due to a specific cause - through relative survival
Relative survival
Provides an estimate of cause-specific survival in a cohort. corrects for deaths from causes other than the disease under study
RS = observed OS/expected OS
If expected OS = 1, RS = observed OS
If expected OS<1, RS > or equal to observed OS
Expected OS
Usually expected OS of person of the same demographics and calendar period from publicly available vital statistics data
Key assumption: OS in the diseased cohort would be the same as the OS of the comparison population, if the cohort members did not have the disease (assuming that the only difference between the two cohorts is the disease).
Effect modification
Variation in the magnitude of the association between an exposure and an outcome across strata of a second exposure (the effect modifier)
Has an underlying public health, clinical, biologic, or psychosocial basis. Not merely a statistical phenomenon.
Can be assessed through stratified analysis and multivariable models
Effect modification is reciprocal since there is an interaction
Effect modification via stratified analysis
If stratify and RR for each stratum is not similar then there is a potential for effect modification. If this is the case you can then calculate a p-value for heterogeneity (interaction) (this is a likelihood ratio test).
If p-value for heterogeneity is significant, then effect modification/interaction, if not then no effect modification/interaction.
calculate p-value of the interaction term, if multiple, the interaction terms in aggregate