Epi Methods 753 Flashcards
2 categories where 1 is reference group (typically “unexposed”)
Dichotomous Variable
Parameterization of variable into discrete categories
Categorical Variable
Categorical variable that assigns 0 or 1
Binary Variable
Categorical variable that doesn’t have ordering/order not of interest; collection of k-1 binary indicator variables
Nominal Variable
Categorical variable that has ordering/order of interest; collection of binary variables assigned score; step between categories constrained to be equal
Ordinal Variable
Test Ho that B=0, where B is coefficient for category score variable; if p<0.05 best estimate for step from one category to next is different from 0
Mantel Test for Trend
Variable can take any value between lower & upper limit
Continuous Variable
Divide continuous variable by factor; coefficient of variable affected
Rescaling
Subtract continuous variable by factor; intercept affected
Centering
Closely related to counterfactual; compare observed outcome to non-observed (counterfactual) outcome; estimate measures of causal effect by measures of association assuming exchangeability (differences due to confounding)
Potential Outcomes
Observed outcomes in unexposed are good stand-in for unobserved potential outcomes for exposed persons under no exposure & vice versa; not testable but met in expectation with randomization
Exchangeability Assumption
Comparison of pre-treatment covariates in exposed & unexposed groups; comparability doesn’t guarantee assumption met
Exchangeability Assessment
Relax exchangeability assumption to be conditional on covariates; assumes no unmeasured confounders
Conditional Exchangeability
Used to assess how average value of continuous outcome varies systematically with X’s; E[Y] = B0+B1X1+…; B1=average difference (cross-sectional) or change (longitudinal) in Y per 1-unit X1
Linear Regression
Used to assess how log odds binary outcome varies systematically with X’s; log(odds Y)=B0+B1X1+…; B1=difference in log(odds Y) per 1-unit X1; PrOR for cross-sectional or ROR for longitudinal
Logistic Regression
Risk or prevalence > 10%
OR Overestimates RR or PrR
Used to assess how log probability binary outcome varies systematically with X’s; log(Pr(Y=1))=B0+B1X1+…; B1=difference in log(prob Y) per 1-unit X1; PrR for cross-sectional or RR for longitudinal
Log-Binomial Regression
Path from E to O that starts with E & all arrows point in same direction
Causal Path
Any other path from E to O; unconditionally open backdoor paths are confounded vs. unconditionally closed backdoor paths are blocked at collider
Non-Causal Path
Covariate set that leaves all causal paths open & non-causal paths closed vs. does this without any extra variables
Sufficient vs. Minimally Sufficient
Variables only causally associated with exposure; decreases precision if put into model
Instrument
Not necessary for confounder control but may increase precision
Variables Associated with Outcome
Confounding is causal concept but collapsibility is statistical concept; depends on prevalence of outcome & type of measure of association
Problems with Collapsibility Definition
Stratify table by exposure, do not include outcome, & do not include p-values
Causal Inference Table 1
Give similar results when number of confounders is small & no confounders are continuous
Stratified Analysis vs. Regression
Expresses incomplete adjustment of confounding variables due to mismeasurement or misspecification
Residual Confounding
Prioritize accurate representation & interpretation of exposure but fit for confounders
Causal Inference Modeling
2 or more risk factors modify effect of each other with regard to occurrence/level of outcome; effect of E on O differs across strata of X; potential outcomes indexed by E only & estimated conditional on X (1 exchangeability assumption)
Effect Measure Modifier
Risk of O in presence of both E & X differs from what would be expected based on effect of E alone & X alone; potential outcomes indexed by both E & X (2 exchangeability assumptions)
Causal Interaction
Difference of risk differences (liner), ratio of odds ratios (logistic), or ratio of risk ratios (log-binomial)
Coefficient of Product Term
Difference of risk differences expressed as proportion of reference risk (RR00); R11-R01-R10+1
Relative Excess Risk due to Interaction (RERI)
P-value of Wald test for interaction coefficient; LRT (or F-test for linear models); underpowered & likely to return false positives
Test of Homogeneity
Used to identify potential associations with outcome; hypothesis generating; non-causal, potential for multiple comparisons, different across studies
Risk Factor Analysis
Stratify table by outcome, no p-values
Risk Factor Analysis Table 1
Prioritize interpretability & model fit of all covariates
Risk Factor Analysis Modeling
Assign predicted probability of condition based on baseline characteristics; use logistic regression then convert to probability
Prediction Model
Use baseline characteristics to predict current disease stage; useful if gold standard is invasive or expensive; assessed against gold standard
Diagnostic Model
Use baseline characteristics to predict future disease state; assessed by future outcomes data
Prognostic Model
Degree of closeness of measured/predicted quantity to actual/gold standard value
Accuracy
Accuracy of output from prediction model applied to data used to develop model; calibration & discrimination
Model Accuracy
Accuracy of output from prediction model applied to data not used to develop model; calibration & discrimination
Model Predictive Accuracy
Ability to correctly estimate disease state or risk/probability of future event
Calibration
Ability to separate persons with/without disease or various disease states
Discrimination
Prioritize model fit & parsimony
Prediction Modeling
Continuous outcome; R2 closer to 1 vs. intercept = 0 & slope = 1
Good Discrimination vs. Calibration
Plot sensitivity vs. 1-specificity for binary outcome; each point corresponds to different cutoff of what defines “positive test”
ROC Curve
Area under ROC curve; if one person with & one without disease were randomly selected, probability that person with disease has higher predicted probability
C-Statistic
Measure of calibration for binary outcomes; measures closeness of distributions of observed & predicted values; tests Ho that observed=expected, p<0.05 indicates poor fit
Hosmer-Lemeshow X2 Goodness-of-Fit Test
Stratify by dataset, no p-values
Prediction Table 1
Describes models whose output reflects statistical “noise” in particular dataset rather than underlying, stable relationships that may be reproducible
Overfitting
Correlation between predictors high enough to degrade precision of regression coefficient estimates substantially for some/all correlated predictors; do not tolerate VIF > 10
Collinearity
Measurement of predictive accuracy; discrimination/calibration of model on data not used to derive model
Validation
Predictive accuracy measured within same population (training set vs. validation set); e.g. split sample or h-fold cross-validation
Internal Validation
Predictive accuracy measured within different population
External Validation
2-state, non-recurrent event
Outcome for Survival Analysis
Reflects beginning of time individuals biologically & methodologically at risk; elapsed time measured from this point (aligns individuals)
Time Origin
Yardstick by which time is measured; controls for that measurement of time
Time Metric
Time at beginning of individual’s observation in study
Entry Time
Time origin < study entry; assume individuals representative of all other participants & those who don’t enter at all
Late Entry
Time during which study outcome cannot occur because individual not under observation; downwardly biased outcome rate & upwardly biased survival curve
Immortal Person-Time
Exclusion of prevalent cases
Left Censoring
Individual did not experience outcome under follow-up & can’t be further observed (no longer methodologically at risk); administrative censoring, LTFU, or competing risk; assumed to be non-informative
Right Censoring
Assumption that risk of outcome at any given moment of follow-up is similar across individuals
Equivalence of Person-Time at Risk
Group of individuals aligned by time origin & at risk for event at time t; used for comparisons in survival analysis; assembled at each time of event (continuous) or period (discrete)
Risk Set
Instantaneous rate of event among those who survive without event to that time point among those who make it to time point; estimated using p(t)/width
Continuous Time Hazard
Conditional probability of event among those who survive without event to that time period among those who make it to time period; # events/#at risk; determines whether risk is increasing, decreasing, or constant
Discrete Time Hazard
Cumulative probability of surviving beyond time j; S(tj-1)(1-h(tj)) or S(tj-1)(1-p(tj)); plot using Kaplan-Meier
Survival
Cumulative probability of having event at or before time j; complement of survival function; plot using Kaplan-Meier
Cumulative Incidence
Cumulation of hazard between t0 & tj for individual; shape represents behavior of hazard function in continuous time; estimated using Kaplan-Meier –> plot -ln(S(tij))
Cumulative Hazard
One record for each person-period when individual at risk (often multiple rows of data per person); define late entries, exclude person-time prior to study entry or after study exit, & identify gaps
Discrete Time Data Setup
Models discrete-time hazard function for truly discrete hazard; log hazard odds=[aD1+…]*BXi; aj=log hazard odds for time period j when X’s=0 (estimates hazard in each time period); B=log hazard OR in exposed vs. unexposed
Pooled Logistic Regression
Truly discrete hazard (hazard is conditional probability & constant within each time period) & proportional hazard odds (hazard OR constant across periods)
Assumptions of Pooled Logistic Regression
Models discrete-time hazard function for underlying continuous event processes; ln(-ln(1-h(tij|Xij))=[a1D1+…]*BX1; B=log HR outcome in exposed vs. unexposed
Discrete Time Proportional Hazards Regression (cloglog)
Continuous-time hazard & proportional hazards
Assumptions of Discrete Time Proportional Hazards Regression
One record for each individual (can be multiple if time-varying covariates); define late entries & exclude person-time prior to study entry or after study exit
Continuous Time Data Setup
Models continuous-time hazard function; log(h(t))=log(h0(t))+B1X1+…; B1=log HR outcome in exposed vs. unexposed; semi-parametric; sensitive to ties
Cox Proportional Hazards Regression
Shape of hazard allowed to vary & proportional hazards
Assumptions of Cox Proportional Hazards Regression
Parallel lines for plot H(t) vs. time or ln(H(t)) vs. time; horizontal line or correlation of 0 for plot of Schoenfeld residuals vs. time
Assessing Proportional Hazards Assumption
Tests Ho of no difference between survival functions; p<0.05 indicates survival differs in at least 1 group
Log-Rank Test
Unit of analysis is time period in which variable is constant; include additional rows of data for each transition time
Time-Varying Covariates
Conditional logistic regression to calculate matched OR (discordant pairs); rare disease assumption met & OR is valid estimate of HR (representative subsample & cohort is reasonable size)
Analysis for Nested Case-Control Studies
Cox proportional hazards regression with late entries for cases outside subcohort; rare disease assumption met (cohort reasonable size & few ties)
Analysis for Case-Cohort Studies
Used to assess how log incidence rate of count outcome varies systematically with X’s; log(IRk)=uj+B0+B1X1+… or log(A)=uj+B0+B1X1+…+log(T); B1=difference in log(IR Y) per 1-unit X1 (same as log IRR)
Poisson Regression
Equivalent to hazard when hazard is constant or average hazard when hazard isn’t constant
Incidence Rate
Communication, no meaningful time origin, no multi-level data, or outcome is count
Reasons to Estimate IR
Offset
ln(person-time) in Poisson Regression
Each row corresponds to one bin of person-time; each row needs covariate(s) values, # events, & amount of person-time
Poisson Data Setup
Constant multiplicative effect, constant average hazard, mean=variance
Assumptions of Poisson Regression
Used to assess how log incidence rate of count outcome varies systematically with X’s; relaxes mean=variance assumption using dispersion parameter (a); log(IRk)=uj+B0+B1X1+… or log(A)=uj+B0+B1X1+…+log(T); B1=difference in log(IR Y) per 1-unit X1 (same as log IRR)
Negative Binomial Regression
LRT with Ho: a=0; if p<0.05 then use NB
Evaluating Overdispersion
Variance>mean in dataset where outcome assumed to be Poisson distributed; may occur if confounder not included in model or outcomes correlated across time bins; can produce underestimated SE & overestimated test statistics
Overdispersion