Week 2 - Propensity Score Estimation Flashcards
What are the steps of propensity score estimation?
Data Preparation Propensity Score Estimation Propensity Score Method Implementation Covariate Balance Evaluation Treatment Effect Estimation Sensitivity Analysis
How can the success of propensity score estimation be determined?
a) The propensity score estimation converged to a solution b) Common support is adequate for estimation of the treatment effect of interest
c) Adequate covariate balance is obtained.
What are true confounders?
A true confounder is a covariate that has a direct effect on the probability of treatment assignment and a direct effect on the outcome.
What is the consequence of including variables that are related to the outcome but not to treatment assignment in the propensity score model?
Including variables related to the outcome but not to exposure (treatment assignment) in the PS model does not affect bias but decreases variance.
What is the consequence of including variables that are related to the treatment assignment but not to the outcome in the propensity score model?
Including variables related to exposure (treatment assignment) but not the outcome does not affect bias but increases variance.
What are three strategies that can be used to select covariates for the propensity score model?
- Theoretical analysis of factors influencing the selection mechanism and their relationship with outcomes.
- Pilot study focused on identifying the selection mechanism.
- Expert reviews and interviews with participants and other persons knowledgeable about the selection process.
- Use a sub-sample of the original data.
Why is it important not to use the outcome data in the process of selecting covariates for the propensity score model?
To maintain researcher objectivity in the implementation of propensity score methods, and to parallel the design of randomized experiments.
What are two strategies to use multiple imputation to deal with missing data in the propensity score estimation process?
Multiple imputation of covariates, followed by averaging of multiple propensity scores to create single propensity score vector.
OR
Multiple imputation followed by separate propensity score analysis of each imputed dataset.
Identify three methods that can be used to estimate propensity scores.
logistic regression (stats) profit regression (stats) classification trees (data mining) boosting (data mining) bagging (data mining) random forests (data mining)
What is the main challenge for using logistic regression to estimate propensity scores?
If a model consists of a large number of variables and covariate balance is not achieved on one or more variable, determining the cause of the problem (e.g., interaction effects) and an appropriate model can prove problematic. (?)
How do classification trees produce estimates of propensity scores?
- Each variable is split into two nodes that are more homogenous than the existing node with respect the outcome variable.
- The classification tree algorithm calculates this gain for every possible split and selects the split resulting in largest gain in impurity reduction.
- Variables may be used more than once to allow interactions. Trees automatically capture interactions and non-linear effects.
- The algorithm iteratively splits variables until a stop criterion is met.
What are the limitations of classification trees for propensity score estimation?
The results have a high level of variability with many covariates and frequently produce poor estimates of propensity scores.
Classification trees tend to over-fit the data, producing a tree with many branches that are due to random variation in the data and do not cross-validate to other datasets.
What is the difference between classification trees and bagging?
Bagging (bootstrapped aggregation) improves upon classification trees by running a large number of trees with bootstrapped samples of the same size of the original sample, taken with replacement. These trees are run using all available variables and without any pruning. Then, the results are combined into a composite tree, which is less affected by random variability in the data than a single tree.
What is the difference between bagging and random forests?
The random forest algorithm is similar to bagging, except that only a subset of the complete set of variables is used at each iteration.
What are the advantages of random forests over bagging?
Random forests prevent that one variable dominates another and guarantee that all variables participate in building some trees (Strobl, et al., 2009).