18. Exploratory and Confirmatory Data Analysis Flashcards
When is/should an exploratory analysis be used?
Conducted when:
1. Have a hypothesis but no clear plan/analysis
2. Have lots of variables that may be associated with an outcome variable but not sure how (no clear prediction)
(can provide tools for hypothesis generation via visualisation)
What are the issues with exploratory data analysis?
- No clear prediction about how variables are related
- Not obvious what predictors are important to question
- Hypothesis can be tested in multiple ways
How can exploratory analysis be done wrong?
- Results are wonky as can’t find starting hypothesis results - post-hoc analysis is damaging
- P-value conditional on intended sample size (issue of overfitting and multiple comparisons)
- Done poorly = Only report on significant variables that are significant (leads to overfitting and papers only producing desired outcome
What are the steps of exploratory data analysis?
- Check coding of data and visualise variables of interest
- Compare models of interest to find if your variables of interest are good predictors of your outcome variable
- Compute the K-fold, cross-validation, MSE for each of your models
- Identify the best fitting models
What is overfitting?
Statistical model with fit sample specific noise as if it is signal
Overfitting refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data
Can’t trust estimates based on a model that was trained on the same sample
How does R2 change in overfitting?
Optimistic - Higher as saying it will account for DV more than it does (Replication crisis = Studies don’t find the same results)
What is p-hacking?
Special case of overfitting
Procedural overfitting - Takes place in parallel to model estimation (During data cleaning etc.)
Exploitation of data analysis in order to discover patterns which would be presented as statistically significant, when in reality, there is no underlying effect.
What is the best way to avoid overfitting when exploring data?
Bias-variance tradeoff
What is bias?
Tendency for a model to consistently produce answers that are wrong in a particular direction
What is variance?
Extent to which a model’s fitted parameters will tend to deviate from central tendency across different data sets
What are the two different approaches in the bias-variance trade-off? Which one is preferred?
- Liberal, flexible data-analysis (low bias but high variance) - Almost any pattern can be detected at the cost of high rate of spurious identifications) - (exploratory data analysis)
- Approach that fixes high variance (only a limited range of patterns can be identified - risk of pattern hallucination is low (confirmatory data analysis)
use method one, otherwise just assessing noise
How does what to do in exploratory research help tell you what to do in confirmatory research?
- Data sets are large enough to support training models
- Accurately prediction error to asses performance and improve model
- Exert control over bias-variance trade-off when appropriate
What is cross-validation?
To assess models = Need to quantify out of sample prediction error
Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations.
What is canonical cross-validation?
Classic replication
Model tested on one data set is then tested on completely independent data set
More in experimental design than correlational
What are some issues that can prevent canonical cross-validation from occurring?
Sometime can’t collect enough data as limited population, limited funds to collect more data, one giant study you want to study was only run once