18. Exploratory and Confirmatory Data Analysis Flashcards

Question 1

Q

When is/should an exploratory analysis be used?

Answer

A

Conducted when:
1. Have a hypothesis but no clear plan/analysis
2. Have lots of variables that may be associated with an outcome variable but not sure how (no clear prediction)

(can provide tools for hypothesis generation via visualisation)

Question 2

Q

What are the issues with exploratory data analysis?

Answer

A

No clear prediction about how variables are related
Not obvious what predictors are important to question
Hypothesis can be tested in multiple ways

Question 3

Q

How can exploratory analysis be done wrong?

Answer

A

Results are wonky as can’t find starting hypothesis results - post-hoc analysis is damaging
P-value conditional on intended sample size (issue of overfitting and multiple comparisons)
Done poorly = Only report on significant variables that are significant (leads to overfitting and papers only producing desired outcome

Question 4

Q

What are the steps of exploratory data analysis?

Answer

A

Check coding of data and visualise variables of interest
Compare models of interest to find if your variables of interest are good predictors of your outcome variable
Compute the K-fold, cross-validation, MSE for each of your models
Identify the best fitting models

Question 5

Q

What is overfitting?

Answer

A

Statistical model with fit sample specific noise as if it is signal

Overfitting refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data

Can’t trust estimates based on a model that was trained on the same sample

Question 6

Q

How does R2 change in overfitting?

Answer

A

Optimistic - Higher as saying it will account for DV more than it does (Replication crisis = Studies don’t find the same results)

Question 7

Q

What is p-hacking?

Answer

A

Special case of overfitting
Procedural overfitting - Takes place in parallel to model estimation (During data cleaning etc.)

Exploitation of data analysis in order to discover patterns which would be presented as statistically significant, when in reality, there is no underlying effect.

Question 8

Q

What is the best way to avoid overfitting when exploring data?

Answer

A

Bias-variance tradeoff

Question 9

Q

What is bias?

Answer

A

Tendency for a model to consistently produce answers that are wrong in a particular direction

Question 10

Q

What is variance?

Answer

A

Extent to which a model’s fitted parameters will tend to deviate from central tendency across different data sets

Question 11

Q

What are the two different approaches in the bias-variance trade-off? Which one is preferred?

Answer

A

Liberal, flexible data-analysis (low bias but high variance) - Almost any pattern can be detected at the cost of high rate of spurious identifications) - (exploratory data analysis)
Approach that fixes high variance (only a limited range of patterns can be identified - risk of pattern hallucination is low (confirmatory data analysis)

use method one, otherwise just assessing noise

Question 12

Q

How does what to do in exploratory research help tell you what to do in confirmatory research?

Answer

A

Data sets are large enough to support training models
Accurately prediction error to asses performance and improve model
Exert control over bias-variance trade-off when appropriate

Question 13

Q

What is cross-validation?

Answer

A

To assess models = Need to quantify out of sample prediction error
Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations.

Question 14

Q

What is canonical cross-validation?

Answer

A

Classic replication
Model tested on one data set is then tested on completely independent data set
More in experimental design than correlational

Question 15

Q

What are some issues that can prevent canonical cross-validation from occurring?

Answer

A

Sometime can’t collect enough data as limited population, limited funds to collect more data, one giant study you want to study was only run once

Question 16

Q

What solves the issues of cross-validation?

Answer

A

K-Folding (K = no. folds)

Recycles the data set
in one fold (subset of data), one half is used for training and the other for testing
Second fold, datasets are reversed i.e. training and testing roles are reversed
Typical number of folds is 10
Minimises error predictions

Question 17

Q

What is confirmatory research analysis?

Answer

A

Conducted when you have a specific research question to test (putting a hypothesis to trial)

Specify prior to data collection, the statistical analyses you intend to run and expectations of relationships

Question 18

Q

What is mean squared error and what is it used for?

Answer

A

The average squared difference between the estimated values and the actual value

Used to assess model fit

Question 19

Q

What would an increased MSE suggest about a model?

Answer

A

It is a worse model as there is a larger difference between estimated and actual values

e.g. MSE of 0 = “perfect accuracy”

Question 20

Q

What is a disadvantage of using MSE?

Answer

A

Heavily weighing outliers (some use mean absolute error instead)

Question 21

Q

What is a sensitivity analysis?

Answer

A

Sensitivity analysis determines how different values of an independent variable affect a particular dependent variable under a given set of assumptions. In other words, sensitivity analyses study how various sources of uncertainty in a mathematical model contribute to the model’s overall uncertainty.

Question 22

Q

What is a limitation of using purely confirmatory data analysis?

Answer

A

Can only detect a limited range of patterns when it’s possible others exist in the data so may miss important results

Question 23

Q

Is direct replication of a prior experiment a form of cross-validation?