18. Exploratory and Confirmatory Data Analysis Flashcards

1
Q

When is/should an exploratory analysis be used?

A

Conducted when:
1. Have a hypothesis but no clear plan/analysis
2. Have lots of variables that may be associated with an outcome variable but not sure how (no clear prediction)

(can provide tools for hypothesis generation via visualisation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the issues with exploratory data analysis?

A
  • No clear prediction about how variables are related
  • Not obvious what predictors are important to question
  • Hypothesis can be tested in multiple ways
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can exploratory analysis be done wrong?

A
  • Results are wonky as can’t find starting hypothesis results - post-hoc analysis is damaging
  • P-value conditional on intended sample size (issue of overfitting and multiple comparisons)
  • Done poorly = Only report on significant variables that are significant (leads to overfitting and papers only producing desired outcome
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the steps of exploratory data analysis?

A
  1. Check coding of data and visualise variables of interest
  2. Compare models of interest to find if your variables of interest are good predictors of your outcome variable
  3. Compute the K-fold, cross-validation, MSE for each of your models
  4. Identify the best fitting models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is overfitting?

A

Statistical model with fit sample specific noise as if it is signal

Overfitting refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data

Can’t trust estimates based on a model that was trained on the same sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does R2 change in overfitting?

A

Optimistic - Higher as saying it will account for DV more than it does (Replication crisis = Studies don’t find the same results)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is p-hacking?

A

Special case of overfitting
Procedural overfitting - Takes place in parallel to model estimation (During data cleaning etc.)

Exploitation of data analysis in order to discover patterns which would be presented as statistically significant, when in reality, there is no underlying effect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the best way to avoid overfitting when exploring data?

A

Bias-variance tradeoff

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is bias?

A

Tendency for a model to consistently produce answers that are wrong in a particular direction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is variance?

A

Extent to which a model’s fitted parameters will tend to deviate from central tendency across different data sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two different approaches in the bias-variance trade-off? Which one is preferred?

A
  1. Liberal, flexible data-analysis (low bias but high variance) - Almost any pattern can be detected at the cost of high rate of spurious identifications) - (exploratory data analysis)
  2. Approach that fixes high variance (only a limited range of patterns can be identified - risk of pattern hallucination is low (confirmatory data analysis)

use method one, otherwise just assessing noise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does what to do in exploratory research help tell you what to do in confirmatory research?

A
  • Data sets are large enough to support training models
  • Accurately prediction error to asses performance and improve model
  • Exert control over bias-variance trade-off when appropriate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is cross-validation?

A

To assess models = Need to quantify out of sample prediction error
Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is canonical cross-validation?

A

Classic replication
Model tested on one data set is then tested on completely independent data set
More in experimental design than correlational

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are some issues that can prevent canonical cross-validation from occurring?

A

Sometime can’t collect enough data as limited population, limited funds to collect more data, one giant study you want to study was only run once

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What solves the issues of cross-validation?

A

K-Folding (K = no. folds)

  • Recycles the data set
  • in one fold (subset of data), one half is used for training and the other for testing
  • Second fold, datasets are reversed i.e. training and testing roles are reversed
  • Typical number of folds is 10
  • Minimises error predictions
17
Q

What is confirmatory research analysis?

A

Conducted when you have a specific research question to test (putting a hypothesis to trial)

Specify prior to data collection, the statistical analyses you intend to run and expectations of relationships

18
Q

What is mean squared error and what is it used for?

A

The average squared difference between the estimated values and the actual value

Used to assess model fit

19
Q

What would an increased MSE suggest about a model?

A

It is a worse model as there is a larger difference between estimated and actual values

e.g. MSE of 0 = “perfect accuracy”

20
Q

What is a disadvantage of using MSE?

A

Heavily weighing outliers (some use mean absolute error instead)

21
Q

What is a sensitivity analysis?

A

Sensitivity analysis determines how different values of an independent variable affect a particular dependent variable under a given set of assumptions. In other words, sensitivity analyses study how various sources of uncertainty in a mathematical model contribute to the model’s overall uncertainty.

22
Q

What is a limitation of using purely confirmatory data analysis?

A

Can only detect a limited range of patterns when it’s possible others exist in the data so may miss important results

23
Q

Is direct replication of a prior experiment a form of cross-validation?

A

Yes