17. Understanding Linear Models Flashcards

Question 1

Q

What is causality?

Answer

A

One event directly leads to another event

(Different from covariance where two variables change together)

Question 2

Q

What are the different conditions for causality?

Answer

A

Covariation
- When two factors occur at the same time but are not related
- E.g. Ice Cream and Shark Attacks

Plausibility
- Is the causation actually plausible to occur?

Temporal precedence

A happens before B but B does not lead to A etc.

No reasonable alternatives
- Hard to establish
- Fails to account for alternative explanations - may lead to spurious correlations

Question 3

Q

How can causality be tested?

Answer

A

Identifying causal relationships = Examined through study design rather than statistical testing

e.g. test experimental vs observational design (manipulating one variable and seeing it’s effect on the other)

Needs good causal relation test in the first place as many studies are poorly designed

OR

Propensity score matching = Instrumental variable analysis (use stats to simulate control group)

Question 4

Q

What is a marginal distribution?

Answer

A

An event’s value, independent of other events

Question 5

Q

What is a conditional distribution?

Answer

A

An events value, given the value of another event

Question 6

Q

What is endogeneity?

Answer

A

Theoretically occurs in a marginal distribution of predictor variable and is not independent of conditional distribution of outcome variable, given the predictor variable

Occurs when predictor variable x is correlated with error term - causes bias in beta estimates

e not equal to 0

Question 7

Q

What is an endogenous variable?

Answer

A

An endogenous variable is any variable in the regression model that is correlated with the error term.

Variable measure is determined by the model

Question 8

Q

What is a exogenous variable?

Answer

A

An exogenous variable is an explanatory variable that is not correlated with the error term

Variable measure is determined outside of the model not by the model

Question 9

Q

What are the problems with endogeneity?

Answer

A

Can’t easily test whether variables are endogenous
- Model estimate of error will be biased by endogenous variable if we have a model with both endogenous and exogenous variable
Even if you detect endogeneity must still determine why it’s there to solve the issue

Question 10

Q

What are the different sources of endogeneity? (Name only)

Answer

A

Simultaneity bias
Omitted/Confounding variables
Measurement Error

Question 11

Q

What is simultaneity bias?

Answer

A

X causes Y, which causes x

E.g. Farmer’s income <-> crop yield

(y = Beta 0 + Beta 1 (exogenous) + Beta 2 (endogenous)

If endogeneity is due to simultaneity (done at the same time as something else) then x (exogenous) will lead to change in y that will change x (endogenous) as it is linked to the DV/model

More endogenous variables = Effect is more pronounced

Question 12

Q

How do we solve simultaneity bias?

Answer

A

Use statistical methods developed for this situation (two-stage least squares regression)

Question 13

Q

How do omitted/confounding variables explain endogeneity?

Answer

A

In a perfectly exogenous model - effect of x on y is separated from the error term

When x is correlated with both the outcome and an omitted variable z, the variance explained by z falls on ϵ

Question 14

Q

What is the solution for endogeneity when omitted/confounding variables cause it?

Answer

A

Ensure confounds are measured and included in the model, no small tasks, requires thorough knowledge of the topic

Question 15

Q

How does measurement error cause endogeneity?

Answer

A

Instead of measuring x, you measure x∗, which is a measurement of x with error (r) included

E.g. Reporting errors and coding errors

Similar to the case of omitted variables , measurement error becomes part of error but will be associated with x, leading to endogeneity

Question 16

Q

How is the issue of measurement error in causing endogeneity solved?

Answer

A

Careful planning and study design

E.g. through pilot testing

Question 17

Q

What is prediction?

Answer

A

Important application of understanding causality

E.g. it is the aim of LM - To produce a model to predict the outcome variable

Question 18

Q

How do we predict values outwith the original data set?

Answer

A

When collecting data, range of samples in predictor and outcome may span full range of variables as they exist in the world

E.g. Hours spent studying

Can think about it as two sets of unknown values
- Those within range used to estimate model
- Those outside range used to estimate model

Question 19

Q

What is interpolation?

Answer

A

Obtaining a value from a model within the range of given data or points

Question 20

Q

What is extrapolation?

Answer

A

Obtaining value from model from outside a range of given data points

Extrapolation is not recommended - especially as trajectory is unknown
Don’t have patterns on each side to show pattern

Question 21

Q

What is missing data?

Answer

A

Missing data, or missing values, occur when you don’t have data stored for certain variables or participants

Question 22

Q

What are the reasons for missing data?

Answer

A

PPT non-responses, Error in data collection, Errors in data entry, Missing by design

Question 23

Q

What are the two main things to worry about when it comes to missing data?

Answer

A

Two main things to worry about: Loss of efficiency (as smaller sample size) so reduced power and bias (incorrect estimates)

Question 24

Q

What are the types of missing data? (name only)

Answer

A

Missing at random (MAR)
Missing completely at random (MCAR)
Missing not random (MNAR)

Question 25

Q

What is missing at random? (MAR)

Answer

A

When the probability of missing data on a variable y is related to other variables in the model but not to variables of y itself

e.g. Those with low self-control are more likely to have missing data on aggression

No way to confirm if this relationship is true

Question 26

Q

What is missing completely at random (MCAR)?

Answer

A

Genuinely random missingness
No relation between Y or other variable in model
e.g. people at all levels of self-control and aggression = Equal chance of missingness

Question 27

Q

What is missing not random (MNAR) ?

Answer

A

When the probabilities of missingness on Y is related to the values of Y itself

e.g. Those high in aggression (y) = increased missing data even when x is controlled

No way to verify

Question 28

Q

What are the three methods to deal with MNAR? (Name only)

Answer

A

Pattern Mixture Models
Random Coefficient Models
Selection Models

Question 29

Q

What are pattern mixture models as a method to solve MNAR?

Answer

A

Stratifies sampling according to different missing data patterns (separates into patterns such as complete data and missing data)

Then, estimate substantiate sub model (models for each pattern)

Pool groups together to create a weighted parameter

Good to include as part of a sensitivity analysis

Question 30

Q

What are random coefficient models as a method to solved MNAR?

Answer

A

A random coefficient regression is a special type of linear mixed model. They can be used when we want to explore the relationship between a response variable (y) and a continuous explanatory variable (x) and we have repeated measurements of x and y on individual subjects.

Question 31

Q

What are selection models as a method to solved MNAR?

Answer

A

Combine model for predicting missingness as well as analysis model of interest

Selection model = Predict missingness on aggression from covariates
Substantive model = Predict aggression from self-control

Parameter estimates are adjusted by select model in sub model

Makes strong, untestable assumptions

Question 32

Q

What are the two deletion methods for missing data? (Name and describe)

Answer

A

Listwise deletion/Complete Case Analysis (Not recommended)

Delete everyone who has missing data
Will be biased unless data MCAR
Even if MCAR = Power will be reduced

Pairwise deletion/available case analysis (Again not recommended)

Uses available data
Difference cases contribute different correlation in matrix (selects data not involved in correlation)
Doesn’t reduce power as much as listwise but if data x MCAR = Biased results

Question 33

Q

What are the three imputation methods? (Name only)

Answer

A

Mean imputation (x recommended)
Regression Imputation
Multiple imputation (recommended)

Question 34

Q

What is mean imputation method?

Answer

A

Replacing missing values with mean of that variable

Issues: Artificially reduces reliability of data and give biased estimates even when data MCAR

Question 35

Q

What is regression imputation?

Answer

A

Replaces missing values with values predicted from regression (use lm to create a predicted value)
- Estimate a set of regression equations where the
incomplete variables are predicted from the complete
variables
- Use the regression equations to calculate the predicted
values on the incomplete variables

Based on the principle of using information from the complete data to estimate the missing data

Two forms:

Normal regression
Stochastic regression (adds residual term to overcome loss of variance)

Stochastic regression is preferred and is unbiased if MAR

Question 36

Q

What is multiple imputation (recommended method)?

Answer

A

Imputes data multiple times create multiple data sets, analyses conducted for each data set, results pooled across sets to estimated parameters + SE (20 + sets is ideal)

SE takes account of additional uncertainty due to missingness

Include as many high order effects as possible

Unbiased under MAR

Question 37

Q

What is the maximum likelihood estimation (MLE) method approach to missing data? Recommended

Answer

A

Uses all information in model to create estimates as if they are complete

The parameter values are found such that they maximise the likelihood that the process described by the model produced the data that were actually observed

Doesn’t compute individual values

Unbiased estimates MAR

Assumes multivariate normality

In MCAR - Superior to deletion methods as uses more information

Easier to implement than imputation

Question 38

Q

What conditions best suit when to use Maximum Likelihood Estimate (MLE) compared to conditions that suit Multiple imputation?

Answer

A

MLE = Better when

When the substantive model includes interactions
For structural equation models (more on this in dapR3)
For the inexperienced (easier to learn and implement)

MI = Better when

Structural equation has categorical indicators
When there is missing data on predictors
When including auxiliary variables (any Variable about which information is available prior to data collection)