17. Understanding Linear Models Flashcards
What is causality?
One event directly leads to another event
(Different from covariance where two variables change together)
What are the different conditions for causality?
Covariation
- When two factors occur at the same time but are not related
- E.g. Ice Cream and Shark Attacks
Plausibility
- Is the causation actually plausible to occur?
Temporal precedence
- A happens before B but B does not lead to A etc.
No reasonable alternatives
- Hard to establish
- Fails to account for alternative explanations - may lead to spurious correlations
How can causality be tested?
Identifying causal relationships = Examined through study design rather than statistical testing
e.g. test experimental vs observational design (manipulating one variable and seeing it’s effect on the other)
- Needs good causal relation test in the first place as many studies are poorly designed
OR
Propensity score matching = Instrumental variable analysis (use stats to simulate control group)
What is a marginal distribution?
An event’s value, independent of other events
What is a conditional distribution?
An events value, given the value of another event
What is endogeneity?
Theoretically occurs in a marginal distribution of predictor variable and is not independent of conditional distribution of outcome variable, given the predictor variable
Occurs when predictor variable x is correlated with error term - causes bias in beta estimates
e not equal to 0
What is an endogenous variable?
An endogenous variable is any variable in the regression model that is correlated with the error term.
Variable measure is determined by the model
What is a exogenous variable?
An exogenous variable is an explanatory variable that is not correlated with the error term
Variable measure is determined outside of the model not by the model
What are the problems with endogeneity?
- Can’t easily test whether variables are endogenous
- Model estimate of error will be biased by endogenous variable if we have a model with both endogenous and exogenous variable
- Even if you detect endogeneity must still determine why it’s there to solve the issue
What are the different sources of endogeneity? (Name only)
Simultaneity bias
Omitted/Confounding variables
Measurement Error
What is simultaneity bias?
X causes Y, which causes x
E.g. Farmer’s income <-> crop yield
(y = Beta 0 + Beta 1 (exogenous) + Beta 2 (endogenous)
If endogeneity is due to simultaneity (done at the same time as something else) then x (exogenous) will lead to change in y that will change x (endogenous) as it is linked to the DV/model
More endogenous variables = Effect is more pronounced
How do we solve simultaneity bias?
Use statistical methods developed for this situation (two-stage least squares regression)
How do omitted/confounding variables explain endogeneity?
In a perfectly exogenous model - effect of x on y is separated from the error term
When x is correlated with both the outcome and an omitted variable z, the variance explained by z falls on ϵ
What is the solution for endogeneity when omitted/confounding variables cause it?
Ensure confounds are measured and included in the model, no small tasks, requires thorough knowledge of the topic
How does measurement error cause endogeneity?
Instead of measuring x, you measure x∗, which is a measurement of x with error (r) included
E.g. Reporting errors and coding errors
Similar to the case of omitted variables , measurement error becomes part of error but will be associated with x, leading to endogeneity
How is the issue of measurement error in causing endogeneity solved?
Careful planning and study design
E.g. through pilot testing
What is prediction?
Important application of understanding causality
E.g. it is the aim of LM - To produce a model to predict the outcome variable
How do we predict values outwith the original data set?
When collecting data, range of samples in predictor and outcome may span full range of variables as they exist in the world
E.g. Hours spent studying
Can think about it as two sets of unknown values
- Those within range used to estimate model
- Those outside range used to estimate model
What is interpolation?
Obtaining a value from a model within the range of given data or points
What is extrapolation?
Obtaining value from model from outside a range of given data points
Extrapolation is not recommended - especially as trajectory is unknown
Don’t have patterns on each side to show pattern
What is missing data?
Missing data, or missing values, occur when you don’t have data stored for certain variables or participants
What are the reasons for missing data?
PPT non-responses, Error in data collection, Errors in data entry, Missing by design
What are the two main things to worry about when it comes to missing data?
Two main things to worry about: Loss of efficiency (as smaller sample size) so reduced power and bias (incorrect estimates)
What are the types of missing data? (name only)
Missing at random (MAR)
Missing completely at random (MCAR)
Missing not random (MNAR)