17. Understanding Linear Models Flashcards
What is causality?
One event directly leads to another event
(Different from covariance where two variables change together)
What are the different conditions for causality?
Covariation
- When two factors occur at the same time but are not related
- E.g. Ice Cream and Shark Attacks
Plausibility
- Is the causation actually plausible to occur?
Temporal precedence
- A happens before B but B does not lead to A etc.
No reasonable alternatives
- Hard to establish
- Fails to account for alternative explanations - may lead to spurious correlations
How can causality be tested?
Identifying causal relationships = Examined through study design rather than statistical testing
e.g. test experimental vs observational design (manipulating one variable and seeing it’s effect on the other)
- Needs good causal relation test in the first place as many studies are poorly designed
OR
Propensity score matching = Instrumental variable analysis (use stats to simulate control group)
What is a marginal distribution?
An event’s value, independent of other events
What is a conditional distribution?
An events value, given the value of another event
What is endogeneity?
Theoretically occurs in a marginal distribution of predictor variable and is not independent of conditional distribution of outcome variable, given the predictor variable
Occurs when predictor variable x is correlated with error term - causes bias in beta estimates
e not equal to 0
What is an endogenous variable?
An endogenous variable is any variable in the regression model that is correlated with the error term.
Variable measure is determined by the model
What is a exogenous variable?
An exogenous variable is an explanatory variable that is not correlated with the error term
Variable measure is determined outside of the model not by the model
What are the problems with endogeneity?
- Can’t easily test whether variables are endogenous
- Model estimate of error will be biased by endogenous variable if we have a model with both endogenous and exogenous variable
- Even if you detect endogeneity must still determine why it’s there to solve the issue
What are the different sources of endogeneity? (Name only)
Simultaneity bias
Omitted/Confounding variables
Measurement Error
What is simultaneity bias?
X causes Y, which causes x
E.g. Farmer’s income <-> crop yield
(y = Beta 0 + Beta 1 (exogenous) + Beta 2 (endogenous)
If endogeneity is due to simultaneity (done at the same time as something else) then x (exogenous) will lead to change in y that will change x (endogenous) as it is linked to the DV/model
More endogenous variables = Effect is more pronounced
How do we solve simultaneity bias?
Use statistical methods developed for this situation (two-stage least squares regression)
How do omitted/confounding variables explain endogeneity?
In a perfectly exogenous model - effect of x on y is separated from the error term
When x is correlated with both the outcome and an omitted variable z, the variance explained by z falls on ϵ
What is the solution for endogeneity when omitted/confounding variables cause it?
Ensure confounds are measured and included in the model, no small tasks, requires thorough knowledge of the topic
How does measurement error cause endogeneity?
Instead of measuring x, you measure x∗, which is a measurement of x with error (r) included
E.g. Reporting errors and coding errors
Similar to the case of omitted variables , measurement error becomes part of error but will be associated with x, leading to endogeneity
How is the issue of measurement error in causing endogeneity solved?
Careful planning and study design
E.g. through pilot testing
What is prediction?
Important application of understanding causality
E.g. it is the aim of LM - To produce a model to predict the outcome variable
How do we predict values outwith the original data set?
When collecting data, range of samples in predictor and outcome may span full range of variables as they exist in the world
E.g. Hours spent studying
Can think about it as two sets of unknown values
- Those within range used to estimate model
- Those outside range used to estimate model
What is interpolation?
Obtaining a value from a model within the range of given data or points
What is extrapolation?
Obtaining value from model from outside a range of given data points
Extrapolation is not recommended - especially as trajectory is unknown
Don’t have patterns on each side to show pattern
What is missing data?
Missing data, or missing values, occur when you don’t have data stored for certain variables or participants
What are the reasons for missing data?
PPT non-responses, Error in data collection, Errors in data entry, Missing by design
What are the two main things to worry about when it comes to missing data?
Two main things to worry about: Loss of efficiency (as smaller sample size) so reduced power and bias (incorrect estimates)
What are the types of missing data? (name only)
Missing at random (MAR)
Missing completely at random (MCAR)
Missing not random (MNAR)
What is missing at random? (MAR)
When the probability of missing data on a variable y is related to other variables in the model but not to variables of y itself
e.g. Those with low self-control are more likely to have missing data on aggression
- No way to confirm if this relationship is true
What is missing completely at random (MCAR)?
Genuinely random missingness
No relation between Y or other variable in model
e.g. people at all levels of self-control and aggression = Equal chance of missingness
What is missing not random (MNAR) ?
When the probabilities of missingness on Y is related to the values of Y itself
e.g. Those high in aggression (y) = increased missing data even when x is controlled
- No way to verify
What are the three methods to deal with MNAR? (Name only)
Pattern Mixture Models
Random Coefficient Models
Selection Models
What are pattern mixture models as a method to solve MNAR?
Stratifies sampling according to different missing data patterns (separates into patterns such as complete data and missing data)
Then, estimate substantiate sub model (models for each pattern)
Pool groups together to create a weighted parameter
Good to include as part of a sensitivity analysis
What are random coefficient models as a method to solved MNAR?
A random coefficient regression is a special type of linear mixed model. They can be used when we want to explore the relationship between a response variable (y) and a continuous explanatory variable (x) and we have repeated measurements of x and y on individual subjects.
What are selection models as a method to solved MNAR?
Combine model for predicting missingness as well as analysis model of interest
- Selection model = Predict missingness on aggression from covariates
- Substantive model = Predict aggression from self-control
Parameter estimates are adjusted by select model in sub model
Makes strong, untestable assumptions
What are the two deletion methods for missing data? (Name and describe)
Listwise deletion/Complete Case Analysis (Not recommended)
- Delete everyone who has missing data
- Will be biased unless data MCAR
- Even if MCAR = Power will be reduced
Pairwise deletion/available case analysis (Again not recommended)
- Uses available data
- Difference cases contribute different correlation in matrix (selects data not involved in correlation)
- Doesn’t reduce power as much as listwise but if data x MCAR = Biased results
What are the three imputation methods? (Name only)
Mean imputation (x recommended)
Regression Imputation
Multiple imputation (recommended)
What is mean imputation method?
Replacing missing values with mean of that variable
Issues: Artificially reduces reliability of data and give biased estimates even when data MCAR
What is regression imputation?
Replaces missing values with values predicted from regression (use lm to create a predicted value)
- Estimate a set of regression equations where the
incomplete variables are predicted from the complete
variables
- Use the regression equations to calculate the predicted
values on the incomplete variables
Based on the principle of using information from the complete data to estimate the missing data
Two forms:
- Normal regression
- Stochastic regression (adds residual term to overcome loss of variance)
Stochastic regression is preferred and is unbiased if MAR
What is multiple imputation (recommended method)?
Imputes data multiple times create multiple data sets, analyses conducted for each data set, results pooled across sets to estimated parameters + SE (20 + sets is ideal)
SE takes account of additional uncertainty due to missingness
Include as many high order effects as possible
Unbiased under MAR
What is the maximum likelihood estimation (MLE) method approach to missing data? Recommended
Uses all information in model to create estimates as if they are complete
The parameter values are found such that they maximise the likelihood that the process described by the model produced the data that were actually observed
Doesn’t compute individual values
Unbiased estimates MAR
Assumes multivariate normality
In MCAR - Superior to deletion methods as uses more information
Easier to implement than imputation
What conditions best suit when to use Maximum Likelihood Estimate (MLE) compared to conditions that suit Multiple imputation?
MLE = Better when
When the substantive model includes interactions
For structural equation models (more on this in dapR3)
For the inexperienced (easier to learn and implement)
MI = Better when
Structural equation has categorical indicators
When there is missing data on predictors
When including auxiliary variables (any Variable about which information is available prior to data collection)