17. Understanding Linear Models Flashcards

1
Q

What is causality?

A

One event directly leads to another event

(Different from covariance where two variables change together)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the different conditions for causality?

A

Covariation
- When two factors occur at the same time but are not related
- E.g. Ice Cream and Shark Attacks

Plausibility
- Is the causation actually plausible to occur?

Temporal precedence

  • A happens before B but B does not lead to A etc.

No reasonable alternatives
- Hard to establish
- Fails to account for alternative explanations - may lead to spurious correlations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can causality be tested?

A

Identifying causal relationships = Examined through study design rather than statistical testing

e.g. test experimental vs observational design (manipulating one variable and seeing it’s effect on the other)

  • Needs good causal relation test in the first place as many studies are poorly designed

OR

Propensity score matching = Instrumental variable analysis (use stats to simulate control group)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a marginal distribution?

A

An event’s value, independent of other events

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a conditional distribution?

A

An events value, given the value of another event

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is endogeneity?

A

Theoretically occurs in a marginal distribution of predictor variable and is not independent of conditional distribution of outcome variable, given the predictor variable

Occurs when predictor variable x is correlated with error term - causes bias in beta estimates

e not equal to 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an endogenous variable?

A

An endogenous variable is any variable in the regression model that is correlated with the error term.

Variable measure is determined by the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a exogenous variable?

A

An exogenous variable is an explanatory variable that is not correlated with the error term

Variable measure is determined outside of the model not by the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the problems with endogeneity?

A
  • Can’t easily test whether variables are endogenous
    • Model estimate of error will be biased by endogenous variable if we have a model with both endogenous and exogenous variable
  • Even if you detect endogeneity must still determine why it’s there to solve the issue
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the different sources of endogeneity? (Name only)

A

Simultaneity bias
Omitted/Confounding variables
Measurement Error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is simultaneity bias?

A

X causes Y, which causes x

E.g. Farmer’s income <-> crop yield

(y = Beta 0 + Beta 1 (exogenous) + Beta 2 (endogenous)

If endogeneity is due to simultaneity (done at the same time as something else) then x (exogenous) will lead to change in y that will change x (endogenous) as it is linked to the DV/model

More endogenous variables = Effect is more pronounced

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do we solve simultaneity bias?

A

Use statistical methods developed for this situation (two-stage least squares regression)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do omitted/confounding variables explain endogeneity?

A

In a perfectly exogenous model - effect of x on y is separated from the error term

When x is correlated with both the outcome and an omitted variable z, the variance explained by z falls on ϵ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the solution for endogeneity when omitted/confounding variables cause it?

A

Ensure confounds are measured and included in the model, no small tasks, requires thorough knowledge of the topic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does measurement error cause endogeneity?

A

Instead of measuring x, you measure x∗, which is a measurement of x with error (r) included

E.g. Reporting errors and coding errors

Similar to the case of omitted variables , measurement error becomes part of error but will be associated with x, leading to endogeneity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How is the issue of measurement error in causing endogeneity solved?

A

Careful planning and study design

E.g. through pilot testing

17
Q

What is prediction?

A

Important application of understanding causality

E.g. it is the aim of LM - To produce a model to predict the outcome variable

18
Q

How do we predict values outwith the original data set?

A

When collecting data, range of samples in predictor and outcome may span full range of variables as they exist in the world

E.g. Hours spent studying

Can think about it as two sets of unknown values
- Those within range used to estimate model
- Those outside range used to estimate model

19
Q

What is interpolation?

A

Obtaining a value from a model within the range of given data or points

20
Q

What is extrapolation?

A

Obtaining value from model from outside a range of given data points

Extrapolation is not recommended - especially as trajectory is unknown
Don’t have patterns on each side to show pattern

21
Q

What is missing data?

A

Missing data, or missing values, occur when you don’t have data stored for certain variables or participants

22
Q

What are the reasons for missing data?

A

PPT non-responses, Error in data collection, Errors in data entry, Missing by design

23
Q

What are the two main things to worry about when it comes to missing data?

A

Two main things to worry about: Loss of efficiency (as smaller sample size) so reduced power and bias (incorrect estimates)

24
Q

What are the types of missing data? (name only)

A

Missing at random (MAR)
Missing completely at random (MCAR)
Missing not random (MNAR)

25
Q

What is missing at random? (MAR)

A

When the probability of missing data on a variable y is related to other variables in the model but not to variables of y itself

e.g. Those with low self-control are more likely to have missing data on aggression

  • No way to confirm if this relationship is true
26
Q

What is missing completely at random (MCAR)?

A

Genuinely random missingness
No relation between Y or other variable in model
e.g. people at all levels of self-control and aggression = Equal chance of missingness

27
Q

What is missing not random (MNAR) ?

A

When the probabilities of missingness on Y is related to the values of Y itself

e.g. Those high in aggression (y) = increased missing data even when x is controlled

  • No way to verify
28
Q

What are the three methods to deal with MNAR? (Name only)

A

Pattern Mixture Models
Random Coefficient Models
Selection Models

29
Q

What are pattern mixture models as a method to solve MNAR?

A

Stratifies sampling according to different missing data patterns (separates into patterns such as complete data and missing data)

Then, estimate substantiate sub model (models for each pattern)

Pool groups together to create a weighted parameter

Good to include as part of a sensitivity analysis

30
Q

What are random coefficient models as a method to solved MNAR?

A

A random coefficient regression is a special type of linear mixed model. They can be used when we want to explore the relationship between a response variable (y) and a continuous explanatory variable (x) and we have repeated measurements of x and y on individual subjects.

31
Q

What are selection models as a method to solved MNAR?

A

Combine model for predicting missingness as well as analysis model of interest

  • Selection model = Predict missingness on aggression from covariates
  • Substantive model = Predict aggression from self-control

Parameter estimates are adjusted by select model in sub model

Makes strong, untestable assumptions

32
Q

What are the two deletion methods for missing data? (Name and describe)

A

Listwise deletion/Complete Case Analysis (Not recommended)

  • Delete everyone who has missing data
  • Will be biased unless data MCAR
  • Even if MCAR = Power will be reduced

Pairwise deletion/available case analysis (Again not recommended)

  • Uses available data
  • Difference cases contribute different correlation in matrix (selects data not involved in correlation)
  • Doesn’t reduce power as much as listwise but if data x MCAR = Biased results
33
Q

What are the three imputation methods? (Name only)

A

Mean imputation (x recommended)
Regression Imputation
Multiple imputation (recommended)

34
Q

What is mean imputation method?

A

Replacing missing values with mean of that variable

Issues: Artificially reduces reliability of data and give biased estimates even when data MCAR

35
Q

What is regression imputation?

A

Replaces missing values with values predicted from regression (use lm to create a predicted value)
- Estimate a set of regression equations where the
incomplete variables are predicted from the complete
variables
- Use the regression equations to calculate the predicted
values on the incomplete variables

Based on the principle of using information from the complete data to estimate the missing data

Two forms:

  • Normal regression
  • Stochastic regression (adds residual term to overcome loss of variance)

Stochastic regression is preferred and is unbiased if MAR

36
Q

What is multiple imputation (recommended method)?

A

Imputes data multiple times create multiple data sets, analyses conducted for each data set, results pooled across sets to estimated parameters + SE (20 + sets is ideal)

SE takes account of additional uncertainty due to missingness

Include as many high order effects as possible

Unbiased under MAR

37
Q

What is the maximum likelihood estimation (MLE) method approach to missing data? Recommended

A

Uses all information in model to create estimates as if they are complete

The parameter values are found such that they maximise the likelihood that the process described by the model produced the data that were actually observed

Doesn’t compute individual values

Unbiased estimates MAR

Assumes multivariate normality

In MCAR - Superior to deletion methods as uses more information

Easier to implement than imputation

38
Q

What conditions best suit when to use Maximum Likelihood Estimate (MLE) compared to conditions that suit Multiple imputation?

A

MLE = Better when

When the substantive model includes interactions
For structural equation models (more on this in dapR3)
For the inexperienced (easier to learn and implement)

MI = Better when

Structural equation has categorical indicators
When there is missing data on predictors
When including auxiliary variables (any Variable about which information is available prior to data collection)