Linear models refresher Flashcards
What are deterministic models?
Given the same input, functions will return exactly the same output
e.g. y = mx +c
What are statistical models?
they contain some pattern but there is variation around this pattern
e.g. given someone’s age we could attempt to predict their height
What are linear models?
They are a type of statistical model
outcome = (model) + error
Interaction terms
These occur when we can’t control for factors, so we include interaction terms to try and control for the effect on y
“as we move up 1 unit in x1, the slope of x2 changes by…”
What is standard error?
how much we estimate our sample will vary due to random error
what is the p-value
it is the likelihood of obtaining a result equal to or more extreme than one of our observations if the null is true
We use this to evaluate our null hypothesis
The smaller the p-value, the less likely the null is
What is the sums of squares residual (SSresidual)?
the difference between the model line and each observed value of y
(y - y hat)
allows us to consider the model as a whole - rather than as specific coefficients
What is the sums of squares total (SStotal)?
the difference between the mean of y and each observed value of y
(y - y bar)
What is the sums of squares model (SSmodel)?
the difference between the model line and the mean of y
(y hat - y bar)
What is R squared?
= how much variance in y is due to our predictors (we want this to be high)
R^2 = SSmodel / SStotal
R^2 = 1 - ( SSresidul / SStotal )
what are joint tests?
they allow us to make inferences about the improvement of model fit with the inclusion of additional parameters (this allows us to test the reduction in SSresidual)
in R this is what the anova() function is for
What are the assumptions of a linear model?
Our residuals (y - y hat) reflect everything we don’t account for in our models - in an ideal world these residuals would be just like random noise
We check how much ‘like randomness’ the residuals appear to be:
- mean of 0
- normal distribution
- constant variance ( = no detectable plotted pattern)
- independent and identically distributed
What to do if our model does not meet assumptions:
- check if the model is mis-specified
- transform the outcome variable?
- bootstrap?
these will not help if the independence assumption is violated