L21 Part 2 - Single regression (chapter 8 part 1) Flashcards
What is linear regression?
Models the relationship between a scalar dependent variable y and one or more explanatory variables x
↪ outcome = model prediction + error
- One explanatory variable → single linear regression
- It models it using linear predictor functions whose unknown model parameters are estimated from the data
What is the formula for linear regression?
Picture 1 - expresses how our model predicts (is there more accuracy or error?)
Y - outcome variable
Bs - parameters and they represent what we’re interested in (when the predictor is 0)
- B0 - intercept, baseline level that we are predicting with
- B1 - regression coefficient for our single predictor variable and it quantifies how strong the association is between our predictor and outcome variable
↪ we multiply this with our predictor variable X and this product gives us the model prediction
↪ to calculate B1, we use correlation between the two variables, so the higher the correlation the stronger the predictive value the predictor has
- when we add hats to the bs they are estimates of the population using a sample
E - erros in the prediction in our sample model (residuals)
Assumptions for linear regression
picture 14 - general procedure of fitting a regression model but it shows what to do with each assumption
- Continous variables
- linearity
- independent errors of the observations
- Sensitivity (outliers)
- Homoscedasticity (equivalent to equal variances in anova)
- Normality (model residuals are normally distributed; visualised with QQ plots)
What is linearity?
For this assumption to hold, the predictors must have a linear relation to the outcome variable
- checked through: correlations, matrix scatterplot with predictors and outcome variable
What is sensitivity?
Potential influence of outliers
We look at outliers through:
- Extreme residuals
- Cook’s distance
- Check Q-Q, residuals plots, casewise diagnostics (cook’s distance)
What is the difference between unstandardized residuals and standardized residuals?
Residuals represent the error present in the model (small residual = model fits the sample data well)
Unstandardized residuals - raw differences between predicted and observed values of the outcome variable
↪ measured in the same units as the outcome variable which makes it difficult to generalize
Standardized residuals - residuals converted to z-scores and so are expressed in SD units (mean 0, sd 1)
- With standardized residuals we can assess which data points are outside of the general pattern of the data set
Diagnostic statistics
What is leverage?
It gauges the influence of the observed value of the outcome variable over the predicted values
- Defined as (k+1)/n, k is the number of predictors in the model, n is the number of cases
- Can vary from 0 (no influence) to 1 (the case has complete infleunce over predictions)
- If no cases exert undue influence over the model all leverage values should be close to the average value of (k+1)/n
- Those values greater than twice the average should be investigated
What is cook’s distance?
A measure of the overall infleunce of a case on the model
↪ compute for every observation separately and it assesses to what extent our result would change with and without that observation (outlier - removing or not this participant affects our data and results in high cook’s distance)
↪ Cook’s distance should be < 1 for this assumption to be met
↪ combines the point’s leverage and its residual; point with high leverage and high residual will have a large cook’s distance, so strong influence on the fitted values across the model
How does outlier affect our correlation?
Picture 2
The correlation is higher when the outlier is removed because it’s not a follower of the general pattern of the data
What do we do when there is an outlier?
We always must follow up on this outlier and investigate why there is this outlier, we should never just remove it from our data
- see them as a source of information not as annoyance
What is homoscedasticity? How do we assess it?
Variance of residuals should be equal (equally distributed) across all expected values → no systematic errors
- Assess by looking at scatterplot of standardized residuals: predicted values vs. residuals → roughly round shape is needed (spread out points equally, no pattern in the errors)
- After the analysis is complete because it’s based on the residuals
- picture 2
What is cross-validation? Why do we use it?
Assessing the accuracy of a model across different sample
- To generalise, model must be able to accurately predict the same outcome variable from the same set of predictors in a different group of people
- If the model is applied to a different sample and there is a severe drop in its predictive power, then the model does not generalise
How large should be our sample?
It depends on the size of effects that we’re trying to detect and how much power we want to detect these effects
Why does it matter what size of the sample we have in terms of R?
The estimate of R that we get from regression is dependent on the number of predictors, k, and the sample size, N
- the expected R for random data (should be 0) is k/(N−1) and so with small sample sizes random data can appear to show a strong effect
- E.g. 6 predictors, 21 cases of data; R = 6/(21-1) = 0.3
What can we do as a first step in the analysis of our data?
Create a scatterplot to see whether the data are somehow associated and in which direction
- The strength of the correlation decided later, this is just to get an idea of the data