9: REGRESSION Flashcards
linear regression
used when the relationship between variables x and y can be described with a straight line
correlation determines the strength of a relationship between x and y (doesn’t tell us how much y changes based on a given change in x)
regression allows us to estimate how much y will change as a result of a given change in x
terminology (regression): variables x and y
regression distinguishes between the variable being predicted and the variable(s) used to predict (in simple linear regression, only one predictor variable)
predicted variable : y
- the outcome variable
- the dependent variable
- the criterion variable
predictor variable : x
- the predictor variable
- the independent variable
- the explanatory variable
uses of linear regression (interpretation)
researchers might use regression to
- investigate the strength of the effect x has on y
- estimate how much y will change as a result of a given change in x
- predict a future value of y, based on a known value of x
unlike correlation, regression makes the assumption that y is (to some extent) dependent on x
- this may not reflect causal dependency
regression does NOT provide direct evidence of causality
stages of linear regression
- analysing the relationship between variables
- determining the strength and direction of relationship (correlation) - proposing a model to explain that relationship
- a line of best fit
- find a (the intercept), b (the gradient) - evaluating the model
- assessing the goodness of fit
linear regression: 3 evaluating the model: goodness of fit
simplest model:
- no relationship between x and y (b = 0)
best model:
- based on relationship between x and y
- the regression line
is our regression model better at predicting y than the simplest model?
linear regression: calculating goodness of fit
- SSt: the difference between the observed values of y and the mean of y (i.e. the variance in y not explained by the simplest model, b = 0)
- SSr: the difference between the observed values of y and those predicted by the regression line (i.e. the variance in y not explained by the regression model)
- the difference between SSt and SSr reflects the improvement in prediction using the regression model compared to the simplest model (i.e. the reduction in unexplained variance using the regression model compared to the simplest model) - SSt - SSr = SSm
- the larger SSm, the bigger the improvement in prediction using the regression model over the simplest model
linear regression: assessing the goodness of fit
- we can use an F-test (i.e. ANOVA) to evaluate the improvement due to the model (SSm) relative to variance the model does not explain (SSr)
- rather than using the Sums of Squares (SS) values, the F-test uses Mean Squares (MS) values - takes the degrees of freedom into account
- F ratio provides a measure of how much the model has improved the prediction of y, relative to the level of inaccuracy of the model
F = MSm / MSr
linear regression: interpreting goodness of fit
- if the regression model is good, MSm will be large, while MSr will be small (i.e. F value further away from 0)
- null hypothesis: the regression model and the simplest model are equal (MSm = 0)
- p expresses the probability of finding an improvement of the magnitude we have obtained (or larger), when the null is true
- a significant result suggests regression model provides a better fit than the simplest model
linear regression: assumptions
- linearity: x and y must be linearly related
- absence of outliers
- normality of residuals: residuals should be normally distributed around the predicted outcome
- homoscedasticity: variance of residuals about the outcome should be the same for all predicted scores
Dancey and Reidy state that the outcome variable should be normally distributed, but this is a simplification
no parametric equivalent - can only attempt a ‘fix’
linear regression: Assumptions: Normal P-P Plot of Regression Standardized Residual
ideally, data points will lie in a reasonably straight diagonal line, from bottom left to top right
- would suggest no major deviations from normality
linear regression: Assumptions: Scatterplot of Regression Standardized Residual
ideally, residuals will be roughly rectangularly distributed, with most scores concentrated in the centre (0)
- dont want to see systematic pattern to residuals
outliers: standardised residuals >3.3 or < -3.3
linear regression: SPSS coefficients (location)
coefficients table
- intercept (a) - B(constant)
- slope (b) - Bvariable
- standardised b value - beta*variable
t statistic tests the null that the value of b is 0
estimating variance explained
- R^2: the amount of variance in y explained by model relative to total variance in y (R^2 = SSm/SSt)
- can express R^2 as a percentage (x100)
- r^2 expresses the proportion of shared variance between 2 variables
in regression we assume x explains the variance in y
- though r^2 = R^2 if we only have 1 predictor
variance not explained by x = (1-R^2)
multiple regression
- allows us to assess the influence of several predictor variables (x1,x2 …) on y
- we obtain a measure of how much variance in the outcome variabe (y) the predictor variables combined explain (by incorporating a model which incorporates the slopes of each predictor variable)
- we also obtain measures of how much variance in the outcome variable our predictor variables explain when considered separately
multiple regression: assumptions (sample size)
sufficient sample size
advice:
- combined effect of several predictors: N > (or equal) 50 + 8M (e.g., for 3 predictors, at least 74 Ps)
- separate effect of several predictors: N > (or equal) 104 + M (e.g. for 3 predictors, at least 107 Ps)
too few participants may result in over optimistic results