Simple linear regression Flashcards
What is the equation for simple linear regression?
Y = b0 + b1X - e
What is b0?
The intercept
- the point at which the regression line crosses the Y axis
- The value of Yi when X = 0
(labelled as the constant in SPSS)
What is b1?
The slope/gradient
- a measure of how much Y changes as X changes
- regardless of sign (pos/neg), the larger the value of b1, the steeper the slope
What is e?
Residual/prediction error
- difference between observed value of outcome variable and what the model predicts (e=Yobs - Ypred)
- represents how wrong we are in making the prediction for the particular case
What is the equation for Ypred?
Ypred = b0 + b1X
What is a regression line?
Line of best fit - line that best represents the data and minimises residuals
What is a prediction?
Best guess at Y given X
X doesn’t have to cause Y or come before Y in time
What values show how well the model fits the observed data? (goodness of fit)
R2
F-ratio
What does the model refer to?
The regression line
What values show how the variables relate to each other?
The Intercept
Beta values (slope)
What is residual sum of squares? (SSR)
Square residuals and then add them up - a gauge of how well the model (line) fits the data: The smaller SSR, the better the fit
- can also be error variance - hoe much error there is in the model
(Residual/error variance)
What is the equation for total sum of squares (SST)?
SSTotal = SSModel + SSResidual
What is the model sum of squares (SSM)?
Sum of squared differences between Ypred and sample mean - represents improvement from baseline model to regression model
(Model variance)
In any regression model, what is the overall variation of the outcome variable (Y) due to?
- Model/regression - how much variance in the observed Y the predicted values explain. This variance would be measured by the deviations of the predicted values from the sample mean, Y̅.
- Error/residual - how much variance is left over in observed Y after we accounted for the predicted values - measured by deviations of observed values from predicted values
What is the Total sum of squares (SST)?
Total variance in outcome variable - partitioned into model variance and residual/error variance
What is the equation for R2?
R2 = SSM/SST
Variance in outcome explained by model / total variance in outcome variable to be explained
What is R2?
- provides proportion of variance accounted for by model
- Value ranges between 0-1 (the higher the value, the better the model)
- interpreted as a percentage eg. R2=.69 - x 100 - 69% of variance in outcome variable is explained by the model
What is the equation for the F ratio?
F = MSM / MSR
Model mean squares / residual or error mean squares
What is the equation for model mean squares (MSM)?
MSM = SSM / dfM
What is the equation for residual/error mean squares?
MSR = SSR / dfR
What is the F ratio?
The ratio of explained variance to unexplained variance (error) in the model
- MSM should be larger than MSR (F-statistic greater than 1)
- also called ANOVA - comparing ratio of systematic variance to unsystematic variance
What is dfM?
K
Number of predictors
What is the equation for dfR?
dfR = N-k-1 (N minus number of coefficients)
What are the 2 ways the hypothesis (overall test) in regression can be phrased?
Can the scores on Y be predicted based on the scores on X and the regression line?
- Null hyp: Predicted values of Y are the same regardless of the value of X (or simply, there is no relationship between Y and X).
Does the model (Ypred) explain significant amount of variance in outcome variable (Yobs)?
- Null hyp: Populaion R2=0
- Ratio of model variance to error variance tested using F-test (ANOVA)
OR:
H1: The regression line is a significantly better model than the flat model
H0: The flat model
What do the coefficients refer to?
The characteristics of the regression line:
- Beta values: the slope of the regression line
- The intercept
What is the unstandardised beta?
The value of the slope (b1)
- for every one unit change in x, the change in the value of y
- in units of measurement
If b1 is 0, there is no relationship between x and y (flat line - as predictor variable changes, predicted value of outcome is constant and does not change)
- If variable significantly predicts outcome, b value should be different from 0 - tested using a t-test (H0: b = 0) - if test is significant, interpret as supporting that predictor variable contributes significantly to ability to estimate values of outcome.
What is the standardised beta?
- a measure of the slope
The standardised change in y for one standard deviation change in x - As x increases by one standard deviation, y changes by b1 of a standard deviation
In simple regression (1 predictor) what is b1 equal to?
b1 = r(xy)
When should you use unstandardised b?
- when you want coefficients to refer to meaningful units
- when you want a regression equation to predict values of Y
When should you use standardised β?
(independent of units)
- when you want an effect size measure eg. small/med/large β is equivalent to small/med/large r (.1/.3/.5)
- when you want to compare the strength of a relationship between predictor and outcome
What is covariance?
The extent to which variables co-vary (change together)
High covariance means there is a large overlap between patterns of change (variance) observed in each variable
What should you do before running a regression analysis?
- Detect bias from unusual cases (outliers)
- Check assumptions of linear regression
What are outliers in linear regression?
An observation with a large residual (the differences between observed and predicted values - e = Yobs - Ypred)
- may distort results by pulling regression line away from line of best fit for most people
- has potential to be an outlier is standardised score (or Z score) on 1+ predictors/standardised score is in excess of +/-3.29
Why are outliers an issue in regression?
They influence the model’s ability to predict all cases
What does how influential an outlier is depend on?
Distance between Yobs and Ypred (residual) - the larger the distance, the weaker the prediction
Leverage (unusual value on predictor) - large leverage can either weaken or strengthen prediction depending where they lie related to the trend - on trend = strengthens results.
large leverage + large distance -> negative impact of pulling or tilting regression line away from LOBF
What is the minimum value that makes standardised residuals or predictors potential influential outliers?
+/-3.29 (p<.001)
How can outliers be dealt with?
- check data were entered and coded correctly - can justifiably remove outliers that are due to errors in data entry/ppt procedure following (eg. reaction times that are impossibly short or long)
Outliers CAN represent genuine data - for every 100ppts, expect 1 score beyond +/-3SD
What are the 4 assumptions of linear regression?
- Linearity
- Independence
- Normality of residuals
- Homogeneity of variance (homoscedasticity)
What is linearity?
The outcome (continuous variable) is linearly related to predictors
What is the independence assumption?
Observations are randomly and independently chosen from population - residuals are not related to each other.
Residuals not independent in cases such as:
- repeated obs on same ppt
- obs from related ppts (twins, students in same class)
If this assumption is violates, model standard errors (SEs) will be invalid, as will confidence intervals (CIs) and sig tests based on them.
- ensure INDEPENDENT sampling in design
What is the normality of residuals assumption?
Residuals (not IVs or DVs) should be normally distributed
- check using histogram and normal probability plot
want observed and expected frequencies to be very similar - 45 degree straight line.
- in small samples, a lack of normality invalidated confidence intervals and significance tests BUT in large samples, it will not due to central limit theorem.
What is the homoscedasticity assumption (homogeneity of variance)?
The variability of residuals should be the same for all values of Ypred.
Violating these assumptions invalidates confidence intervals and significance tests
Check using residual scatterplot - should be NO funnelling of residuals.