W2: Simple Linear Regression Flashcards
Regression
Deriving an equation for predicting one variable from another
Regression model needs to:
How well can we predict Y, given a value of X?
How much variance in Y can we explain?
Types of regression line
Simple linear regression: a single independent variable = the one IV
Multiple linear regression: multiple independent variables = all the IVs
Total sums of squares (TSS)’s aim
Aim to explain or predict variability in y
Does not have anything to do with X
How to know if the regression line is better predicting Y?
The closer the actual scores are to the predicted scores, the better the model predicts Y, the less variability around the line there is (better for generalisation)
The further away from the line (predicted scores) the actual scores are, the worse the model predicts Y, the more variability around the line there is
What is residual?
Residual is the difference between predicted Y and actual Y for any given value of X
Residual SS vs Regression SS
Residual sums of square → small value
Regression sums of squares → big values
Where abouts are the negative and positive residual?
(+) residual: above the regression line
(-) residual: below the regression line
Why does the residual sum always equal to 0? And what would happen to the sum if there is an outlier?
Residual sum would always be 0 - because there are half that are negative and half that are positive (regression line sits in the middle of data points)
Even with an outlier, the regression sum would be 0 as the regression line would skew to compensate and adjust
Conditional and marginal distribution
Marginal distribution of Y: spread or variance of scores around the mean
* Wider and more spread out → more variability compared to the conditional distribution
Conditional distribution of Y: spread of variance of scores around the regression line, for any given value of X
* Narrower and tighter
R^2: coefficient of determination
Is split into regression and error
If variability is explained → regression
If viability not explained → residual
R^2 effect:
R^2 ranges from 0-1 (often expressed as a percentage)
Closer to 0 (or 0%: the weaker the effect, the less variance the model explains
Closer to 1 (or 100%): the stronger the effect, the more variance that the model explains
Different ways to find correlation
pwcorr or The square root of R^2→ the correlation between the predicted values of Y and the actual values of Y
What are the effects tested in regression? What statistical signifance is used to test it?
Effects that are teste:
* Model-as-a-whole
* Individual variable or predictor effects
Statistical significance:
* Model as a whole: F ratio and p-value
* Individual prediction: t-statistics and p-value
How is mean square (MS) calculated?
sums of squares divided by degrees of freedom
F Ratio (Model as a whole) H0 and H1
H0 (null hypothesis): the model is no better than the intercept only model (aka the null model)
H0: all regression coefficients = 0
H1 (alternative hypothesis): the regression model significantly better than the null model
H1: at least on regression coefficient is not equal to 0
T statistics ( Individual variable) H0 and H1
H0: b=0
H1: b≠0
Statistical signifiance of p-value
p<.05
Model as a whole R^2 effect size
R-squared
* 2% - 12%: small effect
* 13% - 25%: medium effect
* 26% and above: large effect
Individual variable statistacal significant via beta coefficient
Beta coefficient: unstandardised, no standard cut offs
Need to understand the variable’s scales to know whether it is big or small effect
Can tell whether of not significance by p-value
Beta coefficient: standardised
Interpret similarly to correlation coefficient
Why is there a need for standardised regression coefficient?
Because unstandardised effect size are useful if the ‘natural’ scale is known
* Without a natural scale, direct comparison cannot be made as variables may be on different scales
* One-point increase can be a small small change or a big change (e.g. one point on a 1-7 scale is bigger than one point on a 0-100 scale)