W2: Simple Linear Regression Flashcards
Regression
Deriving an equation for predicting one variable from another
Regression model needs to:
How well can we predict Y, given a value of X?
How much variance in Y can we explain?
Types of regression line
Simple linear regression: a single independent variable = the one IV
Multiple linear regression: multiple independent variables = all the IVs
Total sums of squares (TSS)’s aim
Aim to explain or predict variability in y
Does not have anything to do with X
How to know if the regression line is better predicting Y?
The closer the actual scores are to the predicted scores, the better the model predicts Y, the less variability around the line there is (better for generalisation)
The further away from the line (predicted scores) the actual scores are, the worse the model predicts Y, the more variability around the line there is
What is residual?
Residual is the difference between predicted Y and actual Y for any given value of X
Residual SS vs Regression SS
Residual sums of square → small value
Regression sums of squares → big values
Where abouts are the negative and positive residual?
(+) residual: above the regression line
(-) residual: below the regression line
Why does the residual sum always equal to 0? And what would happen to the sum if there is an outlier?
Residual sum would always be 0 - because there are half that are negative and half that are positive (regression line sits in the middle of data points)
Even with an outlier, the regression sum would be 0 as the regression line would skew to compensate and adjust
Conditional and marginal distribution
Marginal distribution of Y: spread or variance of scores around the mean
* Wider and more spread out → more variability compared to the conditional distribution
Conditional distribution of Y: spread of variance of scores around the regression line, for any given value of X
* Narrower and tighter
R^2: coefficient of determination
Is split into regression and error
If variability is explained → regression
If viability not explained → residual
R^2 effect:
R^2 ranges from 0-1 (often expressed as a percentage)
Closer to 0 (or 0%: the weaker the effect, the less variance the model explains
Closer to 1 (or 100%): the stronger the effect, the more variance that the model explains
Different ways to find correlation
pwcorr or The square root of R^2→ the correlation between the predicted values of Y and the actual values of Y
What are the effects tested in regression? What statistical signifance is used to test it?
Effects that are teste:
* Model-as-a-whole
* Individual variable or predictor effects
Statistical significance:
* Model as a whole: F ratio and p-value
* Individual prediction: t-statistics and p-value
How is mean square (MS) calculated?
sums of squares divided by degrees of freedom