Simple Linear Regression Flashcards
what is correlation?
- looking at how two variables are related to each other
-> we arenāt making predictions from one to the other
-> relationship is symmetrical
what is regression?
- trying to predict one variable from another using the model
-> predict criterion variables from the predicting variable
-> relationship is asymmetrical
-> assuming one (the predictor) precedes the other (outcome)
what is the whole idea of a regression?
predict outcome (dependent / criterion variable) from a predictor (independent) variables
whatās an example of a regression question?
How can you predict university success from school results?
* Tariff score and Honours Classification
How can we predict regression?
Y = b0 + b1
b0
intercept
-> where our line crosses the y axis - itās constant
b1
āgradient/slopeā
how does the āslopeā work?
the gradient of the line has been fitted to the data
* for every unit X goes up
* Y goes up (or down) in line with the gradient
i.e. for every unit of X that does up, Y goes up 0.5 of a unit [thatās the perfect prediction]
X = 2. What is Y?
Y = 0 + 0.5 (2)
Y is 1
If b0 = 3.75 and b1 (slope) = .469. An individual scores 7 on their maths test. What is Y?
Y = 3.75 + .469(7)
Y = 7.03
what is the issue with Y though?
fit of our line is not perfect, yet weāre interested in being able to quantify the gap
b0 = 11.35 and b1 = -0.722. What is the equation?
Y = 11.35 + -0.722(x)
what is the regression outcome?
statistics we look at to predict how good our predictor is at predicting our outcome variable
What is the technique about making decisions about the data?
aim us to ensure the line of best fit produces a small residual
* not always a good fit but itās the best fit -> we can measure how good a fit is is and estimate how good our regression is (how good is our equation at predicting the outcome -> knowing the predictor)
* and ifās itās significant
There are two outcomes
What are these two outcomes?
R^2: how good the model / regression is (predicting) [trying to test the null hypothesis that r = 0]
F ratio: is it significant or not [trying to say there is no predictive relationship / variation]
what are the questions we are asking ourselves?
- The general question we are asking how good is our model at predicting the actual data (Y, the dependent measure, the criterion variable)?
- The technical question is how much of the variance in the Y data set can we predict/account for using our model?
- Outcome of the analysis is what proportion of the variation in the data set can we predict using our model
what can we use to calculate this proportion?
- model
- data the model produces (the predicted Y score)
- the actual data (observed/actual Y scores)
There are different types of variation, what is this called?
the residual -> differences between the observed and predicted Y scores
* actual Y score minus the predicted Y score using the equation and X value
* squared to stop them cancelling each other out
* the gap between the actual and the predicted
the gap before the actual score and the predicted score, what does this tell us?
The weaker the prediction, the greater the residual variance
* the bigger the gap between the actual scores and the scores that our model predicts
if the gap is small?
youāve got a good prediction
if the gap is large?
you donāt have a good prediction
what is variation not predicted by?
the model/equation/regression
what does the residual tell us?
the difference between the score predicted by the equation and the score we actually have
How do we calculate the SSResidual
Y = score for each participant
Ŷ = score for each participant calculated by the equation (predicted Y)
Ŷ- Y = score for each participant calculated by the equation minus score for each participant
(Ŷ - Y)^2 = score for each participant calculated by the equation minus score for each participant squared
The Equation: ā(Ŷ- Y)^2
what is the total variance?
- the total variance of Y scores in the data set
All the variation that there is to explain
how to calculate SS total?
sum of (Y-M)^2
* each data point minus the mean for all Y data points
So.. How do we figure out the variance?
Sum of Squares of the Residual (SS residual): an estimate of the amount of variation that is not predicted by our regression in our sample (gap between the actual and the predicted)
Total Sum of Squares (SS total): an estimate of all the variation in the sample
What do we need to find out?
an estimate of how of the variation is actually predicted by our model
How do we find an estimate of how of the variation is actually predicted by our model?
Take away the SS residual from the SS total -> sum of squares model
what is SSm (Sum of Square of the Model) / SS reg (Sum of square of the regression)
an estimate of the amount of variance explained by the repression or the model
How can we calculate SSreg directly
take the mean of the actual Y score away from the predicted Y
-> gives you the variance explained by the regression equation or model
what SS total?
an estimate of all the variance in the data set
what is SSm or SSreg
estimate of the variance accounted for by the model/regression (gives us a idea of the variation explained)
What is SSreg/m affected by?
sample size and amount of total variation in the sample
- you canāt compare it from different studies and samples as different sample sizes etc produce different estimates -> yet very useful if we want to generalise and compare results
instead we need a standardised measure of the total proportion of the variation explained by the regression
what is the standardised measure of the total proportion of the variation explained by the regression?
R^2
R^2
Proportion of the variance predicted by the regression equation
* SSreg divided by SStotal
* Better 1 and 0 -> larger the better
* can be expressed as a percentage i.e. 80% of the variance is explained by the model/regression
Sstotal
an estimate of all the variance in the data set
Ssres
A measure of the amount of variance not explained but our regression
SSreg or SSm
-> an estimate of the variance accounted for by the model / regression
Take SStotal from SSres and that leaves us with the amount of variance explained by our equation
R^2
Standardise this by dividing SSreg by SStotal
-> what proportion of the total variation is explained by the regression/model
what is the F ratio?
ratio between variance that is predicted and the variance that is not predicted (error)
* a way to see whether a significant amount of the variance is explained
If F ratio is high -> this means the effect is strong; there is lots of variance explained in relation to the variance that is not explained (we should get a significant result)
How do we calculate the F ratio?
āmean square errorā
* SS divided by degrees of freedom
what is F ratio?
the ratio between mean squared error
-> SS / Df
The degrees of freedom for the regression model is simply the number of predictors
SS reg / m divided by the number of predictors in the model/regression (ākā)
how many predictors are there in the linear regression?
1
SS res divided by N minus the number of parameters in our model. What else is there?
- These are the intercept and the predictors
- There is always one intercept and ākā number of predictors
how many degrees of freedom is the F ratio reported with?
2 -> for each of the mean squared errors (df Msreg/m , df of Msres)
What does it mean if the F value (found in the F table) is large and the p value is significant
itās predicting a significant amount of the variance -> a lot of variance too
what does the p-value mean?
tells us that the result is significant
-> allows us to make decisions about the null hypothesis
If p < 0.05
can reject the null hypothesis
in regression, the null hypothesis means
the variance explained by the model is 0
in t-tests, the null hypothesis means
there is no difference between the two means (or that the data comes from the same population)
F
The ratio of the Mean square model (or āregressionā) error to the mean square residual error.
Big F ->
little p values
where is the R-squared?
in the module summary, sometimes we report adjusted r squared next to it
what are the assumptions of a simple regression?
- variable type must be continuous (predictor can be continuous or discrete)
- non-zero variance: predictors must not have zero variance
- independence: all values of outcomes should come from a different person or item
- linearity: the relationship we model is, in reality, linear (x and y is still important to see if thereās a relationship)
- homoscedasticity: for each value of predictors, the variance of the error term should be constant
AND independence of errors: Plot ZRESID (y-axis) against ZPRED (x-axis) - Normally-distributed errors: the residual (score) must be normally distributed (should form a normal distribution - if they donāt then we have some problems with the data)
ā Do a normal probability plot or āsaveā the residuals and then compute all the usual tests for normality
How to calculate F
ssreg/m divided by ākā -> number of predictors) (1)
SSres divided by N - K - 1 = 2
MSres = answer / 2 = 3
F = 1 / 3
Regression
way of predicting an outcome
SStotal
total sum of squares of the differences between data points and the mean of y (all the variance there is to explain/account for)
SSres
total sum of squares of the differences between the data points and the line of best fit (variation that is not explained by the model)
(an estimate of the variance that is not accounted for by the model/regression)
SSmodel/regression
difference between SStotal and SSres
-> variation explained by the model
R^2
SSmodel/regression / SStotal
-> proportion of variance explained by the model)