Weeks 4 & 5 - Regression Flashcards
Which variable is also known as the predictor variable
independent variable (X)
Which variable is also known as the outcome variable
dependent variable (Y)
What kind of relationship is there between variables in a regression analysis
An asymmetrical relationship - scores on one variable (IV) predict scores on the other (DV)
Formula for a straight line function
y = mx + b, where y is the DV, x is the IV, m is the constant slope of the straight line, b is a constant for the y intercept (the value of y when x =0)
Why use a straight line?
Because it provides a strictly linear relationship between variables that are not likely to be present in a normal scatter plot (assuming that there is not a perfect correlation between variables)
Does it matter which variable is on the X and Y axes?
Yes, because one variable is predicting the other, the predictor variable (IV) should be on the X axis and the outcome variable (DV) should be on the Y axis.
What does the regression line mean?
It measures the summary characteristics of the relationship between two variables.
What method is used to find the regression line of best fit?
The least squares regression line ensures that there is the smallest deviation between the observed and predicted scores (obs & regression line)
What is minimised in the least squares regression line?
The sum of squared residuals (must be squared because the sum of residuals is equal to zero). The line of best fit has the smallest SSres.
What is the method of least squares?
The method of obtaining the line of best fit in a regression model that has the smallest possible SSres.
What is the least squares estimator?
The estimator used to obtain the line of best fit in a regression model. This estimator finds the estimated value for the slope and y intercept constant that minimises the SSres for the set of observed scores on the X and Y axis
What is the least squares linear regression function?
The line of best fit produced by the method of least squares
The full regression equation (full simple regression equation0
Y = a +bX = e (where a = constant intercept parameter; b = slope/regression coefficient; X = score on IV; e = residual score)
Regression model equation (simple regression model equation)
Y_hat = a +bX (excudes the residual score from right hand side of equation and uses the predicted Y score (Y_hat) on the right side of the equation)
What is the regression coefficient?
The slope parameter b in the regression model & full regression equation
What is a negative residual?
A residual score obtained when the predicted score is greater than the observed score
What is a positive residual?
A residual score where the observed score is greater than the predicted score
What is SSTotal?
The total variation in observed scores on the dependent variable (Y).
Measures the sum of squared differences between the observed Y scores and the mean (average).
What is SSReg?
Variation in the predicted scores in Y (DV).
Sum of squared deviations between the predicted scores and the mean.
Represents the variation in predicted scores accounted for by the model. Bigger SSReg means that the regression model is a good predictor.
What is SSres?
Variation in the difference between observed and predicted scores.
Represents the sum of squared deviations between the observed and predicted data (the residuals). Large SSres means regression model is not a good predictor. A SSres of zero means a perfect correlation, that observed and predicted scores fit perfectly along a straight line (but not v. likely to happen!)
What is R squared (R2)?
The proportion of total variation in Y accounted for by the model.
Measures the overall strength of prediction.
An R squared value can range between zero and +1 (can’t be negative because it’s a squared value).
An R squared value of zero means that the IV does not predict the DV (they are independent of each other)
An R squared value of 1 means that 100% of the variability in Y can be predicted by X (also v. unlikely), the larger the R squared value the greater the strength of prediction in the regression model.
How is R squared calculated?
R squared is calculated by dividing the SSReg by the SSTotal.
What are the alternative ways of calculating R squared?
By subtracting the SSres from the SSTotal (which produces the SSReg) and dividing this by SSTotal
By dividing the SSReg by SSReg + SSres (which equals the SSTotal)
These methods all produce the same measure of strength of prediction, but use the three measures of variability found in the regression model differently to obtain the same results.
Provide a good way of showing the relationship between these measures of variability.
What is R
The multiple correlation coefficient (Multiple R) = square root of R squared.
What does Multiple R measure in a regression analysis?
The extent to which higher predicted scores (Y hat) for the DV are associated with higher observed scores (Y) on the DV.
What are the df reg?
The number of independent variables
What are the df res?
df res= n - no. of IVs - 1
What are the df total?
df total = df reg + df res
What theoretical probability distribution is equivalent to R squared as an estimator of of the overall strength of prediction in the regression model at a population level?
The F distribution
What techniques are used to make inferences about the overall strength of prediction in a regression model at a population level?
Null Hypothesis significance testing of R squared
What is the population parameter that corresponds to R squared?
P squared (rho squared)
What does the null hypothesis state for R squared at the population level?
Ho = P2 (rho suqared) = 0
What does the alternative hypothesis state when making inferences from R squared to P squared?
Ha = P2 (rho squared) is not equal to zero (or is larger than zero)
What do sums of squares (SS) measure?
SS measures variation (they are squared deviation scores)
What are Mean Sum of Squares (MS)
Measures the average variance (average SS)
How are the Mean Sum of Squares (MS) obtained (MS)res & MSReg)
MSReg = SSReg/df reg
MSres = SSres/df res
How is the Tobs for the null hypothesis test on R2?
Tobs = MSReg/MSres
What theoretical probability distribution is the Tobs for a null hypothesis test on R2 equivalent to?
F distributiion
Tobs = F(dfReg, dfres) = MSReg/MSres
What is the shape of the F distribution?
Positively skewed
What value range can an F statistic have
F statistic must be between 0 and 1 (can’t be negative)
Where is the critical region in the F distribution?
In the tail (because positively skewed)
What other distribution is the F distribution similar to?
The Chi Square distribution
What is the limitation of using a point estimate in a null hypothesis test on R2?
The use of the point estimate can only tell us if P2 (corresponding to R2) is equal, or not equal to zero. Can’t provide a range of plausible values that R2 could be.
What are the advantages of placing a confidence interval around R2?
- R2 can only range between 0 - 1, there for the CI can immediately and clearly estimate the precision with which R2 is being estimated.
- If the lower bound of the CI is not 0, we immediately know that the null hypothesised value of 0 would be rejected.
- The CI can indicate extreme bias in R2 (ie. R2 may not be contained within CI, indicating extreme bias).
What factors influence the precision of a confidence interval on R2?
- The number of IVs
- The sample size
- The size of R2
- fewer IVs = greater precision
- larger sample size = greater precision
- larger R2 = greater precision
How can a confidence interval for Multiple R (multiple correlation coefficient)?
By taking the square root of the upper and lower bounds, a CI for Multiple R can be obtained.
How is a CI for Multiple R interpreted?
The CI for Multiple R provides an estimate for the expected correlaiton between the observed and predicted scores on the dependent variable at the population level.
Is R2 a biased estimator of P2?
Yes, but it is also a consistent estimator
When can Multiple R be a better measure of the overall strength of prediction than R2?
When the R2 is very small, Multiple R can be used to determine if there is a significant correlation between obs and exp. scores on the DV.
Is the unbiased or adjusted CI value more accurate?
The unbiased estimate is better than the adjusted (but SPSS doesn’t produce it)
What does an R2 value greater than the upper bound of the CI mean?
That there is a huge upward bias in the observed R2 and that the population P2 is more likely to lie between a CI that is lower.
What possible causes are there for extreme bias in R2?
A large number of IVs and small sample size.
How can bias in R2 reduced?
By using a larger sample size and fewer IVs
Which estimate of R2 should be used for a smaller sample size?
The Unbiased R2 is the only one that is definitely an accurate estimator of P2 as a point estimator (adjusted has slight negative bias and unadjusted is positively biased). However, unadjusted R2 is only accurate interval estimator.
Which estimator of R2 should be used for a larger sample size?
All 3 estimators of R2 are accurate (therefore all consistent), all good estimators of interval estimates.
What is an unstandardised partial regression coefficient?
“b” - indicates the expected change in the scores for the DV for a unit change on the focal IV (while holding constant scores on all other IVs)
What are two ways of measuring the strength of prediction of each IV using the partial regression coefficient?
- The semipartial correlation
2. Standardised partial regression coefficient
What theoretical probablility distribution does a hypothesis test of the partial regression coefficient correspond to?
the t-dstribution
What are the df for the t-distribiution hypothesis test on the partial regression coefficient?
df = n - dfReg - 1
How is the Tobs (tons) calculated for a partial regression coefficient?
Tobs = tobs = (partial regression coefficient - population regresssion coefficient)/ Standard Error of the regression coefficient.
How is a 95% CI for a partial regression coefficient interpreted?
A 95% CI indicates the range of expected change in the DV for a unit change in the focal IV (holding constant all other IVs).
What is a semipartial correlation?
The pearson correlation between scores on the DV and that part of the scores on an IV not accounted for by all other IVs in the regression model.
What is the notation for the semipartial correlation?
Sr
What is homoscedasticity?
Irrespective of what the predicted score on the dependent variable are, the degree of variability of the residual variance is the same.
What is heteroscedasticity?
Systematic variabilty in the residual variances according to the predicted values (eg. low variability with low scores on DV and high variability with high scores on the DV)
What diagnostic tool is used to determine if heteroscesasticity is present (or not)?
A scatterplot of the residuals against predicted scores on the DV.
Why do we use predicted scores in a scatterplot when testing for heteroscedasticity?
Because the predicted scores on the DV represent a linear combination of all scores on the IV (so don’t need to check each IV individually)
What is plotted on the X axis of a scatterplot when checking for heteroscedasticity?
Standardised predicted scores (Z transformation of Y hat scores)
What is plotted on the Y axis of a scatterplot used for checking for heteroscedasticity?
The residual scores (can be standardised, studentised, or studentised deleted residuals)
What are standardized residuals?
Obtained when applying a Z transformation to raw residuals (mean =0, SD = 1)
What are studentized residuals?
Transforming raw score residuals by an estimate of their standard error (mean approx. 0, SD = 1)
What are studentized deleted residuals and how are they obtained?
Residuals obtained by repeatedly reapplying the same regression model to the sample data when one case is left out of the data, and then the next and the next.
The difference between the original raw score and the predicted score from the regression equation excluding for case that value on the DV is called the deleted residual.
The studentised deleted residual for a case is therefore its deleted residual value divided by an estimate of its standard error.
What is a residual outlier?
A studentised deleted residual identified on a scatter plot with a value greater than 2.5 to 3 (or less than -2.5 - 3.
Studentised deleted residuals plotted on Y axis and predicted scores on the X axis.
What is the advantage of using studentised deleted residual
- Good at picking up outliers & extreme data points, especially when sample size is small.
A studentised deleted residual with a value + or - 3 or above is an extreme data point.
What does fanning in (or out) indicate in a scatterplot of residuals?
Indicative of heteroscedasticity (systematic variation of residuals as predicted values incrsase/decrease.
Does sample size effect ability to detect heteroscedasticity?
Yes! In a small sample it is almost impossible to see on a scatterplot unless it is really obvious.
What effect can an aberrant or extreme score have on a small sample size?
It can dramatically change the results of the regression model.
What two diagnostic techniques allow extreme data points to be observed in a regression model?
- Examine the studentised deleted residuals for extreme scores
- Investigate a measure of influential cases on our data using Cook’s d statistic.
What size of studentised deleted residual indicates an extreme score?
A studentised deleted residual of more than 2.5 or 3, or less than -2.5 or 3.
What is Cook’s D
A statistic that is calculated for each data value in the regression model and assesses the influence of each case on the model, when that case has been removed from the model.
What is the range of values for Cook’s d?
Minimum value of 0, and a large value (e.g. +1 or more) is indicative of an extreme datapoint.
What value of Cook’s d indicates an extreme data point?
A cook’s d value of +1 or higher.
How is non-linearity in a regression model established?
Systematic patterning indicating non-linearity should be evident in a scatterplot of the residuals and predicted values.
What is the meaning of the intercept parameter (a intercept)
Expected value on the DV when the scores on all IVs = 0. Intercept always = 0 in a standardised regression equation.
What is the meaning of the intercept parameter (a intercept)
Expected value on the DV when the scores on all IVs = 0. Intercept always = 0 in a standardised regression equation.
Only need to understand this if an expected score of 0 is meaningful (otherwise forget it).