Weeks 4 & 5 - Regression Flashcards

1
Q

Which variable is also known as the predictor variable

A

independent variable (X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which variable is also known as the outcome variable

A

dependent variable (Y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What kind of relationship is there between variables in a regression analysis

A

An asymmetrical relationship - scores on one variable (IV) predict scores on the other (DV)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Formula for a straight line function

A

y = mx + b, where y is the DV, x is the IV, m is the constant slope of the straight line, b is a constant for the y intercept (the value of y when x =0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why use a straight line?

A

Because it provides a strictly linear relationship between variables that are not likely to be present in a normal scatter plot (assuming that there is not a perfect correlation between variables)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Does it matter which variable is on the X and Y axes?

A

Yes, because one variable is predicting the other, the predictor variable (IV) should be on the X axis and the outcome variable (DV) should be on the Y axis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the regression line mean?

A

It measures the summary characteristics of the relationship between two variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What method is used to find the regression line of best fit?

A

The least squares regression line ensures that there is the smallest deviation between the observed and predicted scores (obs & regression line)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is minimised in the least squares regression line?

A

The sum of squared residuals (must be squared because the sum of residuals is equal to zero). The line of best fit has the smallest SSres.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the method of least squares?

A

The method of obtaining the line of best fit in a regression model that has the smallest possible SSres.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the least squares estimator?

A

The estimator used to obtain the line of best fit in a regression model. This estimator finds the estimated value for the slope and y intercept constant that minimises the SSres for the set of observed scores on the X and Y axis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the least squares linear regression function?

A

The line of best fit produced by the method of least squares

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The full regression equation (full simple regression equation0

A

Y = a +bX = e (where a = constant intercept parameter; b = slope/regression coefficient; X = score on IV; e = residual score)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Regression model equation (simple regression model equation)

A

Y_hat = a +bX (excudes the residual score from right hand side of equation and uses the predicted Y score (Y_hat) on the right side of the equation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the regression coefficient?

A

The slope parameter b in the regression model & full regression equation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a negative residual?

A

A residual score obtained when the predicted score is greater than the observed score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a positive residual?

A

A residual score where the observed score is greater than the predicted score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is SSTotal?

A

The total variation in observed scores on the dependent variable (Y).

Measures the sum of squared differences between the observed Y scores and the mean (average).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is SSReg?

A

Variation in the predicted scores in Y (DV).

Sum of squared deviations between the predicted scores and the mean.

Represents the variation in predicted scores accounted for by the model. Bigger SSReg means that the regression model is a good predictor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is SSres?

A

Variation in the difference between observed and predicted scores.

Represents the sum of squared deviations between the observed and predicted data (the residuals). Large SSres means regression model is not a good predictor. A SSres of zero means a perfect correlation, that observed and predicted scores fit perfectly along a straight line (but not v. likely to happen!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is R squared (R2)?

A

The proportion of total variation in Y accounted for by the model.

Measures the overall strength of prediction.

An R squared value can range between zero and +1 (can’t be negative because it’s a squared value).

An R squared value of zero means that the IV does not predict the DV (they are independent of each other)

An R squared value of 1 means that 100% of the variability in Y can be predicted by X (also v. unlikely), the larger the R squared value the greater the strength of prediction in the regression model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How is R squared calculated?

A

R squared is calculated by dividing the SSReg by the SSTotal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the alternative ways of calculating R squared?

A

By subtracting the SSres from the SSTotal (which produces the SSReg) and dividing this by SSTotal

By dividing the SSReg by SSReg + SSres (which equals the SSTotal)

These methods all produce the same measure of strength of prediction, but use the three measures of variability found in the regression model differently to obtain the same results.

Provide a good way of showing the relationship between these measures of variability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is R

A

The multiple correlation coefficient (Multiple R) = square root of R squared.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What does Multiple R measure in a regression analysis?

A

The extent to which higher predicted scores (Y hat) for the DV are associated with higher observed scores (Y) on the DV.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are the df reg?

A

The number of independent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are the df res?

A

df res= n - no. of IVs - 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are the df total?

A

df total = df reg + df res

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What theoretical probability distribution is equivalent to R squared as an estimator of of the overall strength of prediction in the regression model at a population level?

A

The F distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What techniques are used to make inferences about the overall strength of prediction in a regression model at a population level?

A

Null Hypothesis significance testing of R squared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the population parameter that corresponds to R squared?

A

P squared (rho squared)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What does the null hypothesis state for R squared at the population level?

A

Ho = P2 (rho suqared) = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What does the alternative hypothesis state when making inferences from R squared to P squared?

A

Ha = P2 (rho squared) is not equal to zero (or is larger than zero)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What do sums of squares (SS) measure?

A

SS measures variation (they are squared deviation scores)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are Mean Sum of Squares (MS)

A

Measures the average variance (average SS)

36
Q

How are the Mean Sum of Squares (MS) obtained (MS)res & MSReg)

A

MSReg = SSReg/df reg

MSres = SSres/df res

37
Q

How is the Tobs for the null hypothesis test on R2?

A

Tobs = MSReg/MSres

38
Q

What theoretical probability distribution is the Tobs for a null hypothesis test on R2 equivalent to?

A

F distributiion

Tobs = F(dfReg, dfres) = MSReg/MSres

39
Q

What is the shape of the F distribution?

A

Positively skewed

40
Q

What value range can an F statistic have

A

F statistic must be between 0 and 1 (can’t be negative)

41
Q

Where is the critical region in the F distribution?

A

In the tail (because positively skewed)

42
Q

What other distribution is the F distribution similar to?

A

The Chi Square distribution

43
Q

What is the limitation of using a point estimate in a null hypothesis test on R2?

A

The use of the point estimate can only tell us if P2 (corresponding to R2) is equal, or not equal to zero. Can’t provide a range of plausible values that R2 could be.

44
Q

What are the advantages of placing a confidence interval around R2?

A
  1. R2 can only range between 0 - 1, there for the CI can immediately and clearly estimate the precision with which R2 is being estimated.
  2. If the lower bound of the CI is not 0, we immediately know that the null hypothesised value of 0 would be rejected.
  3. The CI can indicate extreme bias in R2 (ie. R2 may not be contained within CI, indicating extreme bias).
45
Q

What factors influence the precision of a confidence interval on R2?

A
  1. The number of IVs
  2. The sample size
  3. The size of R2
  • fewer IVs = greater precision
  • larger sample size = greater precision
  • larger R2 = greater precision
46
Q

How can a confidence interval for Multiple R (multiple correlation coefficient)?

A

By taking the square root of the upper and lower bounds, a CI for Multiple R can be obtained.

47
Q

How is a CI for Multiple R interpreted?

A

The CI for Multiple R provides an estimate for the expected correlaiton between the observed and predicted scores on the dependent variable at the population level.

48
Q

Is R2 a biased estimator of P2?

A

Yes, but it is also a consistent estimator

49
Q

When can Multiple R be a better measure of the overall strength of prediction than R2?

A

When the R2 is very small, Multiple R can be used to determine if there is a significant correlation between obs and exp. scores on the DV.

50
Q

Is the unbiased or adjusted CI value more accurate?

A

The unbiased estimate is better than the adjusted (but SPSS doesn’t produce it)

51
Q

What does an R2 value greater than the upper bound of the CI mean?

A

That there is a huge upward bias in the observed R2 and that the population P2 is more likely to lie between a CI that is lower.

52
Q

What possible causes are there for extreme bias in R2?

A

A large number of IVs and small sample size.

53
Q

How can bias in R2 reduced?

A

By using a larger sample size and fewer IVs

54
Q

Which estimate of R2 should be used for a smaller sample size?

A

The Unbiased R2 is the only one that is definitely an accurate estimator of P2 as a point estimator (adjusted has slight negative bias and unadjusted is positively biased). However, unadjusted R2 is only accurate interval estimator.

55
Q

Which estimator of R2 should be used for a larger sample size?

A

All 3 estimators of R2 are accurate (therefore all consistent), all good estimators of interval estimates.

56
Q

What is an unstandardised partial regression coefficient?

A

“b” - indicates the expected change in the scores for the DV for a unit change on the focal IV (while holding constant scores on all other IVs)

57
Q

What are two ways of measuring the strength of prediction of each IV using the partial regression coefficient?

A
  1. The semipartial correlation

2. Standardised partial regression coefficient

58
Q

What theoretical probablility distribution does a hypothesis test of the partial regression coefficient correspond to?

A

the t-dstribution

59
Q

What are the df for the t-distribiution hypothesis test on the partial regression coefficient?

A

df = n - dfReg - 1

60
Q

How is the Tobs (tons) calculated for a partial regression coefficient?

A

Tobs = tobs = (partial regression coefficient - population regresssion coefficient)/ Standard Error of the regression coefficient.

61
Q

How is a 95% CI for a partial regression coefficient interpreted?

A

A 95% CI indicates the range of expected change in the DV for a unit change in the focal IV (holding constant all other IVs).

62
Q

What is a semipartial correlation?

A

The pearson correlation between scores on the DV and that part of the scores on an IV not accounted for by all other IVs in the regression model.

63
Q

What is the notation for the semipartial correlation?

A

Sr

64
Q

What is homoscedasticity?

A

Irrespective of what the predicted score on the dependent variable are, the degree of variability of the residual variance is the same.

65
Q

What is heteroscedasticity?

A

Systematic variabilty in the residual variances according to the predicted values (eg. low variability with low scores on DV and high variability with high scores on the DV)

66
Q

What diagnostic tool is used to determine if heteroscesasticity is present (or not)?

A

A scatterplot of the residuals against predicted scores on the DV.

67
Q

Why do we use predicted scores in a scatterplot when testing for heteroscedasticity?

A

Because the predicted scores on the DV represent a linear combination of all scores on the IV (so don’t need to check each IV individually)

68
Q

What is plotted on the X axis of a scatterplot when checking for heteroscedasticity?

A

Standardised predicted scores (Z transformation of Y hat scores)

69
Q

What is plotted on the Y axis of a scatterplot used for checking for heteroscedasticity?

A

The residual scores (can be standardised, studentised, or studentised deleted residuals)

70
Q

What are standardized residuals?

A

Obtained when applying a Z transformation to raw residuals (mean =0, SD = 1)

71
Q

What are studentized residuals?

A

Transforming raw score residuals by an estimate of their standard error (mean approx. 0, SD = 1)

72
Q

What are studentized deleted residuals and how are they obtained?

A

Residuals obtained by repeatedly reapplying the same regression model to the sample data when one case is left out of the data, and then the next and the next.

The difference between the original raw score and the predicted score from the regression equation excluding for case that value on the DV is called the deleted residual.

The studentised deleted residual for a case is therefore its deleted residual value divided by an estimate of its standard error.

73
Q

What is a residual outlier?

A

A studentised deleted residual identified on a scatter plot with a value greater than 2.5 to 3 (or less than -2.5 - 3.

Studentised deleted residuals plotted on Y axis and predicted scores on the X axis.

74
Q

What is the advantage of using studentised deleted residual

A
  1. Good at picking up outliers & extreme data points, especially when sample size is small.

A studentised deleted residual with a value + or - 3 or above is an extreme data point.

75
Q

What does fanning in (or out) indicate in a scatterplot of residuals?

A

Indicative of heteroscedasticity (systematic variation of residuals as predicted values incrsase/decrease.

76
Q

Does sample size effect ability to detect heteroscedasticity?

A

Yes! In a small sample it is almost impossible to see on a scatterplot unless it is really obvious.

77
Q

What effect can an aberrant or extreme score have on a small sample size?

A

It can dramatically change the results of the regression model.

78
Q

What two diagnostic techniques allow extreme data points to be observed in a regression model?

A
  1. Examine the studentised deleted residuals for extreme scores
  2. Investigate a measure of influential cases on our data using Cook’s d statistic.
79
Q

What size of studentised deleted residual indicates an extreme score?

A

A studentised deleted residual of more than 2.5 or 3, or less than -2.5 or 3.

80
Q

What is Cook’s D

A

A statistic that is calculated for each data value in the regression model and assesses the influence of each case on the model, when that case has been removed from the model.

81
Q

What is the range of values for Cook’s d?

A

Minimum value of 0, and a large value (e.g. +1 or more) is indicative of an extreme datapoint.

82
Q

What value of Cook’s d indicates an extreme data point?

A

A cook’s d value of +1 or higher.

83
Q

How is non-linearity in a regression model established?

A

Systematic patterning indicating non-linearity should be evident in a scatterplot of the residuals and predicted values.

84
Q

What is the meaning of the intercept parameter (a intercept)

A

Expected value on the DV when the scores on all IVs = 0. Intercept always = 0 in a standardised regression equation.

85
Q

What is the meaning of the intercept parameter (a intercept)

A

Expected value on the DV when the scores on all IVs = 0. Intercept always = 0 in a standardised regression equation.

Only need to understand this if an expected score of 0 is meaningful (otherwise forget it).