Multiple regression Flashcards

1
Q

what are the assumptions of regression?

A
  • residuals normally distributed (mean of 0 and SD of sigma squared)
  • homoskedasticity (Variance of residual remains constant no matter the value of x
  • Residuals not correlated
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

how exactly do we get the regression line? what method is used

A

Line is fit using the method of least squares.where the square of the residuals is minimised.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what are we estimating with regression?

A

the intercept and slope of the population parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

why center a variable?

A

to allow for more meaningful interpretation. E.g, imagine centering age to 46, the mean age of a sample. Then the intercept tells us what the DV id for a 46 year old with all other predictors set to 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

if you standardise a variable e.g., age, how does this impact the interpretation of the regression line

what now does the slope reflect?

A

After standardising a variable, the slope is interereted as a single SD change, how this would affect the outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

if i standardised both the predictor and outcome variable. what does the coeefficient for the X variable reflect?

A

pearsons correlation coefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

in simple regression what part of the equation captures the explained and unexplained variance?

A

systemic part : B0 + B1
Random part: residual (e)

the

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

describe the form of a regression equation

A

Response = Systemic part + random part

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

in simple regression how do we capture the residual variability?

A

sigma squared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

to estimate the residual variability using sigma squared - what assumption needs to be made

A

that the residuals are normally distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is explained variance in Y, unexplained variance in Y and the total variance

A
  • Explained variance = the variance in Y that is explained by X
  • Unexplained variance = residual
  • Total variance = sum of explained and unexplained variance.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is R-squared? What is the formula to calculate this?

A
  • The amount of variance in Y explained by X
  • Formula: explained variance / total variance = R-squared
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

in simple regression what is R squared the same thing as?

A

The square of the Pearson’s correlation coefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what two things affect the standard error

A

> sampel size

> amount of variability in X and amount of variance in Y unexplained by X (residual variance)

specifically, SE DECREASES with more variability in X, SE INCREASES with more residual variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How use the SE to calculate the 95% confidence interval for the slope estimate (B1)

lets say SE is 0.001

A

CI = slope +/- (1.96 x 0.001)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

how do we use the SE to calculate the test statistic (aka the Z or t-ratio)

  • SE of sex is 0.025
  • coefficient is -0.156
A

slope / SE = test statistic

so Z ratio is -0.156 / 0.025 = -6.12

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

we have 2 groups treatment and control. We want to test whether there is a difference int the variance of these groups.

I could use ANOVA or regression. Which is better?

A

Regression as it allows you to control for the effects of other predictors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

multiple regression with categorical predictor. measuring score of hunger in different countries

Germany, UK , france,

UK is reference coutnry.

HUNGER = BO + B1 + B2 + E

what exactly do the B1 and B2 slope reflect

A
  • B1 = the difference in means Germany vs UK
  • B2 = difference in means France vs UK
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

how can we express the null hypothesis for the difference in hedonism scores for Germany vs UK. Then again for France vs UK?

A
  • H0: B1 = 0
  • HO: B2 = 0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

multiple regression with categorical predictor. measuring score of hedonism in different countries

Germany, UK , france,

hedonism = BO + B1 + B2 + e

what does the regression with dummy coded variables tell us

A

the mean hedonism score for n in
- uk
- germany
- france

21
Q

multiple regression with more than one explanatory variable.

Y = b0 + b1x1 + b2x2 + e

A

what value do we expect each predictor to be at the level of the intercept?

22
Q

multiple regression with more than one explanatory variable.

Y = b0 + b1x1 + b2x2 + e

interpret the coefficients B1 and B2

A
  • B1 the coefficient for variable X1, interpreted as a change in Y for a single unit change in X1 while controlling for x2.
  • Likewise X2 is the change in x2
23
Q

multiple regression with more than one explanatory variable.

Y = b0 + b1x1 + b2x2 + e

what does the residual mean in this equation?

A

The variance of Y that isn’t accounted for by X1 nor X2

24
Q

multiple regression with more than one explanatory variable.

Y = b0 + b1x1 + b2x2 + e

what is the null and alternative hypothesis for this equation?

A
  • H0: Bk = 0
  • H1: Bk != 0
25
Q

How can we test for significance in multiple regression?

A

By examining confidence intervals for each parameter or, equivalently, by comparing Z-ratios to the normal distribution and calculating a p-value.

26
Q

lets say we have a linear regression analysis with 1 categorical PV. Is this linear or multiple regression?

A
  • Multiple as there are two variables included in the equation. Dummy coded.
27
Q

what are the parameters in multiple regression?

A

intercept and slope

28
Q

what is the standardized coefficient of the predictor X?

A

This is the lsope we expect when both X and Y are standardised before the analysis.

29
Q

what statistic in simple regression is the standardised coefficient for the predictor X equivalent to?

A

Pearsons correlation coefficient

30
Q

interpret what the standardised coefficient for predictor X1 in multiple regression with 2 predictors

A

The change in SD for 1 unit increase in X1 on the DV, while holding X2 constant.

31
Q

what does 1 unit of a standardized variable correspond to?

A

1 SD

32
Q

imagine a multiple regression model of hedonism on age and education, the standardised coefficient for AGE is -0.358. Interpret the relationship between age and hedonism.

A

1 SD change in age, predicts a .358 decrease in the SD of hedonism.

33
Q

how do we know if we have a linear relationship between the x and y variable

A

plot them against eachother. If there is a curve then the relationship is non linear and we should fit a quadratic function

34
Q

we have fit a quadratic curve to the relationship between age and hedonism. age is included as both a linear and squared term.

linear age has a negative coefficient and squared age has a positive coefficient.

what does the positive coefficient of the squared term, together with the negative coefficient of the linear term tell us about the relationship between hedonism and age?

A

that the negative relationship flattens out at older ages

35
Q

R squared

A
  • The proportion of variance in Y that is explained by all variables in the model
  • Also reflects the square of the correlation between the predicted and observed Y values in the model
36
Q

R2 vs Adjusted R2

A

R2 will always increase even if irrelevant variables are added to the model. It is then common to use the adjusted R2. Thit is better because it takes into account just how many variables you are including in the model. It is a measure of the goodness of fit that’s penalises you the more variables you include. If you add variables, its value will only increase If their addition explains some variance in Y

adjusted is better - More meaningful interpretation of the value of the new predictors. Value will only increase if they explain some variance in Y.

37
Q

interpret R2 for both simple regression and multiple regression

A

In simple regression – the varance in Y that is accounted for by X.

In multiple regression it is the variance in Y that is accounted for by all predictors in the model.

38
Q

what does an R2 of 0.121 tell us?

A

That 12.1% of the variation in hedonism scores is due to variation in age and education.
The correlation between the predicted and observed hedonism score is 0.348. Square root of this value is 0.121

39
Q

What does R2 tell us about the predictability of the model?

A

It tells us the square of the correlation between the predicted values of Y (from the fitted model) and the observed values of Y

40
Q

multicollinearity.
Why is high correlation between predictos bad?

A

The coefficient estimates will be unstable and imprecise (large SE). Solution: either drop one or create a new variable that is a combination of the two.

41
Q

when allowing an interaction term (age *sex) in multiple regression model.

why does the multiplication of age and sex into a new variable help us see interaction effects

A

Interaction effect allows for the influence of age on hedonism score to differ for men vs women. If this is significant then we say there is an interaction effect.

42
Q

multiple regression. predicting hedonism with age and gender.

Q: how would the regression model differ for men vs women? Men coded as 0 and women as 1.

Hed(i) = b0 + b1(AGE)i +b2(SEX)i + B3(AGESEX)i + Ei

A

intercept and slope for men

  • B0 = intercept
  • B1* AGE = slope

intercept and slope for women

*B0 + B2 = Intercept, b2 being the difference in intercept for women from men
*B1 + B3 = slope, b3 being the difference in slope for women from men

43
Q

difference between a one sided and two sided test

A

one sided:
testing that the test statistic is greater (or less than) than X.XX only.

two sided:
the test statistics is either + than 3.33 or less than -3.33

44
Q

for a test investigating whether the influence of cohort on score depends on social class, what would the null hypothesis for no interaction existing be?

A

That the coefficients for all the interactions are 0
H0: B5= B6 = B7 = 0.

45
Q

measuring the influence of coutnry (uk, germany, france) on hedonism score.

how can we test whetehr the 3 countries all have the same slope?

A

include interaction term in model. Then compare this to the results of a simple model (nested within).

if a significant difference between the two tells us at least 1 of the coutnries slope differs from the others

46
Q

what test do we use to compare two nested models.

A

F test

So when investigating whether 1 coefficient is equal to 0 , we can use the t statistic. When we want to investigate a group of coefficients, use a nested F test.

47
Q

What is the f statistic?

A

The difference between two R2 values for 2 models, the interaction model and the restricted model.

The change in R2 is compared to the F distribution.

48
Q

how can we calculate the Wald statistic using the Fstatistic ?

A

Wald = no. of parameters constrained at 0 (interaction coefficients) X F
e.g., for a model with 3 interaction terms and an f statistic of 46.56
3 × 46.56 = 139.68

49
Q

What is the equivalent of comparing the wald statistic and chi squared distribution

A

Comparing the Wald statistic to a chi-squared distribution is equivalent to comparing the F statistic to an F-distribution.