Week 3-Hierarchical Regression Flashcards

1
Q

What does a regression identify?

A

-It identifies whether there are significant associations between a predictor variable (s) and an outcome variable

-It does this by essentially predicting a line of best fit for the association between variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 2 key ingredients for any regression?

A
  1. Amount of variance the model explained (Adjusted R^2)
  2. Significance of individual predictors (Regression Coefficients) (Just because overall significance doesn’t mean individual predictors will be significant)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is R^2?

A

-This explains how much variance in our dependent variable is explained by our regression model

-The regression model refers to all the predictors considered together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you calculate R^2?

A

SSR/SST OR SSR/SSE+SSR

SSR=Variance explained
SSE=Unexplained variance
SST=Total variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What can the value of R^2 range from and what does it indicate?

A

-Ranges between 0 and 1, where the higher it is the more accurate the regression model is (often referred to as a %)

E.g:
R^2 =.05 means 5% of variance is explained
R^2 =.21 means 21% of variance is explained

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Adjusted R^2?

A

-Very similar to the R^2 statistic but better one to report as it stops dodgy researchers adding more variable to prove significance

-But it is always lower

-It punishes R^2 for each predictor added to the model

-This stops people throwing in more variables in order to improve the fit of the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is R^2 and Adjusted R^2 assessed?

A

-Significance of this is assessed using an ANOVA (analysis of variance)

-Tells us whether the amount of variance explained is statistically significant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a regression coefficient?

A

-It tells us whether the association between our IVs and DVs

-It can be positive (positive association) or negative (negative association) like a correlation

-Unlike correlation coefficients it does not range between -1 and 1

-It is a description of an IV-DV association in terms of unit changes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does the regression coefficient number mean?

A

-It means how much the DV changes when the IV is increased by one unit

For example:
-I measure stress using a questionnaire (IV) and anxiety using a questionnaire (DV) and my regression coefficient is 1.5

-This would mean that for each increase of 1 on the stress questionnaire, scores on the anxiety questionnaire go up by 1.5.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a Standard Error in relation to Regression Coefficients?

A

-Regression Coefficients always come with a SE

-This is how precise your estimate (regression coefficient) is

-Big SE=not precise
-Small SE=it’s precise

-A big regression coefficient and a small SE=significant effect

-Indeed the p value is based on the proportion of the regression coefficient to the SE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the Standardised regression coefficients?

A

-β (beta) values (can be expressed in SDs)

-You cannot directly compare regression coefficients and say that coefficient is bigger, therefore it is a bigger effect

-This is because they are expressed in unit changes and the IV’s are likely to be measured in different ways

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How is the Standardised Regression Coefficients interpreted?

A

-It’s interpreted as for every one SD change in the IV the DV changes by the number of SDs for the standardised regression coefficients indicates

For example:
-β =0.5 means for every one standard deviation increase in the IV the DV increases by 0.5 standard deviations

-β =-0.2 means for every one standard deviation increase in the IV the DV decreases by 0.2 standard deviations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a stepwise regression?

A

-A data mining method which is reliant on statistical significance to choose variables to be included in a final method

-At each step a variable is entered and the t statistic for an effect is produced (forwards, backwards, stepwise/bidirectional)

-Walter and Tiemeier (2009): In 4 leading epidemiologic journals they found that 20% of the articles published in 2008 used stepwise regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some problems that Frank Harrell critiqued?

A

-The R^2 statistics are inflated (in the sense that data is messed around with for best-case model)

-The F tests do not have the claimed distribution

-Coefficients for retained variables are inflated (makes them look like better predictors than they are)

-Standard errors are deflated

-Falsely narrow confidence intervals (See Altman and Andersen, 1989)

-It yields p-values that do not have the proper meaning and the proper correction problem (i.e., don’t know how many have been checked)

-It has severe problems in the presence of collinearity

-Often, researchers fail to test the model in new data (and often it doesn’t fit)

-It allows us to not think about the problem “The data analyst knows more than the computer…failure to use that knowledge produces inadequate data analysis.” (Henderson and Velleman, 1981)

-The statistical tests used are intended to be used to test pre-specified hypotheses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the difference between a Simple multiple regression vs a Hierarchical regression?

A

-Simple and multiple regression give us model fit and R squared which accounts for all the predictors in our model

-Simple/Multiple regression is a simultaneous model (i.e., all variables are chucked together)

-Hierarchical models - there is some strategy (or specified hierarchy) which is dictated in advance by the purpose / logic of the research

17
Q

Why are Hierarchical models dictated in advance?

A

-To become more theory driven in our statistical analysis

-Allows us to adjust for variables

-Partitions our explained variance (i.e., we break it into separate parts to have a richer understanding of our data)

18
Q

What are nuisance variables?

A

Variables that we don’t care for e.g., basic demographics

18
Q

How do we distinguish the steps?

A

-There will be a group of variables that we wish to look at as a distinct set of predictors

-Often people have variables they wish to adjust for in step one e.g., age and gender

-You may wish to put questionnaire measures in one step and behavioural measures in others

-Known predictors (nuisance variables e.g., age) before hypothesised predictors

19
Q

What is R^2-change?

A

-Essentially this analysis will tell us the amount of variance that the first block predicts

-How much the R squared changes when we add a new set of predictors (how much each one of variance each explains)

And then the additional amount of variance subsequent blocks predict
-It is this concept of additional variance that matters, the variance block one has accounted for is ‘removed’ and the next block(s) can only predict residual variance (thats not already accounted for by earlier blocks)

-If you add up the R^2 changes it gives the model R^2 (the total variance explained) by all the variables together. (Note there is no adjusted R^2 change statistic)

20
Q

What is F-change?

A

-It’s an F test that enables you to ascertain if the additional amount of variance your block predicts is statistically significant

-It tells you the amount of variance your R^2 predicted is actually significant

-It is reported like any other F statistic with 2 degrees of freedom

21
Q

How do we choose a block?

A

-Groups of variables that we wish to look at as a distinct set of predictors

-Often people have variables they wish to control for in block one e.g., age and sex

-You may wish to put questionnaire measures in one block and behavioural measures in another

There’s no rule about what goes in a block:
-Blocks are used because of a theoretical rationale
-You are interested in the contribution of certain groups of variables to the explained variance so they go in blocks together
-For example, you may hypothesise that after controlling for age, gender and stress, job satisfaction predicts a significant amount of variance in work absence

-As long as its not crazy complicated e.g., 10 steps, then it is completely up to you

22
Q

What is the R^2 for the full model alongside an F statistic etc.,?

A

-This is the full variance and whether it is significant (as if there were no blocks at all)

-The R^2 changes will add up to the R^2 for the full model

23
Q

What is meant by the simultaneous portions of the model?

A

It refers to coefficients for all the variables in the regression equation when all considered together

24
Q

What is meant by the cumulative portions of the model?

A

It refers to the R^2 change and F change values for the blocks of the model, tells you how much additional variance each block adds to the model and whether this additional variance is significant

25
Q

How can we check for outliers, leverage and influential cases?

A

-Outliers and influential cases can have a considerable effect on our regression parameters

-We can check outliers using box plots

26
Q

What’s leverage?

A

-Tells us about extreme data points on the X variable (aka predictor)

-The distance between Xi and all other X data points

-Between 0 and 1 (lower = better)

-Sum of the leverage values = the number of predictors in the model.

27
Q

What’s Cook’s Distance?

A

-If any outcome variables have high residuals, they may distort the accuracy of a regression

-Cook’s distance tell us how predicted Y values will move on average if the data point is removed (a lot of change indicates a large amount of influence) (also tells us how much more accurate our regression slope will be if influential case is removed (especially if huge influence))

As with VIF, no accepted cut off (different ‘cut offs’

> 1

4/n

4/(n – k – 1) (what our cut off will be)
-Let’s use 4/(n – k - 1) as our cut off (= 0.043)

So, any Cook’s distance values greater than this mean we have an influential case

N=number of participants
K=number of independent variables

3 times larger than the mean