Week 3 (regressions and control) Flashcards

1
Q

what is regression to the mean

A
  • if one sample of a random variable is extreme, the next sample of the same random variable is likely to be less extreme
  • this is because extreme scores on a normal distribution are more likely to occur by chance than non extreme scores, therefore next time they are measured the scores are likely to be less extreme
  • therefore it can be hard to see if an intervention is effective or if the results are just regression to the mean
  • they should therefore be compared with control groups that dont recieve the intervention
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is a multiple linear regresion?

A

a linear regression with multiple predictor variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is the formula for a multiple linear regression

A
  • the same as for linear regression except for every predictor you include, you have to add a coefficient
  • therefore Y = the intercept + each predictor variable*their coefficient

-Y = B0 + B1 X1 + … + Bn Xn + e
Y = predicted value of the dependent variable
B0 = y-intercept or value of y with all parameters set to 0
B1 X1 = regression coefficient of the first independent variable
Bn Xn = regression coefficient of the last independent variable
E = model error or level of variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is SSt when there is multiple variables?

A

SSt represents the difference between the observed values and the mean value of the outcome variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is SSr when there is multiple variables?

A

SSr still represents the difference between the values of Y predicted by the model and the observed values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is SSm when there is multiple variables?

A

SSm represents the difference between the values of Y predicted by the model and the mean value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is multiple R^2?

A

the square of the correlation between the observed values of Y and the values of Y predicted by the multiple regression model

therefore large values of multiple R^2 represent a large correlation between the predicted and observed values of the outcome

a value of 1 would mean the model perfectly fit the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

do models with more or less variables tend to have a larger R^2

A

models with more variables always have a larger R^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is the Aikaike information criteria (AIC)?

A
  • the problem with multiple R^2 is the more variables it has, the larger the R^2 value is
  • AIC is a measure of fit which penalizes the model for having more variables
  • A larger AIC value indicates a worse fit, corrected for the number of variables
  • it only makes sense to compare AIC between models of the same data, as its relative
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is hierarchical regression?

A
  • predictors are decided based on past work and the experimenter decides which order to enter predictors into the model
  • they are entered in order or importance in predicting the outcome
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is forced entry regression?

A
  • all predictors are entered into the model simultaneously

- some believe this is the only appropriate technique for theory testing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is stepwise regression?

A
  • decisions about the order in which predictors are entered into the model are based on a purely mathematic criterion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

how does R carry out a forward stepwise regression?

A
  • an initial model is defined that contains only the constant
  • the computer then picks the variable that has the biggest simple correlation with the outcome and calculates how much variation that this variable explains
  • the model the searches for a second predictor that can explain the biggest part of the remaining variance
  • this gives a measure of how much ‘new variance in the outcome can be explained by each remaining predictor
  • the model always picked the next variable that can explain the largest amount of variable
  • R has to decide when to stop adding predictors to the model and it does this based on the AIC criterion
  • Variables are only kept in the model if it lowers the AIC and if no variable can lower the AIC further the model is stopped

-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

how is a backward stepwise regression different from a forward stepwise regression or a both regression

A
  • the forward model is where predictor variables are added until none can lower the AIC any further
  • the backward model is where the computer begins by placing all predictors in the model and then removes them by looking to see if the AIC goes down when each variable is removed, this continues until removing any variables causes AIC to increase
  • in the both model this goes in both directions. each time a predictor is added to the model a removal test is made of the least useful predictor
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

why is the backward stepwise regression preferable to the forward stepwise regression

A
  • because of suppressor effects
  • this occurs when a predictor has an effect but only if another variable is held constant
  • forward selection is more likely than backward selection to exclude predictors involved in backwards effects
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is an all subsets regression?

A
  • all subsets regressions try every combination of variables to see which one gives the right fit
  • the issue with this is that as the number of predictor variables increases, the number of possible combinations increases exponentially
17
Q

what is the cons of the computer automatically choosing stepwise regression

A
  • the computer makes decisions about order based on small statistical differences between variables however this may be in contrast to theoretical importance
  • there is the danger of overfitting or underfitting
  • therefore if you do do a stepwise regression it is advisable to cross validate your model by splitting the data
18
Q

what is the guidlines for choosing the order of variables?

A
  • base your model on what past research tells you
  • include meaningful variables in their order of importance
  • after this statistical analysis, repeat the regression but exclude any variables that were statistically redundant the first time
  • try not to include too many predictors, in general the fewer the better
  • only include predictors that have a decent theoretical grounding
19
Q

how would you quantify a binary categorical variable into a multiple linear regression?

A
  • code the condition as either 0 or 1

- so if the variable is treatment vs. no treatment treatment would be included as 1 and no treatment as 0

20
Q

what are assumptions of multiple linear regressions?

A
  • predictor variables can be quantitative or categorical.
  • the outcome variable must be quantitative, continuous and unbounded
  • the quantitative variables must have interval level data
  • the predictors must have some variation in value so that the variance is not zero
  • there must be no perfect multi collinearity. this means that predictor variables should not correlate too highly with each other
  • there should not be a third variable which correlates with the variables included
  • the residuals of the model should be normally distributed with a mean of zero
  • linearity meaning the relationship being modelled fits along a straight line
21
Q

how to how to assess the fit of a model by looking at residuals

A
  • 99.9% of values should lie between 3.29, so if there is a standardized residual greater or then +/-3.29 then it is likely to be an outlier
  • 99% of values should lie between 2.58 and +2.58 so if more than 1% of values is greater than this this is a cause for concern
  • 95% should lie between -1.96 and +1.96 so if more than 5% of standardised residuals are greater than this the model is not likely to be a good fit
22
Q

how to assess for outliers by looking at adjusted predicted values

A
  • we can see whether if we delete a certain case whether we would obtain different regression coefficients. If so then that case is likely to be an outlier
  • the adjusted predicted value shows the predicted value for that case when the particular case is excluded from the analysis. essentially the computer calculates a new model without the case and uses that model to calculate what the case should be
  • the difference between the adjusted predicted value and the actual predicted value is known as the DFFit
  • alternatively we can look at the residual between the adjusted predicted value and the observed value. if you divide this by the standard error it gives a standardized value called the studentized residual. because this is standardized it can be compared against different regression analyses, and it tends to follow the students t distribution
23
Q

how to consider the effect of outliers by looking at cooks distance?

A
  • cooks distance is a measure of the overall influence of a single case on a model
  • it is suggested that values greater than 1 may be a cause for concern
24
Q

how to asses the influence of outliers by looking at hat values

A
  • hat values gauge the influence of the observed value of the outcome variable over the predicted values
  • the average value is defined as (k+1)/n
  • k is the number of predictors and n is the number of participants
  • the values range from 0-1. 0 is no influence whatsoever and 1 is complete influence
  • if no cases are outliers then we would expect all the hat values/leverage values to be near to the average value
  • any cases with leverage values greater than twice the average are worth investigating
  • any cases with leverage values greater than 3 times the average should be excluded
25
Q

what is the name of the difference between a regression parameter when all cases are included and a regression parameter when one case is excluded?

A
  • the DFBeta
26
Q

what assumptions should we check on a regression analysis?

A

VARIABLES

  • all predictor variables must be quantitative or categorical
  • the outcome variable must be quantitative, continuous and unbounded

NON ZERO VARIANCE
-the predictors should have some variation in value (should not have a variance of 0)

NO PERFECT MULTICOLLINEARITY
- There should be no perfect linear relationship between two or more of the predictors

PREDICTORS ARE UNCORRELATED WITH EXTERNAL VARIABLES

HOMOSCEDASICITY
-at each level of the predictor variable the variance of the residual terms should be constant

INDEPENDENT ERRORS
-for any two observations the residual terms should be uncorrelated

NORMALLY DISTRIBUTED ERRORS
- the residuals of the model are random, normally distributed and with a mean of 0

INDEPENDENCE
- all values of the outcome variable are independent

LINEARITY
- the mean values of the outcome variable are along a straight line

27
Q

what does a regression with all assumptions met mean

A
  • on average the regression model from the sample is the same as the population model
28
Q

what is cross validation of a model?

A

assessing the accuracy of a model across different samples

29
Q

what are two ways of cross validating a model?

A

Adjusted R squared

Data splitting

30
Q

How to use data splitting to cross validate a model?

A

randomly split a dataset

create a regression model for each set and compare regression models

31
Q

what is the general rule of thumb for sample size in regression?

A
  • you should have 10 or 15 cases of data for each predictor in the model
32
Q

what are three problems associated with collinearity in regressions?

A
  1. Untrustworthy Bs
    - as collinearity increases so does the standard error of the B coefficients
    - this means that the B’s are more variable across samples and so are less likely to represent the population
  2. It limits the size of R
    - if two variables are highly correlated, once one predictor has increased the value of R, the second predictor cannot add that much if most of the variance its adding was mostly accounted for by the first predictor
  3. importance of predictors
    - multicollinearity between predictors makes it difficult to asses the individual importance of a predictor
33
Q

what is an interaction?

A
  • when the effect of one variable differs according to the value of another variable
34
Q

how to quantify the difference between two linear models (with or without interaction for example) to identify which one is better?

A
  • we can use an analysis of variance (anova) to quantify goodness of fit