Week 3 (regressions and control) Flashcards by Annie Sherred

what is regression to the mean

if one sample of a random variable is extreme, the next sample of the same random variable is likely to be less extreme
this is because extreme scores on a normal distribution are more likely to occur by chance than non extreme scores, therefore next time they are measured the scores are likely to be less extreme
therefore it can be hard to see if an intervention is effective or if the results are just regression to the mean
they should therefore be compared with control groups that dont recieve the intervention

How well did you know this?

Not at all

Perfectly

what is a multiple linear regresion?

a linear regression with multiple predictor variables

How well did you know this?

Not at all

Perfectly

what is the formula for a multiple linear regression

the same as for linear regression except for every predictor you include, you have to add a coefficient
therefore Y = the intercept + each predictor variable*their coefficient

-Y = B0 + B1 X1 + … + Bn Xn + e
Y = predicted value of the dependent variable
B0 = y-intercept or value of y with all parameters set to 0
B1 X1 = regression coefficient of the first independent variable
Bn Xn = regression coefficient of the last independent variable
E = model error or level of variation

How well did you know this?

Not at all

Perfectly

what is SSt when there is multiple variables?

SSt represents the difference between the observed values and the mean value of the outcome variable

How well did you know this?

Not at all

Perfectly

what is SSr when there is multiple variables?

SSr still represents the difference between the values of Y predicted by the model and the observed values

How well did you know this?

Not at all

Perfectly

what is SSm when there is multiple variables?

SSm represents the difference between the values of Y predicted by the model and the mean value

How well did you know this?

Not at all

Perfectly

what is multiple R^2?

the square of the correlation between the observed values of Y and the values of Y predicted by the multiple regression model

therefore large values of multiple R^2 represent a large correlation between the predicted and observed values of the outcome

a value of 1 would mean the model perfectly fit the data

How well did you know this?

Not at all

Perfectly

do models with more or less variables tend to have a larger R^2

models with more variables always have a larger R^2

How well did you know this?

Not at all

Perfectly

what is the Aikaike information criteria (AIC)?

the problem with multiple R^2 is the more variables it has, the larger the R^2 value is
AIC is a measure of fit which penalizes the model for having more variables
A larger AIC value indicates a worse fit, corrected for the number of variables
it only makes sense to compare AIC between models of the same data, as its relative

How well did you know this?

Not at all

Perfectly

what is hierarchical regression?

predictors are decided based on past work and the experimenter decides which order to enter predictors into the model
they are entered in order or importance in predicting the outcome

How well did you know this?

Not at all

Perfectly

what is forced entry regression?

all predictors are entered into the model simultaneously

- some believe this is the only appropriate technique for theory testing

How well did you know this?

Not at all

Perfectly

what is stepwise regression?

decisions about the order in which predictors are entered into the model are based on a purely mathematic criterion

How well did you know this?

Not at all

Perfectly

how does R carry out a forward stepwise regression?

an initial model is defined that contains only the constant
the computer then picks the variable that has the biggest simple correlation with the outcome and calculates how much variation that this variable explains
the model the searches for a second predictor that can explain the biggest part of the remaining variance
this gives a measure of how much ‘new variance in the outcome can be explained by each remaining predictor
the model always picked the next variable that can explain the largest amount of variable
R has to decide when to stop adding predictors to the model and it does this based on the AIC criterion
Variables are only kept in the model if it lowers the AIC and if no variable can lower the AIC further the model is stopped

How well did you know this?

Not at all

Perfectly

how is a backward stepwise regression different from a forward stepwise regression or a both regression

the forward model is where predictor variables are added until none can lower the AIC any further
the backward model is where the computer begins by placing all predictors in the model and then removes them by looking to see if the AIC goes down when each variable is removed, this continues until removing any variables causes AIC to increase
in the both model this goes in both directions. each time a predictor is added to the model a removal test is made of the least useful predictor

How well did you know this?

Not at all

Perfectly

why is the backward stepwise regression preferable to the forward stepwise regression

because of suppressor effects
this occurs when a predictor has an effect but only if another variable is held constant
forward selection is more likely than backward selection to exclude predictors involved in backwards effects

How well did you know this?

Not at all

Perfectly

what is an all subsets regression?

all subsets regressions try every combination of variables to see which one gives the right fit
the issue with this is that as the number of predictor variables increases, the number of possible combinations increases exponentially

what is the cons of the computer automatically choosing stepwise regression

the computer makes decisions about order based on small statistical differences between variables however this may be in contrast to theoretical importance
there is the danger of overfitting or underfitting
therefore if you do do a stepwise regression it is advisable to cross validate your model by splitting the data

what is the guidlines for choosing the order of variables?

base your model on what past research tells you
include meaningful variables in their order of importance
after this statistical analysis, repeat the regression but exclude any variables that were statistically redundant the first time
try not to include too many predictors, in general the fewer the better
only include predictors that have a decent theoretical grounding

how would you quantify a binary categorical variable into a multiple linear regression?

code the condition as either 0 or 1

- so if the variable is treatment vs. no treatment treatment would be included as 1 and no treatment as 0

what are assumptions of multiple linear regressions?

predictor variables can be quantitative or categorical.
the outcome variable must be quantitative, continuous and unbounded
the quantitative variables must have interval level data
the predictors must have some variation in value so that the variance is not zero
there must be no perfect multi collinearity. this means that predictor variables should not correlate too highly with each other
there should not be a third variable which correlates with the variables included
the residuals of the model should be normally distributed with a mean of zero
linearity meaning the relationship being modelled fits along a straight line

how to how to assess the fit of a model by looking at residuals

99.9% of values should lie between 3.29, so if there is a standardized residual greater or then +/-3.29 then it is likely to be an outlier
99% of values should lie between 2.58 and +2.58 so if more than 1% of values is greater than this this is a cause for concern
95% should lie between -1.96 and +1.96 so if more than 5% of standardised residuals are greater than this the model is not likely to be a good fit

how to assess for outliers by looking at adjusted predicted values

we can see whether if we delete a certain case whether we would obtain different regression coefficients. If so then that case is likely to be an outlier
the adjusted predicted value shows the predicted value for that case when the particular case is excluded from the analysis. essentially the computer calculates a new model without the case and uses that model to calculate what the case should be
the difference between the adjusted predicted value and the actual predicted value is known as the DFFit
alternatively we can look at the residual between the adjusted predicted value and the observed value. if you divide this by the standard error it gives a standardized value called the studentized residual. because this is standardized it can be compared against different regression analyses, and it tends to follow the students t distribution

how to consider the effect of outliers by looking at cooks distance?

cooks distance is a measure of the overall influence of a single case on a model
it is suggested that values greater than 1 may be a cause for concern

how to asses the influence of outliers by looking at hat values

hat values gauge the influence of the observed value of the outcome variable over the predicted values
the average value is defined as (k+1)/n
k is the number of predictors and n is the number of participants
the values range from 0-1. 0 is no influence whatsoever and 1 is complete influence
if no cases are outliers then we would expect all the hat values/leverage values to be near to the average value
any cases with leverage values greater than twice the average are worth investigating
any cases with leverage values greater than 3 times the average should be excluded

what is the name of the difference between a regression parameter when all cases are included and a regression parameter when one case is excluded?

- the DFBeta

what assumptions should we check on a regression analysis?

VARIABLES - all predictor variables must be quantitative or categorical - the outcome variable must be quantitative, continuous and unbounded NON ZERO VARIANCE -the predictors should have some variation in value (should not have a variance of 0) NO PERFECT MULTICOLLINEARITY - There should be no perfect linear relationship between two or more of the predictors PREDICTORS ARE UNCORRELATED WITH EXTERNAL VARIABLES HOMOSCEDASICITY -at each level of the predictor variable the variance of the residual terms should be constant INDEPENDENT ERRORS -for any two observations the residual terms should be uncorrelated NORMALLY DISTRIBUTED ERRORS - the residuals of the model are random, normally distributed and with a mean of 0 INDEPENDENCE - all values of the outcome variable are independent LINEARITY - the mean values of the outcome variable are along a straight line

what does a regression with all assumptions met mean

- on average the regression model from the sample is the same as the population model

what is cross validation of a model?

assessing the accuracy of a model across different samples

what are two ways of cross validating a model?

Adjusted R squared | Data splitting

How to use data splitting to cross validate a model?

randomly split a dataset | create a regression model for each set and compare regression models

what is the general rule of thumb for sample size in regression?

- you should have 10 or 15 cases of data for each predictor in the model

what are three problems associated with collinearity in regressions?

1. Untrustworthy Bs - as collinearity increases so does the standard error of the B coefficients - this means that the B's are more variable across samples and so are less likely to represent the population 2. It limits the size of R - if two variables are highly correlated, once one predictor has increased the value of R, the second predictor cannot add that much if most of the variance its adding was mostly accounted for by the first predictor 3. importance of predictors - multicollinearity between predictors makes it difficult to asses the individual importance of a predictor

what is an interaction?

- when the effect of one variable differs according to the value of another variable

how to quantify the difference between two linear models (with or without interaction for example) to identify which one is better?

- we can use an analysis of variance (anova) to quantify goodness of fit