Week 3-Hierarchical Regression Flashcards
What does a regression identify?
-It identifies whether there are significant associations between a predictor variable (s) and an outcome variable
-It does this by essentially predicting a line of best fit for the association between variables
What are the 2 key ingredients for any regression?
- Amount of variance the model explained (Adjusted R^2)
- Significance of individual predictors (Regression Coefficients) (Just because overall significance doesn’t mean individual predictors will be significant)
What is R^2?
-This explains how much variance in our dependent variable is explained by our regression model
-The regression model refers to all the predictors considered together
How do you calculate R^2?
SSR/SST OR SSR/SSE+SSR
SSR=Variance explained
SSE=Unexplained variance
SST=Total variance
What can the value of R^2 range from and what does it indicate?
-Ranges between 0 and 1, where the higher it is the more accurate the regression model is (often referred to as a %)
E.g:
R^2 =.05 means 5% of variance is explained
R^2 =.21 means 21% of variance is explained
What is Adjusted R^2?
-Very similar to the R^2 statistic but better one to report as it stops dodgy researchers adding more variable to prove significance
-But it is always lower
-It punishes R^2 for each predictor added to the model
-This stops people throwing in more variables in order to improve the fit of the model
How is R^2 and Adjusted R^2 assessed?
-Significance of this is assessed using an ANOVA (analysis of variance)
-Tells us whether the amount of variance explained is statistically significant
What is a regression coefficient?
-It tells us whether the association between our IVs and DVs
-It can be positive (positive association) or negative (negative association) like a correlation
-Unlike correlation coefficients it does not range between -1 and 1
-It is a description of an IV-DV association in terms of unit changes
What does the regression coefficient number mean?
-It means how much the DV changes when the IV is increased by one unit
For example:
-I measure stress using a questionnaire (IV) and anxiety using a questionnaire (DV) and my regression coefficient is 1.5
-This would mean that for each increase of 1 on the stress questionnaire, scores on the anxiety questionnaire go up by 1.5.
What is a Standard Error in relation to Regression Coefficients?
-Regression Coefficients always come with a SE
-This is how precise your estimate (regression coefficient) is
-Big SE=not precise
-Small SE=it’s precise
-A big regression coefficient and a small SE=significant effect
-Indeed the p value is based on the proportion of the regression coefficient to the SE
What is the Standardised regression coefficients?
-β (beta) values (can be expressed in SDs)
-You cannot directly compare regression coefficients and say that coefficient is bigger, therefore it is a bigger effect
-This is because they are expressed in unit changes and the IV’s are likely to be measured in different ways
How is the Standardised Regression Coefficients interpreted?
-It’s interpreted as for every one SD change in the IV the DV changes by the number of SDs for the standardised regression coefficients indicates
For example:
-β =0.5 means for every one standard deviation increase in the IV the DV increases by 0.5 standard deviations
-β =-0.2 means for every one standard deviation increase in the IV the DV decreases by 0.2 standard deviations
What is a stepwise regression?
-A data mining method which is reliant on statistical significance to choose variables to be included in a final method
-At each step a variable is entered and the t statistic for an effect is produced (forwards, backwards, stepwise/bidirectional)
-Walter and Tiemeier (2009): In 4 leading epidemiologic journals they found that 20% of the articles published in 2008 used stepwise regression.
What are some problems that Frank Harrell critiqued?
-The R^2 statistics are inflated (in the sense that data is messed around with for best-case model)
-The F tests do not have the claimed distribution
-Coefficients for retained variables are inflated (makes them look like better predictors than they are)
-Standard errors are deflated
-Falsely narrow confidence intervals (See Altman and Andersen, 1989)
-It yields p-values that do not have the proper meaning and the proper correction problem (i.e., don’t know how many have been checked)
-It has severe problems in the presence of collinearity
-Often, researchers fail to test the model in new data (and often it doesn’t fit)
-It allows us to not think about the problem “The data analyst knows more than the computer…failure to use that knowledge produces inadequate data analysis.” (Henderson and Velleman, 1981)
-The statistical tests used are intended to be used to test pre-specified hypotheses