Week 9 day 2 Flashcards
When do we do a multiple linear regression?
When we want to see whether there is a correlation between more than two numeric variables.
What kind of analysis do we need to do if we have more than one predictor variable, which are numberic, and the outcome is numeric also.
Multiple regression.
If you have known predictor variables for a give outcome and you do not do a multiple regression, what are you risking?
You are risking taking into account how those other predictor variables explain the variation in the outcome. In other words, you will end up misinterpreting how a given predictor is correlated with a given outcome.
What are the models used to describe a correlation between:
1. one outcome variable and one predictor variable ( continuous variables) - data appears to have a linear relationship.
2. one outcome variable and one predictor variable (continuous variables) - data appears to have a non-linear relationship, but is monotonic.
3. one outcome variable and two predictor variables (all numeric) and we don’t include an interaction term.
4. One outcome variable and two predictor variables (all numeric) and there is an interaction term.
- Pearson’s r correlation. - line of best fit for raw data.
- Spearman’s correlation. - line of best fit for ranked data.
- Multiple regression - plane of best fit.
- Multiple regression - curved plane of best fit.
What do interactions tell us?
They tell us that in order to know what one predictor does to the outcome variable we need to know the value of the other predictor.
What is the null hypothesis for a regression analysis?
The null hypothesis in a regression is that there is no relationship between the outcome and the predictor and therefore y=intercept+error. In other words, it doesn’t matter what the value of the predictor variable is, the outcome will remain the same.
What is the test stat for regression?
F stat.
How is the F stat for regression calculated? What is it a ratio between?
Model sum of squares and residual sum of squares.
Model sum of squares is the sum of squared deviation between each data point and the mean of the outcome variable. (df=number of predictors)
Residual sum of squares is the sum of squared deviations between each data point and the line of best fit. (df=N-number of predictors-1).
The residuals capture what our model does NOT capture.
In ANOVA we have a sum of squares between and sum of squares within.
What are their analogues in a regression?
Model sum of squares and residual sum of squares.
How do we determine whether a specific predictor is significant or not?
A t-test is done between the null that the coefficient would be zero and the actual coefficient from the line of best fit.
Why do we have to worry less about doing multiple comparisons in a regression than we do in an ANOVA?
Because we are doing fewer tests and the tests we are doing are usually theoretically motivated, that is, the only reason we are doing a regression the first place is because we think that there may be some relationship between the predictor/s and the outcome.
What is the measure of effect size used for regression?
R-squared.
It captures the proportion of the variance in the outcome accounted for by the model (line/plane of best fit).
R-sqaured=1- residual sum of squares/(residual sum of squares + model sum of squares).
When is R-squared 1?
When is -squared 0?
When we do correlation test, like Pearson’s r, are we just doing a regression with one predictor?
yes.
What are standardised coefficients in regression models and when do we use them?