Week 2-Multiple regression Flashcards
What is Correlational Research?
-Allows us to establish whether an association exists but doesn’t allow us to establish whether the association is causal (i.e., association is part of causation but does NOT mean it is causation
-We can track people overtime and try to establish a time-order relationship (i.e., does one variable increase before the other variable)
-May give an indication of a possible causal relationship but as the data is observational, we can’t rule out a 3rd variable accounting for this effect
Correlational research is important as it allows us to look at things that what?
-Cannot ethically be looked at in experiments e.g., effects of drug addiction (can’t make someone an addict)
-Cannot feasibly be looked at in experiments due to very small effects (by using very large samples), cost or impossibility of randomisation
-Cannot be put into a classic experimental framework as there are no naturally occurring or logical conditions e.g., effects of age (Individually can’t be looked at BUT can see a general representation of a population)
What does correlational research consist of?
1.Exploring big data (e.g., NHS, police data sets)
2.Questionnaires and surveys
3.Secondary data analysis (data people have already collected)
4.Understanding the multivariate world (i.e., context e.g., you wouldn’t drink as much vodka in a lab compared to spoons)
5.Predictions
Give examples of a positive and negative association
Positive association: As Sociopathy scores increase Coldplay liking increases
Negative association: As IQ increases Coldplay liking decreases
What’s a strong and weak association?
Strong: the measurements are all near the line of best fit, if you have the value of X, you can estimate the value of Y accurately
Weak: although the slope of the line of best fit is the same, there is a lot of variance around the line. If you have the value of X, your estimate of Y will not be accurate
-Each dot is an observation aka person in data
-Line of best fit is the same but it’s the association around it
-Regression coefficient is slope and standard error is how far you are from the slope
How do we explore the association between more than one variable and a DV?
We need to build a regression model.
What do regressions aim to tell us?
1.Whether our model is a ‘good fit’
2.Whether there are significant relationships between a predictor variable(s) and an outcome variable
3.The direction of these relationships
-We can then use this information to make predictions (does this by predicting a line of best fit for the association between variables)
-Good fit=whether much variance with predictors was picked up or not
How can we predict Y?
If we know the value of X we can predict Y based off the regression slope (Y = b (regression slope) x X + a (intercept)
How is the line selected on the graph?
-It is the line with the lowest sum of squared prediction errors.
-The line drawn is the line which the sum of the (squared) differences (double ended arrows) is the smallest. This is the sum of squared error (SSE) if you think about it, this is what error is, as it’s how far from perfect our line of best fit is.
-Computers best line of fit is the closest distance all the points can be from it giving its sum of square errors
-How far dots are from the line of best fit=error of variance
How do we ascertain the amount of variance our regression model explains?
-The amount of variance explained is the sum of squares for the regression (SSR)
-SSR=the sum of the squared differences between the predicted value for each observation and the population mean
-Mathematically the SSR is the summed difference between the predictions and the mean squared
-The difference between everyones predicted value and the population mean (determines what was actually predicted)
What’s SST?
The total amount of variance in our model
SST=SSE+SSR
What are the two types of variances?
-Variance we can explain (SSR)
-Variance we cannot explain (SSE)
-We want more variances explained than not explained so more is accounted for
What’s the Coefficient of Determination aka R Squared?
The proportion of total variation (SST) that is explained by the regression (SSR)
R squared= SSR/SST = SSM/SSE+SSR
The value of R squared ranges from 0 to 1 and is the most accurate regression model is often referred to as a percentage
.7=70% of variance is accounted for
.05=5% of variance is accounted for
What’s adjusted R squared?
-An adjustment based on the number of predictors in the model
-Interpreted the same way and always lower than R squared (better to report usually)
-More predictors=R squared creeps up so each time a variable is added, adjusted R squared will punish it hence why it’s smaller
Why is adjusted R squared useful?
-By adding new predictors, R squared will inevitably increase even if the new predictors have no real impact on the predictive utility of the model
-The adjusted R squared will decrease a lot if variables with little value are added to the model
What’s the ANOVA (analysis of variance?)
It simply tells us whether the proportion of variance in the DV predicted by the IV(s) is significant
R squared and adjusted R squared are used to evaluate model fit
How is the F statistic for a regression calculated?
Mean square of the model (not the sum of squares of the model SSM) DIVIDED by the mean square of the residual (not the sum of squares for the model SSR)
F=MSM/MSR
What are the 2 limitations of the overall regression model?
1.Doesn’t tell you information about specific predictors (e.g., 3 predictors accounting for 21% variance, is one 20%? 1%? 0?)
2.The direction between the association between variables is unknown (positive or negative)
-It’s necessary to look at the individual regression coefficients to understand individual predictors
What are regression coefficients? (B/b)
The number of units the DV changes for each one unit increase in the IV:
-B=.03, for each one unit increase of the IV the increases by .03 units
-B=-.01, for each one unit increase of the IV the decreases by .01 units
What’s the standard error?
-How much the regression coefficient deviates across cases + across the slope (ideally its small meaning the RC is precise)
How do you calculate the t statistic?
B (regression coefficient)/SE (standard error)
-The larger the RC compared to the SE, the larger the t statistic will be and the smaller the p-value calculated for the association
-Small SE=slope is pretty good at the prediction made
What are Beta values? (β)
-They explain the association between each IV and DV in terms of standard deviation changes
β =.50 means that for every one standard deviation increase in the IV there is a .50 standard deviation increase in the DV
β= -.50 meaning for each standard deviation increase in the IV there is a .50 standard deviation decrease in the DV.
What is the most useful property of the Beta value?
It allows a simple comparison of the strength of the association between your IV and DVs. The higher the beta the stronger the association is
(It is notable that a standardised regression coefficient is just a different way of expressing the same information as an unstandardized regression coefficient so they have exactly the same p value.)
What are the assumptions of a simple and multiple regression?
-Normally distributed (ish) continuous outcome
-Independent data
-Interval/ratio predictors
-Nominal predictors with 2 categories (dichotomous)
-No multicollinearity for multiple regression
-Careful of influencing cases (someone who has a large influence/effect on the slope)
-Observations are independent
-It won’t take repeated measures (would need a linear mixed affective model (don’t need to know)