Week 2-Multiple regression Flashcards
What is Correlational Research?
-Allows us to establish whether an association exists but doesn’t allow us to establish whether the association is causal (i.e., association is part of causation but does NOT mean it is causation
-We can track people overtime and try to establish a time-order relationship (i.e., does one variable increase before the other variable)
-May give an indication of a possible causal relationship but as the data is observational, we can’t rule out a 3rd variable accounting for this effect
Correlational research is important as it allows us to look at things that what?
-Cannot ethically be looked at in experiments e.g., effects of drug addiction (can’t make someone an addict)
-Cannot feasibly be looked at in experiments due to very small effects (by using very large samples), cost or impossibility of randomisation
-Cannot be put into a classic experimental framework as there are no naturally occurring or logical conditions e.g., effects of age (Individually can’t be looked at BUT can see a general representation of a population)
What does correlational research consist of?
1.Exploring big data (e.g., NHS, police data sets)
2.Questionnaires and surveys
3.Secondary data analysis (data people have already collected)
4.Understanding the multivariate world (i.e., context e.g., you wouldn’t drink as much vodka in a lab compared to spoons)
5.Predictions
Give examples of a positive and negative association
Positive association: As Sociopathy scores increase Coldplay liking increases
Negative association: As IQ increases Coldplay liking decreases
What’s a strong and weak association?
Strong: the measurements are all near the line of best fit, if you have the value of X, you can estimate the value of Y accurately
Weak: although the slope of the line of best fit is the same, there is a lot of variance around the line. If you have the value of X, your estimate of Y will not be accurate
-Each dot is an observation aka person in data
-Line of best fit is the same but it’s the association around it
-Regression coefficient is slope and standard error is how far you are from the slope
How do we explore the association between more than one variable and a DV?
We need to build a regression model.
What do regressions aim to tell us?
1.Whether our model is a ‘good fit’
2.Whether there are significant relationships between a predictor variable(s) and an outcome variable
3.The direction of these relationships
-We can then use this information to make predictions (does this by predicting a line of best fit for the association between variables)
-Good fit=whether much variance with predictors was picked up or not
How can we predict Y?
If we know the value of X we can predict Y based off the regression slope (Y = b (regression slope) x X + a (intercept)
How is the line selected on the graph?
-It is the line with the lowest sum of squared prediction errors.
-The line drawn is the line which the sum of the (squared) differences (double ended arrows) is the smallest. This is the sum of squared error (SSE) if you think about it, this is what error is, as it’s how far from perfect our line of best fit is.
-Computers best line of fit is the closest distance all the points can be from it giving its sum of square errors
-How far dots are from the line of best fit=error of variance
How do we ascertain the amount of variance our regression model explains?
-The amount of variance explained is the sum of squares for the regression (SSR)
-SSR=the sum of the squared differences between the predicted value for each observation and the population mean
-Mathematically the SSR is the summed difference between the predictions and the mean squared
-The difference between everyones predicted value and the population mean (determines what was actually predicted)
What’s SST?
The total amount of variance in our model
SST=SSE+SSR
What are the two types of variances?
-Variance we can explain (SSR)
-Variance we cannot explain (SSE)
-We want more variances explained than not explained so more is accounted for
What’s the Coefficient of Determination aka R Squared?
The proportion of total variation (SST) that is explained by the regression (SSR)
R squared= SSR/SST = SSM/SSE+SSR
The value of R squared ranges from 0 to 1 and is the most accurate regression model is often referred to as a percentage
.7=70% of variance is accounted for
.05=5% of variance is accounted for
What’s adjusted R squared?
-An adjustment based on the number of predictors in the model
-Interpreted the same way and always lower than R squared (better to report usually)
-More predictors=R squared creeps up so each time a variable is added, adjusted R squared will punish it hence why it’s smaller
Why is adjusted R squared useful?
-By adding new predictors, R squared will inevitably increase even if the new predictors have no real impact on the predictive utility of the model
-The adjusted R squared will decrease a lot if variables with little value are added to the model