Week 2-Multiple regression Flashcards by Meg A

What is Correlational Research?

-Allows us to establish whether an association exists but doesn’t allow us to establish whether the association is causal (i.e., association is part of causation but does NOT mean it is causation

-We can track people overtime and try to establish a time-order relationship (i.e., does one variable increase before the other variable)

-May give an indication of a possible causal relationship but as the data is observational, we can’t rule out a 3rd variable accounting for this effect

How well did you know this?

Not at all

Perfectly

Correlational research is important as it allows us to look at things that what?

-Cannot ethically be looked at in experiments e.g., effects of drug addiction (can’t make someone an addict)

-Cannot feasibly be looked at in experiments due to very small effects (by using very large samples), cost or impossibility of randomisation

-Cannot be put into a classic experimental framework as there are no naturally occurring or logical conditions e.g., effects of age (Individually can’t be looked at BUT can see a general representation of a population)

How well did you know this?

Not at all

Perfectly

What does correlational research consist of?

1.Exploring big data (e.g., NHS, police data sets)

2.Questionnaires and surveys

3.Secondary data analysis (data people have already collected)

4.Understanding the multivariate world (i.e., context e.g., you wouldn’t drink as much vodka in a lab compared to spoons)

5.Predictions

How well did you know this?

Not at all

Perfectly

Give examples of a positive and negative association

Positive association: As Sociopathy scores increase Coldplay liking increases

Negative association: As IQ increases Coldplay liking decreases

How well did you know this?

Not at all

Perfectly

What’s a strong and weak association?

Strong: the measurements are all near the line of best fit, if you have the value of X, you can estimate the value of Y accurately

Weak: although the slope of the line of best fit is the same, there is a lot of variance around the line. If you have the value of X, your estimate of Y will not be accurate

-Each dot is an observation aka person in data

-Line of best fit is the same but it’s the association around it

-Regression coefficient is slope and standard error is how far you are from the slope

How well did you know this?

Not at all

Perfectly

How do we explore the association between more than one variable and a DV?

We need to build a regression model.

How well did you know this?

Not at all

Perfectly

What do regressions aim to tell us?

1.Whether our model is a ‘good fit’

2.Whether there are significant relationships between a predictor variable(s) and an outcome variable

3.The direction of these relationships

-We can then use this information to make predictions (does this by predicting a line of best fit for the association between variables)

-Good fit=whether much variance with predictors was picked up or not

How well did you know this?

Not at all

Perfectly

How can we predict Y?

If we know the value of X we can predict Y based off the regression slope (Y = b (regression slope) x X + a (intercept)

How well did you know this?

Not at all

Perfectly

How is the line selected on the graph?

-It is the line with the lowest sum of squared prediction errors.
-The line drawn is the line which the sum of the (squared) differences (double ended arrows) is the smallest. This is the sum of squared error (SSE) if you think about it, this is what error is, as it’s how far from perfect our line of best fit is.

-Computers best line of fit is the closest distance all the points can be from it giving its sum of square errors

-How far dots are from the line of best fit=error of variance

How well did you know this?

Not at all

Perfectly

How do we ascertain the amount of variance our regression model explains?

-The amount of variance explained is the sum of squares for the regression (SSR)

-SSR=the sum of the squared differences between the predicted value for each observation and the population mean

-Mathematically the SSR is the summed difference between the predictions and the mean squared

-The difference between everyones predicted value and the population mean (determines what was actually predicted)

How well did you know this?

Not at all

Perfectly

What’s SST?

The total amount of variance in our model

SST=SSE+SSR

How well did you know this?

Not at all

Perfectly

What are the two types of variances?

-Variance we can explain (SSR)

-Variance we cannot explain (SSE)

-We want more variances explained than not explained so more is accounted for

How well did you know this?

Not at all

Perfectly

What’s the Coefficient of Determination aka R Squared?

The proportion of total variation (SST) that is explained by the regression (SSR)

R squared= SSR/SST = SSM/SSE+SSR

The value of R squared ranges from 0 to 1 and is the most accurate regression model is often referred to as a percentage

.7=70% of variance is accounted for
.05=5% of variance is accounted for

How well did you know this?

Not at all

Perfectly

What’s adjusted R squared?

-An adjustment based on the number of predictors in the model

-Interpreted the same way and always lower than R squared (better to report usually)

-More predictors=R squared creeps up so each time a variable is added, adjusted R squared will punish it hence why it’s smaller

How well did you know this?

Not at all

Perfectly

Why is adjusted R squared useful?

-By adding new predictors, R squared will inevitably increase even if the new predictors have no real impact on the predictive utility of the model

-The adjusted R squared will decrease a lot if variables with little value are added to the model

How well did you know this?

Not at all

Perfectly

What’s the ANOVA (analysis of variance?)

Study These Flashcards

It simply tells us whether the proportion of variance in the DV predicted by the IV(s) is significant

R squared and adjusted R squared are used to evaluate model fit

How is the F statistic for a regression calculated?

Study These Flashcards

Mean square of the model (not the sum of squares of the model SSM) DIVIDED by the mean square of the residual (not the sum of squares for the model SSR)

F=MSM/MSR

What are the 2 limitations of the overall regression model?

Study These Flashcards

1.Doesn’t tell you information about specific predictors (e.g., 3 predictors accounting for 21% variance, is one 20%? 1%? 0?)

2.The direction between the association between variables is unknown (positive or negative)

-It’s necessary to look at the individual regression coefficients to understand individual predictors

What are regression coefficients? (B/b)

Study These Flashcards

The number of units the DV changes for each one unit increase in the IV:

-B=.03, for each one unit increase of the IV the increases by .03 units
-B=-.01, for each one unit increase of the IV the decreases by .01 units

What’s the standard error?

Study These Flashcards

-How much the regression coefficient deviates across cases + across the slope (ideally its small meaning the RC is precise)

How do you calculate the t statistic?

Study These Flashcards

B (regression coefficient)/SE (standard error)

-The larger the RC compared to the SE, the larger the t statistic will be and the smaller the p-value calculated for the association

-Small SE=slope is pretty good at the prediction made

What are Beta values? (β)

Study These Flashcards

-They explain the association between each IV and DV in terms of standard deviation changes

β =.50 means that for every one standard deviation increase in the IV there is a .50 standard deviation increase in the DV

β= -.50 meaning for each standard deviation increase in the IV there is a .50 standard deviation decrease in the DV.

What is the most useful property of the Beta value?

Study These Flashcards

It allows a simple comparison of the strength of the association between your IV and DVs. The higher the beta the stronger the association is

(It is notable that a standardised regression coefficient is just a different way of expressing the same information as an unstandardized regression coefficient so they have exactly the same p value.)

What are the assumptions of a simple and multiple regression?

Study These Flashcards

-Normally distributed (ish) continuous outcome

-Independent data

-Interval/ratio predictors

-Nominal predictors with 2 categories (dichotomous)

-No multicollinearity for multiple regression

-Careful of influencing cases (someone who has a large influence/effect on the slope)

-Observations are independent

-It won’t take repeated measures (would need a linear mixed affective model (don’t need to know)

True or false: IVs should be continuous or they can be two-level nominal (i.e. a category that has two levels only) so to interpret these, you simply need to know how you coded your IV

True!

Why can't you have multicategorical predictors?

-Because coding is arbitrary (unrestrained/random)

True or false: the strength of the association differs according to how you code it, therefore you conclusions would be based on an arbitrary coding decision.

True

Define Multicollinearity

It occurs when independent variables in a regression model are highly correlated

What happens if If two or more predictor variables in your model are highly correlated with each other?

They don't provide unique and independent information to the model -Can adversely affect regression estimates -Large amount of variance explained but no significant predictors (variables that interact tend to give odd results)

How can we identify Multicollinearity?

Look for high correlations between variables in a correlation matrix (rule of thumb r > .80): -r = 1.0 is perfect multi-collinearity and likely represents a data issue. Tolerance Statistic (generated by SPSS): -Percentage of variance in the IV accounted for by other IVs -1 – R2 -High tolerance = low multicollinearity -Low tolerance = high multicollinearity (a value of < .20 or .10) Variance Inflation Factor (inverse of tolerance): -1/tolerance -indicates of much the standard error will be inflated

What to do if you have Multicollinearity issues?

-Increase sample size to stabilise the regression coefficients -Remove redundant (no longer needed or useful) variable(s) -If the 2 or more variables are important, create a variable that takes both of them into account (you’d do that by doing Z score and adding them together) -If two variables highly correlated just pick one as they both give the same (to prevent multicollinearity)

Give an example of a simple regression write-up

A simple regression was carried out to investigate the relationship between self-control and BMI. The regression model was significant and predicted approximately 47% of variance (adjusted R2 = .47; F (1, 98) = 87.35, p < .001). Self-control was a significant negative predictor of BMI (b = -.50 (s.e. = .05), p <.001; 95% CI -.60 to -.39).

Give an example of a multiple regression write-up

-A multiple regression was conducted to investigate the roles of self-control, eating restraint and attachment on BMI. The regression model was significant and predicted 48% of variance (adjusted R2 = .48: F (3, 96) = 31.50, p < .001). Self-control was a significant negative predictor of BMI (b = -0.47 (s.e. = 0.05), p < .001 ; 95% CI -.58 to -.37), Attachment was a significant predictor of BMI (b = 2.00 (s.e. = 0.99), p = .045; 95% CI .05 to 3.96), with those with insecure attachment having higher BMI. However, eating restraint was not a significant predictor (b = -0.05 (s.e. = .09), p = .565; 95% CI = -.23 to .12). Variance Inflation Factors suggest multicollinearity was not a concern (self-control = 1.05, eating restraint = 1.04, Attachment = 1.03).

Week 2-Multiple regression Flashcards

(33 cards)