- An association or dependency between two independently observed variables - use a scatterplot to visualise a correlation

Correlation & multiple regression Flashcards by Poppy Butler

what is correlation?

An association or dependency between two independently observed variables
use a scatterplot to visualise a correlation

How well did you know this?

Not at all

Perfectly

what does Pearsons correlation coefficient do?

tells you how strong the correlation is between X and Y
its a number between -1 and 1
0 they are completely independent of eachother
1.0 they are identical to eachther
-1.0 they are exactly inverse of one another

How well did you know this?

Not at all

Perfectly

when is covariance greater?

when the values if X and Y and more similar

How well did you know this?

Not at all

Perfectly

when do we conduct a Pearsons’s coefficient (r)?

two interval/ration variables

How well did you know this?

Not at all

Perfectly

when do we conduct a Spearman’s rank coefficient

two ordinal (rank) variables

How well did you know this?

Not at all

Perfectly

when do we conduct a Kendall’s rank coefficient

two true dichotomy values

How well did you know this?

Not at all

Perfectly

when do we conduct a Phi coefficient?

two true dichotomy variables

How well did you know this?

Not at all

Perfectly

when do we conduct a point-biserial coefficient

one true dichotomy variable and one interval/ratio variable

How well did you know this?

Not at all

Perfectly

what is partial correlation?

when information from different variables is overlapping

How well did you know this?

Not at all

Perfectly

what is multiple regression?

it describes the relationship between one or more predictor variables (X1, X2 etc) and a single criterion (Y)

How well did you know this?

Not at all

Perfectly

linear regression equation

𝑌̂=𝛽0+𝛽1 𝑋1+𝛽2 𝑋2+…+𝛽𝑚 𝑋𝑚

𝒀̂ = the predicted value of the criterion variable 𝒀
𝜷𝟎 = the intercept term
𝜷𝒊 = the 𝑖th regression coefficient, indicating how strongly predictor variable 𝑿𝒊 can be used to predict 𝒀 in the model
𝒎 = the number of predictor variables in the model

How well did you know this?

Not at all

Perfectly

what is 𝑦=𝑎𝑥+𝑏 equivalent to?

𝑌̂=𝛽0+𝛽1 𝑋1
where a is the slope and b is the y intercept

How well did you know this?

Not at all

Perfectly

what is the equation for residual error?

𝜀=𝑌−𝑌̂

How well did you know this?

Not at all

Perfectly

what is the equation for the variance unexplained?

〖𝑆𝑆〗_𝑅=∑(𝑌−𝑌̂ )^2

How well did you know this?

Not at all

Perfectly

what is variance explained question?

〖𝑆𝑆〗_𝑀=∑(𝑌̂−𝑌̅ )^2

How well did you know this?

Not at all

Perfectly

what is prediction error?

Study These Flashcards

the difference between the actual values 𝑌 and the predicted values 𝑌̂
𝜀=𝑌−𝑌̂

what is the goal of a regression?

Study These Flashcards

to find the best fit between the model and the observations, by adjusting the values of 𝛽_𝑖 until the prediction error is minimised

what is multiple correlation coefficient (R)?

Study These Flashcards

Correlation between the predicted values 𝒀̂ and the observed values 𝒀
- cannot directly be calculated
- has to be calculated by the square root of the coefficient of determination (R^2)

what os the coefficient of determination (R^2)

Study These Flashcards

Proportion of variance of explained by the regression model
This is simply the square of the multiple correlation coefficient

F-ratio

Study These Flashcards

the proportion of explained variance with the residual variance, allowing a statistical test

effect size formultiple linear regression for cohen’s f

Study These Flashcards

small effect size = cohen’s f of 0.02
medium effect size = cohen’s f of 0.15
large effect size = cohen’s f of 0.35

what is a simultaneous (standard) multiple regression approach?

Study These Flashcards

No a priori model assumed
All predictor variables are fit together

what is a stepwise approach to multiple regression?

Study These Flashcards

No a priori model
Predictor variables are added/removed one at time, to maximize fit
Not a good approach because it will always overfit the data

what is a hierarchical multiple regression approach?

Study These Flashcards

Based on a priori knowledge of variables – we may know a relationship exists for some variables, but are interested in the added explanatory power of a new variable
Several subsequent regression models are analysed (adding or removing predictor variables)
We can use this assess how much better one model explains the criterion variable than another (∆𝑅^2)

what some factors that affect multiple linear regression?

- Outliers - Scedasticity - Singularity & Multicollinearity - Number of observations /Number of predictors - Range of values - Distribution of values

what are outliers?

- points which deviate substantially from most of the others can have a disproportionate effect on the linear regression fit

what does cook's distance measure?

the extremity of an outlier; values greater than 1 are cause for concern

what is scedasticity?

- refers to the distribution of the residual error (i.e., relative to the predictor variable) - Homoscedasticity: residuals stay relatively constant over the range of the predictor variable - Heteroscedasticity: residuals vary systematically across the range of the predictor variable - Multiple linear regression assumes homoscedasticity

what is Multicollinearity?

refers to a high similarity between two or more variables (𝑟 > 0.9)

what is Singularity

refers to a redundant variable; typically, this results when one variable is a combination of two or more other variables (e.g., subscores of an intelligence scale)

issues with SINGULARITY & MULTICOLLINEARITY

- Logical: Don’t want to measure the same thing twice - Statistical: Cannot solve regression problem because system is ill-conditioned

how does the number of observations and number of predictions affect multiple regression?

- Number of observations (𝑁) should be high compared to the number of predictor variables (𝑚) - Results become meaningless (impossible to generalise due to overfitting) as 𝑁/𝑚 decreases

how does range and distribution affect multiple regression?

- Range: Small range (max-min) of the predictor variable restricts statistical power - Distribution of variables: Data should be normally or uniformly distributed Always important to plot/visualise your data!

Correlation & multiple regression Flashcards

(34 cards)