Correlation & multiple regression Flashcards
what is correlation?
- An association or dependency between two independently observed variables
- use a scatterplot to visualise a correlation
what does Pearsons correlation coefficient do?
- tells you how strong the correlation is between X and Y
- its a number between -1 and 1
- 0 they are completely independent of eachother
- 1.0 they are identical to eachther
- -1.0 they are exactly inverse of one another
when is covariance greater?
when the values if X and Y and more similar
when do we conduct a Pearsonsβs coefficient (r)?
two interval/ration variables
when do we conduct a Spearmanβs rank coefficient
two ordinal (rank) variables
when do we conduct a Kendallβs rank coefficient
two true dichotomy values
when do we conduct a Phi coefficient?
two true dichotomy variables
when do we conduct a point-biserial coefficient
one true dichotomy variable and one interval/ratio variable
what is partial correlation?
when information from different variables is overlapping
what is multiple regression?
it describes the relationship between one or more predictor variables (X1, X2 etc) and a single criterion (Y)
linear regression equation
πΜ=π½0+π½1 π1+π½2 π2+β¦+π½π ππ
πΜ = the predicted value of the criterion variable π
π·π = the intercept term
π·π = the πth regression coefficient, indicating how strongly predictor variable πΏπ can be used to predict π in the model
π = the number of predictor variables in the model
what is π¦=ππ₯+π equivalent to?
πΜ=π½0+π½1 π1
where a is the slope and b is the y intercept
what is the equation for residual error?
π=πβπΜ
what is the equation for the variance unexplained?
γππγ_π =β(πβπΜ )^2
what is variance explained question?
γππγ_π=β(πΜβπΜ )^2
what is prediction error?
the difference between the actual values π and the predicted values πΜ
π=πβπΜ
what is the goal of a regression?
to find the best fit between the model and the observations, by adjusting the values of π½_π until the prediction error is minimised
what is multiple correlation coefficient (R)?
Correlation between the predicted values πΜ and the observed values π
- cannot directly be calculated
- has to be calculated by the square root of the coefficient of determination (R^2)
what os the coefficient of determination (R^2)
- Proportion of variance of explained by the regression model
- This is simply the square of the multiple correlation coefficient
F-ratio
the proportion of explained variance with the residual variance, allowing a statistical test
effect size formultiple linear regression for cohenβs f
small effect size = cohenβs f of 0.02
medium effect size = cohenβs f of 0.15
large effect size = cohenβs f of 0.35
what is a simultaneous (standard) multiple regression approach?
- No a priori model assumed
- All predictor variables are fit together
what is a stepwise approach to multiple regression?
- No a priori model
- Predictor variables are added/removed one at time, to maximize fit
- Not a good approach because it will always overfit the data
what is a hierarchical multiple regression approach?
- Based on a priori knowledge of variables β we may know a relationship exists for some variables, but are interested in the added explanatory power of a new variable
- Several subsequent regression models are analysed (adding or removing predictor variables)
- We can use this assess how much better one model explains the criterion variable than another (βπ ^2)
what some factors that affect multiple linear regression?
- Outliers
- Scedasticity
- Singularity & Multicollinearity
- Number of observations /Number of predictors
- Range of values
- Distribution of values
what are outliers?
- points which deviate substantially from most of the others can have a disproportionate effect on the linear regression fit
what does cookβs distance measure?
the extremity of an outlier; values greater than 1 are cause for concern
what is scedasticity?
- refers to the distribution of the residual error (i.e., relative to the predictor variable)
- Homoscedasticity: residuals stay relatively constant over the range of the predictor variable
- Heteroscedasticity: residuals vary systematically across the range of the predictor variable
- Multiple linear regression assumes homoscedasticity
what is Multicollinearity?
refers to a high similarity between two or more variables (π > 0.9)
what is Singularity
refers to a redundant variable; typically, this results when one variable is a combination of two or more other variables (e.g., subscores of an intelligence scale)
issues with SINGULARITY & MULTICOLLINEARITY
- Logical: Donβt want to measure the same thing twice
- Statistical: Cannot solve regression problem because system is ill-conditioned
how does the number of observations and number of predictions affect multiple regression?
- Number of observations (π) should be high compared to the number of predictor variables (π)
- Results become meaningless (impossible to generalise due to overfitting) as π/π decreases
how does range and distribution affect multiple regression?
- Range:
Small range (max-min) of the predictor variable restricts statistical power - Distribution of variables:
Data should be normally or uniformly distributed
Always important to plot/visualise your data!