Correlation & multiple regression Flashcards

1
Q

what is correlation?

A
  • An association or dependency between two independently observed variables
  • use a scatterplot to visualise a correlation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what does Pearsons correlation coefficient do?

A
  • tells you how strong the correlation is between X and Y
  • its a number between -1 and 1
  • 0 they are completely independent of eachother
  • 1.0 they are identical to eachther
  • -1.0 they are exactly inverse of one another
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

when is covariance greater?

A

when the values if X and Y and more similar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

when do we conduct a Pearsons’s coefficient (r)?

A

two interval/ration variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

when do we conduct a Spearman’s rank coefficient

A

two ordinal (rank) variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

when do we conduct a Kendall’s rank coefficient

A

two true dichotomy values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

when do we conduct a Phi coefficient?

A

two true dichotomy variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

when do we conduct a point-biserial coefficient

A

one true dichotomy variable and one interval/ratio variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is partial correlation?

A

when information from different variables is overlapping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is multiple regression?

A

it describes the relationship between one or more predictor variables (X1, X2 etc) and a single criterion (Y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

linear regression equation

A

𝑌̂=𝛽0+𝛽1 𝑋1+𝛽2 𝑋2+…+𝛽𝑚 𝑋𝑚

𝒀̂ = the predicted value of the criterion variable 𝒀
𝜷𝟎 = the intercept term
𝜷𝒊 = the 𝑖th regression coefficient, indicating how strongly predictor variable 𝑿𝒊 can be used to predict 𝒀 in the model
𝒎 = the number of predictor variables in the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is 𝑦=𝑎𝑥+𝑏 equivalent to?

A

𝑌̂=𝛽0+𝛽1 𝑋1
where a is the slope and b is the y intercept

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is the equation for residual error?

A

𝜀=𝑌−𝑌̂

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is the equation for the variance unexplained?

A

〖𝑆𝑆〗_𝑅=∑(𝑌−𝑌̂ )^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is variance explained question?

A

〖𝑆𝑆〗_𝑀=∑(𝑌̂−𝑌̅ )^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is prediction error?

A

the difference between the actual values 𝑌 and the predicted values 𝑌̂
𝜀=𝑌−𝑌̂

17
Q

what is the goal of a regression?

A

to find the best fit between the model and the observations, by adjusting the values of 𝛽_𝑖 until the prediction error is minimised

18
Q

what is multiple correlation coefficient (R)?

A

Correlation between the predicted values 𝒀̂ and the observed values 𝒀
- cannot directly be calculated
- has to be calculated by the square root of the coefficient of determination (R^2)

19
Q

what os the coefficient of determination (R^2)

A
  • Proportion of variance of explained by the regression model
  • This is simply the square of the multiple correlation coefficient
20
Q

F-ratio

A

the proportion of explained variance with the residual variance, allowing a statistical test

21
Q

effect size formultiple linear regression for cohen’s f

A

small effect size = cohen’s f of 0.02
medium effect size = cohen’s f of 0.15
large effect size = cohen’s f of 0.35

22
Q

what is a simultaneous (standard) multiple regression approach?

A
  • No a priori model assumed
  • All predictor variables are fit together
23
Q

what is a stepwise approach to multiple regression?

A
  • No a priori model
  • Predictor variables are added/removed one at time, to maximize fit
  • Not a good approach because it will always overfit the data
24
Q

what is a hierarchical multiple regression approach?

A
  • Based on a priori knowledge of variables – we may know a relationship exists for some variables, but are interested in the added explanatory power of a new variable
  • Several subsequent regression models are analysed (adding or removing predictor variables)
  • We can use this assess how much better one model explains the criterion variable than another (∆𝑅^2)
25
Q

what some factors that affect multiple linear regression?

A
  • Outliers
  • Scedasticity
  • Singularity & Multicollinearity
  • Number of observations /Number of predictors
  • Range of values
  • Distribution of values
26
Q

what are outliers?

A
  • points which deviate substantially from most of the others can have a disproportionate effect on the linear regression fit
27
Q

what does cook’s distance measure?

A

the extremity of an outlier; values greater than 1 are cause for concern

28
Q

what is scedasticity?

A
  • refers to the distribution of the residual error (i.e., relative to the predictor variable)
  • Homoscedasticity: residuals stay relatively constant over the range of the predictor variable
  • Heteroscedasticity: residuals vary systematically across the range of the predictor variable
  • Multiple linear regression assumes homoscedasticity
29
Q

what is Multicollinearity?

A

refers to a high similarity between two or more variables (𝑟 > 0.9)

30
Q

what is Singularity

A

refers to a redundant variable; typically, this results when one variable is a combination of two or more other variables (e.g., subscores of an intelligence scale)

31
Q

issues with SINGULARITY & MULTICOLLINEARITY

A
  • Logical: Don’t want to measure the same thing twice
  • Statistical: Cannot solve regression problem because system is ill-conditioned
32
Q

how does the number of observations and number of predictions affect multiple regression?

A
  • Number of observations (𝑁) should be high compared to the number of predictor variables (𝑚)
  • Results become meaningless (impossible to generalise due to overfitting) as 𝑁/𝑚 decreases
33
Q

how does range and distribution affect multiple regression?

A
  • Range:
    Small range (max-min) of the predictor variable restricts statistical power
  • Distribution of variables:
    Data should be normally or uniformly distributed
    Always important to plot/visualise your data!
33
Q
A