correlation & multiple regression Flashcards

1
Q

What is correlation

A
  • An association or dependency between two independently observed variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Analysis of correlation and what scores mean

A

0.0 when X and Y are completely independent of each other

1.0 when they are identical to one another

−1.0 when they are exactly inverse to one another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is partial correlation?

A

Want to see if more than 2 variables relate to one another

i.e X, Y and Z

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is multiple linear regression?

A

Multiple linear regression is a similar concept to correlation

Major difference: it describes the relationship between one or more predictor variables (X1, X2, etc.) and a single criterion variable (Y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Higher the beta… (MR)

A

Stronger the relationship

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Beta tells us… (MR)

A

how e.g neurotism/stress predicts depression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

prediction error is…

A

difference between the actual Y values and the predicted values

we aim to get this minimised

can be expressed as residual sum of squares

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

y = ax + b is the same as..

A

Y = BETA0 + BETA1X1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Multiple correlation coefficient (R)

A

Correlation between the predicted values Y^ and the observed values Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Coefficient of determination (R^2)

A

Proportion of variance of explained by the regression model
This is simply the square of the multiple correlation coefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

F-Ratio

A

As for ANOVA, we can derive an F-ratio contrasting the proportion of explained variance with the residual variance, allowing a statistical test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Assessing goodness-of-fit: sums of squares

A

Total sums of squares - how far all the data points vary from the mean

Residual sums of squares - difference between actual value and predicted value

Ssm = how much does our model vary from the mean - model sums of squares - mean best guess

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Equation for coefficient of determination (R2)

A

R2 = SSM / SST
OR
R2 = 1 - SSR / SST

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Higher F-rations indicate ?

A

Better models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Effect size for MR

A

Cohen’s f2
small = 0.02
medium = 0.15
large = 0.35

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Multiple regression approaches

A

Simultaneous
Stepwise
Hierarchial

17
Q

Simultaneous (standard) approach

A

No a priori model assumed
All predictor variables are fit together

18
Q

Stepwise approach

A

No a priori model
Predictor variables are added/removed one at time, to maximize fit
Not a good approach because it will always overfit the data

19
Q

Hierarchical approach

A

Based on a priori knowledge of variables – we may know a relationship exists for some variables, but are interested in the added explanatory power of a new variable

Several subsequent regression models are analysed (adding or removing predictor variables)

We can use this assess how much better one model explains the criterion variable than another (∆R^2) = larger = stress scroe predicts the depression scores over neuroritism - how well we can predict stress score > neurortism

20
Q

Factors affecting multiple linear regression

A
  • Outliers
  • Scedasticity - how much one variability looks like
  • Singularity & Multicollinearity
  • Number of observations / Number of predictors
  • Range of values - how much variability is there
  • Distribution of values - normal…
21
Q

Scedasticity

A
  • Scedasticity refers to the distribution of the residual error (i.e., relative to the predictor variable)
    • Homoscedasticity: residuals stay relatively constant over the range of the predictor variable
    • Heteroscedasticity: residuals vary systematically across the range of the predictor variable
  • Multiple linear regression assumes homoscedasticity
22
Q

Singularity and multicollinearity

A
  • Multicollinearity refers to a high similarity between two or more variables (r > 0.9)
  • Singularity refers to a redundant variable; typically, this results when one variable is a combination of two or more other variables (e.g., subscores of an intelligence scale)
  • Problems with these:
    • Logical: Don’t want to measure the same thing twice
    • Statistical: Cannot solve regression problem because system is ill-conditioned
23
Q

Number of observation, number of predictors

A
  • Number of observations (N) should be high compared to the number of predictor variables (m)
    • Results become meaningless (impossible to generalise due to overfitting) as N/m decreases
  • Rules of thumb (medium effect size):
    • N > 50 + 8 x m