Correlations Flashcards
What is the best way to look at residuals?
PP plot
Linear function
Same variable but diffrent units of measurement —> slopes become arbitrary
Would have complete shared variance
What are the differences between partial and semi-partial correlations?
Semi-partial - used to examine the additional predictive value of a predictor, residualises one variable
Partial - used to statistically control other predictors, residualises both variables
When do you use multiple regression?
If you want to predict a response variable using many predictors
Can also determine if we have more than one predictor
When do you use semi-partial correlation?
If you want to determine how much benefit a predictor gives you on top of several other predictors
When do you use partial correlation?
If you want to examine the strength of a relationship between variables while holding other variables constant
What would you expect if you correlate a z score and percentile of a variable?
Not complete correlation but very close
Perfect Kendall’s Tau
Very highly correlated = collinearity (not linear function though)
Radically changes p value, standard error etc - cannot identify a unique effect of z score
What is multi-collinearity?
None of the predictors are correlated with the variable but are highly correlated with each other
What is VIF?
Variable inflation factor - collinearity diagnostic
Increases with correlation
>9 is considered problematic (3 when square root)
Regression ignoring dv
In what circumstances can’t you have a linear relationship?
Between a predictor and a discrete DV
What is correlation?
It’s all about prediction - if there is a relationship between two variables we can use x to estimate y
How can we characterise a relationship?
Strength - how well one variable can predict another
Form - what is the shape of the variable
Direction (if form is monotone) - is the direction positive or negative
What is the criteria for strength?
There is none - it’s a subjective idea
What is Kendall’s Tau?
A non parametric correlation test
Used when data set is small with large number of tied ranks
How is Kendall’s Tau useful?
Can draw more accurate generalisations with Kendall’s Tau than Spearman’s
Helps us understand strength and direction of monotone relationships
Resistant to outliers
Tb used to solve the problem of tied ranks
How are movements between points characterised?
Consistent - as you go up in x you go up in y (positive)
Inconsistent - as you go up in x you go down in y (negative)
How do you calculate Tau?
- Calculate the proportion of consistent movements (con/total)
- T = (2 X proportion of consistent movements) - 1
What makes Kendall’s Tau non-parametric?
Slope and intercept aren’t needed so it doesn’t assume parametric from for the relationship
What is standardisation?
Convert into standard set of units (SDs) to overcome dependence on measurement scale problem
Pearson’s correlation
Coefficient = r —> ranges between -1 and 1 (0 = no relationship)
For linear relationships only
Highly sensitive to outliers
Strong when big x standardised scores are paired with big y standardised scores
Positive when positive x standardised scores are paired with positive y standardised scores and vice versa
What is a z-score?
SD score
How do you compare independent correlations?
Transform the r’s into z values using Fishers z transformation
What is the first step in regression?
Units must be unstandardised
How is the regression slope defined?
b1 = (SDy/SDx) X r
How do you find the intercept for regression?
b0 = mean of y - ( b1 X mean of x)
What is the linear regression model?
yi = b0 + b1(xi) + error
What are the assumptions of linear regression?
The true relationship is linear, has intercept b0, slope b1 and is contaminated by error
All errors are independent- cant assume when a variable is correlated at different levels e.g time
What are residuals?
Prediction errors
The regression line minimised the sums of squares residuals
Diagnose problems with assumptions
What are good residuals?
Don’t show systematic trends
Equally variable
Normally distributed
Don’t have outliers
What is R^2?
The coefficient of determination
Measures the amount of variability in one variable that is shared by another variable
Tells us how close points are to the line - error
Compare how well regression line can predict y compared to mean of y
How do you calculate R^2?
R^2 = 1 - (SSregression / SS mean of y)
What is adjusted R^2?
Prevents overfitting
Minimising the sum of squared residuals will give you the best possible line
Even true regression line won’t do better
Biased —> cant be 0
Use SPSS
What is the formula for multiple regression?
Yi = B0 + B1xi + B2xi + B3xi + Error
What is collinearity?
A limitation of multiple regression
Occurs when predictors are highly correlated - contain essentially the same information
No unique contribution or relationship can be determined
What are the symptoms of collinearity?
Strong predictors are nonetheless non-significant
Large standard errors
Coefficients change radically when new predictors are added
High VIF
R^2 for many predictors is basically sam for each separately
How do we control for variables?
We hold it constant by residualising it
What is an interaction
The effect of one IV differs as a function of another IV - (Geoff)
Product of two predictors and the slope of one variable depends on another (Main one from stats)
What is centering?
Subtracting the mean of x from all other x values to create an intercept of y when x = mean of x
This increases meaningfulness of intercept and reduces collinearity with interactions and polynomial regressions
How do you centre an intercept? (Formula)
Yi = B0 + B1(Xi - mean of X) + error
What changes after centering?
The intercept (SE and Significance)
The slope does NOT change in SE or significance
What is Dummy Coding?
What is the formula?
Used when you want to include a discrete predictor e.g gender (g)
You would code one 1 and the other 0 which alters the equation when substituted in — > for 1 B2 remains and 0 it disappears
Yi = B0 + B1xi + B2gi + error
What is dummy coding measuring?
As B0 and B1 sum together to make the intercept dummy coding is looking to see how much B2 changes the intercept
How do we create an interaction using dummy coding?
Add a new variable that is the product of the dummy code variable (gender) and x called B3gix
What are the types of interaction?
Continuous/continuous —> rare (polynomial)
Discrete/continuous
Discrete/discrete —> ANOVA
How do you centre when using dummy codes?
Contrast codes - in this case it would be -0.5
How do you centre continuous predictors?
Using mean
What is polynomial regression?
Linear regression that does not have a linear relationship
Contains squared terms - which is like an interaction between x and itself - causes slope to change
Causes curves in form
X must be centred
What is a linear combination?
Sum of terms multiplied by constants (doesn’t mean its linear)
What is the form of a polynomial?
b0 + b1x + b2x^2 + b3x^3… bnx^n
What is the letter n in a polynomial?
The order or degree
Highest power
What does increasing parameters cause?
An increase in flexibility which is not always a good thing
What is the benefit of centering x in a polynomial?
Prevents collinearity between x and x squared
Allows it to fit the curve better and identify unique components causing curvature
How do you deal with overfitting?
Adjust R^2
Replication
Cross validation - split data into parts and fit curve to one part (fit data) then test in other part (hold out data)
What are the assumptions of linear regression analysis?
DV is interval scaled
DV is a linear combination of predictors
Observations/errors are independent
Heteroscedasticity
Errors are normally distributed
What is the regression technique for a binary DV?
Logistic regression
What is the regression technique for counts with upper limits?
Logistic regression
What is the regression technique for counts without upper limits?
Regression using rates
What is the regression technique used for time to event DVs?
Regression using rates
What is the regression technique used for ordinal DV?
Ordinal regression