Week 12 Flashcards
Regression is?
- More Fiddly than other methods
- Has more assumptions
Why do Linear Regression
- Not looking at differences
- Looking at relationships
- Regression goes further than correlation - Allows us to make predictions
- Produces a model that allows for sophisticated exploration of relationships in variables
In Second Year Stats
- Looked at relationships - Correlation
- Differences Between Groups and Within-Groups
- Used t-tests and aNOVAs
- Variation in Dependent Variable
Correlation
Allows us to estimate direction and strength of a linear relationship
Why do Linear Regression
- How well will a set of variables predict an outcome?
- Which variable in a set of variables is the best predictor of an outcome?
- Does a particular predictor variable predict an outcome if another variable is controlled for?
Predictor Variable
Same as Independent Variable in Regression
Outcome Variable
The same as the Dependent Variable in Regression
What is a Model?
- An approximation to the actual data
- simple summary of data
- Makes data easier to interpret, communicate
- Allows us to predict data
What is a Regression Model
Mathematically Describes the linear relationship
* Y = Beta X + C
* Y = Predicted valuus of the DV
* Beta = The slope of the line
* X = Scores on the Predictor (IV)
* C = The Intercept
The Intercept
- Point where the function crosses the y-axis.
- Sometimes Regression model only becomes significant when we remove the intercept, and the regression line reduces to
- Y = b(X) + error
Standardized beta (β)
- Compares the strength of the effect of each IV to the DV
- The higher the absolute value of the beta coefficient, the stronger the effect
How Does Regression Work?
- Linear combination of another variable don’t always have to be continuous
- Can have a combination of variables
- Need to find the Line of Best Fit
Line of Best Fit
- Many lines produced with Regression Formula
- How do we know what line is best?
- Mimimises the difference between observed values and data predicted by the line
- This is called error
- In regression also called residuals
* Y = b(X) + C + error
N (Cases): k (Predictors) Ratio
- Assumption about sample size
- Need a certain number of participants to trust validity
- Simple Linear Regression Assumption
- Number of Cases multiplied by Predictors
- The more Predictors we have the more cases we need for the study
Checking Linearity
- checking for Linearity requires scatter plots
- Need Scattepots between each DV & IV
- Looking for Non-Linear evidence
Check for Normality
- Kolmogorov-Smirnov/Shapiro-Wilks: p > .05
- Skewness & Kurtosis: z score is < ±1.96 then it is normal
- Histogram follows a bell curve.
- Detrended Q-Q Plots: Equal amounts of dots above and below the line.
- Normal QQ Plots: Normal if dots hugging the line.
Check for Univariate Outliers
Week 12 Part 2 - 10:00
- Identified on Box & Whisker Plots
- Dots indicate outliers
- Asterisk indicates extreme cases
- Number tells you which case is the issue
Reason Univariate Outliers are Problematic
- Regression Analysis gives formula for a straight line
- A data point that stands outside other data points can change the slope of your straight line
- This makes the line a poor predictor of the value of other data points
How to deal with Outliers
- Check if Outlier is a data entry error and fix it
- Check if outlier is from different population - Justifies removing their data
- Separate outliers and run different analysis
- Run Analysis with and without outliers and report both models
- Winsorization - Change values so they’re not Outliers anymore
- Use transformations or Bootstrapping
Winsorization
- Change the score of outlier to value of 5th percentile for minimum values
- Change the score of outlier to value of 95th percentile for maximum values
- Slightly problematic because it changes the data
- But this retains extremeness without removing outlier data
Bootstrapping
- Uses transformations to deal with outliers
- Creates samples from your sample
- Uses your Mean and Standard Deviation to create another data set
- does this repeatedly
- This creates a large data set where extreme values are more normal
Homoscedasticity
- Means Scame Scatter or Same Variance
- Residuals are equal for all scores on the Outcome Variable
Check for Normality, Linearity and Homoscedasticity
- We need the residuals to behave in a certain way
- Residuals are the difference between predicted scores and outcome variable
- SPSS generates a Histogram Q-Q Plots
Dealing with Heteroscedasticity
- Check graph in SPSS
- Check the data for any patterns
- If dots are scattered randomly then we are all good
If Regression Assumptions are violated
- Check Normality of predictors, if you fix these heteroscedasitcity can dissapear
- Use a transformation on the Outcome Variable
- Consider using a different method like Weighted Least Squares Regression
- Use some kind of Non-Linear Regression
Null Hypothesis for Regression
- Slope of Regression line will be equal to 0
- Beta = 0
Alternative Hypothesis
- Slope of the Regression line will not be Zero
- Beta Not= 0
Running Linear Regression
1. Analyse
2. Regression
3. Linear
4. Move your DV into the dependent box.
5. Move your independent variable into the independent box.
6. Ok
Linear Regression - R value
- Same as Pearson’s Correlation - p value
- Tells us strength and direction of relationship
Linear Regression R Square Value
- Tells us amount of variance in DV explained by IV
- Proportion of Variance that can be explained by the variable
- 23% of variability in grades explained by attendance in this example
- Known to overestimate the explained variance
Linear Regression - R Square Adjusted
- R needs to be adjusted to be smaller than R squared
- Corrects bias of overestimated explained variance
- Useful as Goodness of Fit Statistic
Goodness of Fit Statistic
- Determines how well sample data fits a distribution from a normal population
- Determines if a sample is skewed or normal in the actual population
Regression ANOVA
- Uses the df, F value and the p value
- Compares error rate with line of best fit and the error rate of the baseline model of 0
- ANOVA is significant if it is “better” than the baseline
Unstandardised Coefficient
- The Slope of the Regression Equation
- Amount of change in a Dependent Variable due to a change of an Independent Variable
- This is the Beta coefficient
e.g. each unit of attendance is associated with 1.88 unit of increase in grades
Coefficient t-tests
- Check if IV is a significant predictor of the DV
- Become more relevant when we start adding more predictors
Standardised Coefficients
- A measure of the effect size
- Useful for multiple Regression
- Important when we have more than one Predictor
- Predictors often measured in different scales
- e.g, IQ Points, Classes attended, additional study time
Dealing with Multiple Predictors
- Most commonly found in Research Projects
- Allows us to predict the outome variable from more than one predictor
- Answers how well does combination predictors predict the outcome
Y = b1(X1) + b2(X2) + C + error
Univariate OUtliers
Outlier on one variable
Multivariate Outlier
Outlier on a combination of variables
Assumptions with Regression
- Normality
- Univariate Outliers
- Multivariate Outliers
- Multicollinearity
- Normality, Linearity & Homosedasticity of residuals
Multicollinearity
- Two or more IV’s highly correlated in regression
- IV can be predicted from another IV in a regression model.
How to check for Multivariate Outliers
Mahalonobis Distance
Mahalanobis Distance
- largest value should not be greater than the critical 𝜒2 value for df = k at 𝛼= .001.
- Where k = the number of predictors.
- Use same table as Cook’s Distance
- For simplicity use table below:
Cooks Distance
- Tells you if there are cases that influence the regression line
- Use same table as Mahalanobis Distance
- rule of thumb is if Cook’s D is > 1 you have influential cases.
- Dealt with the same way as Univariate Oultiers
Check for Multicollinearity
- Pearson’s Correlations between IV
- if i > .85 then there is multicollinearity
- Tolerance: Values < .1 are multicollinear; < .2 warrant a closer look
- VIF: Values > 10 are clearly Multicollinear; > 5 warrant a closer look
- If you find a problem then remove the offending variable
- If they are so closely related then they are basically the same thing. treat as one variable.
Check for Multivariate Outliers
- Use Residual Statistics Table
- First, we find the critical 𝜒2 for a model with 4 predictors: 𝜒2 = 18.467 - Check the Mahalanobis Distance Table
- Use Mahal. Distance Maximum (13.803 here)
- 13.803 < 18.467 Therefore there are no multivariate outliers.
- Cooks D is < 1 so there are no influential cases
Interpreting Multiple Regression
- Use the Variables Entered/Removed Table - Tells you how many predictors are in the model (4)
- Then Model Summary Table
- R is not just Person’s R anymore
- It is correlation between actual scores and predictions in the regression equation
- R square = Proportion of variance in DV Accounted for y combined predictors
- Again R square Adjusted is a corrected version of R square that accounts for the positive bias.
Interpreting Multiple Regression ANOVA
- Now tests the comBination of predictors
- A significant predictor of GHQ
- The table has the df, the F value, and the p-value.
Interpreting Multiple Regression Coefficients
- Unstandardized coefficient is the slope of the regression
- Shows each unit increase in one of the independent variables is associated with a b unit increase in GHQ
- All other IVs are kept constant
- Beta values = Standardised regression coefficients
- Allow direct comparison of regression coefficients.
- Displayed in units of standard deviation.
Interpreting Multiple Regression Standardised Coefficients
- t-values and p-values test the significance of the unique contribution of each predictor
- Changes depending on predictors included in the model.
Multiple Regression Tolerance & VIF
- Tolerance: values < .1 are multicollinear; < .2 warrant closer inspection.
- VIF: values > 10 are clearly multicollinear; > 5 warrant closer inspection.
Remove Non-Significant Predictors
- If you have a predictor that is not reflecting anything it makes the model worse
- This changes the numbers slightly
- Only have significant predictors in the model
Applied look at Regression Equation
- Our general form for the regression is:
Y = b1(X1) + b2(X2) + b3(X3) + C + error - And if we take this equation and substitute in our variables we get:
GHQ = b1(neuroticism) + b2(state-anxiety) + b3(trait-anxiety) + C + error
GHQ = .555(Neuroticism) + .318(state-anxiety) + .471(trait-anxiety) + 13.552 + error
What is the value for R?
- Correlation between theDV & IV
- Value greater than 0.4 is taken for further analysis.
What does R tell us?
The strength & direction of the relationship
What does the value of R2 Adjusted tell the researcher
- Tells you the percentage of variation explained by only the independent variables that actually affect the dependent variable.
What does the value of R2 Adjusted tell the researcher
- Tells you the percentage of variation explained by only the independent variables that actually affect the dependent variable.
What does the value of R2 Adjusted tell the researcher
- Tells you the percentage of variation explained by only the independent variables that actually affect the dependent variable.
What does the value of R2 Adjusted tell the researcher
- Tells you the percentage of variation explained by only the independent variables
- Those that actually affect the dependent variable.