Week 4: Multiple Regression Flashcards
What is the decision tree for multiple regression? - (4)
- Continous
- Two or more predictors that are continous
- Multiple regression
- Meets assumptions of parametric tests
simple linear regression
the outcome variable Y is
predicted using the equation of a straight line
Multiple regression still uses the same basic equation of …. but the model is still complex
Multiple regression is the same as simple linear regression expect for - (2)
every extra predictor you include, you have to add a coefficient;
so, each predictor variable has its own coefficient, and the outcome variable is predicted from a combination of all the variables multiplied by their respective coefficients plus a residual term
Multiple regression equation
In multiple regression equation, list all the terms - (5)
- Y is the outcome variable,
- b1 is the coefficient of the first predictor (X1),
- b2 is the coefficient of the second predictor (X2),
- bn is the coefficient of the nth predictor (Xn),
- εi is the difference between the predicted and the observed value of Y for the ith participant.
Multiple regression uses the same principle as linear regression in a way that
we seek to find the linear combination of predictors that correlate maximally with the outcome variable.
Regression is a way of predicting things that you have not measured by predicting
an outcome variable from one or more predictor variables
Regression can be used to produce a
linear model of the relationship between 2 variables
Record company interested in creating model of predicting recording sales from advertising budget and plays on radio per week (airplay)
- Example of it’s MR plotted on + number of vars measured, what vertical axis shows, horizontal and third axis shows - (4)
It is a three dimensional scatter plots, which means there are three axes measuring the value of the three variables.
The vertical axis measures the outcome, which in this case is the number of album sales.
The horizontal axis measures how often the album is played on the radio per week.
The third axis, which can can think of being directed into the page measures the advertising budget.
Can’t plot a 3D plot of MR as shown here
for more than 2 predictor (X) variables
The overlap in the diagram is the shared variance, which we call the
covariance
covariance is also referred to as the variance
shared between the predictor and outcome variable.
What is shown in E?
The variance in Album Sales not shared by the predictors
What is shown in D?
Unique variance shared between Ad Budget and Plays
What is shown in C?
The variance in Album Sales shared by Ad Budget and Plays
What is shown in B?
Unique variance shared between Plays and Album Sales
What is shown in A?
Unique variance shared between Ad Budget and Album Sales
If you got two prediictors thart overlap and correlate a lot then it is a .. model
bad model can’t uniquely explain the outcome
In Hierarchical regression, we are seeing whether
one model explains significantly more variance than the other
In hierarchical regression predictors are selected based on
past work and the experimenter
decides in which order to enter the predictors into the model
As a general rule for hierarchical regression, - (3)
known predictors (from other research) should be entered into the model first in order of their importance in predicting the outcome.
After known predictors have been entered, the
experimenter can add any new predictors into the model.
New predictors can be entered either all in one go, in a stepwise manner, or hierarchically (such that the new predictor
suspected to be the most important is entered first).
Example of hierarchical regression in terms of album sales - (2)
The first model allows all the shared variance between Ad budget and Album sales to be accounted for.
The second model then only has the option to explain more variance by the unique contribution from the added predictor Plays on the radio.
What is forced entry MR?
method in which all predictors are forced
into the model simultaneously.
Like HR, forced entry MR relies on
good theoretical reasons for including the chosen predictors,
Different from HR, forced entry MR
makes no decision about the order in which variables are entered.
Some researchers believe that about forced entry MR that
this method is the only appropriate method for theory testing because stepwise techniques are influenced by random variation in the data and so rarely give replicable results if the model is retested.
How to do forced entry MR in SPSS? - (4)
Analyse –> Linear –> Regression
Put outcome in DV and IVs (predictors, x) in IV box
Can select a range of statistics in statistics box and press okay to check colinearity assumption
Can also click plots to check assumptions of homoscedasticity and lineartiy
Why select colinearity diagnostics in statistics box for multiple regression? - (2)
This option is for obtaining collinearity statistics such as the
VIF, tolerance,
Checking assumption of no multicolinearity
Multicollinearity exists when there is a
strong correlation between two or more predictors in a regression model.
Multicollinearity poses a problem only for multiple regression because
simple regression requires only one predictor.
Perfect collinearity exists in multiple regression when at least
e.g., two predictors are perfectly correlated , have a correlation coefficient of 1
If there is perfect collinearity in multiple regression between predictors it
becomes impossible
to obtain unique estimates of the regression coefficients because there are an infinite number of combinations of coefficients that would work equally well.
Good news is perfect colinearity in multiple regression is rare in
real-life data
If two predictors are perfectly correlated in multiple regression then the values of b for each variable are
interchangable
The bad news is that less than perfect collinearity is virtually
unavoidable
As colinearity increases in multiple regression, there are 3 problems that arise - (3)
- Untrustory bs
- Limit size of R
- Importance of predictors
As colinearity increases, there are 3 problems that arise - (3)
importance of predictors - (3)
Multicollinearity between predictors makes it difficult
to assess the individual importance of a predictor.
If the predictors are highly correlated, and each accounts for similar variance in the outcome, then how can we know
which of the two variables is important?
Quite simply we can’t tell which variable is important – the model could include either one, interchangeably.
One way of identifying multicollinearity in multiple regression is to scan a
a correlation matrix of all of the predictor
variables and see if any correlate very highly (by very highly I mean correlations of above .80
or .90)
SPSS produces colinearity diagnoistics in multiple regression which is - (2)
variance inflation factor (VIF) and tolerance
The VIF indicates in multiple regression whether a
predictor has a strong linear relationship with the other predictor(s).
If VIF statistic is above 10 in multiple regression there is a good reason to worry about
potential problem of multicolinearity
If VIF statistic above 10 or approaching 10 in multiple regression then what you would want to do is have a - (2)
look at your variables to see if you need to include all variables whether all need to go in model
if high correlation between 2 predictors (measuring same thing) then decide whether its important to include both vars or take one out and simplify regression model
Related to the VIF in multiple regression is the tolerance
statistic, which is its
reciporal (1/VIF) = inverse of VIF
In tolerance, value below 0.2 shows in multiple regression
issue with multicolinerity
In Plots in SPSS, you put in multiple regression - (2)
ZRESID on Y and ZPRED on X
Plot of residuals against predicted to asses homoscedasticity
What is ZPRED in MR? - (2)
(the standardized predicted values of the dependent variable based on the model).
These values are standardized forms of the values predicted by the model.
What is ZRESID in MR? - (2)
(the standardized residuals, or errors).
These values are the standardized differences between the observed data and the values that the model predicts).
SPSS in multiple linear regression gives descriptive outcoems which is - (2)
- basics means and also a table of correlations between variables.
- This is a first opportunity to determine whether there is high correlation between predictors, otherwise known as multi-collinearity
In model summary of SPSS, it captures how the model or models explain in MR
variance in terms of R squared, and more importantly how R squared changes between models and whether those changes are significant.
Diagram of model summary
What is the measure of R^2 in multiple regression
measure of how much of the variability in the outcome is accounted for
by the predictors
The adjusted R^2 gives us an estimate of in multiple regression
fit in the general population
The Durbin-Watson statistic if specificed in multiple regresion tells us whether the - (2)
assumption of independent errors is tenable (value less than 1 or greater than 3 raise alarm bells)
value closer to 2 the better = assumption met
SPSS output for MR = ANOVA table which performs
F-tests for each model
SPSS output for MR contains ANOVA that tests whether the model is
significantly beter at predicting the outcome than using the mean as a ‘best guess’
The F-ratio represents the ratio of
improvement in prediction that results from fitting the model, relative to the inaccuracy that still exists in the model
We are told the sum of squares for model (SSM) - MR regression line in output which represents
improvement in prediction resulting from fitting a regression line to the data rather than using the mean as an estimate of the outcome
We are told residual sum of squares (Residual line) in this MR output which represents
total difference between
the model and the observed data
DF for Sum of squares Model for MR regression line is equal to
number of predictors (e.g., 1 for first model, 3 for second)
DF for Sum of Squares Residual for MR is - (2)
Number of observations (N) minus number of coefficients in regression model
(e.g., M1 has 2 coefficents - one for predictor and one for constant, M2 has 4 - one for each 3 predictor and one for constant)
The average sum of squares in ANOVA table is calculated by
calculated for each term (SSM, SSR) by dividing the SS by the df. T
How is the F ratio calculated in this ANOVA table?
F-ratio is calculated by dividing the average improvement in prediction by the model (MSM) by the average
difference between the model and the observed data (MSR)
If the improvement due to fitting the regression model is much greater than the inaccuracy within the model then value of F will be
greater than 1 and SPSS calculates exact prob (p-value) of obtaining value of F by change
What happens if b values are positive in multiple regression?
there is a positive relationship between the predictor and the outcome,
What happens if the b value is negative in multiple regression?
represents a negative relationship between predictor and outcome variable?
What do the b values in this table tell us what relationships between predictor and outcome variable in multiple regression? (3)
Indicating positive relationships so as advertising budget increases, record sales increases (outcome)
plays on ratio increase as do record sales
attractiveness of band increases record sales
The b-values also tell us, in addition to direction of relationship (pos/neg) , to what degree each in multiple regression
predictor affects the outcome if the effects of all other predictors are held constant: