week 3 Multiple Regression Flashcards
general multiple regression equation
Yi=(b0+b1x1i+b2x2i+….+bnxni)+ei
b0=the Y intercept (where all x scores equal zero)
b’s are all unstandardised regression coefficients
x1, x2 etc are independent variables.
Note this is unstandardised
R2
in this instance, is technically multiple R2=the amount of variance predicted in th DV by using all of the IV’s.
R2=SSM/SST
R2 is the multiple correlation coefficient
Determines how well the linear regression equation fits the data.
r2= single coefficient of determination
research uses of multiple regression analysis
1.combined predictive utility
How much variability in the DV can we explain by knowing scores of all the predictor values? Does knowing the predictor scores tell us anything meaningful about the DV or is it as good as chance?
2.Importance of the IV’s.
Which variable is the most contributing to the prediction of the DV?
Do we need all the IV’s or are a few just as good?
- Uniqueness. Have much unique (non-overlapping) does each variable explain?
- Can we improve the prediction of a DV by adding one or more IV’s to the equation (sequential multiple regression)?
Standard multiple regression versus sequential multiple regression
In standard MR, all IV’s are entered in the equation simultaneously. In sequential MR, the IV’s are added to the equation in specific stages.
Evaluating predictive importance of the independent variables
To do this, need the beta weights. Beta weight=the standardised coefficient. It is a b weight converted via z score into standardised form. It is used for comparing relative iv contributions to the prediction of the dv.
E.g.. if beta =.53 then a one standard deviation increase in the iv results in a .53 standard deviation increase in the predicted value of the dv.
As iv increases by 1 standard deviation, the dv changes by
beta x the standard deviation of the dv.
As the beta values range from negative infinity to infinity, they cannot be meaningfully squared and CANNOT inform about exactly how much variance an iv uniquely contributes to the dv.
But if an IV’s beta value is greater than another Iv’s, the former is contributing more to the dv’s variance.
Significance testing for evaluating unique predictiveness
for individual IV predictors, the significance test evaluating the unique part of an iv’s association with a dv is usually done via a t test.
t= b weight/standard error.
the t-test can be applied using either b weight or beta weight. The t-test is only evaluating the unique predictiveness, the the full correlation should always be assessed to see if there is any overlapWhen reporting the results of the t test, the values of r and sr2 must be reported also.
Multiple R
Multiple R (unsquared) ranges from 0 to 1. It is the correlation between the observed and predicted Y values. If R=1, all predicted Y values are all on the regression line, if R=0, there is no correlation.
Cannot change significance of R. To change significance of R,
need an F ratio, which tests the null hypoethesis that no linear relationship exists between the independent and dependent variables.
SST
SST=SSM+SSR
SST=Σ(Y-Y-)2
SSM=Σ(Y^-Y-)2
SSR=Σ(Y-Y^)2
blank
3
Tolerance and R2A
eg Have iv’s A, B and C. and Dv Y.
R2 tells us how much variance in Y is explained by the combination of A, B and C,
To calculate the tolerance of variable A, do a new multiple regression, but A now becomes our dv, with B and C remaining as iv’s. New R2 is now R2A and 1-R2A=tolerance of A. Can repeat for other iv’s.
Adjusted R2
Adjusted R2 is a modifed measure of R2.It is deemed necessary to adjust R2 becuase of a couple of inherent biases, these are:
1.R2 tends to be erroneously inflated when we have a small sample size. As the sample size increases, we become more confident in R2’s accuracy.
and
2.R2 tends to be erroneously inflated with a large number of IV’s. (R2 will increase even if add more IV’s to the model which have no correlation at all to the DV!
Adjusted R2=1- (1-R2)(N-1)/(N-k-1)
If R2 is very close to adjusted R2 it is ok to make a judement using it, BUT if there is a big difference, R2 is likely to be misleading, in which case BOTH should be reported.
dummy variable coding
Multiple regression is designed primarily for continuous data but can handle discrete data if converted into dichotomous variables. (note the dv must be continuous). If have more than 2 nominal data eg. Muslim, Christian, Jew, Other, these MUST first be broken down into dummy variable coding.
If have c number of categores, must creat c-1 new variables
(df=categories-1)
eg Muslim vs non Muslim
Jewish vs non Jewish
Christian vs non Christian
(note Other from previous is the same as non Muslim, non Jewish and nonChristian.)
eg. If originally, had other=0, muslim=1, jewish=2 and Christian=3, and now in the dummy coded version, 0=no and 1=yes.
Dummy coding eg:
Original Muslim Jewish Christian
0 0 0 0
1 1 0 0
2 0 1 0
3 0 0 1
3 0 0 1
2 0 1 0 etc
rules for sample size in multiple regression
There are 2 general rules of thumb recommended for determing the appropriate sample size in multiple regression;
- when considering the overall multiple correlation;
N>/= 50 + 8m
m=number of IV’s in the model.
ie with 10 predictors, we would reuiqre a sample size of 50+80=130, or more.
- When considering the predictive influence of individual IV’s;
N>/=104 +m. If have 10 predictors, need 114 or more cases.
These are general rules of thumb, based on the idea that the IV’s are moderately correlated with the DV. If the correlation is much larger, arguably one could have a slightly smaller sample size, and if the correlations are much smaller, arguably one would need a much larger sample size.
multicollinearity and singularity
Multicollinearity exists where 2 or more IV’s are highly linearly related to each other.
When considering whether we have multicollinearity:
>.90=multicollinear
1=singularity
.70 coule be a strong relationship and similar informational yield.
Because SPSS finds it difficult to measure each iv’s unique contribution if there is strong multicollinearity, SPSS assesses whether or not there is a potential issue by giving values of;
1.tolerance
and
- Variance inflation factor