L2 Multiple Regression Flashcards
What is multivariate analysis?
Analysing the impact of multiple variables for predicting changes in Y.
What are the two types of multivariate analysis?
Dependent and independent
What are two analysis of dependence?
multiple regression and discrimination
What are three analysis of independence?
cluster analysis, principal component analysis, factor analysis
What is the purpose of multiple regression?
It is designed to isolate the effects of each x-variable (predictor) upon the y-variable
What is a coefficient?
A numerical description of an x-variable’s relationship upon the y-variable
What is a regressor/predictor?
an independent variable
What is the model?
term used to describe the collection of regressors involved in predicting y
Given an example of a multiple regression case and 5 potential regressors?
Plant growth rate (y) is dependent upon 1) temperature, 2) radiation, 3) carbon dioxide, 4) nutrient supply, 5) water
How and WHY does the line of best fit changes for multiple regression?
Because there are more x-variables, their relationship is now collectively non-linear so the line becomes a plane of best fit
How does the r^2 statistics differ in multiple regression?
it becomes r^2v(adj) [adjusted r squared value].
What is the adjusted r squared metric and why does it differ from its linear counterpart?
The linear r squared considered how much of y was explained by a singular predictor. Whereas the adjusted version measures how well the whole model explains variance in Y.
What happens to the adjusted r squared value if you add more regressors to the model and why?
The adjusted r squared can either increase or decrease. This is because the other regressors may either contribute an increase to the understanding/explanation of y, or they may not be relevant and so cloud out/decrease the model’s overall explanation of Y.
What is crucial before we compare adjusted r squared values?
Standardizaiton
Why do we need to standardize in order to compare adjusted r squared values?
because of the different data, scales and models between different samples that make it inappropriate to compare
What does scale dependency mean and why is it important?
In multiple regression because there are different regressors inputted in to the model, of which they have different units of measurement, it means that when you add another regressor, the initial ones change i.e. their scale is dependent upon the others. Importantly however the way they change is not necessarily correct because of the different units of measurement between the different variables
What is an example of scale dependency being important ?
Plant growth rate - water is measured in millilitres, carbon dioxide is measured in ppm usually. This means that they respond in incorrect ways
What technique allows us to overcome the scale dependency?
standardization
What happens to the name of the coefficients as they are standardised?
they transform from partial regression coefficients in to beta regression coefficients
To overcome the different units of the different regressors what happens to the units?
They become z units
What is the t value?
the significance of the coefficient in explaining the variance
What is multicollinearity?
When the regressors in a model correlate with each other
Why is multicollinearity a problem for multiple regression?
Because multiple regression seeks to determine the specific effect of each regressor separately upon Y.
If multicollinearity is present, then how would the finding of one regressor impacting Y in a certain way be hampered?
because a regressor correlates with another regressor then the impact of the initial regressor upon Y would have to be shared between both regressors, and there would be no way of determining which one is a more important regressor. Furthermore, they may both be controlled by an underlying factor
How do we look for multicollinearity?
using the variance inflation factor
What are the 3 ways we can address multicollinearity?
- Remove one or more of the correlated regressors and then re-run the test
- Where there is a logical relationship between the two variables we can create an interaction term
- If variables are related because of underlying factors then they can be reduced using factor analysis
What does the f-test enable?
the determination of how taking a regressor out of the model impacts the Y value
How does the f-test result work?
Look to see how much the f-test value changes between different models with different regressors included or excluded.
How do you interpret the f-test result?
If F changes a lot then it means the inclusion or exclusion of that regressor has a big impact upon Y. Furthermore, the f-value will also have a p value which indicates the statistical significance of that result.
What are 3 things to remember when we want to undertake a meaningful multiple regression analysis?
- Variable selection - dont chuck in a load of variables and see what happens, instead select them and their order based on theory and previous findings.
- Do not rely on a single metric to assess, instead use multiple as they all have a specific and useful purpose.
- Parsimony - do not overcomplicate the model, seek coherence and simplicity.
What are the 3 methods for building the structure of the model’s regressors?
- Hierarchical: ordering the regressors individually, based on theoretical considerations with known predictors entered in first and in order of importance
- Step-wise: order based on mathematical reasoning, which means that is is usually carried out by computers based on a correlation match rate (predictors that explain most are ordered first)
- Forced-entry/simultaneous: all regressors (known or unknown) are put in to the model in a simultaneous manner but all regressors have been considered before
What is important to remember for the validation of a model?
Cross-validation
What is cross-validation?
Applying the model with the same predictors for a different sample or population
What two things are important to consider about your model once you have produced it?
Does it fit the observed data well (or is it overly affected by a few regressors), and can the model be generalised to other samples
What is the standardization formula?
beta coefficient (standardized multiple regression coefficient) = partial regression coefficient x (standard deviation for regressor divided by standard deviation of variable y)
How do we determine multicollinearity in SPSS?
Look at the tolerance and VIF values. Ideally we want tolerance to be >0.2 and the VIF to be <5.
What do the tolerance and VIF metrics mean?
tolerance = e.g. if = 0.3 then this means 30% of the variance for that given predictor is not accounted for by the other predictors. We therefore want it to be higher than 0.2 so that we know that the predictor is not useless because other predictors explain it themselves. VIF = if >5 this indicates the presence of multicollinearity
How doe we determine heteroscedasticity in SPSS?
We look at the scatterplot, histogram and the p-p plot.
What is heteroscedasticity? Why does it matter?
When our residuals are not constant across the x-axis. If this occurs then we cannot be as confident in our regression analysis
How do we interpret the scatterplot, histogram and p-p plot in SPSS to determine heteroscedasticity?
Scatterplot = we want to see a random and wide spread for homoscedascity
Histogram = we want to see a normal distribution (bell-shaped curve)
P-P plot = we want to see a fairly even alignment of points to the line
What is autocorrelation? Why is it a problem?
This is when each of our correlations is not independent of the others around it. Regression assumes that they are independent
How do we determine auto correlation in SPSS?
Look at the Durbin Watson statistic?
What is the range of Durbin-Watson statistics where we would state there was no autocorrelation?
1.566 - 2.434
What is the range of Durbin-Watson statistics where we would state there is significant positive or negative autocorrelation?
Positive = 0.1.475 Negative = 2.525-4.0
What are the ranges of Durbin-Watson statistics where we would state there is indeterminate autocorrelation?
1.475-1.566 & 2.434-2.525
How does stepwise regression work in SPSS?
SPSS uses the correlation matrix to find the independent variable that has the largest significant Pearson’s correlation with Y, followed by the next and then orders them in that order.