3. Regression Flashcards
Regression
It is a way of predicting the value of one variable using another variable(s). It is a hypothetical model of the relationship between variables. The model is linear (based on a straight line).
Regression Equation
Y is the outcome (what you expect to predict) b0 is the y-intercept, Bi is the regression coefficient of the predictor (the gradient of the slope, it gives you information about direction of relationship and strength)
Multiple Regression
. We use it to predict a value(s) of an outcome using multiple predictors (could use an infinite amount). An example of a regression model with two predictors is below.
Multiple Regression Equation
Fitting the regression model to the data
The regression model is a model of best fit to the data. It is decided by finding the model which minimizes the residual sum of squares. To calculate the RSS, you calculate the residuals for each data point, add them together and then square them.
Regression Model Equation
Goodness of Fit
To test the model, we compare it to the baseline model (the mean of the outcome variable Y). We do this by comparing squared sums of residuals between the two models and the squared sum of variance between the two models. We calculate this statistically by running an ANOVA and looking at the f-test value. The bigger the f value the better the prediction of the regression model.
Proportion of Variance
We can also at the proportion of variance accounted for by the regression model by dividing the variance accounted for by the regression model by the total variance in the data.
Thepory of testing a model
When testing a model, we hope to find that the improvement
due to the model (SSm) is greater than the error in the model
(SSr) in terms of total variance in the data (SST). In terms of
regression; SST is the total variability between scores and the
mean, SSR is the square sum of residuals and SSM is the model
variability (difference between the model and the mean).
Beta Values
These express changes in an outcome associated with a unit of change in the predictor. The beta value is unit specific (e.g. a change in £1 increase happiness measure by .56) while the standardized coefficient is based on standard deviations.
Methods of Multiple Regression
Unless predictors are completely uncorrelated (very rarely), the order in which we input predictors matters. This is because weights changed based on correlations between predictors as you input them.
Forced Entry: This involves entering all predictors simultaneously. The results (just one step) therefore depend on the variables entered in the model so you must have good reasoning adding each variable. You would do this when you cannot make predictions about the predictors based on theory.
Hierarchical: The experimenter decides on order variables are entered. This should be based on theory (experimenter needs to know what they are doing). The experimenter enters variables in blocks, starting with known predictors and then entering new variables to see if it adds predictive power to the model.
Stepwise: Predictors are selected using semi-partial correlation (by software). Hence predictors are added using mathematical criterion. This is BAD as there is no logic behind the predictors being added. It also may include predictors with a minimal statistical contribution which a researcher would otherwise choose not to use. Some people use this exploratorily when they have no idea how predictors fit together but this is bad science as you should have an idea.
Outliers: Standardized residuals
these tell about the residuals for each case in a standardized format (compared to mean/std dev). 95% should be below +-2. Values over +-3 are a cause for alarm (potential outlier) but should not be removed on this alone.
Outliers: Cooks Distance
measures the influence of a single case on the model as a whole. Absolute values (>1) are a cause for concern.
Outliers: Mahalnobis Distance
is the distance between cases and the mean(s) of predictor variable(s). Larger values are more problematic, starting at values of about 11 for a small regression.
Outliers: Dfbeta
the difference between a parameter when specific case is excluded. >2 cfc