3. Regression Flashcards

Question 1

Q

Regression

Answer

A

It is a way of predicting the value of one variable using another variable(s). It is a hypothetical model of the relationship between variables. The model is linear (based on a straight line).

Question 2

Q

Regression Equation

Answer

A

Y is the outcome (what you expect to predict) b0 is the y-intercept, Bi is the regression coefficient of the predictor (the gradient of the slope, it gives you information about direction of relationship and strength)

Question 3

Q

Multiple Regression

Answer

A

. We use it to predict a value(s) of an outcome using multiple predictors (could use an infinite amount). An example of a regression model with two predictors is below.

Question 4

Q

Multiple Regression Equation

Question 5

Q

Fitting the regression model to the data

Answer

A

The regression model is a model of best fit to the data. It is decided by finding the model which minimizes the residual sum of squares. To calculate the RSS, you calculate the residuals for each data point, add them together and then square them.

Question 6

Q

Regression Model Equation

Question 7

Q

Goodness of Fit

Answer

A

To test the model, we compare it to the baseline model (the mean of the outcome variable Y). We do this by comparing squared sums of residuals between the two models and the squared sum of variance between the two models. We calculate this statistically by running an ANOVA and looking at the f-test value. The bigger the f value the better the prediction of the regression model.

Question 8

Q

Proportion of Variance

Answer

A

We can also at the proportion of variance accounted for by the regression model by dividing the variance accounted for by the regression model by the total variance in the data.

Question 9

Q

Thepory of testing a model

Answer

A

When testing a model, we hope to find that the improvement

due to the model (SSm) is greater than the error in the model

(SSr) in terms of total variance in the data (SST). In terms of

regression; SST is the total variability between scores and the

mean, SSR is the square sum of residuals and SSM is the model

variability (difference between the model and the mean).

Question 10

Q

Beta Values

Answer

A

These express changes in an outcome associated with a unit of change in the predictor. The beta value is unit specific (e.g. a change in £1 increase happiness measure by .56) while the standardized coefficient is based on standard deviations.

Question 11

Q

Methods of Multiple Regression

Answer

A

Unless predictors are completely uncorrelated (very rarely), the order in which we input predictors matters. This is because weights changed based on correlations between predictors as you input them.

Forced Entry: This involves entering all predictors simultaneously. The results (just one step) therefore depend on the variables entered in the model so you must have good reasoning adding each variable. You would do this when you cannot make predictions about the predictors based on theory.

Hierarchical: The experimenter decides on order variables are entered. This should be based on theory (experimenter needs to know what they are doing). The experimenter enters variables in blocks, starting with known predictors and then entering new variables to see if it adds predictive power to the model.

Stepwise: Predictors are selected using semi-partial correlation (by software). Hence predictors are added using mathematical criterion. This is BAD as there is no logic behind the predictors being added. It also may include predictors with a minimal statistical contribution which a researcher would otherwise choose not to use. Some people use this exploratorily when they have no idea how predictors fit together but this is bad science as you should have an idea.

Question 12

Q

Outliers: Standardized residuals

Answer

A

these tell about the residuals for each case in a standardized format (compared to mean/std dev). 95% should be below +-2. Values over +-3 are a cause for alarm (potential outlier) but should not be removed on this alone.

Question 13

Q

Outliers: Cooks Distance

Answer

A

measures the influence of a single case on the model as a whole. Absolute values (>1) are a cause for concern.

Question 14

Q

Outliers: Mahalnobis Distance

Answer

A

is the distance between cases and the mean(s) of predictor variable(s). Larger values are more problematic, starting at values of about 11 for a small regression.

Question 15

Q

Outliers: Dfbeta

Answer

A

the difference between a parameter when specific case is excluded. >2 cfc

Question 16

Q

Outliers: Dffit

Answer

Study These Flashcards

A

the difference between In predicted outcome Y when specific case is excluded. Deviations from 0 CFC

Question 17

Q

Assumptions for Generalizability

Answer

Study These Flashcards

A

Outcome variable should be continuous, predictors continuous or dichotomous.

Predictors must not have 0 variance (otherwise they are identical to outcome). The model should be linear and all outcomes should come from a different person.

No Multicollinearity

Homoscedasticity

Independence of errors

Normality of errors

Question 18

Q

Assumption: No multicollinearity

Answer

Study These Flashcards

A

this exists when predictors are highly correlated. This can be checked by looking at the collinearity statistics. Tolerance should be more than 0.2, ivf should be less than 10. If violates, limits size or R2 and decreases importance/distinctiveness of predictors.

Question 19

Q

Homoscedasticity

Answer

Study These Flashcards

A

residuals should have the same variance for each level of the predictor. You can check this using a scatterplot, you compare how the residuals go up and down from the model across the model.