3. Regression Flashcards

1
Q

Regression

A

It is a way of predicting the value of one variable using another variable(s). It is a hypothetical model of the relationship between variables. The model is linear (based on a straight line).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Regression Equation

A

Y is the outcome (what you expect to predict) b0 is the y-intercept, Bi is the regression coefficient of the predictor (the gradient of the slope, it gives you information about direction of relationship and strength)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Multiple Regression

A

. We use it to predict a value(s) of an outcome using multiple predictors (could use an infinite amount). An example of a regression model with two predictors is below.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Multiple Regression Equation

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Fitting the regression model to the data

A

The regression model is a model of best fit to the data. It is decided by finding the model which minimizes the residual sum of squares. To calculate the RSS, you calculate the residuals for each data point, add them together and then square them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Regression Model Equation

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Goodness of Fit

A

To test the model, we compare it to the baseline model (the mean of the outcome variable Y). We do this by comparing squared sums of residuals between the two models and the squared sum of variance between the two models. We calculate this statistically by running an ANOVA and looking at the f-test value. The bigger the f value the better the prediction of the regression model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Proportion of Variance

A

We can also at the proportion of variance accounted for by the regression model by dividing the variance accounted for by the regression model by the total variance in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Thepory of testing a model

A

When testing a model, we hope to find that the improvement

due to the model (SSm) is greater than the error in the model

(SSr) in terms of total variance in the data (SST). In terms of

regression; SST is the total variability between scores and the

mean, SSR is the square sum of residuals and SSM is the model

variability (difference between the model and the mean).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Beta Values

A

These express changes in an outcome associated with a unit of change in the predictor. The beta value is unit specific (e.g. a change in £1 increase happiness measure by .56) while the standardized coefficient is based on standard deviations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Methods of Multiple Regression

A

Unless predictors are completely uncorrelated (very rarely), the order in which we input predictors matters. This is because weights changed based on correlations between predictors as you input them.

Forced Entry:​ This involves entering all predictors simultaneously. The results (just one step) therefore depend on the variables entered in the model so you must have good reasoning adding each variable. You would do this when you cannot make predictions about the predictors based on theory.

Hierarchical: ​The experimenter decides on order variables are entered. This should be based on theory (experimenter needs to know what they are doing). The experimenter enters variables in blocks, starting with known predictors and then entering new variables to see if it adds predictive power to the model.

Stepwise: ​Predictors are selected using semi-partial correlation (by software). Hence predictors are added using mathematical criterion. This is BAD as there is no logic behind the predictors being added. It also may include predictors with a minimal statistical contribution which a researcher would otherwise choose not to use. Some people use this exploratorily when they have no idea how predictors fit together but this is bad science as you should have an idea.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Outliers: Standardized residuals

A

these tell about the residuals for each case in a standardized format (compared to mean/std dev). 95% should be below +-2. Values over +-3 are a cause for alarm (potential outlier) but should not be removed on this alone.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Outliers: Cooks Distance

A

measures the influence of a single case on the model as a whole. Absolute values (>1) are a cause for concern.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Outliers: Mahalnobis Distance

A

is the distance between cases and the mean(s) of predictor variable(s). Larger values are more problematic, starting at values of about 11 for a small regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Outliers: Dfbeta

A

the difference between a parameter when specific case is excluded. >2 cfc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Outliers: Dffit

A

the difference between In predicted outcome Y when specific case is excluded. Deviations from 0 CFC

17
Q

Assumptions for Generalizability

A

Outcome variable should be continuous, predictors continuous or dichotomous.

Predictors must not have 0 variance (otherwise they are identical to outcome). The model should be linear and all outcomes should come from a different person.

No Multicollinearity

Homoscedasticity

Independence of errors

Normality of errors

18
Q

Assumption: No multicollinearity

A

this exists when predictors are highly correlated. This can be checked by looking at the collinearity statistics. Tolerance should be more than 0.2, ivf should be less than 10. If violates, limits size or R2 and decreases importance/distinctiveness of predictors.

19
Q

Homoscedasticity

A

residuals should have the same variance for each level of the predictor. You can check this using a scatterplot, you compare how the residuals go up and down from the model across the model.

20
Q

Independence of Errors

A

for any two data points, residuals should be uncorrelated. Can check this on a histogram, would look like a normal correlation I suppose.

21
Q

Normality of Errors

A

residuals should have a mean of 0 and be normally distributed. Can check this by looking at histogram (look for bell curve)

22
Q
A