Regression, Simple, Multiple, & Logistic (Fields Ch. 7 & 8) Flashcards

1
Q

Adjusted R_

A

a measure of the loss of predictive power or shrinkage in regression. The adjusted R_ tells us how much variance in the outcome would be accounted for if the model had been derived from the population from which the sample was taken.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Autocorrelation

A

when the residuals of two observations in a regression model are correlated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

bi

A

unstandardized regression coefficient. Indicates the strength of relationship between a given predictor, i, of many and an outcome in the units of measurement of the predictor. It is the change in the outcome associated with a unit change in the predictor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

_i

A

standardized regression coefficient. Indicates the strength of relationship between a given predictor, i, of many and an outcome in a standardized form. It is the change in the outcome (in standard deviations) associated with a one standard deviation change in the predictor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Cook’s distance

A

a measure of the overall influence of a case on a model. Cook and Weisberg (1982) have suggested that values greater than 1 may be cause for concern.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Cross-validation

A

assessing the accuracy of a model across different samples. This is an important step in generalization. In a regression model there are two main methods of cross-validation: adjusted R_ or data splitting, in which the data are split randomly into two halves, and a regression model is estimated for each half and then compared.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Dummy variables

A

a way of recoding a categorical variable with more than two categories into a series of variables all of which are dichotomous and can take on values of only 0 or 1. There are seven basic steps to create such variables: (1) count the number of groups you want to recode and subtract 1; (2) create as many new variables as the value you calculated in step 1 (these are your dummy variables); (3) choose one of your groups as a baseline (i.e., a group against which all other groups should be compared, such as a control group); (4) assign that baseline group values of 0 for all of your dummy variables; (5) for your first dummy variable, assign the value 1 to the first group that you want to compare against the baseline group (assign all other groups 0 for this variable); (6) for the second dummy variable assign the value 1 to the second group that you want to compare against the baseline group (assign all other groups 0 for this variable); (7) repeat this process until you run out of dummy variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

F-ratio

A

a test statistic with a known probability distribution (the F-distribution). It is the ratio of the average variability in the data that a given model can explain to the average variability unexplained by that same model. It is used to test the overall fit of the model in simple regression and multiple regression, and to test for overall differences between group means in experiments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Generalization

A

the ability of a statistical model to say something beyond the set of observations that spawned it. If a model generalizes it is assumed that predictions from that model can be applied not just to the sample on which it is based, but to a wider population from which the sample came.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Goodness of fit

A

an index of how well a model fits the data from which it was generated. It’s usually based on how well the data predicted by the model correspond to the data that were actually collected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Heteroscedasticity

A

the opposite of homoscedasticity. This occurs when the residuals at each level of the predictor variables(s) have unequal variances. Put another way, at each point along any predictor variable, the spread of residuals is different.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Hierarchical regression

A

a method of multiple regression in which the order in which predictors are entered into the regression model is determined by the researcher based on previous research: variables already known to be predictors are entered first, new variables are entered subsequently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Homoscedasticity

A

an assumption in regression analysis that the residuals at each level of the predictor variable(s) have similar variances. Put another way, at each point along any predictor variable, the spread of residuals should be fairly constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Independent errors

A

for any two observations in regression the residuals should be uncorrelated (or independent).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Mean squares

A

a measure of average variability. For every sum of squares (which measure the total variability) it is possible to create mean squares by dividing by the number of things used to calculate the sum of squares (or some function of it).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Model sum of squares

A

a measure of the total amount of variability for which a model can account. It is the difference between the total sum of squares and the residual sum of squares.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Multicollinearity

A

a situation in which two or more variables are very closely linearly related.
· To check for multicollinearity, use the VIF values from the table labelled Coefficients in the SPSS output.
· If these values are less than 10, then there probably isn’t cause for concern.
· If you take the average of VIF values, and it is not substantially greater than 1, then there’s also no cause for concern.

18
Q

Multiple R

A

the multiple correlation coefficient. It is the correlation between the observed values of an outcome and the values of the outcome predicted by a multiple regression model.

19
Q

Multiple regression

A

an extension of simple regression in which an outcome is predicted by a linear combination of two or more predictor variables. The form of the model is: (see above image) in which the outcome is denoted as Y, and each predictor is denoted as X. Each predictor has a regression coefficient b associated with it, and b0 is the value of the outcome when all predictors are zero.

20
Q

Ordinary least squares (OLS)

A

a method of regression in which the parameters of the model are estimated using the method of least squares.

21
Q

Outcome variable

A

a variable whose values we are trying to predict from one or more predictor variables.

22
Q

Perfect collinearity

A

exists when at least one predictor in a regression model is a perfect linear combination of the others (the simplest example being two predictors that are perfectly correlated - they have a correlation coefficient of 1).

23
Q

Predicted value

A

the value of an outcome variable based on specific values of the predictor variable or variables being placed into a statistical model.

24
Q

Predictor variable

A

a variable that is used to try to predict values of another variable known as an outcome variable.

25
Q

Residual

A

The difference between the value a model predicts and the value observed in the data on which the model is based. Basically, an error. When the residual is calculated for each observation in a data set the resulting collection is referred to as the residuals.

26
Q

Residual sum of squares

A

a measure of the variability that cannot be explained by the model fitted to the data. It is the total squared deviance between the observations, and the value of those observations predicted by whatever model is fitted to the data.

27
Q

Simple regression

A

· Simple regression is a way of predicting values of one variable from another.
· We have to assess how well the line fits the data using:
o R2 which tells us how much variance is explained by the model compared to how much variance there is to explain in the first place. It is the proportion of variance in the outcome variable that is shared by the predictor variable.
o F, which tells us how much variability the model can explain relative to how much it can’t explain (i.e., it’s the ratio of how good the model is compared to how bad it is).
o the b-value, which tells us the gradient of the regression line and the strength of the relationship between a predictor and the outcome variable. If it is significant (Sig. < .05 in the SPSS table) then the predictor variable significantly predicts the outcome variable.

28
Q

Standardized residuals

A

the residuals of a model expressed in standard deviation units. Standardized residuals with an absolute value greater than 3.29 (actually, we usually just use 3) are cause for concern because in an average sample a value this high is unlikely to happen by chance; if more than 1% of our observations have standardized residuals with an absolute value greater than 2.58 (we usually just say 2.5) there is evidence that the level of error within our model is unacceptable (the model is a fairly poor fit of the sample data); and if more than 5% of observations have standardized residuals with an absolute value greater than 1.96 (or 2 for convenience) then there is also evidence that the model is a poor representation of the actual data.

29
Q

Stepwise regression

A

a method of multiple regression in which variables are entered into the model based on a statistical criterion (the semi-partial correlation with the outcome variable). Once a new variable is entered into the model, all variables in the model are assessed to see whether they should be removed.

30
Q

Studentized deleted residual

A

a measure of the influence of a particular case of data. This is a standardized version of the deleted residual.

31
Q

Studentized residuals

A

a variation on standardized residuals. A Studentized residual is an unstandardized residual divided by an estimate of its standard deviation that varies point by point. These residuals have the same properties as the standardized residuals but usually provide a more precise estimate of the error variance of a specific case.

32
Q

Suppressor effects

A

situation where a predictor has a significant effect, but only when another variable is held constant.

33
Q

t-statistic

A

Student’s t is a test statistic with a known probability distribution (the t-distribution). In the context of regression it is used to test whether a regression coefficient b is significantly different from zero; in the context of experimental work it is used to test whether the differences between two means are significantly different from zero. See also paired-samples t-test and Independent t-test.

34
Q

Tolerance

A

tolerance statistics measure multicollinearity and are simply the reciprocal of the variance inflation factor (1/VIF). Values below 0.1 indicate serious problems, although Menard (1995) suggests that values below 0.2 are worthy of concern.

35
Q

Total sum of squares

A

a measure of the total variability within a set of observations. It is the total squared deviance between each observation and the overall mean of all observations.

36
Q

Unstandardized residuals

A

the residuals of a model expressed in the units in which the original outcome variable was measured.

37
Q

Variance inflation factor (VIF)

A

a measure of multicollinearity. The VIF indicates whether a predictor has a strong linear relationship with the other predictor(s). Myers (1990) suggests that a value of 10 is a good value at which to worry. Bowerman and O’Connell (1990) suggest that if the average VIF is greater than 1, then multicollinearity may be biasing the regression model.

38
Q

SPSS Regression Model Output (Summary)

A

· The fit of the regression model can be assessed using the Model Summary and ANOVA tables from SPSS.
· Look for the R2 to tell you the proportion of variance explained by the model.
· If you have done a hierarchical regression then assess the improvement of the model at each stage of the analysis by looking at the change in R2 and whether this change is significant (look for values less than .05 in the column labelled Sig F Change).
· The ANOVA also tells us whether the model is a significant fit of the data overall (look for values less than .05 in the column labelled Sig.).
The individual contribution of variables to the regression model can be found in the Coefficients table from
SPSS. If you have done a hierarchical regression then look at the values for the final model. For each predictor variable, you can see if it has made a significant contribution to predicting the outcome by looking at the column labelled Sig. (values less than .05 are significant).
· The standardized beta values tell you the importance of each predictor (bigger absolute value = more
important).

39
Q

Outliers

A

You need to look for cases that might be influencing the regression model:
· Look at standardized residuals and check that no more than 5% of cases have absolute values above 2, and that no more than about 1% have absolute values above 2.5. Any case with a value above about 3 could be an outlier.
· Look in the data editor for the values of Cook’s distance: any value above 1 indicates a case that might be influencing the model.

40
Q

Logistic Regression

A

 Can be used when the Outcome Variable is Categorical
 In logistic regression, we assume the same things as ordinary regression. (except the assumption that the relationship between variables is linear.)
- * A logistic transformation can allow us to express a non-linear relationship in a linear way (expresses the multiple linear regression equation in logarithmic terms (called the logit), thus overcoming the problem of violating the assumption of linearity. From there the linearity assumption is that each predictor has a linear relationship with the log of the outcome variable.
 In logistic regression instead of predicting the value of Y (the outcome variable), we predict the probability of the Categorical outcome occurring (e.g., what is the probability that the company is still alive?) given the known values of the predictor variables.

41
Q

two main types of logistic regression

A

Binary Logistic Regression – When we are trying to predict membership in one of two categorical outcomes (Dead or Alive)

Multinomial Logistic Regression – When we are trying to predict membership in one of more than two categorical outcomes.