module 2 regression Flashcards

1
Q

develop a conceptual understanding of the regression model

A

REGRESSION
linear regression is that it is a method of predicting a single outcome variable: a quantitative outcome variable, usually a quasi-continuous one. If your outcome is true/false, yes/no, dead/alive, or something similarly dichotomous, linear regression is not suitable.If you are looking at an outcome that might have five or six levels—for example, a single writing scale item—it may also be unsuitable. Linear regression is focused on predicting an outcome (DV) using predictor variables (IVs). Linear regression is based on Pearson correlation, the oldest analysis in psycjhology.

OTHER types of regression
There are other methods instead of linear regression including:
-ogistic regression (used for dichotomous or polytomous outcomes)
-nonlinear regression
-special nonlinear functions
-multiple ‘predictors’ and multiple ‘outcomes’ SEM (standard error of the mean) or PLS (partial least squares) regression.

If this shape follows a mathematical function, you can apply the inverse of that function. In the case of negatively correlated items, we take the square root and turn it into a straight line. We can then bring the data into the linear regression framework.

Items like income and socioeconomic status are often nonlinear. There’s a very small number of people who have a large amount of everything and a moderately large number who have much less. There are almost no relationships that you can’t linearise with a log transform. Remember, if you encounter variables that have these unusual shapes, there is usually a way to bring them into a linear framework.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

use SPSS to perform a multiple regression analysis

A

2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Notes from collaborate session on REGRESSION
interpret output—in particular, explain the meaning of
R2 and the ANOVA table
B and β
zero-order, part and partial correlations

A

In the Regressions Analysis, spss gave us a significance in 1-tailed BUT should actually consider regression significance using 2-tailed -t test. To get the 2-taile d-t test, actually need to run via
> Correlate menu.
(may be better to run > Correlate
prior to > Regression)
a STRONG CORRELATION is 0.5 or greater
A MODERATE CORRELATION is 0.3 to 0.5
TABLE; Variables entered/removed;
is a summary of predictors added and in how many blocks, plus the method of entry. Method of entry is either Enter or Stepwise. Enter=added all predictors at once, which is more usual. Stepwise adds the strongest correlate first then sequentially the rest.
MODEL summary;
Our model (from lab example) explains 70.2% (from Rsquare= 0.702) of variance in our DV when we estimate our DV using the regression equation. We are in error by an average of +/- 1 .27 (from sStandard error of the estimate = 1.265).
ANOVA; The 70.2% variance in our DV we can explain, represents a significant amount of variance, F (4,75) = 44.22, p < 0.001. The 4 is from degrees of freedom, and the 75 is from residual.
The Anova table is testing the overall significance of the model.
ADJUSTED RSQUARED; says how much variance explained if tested whole population. ie Rsquare and AdjustedRsquare become much closer to each other in value as N becomes larger (and at N=population they would be the same!)
THE STANDARD ERROR OF THE ESTIMATE says if I use regression equation on new cases to predict their scores, the predicted score would be within +/- the standard error.
COEFFICIENTS TABLE; The UNSTANDRADISED coeffecient tells me that every time that IV predictor goes up by 1, the DV goes up by the amount of that predictor’s unstandardised coefficent (B). We use the UNSTANDARDISED COEFFICIENTS (B’s) in the Regression Equation.
BETA tells us if increase iv predictor by 1 standard unit, DV goes up by 1 beta unit of that predictor. Beta lets us see which IV has the strongest influence. COHEN’S d RATES Beta as 0.1= small effect
0.3= medium effect
0.5=large effect
Sig shows <0.05 the IV is a significant predictor of the DV. IE IF SIG >0.05, the predictor is NOT a significant predictor ofthe DV.
THE CONFIDENCE INTERVALS; relate to the B’s. This gives a 95% confidence that the true population coefficient is likely to fall between the upper and lower bound limits. eg. B=0.152. For every 1 unit increase in rating of CBT program efficacy (IV), there is a 0.152 increase in anxiety score and we are 95% confident that the actual increase in anxiety would fall between 0.075 (lowerbound limit) and 0.230(upperbound limit).
CORRELATIONS. “Zero order” is the same as “Pearson’s bivariate correlation”.
PART (aka semi partial)= if we square these values (Partsquared) we get the unique proportion of DV variance explained by this predictor minus any overlapping variance. So for self esteem this value is 0.303. If we square this we get 0.092. We can say that 9.2% of the variance in anxiety is uniquely explained by self esteem. NOTE that part and partsquared are often reported in APA to 3 decimal places unlike most others to 2 decimal places, because they are usually very small values.
COLLINEARITY STATISTICS; The most intuitive value to read is the Tolerance Value, but reporting 1 or other is fine. The Variance Inflation Factor is calculated from Tolerance as (1/tolerance=VIF) so they are saying the same thing in different ways. TOLERANCE= the amount of variance in the predictor unexplained by, or not overlapping with, other predictors. Ideally want it to be >0.2. Here, the self esteem tolerance value is 0.336. This means that 33.6% of the variance in this predictor, is not explained by other predictors in the model. The VIF is 2.974. This means that the variance is inflated by a factor of 2.974 compared to if there was zero overlap between the predictors. The match up cut off for VIF is 5, when we have set Tolerance for 0.20 cut off. So our VIF is also below the cut off and ok. NOTE Dr J recommends using the cut off for tolerance at >0.20 which is VIF of 5 (because 1/0.20=5) as more conservative approach than the oft recommended tolerance >0.3 and VIF >3).
Tolerance is only about how the IV’s may or may not overlap, and is NOT about how much overlap they share with each other and the DV. If we are saying tolerance=0.2, this means that 20% of variable is not overlapping ie 80% is overlapping (that’s a lot).
A VIF of 5, means the errors in the Betas are 5 times worse than if the predictors were not related.
CHECKING FOR MULTIVARIATE OUTLIERS;
RESIDUALS STATISTICS:
MAHALINOBIS DISTANCE is a measure of multivariate outliers. The larger the value, the more of an outlier it is. BUT may not necessarily be impacting results. ie Mdistance says is or is not an outlier, but not if bad or good. The maximum tells us how far the furthest outlier is from the centroid. Compare this max number to a particular cut off value (look up Chi-squared table) to see if an outlier or not. The cut off value depends on number of predictors. Degrees of freedom=number of predictors and using p value < 0.001. In this example, the cut off is 18.47 so anyone with a value of Mdistance >18.47 is an outlier. In the example, we have no outliers because max was less than. The Residuals table only gives us max and min Mdistance, not all the individual Mdistances. But we did earlier save Mdistance as a variable, so can if wish check all individual scores eg if do have outliers and need to see how many. (and can sort by value to make it easier).
COOK’S DISTANCE;is a measure of how influential an outlier is. ie says how impactful the inclusion of that case’s data, is, on the regression equation. If you have a Cook’s distance of >1, you are unduly influencing the results. ie in the table, check that the Cook’s distance Max is <1 and are all good. If have Max Cook’s distance >1, need to go back to all the individual Cook’sdistance scores and sort and usually remove those cases from the analysis. A GREAT thing to do, is run the analysis with outliers in and write up my apa summary of what the findings mean. Do it again with the outliers removed. write the summary. If the numbers change a little, but the words and level of high/medium/low etc stay the same, the outliers were not really influencing how results interpreted.
NORMALITY OF RESIDUALS (residual is the difference between a person’s actual score on the dependenr variable, and what the model predicts). An assumption of the linear regression model, is that if we calulate all the residuals and plot them, it should look like a bell curve normal distribution. When request a frequency of residuals histogram from spss, it will helpfully overlay a curveline showing of bell curve or not. Also get a PPPlot and want the points to be fal;ling very close to the line.If looks more “snakey”, suggests either skew or kurtosis issue is happening. Usually only need to look at either the histogram or the ppplpot of residuals as they say the same thing. There are tests that can be further run if think an issue, to look at eg skew or kurtosis figures to say id differing significantly from Normal.
STANDARDISED RESIDUALS BY STANDARDISED PREDICTED SCORES PLOT GRAPH- want there to be an even spread of points above and below the zero line, and if so, we have met the assumption of homoscedasticity.
CODING because in the example, high scores on low self esteem actually means have low self-esteem, might prefer to reverse code this so high score =high self esteem. eg create a new variable eg selfest_reverse. Need to go though and indicate each old score and what will now be as new score. eg old score 0 becomes new score 12 etc. even though 6 remains as 6, still enter it completely in as otherwise will give blanks. The reverse coding will result in the same correlation value, but the sign (+/-) will be reversed. Just makes more sense for brain. REMEMBER to only use 1 version of the variant though in the analysis.
The REGRESSION EQUATION is usually more theoretical, and we are not usually interested in actually predicting scores, but more interested in eg strength of correlation, amount of variance, ability to predict variance etc, ie is it a good model? We also never use the regression equation to predict actual data such as missing data.Estimations of what missing data may be, are achieved via other methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

list the assumptions of regression and explain key diagnostic tests.
(worked example video 2 d)

A

ASSUMPTIONS/DIAGNOSTICS
-Multicollinearity & singularity
Predictors should not be redundant
_ Outliers
No aberrant values (outliers can also skew parameter estimates)
-Normality of residuals
-Linearity of the model
Regression only models linear relationships.
Need to check for non linear affects.
-Homoscedasticity
Prediction should be equally good at all levels of the DV
-Independence of errors
Residual correlations mean there is significant unexplained variance or confoundity.
-Measured error

These assumptions concern how adequately a linear model can explain the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

positive correlation

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

strong negative correlation

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

no correlation but a strong relationship

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

statistical regression video

A

eg a typical linear regression anova using eg of a gas bill.
The relationship between $ and MJ is an eg of perfect linear regression.
Usually though, the relationship is less than perfect, plus often have multiple determinants. The regression model procedure is basically figuring out all these other determinant relationships.
REGRESSION EQUATION
eg phone bill
bill =(line rental) + (#calls)x ($ of call)
Y= A + BX
A is the intercept
In regression, we are predicting Y from X so
Y’= A + BX
Y” =predicted Y. It is important to specify that it is a prediction because usually the relationship is not perfect.
B is the slope of the line and is the regression coefficient, and how much each X contributes to Y.
X is the predictor value and may actually comprise of multiple variables.
Usually in psychology we are most interested in B, but occasionally interested in A.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

line of best fit video

A

Have a scatter plot. Try to find the line of best fit for data. Most data points wil not be right on the line of best fit. Dustance from line is error. Try to minimise error as much as possible but will always be there.
error= Y-Y’
the regression software will find A and B (of regression equation Y= A +BX) to minimise the “sum of squares” error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

basic steps of regression

A

Step 1: Prepare data
Make sure your data is clean. Look for missing values, outliers, normality, linearity, etc.
Step 2: Run regression software
Perform the following sub-steps to run the regression software:
enter DV and IVs
choose options.
Step 3: Interpret parameter estimates
Intercept and regression coefficients
R2
ANOVA
Step 4: Check diagnostics
Check for:
Linearity
Normality
Homoscedasticity
Independence of errors
Outliers
Multicollinearity and singularity
Measurement error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

worked example video part 2

A

OUTPUT REGRESSION TABLE;
regression coefficients-these values are the B’s for the model equation.
- Unstandardised (in the units of original measure)
- Standardised (units are in Z scores. The change inSD’s in the DV for 1 SD change in the IV). Gives some indication of which variables have the biggest influence on the outcome. Allows comparison of different variables which otherwise would not be possible as comparing apples to oranges.Each variable in Z score though is on equal footing. Z scores have mean of zero and standard deviation of 1.
- shows t tests to see which variables are significant predictors.
- B values in the table are in the original units.
- Beta is B in standardised form.
- Zero order correlations are approximately equivalent to Pearson’s correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

worked example video part 2b
PART OR SEMI PARTIAL CORRELATION

A

Part or Semi Partial correlation is useful because we use this squared to show how much each IV uniquely contributes to the DV
Part squared. Or spr squared.
Zero order correlations may differ markedly from part correlation if correlations are shared with other variables.

	Part correlation meanings;
	-a variable with a large part correlation adds a lot of prediction that no other variable can account for
	-a variable with a small part correlation and a large zero order correlation is "redundant" (Its effect can be accounted for by other variables, but equally it may be able to replace other variables. Theoretical question re what the stats actually mean given the situation.)
	-a variable with a similar zero order and part correlation has little or no overlap that is unique to itself.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

worked example video part 2c
PARTIAL CORRELATION

A

Partial correlation= correlation between 2 variables, with influence of other variables removed from both the predictor and the criterion.
Partial correlation = pr squared.
Partial correlation might mean;
-need the zero order or part correlation for context (a large partial correlation may be a big chunk of a small pie)
-tells you about incremental prediction
-can help you figure out which variables overlap each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

multi collinearity

A

Multicollinearity is used to predict the other variables (if possible)
If the Tolerance is < 0.3, there is a problem.
VIF=variance inflation factor. VIF is the inverse of tolerance
VIF= 1/tolerance.
if VIF >3 there is a problem. ie a variable has a lot of overlap with another variable and may actually be redundant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

normality and linearity

A

normality and linearity plots the residuals against the predictions. They should be fairly close. Substantial deviations suggest nonnormality and non linearity. Therefore should revise the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

heteroscedasticity and linearity

A

scatterplot of the residuals. Should be evenly spread around the zero value line

17
Q

independence

A

Independence
-save residuals
-compute correlation of these with IV’s (should be non significant)
-also check against eg case number (eg might have sequence effects, data coding issues, entry artifacts etc)

18
Q

empirical model building

A

The are several different types of empirical model building. Often best to try multiple ones, so that can determine which givesthe most valuable insight into the parameter you are most trying to explain. Sometimes how it is run, affects the outcome. Often go with a method showing some overlap of results with other methods also.
A. BACKWARD ELIMINATION
This asks if there are any variables in here that are non significant? Keep removing non significant variables until only have significant ones left and the overrall equation is significant.
step 1. Put all variables in.
step 2. remove variable with largest non significant p-value (or by other criteria)
step 3. repeat step 2 until only significant parameters remain.
B. FORWARD ENTRY start “empty” and add in predictor with highest correlation (and lowest p-value) to begin with. keep adding until have the model which does not change significantly with further predictor additions.
step 1. start with “empty” model
step 2. add variable with smallest significant p-value
step 3. repeat step 2 until only non significant parameters remain outside.
C. STEPWISE MODELLING
the model may change when variables are either added or removed
step 1. add variable (like forward entry)
step 2. remove any variable which no longer meets criteria (like backward elimination)
step 3. repeat until all variables in model are significant and no external variables meet entry criteria.
Note that sometimes when we add or subtract 1 predictor, it changes the significance of some other predictors. This is why best to go back and forth with adding and subtracting predictors a bit.
D.INFORMATION CRITERIA sometimes need to think about what is best to try to achieve and draw the line when have achieved it. eg often better to have 2 variables explaining 50% of the variance than 10 variables explaining 52% of the variance etc.

19
Q

check model fit and stability video

A

CHECK MODEL
At stage where have things almost sorted with your model, but just need to double check…A point is impacting your regression solution, if the regression parameters are significantly changed if you removed that point from the data.

Things which may indicate you have an outlier exerting undue influence;
A) Discrepancy- a point which is in line with the regression line…. or is it discrepant?
B) Distance-the point is far away from…..eg the other points
C) Leverage-distance of a predictor observation from the centroid of that predictor. Has the potential that may influence regression. leverage is saying is there something unusual about the predictor value? The potential to impact, does not necessarily mean that it does.
D) Influence. This is Leverage X Discrepancy. Does have a big impact on regression parameters. Multivariate outliers have a high potential for influence. Influence is not about the possibility of changing the regression, but it definately does change it. Influence is measured by Cook’s D (Cook’s distance) which is a combination of leverage and residual. Asks if I took this point out, how much does the model change? Remove a point in the regression, and then sum all the changes in the model. This is the most useful diagnostic to run. Running diagnostics basically means checking the model is as good as can be and does not have issues. If Cook’s D is >1, there is an issue. Or, if Cook’s D is > 4/N there is an issue.
Note this will be saved as a new variable (therefore only do towards the end as otherwise data sheet will get very cluttered.

20
Q

multivariate outlier

A

This shows a flattish but normal distribution for some univariate variables along each axis for illustration. Can see that the points on each curve by themselves would not be outliers as are right on the curve, BUT when plotted against each to see as multivariate, it becomes an outlier. This is an important concept as without taking this step, would never be able to appreciate that you do have a multivariate outlier.

21
Q

high distance, high leverage, low discrepancy

A

high distance= fair way away
high leverage=in line with line
low discrepancy=unlikely to exert much impact on the line but still double check anyway

22
Q

moderate distance, low leverage, high discrepancy

A

would not change slope but would drag line fractionally to right

23
Q

high leverage, high discrepancy, high influence

A

this is the most impactful (worst) type of outlier. Will change both the slope and the intercept.

24
Q

looking at outliers

A

LOOKING AT OUTLIERS;
-have already inspected predictors for outliers.
-inspect residuals for outliers
-see if there are any unusual cases

the regression model calculates some additional measures for each case:
-Mahalanobis’ distance. Detects multivariate outlier. Check critical value from table.
-Cook’s D