Chapter 9: Linear Model (Regression) Flashcards
What is the difference between a linear model and correlation?
Linear model differs from that of a correlation only in that it uses an unstandardized measure of the relationship (b1) and a parameter (b0) that tells us the value of the outcome when the predictor is 0.
b0 = intercept & represents the value of the outcome when predictor is 0
Any straight line can be defined by two things:
1) the slope (b1) - also known as the relationship between the predictor and the outcome in unstandardized units
2) the point at which the line crosses the vertical axis of the graph, known as the intercept (b0)
b1 & b0 are parameters known as regression coefficients
b1 represents mean change in the outcome for one unit change in predictor
Regression is used for two things:
1) prediction
2) test theories/explanations
A good model…
1) fits the data better than a model with no predictors
2) should account for an amount of variance that is judged to be of practical and/or scientific significance
3) individual predictors/regression coefficients are significantly different than 0
4) should not have any outliers biasing the model
5) expected to predict the outcomes well in other samples
6) can be generalized to additional samples (cross validation) because assumptions are met
does not prove causation, sees if data consistent w/ causal hypothesis
once a model is estimated, it can be used for forecasting, called predicted value
Same intercept, different slopes
b0 (intercept) is the same in each but b1 (slope) is different. it looks like three lines coming out of the same point in the graph.
Same slope, different intercept
b1(slope) is the same in each but b0(intercept) is different in each model. it looks like three separate lines going in the same direction, but do not connect at any point.
+ b1 = + relationship w/ outcome; - b1 = - relationship w/ outcome
the slope (b1) tells us what the model looks like (shape); the intercept (b0) locates the model in geometric space.
What is a regression analysis?
What are the two types?
term for fitting a linear model to data and using it to predict values of an outcome (dependent) variable from one or more predictor (independent) variables.
simple regression: one predictor variable in the linear model
multiple regression: several predictors in the linear model
What are residuals?
differences between what the linear model predicts and the observed data
What is the residual sum of squares (SSr)?
gauge of how well a linear model fits the data. it represents the degree of inaccuracy when the best model is fitted to the data. if it is a large number, it means the model is not representative of the data (a lot of error in prediction). if it is a small number, the line is representative of the data.
total amount of error in a model/does not tell us how good the model is
higher with more people
Ordinarly least squares (OLS) regression
method of regression in which the parameters of the model are estimated using the method of least squares
essentially, getting b values that make the sum of the squared residuals (error b/w observed and model) as small as possible
What happens wheen there are more predictors?
expanded models with more predictors account for even more variance. unless the IVs are completely independent of one another, the predictors change as additional predictors are added
regression planes must be used to visualize.
w/ 3 or more, you cant visualize it.
What is goodness-of-fit?
good models exist on a continuum
assesses how well the model fits the observed data. we do this because even though the model may be the best one available, it can still be a bad fit. usually based on how well the data predicted by the model corresponds to the data actually collected.
done by comparing the complex model against a baseline model to see whether it improves how well we can predict the outcome, then calculating the error. if the complex model is any good, it should have significantly less error than the simple model
simple model is usually the mean of the outcome
R^2 and F Statistic
What is the total sum of squares (SSt)?
represents how good the mean is as a model of the observed outcome scores.
it is a measure of the total variability within a set of observation (total squared deviance between each observation and the overall mean of all observations)
What is the model sum of squares (SSm)?
the improvement in prediction resulting from using the linear model rather than the mean. this difference shows us the reduction in the inaccuracy of the model resulting from fitting the regression model to the data.
if SSm is large, the linear model is very different from using the mean to predict the outcome variable. this implies that the linear model has made a big improvement in predicting the outcome.
if SSm is small, using the linear model is little better than using the mean
SSm = SSt (total sum of squares) - SSr (residual sum of squares)
higher with more predictors
What is R^2?
represents the amount of variance in the outcome explained by the model (SSm) relative to how much variation there was to explain in the first place (SSt).
proportion of variation in the outcome that can be predicted from the model
indicates whether the model is of scientific and/or practical significance
R^2 = SSm / SSt
we want this number to be high - r will tell us overall fit of the regression model
Mean Squares (MSm and MSr)
measure of average variability, based on the number of differences that were added up
MSm = SSm/k (df; number of predictors in the model)
MSr = SSr/ N - k - 1 (N = number of observations; k = number of parameters being estimates aka as all the b values)
dividing undoes the biasing effect of # of predictors
What is the F-statistic?
measure of how much the model has improved the prediction of the outcome compared to the level of inaccuracy of the model.
if a model is good MSm will be large and MSr will be small
large F statistic = greater than 1 (1 indicated no improvement)
F = MSm/MSr
if associated p-value is less than .05, there is significant improvement in prediction over the baseline model w/ no predictors
as the F gets higher, p gets lower (H0 less tenable)
What can the F statistic tell us about R^2?
F can be used to calculate the significance of R^2 (how different it is from 0)
F = (N - k - 1)R^2 / k(1-R^2)
N = # of cases/participants, k = number of predictors in the model
if associated p-value is less than .05, R^2 is significantly different from 0
What is a flat model?
model in which the same predicted value arises from all values of the predictor values, and will have b-values of 0 for the predictors
if a variable significantly predicts an outcome, it should haev a b value that is different from 0
regression coeff of 0 means a unit change in the predictor results in no change in the predicted value of the outcome and the linear model is flat
What is the t-statistic?
tests whether a b-value is significantly different than 0.
if the test is significant, we might interpret this as supporting a hypothesis that the b (regression coeffs) is significantly different (p < .05) from 0 and that the predictor variable contributes significantly to our ability to estimate values of the outcome, after accounting for the other predictors in the model. Each b gets its own test.
t = b(observed)-b(expected)/ SEb = b(observed)/ SEb
b(expected) = b value we expect to obtain if null is true, so 0.
b(observed) = b we calculate
SEb = how much error we estimate is likely in our b
when the SE is small even a small deciation from 0 can reflect a sig difference because b is representative of the majority of possible samples
What is generalization?
ability of a model to be applied to other samples aside from the one which it was based on. if it is not generalizable, then we must restrict conclusions only to the sample used.
What are outliers?
data with extreme value on the outcome variable, Y (large residuals)
case that differs substantially from the main data trend. it can affect the estimates of the regression coefficients. outliers can be assessed by unstandardized, standardized, and studentized residuals. they are NOT always influential
unstandardized residuals
raw differences between the predicted and observed values of the outcome variable
standardized residuals
residuals converted to z scores/represented in SD units
1) if they are greater than 3, then they are cause for concern because a value that high is unlikely to occur
2) if more than 1% of our sample cases have residuals w an absolute value greater than 2.5 there is evidence that the level of error in the model is unacceptable
3) if more than 5% of cases have residuals with an absolute value greater than 2, then the model may be a poor representation of the data
use this bc its easier to interpret
studentized residuals
unstandardized residual divided by an estimate of its SD that varies point by point
have the same properties as standardized, but provide a more precise estimate of the error variance of a special case
use this bc its easier to interpret
adjusted predicted value
predicted value of the outcome for that case from a model in which the case is excluded
estimate the model parameters excluding a particular case and use this new model to predict the outcome for the case that was excluded. if the model is stable then the predicted value of a case should be the same regardless of whether the case was used to estimate the model
deleted residual
difference between the adjusted predicted value and the original observed value
tell us about the influence of cases on the ability of the model to predict that case, but not about how the case influences the whole model
studentized deleted residual
the deleted residual is divided by the standard error to give us this value. it can then be compared across different regression analyses
Leverage
gauges the influence of the observed value of the outcome over the predicted values.
high leverage points
data with an extreme value on a predictor variable (x)
these points are extreme on the x-axis. if it drags the regression line towards it, it is influential
all leverage points should be close to the average value (k+1)/n
cases with 2-3x the average leverage value should be investigated/concerning
range from 0-1. a value of 1 indicates the case has complete influence over prediction
When is a case influential?
a case is influential if the model parameter estimates change substantially if the case is deleted and the model reestimated
a good model should not be so fragile that 1-2 cases change it a lot
cases will be influential if: they have some combination of being extreme on X (leverage) and extreme on Y (outliers)
influence is what matters most
if conclustions change based on 1-2 data points, conclusions are said to be fragile
not all influential points have large residuals
can be examined using: Cook’s distance, Difference in Beta (DFBeta), Difference in Fit (DFFit).
Cook’s distance
measure of the overall influence of a case on the model
abs values greater than 1 may be a concern
Difference in Beta (DFBeta)
Standardized
measure of how much the estimates of the b’s change when a case is deleted
abs values greater than 1 may be a concern
Difference in Fit (DFFit)
Standardized
measure of the difference in prediction when a case is deleted
abs values greater than 1 may be a concern
Consider deleting high leverage points or outliers only when:
they are influential
first check that influential points are not coding error
if it is not coding error: does the case change the conclusions? is it possible to get more observations near that value of X?
if case does change the conclusion: report the results with and without the influential case or restrict your analysis to values of X where the relationship holds
do not use this to drop cases to create desired results (phack)
Assumptions of the linear model
1) additivity and linearity
2) independent errors
3) homoscedasticity
4) normally distributed error
additivity & linearity
outcome variable and predictors are linearly related/can be described by a linear model
do not use a linear model to describe a nonlinear relationship
independent errors
residual terms should be uncorrelated/independent
assumption necessary for CIs and significance tests to be valid
if violated, use robust methods or a multilevel model
homoscedasticity
residuals at each level of the predictor should have the same variance
if violated, CIs and significance tests are invalidated. use weighted least squares regression instead.
normally distributed errors
residuals in the model are random, normally distributed variables with a mean of 0
does not matter with large sample sized because of CLT
if violated with a small sample size, use bootstrapped CIs
other assumptions/considerations of the linear model
1) predictors are uncorrelated with external variables
2) Variable type: quantitative or dichotomous predictors, and continuous unbounded criterion
3) no perfect mullticollinearity
4) non zero variance
predictors are uncorrelated with external variables
should be no external variables that correlate with any of the variables included in the model
regression results can be biased by an omitted (3rd) variable
if violated, conclusions are unreliable
Variable type: quantitative or dichotomous predictors, and continuous unbounded criterion
all predictor variables must be quanititative or categorical, and the outcome must be quantitative, continuous, and unbounded
no perfect multicollinearity
if your model has more than one predictor then there should be no perfect linear relationship between 2 or more of the predictors (predictors should not correlate too highly)
if violated, lead to untrustworthy estimates of the b’s, and SE gets very big
non zero variance
predictors should have some variation in value (not have variance of 0)
cross validation of the model
assessing the accuracy of a model across different samples (how it generalizes to different samples)
two main methods: adjusted R^2 and data splitting
adjusted R^2
tells us how much variance in the predicted outcome would be accounted for if the model had been derived from the population from which the sample was taken (estimates what r^2 would be in a new sample)
when you try to apply a model to a different sample, R^2 from sample 1 to sample 2 will drop, causing a loss of predictive power known as SHRINKAGE.
shrinkage occurs because the process of fitting a model capitalizes on chance
more capitalizing on change + more shrinkage will occur when there are more predictors and with smaller sample sizes.
with large samples, a model will be well estimated and shrinkage will be minimal
regression models are optimized for the sample they were created from
data splitting
randomly splitting your sample data, estimating the model in both halves of the data, and comparing the resulting models (80/20(model))
To assume b’s are normally distibuted, we need a large sample. How big does it have to be?
situation specific power analysis is the best way to determine sample size needed.
General guidelines:
- if you expect to find a large effect: 80 people or higher
- if you expect to find a medium effect: 100 people or higher if there are 6 predictors or less
- if you expect to find a small effect: don’t bother unless you can get a very large sample
Multiple regression: methods of entering predictors into a model
1) forced: predictors are added into the model at once. it is useful for testing theory. no established predictors
2)** hierarchical:** predictors added in blocks. established predictors added earlier in the process. new predictors asses as a group last. useful for testing theory or validity of new predictors.
3) stepwise: automatic method. predictors added to the model one by one based on partial correlations. process stops when removal criterion is met (i.e., regression coeff is not significant for the added predictor). useful for exploratory analyses when you have no idea what’s going on what want to generate hypotheses. it gives models that can’t generalize, and is frowned upon.
- this method is sensitive to sampling variation and the results don’t generalize
adding control variables into the model doesn’t purify the analysis and their inclusion can result in inappropriate inferences
Parsimony
- explaining data while being as simple as possible
- accounting for variance in the simplest way
- more predictors account for more variance
- R^2 increases with more predictors added/see if new predictors have value
- change R^2 from simple model to complex model: if its sig, then sampling error alone cannot account for this difference.
Assessing parsimony
- change in R^2 (significance and magnitude)
- Akaike Information Criterion (AIC): lowest AIC = most parsimonious model, penalizes you for adding predictors
you can compare different models and access parsimony using R^2 change and AIC.
R^2 should significantly change, magnitude change, lower AIC = argument for the complex model
What is multicollinearity?
multiple regression
- occurs when the predictor variables themselves are related to each other (highly correlated)
- for ex. trying to predict lawyer salary based on age and experience (hard to tease apart)
- level of multicollinearity varies from none to perfect multicollinearity (continuum). if none, it means that all predictor variables are unrelated. if perfect, can’t fit a regression model because there are infinite models
- mild multicollinearity is not a big deal but high multicollinearity makes it more difficult to estimate the b’s
- high multicollinearity can result in the standard errors of the coeffs being very high. this means more errors in each estimate and a wider sampling distribution (parameters not well estimated)
- parameter estimates change wildly from sample to sample; CI’s will be wide; and harder to find significance
How to diagnose multicollinearity?
- simplest way: examine the correlations among predictors. correlations that are higher than .8 or .9 suggest multicollinearity may be an issue
- examining the correlation matrix will miss more subtle forms of multicollinearity
- we need stats that are specific to detecting multicollinearity. one is the variance inflation factor (VIF) for each predictor. (VIF = 1/(1-R^2k)
- VIF > 10 is a problem
how to deal with multicollinearity?
1) do nothing
2) get rid of the variables
3) combine the correlated variables
4) use a method that can handle highly correlated variables like partial least squares or principal component analysis
how can the unstandardized simple regression equation be written in standard form?
- variables expressed as z scores
- Zy = r(Zx)
- predicting Y’s z score, input is not the person’s score on X but rather their z score on X
how can the unstandardized multiple regression equation be written in standard form for 2 predictors?
Zy = Beta(Zx) + Beta(Zx) …
- the betas take place of the correlation coeffs becaue we have to account for the other predictors
- the std multiple regression equation works the same way as the std simple regression equation. Plug in someone’s z score on the X variables to get their predicted z score on the Y variable
- the std betas refer to how many SDs the outcome will change for every SD change in the predictor
- can be extended to include more predictors
std betas help assess relative importance of dif predictors in SD units