Chapter 9: Linear Model (Regression) Flashcards
What is the difference between a linear model and correlation?
Linear model differs from that of a correlation only in that it uses an unstandardized measure of the relationship (b1) and a parameter (b0) that tells us the value of the outcome when the predictor is 0.
b0 = intercept & represents the value of the outcome when predictor is 0
Any straight line can be defined by two things:
1) the slope (b1) - also known as the relationship between the predictor and the outcome in unstandardized units
2) the point at which the line crosses the vertical axis of the graph, known as the intercept (b0)
b1 & b0 are parameters known as regression coefficients
b1 represents mean change in the outcome for one unit change in predictor
Regression is used for two things:
1) prediction
2) test theories/explanations
A good model…
1) fits the data better than a model with no predictors
2) should account for an amount of variance that is judged to be of practical and/or scientific significance
3) individual predictors/regression coefficients are significantly different than 0
4) should not have any outliers biasing the model
5) expected to predict the outcomes well in other samples
6) can be generalized to additional samples (cross validation) because assumptions are met
does not prove causation, sees if data consistent w/ causal hypothesis
once a model is estimated, it can be used for forecasting, called predicted value
Same intercept, different slopes
b0 (intercept) is the same in each but b1 (slope) is different. it looks like three lines coming out of the same point in the graph.
Same slope, different intercept
b1(slope) is the same in each but b0(intercept) is different in each model. it looks like three separate lines going in the same direction, but do not connect at any point.
+ b1 = + relationship w/ outcome; - b1 = - relationship w/ outcome
the slope (b1) tells us what the model looks like (shape); the intercept (b0) locates the model in geometric space.
What is a regression analysis?
What are the two types?
term for fitting a linear model to data and using it to predict values of an outcome (dependent) variable from one or more predictor (independent) variables.
simple regression: one predictor variable in the linear model
multiple regression: several predictors in the linear model
What are residuals?
differences between what the linear model predicts and the observed data
What is the residual sum of squares (SSr)?
gauge of how well a linear model fits the data. it represents the degree of inaccuracy when the best model is fitted to the data. if it is a large number, it means the model is not representative of the data (a lot of error in prediction). if it is a small number, the line is representative of the data.
total amount of error in a model/does not tell us how good the model is
higher with more people
Ordinarly least squares (OLS) regression
method of regression in which the parameters of the model are estimated using the method of least squares
essentially, getting b values that make the sum of the squared residuals (error b/w observed and model) as small as possible
What happens wheen there are more predictors?
expanded models with more predictors account for even more variance. unless the IVs are completely independent of one another, the predictors change as additional predictors are added
regression planes must be used to visualize.
w/ 3 or more, you cant visualize it.
What is goodness-of-fit?
good models exist on a continuum
assesses how well the model fits the observed data. we do this because even though the model may be the best one available, it can still be a bad fit. usually based on how well the data predicted by the model corresponds to the data actually collected.
done by comparing the complex model against a baseline model to see whether it improves how well we can predict the outcome, then calculating the error. if the complex model is any good, it should have significantly less error than the simple model
simple model is usually the mean of the outcome
R^2 and F Statistic
What is the total sum of squares (SSt)?
represents how good the mean is as a model of the observed outcome scores.
it is a measure of the total variability within a set of observation (total squared deviance between each observation and the overall mean of all observations)
What is the model sum of squares (SSm)?
the improvement in prediction resulting from using the linear model rather than the mean. this difference shows us the reduction in the inaccuracy of the model resulting from fitting the regression model to the data.
if SSm is large, the linear model is very different from using the mean to predict the outcome variable. this implies that the linear model has made a big improvement in predicting the outcome.
if SSm is small, using the linear model is little better than using the mean
SSm = SSt (total sum of squares) - SSr (residual sum of squares)
higher with more predictors
What is R^2?
represents the amount of variance in the outcome explained by the model (SSm) relative to how much variation there was to explain in the first place (SSt).
proportion of variation in the outcome that can be predicted from the model
indicates whether the model is of scientific and/or practical significance
R^2 = SSm / SSt
we want this number to be high - r will tell us overall fit of the regression model
Mean Squares (MSm and MSr)
measure of average variability, based on the number of differences that were added up
MSm = SSm/k (df; number of predictors in the model)
MSr = SSr/ N - k - 1 (N = number of observations; k = number of parameters being estimates aka as all the b values)
dividing undoes the biasing effect of # of predictors
What is the F-statistic?
measure of how much the model has improved the prediction of the outcome compared to the level of inaccuracy of the model.
if a model is good MSm will be large and MSr will be small
large F statistic = greater than 1 (1 indicated no improvement)
F = MSm/MSr
if associated p-value is less than .05, there is significant improvement in prediction over the baseline model w/ no predictors
as the F gets higher, p gets lower (H0 less tenable)
What can the F statistic tell us about R^2?
F can be used to calculate the significance of R^2 (how different it is from 0)
F = (N - k - 1)R^2 / k(1-R^2)
N = # of cases/participants, k = number of predictors in the model
if associated p-value is less than .05, R^2 is significantly different from 0
What is a flat model?
model in which the same predicted value arises from all values of the predictor values, and will have b-values of 0 for the predictors
if a variable significantly predicts an outcome, it should haev a b value that is different from 0
regression coeff of 0 means a unit change in the predictor results in no change in the predicted value of the outcome and the linear model is flat
What is the t-statistic?
tests whether a b-value is significantly different than 0.
if the test is significant, we might interpret this as supporting a hypothesis that the b (regression coeffs) is significantly different (p < .05) from 0 and that the predictor variable contributes significantly to our ability to estimate values of the outcome, after accounting for the other predictors in the model. Each b gets its own test.
t = b(observed)-b(expected)/ SEb = b(observed)/ SEb
b(expected) = b value we expect to obtain if null is true, so 0.
b(observed) = b we calculate
SEb = how much error we estimate is likely in our b
when the SE is small even a small deciation from 0 can reflect a sig difference because b is representative of the majority of possible samples
What is generalization?
ability of a model to be applied to other samples aside from the one which it was based on. if it is not generalizable, then we must restrict conclusions only to the sample used.
What are outliers?
data with extreme value on the outcome variable, Y (large residuals)
case that differs substantially from the main data trend. it can affect the estimates of the regression coefficients. outliers can be assessed by unstandardized, standardized, and studentized residuals. they are NOT always influential