Exam 1 Flashcards
Why use models?
To understand the relationships between variable
To predict future outcomes
To quantify differences between groups or treatments
Response variable
the variable that you want to understand/model/predict. aka - y, dependent variable
explanatory variables
the variables you know and think that they are maybe related to the response variable that you want to use to figure out a pattern/model/relationship. aka - x, independent variable, predictor variable, covariates
model
a function that combines explanatory variables mathematically into estimates of the response variable
error
what’s left over; the variability in the response that your model doesn’t capture (error
is somewhat of a misnomer – maybe noise is a better term)
Categorical Data
Two outcomes, not numerical
Quantitative variables
Numerical
Parameter
Describes entire population
Statistic
Describes sample
The four-step process
- Choose
- Fit
- Assess
- Use
Model Notation
Y = f(X) + e
ybar or xbar
averages
yhat
estimate
Y = ? (Simple Linear Regression)
Beta0 + Beta1*X + e
Yhat = ? (Simple Linear Regression)
Beta0 + Beta1*X
Naive Model
Mean + Error
Age = Agebar + e
Residuals
How far from the prediction line points are
yhat - y
Least Squares
Technique to minimize SSE
The value of all squared residuals is at a minimum
SSE
SSE =∑(yhat − y)^2
Regression Standard Error
σ = sqrt(SSE / n-2)
Linearity
If the resuduals resemble a line
Independence
Residuals do not depend on time. Don’t get bigger or smaller as plot goes on
Normality of Residuals:
The residuals are distributed symmetrically around zero, with no skewness or kurtosis.
- Equal Variance of Residuals (homoskedasticity):
Variables have equal variance over time.
Standard Error
ei / σhat = yi - yhati/σhat
If greater than 3 it is considered an outliar
Leverage
Points that have extreme x values can have a disproportionate influence on the slope of the regression line
Hypothesis Testing
H0: B1 = 0
HA: B1 DNE 0
Test Statistic
t = B1hat / SE
Confidence Interval for Slope
Beta1 +/- t* SE
Coefficient of determination
R^2, How much of the variability is explained by the model
Partitioning variability
ANOVA
(yi - ybar) = (yhat - ybar) + yi - yhat)
SST
∑(yi - ybar)^2
SSM
∑(yhat-ybar)^2
SST, SSM, SSE Relationship
SST = SSM + SSE
R^2 =
SSM/SST
Confidence Interval
sqrt(1/n + [x*-xbar]^2/[∑x-xbar^2])
Prediction Interval
sqrt(1 + 1/n + [x* -xbar]^2/[∑x-xbar^2])
MLR
Y = B0 + B1X1 + B2X2 +…+Bp*Xp + e
MLR with categorical data
Parallel slopes model
When does p-value explain
p-value < .05