Linear regression (week 3-5) Flashcards
Linear regression formula
Y = alpha + BetaiXi +E
What is Y in Linear Regression?
What is alpha in Linear Regression?
What is Beta in Linear Regression?
What is X in Linear Regression?
What is E in Linear Regression?
Y is dependent var
Alpha is intercept parameter
Beta is regression coefficient
X is explanatory variable
E is i.i.d error term (use N(o, var))
Estimated Linear Regression
same, but put hat on all the coeff and E is Ɛ
E vs Ɛ?
E is i.i.d error (capture uncertainty) and Ɛ is residual term (diff data from model, is better it act more like E, contains uncertainty and what not captured)
What if its not linear?
Use log
How to minimize Ɛ
use minimize SSR (Sum Squared of Error)
1. Sum symbol (Y - Yhat)
2. derive! alpha and beta
3. Set the derivation to 0
what is δ^2?
it represents (Σ(y-yhat)^2)/n-2
why have n-2 in the δ^2?
it represent unbiased estimator
what if is too large?
use the formula alpha with squingy line on top and beta with squingy line and test the hypothesis for both alpha with squingy line on top and beta with squingy line
Goodness-of-fit measured using
R^2 = regression SS / Total SS
between 0% to 100%
calc test stats
use F1,n-2
R^2 means?
proportion of total data variability explained by model
Total Sum Of Square means?
Deviations between data and sample mean (total variability)
Regression SS mean?
Deviations between model estimate and sample mean (data variability explained by model)
Residual SS mean?
Deviations between data and model estimate (data variability unexplained by model)
what does Y* means
its using the new Y, or Y in the future trs dibagi 100
prediction interval of Y*
(alpha + (beta times x) ± tn-2 times sqrt(δ^2) times sqrt (1 + 1/n+ (x- mean of old x)^2 / ∑x^2 - n times sum of old x^2)
residual is .. - …
and good if
dependent var - fitted value
good if randomly scattered and normally distributed s
model fitting
estimate alpha and beta and the δ^2
Predict with 95% interval!
- model fitting
- goodness of fit
- plot residual (lebih scattered lebih bagus)
- prediction interval
Multicollinearity is
When 2 or more explanatory variable highly correlated, so it become vague, imprecise, and unreliable parameter estimates.
Adjusted R vs R^2
As explanatory var increase, R^2 also increase. However, overly complex model is also not good so we introduce adjusted R
Adjusted R formula
(1-(n-1)(1-R^2))/(n-k-1)
Forward selection
-start with single explanatory variable and see the adjusted R
-If the R is higher, its good then add the variable into the model
Backward selection
-Start with all explanatory variable
-Remove one by one, see the adjusted R
Interaction term
add x1:x2 if relationship between Y and x1 depends on x2
Categorical variable is
-discrete categories/levels/classes
-qualitative by nature
-tackled by set dummy variable
-can include continuous and categorical var in linear regression
-can include interaction term
Nominal vs Ordinal
Unordered, ordered
How many dummy var needed?
n-1, because we put one of the explanatory var as the base
What can we do to improve the adjusted R?
-Check the stars (significant or not)
-We can combine variable
-Do not forget to check any multicollinearity
From where we can now to combine this var with others?
Try use the similar estimate (neighbours)
If we test the cor and found it multicollinearity detected, we have to test ..?
Test if it is significant:
assume corr(x1,x2) = rho
t-test = rho * sqrt((n-2)/(1-rho^2))
use qt(0.975, n-2)