Multiple linear regression Flashcards
disadvantage of doing regressions separately?
ignores potential synergy effect–>lead to misleading results
RSE?
sqrt(1/(n-p-1) * RSS). n-(p+1) denominator is the degrees of freedom
why does R squared increase when non-zero inputs are added?
RSS will decrease when choosing another (non-zero) parameter to estimate logic for finding a new set of coefficient based on minimising RSS
what does adjusted R squared do?
adds a penalisation factor to account for the number of predictors included in the model
formula of adjusted R square?
1-(n-1)/(n-1-p)*RSS/TSS. ALWAYS SMALLER THAN R SQAURE(can be neg)
what is the null hypothesis H0?
all regression coefficients are 0 simultaneously
formula of F stats?
((TSS-RSS)/P)/(RSS/(n-1-p))
when no relation what is F stat?
1
when to reject null hypothesis?
p-value<0.05
how does forward selection work?
- start with null model with intercept but no predictor
- successively include most informative variable (lowest RSS, highest R square)
- stop when stopping rule is reached (all variables have p-value<0.05)
how does backward elimination work?
- start with full model with intercept and all predictors
- successively remove least informative variable (highest RSS, lowest R squared)
- stop when stopping rule is reached (all variables have p-value<0.05)
how does cross validation work?
- split dataset into training and testing set
- train model using training set
- validate fitted model using testing set
how is validation error rate assessed?
mean squared error (1/n*RSS)
process of leave one out CV?
1,fit training data (obs=n-1) into a model
- validate the model using testing set (obs=1)
- compute to test MSE for first round
- repeat 1-3 for n times to obtain n MSEs
- construct LOOCV estimate as avg for MSEs
K-fold CV?
- randomly split observations into k groups
- fit training data (obs=n-n1) into a model
- validate the model using testing set (obs=n1)
- compute to test MSE for first round
- repeat 2-4 for k times to obtain k MSEs
- construct K-FOLD CV estimate as avg of k MSEs
lm for multiple linear regression?
lm_fit=lm(y~var1+var2,data=Boston)
summary(lm_fit)
get model with all variables?
lm_fit1=lm(y~. , data=Boston)
remove one or two variables?
lm_fit2=lm(y~. -var1 , data=Boston)
lm_fit3=lm(y~.-var1 -var2, data=Boston)
get null set?
lm_fit4=lm(y~1,data=Boston)
get correlation for all inputs?
round(cor(Boston),2)
how to visualise the pair-wise correlation matrix?
install.packages(‘corrplot’)
library(corrplot)
cor_matrix=round(cor(Boston), 2)
corrplot(cor_matrix, type = “upper”, order = ‘alphabet’,
tl.col = “black”, tl.srt = 45, tl.cex = 0.9,
method = “circle”)
for variables with high correlation, how to remove additive assumption?
lm_non_add=lm(y~var1*var2, data=Boston)
scatterplot with linear assumption?
attach(Boston)
plot(var1,y,pch=16,col=’gray50’)
for polynomial?
lm_nonlinear=lm(y~poly(var1,2),data=Boston)
add line to scatterplot of non linear?
lines(sort(y),fitted(lm_nonlinear)[order(y)],lwd=2,col=’deeppink3’)
find just the coefficient of variable n intercept?
glm_fit=glm(y~x,data=Boston)
coef(glm_fit) #same as lm_fit
find LOOCV estimates?
cv_err=cv.glm(Boston,glm_fit)
cv_err$delta[1]
finding CV of different models and lowest MSE?
glm_fit2=glm(y~poly(x,2), data=Boston)
cv_err=cv.glm(Boston,glm_fit2)
OR
USE FOR LOOP
cv_error=rep(0,10)
for (i in 1:10){glm_fit=glm(y~poly(x,i), data=Boston)
cv_error[I]=cv.glm(Boston,glm_fit)$delta[1]}
plotting CV errors?
plot(cv_error,xlab=’polynomial’,main=’test MSE’, ylab=’‘,type=’b’, pch=16)
CV for K fold?
cv_error=rep(0,10)
for (i in 1:10){glm_fit=glm(y~poly(x,i), data=Boston)
cv_error[I]=cv.glm(Boston,glm_fit,K=10)$delta[1]