Simple linear regression Flashcards
what is linear regression?
simple approach to supervised learning. it is used to model the relationship between several input variables (x) and a continuous response variable (y)
assumed model?
y=β0+β1 X+e
distance between observed and predicted values?
residual e= Yi-predicted Yi= Yi-(β0+β1Xi)
residual sum squares (RSS)?
total magnitude of deviations from all squared residuals of data points (sum) (residual may be positive or negative thus square)
to find β0 and β1, use estimation of least squares
first order derivatives of RSS w/ respect to β0 and β1 separately, set to 0
predicated β1?
cov(x,y)/var(x)
predicted β0?
mean(y) - β1*mean(x)
what is standard error(SE) an estimator for?
how the estimates vary under repeated sampling
hypothesis testing for relationship between x and y?
H0: β1=0 H1: β1!=0
t-statistics(to test null hypothesis)
t=(β1-0)/SE(β1). n-2 degrees of freedom
critical value and confidence interval when n is large?
1.96(as n increases, t-dist gets closer to normal dist) and 95%(as n increases, t-dist gets closer to normal dist)
p-value definition?
probability of observing any value >= |t|
calculate confidence interval
[β1+-1.96*SE(β1)]
when to reject null hypothesis?
when |t| both larger than 1.96, we can reject H0 with 95% confidence
Residual standard error means?
RSE measures lack of fit, if RSE=3.259, on avg, deviation of Y from regression line is 3.259 points
R squared for?
measures how well regression model describes data. e.g. if R squared is 0.6119, X explains only 61.19% of subject
how to measure RSE
sqrt(1/(n-2)RSS)
measure R square?
1-(RSS/TSS). TSS is the total variance in response variable y. (ranges from 0 to 1)
for 95% CI, use β1 or β0
β1. β0 has nothing to do with r/s between X and Y
how to install package MASS
intall.package(‘MASS’)
load MASS?
library(MASS)
load data Boston in MASS?
data(Boston)
documentation in data set?
?Boston
number of missing values?
sum(is.na(Boston))
number of duplicated values?
sum(duplicated(Boston))
find outliers for both variables?
boxplot. stats(Boston$var1)$out
boxplot. stats(Boston$var2)$out
reduce dataset to subset of the 2 variables?
name=subset(Boston, select=c(var1,var2))
scatterplot? var1 being y and var2 being x
plot(var1~var2, main='Scatterplot of var1 vs var2', xlab='var2=name', ylab='var1=name', pch=20 col='gray50')
simple linear regression?
lmfit=lm(var1~var2,data=name)
summary of lm?
summary(lmfit)
upper and lower range of CI 95%?
confit(lmfit, level=0.95)
Regression when X is binary
Create a dummy variable that equals to one if rm is above the sample median
mydata$dummy=ifelse(mydata$x>=median(mydata$x),1,0)
plot scatterplot with fitted line?
lmfit1=lmfit(var1~mydata, data=mydata)
plot(var1~mydata$dummy, main=’scatterplot of var1 vs var2’,xlab=’var2’, ylab=’var1’,pch=20,col=’gray50’)
abline(lmfit1,lwd=2,col=’deeppink3’)