week 5 Flashcards
predictive analytics use historical data to
tell us somehting about the future
IV vs DV
IV - used to predict DV and on the x axis
DV- what we trying to predict based on other variables on y axis
what are we trying to predict with lm
B0, B1 or coeffcients
what are we trying to minimze for each variable and for each model
E for variablke and total error for model
Why use SSE
magnify error deviation by squaring it
issue with SSE and fix
depedns on number of points- more points SSE = higher
fix- by use RMSE: normalized by N and same variable as DV so if DV is price units will be in $
R square is high means
model fits well with data and error are small but not guarantee work well on unseen data
MAPE vs MAE in high averages for data set
MAPE better if data set average high as will show as a %. MAE will be higher if data set higher
what is r squared
is percentage decrease in SSE, what percentage in SSE has actually dropped compared to baseline model (SST)
it is hard to get model with good accuracy 0.8+ on real data so what values good
0.3
model r ^2 get better if you add more variables that are above 0 R^2 but at a
diminish rate
- Not all variables should be used because
model over fit data
issue with over fitting
is that it will perform badly on unseen data because it doesn’t know that data just memorized old data
Will change coefficient to minimize error when given to make prediction on future will make error because is over fitted
significance is based on confidence level we want if confidence is 95% pvalue is and what is insiginifcant
5% if greater then not signifacnt
Coefficient (beta) = 0.6 means
if IV increase by 1 unit then DV will increase by 0.6 units.