week 5 Flashcards
predictive analytics use historical data to
tell us somehting about the future
IV vs DV
IV - used to predict DV and on the x axis
DV- what we trying to predict based on other variables on y axis
what are we trying to predict with lm
B0, B1 or coeffcients
what are we trying to minimze for each variable and for each model
E for variablke and total error for model
Why use SSE
magnify error deviation by squaring it
issue with SSE and fix
depedns on number of points- more points SSE = higher
fix- by use RMSE: normalized by N and same variable as DV so if DV is price units will be in $
R square is high means
model fits well with data and error are small but not guarantee work well on unseen data
MAPE vs MAE in high averages for data set
MAPE better if data set average high as will show as a %. MAE will be higher if data set higher
what is r squared
is percentage decrease in SSE, what percentage in SSE has actually dropped compared to baseline model (SST)
it is hard to get model with good accuracy 0.8+ on real data so what values good
0.3
model r ^2 get better if you add more variables that are above 0 R^2 but at a
diminish rate
- Not all variables should be used because
model over fit data
issue with over fitting
is that it will perform badly on unseen data because it doesn’t know that data just memorized old data
Will change coefficient to minimize error when given to make prediction on future will make error because is over fitted
significance is based on confidence level we want if confidence is 95% pvalue is and what is insiginifcant
5% if greater then not signifacnt
Coefficient (beta) = 0.6 means
if IV increase by 1 unit then DV will increase by 0.6 units.
sign of overfitting
Adjusted r square can increase or decrease. Add new variable and adjusted r square goes down = overfitting the model
goal is to include only significant variables in regression because
other variables will cause overfitting
correlation does what
mirrors linear relationship between two variables. It measures the degree to which the two variables are linearly related to each other is between -1 and 1
linear regression assumption about correlation
all variables independant so no correlation
x variables are going to be independent not dependant on other variables
whats is sign for worry in correlation
-0.6-0.6
why do we split data
model may just be trying to minimize error not make predictions, to see if perform well or just overfit, training should be 80%, using lm function with training data to build
if coeefcient 0 means
no impact on dv from iv
output of RMSE/ MAE
tells us predictions within # error on average differnece is RMSE gives more weigh to larger errors. making senssitve to outliers or large deviations
MAE used when
average magnitude of errors, regardless of directions. less sensitive to outliers with focus on overall accuracy not punish deviations
MAPE output tells us
avarge deviation as a % from actual price
r^2 only useful for who adn solution
analysts, use RMSE, MAPE, MAE as easeier to understand
when r = 0
indicates that there is no linear correlation between x and y. However, it does not necessarily imply that there is no relationship between them.
best metrics for testing data
MAE, RMSE, MAPE
what does -R^2 mean
o This means baseline model (average of price = price all wines) is doing better than your model and yours is useless. So build better model or just use baseline
what is standard error
meausre of uncertantiy in estimate of coeffcient
what are residuals in model summary
- residuals are when you build a model, tell you the distribution of these errors