Trees Flashcards
what is a main issue with logsitic regression
coeffcicent indicaates effect of variable not how decesion is made
decesions we make in real life are:
in sequential order like regression trees
main benefit of regrssion trees
is easy to understand
in trees we know directrion the variable effects the probability but cant tell
impact on y variable
we split the data using the IV into
yes or no decisions
trees doesnt assumes model is
linear
adding more splits will
increase accuracy
3 splits in the tree
3 decision tree levels
terminal node is an
output not condition ex- will tell you color red or grey
root node is
condition very correlated with y variable - most important one at top
we can have 100% accuracy if keep adding splits with no errors but issue is
too many variables and leads to overfitting
what is fix to overfitting issue
set lower bond on number of points in each subset
each split divides points into
buckets
if we set minimum bucket size = lower bound thenn we
wont split if points in the split is less than minimum bucket size
buckets only tell you
SIZE OF SPLit not the outcome of the bucket
if bucket size too large then
model is too simple and will have poor accuracy
logistic regression bucket size equivalent to
observations and predicts most frequent # or the baseline model
in classfication model use
majority vote
categorical variables/ classficiation are discrete variables what would using continous variables be
take average of all numbers, most frequent -> ex how many cars sold
regression trees are easy to understand issue is when using single tree
we will observe errors significsantlly higher
single trees are unstable meaning
small change in data= different trees/ interprattion
how to fix issue of trees
forest
random forest generate multiple trees instead of 1 then
takes average to come up with predicted values. -> if 5 trees then need 5 different data sets
how does forest select trees
randomly select rows with replacement (same row can be selected )
why do we select rows randomly with replacement
because trees highly sensitive to small changes so we change data slighlty to make regression tree and then by taking an average reduces the variance in the oredicted value
minimum bucket in forests is called
Node size
other parameter in trees
number of trees
default number of trees in r
500
`more trees is better because
unbiased
why would more trees = issue
computationally dificult
dont have to worry about what in forests
overfitting