Quiz 5 Flashcards
separating observations into subgroups by creating splits or predictors
trees
what are the types of trees?
classification tree
regression tree
what does CART stand for?
classification and regression trees
nodes that have successors, also called splitting nodes
decision node
nodes with no successors, leaves of the tree, represents partitioning of data by predictors
terminal node
how many terminal nodes are there?
one more than decision nodes
how are things moved down the tree?
they are dropped
how are classes assigned?
taking vote/average of class with the most similarities
divides up the p-dimensional space of the x variables into non-overlapping multidimensional rectangles, operates on results of prior division
recursive partitioning
what is a pure rectangle?
only contains to one class
each split depicted as split of a node into two successor nodes
classification tree algorithms
statistical test to assess whether splitting a node improves the purity by a statistically significant amount
chi-squared automatic interaction detection (CHAID)
removing the weakest branches that hardly reduce the error rate, successively selecting decision nodes and re-designating it as a terminal node
pruning the tree
misclassification error + penalty factor for size of tree
cost complexity
what does the minimum error tree have?
lowest misclassification error on validation set
smallest tree in pruning sequence with error within one standard deviation of the minimum error tree
best-pruned tree
fit classification to different samples and then combine them, random subset of observation and predictors and make a tree, repeat a bunch, take a vote among trees/average
random forest
sequence of trees is fitted so each tree concentrates on misclassified records from the previous tree, make a tree, see how model performs, focus on incorrect predictors and make tree for it, take average
boosted trees/gradient boosting
1 minus sum of p of k squred
gini measure
what is p of k
percentage of observations in rectangle A that belong to class k
what is the perfect gini measure
0
-sum of p of k times log 2 times p of k
entropy measure
percentage of data used for each tree in the random forest
bootstrap percentage