decision trees Flashcards
arrange the data into predefined groups
classification
What is the difference between clustering and classfication?
depends whether categories are predefined or not
What are examples of classification
classifying emails as “legit” or “spam”
what are common algorithms of classification?
decision tree analysis
What is the processes of how classficiation works
What is the processes of how classficiation works
1) choose classes and a set of classifying attributes
2) choose a set of records (randomly) for the training set
3) choose a set of records (randomly) for the test set
4) using the training set, create a model to predict the chosen class as a function of the other classifying attritbutes
5) evaluate the model using the test set
6) classify future records by applying the “best” model
What attributes of families made it more likely they rented a certain class of apartment?
To answer this, we can run a classification algorithm over the training data
- the data has known outcomes, we know what class of aprtment each family ended up renting
- the outcome can be displayed on a decision tree
The classifcation of events, outcomes, things etc
Decision tree
what is the first question of the decision tree called
root node
What is the second question of decision tree called
split or partition
what is the third or ifnal part of decision tree called?
terminal node or lead
How do you create a decision tree using training data?
find a way to split (partition) the training dataset into smaller sub-groups
How to decide what attributes and rules to split at each node? (which partition is better than others?)?
Several algorithms. One appraoch: recursive paritioning
What are the steps to recursive partitioning steps?
- pick one of the predictor variable, Xi
- Pick a value of Xi (says, si) that dvides the training data into two portions
- measure how “pure” each of the resulting portions (subgroups) are
- The idea is to pick Xi and Si to maximize purity improvement in one step
- REPEAT the process for each fo the subgroups until a predetermined number of subgroups is reached, or until improvements in purity become too small
What does pure mean?
containing records of mostly one class