6 - Decision Trees Flashcards
What are the first four phases of the Data Science Methodology?
- Data Understanding Phase
- Data Preparation Phase
- Exploratory Data Analysis Phase
- Setup Phase
These phases are essential steps before moving on to the Modeling Phase.
What is the target variable in the adult_ch6_training data set?
Income, a categorical target variable with classes >50k and ≤50k
This represents individuals whose income is greater than $50,000 per year and those with income less than or equal to $50,000 per year.
What are the predictors retained in the adult_ch6_training data set?
- Marital status (categorical predictor)
- Cap_gains_losses (numerical predictor)
Marital status includes classes: married, divorced, never-married, separated, and widowed.
What is the structure of a decision tree?
A set of decision nodes connected by branches, terminating in leaf nodes
The root node is at the top, and decisions are made at each node.
What does the root node in a decision tree indicate?
It indicates that 24% of the records have high income (>50K)
The root node shows the proportion of high-income records and the total percentage of records.
How does the CART algorithm determine the optimal split at a node?
By maximizing the Gini Index measure of a candidate split at that node
CART produces strictly binary trees with two branches for each decision node.
What are the two leading algorithms for constructing decision trees?
- CART algorithm
- C5.0 algorithm
These algorithms measure leaf node purity to create effective decision trees.
What is the purpose of the Gini Index in decision trees?
To measure the goodness of a candidate split at a node
It helps in determining the optimal split that maximizes the purity of leaf nodes.
What Python library is used to implement the CART model?
sklearn
The DecisionTreeClassifier from sklearn is used to build CART decision trees.
How do you convert categorical variables into a dummy variable form in Python?
Using the categorical() command from the statsmodels.tools package
This is necessary for fitting the CART model in sklearn.
What command is used to fit the CART model in Python?
fit()
This command fits the decision tree model to the input data.
What is the function of the export_graphviz() command in Python?
To obtain the tree structure and save it to a specified file
It enhances the readability of the decision tree by including feature and class names.
In R, what command is used to build the CART model?
rpart()
The formula specifies the target and predictors for the model.
What packages need to be installed and loaded in R to build a CART model?
- rpart
- rpart.plot
These packages provide the necessary functions for modeling and visualization.
What is the maximum number of leaf nodes specified in the CART model example?
5
This parameter is set to limit the complexity of the decision tree.
Fill in the blank: Decision trees seek to create a set of leaf nodes that are as _______ as possible.
pure
Purity means that records in a leaf node share the same classification.
True or False: Decision trees can grow indefinitely without any constraints.
False
Decision trees stop growing when no further splits can be made.
What packages need to be installed and opened to build a CART model?
rpart and rpart.plot
Use install.packages(c(“rpart”, “rpart.plot”)) to install.
What command is used to build a CART model?
rpart()
The command structure is cart01 <- rpart(formula = Income ~ maritalStatus + Cap_Gains_Losses, data = adult_tr, method = “class”).
What does the formula input in the rpart() command represent?
Target ~ Predictors
The predictors are separated by plus signs.
How do you plot a CART model in R?
Using the rpart.plot() command
The required input is the name of the saved CART model.
What does the type argument in rpart.plot() control?
It controls the type of plot display
For example, type = 4 labels branches with specific values.
What is the purpose of the predict() command in the context of a CART model?
To obtain classifications for each record in the data set
The command structure is predIncomeCART = predict(object = cart01, newdata = X, type = “class”).
What algorithm is an extension of C4.5 for generating decision trees?
C5.0 algorithm
Developed by J. Ross Quinlan.