Data Science using Python and R - 6 Flashcards
What are the first four phases of the Data Science Methodology?
- Data Understanding Phase
- Data Preparation Phase
- Exploratory Data Analysis Phase
- Setup Phase
These phases are essential steps before moving on to the Modeling Phase.
What is the target variable in the adult_ch6_training data set?
Income, a categorical target variable with classes >50k and ≤50k
This represents individuals whose income is greater than $50,000 per year and those with income less than or equal to $50,000 per year.
What are the predictors retained in the adult_ch6_training data set?
- Marital status (categorical predictor)
- Cap_gains_losses (numerical predictor)
Marital status includes classes: married, divorced, never-married, separated, and widowed.
What is the structure of a decision tree?
A set of decision nodes connected by branches, terminating in leaf nodes
The root node is at the top, and decisions are made at each node.
What does the root node in a decision tree indicate?
It indicates that 24% of the records have high income (>50K)
The root node shows the proportion of high-income records and the total percentage of records.
How does the CART algorithm determine the optimal split at a node?
By maximizing the Gini Index measure of a candidate split at that node
CART produces strictly binary trees with two branches for each decision node.
What are the two leading algorithms for constructing decision trees?
- CART algorithm
- C5.0 algorithm
These algorithms measure leaf node purity to create effective decision trees.
What is the purpose of the Gini Index in decision trees?
To measure the goodness of a candidate split at a node
It helps in determining the optimal split that maximizes the purity of leaf nodes.
What Python library is used to implement the CART model?
sklearn
The DecisionTreeClassifier from sklearn is used to build CART decision trees.
How do you convert categorical variables into a dummy variable form in Python?
Using the categorical() command from the statsmodels.tools package
This is necessary for fitting the CART model in sklearn.
What command is used to fit the CART model in Python?
fit()
This command fits the decision tree model to the input data.
What is the function of the export_graphviz() command in Python?
To obtain the tree structure and save it to a specified file
It enhances the readability of the decision tree by including feature and class names.
In R, what command is used to build the CART model?
rpart()
The formula specifies the target and predictors for the model.
What packages need to be installed and loaded in R to build a CART model?
- rpart
- rpart.plot
These packages provide the necessary functions for modeling and visualization.
What is the maximum number of leaf nodes specified in the CART model example?
5
This parameter is set to limit the complexity of the decision tree.
Fill in the blank: Decision trees seek to create a set of leaf nodes that are as _______ as possible.
pure
Purity means that records in a leaf node share the same classification.
True or False: Decision trees can grow indefinitely without any constraints.
False
Decision trees stop growing when no further splits can be made.
What packages need to be installed and opened to build a CART model?
rpart and rpart.plot
Use install.packages(c(“rpart”, “rpart.plot”)) to install.
What command is used to build a CART model?
rpart()
The command structure is cart01 <- rpart(formula = Income ~ maritalStatus + Cap_Gains_Losses, data = adult_tr, method = “class”).
What does the formula input in the rpart() command represent?
Target ~ Predictors
The predictors are separated by plus signs.
How do you plot a CART model in R?
Using the rpart.plot() command
The required input is the name of the saved CART model.
What does the type argument in rpart.plot() control?
It controls the type of plot display
For example, type = 4 labels branches with specific values.
What is the purpose of the predict() command in the context of a CART model?
To obtain classifications for each record in the data set
The command structure is predIncomeCART = predict(object = cart01, newdata = X, type = “class”).
What algorithm is an extension of C4.5 for generating decision trees?
C5.0 algorithm
Developed by J. Ross Quinlan.
What method does the C5.0 algorithm use to select the optimal split?
Information gain or entropy reduction
Compared to CART, C5.0 is not restricted to binary splits.
How is entropy defined in the context of C5.0?
H(X) = -Σ(pj log2(pj))
Where pj is the probability of each value of variable X.
What does the information gain represent in the C5.0 algorithm?
gain(S) = H(T) - H_S(T)
It is the increase in information produced by partitioning the training data.
What is a key difference between CART and C5.0 decision trees?
CART trees tend to have more balanced splits due to the Gini Index
C5.0 can send a majority of records to a single node.
How do you build a C5.0 decision tree in Python?
Using the DecisionTreeClassifier() command with criterion=”entropy”
Example: c50_01 = DecisionTreeClassifier(criterion=”entropy”, max_leaf_nodes=5).
What is the purpose of the random forests algorithm?
To build a series of decision trees and combine their classifications
It is an example of an ensemble method.
What is the process of building each decision tree in random forests?
Taking a random sample from the original training data set
Each tree is built on a different dataset.
What does each classification in random forests represent?
A vote for that particular target variable value
The final classification is the value with the largest number of votes.
What command is used to create a random forest in Python?
RandomForestClassifier()
Example: rf01 = RandomForestClassifier(n_estimators = 100, criterion=”gini”).
What is the input required to build random forests in R?
randomForest() command with the formula and data inputs
Example: rf01 <- randomForest(formula = Income ~ maritalStatus + Cap_Gains_Losses, data = adult_tr, ntree = 100, type = “classification”).
What is the function used to create random forests in R?
randomForest()
What does the ‘formula’ input in randomForest() specify?
It specifies where the variables in formula come from.
What does the ‘ntree’ input in randomForest() indicate?
It tells the algorithm how many trees to make.
How many trees are used in the example provided for the random forests model?
100 trees
What does the ‘type’ input set to ‘classification’ specify in randomForest()?
It specifies that the data is being classified.
How can you view the classifications made by the random forests algorithm?
Look at the predicted values saved under rf01.
What does rf01$predicted return?
A classification for each record in the data set.
What is a decision tree?
A model that uses a tree-like graph of decisions and their possible consequences.
What is the difference between a decision node and a leaf node?
A decision node splits into further nodes, while a leaf node represents an outcome.
Where is the most powerful of all possible splits made in a decision tree?
At the root node.
When do decision trees stop growing?
When they reach a predefined stopping criterion.
How do decision trees work?
By splitting data into subsets based on feature values.
Would CART be a good algorithm to use if we are interested in a trinary categorical predictor?
Yes
Which criterion is used by CART to assess which split is optimal?
Gini impurity or entropy.
Which concept does the C5.0 algorithm use to select the optimal split?
Information gain.
What are random forests?
An ensemble learning method that constructs multiple decision trees.
How do random forests work?
By aggregating the predictions from multiple decision trees.
Are all the predictor variables candidates to be the ‘best’ split for each node in a tree built by random forests?
No
Are the data sets used to build each tree in random forests the same?
No, they are different subsets.
How does the random forests algorithm give the training data set its final classification?
By majority voting among the predictions of all trees.
Fill in the blank: The package for CART models in R is called _______.
rpart
Fill in the blank: The C5.0 decision trees and rule-based models package in R is called _______.
C50
True or False: The rpart.plot package is used to visualize CART models.
True
True or False: The randomForest package is primarily used for regression tasks only.
False