Data Science using Python and R - 6 Flashcards

Question 1

Q

What are the first four phases of the Data Science Methodology?

Answer

A

Data Understanding Phase
Data Preparation Phase
Exploratory Data Analysis Phase
Setup Phase

These phases are essential steps before moving on to the Modeling Phase.

Question 2

Q

What is the target variable in the adult_ch6_training data set?

Answer

A

Income, a categorical target variable with classes >50k and ≤50k

This represents individuals whose income is greater than $50,000 per year and those with income less than or equal to $50,000 per year.

Question 3

Q

What are the predictors retained in the adult_ch6_training data set?

Answer

A

Marital status (categorical predictor)
Cap_gains_losses (numerical predictor)

Marital status includes classes: married, divorced, never-married, separated, and widowed.

Question 4

Q

What is the structure of a decision tree?

Answer

A

A set of decision nodes connected by branches, terminating in leaf nodes

The root node is at the top, and decisions are made at each node.

Question 5

Q

What does the root node in a decision tree indicate?

Answer

A

It indicates that 24% of the records have high income (>50K)

The root node shows the proportion of high-income records and the total percentage of records.

Question 6

Q

How does the CART algorithm determine the optimal split at a node?

Answer

A

By maximizing the Gini Index measure of a candidate split at that node

CART produces strictly binary trees with two branches for each decision node.

Question 7

Q

What are the two leading algorithms for constructing decision trees?

Answer

A

CART algorithm
C5.0 algorithm

These algorithms measure leaf node purity to create effective decision trees.

Question 8

Q

What is the purpose of the Gini Index in decision trees?

Answer

A

To measure the goodness of a candidate split at a node

It helps in determining the optimal split that maximizes the purity of leaf nodes.

Question 9

Q

What Python library is used to implement the CART model?

Answer

A

sklearn

The DecisionTreeClassifier from sklearn is used to build CART decision trees.

Question 10

Q

How do you convert categorical variables into a dummy variable form in Python?

Answer

A

Using the categorical() command from the statsmodels.tools package

This is necessary for fitting the CART model in sklearn.

Question 11

Q

What command is used to fit the CART model in Python?

Answer

A

fit()

This command fits the decision tree model to the input data.

Question 12

Q

What is the function of the export_graphviz() command in Python?

Answer

A

To obtain the tree structure and save it to a specified file

It enhances the readability of the decision tree by including feature and class names.

Question 13

Q

In R, what command is used to build the CART model?

Answer

A

rpart()

The formula specifies the target and predictors for the model.

Question 14

Q

What packages need to be installed and loaded in R to build a CART model?

Answer

A

rpart
rpart.plot

These packages provide the necessary functions for modeling and visualization.

Question 15

Q

What is the maximum number of leaf nodes specified in the CART model example?

Answer

A

5

This parameter is set to limit the complexity of the decision tree.

Question 16

Q

Fill in the blank: Decision trees seek to create a set of leaf nodes that are as _______ as possible.

Answer

A

pure

Purity means that records in a leaf node share the same classification.

Question 17

Q

True or False: Decision trees can grow indefinitely without any constraints.

Answer

A

False

Decision trees stop growing when no further splits can be made.

Question 18

Q

What packages need to be installed and opened to build a CART model?

Answer

A

rpart and rpart.plot

Use install.packages(c(“rpart”, “rpart.plot”)) to install.

Question 19

Q

What command is used to build a CART model?

Answer

A

rpart()

The command structure is cart01 <- rpart(formula = Income ~ maritalStatus + Cap_Gains_Losses, data = adult_tr, method = “class”).

Question 20

Q

What does the formula input in the rpart() command represent?

Answer

A

Target ~ Predictors

The predictors are separated by plus signs.

Question 21

Q

How do you plot a CART model in R?

Answer

A

Using the rpart.plot() command

The required input is the name of the saved CART model.

Question 22

Q

What does the type argument in rpart.plot() control?

Answer

A

It controls the type of plot display

For example, type = 4 labels branches with specific values.

Question 23

Q

What is the purpose of the predict() command in the context of a CART model?

Answer

A

To obtain classifications for each record in the data set

The command structure is predIncomeCART = predict(object = cart01, newdata = X, type = “class”).

Question 24

Q

What algorithm is an extension of C4.5 for generating decision trees?

Answer

A

C5.0 algorithm

Developed by J. Ross Quinlan.

Question 25

Q

What method does the C5.0 algorithm use to select the optimal split?

Answer

A

Information gain or entropy reduction

Compared to CART, C5.0 is not restricted to binary splits.

Question 26

Q

How is entropy defined in the context of C5.0?

Answer

A

H(X) = -Σ(pj log2(pj))

Where pj is the probability of each value of variable X.

Question 27

Q

What does the information gain represent in the C5.0 algorithm?

Answer

A

gain(S) = H(T) - H_S(T)

It is the increase in information produced by partitioning the training data.

Question 28

Q

What is a key difference between CART and C5.0 decision trees?

Answer

A

CART trees tend to have more balanced splits due to the Gini Index

C5.0 can send a majority of records to a single node.

Question 29

Q

How do you build a C5.0 decision tree in Python?

Answer

A

Using the DecisionTreeClassifier() command with criterion=”entropy”

Example: c50_01 = DecisionTreeClassifier(criterion=”entropy”, max_leaf_nodes=5).

Question 30

Q

What is the purpose of the random forests algorithm?

Answer

A

To build a series of decision trees and combine their classifications

It is an example of an ensemble method.

Question 31

Q

What is the process of building each decision tree in random forests?

Answer

A

Taking a random sample from the original training data set

Each tree is built on a different dataset.

Question 32

Q

What does each classification in random forests represent?

Answer

A

A vote for that particular target variable value

The final classification is the value with the largest number of votes.

Question 33

Q

What command is used to create a random forest in Python?

Answer

A

RandomForestClassifier()

Example: rf01 = RandomForestClassifier(n_estimators = 100, criterion=”gini”).

Question 34

Q

What is the input required to build random forests in R?

Answer

A

randomForest() command with the formula and data inputs

Example: rf01 <- randomForest(formula = Income ~ maritalStatus + Cap_Gains_Losses, data = adult_tr, ntree = 100, type = “classification”).

Question 35

Q

What is the function used to create random forests in R?

Answer

A

randomForest()

Question 36

Q

What does the ‘formula’ input in randomForest() specify?

Answer

A

It specifies where the variables in formula come from.

Question 37

Q

What does the ‘ntree’ input in randomForest() indicate?

Answer

A

It tells the algorithm how many trees to make.

Question 38

Q

How many trees are used in the example provided for the random forests model?

Answer

A

100 trees

Question 39

Q

What does the ‘type’ input set to ‘classification’ specify in randomForest()?

Answer

A

It specifies that the data is being classified.

Question 40

Q

How can you view the classifications made by the random forests algorithm?

Answer

A

Look at the predicted values saved under rf01.

Question 41

Q

What does rf01$predicted return?

Answer

A

A classification for each record in the data set.

Question 42

Q

What is a decision tree?

Answer

A

A model that uses a tree-like graph of decisions and their possible consequences.

Question 43

Q

What is the difference between a decision node and a leaf node?

Answer

A

A decision node splits into further nodes, while a leaf node represents an outcome.

Question 44

Q

Where is the most powerful of all possible splits made in a decision tree?

Answer

A

At the root node.

Question 45

Q

When do decision trees stop growing?

Answer

A

When they reach a predefined stopping criterion.

Question 46

Q

How do decision trees work?

Answer

A

By splitting data into subsets based on feature values.

Question 47

Q

Would CART be a good algorithm to use if we are interested in a trinary categorical predictor?

Question 48

Q

Which criterion is used by CART to assess which split is optimal?

Answer

A

Gini impurity or entropy.

Question 49

Q

Which concept does the C5.0 algorithm use to select the optimal split?

Answer

A

Information gain.

Question 50

Q

What are random forests?

Answer

A

An ensemble learning method that constructs multiple decision trees.

Question 51

Q

How do random forests work?

Answer

A

By aggregating the predictions from multiple decision trees.

Question 52

Q

Are all the predictor variables candidates to be the ‘best’ split for each node in a tree built by random forests?

Question 53

Q

Are the data sets used to build each tree in random forests the same?

Answer

A

No, they are different subsets.

Question 54

Q

How does the random forests algorithm give the training data set its final classification?

Answer

A

By majority voting among the predictions of all trees.

Question 55

Q

Fill in the blank: The package for CART models in R is called _______.

Question 56

Q

Fill in the blank: The C5.0 decision trees and rule-based models package in R is called _______.

Question 57

Q

True or False: The rpart.plot package is used to visualize CART models.

Question 58

Q

True or False: The randomForest package is primarily used for regression tasks only.

Brainscape's Knowledge GenomeTM

Data Science using Python and R - 6 Flashcards

Brainscape's Knowledge Genome^TM