Data Science using Python and R - 6 Flashcards

1
Q

What are the first four phases of the Data Science Methodology?

A
  • Data Understanding Phase
  • Data Preparation Phase
  • Exploratory Data Analysis Phase
  • Setup Phase

These phases are essential steps before moving on to the Modeling Phase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the target variable in the adult_ch6_training data set?

A

Income, a categorical target variable with classes >50k and ≤50k

This represents individuals whose income is greater than $50,000 per year and those with income less than or equal to $50,000 per year.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the predictors retained in the adult_ch6_training data set?

A
  • Marital status (categorical predictor)
  • Cap_gains_losses (numerical predictor)

Marital status includes classes: married, divorced, never-married, separated, and widowed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the structure of a decision tree?

A

A set of decision nodes connected by branches, terminating in leaf nodes

The root node is at the top, and decisions are made at each node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does the root node in a decision tree indicate?

A

It indicates that 24% of the records have high income (>50K)

The root node shows the proportion of high-income records and the total percentage of records.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does the CART algorithm determine the optimal split at a node?

A

By maximizing the Gini Index measure of a candidate split at that node

CART produces strictly binary trees with two branches for each decision node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the two leading algorithms for constructing decision trees?

A
  • CART algorithm
  • C5.0 algorithm

These algorithms measure leaf node purity to create effective decision trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the purpose of the Gini Index in decision trees?

A

To measure the goodness of a candidate split at a node

It helps in determining the optimal split that maximizes the purity of leaf nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What Python library is used to implement the CART model?

A

sklearn

The DecisionTreeClassifier from sklearn is used to build CART decision trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you convert categorical variables into a dummy variable form in Python?

A

Using the categorical() command from the statsmodels.tools package

This is necessary for fitting the CART model in sklearn.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What command is used to fit the CART model in Python?

A

fit()

This command fits the decision tree model to the input data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the function of the export_graphviz() command in Python?

A

To obtain the tree structure and save it to a specified file

It enhances the readability of the decision tree by including feature and class names.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In R, what command is used to build the CART model?

A

rpart()

The formula specifies the target and predictors for the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What packages need to be installed and loaded in R to build a CART model?

A
  • rpart
  • rpart.plot

These packages provide the necessary functions for modeling and visualization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the maximum number of leaf nodes specified in the CART model example?

A

5

This parameter is set to limit the complexity of the decision tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Fill in the blank: Decision trees seek to create a set of leaf nodes that are as _______ as possible.

A

pure

Purity means that records in a leaf node share the same classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

True or False: Decision trees can grow indefinitely without any constraints.

A

False

Decision trees stop growing when no further splits can be made.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What packages need to be installed and opened to build a CART model?

A

rpart and rpart.plot

Use install.packages(c(“rpart”, “rpart.plot”)) to install.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What command is used to build a CART model?

A

rpart()

The command structure is cart01 <- rpart(formula = Income ~ maritalStatus + Cap_Gains_Losses, data = adult_tr, method = “class”).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does the formula input in the rpart() command represent?

A

Target ~ Predictors

The predictors are separated by plus signs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How do you plot a CART model in R?

A

Using the rpart.plot() command

The required input is the name of the saved CART model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does the type argument in rpart.plot() control?

A

It controls the type of plot display

For example, type = 4 labels branches with specific values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the purpose of the predict() command in the context of a CART model?

A

To obtain classifications for each record in the data set

The command structure is predIncomeCART = predict(object = cart01, newdata = X, type = “class”).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What algorithm is an extension of C4.5 for generating decision trees?

A

C5.0 algorithm

Developed by J. Ross Quinlan.

25
Q

What method does the C5.0 algorithm use to select the optimal split?

A

Information gain or entropy reduction

Compared to CART, C5.0 is not restricted to binary splits.

26
Q

How is entropy defined in the context of C5.0?

A

H(X) = -Σ(pj log2(pj))

Where pj is the probability of each value of variable X.

27
Q

What does the information gain represent in the C5.0 algorithm?

A

gain(S) = H(T) - H_S(T)

It is the increase in information produced by partitioning the training data.

28
Q

What is a key difference between CART and C5.0 decision trees?

A

CART trees tend to have more balanced splits due to the Gini Index

C5.0 can send a majority of records to a single node.

29
Q

How do you build a C5.0 decision tree in Python?

A

Using the DecisionTreeClassifier() command with criterion=”entropy”

Example: c50_01 = DecisionTreeClassifier(criterion=”entropy”, max_leaf_nodes=5).

30
Q

What is the purpose of the random forests algorithm?

A

To build a series of decision trees and combine their classifications

It is an example of an ensemble method.

31
Q

What is the process of building each decision tree in random forests?

A

Taking a random sample from the original training data set

Each tree is built on a different dataset.

32
Q

What does each classification in random forests represent?

A

A vote for that particular target variable value

The final classification is the value with the largest number of votes.

33
Q

What command is used to create a random forest in Python?

A

RandomForestClassifier()

Example: rf01 = RandomForestClassifier(n_estimators = 100, criterion=”gini”).

34
Q

What is the input required to build random forests in R?

A

randomForest() command with the formula and data inputs

Example: rf01 <- randomForest(formula = Income ~ maritalStatus + Cap_Gains_Losses, data = adult_tr, ntree = 100, type = “classification”).

35
Q

What is the function used to create random forests in R?

A

randomForest()

36
Q

What does the ‘formula’ input in randomForest() specify?

A

It specifies where the variables in formula come from.

37
Q

What does the ‘ntree’ input in randomForest() indicate?

A

It tells the algorithm how many trees to make.

38
Q

How many trees are used in the example provided for the random forests model?

39
Q

What does the ‘type’ input set to ‘classification’ specify in randomForest()?

A

It specifies that the data is being classified.

40
Q

How can you view the classifications made by the random forests algorithm?

A

Look at the predicted values saved under rf01.

41
Q

What does rf01$predicted return?

A

A classification for each record in the data set.

42
Q

What is a decision tree?

A

A model that uses a tree-like graph of decisions and their possible consequences.

43
Q

What is the difference between a decision node and a leaf node?

A

A decision node splits into further nodes, while a leaf node represents an outcome.

44
Q

Where is the most powerful of all possible splits made in a decision tree?

A

At the root node.

45
Q

When do decision trees stop growing?

A

When they reach a predefined stopping criterion.

46
Q

How do decision trees work?

A

By splitting data into subsets based on feature values.

47
Q

Would CART be a good algorithm to use if we are interested in a trinary categorical predictor?

48
Q

Which criterion is used by CART to assess which split is optimal?

A

Gini impurity or entropy.

49
Q

Which concept does the C5.0 algorithm use to select the optimal split?

A

Information gain.

50
Q

What are random forests?

A

An ensemble learning method that constructs multiple decision trees.

51
Q

How do random forests work?

A

By aggregating the predictions from multiple decision trees.

52
Q

Are all the predictor variables candidates to be the ‘best’ split for each node in a tree built by random forests?

53
Q

Are the data sets used to build each tree in random forests the same?

A

No, they are different subsets.

54
Q

How does the random forests algorithm give the training data set its final classification?

A

By majority voting among the predictions of all trees.

55
Q

Fill in the blank: The package for CART models in R is called _______.

56
Q

Fill in the blank: The C5.0 decision trees and rule-based models package in R is called _______.

57
Q

True or False: The rpart.plot package is used to visualize CART models.

58
Q

True or False: The randomForest package is primarily used for regression tasks only.