Ch9 Tabular Modeling Deep Dive End-of-Chapter Questions Flashcards

Question 1

Q

Provide 3 words that are used for the possible values of a categorical variable

Answer

A

levels, categories, classes

Question 2

Q

What is a “dense” layer?

Answer

A

= linear layer

Question 3

Q

How do entity embeddings reduce memory usage and speed up neural networks?

Answer

A

An embedding indexes into an embedding matrix directly but has the derivative calculated in a way that is equivalent to doing a matrix multiplication with a one-hot-encoded vector. This reduces memory usage by bypassing the need to store a one-hot-encoded vector and speeds up the model because there’s no need to search through the vector for the occurrence of the number “1”.

Question 4

Q

What kinds of datasets are entity embeddings especially useful for?

Answer

A

Datasets containing categorical variables with a large number of levels

Question 5

Q

What are the two main families of machine learning algorithms?

Answer

A

Ensembles of decision trees (i.e., random forests and gradient boosting machines), mainly for structured data
Multilayered neural networks learned with SGD (i.e., shallow and/or deep learning), mainly for unstructured data (such as audio, images, and natural language)

Question 6

Q

Summarize what a decision tree algorithm does

Answer

A

A decision tree is made up of a number of binary splits. At each split, the data is separated into two groups in a way that maximizes the number of observations having the same value for the target variable within each group, i.e. group members are as similar as possible in terms of the target variable. At each step, the decision tree will look for the optimal split for each column and then choose the column with the best split overall. The process repeats for each group created by the previous split.

Question 7

Q

Why is a date different from a regular categorical or continuous variable, and how can you preprocess it to allow it to be used in a model?

Answer

A

Dates have meaning that may affect the target variable beyond the order in time, such as whether the data represents a weekday/weekend, a holiday, etc.

fastai’s add_datepart function replaces a date column with a set of date metadata columns, including
- saleYear
- saleMonth,
- saleWeek
- saleDay
- saleDayofweek
- saleDayofyear
- saleIs_month_end
- saleIs_month_start
- saleIs_quarter_end
- saleIs_quarter_start
- saleIs_year_end
- saleIs_year_start
- saleElapsed

Question 8

Q

How are squared_error, samples, and value calculated for the decision tree shown below?

Data is from Kaggle’s Bulldozer Blue Book competition. The target variable is SalePrice

Answer

A

value: Represents the average target value (in this case log of SalePrice) for the samples within a node. The top node has the average for the whole dataset.

squared_error: The mean squared error for the samples in a node. Calculated by using the labels and the average target value (i.e. the tree’s prediction at that point).

samples: The number of observations in each node. The top node has the size of the whole dataset.

Question 9

Q

How do we deal with outliers before building a decision tree?

Answer

A

We don’t need to. Decision tree models are robust against outliers.

Question 10

Q

How do we handle categorical variables in a decision tree?

Answer

A

We can use label-enconding. It’s not neccessary to one-hot-encode categorical variables for decision tree models, but we do need to make them numeric.

Question 11

Q

What is bagging?

Answer

A

Bagging involves taking the average prediction of many models (each trained on a different data subset)

Question 12

Q

What is the difference between max_samples and max_features when creating a random forest?

Answer

A

max_samples specifies how many rows to sample for training each tree.
max_features defines how many columns to sample at each split point

Question 13

Q

If you increase n_estimators in a random forest model to a very high value, can that lead to overfitting? Why or why not?

Answer

A

No, because each tree in a random forest is independent of the others

Question 14

Q

What is out-of-bag error? (context: decision trees)

Answer

A

Out-of-bag error calculates error on each row of the training data using the trees that did not include that row for training.

Question 15

Q

Why might a model’s validation set error be worse than the OOB error? How could you test your hypotheses?

Answer

A

The validations set error might be worse than the OOB error if the validation set contains out-of-domain data.

A way to check for out-of-domain data is to train a random forest model to predict whether the the row comes from the validation or training set. Then, check feature importance — variables with very high feature importance are the ones that differ significantly between training and validation sets.

Question 16

Q

How can you use a random forest model to answer the following question?

How confident are we in our predictions using a particular row of data?

Answer

Study These Flashcards

A

Calculate the standard deviation of predictions across trees, then check how the standard deviation for that row compares to others. Higher standard deviations mean the trees are giving very different results for a row, while lower standard deviations mean that the trees are giving consistent answers and we can be more confident in the prediction.

Question 17

Q

How can you use a random forest model to answer the following question?

For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?

Answer

Study These Flashcards

A

Can use the treeinterpreter library to calculate contributions from each variable.

Question 18

Q

How can you use a random forest model to answer the following question?

Which columns are the strongest predictors?

Answer

Study These Flashcards

A

Use feature importance

Question 19

Q

How can you use a random forest model to answer the following question?

How do predictions vary as we vary these columns?

Answer

Study These Flashcards

A

Use partial dependence plots

Question 20

Q

What’s the purpose of removing unimportant variables?

Answer

Study These Flashcards

A

It helps use focus on the important variables to study in depth. Also, in practice, a simpler, more interpretable model is easier to roll out and maintain.

Question 21

Q

What is a good type of plot for showing tree interpreter results?

Answer

Study These Flashcards

A

A waterfall chart

Question 22

Q

What is the extrapolation problem for random forests?

Answer

Study These Flashcards

A

A random forest can never predict values outside the range of its training data. That’s because random forest models don’t use an equation to make predictions — it just averages the predictions of the trees.

Ch9 Tabular Modeling Deep Dive End-of-Chapter Questions Flashcards

(22 cards)