Ch9 Tabular Modeling Deep Dive End-of-Chapter Questions Flashcards

1
Q

Provide 3 words that are used for the possible values of a categorical variable

A

levels, categories, classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a “dense” layer?

A

= linear layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do entity embeddings reduce memory usage and speed up neural networks?

A

An embedding indexes into an embedding matrix directly but has the derivative calculated in a way that is equivalent to doing a matrix multiplication with a one-hot-encoded vector. This reduces memory usage by bypassing the need to store a one-hot-encoded vector and speeds up the model because there’s no need to search through the vector for the occurrence of the number “1”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What kinds of datasets are entity embeddings especially useful for?

A

Datasets containing categorical variables with a large number of levels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the two main families of machine learning algorithms?

A
  1. Ensembles of decision trees (i.e., random forests and gradient boosting machines), mainly for structured data
  2. Multilayered neural networks learned with SGD (i.e., shallow and/or deep learning), mainly for unstructured data (such as audio, images, and natural language)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Summarize what a decision tree algorithm does

A

A decision tree is made up of a number of binary splits. At each split, the data is separated into two groups in a way that maximizes the number of observations having the same value for the target variable within each group, i.e. group members are as similar as possible in terms of the target variable. At each step, the decision tree will look for the optimal split for each column and then choose the column with the best split overall. The process repeats for each group created by the previous split.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is a date different from a regular categorical or continuous variable, and how can you preprocess it to allow it to be used in a model?

A

Dates have meaning that may affect the target variable beyond the order in time, such as whether the data represents a weekday/weekend, a holiday, etc.

fastai’s add_datepart function replaces a date column with a set of date metadata columns, including
- saleYear
- saleMonth,
- saleWeek
- saleDay
- saleDayofweek
- saleDayofyear
- saleIs_month_end
- saleIs_month_start
- saleIs_quarter_end
- saleIs_quarter_start
- saleIs_year_end
- saleIs_year_start
- saleElapsed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How are squared_error, samples, and value calculated for the decision tree shown below?

Data is from Kaggle’s Bulldozer Blue Book competition. The target variable is SalePrice

A

value: Represents the average target value (in this case log of SalePrice) for the samples within a node. The top node has the average for the whole dataset.

squared_error: The mean squared error for the samples in a node. Calculated by using the labels and the average target value (i.e. the tree’s prediction at that point).

samples: The number of observations in each node. The top node has the size of the whole dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do we deal with outliers before building a decision tree?

A

We don’t need to. Decision tree models are robust against outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do we handle categorical variables in a decision tree?

A

We can use label-enconding. It’s not neccessary to one-hot-encode categorical variables for decision tree models, but we do need to make them numeric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is bagging?

A

Bagging involves taking the average prediction of many models (each trained on a different data subset)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the difference between max_samples and max_features when creating a random forest?

A

max_samples specifies how many rows to sample for training each tree.
max_features defines how many columns to sample at each split point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

If you increase n_estimators in a random forest model to a very high value, can that lead to overfitting? Why or why not?

A

No, because each tree in a random forest is independent of the others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is out-of-bag error? (context: decision trees)

A

Out-of-bag error calculates error on each row of the training data using the trees that did not include that row for training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why might a model’s validation set error be worse than the OOB error? How could you test your hypotheses?

A

The validations set error might be worse than the OOB error if the validation set contains out-of-domain data.

A way to check for out-of-domain data is to train a random forest model to predict whether the the row comes from the validation or training set. Then, check feature importance — variables with very high feature importance are the ones that differ significantly between training and validation sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can you use a random forest model to answer the following question?

How confident are we in our predictions using a particular row of data?

A

Calculate the standard deviation of predictions across trees, then check how the standard deviation for that row compares to others. Higher standard deviations mean the trees are giving very different results for a row, while lower standard deviations mean that the trees are giving consistent answers and we can be more confident in the prediction.

17
Q

How can you use a random forest model to answer the following question?

For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?

A

Can use the treeinterpreter library to calculate contributions from each variable.

18
Q

How can you use a random forest model to answer the following question?

Which columns are the strongest predictors?

A

Use feature importance

19
Q

How can you use a random forest model to answer the following question?

How do predictions vary as we vary these columns?

A

Use partial dependence plots

20
Q

What’s the purpose of removing unimportant variables?

A

It helps use focus on the important variables to study in depth. Also, in practice, a simpler, more interpretable model is easier to roll out and maintain.

21
Q

What is a good type of plot for showing tree interpreter results?

A

A waterfall chart

22
Q

What is the extrapolation problem for random forests?

A

A random forest can never predict values outside the range of its training data. That’s because random forest models don’t use an equation to make predictions — it just averages the predictions of the trees.