Ch9 Tabular Modeling Deep Dive End-of-Chapter Questions Flashcards
Provide 3 words that are used for the possible values of a categorical variable
levels, categories, classes
What is a “dense” layer?
= linear layer
How do entity embeddings reduce memory usage and speed up neural networks?
An embedding indexes into an embedding matrix directly but has the derivative calculated in a way that is equivalent to doing a matrix multiplication with a one-hot-encoded vector. This reduces memory usage by bypassing the need to store a one-hot-encoded vector and speeds up the model because there’s no need to search through the vector for the occurrence of the number “1”.
What kinds of datasets are entity embeddings especially useful for?
Datasets containing categorical variables with a large number of levels
What are the two main families of machine learning algorithms?
- Ensembles of decision trees (i.e., random forests and gradient boosting machines), mainly for structured data
- Multilayered neural networks learned with SGD (i.e., shallow and/or deep learning), mainly for unstructured data (such as audio, images, and natural language)
Summarize what a decision tree algorithm does
A decision tree is made up of a number of binary splits. At each split, the data is separated into two groups in a way that maximizes the number of observations having the same value for the target variable within each group, i.e. group members are as similar as possible in terms of the target variable. At each step, the decision tree will look for the optimal split for each column and then choose the column with the best split overall. The process repeats for each group created by the previous split.
Why is a date different from a regular categorical or continuous variable, and how can you preprocess it to allow it to be used in a model?
Dates have meaning that may affect the target variable beyond the order in time, such as whether the data represents a weekday/weekend, a holiday, etc.
fastai’s add_datepart
function replaces a date column with a set of date metadata columns, including
- saleYear
- saleMonth,
- saleWeek
- saleDay
- saleDayofweek
- saleDayofyear
- saleIs_month_end
- saleIs_month_start
- saleIs_quarter_end
- saleIs_quarter_start
- saleIs_year_end
- saleIs_year_start
- saleElapsed
How are squared_error, samples, and value calculated for the decision tree shown below?
Data is from Kaggle’s Bulldozer Blue Book competition. The target variable is SalePrice
value: Represents the average target value (in this case log of SalePrice) for the samples within a node. The top node has the average for the whole dataset.
squared_error: The mean squared error for the samples in a node. Calculated by using the labels and the average target value (i.e. the tree’s prediction at that point).
samples: The number of observations in each node. The top node has the size of the whole dataset.
How do we deal with outliers before building a decision tree?
We don’t need to. Decision tree models are robust against outliers.
How do we handle categorical variables in a decision tree?
We can use label-enconding. It’s not neccessary to one-hot-encode categorical variables for decision tree models, but we do need to make them numeric.
What is bagging?
Bagging involves taking the average prediction of many models (each trained on a different data subset)
What is the difference between max_samples
and max_features
when creating a random forest?
max_samples
specifies how many rows to sample for training each tree.max_features
defines how many columns to sample at each split point
If you increase n_estimators
in a random forest model to a very high value, can that lead to overfitting? Why or why not?
No, because each tree in a random forest is independent of the others
What is out-of-bag error? (context: decision trees)
Out-of-bag error calculates error on each row of the training data using the trees that did not include that row for training.
Why might a model’s validation set error be worse than the OOB error? How could you test your hypotheses?
The validations set error might be worse than the OOB error if the validation set contains out-of-domain data.
A way to check for out-of-domain data is to train a random forest model to predict whether the the row comes from the validation or training set. Then, check feature importance — variables with very high feature importance are the ones that differ significantly between training and validation sets.