ML - Preprocessing Flashcards
Name 5 ways we can do Feature Importance Selection:
Name 5 ways we can do Feature Importance Selection:
- Drop Column Importance
- Permutation Importance
- Correlation Matrix / Spearman’s Correlation (Heatmap)
- Comparison to Random Noise Column
- Using a Random Forest (Gini Drop)
What is Drop Column Importance?
We remove one feature and look at the change in performance of our model. To do this we first get a score from metric of choice (we can use Cross-Validation for this), and let this be our baseline. We then drop one column at a time and retrain our model. We compare each model’s score (from the metric of choice), to the baseline score.
Feature’s importance = metric - baseline
What are the pros / cons of using drop-column importance?
Pros:
- relatively simple + intuitive
- model agnostic (can be applied to any model)
- if a feature is destroyed, all interactions with it are also destroyed
Cons:
- You have to train the model multiple times which may be expensive
- If there are collinear features, then these get amplified if we drop other features (since there are less features overall)
What is Permutation Importance?
The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled. This technique measures the difference in performance if you permute or shuffle a feature vector. This shuffling process *breaks* any relationships of the line or record of other values with that column or specific variable.
What are the pros / cons of using permutation importance?
Pros:
- Model Agnostic (can be used on any model)
- The model doesn’t need to be retrained each time (unlike in drop column importance)
Cons:
- Even if one of the correlated features is permuted the other one still has an inflated importance
- we have to average across trials since permuting is random
List the 4 steps in the pseudo-algorithm for calculating permutation importance:
- Calculate a baseline score using the metric, trained model, the feature matrix and the target vector
- For each feature in the feature matrix, make a copy of the feature matrix.
- Shuffle the feature column, pass it through the trained model to get a prediction and use the metric to calculate the performance.
Importance = Baseline – Score
- Repeat for N times for statistical stability and take an average importance across trials
What is Label Encoding?
*Label Encoding* converts categorical variables to simple numbers, through just a 1:1 substitution. One of the cons is that have an implicit order when we might not mean to.
What is One-Hot Encoding?
We use this when we want to convert a categorical column into a binary variable (1 or 0).
It splits up that column into X columns where X is the number of different categories you have.
Thus, you have to be careful if you have a lot of categories so that you don’t suddenly greatly increase the width/dimension of your data.
What is Binary Encoding?
Binary Encoding is when you encode a categorical variable as the presence or absence (1 or 0). It differs from OHE since you usually do it when you have one or two categories total in a single column.
What is Target Encoding? How does this differ from Label Encoding?
Target encoding is when you replace the variable with either the min / max / quantile / average / count of that column.
This differs from Label Encoding as each new number still has some relationship with the underlying replaced number.
What is Rank Encoding?
Rank Encoding represents the numerical data by its respective rank or order.
What is Frequency Encoding?
Frequency Encoding represents categorical data by the frequency in which that category appears.
How do we use Hashing to encode variables?
Hashing transforms a string of chars into a shorter-fixed length number.
What is Embedding? Name 3 examples where we do this.
Embedding is when we map some high dimensional space to a lower dimensional one.
Examples - Word2Vec, NN layers, PCA
What is Normalization?
Normalization is when you scale a variable to be between 0 and 1.
You do this by subtracting the min, then dividing by the difference between the max and min.
It is used when you have too high / low values. By decreasing this range it makes gradient descent easier to traverse through (since gradient descent travels through the partial derivatives of the Losses wrt the Weights).