ML - Preprocessing Flashcards

Question 1

Q

Name 5 ways we can do Feature Importance Selection:

Answer

A

Name 5 ways we can do Feature Importance Selection:

Drop Column Importance
Permutation Importance
Correlation Matrix / Spearman’s Correlation (Heatmap)
Comparison to Random Noise Column
Using a Random Forest (Gini Drop)

Question 2

Q

What is Drop Column Importance?

Answer

A

We remove one feature and look at the change in performance of our model. To do this we first get a score from metric of choice (we can use Cross-Validation for this), and let this be our baseline. We then drop one column at a time and retrain our model. We compare each model’s score (from the metric of choice), to the baseline score.

Feature’s importance = metric - baseline

Question 3

Q

What are the pros / cons of using drop-column importance?

Answer

A

Pros:

relatively simple + intuitive
model agnostic (can be applied to any model)
if a feature is destroyed, all interactions with it are also destroyed

Cons:

You have to train the model multiple times which may be expensive
If there are collinear features, then these get amplified if we drop other features (since there are less features overall)

Question 4

Q

What is Permutation Importance?

Answer

A

The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled. This technique measures the difference in performance if you permute or shuffle a feature vector. This shuffling process *breaks* any relationships of the line or record of other values with that column or specific variable.

Question 5

Q

What are the pros / cons of using permutation importance?

Answer

A

Pros:

Model Agnostic (can be used on any model)
The model doesn’t need to be retrained each time (unlike in drop column importance)

Cons:

Even if one of the correlated features is permuted the other one still has an inflated importance
we have to average across trials since permuting is random

Question 6

Q

List the 4 steps in the pseudo-algorithm for calculating permutation importance:

Answer

A

Calculate a baseline score using the metric, trained model, the feature matrix and the target vector
For each feature in the feature matrix, make a copy of the feature matrix.
Shuffle the feature column, pass it through the trained model to get a prediction and use the metric to calculate the performance.

Importance = Baseline – Score

Repeat for N times for statistical stability and take an average importance across trials

Question 7

Q

What is Label Encoding?

Answer

A

*Label Encoding* converts categorical variables to simple numbers, through just a 1:1 substitution. One of the cons is that have an implicit order when we might not mean to.

Question 8

Q

What is One-Hot Encoding?

Answer

A

We use this when we want to convert a categorical column into a binary variable (1 or 0).

It splits up that column into X columns where X is the number of different categories you have.

Thus, you have to be careful if you have a lot of categories so that you don’t suddenly greatly increase the width/dimension of your data.

Question 9

Q

What is Binary Encoding?

Answer

A

Binary Encoding is when you encode a categorical variable as the presence or absence (1 or 0). It differs from OHE since you usually do it when you have one or two categories total in a single column.

Question 10

Q

What is Target Encoding? How does this differ from Label Encoding?

Answer

A

Target encoding is when you replace the variable with either the min / max / quantile / average / count of that column.

This differs from Label Encoding as each new number still has some relationship with the underlying replaced number.

Question 11

Q

What is Rank Encoding?

Answer

A

Rank Encoding represents the numerical data by its respective rank or order.

Question 12

Q

What is Frequency Encoding?

Answer

A

Frequency Encoding represents categorical data by the frequency in which that category appears.

Question 13

Q

How do we use Hashing to encode variables?

Answer

A

Hashing transforms a string of chars into a shorter-fixed length number.

Question 14

Q

What is Embedding? Name 3 examples where we do this.

Answer

A

Embedding is when we map some high dimensional space to a lower dimensional one.

Examples - Word2Vec, NN layers, PCA

Question 15

Q

What is Normalization?

Answer

A

Normalization is when you scale a variable to be between 0 and 1.

You do this by subtracting the min, then dividing by the difference between the max and min.

It is used when you have too high / low values. By decreasing this range it makes gradient descent easier to traverse through (since gradient descent travels through the partial derivatives of the Losses wrt the Weights).

Question 16

Q

What is Regularization? Why do you do it? Name two methods we use.

Answer

Study These Flashcards

A

Regularization is used when you would like to solve the problem of overfitting a model to a data set.

You use regularization to reduce model complexity by subtracting a specific regularization term.

Two methods we know of are:

L1 = MAE = LASSO, where ‘A’ is for absolute
L2 = MSE = RIDGE

Question 17

Q

What is L1 regularization?

Why do we use it?

When we plot the weights, what shape do we get?

What problems does it have?

Answer

Study These Flashcards

A

L1 = MAE = LASSO, where ‘A’ is for absolute

It used to *reduce* the # of features since those values go to 0
When you plot the coefficients you get a diamond shape
Non-differentiable because those values go to 0

Question 18

Q

What is L2 regularization?

Why do we use it?

When we plot the weights, what shape do we get?

What problems does it have?

Answer

Study These Flashcards

A

L2 = MAE = Ridge

Used when you want to specifically *negatively impact* the effect of outliers (because you square the error)
When you plot the weights you get a circle shape

Question 19

Q

What is Standardization?
Why do we use it?
How do we do it?

Answer

Study These Flashcards

A

Standardization is used to transform data to have a mean=0 and std=1 (ex - z-score)
We use it when our data (or different features) come from different distributions.
We standardize by subtracting the mean and dividing by the standard deviation.

Question 20

Q

Name 4 different ways we can do dimensionality reduction.

Answer

Study These Flashcards

A

PCA
PCoA
SVD (Singular Value Decomposition)
Factor Analysis

Question 21

Q

What is PCA?

Why do we do it?

How does it work?

What are the cons?

Answer

Study These Flashcards

A

PCA = Principle Component Analysis
We do this when we have A LOT of variables to consider and would like to reduce the dimensionality of our data
In essence, we transform the data so that the variable with the most variation lines up on the axis that we consider our “principle component” and we collapse it onto the line representing that axis.
The cons are that our features are no longer interpretable
How does it work?

Create a matrix that represents how all our independent variables are related to each other
For each column subtract the mean of that column from each entry and divide by std.
Transpose that matrix (the result is the covariance matrix), and then multiply that by the original matrix, so = Z^TZ
Calculate the eigenvectors + eigenvalues of this matrix by decomposing Z^TZ into PDP^-1, where P is a matrix of eigenvectors and D is the diagonal matrix with eigenvalues along the diagonal and zero everywhere else. Each of these eigenvalues λ₁ corresponds to the eigenvector in a column of P.
Take the eigenvalues λ₁, λ₂, …, λp and sort them from largest to smallest. In doing so, sort the eigenvectors in P accordingly. (For example, if λ₂ is the largest eigenvalue, then take the second column of P and place it in the first column position.) Depending on the computing package, this may be done automatically. Call this sorted matrix of eigenvectors P*. Note that these eigenvectors are independent of one another.
Calculate Z* = ZP*. This new matrix, Z*, is a centered/standardized version of X but now each observation is a combination of the original variables, where the weights are determined by the eigenvector. As a bonus, because our eigenvectors in P* are independent of one another, each column of Z* is also independent of one another!
(Because our principal components are orthogonal to one another, they are statistically linearly independent of one another… which is why our columns of Z* are linearly independent of one another!)
Finally, we need to determine how many features to keep versus how many to drop. There are three common methods to determine this, discussed below and followed by an explicit example:

Method 1: We arbitrarily select how many dimensions we want to keep. Perhaps I want to visually represent things in two dimensions, so I may only keep two features. This is use-case dependent and there isn’t a hard-and-fast rule for how many features I should pick.

Method 2: Calculate the proportion of variance explained (briefly explained below) for each feature, pick a threshold, and add features until you hit that threshold. (For example, if you want to explain 80% of the total variability possibly explained by your model, add features with the largest explained proportion of variance until your proportion of variance explained hits or exceeds 80%.). Because each eigenvalue is roughly the importance of its corresponding eigenvector, the proportion of variance explained is the sum of the eigenvalues of the features you kept divided by the sum of the eigenvalues of all features.

Method 3: This is closely related to Method 2. Calculate the proportion of variance explained for each feature, sort features by proportion of variance explained and plot the cumulative proportion of variance explained as you keep more features. (This plot is called a scree plot, shown below.) One can pick how many features to include by identifying the point where adding a new feature has a significant drop in variance explained relative to the previous feature, and choosing features up until that point. (I call this the “find the elbow” method, as looking at the “bend” or “elbow” in the scree plot determines where the biggest drop in proportion of variance explained occurs.)

Question 22

Q

What is PCoA?
How does it differ from PCA?

Answer

Study These Flashcards

A

PCoA = Principal Coordinate Analysis
PCoA attempts to represent the distances between samples in a low-dimensional, Euclidean space. In particular, it maximizes the linear correlation between the distances in the distance matrix, and the distances in a space of low dimension (typically, 2 or 3 axes are selected).
The first step of a PCoA is the construction of a (dis)similarity matrix. While PCA is based on Euclidean distances, PCoA can handle (dis)similarity matrices calculated from quantitative, semi-quantitative, qualitative, and mixed variables.
For abundance data, Bray-Curtis distance is often recommended. You can use Jaccard index for presence/absence data. When the distance metric is Euclidean, PCoA is equivalent to Principal Components Analysis.
Although PCoA is based on a (dis)similarity matrix, the solution can be found by eigenanalysis. The interpretation of the results is the same as with PCA.

Question 23

Q

What is SVD?

Answer

Study These Flashcards

A

One of the most known and widely used matrix decomposition method is the Singular-Value Decomposition, or SVD.
All matrices have an SVD, which makes it more stable than other methods, such as the eigendecomposition.
A = U · Sigma · V^T
Where A is the real m x n matrix that we wish to decompose, U is an m x m matrix, Sigma is an m x n diagonal matrix, and V^T is the transpose of an n x n matrix where T is a superscript.
The diagonal values in the Sigma matrix are known as the singular values of the original matrix A. The columns of the U matrix are called the left-singular vectors of A, and the columns of V are called the right-singular vectors of A.

ML - Preprocessing Flashcards

(23 cards)