ML - Preprocessing Flashcards

1
Q

Name 5 ways we can do Feature Importance Selection:

A

Name 5 ways we can do Feature Importance Selection:

  1. Drop Column Importance
  2. Permutation Importance
  3. Correlation Matrix / Spearman’s Correlation (Heatmap)
  4. Comparison to Random Noise Column
  5. Using a Random Forest (Gini Drop)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Drop Column Importance?

A

We remove one feature and look at the change in performance of our model. To do this we first get a score from metric of choice (we can use Cross-Validation for this), and let this be our baseline. We then drop one column at a time and retrain our model. We compare each model’s score (from the metric of choice), to the baseline score.

Feature’s importance = metric - baseline

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the pros / cons of using drop-column importance?

A

Pros:

  • relatively simple + intuitive
  • model agnostic (can be applied to any model)
  • if a feature is destroyed, all interactions with it are also destroyed

Cons:

  • You have to train the model multiple times which may be expensive
  • If there are collinear features, then these get amplified if we drop other features (since there are less features overall)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Permutation Importance?

A

The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled. This technique measures the difference in performance if you permute or shuffle a feature vector. This shuffling process *breaks* any relationships of the line or record of other values with that column or specific variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the pros / cons of using permutation importance?

A

Pros:

  • Model Agnostic (can be used on any model)
  • The model doesn’t need to be retrained each time (unlike in drop column importance)

Cons:

  • Even if one of the correlated features is permuted the other one still has an inflated importance
  • we have to average across trials since permuting is random
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

List the 4 steps in the pseudo-algorithm for calculating permutation importance:

A
  1. Calculate a baseline score using the metric, trained model, the feature matrix and the target vector
  2. For each feature in the feature matrix, make a copy of the feature matrix.
  3. Shuffle the feature column, pass it through the trained model to get a prediction and use the metric to calculate the performance.

Importance = Baseline – Score

  1. Repeat for N times for statistical stability and take an average importance across trials
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Label Encoding?

A

*Label Encoding* converts categorical variables to simple numbers, through just a 1:1 substitution. One of the cons is that have an implicit order when we might not mean to.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is One-Hot Encoding?

A

We use this when we want to convert a categorical column into a binary variable (1 or 0).

It splits up that column into X columns where X is the number of different categories you have.

Thus, you have to be careful if you have a lot of categories so that you don’t suddenly greatly increase the width/dimension of your data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Binary Encoding?

A

Binary Encoding is when you encode a categorical variable as the presence or absence (1 or 0). It differs from OHE since you usually do it when you have one or two categories total in a single column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Target Encoding? How does this differ from Label Encoding?

A

Target encoding is when you replace the variable with either the min / max / quantile / average / count of that column.

This differs from Label Encoding as each new number still has some relationship with the underlying replaced number.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Rank Encoding?

A

Rank Encoding represents the numerical data by its respective rank or order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Frequency Encoding?

A

Frequency Encoding represents categorical data by the frequency in which that category appears.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do we use Hashing to encode variables?

A

Hashing transforms a string of chars into a shorter-fixed length number.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Embedding? Name 3 examples where we do this.

A

Embedding is when we map some high dimensional space to a lower dimensional one.

Examples - Word2Vec, NN layers, PCA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Normalization?

A

Normalization is when you scale a variable to be between 0 and 1.

You do this by subtracting the min, then dividing by the difference between the max and min.

It is used when you have too high / low values. By decreasing this range it makes gradient descent easier to traverse through (since gradient descent travels through the partial derivatives of the Losses wrt the Weights).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Regularization? Why do you do it? Name two methods we use.

A

Regularization is used when you would like to solve the problem of overfitting a model to a data set.

You use regularization to reduce model complexity by subtracting a specific regularization term.

Two methods we know of are:

  1. L1 = MAE = LASSO, where ‘A’ is for absolute
  2. L2 = MSE = RIDGE
17
Q

What is L1 regularization?

Why do we use it?

When we plot the weights, what shape do we get?

What problems does it have?

A

L1 = MAE = LASSO, where ‘A’ is for absolute

  • It used to *reduce* the # of features since those values go to 0
  • When you plot the coefficients you get a diamond shape
  • Non-differentiable because those values go to 0
18
Q

What is L2 regularization?

Why do we use it?

When we plot the weights, what shape do we get?

What problems does it have?

A

L2 = MAE = Ridge

  • Used when you want to specifically *negatively impact* the effect of outliers (because you square the error)
  • When you plot the weights you get a circle shape
19
Q

What is Standardization?
Why do we use it?
How do we do it?

A
  • Standardization is used to transform data to have a mean=0 and std=1 (ex - z-score)
  • We use it when our data (or different features) come from different distributions.
  • We standardize by subtracting the mean and dividing by the standard deviation.
20
Q

Name 4 different ways we can do dimensionality reduction.

A
  1. PCA
  2. PCoA
  3. SVD (Singular Value Decomposition)
  4. Factor Analysis
21
Q

What is PCA?

Why do we do it?

How does it work?

What are the cons?

A
  • PCA = Principle Component Analysis
  • We do this when we have A LOT of variables to consider and would like to reduce the dimensionality of our data
  • In essence, we transform the data so that the variable with the most variation lines up on the axis that we consider our “principle component” and we collapse it onto the line representing that axis.
  • The cons are that our features are no longer interpretable
  • How does it work?
  1. Create a matrix that represents how all our independent variables are related to each other
  2. For each column subtract the mean of that column from each entry and divide by std.
  3. Transpose that matrix (the result is the covariance matrix), and then multiply that by the original matrix, so = ZTZ
  4. Calculate the eigenvectors + eigenvalues of this matrix by decomposing ZTZ into PDP-1, where P is a matrix of eigenvectors and D is the diagonal matrix with eigenvalues along the diagonal and zero everywhere else. Each of these eigenvalues λ₁ corresponds to the eigenvector in a column of P.
  5. Take the eigenvalues λ₁, λ₂, …, λp and sort them from largest to smallest. In doing so, sort the eigenvectors in P accordingly. (For example, if λ₂ is the largest eigenvalue, then take the second column of P and place it in the first column position.) Depending on the computing package, this may be done automatically. Call this sorted matrix of eigenvectors P*. Note that these eigenvectors are independent of one another.
  6. Calculate Z* = ZP*. This new matrix, Z*, is a centered/standardized version of X but now each observation is a combination of the original variables, where the weights are determined by the eigenvector. As a bonus, because our eigenvectors in P* are independent of one another, each column of Z* is also independent of one another!
    (Because our principal components are orthogonal to one another, they are statistically linearly independent of one another… which is why our columns of Z* are linearly independent of one another!)
  7. Finally, we need to determine how many features to keep versus how many to drop. There are three common methods to determine this, discussed below and followed by an explicit example:

Method 1: We arbitrarily select how many dimensions we want to keep. Perhaps I want to visually represent things in two dimensions, so I may only keep two features. This is use-case dependent and there isn’t a hard-and-fast rule for how many features I should pick.

Method 2: Calculate the proportion of variance explained (briefly explained below) for each feature, pick a threshold, and add features until you hit that threshold. (For example, if you want to explain 80% of the total variability possibly explained by your model, add features with the largest explained proportion of variance until your proportion of variance explained hits or exceeds 80%.). Because each eigenvalue is roughly the importance of its corresponding eigenvector, the proportion of variance explained is the sum of the eigenvalues of the features you kept divided by the sum of the eigenvalues of all features.

Method 3: This is closely related to Method 2. Calculate the proportion of variance explained for each feature, sort features by proportion of variance explained and plot the cumulative proportion of variance explained as you keep more features. (This plot is called a scree plot, shown below.) One can pick how many features to include by identifying the point where adding a new feature has a significant drop in variance explained relative to the previous feature, and choosing features up until that point. (I call this the “find the elbow” method, as looking at the “bend” or “elbow” in the scree plot determines where the biggest drop in proportion of variance explained occurs.)

22
Q

What is PCoA?
How does it differ from PCA?

A
  • PCoA = Principal Coordinate Analysis
  • PCoA attempts to represent the distances between samples in a low-dimensional, Euclidean space. In particular, it maximizes the linear correlation between the distances in the distance matrix, and the distances in a space of low dimension (typically, 2 or 3 axes are selected).
  • The first step of a PCoA is the construction of a (dis)similarity matrix. While PCA is based on Euclidean distances, PCoA can handle (dis)similarity matrices calculated from quantitative, semi-quantitative, qualitative, and mixed variables.
  • For abundance data, Bray-Curtis distance is often recommended. You can use Jaccard index for presence/absence data. When the distance metric is Euclidean, PCoA is equivalent to Principal Components Analysis.
  • Although PCoA is based on a (dis)similarity matrix, the solution can be found by eigenanalysis. The interpretation of the results is the same as with PCA.
23
Q

What is SVD?

A
  • One of the most known and widely used matrix decomposition method is the Singular-Value Decomposition, or SVD.
  • All matrices have an SVD, which makes it more stable than other methods, such as the eigendecomposition.
  • A = U · Sigma · V^T
  • Where A is the real m x n matrix that we wish to decompose, U is an m x m matrix, Sigma is an m x n diagonal matrix, and V^T is the transpose of an n x n matrix where T is a superscript.
  • The diagonal values in the Sigma matrix are known as the singular values of the original matrix A. The columns of the U matrix are called the left-singular vectors of A, and the columns of V are called the right-singular vectors of A.