Data Preprocessing Flashcards

Question 1

Q

What is a pipeline?

Answer

A

A series of steps we take to transform the data from an initial raw form into a clean form that we can feed into our learning algorithms.

Question 2

Q

What is scaling and when is it useful?

Answer

A

Data scaling is a technique used in data preprocessing to normalize or standardize the range of values of numeric features in a dataset. If a feature has very large values (either positive or negative) it can have a disproportional impact on the outcome of the learning algorithm. This is especially evident in distance-based algorithms, like KNN and SVMs (dot products). Also linear models can be affected, as the regularization is affected.

Data scaling prevents this by scaling features into the same range. In addition, the weights might be more interpretable if the weights have similar scales.

Question 3

Q

Explain standard scaling

Answer

A

Standard scaling works by calculating the mean and standard deviation for each dimension. Afterwards the mean is subtracted from the data and then the data is scaled according to the standard deviation. This means that the mean for every feature is exactly 0 and therefore all points are centered around 0.

x_new = (x - μ) / σ

We assume that the data is mostly gaussian distributed, though it does not have to be perfectly distributed this way.

Question 4

Q

Explain min-max scaling

Answer

A

All features are scaled between a min and max value. Often 0 and 1 are used for these values. This kind of scaling would only makes sense if the min/max values have meaning in your data (e.g. age).

x_new = (x - x_min) / (x_max - x_min) · (max - min) + min

Min-max scaling is very sensitive to outliers. Large values will squish the rest of the points together.

Question 5

Q

Explain Robust scaling

Answer

A

Robust scaling is similar to standard scaling, but instead of subtracting the mean, the median of each feature is subtracted from the data points. This means that half of the points will be to the left of 0, the other half will be to the right of 0. Afterwards, the data is scaled between the 25th percentile and 75th percentile in the range between -1 and 1.

Question 6

Q

Explain Normalization.

Answer

A

Makes sure that features values of each point sum up to 1, in case L1 regularization is used. This means that the values of each row count up to 1. This is often useful when we have very high dimensional data like count data, where each feature is the frequency of a word in a particular document (row).

If L2 regularization is used when calculating distances in high dimensional data. In these cases, often cosine similarity is used, but this is expensive to calculate. If your data is normalized, the Normalized Euclidean distance is equal to the cosine similarity. In this case the sum of squares is summed up to be 1: SUM(x²_i) = 1.

Question 7

Q

Explain Maximum Absolute scalers

Answer

A

Maximum Absolute scalers is often only useful when you have many features but few are non-zero (sparse data). The point of the Maximum Absolute scaler is that it keep the zeroed datapoint zero. This is important for efficient storage. It is similar to Min-Max scaling without changing the 0 values.

Question 8

Q

What if your data is not normally distributed at all?

Answer

A

Power transformations can be performed in this case. If we know the distribution of the data beforehand, we can scale the data in such a way so that it becomes normally distributed. For example, we can use a Box-Cox transformation to transform a lognormal or chi-squared distributed dataset into a normally distributed dataset.

Box-Cox formula with parameter λ:

if λ = 0 -> log(x)
if λ =/= 0 -> (x^λ - 1) / λ

λ can be learned on the training data.
If your data contains non-positive values the Yeo-Johnson transformation can be used.

Question 9

Q

Explain ordinal encoding, when it comes to encoding categorical features.

Answer

A

In ordinal encoding an integer value is assigned to each category in the order they are encountered. This is only useful if there exist a natural order in categories, because the model will consider one category to be ‘higher’ or ‘closer’ to another. For example it might be useful when you have categorical values like ‘very high’, ‘high’, ‘medium’ etc.

Question 10

Q

Explain one-hot encoding

Answer

A

For every category, one-hot encoding adds a new 0/1 feature for every category, having 1 (hot) if the sample has that category. This technique can explode when a feature has many different values. For example 200 different categorical values in a feature creates 200 new features, which can create problems with the learning algorithm.

It can also be that a value is only present in the test set, meaning that we did not encode this value in our one-hot vector. In this case we can resample our data or ignore this value.

Question 11

Q

Explain target encoding.

Answer

A

target encoding is a supervised way of encoding categorical values. It checks if the current class correlates with a positive (target) class 1 or a negative class 0. Target encoding blends the posterior probability of the target and the prior probability using the logic function. For the formula go to the slides.

Question 12

Q

What is meant with a fit-predict paradigm in relation to transformations?

Answer

A

The transformer should be fit on the training data only. Fitting means recording the mean and standard deviation and then transform (e.g. scale) the training data, and then train the learning model. Afterwards transform the test data and evaluate the model. Additionally it is important to only scale the input features (X) and not the targets (y) .

Fitting and transforming on the whole dataset before splitting causes data leakage, as we already looked at the test data. Therefore model evaluations will be misleading and probably optimistic.

Fitting and scaling the test and train data separately is also not good as it distorts the data. It’s best to fit the transformer on the train data and transform both the train and test data with this scaler.

Question 13

Q

How can data leakage occur in Cross Validation?

Answer

A

When the training data is scaled before doing cross-validation, data leakage occurs because we have used both the validation and train folds for fitting the scaler.

Question 14

Q

Explain how pipelines work in practice.

Answer

A

A pipeline can be created in Sklearn with the Pipeline object. It chains a number of transformation steps to each other and ends with a classifier/regressor. The whole pipeline therefore acts like a classifier/regressor and can be used for Gridsearch or cross validation, which will take care of correct execution.

Question 15

Q

Explain the curse of dimensionality.

Answer

A

For every feature you add, you need exponentially more data.

Question 16

Q

Name 3 reasons for reducing the number of features using automatic feature selection.

Answer

A

Simpler models often generalize better
Faster prediction and training
Easier data collection
More interpretable models.

Question 17

Q

Name two Unsupervised methods for feature selection.

Answer

A

Variance-based methods: Removes features that are constant. A feature that always has a value of 5 is probably not informative and can be thrown away. For each feature we calculate the variance and then compare that for different features, removing features with a variance lower than a certain threshold.
Covariance-based: investigate if two features are correlated. If the values of one feature rise and fall similarly to another feature, we might only need to keep one of them. Though, small changes might be important to keep!

In general unsupervised methods for feature selection are less reliable.

Question 18

Q

Explain the univariate method for feature selection.

Answer

A

Perform an f-test on the different features to see if there is a linear statistically significant relationship with the target. We can rank the features based on the F-statistic and take the best k features. We can also choose based on the p value.

The F-statistic does not look at any correlations or interactions between different features.

For Regression: The F-statistic checks if the feature X_i correlate (positively or negatively) with the target y.

For classification: uses ANOVA: does X_i explain the between-class variance.

Question 19

Q

Name some examples about model-based feature selection.

Answer

A

A linear model can for example be used to calculate the weights, and see which weights are very large. The features related to these weights are deemed important because the large weight has a big effect on the linear model outcome.
Tree-based models: features used in the first nodes are deemed more important as splits based on these features result in high information gain.
KNN: take a point randomly, look at the nearest neighbor. Look at if this neighbor is from the same class (near-hit) or from a different class (near-miss). We then compute a feature weight based on these near-hits or misses:
w_i = w_{i - 1} + (x_i - nearMiss_i)² - (x_i - nearHit_i)²|
This is repeated in iterations.