Data Preprocessing Flashcards
What is a pipeline?
A series of steps we take to transform the data from an initial raw form into a clean form that we can feed into our learning algorithms.
What is scaling and when is it useful?
Data scaling is a technique used in data preprocessing to normalize or standardize the range of values of numeric features in a dataset. If a feature has very large values (either positive or negative) it can have a disproportional impact on the outcome of the learning algorithm. This is especially evident in distance-based algorithms, like KNN and SVMs (dot products). Also linear models can be affected, as the regularization is affected.
Data scaling prevents this by scaling features into the same range. In addition, the weights might be more interpretable if the weights have similar scales.
Explain standard scaling
Standard scaling works by calculating the mean and standard deviation for each dimension. Afterwards the mean is subtracted from the data and then the data is scaled according to the standard deviation. This means that the mean for every feature is exactly 0 and therefore all points are centered around 0.
xnew = (x - μ) / σ
We assume that the data is mostly gaussian distributed, though it does not have to be perfectly distributed this way.
Explain min-max scaling
All features are scaled between a min and max value. Often 0 and 1 are used for these values. This kind of scaling would only makes sense if the min/max values have meaning in your data (e.g. age).
xnew = (x - xmin) / (xmax - xmin) · (max - min) + min
Min-max scaling is very sensitive to outliers. Large values will squish the rest of the points together.
Explain Robust scaling
Robust scaling is similar to standard scaling, but instead of subtracting the mean, the median of each feature is subtracted from the data points. This means that half of the points will be to the left of 0, the other half will be to the right of 0. Afterwards, the data is scaled between the 25th percentile and 75th percentile in the range between -1 and 1.
Explain Normalization.
Makes sure that features values of each point sum up to 1, in case L1 regularization is used. This means that the values of each row count up to 1. This is often useful when we have very high dimensional data like count data, where each feature is the frequency of a word in a particular document (row).
If L2 regularization is used when calculating distances in high dimensional data. In these cases, often cosine similarity is used, but this is expensive to calculate. If your data is normalized, the Normalized Euclidean distance is equal to the cosine similarity. In this case the sum of squares is summed up to be 1: SUM(x2i) = 1.
Explain Maximum Absolute scalers
Maximum Absolute scalers is often only useful when you have many features but few are non-zero (sparse data). The point of the Maximum Absolute scaler is that it keep the zeroed datapoint zero. This is important for efficient storage. It is similar to Min-Max scaling without changing the 0 values.
What if your data is not normally distributed at all?
Power transformations can be performed in this case. If we know the distribution of the data beforehand, we can scale the data in such a way so that it becomes normally distributed. For example, we can use a Box-Cox transformation to transform a lognormal or chi-squared distributed dataset into a normally distributed dataset.
Box-Cox formula with parameter λ:
if λ = 0 -> log(x)
if λ =/= 0 -> (xλ - 1) / λ
λ can be learned on the training data.
If your data contains non-positive values the Yeo-Johnson transformation can be used.
Explain ordinal encoding, when it comes to encoding categorical features.
In ordinal encoding an integer value is assigned to each category in the order they are encountered. This is only useful if there exist a natural order in categories, because the model will consider one category to be ‘higher’ or ‘closer’ to another. For example it might be useful when you have categorical values like ‘very high’, ‘high’, ‘medium’ etc.
Explain one-hot encoding
For every category, one-hot encoding adds a new 0/1 feature for every category, having 1 (hot) if the sample has that category. This technique can explode when a feature has many different values. For example 200 different categorical values in a feature creates 200 new features, which can create problems with the learning algorithm.
It can also be that a value is only present in the test set, meaning that we did not encode this value in our one-hot vector. In this case we can resample our data or ignore this value.
Explain target encoding.
target encoding is a supervised way of encoding categorical values. It checks if the current class correlates with a positive (target) class 1 or a negative class 0. Target encoding blends the posterior probability of the target and the prior probability using the logic function. For the formula go to the slides.
What is meant with a fit-predict paradigm in relation to transformations?
The transformer should be fit on the training data only. Fitting means recording the mean and standard deviation and then transform (e.g. scale) the training data, and then train the learning model. Afterwards transform the test data and evaluate the model. Additionally it is important to only scale the input features (X) and not the targets (y) .
Fitting and transforming on the whole dataset before splitting causes data leakage, as we already looked at the test data. Therefore model evaluations will be misleading and probably optimistic.
Fitting and scaling the test and train data separately is also not good as it distorts the data. It’s best to fit the transformer on the train data and transform both the train and test data with this scaler.
How can data leakage occur in Cross Validation?
When the training data is scaled before doing cross-validation, data leakage occurs because we have used both the validation and train folds for fitting the scaler.
Explain how pipelines work in practice.
A pipeline can be created in Sklearn with the Pipeline object. It chains a number of transformation steps to each other and ends with a classifier/regressor. The whole pipeline therefore acts like a classifier/regressor and can be used for Gridsearch or cross validation, which will take care of correct execution.
Explain the curse of dimensionality.
For every feature you add, you need exponentially more data.