Data pre-processing Flashcards
Possible data pre-processing procedures
Input preprocessing:
1) input centering
2) input normalization
3) input whitening
4) data cleaning
PCA can also be considered a data pre-processing procedure.
Why is data pre-processing used?
Data should be pre-processed to make the suited for Learning, and to achieve better results.
Pre-processing operations must be done without snooping the data!
Input preprocessing: general procedure
Given the input data matrix X app R^N*d
input pre-processing consists in finding a standardization transform Φ, that gives
zn = Φ(xn) for any n = 1,..,N
The final hypothesis will be
h(x) = h(Φ(x))
Input centering
Goal: to remove any bias from the input.
zn = xn - x_
x_ = 1/N sum(n=1,N) xn (mean of the data)
Input normalization
Goal: to scale the input wrt its variance. Assuming it is centered: zn = [zn1 … znd]' = [xn1/σ1 … xnd/σd] where σi^2 = 1/N sum(n=1,N) xni^2 , i=1,...d (variance of the features)
Input whitening
Goal: to decorrelate input samples, if it is known that they are decorrelated.
zn = A^-1/2 xn
where A is the covariance matrix of the x’s
Data cleaning
Goal: to remove outliers
- use simple models
- compute leverage score of validation
- look at the data
Causes of outliers data
- stochastic output noise
- System complexity not modelled (deterministic noise)