Data pre-processing Flashcards

Question 1

Q

Possible data pre-processing procedures

Answer

A

Input preprocessing:

1) input centering
2) input normalization
3) input whitening
4) data cleaning

PCA can also be considered a data pre-processing procedure.

Question 2

Q

Why is data pre-processing used?

Answer

A

Data should be pre-processed to make the suited for Learning, and to achieve better results.
Pre-processing operations must be done without snooping the data!

Question 3

Q

Input preprocessing: general procedure

Answer

A

Given the input data matrix X app R^N*d
input pre-processing consists in finding a standardization transform Φ, that gives
zn = Φ(xn) for any n = 1,..,N

The final hypothesis will be
h(x) = h(Φ(x))

Question 4

Q

Input centering

Answer

A

Goal: to remove any bias from the input.
zn = xn - x_

x_ = 1/N sum(n=1,N) xn (mean of the data)

Question 5

Q

Input normalization

Answer

A

Goal: to scale the input wrt its variance.
Assuming it is centered:
zn = [zn1 … znd]' = [xn1/σ1 … xnd/σd]
where
σi^2 = 1/N sum(n=1,N) xni^2 , i=1,...d
(variance of the features)

Question 6

Q

Input whitening

Answer

A

Goal: to decorrelate input samples, if it is known that they are decorrelated.
zn = A^-1/2 xn
where A is the covariance matrix of the x’s

Question 7

Q

Data cleaning

Answer

A

Goal: to remove outliers

use simple models
compute leverage score of validation
look at the data

Question 8

Q

Causes of outliers data

Answer

A

stochastic output noise

- System complexity not modelled (deterministic noise)

Data pre-processing Flashcards

(8 cards)