pre-processing Flashcards

Question 1

Q

general steps

Answer

A

FeFeSPoT
* transformations (center/scale, skewness, Box-Cox)
* feature extraction
* feature engineering
* predictor selection
* supervised vs unsupervised–supervised considers outcome variable (like PLS)

Question 2

Q

feature extraction

Answer

A

aka signal extraction
identifying and extracting features relevant for a particular problem
eg, PCA (amounting to dimensionality reduction)

Question 3

Q

one-hot encoding / indicator variables

Answer

A

one-hot may refer to encoding every factor level with 0/1, while indicator or dummy variables typically leave one level out (to avoid collinearities)

Question 4

Q

maximum dissimilarity sampling

Answer

A

a way to stratify test/train sets, by ensuring maximal separation between instances in the predictor phase space
can also be conditioned on a per-class basis for classification problems

Question 5

Q

resampling types

Answer

A

LOOCV–for n samples, one is held out for testing, and the other n-1 are used for training
LGOCV–aka Monte Carlo cross-validation, this just sets a train/test ratio, and then randomly resamples the dataset to create splits on the fly, over some number of repititions
bootstrap
- random test/train split produced by pulling k samples out of n total instances, with replacement
- likely there will be samples not picked (due to repeats)–these are the “out of bag” samples, and will be used for testing

Question 6

Q

principal component analysis (PCA)

Answer

A

works in the per-sample space
mutually orthogonal linear combinations of predictors that account for the most possible variance
finds eigenvectors of the predictors’ covariance matrix (which is inherently symmetric); the covariances are over all samples

pre-processing Flashcards

(6 cards)