pre-processing Flashcards
1
Q
general steps
A
FeFeSPoT
* transformations (center/scale, skewness, Box-Cox)
* feature extraction
* feature engineering
* predictor selection
* supervised vs unsupervised–supervised considers outcome variable (like PLS)
2
Q
feature extraction
A
- aka signal extraction
- identifying and extracting features relevant for a particular problem
- eg, PCA (amounting to dimensionality reduction)
3
Q
one-hot encoding / indicator variables
A
one-hot may refer to encoding every factor level with 0/1, while indicator or dummy variables typically leave one level out (to avoid collinearities)
4
Q
maximum dissimilarity sampling
A
- a way to stratify test/train sets, by ensuring maximal separation between instances in the predictor phase space
- can also be conditioned on a per-class basis for classification problems
5
Q
resampling types
A
- LOOCV–for n samples, one is held out for testing, and the other n-1 are used for training
- LGOCV–aka Monte Carlo cross-validation, this just sets a train/test ratio, and then randomly resamples the dataset to create splits on the fly, over some number of repititions
- bootstrap
- random test/train split produced by pulling k samples out of n total instances, with replacement
- likely there will be samples not picked (due to repeats)–these are the “out of bag” samples, and will be used for testing
6
Q
principal component analysis (PCA)
A
- works in the per-sample space
- mutually orthogonal linear combinations of predictors that account for the most possible variance
- finds eigenvectors of the predictors’ covariance matrix (which is inherently symmetric); the covariances are over all samples