Preprocessing Flashcards
1
Q
Supervised learning pipeline
A
- Analysis of the problem
- Collection, analysis and cleaning of data
- Preprocessing and missing values
- Study of correlations among variables
- Feature Selection/Weighting/Learning
- Choice of the predictor and Model Selection
- Test
2
Q
What are some representations for data types?
A
- vectors
- strings
- sets and bags -> set of terms/words
- tensors -> images and videos
- trees and graphs
- compound structures
3
Q
Features representation
A
- can be categorical [symbolic] or quantitative [numeric] features
- categorical
- nominal -> no order in possible values [book attributes, auto]
- ordinals -> they have an order but it is not assured that they always keep the same distance from one another [military rank]
- quantitative
- intervals -> typically discrete and enumerable characteristics [review stars]
- ratio -> actual values [weight]
4
Q
How are categorical variables encoded?
A
- OneHot Encoding
- categorical variables can be represented in a vector with as many
components as the number of possible values for the variable - boolean vector
- Brand: Fiat [c0], Toyota [c1], Ford[c2]
- Color: White [c3], Black [c4], Red [c5]
- Type: Subcompact [c6], Sports [c7]
- (Toyota, Red, Subcompact) -> [0,1,0,0,0,1,1,0]
5
Q
How are continous variables encoded?
A
- more difficult than categorical, preprocessing needed
- features are to be transformed first to be comparable to one another
- standardization
- centering
- variance scaling
- scaling in a range
- normalzation
- standardization
6
Q
Feature selection
A
- reduction of the dimensionality of the features
- removing irrelevant or redundant features
- interpretability of the model maintained
- filter methods
- efficient scoring function determines usefulness of the features []
- wrapper methods
- predictor is evaluated on a hold-out sample using sub-sets of different features [RFE for SVMs]
- embedded methods
- selection of features occurs in conjunction with the creation of the model (e.g. modifying objective function) [regularization, LASSO]
7
Q
Feature extraction
A
- reduction of the dimensionality of the features
- combining features
- interpretability of the model lost
- Principal Component Analysis
- converts a set of instances with possibly related features into corresponding values on another set of linearly unrelated features
- orthogonal, direction with most variance