Preprocessing Flashcards

Question 1

Q

Supervised learning pipeline

Answer

A

Question 2

Q

What are some representations for data types?

Answer

A

Question 3

Q

Features representation

Answer

A

can be categorical [symbolic] or quantitative [numeric] features
categorical
- nominal -> no order in possible values [book attributes, auto]
- ordinals -> they have an order but it is not assured that they always keep the same distance from one another [military rank]
quantitative
- intervals -> typically discrete and enumerable characteristics [review stars]
- ratio -> actual values [weight]

Question 4

Q

How are categorical variables encoded?

Answer

A

OneHot Encoding
categorical variables can be represented in a vector with as many
components as the number of possible values for the variable
boolean vector
- Brand: Fiat [c0], Toyota [c1], Ford[c2]
- Color: White [c3], Black [c4], Red [c5]
- Type: Subcompact [c6], Sports [c7]
- (Toyota, Red, Subcompact) -> [0,1,0,0,0,1,1,0]

Question 5

Q

How are continous variables encoded?

Answer

A

Question 6

Q

Feature selection

Answer

A

reduction of the dimensionality of the features
- removing irrelevant or redundant features
- interpretability of the model maintained
filter methods
- efficient scoring function determines usefulness of the features []
wrapper methods
- predictor is evaluated on a hold-out sample using sub-sets of different features [RFE for SVMs]
embedded methods
- selection of features occurs in conjunction with the creation of the model (e.g. modifying objective function) [regularization, LASSO]

Question 7

Q

Feature extraction

Answer

A

reduction of the dimensionality of the features
- combining features
- interpretability of the model lost
Principal Component Analysis
- converts a set of instances with possibly related features into corresponding values on another set of linearly unrelated features
- orthogonal, direction with most variance

(7 cards)