Preprocessing Flashcards

1
Q

Supervised learning pipeline

A
  • Analysis of the problem
  • Collection, analysis and cleaning of data
  • Preprocessing and missing values
  • Study of correlations among variables
  • Feature Selection/Weighting/Learning
  • Choice of the predictor and Model Selection
  • Test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some representations for data types?

A
  • vectors
  • strings
  • sets and bags -> set of terms/words
  • tensors -> images and videos
  • trees and graphs
  • compound structures
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Features representation

A
  • can be categorical [symbolic] or quantitative [numeric] features
  • categorical
    • nominal -> no order in possible values ​​[book attributes, auto]
    • ordinals -> they have an order but it is not assured that they always keep the same distance from one another [military rank]
  • quantitative
    • intervals -> typically discrete and enumerable characteristics [review stars]
    • ratio -> actual values ​​[weight]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How are categorical variables encoded?

A
  • OneHot Encoding
  • categorical variables can be represented in a vector with as many
    components as the number of possible values for the variable
  • boolean vector
    • Brand: Fiat [c0], Toyota [c1], Ford[c2]
    • Color: White [c3], Black [c4], Red [c5]
    • Type: Subcompact [c6], Sports [c7]
    • (Toyota, Red, Subcompact) -> [0,1,0,0,0,1,1,0]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How are continous variables encoded?

A
  • more difficult than categorical, preprocessing needed
  • features are to be transformed first to be comparable to one another
    • standardization
      • centering
      • variance scaling
    • scaling in a range
    • normalzation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Feature selection

A
  • reduction of the dimensionality of the features
    • removing irrelevant or redundant features
    • interpretability of the model maintained
  • filter methods
    • efficient scoring function determines usefulness of the features []
  • wrapper methods
    • predictor is evaluated on a hold-out sample using sub-sets of different features [RFE for SVMs]
  • embedded methods
    • selection of features occurs in conjunction with the creation of the model (e.g. modifying objective function) [regularization, LASSO]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Feature extraction

A
  • reduction of the dimensionality of the features
    • combining features
    • interpretability of the model lost
  • Principal Component Analysis
    • converts a set of instances with possibly related features into corresponding values on another set of linearly unrelated features
    • orthogonal, direction with most variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly