3. Feature Engineering Flashcards
What is feature engineering?
The process of transforming raw data to useful features for model training.
What are the primary reasons for data transformation?
Data compatibility, i.e., from sting type data to numerical data
Data quality, i.e., convert text to lowercase.
What are the approaches for consistent data preprocessing?
Pretraining data transformation: data transformation before training. Adv: only perform once.
Disadv: update requires rerun the whole dataset.
Inside model data transformation: transformation is a part of the model code.
Adv: easy to decouple data and transformation.
Disadv: increase model latency.
What is encoding for structured data types?
Categorical data must be converted to numerical as most models can’t handle categorical data directly.
What are the two kinds of transformation may be needed for integer of floating-point data?
Normalization and bucketing
Why do we need to normalize data with various ranges?
Slow convergence for models with gradient descent.
Wide range of values in a single feature will lead to generation of NaN error in some models.
What are the two ways of bucketing?
Bucketing is to transform numeric data to categorical data.
Buckets with equal-spaced boundaries
Buckets with quantile boundaries
What is label encoding?
Convert text categories to numeric while preserving the order, e.g., small, medium, big to 1, 2, 3
What is Out of Vocab?
It is a special category for outliers. ML systems won’t waste time on training the rare outliers.
What is one-hot encoding?
Create dummy variables used for categorical variables where order does not matter.
What is feature hashing?
Apply a hash function to a categorical feature and use the hash value as the feature name.
What is hybrid of hashing and vocabulary?
Use vocabulary for important features
Use hashing for the less important features
What is embedding?
Embedding is a categorical feature represented as a continuous-valued feature.
What is feature selection?
Select a subset of features that are most useful to a model in order to predict the target variable.
What are the benefits of dimensionality reduction?
Reduce the noise from data
Reduce overfitting problem