Feature Engineering Flashcards
DS
What are some naive feature engineering techniques that improve model efficiency?
- Summary Statistics (mean, median, min, max, std) for each GROUP of SIMILAR records; e.g., all males customers between 32-44 would get their own summary stats.
- Interactions between ratios or features, e.g., var1/var2 or var1*var2
- Summaries of Features; e.g., the number of purchases a customer made in the last 30 days (raw features may be last 10 purchase dates)
- Splitting Feature Information Manually; e.g., customer taller than 6’ may be a critical piece of information when recommending car vs. SUV.
- kNN Using Records in the Training Set to produce a “kNN” feature that is fed into another model.
What are three methods for scaling your data?
- “Normalization” or “Scaling” are general terms that refer to transforming your input data to a new scale (often a linear transformation) such as 0 through 1, -1 through 1, 0 through 10, etc.
- Min-Max: linear transformation of data that maps the minimum value to 0 and the maximum value to 1.
- Standardization: transforms each feature to a normal distribution with mean 0 and standard deviation of 1. Also known as Z-score transformation.
Explain one major drawback to each of the scaling methods.
- Generalized normalization scaling is sensitive to outliers since the presence of outliers will compress most values and make them appear extremely close together.
- Min-Max scaling is also sensitive to outliers since the presence of outliers will compress most values and make them appear close together..
- Standardization (Z-score transformation) rescales to an unbounded interval which can be problematic for certain algorithms, e.g. some neural networks, that expect expect input values to be inside a certain range.
When should you scale your data and why?
When you algorithm will weight each input, e.g. gradient descent used by many neural nets or use distance metrics (e.g. KNN), model performance can be improved by normalizing, standardizing, or otherwise scaling your data so that each feature is given relatively equal weight.
Scaling is also important when features are measured in different units. e.g., if feature A is measured in inches, feature B is measured in feet, and feature C is measured in dollars it is important that these features are scaled so that they are weighted and/or represented equally.
In some cases, efficacy will not change but perceived feature importance may change, e.g. coefficients in a linear regression.
Note: scaling your data typically does not change performance or feature importance for TREE-based models since the split points will simply shift to compensate for data.
Describe basic feature encoding for categorical variables.
Feature encoding involves replacing classes in a categorical variable with new values such as integers or real values; e.g., [‘red’, ‘blue’, ‘green’] could be encoded as [8, 5, 11].
When should you encode your features and why?
You should encode your categorical features so that they may be processed by algorithms, e.g. so that ML algorithms can learn from them.
What are three encoding methods for categorical features?
- Label Encoding (non-ordinal): each category is assigned a numeric value not representing any ordering. e.g., [‘red’, ‘blue’, ‘green’] could be encoded as [8, 5, 11].
- Label Encoding (Ordinal): each category is encoded with a value representing an ordering. e.g. [‘small’, ‘medium’, ‘large’] could be encoded as [1, 2, 3].
- One-Hot-Encoding (aka Binary Encoding): each category is transformed into a new binary feature, with all records being marked 1 or 0. e.g., colors = [‘red’, ‘blue’, ‘green’] could be encoded as ‘red’ = [1, 0, 0], ‘blue’ = [0, 1, 0], and ‘green’ = [0, 0, 1].