Feature Engineering Flashcards
When is a value missing not at random (MNAR)?
This is when the reason a value is missing is because of the true value itself. For instance, people didn’t disclose their incomes because they did not want to share their income
When is a value missing at random (MAR)?
This is when the reason a value is missing is not due to the value itself, but due to another observed variable. For example gender A did not close their age, because gender A generally does not like to disclose their age
When is a value missing completely at random (MCAR)?
This is when there’s not pattern in when the value is missing. For instance, people forgot to fill in the value in a survey
What are the two ways of dealing with missing values?
- Deletion
- Imputation
What are the types of deletion when dealing with missing values and when do you use which?
- Column deletion (If lots of examples are missing and you are confident it can be deleted)
- Row deletion (If MCAR and number of examples is small like 0.1%)
What are the types of imputation when dealing with missing values?
- Default values (empty string)
- Mean, median, or modus
What is feature scaling?
To scale features to be similar ranges
How do you scale features to get them to be in the range [0, 1] given variable x?
x_scaled = (x - min(x)) - (max(x) - min(x))
What is standardization and when should it be used in feature scaling?
A process to normalize features so that they have zero mean and unit variance. It should be applied to the variables, if the variables seem to follow a normal distribution. x_standardized = (x - x_mean) / standard_deviation
What are two points of attention when applying features scaling?
- It’s a common source of data leakage
- It often requires global statistics. You need all data to calculate your min, max or mean. If these statistics change compared to the training, they won’t be useful
What is discretization?
The process of turning a continuous feature into a discrete feature
What is the hashing trick?
A hash function is used to generate a hashed value of each category. This is used to solve the problem of not knowing the number of categories in advance. A problem with hashed functions is collision, but in practice the impact on the performance is insignificant
What is feature crossing?
A technique to combine two or more features to generate new features. This is useful to model the nonlinear relationships between features
What is an embedding?
A vector that represents a piece of data. One of the most common uses of embeddings is word embeddings, where it’s possible to represent each word with a vector
What is an embedding space?
The set of all possible embeddings generated by the same algorithm for a type of data. All embedding vectors in the same space are of the same size