Feature Engineering Flashcards

Question 1

Q

When is a value missing not at random (MNAR)?

Answer

A

This is when the reason a value is missing is because of the true value itself. For instance, people didn’t disclose their incomes because they did not want to share their income

Question 2

Q

When is a value missing at random (MAR)?

Answer

A

This is when the reason a value is missing is not due to the value itself, but due to another observed variable. For example gender A did not close their age, because gender A generally does not like to disclose their age

Question 3

Q

When is a value missing completely at random (MCAR)?

Answer

A

This is when there’s not pattern in when the value is missing. For instance, people forgot to fill in the value in a survey

Question 4

Q

What are the two ways of dealing with missing values?

Answer

A

Deletion
Imputation

Question 5

Q

What are the types of deletion when dealing with missing values and when do you use which?

Answer

A

Column deletion (If lots of examples are missing and you are confident it can be deleted)
Row deletion (If MCAR and number of examples is small like 0.1%)

Question 6

Q

What are the types of imputation when dealing with missing values?

Answer

A

Default values (empty string)
Mean, median, or modus

Question 7

Q

What is feature scaling?

Answer

A

To scale features to be similar ranges

Question 8

Q

How do you scale features to get them to be in the range [0, 1] given variable x?

Answer

A

x_scaled = (x - min(x)) - (max(x) - min(x))

Question 9

Q

What is standardization and when should it be used in feature scaling?

Answer

A

A process to normalize features so that they have zero mean and unit variance. It should be applied to the variables, if the variables seem to follow a normal distribution. x_standardized = (x - x_mean) / standard_deviation

Question 10

Q

What are two points of attention when applying features scaling?

Answer

A

It’s a common source of data leakage
It often requires global statistics. You need all data to calculate your min, max or mean. If these statistics change compared to the training, they won’t be useful

Question 11

Q

What is discretization?

Answer

A

The process of turning a continuous feature into a discrete feature

Question 12

Q

What is the hashing trick?

Answer

A

A hash function is used to generate a hashed value of each category. This is used to solve the problem of not knowing the number of categories in advance. A problem with hashed functions is collision, but in practice the impact on the performance is insignificant

Question 13

Q

What is feature crossing?

Answer

A

A technique to combine two or more features to generate new features. This is useful to model the nonlinear relationships between features

Question 14

Q

What is an embedding?

Answer

A

A vector that represents a piece of data. One of the most common uses of embeddings is word embeddings, where it’s possible to represent each word with a vector

Question 15

Q

What is an embedding space?

Answer

A

The set of all possible embeddings generated by the same algorithm for a type of data. All embedding vectors in the same space are of the same size

Question 16

Q

What is data leakage?

Answer

Study These Flashcards

A

Refers to the phenomenon when a form of the label “leaks” into the set of features used for making predictions, and this same information is not available during inference

Question 17

Q

What is an example of data leakage?

Answer

Study These Flashcards

A

When models are found to be picking up on the text font that certain hospitals use to label their CT scans. As a result, fonts from hospitals with more serious caseloads become predictors of the given disease risk

Question 18

Q

What are common causes of data leakage?

Answer

Study These Flashcards

A

Splitting time-correlated data randomly instead of by time
Scaling before splitting
Filling in missing data with statistics from the test split
Poor handling of data duplication before splitting
Group leakage (group of examples with correlated labels are divided into different splits)
Leakage from data generation process

Question 19

Q

What are two ways to detect data leakage?

Answer

Study These Flashcards

A

If a feature has unusually high correlation
If removing a feature causes the model’s performance to deteriorate significantly

Question 20

Q

What are the downsides of having too many features?

Answer

Study These Flashcards

A

More opportunities for data leakage
Can cause overfitting
Can increase memory required to serve a model
Can increase inference latency
Useless features become technical debt (whenever the data pipeline changes, all the affected features need to be adjusted accordingly)

Question 21

Q

What is bagging?

Answer

Study These Flashcards

A

Shortened for bootstrap aggregating. It’s an ensemble type designed to improve both the training stability and accuracy of ML algorithms. It reduces variance and helps to avoid overfitting. Bootstraps are created by sampling with replacement and then each bootstrap is trained individually. If the problem is classification, the final prediction is decided by the majority vote. If the problem is regression, the final prediction is the average of all models’ predictions

Question 22

Q

What is boosting?

Answer

Study These Flashcards

A

A type of ensemble that is a family of iterative ensemble algorithms that convert weak learners to strong ones. Each learner in this ensemble is trained on the same set of samples, but the samples are weighted differently among iterations. As a result, future weak learners focus more on the examples that previous weak learners misclassified

Question 23

Q

What is stacking?

Answer

Study These Flashcards

A

Stacking is a type of ensemble where base learners are trained from the training data then a meta-learner is created that combines the outputs of the base learners to output final predictions. The meta-learner can be as simple as a heuristic: majority vote (classification) or average vote (regression) from all base learners. It can also be another model (logistic or linear regression model)

Question 24

Q

What are some metrics worth tracking for each experiment during its training process?

Answer

Study These Flashcards

A

The loss curve corresponding to the train split and each of the eval splits
The model performance metrics on all non-test splits, such as accuracy, F1, perplexity
The log of corresponding sample, prediction, and ground truth label (for ad hoc analytics and sanity checks)
The speed of the model (number of steps per second, or number of tokens processed per second)
System performance metrics (memory, CPU/GPU usage)
The values over time of any (hyper)parameter whose changes can affect your model’s performance

Feature Engineering Flashcards

(24 cards)