3. Feature Engineering Flashcards

1
Q

What is feature engineering?

A

The process of transforming raw data to useful features for model training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the primary reasons for data transformation?

A

Data compatibility, i.e., from sting type data to numerical data
Data quality, i.e., convert text to lowercase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the approaches for consistent data preprocessing?

A

Pretraining data transformation: data transformation before training. Adv: only perform once.
Disadv: update requires rerun the whole dataset.
Inside model data transformation: transformation is a part of the model code.
Adv: easy to decouple data and transformation.
Disadv: increase model latency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is encoding for structured data types?

A

Categorical data must be converted to numerical as most models can’t handle categorical data directly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the two kinds of transformation may be needed for integer of floating-point data?

A

Normalization and bucketing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why do we need to normalize data with various ranges?

A

Slow convergence for models with gradient descent.
Wide range of values in a single feature will lead to generation of NaN error in some models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the two ways of bucketing?

A

Bucketing is to transform numeric data to categorical data.
Buckets with equal-spaced boundaries
Buckets with quantile boundaries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is label encoding?

A

Convert text categories to numeric while preserving the order, e.g., small, medium, big to 1, 2, 3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Out of Vocab?

A

It is a special category for outliers. ML systems won’t waste time on training the rare outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is one-hot encoding?

A

Create dummy variables used for categorical variables where order does not matter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is feature hashing?

A

Apply a hash function to a categorical feature and use the hash value as the feature name.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is hybrid of hashing and vocabulary?

A

Use vocabulary for important features
Use hashing for the less important features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is embedding?

A

Embedding is a categorical feature represented as a continuous-valued feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is feature selection?

A

Select a subset of features that are most useful to a model in order to predict the target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the benefits of dimensionality reduction?

A

Reduce the noise from data
Reduce overfitting problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the two ways to achieve dimensionality reduction?

A

Use feature importance
Use combinations of feature, e.g., PCA, t-SNE

13
Q

What are key outcomes of classification models?

A

True positive: Predict positive class correctly
True negative: Predict negative class correctly
False positive: Predict positive class incorrectly
False negative: Predict positive class incorrectly

13
Q

What is classification threshold?

A

It is a threshold separate positive to negative class. By default, it is set at 0.5.

14
Q

What is AUC ROC used for?

A

Balanced datasets in classification problems

15
Q

What will happen if you raise and lower the classification threshold?

A

Raise: reduce false positives, increase false negatives, increase precision
Lower: reduce false negatives, increase false positives, increase recall

16
Q

What is AUC PR used for?

A

Imbalanced datasets in classification problems

16
Q

What is AUC ROC?

A

It is a graph showing the performance of a classification model at all classification thresholds.
True positive rate against False positive rate
AUC = 1, means perfect class separation

17
Q

What is AUC PR?

A

Precision values against Recall values
It gives more attention to the minor class.

18
Q

What is feature crossing?

A

Multiply two or more features.

19
What are the two ways to use feature cross?
Cross two features: more predictive feature Cross two or more features: represent nonlinearity
20
What is TensorFlow Data API (tf.data)?
It is to make data input pipeline more efficient.
21
What is TensorFlow Transform?
TensorFlow Transform library is a part of TensorFlow Extended. It performs transformations prior to training the model. tf.Transform can avoid training-serving skew
21
What is the best practice to make an efficient data input pipeline?
tf.data.Dataset.interleave: It parallelizes data reading. tf.data.Dataset.cache: Cache a dataset in memory or local storage. tf.data.Dataset.prefetch: Make sure preprocessed is ready before training Vectorize user-defined functions on a batch of datasets. Apply interleave, prefetch and shuffle to reduce memory usage
22
What does TensorFlow Transform do?
You can create transform pipelines using Cloud Dataflow. Analyze training data Transform training data Transform evaluation data Produce metadata Feed the model Serve data
23
What are the steps and libraries used in TFX pipeline?
Data extraction & validation: TFDV (Dataflow) Data transformation: TF Transform (Dataflow) Model training & tuning: tf.Estimators & tf.Keras (Vertex AI Training) Model evaluation & validation: TF Model Analysis (Dataflow) Model serving for prediction: TF Serving (Vertex AI Prediction)
23
What are the two tools to help data transformation?
Data Fusion: Code-free UI-based managed service for ETL or ELT pipelines from various sources. Dataprep: Code-free UI-based serverless tool for visually exploring, cleaning, preparing structured and unstructured data for analysis, reporting and machine learning at any scale.