3. Feature Engineering Flashcards
What is feature engineering?
The process of transforming raw data to useful features for model training.
What are the primary reasons for data transformation?
Data compatibility, i.e., from sting type data to numerical data
Data quality, i.e., convert text to lowercase.
What are the approaches for consistent data preprocessing?
Pretraining data transformation: data transformation before training. Adv: only perform once.
Disadv: update requires rerun the whole dataset.
Inside model data transformation: transformation is a part of the model code.
Adv: easy to decouple data and transformation.
Disadv: increase model latency.
What is encoding for structured data types?
Categorical data must be converted to numerical as most models can’t handle categorical data directly.
What are the two kinds of transformation may be needed for integer of floating-point data?
Normalization and bucketing
Why do we need to normalize data with various ranges?
Slow convergence for models with gradient descent.
Wide range of values in a single feature will lead to generation of NaN error in some models.
What are the two ways of bucketing?
Bucketing is to transform numeric data to categorical data.
Buckets with equal-spaced boundaries
Buckets with quantile boundaries
What is label encoding?
Convert text categories to numeric while preserving the order, e.g., small, medium, big to 1, 2, 3
What is Out of Vocab?
It is a special category for outliers. ML systems won’t waste time on training the rare outliers.
What is one-hot encoding?
Create dummy variables used for categorical variables where order does not matter.
What is feature hashing?
Apply a hash function to a categorical feature and use the hash value as the feature name.
What is hybrid of hashing and vocabulary?
Use vocabulary for important features
Use hashing for the less important features
What is embedding?
Embedding is a categorical feature represented as a continuous-valued feature.
What is feature selection?
Select a subset of features that are most useful to a model in order to predict the target variable.
What are the benefits of dimensionality reduction?
Reduce the noise from data
Reduce overfitting problem
What are the two ways to achieve dimensionality reduction?
Use feature importance
Use combinations of feature, e.g., PCA, t-SNE
What are key outcomes of classification models?
True positive: Predict positive class correctly
True negative: Predict negative class correctly
False positive: Predict positive class incorrectly
False negative: Predict positive class incorrectly
What is classification threshold?
It is a threshold separate positive to negative class. By default, it is set at 0.5.
What is AUC ROC used for?
Balanced datasets in classification problems
What will happen if you raise and lower the classification threshold?
Raise: reduce false positives, increase false negatives, increase precision
Lower: reduce false negatives, increase false positives, increase recall
What is AUC PR used for?
Imbalanced datasets in classification problems
What is AUC ROC?
It is a graph showing the performance of a classification model at all classification thresholds.
True positive rate against False positive rate
AUC = 1, means perfect class separation
What is AUC PR?
Precision values against Recall values
It gives more attention to the minor class.
What is feature crossing?
Multiply two or more features.
What are the two ways to use feature cross?
Cross two features: more predictive feature
Cross two or more features: represent nonlinearity
What is TensorFlow Data API (tf.data)?
It is to make data input pipeline more efficient.
What is TensorFlow Transform?
TensorFlow Transform library is a part of TensorFlow Extended. It performs transformations prior to training the model.
tf.Transform can avoid training-serving skew
What is the best practice to make an efficient data input pipeline?
tf.data.Dataset.interleave: It parallelizes data reading.
tf.data.Dataset.cache: Cache a dataset in memory or local storage.
tf.data.Dataset.prefetch: Make sure preprocessed is ready before training
Vectorize user-defined functions on a batch of datasets.
Apply interleave, prefetch and shuffle to reduce memory usage
What does TensorFlow Transform do?
You can create transform pipelines using Cloud Dataflow.
Analyze training data
Transform training data
Transform evaluation data
Produce metadata
Feed the model
Serve data
What are the steps and libraries used in TFX pipeline?
Data extraction & validation: TFDV (Dataflow)
Data transformation: TF Transform (Dataflow)
Model training & tuning: tf.Estimators & tf.Keras (Vertex AI Training)
Model evaluation & validation: TF Model Analysis (Dataflow)
Model serving for prediction: TF Serving (Vertex AI Prediction)
What are the two tools to help data transformation?
Data Fusion:
Code-free UI-based managed service for ETL or ELT pipelines from various sources.
Dataprep:
Code-free UI-based serverless tool for visually exploring, cleaning, preparing structured and unstructured data for analysis, reporting and machine learning at any scale.