W2- Machine Learning Data Lifecycle in Production Flashcards
Normalizations (MinMaxScaler) are usually good if you know that the distribution of your data is not ____ .
Gaussian
What are Feature crosses?
Well, they combine multiple features together into a new feature. That’s fundamentally what a feature across.
It encodes non-linearity in the feature space, or encodes the same information in fewer features.
You are working on a taxi tip prediction model and your raw dataset has columns for the latitude and longitude of both pickup and dropoff locations. These do not assume a Gaussian distribution.
Is the below feature engineering process useful? Why?
Because the data does not assume a Gaussian distribution, you should normalize these location features following the formula: Xnorm = (X - Xmin)/(Xmax - Xmin) This puts the values into the range [0,1] so it can help the training converge faster.
It’s not useful.
Normalization of the raw data implies that these geographic coordinates carry quantitative meaning. In this application, that can be counter-intuitive. For example: given all other features are constant, you can’t always say that a bigger tip will be given just because the pickup location is 1 degree longitude “greater” than a previous trip.
What are 4 of the issues of doing feature engineering at scale?
- Inconsistencies in feature engineering (different code paths for serving vs training e.g. example of using python for training vs java for serving)
- Preprocessing granularity
- Preprocessing training dataset
- Optimizing instance-level transformation
What are the benefits of using tf.transform?
- Emitted tf.graph holds all the necessary constants (like training set mean for standardscaling) and transformations
- Focus on data preprocessing only at training time
- Works in-line during both training and serving
- No need for preprocessing code at serving time
- Consistently applied transformations irrespective of deployment platform
With Tensorflow Transform, you can preprocess data using the same code for both training a model and serving inferences in production. True/False
C2-W2-Lab1
True
It provides several utility functions for common preprocessing tasks including creating features that require a full pass over the training dataset.
What are the outputs of Tensorflow Transform?
C2-W2-Lab1
The outputs are the transformed features and a TensorFlow graph which you can use for both training and serving.
Using the same graph for both training and serving can prevent feature skew, since the same transformations are applied in both stages.
What’s the difference between a tensorflow operation and tensorflow transfrom analyzer?
C2-W2-Lab1: Create a preprocessing function
unlike TensorFlow ops they only run once during training, and typically make a full pass over the entire training dataset. They create tensor constants, which are added to your graph. For example, tft.min computes the minimum of a tensor over the training dataset.
What are the main steps of TensorFlow Transform to preprocess input data?
C2-W2-Lab1
Collect raw data
Define metadata
Create a preprocessing function
Generate a constant graph with the required transformations
Like TFDV, Tensorflow Transform also uses ____ for deployment scalability and flexibility.
C2-W2-Lab1:Create a preprocessing function
Apache Beam
What do:
ExampleGen
StatisticsGen
SchemaGen
ExampleValidator
Transform
do in a TFX pipeline?
C2-W2-Lab2
ingest data from a base directory with ExampleGen
compute the statistics of the training data with StatisticsGen
infer a schema with SchemaGen
detect anomalies in the evaluation data with ExampleValidator
preprocess the data into features suitable for model training with Transform
What are the steps of building a data pipeline using Tensorflow Extended (TFX) to prepare features from a dataset?
C2W2-Assignment
- create an InteractiveContext to run TFX components interactively
- use TFX ExampleGen component to split your dataset into training and evaluation datasets
- generate the statistics and the schema of your dataset using TFX StatisticsGen and SchemaGen components
- validate the evaluation dataset statistics using TFX ExampleValidator
- performe feature engineering using the TFX Transform component
we refer to the outputs of pipeline components as artifacts