W2- Machine Learning Data Lifecycle in Production Flashcards

Question 1

Q

Normalizations (MinMaxScaler) are usually good if you know that the distribution of your data is not ____ .

Feature Engineering Techniques 02:39

Question 2

Q

What are Feature crosses?

Feature Crosses 00:11

Answer

A

Well, they combine multiple features together into a new feature. That’s fundamentally what a feature across.

It encodes non-linearity in the feature space, or encodes the same information in fewer features.

Question 3

Q

You are working on a taxi tip prediction model and your raw dataset has columns for the latitude and longitude of both pickup and dropoff locations. These do not assume a Gaussian distribution.
Is the below feature engineering process useful? Why?

Because the data does not assume a Gaussian distribution, you should normalize these location features following the formula: 

Xnorm = (X - Xmin)/(Xmax - Xmin)

This puts the values into the range [0,1] so it can help the training converge faster.

Answer

A

It’s not useful.
Normalization of the raw data implies that these geographic coordinates carry quantitative meaning. In this application, that can be counter-intuitive. For example: given all other features are constant, you can’t always say that a bigger tip will be given just because the pickup location is 1 degree longitude “greater” than a previous trip.

Question 4

Q

What are 4 of the issues of doing feature engineering at scale?

Preprocessing Data at Scale 02:17

Answer

A

Inconsistencies in feature engineering (different code paths for serving vs training e.g. example of using python for training vs java for serving)
Preprocessing granularity
Preprocessing training dataset
Optimizing instance-level transformation

Question 5

Q

What are the benefits of using tf.transform?

TensorFlow Transform 10:01

Answer

A

Emitted tf.graph holds all the necessary constants (like training set mean for standardscaling) and transformations
Focus on data preprocessing only at training time
Works in-line during both training and serving
No need for preprocessing code at serving time
Consistently applied transformations irrespective of deployment platform

Question 6

Q

With Tensorflow Transform, you can preprocess data using the same code for both training a model and serving inferences in production. True/False

C2-W2-Lab1

Answer

A

True

It provides several utility functions for common preprocessing tasks including creating features that require a full pass over the training dataset.

Question 7

Q

What are the outputs of Tensorflow Transform?

C2-W2-Lab1

Answer

A

The outputs are the transformed features and a TensorFlow graph which you can use for both training and serving.

Using the same graph for both training and serving can prevent feature skew, since the same transformations are applied in both stages.

Question 8

Q

What’s the difference between a tensorflow operation and tensorflow transfrom analyzer?

C2-W2-Lab1: Create a preprocessing function

Answer

A

unlike TensorFlow ops they only run once during training, and typically make a full pass over the entire training dataset. They create tensor constants, which are added to your graph. For example, tft.min computes the minimum of a tensor over the training dataset.

Question 9

Q

What are the main steps of TensorFlow Transform to preprocess input data?

C2-W2-Lab1

Answer

A

Collect raw data
Define metadata
Create a preprocessing function
Generate a constant graph with the required transformations

Question 10

Q

Like TFDV, Tensorflow Transform also uses ____ for deployment scalability and flexibility.

C2-W2-Lab1:Create a preprocessing function

Answer

A

Apache Beam

Question 11

Q

What do:
ExampleGen
StatisticsGen
SchemaGen
ExampleValidator
Transform
do in a TFX pipeline?

C2-W2-Lab2

Answer

A

ingest data from a base directory with ExampleGen
compute the statistics of the training data with StatisticsGen
infer a schema with SchemaGen
detect anomalies in the evaluation data with ExampleValidator
preprocess the data into features suitable for model training with Transform

Question 12

Q

What are the steps of building a data pipeline using Tensorflow Extended (TFX) to prepare features from a dataset?

C2W2-Assignment

Answer

A

create an InteractiveContext to run TFX components interactively
use TFX ExampleGen component to split your dataset into training and evaluation datasets
generate the statistics and the schema of your dataset using TFX StatisticsGen and SchemaGen components
validate the evaluation dataset statistics using TFX ExampleValidator
performe feature engineering using the TFX Transform component

we refer to the outputs of pipeline components as artifacts

W2- Machine Learning Data Lifecycle in Production Flashcards

(12 cards)