W2- Machine Learning Data Lifecycle in Production Flashcards

1
Q

Normalizations (MinMaxScaler) are usually good if you know that the distribution of your data is not ____ .

A

Gaussian

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are Feature crosses?

A

Well, they combine multiple features together into a new feature. That’s fundamentally what a feature across.

It encodes non-linearity in the feature space, or encodes the same information in fewer features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

You are working on a taxi tip prediction model and your raw dataset has columns for the latitude and longitude of both pickup and dropoff locations. These do not assume a Gaussian distribution.
Is the below feature engineering process useful? Why?

Because the data does not assume a Gaussian distribution, you should normalize these location features following the formula: 

Xnorm = (X - Xmin)/(Xmax - Xmin)

This puts the values into the range [0,1] so it can help the training converge faster.
A

It’s not useful.
Normalization of the raw data implies that these geographic coordinates carry quantitative meaning. In this application, that can be counter-intuitive. For example: given all other features are constant, you can’t always say that a bigger tip will be given just because the pickup location is 1 degree longitude “greater” than a previous trip.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are 4 of the issues of doing feature engineering at scale?

A
  1. Inconsistencies in feature engineering (different code paths for serving vs training e.g. example of using python for training vs java for serving)
  2. Preprocessing granularity
  3. Preprocessing training dataset
  4. Optimizing instance-level transformation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the benefits of using tf.transform?

A
  • Emitted tf.graph holds all the necessary constants (like training set mean for standardscaling) and transformations
  • Focus on data preprocessing only at training time
  • Works in-line during both training and serving
  • No need for preprocessing code at serving time
  • Consistently applied transformations irrespective of deployment platform
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

With Tensorflow Transform, you can preprocess data using the same code for both training a model and serving inferences in production. True/False

C2-W2-Lab1

A

True

It provides several utility functions for common preprocessing tasks including creating features that require a full pass over the training dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the outputs of Tensorflow Transform?

C2-W2-Lab1

A

The outputs are the transformed features and a TensorFlow graph which you can use for both training and serving.

Using the same graph for both training and serving can prevent feature skew, since the same transformations are applied in both stages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What’s the difference between a tensorflow operation and tensorflow transfrom analyzer?

C2-W2-Lab1: Create a preprocessing function

A

unlike TensorFlow ops they only run once during training, and typically make a full pass over the entire training dataset. They create tensor constants, which are added to your graph. For example, tft.min computes the minimum of a tensor over the training dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the main steps of TensorFlow Transform to preprocess input data?

C2-W2-Lab1

A

Collect raw data
Define metadata
Create a preprocessing function
Generate a constant graph with the required transformations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Like TFDV, Tensorflow Transform also uses ____ for deployment scalability and flexibility.

C2-W2-Lab1:Create a preprocessing function

A

Apache Beam

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What do:
ExampleGen
StatisticsGen
SchemaGen
ExampleValidator
Transform
do in a TFX pipeline?

C2-W2-Lab2

A

ingest data from a base directory with ExampleGen
compute the statistics of the training data with StatisticsGen
infer a schema with SchemaGen
detect anomalies in the evaluation data with ExampleValidator
preprocess the data into features suitable for model training with Transform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the steps of building a data pipeline using Tensorflow Extended (TFX) to prepare features from a dataset?

C2W2-Assignment

A
  1. create an InteractiveContext to run TFX components interactively
  2. use TFX ExampleGen component to split your dataset into training and evaluation datasets
  3. generate the statistics and the schema of your dataset using TFX StatisticsGen and SchemaGen components
  4. validate the evaluation dataset statistics using TFX ExampleValidator
  5. performe feature engineering using the TFX Transform component

we refer to the outputs of pipeline components as artifacts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly