All Flashcards by Wonk L

For all non-numeric columns other than timestamp, BigQuery ML performs a one-hot encoding transformation. This transformation generates a separate feature for each unique value in the column.

True

How well did you know this?

Not at all

Perfectly

To understand other performance metrics, you can configure Google’s Cloud monitoring to monitor your model’s traffic patterns, error rates, latency, and resource utilization. This can help spot problems with your models and find the right machine type to optimize latency and cost.

True

How well did you know this?

Not at all

Perfectly

What is the trade-off between static and dynamic training?

Static is simpler to build and test, but will probably become stale.
Whereas dynamic is harder to build and test, but will adapt to changes. Part of the reason the dynamic is harder to build and test is that new data may have all sorts of bugs in it.

How well did you know this?

Not at all

Perfectly

What are three potential architectures to explore for dynamic training?

Cloud Functions,
App Engine,
Dataflow.

How well did you know this?

Not at all

Perfectly

Describe In a general architecture for dynamic training using Cloud Functions.

a new data file appears in Cloud Storage,
the Cloud Function is launched.
the Cloud Function starts the AI Platform training job.
the AI Platform writes out a new model.

How well did you know this?

Not at all

Perfectly

Describe In a general architecture for dynamic training using App Engine.

when a user makes a web request from a dashboard to App Engine, an AI Platform training job is launched,
AI Platform job writes a new model to Cloud Storage
from there, the statistics of the training job are displayed to the user when the job is complete.

How well did you know this?

Not at all

Perfectly

Describe how the Dataflow pipeline can be invoked the model for predictions.

the streaming topic is ingested into Pub/Sub from subscribers.
Messages are then aggregated with Dataflow,
aggregated data is stored in BigQuery.
AI Platform is launched on the arrival of new data in BigQuery,
then an updated model is deployed.

How well did you know this?

Not at all

Perfectly

How the latency can be improved when serving models?

Static serving then computes the label ahead of time and serves by looking it up in the table.
Dynamic serving in contrast computes the label on demand.

How well did you know this?

Not at all

Perfectly

Describe a space-time trade-off in serving prediction model.

Static serving is space intensive, resulting in higher storage costs because we store precomputed predictions with a low, fixed latency and lower maintenance costs.
Dynamic serving, however, is compute intensive. It has lower storage costs, higher maintenance, and variable latency.

How well did you know this?

Not at all

Perfectly

What is Peakedness in a data distribution?

Peakedness in a data distribution is the degree to which data values are concentrated around the mean, or in the case of choosing between model serving approaches, how concentrated the distribution of the prediction workload is.

How well did you know this?

Not at all

Perfectly

What is Cardinality in a data distribution?

Cardinality refers to the number of values in a set. In this case, the set is composed of all the possible things we might have to make predictions for.

How well did you know this?

Not at all

Perfectly

When to choose static vs. dynamic model serving?

When the cardinality is sufficiently low, we can store the entire expected prediction workload.
When the cardinality is high because the size of the input space is large and the workload is not very peaked, you probably want to use dynamic training.
In practice though, a hybrid of static and dynamic is often chosen, where you statically cache some of the predictions while responding on demand for the long tail. This works best when the distribution is sufficiently peaked.

How well did you know this?

Not at all

Perfectly

What design changes need to be made If you want to build a static serving system?

you need to change your call to AI Platform from an online prediction job to a batch prediction job.
you need to make sure that your model accepted and passed through keys as input. These keys will allow you to join your request to prediction at serving time.
you write the predictions to a data warehouse like BigQuery, and create an API to read from it.

How well did you know this?

Not at all

Perfectly

Explain Extrapolation and Interpolation.

Extrapolation means to generalize outside the bounds of what we’ve previously seen.
Interpolation means to generalize within the bounds of what we’ve previously seen.

How well did you know this?

Not at all

Perfectly

How can you protect a model from changing distributions?

be vigilant through monitoring (summary statistics of inputs over time)
check residuals (the difference between its predictions and the labels, have changed as a function of your inputs)
emphasize data recency during model training
retrain the model frequently

How well did you know this?

Not at all

Perfectly

Describe types of drift in ML models.

Study These Flashcards

Data Drift or change in probability of X P(X) is a shift in the model’s input data distribution.
Concept drift, or change in P(Y|X), is a shift in the actual relationship between the model inputs and the output
Prediction drift, or change in the predicted value of Y given X, is a shift in the model’s predictions.
Label drift or change in the predicted value of Y as your target variable is a shift in the model’s output or label distribution

List changes in the data distribution of the inputs.

Study These Flashcards

Data drift,
feature drift,
population,
covariate shift

Define concept drift in ML model.

Study These Flashcards

Concept drift occurs when the distribution of our observations shifts over time, or that the joint probability distribution we mentioned before changes.

Concept drift can occur due to shifts in the feature space and/or the decision boundary, so we need to be aware of these during production.

What if you diagnose concept drift?

Study These Flashcards

If you diagnose concept drift, the old data needs to be relabeled and the model retrained.

What if you diagnose data drift?

Study These Flashcards

If you diagnose data drift, enough of the data needs to be labeled to introduce new classes and the model retrained.

Why and when does distribution skew occur?

Study These Flashcards

Distribution skew occurs when the distribution of feature values for training data is significantly different from serving data and one of the key causes for distribution skew is how data is handled or changed in training vs production.

What is TensorFlow Data Validation?

Study These Flashcards

TensorFlow Data Validation is a library for analyzing and validating machine learning data, for which there are three components:
- The Statistics Generation component ( generates features statistics and random samples over training data, which can be used for visualization and validation)
- the Schema Generation component,
- the Example Validator component.

What the SchemaGen pipeline component is for?

Study These Flashcards

A SchemaGen pipeline component will automatically generate a schema by inferring types, categories, and ranges from the training data.

What ExampleValidator pipeline component is for?

Study These Flashcards

The ExampleValidator pipeline component identifies anomalies in training and serving data.

The ExampleValidator pipeline component identifies any anomalies in the example data by comparing data statistics computed by the StatisticsGen pipeline component against a schema.

TensorFlow Data Validation is a component of TensorFlow Extended, and it helps you to analyze and validate your data. Data validation checks include identifying feature correlations, checking for missing values, and identifying class imbalances.

True

What does training-serving skew refers to?

Training-serving skew refers to differences caused by one of three things: - A discrepancy between how you handle data in the training and serving pipelines. - A change in the data between when you train and when you serve, - A feedback loop between your model and your algorithm. Up until now,

What is QPS?

The performance consideration is not how many training stamps you can carry out per minute, but how many queries you can handle per second. The unit of this, queries per second, is often called QPS.

What do you need to build hybrid machine learning systems?

In order to build hybrid machine learning systems that work well both on-premises and in the cloud, your machine learning framework has to support three things: - composability, - portability, - scalability.

What technologies are important in hybrid machine learning systems?

- Kubeflow (offers portability and composability between your on-premises environment and Cloud ML Engine) - TensorFlow Lite (TensorFlow Lite makes specific compromises to enable machine learning inference on low-power or under-resourced devices)

Is Kuberflow serverless?

No You will have to do cluster management.

All Flashcards

(30 cards)