Machine Learning in the Enterprise Flashcards by Boban Djordjevic

How do you call a process of developing a pipeline for model (re)training?

Model operationalization

How well did you know this?

Not at all

Perfectly

What are the different types of inference?

Online prediction - API for real-time prediction
Streaming prediction - near real-time event-based predictions
Batch prediction - offline prediction in batches
Embedded prediction - prediction on different devices like mobile phone

How well did you know this?

Not at all

Perfectly

What is a Data Catalog?

Data Catalog is a data management service that can add additional metadata to the data coming from various sources like BigQuery, Cloud Storage, Dataplex, etc. It marks the data and make it discoverable and understandable to everyone.

How well did you know this?

Not at all

Perfectly

What is Dataplex?

Enables you to centrally manage your data coming from various sources like data lakes, data warehouses and data marts. It consists of Lakes, Zones, and assets. You can control who has access to which data by logically organizing them into Zones/Lakes, making it easily discoverable, data lineage, automatic data quality checks, and automatic meta-data extraction.

How well did you know this?

Not at all

Perfectly

What is Analytics Hub?

It is a central location where you can publish your data and subscribe to data from other publishers, easily ingest and use it in your project. Publishers pay for the storage of the data while subscribers pay for the analytics workload.

How well did you know this?

Not at all

Perfectly

What are the different data preprocessing options on GCP?

BigQuery - recommended for structured data, performing transformations on the data and storing it in a “clean” dataset
Dataflow - recommended for unstructured data
Dataproc - for customers who already have their data preprocessing pipeline in Spark and Hadoop
Tensorflow Extended - if it is used for model pipelines

How well did you know this?

Not at all

Perfectly

What is Dataprep?

UI on top of Dataflow. It has some nice graphs, quality checks, statistics, easy to use transformations on data… The most important feature is recipes that chain together different types of transformations (predefined or your own).

How well did you know this?

Not at all

Perfectly

What are some optimal and maximal values for batch size?

40-100 as an optimal value and 500 as a maximum

How well did you know this?

Not at all

Perfectly

Is it better to start with a small or larger batch size?

Smaller

How well did you know this?

Not at all

Perfectly

What is model parallelism?

When your model is too big to fit on one device, you have to split it it per layers or cluster nodes where portions of deep neural network would be split between nodes.

How well did you know this?

Not at all

Perfectly

How can you run your Python training application in a custom job, what options do you have for running your code?

You use pre-build container provided by Google for various frameworks or use a custom image.

How well did you know this?

Not at all

Perfectly

How to work with large datasets that can not fit into memory during training?

Approaches could be to introduce streaming or load data into batches.

How well did you know this?

Not at all

Perfectly

What is the required thing to do when the custom job is completed with model training?

You need to store the model in Cloud Storage, after training is completed, the VM is shut-down and if you don’t save the model for later usage it will be deleted.

How well did you know this?

Not at all

Perfectly

Can you use model artifacts stored in GCS directly in Vertex AI for prediction?

No, you have to first store it to Vertex AI Model Registry

How well did you know this?

Not at all

Perfectly

What are the different ways of packaging your code for custom training?

Store your code in a single python file and use it in a combination with pre-built container (good for prototyping)
Package your code in Python source distribution with a pre-built container
Create a custom Docker image and store in Artifact Registry

How well did you know this?

Not at all

Perfectly

What is Cloud Storage FUSE?

It is a tool that enables you to mount GCS buckets to your code and access files/folders directly without downloading them. They are available under “/gcs/” root folder and it can be treated like a file system.

How well did you know this?

Not at all

Perfectly

What are the different ways of loading data for custom training?

Study These Flashcards

Load data from Cloud Storage using Cloud Storage FUSE
Mount a NFS share
Use managed datasets

If you need to access other Google resources from a custom training job, what kind of authentication is the recommended way to access them?

Study These Flashcards

The recommended way is to use ADC (Authentication Default Credentials). Vertex AI automatically configures Custom Code Service Agent with predefined permissions but if you need a different set of permissions you can create a custom service account.

What do you need to do to read files from GCS in Vertex AI custom jobs?

Study These Flashcards

Nothing, they are automatically accessible on “/gcs/” path as GCS FUSE is automatically integrated.

What will happen if a VM is shutdown/restarted during model training? How to approach this problem?

Study These Flashcards

If VM restarts during model training, training progress will be lost and when it starts up again it will start the training from beginning. To resolve this, rule of the thumb is if your training lasts more than 4 hours make sure to:
1. Store training intermediate steps to GCS
2. When training starts, first check if training progress already exists

How can you create a custom container image for training?

Study These Flashcards

You can use autopackaging feature that automatically builds a Docker image, push it to Container Registry and creates CustomJob resource based on container image in a single command. This is not working for TrainingPipeline and HyperparameterTuningJob. Another option would be to manually create a Docker file.

What is the difference between Custom Job and Training Pipeline?

Study These Flashcards

CustomJob is a single execution of your custom training code while Training Pipeline orchestrates multiple Custom Job, hyperparameter tuning jobs, storing model to Model Artifact, etc.

What type of search algorithms is Vertex AI Vizier supporting?

Study These Flashcards

Grid - search each combination of hyperparameters
Random - random combination of params
Bayesian - decide which params to choose based on results of previous iterations (default)

What is the baseline for skew and drift in model monitoring?

Study These Flashcards

Skew - statistical distribution of feature values in training data
Drift - statistical distribution of feature values in the recent past

What kind of models is Vertex AI model monitoring supporting?

AutoML tabular and custom tabular models

What are the things you can monitor OOTB in Vertex AI Monitoring?

Skew and drift detection for prediction requests and feature attributions.

What are the necessary steps to enable monitoring?

1. Upload the model to Vertex AI Endpoint 2. Configure a model monitoring specification 3. Upload model monitoring specification to Vertex AI Endpoint 4. Upload or automatic generation of input schema for parsing 5. For feature skew, upload the training data for automatic generation of the feature distribution 6. For feature attributions, upload the corresponding Vertex AI Explainability configuration

Explain in detail how model monitoring works in GCP.

When model monitoring is enabled on Vertex AI Endpoint and input schema is configured prediction logs are stored in the Big Query table that will be used for skew and drift detection. You have to configure monitoring intervals (how often you will check for skew and drift) and provide a sample rate (how many prediction requests should be logged and used for monitoring - a value between 0 and 1 where 0.5 means that you randomly take 50% of incoming prediction requests). For skew detection, you have to upload a training data sample against which drift will be evaluated (no need for AutoML). For drift detection of feature attributions, you have to provide a Vertex AI Explainability configuration (if not AutoML) When skew or drift is detected based on a certain threshold that you configure, an alert is sent to the configured email(s).

What is the purpose of input schema in model monitoring?

For the model to be able to parse prediction features there has to be an input schema. For Auto ML it will be created automatically, for custom models it will try to create it based on the first 1000 prediction requests, or you have to upload it manually. It works best with standard key: value pairs.

What type of data/models is model monitoring supporting?

Only Auto ML tabular and tabular custom-trained models

Should you monitor both training-serving skew and prediction drift at the same time?

It is possible, but you should prioritize. If training data is available to you, you should monitor training-serving skew. If not available, you should monitor prediction drift.

Which type of features are supported in model monitoring?

Numerical and categorical features - monitors changes in the distribution of features

How are feature distributions calculated in model monitoring?

Categorical by percentage/number of occurrences and numerical are first organized into bins and then the same as categorical.

What happens if you don't specify a target column in your training data in model monitoring?

It will take the last feature in the training data.

What happens if you don't provide input schema from model monitoring?

It will try to automatically detect schema from first 1000 prediction requests but it will stay in pending state until it is created.

Is it always recommended to use Vertex AI Pipelines if you need to orchestrate a few steps in ML workflow?

If you are using Tensorflow which processes terabytes of structured and unstructured data it is recommended to use TFX.

Explain the process of Vertex AI Pipeline creation

1. Define components (functions) and their input and output that represent pipeline steps 2. Chain them together in a pipeline and define the output of which component is the input of which component 3. Compile pipeline which results in the creation of a .json or .yaml file that represents the pipeline 4. Run the pipeline by referencing that file

What options are available to create components in Vertex AI Pipelines?

You can create a custom component or use prebuilt Google components.

What is the benefit of using Vertex AI Pipelines compared to Kubeflow pipelines?

Because it is a managed service, you don't have to maintain a Kubernetes cluster.

What kind of data is Vertex Explainable AI supporting?

Tabular and image data

Is it possible to have lower latency online predictions if your application resides within the local network?

Yes, you can use a private endpoint with VPC where the communication will not leave the private network (go to the internet).

What are the options to pass the data for batch prediction?

You can use the data from the data lake/warehouse or from the Vertex AI Feature Store.

Machine Learning in the Enterprise Flashcards

(42 cards)