Machine Learning in the Enterprise Flashcards
How do you call a process of developing a pipeline for model (re)training?
Model operationalization
What are the different types of inference?
Online prediction - API for real-time prediction
Streaming prediction - near real-time event-based predictions
Batch prediction - offline prediction in batches
Embedded prediction - prediction on different devices like mobile phone
What is a Data Catalog?
Data Catalog is a data management service that can add additional metadata to the data coming from various sources like BigQuery, Cloud Storage, Dataplex, etc. It marks the data and make it discoverable and understandable to everyone.
What is Dataplex?
Enables you to centrally manage your data coming from various sources like data lakes, data warehouses and data marts. It consists of Lakes, Zones, and assets. You can control who has access to which data by logically organizing them into Zones/Lakes, making it easily discoverable, data lineage, automatic data quality checks, and automatic meta-data extraction.
What is Analytics Hub?
It is a central location where you can publish your data and subscribe to data from other publishers, easily ingest and use it in your project. Publishers pay for the storage of the data while subscribers pay for the analytics workload.
What are the different data preprocessing options on GCP?
BigQuery - recommended for structured data, performing transformations on the data and storing it in a “clean” dataset
Dataflow - recommended for unstructured data
Dataproc - for customers who already have their data preprocessing pipeline in Spark and Hadoop
Tensorflow Extended - if it is used for model pipelines
What is Dataprep?
UI on top of Dataflow. It has some nice graphs, quality checks, statistics, easy to use transformations on data… The most important feature is recipes that chain together different types of transformations (predefined or your own).
What are some optimal and maximal values for batch size?
40-100 as an optimal value and 500 as a maximum
Is it better to start with a small or larger batch size?
Smaller
What is model parallelism?
When your model is too big to fit on one device, you have to split it it per layers or cluster nodes where portions of deep neural network would be split between nodes.
How can you run your Python training application in a custom job, what options do you have for running your code?
You use pre-build container provided by Google for various frameworks or use a custom image.
How to work with large datasets that can not fit into memory during training?
Approaches could be to introduce streaming or load data into batches.
What is the required thing to do when the custom job is completed with model training?
You need to store the model in Cloud Storage, after training is completed, the VM is shut-down and if you don’t save the model for later usage it will be deleted.
Can you use model artifacts stored in GCS directly in Vertex AI for prediction?
No, you have to first store it to Vertex AI Model Registry
What are the different ways of packaging your code for custom training?
- Store your code in a single python file and use it in a combination with pre-built container (good for prototyping)
- Package your code in Python source distribution with a pre-built container
- Create a custom Docker image and store in Artifact Registry
What is Cloud Storage FUSE?
It is a tool that enables you to mount GCS buckets to your code and access files/folders directly without downloading them. They are available under “/gcs/” root folder and it can be treated like a file system.
What are the different ways of loading data for custom training?
- Load data from Cloud Storage using Cloud Storage FUSE
- Mount a NFS share
- Use managed datasets
If you need to access other Google resources from a custom training job, what kind of authentication is the recommended way to access them?
The recommended way is to use ADC (Authentication Default Credentials). Vertex AI automatically configures Custom Code Service Agent with predefined permissions but if you need a different set of permissions you can create a custom service account.
What do you need to do to read files from GCS in Vertex AI custom jobs?
Nothing, they are automatically accessible on “/gcs/” path as GCS FUSE is automatically integrated.
What will happen if a VM is shutdown/restarted during model training? How to approach this problem?
If VM restarts during model training, training progress will be lost and when it starts up again it will start the training from beginning. To resolve this, rule of the thumb is if your training lasts more than 4 hours make sure to:
1. Store training intermediate steps to GCS
2. When training starts, first check if training progress already exists
How can you create a custom container image for training?
You can use autopackaging feature that automatically builds a Docker image, push it to Container Registry and creates CustomJob resource based on container image in a single command. This is not working for TrainingPipeline and HyperparameterTuningJob. Another option would be to manually create a Docker file.
What is the difference between Custom Job and Training Pipeline?
CustomJob is a single execution of your custom training code while Training Pipeline orchestrates multiple Custom Job, hyperparameter tuning jobs, storing model to Model Artifact, etc.
What type of search algorithms is Vertex AI Vizier supporting?
Grid - search each combination of hyperparameters
Random - random combination of params
Bayesian - decide which params to choose based on results of previous iterations (default)
What is the baseline for skew and drift in model monitoring?
Skew - statistical distribution of feature values in training data
Drift - statistical distribution of feature values in the recent past