Machine Learning in the Enterprise Flashcards
How do you call a process of developing a pipeline for model (re)training?
Model operationalization
What are the different types of inference?
Online prediction - API for real-time prediction
Streaming prediction - near real-time event-based predictions
Batch prediction - offline prediction in batches
Embedded prediction - prediction on different devices like mobile phone
What is a Data Catalog?
Data Catalog is a data management service that can add additional metadata to the data coming from various sources like BigQuery, Cloud Storage, Dataplex, etc. It marks the data and make it discoverable and understandable to everyone.
What is Dataplex?
Enables you to centrally manage your data coming from various sources like data lakes, data warehouses and data marts. It consists of Lakes, Zones, and assets. You can control who has access to which data by logically organizing them into Zones/Lakes, making it easily discoverable, data lineage, automatic data quality checks, and automatic meta-data extraction.
What is Analytics Hub?
It is a central location where you can publish your data and subscribe to data from other publishers, easily ingest and use it in your project. Publishers pay for the storage of the data while subscribers pay for the analytics workload.
What are the different data preprocessing options on GCP?
BigQuery - recommended for structured data, performing transformations on the data and storing it in a “clean” dataset
Dataflow - recommended for unstructured data
Dataproc - for customers who already have their data preprocessing pipeline in Spark and Hadoop
Tensorflow Extended - if it is used for model pipelines
What is Dataprep?
UI on top of Dataflow. It has some nice graphs, quality checks, statistics, easy to use transformations on data… The most important feature is recipes that chain together different types of transformations (predefined or your own).
What are some optimal and maximal values for batch size?
40-100 as an optimal value and 500 as a maximum
Is it better to start with a small or larger batch size?
Smaller
What is model parallelism?
When your model is too big to fit on one device, you have to split it it per layers or cluster nodes where portions of deep neural network would be split between nodes.
How can you run your Python training application in a custom job, what options do you have for running your code?
You use pre-build container provided by Google for various frameworks or use a custom image.
How to work with large datasets that can not fit into memory during training?
Approaches could be to introduce streaming or load data into batches.
What is the required thing to do when the custom job is completed with model training?
You need to store the model in Cloud Storage, after training is completed, the VM is shut-down and if you don’t save the model for later usage it will be deleted.
Can you use model artifacts stored in GCS directly in Vertex AI for prediction?
No, you have to first store it to Vertex AI Model Registry
What are the different ways of packaging your code for custom training?
- Store your code in a single python file and use it in a combination with pre-built container (good for prototyping)
- Package your code in Python source distribution with a pre-built container
- Create a custom Docker image and store in Artifact Registry
What is Cloud Storage FUSE?
It is a tool that enables you to mount GCS buckets to your code and access files/folders directly without downloading them. They are available under “/gcs/” root folder and it can be treated like a file system.