8. Model Training and Hyperparameter Tuning Flashcards
What does Google Cloud analytics portfolio have?
Collect: Pub/Sub, Datastream, Data Transfer Service
Process: Dataflow, Dataproc, Data Fusion, Composer, Dataprep
Store: Cloud SQL, Spanner, Bigtable, Firestore, Memorystore
Analyze: BigQuery, BI Engine, BQML, Data QnA, Google Storage, MultiCloud
Activate: Vertex AI, Looker, 3rd BI
What is Pub/Sub?
Pub/Sub is a serverless scalable service for messaging and real‐time analytics. You can directly stream data from a third party to BigQuery using Pub/Sub.
What is Datastream?
Datastream is a serverless and easy‐to‐use change data capture (CDC) and replication service.
It allows you to synchronize data across heterogeneous databases and applications with minimal latency and downtime.
Datastream supports streaming from Oracle and MySQL databases into Cloud Storage.
Datastream is integrated with Dataflow, and it leverages Dataflow templates to load data into BigQuery, Cloud Spanner, and Cloud SQL.
What is BigQuery Data Transfer Service?
You can load data from the following sources to BigQuery: Data warehouses such as Teradata and Amazon Redshift
External cloud storage provider Amazon S3
Google software as a service (SaaS) apps such as Cloud Storage, Google Ads, etc.
What is Cloud Dataflow?
Cloud Dataflow is a serverless, fully managed data processing or ETL service to process streaming and batch data. Dataflow used Apache Beam.
It allows you to build pipelines, monitor their execution, and transform and analyze data.
It allows you to process and read data from source Google Cloud data services to sinks
What is Cloud Data Fusion?
It is a UI‐based ETL tool with no code implementation.
What is Cloud Dataproc?
Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open‐source tools and frameworks.
Dataproc lets you do batch processing, querying, streaming, and machine learning.
Dataproc automation helps you create clusters quickly, manage them easily, and turn them off when not use.
What integrations do Dataproc have with Google Cloud Platform?
BigQuery, Cloud Storage, Cloud Bigtable, Cloud Logging, and Cloud Monitoring.
They provide a complete data platform. You can use Dataproc to do ETL.
Dataproc uses the Hadoop Distributed File System (HDFS) for storage.
What is Cloud Composer?
It is a managed data workflow orchestration service allowing you to author, schedule and monitor pipelines.
It is built on Apache Airflow and pipelines are configured as directed acyclic graphs.
It supports hybrid and multicloud architecture.
It provides end-to-end integration with Google Cloud products.
What are the Dataproc connectors?
Cloud Storage connector: Run Apache Hadoop or Apache Spark jobs directly on data in Cloud Storage.
BigQuery connector: Enable Spark and Hadoop applications to process data from BigQuery and write data to BigQuery.
BigQuery Spark connector: Support reading and writing to and from BigQuery to Spark’s DataFrames.
Cloud Bigtable with Dataproc: Use Bigtable with Dataproc
Pub/Sub Lite Spark connector: Support Pub/Sub Lite as a input source to Apache Spark Structured Streaming.
What is Cloud Dataprep?
It is a UI-based ETL tool for structured and unstructured data for analysis, reporting and machine learning.
What are different types of data processing tools best used for?
Dataflow: Unified streaming and batch workload need customization
Data Fusion: Managed batch and real-time pipelines from hybrid sources
Dataproc: Lift and shift Hadoop workloads from on premise
Dataprep: Ad hoc analytics
What is data storage guidance on GCP for machine learning?
Tabular data: BigQuery, BigQuery ML
Image, video, audio, unstructured data: Cloud Storage
Unstructured data: Vertex Data Labeling
Structured data: Vertex AI Feature Store
For AutoML image, video, text: Vertex AI Managed Datasets
You should not store data in …
Block storage (Network File System) and VMs. Avoid reading data directly from databases like CloudSQL
When should you store data as sharded TFRecord files and Avro files?
Sharded TFRecord files for Tensorflow and Avro files for other framework.
How to improve read and write throughput to Cloud Storage if you have image, video, audio and unstructured data.
Combining individual files into large files at least 100MB and between 100-10000 shard.
What is TensorFlow I/O?
TF I/O manages data in Parquet format for TensorFlow training.
What is Vertex AI Workbench?
You can create Jupyter Notebook to train, tune and deploy models using Vertex AI Workbench.
What is user-managed notebook?
You have more control but fewer features.
Custom container
Use one framework (from all supported framework)
VPC + other networking and security features
What is managed notebook?
It comes with more features:
Automatic shutdown
UI integration with Cloud Storage and BigQuery
Automated run
Custom container
Dataproc or Serverless Spark integration
All frameworks preinstalled
VPC support
Why don’t you need a large hardware to develop code in Jupyterlab?
You perform training and prediction using Vertex AI training and Prediction SDKs. The APIs and SDKs create a training container outside the JupyterLab environment. It creates a prediction container and host it (endpoint).