Data Analytics, AI/ML Flashcards
Vertex AI
AutoML
- Train ML models without code
AI Platform
- Train ML models using custom training code
Features:
- Data labeling (human assistance in labeling training data)
- Feature store (repo for ML features)
- Workbench (Jupyter notebook IDE)
Cloud TPU
Tensor Processing Units
Google designed circuits for training deep learning models
Dataflow
Serverless, batch and stream data processing service
Develop (Apache Beam) and execute (on Dataflow instances) ETL, batch and continuous computation (map and reduce) pipelines
Dataproc
Managed service for running Hadoop, Spark, Hive, and Pig jobs/clusters on Compute Engine VMs
Use Spark and Spark SQL for data analysis
Use Spark ML libraries to run classification algorithms
Analyze data stored in Cloud Storage
Cloud Workflows
Serverless orchestration platform that executes services based on YAML or JSON defined workflows
Workflows - combine steps for GCP API services, Cloud Functions, Cloud Run
*NOT for large volume of data or complex sequence of jobs
Data Fusion
Fully managed service based on CDAP for building ETL pipelines without code
*Pre-built connectors and transformations
Cloud Composer
Fully managed workflow orchestration service for Apache Airflow DAG workflows.
Open source - supports on prem and multicloud
Dataprep
Visually explore, clean, prepare structured and unstructured data for analysis
Dataproc
Runs Apache Spark and Hadoop clusters
Data Fusion
Data integration service to build and manage ETl/ELT pipelines
Preconfigured connectors and transformations
Cloud Composer
Workflow orchestration service to author, schedule, monitor pipelines that span clouds and on prem
Data Catalog
Metadata management
Google Data Studio
Visual analytics, interactive dashboards
Dataform
Develop data workflows in SQL and collaborate with Git.
Schedule data workflows with incremental updates to downstream datasets.
Define data quality checks and get alets.
**Works with BQ
Ingestion tools (5)
Pub/sub Storage Transfer Service Transfer Appliance Cloud IoT Core BQ
Data analytics storage (4)
Cloud Storage
Bigtable
Memorystore
BQ
Services for ingesting data from other clouds (3)
Cloud Data Fusion
Storage Transfer Service
BQ Transfer Service
Services for ingesting data from on prem
Data Fusion + Connector - for low-code, graphical UI
Transfer Appliance or Storage Transfer Service - large volumes
Recommended method for ingesting batch workloads
gsutil or STS to ingest into Cloud Storage
Services to ingest data via streaming
Pub/Sub - global, low latency
BQ - for analytics and reporting
Apache Kafka on prem or other clouds - Kafka to BQ Dataflow template
Service to use to ingest data from multiple sources
Dataflow
Cloud Data Loss Prevention
Service to inspect and transform structured and unstructured data from anywhere in Google Cloud
Classify, mask, tokenize sensitive info.
Scan BQ data
De-identify and re-identify PII in large data sets
Smart Analytics suite consists of (3)
BQ
Data Studio
Cloud Composer
Ways to send data to Cloud Logging
App Engine - auto record data to Cloud Logging
Logging agent
Custom logging messages to stdout and stderr
Ways to ingest app data into GCP for analysis
Write data to file - store in Cloud Storage - BQ import function
Write data to database - Cloud SQL, Bigtable, Firestore/Datastore
Stream data via pub/sub
Datastream
Captures change data from Oracle, MySQL, others
Data replicate using Dataflow templates to create replicated table in BQ
Services for processing data (3)
Dataproc
Dataprep
Dataflow
Data analytics services spanning pipeline (4)
Data Fusion
Data Catalog
Cloud Composer
Datastream
Kubeflow
Library and tools for ML workflow deployment in K8
TensorFlow Enterprise
Development environment for ML
AI Platform Prediction
serverless, autoscaling service to host ML models
Used to serve trained models for online inference