ETL , EL, ELT - Sheet1 (1) Flashcards
When should you consider using Dataflow and BigQuery for data quality?
Dataflow and BigQuery are recommended for addressing data quality issues in general.
What are some specific needs that Dataflow and BigQuery may not meet easily?
Low Latency and High Throughput; Reuse of Existing Spark Pipelines; Need for Visual Pipeline Building
What is Dataproc?
Dataproc is a managed service for batch processing, querying, streaming, and machine learning.
What are the benefits of using Dataproc?
Cost-effective for Hadoop workloads; Autoscaling; Integration with other Google Cloud products
What is Data Fusion?
Data Fusion is a fully managed, Cloud-native enterprise data integration service.
What can Data Fusion be used for?
Transformations; Cleanup; Ensuring data consistency; Populating a data warehouse
What is an advantage of Data Fusion for non-programming role users?
Building visual pipelines without waiting for an IT team
What is an advantage of Data Fusion for IT staff?
Flexible API for creating scripts for automated execution
What are important aspects to consider in ETL regardless of the tool used?
Data Lineage; Metadata and Data Catalog
What does data lineage refer to?
Data lineage refers to the data’s origin, processes it has undergone, and its current condition.
Why is data lineage important?
Understanding data suitability; Troubleshooting; Ensuring trust and regulatory compliance
What is the purpose of metadata in ETL?
Discovery and identification of data suitability
What service on Google Cloud provides data discoverability?
Data Catalog
What is required to make Data Catalog effective for data discoverability?
Adding labels to your resources
What are labels in Google Cloud?
Key-value pairs that help organize resources
What are the benefits of using labels in Google Cloud?
Manage complex resources; Facilitate fine-grained look at Cloud Bill; First step towards a data catalog
What is Data Catalog?
A fully managed, highly scalable data discovery and metadata management service
What are the features of Data Catalog?
No infrastructure setup or management; Enterprise-grade access control; Integration with Data Loss Prevention API