ETL , EL, ELT - Sheet1 (1) Flashcards
When should you consider using Dataflow and BigQuery for data quality?
Dataflow and BigQuery are recommended for addressing data quality issues in general.
What are some specific needs that Dataflow and BigQuery may not meet easily?
Low Latency and High Throughput; Reuse of Existing Spark Pipelines; Need for Visual Pipeline Building
What is Dataproc?
Dataproc is a managed service for batch processing, querying, streaming, and machine learning.
What are the benefits of using Dataproc?
Cost-effective for Hadoop workloads; Autoscaling; Integration with other Google Cloud products
What is Data Fusion?
Data Fusion is a fully managed, Cloud-native enterprise data integration service.
What can Data Fusion be used for?
Transformations; Cleanup; Ensuring data consistency; Populating a data warehouse
What is an advantage of Data Fusion for non-programming role users?
Building visual pipelines without waiting for an IT team
What is an advantage of Data Fusion for IT staff?
Flexible API for creating scripts for automated execution
What are important aspects to consider in ETL regardless of the tool used?
Data Lineage; Metadata and Data Catalog
What does data lineage refer to?
Data lineage refers to the data’s origin, processes it has undergone, and its current condition.
Why is data lineage important?
Understanding data suitability; Troubleshooting; Ensuring trust and regulatory compliance
What is the purpose of metadata in ETL?
Discovery and identification of data suitability
What service on Google Cloud provides data discoverability?
Data Catalog
What is required to make Data Catalog effective for data discoverability?
Adding labels to your resources
What are labels in Google Cloud?
Key-value pairs that help organize resources
What are the benefits of using labels in Google Cloud?
Manage complex resources; Facilitate fine-grained look at Cloud Bill; First step towards a data catalog
What is Data Catalog?
A fully managed, highly scalable data discovery and metadata management service
What are the features of Data Catalog?
No infrastructure setup or management; Enterprise-grade access control; Integration with Data Loss Prevention API
What is the benefit of Data Catalog’s integration with Data Loss Prevention API?
Discover and classify sensitive data; Aid in data governance
What can be done with Data Catalog?
Search metadata about datasets; Group datasets with tags; Flag columns containing sensitive data
What is the advantage of using Data Catalog for dataset discovery?
Unified user experience; Quick access to datasets; Eliminate the need to hunt for specific table names
What is the significance of data lineage in ETL?
Understanding data origin, processes, and current condition; Ensuring trust, regulatory compliance, and troubleshooting odd results
What is metadata?
Information about the data that aids in discovery and identification of data suitability
What is the purpose of metadata labels in Data Catalog?
To organize resources and enable better management
What are labels in Data Catalog?
Key-value pairs that help categorize and organize resources
What are the benefits of using labels in Data Catalog?
Simplify resource management; Enable fine-grained cost analysis; Step towards creating a data catalog
What is the role of Data Catalog in data discovery?
Fully managed metadata management service; Provides discoverability and searchability of datasets
What is Data Catalog’s integration with the Data Loss Prevention API?
It allows discovery and classification of sensitive data, aiding in data governance
What are the advantages of using Data Catalog for metadata management?
Searchable metadata for datasets regardless of storage location; Grouping datasets with tags; Flagging columns with sensitive data
What is the benefit of Data Catalog’s unified user experience?
Quick and easy discovery of datasets without the need to search for specific table names
What should be considered when evaluating Dataflow and BigQuery for data quality needs?
Low Latency and High Throughput; Reuse of Existing Spark Pipelines; Need for Visual Pipeline Building
What are the advantages of using Dataproc for data processing?
Managed service for batch processing, querying, streaming, and machine learning; Cost-effective for Hadoop workloads; Autoscaling; Integration with other Google Cloud products
What is the purpose of Data Fusion in ETL processes?
Fully managed, Cloud-native enterprise data integration service; Transformation, cleanup, ensuring data consistency, populating a data warehouse
What are the benefits of Data Fusion for non-programming role users?
Visual pipeline building without relying on IT team
What are the benefits of Data Fusion for IT staff?
Flexible API for automated execution
What are the important aspects to keep in mind regardless of the ETL tool used?
Data Lineage; Metadata and Data Catalog
What is the significance of data lineage in ETL processes?
Understanding data origin, processes, and current condition; Trust, troubleshooting, regulatory compliance
What is the purpose of metadata in ETL?
Discovery and identification of data suitability
What does Data Catalog provide for data discoverability?
Searchable metadata and labeling
What are the benefits of using labels in Data Catalog?
Organize resources; Fine-grained cost analysis; Step towards creating a data catalog
What is Data Catalog?
Managed, scalable data discovery and metadata management service
What are the features of Data Catalog?
No infrastructure setup or management; Enterprise-grade access control; Integration with Data Loss Prevention API
What are the benefits of Data Catalog’s integration with Data Loss Prevention API?
Discover and classify sensitive data; Aid in data governance
What can be done with Data Catalog?
Search metadata; Group datasets with tags; Flag columns with sensitive data
What is the advantage of using Data Catalog for dataset discovery?
Unified user experience; Quick access to datasets; Eliminate the need to hunt for specific table names