Data Engineering on GCP Flashcards
Who is a Data Engineer?
Someone who builds Data Pipelines
Why does the data engineer
build data pipelines?
Because they want to get their data into a place such as a dashboard or a report or a Machine Learning model from where the
business can make data-driven decisions.
What is a Data Lake?
A data lake brings together data from across the Enterprise into a single location. So you might get the data from a relational database or from a spreadsheet and store the raw data in a data lake. One option for
this single location to store the raw data is to store it in
a cloud storage bucket.
What are considerations for building a data lake?
- Can data lake handle all types of data?
- Can it elastically scale
- What data should reside in Cloud SQL/relational vs storage bucket?
- Does it support high-throughput ingestion?
Why build ETL/ELT pipelines?
cleaning, formatting, and getting the data
ready for insights requires that you build extract transform load or ETL pipelines. ETL pipelines are usually necessary to ensure data
accuracy and quality. The cleaned and transformed
data are typically stored not in a data lake, but in a data warehouse
What’s an example of a transform in ETL?
Have a global company and need to standardize the all transactions to a UTC.
What are challenges of managing on-premise DW platform like ETL and Databases ?
Manage the infrastructure for database and ETL. Potential for much wasted resources on those servers when resource demand is low.
What is Bigquery?
BigQuery is Google Cloud’s petabyte scale Serverless
Data Warehouse
BigQuery service replaces the typical hardware setup for a traditional data warehouse.
Datasets are collections
of tables that can be divided along business lines or a given analytical domain. Each dataset is tied
to a GCP project
Where does a data lake store its data?
data lake, might contain
files in Cloud Storage or Google Drive or
transactional data in Cloud Bigtable or Cloud SQL
How does Bigquery work?
BigQuery allocates
storage and query resources dynamically based
on your usage patterns. Storage resources are
allocated as you consume them, and deallocated as you remove data or you drop tables. Query resources are allocated according to the query type and complexity. Each query uses some number
of what are called slots. Slots are units of computation that comprise a certain
amount of CPU and RAM.
What are slots in Bigquery?
Slots are units of computation that comprise a certain
amount of CPU and RAM.
How does Bigquery control access ?
Rather than through the Grants and Revokes it is done through IAM
what is BigQuery at a high-level?
- Uses SQL
- Serverless (no managing infrastructure)
- works with tools: Sheets, Looker, Tableau, Qlik and Data Studio
- Lays the foundation for AI
- Train Tensorflow models and AI Platform models directly in BQ
- BigQuery ML can train ML models through SQL
- Bigquery GIS for geospatial analysis
- BQ can analyze events in real-time by ingesting 100,000 or more rows per second.
- BQ can federate queries to Cloud SQL, Cloud Storage
What are the downsides of federated querying in BQ?
The data never gets cached since the data does is not natively stored inside BQ. So the query can be a bit slower because it has to authenticate the connections.
What is cloud SQL?
Cloud SQL, which is Google Cloud’s fully managed
relational database solution
PostgreSQL, MySQl and SQL Server