Data Engineering on GCP Flashcards
Who is a Data Engineer?
Someone who builds Data Pipelines
Why does the data engineer
build data pipelines?
Because they want to get their data into a place such as a dashboard or a report or a Machine Learning model from where the
business can make data-driven decisions.
What is a Data Lake?
A data lake brings together data from across the Enterprise into a single location. So you might get the data from a relational database or from a spreadsheet and store the raw data in a data lake. One option for
this single location to store the raw data is to store it in
a cloud storage bucket.
What are considerations for building a data lake?
- Can data lake handle all types of data?
- Can it elastically scale
- What data should reside in Cloud SQL/relational vs storage bucket?
- Does it support high-throughput ingestion?
Why build ETL/ELT pipelines?
cleaning, formatting, and getting the data
ready for insights requires that you build extract transform load or ETL pipelines. ETL pipelines are usually necessary to ensure data
accuracy and quality. The cleaned and transformed
data are typically stored not in a data lake, but in a data warehouse
What’s an example of a transform in ETL?
Have a global company and need to standardize the all transactions to a UTC.
What are challenges of managing on-premise DW platform like ETL and Databases ?
Manage the infrastructure for database and ETL. Potential for much wasted resources on those servers when resource demand is low.
What is Bigquery?
BigQuery is Google Cloud’s petabyte scale Serverless
Data Warehouse
BigQuery service replaces the typical hardware setup for a traditional data warehouse.
Datasets are collections
of tables that can be divided along business lines or a given analytical domain. Each dataset is tied
to a GCP project
Where does a data lake store its data?
data lake, might contain
files in Cloud Storage or Google Drive or
transactional data in Cloud Bigtable or Cloud SQL
How does Bigquery work?
BigQuery allocates
storage and query resources dynamically based
on your usage patterns. Storage resources are
allocated as you consume them, and deallocated as you remove data or you drop tables. Query resources are allocated according to the query type and complexity. Each query uses some number
of what are called slots. Slots are units of computation that comprise a certain
amount of CPU and RAM.
What are slots in Bigquery?
Slots are units of computation that comprise a certain
amount of CPU and RAM.
How does Bigquery control access ?
Rather than through the Grants and Revokes it is done through IAM
what is BigQuery at a high-level?
- Uses SQL
- Serverless (no managing infrastructure)
- works with tools: Sheets, Looker, Tableau, Qlik and Data Studio
- Lays the foundation for AI
- Train Tensorflow models and AI Platform models directly in BQ
- BigQuery ML can train ML models through SQL
- Bigquery GIS for geospatial analysis
- BQ can analyze events in real-time by ingesting 100,000 or more rows per second.
- BQ can federate queries to Cloud SQL, Cloud Storage
What are the downsides of federated querying in BQ?
The data never gets cached since the data does is not natively stored inside BQ. So the query can be a bit slower because it has to authenticate the connections.
What is cloud SQL?
Cloud SQL, which is Google Cloud’s fully managed
relational database solution
PostgreSQL, MySQl and SQL Server
Differences between Traditional Databases vs Data warehouse
Traditional DB is optimized for transactions and returning small answer sets back to client. Very fast, tactical queries. Row based storage. They also employ referential integrity
BQ is column based storage.
What is BI engine?
BI Engine is a fast in memory
analysis service that is built directly into BigQuery and available to speed up your business
intelligence applications.
BI Engine is built on top of
the same BigQuery storage and compute architecture and serves as a fast in memory intelligent caching service
that maintain state.
How do you monitor your Ecosystem?
One popular way to monitor the health
of your ecosystem is to use built-in stackdriver monitoring of all resources on Google Cloud platform
you can set up alerts and notifications for metrics like query count or
bytes of data processed, so that you can better track
usage performance and cost
You can even use stackdriver
to create cloud audit logs to view actual query job information and
look at granular level details about which queries were run and
who ran them.
What is used for data governance on GCP?
Cloud Data Catalog and the Data Loss Prevention API
Purpose of Data Catalog
Data Catalog makes all the metadata
about your data sets available for users to search. You group data sets together with tags, flag certain columns at sensitive.
Data Catalog can provide
a single unified user experience to discover those data sets quickly
What is Data Loss Prevention or DLP API?
This helps you better understand and
manage sensitive data. It provides fast scalable
classification and reduction for sensitive data elements,
like credit card numbers, names. Social Security numbers,
selected international identifier numbers, phone numbers, GCP credentials, etc
What is Cloud Composer?
Orchestrates data engineering workflow. i.e. CSV file landing to cloud storage can trigger a data processing workflow to place the CSV data into BQ