Data Engineering on GCP Flashcards

Question 1

Q

Who is a Data Engineer?

Answer

A

Someone who builds Data Pipelines

Question 2

Q

Why does the data engineer

build data pipelines?

Answer

A

Because they want to get their data into a place such as a dashboard or a report or a Machine Learning model from where the
business can make data-driven decisions.

Question 3

Q

What is a Data Lake?

Answer

A

A data lake brings together data from across the Enterprise into a single location. So you might get the data from a relational database or from a spreadsheet and store the raw data in a data lake. One option for
this single location to store the raw data is to store it in
a cloud storage bucket.

Question 4

Q

What are considerations for building a data lake?

Answer

A

Can data lake handle all types of data?
Can it elastically scale
What data should reside in Cloud SQL/relational vs storage bucket?
Does it support high-throughput ingestion?

Question 5

Q

Why build ETL/ELT pipelines?

Answer

A

cleaning, formatting, and getting the data
ready for insights requires that you build extract transform load or ETL pipelines. ETL pipelines are usually necessary to ensure data
accuracy and quality. The cleaned and transformed
data are typically stored not in a data lake, but in a data warehouse

Question 6

Q

What’s an example of a transform in ETL?

Answer

A

Have a global company and need to standardize the all transactions to a UTC.

Question 7

Q

What are challenges of managing on-premise DW platform like ETL and Databases ?

Answer

A

Manage the infrastructure for database and ETL. Potential for much wasted resources on those servers when resource demand is low.

Question 8

Q

What is Bigquery?

Answer

A

BigQuery is Google Cloud’s petabyte scale Serverless
Data Warehouse

BigQuery service replaces the typical hardware setup for a traditional data warehouse.

Datasets are collections
of tables that can be divided along business lines or a given analytical domain. Each dataset is tied
to a GCP project

Question 9

Q

Where does a data lake store its data?

Answer

A

data lake, might contain
files in Cloud Storage or Google Drive or
transactional data in Cloud Bigtable or Cloud SQL

Question 10

Q

How does Bigquery work?

Answer

A

BigQuery allocates
storage and query resources dynamically based
on your usage patterns. Storage resources are
allocated as you consume them, and deallocated as you remove data or you drop tables. Query resources are allocated according to the query type and complexity. Each query uses some number
of what are called slots. Slots are units of computation that comprise a certain
amount of CPU and RAM.

Question 11

Q

What are slots in Bigquery?

Answer

A

Slots are units of computation that comprise a certain

amount of CPU and RAM.

Question 12

Q

How does Bigquery control access ?

Answer

A

Rather than through the Grants and Revokes it is done through IAM

Question 13

Q

what is BigQuery at a high-level?

Answer

A

Uses SQL
Serverless (no managing infrastructure)
works with tools: Sheets, Looker, Tableau, Qlik and Data Studio
Lays the foundation for AI
Train Tensorflow models and AI Platform models directly in BQ
BigQuery ML can train ML models through SQL
Bigquery GIS for geospatial analysis
BQ can analyze events in real-time by ingesting 100,000 or more rows per second.
BQ can federate queries to Cloud SQL, Cloud Storage

Question 14

Q

What are the downsides of federated querying in BQ?

Answer

A

The data never gets cached since the data does is not natively stored inside BQ. So the query can be a bit slower because it has to authenticate the connections.

Question 15

Q

What is cloud SQL?

Answer

A

Cloud SQL, which is Google Cloud’s fully managed
relational database solution

PostgreSQL, MySQl and SQL Server

Question 16

Q

Differences between Traditional Databases vs Data warehouse

Answer

Study These Flashcards

A

Traditional DB is optimized for transactions and returning small answer sets back to client. Very fast, tactical queries. Row based storage. They also employ referential integrity

BQ is column based storage.

Question 17

Q

What is BI engine?

Answer

Study These Flashcards

A

BI Engine is a fast in memory
analysis service that is built directly into BigQuery and available to speed up your business
intelligence applications.

BI Engine is built on top of
the same BigQuery storage and compute architecture and serves as a fast in memory intelligent caching service
that maintain state.

Question 18

Q

How do you monitor your Ecosystem?

Answer

Study These Flashcards

A

One popular way to monitor the health
of your ecosystem is to use built-in stackdriver monitoring of all resources on Google Cloud platform

you can set up alerts and notifications for metrics like query count or
bytes of data processed, so that you can better track
usage performance and cost

You can even use stackdriver
to create cloud audit logs to view actual query job information and
look at granular level details about which queries were run and
who ran them.

Question 19

Q

What is used for data governance on GCP?

Answer

Study These Flashcards

A

Cloud Data Catalog and the Data Loss Prevention API

Question 20

Q

Purpose of Data Catalog

Answer

Study These Flashcards

A

Data Catalog makes all the metadata
about your data sets available for users to search. You group data sets together with tags, flag certain columns at sensitive.

Data Catalog can provide
a single unified user experience to discover those data sets quickly

Question 21

Q

What is Data Loss Prevention or DLP API?

Answer

Study These Flashcards

A

This helps you better understand and
manage sensitive data. It provides fast scalable
classification and reduction for sensitive data elements,
like credit card numbers, names. Social Security numbers,
selected international identifier numbers, phone numbers, GCP credentials, etc

Question 22

Q

What is Cloud Composer?

Answer

Study These Flashcards

A

Orchestrates data engineering workflow. i.e. CSV file landing to cloud storage can trigger a data processing workflow to place the CSV data into BQ

Data Engineering on GCP Flashcards

(22 cards)