Data Engineering on GCP Flashcards

1
Q

Who is a Data Engineer?

A

Someone who builds Data Pipelines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why does the data engineer

build data pipelines?

A

Because they want to get their data into a place such as a dashboard or a report or a Machine Learning model from where the
business can make data-driven decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a Data Lake?

A

A data lake brings together data from across the Enterprise into a single location. So you might get the data from a relational database or from a spreadsheet and store the raw data in a data lake. One option for
this single location to store the raw data is to store it in
a cloud storage bucket.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are considerations for building a data lake?

A
  1. Can data lake handle all types of data?
  2. Can it elastically scale
  3. What data should reside in Cloud SQL/relational vs storage bucket?
  4. Does it support high-throughput ingestion?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why build ETL/ELT pipelines?

A

cleaning, formatting, and getting the data
ready for insights requires that you build extract transform load or ETL pipelines. ETL pipelines are usually necessary to ensure data
accuracy and quality. The cleaned and transformed
data are typically stored not in a data lake, but in a data warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What’s an example of a transform in ETL?

A

Have a global company and need to standardize the all transactions to a UTC.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are challenges of managing on-premise DW platform like ETL and Databases ?

A

Manage the infrastructure for database and ETL. Potential for much wasted resources on those servers when resource demand is low.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Bigquery?

A

BigQuery is Google Cloud’s petabyte scale Serverless
Data Warehouse

BigQuery service replaces the typical hardware setup for a traditional data warehouse.

Datasets are collections
of tables that can be divided along business lines or a given analytical domain. Each dataset is tied
to a GCP project

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Where does a data lake store its data?

A

data lake, might contain
files in Cloud Storage or Google Drive or
transactional data in Cloud Bigtable or Cloud SQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does Bigquery work?

A

BigQuery allocates
storage and query resources dynamically based
on your usage patterns. Storage resources are
allocated as you consume them, and deallocated as you remove data or you drop tables. Query resources are allocated according to the query type and complexity. Each query uses some number
of what are called slots. Slots are units of computation that comprise a certain
amount of CPU and RAM.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are slots in Bigquery?

A

Slots are units of computation that comprise a certain

amount of CPU and RAM.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does Bigquery control access ?

A

Rather than through the Grants and Revokes it is done through IAM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is BigQuery at a high-level?

A
  1. Uses SQL
  2. Serverless (no managing infrastructure)
  3. works with tools: Sheets, Looker, Tableau, Qlik and Data Studio
  4. Lays the foundation for AI
  5. Train Tensorflow models and AI Platform models directly in BQ
  6. BigQuery ML can train ML models through SQL
  7. Bigquery GIS for geospatial analysis
  8. BQ can analyze events in real-time by ingesting 100,000 or more rows per second.
  9. BQ can federate queries to Cloud SQL, Cloud Storage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the downsides of federated querying in BQ?

A

The data never gets cached since the data does is not natively stored inside BQ. So the query can be a bit slower because it has to authenticate the connections.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is cloud SQL?

A

Cloud SQL, which is Google Cloud’s fully managed
relational database solution

PostgreSQL, MySQl and SQL Server

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Differences between Traditional Databases vs Data warehouse

A

Traditional DB is optimized for transactions and returning small answer sets back to client. Very fast, tactical queries. Row based storage. They also employ referential integrity

BQ is column based storage.

17
Q

What is BI engine?

A

BI Engine is a fast in memory
analysis service that is built directly into BigQuery and available to speed up your business
intelligence applications.

BI Engine is built on top of
the same BigQuery storage and compute architecture and serves as a fast in memory intelligent caching service
that maintain state.

18
Q

How do you monitor your Ecosystem?

A

One popular way to monitor the health
of your ecosystem is to use built-in stackdriver monitoring of all resources on Google Cloud platform

you can set up alerts and notifications for metrics like query count or
bytes of data processed, so that you can better track
usage performance and cost

You can even use stackdriver
to create cloud audit logs to view actual query job information and
look at granular level details about which queries were run and
who ran them.

19
Q

What is used for data governance on GCP?

A

Cloud Data Catalog and the Data Loss Prevention API

20
Q

Purpose of Data Catalog

A

Data Catalog makes all the metadata
about your data sets available for users to search. You group data sets together with tags, flag certain columns at sensitive.

Data Catalog can provide
a single unified user experience to discover those data sets quickly

21
Q

What is Data Loss Prevention or DLP API?

A

This helps you better understand and
manage sensitive data. It provides fast scalable
classification and reduction for sensitive data elements,
like credit card numbers, names. Social Security numbers,
selected international identifier numbers, phone numbers, GCP credentials, etc

22
Q

What is Cloud Composer?

A

Orchestrates data engineering workflow. i.e. CSV file landing to cloud storage can trigger a data processing workflow to place the CSV data into BQ