Intro to Data Lakes Flashcards

Question 1

Q

What is a Data Lake?

Answer

A

generally describes a place where you can securely store various types of data of
all scales for processing and analytics. Data Lakes typically drive data analytics, data science and ML, and batch or streaming pipelines

Question 2

Q

Compare a data lake to a data warehouse

Answer

A

data lake is a capture of every
aspect of your business
operation. The data is stored in
its natural/raw format, usually as
object blobs or files.

Loaded only when its use is defined
● Processed/organized/transformed
● Provide faster insights
● Current/historical data for reporting
● Tends to have consistent schema
shared across applications

Question 3

Q

When would one use EL (Extract & Load)?

Answer

A

When you have data that is readily ingestible by the cloud product. i.e. Avro file can be easily ingested by BQ. Assuming no other transformations are needed. Use EL when the data can be imported.

Question 4

Q

When to use ETL?

Answer

A

That’s when data is loaded in a cloud product but not in the final form you want it.

by transforming the data before
loading it into the cloud you might be able to greatly reduce the network bandwidth that you need and reducing the data size upfront that gets loaded.

Question 5

Q

What is an example of ELT?

Answer

A

Extract data from on-premise, load it into a Cloud Product then do a transformation.

ELT allows raw data to be loaded
directly into the target and then transformed there. For example, a very common example inside
a big query you could use SQL to transform that raw data that’s loaded into bigquery
and just simply write it to a new table

Question 6

Q

What comprises Cloud Storage?

Answer

A

Buckets and Objects

Question 7

Q

What are buckets in Cloud Storage?

Answer

A

buckets are containers
which hold objects, and objects exist inside of those buckets and
not apart from them

Question 8

Q

What are Multi-Region Buckets?

Answer

A

are buckets replicated across regions

Question 9

Q

How are region buckets replicated?

Answer

A

single region bucket as you might expect, the objects are replicated across zones within that one region.

when the object is
retrieved it’s served up from the closest replica to the requester. And that’s how the low-latency happens

Question 10

Q

How is high-throughput achieved with buckets?

Answer

A

multiple requesters could be retrieving the objects at the same time from different replicas, and that’s how high throughput is achieved

Question 11

Q

How can you manage costs with Cloud Storage?

Answer

A

move objects that have been accessed in 30 days to nearline storage class or after 90 days to coldline storage to help optimize your costs.

Question 12

Q

What is Cloud Spanner?

Answer

A

But if you require a globally
distributed database, then you could use Cloud Spanner. You’d want a globally
distributed database if your database will see updates from applications running in different geographic regions

Question 13

Q

if you need a horizontal read and write scaling with SQL Database ___________

Answer

A

consider the use of Cloud Spanner

Intro to Data Lakes Flashcards

(13 cards)