Intro to Data Lakes Flashcards
What is a Data Lake?
generally describes a place where you can securely store various types of data of
all scales for processing and analytics. Data Lakes typically drive data analytics, data science and ML, and batch or streaming pipelines
Compare a data lake to a data warehouse
data lake is a capture of every aspect of your business operation. The data is stored in its natural/raw format, usually as object blobs or files.
Loaded only when its use is defined ● Processed/organized/transformed ● Provide faster insights ● Current/historical data for reporting ● Tends to have consistent schema shared across applications
When would one use EL (Extract & Load)?
When you have data that is readily ingestible by the cloud product. i.e. Avro file can be easily ingested by BQ. Assuming no other transformations are needed. Use EL when the data can be imported.
When to use ETL?
That’s when data is loaded in a cloud product but not in the final form you want it.
by transforming the data before
loading it into the cloud you might be able to greatly reduce the network bandwidth that you need and reducing the data size upfront that gets loaded.
What is an example of ELT?
Extract data from on-premise, load it into a Cloud Product then do a transformation.
ELT allows raw data to be loaded
directly into the target and then transformed there. For example, a very common example inside
a big query you could use SQL to transform that raw data that’s loaded into bigquery
and just simply write it to a new table
What comprises Cloud Storage?
Buckets and Objects
What are buckets in Cloud Storage?
buckets are containers
which hold objects, and objects exist inside of those buckets and
not apart from them
What are Multi-Region Buckets?
are buckets replicated across regions
How are region buckets replicated?
single region bucket as you might expect, the objects are replicated across zones within that one region.
when the object is
retrieved it’s served up from the closest replica to the requester. And that’s how the low-latency happens
How is high-throughput achieved with buckets?
multiple requesters could be retrieving the objects at the same time from different replicas, and that’s how high throughput is achieved
How can you manage costs with Cloud Storage?
move objects that have been accessed in 30 days to nearline storage class or after 90 days to coldline storage to help optimize your costs.
What is Cloud Spanner?
But if you require a globally
distributed database, then you could use Cloud Spanner. You’d want a globally
distributed database if your database will see updates from applications running in different geographic regions
if you need a horizontal read and write scaling with SQL Database ___________
consider the use of Cloud Spanner