Delta Lake Flashcards
The evolution of data management…
Data Warehouses Late 80s
Data Lake 2011
Cloud Data Platform 2020
While warehouses excel in handling structured data, most enterprises have to deal with unstructured, semi-structured, and data with high variety, velocity, and volume. Data warehouses are not suited for many of these use cases, and they are certainly not the most cost-efficient.
While suitable for storing data, data lakes lack some critical features. Data lakes do not support ACID transactions, do not enforce data quality, and their lack of consistency/isolation makes it almost impossible to mix appends and reads, and batch and stream jobs.
What is Delta Lake?
Delta Lake is a storage solution specifically designed to work with Apache Spark and is read from and written to using Apache Spark.
Elements of Delta Lake
Delta tables
Delta optimization engine
Delta Lake storage layer
Delta architecture design pattern
Apache Spark
A powerful paradigm in modern data storage and processing is the separation of compute and storage.
In this system, Apache Spark loads and performs computation on the data. It does not handle permanent storage. Apache Spark works with Delta Lake, the first storage solution specifically designed to do so.
The benefit of the separation of computation and storage:
Ex. Cloud Data Platform
makes it easier to allocate resources elastically.
A data lake is a NoSQL system meaning that
it can support structured as well as semi-structured and unstructured data.
Enterprise Decision Support System Architectures
Inmon Architecture
on-premises data warehouse system
cloud-based data warehouse system
Cloud Data Platform
Delta Architecture
With the Delta architecture, multiple tables of data are kept in the same data lake. We write batch and stream data to the same table. Data is written to a series of progressively cleaner and more refined tables.