Data Engineering Flashcards
What is Apache Spark?
A distributed processing system used for big data workloads
When to use Spark?
When dealing with big data or when dealing with resource-heavy queries
What is Apache Delta Lake?
Data layer on top of an existing data lake that brings both reliability and performance to data lakes
What are some key features of Delta Lake?
- ACID transactionality
- Data versioning
- Schema enforcement and evolution
- Audit history
- Parquet format
- Compatible with Spark API
- Unifies streaming and batch data processing
What is a data lake?
A system or repository of data stored in its natural/raw format
What is a data warehouse?
A repository for business data. Only highly structured and unified data lives in a data warehouse to support specific business intelligence and analytics needs
What is the difference between a database and a data lake?
A database stores the current data required to power an application. A data lake stores current and historical data for one or more systems in its raw form for the purpose of analyzing the data
What is a transaction in the context of databases and data storage systems?
Any operation that is treated as a single unit of work, which either completes fully or does not complete at all, and leaves the storage system in a consistent state
What are the four key properties that define a transaction?
- Atomicity
- Consistency
- Isolation
- Durability
What is atomicity in the context of ACID transactions?
Each statement in a transaction (read, write, update, delete) is treated as a single unit. Either the entire statement is executed, or none of it is executed. This property prevents data loss and corruption from occurring
What is consistency in the context of ACID transactions?
Ensures that transactions only make changes to tables in predefined, predictable ways. Ensures corruption or errors in data do not create unintended consequences for the integrity of the table
What is isolation in the context of ACID transactions?
When multiple users are reading and writing from the same table all at once, isolation of their transactions ensures that the concurrent transactions don’t interfere with or affect one another
What is durability in the context of ACID transactions?
Ensures that changes to your data made by successfully executed transactions will be saved, even in the event of system failure
What is OLAP?
Online Analytical Processing. A system for performing multidimensional analysis at high speeds on large volumes of data. Typically, this data is from a warehouse.
What is OLTP
Online Transactional Processing. It enables the real-time execution of large numbers of database transactions by large numbers of people, such as in ATMs and in reservation systems
What is an OLAP cube?
A multidimensional array of data
What is Apache Parquet?
A column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk
What is HDFS?
Hadoop Distributed File System. A distributed file system that provides high-throughput access to application data