Data Engineering Flashcards

1
Q

What is Apache Spark?

A

A distributed processing system used for big data workloads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When to use Spark?

A

When dealing with big data or when dealing with resource-heavy queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Apache Delta Lake?

A

Data layer on top of an existing data lake that brings both reliability and performance to data lakes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some key features of Delta Lake?

A
  • ACID transactionality
  • Data versioning
  • Schema enforcement and evolution
  • Audit history
  • Parquet format
  • Compatible with Spark API
  • Unifies streaming and batch data processing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a data lake?

A

A system or repository of data stored in its natural/raw format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a data warehouse?

A

A repository for business data. Only highly structured and unified data lives in a data warehouse to support specific business intelligence and analytics needs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the difference between a database and a data lake?

A

A database stores the current data required to power an application. A data lake stores current and historical data for one or more systems in its raw form for the purpose of analyzing the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a transaction in the context of databases and data storage systems?

A

Any operation that is treated as a single unit of work, which either completes fully or does not complete at all, and leaves the storage system in a consistent state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the four key properties that define a transaction?

A
  • Atomicity
  • Consistency
  • Isolation
  • Durability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is atomicity in the context of ACID transactions?

A

Each statement in a transaction (read, write, update, delete) is treated as a single unit. Either the entire statement is executed, or none of it is executed. This property prevents data loss and corruption from occurring

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is consistency in the context of ACID transactions?

A

Ensures that transactions only make changes to tables in predefined, predictable ways. Ensures corruption or errors in data do not create unintended consequences for the integrity of the table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is isolation in the context of ACID transactions?

A

When multiple users are reading and writing from the same table all at once, isolation of their transactions ensures that the concurrent transactions don’t interfere with or affect one another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is durability in the context of ACID transactions?

A

Ensures that changes to your data made by successfully executed transactions will be saved, even in the event of system failure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is OLAP?

A

Online Analytical Processing. A system for performing multidimensional analysis at high speeds on large volumes of data. Typically, this data is from a warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is OLTP

A

Online Transactional Processing. It enables the real-time execution of large numbers of database transactions by large numbers of people, such as in ATMs and in reservation systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is an OLAP cube?

A

A multidimensional array of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Apache Parquet?

A

A column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is HDFS?

A

Hadoop Distributed File System. A distributed file system that provides high-throughput access to application data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How is HDFS structured?

A

It has a master/slave architecture. An HDFS cluster consists of a single NameNode and a number of DataNodes.

20
Q

What is a NameNode in HDFS?

A

A master server that manages the file system namespace and regulates access to files by clients. It executes operations like opening, closing, and renaming files and directories. It also determines the mapping of data blocks to DataNodes

21
Q

What is a DataNode in HDFS?

A

Manages storage attached to the nodes that they run on. They are responsible for serving read and write request from the file system’s clients and also perform data block creation, deletion, and replication upon instruction from the NameNode

22
Q

How does HDFS store data?

A

Files are divided into blocks and each block is stored on a DataNode. The NameNode distributes replicas of these data blocks across the cluster and instructs the user or application where to locate wanted information

23
Q

What is Apache Hive?

A

Data warehouse software designed to read, write, and manage large datasets extracted from the HDFS

24
Q

What is the difference between Hive and Spark?

A

Hive is a SQL-like query language tool that is used to analyze structured data in the Hadoop ecosystem and Spark is a in-memory data processing engine that can be used for batch/stream processing, ML, and interactive SQL

25
Q

What is Apache Hadoop YARN?

A

Stands for Yet Another Resource Negotiator. Enables management of resources and scheduling of jobs in Hadoop

26
Q

What is the difference between HDFS and AWS S3?

A

HDFS is designed to run on on-premises hardware or in a private cloud, while S3 is a cloud-based service

27
Q

What is the difference between HDFS and MySQL?

A

HDFS is a distributed file system that is designed to process large amounts of unstructured or semi-structured data, while MySQL is a relational database that is designed to store and manage structured data in a tabular format

28
Q

What is DBFS?

A

DBFS (Databricks File System) is a distributed file system that is built on top of cloud storage, such as Amazon S3 or Microsoft Azure Blob Storage. It abstracts away the details of the underlying storage system and provides a consistent interface for accessing data

29
Q

What are the differences between DBFS and HDFS?

A
  • DBFS is built on top of cloud storage, while HDFS is designed to run on a cluster of commodity hardware.
  • HDFS stores data on the same nodes that are used to process it, which improves performance by reducing need to transfer data over the network. DBFS does not have this capability and relies on the underlying cloud storage system
  • HDFS uses data replication. DBFS does not have this capability and relies on the underlying cloud storage system to manage data replication
30
Q

What components does the core architecture of Apache Spark contain?

A
  • Driver
  • Executor
  • Clusters
  • Cluster manager
  • RDDs (Resilient Distributed Datasets)
  • DAG (Directed Acyclic Graph)
31
Q

What is a driver in Apache Spark?

A

The process that controls the execution of a Spark application. It is responsible for scheduling tasks, managing memory, and interacting with the cluster manager to acquire resources for that application

32
Q

What are executors in Apache Spark?

A

Processes that run on worker nodes and execute tasks assigned by the driver. Each executor is responsible for executing a set of tasks on a subset of the data

33
Q

What is a cluster in Apache Spark?

A

A group of compute resources (CPUs, GPUs) that are used to run Spark applications. Clusters can be created and scaled up or down as needed, allowing users to allocate the appropriate amount of resources for their applications

34
Q

What is the cluster manager in Apache Spark?

A

It is responsible for managing the resources in a cluster and allocating them to applications as needed. Spark supports multiple cluster managers, including Apache Mesos, Kubernetes, and Apache Hadoop YARN

35
Q

What are RDDs in Apache Spark?

A

RDD (Resilient Distributed Datasets) are the fundamental data structure in Spark. They are immutable collections of data that are distributed across a cluster and can be processed in parallel. RDDs are fault-tolerant and can be recovered if a node fails, making them resilient to failures in the cluster

36
Q

What is a DAG in Apache Spark?

A

A graph representation of a Spark application’s execution plan. It shows the dependencies between different stages of the application and helps the driver optimize the execution of the application

37
Q

What is the difference between Apache Hadoop YARN and Apache Mesos?

A

YARN is a good choice for organizations that are looking to manage resources in a Hadoop cluster and to run Hadoop-based workloads, while Mesos is a more general-purpose resource management platform that is suitable for a wide range of workloads

38
Q

What is shuffling in Apache Spark?

A

A process in Apache Spark that is used to redistribute data between executors and nodes in a cluster. During shuffling, Spark uses a combination of in-memory storage and external storage to store and exchange data

39
Q

In which situations does shuffling occur in Apache Spark?

A
  • Partitioning data
  • Joining data
  • Aggregation
40
Q

What is Apache Airflow?

A

A platform to programmatically author, schedule, and monitor workflows. It allows users to define workflows as DAGs of tasks, with the ability to specify dependencies between tasks, set up retries and failure handling, and define execution schedules

41
Q

Where is Apache Airflow often used?

A

In data engineering and data science pipelines, where workflows may involve the transfer and transformation of data, the training and deployment of machine learning models, and the execution of custom scripts and programs. It is also commonly used to automate various types of ETL (extract, transform, load) processes, as well as to orchestrate the execution of distributed applications and infrastructure

42
Q

What are the main components of Apache Airflow?

A
  • DAGs
  • Operators
  • Executors
  • Scheduler
  • Web Server
  • Metadata Database
  • Workflow Dependency Manager
43
Q

What are operators in Apache Airflow?

A

The building blocks of DAGs. They represent a single task that needs to be executed as part of a workflow. For example, running a SQL query, transferring data between systems, and executing a Python function

44
Q

What are executors in Apache Airflow?

A

They are responsible for executing tasks defined in a DAG. There are several types of executors, such as SequentialExecutor, LocalExecutor (run tasks concurrently), and CeleryExecutor (runs tasks concurrently using a distributed task queue)

45
Q

When are row-major formats and column-major formats better?

A

Row-major formats are better when you have to do a lot of writes, whereas column-major ones are better when you have to do a lot of column-based reads