Data Engineering Flashcards

Question

What is Apache Hadoop YARN?

Answer 1

Stands for Yet Another Resource Negotiator. Enables management of resources and scheduling of jobs in Hadoop

Answer 2

HDFS is designed to run on on-premises hardware or in a private cloud, while S3 is a cloud-based service

Answer 3

HDFS is a distributed file system that is designed to process large amounts of unstructured or semi-structured data, while MySQL is a relational database that is designed to store and manage structured data in a tabular format

Answer 4

DBFS (Databricks File System) is a distributed file system that is built on top of cloud storage, such as Amazon S3 or Microsoft Azure Blob Storage. It abstracts away the details of the underlying storage system and provides a consistent interface for accessing data

Answer 5

- DBFS is built on top of cloud storage, while HDFS is designed to run on a cluster of commodity hardware. - HDFS stores data on the same nodes that are used to process it, which improves performance by reducing need to transfer data over the network. DBFS does not have this capability and relies on the underlying cloud storage system - HDFS uses data replication. DBFS does not have this capability and relies on the underlying cloud storage system to manage data replication

Answer 6

- Driver - Executor - Clusters - Cluster manager - RDDs (Resilient Distributed Datasets) - DAG (Directed Acyclic Graph)

Answer 7

The process that controls the execution of a Spark application. It is responsible for scheduling tasks, managing memory, and interacting with the cluster manager to acquire resources for that application

Answer 8

Processes that run on worker nodes and execute tasks assigned by the driver. Each executor is responsible for executing a set of tasks on a subset of the data

Answer 9

A group of compute resources (CPUs, GPUs) that are used to run Spark applications. Clusters can be created and scaled up or down as needed, allowing users to allocate the appropriate amount of resources for their applications

Answer 10

It is responsible for managing the resources in a cluster and allocating them to applications as needed. Spark supports multiple cluster managers, including Apache Mesos, Kubernetes, and Apache Hadoop YARN

Answer 11

RDD (Resilient Distributed Datasets) are the fundamental data structure in Spark. They are immutable collections of data that are distributed across a cluster and can be processed in parallel. RDDs are fault-tolerant and can be recovered if a node fails, making them resilient to failures in the cluster

Answer 12

A graph representation of a Spark application's execution plan. It shows the dependencies between different stages of the application and helps the driver optimize the execution of the application

Answer 13

YARN is a good choice for organizations that are looking to manage resources in a Hadoop cluster and to run Hadoop-based workloads, while Mesos is a more general-purpose resource management platform that is suitable for a wide range of workloads

Answer 14

A process in Apache Spark that is used to redistribute data between executors and nodes in a cluster. During shuffling, Spark uses a combination of in-memory storage and external storage to store and exchange data

Answer 15

- Partitioning data - Joining data - Aggregation

Answer 16

A platform to programmatically author, schedule, and monitor workflows. It allows users to define workflows as DAGs of tasks, with the ability to specify dependencies between tasks, set up retries and failure handling, and define execution schedules

Answer 17

In data engineering and data science pipelines, where workflows may involve the transfer and transformation of data, the training and deployment of machine learning models, and the execution of custom scripts and programs. It is also commonly used to automate various types of ETL (extract, transform, load) processes, as well as to orchestrate the execution of distributed applications and infrastructure

Answer 18

- DAGs - Operators - Executors - Scheduler - Web Server - Metadata Database - Workflow Dependency Manager

Answer 19

The building blocks of DAGs. They represent a single task that needs to be executed as part of a workflow. For example, running a SQL query, transferring data between systems, and executing a Python function

Answer 20

They are responsible for executing tasks defined in a DAG. There are several types of executors, such as SequentialExecutor, LocalExecutor (run tasks concurrently), and CeleryExecutor (runs tasks concurrently using a distributed task queue)

Answer 21

Row-major formats are better when you have to do a lot of writes, whereas column-major ones are better when you have to do a lot of column-based reads

Data Engineering Flashcards

(45 cards)