ARCHITECTURE Flashcards
What is Apache Spark?
Spark is a distributed data processing platform. It is mainly used for large scale data processing.
Key Features / Advantages of Apache Spark
- Speed: It offers in-memory processing capabilities that make it much faster than traditional disk-based processing systems like Hadoop MapReduce.
- Unified analytics Engine: Spark provides a unified analytics engine that supports various data processing workloads, including batch processing, interactive queries, real-time stream processing, machine learning, and graph processing.
- Ease of Use and Developer Productivity: Spark APIs, such as DataFrame and Dataset APIs for structured data processing, make it easier for developers to write complex data processing pipelines with fewer lines of code.
- Fault Tolerance and Resilience: Spark provides built-in fault tolerance mechanisms, such as lineage tracking and RDD (Resilient Distributed Dataset) abstraction, which allow it to recover from failures.
Explain about Spark Ecosystem?
Spark ecosystem can be divided into 3 layers.
- Storage and Cluster Manager: Spark doesn’t come with an inbuilt storage and cluster manager. External Plugins are accepted.
Plugins for cluster manager are Apache YARN, Mesos, Kubernetes.
plugins for storage are HDFS, Amazon S3, Google Cloud storage, Cassandra File system. - Spark Core: It has two main components.
Spark Compute Engine: It provides basic functionalities like memory management, task scheduling, fault recovery, interaction with cluster manager and storage.
Spark Core APIs: The spark core APIs are used for processing the data. Structured APIs like dataframeAPI and datasetAPI and unstructured APIs like RDD. - Libraries and DSL: Outside the Spark Core, we have four different set of libraries and packages.
Spark SQL: allows to use SQL queries for structured data processing.
Spark Streaming: Consume and process data streams.
MLlib: Machine learning library
GraphX: Graph algorithms.
These libraries/packages directly depend on Spark Core APIs to achieve distributed processing.
What is a Spark application?
A spark application is created when ever a spark Job is submitted.
Upon submission, the driver program for the Spark application initializes and starts running. The driver program is responsible for orchestrating the execution of tasks on the cluster.
The driver program begins executing the user-defined transformations and actions specified in the Spark job. It breaks down the computation into tasks and schedules them for execution on the cluster.
The driver program also monitors the execution of these tasks and tracks their completion.
Upon completion of the tasks from the spark job, the Spark application terminates, and the driver program exits.
What is DAG how is is used?
DAG - Directed Acyclic graphs
When a spark job is submitted it would generate a spark application. This spark application will be responsible to perform the user defined operations (transformations & actions). In order for the driver program to execute these tasks it should have a plan, that is where DAG is used.
DAG lays out a plan for executing the tasks. After an initial DAG is finished Spark applies a series of optimization techniques, such as rule-based optimizations, cost-based optimizations, and query optimization strategies, to optimize the logical plan and minimize the computational overhead.
What is the Spark execution model?
Spark follows a master-slave architecture. In spark terminology Master -> Driver and slave -> Executor.
Once a spark job gets submitted the associated spark application will be assigned with a one driver process and multiple executor processes.
The driver process will be present in the master node and the executor processes will be in the worker nodes.
Each spark application has its own exclusive set of driver and executor processes.
The Spark driver will assign a part of the data and a set of code to executors. The executor is responsible for executing the assigned code on the given data. They keep the output with them and report the status back to the driver.
What are the Spark Execution Modes?
The executors are always going to run on the cluster machines. There is no exception for executors. However, you have the flexibility to start the driver on your local machine or as a process on the cluster.
There are 3 execution modes
Client Mode - Start the driver on your local machine
Cluster Mode - Start the driver on the cluster
Local Mode - Start everything in a single local JVM.
If we are using an interactive client like jupyter notebooks for submitting spark jobs then it is mostly considered as client mode.
If we use spark-submit utility we can run the job in both client and cluster execution modes.
Spark Execution methods/ways to submit spark job?
- Interactive clients: spark-shell,Notebooks. This type of execution method can be used for only client based execution model.
- Submit Job: spark-submit is universally accepted. Other alternatives can be Rest API offered by specific vendors. This type can be used for both client and cluster execution models.
Difference between transformation and actions?
- Transformations
Transformations are operations that produce new RDDs, DataFrames, or Datasets by applying functions to existing datasets.
Transformations are evaluated lazily, meaning they are only executed when an action is called on the resulting RDD, DataFrame, or Dataset.
Transformations are immutable, meaning they do not modify the original dataset. Instead, they create new datasets with the desired transformations applied.
- Actions
Actions are operations that trigger the execution of transformations and return non-RDD, DataFrame, or Dataset results to the driver program or external storage.
Actions initiate the execution of the Spark job by triggering the execution of the preceding transformations in the DAG lineage
Types of Transormations?
- Narrow Transformations
Narrow transformations are operations where each input partition contributes to at most one output partition.
They are performed independently on each partition of the parent RDD, resulting in a one-to-one mapping between input and output partitions.
Examples of narrow transformations include map, filter, flatMap, mapPartitions, mapPartitionsWithIndex, and sample. - Wide Transformations
Wide transformations, also known as shuffle transformations, are operations that require data shuffling across partitions.
They involve data exchange and aggregation across multiple partitions of the parent RDD, resulting in a many-to-many mapping between input and output partitions.
Examples of wide transformations include groupByKey, reduceByKey, aggregateByKey, sortByKey, join, cogroup, distinct, and repartition.
What is the need for lazy evaluation in spark?
Lazy evaluation allows Spark to construct an optimized execution plan (DAG) by deferring the execution of transformations until the final result is needed.
By delaying the execution of transformations, Spark reduces the overhead associated with intermediate computations, storing intermediate results and data movement.
Where do the results from the worker nodes gets stored?
How is Fault Tolerance achieved in spark?
what are RDDs?
Discuss about Memory Management in spark?