Spark Flashcards
What is Spark?
It is distributed engine for large-scale data processing. It is memory oriented (RAM) which makes 100x faster than MapReduce. It has easy-to-use APIs for Python, Scala, Java, R and SQL
How if Fault Tolerant achieved in Spark?
Using RDD (Resilient Distributed Dataset) and recovery with DAG/Checkpoints.
Explain Spark Ecosystem
Spark Core APIs which is thee underlying general execution engine for the Spark platform that all other functionality is built on top of.
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine
Spark Streaming enables powerful interactive and analytical applications across both streaming and historical data
MLlib is a scalable machine learning library that delivers both high-quality algorithms and blazing speed
GraphX is a graph computation engine built on top of Spark that enables users to interactively build, transform and reason about graph structured data at scale
What are RDDs in Spark?
Stands for Resilient Distributed Datasets. It is the fundamental data structure of Spark. These are immutable (read-only) collections of objects of varying types which computes on the different nodes of a given cluster.
RDD’s are split into partitions and can be executed on different nodes of a cluster.
What are Spark RDD’s features?
Lazy Evaluation: All transformations are lazy which means that they do not compute the results immediately but they are tracked on DAGs. The computations are made when an action is called.
In-Memory Computation: Data is kept in RAM which speeds up the processing time
Fault Tolerance: Since they keep track of data lineage, it can rebuild lost data. The data is replicated among various nodes in the cluster.
Immutability: Data is safer to share across processes
Partitioning: Since it cannot fit into a single node, data is partitioned across multiple nodes. Spark automatically does this partitioning and distributes these partitions across different nodes.
What are Spark’s Transformations and Actions?
A transformation is a function that produces new RDD from existing RDD. We have Narrow Transformations (map, flatmap, filter) and Wide Transformations (groupByKey, Join, Cartesian)
An action are operations that return a value after running a computation on an RDD. For example, count, collect, take, top, saveAsTextFile
What are Spark’s Shared Variables?
Variables that can be shared across all nodes. There are two types of Shared Variables:
Accumulators: Counters or sums that can be reliably used in parallel processing. Workers can modify the state but cannot read content. Only a driver program and read the accumulated value
Broadcast: Intended for workers as reference data. Keeps a read-only variable cached on each machine.
What’s the difference between RDD, DataFrame and Dataset?
RDD (2011)
- Distribute collection of JVM objects
- Functional Operators (map, filter)
DataFrame (2013)
- Distribute collection of Row objects
- Data organized into named columns. It is equivalent to a table in a relational database or a data frame in Python
- Catalyst query optimization
DataSet (2015)
- Rows on top of JVM objects
- Type safe
- Slower than DF
- Strongly typed collection
- Only works in Scala and Java
What’s the difference between cache() and persist() in Spark?
Both can help you save results so that they can be reused later.
Persistence allows more control in how and where the data will be saved (memory, disk, serialization)
What’s the difference between coalesce() and repartition() in Spark?
repartition() is used to increase or decrease the RDD, DataFrame, Dataset partitions whereas coalesce() is used only to decrease the number of partitions in an efficient way. Both of them are expensive operations since the shuffle the data.
coalesce() uses existing partitions to minimize the amount of data that’s shuffled whereas repartition() does a full shuffle.
What is Spark Streaming?
It is a scalable and fault-tolerant stream processing engine that allows to process live data streams ingested from many sources like Kafka or TCP sockets and push out to filesystems, databases or live dashboards.
Spark Streaming is used to stream real-time data from various sources like Twitter and perform powerful analytics to help businesses.
Fundamental stream unit is DStream which is basically a series of RDD to process the real-time data
What is Spark Structured Streaming?
It is a scalable and fault tolerant stream processing engine built on the Spark SQL engine. Instead of using RDDs in Spark Streaming, it uses Dataframes and Datasets which means we can easily apply any SQL query on streaming data.
What’s the difference between Spark Streaming and Spark Structured Streaming?
Both uses micro batching which basically divides into small batch jobs (batch size can differ from seconds to hours). However Spark Streaming uses DStreams (collection of RDDs) whereas Spark Structured Streaming uses DataFrames/DataSets.
In Spark Structured Streaming data is treated as unbounded tables and new rows are appended. It also adds this new concept of Continous Processing which is a new streaming execution mode.
What are Spark components?
Spark Driver which is the process which clients uses to submit applications. It converts the user programs into tasks and plans the execution of tasks.
Spark Executors are the processes that perform the tasks assigned by the Spark Driver.
Cluster Manager: Responsible for maintaining a cluster of machines and allocating the cluster resources into the Spark job. It can be Standalone, Mesos, Yarn, Kubernetes
Spark Master: Lives inside the Cluster Manager which requests resources from the cluster and makes these available to the spark driver
Explain how Spark runs applications
- Driver: It converts the user programs into tasks and plans the execution of tasks.
- Resource Manager: Assigns tasks to workers.
- Worker Nodes have containers and within each container it has an Executor which runs the task.
(?)