Spark Flashcards by Alejandro Garcia

What is Spark?

It is distributed engine for large-scale data processing. It is memory oriented (RAM) which makes 100x faster than MapReduce. It has easy-to-use APIs for Python, Scala, Java, R and SQL

How well did you know this?

Not at all

Perfectly

How if Fault Tolerant achieved in Spark?

Using RDD (Resilient Distributed Dataset) and recovery with DAG/Checkpoints.

How well did you know this?

Not at all

Perfectly

Explain Spark Ecosystem

Spark Core APIs which is thee underlying general execution engine for the Spark platform that all other functionality is built on top of.

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine

Spark Streaming enables powerful interactive and analytical applications across both streaming and historical data

MLlib is a scalable machine learning library that delivers both high-quality algorithms and blazing speed

GraphX is a graph computation engine built on top of Spark that enables users to interactively build, transform and reason about graph structured data at scale

How well did you know this?

Not at all

Perfectly

What are RDDs in Spark?

Stands for Resilient Distributed Datasets. It is the fundamental data structure of Spark. These are immutable (read-only) collections of objects of varying types which computes on the different nodes of a given cluster.

RDD’s are split into partitions and can be executed on different nodes of a cluster.

How well did you know this?

Not at all

Perfectly

What are Spark RDD’s features?

Lazy Evaluation: All transformations are lazy which means that they do not compute the results immediately but they are tracked on DAGs. The computations are made when an action is called.

In-Memory Computation: Data is kept in RAM which speeds up the processing time

Fault Tolerance: Since they keep track of data lineage, it can rebuild lost data. The data is replicated among various nodes in the cluster.

Immutability: Data is safer to share across processes

Partitioning: Since it cannot fit into a single node, data is partitioned across multiple nodes. Spark automatically does this partitioning and distributes these partitions across different nodes.

How well did you know this?

Not at all

Perfectly

What are Spark’s Transformations and Actions?

A transformation is a function that produces new RDD from existing RDD. We have Narrow Transformations (map, flatmap, filter) and Wide Transformations (groupByKey, Join, Cartesian)

An action are operations that return a value after running a computation on an RDD. For example, count, collect, take, top, saveAsTextFile

How well did you know this?

Not at all

Perfectly

What are Spark’s Shared Variables?

Variables that can be shared across all nodes. There are two types of Shared Variables:

Accumulators: Counters or sums that can be reliably used in parallel processing. Workers can modify the state but cannot read content. Only a driver program and read the accumulated value

Broadcast: Intended for workers as reference data. Keeps a read-only variable cached on each machine.

How well did you know this?

Not at all

Perfectly

What’s the difference between RDD, DataFrame and Dataset?

RDD (2011)

Distribute collection of JVM objects
Functional Operators (map, filter)

DataFrame (2013)

Distribute collection of Row objects
Data organized into named columns. It is equivalent to a table in a relational database or a data frame in Python
Catalyst query optimization

DataSet (2015)

Rows on top of JVM objects
Type safe
Slower than DF
Strongly typed collection
Only works in Scala and Java

How well did you know this?

Not at all

Perfectly

What’s the difference between cache() and persist() in Spark?

Both can help you save results so that they can be reused later.
Persistence allows more control in how and where the data will be saved (memory, disk, serialization)

How well did you know this?

Not at all

Perfectly

What’s the difference between coalesce() and repartition() in Spark?

repartition() is used to increase or decrease the RDD, DataFrame, Dataset partitions whereas coalesce() is used only to decrease the number of partitions in an efficient way. Both of them are expensive operations since the shuffle the data.

coalesce() uses existing partitions to minimize the amount of data that’s shuffled whereas repartition() does a full shuffle.

How well did you know this?

Not at all

Perfectly

What is Spark Streaming?

It is a scalable and fault-tolerant stream processing engine that allows to process live data streams ingested from many sources like Kafka or TCP sockets and push out to filesystems, databases or live dashboards.

Spark Streaming is used to stream real-time data from various sources like Twitter and perform powerful analytics to help businesses.

Fundamental stream unit is DStream which is basically a series of RDD to process the real-time data

How well did you know this?

Not at all

Perfectly

What is Spark Structured Streaming?

It is a scalable and fault tolerant stream processing engine built on the Spark SQL engine. Instead of using RDDs in Spark Streaming, it uses Dataframes and Datasets which means we can easily apply any SQL query on streaming data.

How well did you know this?

Not at all

Perfectly

What’s the difference between Spark Streaming and Spark Structured Streaming?

Both uses micro batching which basically divides into small batch jobs (batch size can differ from seconds to hours). However Spark Streaming uses DStreams (collection of RDDs) whereas Spark Structured Streaming uses DataFrames/DataSets.

In Spark Structured Streaming data is treated as unbounded tables and new rows are appended. It also adds this new concept of Continous Processing which is a new streaming execution mode.

How well did you know this?

Not at all

Perfectly

What are Spark components?

Spark Driver which is the process which clients uses to submit applications. It converts the user programs into tasks and plans the execution of tasks.

Spark Executors are the processes that perform the tasks assigned by the Spark Driver.

Cluster Manager: Responsible for maintaining a cluster of machines and allocating the cluster resources into the Spark job. It can be Standalone, Mesos, Yarn, Kubernetes

Spark Master: Lives inside the Cluster Manager which requests resources from the cluster and makes these available to the spark driver

How well did you know this?

Not at all

Perfectly

Explain how Spark runs applications

Driver: It converts the user programs into tasks and plans the execution of tasks.
Resource Manager: Assigns tasks to workers.
Worker Nodes have containers and within each container it has an Executor which runs the task.
(?)

How well did you know this?

Not at all

Perfectly

What are the different cluster managers in Spark?

Study These Flashcards

Standalone Mode
Apache Mesos
Hadoop Yarn
Kubernetes

What is lazy evaluation?

Study These Flashcards

Transformations are not evaluated until you perform an action, which aids in optimizing the data processing workflow. So, Spark remembers all the instructions and transformations but it doesn’t execute them until an action is called.

What is shuffling?

Study These Flashcards

It is the process of redistributing data across partitions. It occurs when you join two datasets or perform byKey operations like GroupByKey.

What are the various functionalities supported by Spark Core?

Study These Flashcards

Spark Core is the engine for parallel and distributed processing.

Scheduling and monitoring jobs
Memory management
Fault recovery
Task dispatching

Does Spark support data replication in memory?

Study These Flashcards

No. If any data is lost, it can be rebuilt using RDD DAG.

You can have persistence though using an internal method.

What are checkpoints in Spark Streaming?

Study These Flashcards

Process of making streaming applications resilient to failures.

What is a sliding window in Spark Streaming?

Study These Flashcards

It allows you apply transformations over a sliding window of data. Essentially, you can apply a transformation to all events captured within a window of 10 seconds.

Suppose you want to be alerted whenever a Twitter Topic is mentioned more than 3 times in under 10 seconds.
Assuming you’re using SQL like with Azure Stream Analytics:
SELECT Topic, COUNT()
FROM TwitterStream TIMESTAMP BY CreatedAt
GROUP BY Topic, SlidingWindow(second, 10)
HAVING COUNT() 3

What is the role of Catalyst Optimizer in Spark SQL?

Study These Flashcards

It is an extensible query optimizer. It automatically optimizes and transforms the relational queries. In addition, it allows the developer to extend the optimizer with new features.

What is Spark Context?

Study These Flashcards

Entry-point of a Spark application. It tells Spark how to access the cluster.

What are partitions in Spark?

Logical division of data that is split over different nodes. It is similar to a logical division of blocks in MapReduce.

How Spark stores the data?

Spark is a processing engine. There's no storage engine. | It can, however, retrieve data from any storage engine line HDFS, S3, etc.

What is the Executor Memory in Spark?

It is a measure on how much memory of the worker node will the application utilize

Spark Flashcards

(27 cards)