Spark Flashcards

Question 1

Q

For iterative map/reduce implementations, what negatively affects performance (2)

Answer

A

Data needs to be loaded from disk on every iteration
Results are saved to HDFS, with multiple replications

Question 2

Q

What is Apache Spark?

Answer

A

an open-source, distributed framework designed for big data processing and analytics that takes advantage of in-memory processing

Question 3

Q

How is data handled with in-memory processing? (2)

Answer

A

Data is loaded in memory before computation
Kept in memory during successive steps

Question 4

Q

In what circumstance is in-memory processing suitable?

Answer

A

when new data arrives at a fast pace (streams)
for real-time analytics and exploratory tasks
when iterative access is required
when multiple jobs need the same data

Question 5

Q

What is an in-memory processing approach?

Answer

A

the technique of storing and manipulating data that is stored in the computer’s main memory (RAM), as opposed to traditional disk-based storage

Question 6

Q

Where do Spark components run?

Answer

A

In Java virtual machines

Question 7

Q

What is the Spark cluster manager?

Answer

A

Spark’s standalone cluster manager, YARN which is mainly suited for Hadoop or Mesos which is more generic) keeps track of resources available

Question 8

Q

What do Spark applications consist of? (2)

Answer

A

Driver process
(A set of) Executor processes

Question 9

Q

How are spark applications run? (5)

Answer

A

User submits spark application to the cluster manager (via the application)
The application’s driver process is run on one of the nodes in the cluster
The driver program breaks down the application into tasks which are then distributed to executor processes running on worker nodes in the cluster
Executors execute the tasks assigned to them and the results from each task are sent back to the driver program for aggregation
Once all tasks are completed, the Spark application terminates, and the final results are returned to the user or saved to an external storage system

Question 10

Q

What is an RDD?

Answer

A

a partitioned collection of records

Question 11

Q

How do you control spark applications?

Answer

A

through a driver process called the SparkSession

Question 12

Q

How are RDDs created? (2)

Answer

A

by reading data form an external storage system (for instance HDFS, Hbase, Amazon S3, Cassandra, …)
from an existing collection in the driver program.

Question 13

Q

What are the two types of spark operation?

Answer

A

Transformation
Action

Question 14

Q

What is a transformation?

Answer

A

A lazy operation to build an RDD from another RDD

Question 15

Q

What is an action? (3)

Answer

A

An operation to take an RDD and return a result to driver//HDFS

Question 16

Q

Describe how operations are evaluated lazily in Spark (3)

Answer

Study These Flashcards

A

Transformations are only executed when they are needed
Only the invocation of an action will trigger the execution chain
Enables building the actual execution plan to optimise the data flow

Question 17

Q

When is it necessary to use low-level API? (3)

Answer

Study These Flashcards

A

Control over physical data across a cluster is needed
Some very specific functionality is needed
There is legacy code using RDDs

Question 18

Q

Describe the “micro-batch” approach?

Answer

Study These Flashcards

A

(accumulates small batches of input data and then processes them in
parallel

Question 19

Q

What are Spark low-level APIs?

Answer

Study These Flashcards

A

APIs to write applications that operate on RDDs directly

Question 20

Q

What are the three execution modes?

Answer

Study These Flashcards

A

Local
Client
Cluster

Question 21

Q

Which is the most common mode for execution spark programs?

Answer

Study These Flashcards

A

Cluster mode

Question 22

Q

What is the difference between client and cluster mode

Answer

Study These Flashcards

A

In cluster mode, the cluster manager launches the driver process and executor processes on worker nodes inside the cluster meaning the cluster manager is responsible for maintaining all Spark Application–related processes
In client mode, the driver process is run on the same machine that submits the application, meaning that the that the client machine is responsible for maintaining the Spark driver process, and the cluster manager maintains the executor processses

Question 23

Q

what is a narrow operation?

Answer

Study These Flashcards

A

operations, which are applied to each record independently e.g. map, flatMap

Question 24

Q

what is a wide operation?

Answer

Study These Flashcards

A

operations which involve records from multiple partitions (and are costly)

Spark Flashcards

(24 cards)