Spark Flashcards

1
Q

For iterative map/reduce implementations, what negatively affects performance (2)

A
  1. Data needs to be loaded from disk on every iteration
  2. Results are saved to HDFS, with multiple replications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Apache Spark?

A

an open-source, distributed framework designed for big data processing and analytics that takes advantage of in-memory processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How is data handled with in-memory processing? (2)

A
  1. Data is loaded in memory before computation
  2. Kept in memory during successive steps
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In what circumstance is in-memory processing suitable?

A
  1. when new data arrives at a fast pace (streams)
  2. for real-time analytics and exploratory tasks
  3. when iterative access is required
  4. when multiple jobs need the same data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is an in-memory processing approach?

A

the technique of storing and manipulating data that is stored in the computer’s main memory (RAM), as opposed to traditional disk-based storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Where do Spark components run?

A

In Java virtual machines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the Spark cluster manager?

A

Spark’s standalone cluster manager, YARN which is mainly suited for Hadoop or Mesos which is more generic) keeps track of resources available

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What do Spark applications consist of? (2)

A
  1. Driver process
  2. (A set of) Executor processes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How are spark applications run? (5)

A
  1. User submits spark application to the cluster manager (via the application)
  2. The application’s driver process is run on one of the nodes in the cluster
  3. The driver program breaks down the application into tasks which are then distributed to executor processes running on worker nodes in the cluster
  4. Executors execute the tasks assigned to them and the results from each task are sent back to the driver program for aggregation
  5. Once all tasks are completed, the Spark application terminates, and the final results are returned to the user or saved to an external storage system
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is an RDD?

A

a partitioned collection of records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you control spark applications?

A

through a driver process called the SparkSession

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How are RDDs created? (2)

A
  1. by reading data form an external storage system (for instance HDFS, Hbase, Amazon S3, Cassandra, …)
  2. from an existing collection in the driver program.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the two types of spark operation?

A
  1. Transformation
  2. Action
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a transformation?

A

A lazy operation to build an RDD from another RDD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is an action? (3)

A

An operation to take an RDD and return a result to driver//HDFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe how operations are evaluated lazily in Spark (3)

A
  1. Transformations are only executed when they are needed
  2. Only the invocation of an action will trigger the execution chain
  3. Enables building the actual execution plan to optimise the data flow
17
Q

When is it necessary to use low-level API? (3)

A
  1. Control over physical data across a cluster is needed
  2. Some very specific functionality is needed
  3. There is legacy code using RDDs
18
Q

Describe the “micro-batch” approach?

A

(accumulates small batches of input data and then processes them in
parallel

19
Q

What are Spark low-level APIs?

A

APIs to write applications that operate on RDDs directly

20
Q

What are the three execution modes?

A
  1. Local
  2. Client
  3. Cluster
21
Q

Which is the most common mode for execution spark programs?

A

Cluster mode

22
Q

What is the difference between client and cluster mode

A

In cluster mode, the cluster manager launches the driver process and executor processes on worker nodes inside the cluster meaning the cluster manager is responsible for maintaining all Spark Application–related processes
In client mode, the driver process is run on the same machine that submits the application, meaning that the that the client machine is responsible for maintaining the Spark driver process, and the cluster manager maintains the executor processses

23
Q

what is a narrow operation?

A

operations, which are applied to each record independently e.g. map, flatMap

24
Q

what is a wide operation?

A

operations which involve records from multiple partitions (and are costly)