Spark Flashcards
For iterative map/reduce implementations, what negatively affects performance (2)
- Data needs to be loaded from disk on every iteration
- Results are saved to HDFS, with multiple replications
What is Apache Spark?
an open-source, distributed framework designed for big data processing and analytics that takes advantage of in-memory processing
How is data handled with in-memory processing? (2)
- Data is loaded in memory before computation
- Kept in memory during successive steps
In what circumstance is in-memory processing suitable?
- when new data arrives at a fast pace (streams)
- for real-time analytics and exploratory tasks
- when iterative access is required
- when multiple jobs need the same data
What is an in-memory processing approach?
the technique of storing and manipulating data that is stored in the computer’s main memory (RAM), as opposed to traditional disk-based storage
Where do Spark components run?
In Java virtual machines
What is the Spark cluster manager?
Spark’s standalone cluster manager, YARN which is mainly suited for Hadoop or Mesos which is more generic) keeps track of resources available
What do Spark applications consist of? (2)
- Driver process
- (A set of) Executor processes
How are spark applications run? (5)
- User submits spark application to the cluster manager (via the application)
- The application’s driver process is run on one of the nodes in the cluster
- The driver program breaks down the application into tasks which are then distributed to executor processes running on worker nodes in the cluster
- Executors execute the tasks assigned to them and the results from each task are sent back to the driver program for aggregation
- Once all tasks are completed, the Spark application terminates, and the final results are returned to the user or saved to an external storage system
What is an RDD?
a partitioned collection of records
How do you control spark applications?
through a driver process called the SparkSession
How are RDDs created? (2)
- by reading data form an external storage system (for instance HDFS, Hbase, Amazon S3, Cassandra, …)
- from an existing collection in the driver program.
What are the two types of spark operation?
- Transformation
- Action
What is a transformation?
A lazy operation to build an RDD from another RDD
What is an action? (3)
An operation to take an RDD and return a result to driver//HDFS
Describe how operations are evaluated lazily in Spark (3)
- Transformations are only executed when they are needed
- Only the invocation of an action will trigger the execution chain
- Enables building the actual execution plan to optimise the data flow
When is it necessary to use low-level API? (3)
- Control over physical data across a cluster is needed
- Some very specific functionality is needed
- There is legacy code using RDDs
Describe the “micro-batch” approach?
(accumulates small batches of input data and then processes them in
parallel
What are Spark low-level APIs?
APIs to write applications that operate on RDDs directly
What are the three execution modes?
- Local
- Client
- Cluster
Which is the most common mode for execution spark programs?
Cluster mode
What is the difference between client and cluster mode
In cluster mode, the cluster manager launches the driver process and executor processes on worker nodes inside the cluster meaning the cluster manager is responsible for maintaining all Spark Application–related processes
In client mode, the driver process is run on the same machine that submits the application, meaning that the that the client machine is responsible for maintaining the Spark driver process, and the cluster manager maintains the executor processses
what is a narrow operation?
operations, which are applied to each record independently e.g. map, flatMap
what is a wide operation?
operations which involve records from multiple partitions (and are costly)