Spark Flashcards
For iterative map/reduce implementations, what negatively affects performance (2)
- Data needs to be loaded from disk on every iteration
- Results are saved to HDFS, with multiple replications
What is Apache Spark?
an open-source, distributed framework designed for big data processing and analytics that takes advantage of in-memory processing
How is data handled with in-memory processing? (2)
- Data is loaded in memory before computation
- Kept in memory during successive steps
In what circumstance is in-memory processing suitable?
- when new data arrives at a fast pace (streams)
- for real-time analytics and exploratory tasks
- when iterative access is required
- when multiple jobs need the same data
What is an in-memory processing approach?
the technique of storing and manipulating data that is stored in the computer’s main memory (RAM), as opposed to traditional disk-based storage
Where do Spark components run?
In Java virtual machines
What is the Spark cluster manager?
Spark’s standalone cluster manager, YARN which is mainly suited for Hadoop or Mesos which is more generic) keeps track of resources available
What do Spark applications consist of? (2)
- Driver process
- (A set of) Executor processes
How are spark applications run? (5)
- User submits spark application to the cluster manager (via the application)
- The application’s driver process is run on one of the nodes in the cluster
- The driver program breaks down the application into tasks which are then distributed to executor processes running on worker nodes in the cluster
- Executors execute the tasks assigned to them and the results from each task are sent back to the driver program for aggregation
- Once all tasks are completed, the Spark application terminates, and the final results are returned to the user or saved to an external storage system
What is an RDD?
a partitioned collection of records
How do you control spark applications?
through a driver process called the SparkSession
How are RDDs created? (2)
- by reading data form an external storage system (for instance HDFS, Hbase, Amazon S3, Cassandra, …)
- from an existing collection in the driver program.
What are the two types of spark operation?
- Transformation
- Action
What is a transformation?
A lazy operation to build an RDD from another RDD
What is an action? (3)
An operation to take an RDD and return a result to driver//HDFS