Week6 - Apache Spark Flashcards
What is RDD stand for?
Resilient Distributed dataset
Is RDD read only?
Yes
RDDs can only be created through? (2)
1) Data in stable Storage
2) other RDDs
RDD is a restricted Distributed shared____ what
Memory System ( Cached dataset shared memory)
RDD Contains dataset?
Atomic pieces of the dataset
RDD Contains dependencies on?
Parent RDDs
for fault tolerance
How does a RDD compute the dataset
It is based on its parents (for fault tolerance)
metadata about its partitioning scheme and data placement
RDD read only and
Partitioned collections of records
Two important features of RDD and Apache Spark
1) Fault Tolerance
2) Lazy Evaluation
Describe RDD Fault Tolerance
It is achieved through lineage retrieval
Describe RDD Lazy Evaluation
A RDD will not be created until a reduce-like job or persist job is created ( create meaningful output)
What two classes of operations can you do on RDDs
1) Transformations
2) actions
RDD Transformations
Build RDDs through operations on other RDDs
1) map, filter, join
2) lazy operations
RDD Actions
1) Count, Collect, save
2) trigger execution
hdfs is ?
1) text file (Hadoop file system)
2) Distributed file system
3) contain text, log files, errors
How to find errors in htfs files
file.filter(_.contians(“ERROR”))
DAG Scheduler
Partition DAG into efficient stages (think narrow and wide dependencies) Pu
Narrow Dependencies
Transformation: output needs input from only one partition (very title communications )
1) map
2) union
Wide Dependencies
Multiple dependencies… need data from other partition
1) Group by key
2) join with inputs not on the same partitioned
DAG wide dependencies early or late in the process
late (less amount of data)