Spark Flashcards
Apache Spark
Apachhe Spark provides in-memory, fault-tolerant distributed processing.
Key ideas
Spark programs comprise multiple chained data
transformations, using a high-level functional programming
model;
Spark defines a distributed collection data-structure :
Resilient Distributed Dataset (RDD).
RDD
RDDs are immutable data
* logically a RDD is an immutable collection of data tuples;
* physically distributed (partitioned) across many nodes;
* upon a failure (or cascade of failures), RDDs can be recreated automatically
and efficiently from the dependencies.
Spark Dataframes
- DataFrames are distributed collections of data that is grouped into named
columns. - DataFrames can be seen as RDDs with a schema that names the fields of the
underlying tuples.
Spark SQL
- SQL for specifying computations
SPARKSQL ARCHITECTURE
Programs using SQL/DataFrames are
translated into Spark programs.
Programs are optimized to execute
efficiently. Based on the techniques used in
database systems.
Libraries for advanced analytics
algorithms such as graph
processing and machine
learning.