Spark Flashcards

Question 1

Q

Apache Spark

Answer

A

Apachhe Spark provides in-memory, fault-tolerant distributed processing.

Question 2

Q

Key ideas

Answer

A

Spark programs comprise multiple chained data
transformations, using a high-level functional programming
model;
Spark defines a distributed collection data-structure :
Resilient Distributed Dataset (RDD).

Question 3

Q

RDD

Answer

A

RDDs are immutable data
* logically a RDD is an immutable collection of data tuples;
* physically distributed (partitioned) across many nodes;
* upon a failure (or cascade of failures), RDDs can be recreated automatically
and efficiently from the dependencies.

Question 4

Q

Spark Dataframes

Answer

A

DataFrames are distributed collections of data that is grouped into named
columns.
DataFrames can be seen as RDDs with a schema that names the fields of the
underlying tuples.

Question 5

Q

Spark SQL

Answer

A

SQL for specifying computations

Question 6

Q

SPARKSQL ARCHITECTURE

Answer

A

Programs using SQL/DataFrames are
translated into Spark programs.

Programs are optimized to execute
efficiently. Based on the techniques used in
database systems.

Libraries for advanced analytics
algorithms such as graph
processing and machine
learning.