Spark Flashcards

1
Q

Apache Spark

A

Apachhe Spark provides in-memory, fault-tolerant distributed processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Key ideas

A

Spark programs comprise multiple chained data
transformations, using a high-level functional programming
model;
Spark defines a distributed collection data-structure :
Resilient Distributed Dataset (RDD).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

RDD

A

RDDs are immutable data
* logically a RDD is an immutable collection of data tuples;
* physically distributed (partitioned) across many nodes;
* upon a failure (or cascade of failures), RDDs can be recreated automatically
and efficiently from the dependencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Spark Dataframes

A
  • DataFrames are distributed collections of data that is grouped into named
    columns.
  • DataFrames can be seen as RDDs with a schema that names the fields of the
    underlying tuples.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Spark SQL

A
  • SQL for specifying computations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

SPARKSQL ARCHITECTURE

A

Programs using SQL/DataFrames are
translated into Spark programs.

Programs are optimized to execute
efficiently. Based on the techniques used in
database systems.

Libraries for advanced analytics
algorithms such as graph
processing and machine
learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly