Spark and Python for Big Data Flashcards

Question 1

Q

When to use spark ?

Answer

A

When it no longer makes sense to fit all your data on RAM or into a single machine

Question 2

Q

Local system vs Distributed system characteristics

Answer

A

Distributed :
leverages power of many machines
easier to scale adding new machines unlike local
fault tolerance: if one machine fails the whole system can still go on

Question 3

Q

What is Hadoop ?

Answer

A

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation.

Question 4

Q

What is HDFS ?

Answer

A

Hadoop Distributed File System : allows user to work with very large datasets

Question 5

Q

What is MapReduce ?

Answer

A

A programming paradigm that allows calculations on different machines of the cluster in two steps : map which is a local calculation, and reduce which combines the results over the results to yield a single result.

Question 6

Q

What are the trackers involved in mapreduce ?

Answer

A

Job tracker (master node) sends code to run on the task trackers. the task trackers allocate CPU and memory for the tasks AND monitor the tasks on worker nodes

Question 7

Q

What is Spark ?

Answer

A

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

EASE OF USE and SPEED

Question 8

Q

How does Spark relate to MapReduce ?

Answer

A

Spark is a flexible alternative to MapReduce

Question 9

Q

Spark vs MapReduce ? (NOT hadoop vs Spark)

Answer

A

MapReduce requires files to be stored in HDFS, Spark does NOT, wide variety of data formats
Spark can perform operations up to xI00 times faster than MapReduce (Spark keeps most of the data in RAM after each transformation, and can spill over to disk if the memory is filled)

Question 10

Q

What is RDD ?

Answer

A

Resilient Distributed Dataset (idea at the core of Spark)

Question 11

Q

What are the 4 main features of RDDs ? (DFPS)

Answer

A

Distributed collection of data
Fault tolerance
Parallel operation
Ability to use many data sources (not limited to HDFS)

Question 12

Q

What are 3 main characteristics of RDDS ? (ILC)

Answer

A

immutable, lazily evaluated, and cacheable

Question 13

Q

What are the two types of Spark operations ?

Answer

A

transformations : recipe to follow
actions : Perform the recipe and return something

Because spark is lazy, with a large dataset you have to be, you don’t want to go on calculating all the transformations until you are sure you want to PERFORM them (action)

Question 14

Q

RDD vs Dataframe ?

Answer

A

Since spark 2.0, we are moving toward dataframe syntactically but RDD at the core

Question 15

Q

What language is Spark written in ?

Answer

A

Scala, so Scala API gets latests features, Scala is written in Java, sometimes takes one extra release cycle to get to python, another to get to R

Question 16

Q

Can Spark be run on a single machine ?

Answer

Study These Flashcards

A

Realistically spark is run on a cluster, AWS or google Cloud. These clusters will almost always be Linux based (real world)

Spark and Python for Big Data Flashcards

(16 cards)