Spark and Python for Big Data Flashcards

1
Q

When to use spark ?

A

When it no longer makes sense to fit all your data on RAM or into a single machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Local system vs Distributed system characteristics

A

Distributed :
leverages power of many machines
easier to scale adding new machines unlike local
fault tolerance: if one machine fails the whole system can still go on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Hadoop ?

A

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is HDFS ?

A

Hadoop Distributed File System : allows user to work with very large datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is MapReduce ?

A

A programming paradigm that allows calculations on different machines of the cluster in two steps : map which is a local calculation, and reduce which combines the results over the results to yield a single result.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the trackers involved in mapreduce ?

A

Job tracker (master node) sends code to run on the task trackers. the task trackers allocate CPU and memory for the tasks AND monitor the tasks on worker nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Spark ?

A

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

EASE OF USE and SPEED

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does Spark relate to MapReduce ?

A

Spark is a flexible alternative to MapReduce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Spark vs MapReduce ? (NOT hadoop vs Spark)

A
  • MapReduce requires files to be stored in HDFS, Spark does NOT, wide variety of data formats
  • Spark can perform operations up to xI00 times faster than MapReduce (Spark keeps most of the data in RAM after each transformation, and can spill over to disk if the memory is filled)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is RDD ?

A

Resilient Distributed Dataset (idea at the core of Spark)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 4 main features of RDDs ? (DFPS)

A
  • Distributed collection of data
  • Fault tolerance
  • Parallel operation
  • Ability to use many data sources (not limited to HDFS)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are 3 main characteristics of RDDS ? (ILC)

A

immutable, lazily evaluated, and cacheable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the two types of Spark operations ?

A
  • transformations : recipe to follow
  • actions : Perform the recipe and return something

Because spark is lazy, with a large dataset you have to be, you don’t want to go on calculating all the transformations until you are sure you want to PERFORM them (action)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

RDD vs Dataframe ?

A

Since spark 2.0, we are moving toward dataframe syntactically but RDD at the core

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What language is Spark written in ?

A

Scala, so Scala API gets latests features, Scala is written in Java, sometimes takes one extra release cycle to get to python, another to get to R

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Can Spark be run on a single machine ?

A

Realistically spark is run on a cluster, AWS or google Cloud. These clusters will almost always be Linux based (real world)