Spark and Python for Big Data Flashcards
When to use spark ?
When it no longer makes sense to fit all your data on RAM or into a single machine
Local system vs Distributed system characteristics
Distributed :
leverages power of many machines
easier to scale adding new machines unlike local
fault tolerance: if one machine fails the whole system can still go on
What is Hadoop ?
Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation.
What is HDFS ?
Hadoop Distributed File System : allows user to work with very large datasets
What is MapReduce ?
A programming paradigm that allows calculations on different machines of the cluster in two steps : map which is a local calculation, and reduce which combines the results over the results to yield a single result.
What are the trackers involved in mapreduce ?
Job tracker (master node) sends code to run on the task trackers. the task trackers allocate CPU and memory for the tasks AND monitor the tasks on worker nodes
What is Spark ?
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
EASE OF USE and SPEED
How does Spark relate to MapReduce ?
Spark is a flexible alternative to MapReduce
Spark vs MapReduce ? (NOT hadoop vs Spark)
- MapReduce requires files to be stored in HDFS, Spark does NOT, wide variety of data formats
- Spark can perform operations up to xI00 times faster than MapReduce (Spark keeps most of the data in RAM after each transformation, and can spill over to disk if the memory is filled)
What is RDD ?
Resilient Distributed Dataset (idea at the core of Spark)
What are the 4 main features of RDDs ? (DFPS)
- Distributed collection of data
- Fault tolerance
- Parallel operation
- Ability to use many data sources (not limited to HDFS)
What are 3 main characteristics of RDDS ? (ILC)
immutable, lazily evaluated, and cacheable
What are the two types of Spark operations ?
- transformations : recipe to follow
- actions : Perform the recipe and return something
Because spark is lazy, with a large dataset you have to be, you don’t want to go on calculating all the transformations until you are sure you want to PERFORM them (action)
RDD vs Dataframe ?
Since spark 2.0, we are moving toward dataframe syntactically but RDD at the core
What language is Spark written in ?
Scala, so Scala API gets latests features, Scala is written in Java, sometimes takes one extra release cycle to get to python, another to get to R