College 7: Apache Spark Flashcards

Question 1

Q

What is Apache Hadoop

Answer

A

An ecosystem of tools for processing “Big Data”
Open-source project
Part of the Apache group
Distributed framework in Java

Question 2

Q

Not a good fit for

Answer

A

Low-latency access
Lots of small files
Multiple writers, arbitrary file modifications

Question 3

Q

RDDs: Resilient Distributed Dataset

Answer

A

Distributed collection of objects that can be cached in memory across cluster nodes.
Characteristics: immutable, resilient, distributed, lazily evaluated, cacheable/persistent, fault-tolerant

Question 4

Q

Spark actions

Answer

A

Return a value to the driver program after running a computation on the dataset (reduce is an action)
Result into a DAG of operations
DAG is compiled into stages where each stage is executed as a series of task
Task: fundamental unit of work

Question 5

Q

Spark transformations

Answer

A

Transformations create a new dataset from existing one (map & filter is a transformation)
* Return pointers to a new RDD
* Transformations are lazy (not computed immediately), only computed when an action requires a result
* Transformed RDDs gets recomputed when actions run
on it
* RDD can be persisted in memory or disk

Question 6

Q

Action examples

Answer

A

Collect(), count(), countByValue(), take(num), top(num), reduce(func)

Question 7

Q

ML in Hadoop vs Spark

Answer

A

In hadoop each iteration of ML translates to single map/reduce job
each of these jobs need to store data in HDFS, leads to important overhead
Keeping state across jobs is not directly available in Map reduce
Constant fight between quality of results vs performance

Spark
* spark is the first general-purpose big data processing engine built for ML from day one
* initial design of spark was driven by ML optimization
- caching: for running on data multiple times
- accumulator: to keep state across multiple iterations in memory
- good support for CPU intensive tasks with laziness
- two main operations: fit and transform

Question 8

Q

Spark vs Hadoop: Data processing models

Answer

A

Data Processing Models: Hadoop is designed for batch processing using the MapReduce model, ideal for high-latency, large-scale data tasks. Spark, on the other hand, supports both batch and real-time stream processing with its Resilient Distributed Datasets (RDDs), allowing for faster in-memory processing.

Processing Speed: Hadoop’s disk I/O for reading and writing data during batch processing makes it slower, particularly for iterative tasks. Spark’s in-memory model significantly reduces disk I/O, making it up to 100 times faster for certain workloads.

Ease of Use: Hadoop requires developers to use the complex MapReduce paradigm and manage HDFS configurations. Spark offers a user-friendly API, supports multiple programming languages (Scala, Java, Python, R), and provides high-level libraries for various data tasks, making it easier to use.

Data Processing Paradigms: Hadoop excels at parallelizable, large-scale data processing tasks like log analysis and ETL. Spark, however, supports a broader range of paradigms, efficiently handling real-time processing, machine learning, graph processing, and interactive SQL queries.

Integration with Existing Ecosystems: Hadoop has a well-established ecosystem with seamless integration with tools like Hive, Pig, and HBase. Spark integrates with the Hadoop ecosystem using HDFS for data storage and offers flexible deployment options, including standalone clusters, Mesos, or Apache YARN.

Question 9

Q

College 7: Apache Spark Flashcards

(9 cards)