College 7: Apache Spark Flashcards

1
Q

What is Apache Hadoop

A
  • An ecosystem of tools for processing “Big Data”
  • Open-source project
  • Part of the Apache group
  • Distributed framework in Java
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Not a good fit for

A
  • Low-latency access
  • Lots of small files
  • Multiple writers, arbitrary file modifications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

RDDs: Resilient Distributed Dataset

A

Distributed collection of objects that can be cached in memory across cluster nodes.
Characteristics: immutable, resilient, distributed, lazily evaluated, cacheable/persistent, fault-tolerant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Spark actions

A
  • Return a value to the driver program after running a computation on the dataset (reduce is an action)
  • Result into a DAG of operations
  • DAG is compiled into stages where each stage is executed as a series of task
  • Task: fundamental unit of work
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Spark transformations

A

Transformations create a new dataset from existing one (map & filter is a transformation)
* Return pointers to a new RDD
* Transformations are lazy (not computed immediately), only computed when an action requires a result
* Transformed RDDs gets recomputed when actions run
on it
* RDD can be persisted in memory or disk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Action examples

A

Collect(), count(), countByValue(), take(num), top(num), reduce(func)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

ML in Hadoop vs Spark

A
  • In hadoop each iteration of ML translates to single map/reduce job
  • each of these jobs need to store data in HDFS, leads to important overhead
  • Keeping state across jobs is not directly available in Map reduce
  • Constant fight between quality of results vs performance

Spark
* spark is the first general-purpose big data processing engine built for ML from day one
* initial design of spark was driven by ML optimization
- caching: for running on data multiple times
- accumulator: to keep state across multiple iterations in memory
- good support for CPU intensive tasks with laziness
- two main operations: fit and transform

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Spark vs Hadoop: Data processing models

A

Data Processing Models: Hadoop is designed for batch processing using the MapReduce model, ideal for high-latency, large-scale data tasks. Spark, on the other hand, supports both batch and real-time stream processing with its Resilient Distributed Datasets (RDDs), allowing for faster in-memory processing.

Processing Speed: Hadoop’s disk I/O for reading and writing data during batch processing makes it slower, particularly for iterative tasks. Spark’s in-memory model significantly reduces disk I/O, making it up to 100 times faster for certain workloads.

Ease of Use: Hadoop requires developers to use the complex MapReduce paradigm and manage HDFS configurations. Spark offers a user-friendly API, supports multiple programming languages (Scala, Java, Python, R), and provides high-level libraries for various data tasks, making it easier to use.

Data Processing Paradigms: Hadoop excels at parallelizable, large-scale data processing tasks like log analysis and ETL. Spark, however, supports a broader range of paradigms, efficiently handling real-time processing, machine learning, graph processing, and interactive SQL queries.

Integration with Existing Ecosystems: Hadoop has a well-established ecosystem with seamless integration with tools like Hive, Pig, and HBase. Spark integrates with the Hadoop ecosystem using HDFS for data storage and offers flexible deployment options, including standalone clusters, Mesos, or Apache YARN.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly