Introduction to SPARK Flashcards
1
Q
What is SPARK?
A
Apache SPARK is fast (in memory), general computing engine for large-scale data procesing on a cluster.
2
Q
About SPARK
A
- Large-scale data processing engine
- General purpose
- Runs on Hadoop cluster and use storage in HDFS
3
Q
Supports a wide range of workloads
A
- Machine learning
- Business intelligence
- Streaming
- Batch Processing
4
Q
Distributed Data Processing
Framework
A
- Processing
- Hadoop MapReduce
- Spark API
- Resource Management
- YARN (recomended by Cloudera/Databricks)
- Storage
- HDFS
5
Q
Big Data Processing with SPARK
A
- Hadoop based on two key concepts
- Distribute data when the data is stored
- Run computation where the data is
- Spark adds
- Process data in memory for faster execution
- Execution plan to organize work
- High-level API
sc.textFile(file) \
.flatMap(lambda s: s.split()) \
.map(lambda w: (w,1)) \
.reduceByKey(lambda v1, v2: v1+v2)
.saveAsTextFile(output)
6
Q
Spark and Hadoop
A
Spark was created to complement, not replace, Hadoop
7
Q
HDFS Basic Concepts
A
- HDFS is a file system written in Java
- Based on Google’s GFS
- Sits on top of a native filesystem
- Such as ext3, ext4, or xfs
- Provides redundant storage for massive amounts of data!
- Using readily available, industry -standard computers.
8
Q
HDFS Basic Concept 2
A
- HDFS performs best with a ‘modest’ number!of large files
- Millions, rather than billions, of files
- Each file typically 100MB or more
- Files in HDFS are ‘write once’
- No random writes to files are allowed
- HDFS is optimized for large, streaming reads of files
- Rather than random reads
9
Q
What is YARN?
A
- YARN = Yet Another Resource Negotiator
- YARN is Hadoop processing layer that contains:
- A resource manager
- A job scheduler
- YARN allows multiple data processing engines to run on a single Hadoop cluster
- Batch programs (e.g. SPARK, MapReduce)
- Interactive SQL (e.g. Impala)
- Advance analytics (e.g. Spark, Impala)
- Streaming (e.g. Spark Streaming)
10
Q
YARN Daemons
A
- Resource Manager (RM)
- Runs on master node
- Global resource scheduler
- Arbitrates system resources between competing applications
- Has a pluggable scheduler to support different algorithms (capacity, fair scheduler, etc)
- Node Manager (NM)
- Runs on slave nodes
- Communicates with RM
11
Q
Running an application on YARN
A
-
Containers
- Created by the RM upon request
- Allocate a certain amount of resources (memory, CPU) on a slave node
- Applications run in one or more containers
-
Application Master (AM)
- One per application
- Framework/application specific
- Runs in a container
- Requests more containers to run application tasks
12
Q
Advantages of Spark
A
- Distributed processing and cluster computing
- Application processes are distributed acress a cluster of worker nodes
- Works with distributed storage
- Supports data locality
- Data in memory
- Configurable persistence for efficient iteraction
- High-level programming framework
- Programmers can focus on logic, not plumbing
13
Q
Essential Points
A
- Traditional large-scale computing involved complex processing on small amounts of data.
- Exponential growth in data drove development of distributed computing.
- Distributed computing is difficult!
- Spark addresses big data distributed computing challenges.
- Bring the computation to the data
- Fault tolerance
- Scalability
- Hides the ‘plumbing’ so developers can focus on the data
- Caches data in memory