Introduction to SPARK Flashcards

Question 1

Q

What is SPARK?

Answer

A

Apache SPARK is fast (in memory), general computing engine for large-scale data procesing on a cluster.

Question 2

Q

About SPARK

Answer

A

Large-scale data processing engine
General purpose
Runs on Hadoop cluster and use storage in HDFS

Question 3

Q

Supports a wide range of workloads

Answer

A

Machine learning
Business intelligence
Streaming
Batch Processing

Question 4

Q

Distributed Data Processing

Framework

Answer

A

Processing
- Hadoop MapReduce
- Spark API
Resource Management
- YARN (recomended by Cloudera/Databricks)
Storage
- HDFS

Question 5

Q

Big Data Processing with SPARK

Answer

A

Hadoop based on two key concepts
- Distribute data when the data is stored
- Run computation where the data is
Spark adds
- Process data in memory for faster execution
- Execution plan to organize work
- High-level API

sc.textFile(file) \
.flatMap(lambda s: s.split()) \
.map(lambda w: (w,1)) \
.reduceByKey(lambda v1, v2: v1+v2)
.saveAsTextFile(output)

Question 6

Q

Spark and Hadoop

Answer

A

Spark was created to complement, not replace, Hadoop

Question 7

Q

HDFS Basic Concepts

Answer

A

HDFS is a file system written in Java
- Based on Google’s GFS
Sits on top of a native filesystem
- Such as ext3, ext4, or xfs
Provides redundant storage for massive amounts of data!
- Using readily available, industry -standard computers.

Question 8

Q

HDFS Basic Concept 2

Answer

A

HDFS performs best with a ‘modest’ number!of large files
- Millions, rather than billions, of files
- Each file typically 100MB or more
Files in HDFS are ‘write once’
- No random writes to files are allowed
HDFS is optimized for large, streaming reads of files
- Rather than random reads

Question 9

Q

What is YARN?

Answer

A

YARN = Yet Another Resource Negotiator
YARN is Hadoop processing layer that contains:
- A resource manager
- A job scheduler
YARN allows multiple data processing engines to run on a single Hadoop cluster
- Batch programs (e.g. SPARK, MapReduce)
- Interactive SQL (e.g. Impala)
- Advance analytics (e.g. Spark, Impala)
- Streaming (e.g. Spark Streaming)

Question 10

Q

YARN Daemons

Answer

A

Resource Manager (RM)
- Runs on master node
- Global resource scheduler
- Arbitrates system resources between competing applications
- Has a pluggable scheduler to support different algorithms (capacity, fair scheduler, etc)
Node Manager (NM)
- Runs on slave nodes
- Communicates with RM

Question 11

Q

Running an application on YARN

Answer

A

Containers
- Created by the RM upon request
- Allocate a certain amount of resources (memory, CPU) on a slave node
- Applications run in one or more containers
Application Master (AM)
- One per application
- Framework/application specific
- Runs in a container
- Requests more containers to run application tasks

Question 12

Q

Advantages of Spark

Answer

A

Distributed processing and cluster computing
- Application processes are distributed acress a cluster of worker nodes
Works with distributed storage
- Supports data locality
Data in memory
- Configurable persistence for efficient iteraction
High-level programming framework
- Programmers can focus on logic, not plumbing

Question 13

Q

Essential Points

Answer

A

Traditional large-scale computing involved complex processing on small amounts of data.
Exponential growth in data drove development of distributed computing.
Distributed computing is difficult!
Spark addresses big data distributed computing challenges.
- Bring the computation to the data
- Fault tolerance
- Scalability
- Hides the ‘plumbing’ so developers can focus on the data
- Caches data in memory

Introduction to SPARK Flashcards

(13 cards)