Introduction to SPARK Flashcards

1
Q

What is SPARK?

A

Apache SPARK is fast (in memory), general computing engine for large-scale data procesing on a cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

About SPARK

A
  • Large-scale data processing engine
  • General purpose
  • Runs on Hadoop cluster and use storage in HDFS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Supports a wide range of workloads

A
  • Machine learning
  • Business intelligence
  • Streaming
  • Batch Processing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Distributed Data Processing

Framework

A
  • Processing
    • Hadoop MapReduce
    • Spark API
  • Resource Management
    • YARN (recomended by Cloudera/Databricks)
  • Storage
    • HDFS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Big Data Processing with SPARK

A
  • Hadoop based on two key concepts
    • Distribute data when the data is stored
    • Run computation where the data is
  • Spark adds
    • Process data in memory for faster execution
    • Execution plan to organize work
    • High-level API

sc.textFile(file) \
.flatMap(lambda s: s.split()) \
.map(lambda w: (w,1)) \
.reduceByKey(lambda v1, v2: v1+v2)
.saveAsTextFile(output)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Spark and Hadoop

A

Spark was created to complement, not replace, Hadoop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

HDFS Basic Concepts

A
  • HDFS is a file system written in Java
    • Based on Google’s GFS
  • Sits on top of a native filesystem
    • Such as ext3, ext4, or xfs
  • Provides redundant storage for massive amounts of data!
    • Using readily available, industry -standard computers.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

HDFS Basic Concept 2

A
  • HDFS performs best with a ‘modest’ number!of large files
    • Millions, rather than billions, of files
    • Each file typically 100MB or more
  • Files in HDFS are ‘write once’
    • No random writes to files are allowed
  • HDFS is optimized for large, streaming reads of files
    • Rather than random reads
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is YARN?

A
  • YARN = Yet Another Resource Negotiator
  • YARN is Hadoop processing layer that contains:
    • A resource manager
    • A job scheduler
  • YARN allows multiple data processing engines to run on a single Hadoop cluster
    • Batch programs (e.g. SPARK, MapReduce)
    • Interactive SQL (e.g. Impala)
    • Advance analytics (e.g. Spark, Impala)
    • Streaming (e.g. Spark Streaming)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

YARN Daemons

A
  • Resource Manager (RM)
    • Runs on master node
    • Global resource scheduler
    • Arbitrates system resources between competing applications
    • Has a pluggable scheduler to support different algorithms (capacity, fair scheduler, etc)
  • Node Manager (NM)
    • Runs on slave nodes
    • Communicates with RM
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Running an application on YARN

A
  • Containers
    • Created by the RM upon request
    • Allocate a certain amount of resources (memory, CPU) on a slave node
    • Applications run in one or more containers
  • Application Master (AM)
    • One per application
    • Framework/application specific
    • Runs in a container
    • Requests more containers to run application tasks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Advantages of Spark

A
  • Distributed processing and cluster computing
    • Application processes are distributed acress a cluster of worker nodes
  • Works with distributed storage
    • Supports data locality
  • Data in memory
    • Configurable persistence for efficient iteraction
  • High-level programming framework
    • Programmers can focus on logic, not plumbing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Essential Points

A
  • Traditional large-scale computing involved complex processing on small amounts of data.
  • Exponential growth in data drove development of distributed computing.
  • Distributed computing is difficult!
  • Spark addresses big data distributed computing challenges.
    • Bring the computation to the data
    • Fault tolerance
    • Scalability
    • Hides the ‘plumbing’ so developers can focus on the data
    • Caches data in memory
How well did you know this?
1
Not at all
2
3
4
5
Perfectly