Week 10: Big Data Framework Flashcards

Question 1

Q

What is scaling up?

Answer

A

Using more powerful processors and more memory
Data architecture does not significantly change

Question 2

Q

Is scaling up a short or long term fix?

Answer

A

Short term fix

Question 3

Q

What is scaling out?

Answer

A

Adding servers for parallel computing

Using lots of small machines in a cluster

Question 4

Q

Is scaling out a short or long term fix?

Answer

A

Long term as more servers may be added when needed

Question 5

Q

When is scaling up a good option?

Answer

A

strong internal cross-references
need for transactional integrity

Question 6

Q

Name a batch only framework

Answer

A

Apache Hadoop

Question 7

Q

Name 2 stream only frameworks

Answer

A

Apache Storm
Apache Samza

Question 8

Q

Name 2 hybrid frameworks

Answer

A

Apache Spark
Apache Flink

Question 9

Q

Datasets in batch processing are typically:

Answer

A

bounded: finite collection of data
persistent: data is backed by permanent storage
large: only option on large datasets

Question 10

Q

Batch processing is well suited for

Answer

A

calculations where access to all data is required (eg: averages / totals)
tasks that require large volumes of data

Question 11

Q

Batch processing is not appropriate when?

Answer

A

processing time is imporant

Question 12

Q

Batch processing involves

Answer

A

operating over a large, static dataset
returning results at a later time
once computation is complete

Question 13

Q

Stream processing systems

Answer

A

compute over data as it enters the system

Question 14

Q

datasets in stream processing are considered;

Answer

A

unbounded

Question 15

Q

in stream processing what does the total dataset refer to

Answer

A

the total amount of data that has entered the system so far

Question 16

Q

Hybrid frameworks attempt to offer

Answer

Study These Flashcards

A

a general solution for data processing

Question 17

Q

For Apache Hadoop describe the following:

Execution Model
Latency
Programming Language
Fault Tolerance

Answer

Study These Flashcards

A

Batch processing using disk storage
High latency
Java
Replication

Question 18

Q

For Apache Spark describe the following:

Execution Model
Latency
Programming Language

Answer

Study These Flashcards

A

batch and stream processing using memory or disk storage
low latency for small micro-batch size
Scala, Python,Java,R

Question 19

Q

Apache spark benefits of in-memory processing

Answer

Study These Flashcards

A

Runs up to 100x faster in memory

Runs up to 10x faster when it uses disk over traditional map-reduce

Question 20

Q

Disk sharing is slow in MapReduce due to

1. 3.

Answer

Study These Flashcards

A

replication
serialization
disk IO

Question 21

Q

Ways to create RDDs

2.

Answer

Study These Flashcards

A

parallelizing an existing collection in your driver program
referencing a dataset in an external storage solution (HDFS)

Question 22

Q

Answer

Study These Flashcards

A

Week 10: Big Data Framework Flashcards

(22 cards)