Week 10: Big Data Framework Flashcards

1
Q

What is scaling up?

A
  • Using more powerful processors and more memory
  • Data architecture does not significantly change
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Is scaling up a short or long term fix?

A

Short term fix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  • What is scaling out?
A

Adding servers for parallel computing

Using lots of small machines in a cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Is scaling out a short or long term fix?

A

Long term as more servers may be added when needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When is scaling up a good option?

A
  • strong internal cross-references
  • need for transactional integrity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Name a batch only framework

A

Apache Hadoop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Name 2 stream only frameworks

A
  • Apache Storm
  • Apache Samza
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name 2 hybrid frameworks

A
  • Apache Spark
  • Apache Flink
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Datasets in batch processing are typically:

A
  • bounded: finite collection of data
  • persistent: data is backed by permanent storage
  • large: only option on large datasets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Batch processing is well suited for

A
  • calculations where access to all data is required (eg: averages / totals)
  • tasks that require large volumes of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Batch processing is not appropriate when?

A

processing time is imporant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Batch processing involves

A
  • operating over a large, static dataset
  • returning results at a later time
  • once computation is complete
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Stream processing systems

A
  • compute over data as it enters the system
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

datasets in stream processing are considered;

A

unbounded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

in stream processing what does the total dataset refer to

A

the total amount of data that has entered the system so far

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Hybrid frameworks attempt to offer

A

a general solution for data processing

17
Q

For Apache Hadoop describe the following:

  1. Execution Model
  2. Latency
  3. Programming Language
  4. Fault Tolerance
A
  1. Batch processing using disk storage
  2. High latency
  3. Java
  4. Replication
18
Q

For Apache Spark describe the following:

  1. Execution Model
  2. Latency
  3. Programming Language
A
  1. batch and stream processing using memory or disk storage
  2. low latency for small micro-batch size
  3. Scala, Python,Java,R
19
Q

Apache spark benefits of in-memory processing

A

Runs up to 100x faster in memory

Runs up to 10x faster when it uses disk over traditional map-reduce

20
Q

Disk sharing is slow in MapReduce due to

    1. 3.
A
  1. replication
  2. serialization
  3. disk IO
21
Q

Ways to create RDDs

  1. 2.
A
  1. parallelizing an existing collection in your driver program
  2. referencing a dataset in an external storage solution (HDFS)
22
Q
A