Week 10: Big Data Framework Flashcards
What is scaling up?
- Using more powerful processors and more memory
- Data architecture does not significantly change
Is scaling up a short or long term fix?
Short term fix
- What is scaling out?
Adding servers for parallel computing
Using lots of small machines in a cluster
Is scaling out a short or long term fix?
Long term as more servers may be added when needed
When is scaling up a good option?
- strong internal cross-references
- need for transactional integrity
Name a batch only framework
Apache Hadoop
Name 2 stream only frameworks
- Apache Storm
- Apache Samza
Name 2 hybrid frameworks
- Apache Spark
- Apache Flink
Datasets in batch processing are typically:
- bounded: finite collection of data
- persistent: data is backed by permanent storage
- large: only option on large datasets
Batch processing is well suited for
- calculations where access to all data is required (eg: averages / totals)
- tasks that require large volumes of data
Batch processing is not appropriate when?
processing time is imporant
Batch processing involves
- operating over a large, static dataset
- returning results at a later time
- once computation is complete
Stream processing systems
- compute over data as it enters the system
datasets in stream processing are considered;
unbounded
in stream processing what does the total dataset refer to
the total amount of data that has entered the system so far
Hybrid frameworks attempt to offer
a general solution for data processing
For Apache Hadoop describe the following:
- Execution Model
- Latency
- Programming Language
- Fault Tolerance
- Batch processing using disk storage
- High latency
- Java
- Replication
For Apache Spark describe the following:
- Execution Model
- Latency
- Programming Language
- batch and stream processing using memory or disk storage
- low latency for small micro-batch size
- Scala, Python,Java,R
Apache spark benefits of in-memory processing
Runs up to 100x faster in memory
Runs up to 10x faster when it uses disk over traditional map-reduce
Disk sharing is slow in MapReduce due to
- 3.
- replication
- serialization
- disk IO
Ways to create RDDs
- 2.
- parallelizing an existing collection in your driver program
- referencing a dataset in an external storage solution (HDFS)