Week 10: Big Data Framework Flashcards
What is scaling up?
- Using more powerful processors and more memory
- Data architecture does not significantly change
Is scaling up a short or long term fix?
Short term fix
- What is scaling out?
Adding servers for parallel computing
Using lots of small machines in a cluster
Is scaling out a short or long term fix?
Long term as more servers may be added when needed
When is scaling up a good option?
- strong internal cross-references
- need for transactional integrity
Name a batch only framework
Apache Hadoop
Name 2 stream only frameworks
- Apache Storm
- Apache Samza
Name 2 hybrid frameworks
- Apache Spark
- Apache Flink
Datasets in batch processing are typically:
- bounded: finite collection of data
- persistent: data is backed by permanent storage
- large: only option on large datasets
Batch processing is well suited for
- calculations where access to all data is required (eg: averages / totals)
- tasks that require large volumes of data
Batch processing is not appropriate when?
processing time is imporant
Batch processing involves
- operating over a large, static dataset
- returning results at a later time
- once computation is complete
Stream processing systems
- compute over data as it enters the system
datasets in stream processing are considered;
unbounded
in stream processing what does the total dataset refer to
the total amount of data that has entered the system so far