Chapter 6 Flashcards
Parallel data processing
simultaneous execution of multiple sub-tasks that
collectively comprise a larger task. The goal is to reduce the execution time by dividing a
single larger task into multiple smaller tasks that run concurrently
Distributed Data Processing
closely related to parallel data processing in that the same principle of “divide-and-conquer” is applied
distributed data processing is
always achieved through physically separate machines that are networked together as a cluster
Hadoop
an open-source framework for large-scale data storage and data processing that is compatible with commodity hardware
processing workload in Big Data
defined as the amount and nature of data that is
processed within a certain amount of time
Workloads are usually divided into two types
- batch
* transactional
Batch processing
involves processing data in batches
and usually imposes delays, which in turn results in high-latency responses. Batch
workloads typically involve large quantities of data with sequential read/writes and
–comprise of groups of read or write queries.
Queries can be complex and involve multiple joins. OLAP systems
Transaction workloads involve small amounts of data
with random reads and writes
Transactional processing
known as online processing
data is processed interactively without delay,
resulting in low-latency responses.
OLTP and operational systems
clusters
provide support to create horizontally scalable storage solutions
benefit of clusters
comprised of low-cost commodity nodes that collectively provide increased processing capacity
–they provide inherent redundancy and fault tolerance, as they consist of physically separate nodes
MapReduce
It divides a big problem into a collection of
smaller problems that can each be solved quickly
Map tasks
- map
- combine (optional)
- partition
Reduce tasks
- shuffle and sort
* reduce
map stage
during which the dataset file is divided
into multiple smaller splits. Each split is parsed into its constituent records as a key-value
pair. The key is usually the ordinal position of the record, and the value is the actual
record.
Processing in Realtime Mode
data is processed in-memory as it is captured before being persisted to the disk.
- -Response time generally ranges from a sub-second to under a minute.
- -Realtime mode addresses the velocity characteristic of Big Data datasets.
Whereas the CAP theorem is primarily related to _____, the SCV principle is related to ______
distributed data storage; distributed data processing.