Chapter 6 Flashcards
Parallel data processing
simultaneous execution of multiple sub-tasks that
collectively comprise a larger task. The goal is to reduce the execution time by dividing a
single larger task into multiple smaller tasks that run concurrently
Distributed Data Processing
closely related to parallel data processing in that the same principle of “divide-and-conquer” is applied
distributed data processing is
always achieved through physically separate machines that are networked together as a cluster
Hadoop
an open-source framework for large-scale data storage and data processing that is compatible with commodity hardware
processing workload in Big Data
defined as the amount and nature of data that is
processed within a certain amount of time
Workloads are usually divided into two types
- batch
* transactional
Batch processing
involves processing data in batches
and usually imposes delays, which in turn results in high-latency responses. Batch
workloads typically involve large quantities of data with sequential read/writes and
–comprise of groups of read or write queries.
Queries can be complex and involve multiple joins. OLAP systems
Transaction workloads involve small amounts of data
with random reads and writes
Transactional processing
known as online processing
data is processed interactively without delay,
resulting in low-latency responses.
OLTP and operational systems
clusters
provide support to create horizontally scalable storage solutions
benefit of clusters
comprised of low-cost commodity nodes that collectively provide increased processing capacity
–they provide inherent redundancy and fault tolerance, as they consist of physically separate nodes
MapReduce
It divides a big problem into a collection of
smaller problems that can each be solved quickly
Map tasks
- map
- combine (optional)
- partition
Reduce tasks
- shuffle and sort
* reduce
map stage
during which the dataset file is divided
into multiple smaller splits. Each split is parsed into its constituent records as a key-value
pair. The key is usually the ordinal position of the record, and the value is the actual
record.
Processing in Realtime Mode
data is processed in-memory as it is captured before being persisted to the disk.
- -Response time generally ranges from a sub-second to under a minute.
- -Realtime mode addresses the velocity characteristic of Big Data datasets.
Whereas the CAP theorem is primarily related to _____, the SCV principle is related to ______
distributed data storage; distributed data processing.
Speed
Speed refers to how quickly the data can be processed once it is generated.
In the case of realtime analytics, data is processed comparatively faster than batch
analytics.
Consistency
Consistency refers to the accuracy and the precision of the results.
Results are deemed accurate if they are close to the correct value and precise if close to each other.
Volume
Volume refers to the amount of data that can be processed. Big Data’s velocity characteristic results in fast growing datasets leading to huge volumes of
data that need to be processed in a distributed manner
Event Stream Processing (ESP)
incoming stream of events, generally from a single source and ordered by time, is continuously analyzed. The analysis can occur via simple queries or the
application of algorithms that are mostly formula-based
Complex Event Processing (CEP)
a number of realtime events often coming from disparate sources and arriving at different time intervals are analyzed simultaneously for the detection of
patterns and initiation of action
cannot use map reduce with realtime processing because
MapReduce cannot process data incrementally and can only process complete datasets.
combine step
combiner looks for similar nodes and combines it
partition
During the partition stage, if more than one reducer is involved, a partitioner divides the
output from the mapper or combiner
into partitions between reducer instances.
shuffling and sort
shuffling -output from all partitioners is copied across the network to the nodes running the reduce task
sort - MapReduce engine automatically groups and sorts the key-value pairs according
to the keys so that the output contains a sorted list of all input keys and their values with
the same keys appearing together
Reduce
reducer will either further summarize its input or will
emit the output without making any changes
Task Parallelism
parallelization of data processing
by dividing a task into sub-tasks and running each sub-task on a separate processor,
generally on a separate node in a cluster
Data Parallelism
parallelization of data processing
by dividing a dataset into multiple datasets and processing each sub-dataset in
parallel