map combine (optional) partition

Chapter 6 Flashcards by Joohi Rana

Parallel data processing

simultaneous execution of multiple sub-tasks that
collectively comprise a larger task. The goal is to reduce the execution time by dividing a
single larger task into multiple smaller tasks that run concurrently

How well did you know this?

Not at all

Perfectly

Distributed Data Processing

closely related to parallel data processing in that the same principle of “divide-and-conquer” is applied
distributed data processing is
always achieved through physically separate machines that are networked together as a cluster

How well did you know this?

Not at all

Perfectly

Hadoop

an open-source framework for large-scale data storage and data processing that is compatible with commodity hardware

How well did you know this?

Not at all

Perfectly

processing workload in Big Data

defined as the amount and nature of data that is

processed within a certain amount of time

How well did you know this?

Not at all

Perfectly

Workloads are usually divided into two types

batch

* transactional

How well did you know this?

Not at all

Perfectly

Batch processing

involves processing data in batches
and usually imposes delays, which in turn results in high-latency responses. Batch
workloads typically involve large quantities of data with sequential read/writes and
–comprise of groups of read or write queries.
Queries can be complex and involve multiple joins. OLAP systems
Transaction workloads involve small amounts of data
with random reads and writes

How well did you know this?

Not at all

Perfectly

Transactional processing

known as online processing
data is processed interactively without delay,
resulting in low-latency responses.
OLTP and operational systems

How well did you know this?

Not at all

Perfectly

clusters

provide support to create horizontally scalable storage solutions

How well did you know this?

Not at all

Perfectly

benefit of clusters

comprised of low-cost commodity nodes that collectively provide increased processing capacity
–they provide inherent redundancy and fault tolerance, as they consist of physically separate nodes

How well did you know this?

Not at all

Perfectly

MapReduce

It divides a big problem into a collection of

smaller problems that can each be solved quickly

How well did you know this?

Not at all

Perfectly

Map tasks

map
combine (optional)
partition

How well did you know this?

Not at all

Perfectly

Reduce tasks

shuffle and sort

* reduce

How well did you know this?

Not at all

Perfectly

map stage

during which the dataset file is divided
into multiple smaller splits. Each split is parsed into its constituent records as a key-value
pair. The key is usually the ordinal position of the record, and the value is the actual
record.

How well did you know this?

Not at all

Perfectly

Processing in Realtime Mode

data is processed in-memory as it is captured before being persisted to the disk.

-Response time generally ranges from a sub-second to under a minute.
-Realtime mode addresses the velocity characteristic of Big Data datasets.

How well did you know this?

Not at all

Perfectly

Whereas the CAP theorem is primarily related to _____, the SCV principle is related to ______

distributed data storage; distributed data processing.

How well did you know this?

Not at all

Perfectly

Speed

Study These Flashcards

Speed refers to how quickly the data can be processed once it is generated.
In the case of realtime analytics, data is processed comparatively faster than batch
analytics.

Consistency

Study These Flashcards

Consistency refers to the accuracy and the precision of the results.
Results are deemed accurate if they are close to the correct value and precise if close to each other.

Volume

Study These Flashcards

Volume refers to the amount of data that can be processed. Big Data’s velocity characteristic results in fast growing datasets leading to huge volumes of
data that need to be processed in a distributed manner

Event Stream Processing (ESP)

Study These Flashcards

incoming stream of events, generally from a single source and ordered by time, is continuously analyzed. The analysis can occur via simple queries or the
application of algorithms that are mostly formula-based

Complex Event Processing (CEP)

Study These Flashcards

a number of realtime events often coming from disparate sources and arriving at different time intervals are analyzed simultaneously for the detection of
patterns and initiation of action

cannot use map reduce with realtime processing because

Study These Flashcards

MapReduce cannot process data incrementally and can only process complete datasets.

combine step

Study These Flashcards

combiner looks for similar nodes and combines it

partition

Study These Flashcards

During the partition stage, if more than one reducer is involved, a partitioner divides the
output from the mapper or combiner
into partitions between reducer instances.

shuffling and sort

Study These Flashcards

shuffling -output from all partitioners is copied across the network to the nodes running the reduce task
sort - MapReduce engine automatically groups and sorts the key-value pairs according
to the keys so that the output contains a sorted list of all input keys and their values with
the same keys appearing together

Reduce

reducer will either further summarize its input or will | emit the output without making any changes

Task Parallelism

parallelization of data processing by dividing a task into sub-tasks and running each sub-task on a separate processor, generally on a separate node in a cluster

Data Parallelism

parallelization of data processing by dividing a dataset into multiple datasets and processing each sub-dataset in parallel

Chapter 6 Flashcards

(27 cards)