Chapter 6 Flashcards

1
Q

Parallel data processing

A

simultaneous execution of multiple sub-tasks that
collectively comprise a larger task. The goal is to reduce the execution time by dividing a
single larger task into multiple smaller tasks that run concurrently

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Distributed Data Processing

A

closely related to parallel data processing in that the same principle of “divide-and-conquer” is applied
distributed data processing is
always achieved through physically separate machines that are networked together as a cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Hadoop

A

an open-source framework for large-scale data storage and data processing that is compatible with commodity hardware

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

processing workload in Big Data

A

defined as the amount and nature of data that is

processed within a certain amount of time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Workloads are usually divided into two types

A
  • batch

* transactional

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Batch processing

A

involves processing data in batches
and usually imposes delays, which in turn results in high-latency responses. Batch
workloads typically involve large quantities of data with sequential read/writes and
–comprise of groups of read or write queries.
Queries can be complex and involve multiple joins. OLAP systems
Transaction workloads involve small amounts of data
with random reads and writes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Transactional processing

A

known as online processing
data is processed interactively without delay,
resulting in low-latency responses.
OLTP and operational systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

clusters

A

provide support to create horizontally scalable storage solutions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

benefit of clusters

A

comprised of low-cost commodity nodes that collectively provide increased processing capacity
–they provide inherent redundancy and fault tolerance, as they consist of physically separate nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

MapReduce

A

It divides a big problem into a collection of

smaller problems that can each be solved quickly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Map tasks

A
  • map
  • combine (optional)
  • partition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Reduce tasks

A
  • shuffle and sort

* reduce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

map stage

A

during which the dataset file is divided
into multiple smaller splits. Each split is parsed into its constituent records as a key-value
pair. The key is usually the ordinal position of the record, and the value is the actual
record.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Processing in Realtime Mode

A

data is processed in-memory as it is captured before being persisted to the disk.

  • -Response time generally ranges from a sub-second to under a minute.
  • -Realtime mode addresses the velocity characteristic of Big Data datasets.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Whereas the CAP theorem is primarily related to _____, the SCV principle is related to ______

A

distributed data storage; distributed data processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Speed

A

Speed refers to how quickly the data can be processed once it is generated.
In the case of realtime analytics, data is processed comparatively faster than batch
analytics.

17
Q

Consistency

A

Consistency refers to the accuracy and the precision of the results.
Results are deemed accurate if they are close to the correct value and precise if close to each other.

18
Q

Volume

A

Volume refers to the amount of data that can be processed. Big Data’s velocity characteristic results in fast growing datasets leading to huge volumes of
data that need to be processed in a distributed manner

19
Q

Event Stream Processing (ESP)

A

incoming stream of events, generally from a single source and ordered by time, is continuously analyzed. The analysis can occur via simple queries or the
application of algorithms that are mostly formula-based

20
Q

Complex Event Processing (CEP)

A

a number of realtime events often coming from disparate sources and arriving at different time intervals are analyzed simultaneously for the detection of
patterns and initiation of action

21
Q

cannot use map reduce with realtime processing because

A

MapReduce cannot process data incrementally and can only process complete datasets.

22
Q

combine step

A

combiner looks for similar nodes and combines it

23
Q

partition

A

During the partition stage, if more than one reducer is involved, a partitioner divides the
output from the mapper or combiner
into partitions between reducer instances.

24
Q

shuffling and sort

A

shuffling -output from all partitioners is copied across the network to the nodes running the reduce task
sort - MapReduce engine automatically groups and sorts the key-value pairs according
to the keys so that the output contains a sorted list of all input keys and their values with
the same keys appearing together

25
Q

Reduce

A

reducer will either further summarize its input or will

emit the output without making any changes

26
Q

Task Parallelism

A

parallelization of data processing
by dividing a task into sub-tasks and running each sub-task on a separate processor,
generally on a separate node in a cluster

27
Q

Data Parallelism

A

parallelization of data processing
by dividing a dataset into multiple datasets and processing each sub-dataset in
parallel