Module 1 Flashcards

1
Q

Name the four V’s of Big Data and explain what they represent

A

Velocity, Volume, Variety and Veracity
Features of data set that help characterize big data

Big data is tradionally defined as having high velocity, volume, variety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define Big Data

A

Massive data sets (volume) that cannot be maintained using traditional data processing techniques. They grow at an incredible rate (velocity) and are complex in nature as they can store structured, semi-structured or unstructured data (variety).

Key characteristics of big data include substantial volume, high velocity, and diverse variety.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain Variety from the 3 V’s of Big Data

A

Big data is complex, having high variety, meaning it can contain structure, semi-structured or unstructured data (heterogenous).

This arises as consequence from the ever growing number of data sources available (e.g. processes, sensors, mobile equipments, people, etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain Volume from the 3 V’s of Big Data

A

Big data, per definition, involves massive amounts of data that are ever growing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain Velocity from the 3 V’s of Big DataExplain variety in terms of Big Data

A

Big Data needs to be able to ingest and process data continously at a high rate, sometime in real-time or in near real time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain Veracity, which sometimes is said to be the fourth V of Big Data

A

As we rely on such massive amount of data, we need ensure this data is as much trustworthy as possible. Good quality of data is accurate, complete and unambiguous.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe the Big Data life cycle

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the difference between Linear and Parallel processing when solving a problem

A

In Linear Processing of a problem, solution is broken down into a set of sequential instructions.
In Parallel Processing of a problem, solution is broken down in a set of instructions with each assigned to specific node from the cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a node?
What is a cluster and its purpose?

A

Node is an individual compute or server that has compute and storage capacity.

Cluster is a collection of nodes interlinked, allowing it to perform parallel processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain the difference between scaling up (vertical) and out (horizontal)

A

Scaling up (vertically) means to increase compute and/or storage capacity in a single node. Because of three V’s of big data, scaling up might not be a sustainable solution.

Scaling out (horizontally) means adding more nodes to the cluster, ultimately increase the cluster’s storande and/or compute capacity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is Parallel Processing superior to Linear Processing in case of errors during Big Data problems?

A

In case of errors during calculations, linear processing requires the whole set of instructions to be executed again, while in parallel processing, only the node that failed needs to be executed again.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How Linear and Parallel processing compare in terms of node requirements and flexibility?

A

Storage and compute requirements are lower in the latter as task has been broken down into a set of smaller instructions compared to doing everything in one node.

The latter is also more flexible as nodes can be added or removed from the cluster dependeding on the complexity of the task.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are embarrassingly parallel calculations?

A

Workloads that can easily be divided and run independently. If one workload fails, it has no impact on the other workloads and is easily rerun.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Scaling in the context of big data refers to ____

A

… adding more computing resources to handle increased data volume and processing demands

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the types of data associated with Big Data?

A
  • structured
  • semi-structured
  • unstructured
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is parallel processing important in big data?

A

It reduces processing times.

17
Q

What are the differences between structured, unstructured and semi-structued data?

A
18
Q
A
18
Q
A