Module 1 Flashcards
Name the four V’s of Big Data and explain what they represent
Velocity, Volume, Variety and Veracity
Features of data set that help characterize big data
Big data is tradionally defined as having high velocity, volume, variety
Define Big Data
Massive data sets (volume) that cannot be maintained using traditional data processing techniques. They grow at an incredible rate (velocity) and are complex in nature as they can store structured, semi-structured or unstructured data (variety).
Key characteristics of big data include substantial volume, high velocity, and diverse variety.
Explain Variety from the 3 V’s of Big Data
Big data is complex, having high variety, meaning it can contain structure, semi-structured or unstructured data (heterogenous).
This arises as consequence from the ever growing number of data sources available (e.g. processes, sensors, mobile equipments, people, etc.)
Explain Volume from the 3 V’s of Big Data
Big data, per definition, involves massive amounts of data that are ever growing.
Explain Velocity from the 3 V’s of Big DataExplain variety in terms of Big Data
Big Data needs to be able to ingest and process data continously at a high rate, sometime in real-time or in near real time.
Explain Veracity, which sometimes is said to be the fourth V of Big Data
As we rely on such massive amount of data, we need ensure this data is as much trustworthy as possible. Good quality of data is accurate, complete and unambiguous.
Describe the Big Data life cycle
Explain the difference between Linear and Parallel processing when solving a problem
In Linear Processing of a problem, solution is broken down into a set of sequential instructions.
In Parallel Processing of a problem, solution is broken down in a set of instructions with each assigned to specific node from the cluster.
What is a node?
What is a cluster and its purpose?
Node is an individual compute or server that has compute and storage capacity.
Cluster is a collection of nodes interlinked, allowing it to perform parallel processing.
Explain the difference between scaling up (vertical) and out (horizontal)
Scaling up (vertically) means to increase compute and/or storage capacity in a single node. Because of three V’s of big data, scaling up might not be a sustainable solution.
Scaling out (horizontally) means adding more nodes to the cluster, ultimately increase the cluster’s storande and/or compute capacity.
How is Parallel Processing superior to Linear Processing in case of errors during Big Data problems?
In case of errors during calculations, linear processing requires the whole set of instructions to be executed again, while in parallel processing, only the node that failed needs to be executed again.
How Linear and Parallel processing compare in terms of node requirements and flexibility?
Storage and compute requirements are lower in the latter as task has been broken down into a set of smaller instructions compared to doing everything in one node.
The latter is also more flexible as nodes can be added or removed from the cluster dependeding on the complexity of the task.
What are embarrassingly parallel calculations?
Workloads that can easily be divided and run independently. If one workload fails, it has no impact on the other workloads and is easily rerun.
Scaling in the context of big data refers to ____
… adding more computing resources to handle increased data volume and processing demands
What are the types of data associated with Big Data?
- structured
- semi-structured
- unstructured