Big Data Flashcards
Elements that conform Big Data
- Velocity.
- Volume.
- Variety.
- Veracity.
- Value.
Velocity
is the speed at which data accumulates, it generates really fast and never stops.
Volume
is the scale of the data or the increase in the amount of data stored.
Variety
is the diversity of the data, data can be structured, semi-structured, and unstructured. Variety means different sources of which data comes from like machines, people, and processes.
Veracity
it’s the quality, origin of data, and its conformity to facts and accuracy. With large amount of data obtain and accumulated, it needs to be classified as real, false, accurate or reliable.
Value
is our ability and need to turn data into value. Value can be profit, medical, or social benefit.
Big Data Processing Tools
- Hadoop.
- Hive.
- Spark.
Hadoop
It is a Java-based (Text form) open-source framework, allows distributed storage and processing of large datasets across clusters of computers.
Hadoop Benefits
- You can incorporate into the system emerging data formats like streaming audio, video, social media, etc. Along with structured, semi-structures, and unstructured data.
- Provides real-time access for stakeholders to the data.
- Optimize and streamline costs in your enterprise data warehouse by consolidating data across the organization and moving “cold” data (data that is not frequent use) to a Hadoop-based system.
Hadoop Distributed File System (HDFS)
Storage system for big data.
Hadoop Distributed File System (HDFS) capacities
- HDFS provides reliable big data storage by partitioning files over multiples notes.
- It splits large files across multiple computers, allowing parallel access to them (different specifics spaces to access data).
- It replicates (copies) file blocks on different nodes to prevent data loss.
- Fast recovery from hardware failures, HDFS can detect faults and automatically recover.
- Access to streaming data (videos).
Hive
It is an open-source data warehouse software for reading, writing, and managing large data that are stored directly on Hadoop or other data storage system.
Spark capacities
- Has in-memory processing which increases the speed of computations.
- It has interfaces for major programing interfaces like Java, Python, R, and SQL.
- It can access data in a large variety of data sources, including HDFS and Hive.
- It can process streaming data fast.
- It can do complex analytics in real-time.
Spark
A general-purpose data processing engine designed to extract and process large volumes of data for a wide range of application in real-time.