Big Data Flashcards
What are the 5Vs of big data?
Veracity, Velocity, Volume, Variety, Volatility
What is veracity in big data?
Data quality and origin.
What is velocity in big data?
Data being generated extremely fast.
What is volume in big data?
Vast amounts of data.
What is variety in big data?
Data comes from different sources, in different forms (structured and unstructured).
What is volatility in big data?
Volatile (very high and then very low) data points
Is RDBMS going away? Why or why not?
No. Orgs store different types of data in different ways, according to their strengths (polyglot coexistence)
What is Hadoop?
Java-based framework for distributing and processing very large data sets across clusters of computers
What are some important components of Hadoop?
HDFS and MapReduce.
What is HDFS?
Hadoop Distributed File System.
- Low-level distributed file processing system to store data
- Designed to run on commodity (cheap) hardware.
What is MapReduce?
Programming model that supports processing large
data sets
Assumptions of HDFS?
- High volume: 64MB block size
- Write once, read many: concurrency control
- Streaming access: batch process of file
- Fault tolerant: replicate data across different nodes
Types of HDFS nodes?
- Data node
- Name node
- Client node
What is HDFS data node?
- Store actual file data
- Block creation, deletion & replication
- Sends heartbeat to name node
What is HDFS name node?
- One per cluster
- Containts file system metadata (filename, block #, r factor)