Question 5 Flashcards
What is Big Data?
Refers to datasets characterized by volume, velocity, and variety, making them unsuitable for traditional relational database management systems. It involves managing large datasets that align with these 3V’s. Big data allows businesses to generate and track continuous data streams, enabling real-time processing and insights.
What are the 3 V’s of Big Data?
- Volume
- Velocity
- Variety
What is Volume in Big Data?
Volume: Refers to the vast amount of data generated. It can be handled through:
- Scaling Up: Upgrading existing systems to handle larger loads.
- Scaling Out: Distributing the load across multiple servers when a single server’s capacity is exceeded.
What is Velocity in Big Data?
Velocity: The speed at which data is generated and must be processed.
- Stream Processing: Analysing data in real-time as it flows into the system.
- Feedback Loop Processing: Analysing data to produce actionable insights immediately.
What is the Variety in Big Data?
Variety: The different types of data (structured and unstructured) that need to be stored.
- Structured Data: Fits into a predefined model (e.g., relational databases).
- Unstructured Data: Does not fit into a predefined model (e.g., text, images).
What is NoSQL?
Non-relational database technologies developed to address Big Data challenges
How does NoSQL differ from Relational Model in key values?
NoSQL Structure:
- Unique Key-value pairs
Differences from Relational:
- Schema-less, simple queries, highly scalable
How does NoSQL differ from a Relational Model in Document Databases?
NoSQL Structure:
- Documents (e.g. JSON)
Differences from Relational:
- Flexible schemas, content-based querying, data stored together.
How does NoSQL differ from a Relational Model in Column-family Stores?
NoSQL Structure:
- Data stored in Columns
Differences from Relational:
- Optimized for specific queries, varied column schemas, highly available and scalable.
How does NoSQL differ from a Relational Model in Graph Databases?
NoSQLStructure:
- Nodes and edges
Differences from Relational:
- Direct relationship modelling, efficient traversal, adaptable schemas
What is the Hadoop framework?
A Java-based framework designed for the distributed storage and processing of large data sets across clusters of computers.
Explain the 2 core components of the Hadoop Framework.
- Hadoop Distributed File System (HDFS): A distributed file system that stores data across many machines and provides high throughput access.
- MapReduce: A programming model that processes large data sets in parallel across a distributed cluster.
What are the major components of the Hadoop ecosystem?
- Hive: A data warehousing solution that uses SQL-like queries.
- Pig: A scripting language for creating MapReduce jobs.
- HBase: A NoSQL database that runs on top of HDFS.
What is data storage?
Data storage focuses on how data is organized and saved, involving structures, formats, and systems, while data processing emphasizes how data is accessed, manipulated, and transformed through queries, transactions, and analytics.