Big Data and NoSQL Flashcards
what is data storage
Data storage focuses on how data is organized and saved, involving structures, formats, and systems
What is data processing
data processing emphasizes how data is accessed, manipulated, and transformed through queries,
transactions, and analytics.
Explain the role of Big Data in modern business
Big data: Refers to datasets characterized by volume, velocity, and variety, making them unsuitable for
traditional relational database management systems. It involves managing large datasets that align with these
3V’s. Big data allows businesses to generate and track continuous data streams, enabling real-time processing
and insights.
Characteristics of Big Data and how these go beyond the “3 Vs”
- Volume
- Velocity
- Variety
What is volume
- Volume: Refers to the vast amount of data generated. It can be handled through:
* Scaling Up: Upgrading existing systems to handle larger loads.
* Scaling Out: Distributing the load across multiple servers when a single server’s capacity is exceeded.
What is velocity
- Velocity: The speed at which data is generated and must be processed.
* Stream Processing: Analysing data in real-time as it flows into the system.
* Feedback Loop Processing: Analysing data to produce actionable insights immediately.
what is variety
- Variety: The different types of data (structured and unstructured) that need to be stored.
* Structured Data: Fits into a predefined model (e.g., relational databases).
* Unstructured Data: Does not fit into a predefined model (e.g., text, images).
what are the core component s of hadoop
- Hadoop Distributed File System (HDFS): A distributed file system that stores data across many machines
and provides high throughput access. - MapReduce: A programming model that processes large data sets in parallel across a distributed cluster.
What is hadoop framework
Hadoop: is a Java-based framework designed for the distributed storage and processing of large data sets
across clusters of computers.
HDFS Key Features
- High Volume: Default block sizes can be quite large (up to 64 MB or more).
- Write-Once, Read-Many: Simplifies concurrency and improves throughput.
- Fault Tolerance: Data is replicated across different devices to ensure availability in case of failures.
MapReduce Framework:
- Map Function: Processes input data into key-value pairs.
- Reduce Function: Aggregates and summarizes the results of the map function
Identify the major components of the Hadoop ecosystem
Hadoop Ecosystem: A set of related tools and applications that complement Hadoop.
examples the hadoop ecosystem
- Hive: A data warehousing solution that uses SQL-like queries.
- Pig: A scripting language for creating MapReduce jobs.
- HBase: A NoSQL database that runs on top of HDFS.
Four major approaches of the noSQL data model and how they differ from the relational model
- Key-Value Stores - difference - Schema-less, simple queries, highly scalable
- Document Databases - difference - Flexible schemas, content-based querying, data stored together.
- Column-Family - difference - Optimized for specific queries, varied column schemas, highly available and scalable.
- Graph Databases - difference - Direct relationship modeling, efficient traversal, adaptable schemas
Describe the characteristics of NewSQL databases
NewSQL databases: Aim to bridge the gap between traditional relational databases (RDBMS) and NoSQL
databases. While RDBMS are essential for supporting ACID-compliant transactions in everyday business
operations, NoSQL databases focus on managing large volumes of user-generated and machine-generated
data across distributed systems.