Introduction - Mabel Flashcards
What is data independence?
The separation of logical schema (how data is structured) from physical schema (how data is stored).
Why is data independence important?
It ensures physical changes (e.g., hardware upgrades) don’t affect logical data interaction.
Who introduced the concept of data independence?
Edgar Codd in 1970.
What is normalization?
The process of organizing data to reduce redundancy and improve integrity.
What is denormalization?
Combining data into fewer tables to reduce the need for complex joins, enhancing performance and scalability.
What are the three Vs of Big Data?
Volume, Variety, and Velocity.
What are examples of NoSQL technologies?
Key Value Stores, Triple Stores, Column Stores, Document Stores.
What is the purpose of normalization in traditional databases?
To prioritize data integrity and minimize redundancy.
Why is normalization often relaxed in Big Data systems?
To prioritize performance and scalability over strict data integrity.
How is velocity defined in Big Data?
By capacity (how much data can be stored), throughput (speed of data transfer), and latency (time delay in data availability).
How has velocity changed from 1956 to 2024?
Capacity increased by 23 billion times, throughput by 20,800 times, and latency improved by 400 times.
What is the timeline of storage systems?
1960s: File systems, 1970s: Relational Databases, 1980s: Object Era, 2000s: NoSQL era.
What is a data model?
A framework defining how data is structured, organized, and stored.
What are the fundamental shapes of data?
Tables, Trees, Graphs, Cubes, Text.
What are the units for capacity in data velocity?
Megabytes per cubic centimeter (MB/cm³).
What are the units for throughput in data velocity?
Megabytes per second (MB/s).
What are the units for latency in data velocity?
Seconds.
What is the difference between data, information, and knowledge?
Data: Raw facts; Information: Processed data with context (e.g., averages); Knowledge: Interpreted information combined with experience and insights.
What are the 10 principles of Big Data?
Learn from the past, Keep design simple, Modularize architecture, Homogeneity in large, Heterogeneity in small, Separate metadata, Shard data, Replicate data.
What are some examples of Big Data technologies?
S3, HDFS, XML, HBase, OLAP, Neo4j, Hadoop MapReduce, Spark, MongoDB.
What are some key relational algebra operations?
Select, Project, Union, Difference, Cartesian Product, Rename.
What is the purpose of the rename (ρ) operator in relational algebra?
To rename columns in a relation, similar to the AS clause in SQL.
How much data was stored digitally worldwide as of 2021?
Close to 100 zettabytes (ZB).
What prefixes are used for measuring data sizes?
Pico (10⁻¹²), Nano (10⁻⁹), Micro (10⁻⁶), Milli (10⁻³), Kilo (10³), Mega (10⁶), Giga (10⁹), Tera (10¹²), Peta (10¹⁵), Exa (10¹⁸), Zetta (10²¹), Yotta (10²⁴).
What new prefixes were introduced for extremely large data sizes?
Ronna (10²⁷) and Quetta (10³⁰).
What is the object data storage model?
A method of storing data as objects (e.g., S3, Azure Blob).
What technologies use distributed file systems?
HDFS.
How are graphs used in Big Data?
As a model for relationships, typically using Neo4j and Cypher.
What is the key advantage of separating metadata from data in Big Data systems?
It simplifies data management and improves scalability.
Why is modularizing the architecture important in Big Data?
It allows for easier updates, scalability, and system management.
What is denormalization’s primary trade-off?
Increased performance at the cost of potential redundancy.
How do Big Data technologies address scalability?
Through distributed systems, replication, and sharding.
Why is heterogeneity in the small important in Big Data systems?
It allows for diverse data processing and storage solutions at localized levels.
What does the acronym OLAP stand for, and what is its purpose?
Online Analytical Processing; used for analyzing multidimensional data (e.g., cubes).
What is DAG-based distributed query processing?
A processing model used in systems like Apache Spark for handling large-scale data transformations and computations.
What are the main advantages of NoSQL over traditional relational databases?
Scalability, flexibility in data models, and better performance for unstructured data.
What is YARN, and what role does it play in Big Data?
YARN (Yet Another Resource Negotiator) manages resources in Hadoop ecosystems.
What is MapReduce, and why is it significant?
A programming model for processing large datasets in parallel across distributed clusters.