Introduction - Mabel Flashcards

1
Q

What is data independence?

A

The separation of logical schema (how data is structured) from physical schema (how data is stored).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is data independence important?

A

It ensures physical changes (e.g., hardware upgrades) don’t affect logical data interaction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Who introduced the concept of data independence?

A

Edgar Codd in 1970.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is normalization?

A

The process of organizing data to reduce redundancy and improve integrity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is denormalization?

A

Combining data into fewer tables to reduce the need for complex joins, enhancing performance and scalability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the three Vs of Big Data?

A

Volume, Variety, and Velocity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are examples of NoSQL technologies?

A

Key Value Stores, Triple Stores, Column Stores, Document Stores.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the purpose of normalization in traditional databases?

A

To prioritize data integrity and minimize redundancy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is normalization often relaxed in Big Data systems?

A

To prioritize performance and scalability over strict data integrity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is velocity defined in Big Data?

A

By capacity (how much data can be stored), throughput (speed of data transfer), and latency (time delay in data availability).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How has velocity changed from 1956 to 2024?

A

Capacity increased by 23 billion times, throughput by 20,800 times, and latency improved by 400 times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the timeline of storage systems?

A

1960s: File systems, 1970s: Relational Databases, 1980s: Object Era, 2000s: NoSQL era.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a data model?

A

A framework defining how data is structured, organized, and stored.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the fundamental shapes of data?

A

Tables, Trees, Graphs, Cubes, Text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the units for capacity in data velocity?

A

Megabytes per cubic centimeter (MB/cm³).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the units for throughput in data velocity?

A

Megabytes per second (MB/s).

17
Q

What are the units for latency in data velocity?

18
Q

What is the difference between data, information, and knowledge?

A

Data: Raw facts; Information: Processed data with context (e.g., averages); Knowledge: Interpreted information combined with experience and insights.

19
Q

What are the 10 principles of Big Data?

A

Learn from the past, Keep design simple, Modularize architecture, Homogeneity in large, Heterogeneity in small, Separate metadata, Shard data, Replicate data.

20
Q

What are some examples of Big Data technologies?

A

S3, HDFS, XML, HBase, OLAP, Neo4j, Hadoop MapReduce, Spark, MongoDB.

21
Q

What are some key relational algebra operations?

A

Select, Project, Union, Difference, Cartesian Product, Rename.

22
Q

What is the purpose of the rename (ρ) operator in relational algebra?

A

To rename columns in a relation, similar to the AS clause in SQL.

23
Q

How much data was stored digitally worldwide as of 2021?

A

Close to 100 zettabytes (ZB).

24
Q

What prefixes are used for measuring data sizes?

A

Pico (10⁻¹²), Nano (10⁻⁹), Micro (10⁻⁶), Milli (10⁻³), Kilo (10³), Mega (10⁶), Giga (10⁹), Tera (10¹²), Peta (10¹⁵), Exa (10¹⁸), Zetta (10²¹), Yotta (10²⁴).

25
Q

What new prefixes were introduced for extremely large data sizes?

A

Ronna (10²⁷) and Quetta (10³⁰).

26
Q

What is the object data storage model?

A

A method of storing data as objects (e.g., S3, Azure Blob).

27
Q

What technologies use distributed file systems?

28
Q

How are graphs used in Big Data?

A

As a model for relationships, typically using Neo4j and Cypher.

29
Q

What is the key advantage of separating metadata from data in Big Data systems?

A

It simplifies data management and improves scalability.

30
Q

Why is modularizing the architecture important in Big Data?

A

It allows for easier updates, scalability, and system management.

31
Q

What is denormalization’s primary trade-off?

A

Increased performance at the cost of potential redundancy.

32
Q

How do Big Data technologies address scalability?

A

Through distributed systems, replication, and sharding.

33
Q

Why is heterogeneity in the small important in Big Data systems?

A

It allows for diverse data processing and storage solutions at localized levels.

34
Q

What does the acronym OLAP stand for, and what is its purpose?

A

Online Analytical Processing; used for analyzing multidimensional data (e.g., cubes).

35
Q

What is DAG-based distributed query processing?

A

A processing model used in systems like Apache Spark for handling large-scale data transformations and computations.

36
Q

What are the main advantages of NoSQL over traditional relational databases?

A

Scalability, flexibility in data models, and better performance for unstructured data.

37
Q

What is YARN, and what role does it play in Big Data?

A

YARN (Yet Another Resource Negotiator) manages resources in Hadoop ecosystems.

38
Q

What is MapReduce, and why is it significant?

A

A programming model for processing large datasets in parallel across distributed clusters.