L.12 Flashcards
“Big Data” and Map/Reduce
What is “Big Data”?
The tremendous growth of data generation in the last decade due to social media, mobile computing, sensors, IoT, and more.
What are some examples of Big Data sources?
Social media, mobile computing, sensors, Internet-of-Things (IoT), communication networks, and satellite imagery.
Give an example of Big Data from social media.
Netflix has data on over 150 million subscribers, including what they watch, timestamps, and screenshots.
What is the Internet of Things (IoT)?
A trend of connecting all kinds of devices to the internet, such as smart fridges, cameras, and agricultural sensors.
How many IoT devices were connected to the internet by 2024?
Approximately 19 billion devices, producing around 73 Zettabytes of data.
What are the three main data sources in IoT?
- Curated content
- User-generated content
- Machine-generated signals
What are the 5Vs of Big Data?
- Volume - Massive amounts of data
- Velocity - Data incoming at high speed
- Variety - Different types of data
- Veracity - Accuracy and trustworthiness
- Value - Making data useful
What are solutions to handle Big Data challenges?
-Sharding
-Eventual Consistency
-Map/Reduce
-Stream processing
-Edge computing
What is Map/Reduce?
A programming model designed to process large, distributed datasets efficiently.
Who developed Map/Reduce and when?
Google researchers in 2004.
What was the first major use case of Map/Reduce?
Rebuilding Google’s index from a large database of websites.
What is the fundamental data structure in Map/Reduce?
Key/Value pairs.
What are the two main functions in a Map/Reduce program?
- Map function - Extracts and emits key/value pairs
- Reduce function - Processes and combines values for each key
Is Map/Reduce a general query language like SQL?
No, it is mainly used for aggregate queries over large datasets.
How is Map/Reduce implemented in MongoDB?
Using JavaScript functions that operate on JSON documents.
What happened to Map/Reduce in MongoDB 5.0?
It was deprecated and replaced with the aggregation pipeline.
What does the emit function do in Map/Reduce?
It produces key/value pairs from the mapping function.
What is an example of a map function in MongoDB?
var map = function() {
emit(this.cust_id, this.amount);
};
What is an example of a reduce function in MongoDB?
var reduce = function(key, amounts) {
return Array.sum(amounts);
};
How do you start a Map/Reduce query in MongoDB?
db.orders.mapReduce(
map, reduce,
{
query: { status: “A” },
out: “total_orders”
}
)
What does the finalize function do in MongoDB’s Map/Reduce?
It runs once after reducing and allows for additional processing.
What are the formal rules of a reduce function?
- Commutative:
reduce(key, [A, B]) == reduce(key, [B, A]) - Idempotent:
reduce(key, [reduce(key, valuesArray)]) == reduce(key, valuesArray) - Associative:
reduce(key, [C, reduce(key, [A, B])]) == reduce(key, [C, A, B])
Why is Map/Reduce well-suited for distributed databases?
Fault-tolerant: Failed tasks can be retried.
Parallelizable: Mapping and reducing can run in parallel.
Sharding-ready: Reduce tasks can first run per shard, then be combined.
What are the key takeaways about Map/Reduce?
It is a distributed processing model.
It requires well-defined map and reduce functions.
Reduce functions must follow mathematical properties to avoid errors.