L.12 Flashcards

“Big Data” and Map/Reduce

1
Q

What is “Big Data”?

A

The tremendous growth of data generation in the last decade due to social media, mobile computing, sensors, IoT, and more.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some examples of Big Data sources?

A

Social media, mobile computing, sensors, Internet-of-Things (IoT), communication networks, and satellite imagery.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Give an example of Big Data from social media.

A

Netflix has data on over 150 million subscribers, including what they watch, timestamps, and screenshots.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the Internet of Things (IoT)?

A

A trend of connecting all kinds of devices to the internet, such as smart fridges, cameras, and agricultural sensors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How many IoT devices were connected to the internet by 2024?

A

Approximately 19 billion devices, producing around 73 Zettabytes of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the three main data sources in IoT?

A
  1. Curated content
  2. User-generated content
  3. Machine-generated signals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the 5Vs of Big Data?

A
  1. Volume - Massive amounts of data
  2. Velocity - Data incoming at high speed
  3. Variety - Different types of data
  4. Veracity - Accuracy and trustworthiness
  5. Value - Making data useful
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are solutions to handle Big Data challenges?

A

-Sharding
-Eventual Consistency
-Map/Reduce
-Stream processing
-Edge computing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Map/Reduce?

A

A programming model designed to process large, distributed datasets efficiently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Who developed Map/Reduce and when?

A

Google researchers in 2004.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What was the first major use case of Map/Reduce?

A

Rebuilding Google’s index from a large database of websites.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the fundamental data structure in Map/Reduce?

A

Key/Value pairs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the two main functions in a Map/Reduce program?

A
  1. Map function - Extracts and emits key/value pairs
  2. Reduce function - Processes and combines values for each key
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Is Map/Reduce a general query language like SQL?

A

No, it is mainly used for aggregate queries over large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is Map/Reduce implemented in MongoDB?

A

Using JavaScript functions that operate on JSON documents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What happened to Map/Reduce in MongoDB 5.0?

A

It was deprecated and replaced with the aggregation pipeline.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does the emit function do in Map/Reduce?

A

It produces key/value pairs from the mapping function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is an example of a map function in MongoDB?

A

var map = function() {
emit(this.cust_id, this.amount);
};

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is an example of a reduce function in MongoDB?

A

var reduce = function(key, amounts) {
return Array.sum(amounts);
};

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do you start a Map/Reduce query in MongoDB?

A

db.orders.mapReduce(
map, reduce,
{
query: { status: “A” },
out: “total_orders”
}
)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does the finalize function do in MongoDB’s Map/Reduce?

A

It runs once after reducing and allows for additional processing.

22
Q

What are the formal rules of a reduce function?

A
  1. Commutative:
    reduce(key, [A, B]) == reduce(key, [B, A])
  2. Idempotent:
    reduce(key, [reduce(key, valuesArray)]) == reduce(key, valuesArray)
  3. Associative:
    reduce(key, [C, reduce(key, [A, B])]) == reduce(key, [C, A, B])
23
Q

Why is Map/Reduce well-suited for distributed databases?

A

Fault-tolerant: Failed tasks can be retried.

Parallelizable: Mapping and reducing can run in parallel.

Sharding-ready: Reduce tasks can first run per shard, then be combined.

24
Q

What are the key takeaways about Map/Reduce?

A

It is a distributed processing model.

It requires well-defined map and reduce functions.

Reduce functions must follow mathematical properties to avoid errors.

25
How does Map/Reduce handle large-scale data efficiently?
By distributing tasks across multiple machines and processing data in parallel.
26
What types of problems is Map/Reduce best suited for?
Large-scale data aggregation, log processing, data transformation, and statistical analysis.
27
How does MongoDB group emitted key-value pairs before reducing?
It collects all values with the same key into an array.
28
What is the advantage of pre-selecting documents with the query option in MongoDB's Map/Reduce?
It improves performance by processing only relevant data.
29
What is an example of using finalize in MongoDB?
var finalize = function(key, reducedVal) { reducedVal.avg = reducedVal.qty / reducedVal.count; return reducedVal; };
30
Why must a reduce function be idempotent in Map/Reduce?
To ensure consistent results even if the function is applied multiple times.
31
What is an example of a Map/Reduce query that calculates total sales per product in MongoDB?
db.sales.mapReduce( map, reduce, { out: "total_sales", query: { date: { $gt: new Date("2025-01-01") } }, finalize: finalize } )
32
Why was Map/Reduce deprecated in MongoDB 5.0?
Because the aggregation pipeline provided a more efficient and optimized way to perform similar queries.
33
What is a real-world example of using Map/Reduce?
Calculating the most-watched movies on Netflix by aggregating viewing data across all users.
34
What happens if a reduce function in MongoDB returns a non-idempotent result?
The result may be inconsistent or incorrect when reduce is applied multiple times.
35
What is the difference between Map/Reduce and SQL?
SQL is a declarative language for structured data queries, while Map/Reduce is a programming model for parallel, distributed data processing.
36
How does Map/Reduce support fault tolerance?
By allowing failed tasks to be retried independently without affecting the entire job.
37
What is the role of sharding in Map/Reduce?
It enables parallel processing by distributing data across multiple nodes.
38
What are the steps in a typical Map/Reduce workflow?
1) Read input data 2) Apply the map function to generate key-value pairs 3) Shuffle and sort key-value pairs 4) Apply the reduce function to aggregate values 5) Store the final output
39
Why is JSON a good fit for Map/Reduce?
Because JSON documents naturally use key-value structures, which align well with Map/Reduce processing.
40
What are some alternatives to Map/Reduce for big data processing?
Apache Spark, Hadoop, Flink, and MongoDB’s aggregation pipeline.
41
How does edge computing help with Big Data processing?
By processing data closer to the source, reducing latency and bandwidth usage.
42
What is an example of using Map/Reduce for analyzing customer orders?
Counting how many times each product was ordered and computing the average order quantity.
43
Why is Map/Reduce useful in analyzing log files?
Because it can efficiently process and aggregate large-scale unstructured log data.
44
How does Map/Reduce handle unstructured data?
By converting it into key-value pairs that can be processed and aggregated.
45
How can machine learning benefit from Map/Reduce?
By parallelizing computations such as clustering, recommendation systems, and classification tasks.
46
What are the three steps in Map/Reduce's "shuffle and sort" phase?
-Grouping values by key -Sorting values within each key group -Passing grouped values to the reduce function
47
What is a "composite key" in Map/Reduce?
A key consisting of multiple fields used to enable more granular aggregation.
48
How does Map/Reduce compare to streaming processing?
Map/Reduce is batch-oriented, while streaming processing handles real-time data flows.
49
What are some disadvantages of Map/Reduce?
High latency for batch jobs Complex debugging Not optimized for real-time analytics
50
How does Google use Map/Reduce?
To process large-scale web data, such as indexing the internet for search.