Week 4: Big Data Programming in the Cloud - MapReduce, Hadoop, Spark, HDFS and Big Data Distros Flashcards
What are the properties of Reduce Functions?
- Reduce functions take as input a key and the list of values associated with that key (grouped by the framework) and combine these values into a smaller set of output values, often a single summary value per key.
- They are designed to work on all data items that share the same key. This grouping allows reduce functions to compute global summaries (such as totals or averages) based on all occurrences of that key.
- While map functions transform individual items, reduce functions finalize the computation by aggregating intermediate results, such as summing counts in the word count example.
What are the properties of Map Functions
- Map functions work on a single input data item at a time (typically a key–value pair). They transform each individual piece of data into one or more intermediate key–value pairs.
- Since each data item is processed independently, map functions can be executed in parallel across many nodes. This statelessness (with respect to other data items) makes them ideal for distributed computing.
- The map phase is responsible only for the transformation of data, leaving aggregation or summarization (e.g., counting, summing) to be handled later by the reduce function.
What is the primary motivation for using the MapReduce Paradigm?
- MapReduce was designed to simplify processing huge datasets that are distributed over clusters of commodity machines. It abstracts away the complexities involved in parallelizing tasks across thousands of machines
- The framework automatically handles machine failures, retries tasks, and manages load balancing so that individual node failures do not disrupt the overall computation.
- By dividing the task into two simple functions (map and reduce), programmers can focus on the specific computation while the framework handles data distribution, scheduling, and low-level I/O management.
What are examples of MapReduce function
Word Count:
Counting how many times each distinct word appears in a large text file or corpus. This classic example demonstrates the basic principles: mapping each word to a count of 1 and reducing by summing these counts.
Image Smoothing:
Applying filtering or smoothing operations to large images by dividing the image into parts, processing each part independently, and then combining the results. This demonstrates the paradigm’s applicability beyond text processing.
Pi Estimation:
Distributing the work of calculating an approximation of Pi across many nodes. Each node performs a calculation on a subset of data, and the results are combined in the reduce phase to compute the final estimate.
PageRank:
Implementing the PageRank algorithm for ranking web pages. This example uses multiple MapReduce jobs to iteratively compute the importance scores of pages based on link structure.
Why are distinct words counted in the reduce phase in the word count example?
This separation of concerns allows the map phase to run quickly and in parallel, while the reduce phase performs the necessary aggregation.
What is Hadoop?
Hadoop is an open‐source distributed computing framework designed to store and process huge volumes of data across clusters of commodity hardware.
Describe execution initialization
When a user submits a job, the job tracker (the master) receives the job and splits the large input files into manageable chunks using an input format class that extracts individual records from each block (typically 64 MB).
The master then assigns these chunks to task trackers (worker nodes), initiating the map tasks.
How does Hadoop process large datasets using Map Reduce?
Data is split into large fixed-size chunks (typically 64 or 128 MB) and stored on a distributed file system (HDFS).
Map functions process these chunks in parallel, converting input records (e.g., lines of text) into intermediate key–value pairs.
The framework then shuffles and groups these intermediate pairs by key using a partition function.
Reduce functions aggregate or further process the grouped data and write the final output back to HDFS.
Describe partition functions
As mappers process input data, they emit intermediate key–value pairs.
A partition function (often based on hashing) determines which reducer will receive each key, ensuring that data for the same key is grouped together for processing.
Describe the overall flow of Hadoop
Job Submission: A client submits a MapReduce job to the job tracker.
Data Splitting & Task Assignment: The job tracker splits input files (stored in HDFS) into chunks and assigns map tasks to available task trackers, ideally placing tasks where data is locally stored.
Mapping Phase: Each map task reads its assigned data chunk, applies the map function repeatedly to generate intermediate key–value pairs.
Shuffling Phase: Once mapping is done, the system shuffles and sorts the intermediate data so that all values with the same key are grouped.
Reducing Phase: Reducers fetch the grouped data, process it via the reduce function, and generate output.
Storage of Results: The final processed data is written back to HDFS for later use.
Describe how Mappers are executed in Hadoop Map-Reduce
Execution: Each mapper runs on a worker node and processes one chunk of data.
Input Handling: It uses an input format to break its data chunk into individual records.
Processing: The map function is applied to each record, generating intermediate key–value pairs.
Interim Storage: These outputs are initially kept in memory and then periodically flushed to disk if they exceed memory limits.
Describe how “Reducers” are executed in Hadoop Map-Reduce
Data Retrieval: Once the mapping phase completes, reducers pull the intermediate data from the mappers over the network.
Grouping & Sorting: Reducers sort the data by key and group all values associated with the same key.
Processing: The reduce function is applied to each group, aggregating or transforming the data as needed.
Final Output: The results are then written back to HDFS, making them available for further processing or analysis.
What are data pipelines, and what steps do they usually have?
Data Pipelines are systems that automate the flow of data from raw sources to a structured, usable format. They typically involve steps such as:
Data Extraction: Collecting raw data (e.g., web logs, ad clicks).
Data Transformation: Cleaning, aggregating, or converting data into a usable format.
Data Loading: Storing the processed data into databases, warehouses, or analytics systems.
What is the benefits of data pipelines?
Reliability & Speed: They reliably handle large-scale and high-velocity data, ensuring that data is organized and processed quickly for analytics and reporting.
Business Value: By transforming raw events into structured data, pipelines enable targeted content, ad analytics, and real-time decision making.
Scalability: They support continuous growth in data volume and variety, which is crucial for companies handling billions of transactions daily.
What were the benefits of Hadoop?
Scalability: Hadoop’s distributed architecture allows companies to store and process petabytes of data on clusters of inexpensive, commodity hardware—far beyond the capacity of legacy systems (which typically scaled to around 100 TB).
Cost-Effectiveness: Lower storage costs due to commodity hardware and efficient use of resources.
Fault Tolerance: HDFS replicates data across multiple nodes, ensuring data availability even when individual disks or nodes fail.
Unified Processing Framework: A single framework (MapReduce) supports a variety of jobs, reducing the need to maintain specialized clusters for different tasks.
Improved Data Accessibility: By centralizing data storage and processing, Hadoop made data available to more teams within the organization, spurring innovative use cases and analyses.
How does Yahoo use Hadoop?
Process Massive Data Volumes: Handle billions of transactions and hundreds of terabytes of data per day by processing data in five-minute batches.
Enable Diverse Analytics: Transform raw log files, ad impressions, and user interactions into structured data sets for personalized content, targeted advertising analytics, and operational monitoring.
Improve Data Availability: Extend data retention (e.g., storing logs for 40–60 days) and allow more internal teams to access and analyze data directly, fueling innovation across the company.
Support Real-Time Feedback: Quickly process data to provide near-real-time insights that are critical for applications such as ad campaign budgeting and operational decisions
What are the key challenges of managing big data pipelines at Yahoo?
Resource Contention: Operating in a multi-tenant environment means that different teams share the same hardware resources, which can lead to contention and affect performance.
Legacy System Integration: Migrating data from older proprietary systems to Hadoop required building new data paths and educating teams on the HDFS access patterns.
Complexity in Handling Heterogeneous Data: Managing various data types (structured, semi-structured, and unstructured) and ensuring consistent processing across different platforms.
Latency Management: Balancing the need for rapid data processing (e.g., five-minute batch processing) with the complexity of data transformation pipelines.
What are the key benefits of managing big data pipelines at Yahoo?
Scalability & Efficiency: Hadoop’s architecture allows Yahoo to scale processing to hundreds of terabytes per day using cost-effective hardware.
Enhanced Data Accessibility: With data stored in HDFS, more teams can directly access and build on the data, which has led to new use cases and innovation.
Unified Platform: The single processing framework (MapReduce) and the ability to use common tools (e.g., Pig Latin) simplify the management of diverse data jobs.
Business Impact: Faster data processing enables timely insights (often within 30–90 minutes) that directly improve operational decision-making, such as dynamically managing ad campaign budgets.
What are the properties of Map Functions?
Individual Data Processing:
Each map function operates on a single key-value pair (or data item). For example, in a word count task, it takes a line of text and emits a key-value pair for every word (typically (word, 1))
,
.
Transformation and Independence:
The map function transforms the input data independently. It is stateless regarding other data items, making it highly parallelizable.
Intermediate Output:
It produces a list of intermediate key-value pairs that are later grouped by keys.
What are the properties of Reduce functions?
Aggregation Over Groups:
The reduce function processes a group of values that share the same key. It aggregates these values—for example, by summing the counts for a word in a word count program
.
Group-Based Processing:
After the map phase, the framework groups all intermediate pairs by key. The reduce function then combines these values (often resulting in a single output per key, though it can be more flexible).
Final Output:
It produces the final aggregated result for each key.
What are some MapReduce Examples?
Word Count:
- Map: Processes each line of text and emits (word, 1) for each word.
- Reduce: Groups the key-value pairs by word and sums the counts to produce the total occurrences.
Image Smoothing:
- Map: Could process individual pixels or regions, applying a smoothing filter.
- Reduce: Aggregates the processed pixel values to create a smoother image.
PageRank:
Implements the iterative calculation of page rankings across a network of web pages by distributing the computation.
Pi Estimation:
Uses Monte Carlo methods where the map phase generates random points, and the reduce phase aggregates the results to estimate the value of π.
What is the primary motivation for using the map reduce paradigm?
Simplified Distributed Processing:
MapReduce abstracts the complexity of writing parallel, distributed code. Instead of managing communication, fault tolerance, and load balancing manually, you only write two simple functions—map and reduce. The framework handles the rest.
Fault Tolerance and Scalability:
Designed to work on clusters with thousands of machines, MapReduce automatically deals with hardware failures and retries, making it ideal for processing terabytes of data on commodity machines.
Efficiency Through Data Locality:
The paradigm pushes computation to where the data resides, minimizing network bottlenecks and maximizing processing efficiency.
Why Count Distinct Words in the Reduce Phase (Word Count Example)
By separating the emission and aggregation tasks, MapReduce leverages its built-in grouping mechanism to efficiently count words in parallel.
What is HDFS?
HDFS (Hadoop Distributed File System) is the primary storage engine for Hadoop. It’s designed to handle very large files using a distributed, fault‐tolerant architecture.