Week 4: Big Data Programming in the Cloud - MapReduce, Hadoop, Spark, HDFS and Big Data Distros Flashcards by John Doe

What are the properties of Reduce Functions?

Reduce functions take as input a key and the list of values associated with that key (grouped by the framework) and combine these values into a smaller set of output values, often a single summary value per key.
They are designed to work on all data items that share the same key. This grouping allows reduce functions to compute global summaries (such as totals or averages) based on all occurrences of that key.
While map functions transform individual items, reduce functions finalize the computation by aggregating intermediate results, such as summing counts in the word count example.

How well did you know this?

Not at all

Perfectly

What are the properties of Map Functions

Map functions work on a single input data item at a time (typically a key–value pair). They transform each individual piece of data into one or more intermediate key–value pairs.
Since each data item is processed independently, map functions can be executed in parallel across many nodes. This statelessness (with respect to other data items) makes them ideal for distributed computing.
The map phase is responsible only for the transformation of data, leaving aggregation or summarization (e.g., counting, summing) to be handled later by the reduce function.

How well did you know this?

Not at all

Perfectly

What is the primary motivation for using the MapReduce Paradigm?

MapReduce was designed to simplify processing huge datasets that are distributed over clusters of commodity machines. It abstracts away the complexities involved in parallelizing tasks across thousands of machines
The framework automatically handles machine failures, retries tasks, and manages load balancing so that individual node failures do not disrupt the overall computation.
By dividing the task into two simple functions (map and reduce), programmers can focus on the specific computation while the framework handles data distribution, scheduling, and low-level I/O management.

How well did you know this?

Not at all

Perfectly

What are examples of MapReduce function

Word Count:
Counting how many times each distinct word appears in a large text file or corpus. This classic example demonstrates the basic principles: mapping each word to a count of 1 and reducing by summing these counts.

Image Smoothing:
Applying filtering or smoothing operations to large images by dividing the image into parts, processing each part independently, and then combining the results. This demonstrates the paradigm’s applicability beyond text processing.

Pi Estimation:
Distributing the work of calculating an approximation of Pi across many nodes. Each node performs a calculation on a subset of data, and the results are combined in the reduce phase to compute the final estimate.

PageRank:
Implementing the PageRank algorithm for ranking web pages. This example uses multiple MapReduce jobs to iteratively compute the importance scores of pages based on link structure.

How well did you know this?

Not at all

Perfectly

Why are distinct words counted in the reduce phase in the word count example?

This separation of concerns allows the map phase to run quickly and in parallel, while the reduce phase performs the necessary aggregation.

How well did you know this?

Not at all

Perfectly

What is Hadoop?

Hadoop is an open‐source distributed computing framework designed to store and process huge volumes of data across clusters of commodity hardware.

How well did you know this?

Not at all

Perfectly

Describe execution initialization

When a user submits a job, the job tracker (the master) receives the job and splits the large input files into manageable chunks using an input format class that extracts individual records from each block (typically 64 MB).
The master then assigns these chunks to task trackers (worker nodes), initiating the map tasks.

How well did you know this?

Not at all

Perfectly

How does Hadoop process large datasets using Map Reduce?

Data is split into large fixed-size chunks (typically 64 or 128 MB) and stored on a distributed file system (HDFS).
Map functions process these chunks in parallel, converting input records (e.g., lines of text) into intermediate key–value pairs.
The framework then shuffles and groups these intermediate pairs by key using a partition function.
Reduce functions aggregate or further process the grouped data and write the final output back to HDFS.

How well did you know this?

Not at all

Perfectly

Describe partition functions

As mappers process input data, they emit intermediate key–value pairs.
A partition function (often based on hashing) determines which reducer will receive each key, ensuring that data for the same key is grouped together for processing.

How well did you know this?

Not at all

Perfectly

Describe the overall flow of Hadoop

Job Submission: A client submits a MapReduce job to the job tracker.
Data Splitting & Task Assignment: The job tracker splits input files (stored in HDFS) into chunks and assigns map tasks to available task trackers, ideally placing tasks where data is locally stored.
Mapping Phase: Each map task reads its assigned data chunk, applies the map function repeatedly to generate intermediate key–value pairs.
Shuffling Phase: Once mapping is done, the system shuffles and sorts the intermediate data so that all values with the same key are grouped.
Reducing Phase: Reducers fetch the grouped data, process it via the reduce function, and generate output.
Storage of Results: The final processed data is written back to HDFS for later use.

How well did you know this?

Not at all

Perfectly

Describe how Mappers are executed in Hadoop Map-Reduce

Execution: Each mapper runs on a worker node and processes one chunk of data.
Input Handling: It uses an input format to break its data chunk into individual records.
Processing: The map function is applied to each record, generating intermediate key–value pairs.
Interim Storage: These outputs are initially kept in memory and then periodically flushed to disk if they exceed memory limits.

How well did you know this?

Not at all

Perfectly

Describe how “Reducers” are executed in Hadoop Map-Reduce

Data Retrieval: Once the mapping phase completes, reducers pull the intermediate data from the mappers over the network.
Grouping & Sorting: Reducers sort the data by key and group all values associated with the same key.
Processing: The reduce function is applied to each group, aggregating or transforming the data as needed.
Final Output: The results are then written back to HDFS, making them available for further processing or analysis.

How well did you know this?

Not at all

Perfectly

What are data pipelines, and what steps do they usually have?

Data Pipelines are systems that automate the flow of data from raw sources to a structured, usable format. They typically involve steps such as:

Data Extraction: Collecting raw data (e.g., web logs, ad clicks).
Data Transformation: Cleaning, aggregating, or converting data into a usable format.
Data Loading: Storing the processed data into databases, warehouses, or analytics systems.

How well did you know this?

Not at all

Perfectly

What is the benefits of data pipelines?

Reliability & Speed: They reliably handle large-scale and high-velocity data, ensuring that data is organized and processed quickly for analytics and reporting.
Business Value: By transforming raw events into structured data, pipelines enable targeted content, ad analytics, and real-time decision making.
Scalability: They support continuous growth in data volume and variety, which is crucial for companies handling billions of transactions daily.

How well did you know this?

Not at all

Perfectly

What were the benefits of Hadoop?

Scalability: Hadoop’s distributed architecture allows companies to store and process petabytes of data on clusters of inexpensive, commodity hardware—far beyond the capacity of legacy systems (which typically scaled to around 100 TB).
Cost-Effectiveness: Lower storage costs due to commodity hardware and efficient use of resources.
Fault Tolerance: HDFS replicates data across multiple nodes, ensuring data availability even when individual disks or nodes fail.
Unified Processing Framework: A single framework (MapReduce) supports a variety of jobs, reducing the need to maintain specialized clusters for different tasks.
Improved Data Accessibility: By centralizing data storage and processing, Hadoop made data available to more teams within the organization, spurring innovative use cases and analyses.

How well did you know this?

Not at all

Perfectly

How does Yahoo use Hadoop?

Process Massive Data Volumes: Handle billions of transactions and hundreds of terabytes of data per day by processing data in five-minute batches.
Enable Diverse Analytics: Transform raw log files, ad impressions, and user interactions into structured data sets for personalized content, targeted advertising analytics, and operational monitoring.
Improve Data Availability: Extend data retention (e.g., storing logs for 40–60 days) and allow more internal teams to access and analyze data directly, fueling innovation across the company.
Support Real-Time Feedback: Quickly process data to provide near-real-time insights that are critical for applications such as ad campaign budgeting and operational decisions

How well did you know this?

Not at all

Perfectly

What are the key challenges of managing big data pipelines at Yahoo?

Study These Flashcards

Resource Contention: Operating in a multi-tenant environment means that different teams share the same hardware resources, which can lead to contention and affect performance.
Legacy System Integration: Migrating data from older proprietary systems to Hadoop required building new data paths and educating teams on the HDFS access patterns.
Complexity in Handling Heterogeneous Data: Managing various data types (structured, semi-structured, and unstructured) and ensuring consistent processing across different platforms.
Latency Management: Balancing the need for rapid data processing (e.g., five-minute batch processing) with the complexity of data transformation pipelines.

What are the key benefits of managing big data pipelines at Yahoo?

Study These Flashcards

Scalability & Efficiency: Hadoop’s architecture allows Yahoo to scale processing to hundreds of terabytes per day using cost-effective hardware.
Enhanced Data Accessibility: With data stored in HDFS, more teams can directly access and build on the data, which has led to new use cases and innovation.
Unified Platform: The single processing framework (MapReduce) and the ability to use common tools (e.g., Pig Latin) simplify the management of diverse data jobs.
Business Impact: Faster data processing enables timely insights (often within 30–90 minutes) that directly improve operational decision-making, such as dynamically managing ad campaign budgets.

What are the properties of Map Functions?

Study These Flashcards

Individual Data Processing:
Each map function operates on a single key-value pair (or data item). For example, in a word count task, it takes a line of text and emits a key-value pair for every word (typically (word, 1))
,
.
Transformation and Independence:
The map function transforms the input data independently. It is stateless regarding other data items, making it highly parallelizable.
Intermediate Output:
It produces a list of intermediate key-value pairs that are later grouped by keys.

What are the properties of Reduce functions?

Study These Flashcards

Aggregation Over Groups:
The reduce function processes a group of values that share the same key. It aggregates these values—for example, by summing the counts for a word in a word count program
.
Group-Based Processing:
After the map phase, the framework groups all intermediate pairs by key. The reduce function then combines these values (often resulting in a single output per key, though it can be more flexible).
Final Output:
It produces the final aggregated result for each key.

What are some MapReduce Examples?

Study These Flashcards

Word Count:
- Map: Processes each line of text and emits (word, 1) for each word.
- Reduce: Groups the key-value pairs by word and sums the counts to produce the total occurrences.
Image Smoothing:
- Map: Could process individual pixels or regions, applying a smoothing filter.
- Reduce: Aggregates the processed pixel values to create a smoother image.
PageRank:
Implements the iterative calculation of page rankings across a network of web pages by distributing the computation.
Pi Estimation:
Uses Monte Carlo methods where the map phase generates random points, and the reduce phase aggregates the results to estimate the value of π.

What is the primary motivation for using the map reduce paradigm?

Study These Flashcards

Simplified Distributed Processing:
MapReduce abstracts the complexity of writing parallel, distributed code. Instead of managing communication, fault tolerance, and load balancing manually, you only write two simple functions—map and reduce. The framework handles the rest.
Fault Tolerance and Scalability:
Designed to work on clusters with thousands of machines, MapReduce automatically deals with hardware failures and retries, making it ideal for processing terabytes of data on commodity machines.
Efficiency Through Data Locality:
The paradigm pushes computation to where the data resides, minimizing network bottlenecks and maximizing processing efficiency.

Why Count Distinct Words in the Reduce Phase (Word Count Example)

Study These Flashcards

By separating the emission and aggregation tasks, MapReduce leverages its built-in grouping mechanism to efficiently count words in parallel.

What is HDFS?

Study These Flashcards

HDFS (Hadoop Distributed File System) is the primary storage engine for Hadoop. It’s designed to handle very large files using a distributed, fault‐tolerant architecture.

What are the benefits of HDFS?

Fault Tolerance: Automatically handles disk, node, and rack failures by replicating data. Scalability: Enables parallel processing over thousands of nodes, increasing throughput as more disks and CPUs are added. Simplified Management: Abstracts the details of individual machines, so applications need not manage the underlying storage hardware.

How Does HDFS Ensure Data Storage Persistence?

Files are split into blocks (typically 16–64 MB each) and multiple replicas are stored on different DataNodes. This replication—across nodes and even across different racks—ensures that even if hardware fails, data remains available.

What Are the Common Usage Patterns in HDFS?

Large Files: Ideal for storing very large files (hundreds of gigabytes to terabytes). Append-Heavy Workloads: Files are rarely updated in place; instead, they are typically read sequentially and appended. Batch Processing: Often used in big data applications (like MapReduce) where high throughput is more critical than random-access speed.

Which File Operations are HDFS Optimized For?

Sequential Reads and Writes: HDFS is designed for high-throughput access to data, making sequential operations (both reading and writing) very efficient. Appends: Since updating huge files in place is inefficient, HDFS supports appending data, which is common in log processing and other batch tasks.

What is the Function of the DataNode Servers in HDFS?

Storage: DataNodes store the actual data blocks of files. Reporting: They send heartbeats and block reports to the master (NameNode) to help maintain the overall state of the file system. Parallelism: By distributing file chunks across many DataNodes, HDFS achieves parallel data processing. Local I/O: Each DataNode handles read and write operations on its local file system, facilitating efficient data transfer during processing.

HDFS Architecture

NameNode (Master): - Manages the file system namespace and metadata (like directory structure and block mapping). - Oversees the replication strategy and failure recovery by coordinating with DataNodes. DataNodes (Slaves): - Store file data in fixed-size blocks and execute read/write requests. - Communicate regularly with the NameNode via heartbeats and block reports. Client Library: - Provides APIs (primarily Java, with support for Python and C) for applications to interact with HDFS. - The client first contacts the NameNode to retrieve metadata, then accesses DataNodes directly for file I/O.

Describe Replication Pipelining in HDFS

When writing data, the client sends a file’s data in small chunks (e.g., 4K blocks) to the first DataNode. This DataNode then immediately forwards the chunk to the next DataNode, forming a chain (or pipeline) for replication. This method offloads replication work from the client and ensures efficient data duplication across nodes.

Describe Staging in HDFS

Upon a file creation request, the client first writes data to a temporary, local file. The data is then transmitted to the NameNode, which allocates the proper DataNodes for storage. This staged process helps manage the file write operation without immediately burdening the NameNode, ensuring smoother performance.

APIs Available When Using HDFS

Java API: The primary interface for HDFS, allowing applications to perform file operations. Other Language Bindings: Python interfaces and C wrappers are also available, broadening its accessibility.

User Interfaces Available When Using HDFS

Command-Line Tools: The hadoop dfs command provides familiar file system commands (e.g., mkdir, ls, etc.) but executed in a distributed manner. HTTP Interfaces: HDFS can be accessed via an HTTP browser, offering a web-based file exploration option. Integration with Other Tools: Interfaces exist for tools like Pig, enabling further data processing capabilities.

What is the Primary Motivation for Spark?

- overcome the limitations of traditional MapReduce frameworks (like Hadoop) when it comes to iterative algorithms and interactive data exploration. - MapReduce writes intermediate results to disk between iterations, incurring significant overhead. - Spark’s ability to cache data in memory makes these repetitive and interactive tasks much faster and more efficient.

What are the Main Advantages of Spark Over Hadoop?

- In-Memory Computation: Spark caches data (via RDDs) in memory, which greatly speeds up iterative and interactive algorithms by avoiding repeated disk reads. - Efficient Iterative Processing: Since many algorithms (e.g., machine learning, graph processing) iterate over the same data, Spark’s ability to reuse cached data cuts down on processing time. - Interactive Shell: Spark provides an interactive shell (based on a modified Scala interpreter) that lets users experiment with data transformations and actions in real time. - Flexible Programming Model: Its API based on transformations (e.g., map, filter) and actions (e.g., count, collect) allows for more expressive data processing pipelines. - Extended Ecosystem: Spark supports additional libraries and frameworks for various tasks, expanding its use cases beyond what Hadoop was originally designed for.

What are Actions in Spark?

Operations that trigger the execution of the transformation pipeline and produce a result (e.g., count, collect, or save). When an action is executed, Spark distributes the work across the cluster and returns the computed result to the driver program.

What are Transformations in Spark?

perations that create a new RDD from an existing one (such as map, filter, or join). Transformations are lazy, meaning that they build up a series of operations but do not immediately compute results. The actual processing is deferred until an action is called.

how does RDDs provide fault tolerance in Spark?

RDDs (Resilient Distributed Datasets) achieve fault tolerance by maintaining lineage information. Instead of replicating data across nodes, each RDD keeps a record (a dependency graph) of all the transformations that were applied to create it from the original data source. If a node fails or data is lost, Spark can recompute the lost partitions by re-applying these transformations.

List some frameworks built on Spark for graph processing, SQL-like queries, and machine learning.

GraphX: A framework for graph processing, which extends Spark’s capabilities to work efficiently with graph-structured data. Spark SQL (formerly Shark/Hive on Spark): This module enables SQL-like querying and provides a bridge between relational data and Spark’s distributed processing. MLlib: Spark’s machine learning library, which offers a range of scalable machine learning algorithms.

Week 4: Big Data Programming in the Cloud - MapReduce, Hadoop, Spark, HDFS and Big Data Distros Flashcards

(40 cards)