Big Data & Cloud Module I Flashcards
What is erasure encoding (EC)?
Erasure Encoding is a functionality of Hadoop 3 that allows data protection by dividing data into fragments, expanded and encoded with redundant data and stored across different locations. If a drive fails or data becomes corrupted, it can be reconstructed from the segments stored on the other drives
What is Cluster Topology?
By cluster topology we mean the type and state of each node in the clyster and the relation between them.
In order to carry choices, Hadoop must know the cluster topology
When HDFS might NOT be a good fit?
- Whe there is a need for low latency data access
- When we are dealing with lots of small files
What are the main file formats on Hadoop?
We can have traditional file formats (like text, CSV, XML) and specific file formats, like Row Oriented Formats and Column Oriented Formats. In particular, the difference between row oriended and column oriented formats is Row-oriented file formats in Hadoop store data by grouping all attributes of a record together, while column-oriented file formats store data by grouping values of each attribute together.
What is Parquet storage?
Parquet storage is a columnar file format used in big data processing systems like Hadoop. It organizes and stores data in a highly optimized manner, grouping values of each column together for efficient compression and retrieval. This allows for faster data scanning and processing, especially when dealing with large datasets, as only the required columns are read from disk, reducing I/O operations and improving performance.
What is YARN?
Yarn is the preffered Resource Negotiator in Hadoop. It acts by managing resources in the program and schedules tasks to different workers. Yarn provides two different daemons:
- Resource Manager. It’s formed by a scheduler that allocates resources to the applications, and by the Application Manager that accepts job submissions
- The Node Manager- a per-node slave. The NM is responsible for containers an monitors usages of the containers to report to the RM
Explain the concept of data locality
Data Locality is the process of moving the computation to the node whare the data resides.By bringing computation close to the data, data locality reduces network communication and disk I/O, resulting in faster and more efficient data processing.
What is MapReduce?
MapReduce is a programming model for processing and generating big data sets with a parallel, distributed algorithm on a cluster
Explain from what parts a MapReduce program is composed and how it works
A Map Reduce program is composed by two functions: a map function, and a reduce function.
The Map Function performs filtering and sorting, and the reduce function performs a summary operation
Together, they form the MapReduce system which orchestartes the process by marshalling the distributed servers, running tasks in parallel and managing communications
Define what combiners are and what is their utility
Combiners in MapReduce are mini-reducers that operate locally on the output of the map phase. They help to optimize data transfer by reducing the amount of data sent across the network. Combiners perform partial aggregation on the intermediate key-value pairs generated by the map phase, thereby reducing the volume of data that needs to be transferred to the reducers. They are used to improve overall performance and reduce network congestion in MapReduce jobs.
Talk about Partitioning in MapReduce
Partitioning in MapReduce is the process of dividing the intermediate key-value pairs generated by the map phase into separate groups or partitions. Each partition corresponds to a specific reducer task. The goal of partitioning is to ensure that all key-value pairs with the same key are sent to the same reducer, enabling efficient and accurate data processing. Partitioning allows for parallelism in the reduce phase by enabling multiple reducers to work on different subsets of data concurrently. It helps in load balancing and ensures that data is distributed evenly across reducers, improving the overall performance and scalability of MapReduce jobs.
How does MapReduce cope with Failure?
In MapReduce there are several ways to cope with the failure of a map reduce job.
- the AMP sets each failed task to idle and reassigns it to a worker when it becomes available
MapReduce was created around low end commodity servers, so it’s pretty resilient
What are some MapReduce Algorithms?
MapReduce is a framework, so it’s a set of tools that creates a wider product.
Each programmer needs to fit their solution into the MapReduce paradigm. There are some different algorithms that can be used:
- Filtering algorithms. Find lines, tuples, with particular characteristics
- Summarization algorithms. Count the number of requests to each subdomain for exapmple
- Join: used to perform pre aggregations, combining different inputs on some shared values
- Sort: sort inuputs
What is Apache Spark?
Apache Spark is a data processing framework that can quickly perform processing tasks over very large datasets, and can also distribute data processing across multiple computers
Why we went from MapReduce to Spark?
Because Mapredcue has some limitations given the recent changes in technology
- MapReduce is more attuned to batch processing
- MapReduce is a strict paradigm
- New hardware capabilities are not explored with mapreduce
Too Complex
Explain the Main Structure of Spark
Spak is made of two major components: RDD and DAG. RDD stands for Resilient Distributed Dataset, while DAG stands for Direct Acyclic Graph.
Explain RDD
Resilient Distributed Dataset.
It’s the primary data structure in Spark. It’s a reliable and memory efficient solution, and speeds up processes by storing in RDDs.
RDDs are immutable collections of objects; they automatically rebuild on failure, and are distributed