Big Data & Cloud Module I Flashcards
What is erasure encoding (EC)?
Erasure Encoding is a functionality of Hadoop 3 that allows data protection by dividing data into fragments, expanded and encoded with redundant data and stored across different locations. If a drive fails or data becomes corrupted, it can be reconstructed from the segments stored on the other drives
What is Cluster Topology?
By cluster topology we mean the type and state of each node in the clyster and the relation between them.
In order to carry choices, Hadoop must know the cluster topology
When HDFS might NOT be a good fit?
- Whe there is a need for low latency data access
- When we are dealing with lots of small files
What are the main file formats on Hadoop?
We can have traditional file formats (like text, CSV, XML) and specific file formats, like Row Oriented Formats and Column Oriented Formats. In particular, the difference between row oriended and column oriented formats is Row-oriented file formats in Hadoop store data by grouping all attributes of a record together, while column-oriented file formats store data by grouping values of each attribute together.
What is Parquet storage?
Parquet storage is a columnar file format used in big data processing systems like Hadoop. It organizes and stores data in a highly optimized manner, grouping values of each column together for efficient compression and retrieval. This allows for faster data scanning and processing, especially when dealing with large datasets, as only the required columns are read from disk, reducing I/O operations and improving performance.
What is YARN?
Yarn is the preffered Resource Negotiator in Hadoop. It acts by managing resources in the program and schedules tasks to different workers. Yarn provides two different daemons:
- Resource Manager. It’s formed by a scheduler that allocates resources to the applications, and by the Application Manager that accepts job submissions
- The Node Manager- a per-node slave. The NM is responsible for containers an monitors usages of the containers to report to the RM
Explain the concept of data locality
Data Locality is the process of moving the computation to the node whare the data resides.By bringing computation close to the data, data locality reduces network communication and disk I/O, resulting in faster and more efficient data processing.
What is MapReduce?
MapReduce is a programming model for processing and generating big data sets with a parallel, distributed algorithm on a cluster
Explain from what parts a MapReduce program is composed and how it works
A Map Reduce program is composed by two functions: a map function, and a reduce function.
The Map Function performs filtering and sorting, and the reduce function performs a summary operation
Together, they form the MapReduce system which orchestartes the process by marshalling the distributed servers, running tasks in parallel and managing communications
Define what combiners are and what is their utility
Combiners in MapReduce are mini-reducers that operate locally on the output of the map phase. They help to optimize data transfer by reducing the amount of data sent across the network. Combiners perform partial aggregation on the intermediate key-value pairs generated by the map phase, thereby reducing the volume of data that needs to be transferred to the reducers. They are used to improve overall performance and reduce network congestion in MapReduce jobs.
Talk about Partitioning in MapReduce
Partitioning in MapReduce is the process of dividing the intermediate key-value pairs generated by the map phase into separate groups or partitions. Each partition corresponds to a specific reducer task. The goal of partitioning is to ensure that all key-value pairs with the same key are sent to the same reducer, enabling efficient and accurate data processing. Partitioning allows for parallelism in the reduce phase by enabling multiple reducers to work on different subsets of data concurrently. It helps in load balancing and ensures that data is distributed evenly across reducers, improving the overall performance and scalability of MapReduce jobs.
How does MapReduce cope with Failure?
In MapReduce there are several ways to cope with the failure of a map reduce job.
- the AMP sets each failed task to idle and reassigns it to a worker when it becomes available
MapReduce was created around low end commodity servers, so it’s pretty resilient
What are some MapReduce Algorithms?
MapReduce is a framework, so it’s a set of tools that creates a wider product.
Each programmer needs to fit their solution into the MapReduce paradigm. There are some different algorithms that can be used:
- Filtering algorithms. Find lines, tuples, with particular characteristics
- Summarization algorithms. Count the number of requests to each subdomain for exapmple
- Join: used to perform pre aggregations, combining different inputs on some shared values
- Sort: sort inuputs
What is Apache Spark?
Apache Spark is a data processing framework that can quickly perform processing tasks over very large datasets, and can also distribute data processing across multiple computers
Why we went from MapReduce to Spark?
Because Mapredcue has some limitations given the recent changes in technology
- MapReduce is more attuned to batch processing
- MapReduce is a strict paradigm
- New hardware capabilities are not explored with mapreduce
Too Complex
Explain the Main Structure of Spark
Spak is made of two major components: RDD and DAG. RDD stands for Resilient Distributed Dataset, while DAG stands for Direct Acyclic Graph.
Explain RDD
Resilient Distributed Dataset.
It’s the primary data structure in Spark. It’s a reliable and memory efficient solution, and speeds up processes by storing in RDDs.
RDDs are immutable collections of objects; they automatically rebuild on failure, and are distributed
Explain DAG
A DAG is a collection of individual tasks, represented as nodes, connected by edges that define dependencies between the tasks. The graph is directed, meaning that the edges have a direction that indicates the flow of data between tasks. It is also acyclic, meaning that there are no cycles or loops in the graph.
When you perform data processing operations in Spark, such as transformations and actions on RDDs (Resilient Distributed Datasets) or DataFrames, Spark automatically builds a DAG to represent the computation plan. The DAG captures the logical flow of operations that need to be executed to produce the desired output.
Explain the Spark Architecture
Spark uses a master/slave architecture with one central coordinator (DRIVER) and many distributed workers (EXECUTORS).
Cluster Manager: is responsible for assigning and managing cluster resources (can be spark stadalone manager or YARN)
Executor: executes tasks
Driver Program: converts user programs into tasks
Deployment types in Spark
- Cluster Mode: diver processes directly on a node in the cluster
- Client mode: the driver runs on a machine that does not belong to the cluster
- Local mode: the driver and executors run on the same machine
What is RDD partitioning?
RDDs are a collections of data that are so big in size, they need to be partitioned across different nodes. Spark automatically partitions RDDs and distributes the partitions across different nodes
If not specificed, Spark sets the number of partitions automatically. If there are too many, eccessive overhead. If there are too few, some cores will not be used
Explain Shuffling in Spark
Shuffling is the mechanism used to re-distribute data across partitions. It’s necessary to compute some operations. It is complex and costly
The three main shuffling techniques are:
- Hash Shuffle. Each map creates a file for every reducer
- Hash evolved: each executor holds a pool of files
- Sort Shuffle: default. Each mapper keeps output in memory, spills to disk if necessary
- Tungsten sort: evolution of Sort. directly works on serialized records
How to configure a Cluster?
- Resource Configuration: CPU and memory. Every executor in an application has a fixed amount of cores and heap size (cache)
- CPU Tuning. Setting executors, cores per executor
- memory tuning: setting amount of memory per executor
Shared Variables in Spark
Shared variables in Spark are special variables that can be shared and used across multiple tasks in a distributed computing environment.
There are two types of shared variables in Spark:
Broadcast Variables: These are read-only variables that are broadcasted to all the worker nodes in a cluster. It allows the workers to access the variable’s value efficiently without sending a copy of the variable with each task. Broadcast variables are useful when a large dataset or a large lookup table needs to be shared among all the tasks.
Accumulators: These are variables that are used to accumulate values from the tasks running on worker nodes back to the driver program. Accumulators are typically used for aggregating values or collecting statistics from the tasks. They are designed to be only “added” to and provide a convenient way to gather information from distributed tasks without relying on explicit data shuffling.
By using shared variables, Spark avoids unnecessary data transfer and duplication, which can improve the performance and efficiency of distributed computations.