Big Data & Cloud Module I Flashcards
What is erasure encoding (EC)?
Erasure Encoding is a functionality of Hadoop 3 that allows data protection by dividing data into fragments, expanded and encoded with redundant data and stored across different locations. If a drive fails or data becomes corrupted, it can be reconstructed from the segments stored on the other drives
What is Cluster Topology?
By cluster topology we mean the type and state of each node in the clyster and the relation between them.
In order to carry choices, Hadoop must know the cluster topology
When HDFS might NOT be a good fit?
- Whe there is a need for low latency data access
- When we are dealing with lots of small files
What are the main file formats on Hadoop?
We can have traditional file formats (like text, CSV, XML) and specific file formats, like Row Oriented Formats and Column Oriented Formats. In particular, the difference between row oriended and column oriented formats is Row-oriented file formats in Hadoop store data by grouping all attributes of a record together, while column-oriented file formats store data by grouping values of each attribute together.
What is Parquet storage?
Parquet storage is a columnar file format used in big data processing systems like Hadoop. It organizes and stores data in a highly optimized manner, grouping values of each column together for efficient compression and retrieval. This allows for faster data scanning and processing, especially when dealing with large datasets, as only the required columns are read from disk, reducing I/O operations and improving performance.
What is YARN?
Yarn is the preffered Resource Negotiator in Hadoop. It acts by managing resources in the program and schedules tasks to different workers. Yarn provides two different daemons:
- Resource Manager. It’s formed by a scheduler that allocates resources to the applications, and by the Application Manager that accepts job submissions
- The Node Manager- a per-node slave. The NM is responsible for containers an monitors usages of the containers to report to the RM
Explain the concept of data locality
Data Locality is the process of moving the computation to the node whare the data resides.By bringing computation close to the data, data locality reduces network communication and disk I/O, resulting in faster and more efficient data processing.
What is MapReduce?
MapReduce is a programming model for processing and generating big data sets with a parallel, distributed algorithm on a cluster
Explain from what parts a MapReduce program is composed and how it works
A Map Reduce program is composed by two functions: a map function, and a reduce function.
The Map Function performs filtering and sorting, and the reduce function performs a summary operation
Together, they form the MapReduce system which orchestartes the process by marshalling the distributed servers, running tasks in parallel and managing communications
Define what combiners are and what is their utility
Combiners in MapReduce are mini-reducers that operate locally on the output of the map phase. They help to optimize data transfer by reducing the amount of data sent across the network. Combiners perform partial aggregation on the intermediate key-value pairs generated by the map phase, thereby reducing the volume of data that needs to be transferred to the reducers. They are used to improve overall performance and reduce network congestion in MapReduce jobs.
Talk about Partitioning in MapReduce
Partitioning in MapReduce is the process of dividing the intermediate key-value pairs generated by the map phase into separate groups or partitions. Each partition corresponds to a specific reducer task. The goal of partitioning is to ensure that all key-value pairs with the same key are sent to the same reducer, enabling efficient and accurate data processing. Partitioning allows for parallelism in the reduce phase by enabling multiple reducers to work on different subsets of data concurrently. It helps in load balancing and ensures that data is distributed evenly across reducers, improving the overall performance and scalability of MapReduce jobs.
How does MapReduce cope with Failure?
In MapReduce there are several ways to cope with the failure of a map reduce job.
- the AMP sets each failed task to idle and reassigns it to a worker when it becomes available
MapReduce was created around low end commodity servers, so it’s pretty resilient
What are some MapReduce Algorithms?
MapReduce is a framework, so it’s a set of tools that creates a wider product.
Each programmer needs to fit their solution into the MapReduce paradigm. There are some different algorithms that can be used:
- Filtering algorithms. Find lines, tuples, with particular characteristics
- Summarization algorithms. Count the number of requests to each subdomain for exapmple
- Join: used to perform pre aggregations, combining different inputs on some shared values
- Sort: sort inuputs
What is Apache Spark?
Apache Spark is a data processing framework that can quickly perform processing tasks over very large datasets, and can also distribute data processing across multiple computers
Why we went from MapReduce to Spark?
Because Mapredcue has some limitations given the recent changes in technology
- MapReduce is more attuned to batch processing
- MapReduce is a strict paradigm
- New hardware capabilities are not explored with mapreduce
Too Complex
Explain the Main Structure of Spark
Spak is made of two major components: RDD and DAG. RDD stands for Resilient Distributed Dataset, while DAG stands for Direct Acyclic Graph.
Explain RDD
Resilient Distributed Dataset.
It’s the primary data structure in Spark. It’s a reliable and memory efficient solution, and speeds up processes by storing in RDDs.
RDDs are immutable collections of objects; they automatically rebuild on failure, and are distributed
Explain DAG
A DAG is a collection of individual tasks, represented as nodes, connected by edges that define dependencies between the tasks. The graph is directed, meaning that the edges have a direction that indicates the flow of data between tasks. It is also acyclic, meaning that there are no cycles or loops in the graph.
When you perform data processing operations in Spark, such as transformations and actions on RDDs (Resilient Distributed Datasets) or DataFrames, Spark automatically builds a DAG to represent the computation plan. The DAG captures the logical flow of operations that need to be executed to produce the desired output.
Explain the Spark Architecture
Spark uses a master/slave architecture with one central coordinator (DRIVER) and many distributed workers (EXECUTORS).
Cluster Manager: is responsible for assigning and managing cluster resources (can be spark stadalone manager or YARN)
Executor: executes tasks
Driver Program: converts user programs into tasks
Deployment types in Spark
- Cluster Mode: diver processes directly on a node in the cluster
- Client mode: the driver runs on a machine that does not belong to the cluster
- Local mode: the driver and executors run on the same machine
What is RDD partitioning?
RDDs are a collections of data that are so big in size, they need to be partitioned across different nodes. Spark automatically partitions RDDs and distributes the partitions across different nodes
If not specificed, Spark sets the number of partitions automatically. If there are too many, eccessive overhead. If there are too few, some cores will not be used
Explain Shuffling in Spark
Shuffling is the mechanism used to re-distribute data across partitions. It’s necessary to compute some operations. It is complex and costly
The three main shuffling techniques are:
- Hash Shuffle. Each map creates a file for every reducer
- Hash evolved: each executor holds a pool of files
- Sort Shuffle: default. Each mapper keeps output in memory, spills to disk if necessary
- Tungsten sort: evolution of Sort. directly works on serialized records
How to configure a Cluster?
- Resource Configuration: CPU and memory. Every executor in an application has a fixed amount of cores and heap size (cache)
- CPU Tuning. Setting executors, cores per executor
- memory tuning: setting amount of memory per executor
Shared Variables in Spark
Shared variables in Spark are special variables that can be shared and used across multiple tasks in a distributed computing environment.
There are two types of shared variables in Spark:
Broadcast Variables: These are read-only variables that are broadcasted to all the worker nodes in a cluster. It allows the workers to access the variable’s value efficiently without sending a copy of the variable with each task. Broadcast variables are useful when a large dataset or a large lookup table needs to be shared among all the tasks.
Accumulators: These are variables that are used to accumulate values from the tasks running on worker nodes back to the driver program. Accumulators are typically used for aggregating values or collecting statistics from the tasks. They are designed to be only “added” to and provide a convenient way to gather information from distributed tasks without relying on explicit data shuffling.
By using shared variables, Spark avoids unnecessary data transfer and duplication, which can improve the performance and efficiency of distributed computations.
What is SQL-on-Hadoop?
It’s a class of applications that combine SQL querying with new Hadoop elements. SQL on Hadoop allows for a wider group of enterprise developers and business analysis work on Hadoop on commodity hardware
What are the SQL options on Hadoop?
- Batch SQL, so SQL like queries translated to MapReduce or Spark Jobs. Tools: Apache Hive or Apache Spark
- Interactive SQL: interative queries to enable traditional BI and analytics. Tools: Impala, Apache Drill
- Operational SQL: OLTP workloads and apps that operate over smaller datasets with Insert, Update and Deletes. Tools: Apache HBase, NoSQL
Talk about Apache Hive
Apache Hive is a distributed, fault tolerant data Warehouse that enables analytics at a massive scale
Hive is built and closely integrated with Hadoop. Gives the ability to query large datasets, with an SQL like interface
Best use: batch jobs over large quantities of append only data. No real time queries
Talk about Spark SQL
Spark SQL is a Spark Module for structured data processing. It provides a programming abstraction called DataFrames and can act as a distributed SQL query engine.
It enables unmodified Hive queries to run up to 100x faster
What is AQE? Adaptive Query Execution
Adaptive Query Execution is one of the greatest features of Spark 3.0. It optimizes and readjust queries based on runtime statistics during the execution of the query.
The main idea is that the query plan is not final, and additional optimizations are possibly applied
What is NoSQL?
NoSQL is an approach to database design that enables the storage and queriying of data outside the traditional structures found in relational databases.
NoSQL can have several data models, gives freedom from joins, that are not supported or discouraged, and freedom from rigid schemas
NoSQL is a flexible and scalable database system that excels at handling large amounts of unstructured data. It offers advantages such as scalability, flexibility, high performance, support for big data and unstructured data, horizontal scalability, and availability. NoSQL databases excel at handling unstructured and semi-structured data, such as social media posts, sensor data, and multimedia content. They can efficiently store and process diverse data formats, including JSON, XML, key-value pairs, and document-oriented data.
What are NoSQL data models?
- Graph: each DB contains one or more graphs, and each graph contains vertices and arcs.
Vertices represent real world entities
Arcs represent relationships between entities
- Key Value: each DB contains one or more collections (tables): Each one contains a list of key value pairs
- Document: each DB contains one or more collections, and each collection contains a list of documents
- Wide column: a key value pair
- Row oriented, Column oriented
What is sharding data and why is it good in NoSQL?
Database Sharding is the process of storing a large database across multiple machines. One of the strenghts of NoSQL are its scale-out capabilities
A good sharding strategy is fundamental to optimize performance
What are the two types of replications that we can do?
Replication means that the data is copied on several nodes. Improves the robustness of the system: if a node fails, replicas prevent data loss
What are ACID properties and differences in consistency between RDBMS and NoSQL?
Data consistency means that the user sees a consistent view of the data. ACID means for properties are guaranteed:
Atomicity: the transaction is indivisible
Consistency: the the transaction leaves the DB in a consistent state
Isolation: the transaction is independent from the others
Durability: the DBMS protects the DB from failures
Consistency is guaranteed to the detriment of speed and availability. NoSQL is mainly based on PA -EL, so system focused on speed and availability.
What is polyglot persistence?
It means using multiple data storage technologies for varying data storage across the application. Using a single DBMS to handle everything usually leads to inefficient solutions. Each activity has its own requirements. One size fits allo does not work anymore.
What is data streaming?
Streaming data is data that is continuously generated by thousands of data sources.
Big data is not only about batch analysis, but also about analyzing data streams
- latency, lower the better
- workload balancing
In our context, a system for data streaming is a type of data processing engine designed with infinite datasets in mind
What are the classification real time systems?
Hard: latency of nanoseconds, microseconds
Firm: latency of milliseconds, seconds
Soft: latency of seconds, minutes
What are the data stream models?
Time Series Model
Cash Register Model
Turnstile Model
What is the difference between stream time and event time?
There is a potentially significant difference between the time at which an event occurs and the time at which an event enters the streaming system
Stream time: defined by the system as the event enters the pipeline
Event time: carried by the event itself
What is random sampling?
It’s a common technique in data streaming. The goal is to perform real time statistical analysis of whatched videos
What’s the count distinct problem and how do you solve it?
bit pattern-based algorithms that can be used to solve the count distinct problem more efficiently.
What is the membership problem and how do you solve it?
The membership problem involves determining whether a given element is a member of a set or not. In other words, it checks if an item exists in a collection of items.
We can rely on an old data structure: Bloom filters (Conceived by Burton Howard Bloom in 1970) A Bloom filter can return false positives, but no false negatives. If it says that “it hasn’t been seen”, then it hasn’t been seen. A Bloom filter is an array of m bits, with m > n
What is the frequency problem and how do you solve it?
The most common algorithm for this problem is Count-Min Sketch → Because you first do a series of approximated counts, then you keep the minimum of those