Big Data & Cloud Module I Flashcards

1
Q

What is erasure encoding (EC)?

A

Erasure Encoding is a functionality of Hadoop 3 that allows data protection by dividing data into fragments, expanded and encoded with redundant data and stored across different locations. If a drive fails or data becomes corrupted, it can be reconstructed from the segments stored on the other drives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Cluster Topology?

A

By cluster topology we mean the type and state of each node in the clyster and the relation between them.
In order to carry choices, Hadoop must know the cluster topology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When HDFS might NOT be a good fit?

A
  • Whe there is a need for low latency data access
  • When we are dealing with lots of small files
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the main file formats on Hadoop?

A

We can have traditional file formats (like text, CSV, XML) and specific file formats, like Row Oriented Formats and Column Oriented Formats. In particular, the difference between row oriended and column oriented formats is Row-oriented file formats in Hadoop store data by grouping all attributes of a record together, while column-oriented file formats store data by grouping values of each attribute together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Parquet storage?

A

Parquet storage is a columnar file format used in big data processing systems like Hadoop. It organizes and stores data in a highly optimized manner, grouping values of each column together for efficient compression and retrieval. This allows for faster data scanning and processing, especially when dealing with large datasets, as only the required columns are read from disk, reducing I/O operations and improving performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is YARN?

A

Yarn is the preffered Resource Negotiator in Hadoop. It acts by managing resources in the program and schedules tasks to different workers. Yarn provides two different daemons:

  1. Resource Manager. It’s formed by a scheduler that allocates resources to the applications, and by the Application Manager that accepts job submissions
  2. The Node Manager- a per-node slave. The NM is responsible for containers an monitors usages of the containers to report to the RM
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the concept of data locality

A

Data Locality is the process of moving the computation to the node whare the data resides.By bringing computation close to the data, data locality reduces network communication and disk I/O, resulting in faster and more efficient data processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is MapReduce?

A

MapReduce is a programming model for processing and generating big data sets with a parallel, distributed algorithm on a cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain from what parts a MapReduce program is composed and how it works

A

A Map Reduce program is composed by two functions: a map function, and a reduce function.

The Map Function performs filtering and sorting, and the reduce function performs a summary operation

Together, they form the MapReduce system which orchestartes the process by marshalling the distributed servers, running tasks in parallel and managing communications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define what combiners are and what is their utility

A

Combiners in MapReduce are mini-reducers that operate locally on the output of the map phase. They help to optimize data transfer by reducing the amount of data sent across the network. Combiners perform partial aggregation on the intermediate key-value pairs generated by the map phase, thereby reducing the volume of data that needs to be transferred to the reducers. They are used to improve overall performance and reduce network congestion in MapReduce jobs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Talk about Partitioning in MapReduce

A

Partitioning in MapReduce is the process of dividing the intermediate key-value pairs generated by the map phase into separate groups or partitions. Each partition corresponds to a specific reducer task. The goal of partitioning is to ensure that all key-value pairs with the same key are sent to the same reducer, enabling efficient and accurate data processing. Partitioning allows for parallelism in the reduce phase by enabling multiple reducers to work on different subsets of data concurrently. It helps in load balancing and ensures that data is distributed evenly across reducers, improving the overall performance and scalability of MapReduce jobs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does MapReduce cope with Failure?

A

In MapReduce there are several ways to cope with the failure of a map reduce job.

  • the AMP sets each failed task to idle and reassigns it to a worker when it becomes available

MapReduce was created around low end commodity servers, so it’s pretty resilient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some MapReduce Algorithms?

A

MapReduce is a framework, so it’s a set of tools that creates a wider product.

Each programmer needs to fit their solution into the MapReduce paradigm. There are some different algorithms that can be used:

  1. Filtering algorithms. Find lines, tuples, with particular characteristics
  2. Summarization algorithms. Count the number of requests to each subdomain for exapmple
  3. Join: used to perform pre aggregations, combining different inputs on some shared values
  4. Sort: sort inuputs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Apache Spark?

A

Apache Spark is a data processing framework that can quickly perform processing tasks over very large datasets, and can also distribute data processing across multiple computers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why we went from MapReduce to Spark?

A

Because Mapredcue has some limitations given the recent changes in technology

  • MapReduce is more attuned to batch processing
  • MapReduce is a strict paradigm
  • New hardware capabilities are not explored with mapreduce
    Too Complex
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain the Main Structure of Spark

A

Spak is made of two major components: RDD and DAG. RDD stands for Resilient Distributed Dataset, while DAG stands for Direct Acyclic Graph.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Explain RDD

A

Resilient Distributed Dataset.

It’s the primary data structure in Spark. It’s a reliable and memory efficient solution, and speeds up processes by storing in RDDs.
RDDs are immutable collections of objects; they automatically rebuild on failure, and are distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Explain DAG

A

A DAG is a collection of individual tasks, represented as nodes, connected by edges that define dependencies between the tasks. The graph is directed, meaning that the edges have a direction that indicates the flow of data between tasks. It is also acyclic, meaning that there are no cycles or loops in the graph.

When you perform data processing operations in Spark, such as transformations and actions on RDDs (Resilient Distributed Datasets) or DataFrames, Spark automatically builds a DAG to represent the computation plan. The DAG captures the logical flow of operations that need to be executed to produce the desired output.

19
Q

Explain the Spark Architecture

A

Spark uses a master/slave architecture with one central coordinator (DRIVER) and many distributed workers (EXECUTORS).

Cluster Manager: is responsible for assigning and managing cluster resources (can be spark stadalone manager or YARN)

Executor: executes tasks

Driver Program: converts user programs into tasks

20
Q

Deployment types in Spark

A
  1. Cluster Mode: diver processes directly on a node in the cluster
  2. Client mode: the driver runs on a machine that does not belong to the cluster
  3. Local mode: the driver and executors run on the same machine
21
Q

What is RDD partitioning?

A

RDDs are a collections of data that are so big in size, they need to be partitioned across different nodes. Spark automatically partitions RDDs and distributes the partitions across different nodes

If not specificed, Spark sets the number of partitions automatically. If there are too many, eccessive overhead. If there are too few, some cores will not be used

22
Q

Explain Shuffling in Spark

A

Shuffling is the mechanism used to re-distribute data across partitions. It’s necessary to compute some operations. It is complex and costly

The three main shuffling techniques are:

  1. Hash Shuffle. Each map creates a file for every reducer
  2. Hash evolved: each executor holds a pool of files
  3. Sort Shuffle: default. Each mapper keeps output in memory, spills to disk if necessary
  4. Tungsten sort: evolution of Sort. directly works on serialized records
23
Q

How to configure a Cluster?

A
  1. Resource Configuration: CPU and memory. Every executor in an application has a fixed amount of cores and heap size (cache)
  2. CPU Tuning. Setting executors, cores per executor
  3. memory tuning: setting amount of memory per executor
24
Q

Shared Variables in Spark

A

Shared variables in Spark are special variables that can be shared and used across multiple tasks in a distributed computing environment.

There are two types of shared variables in Spark:

Broadcast Variables: These are read-only variables that are broadcasted to all the worker nodes in a cluster. It allows the workers to access the variable’s value efficiently without sending a copy of the variable with each task. Broadcast variables are useful when a large dataset or a large lookup table needs to be shared among all the tasks.

Accumulators: These are variables that are used to accumulate values from the tasks running on worker nodes back to the driver program. Accumulators are typically used for aggregating values or collecting statistics from the tasks. They are designed to be only “added” to and provide a convenient way to gather information from distributed tasks without relying on explicit data shuffling.

By using shared variables, Spark avoids unnecessary data transfer and duplication, which can improve the performance and efficiency of distributed computations.

25
Q

What is SQL-on-Hadoop?

A

It’s a class of applications that combine SQL querying with new Hadoop elements. SQL on Hadoop allows for a wider group of enterprise developers and business analysis work on Hadoop on commodity hardware

26
Q

What are the SQL options on Hadoop?

A
  1. Batch SQL, so SQL like queries translated to MapReduce or Spark Jobs. Tools: Apache Hive or Apache Spark
  2. Interactive SQL: interative queries to enable traditional BI and analytics. Tools: Impala, Apache Drill
  3. Operational SQL: OLTP workloads and apps that operate over smaller datasets with Insert, Update and Deletes. Tools: Apache HBase, NoSQL
27
Q

Talk about Apache Hive

A

Apache Hive is a distributed, fault tolerant data Warehouse that enables analytics at a massive scale

Hive is built and closely integrated with Hadoop. Gives the ability to query large datasets, with an SQL like interface

Best use: batch jobs over large quantities of append only data. No real time queries

28
Q

Talk about Spark SQL

A

Spark SQL is a Spark Module for structured data processing. It provides a programming abstraction called DataFrames and can act as a distributed SQL query engine.

It enables unmodified Hive queries to run up to 100x faster

29
Q

What is AQE? Adaptive Query Execution

A

Adaptive Query Execution is one of the greatest features of Spark 3.0. It optimizes and readjust queries based on runtime statistics during the execution of the query.

The main idea is that the query plan is not final, and additional optimizations are possibly applied

30
Q

What is NoSQL?

A

NoSQL is an approach to database design that enables the storage and queriying of data outside the traditional structures found in relational databases.

NoSQL can have several data models, gives freedom from joins, that are not supported or discouraged, and freedom from rigid schemas

NoSQL is a flexible and scalable database system that excels at handling large amounts of unstructured data. It offers advantages such as scalability, flexibility, high performance, support for big data and unstructured data, horizontal scalability, and availability. NoSQL databases excel at handling unstructured and semi-structured data, such as social media posts, sensor data, and multimedia content. They can efficiently store and process diverse data formats, including JSON, XML, key-value pairs, and document-oriented data.

31
Q

What are NoSQL data models?

A
  1. Graph: each DB contains one or more graphs, and each graph contains vertices and arcs.

Vertices represent real world entities
Arcs represent relationships between entities

  1. Key Value: each DB contains one or more collections (tables): Each one contains a list of key value pairs
  2. Document: each DB contains one or more collections, and each collection contains a list of documents
  3. Wide column: a key value pair
  4. Row oriented, Column oriented
32
Q

What is sharding data and why is it good in NoSQL?

A

Database Sharding is the process of storing a large database across multiple machines. One of the strenghts of NoSQL are its scale-out capabilities

A good sharding strategy is fundamental to optimize performance

33
Q

What are the two types of replications that we can do?

A

Replication means that the data is copied on several nodes. Improves the robustness of the system: if a node fails, replicas prevent data loss

34
Q

What are ACID properties and differences in consistency between RDBMS and NoSQL?

A

Data consistency means that the user sees a consistent view of the data. ACID means for properties are guaranteed:

Atomicity: the transaction is indivisible
Consistency: the the transaction leaves the DB in a consistent state

Isolation: the transaction is independent from the others

Durability: the DBMS protects the DB from failures

Consistency is guaranteed to the detriment of speed and availability. NoSQL is mainly based on PA -EL, so system focused on speed and availability.

35
Q

What is polyglot persistence?

A

It means using multiple data storage technologies for varying data storage across the application. Using a single DBMS to handle everything usually leads to inefficient solutions. Each activity has its own requirements. One size fits allo does not work anymore.

36
Q

What is data streaming?

A

Streaming data is data that is continuously generated by thousands of data sources.

Big data is not only about batch analysis, but also about analyzing data streams

  • latency, lower the better
  • workload balancing

In our context, a system for data streaming is a type of data processing engine designed with infinite datasets in mind

37
Q

What are the classification real time systems?

A

Hard: latency of nanoseconds, microseconds

Firm: latency of milliseconds, seconds

Soft: latency of seconds, minutes

38
Q

What are the data stream models?

A

Time Series Model

Cash Register Model

Turnstile Model

39
Q

What is the difference between stream time and event time?

A

There is a potentially significant difference between the time at which an event occurs and the time at which an event enters the streaming system

Stream time: defined by the system as the event enters the pipeline

Event time: carried by the event itself

40
Q

What is random sampling?

A

It’s a common technique in data streaming. The goal is to perform real time statistical analysis of whatched videos

41
Q

What’s the count distinct problem and how do you solve it?

A

bit pattern-based algorithms that can be used to solve the count distinct problem more efficiently.

42
Q

What is the membership problem and how do you solve it?

A

The membership problem involves determining whether a given element is a member of a set or not. In other words, it checks if an item exists in a collection of items.

We can rely on an old data structure: Bloom filters (Conceived by Burton Howard Bloom in 1970) A Bloom filter can return false positives, but no false negatives. If it says that “it hasn’t been seen”, then it hasn’t been seen. A Bloom filter is an array of m bits, with m > n

43
Q

What is the frequency problem and how do you solve it?

A

The most common algorithm for this problem is Count-Min Sketch → Because you first do a series of approximated counts, then you keep the minimum of those