Big Data Refresher Flashcards

Question

What is repartition?

Answer 1

Repartitions can increase or decrease the number of data partitions. It performs a full shuffle as opposed to a partial shuffle when using coalesce making it potentially slower and more expensive.

Answer 2

DAG stands for Directed Acyclic Graph with no directed cycles. There would be finite vertices and edges. Each edge from one vertex is directed to another vertex in a sequential manner. The vertices refer to the RDDs of Spark and the edges represent the operations to be performed on those RDDs.

Answer 3

It is a spark api extension for supporting stream processing of data from different sources. Data from sources like kafka and flume are processed and pushed to various destinations like databases, dashboards, machine learning apis or file systems.

Answer 4

Dataset is an immutable distributed collection of data similar to data frames with the difference being that they can be strongly typed. Datasets have the following features: Optimized Query feature: Spark datasets provide optimized queries using Tungsten and Catalyst Query Optimizer frameworks. The Catalyst Query Optimizer represents and manipulates a data flow graph (graph of expressions and relational operators). The Tungsten improves and optimizes the speed of execution of Spark job by emphasizing the hardware architecture of the Spark execution platform. Compile-Time Analysis: Datasets have the flexibility of analyzing and checking the syntaxes at the compile-time which is not technically possible in RDDs or Dataframes or the regular SQL queries.

Answer 5

Dataframes are the distributed collection of data organized into columns simiar to tables in a relational database.

Answer 6

Worker node refers to node which runs the application code in the cluster. Worker Node is the Slave Node. Master node assign work and worker node actually perform the assigned tasks. Worker node processes the data stored on the node, they report the resources to the master.

Answer 7

Partitioning allows you to organize large tables into smaller tables based on values of a column. This helps reduce query latency by scanning only the relevant partitions and corresponding datasets.

Answer 8

Bucketing is the process of hashing the values in a column into several user-defined buckets which helps avoid over-partition. Bucketing helps optimize the sampling process and shortens the query response time.

Answer 9

A Scala Case Class is like a regular class, except it is good for modeling immutable data. It also serves useful in pattern matching

Answer 10

Using Dataframes over RDD because of catalyst optimizer which creates a query plan resulting in better performance. Using broadcast variables to store data locally on nodes Utilizing cache and persist to store the dataset in memory Utilize repartition or coalesce to maintain parallelism

Answer 11

``` memory only memory and disk memory only serializable memory and disk serializable disk only ```

Answer 12

The spark driver is where the main method of our program runs. It executes the user code and creates a sparksession that is responsible for create rdd, dataframes, datasets and perform transformations and actions. The spark executor resides in the worker node and run an individual task and return the result to the driver.

Answer 13

AWS stands for Amazon Web Services and is a cloud computing platform that offers services such as database storage options, computing power, content delivery, and networking. These services can be EC2 (Elastic cloud compute) where virtual machines are provided that represent physical servers for you to deploy applications. S3 (Amazon Simple Storgae Service) is an object storage service. EMR (Elastic map reduce) is a managed cluster platform.

Answer 14

Extract, Transform, Load It is a data integration process where you extract data from multiple sources, then transform the data and finally load the data into a data warehouse system.

Big Data Refresher Flashcards

(38 cards)