Big Data Refresher Flashcards

1
Q

What is Spark?

A

Spark is an open-source parallel processing framework for running large-scale data analytics applications across clustered computers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Hadoop?

A

Hadoop is an open-source framework that utilizes a network of clustered computers to store and process large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Hive?

A

Hive™ is a data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL on top of Apache Hadoop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the core components of Spark?

A

Spark Core, Spark SQL, Spark Streaming, MLib, Graph X, Spark R

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the core components of Hadoop?

A

HDFS, YARN, Map Reduce.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is HDFS?

A

HDFS stands for Hadoop Distributed File system, and it is the storage component of Hadoop. It is responsible for storing large datasets of structured and unstructured data across various nodes. It consists of two core components Namenode and Datanode. The namenode is the primary or master node and contains the metadata of the data. The datanode is where the actual data is stored and reads, writes, processes, and replicates the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is YARN?

A

Yarn stands for Yet another resource negotiator. It is the resource management component of Hadoop. Yarn consists of three components the Resource Manager, Node Manager, and the Application Master. The resource manager is in charge of allocating resources to all the applications in the system, the node manager is responsible for containers, monitoring their resource usage such as cpu, memory, and disk. The application master works as an interface between the resource manager and node manager and performs negotiations as per the requirement of the two.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is MapReduce?

A

MapReduce is the processing component of Hadoop. MapReduce makes use of two functions map() and reduce(). Map() performs sorting and filtering of data and organizing them in the form of groups. Map generates a key-value pair based result which is later on processed by the Reduce() method.
Reduce() does the summarization by aggregating the mapped data. In simple, Reduce() takes the output generated by Map() as input and combines those tuples into smaller set of tuples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the characteristics of HDFS?

A

Fault tolerant - Hadoop framework divides data into blocks. After that creates multiple copies of blocks on different machines in the cluster.
Scalable - whenever requirements increase you can scale the cluster. Two scalability mechanisms are available in HDFS: Vertical and Horizontal Scalability.
High Availability - At the time of unfavorable situations like a failure of a node, a user can easily access their data from the other nodes. Because duplicate copies of blocks are present on the other nodes in the HDFS cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is Apache Spark different from MapReduce?

A
  1. Spark processes data in real time and in batches whereas MapReduce only does batch processing.
  2. Spark is 100 times faster than map reduce.
  3. Spark stored data in RAM whereas MapReduce stores data to disk.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does Spark run its applications with the help of its architecture?

A

Spark applications run as independent processes that are coordinated by the SparkSession object in the driver program. The resource manager or cluster manager assigns tasks to the worker nodes with one task per partition. Iterative algorithms apply operations repeatedly to the data so they can benefit from caching datasets across iterations. A task applies its unit of work to the dataset in its partition and outputs a new partition dataset. Finally, the results are sent back to the driver application or can be saved to the disk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are RDD?

A

RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is lazy evaluation in Spark?

A

When spark operates on any dataset, it remembers the instructions. For example when a transformation is called on an RDD the operation is not performed instantly. Transformations in spark are not evaluated until you perform an action which aids in optimizing the overall data processing workflow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a Parquet file and what are its advantages?

A

Parquet is a columnar storage file format that is used to efficiently store large datasets. Some of the advantages are that it enables you to fetch specific columns for access, consumes less space, follows type-specific encoding, and supports limited I/O operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Shuffling in Spark?

A

Shuffling is the process of redistributing data across partitions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the use of coalesce in Spark?

A

Spark uses a coalesce method to reduce the number of partitions in a DataFrame.

17
Q

What are the various functionalities supported by Spark Core?

A

Spark Core is the engine for parallel and distributed processing of large datasets. Some of the functionalities include scheduling and monitoring jobs, memory management, fault recovery, and task dispatching.

18
Q

How do you convert an RDD into a DataFrame?

A

Use the function toDF()

Use the sparksession.createDataFrame()

19
Q

What are transformations and actions?

A

Transformations are operations that are performed on an RDD to create a new RDD containing the results.(Ex: map, filter, join, union)
Actions are operations that return a value after running a computation on a RDD. (Ex: min, max, count, collect)

20
Q

What is a broadcast variable?

A

Broadcast variables are read only shared variables that are cached and available to all nodes in the cluster. Using broadcast variables can improve performance by reducing the amount of network traffic and data serialization required to execute your Spark application because the variables are cached on all the nodes, we do not see to send the data to each node every single time it is being called.

21
Q

What are accumulators?

A

Spark Accumulators are shared variables which are only “added” through an associative and commutative operation and are used to perform counter or sum operations.

22
Q

What are some of the features of apache spark?

A

High processing speed, in-memory computation, fault-tolerance, stream processing in real-time, multiple language support.

23
Q

What is client mode?

A

client mode is when the spark driver component runs on the machine node from where the spark job is submitted. The main disadvantage in this mode is that if the machine fails the entire job fails. This mode is not preferred in production environments.

24
Q

What is cluster mode?

A

Cluster mode is where the spark job driver component does not run on the machine from which the spark job has been submitted. The spark job launches the driver component within the cluster as a part of the sub-process of ApplicationMaster. This mode has a dedicated cluster manager for allocating resources required for the job to run.

25
Q

What is repartition?

A

Repartitions can increase or decrease the number of data partitions. It performs a full shuffle as opposed to a partial shuffle when using coalesce making it potentially slower and more expensive.

26
Q

What is DAG?

A

DAG stands for Directed Acyclic Graph with no directed cycles. There would be finite vertices and edges. Each edge from one vertex is directed to another vertex in a sequential manner. The vertices refer to the RDDs of Spark and the edges represent the operations to be performed on those RDDs.

27
Q

What is Spark Streaming and how is it implemented??

A

It is a spark api extension for supporting stream processing of data from different sources. Data from sources like kafka and flume are processed and pushed to various destinations like databases, dashboards, machine learning apis or file systems.

28
Q

What are DataSets?

A

Dataset is an immutable distributed collection of data similar to data frames with the difference being that they can be strongly typed.

Datasets have the following features:

Optimized Query feature: Spark datasets provide optimized queries using Tungsten and Catalyst Query Optimizer frameworks. The Catalyst Query Optimizer represents and manipulates a data flow graph (graph of expressions and relational operators). The Tungsten improves and optimizes the speed of execution of Spark job by emphasizing the hardware architecture of the Spark execution platform.
Compile-Time Analysis: Datasets have the flexibility of analyzing and checking the syntaxes at the compile-time which is not technically possible in RDDs or Dataframes or the regular SQL queries.

29
Q

What are DataFrames?

A

Dataframes are the distributed collection of data organized into columns simiar to tables in a relational database.

30
Q

What are worker nodes in Spark?

A

Worker node refers to node which runs the application code in the cluster. Worker Node is the Slave Node. Master node assign work and worker node actually perform the assigned tasks. Worker node processes the data stored on the node, they report the resources to the master.

31
Q

What is partitioning in Hive?

A

Partitioning allows you to organize large tables into smaller tables based on values of a column. This helps reduce query latency by scanning only the relevant partitions and corresponding datasets.

32
Q

What is bucketing in Hive?

A

Bucketing is the process of hashing the values in a column into several user-defined buckets which helps avoid over-partition. Bucketing helps optimize the sampling process and shortens the query response time.

33
Q

What is a case class?

A

A Scala Case Class is like a regular class, except it is good for modeling immutable data. It also serves useful in pattern matching

34
Q

What are some optimization techniques in Spark?

A

Using Dataframes over RDD because of catalyst optimizer which creates a query plan resulting in better performance.
Using broadcast variables to store data locally on nodes
Utilizing cache and persist to store the dataset in memory
Utilize repartition or coalesce to maintain parallelism

35
Q

What are the different persistence levels in Spark?

A
memory only
memory and disk
memory only serializable 
memory and disk serializable
disk only
36
Q

Spark driver? Spark executor?

A

The spark driver is where the main method of our program runs. It executes the user code and creates a sparksession that is responsible for create rdd, dataframes, datasets and perform transformations and actions. The spark executor resides in the worker node and run an individual task and return the result to the driver.

37
Q

What is AWS and what services does AWS offer?

A

AWS stands for Amazon Web Services and is a cloud computing platform that offers services such as database storage options, computing power, content delivery, and networking.
These services can be EC2 (Elastic cloud compute) where virtual machines are provided that represent physical servers for you to deploy applications.
S3 (Amazon Simple Storgae Service) is an object storage service.
EMR (Elastic map reduce) is a managed cluster platform.

38
Q

What is the ETL process?

A

Extract, Transform, Load
It is a data integration process where you extract data from multiple sources, then transform the data and finally load the data into a data warehouse system.