05 Spark DF Intervals Flashcards

Question 1

Q

Difference between Hadoop & Spark in-terms of intermediate result storage

Answer

A

Hadoop will stores intermediate result on disk.
Spark does not store intermediate result on disk but stores in memory.

Question 2

Q

Full Form of RDD

Answer

A

Resilient Distributed Dataset

Question 3

Q

Use of RDD

Answer

A

used for transformation

Question 4

Q

What are the steps before execution of a query

Answer

A

query –> spark sql engine –> catalyst optimizer –> RDD –> Execute

Question 5

Q

Relation between spark dataframe, table & RDD

Answer

A

Spark table and dataframe are logical view of RDD as all the tables are later converted to RDD.

Question 6

Q

Properties of RDD

Answer

A

RDD will partition main file into multiple files. So if it has to store a file it will divide the main file into multiple small files and store it.

Question 7

Q

What are transformations

Answer

A

they are operations applied on dataframes to produce a new dataframe.

Question 8

Q

Types of transformations

Answer

A

Narrow and wide transformations

Question 9

Q

Define Narrow Transformation

Answer

A

while executing the file is converted to rdd which will break the file into multiple small files. Now when we need to apply transformations like where the operation is performed on each partition and finally the output is generated.
there is not shuffle in this.

Question 10

Q

Define wide transformation

Answer

A

while executing the file is converted to rdd which will break the file into multiple small files. Now when we need to apply transformations like group by the operation is performed on each partition and finally the output should be combined from each partition which causes shuffle.

Question 11

Q

How does spark divide its hierarchy(job)

Answer

A

job –> stages –> tasks

Question 12

Q

how are jobs decided

Answer

A

based on dependencies on a table jobs are decided.
each action is divided into 1 job.

Question 13

Q

how are jobs executed

Answer

A

spark creates a logical query plan for each job
jobs are broken at each wide transformation.
each wide transformation will create 2 stages.

Question 14

Q

features of stages

Answer

A

the last operation of a stage is either a wide transformation or an action.
stages can not run parallelly.

Question 15

Q

features of a task

Answer

A

if data has x partitions; transformation will be applied on each partition hence x tasks will be created.
output of each task is stored in an exchange buffer.
number of tasks depends on number of partitions.
tasks run parallelly

Question 16

Q

what is shuffle operation

Answer

A

movement of data from write exchange buffer of 1 stage to the read exchange buffer of the other stage. (sharing of data from 1 stage to another stage).

Question 17

Q

Number of stages in query planning

Answer

A

There are 4 stages

Question 18

Q

What are 4stages of query planning

Answer

A

Analysis - in this stage the SQL engine will go to the catalog and check if the variables are present or not.
Logical optimisation- in this stage an optimised logical plan is created by spark based on multiple points
Physical planning - in this spark will calculate cost running in different ways and choose the best way to run the query.
Code generation- using the selected physical plan spark will create rdd and execute the query.