05 Spark DF Intervals Flashcards

1
Q

Difference between Hadoop & Spark in-terms of intermediate result storage

A

Hadoop will stores intermediate result on disk.
Spark does not store intermediate result on disk but stores in memory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Full Form of RDD

A

Resilient Distributed Dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Use of RDD

A

used for transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the steps before execution of a query

A

query –> spark sql engine –> catalyst optimizer –> RDD –> Execute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Relation between spark dataframe, table & RDD

A

Spark table and dataframe are logical view of RDD as all the tables are later converted to RDD.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Properties of RDD

A
  1. RDD will partition main file into multiple files. So if it has to store a file it will divide the main file into multiple small files and store it.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are transformations

A

they are operations applied on dataframes to produce a new dataframe.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Types of transformations

A

Narrow and wide transformations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define Narrow Transformation

A

while executing the file is converted to rdd which will break the file into multiple small files. Now when we need to apply transformations like where the operation is performed on each partition and finally the output is generated.
there is not shuffle in this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define wide transformation

A

while executing the file is converted to rdd which will break the file into multiple small files. Now when we need to apply transformations like group by the operation is performed on each partition and finally the output should be combined from each partition which causes shuffle.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does spark divide its hierarchy(job)

A

job –> stages –> tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

how are jobs decided

A
  1. based on dependencies on a table jobs are decided.
  2. each action is divided into 1 job.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

how are jobs executed

A
  1. spark creates a logical query plan for each job
  2. jobs are broken at each wide transformation.
  3. each wide transformation will create 2 stages.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

features of stages

A
  1. the last operation of a stage is either a wide transformation or an action.
    stages can not run parallelly.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

features of a task

A
  1. if data has x partitions; transformation will be applied on each partition hence x tasks will be created.
  2. output of each task is stored in an exchange buffer.
  3. number of tasks depends on number of partitions.
  4. tasks run parallelly
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is shuffle operation

A

movement of data from write exchange buffer of 1 stage to the read exchange buffer of the other stage. (sharing of data from 1 stage to another stage).

17
Q

Number of stages in query planning

A

There are 4 stages

18
Q

What are 4stages of query planning

A
  1. Analysis - in this stage the SQL engine will go to the catalog and check if the variables are present or not.
  2. Logical optimisation- in this stage an optimised logical plan is created by spark based on multiple points
  3. Physical planning - in this spark will calculate cost running in different ways and choose the best way to run the query.
  4. Code generation- using the selected physical plan spark will create rdd and execute the query.