05 Spark DF Intervals Flashcards
Difference between Hadoop & Spark in-terms of intermediate result storage
Hadoop will stores intermediate result on disk.
Spark does not store intermediate result on disk but stores in memory.
Full Form of RDD
Resilient Distributed Dataset
Use of RDD
used for transformation
What are the steps before execution of a query
query –> spark sql engine –> catalyst optimizer –> RDD –> Execute
Relation between spark dataframe, table & RDD
Spark table and dataframe are logical view of RDD as all the tables are later converted to RDD.
Properties of RDD
- RDD will partition main file into multiple files. So if it has to store a file it will divide the main file into multiple small files and store it.
What are transformations
they are operations applied on dataframes to produce a new dataframe.
Types of transformations
Narrow and wide transformations
Define Narrow Transformation
while executing the file is converted to rdd which will break the file into multiple small files. Now when we need to apply transformations like where the operation is performed on each partition and finally the output is generated.
there is not shuffle in this.
Define wide transformation
while executing the file is converted to rdd which will break the file into multiple small files. Now when we need to apply transformations like group by the operation is performed on each partition and finally the output should be combined from each partition which causes shuffle.
How does spark divide its hierarchy(job)
job –> stages –> tasks
how are jobs decided
- based on dependencies on a table jobs are decided.
- each action is divided into 1 job.
how are jobs executed
- spark creates a logical query plan for each job
- jobs are broken at each wide transformation.
- each wide transformation will create 2 stages.
features of stages
- the last operation of a stage is either a wide transformation or an action.
stages can not run parallelly.
features of a task
- if data has x partitions; transformation will be applied on each partition hence x tasks will be created.
- output of each task is stored in an exchange buffer.
- number of tasks depends on number of partitions.
- tasks run parallelly