Architecture Flashcards
A dataframe is immutable, True/False
True
How are changes tracked on dataframes
The initial state is unchangeable and kept on each node. Modifications are shared with each node
How can you see the lineage of a data frame
.explain(“formated”)
What triggers a transformation on a dataframe
action
A transformation where one partition results in one output partition is called what
Narrow Transformation or Narrow Dependency
In the parsed logical plan and the analyzed logical plan, which uses the catalog
analyzed
how many cpu cores per partitions
1
T/F Cluster Manager is a component of a Spark App
False
Where is the driver in deploy-mode cluster
On a node inside the cluster. The Cluster Manager is responsible for maintaining the cluster and executor nodes
Where is the driver in deploy-mode client
On a node not in the cluster
Is there a performance difference between writing SQL Queries or DataFrame Code
NO
What kind of programming model is Spark
Functional - Same inputs lead to the same outputs; transformations are constant
When you perform a shuffle, Spark outputs how many partitions
200
What is schema inference
Take the best guess at what the schema of our data frame should be
What port does the spark ui run
4040
What type of transformation is aggregation
wide
What type of transformation is filter
Narrow
What are the 3 kind of actions
- View data in the console
- collect data to native objects in the respective language
- write to output data sources
.count() is an example of a what
an action
What is predicate pushdown
pushing down the filter automatically
What is lazy evaluation
Spark will wait till the very last moment to execute the graph of computation instructions
Shuffles will perform filters and then…
Write to disk
What is pipelining
on narrow transformations filters will be performed in memory
A wide dependency is
Input partitions contributing to many output partitions
What is narrow dependencies
Each input partition will contribute to only one output partition
What are the 2 type of transformations
Narrow dependencies and wide dependencies
Spark will not act on a transformations till
an action is called
Core data Structures are muttable or immutable
immutable
With Dataframes, you have to manipulate partitions manually
False
If you ave one partition and many executors, what paralism do you have
1
What is a partition
A collection of rows that sit on one physical machine
To allow every executor to perform in parallel, Spark breaks the data into
Partitions
What is a dataframe
Structured API that represents a table of data with rows and columns
How many spark sessions can you have across a Spark App
1
You control your SparkApp through a driver process called
Spark Session
What are Spark’s Language APIS
Scala, JAVA, R, Python, SQL
What is the point of the cluster manager
Keep track of resources available
What is local mode
Driver and Executor live on the same machine
What are the 3 core cluster managers
Spark’s Standalone Manager
Yarn
Mesos
The driver process is responsible for what 3 things
Maintaining info about the Spark App
Respond to the user program and input
Analyze, distribute and schedule work across executors
Which process runs your main() function
driver
A spark app consists of what two processes
Driver
Executor
Executors are responsible for what two things
Executing code assigned to it
Reporting state of the computation back to the driver node
At which stage do the first set of optimizations take place?
Logical Optimization
When using DataFrame.persist() data on disk is always serialized. T/F
True
Spark dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed partitions. Which property needs to be enabled to achieve this ?
spark.sql.adative.skewJoin.enabled
The goal of Dynamic Partition Pruning (DPP) is to allow you to read only as much data as you need. Which property needs to be set in order to use this functionality ?
spark.sql.optimizer.dynamicPartitionPruning.enabled
The DataFrame class does not have an uncache() operation T/F
True
What are worker nodes
Worker nodes are the nodes of a cluster that perform computations
For text files, we can only have one column of a dataframe we want to write T/F
True
How do you specify a left outer join
left_outer
A job is
A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect).
What is the relationship between an executor and a worker
An executor is a Java Virtual Machine (JVM) running on a worker node.
How are global temp views addressed
spark.read.table(“global_temp.whatever the view is”)
When is a data frame writer treated as a global external/unmanaged table
Spark manages the metadata, while you control the data location. As soon as you add ‘path’ option in dataframe writer it will be treated as global external/unmanaged table. When you drop table only metadata gets dropped. A global unmanaged/external table is available across all clusters.
What are the possible strategies in order to decrease garbage collection time ?
Persist objects in serialized form
Create fewer objects
Increase java heap space size
Which property is used to scale up and down dynamically based on applications current number of pending tasks in a spark cluster ?
Dynamic Allocation
If spark is running in client mode, where is the driver located
on the client machine that submitted the application
What causes a stage boundry
a shuffle
What function will avoid a shuffle if the new partitions are known to be less than the existing partitions
.coalesce(lesser number)
When will a broadcast join be forced
By default spark.sql.autoBroadcastJoinThreshold= 10MB and any value above this thershold will not force a broadcast join.
What command can we use to get the number of partition of a dataframe name df ?
df.rdd.getNumPartitions()
Layout the Catalyst Optimizer steps
SQL Query | Data Frame to
Unresolved Logical Plan to (analysis) (Catalog used)
Logical plan to (logical optimization)
Optimized Logical Plan to (physical planning)
physical plans to
cost model to
selected physical plan to (code generation)
rdds
What is dynamic allocation?
If you are running multiple Spark Applications on the same cluster, Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload.
What is required to turn on dynamic allocation
spark.dynamicAllocation.enabled to true
set up an external shuffle service on each worker node in the same cluster and set spark.shuffle.service.enabled to true in your application
What is the purpose of the external shuffle service
The purpose of the external shuffle service is to allow executors to be removed without deleting shuffle files written by them.
What is the default file format for output
Parquet
is .25 an acceptabe input for a fraction
no
What does adaptive query execution (AQE) allow you to do?
AQE attempts to to do the following at runtime:
- Reduce the number of reducers in the shuffle stage by decreasing the number of shuffle partitions.
- Optimize the physical execution plan of the query, for example by converting a SortMergeJoin into a BroadcastHashJoin where appropriate.
- Handle data skew during a join.
What can be done with Spark catalyst optimizer
- Dynamically convert physical plans to RDDs.
- Dynamically reorganize query orders.
- Dynamically select physical plans based on cost.
What is an equivalent to equivalent code block to:
df.filter(col(“count”) < 2)
df.where(“count < 2”)
What is the purpose of a cluster manager
The cluster manager allocates resources to the Spark Applications and maintains the executor process in client mode
What is the idea behind dynamic partition pruning in Spark
skip over data you do not need in the results of the query
Will spark’s garbage collector clean up persisted objects
yes, but in least recently used
The Dataset API is not available in Python T/F
True
A viable way to improve Spark’s performance when dealing with large amounts of data, given that there is only a single application running on the cluster
increase the values for spark.default.parallelism and spark.sql.shuffle.partitions
A viable way to improve Spark’s performance when dealing with large amounts of data, given that there is only a single application running on the cluster
increase the values for spark.default.parallelism and spark.sql.shuffle.partitions
Dynamically injecting scan filters for join operations to limit the amount of data to be considered in a query is part of what
Dynamic Partition Pruning
What is a Stage
A stage represent a group of tasks that can be executed together to perform the same operation on multiple executors in parallel.
A stage is a combination of transformations which does not cause any shuffling of data across nodes.
Spark starts a new stage when shuffling is required in the job.
How many executors is a task sent to
1
What is a task
Each task is a combination of blocks of data and a set of transformations that will run on a single executor.
What is a possibility if the number of partitions is too small
If the number is too small it will reduce concurrency and possibly cause data skewing.
If there are too many partitions…
there will be a mismatch between task scheduling and task execution.
What is coalesce
Collapses partition on the same worker to avoid shuffling.
What are some examples of transformations
select sum groupBy orderBy filter limit
What are examples of an action
Show
Count
Collect
save
Coalesce cannot be used to increase the number of partitions T/F
True
Is printSchema considered an action
No
Is first considered an action
Yes
When chosing storage method, what means seralized
SER
A driver
runs your main() function
assigns work to be done in parallel
maintains information about the Spark Application
What happens at a stage boundary in spark
data is written to the disk by tasks in the parent stages and the fetched over the network by tasks in the child stage
is .forEach an action
yes
is limit() considered an action
no
In cluster mode the driver will be put onto a worker node t/f
true