Architecture Flashcards
A dataframe is immutable, True/False
True
How are changes tracked on dataframes
The initial state is unchangeable and kept on each node. Modifications are shared with each node
How can you see the lineage of a data frame
.explain(“formated”)
What triggers a transformation on a dataframe
action
A transformation where one partition results in one output partition is called what
Narrow Transformation or Narrow Dependency
In the parsed logical plan and the analyzed logical plan, which uses the catalog
analyzed
how many cpu cores per partitions
1
T/F Cluster Manager is a component of a Spark App
False
Where is the driver in deploy-mode cluster
On a node inside the cluster. The Cluster Manager is responsible for maintaining the cluster and executor nodes
Where is the driver in deploy-mode client
On a node not in the cluster
Is there a performance difference between writing SQL Queries or DataFrame Code
NO
What kind of programming model is Spark
Functional - Same inputs lead to the same outputs; transformations are constant
When you perform a shuffle, Spark outputs how many partitions
200
What is schema inference
Take the best guess at what the schema of our data frame should be
What port does the spark ui run
4040
What type of transformation is aggregation
wide
What type of transformation is filter
Narrow
What are the 3 kind of actions
- View data in the console
- collect data to native objects in the respective language
- write to output data sources
.count() is an example of a what
an action
What is predicate pushdown
pushing down the filter automatically
What is lazy evaluation
Spark will wait till the very last moment to execute the graph of computation instructions
Shuffles will perform filters and then…
Write to disk
What is pipelining
on narrow transformations filters will be performed in memory
A wide dependency is
Input partitions contributing to many output partitions
What is narrow dependencies
Each input partition will contribute to only one output partition
What are the 2 type of transformations
Narrow dependencies and wide dependencies
Spark will not act on a transformations till
an action is called
Core data Structures are muttable or immutable
immutable
With Dataframes, you have to manipulate partitions manually
False
If you ave one partition and many executors, what paralism do you have
1
What is a partition
A collection of rows that sit on one physical machine
To allow every executor to perform in parallel, Spark breaks the data into
Partitions
What is a dataframe
Structured API that represents a table of data with rows and columns
How many spark sessions can you have across a Spark App
1
You control your SparkApp through a driver process called
Spark Session
What are Spark’s Language APIS
Scala, JAVA, R, Python, SQL
What is the point of the cluster manager
Keep track of resources available
What is local mode
Driver and Executor live on the same machine