General Flashcards
how do you cache a dataframe
.persist
.cache.count
doesn’t have to be count but has to be an action that will touch every single record
How can you select where your cache is stored
.persist(storage level)
the default storage level for persist and cache is
MEMORY_AND_DISK
how do you un cache data
.unpresist.count
How can you determine the storage level of your data frame
.storageLevel
How do you register a function as a dataframe function
val function_udf = udf(stringConcat(_:ParamType…):ReturnType)
how do you register a function as SQL Function
spark.udf.register(“new_function_name”, function signature)
how can you create a table from a dataframe
.write.saveAsTable(“table_name”)
how can you set the number of partitions for a shuffle
spark.conf.set(“spark.sql.shuffle.partitions”,50)
how do you get the number of partitions available in a given dataframe
.rdd.getNumPartitions
how do you repartition a dataframe
.repartition(2)
how can you change the number of partitions on a single node
.coalesce(2)
which causes a shuffle repartition or coalesce
repartition
How do you enable adaptive query execution
spark.conf.set(“spark.sql.adaptive.enabled”, true)
What are the elements of an Apache Spark Execution Hierarchy
Job, Stages, and Tasks
Adaptive Query Execution re-optimizes the query plan in the middle of the query execution based on accurate runtime statistics T/F
True
With AQE, Logical optimization and physical planning is removed
False
what does spark.sql.autoBroadcastJoinThreshold do
Configures the maximum size in bytes for a table that will broadcast to all worker nodes when performing a join
How do you turn off dynamic partitions coalescing
spark.conf.set(“spark.sql.adaptive.coalescePartitions.enabled”,false)
What allows you to control how complex types are printed on schemas
.printSchema(1)
How do you set infer schema
.option(“inferSchema”, true)
How do you make a dataframe into a table or a view
createOrReplaceTempView()