General Flashcards
how do you cache a dataframe
.persist
.cache.count
doesn’t have to be count but has to be an action that will touch every single record
How can you select where your cache is stored
.persist(storage level)
the default storage level for persist and cache is
MEMORY_AND_DISK
how do you un cache data
.unpresist.count
How can you determine the storage level of your data frame
.storageLevel
How do you register a function as a dataframe function
val function_udf = udf(stringConcat(_:ParamType…):ReturnType)
how do you register a function as SQL Function
spark.udf.register(“new_function_name”, function signature)
how can you create a table from a dataframe
.write.saveAsTable(“table_name”)
how can you set the number of partitions for a shuffle
spark.conf.set(“spark.sql.shuffle.partitions”,50)
how do you get the number of partitions available in a given dataframe
.rdd.getNumPartitions
how do you repartition a dataframe
.repartition(2)
how can you change the number of partitions on a single node
.coalesce(2)
which causes a shuffle repartition or coalesce
repartition
How do you enable adaptive query execution
spark.conf.set(“spark.sql.adaptive.enabled”, true)
What are the elements of an Apache Spark Execution Hierarchy
Job, Stages, and Tasks