Information Flashcards
1
Q
- drop(): columns or name
A
2
Q
- Split is a function: storesDF.withColumn(“storeValueCategory”, split(col(“storeCategory”), “_”)[0])
A
3
Q
- CAST is a COL method: col(“storeId”).cast(StringType())
A
4
Q
- Explode is a function: storesDF.withColumn(“productCategories”, explode(col(“productCategories”)))
A
5
Q
- regexp_replace is a function: storesDF.withColumn(“storeDescription”, regexp_replace(col(“storeDescription”), “^Description: “, “”))
A
6
Q
- na.fill: storesDF.na.fill(value = 30000, subset = “sqft”) value, list or string or tupple of column names. Subset can be a list or string.
A
7
Q
- dropDuplicates or drop_duplicates which is an alias: DataFrame.drop_duplicates(subset = None) or storesDF.dropDuplicates(subset = [“id”]). subset should be a list or tuple.
A
8
Q
- approxCountDistinct or approx_count_distinct is used with the .agg: storesDF.agg(approx_count_distinct(col(“division”), 0.15).alias(“divisionDistinct”)) ,
the default is 0.05 if we don’t specify.
A
9
Q
- mean is a function that can be used with .agg: storesDF.agg(mean(col(“sqft”)).alias(“sqftMean”)) take a col or col_name (shorthand for df.groupBy().agg()).
A
10
Q
- describe is a function that takes column_names string or list. df.describe([‘age’,’size’]).show(),df.describe(‘age’,’size’).show()
A
11
Q
- orderBy is a function that takes str, list, or Column with an ascending parameter bool defaults to True. df.orderBy([“age”, “name”], ascending=[False, False])
A
12
Q
- sample is a function that takes as parameter: storesDF.sample(withReplacement = False, fraction = 0.10, seed = 123)
A
13
Q
- printSchema(level: Optional[int] = None) has level parameter specifies how many levels to print for nested schemas.
A
14
Q
- for udf sql: spark.udf.register(“function_name”, udf_function, returnType:Optional) then in sql we call the function by the function_name
A
15
Q
- for udf python: assessPerformanceUDF = udf(assessPerformance,IntegerType()), storesDF.withColumn(“result”, assessPerformanceUDF(col(“customerSatisfaction”)))
A
16
Q
- to create a dataframe from a list, spark.createDataFrame(years, IntegerType())
A
17
Q
- .cache() is by default MEMORY_AND_DISK, it takes no parameters
A
18
Q
- .persist() is by default MEMORY_AND_DISK but takes storageLevel as parameter. df.persist(storageLevel = StorageLevel.DISK_ONLY)
A
19
Q
- spark.sql.adaptive.coalescePartitions.enabled is used to configure whether DataFrame partitions that do not meet a minimum size threshold are automatically coalesced into larger partitions during a shuffle.
A
20
Q
- spark.sql.shuffle.partitions is used to adjust the number of partitions used in wide transformations like join(). spark.conf.set(“spark.sql.shuffle.partitions”, “32”)
A
21
Q
- from unix time to timestamp: storesDF.withColumn(“openDateString”, from_unixtime(col(“openDate”), “EEE, MMM d, yyyy h:mm a”))
A
22
Q
- from unix to dayofyear (storesDF.withColumn(“openTimestamp”, col(“openDate”).cast(“Timestamp”)).withColumn(“dayOfYear”, dayofyear(col(“openTimestamp”))))
A
23
Q
- for joins: joinedDF = StoresDF.join(other = employeesDF, on= “storeId”, how = “inner”), the how could be inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti.
A
24
Q
- for the join, when using, df = a.join(b, on=[“column1”, “column2”]) PySpark sees this as a list of column names and automatically matches column1 in DataFrame a with column1 in DataFrame b, and column2 in a with column2 in b. With df = a.join(b, on=[co(“column1”), col(“column2”)]) you are passing column expressions instead of simple column names. PySpark sees each col(“column1”) as an individual expression without any direct context to the DataFrames a or b. It doesn’t know whether col(“column1”) refers to a or b — which causes ambiguity error.
A
25
Q
- for outer joins, the JOIN can perform them and there’s no outer function
A
26
Q
- for cross joins use: df.crossJoin(df2.select(“height”))
A
27
Q
- Brodcast joins: storesDF.join(broadcast(employeesDF), “storeId”)
A
28
Q
- for unions, storesDF.union(acquiredStoresDF) does a position-wise union between the dataframes and storesDF.unionByName(acquiredStoresDF, allowMissingColumns = False) does the union resolving columns by name rather than position.
A
29
Q
- to write in parquet partitioned, storesDF.write.mode(“overwrite”).partitionBy(“division”).parquet(filePath)
A
30
Q
- to write in json, storesDF.write.json(filePath), spark.read.schema(schema).format(“json”).load(filePath)
A