Information Flashcards

1
Q
  • drop(): columns or name
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  • Split is a function: storesDF.withColumn(“storeValueCategory”, split(col(“storeCategory”), “_”)[0])
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  • CAST is a COL method: col(“storeId”).cast(StringType())
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  • Explode is a function: storesDF.withColumn(“productCategories”, explode(col(“productCategories”)))
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  • regexp_replace is a function: storesDF.withColumn(“storeDescription”, regexp_replace(col(“storeDescription”), “^Description: “, “”))
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
  • na.fill: storesDF.na.fill(value = 30000, subset = “sqft”) value, list or string or tupple of column names. Subset can be a list or string.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  • dropDuplicates or drop_duplicates which is an alias: DataFrame.drop_duplicates(subset = None) or storesDF.dropDuplicates(subset = [“id”]). subset should be a list or tuple.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  • approxCountDistinct or approx_count_distinct is used with the .agg: storesDF.agg(approx_count_distinct(col(“division”), 0.15).alias(“divisionDistinct”)) ,
    the default is 0.05 if we don’t specify.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  • mean is a function that can be used with .agg: storesDF.agg(mean(col(“sqft”)).alias(“sqftMean”)) take a col or col_name (shorthand for df.groupBy().agg()).
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  • describe is a function that takes column_names string or list. df.describe([‘age’,’size’]).show(),df.describe(‘age’,’size’).show()
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  • orderBy is a function that takes str, list, or Column with an ascending parameter bool defaults to True. df.orderBy([“age”, “name”], ascending=[False, False])
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  • sample is a function that takes as parameter: storesDF.sample(withReplacement = False, fraction = 0.10, seed = 123)
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  • printSchema(level: Optional[int] = None) has level parameter specifies how many levels to print for nested schemas.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  • for udf sql: spark.udf.register(“function_name”, udf_function, returnType:Optional) then in sql we call the function by the function_name
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  • for udf python: assessPerformanceUDF = udf(assessPerformance,IntegerType()), storesDF.withColumn(“result”, assessPerformanceUDF(col(“customerSatisfaction”)))
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  • to create a dataframe from a list, spark.createDataFrame(years, IntegerType())
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q
  • .cache() is by default MEMORY_AND_DISK, it takes no parameters
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q
  • .persist() is by default MEMORY_AND_DISK but takes storageLevel as parameter. df.persist(storageLevel = StorageLevel.DISK_ONLY)
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q
  • spark.sql.adaptive.coalescePartitions.enabled is used to configure whether DataFrame partitions that do not meet a minimum size threshold are automatically coalesced into larger partitions during a shuffle.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q
  • spark.sql.shuffle.partitions is used to adjust the number of partitions used in wide transformations like join(). spark.conf.set(“spark.sql.shuffle.partitions”, “32”)
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q
  • from unix time to timestamp: storesDF.withColumn(“openDateString”, from_unixtime(col(“openDate”), “EEE, MMM d, yyyy h:mm a”))
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q
  • from unix to dayofyear (storesDF.withColumn(“openTimestamp”, col(“openDate”).cast(“Timestamp”)).withColumn(“dayOfYear”, dayofyear(col(“openTimestamp”))))
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q
  • for joins: joinedDF = StoresDF.join(other = employeesDF, on= “storeId”, how = “inner”), the how could be inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q
  • for the join, when using, df = a.join(b, on=[“column1”, “column2”]) PySpark sees this as a list of column names and automatically matches column1 in DataFrame a with column1 in DataFrame b, and column2 in a with column2 in b. With df = a.join(b, on=[co(“column1”), col(“column2”)]) you are passing column expressions instead of simple column names. PySpark sees each col(“column1”) as an individual expression without any direct context to the DataFrames a or b. It doesn’t know whether col(“column1”) refers to a or b — which causes ambiguity error.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q
  • for outer joins, the JOIN can perform them and there’s no outer function
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q
  • for cross joins use: df.crossJoin(df2.select(“height”))
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q
  • Brodcast joins: storesDF.join(broadcast(employeesDF), “storeId”)
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q
  • for unions, storesDF.union(acquiredStoresDF) does a position-wise union between the dataframes and storesDF.unionByName(acquiredStoresDF, allowMissingColumns = False) does the union resolving columns by name rather than position.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q
  • to write in parquet partitioned, storesDF.write.mode(“overwrite”).partitionBy(“division”).parquet(filePath)
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q
  • to write in json, storesDF.write.json(filePath), spark.read.schema(schema).format(“json”).load(filePath)
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q
  • to read parquet, spark.read.load(path = filePath, schema = None, format = “parquet”)
    *to read CSV, spark.read.schema(schema).csv(filePath)
A
32
Q
  • MEMORY_AND_DISK_2 store as much data as possible in memory on two cluster nodes while storing any data that does not fit in memory on disk to be read in when needed
A
33
Q
  • spark.sql.autoBroadcastJoinThreshold is used to configure the maximum size of an automatically broadcasted DataFrame when performing a join
A
34
Q
  • spark.sql.adaptive.skewedJoin.enabled Spark properties is used to configure whether skewed partitions are automatically detected and subdivided into smaller partitions when joining two DataFrames together
A
35
Q
  • storesDF.drop() to return a new DataFrame from DataFrame storesDF without columns that are specified by name
A
36
Q
  • using a filter, storesDF.filter(col(“sqft”) <= 25000), storesDF.filter((col(“sqft”) <= 25000) | (col(“customerSatisfaction”) >= 30))
A
37
Q
  • storesDF.withColumn(“modality”, lit(“PHYSICAL”)), column modality is the constant string “PHYSICAL”
A
38
Q
  • To rename columns, storesDF.withColumnRenamed(“division”, “state”).withColumnRenamed(“managerName”, “managerFullName”)
A
39
Q
  • na.fill(value = “hello”, subset = [“col_name1”,”col_name2”]) or it’s alias .fillna(), subset can be str, tuple or list
A
40
Q
  • DataFrame.dropDuplicates(), DataFrame.distinct() and DataFrame.drop_duplicates() used to return a DataFrame with no duplicate rows
A
41
Q

*storesDF.agg(approx_count_distinct(col(“division”)).alias(“divisionDistinct”)) returns the a dataframe where column divisionDistinct is the approximate number of distinct values in column division

A
42
Q
  • both df.head(3) and df.take(3) returns the first 3 rows as a list. df.head() with no nb returns it as Row.
A
43
Q
  • filter() and where() will both return a new DataFrame only containing rows that meet a specified logical condition
A
44
Q
  • difference between coalesce and repartition, DataFrame.repartition(n) will split a DataFrame into n number of new partitions with data distributed evenly. DataFrame.coalesce(n) will more quickly combine the existing partitions of a DataFrame but might result in an uneven distribution of data across the new partitions.
A
45
Q
  • DataFrame.select() is classified as a transformation.
A
46
Q
  • absolute value, storesDF.withColumn(“customerSatisfactionAbs”, abs(col(“customerSatisfaction”)))
A
47
Q
  • The Spark driver is responsible for scheduling the execution of data by various worker nodes in cluster mode.
A
48
Q
  • A shuffle induces a stage boundary
A
49
Q
  • A partition is a collection of rows of data that fit on a single machine in a cluster.
A
50
Q
  • A Stage identifies multiple narrow operations that are executed in sequence. In short, a stage groups together narrow transformations that can be executed in sequence without data movement, and each stage boundary is created when a wide transformation that requires a shuffle occurs.
A
51
Q
  • Spark execution/deployment modes include: Client mode, Cluster mode, Local mode.
A
52
Q
  • A failed driver node will cause a Spark job to fail.
A
53
Q
  • The MEMORY_ONLY storage level will store as much data as possible in memory and will recompute any data that does not fit in memory as it’s called. The MEMORY_AND_DISK storage level will store as much data as possible in memory and will store any data that does on fit in memory on disk and read it as it’s called.
A
54
Q
  • to return the first two characters of a column, use storesDF.withColumn(“division”, col(“division”).substr(l, 2)). The position is not zero based, but 1 based index. We can also use the function substring df.select(substring(df.s, 1, 2).alias(‘s’))
A
55
Q
  • to drop rows that have a nan value in any column, df.na.drop(how = “any”), to drop if all columns have nan, df.na.drop(how = “all”)
A
56
Q
  • extracts the value for column sqft from the first row of DataFrame storesDF, storesDF.first().sqft
A
57
Q
  • to create a tempview and query it: storesDF.createOrReplaceTempView(“stores”), spark.sql(“SELECT storeId, managerName FROM stores”)
A
58
Q
  • Job, Stage, Task are the units of work performed by Spark from largest to smallest.
A
59
Q
  • groupBy takes list, str or Column.
A
60
Q
  • UDF using pyspark instead of sql, assessPerformanceUDF = udf(assessPerformance, IntegerType()) storesDF.withColumn(“result”, assessPerformanceUDF(col(“customerSatisfaction”)))
A
61
Q
  • Executors are processing engine instances for performing data computations which run on a worker node.
A
62
Q
  • Garbage collection is important because Spark jobs will fail or run slowly if memory is not available for new objects to be created.
A
63
Q
  • To find the number of characters in each row, storesDF.withColumn(“managerNameLength”, length(col(“managerName”)))
A
64
Q
  • if you have a Spark application running with 1 driver and 1 worker node, and the worker node fails. Spark will ensure completion because worker nodes are fault-tolerant.
A
65
Q

Worker nodes are machines that host the executors responsible for the execution of tasks.

A
66
Q

A task is a combination of a block of data and a set of transformers that will run on
a single executor

A
67
Q

A Stage is a group of tasks that can be executed in parallel to compute the same set of operations on potentially multiple machines

A
68
Q

A shuffle is the process by which data is compared across partitions.

A
69
Q

Transformations are business logic operations that do not induce execution while actions
are execution triggers focused on returning results

A
70
Q

The df.select() is always classified as a narrow transformation

A
71
Q

Spark’s execution/deployment mode determines where the driver and executors are
physically located when a Spark application is run

A
72
Q

An out-of-memory error occurs when either the driver or an executor does not have enough
memory to collect or process the data allocated to it.

A
73
Q

A broadcast variable is entirely cached on each worker node so it doesn’t need to be
shipped or shuffled between nodes with each stage.

A
74
Q

Spark DataFrames built on top of RDDs datastructures (Resilient Distributed Datasets).

A
75
Q

lower is a function that turns characters to lower case, storesDF.withColumn(“storeCategory”,
lower(col(“storeCategory”)))

A
76
Q

Sum of the values in column sqft in DataFrame
storesDF grouped by distinct value in column division,
storesDF.groupBy(“division”).agg(sum(col(“sqft”)))

A