Cleaning Data with PySpark Flashcards
How do you import different data types in spark?
from pyspark.sql.types import IntegerType, StringType, etc
What is the main reason for pyspark using imutability and lazy processing?
Spark takes advantage of data immutability to efficiently share / create new data representations throughout the cluster.
What is the Parquet Format?
Parquet is a compressed columnar data format developed for use in any Hadoop based system. This includes Spark, Hadoop, Apache Impala, and so forth. The Parquet format is structured with data accessible in chunks, allowing efficient read / write operations without processing the entire file.
If we want to use sql language in a pyspark dataframe, what method should we call first?
dataframe.createOrReplaceTempView(‘custom_table_name’)
then we can use sql with
pyspark.sql(‘SQL LANGUAGE’)
What is a DataFrame in pyspark?
Made up of rows and columns
immutable
uses transformations to deal with data
What are the primary functions to filter data on pyspark?
.filter and .where
What is the functin of pyspark.sql.functions.split?
What does it work to split a column with entries like: “john williams”.
How would you split on the whitespace?
Splits str around matches of the given pattern.
F.split(‘col_name’, pattern=’\s+’) # split on the whitespace
What is the role of Column.getItem(key)?
Column.getItem(key)[source]
An expression that gets an item at position ordinal out of a list, or gets an item by key out of a dict.
What are the 2 primary Conditional DataFrame column operations in pyspark?
F.when and F.otherwise
Give an example to use F.when in pyspark and F.otherwise.
voter_df = voter_df.withColumn(‘random_val’,
when(voter_df.TITLE == ‘Councilmember’, F.rand())
.when(voter_df.TITLE == ‘Mayor’, 2)
.otherwise(0))
How do you generate a random value with pyspark?
pyspark.sql.functions.rand()
What is a UDF in pyspark?
user defined functions
What is lazy processing?
transformation operation is lazy; it’s more like a recipe than a command. It defines what should be done to a DataFrame rather than actually doing it
What is the role of F.monotonically_increasing_id?
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.