Cleaning Data with PySpark Flashcards

Question 1

Q

How do you import different data types in spark?

Answer

A

from pyspark.sql.types import IntegerType, StringType, etc

Question 2

Q

What is the main reason for pyspark using imutability and lazy processing?

Answer

A

Spark takes advantage of data immutability to efficiently share / create new data representations throughout the cluster.

Question 3

Q

What is the Parquet Format?

Answer

A

Parquet is a compressed columnar data format developed for use in any Hadoop based system. This includes Spark, Hadoop, Apache Impala, and so forth. The Parquet format is structured with data accessible in chunks, allowing efficient read / write operations without processing the entire file.

Question 4

Q

If we want to use sql language in a pyspark dataframe, what method should we call first?

Answer

A

dataframe.createOrReplaceTempView(‘custom_table_name’)

then we can use sql with

pyspark.sql(‘SQL LANGUAGE’)

Question 5

Q

What is a DataFrame in pyspark?

Answer

A

Made up of rows and columns

immutable

uses transformations to deal with data

Question 6

Q

Question 7

Q

What are the primary functions to filter data on pyspark?

Answer

A

.filter and .where

Question 8

Q

What is the functin of pyspark.sql.functions.split?

What does it work to split a column with entries like: “john williams”.

How would you split on the whitespace?

Answer

A

Splits str around matches of the given pattern.

F.split(‘col_name’, pattern=’\s+’) # split on the whitespace

Question 9

Q

What is the role of Column.getItem(key)?

Answer

A

Column.getItem(key)[source]

An expression that gets an item at position ordinal out of a list, or gets an item by key out of a dict.

Question 10

Q

What are the 2 primary Conditional DataFrame column operations in pyspark?

Answer

A

F.when and F.otherwise

Question 11

Q

Give an example to use F.when in pyspark and F.otherwise.

Answer

A

voter_df = voter_df.withColumn(‘random_val’,

when(voter_df.TITLE == ‘Councilmember’, F.rand())

.when(voter_df.TITLE == ‘Mayor’, 2)

.otherwise(0))

Question 12

Q

How do you generate a random value with pyspark?

Answer

A

pyspark.sql.functions.rand()

Question 13

Q

What is a UDF in pyspark?

Answer

A

user defined functions

Question 14

Q

What is lazy processing?

Answer

A

transformation operation is lazy; it’s more like a recipe than a command. It defines what should be done to a DataFrame rather than actually doing it

Question 15

Q

What is the role of F.monotonically_increasing_id?

Answer

A

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

Question 16

Q

What is caching?

Question 17

Q

What kind of operation could be used to recast a value type in a pyspark dataframe column?

Answer

A

for example, transform in integers: dataframe.column.cast(IntegerType())

Question 18

Q

How would you handle blank lines, headers and comments when parsing a csv in pyspark?

Answer

A

Use the arguments of the csv parser.

Question 19

Q

What kind of sep arguments in the csv parser can be used to handle nested columns?

Answer

A

sep=’*’