Transformation Functions Flashcards
Access the File Systems
%fs ls
%fs ls /some_folder_name
dbutils.fs.ls(“”)
Create Table/DataFrame
- from %sql
- from spark.sql()
- if you have an RDD, list, or pandas dataframe
- return a table as a dataframe
- create temporary view from Dataframe
- CREATE TABLE IF NOT EXISTS
- df = spark.sql(SELECT * FROM table_name)
- spark.createDataFrame()
- spark.table()
- df.createOrReplaceTempView(“”)
Select/Filter
sql & python/pyspark
SELECT * FROM table WHERE column = value
df. filter(df.column = value)
df. filter(“column = ‘value’”)
Sql Group By in Pyspark
df.groupBy()
DataFrame Reader Function
just parquest reading example
spark.read.parquet(“filepath”)
DataFrame Writer Function
example
df.write.option() \
.format() \
.mode() \
.parquet(outPath)
Accessing Column in PySpark
3 different way
df[“column_name”]
df. column_name
df. col(“column_name”)
Column Operators
and/or, math, equality
&, | : and, or
*, +, = : math and comparison operators
==, != : equal, not equal
Column Methods
(4 examples)
(think about sorting ,changing data types, changing column names, NULL)
.alias() : column alias
.cast(), .astype(): cast column to different data type
isNull(), isNotNull(): is null, not null
.asc(), .desc(): ascending/decending
read in a file
what are the 5 components of the method string?
df = spark.read \
.option(“sep”, “,”) \ # the seperator
.option(“header”, True) \ # first row = header
.option(“inferSchema”, True) \ # what schema
.csv(productsCsvPath) # csv path
define the schema
(function used and components)
(read with user defined schema, what changes?)
StructType([
StructField(“col_name”, data_type, True),
StructField(“full_name”, LongType(), True),
StructField(“email”, StringType(), True)
])
——————————————————————-
.option(“inferSchema”, True) changes to
.schema(userDefinedSchema)
DataFrame Action Methods
(4)
(remember viewing data, calculations, summaries, collections)
.show(), .head(), .take():: display n num. rows
.count()
.describe(), .summary(): basic stats.
.collect(): array of all rows in the DataFrame
Python Row Methods
(5)
(think about indexing, accessing fields, returning values in another format, and count)
.index(): return first index of values .count() asDict(): return as dictionary row.key: access fields like attributes row["key"]: access fields like dictionary values
Aggregation methods (5)
(what methods usually precedes?)
(think about common calculations)
groupBy() usually precedes
.agg(), .sum() .avg(), .mean() .count() .max(), .min() .pivot()
Some of the Pyspark Built In Aggregate Functions
(9)
(remember sum. stats. & summaries of data subsets )
Remember these are functions not methods, so you imbed these within a method or another function do not chain.
approx_count_distinct(): approx. num. of distinct items
collect_list(): return list obj. w/ dups.
stddev_samp(): sample standard deviation
var_pop(): pop. variance
collect_set(col): returns a set of objects with duplicate elements eliminated. # important one to remember because not a collect function or method
sumDistinct()
avg(); mean()
corr()
max()
Pyspark Built in Math Functions (4)
think about largest value in a column, log transformations, rounding, square root
ceil(): ceiling of a given column
log()
round()
sqrt()
Date Time Functions (6)
date_format: convert date/time to string
add_months: returns date numMonths after startDate
dayofweek: day of the week as integer
from_unixtime: covert num. of sec. from unix epoch to string
minute: extracts minutes as integer
unix_timestamp: converts time string to Unix timestamp
String Functions (5)
remember regular expressions, SQL functions
translate: converts to charters in search to replacement charters specified in function
regexp_replace: replace substring to match regular expression string
regexp_extract / Itrim: extracts groups from string that matches JAVA regular expresion
lower, upper: coverts string to lower case string
split: splits string based on a pattern
Collection Functions (4)
functions that deal with arrays
- array_contains(col, value): returns true/false based on a condition
- explode(col): creates a new row for each element in a given array or map column
- slice(x, start, length): returns an array containing all the elements in x from index start with specified length. Creates a subset of an array with a start position and number of elements.
x - the array to be sliced
- element_at(col, extraction): Returns element of array at given index in extraction if col is array. Creates a subset of an array with a start position.
extraction - index (number) to check for in the array
Properties of UDFs(3)
- can’t be optimized by the catalyst optimizer
- functions must be serialized and sent to executors
- additional overhead from python interpreter on executors running Python UDF