Spark Dataframe commands Flashcards

1
Q

Describe a dataframe in your own words

A

Dataframe is like a table with rows and columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you read a table from Hive into Spark dataframe with select statement- Spark 2.6

A

spark.sql(“Select * from db.mytable”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you read a table from Hive into Spark dataframe without select statement- Spark 2.6

A

spark.table(“db.myTable”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to display a dataframe

A

df.show()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How to display exactly 100 rows of a dataframe

A

df.show(100)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why do we pass True/False in show

A

True expands the columns and False compresses the columns during show

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Select specific columns

A

df.select(‘col1’,’col2’,’col3’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Can I pass a list of columns within the select statement

A

Yes. df.select([‘col1’, ‘col2’,’col3’])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do I change the column name without using withColumnRenamed?

A

df.selectExpr(“col1”,”col2 as test_column”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do I pull specific rows from a dataframe - For example where a certain column in my dataframe is true

A

df.filter(“col1 = True”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

import statement to import functions

A

from pyspark.sql import functions as func

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Get the total number of records in a dataframe

A

df.count()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do I get the count of distinct values in a column?

A

df.dropDuplicates(“col1”).count()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the difference between df.dropDuplicates() and df.dropDuplicates(“col1”)

A

dropDuplicates() drops the duplicates in the entire dataframe and dropDuplicates(“col1”) just drops the duplicates in specific column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do I see the schema of a dataframe

A

df.printSchema()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do I see the column names along with the datatypes

A

df.printSchema()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do I retrieve the columns to a python list

A

df.columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

df.columns() - Is this correct and what will be the output

A

No. The braces shouldnt be present. It throws an error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do I drop a column from a dataframe?

A

df.drop(“col1”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

df.drop([“col1”,”col2”]) - Is this correct and why

A

yes it is correct. We can pass a list in drop function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Groupby syntax with count

A

df.groupBy(“col1”).agg(func.count(“col2”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Order the rows in a dataframe on a certain column.

A

df.orderBy(func.asc(“col1”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Groupby on multiple columns syntax with count

A

df.groupBy(“col1”,”col2”).agg(func.count(“col2”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Order the rows in a dataframe on multiple columns.

A

df.orderBy(func.asc(“col1”), func.desc(“col2”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

case expression general syntax

A

case when col1 = ‘Y’ then ‘True’ when col1 = ‘N’ then ‘False’ else ‘NA’ end

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

drop multiple columns

A

df.drop(“col1”,”col2”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

drop duplicate values in multiple columns

A

dropDuplicates([“col1”, “col2”])

28
Q

Create a new column in the dataframe. The new column is a flag that has true or false. If a column value is > 100 then True else false

A

df.withColumn(“flag”, func.expr(“case when col1 >= 100 then True else False end”))

29
Q

I have a dataframe with some records. I need to flag all the records as ‘True’ before I proceed further. How do I do that

A

df.withColumn(“flag”, func.lit(True))

30
Q

Rename a column

A

df.withColumnRenamed(“old_col_name”, “new_col_name”)

31
Q

Two ways to rename a column

A

withColumnRenamed, selectExpr

32
Q

Ways to create dataframe

A
  1. Reading hive tables
  2. Reading CSV or JSON files
  3. Create dataframe from list
  4. Create dataframe from rdd
33
Q

How to read csv files into dataframe?

A
  1. df = spark.read.csv(“file.csv”)

2. df = spark.read.format(“csv”).load(“file.csv”)

34
Q

Column names for this dataframe - df = spark.read.format(“csv”).load(“file.csv”)

A

_c0, _c1, _c2…

35
Q

How to load header for csv read command

A
df2 = spark.read.option("header",True).csv("file.csv")
df2 = spark.read.options(header = 'True').csv("file.csv")
36
Q

PySpark reads all columns as a ________ data type by default

A

string

37
Q

Read multiple csv files into a single dataframe

A

df = spark.read.csv(“path1,path2,path3”)

38
Q

Read all CSV files from a directory into DataFrame

A

df = spark.read.csv(“Folder path”)

39
Q

Specify a specific delimiter while reading csv

A
df3 = spark.read.option("delimiter",",") .csv("test.csv")
df3 = spark.read.options(delimiter=',') .csv("test.csv")
40
Q

How to change the default datatype read by spark from a csv

A
df3 = spark.read.option("inferschema", True) .csv("test.csv")
df3 = spark.read.options(inferschema='True') .csv("test.csv")
41
Q

Set both delimiter and inferschema

A

df3 = spark.read.option(“delimiter”,”,”).option(“inferschema”,True) .csv(“test.csv”)

df3 = spark.read.options(inferschema=’True’, delimiter = ‘|’) .csv(“test.csv”)

42
Q

how to import datatypes

A

from pyspark.sql.types import *

43
Q

Read custom schema - I don’t want default string schema and also I don’t want inferschema but would like to change to custom datatype

A

from pyspark.sql.types import *
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema = StructType([
StructField(‘firstname’, StringType(), True),
StructField(‘middlename’, StringType(), True),
StructField(‘id’, IntegerType(), True)
])
df = spark.read.format(“csv”).option(“Header”, True).schema(schema).load(“file.csv”)

df = spark.read.option(“Header”, True).schema(schema).csv(“file.csv”)

44
Q

Write a dataframe to csv file with no header

A

df. write.format(“csv”).option(“header”, True).save(“demo.csv”)
df. write.option(“header”, True).csv(“demo.csv”)

45
Q

Modes while saving a dataframe as a file

A
  1. overwrite – mode is used to overwrite the existing file.
  2. append – To add the data to the existing file.
  3. ignore – Ignores write operation when the file already exists.
  4. error – This is a default option when the file already exists, it returns an error.
46
Q

append mode

A

df.write.mode(“append”).option(“header”, True).csv(“demo.csv”)

47
Q

create spark session

A

spark = SparkSession \
.builder \
.appName(“App1”) \
.getOrCreate()

48
Q

Check the type of variable

A

type(df)

49
Q

Read a json file

A
  1. df = spark.read.json(“file.json”)

2. df = spark.read.format(“json”).load(“file.json”)

50
Q

import statement to import SparkSession

A

from pyspark.sql import SparkSession

51
Q
[{
  "RecordNumber": 2,
  "Zipcode": 704,
  "ZipCodeType": "STANDARD",
  "City": "PASEO COSTA DEL SUR",
  "State": "PR"
},
{
  "RecordNumber": 10,
  "Zipcode": 709,
  "ZipCodeType": "STANDARD",
  "City": "BDA SAN LUIS",
  "State": "PR"
}]

Read the multiline json records

A

df = spark.read.options(mutliline=”True”).json(“file.json”)

52
Q

Read multiple json files

A

df = spark.read.json([“json path1”,”json path2”,”json path3”])

53
Q

Read all json files in a directory

A

df = spark.read.json(“files/*.json”)

54
Q

Pass custom schema for each of the columns for json

A

from pyspark.sql.types import *
from pyspark.sql.types import StructType,StructField, StringType
schema = StructType([
StructField(‘firstname’, StringType(), True),
StructField(‘middlename’, StringType(), True),
StructField(‘lastname’, StringType(), True)
])

df = spark.read.options(header = ‘True’).schema(schema).json(“file.json”)

55
Q

Write a dataframe as json file

A

df.write.json(“file.json”)

56
Q

Write a dataframe as json file - append mode

A

df.write.mode(“append”).json(“file.json”)

57
Q

Create a dataframe using parallelize

A

from pyspark.sql import Row

dept = [Row(“A”,10),
Row(“B”,20),
Row(“C”,30)]

rdd = spark.sparkContext.parallelize(dept)
df = rdd.toDF(col1, col2)
58
Q

Create dataframe from list without using parallelize

A
dept = [("A",10),
("B",20),
("C",30)]
col_names= ("col_1_name", "col_2_name")
df = spark.createDataFrame(data = dept, schema = col_names)
59
Q

I have two tables. One table has id and location. The second table has all the ids who are assigned a parking space . I need a output report of the id, location and whether parking space is allocated - is_parking_allocated (Y/N)

A

join

60
Q

I have two tables. Table 1 - ids, location; Table 2 - ids salary. Output: ids, location and salary

A

join

61
Q

I have two tables. Table 1 - Parts and price Table 2 - Only the parts are purchased in the last 2 months. Output: Parts, price and a flag if the parts are purchased in the last 2 monts - flag_2_months

A

join

62
Q

Declare udf

A

from pyspark.sql import functions as func

def split_str("s"):
    return s.split("_")[1]

split_str_udf = func.udf(split_str) #udf registration

df1 = df.withColumn(“last_name”, split_str(“full_name”))

63
Q

Read avro

A

df = spark.read.format(“avro”).load(“avro_file_path”)

64
Q

save avro

A

df.write.format(“avro”).save(“avro_file_path”)

65
Q

left outer join df1 with df2 with alias; join on id present in both tables and select two columns, one from each table

A

df1.alias(“a”).join(df2.alias(“b”), df1.id == df2.id, “left_outer”).select(“a.col1”,”b.col2”)

66
Q

inner join df1 with df2 with alias; join on id present in both tables and select two columns, one from each table

A

df1.alias(“a”).join(df2.alias(“b”), df1.id == df2.id, “inner”).select(“a.col1”,”b.col2”)

67
Q

right outer join df2 with df1 with alias; join on id present in both tables and select two columns, one from each table

A

df2.alias(“a”).join(df1.alias(“b”), df2.id == df1.id, “right_outer”).select(“df1.col1”,”df2.col2”)