Spark Dataframe commands Flashcards

Question 1

Q

Describe a dataframe in your own words

Answer

A

Dataframe is like a table with rows and columns

Question 2

Q

How do you read a table from Hive into Spark dataframe with select statement- Spark 2.6

Answer

A

spark.sql(“Select * from db.mytable”)

Question 3

Q

How do you read a table from Hive into Spark dataframe without select statement- Spark 2.6

Answer

A

spark.table(“db.myTable”)

Question 4

Q

How to display a dataframe

Answer

A

df.show()

Question 5

Q

How to display exactly 100 rows of a dataframe

Answer

A

df.show(100)

Question 6

Q

Why do we pass True/False in show

Answer

A

True expands the columns and False compresses the columns during show

Question 7

Q

Select specific columns

Answer

A

df.select(‘col1’,’col2’,’col3’)

Question 8

Q

Can I pass a list of columns within the select statement

Answer

A

Yes. df.select([‘col1’, ‘col2’,’col3’])

Question 9

Q

How do I change the column name without using withColumnRenamed?

Answer

A

df.selectExpr(“col1”,”col2 as test_column”)

Question 10

Q

How do I pull specific rows from a dataframe - For example where a certain column in my dataframe is true

Answer

A

df.filter(“col1 = True”)

Question 11

Q

import statement to import functions

Answer

A

from pyspark.sql import functions as func

Question 12

Q

Get the total number of records in a dataframe

Answer

A

df.count()

Question 13

Q

How do I get the count of distinct values in a column?

Answer

A

df.dropDuplicates(“col1”).count()

Question 14

Q

What is the difference between df.dropDuplicates() and df.dropDuplicates(“col1”)

Answer

A

dropDuplicates() drops the duplicates in the entire dataframe and dropDuplicates(“col1”) just drops the duplicates in specific column

Question 15

Q

How do I see the schema of a dataframe

Answer

A

df.printSchema()

Question 16

Q

How do I see the column names along with the datatypes

Answer

A

df.printSchema()

Question 17

Q

How do I retrieve the columns to a python list

Answer

A

df.columns

Question 18

Q

df.columns() - Is this correct and what will be the output

Answer

A

No. The braces shouldnt be present. It throws an error

Question 19

Q

How do I drop a column from a dataframe?

Answer

A

df.drop(“col1”)

Question 20

Q

df.drop([“col1”,”col2”]) - Is this correct and why

Answer

A

yes it is correct. We can pass a list in drop function

Question 21

Q

Groupby syntax with count

Answer

A

df.groupBy(“col1”).agg(func.count(“col2”))

Question 22

Q

Order the rows in a dataframe on a certain column.

Answer

A

df.orderBy(func.asc(“col1”))

Question 23

Q

Groupby on multiple columns syntax with count

Answer

A

df.groupBy(“col1”,”col2”).agg(func.count(“col2”))

Question 24

Q

Order the rows in a dataframe on multiple columns.

Answer

A

df.orderBy(func.asc(“col1”), func.desc(“col2”))

Question 25

Q

case expression general syntax

Answer

A

case when col1 = ‘Y’ then ‘True’ when col1 = ‘N’ then ‘False’ else ‘NA’ end

Question 26

Q

drop multiple columns

Answer

A

df.drop(“col1”,”col2”)

Question 27

Q

drop duplicate values in multiple columns

Answer

A

dropDuplicates([“col1”, “col2”])

Question 28

Q

Create a new column in the dataframe. The new column is a flag that has true or false. If a column value is > 100 then True else false

Answer

A

df.withColumn(“flag”, func.expr(“case when col1 >= 100 then True else False end”))

Question 29

Q

I have a dataframe with some records. I need to flag all the records as ‘True’ before I proceed further. How do I do that

Answer

A

df.withColumn(“flag”, func.lit(True))

Question 30

Q

Rename a column

Answer

A

df.withColumnRenamed(“old_col_name”, “new_col_name”)

Question 31

Q

Two ways to rename a column

Answer

A

withColumnRenamed, selectExpr

Question 32

Q

Ways to create dataframe

Answer

A

Reading hive tables
Reading CSV or JSON files
Create dataframe from list
Create dataframe from rdd

Question 33

Q

How to read csv files into dataframe?

Answer

A

df = spark.read.csv(“file.csv”)

2. df = spark.read.format(“csv”).load(“file.csv”)

Question 34

Q

Column names for this dataframe - df = spark.read.format(“csv”).load(“file.csv”)

Answer

A

_c0, _c1, _c2…

Question 35

Q

How to load header for csv read command

Answer

A

df2 = spark.read.option("header",True).csv("file.csv")
df2 = spark.read.options(header = 'True').csv("file.csv")

Question 36

Q

PySpark reads all columns as a ________ data type by default

Question 37

Q

Read multiple csv files into a single dataframe

Answer

A

df = spark.read.csv(“path1,path2,path3”)

Question 38

Q

Read all CSV files from a directory into DataFrame

Answer

A

df = spark.read.csv(“Folder path”)

Question 39

Q

Specify a specific delimiter while reading csv

Answer

A

df3 = spark.read.option("delimiter",",") .csv("test.csv")
df3 = spark.read.options(delimiter=',') .csv("test.csv")

Question 40

Q

How to change the default datatype read by spark from a csv

Answer

A

df3 = spark.read.option("inferschema", True) .csv("test.csv")
df3 = spark.read.options(inferschema='True') .csv("test.csv")

Question 41

Q

Set both delimiter and inferschema

Answer

A

df3 = spark.read.option(“delimiter”,”,”).option(“inferschema”,True) .csv(“test.csv”)

df3 = spark.read.options(inferschema=’True’, delimiter = ‘|’) .csv(“test.csv”)

Question 42

Q

how to import datatypes

Answer

A

from pyspark.sql.types import *

Question 43

Q

Read custom schema - I don’t want default string schema and also I don’t want inferschema but would like to change to custom datatype

Answer

A

from pyspark.sql.types import *
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema = StructType([
StructField(‘firstname’, StringType(), True),
StructField(‘middlename’, StringType(), True),
StructField(‘id’, IntegerType(), True)
])
df = spark.read.format(“csv”).option(“Header”, True).schema(schema).load(“file.csv”)

df = spark.read.option(“Header”, True).schema(schema).csv(“file.csv”)

Question 44

Q

Write a dataframe to csv file with no header

Answer

A

df. write.format(“csv”).option(“header”, True).save(“demo.csv”)
df. write.option(“header”, True).csv(“demo.csv”)

Question 45

Q

Modes while saving a dataframe as a file

Answer

A

overwrite – mode is used to overwrite the existing file.
append – To add the data to the existing file.
ignore – Ignores write operation when the file already exists.
error – This is a default option when the file already exists, it returns an error.

Question 46

Q

append mode

Answer

A

df.write.mode(“append”).option(“header”, True).csv(“demo.csv”)

Question 47

Q

create spark session

Answer

A

spark = SparkSession \
.builder \
.appName(“App1”) \
.getOrCreate()

Question 48

Q

Check the type of variable

Question 49

Q

Read a json file

Answer

A

df = spark.read.json(“file.json”)

2. df = spark.read.format(“json”).load(“file.json”)

Question 50

Q

import statement to import SparkSession

Answer

A

from pyspark.sql import SparkSession

Question 51

Q

[{
  "RecordNumber": 2,
  "Zipcode": 704,
  "ZipCodeType": "STANDARD",
  "City": "PASEO COSTA DEL SUR",
  "State": "PR"
},
{
  "RecordNumber": 10,
  "Zipcode": 709,
  "ZipCodeType": "STANDARD",
  "City": "BDA SAN LUIS",
  "State": "PR"
}]

Read the multiline json records

Answer

A

df = spark.read.options(mutliline=”True”).json(“file.json”)

Question 52

Q

Read multiple json files

Answer

A

df = spark.read.json([“json path1”,”json path2”,”json path3”])

Question 53

Q

Read all json files in a directory

Answer

A

df = spark.read.json(“files/*.json”)

Question 54

Q

Pass custom schema for each of the columns for json

Answer

A

from pyspark.sql.types import *
from pyspark.sql.types import StructType,StructField, StringType
schema = StructType([
StructField(‘firstname’, StringType(), True),
StructField(‘middlename’, StringType(), True),
StructField(‘lastname’, StringType(), True)
])

df = spark.read.options(header = ‘True’).schema(schema).json(“file.json”)

Question 55

Q

Write a dataframe as json file

Answer

A

df.write.json(“file.json”)

Question 56

Q

Write a dataframe as json file - append mode

Answer

A

df.write.mode(“append”).json(“file.json”)

Question 57

Q

Create a dataframe using parallelize

Answer

A

from pyspark.sql import Row

dept = [Row(“A”,10),
Row(“B”,20),
Row(“C”,30)]

rdd = spark.sparkContext.parallelize(dept)
df = rdd.toDF(col1, col2)

Question 58

Q

Create dataframe from list without using parallelize

Answer

A

dept = [("A",10),
("B",20),
("C",30)]
col_names= ("col_1_name", "col_2_name")
df = spark.createDataFrame(data = dept, schema = col_names)

Question 59

Q

I have two tables. One table has id and location. The second table has all the ids who are assigned a parking space . I need a output report of the id, location and whether parking space is allocated - is_parking_allocated (Y/N)

Question 60

Q

I have two tables. Table 1 - ids, location; Table 2 - ids salary. Output: ids, location and salary

Question 61

Q

I have two tables. Table 1 - Parts and price Table 2 - Only the parts are purchased in the last 2 months. Output: Parts, price and a flag if the parts are purchased in the last 2 monts - flag_2_months

Question 62

Q

Declare udf

Answer

A

from pyspark.sql import functions as func

def split_str("s"):
    return s.split("_")[1]

split_str_udf = func.udf(split_str) #udf registration

df1 = df.withColumn(“last_name”, split_str(“full_name”))

Question 63

Q

Read avro

Answer

A

df = spark.read.format(“avro”).load(“avro_file_path”)

Question 64

Q

save avro

Answer

A

df.write.format(“avro”).save(“avro_file_path”)

Answer 60

A

df1.alias(“a”).join(df2.alias(“b”), df1.id == df2.id, “left_outer”).select(“a.col1”,”b.col2”)

Answer 61

A

df1.alias(“a”).join(df2.alias(“b”), df1.id == df2.id, “inner”).select(“a.col1”,”b.col2”)

Answer 62

A

df2.alias(“a”).join(df1.alias(“b”), df2.id == df1.id, “right_outer”).select(“df1.col1”,”df2.col2”)