spark Flashcards by Sonali Sahu

Tell me about Yourself

Hello everyone, I am Sonali Sahu, Currently I am Working as a Data Engineer - I at TCS. Overall, I have 3 Years of Experience. Currently, I am leading a team of 4 Members in my project. The tech stack I have worked on is Python, SQL, Pyspark, hive, Azure Synapse Analytics for Analytics, Azure Data Factory for ETL orchestration, Azure Blob/ Data Lake for Storage.

How well did you know this?

Not at all

Perfectly

How can we print bad records of a csv?

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName(“CSVReadWithBadRecords”).getOrCreate()

Set the path for bad records
bad_records_path = “hdfs://your_hdfs_path/bad_records”

Read the CSV file with the badRecordsPath option
df = spark.read.option(“badRecordsPath”, bad_records_path).csv(“your_csv_file.csv”)

How well did you know this?

Not at all

Perfectly

How is head () and take () different in Spark?

head() is used on DF while take() can be used on rdd/df also.

How well did you know this?

Not at all

Perfectly

How is data defined in spark?

data = [(1,’Sample Name’),(2,’Sample Name2’)]

which means list of tuples, where length of tuple is equal to length of columns in schema.

How well did you know this?

Not at all

Perfectly

In which scenario can we get Driver OOM?

collect (), which collects data from all partitions and loads on driver node.
broadcast join (), where a considerably small table is broadcasted to all partitions but its size is higher than the space available in the driver. There rises Driver OOM.

How well did you know this?

Not at all

Perfectly

in which scenario can we get executor OOM?

Big Partition (Data Skewness)
Yarn Memory Overhead
High Concurrency

How well did you know this?

Not at all

Perfectly

Most Important Questions

What is AQE in spark?

Spark 1.x - Introduced Catalyst Optimizer
Spark 2.x - Introduced Cost - Based Optimizer
Spark 3.0 - Introduced Adaptive Query Execution

conf command: -
spark.conf.set(“spark.sql.adaptive.enabled”.True)

Spark Optimizer selector selects the most efficient physical plan.
cost based optimizer will look for statistics.
last collected and what is the time difference when the data is changed. There might be a huge gap
and in this scenario , cost based model may not perform well.

in AQE, let say there are 10 stages in the application. 1st stage is completed and its result is return to the cluster. the AQE will try to optimize the physical plan. SO, it will accurate statistics.

Features of AQE: -
1. Dynamically coalescing shuffle partitions
2. Dynamically Switching join strategies
3. Dynamically optimizing skew joins.

How well did you know this?

Not at all

Perfectly

What are the Join Strategies in Spark?

Apache Spark provides several join strategies to perform efficient join operations on DataFrames or Datasets. The choice of join strategy depends on the characteristics of the data, the size of the tables, and the available system resources. Here are some common join strategies in Spark:

Shuffle Hash Join:
- Shuffle hash join is the default join strategy in Spark. It’s used when one or both of the DataFrames being joined are large enough to require shuffling the data.
- It redistributes and partitions the data based on the join keys and then performs a local hash join on each partition.
Broadcast Join:
- Broadcast join is used when one of the DataFrames is small enough to fit in memory on each worker node. This smaller DataFrame is broadcast to all worker nodes, and the join is performed locally.
- Broadcast joins are much more efficient for smaller tables and can significantly reduce the amount of shuffling.
Sort-Merge Join:
- Sort-merge join is used when both DataFrames to be joined are sorted based on the join keys.
- It performs a merge operation on the sorted DataFrames, making it particularly efficient when both DataFrames are already sorted.
Bucketed Join:
- Bucketed join is a join strategy for tables that have been pre-partitioned into buckets. Each bucket contains data with the same join key.
- This strategy can significantly reduce data shuffling, as it leverages the bucketing information to co-locate matching buckets on the same worker node.
Cartesian Join:
- Cartesian join is the most resource-intensive join strategy. It combines every row from the first DataFrame with every row from the second DataFrame, resulting in a Cartesian product.
- This strategy should be used with caution, as it can lead to a large number of output records.
Broadcast Hash Join:
- Similar to broadcast join, broadcast hash join is used when one DataFrame can fit in memory, and the other DataFrame is hashed and broadcast to all worker nodes for the join operation.
Map-side Join:
- Map-side join is a variation of join used in Spark MapReduce. It leverages the MapReduce framework to perform joins efficiently.

The choice of join strategy should be based on the size of the DataFrames, the available system resources, and the specific characteristics of the data. Efficient use of join strategies can significantly impact the performance of your Spark applications, so it’s essential to select the right strategy for your use case. Additionally, Spark’s query optimizer may also make decisions on the join strategy based on cost and data statistics.

How well did you know this?

Not at all

Perfectly

What is check pointing in spark?

to give save point in plans to save that point and recover when application fails.

Checkpoint, breaks the lineage where as cache/persist don’t.
Check point saves the data to storage while cache/persist writes it the memory/disk.
checkpoint saved data is still available when application session is closed while it is not for cache / persist.

How well did you know this?

Not at all

Perfectly

What is Serializer in Spark?

Serialization refers to converting objects into a stream of bytes and vice-versa in an optimal way to transfer it over nodes of network or store it in a file/memory buffer.
2 types: -
Java Serialization (Default)
Kryo Serialization (Recommended by Spark)

How well did you know this?

Not at all

Perfectly

spark. Writer example using partitionBy

df.write.option(“header”, True)\
.partitionBy(“State”,”City”)\
.mode(“overwrite”)
.saveAsTable(‘path/to/file”)

How well did you know this?

Not at all

Perfectly

spark. Writer example using bucketBy

df.write.option(“header”, True)\
.bucketBy(“State”,”City”)\
.mode(“overwrite”)
.saveAsTable(‘path/to/file”)

How well did you know this?

Not at all

Perfectly

What is spark submit command ?

spark-submit \
–class your.main.class \
–master <master-url> \
--deploy-mode <deploy-mode> \
--executor-memory <executor-memory> \
--total-executor-cores <total-cores> \
--conf spark.executor.extraJavaOptions=<extra-java-options> \
--conf spark.driver.memory=<driver-memory> \
--conf spark.driver.cores=<driver-cores> \
--conf spark.sql.shuffle.partitions=<shuffle-partitions> \
--conf spark.default.parallelism=<default-parallelism> \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.compress=true \
--conf spark.eventLog.dir=hdfs://<hdfs-path>/event-log \
your-application.jar</hdfs-path></default-parallelism></shuffle-partitions></driver-cores></driver-memory></extra-java-options></total-cores></executor-memory></deploy-mode></master-url>

How well did you know this?

Not at all

Perfectly

Most Frequently Asked

How do you optimize spark jobs?

Broadcast Join
Caching
Shuffle Partition
Filter before shuffling

How well did you know this?

Not at all

Perfectly

what is sharding?

Sharding is a database partitioning technique used in data engineering and database management to horizontally partition a database into smaller, more manageable pieces called “shards.” Each shard is a self-contained database that stores a subset of the data. Sharding is typically used to address scalability and performance challenges associated with very large databases.

How well did you know this?

Not at all

Perfectly

Most challenging scenario in your Project?

The most challenging scenario in a Data Engineering project with respect to data size is often dealing with extremely large volumes of data, especially when it exceeds the capacity of a single machine or storage system. This requires implementing distributed data processing and storage solutions, such as data partitioning, sharding, and optimizing data transfer and transformation pipelines to handle Big Data efficiently.

How well did you know this?

Not at all

Perfectly

What file formats have you worked with?

CSV, parquet

How well did you know this?

Not at all

Perfectly

How broadcasting works in spark?

Study These Flashcards

Broadcasting in Apache Spark is a technique that optimizes data distribution in a cluster during data processing. It works by efficiently sharing read-only variables (typically small lookup tables or reference data) with worker nodes, reducing data shuffling and network overhead. Broadcast variables are sent to each worker node only once, making operations that depend on these variables more efficient by eliminating redundant data transfer. This can significantly improve the performance of Spark jobs, especially when dealing with small datasets used in join operations.

what are actions and transformations?

Study These Flashcards

actions :- write,collect,count,reducebyKey
Narrow Transformation :- filter, where
wide Transformation :- JOINS, groupby,orderby,distinct,

wordcount in rdd/df using spark?

Study These Flashcards

what is project pruning?

Study These Flashcards

Project pruning, in the context of query optimization and database management systems, refers to the process of eliminating unnecessary or redundant columns from a query’s result set. The goal of project pruning is to reduce the amount of data that needs to be processed, transmitted, and stored, which can lead to improved query performance and reduced resource consumption.

Key points about project pruning:

Column Elimination: During project pruning, the database optimizer analyzes the query and the underlying data model to identify columns that are not needed in the result set.
Query Efficiency: By eliminating unnecessary columns, project pruning can make queries more efficient. It reduces I/O operations, network data transfer, and computational work by focusing only on the required data.
Query Performance: Project pruning can lead to significant improvements in query performance, especially in situations where the database contains wide tables with many columns, but the query only needs a subset of those columns.
Resource Optimization: Reducing the amount of data processed and transmitted can optimize resource utilization, making the system more responsive and cost-effective.
Complex Queries: For complex queries involving joins and subqueries, project pruning can be particularly beneficial in minimizing the amount of data shuffled and processed, leading to faster query execution.

In summary, project pruning is an important optimization technique in database systems that aims to minimize the amount of data involved in query processing by eliminating unnecessary columns, ultimately improving query performance and resource efficiency.

what is cache and persist in spark ?

Study These Flashcards

without passing any value in the function, persist() and cache() are same. with default setting :-
when rdd :- memory-only
when df :- memory_and_disk

diff:-
unlike cache(),persist() allows you to pass argument inside the bracket, in order to specify the storage level.
.persist(MEMORY_ONLY)
.persist(Memory_ONLY_SER)
.persit(MEMORY_AND_DISK)
.persist(MEMORY_AND_DISK_SER)
.persist(DISK_ONLY)

Spark Query Optimization stages?

Study These Flashcards

df/sql -> unresolved logical plan —(catalog)—> logical plan —-catalyst –> optimized logical plan —-> physical plan — cost model —-> Choose optimized Physical plan -> evaluate RDD.

analysis :- Spark SQL uses catalyst rules and catalog objects that tracks table in all data sources to resolve the unresolved attributes. (Unresolved Logical Plan - Logical Plan)
optimization rule :- this phase applies standard rule based optimization to the logical plans. ( Logical Plan - Optimized Logical Plan)
planning strategies :- it generates one or more physical plans using physical operators that match the spark execution engine.it then select a plan using a cost model. ( Logical Plan -> Physical Plan —(cost model)—> best physical plan)
Code generation :- takes the selected physical plan and generates RDDs.

Pivot Programming using spark

Study These Flashcards

Pull data for may month and find total sales. Sales table , refund_table

refund_table is small, Orders table is big.

having 100 GB of data in spark, what will be cluster configurations

How to process 1 TB of Data

Pyspark code to extract data fromcsv file and create a table on top of that

how to monitor your spark jobs

read df, define schema, filter out emp earning less than 20k, add a column bonus 10% for each employee and calculate the total salary after bonus and save the final data into parquet.

In given string replace special character with empty space in 2 columns first name and last name

How to upsert your data daily basis?

How to perform scd2 using spark ?

what is shuffle and how to handle this ?

what is broadcast join and why is it required ?

what is predicate pushdown and AQE?

Pyspark code to perform broadcast join and conditional aggregation based on location column

from a student table based on student ID best of 3 marks using ( doubtfull)

How to handle null in spark ?

Size of Cluster

diff in reducebykey and groupby key

what is delta lake, how it give ACID properties?

Delta lake is a component which is deployed on the cluster as a part of runtime. If you are creating a delta lake table, it gets stored on the storage in 1 or more data files in parquet format along with delta stores a transaction log. log.delta

What is the difference between Spark Table and Spark DF?

Spark tables store schema information in Metadata Store whereas Data Frames Stores Schema information in runtime catalog. Table and Metadata are persistent objects and visible across applications whereas DF and catalog are runtime objects and live only during the application runtime. Data Frame is visible to your application only. we create Tables with a predefined schema whereas DF supports schema on read. Table supports SQL expressions and doesn't support API whereas DF offers API and does not support SQL expressions.

spark Flashcards

(43 cards)