ETL with Spark SQL and Python (Test Qs) Flashcards
You were asked to create a table that can store the below data, orderTime is a timestamp but the finance team when they query this data normally prefer the orderTime in date format, you would like to create a calculated column that can convert the orderTime column timestamp datatype to date and store it, fill in the blank to complete the DDL.
orderID
1
2
order Time
01-01-2020 09:10:24AM
01-01-2020 10:30:30AM
Units
100
10
CREATE TABLE orders (
orderId int,
orderTime timestamp,
orderdate date _____________________________________________ ,
units int)
Choose one:
AS DEFAULT (CAST(orderTime as DATE))
GENERATED DEFAULT AS (CAST(orderTime as DATE))
GENERATED ALWAYS AS (CAST(orderTime as DATE))
AS (CAST(orderTime as DATE))
Delta lake does not support calculated columns, values should be inserted into the tables as a parameter.
GENERATED ALWAYS AS (CAST(orderTime as DATE))
Delta Lake supports generated columns which are a special type of columns whose values are automatically generated based on a user-specified function over other columns in the Delta table. When you write to a table with generated columns and you do not explicitly provide values for them, Delta Lake automatically computes the values.
Note: Databricks also supports partitioning using generated column
Which one of the following is not a Databricks lakehouse object?
Tables
Functions
Stored Procedures
Views
Catalog
Database/Schemas
Stored Procedures
What type of table is created when you create delta table with below command?
CREATE TABLE transactions USING DELTA LOCATION “DBFS:/mnt/bronze/transactions”
Managed Delta table
Managed Table
Temp Table
External Table
Delta Lake Table
Anytime a table is created using the LOCATION keyword it is considered an external table, below is the current syntax.
Syntax
CREATE TABLE table_name ( column column_data_type…) USING format LOCATION “dbfs:/”
format -> DELTA, JSON, CSV, PARQUET, TEXT
Which of the following command can be used to drop a managed delta table and the underlying files in the storage?
DROP TABLE table_name
DROP TABLE table_name CASCADE
DROP TABLE table_name INCLUDE_FILES
DROP TABLE table and run VACUUM command
Use DROP TABLE table_name command and manually delete files usind command dbutils.fs.rm(“/path”, True)
The answer is DROP TABLE table_name,
When a managed table is dropped, the table definition is dropped from metastore and everything including data, metadata, and history are also dropped from storage.
Which of the following is the correct statement for a session scoped temporary view?
Temporary views stored in memory
Temporary views are lost once the notebook is detached and re-attached
Temporary views can still be accessed even if the notebook is detached and attached
Temporary views can still be accessed even if the cluster is restarted
Temper views are created in local_temp database
The answer is Temporary views are lost once the notebook is detached and attached
There are two types of temporary views that can be created, Session scoped and Global
A local/session scoped temporary view is only available with a spark session, so another notebook in the same cluster can not access it. if a notebook is detached and reattached local temporary view is lost.
A global temporary view is available to all the notebooks in the cluster, if a cluster restarts global temporary view is lost.
Which of the following is correct for the global temporary view?
global temporary views cannot be accessed once the notebook is detached and attached
global temporary vies can be accessed across many clusters
global temporary views can still be accessed even if the notebook is detached and attached
global temporary views can still be accessed even if the cluster is restarted
global temporary views are created in a database called temp database
The answer is global temporary views can be still accessed even if the notebook is detached and attached
There are two types of temporary views that can be created Local and Global
· A local temporary view is only available with a spark session, so another notebook in the same cluster can not access it. if a notebook is detached and reattached local temporary view is lost.
· A global temporary view is available to all the notebooks in the cluster, even if the notebook is detached and reattached it can still be accessible but if a cluster is restarted the global temporary view is lost.
You are currently working on reloading customer_sales tables using the below query
INSERT OVERWRITE customer_sales
SELECT * FROM customers c
INNER JOIN sales_monthly s on s.customer_id = c.customer_id
After you ran the above command, the Marketing team quickly wanted to review the old data that was in the table. How does INSERT OVERWRITE impact the data in the customer_sales table if you want to see the previous version of the data prior to running the above statement?
Overwrites the data int he tables, all historical versions of the data, you can not time travel to previous versions.
Overwrites the current version of the data but preserves all historical versions of the data, you can time travel to previous versions.
Overwrites the current version of the data but clears all historical versions of the data, so you can not time travel to previous versions
Appends the data to the current version, you can time travel to previous versions
By default, overwrites the data and schema, you cannot perform time travel.
The answer is, INSERT OVERWRITE Overwrites the current version of the data but preserves all historical versions of the data, you can time travel to previous versions.
INSERT OVERWRITE customer_sales
SELECT * FROM customers c
INNER JOIN sales s on s.customer_id = c.customer_id
Let’s just assume that this is the second time you are running the above statement, you can still query the prior version of the data using time travel, and any DML/DDL except DROP TABLE creates new PARQUET files so you can still access the previous versions of data.
SQL Syntax for Time travel
SELECT * FROM table_name as of [version number]
with customer_sales example
SELECT * FROM customer_sales as of 1 – previous version
SELECT * FROM customer_sales as of 2 – current version
You see all historical changes on the table using DESCRIBE HISTORY table_name
Note: the main difference between INSERT OVERWRITE and CREATE OR REPLACE TABLE(CRAS) is that CRAS can modify the schema of the table, i.e it can add new columns or change data types of existing columns. By default INSERT OVERWRITE only overwrites the data.
INSERT OVERWRITE can also be used to update the schema when spark.databricks.delta.schema.autoMerge.enabled is set true if this option is not enabled and if there is a schema mismatch command INSERT OVERWRITEwill fail.
Any DML/DDL operation(except DROP TABLE) on the Delta table preserves the historical version of the data.
Which of the following SQL statement can be used to query a table by eliminating duplicate rows from the query results?
SELECT DISTINCT * FROM table_name
SELECT DISTINCT * FROM table_name HAVING COUNT (*) > 1
SELECT DISTINCT_ROWS * FROM table_name
SELECT * FROM table_name GROUP BY * HAVING COUNT (*) > 1
SELECT * FROM table_name GROUP BY * HAVING COUNT (*) < 1
The answer is SELECT DISTINCT * FROM table_name
Which of the below SQL Statements can be used to create a SQL UDF to convert Celsius to Fahrenheit and vice versa, you need to pass two parameters to this function one, actual temperature, and the second that identifies if its needs to be converted to Fahrenheit or Celcius with a one-word letter F or C?
select udf_convert(60,’C’) will result in 15.5
select udf_convert(10,’F’) will result in 50
Select one:
CREATE UDF FUNCTION udf_convert(temp DOUBLE, measure STRING)
RETURNS DOUBLE
RETURN CASE WHEN measure == ‘F’ then (temp * 9/5) + 32
ELSE (temp – 33 ) * 5/9
END
CREATE UDF FUNCTION udf_convert(temp DOUBLE, measure STRING)
RETURN CASE WHEN measure == ‘F’ then (temp * 9/5) + 32
ELSE (temp – 33 ) * 5/9
END
CREATE FUNCTION udf_convert(temp DOUBLE, measure STRING)
RETURN CASE WHEN measure == ‘F’ then (temp * 9/5) + 32
ELSE (temp – 33 ) * 5/9
END
CREATE FUNCTION udf_convert(temp DOUBLE, measure STRING)
RETURNS DOUBLE
RETURN CASE WHEN measure == ‘F’ then (temp * 9/5) + 32
ELSE (temp – 33 ) * 5/9
END
CREATE USER FUNCTION udf_convert(temp DOUBLE, measure STRING)
RETURNS DOUBLE
RETURN CASE WHEN measure == ‘F’ then (temp * 9/5) + 32
ELSE (temp – 33 ) * 5/9
END
CREATE FUNCTION udf_convert(temp DOUBLE, measure STRING)
RETURNS DOUBLE
RETURN CASE WHEN measure == ‘F’ then (temp * 9/5) + 32
ELSE (temp – 33 ) * 5/9
END
You are trying to calculate total sales made by all the employees by parsing a complex struct data type that stores employee and sales data, how would you approach this in SQL
Table definition,
batchId INT, performance ARRAY<STRUCT<employeeId: BIGINT, sales: INT», insertDate TIMESTAMP
Sample data of performance column
[
{ “employeeId”:1234
“sales” : 10000},
{ “employeeId”:3232
“sales” : 30000}
]
Calculate total sales made by all the employees?
Sample data with create table syntax for the data:
create or replace table sales as
select 1 as batchId ,
from_json(‘[{ “employeeId”:1234,”sales” : 10000 },{ “employeeId”:3232,”sales” : 30000 }]’,
‘ARRAY<STRUCT<employeeId: BIGINT, sales: INT»’) as performance,
current_timestamp() as insertDate
union all
select 2 as batchId ,
from_json(‘[{ “employeeId”:1235,”sales” : 10500 },{ “employeeId”:3233,”sales” : 32000 }]’,
‘ARRAY<STRUCT<employeeId: BIGINT, sales: INT»’) as performance,
current_timestamp() as insertDate
select one:
WITH CTE as (SELECT EXPLODE (performance) FROM table_name)
SELECT SUM (performance.sales) FROM CTE
select aggregate(flatten(collect_list(performance.sales)), 0, (x, y) -> x + y)
as total_sales from sales
WITH CTE as (SELECT FLATTEN (performance) FROM table_name)
SELECT SUM (performance.sales) FROM CTE
SELECT SUM(SLICE(performance, sales)) FROM employee
select reduce(flatten(collect_list(performance.sales)), 0, (x, y) -> x + y)
as total_sales from sales
The answer is
select aggregate(flatten(collect_list(performance.sales)), 0, (x, y) -> x + y)
as total_sales from sales
Nested Struct can be queried using the . notation performance.sales will give you access to all the sales values in the performance column.
Note: option D is wrong because it uses performance:sales not performance.sales. “:” this is only used when referring to JSON data but here we are dealing with a struct data type. for the exam please make sure to understand if you are dealing with JSON data or Struct data.
Which of the following statements can be used to test the functionality of code to test number of rows in the table equal to 10 in python?
row_count = spark.sql(“select count(*) from table”).collect()[0][0]
assert (row_count == 10, “Row count did not match”)
assert row_count == 10, “Row count did not match”
assert if ( row_count == 10, “Row count did not match”)
assert if row_count == 10, “Row count did not match”
assert row_count = 10, “Row count did not match”
The answer is assert row_count == 10, “Row count did not match”
How do you handle failures gracefully when writing code in Pyspark, fill in the blanks to complete the below statement
_____
Spark.read.table("table_name").select("column").write.mode("append").SaveAsTable("new_table_name")
_____
print(f"query failed")
try: failure:
try: catch:
try: except:
try: fail:
try: error:
The answer is try: and except:
You are working on a process to query the table based on batch date, and batch date is an input parameter and expected to change every time the program runs, what is the best way to we can parameterize the query to run without manually changing the batch date?
Create a notebook parameter for batch date and assign the value to a python variable and use a spark data frame to filter the data based on the python variable
Create a dynamic view that can calculate the batch data automatically and use the view to query the data
There is not way we can combine python variable and spark code
Manually edit code every time to change the batch data
Store the batch data in the spark config and use a spark data frame to filter the data based on the spark config
Create a notebook parameter for batch date and assign the value to a python variable and use a spark data frame to filter the data based on the python variable