Databricks Flashcards

1
Q

What are the 2 main components of Databricks?

A

The control plan: stores notebooks commands, workspace configurations.
The data plane: hosts compute resources
(clusters)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 3 different Databricks service?

A

Data science and engineering workspace
SQL
Machine learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a cluster?

A

A set of compute resources on which you run data engineer, data science workloads, which are run as a set of commands on notebook or as a job.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the 2 cluster types?

A

All purpose clusters: use interactive notebooks.
Job clusters: run automated jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

. How long does Databricks retain cluster configuration information?

A

30 days
Pin cluster to keep all purpose cluster after 30 days.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

1.6. What are the three cluster modes?

A

Standard clusters: large amounts of data with Apache Spark.
Single Node clusters: jobs that use small amounts of data or non-distributed
workloads such as single-node machine learning libraries.
High Concurrency clusters: groups of users who need to share resources or run ad-hoc
jobs. Administrators usually create High Concurrency clusters.
Databricks recommends enabling
autoscaling for High Concurrency clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

To ensure that all data at rest is
encrypted for all storage types, including shuffle data that is stored temporarily on your cluster’s local
disks:

A

You can enable local disk encryption.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

To reduce cluster start time

A

you can attach a cluster to a predefined pool of idle instances, for the driver
and worker nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which magic command do you use to run a notebook from
another notebook?

A

%run ../Includes/Classroom-Setup-1.2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Databricks utilities and how can you use it to list out
directories of files from Python cells?

A

display(dbutils.fs.ls(“/databricks-datasets”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What function should you use when you have tabular data
returned by a Python cell?

A

display()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Databricks Repos?

A

provides repository-level
integration with Git providers, allowing you to work in an environment that is backed by revision control
using Git

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the definition of a Delta Lake?

A

technology at the heart of the Databricks Lakehouse platform. It is an open source
technology that enables building a data lakehouse on top of existing storage systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

. How does Delta Lake address the data lake pain points to
ensure reliable, ready-to-go data?

A

ACID Transactions – Delta Lake adds ACID transactions to data lakes. ACID stands for atomicity,
consistency, isolation, and durability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe how Delta Lake brings ACID transactions to object
storage

A

Difficult to append data
Difficult to modify existing data
Jobs failing mid way
Real time operations are not easy
Costly to keep historical data versions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Is Delta Lake the default for all tables created in Databricks?

A

Yes, Delta Lakes is the default for all tables created in Databricks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What data objects are in the Databricks Lakehouse?

A

Catalog, database, table, view, function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a metastore?

A

contains all of the metadata that defines data objects in the lakehouse:
Including:
Unity catalog
Hive metastore
external metastore

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a catalog?

A

the highest abstraction (or coarsest grain) in the Databricks Lakehouse relational model.
catalog_name.database_name.table_name

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a Delta Lake table?

A

stores data as a directory of files on
cloud object storage and registers table metadata to the metastore within a catalog and schema

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the syntax to create a Delta Table?

A

CREATE TABLE students
(id INT, name STRING, value DOUBLE);
CREATE TABLE IF NOT EXISTS students
(id INT, name STRING, value DOUBLE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the syntax to insert data?

A

INSERT INTO students
VALUES
(4, “Ted”, 4.7),
(5, “Tiffany”, 5.5),
(6, “Vini”, 6.3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the syntax to update particular records of a table?

A

UPDATE students
SET value = value + 1
WHERE name LIKE “T%”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the syntax to delete particular records of a table?

A

DELETE FROM students
WHERE value > 6

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the syntax for merge and what are the benefits of
using merge?

A

MERGE INTO table_a a
USING table_b b
ON a.col_name=b.col_name
WHEN MATCHED AND b.col = X
THEN UPDATE SET *
WHEN MATCHED AND a.col = Y
THEN DELETE
WHEN NOT MATCHED AND b.col = Z
THEN INSERT *

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Deduplication case (python syntax)

A

(deltaTable
.alias(“t”)
.merge(historicalUpdates.alias(“s”), “t.loan_id = s.loan_id”)
.whenNotMatchedInsertAll()
.execute())

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the syntax to delete a table?

A

DROP TABLE students

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is Hive?

A

is built on top of Apache Hadoop, which is an open-source framework used to efficiently store and
process large datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are the two commands to see metadata about a table?

A

DESCRIBE EXTENDED students
DESCRIBE DETAIL students

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is the syntax to display the Delta Lake files?

A

%python
display(dbutils.fs.ls(f”{DA.paths.user_db}/students”))

DESCRIBE DETAIL students

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Describe the Delta Lake files, their format and directory
structure

A

%python
display(dbutils.fs.ls(f”{DA.paths.user_db}/students/_delta_log”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What does the query engine do using the transaction logs
when we query a Delta Lake table?

A

resolve all the files that are valid in the current
version, and ignores all other data files.

You can look at a particular transaction log and see if records were inserted / updated / deleted.
%python
display(spark.sql(f”SELECT * FROM
json.{DA.paths.user_db}/students/_delta_log/00000000000000000007.json”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What commands do you use to compact small files and index
tables?

A

OPTIMIZE command allows you to combine files toward an optimal size:

OPTIMIZE events
OPTIMIZE events WHERE date >= ‘2017-01-01’
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

How do you review a history of table transactions?

A

Because all changes to the Delta Lake table are stored in the transaction log:
DESCRIBE HISTORY students

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

. How do you query and roll back to previous table version?

A

SELECT *
FROM students VERSION AS OF 3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How to roll back after deletion?

A

RESTORE TABLE students TO VERSION AS OF 8

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What command do you use to clean up stale data files and
what are the consequences of using this command?

A

SET spark.databricks.delta.retentionDurationCheck.enabled = false;
SET spark.databricks.delta.vacuum.logging.enabled = true;
VACUUM students RETAIN 0 HOURS DRY RUN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is the advantage of using Delta cache?

A

optimize performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is the syntax to create a database with default location
(no location specified)?

A

CREATE DATABASE IF NOT EXISTS db_name_default_location;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is the syntax to create a database with specified
location?

A

CREATE DATABASE IF NOT EXISTS db_name_custom_location LOCATION
‘path/db_name_custom_location.db’;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

How do you get metadata information of a database? Where
are the databases located (difference between default vs custom
location)

A

DESCRIBE DATABASE EXTENDED db_name;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What’s the best practice when creating databases?

A

declare a location for a given database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is the syntax for creating a table in a database with default location and inserting data?

A

USE db_name_default_location;

CREATE OR REPLACE TABLE managed_table_in_db_with_default_location (width INT, length
INT, height INT);
INSERT INTO managed_table_in_db_with_default_location
VALUES (3, 2, 1);
SELECT * FROM managed_table_in_db_with_default_location;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is the syntax for a table in a database with custom location?

A

USE db_name_custom_location;

CREATE OR REPLACE TABLE managed_table_in_db_with_custom_location (width INT, length
INT, height INT);
INSERT INTO managed_table_in_db_with_custom_location
VALUES (3, 2, 1);
SELECT * FROM managed_table_in_db_with_custom_location;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Where are managed tables located in a database?

A

in the LOCATION of the database it is registered to

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

How can you find their location?

A

DESCRIBE EXTENDED managed_table_in_db;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What is the syntax to create an external table?

A

CREATE OR REPLACE TABLE external_table LOCATION ‘path/external_table’ AS
SELECT * FROM temp_delays;
SELECT * FROM external_table;

Python:
df.write.option(“path”, “/path/to/empty/directory”).saveAsTable(“table_name”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What happens when you drop tables ?(Managed tables)

A

the table’s directory and its log and data files will be deleted,
only the database directory remains.
Managed tables:
DROP TABLE managed_table_in_db_with_default_location;
DROP TABLE managed_table_in_db_with_custom_location;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What happens when you drop tables a unmanaged table?

A

The table definition no longer exists in the metastore, but the underlying data
remain intact.
DROP TABLE external_table;

We still have access to the underlying data files:
%python
tbl_path = f”{DA.paths.working_dir}/external_table”
files = dbutils.fs.ls(tbl_path)
display(files)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What is the command to drop the database and its underlying tables and views?

A

Default location:
DROP DATABASE db_name_default_location CASCADE;
Customer location:
DROP DATABASE db_name_custom_location CASCADE;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

How can you show a list of tables and views?

A

SHOW TABLES;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What is the difference between Views, Temp Views & Global Temp Views?

A

View: Persisted across multiple sessions, just like a table. No process or write data.
Temp View: not persisted across multiple sessions, is scoped to the query level
Global temp view: are scoped to the cluster level, registered to a separate database. Use views with appropriate table ACLs instead of global temporary views.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What is the syntax for view?

A

CREATE VIEW view_delays_abq_lax AS
SELECT *
FROM external_table
WHERE origin = ‘ABQ’ AND destination = ‘LAX’;
SELECT * FROM view_delays_abq_lax;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What is the syntax for temp view?

A

CREATE TEMPORARY VIEW temp_view_delays_gt_120
AS SELECT * FROM external_table WHERE delay > 120 ORDER BY delay ASC;
SELECT * FROM temp_view_delays_gt_120;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What is syntax for global temp view?

A

CREATE GLOBAL TEMPORARY VIEW global_temp_view_dist_gt_1000
AS SELECT * FROM external_table WHERE distance > 1000;
SELECT * FROM global_temp.global_temp_view_dist_gt_1000;

Show tables for global temp views
SHOW TABLES IN global_temp;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What is the syntax to select from global temp views?

A

SELECT * FROM global_temp.name_of_the_global_temp_view;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What is the syntax for defining a CTE in a subquery?

A

SELECT
max(total_delay) AS Longest Delay (in minutes)
FROM
(WITH delayed_flights(total_delay) AS (
SELECT
delay
FROM external_table)
SELECT
*
FROM
delayed_flights
);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What is the syntax for defining a CTE in a subquery expression?

A

SELECT
(
WITH distinct_origins AS (
SELECT DISTINCT origin FROM external_table
)
SELECT
count(origin) AS Number of Distinct Origins
FROM distinct_origins)
AS Number of Different Origin Airports;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What is the syntax for defining a CTE in a CREATE VIEW statement?

A

CREATE OR REPLACE VIEW BOS_LAX
AS WITH origin_destination(origin_airport, destination_airport)
AS (SELECT origin, destination FROM external_table)
SELECT * FROM origin_destination
WHERE origin_airport = ‘BOS’ AND destination_airport = ‘LAX’;
SELECT count(origin_airport) AS Number of Delayed Flights from BOS to LAX FROM
BOS_LAX;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

How do you query data from a single file?

A

SELECT * FROM file_format./path/to/file
SELECT * FROM json.${da.paths.datasets}/raw/events-kafka/001.json

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

. How do you query a directory of files?

A

SELECT * FROM json.${da.paths.datasets}/raw/events-kafka

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

How do you create references to files?

A

CREATE OR REPLACE TEMP VIEW events_temp_view
AS SELECT * FROM json.${da.paths.datasets}/raw/events-kafka/;
SELECT * FROM events_temp_view

63
Q

How do you extract text files as raw strings?

A

SELECT * FROM text.${da.paths.datasets}/raw/events-kafka/

64
Q

How do you extract the raw bytes and metadata of a file?

A

SELECT * FROM binaryFile.${da.paths.datasets}/raw/events-kafka/

65
Q

Explain why executing a direct query against CSV files rarely returns the desired result.

A

the header row can be extracted as a table row, all columns can be loaded as a single column, and the column can contain nested data that is being truncated.
SELECT * FROM csv.${da.paths.working_dir}/sales-csv

66
Q

Describe the syntax required to extract data from most formats against external sources.

A

CREATE TABLE table_identifier (col_name1 col_type1, …)
USING data_source
OPTIONS (key1 = “val1”, key2 = “val2”, …)
LOCATION = path

67
Q

Using Spark SQL DDL to create a table against an external CSV source

A

CREATE TABLE sales_csv
(order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity
INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING)
USING CSV
OPTIONS (
header = “true”, delimiter = “|”
)
LOCATION “${da.paths.working_dir}/sales-csv”

68
Q

What happens to the data, metadata and options during table declaration for these external sources?

A

All the metadata and options passed during table declaration will be persisted to the metastore.

69
Q

What is the syntax to show all of the metadata associated with the table definition?

A

DESCRIBE EXTENDED sales_csv

70
Q

How can you manually refresh the cache of your data?

A

REFRESH TABLE sales_csv

71
Q

What is the syntax to extract data from SQL Databases?

A

CREATE TABLE
USING JDBC
OPTIONS (
url = “jdbc:{databaseServerType}://{jdbcHostname}:{jdbcPort}”,
dbtable = “{jdbcDatabase}.table”, user = “{jdbcUsername}”,
password = “{jdbcPassword}”
)
DROP TABLE IF EXISTS users_jdbc;
CREATE TABLE users_jdbc
USING JDBC
OPTIONS (
url = “jdbc:sqlite:/${da.username}_ecommerce.db”,
dbtable = “users”
)
SELECT * FROM users_jdbc

72
Q

Explain the two basic approaches that Spark uses to interact with external SQL databases and their limits.

A

You can move the entire source table(s) to Databricks and then executing logic on the currently active cluster. BUT significant overhead because of network transfer latency.
You can push down the query to the external SQL database and only transfer the results back to Databricks. But significant overhead because the execution of query logic in source systems not optimized for big data queries.

73
Q

What is a CTAS statement and what is the syntax?

A

CREATE OR REPLACE TABLE sales AS
SELECT * FROM parquet.${da.paths.datasets}/raw/sales-historical/;
DESCRIBE EXTENDED sales;

74
Q

Do CTAS support manual schema declaration?

A

No

75
Q

What is the syntax to overcome the limitation when trying to ingest data from CSV files?

A

CREATE OR REPLACE TEMP VIEW sales_tmp_vw
(order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity
INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING)
USING CSV
OPTIONS (
path = “${da.paths.datasets}/raw/sales-csv”,
header = “true”,
delimiter = “|”
);
CREATE TABLE sales_delta AS
SELECT * FROM sales_tmp_vw;
SELECT * FROM sales_delta

76
Q

How do you filter and rename columns from existing tables during table creation?

A

CREATE OR REPLACE TABLE purchases AS
SELECT order_id AS id, transaction_timestamp, purchase_revenue_in_usd AS price
FROM sales;
SELECT * FROM purchases

77
Q

What is a generated column and how do you declare schemas with generated columns?

A

CREATE OR REPLACE TABLE purchase_dates (
id STRING,
transaction_timestamp STRING,
price STRING,
date DATE GENERATED ALWAYS AS (
cast(cast(transaction_timestamp/1e6 AS TIMESTAMP) AS DATE))
COMMENT “generated based on transactions_timestamp column”)

78
Q

Which built-in Spark SQL commands are useful for file ingestion (for the select clause)?

A

current_timestamp() records the timestamp when the logic is executed;
input_file_name() records the source data file for each record in the table

79
Q

What are the three options when creating tables?

A

A COMMENT is added to allow for easier discovery of table contents
A LOCATION is specified, which will result in an external (rather than managed) table
The table is PARTITIONED BY a date column;

CREATE OR REPLACE TABLE users_pii
COMMENT “Contains PII”
LOCATION “${da.paths.working_dir}/tmp/users_pii”
PARTITIONED BY (first_touch_date)
AS
SELECT *,
cast(cast(user_first_touch_timestamp/1e6 AS TIMESTAMP) AS DATE) first_touch_date,
current_timestamp() updated,
input_file_name() source_file
FROM parquet.${da.paths.datasets}/raw/users-historical/;
SELECT * FROM users_pii;

80
Q

As a best practice, should you default to partitioned tables for most use cases when working with Delta Lake?

A

you should default to nonpartitioned tables for most use cases when working with Delta Lake

81
Q

What are the two options to copy Delta Lake tables and what are the use cases?

A

DEEP CLONE
CREATE OR REPLACE TABLE purchases_clone
DEEP CLONE purchases

SHALLOW CLONE
CREATE OR REPLACE TABLE purchases_shallow_clone
SHALLOW CLONE purchases

82
Q

What are the multiple benefits of overwriting tables instead of deleting and recreating tables?

A

Much faster
The old version of the table still exists and can be easily retrieved using Time Travel;
Concurrent queries can still read the table while you are deleting the table.
if overwriting the table fails, the table will be in its previous state.

83
Q

What are the two easy methods to accomplish complete overwrites?

A

CREATE OR REPLACE TABLE: completely redefine the contents of our target table,
CREATE OR REPLACE TABLE events AS
SELECT * FROM parquet.${da.paths.datasets}/raw/events-historical

INSERT OVERWRITE: only overwrite an existing table. Will fail if we try to
change our schema.
INSERT OVERWRITE sales
SELECT * FROM parquet.`${da.paths.datasets}/raw/sales-historical/

84
Q

What is the syntax to atomically append new rows to an existing Delta table? Is the command idempotent?

A

INSERT INTO sales
SELECT * FROM parquet.${da.paths.datasets}/raw/sales-30m

85
Q

What is the syntax for the the MERGE SQL operation and the benefits of using merge?

A

MERGE INTO target a
USING source b
ON {merge_condition}
WHEN MATCHED THEN {matched_action}
WHEN NOT MATCHED THEN {not_matched_action}

Benfits: : 1. updates, inserts, and deletes are completed as a single transaction;
2. multiple conditions can be added in addition to matching fields;
3. it provides extensive options for
implementing custom logic

86
Q

How can you use merge for deduplication?

A

MERGE INTO events a
USING events_update b
ON a.user_id = b.user_id AND a.event_timestamp = b.event_timestamp
WHEN NOT MATCHED AND b.traffic_source = ‘email’ THEN
INSERT *

87
Q

What is the syntax to have an idempotent option to incrementally ingest data from external systems?

A

COPY INTO sales
FROM “${da.paths.datasets}/raw/sales-30m”
FILEFORMAT = PARQUET

88
Q

How is COPY INTO different than Auto Loader?

A

COPY INTO focused on a SQL analyst doing a batch execution.
Auto Loader requires Structured Streaming

89
Q

What is the syntax to count null values?

A

SELECT * FROM table_name WHERE col_name IS NULL
SELECT count_if(col_name IS NULL) AS new_col_name FROM table_name

90
Q

What is the syntax to count for distinct values in a table for a specific column?

A

SELECT COUNT(DISTINCT(col_1, col_2)) FROM table_name WHERE col_1 IS NOT NULL

91
Q

What is the syntax to deal with binary-encoded JSON values in a human readable format?

A

CREATE OR REPLACE TEMP VIEW events_strings AS
SELECT string(key), string(value)
FROM events_raw;
SELECT * FROM events_strings

92
Q

What is the syntax to parse JSON
objects into struct types with Spark SQL?

A

CREATE OR REPLACE TEMP VIEW parsed_events AS
SELECT from_json(value, schema_of_json(‘{insert_example_schema_here}’)) AS json
FROM events_strings;
SELECT * FROM parsed_events

93
Q

Once a JSON string is unpacked to a struct type, what is the syntax to flatten the fields into columns?

A

CREATE OR REPLACE TEMP VIEW new_events_final AS
SELECT json.*
FROM parsed_events;
SELECT * FROM new_events_final

94
Q

What is the syntax for exploding arrays of structs?

A

SELECT user_id, event_timestamp, event_name, explode(items) AS item
FROM events

Explode function lets us put each element in an array on its own row

95
Q

What is the syntax to collect arrays?

A

SELECT user_id,
collect_set(event_name) AS event_history,
array_distinct(flatten(collect_set(items.item_id))) AS cart_history
FROM events
GROUP BY user_id

The collect_set function can collect unique values for a field, including fields within arrays.
The flatten function allows multiple arrays to be combined into a single array.
The array_distinct function removes duplicate elements from an array.

96
Q

What is the syntax for a semi-join?

A

SELECT columns
FROM table_1
WHERE EXISTS (
SELECT values
FROM table_2
WHERE table_2.column = table_1.column);

97
Q

What is the syntax for FILTER ?

A

FILTER (items, i -> i.item_id LIKE “%K”) AS king_items

FILTER : the name of the higher-order function
items : the name of our input array
i : the name of the iterator variable. You choose this name and then use it in the lambda function. It iterates
over the array, cycling each value into the function one at a time.
-> : Indicates the start of a function
i.item_id LIKE “%K” : This is the function. Each value is checked to see if it ends with the capital letter
K. If it is, it gets filtered into the new column, king_items

98
Q

What is the syntax for EXIST ?

A

EXISTS (categories, c -> c = “Company Blog”) companyFlag

Let’s say we want to flag all
blog posts with “Company Blog” in the categories field. I can use the EXISTS function to mark which
entries include that category.

99
Q

What is the syntax for TRANSFORM ?

A

TRANSFORM(king_items, k -> CAST(k.item_revenue_in_usd * 100 AS INT)) AS item_revenues
we extract the item’s revenue value, multiply it by 100, and cast the result to integer

100
Q

What is the syntax for REDUCE ?

A

REDUCE(co2_level, 0, (c, acc) -> c + acc, acc ->(acc div size(co2_level)))

101
Q

What is the syntax to define and register SQL UDFs? How do
you then apply that function to the data?

A

CREATE OR REPLACE FUNCTION function_name(param TYPE)
RETURNS type_to_be_returned RETURN function_itself

102
Q

What are SQL UDFs governed by?

A

by the same Access Control Lists (ACLs)
as databases, tables, or views

103
Q

What permissions must a user have on the function to use a SQL UDF? Describe their scoping.

A

The user must have USAGE and SELECT permissions on the function to use it.

104
Q

What is the benefit of using SQL UDFs?

A

allow a handful of users to define the complex logic needed for
common reporting and analytic queries

105
Q

What is the syntax to turn SQL queries into Python strings?

A

print(“””
SELECT *
FROM table_name
“””)

106
Q

What is the syntax to execute SQL from a Python cell?

A

spark.sql(“SELECT * FROM table_name”)

107
Q

What function in python do you call to render a query the way it would appear in a normal SQL notebook?

A

display(spark.sql(“SELECT * FROM table_name”))

108
Q

What is the syntax to define a function in Python?

A

def return_new_string(string_arg):
return “The string passed to this function was “ + string_arg

109
Q

What is the syntax for f-strings?

A

f”I can substitute {my_string} here”

110
Q

How can f-strings be used for SQL queries?

A

table_name = “users”
filter_clause = “WHERE state = ‘CA’”
query = f”””
SELECT *
FROM {table_name}
{filter_clause}
“””
print(query)

111
Q

What is the syntax for if / else clauses wrapped in a function?

A

def foods_i_like(food):
if food == “beans”:
print(f”I love {food}”)
elif food == “potatoes”:
print(f”My favorite vegetable is {food}”)
elif food != “beef”:
print(f”Do you have any good recipes for {food}?”)
else:
print(f”I don’t eat {food}”)

112
Q

What are assert statements and what is the syntax?

A

Example asserting that the number 2 is an integer:
assert type(2) == int

113
Q

Why do we use try / except statements and what is the syntax?

A

try / except provides robust error handling. When a nonnumeric string is passed, an informative message is printed out.
def try_int(num_string):
try:
int(num_string)
result = f”{num_string} is a number.”
except:
result = f”{num_string} is not a number!”
print(result)

114
Q

What is the downside of using try / except statements?

A

an error will not be raised when an error occurs. Implementing logic that
suppresses errors can lead to logic silently failing.

115
Q

What is the syntax for try / except statements where you return an informative error message?

A

def three_times(number):
try:
return int(number) * 3
except ValueError as e:
print(f”You passed the string variable ‘{number}’.\n”)
print(f”Try passing an integer instead.”)
return None

116
Q

. How do you apply these concepts to execute SQL logic on Databricks, for example to avoid SQL injection attack?

A

Using a simple if clause with a function allows us to execute arbitrary SQL queries, optionally displaying
the results, and always returning the resultant DataFrame.
def simple_query_function(query, preview=True):
query_result = spark.sql(query)
if preview:
display(query_result)
return query_result
result = simple_query_function(query)

117
Q

What is the purpose of Auto Loader?

A

Provides a way for
data teams to load raw data from cloud object stores at lower costs and latencies,

Allows you to continuously ingest data into Delta Lake2,

Use as general best practice when ingesting data from cloud object storage.

118
Q

What are the 4 arguments using Auto Loader with automatic schema inference and evolution?

A

data_source : Auto Loader will detect new files as they arrive in this location and queue them for ingestion; passed to the .load() method.

source_format: While the format for all Auto Loader queries will be cloudFiles , the format of the source data should always be specified for the cloudFiles.format option

table_name: Spark Structured Streaming supports writing directly to Delta Lake tables
by passing a table name as a string to the .table() method. Note that
table you can either append to an existing table or create a new table.

checkpoint_directory: This argument is passed to the checkpointLocation and
for storing cloudFiles.schemaLocation options. Checkpoints keep track of
streaming progress, while the schema location tracks updates to the fields
in the source dataset.

119
Q

What is the benefit of Auto Loader compared to structured streaming?

A

With cloudFiles.schemaLocation , Auto Loader will infer schema wheareas traditional structured streaming will not,
Auto Loader will scan the first gigabytes of data and infer the schema for you.

120
Q

What keyword indicates that you’re using Auto Loader rather than a traditional stream for ingesting?

A

cloudFiles

121
Q

What is the _rescued_data column?

A

to capture any data that might be
malformed and not fit into the table otherwise

122
Q

What is the data type encoded by Auto Loader for fields in a text-based file format?

A

STRING type

123
Q

Historically, what were the two inefficient ways to land new data?

A

Reprocess all records in a source directory to
calculate current results.
Implement custom logic to identify new data that’s arrived since the last time a table was updated.

124
Q

How do you track the ingestion progress?

A

%sql
DESCRIBE HISTORY target_table

125
Q

What is Spark Structured Streaming?

A

Extends the functionality of Apache Spark to allow for simplified configuration
and bookkeeping when processing incremental datasets.

Allows users to interact with ever-growing data sources as if they were just a static table of records, by treating infinite data as a table.

126
Q

What is the syntax to read a stream?

A

(spark.readStream
.table(“bronze”)
.createOrReplaceTempView(“streaming_tmp_vw”))

127
Q

How can you transform streaming data?

A

%sql
SELECT device_id, count(device_id) AS total_recordings
FROM streaming_tmp_vw
GROUP BY device_id

128
Q

What are the 3 most important settings when writing a stream to Delta Lake tables?

A

Checkpointing with checkpointLocation
.outputMode(“append”).outputMode(“complete”).
Trigger Intervals

129
Q

What is the syntax to load data from a streaming temp view back to a DataFrame, and then query the table that we wrote out to?

A

(spark.table(“device_counts_tmp_vw”)
.writeStream
.option(“checkpointLocation”, f”{DA.paths.checkpoints}/silver”)
.outputMode(“complete”)
.trigger(availableNow=True)
.table(“device_counts”)
.awaitTermination() # This optional method blocks execution of the next cell until
the incremental batch write has succeeded)

130
Q

what is a bronze table?

A

contains raw data ingested from various sources (JSON files, RDBMS data, IoT data, to
name a few examples). Bronze makes sure that data is appended incrementally and grows over time.
We’re interested in retaining the full unprocessed history of each dataset in an efficient storage format which will provide us with the ability to recreate any state of a given data system.

131
Q

What is a silver table?

A

provides a more refined view of our data.
We can join fields from various bronze tables to enrich streaming records, or update account statuses based on recent activity. The silver layer might contain many pipelines and silver tables. Various different views for a given dataset. The goal is that this silver layer becomes that validated single source of truth for our data. This is the dream that the data lake could have been: Correct schema, deduplicated records, but no aggregations for our business users yet.

132
Q

What is a gold table?

A

highly refined and aggregated data. Data that has been transformed to knowledge. Updates to these tables will be completed as part of regularly scheduled production workloads, which helps control costs and allows SLAs for data freshness to be established. Gold tables provide business level aggregates often used for reporting and dashboarding. This would include aggregations such as daily active website users, weekly sales per store, or gross revenue per quarter by department. The end outputs are actionable insights, dashboards and reports of business metrics. Gold tables will often be stored in a separate storage container to help avoid cloud limits on data requests. In general, because aggregations, joins and filtering are being handled before data is written to the golden layer, query performance on data in the gold tables should be exceptional.

133
Q

Bronze: what additional metadata could you add for enhanced discoverability?

A

Examples: source file names that are being ingested, recording of the time where that data was originally processed.

134
Q

how you can configure a read on a raw JSON source using Auto Loader with schema inference?

A

(spark.readStream
.format(“cloudFiles”)
.option(“cloudFiles.format”, “json”)
.option(“cloudFiles.schemaHints”, “time DOUBLE”)
.option(“cloudFiles.schemaLocation”, f”{DA.paths.checkpoints}/bronze”)
.load(DA.paths.data_landing_location)
.createOrReplaceTempView(“recordings_raw_temp”))

135
Q

What happens with the ACID guarantees that Delta Lake brings to your data when you choose to merge this data with other data sources?

A

ensuring that only fully successfully commits are reflected in your tables. If you choose to merge these data with other data sources, be aware of how those sources version data and what sort of consistency guarantees they have.

136
Q

Describe what happens at the silver level, when we enrich our data.

A

We join the recordings data with the PII to
add patient names, the time for the recordings we parse the time for the recordings to the format ‘yyyy-MM-dd
HH:mm:ss’ to be human-readable, and we perform a quality check by excluding heart rates that are <= 0.

(spark.readStream
.table(“bronze”)
.createOrReplaceTempView(“bronze_tmp”))

%sql
CREATE OR REPLACE TEMPORARY VIEW recordings_w_pii AS (
SELECT device_id, a.mrn, b.name, cast(from_unixtime(time, ‘yyyy-MM-dd HH:mm:ss’) AS
timestamp) time, heartrate
FROM bronze_tmp a
INNER JOIN pii b
ON a.mrn = b.mrn
WHERE heartrate > 0)
(spark.table(“recordings_w_pii”)
.writeStream
.format(“delta”)
.option(“checkpointLocation”, f”{DA.paths.checkpoints}/recordings_enriched”)
.outputMode(“append”)
.table(“recordings_enriched”))
%sql
SELECT COUNT(*) FROM recordings_enriched

137
Q

Describe what happens at the Gold level.

A

We read a stream of data from recordings_enriched and write another stream to create an aggregate
gold table of daily averages for each patient.
(spark.readStream
.table(“recordings_enriched”)
.createOrReplaceTempView(“recordings_enriched_temp”))

%sql
CREATE OR REPLACE TEMP VIEW patient_avg AS (
SELECT mrn, name, mean(heartrate) avg_heartrate, date_trunc(“DD”, time) date
FROM recordings_enriched_temp
GROUP BY mrn, name, date_trunc(“DD”, time))

138
Q

What is .trigger(availableNow=True) and when is it used?

A

provides us the ability to continue to use the strengths of structured streaming while trigger this job one-time to process all available data in micro-batches.

(spark.table(“patient_avg”)
.writeStream
.format(“delta”)
.outputMode(“complete”)
.option(“checkpointLocation”, f”{DA.paths.checkpoints}/daily_avg”)
.trigger(availableNow=True) # you want the benefits of streaming but as a single
batch
.table(“daily_patient_avg”))

139
Q

What are the important considerations for complete output mode with Delta?

A

When using complete output mode, we rewrite the entire state of our table each time our logic runs.
While this is ideal for calculating aggregates, we cannot read a stream from this directory, as Structured Streaming assumes data is only being appended.

140
Q

What is the syntax for declaring a bronze layer table using Auto Loader and DLT?

A

CREATE OR REFRESH STREAMING LIVE TABLE sales_orders_raw
COMMENT “The raw sales orders, ingested from /databricks-datasets.”
AS SELECT * FROM cloud_files(“/databricks-datasets/retail-org/sales_orders/”, “json”,
map(“cloudFiles.inferColumnTypes”, “true”))

141
Q

When scheduling a Job, what are the two options to configure the cluster where the task runs?

A

New Job Cluster
Existing All-Purpose Clusters

142
Q

How can you view runs for a Job and the details of the runs?

A

To view details of the run, including the start time, duration, and status, hover over the bar in the Job Runs row.
To view details of each task, including the start time, duration, cluster, and status, hover over the cell for that task

143
Q

What are the recommendations for cluster configuration for specific job types?

A

make sure to use shared job clusters.
choose the correct cluster type for your job.

144
Q

What is the data explorer, how do you access it and what does it allow you to do?

A

allows users and admins to navigate databases, tables, and views; explore data schema,metadata, and history; set and modify permissions of relational entities

145
Q

What is Unity catalog?

A

shows where data came from, who created it and when, how it has been
modified over time, how it’s being used, and more.

146
Q

What are the four key functional areas for data governance?

A

Data Access Control: Who has access to what?

Data Access Audit: Understand who accessed what and when? What did they do? Compliance aspect.

Data Lineage: Which data objects feed downstream data objects - if you make a change to an upstream
table, how does that affect downstream and vice versa.

Data Discovery: Important to find your data and see what actually exists.

147
Q

What are the default permissions for users and admins in DBSQL?

A

Admins will have the ability to view all objects registered to the metastore and will be able to control permissions for other users in the workspace.
Users will have no permissions on anything registered to the metastore, other than objects that they create in DBSQL;
before users can create any databases, tables, or views, they must have create and usage privileges specifically granted to them.
Permissions will be set using Groups that have been configured by an administrator.
Access Control Lists (ACLs) are used to control permissions.

148
Q

What are the 6 objects for which Databricks allows you to configure permissions?

A

CATALOG controls access to the entire data catalog.

DATABASE controls access to a database.

TABLE controls access to a managed or external table.

VIEW controls access to SQL views.

FUNCTION controls access to a named function.

ANY FILE controls access to the underlying filesystem. Users granted access to ANY FILE can bypass the restrictions put on the catalog, databases, tables, and views by reading from
the file system directly.

149
Q

For each object owner, describe what they can grant privileges for.

A

Databricks administrator: All objects in the catalog and the underlying filesystem.

Catalog owner: All objects in the catalog.

Database owner: All objects in the database.

Table owner: Only the table (similar options for views and functions).

150
Q

Describe all the privileges that can be configured in Data Explorer.

A

ALL PRIVILEGES: gives all privileges (is translated into all the below privileges).

SELECT: gives read access to an object.

MODIFY: gives ability to add, delete, and modify data to or from an object.

READ_METADATA: gives ability to view an object and its metadata.

USAGE: does not give any abilities, but is an additional requirement to perform any action
on a database object.

CREATE: gives ability to create an object (for example, a table in a database)

151
Q

What is the command to generate a new database and grant permissions to all users in the DBSQL query editor?

A

To enable the ability to create databases and tables in the default catalog using Databricks SQL:
GRANT usage, create ON CATALOG hive_metastore TO users

To confirm this has run successfully:
SHOW GRANT ON CATALOG hive_metastore

152
Q

In pyspark dataframe API, what should we use to load data from from temp view back to dataframe?

A

spark.table

153
Q
A