Databricks Flashcards

Question 1

Q

What are the 2 main components of Databricks?

Answer

A

The control plan: stores notebooks commands, workspace configurations.
The data plane: hosts compute resources
(clusters)

Question 2

Q

What are the 3 different Databricks service?

Answer

A

Data science and engineering workspace
SQL
Machine learning

Question 3

Q

What is a cluster?

Answer

A

A set of compute resources on which you run data engineer, data science workloads, which are run as a set of commands on notebook or as a job.

Question 4

Q

What are the 2 cluster types?

Answer

A

All purpose clusters: use interactive notebooks.
Job clusters: run automated jobs

Question 5

Q

. How long does Databricks retain cluster configuration information?

Answer

A

30 days
Pin cluster to keep all purpose cluster after 30 days.

Question 6

Q

1.6. What are the three cluster modes?

Answer

A

Standard clusters: large amounts of data with Apache Spark.
Single Node clusters: jobs that use small amounts of data or non-distributed
workloads such as single-node machine learning libraries.
High Concurrency clusters: groups of users who need to share resources or run ad-hoc
jobs. Administrators usually create High Concurrency clusters.
Databricks recommends enabling
autoscaling for High Concurrency clusters.

Question 7

Q

To ensure that all data at rest is
encrypted for all storage types, including shuffle data that is stored temporarily on your cluster’s local
disks:

Answer

A

You can enable local disk encryption.

Question 8

Q

To reduce cluster start time

Answer

A

you can attach a cluster to a predefined pool of idle instances, for the driver
and worker nodes

Question 9

Q

Which magic command do you use to run a notebook from
another notebook?

Answer

A

%run ../Includes/Classroom-Setup-1.2

Question 10

Q

What is Databricks utilities and how can you use it to list out
directories of files from Python cells?

Answer

A

display(dbutils.fs.ls(“/databricks-datasets”))

Question 11

Q

What function should you use when you have tabular data
returned by a Python cell?

Answer

A

display()

Question 12

Q

What is Databricks Repos?

Answer

A

provides repository-level
integration with Git providers, allowing you to work in an environment that is backed by revision control
using Git

Question 13

Q

What is the definition of a Delta Lake?

Answer

A

technology at the heart of the Databricks Lakehouse platform. It is an open source
technology that enables building a data lakehouse on top of existing storage systems.

Question 14

Q

. How does Delta Lake address the data lake pain points to
ensure reliable, ready-to-go data?

Answer

A

ACID Transactions – Delta Lake adds ACID transactions to data lakes. ACID stands for atomicity,
consistency, isolation, and durability

Question 15

Q

Describe how Delta Lake brings ACID transactions to object
storage

Answer

A

Difficult to append data
Difficult to modify existing data
Jobs failing mid way
Real time operations are not easy
Costly to keep historical data versions

Question 16

Q

Is Delta Lake the default for all tables created in Databricks?

Answer

A

Yes, Delta Lakes is the default for all tables created in Databricks

Question 17

Q

What data objects are in the Databricks Lakehouse?

Answer

A

Catalog, database, table, view, function

Question 18

Q

What is a metastore?

Answer

A

contains all of the metadata that defines data objects in the lakehouse:
Including:
Unity catalog
Hive metastore
external metastore

Question 19

Q

What is a catalog?

Answer

A

the highest abstraction (or coarsest grain) in the Databricks Lakehouse relational model.
catalog_name.database_name.table_name

Question 20

Q

What is a Delta Lake table?

Answer

A

stores data as a directory of files on
cloud object storage and registers table metadata to the metastore within a catalog and schema

Question 21

Q

What is the syntax to create a Delta Table?

Answer

A

CREATE TABLE students
(id INT, name STRING, value DOUBLE);
CREATE TABLE IF NOT EXISTS students
(id INT, name STRING, value DOUBLE)

Question 22

Q

What is the syntax to insert data?

Answer

A

INSERT INTO students
VALUES
(4, “Ted”, 4.7),
(5, “Tiffany”, 5.5),
(6, “Vini”, 6.3)

Question 23

Q

What is the syntax to update particular records of a table?

Answer

A

UPDATE students
SET value = value + 1
WHERE name LIKE “T%”

Question 24

Q

What is the syntax to delete particular records of a table?

Answer

A

DELETE FROM students
WHERE value > 6

Question 25

Q

What is the syntax for merge and what are the benefits of
using merge?

Answer

A

MERGE INTO table_a a
USING table_b b
ON a.col_name=b.col_name
WHEN MATCHED AND b.col = X
THEN UPDATE SET *
WHEN MATCHED AND a.col = Y
THEN DELETE
WHEN NOT MATCHED AND b.col = Z
THEN INSERT *

Question 26

Q

Deduplication case (python syntax)

Answer

A

(deltaTable
.alias(“t”)
.merge(historicalUpdates.alias(“s”), “t.loan_id = s.loan_id”)
.whenNotMatchedInsertAll()
.execute())

Question 27

Q

What is the syntax to delete a table?

Answer

A

DROP TABLE students

Question 28

Q

What is Hive?

Answer

A

is built on top of Apache Hadoop, which is an open-source framework used to efficiently store and
process large datasets

Question 29

Q

What are the two commands to see metadata about a table?

Answer

A

DESCRIBE EXTENDED students
DESCRIBE DETAIL students

Question 30

Q

What is the syntax to display the Delta Lake files?

Answer

A

%python
display(dbutils.fs.ls(f”{DA.paths.user_db}/students”))

DESCRIBE DETAIL students

Question 31

Q

Describe the Delta Lake files, their format and directory
structure

Answer

A

%python
display(dbutils.fs.ls(f”{DA.paths.user_db}/students/_delta_log”))

Question 32

Q

What does the query engine do using the transaction logs
when we query a Delta Lake table?

Answer

A

resolve all the files that are valid in the current
version, and ignores all other data files.

You can look at a particular transaction log and see if records were inserted / updated / deleted.
%python
display(spark.sql(f”SELECT * FROM
json.{DA.paths.user_db}/students/_delta_log/00000000000000000007.json”))

Question 33

Q

What commands do you use to compact small files and index
tables?

Answer

A

OPTIMIZE command allows you to combine files toward an optimal size:

OPTIMIZE events
OPTIMIZE events WHERE date >= ‘2017-01-01’
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

Question 34

Q

How do you review a history of table transactions?

Answer

A

Because all changes to the Delta Lake table are stored in the transaction log:
DESCRIBE HISTORY students

Question 35

Q

. How do you query and roll back to previous table version?

Answer

A

SELECT *
FROM students VERSION AS OF 3

Question 36

Q

How to roll back after deletion?

Answer

A

RESTORE TABLE students TO VERSION AS OF 8

Question 37

Q

What command do you use to clean up stale data files and
what are the consequences of using this command?

Answer

A

SET spark.databricks.delta.retentionDurationCheck.enabled = false;
SET spark.databricks.delta.vacuum.logging.enabled = true;
VACUUM students RETAIN 0 HOURS DRY RUN

Question 38

Q

What is the advantage of using Delta cache?

Answer

A

optimize performance

Question 39

Q

What is the syntax to create a database with default location
(no location specified)?

Answer

A

CREATE DATABASE IF NOT EXISTS db_name_default_location;

Question 40

Q

What is the syntax to create a database with specified
location?

Answer

A

CREATE DATABASE IF NOT EXISTS db_name_custom_location LOCATION
‘path/db_name_custom_location.db’;

Question 41

Q

How do you get metadata information of a database? Where
are the databases located (difference between default vs custom
location)

Answer

A

DESCRIBE DATABASE EXTENDED db_name;

Question 42

Q

What’s the best practice when creating databases?

Answer

A

declare a location for a given database

Question 43

Q

What is the syntax for creating a table in a database with default location and inserting data?

Answer

A

USE db_name_default_location;

CREATE OR REPLACE TABLE managed_table_in_db_with_default_location (width INT, length
INT, height INT);
INSERT INTO managed_table_in_db_with_default_location
VALUES (3, 2, 1);
SELECT * FROM managed_table_in_db_with_default_location;

Question 44

Q

What is the syntax for a table in a database with custom location?

Answer

A

USE db_name_custom_location;

CREATE OR REPLACE TABLE managed_table_in_db_with_custom_location (width INT, length
INT, height INT);
INSERT INTO managed_table_in_db_with_custom_location
VALUES (3, 2, 1);
SELECT * FROM managed_table_in_db_with_custom_location;

Question 45

Q

Where are managed tables located in a database?

Answer

A

in the LOCATION of the database it is registered to

Question 46

Q

How can you find their location?

Answer

A

DESCRIBE EXTENDED managed_table_in_db;

Question 47

Q

What is the syntax to create an external table?

Answer

A

CREATE OR REPLACE TABLE external_table LOCATION ‘path/external_table’ AS
SELECT * FROM temp_delays;
SELECT * FROM external_table;

Python:
df.write.option(“path”, “/path/to/empty/directory”).saveAsTable(“table_name”)

Question 48

Q

What happens when you drop tables ?(Managed tables)

Answer

A

the table’s directory and its log and data files will be deleted,
only the database directory remains.
Managed tables:
DROP TABLE managed_table_in_db_with_default_location;
DROP TABLE managed_table_in_db_with_custom_location;

Question 49

Q

What happens when you drop tables a unmanaged table?

Answer

A

The table definition no longer exists in the metastore, but the underlying data
remain intact.
DROP TABLE external_table;

We still have access to the underlying data files:
%python
tbl_path = f”{DA.paths.working_dir}/external_table”
files = dbutils.fs.ls(tbl_path)
display(files)

Question 50

Q

What is the command to drop the database and its underlying tables and views?

Answer

A

Default location:
DROP DATABASE db_name_default_location CASCADE;
Customer location:
DROP DATABASE db_name_custom_location CASCADE;

Question 51

Q

How can you show a list of tables and views?

Answer

A

SHOW TABLES;

Question 52

Q

What is the difference between Views, Temp Views & Global Temp Views?

Answer

A

View: Persisted across multiple sessions, just like a table. No process or write data.
Temp View: not persisted across multiple sessions, is scoped to the query level
Global temp view: are scoped to the cluster level, registered to a separate database. Use views with appropriate table ACLs instead of global temporary views.

Question 53

Q

What is the syntax for view?

Answer

A

CREATE VIEW view_delays_abq_lax AS
SELECT *
FROM external_table
WHERE origin = ‘ABQ’ AND destination = ‘LAX’;
SELECT * FROM view_delays_abq_lax;

Question 54

Q

What is the syntax for temp view?

Answer

A

CREATE TEMPORARY VIEW temp_view_delays_gt_120
AS SELECT * FROM external_table WHERE delay > 120 ORDER BY delay ASC;
SELECT * FROM temp_view_delays_gt_120;

Question 55

Q

What is syntax for global temp view?

Answer

A

CREATE GLOBAL TEMPORARY VIEW global_temp_view_dist_gt_1000
AS SELECT * FROM external_table WHERE distance > 1000;
SELECT * FROM global_temp.global_temp_view_dist_gt_1000;

Show tables for global temp views
SHOW TABLES IN global_temp;

Question 56

Q

What is the syntax to select from global temp views?

Answer

A

SELECT * FROM global_temp.name_of_the_global_temp_view;

Question 57

Q

What is the syntax for defining a CTE in a subquery?

Answer

A

SELECT
max(total_delay) AS Longest Delay (in minutes)
FROM
(WITH delayed_flights(total_delay) AS (
SELECT
delay
FROM external_table)
SELECT
*
FROM
delayed_flights
);

Question 58

Q

What is the syntax for defining a CTE in a subquery expression?

Answer

A

SELECT
(
WITH distinct_origins AS (
SELECT DISTINCT origin FROM external_table
)
SELECT
count(origin) AS Number of Distinct Origins
FROM distinct_origins)
AS Number of Different Origin Airports;

Question 59

Q

What is the syntax for defining a CTE in a CREATE VIEW statement?

Answer

A

CREATE OR REPLACE VIEW BOS_LAX
AS WITH origin_destination(origin_airport, destination_airport)
AS (SELECT origin, destination FROM external_table)
SELECT * FROM origin_destination
WHERE origin_airport = ‘BOS’ AND destination_airport = ‘LAX’;
SELECT count(origin_airport) AS Number of Delayed Flights from BOS to LAX FROM
BOS_LAX;

Question 60

Q

How do you query data from a single file?

Answer

A

SELECT * FROM file_format./path/to/file
SELECT * FROM json.${da.paths.datasets}/raw/events-kafka/001.json

Question 61

Q

. How do you query a directory of files?

Answer

A

SELECT * FROM json.${da.paths.datasets}/raw/events-kafka

Question 62

Q

How do you create references to files?

Answer

A

CREATE OR REPLACE TEMP VIEW events_temp_view
AS SELECT * FROM json.${da.paths.datasets}/raw/events-kafka/;
SELECT * FROM events_temp_view

Question 63

Q

How do you extract text files as raw strings?

Answer

A

SELECT * FROM text.${da.paths.datasets}/raw/events-kafka/

Question 64

Q

How do you extract the raw bytes and metadata of a file?

Answer

A

SELECT * FROM binaryFile.${da.paths.datasets}/raw/events-kafka/

Answer 65

A

the header row can be extracted as a table row, all columns can be loaded as a single column, and the column can contain nested data that is being truncated.
SELECT * FROM csv.${da.paths.working_dir}/sales-csv

Answer 66

A

CREATE TABLE table_identifier (col_name1 col_type1, …)
USING data_source
OPTIONS (key1 = “val1”, key2 = “val2”, …)
LOCATION = path

Answer 67

A

CREATE TABLE sales_csv
(order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity
INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING)
USING CSV
OPTIONS (
header = “true”, delimiter = “|”
)
LOCATION “${da.paths.working_dir}/sales-csv”

Answer 68

A

All the metadata and options passed during table declaration will be persisted to the metastore.

Answer 69

A

DESCRIBE EXTENDED sales_csv

Answer 70

A

REFRESH TABLE sales_csv

Answer 71

A

CREATE TABLE
USING JDBC
OPTIONS (
url = “jdbc:{databaseServerType}://{jdbcHostname}:{jdbcPort}”,
dbtable = “{jdbcDatabase}.table”, user = “{jdbcUsername}”,
password = “{jdbcPassword}”
)
DROP TABLE IF EXISTS users_jdbc;
CREATE TABLE users_jdbc
USING JDBC
OPTIONS (
url = “jdbc:sqlite:/${da.username}_ecommerce.db”,
dbtable = “users”
)
SELECT * FROM users_jdbc

Answer 72

A

You can move the entire source table(s) to Databricks and then executing logic on the currently active cluster. BUT significant overhead because of network transfer latency.
You can push down the query to the external SQL database and only transfer the results back to Databricks. But significant overhead because the execution of query logic in source systems not optimized for big data queries.

Answer 73

A

CREATE OR REPLACE TABLE sales AS
SELECT * FROM parquet.${da.paths.datasets}/raw/sales-historical/;
DESCRIBE EXTENDED sales;

Answer 74

A

CREATE OR REPLACE TEMP VIEW sales_tmp_vw
(order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity
INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING)
USING CSV
OPTIONS (
path = “${da.paths.datasets}/raw/sales-csv”,
header = “true”,
delimiter = “|”
);
CREATE TABLE sales_delta AS
SELECT * FROM sales_tmp_vw;
SELECT * FROM sales_delta

Answer 75

A

CREATE OR REPLACE TABLE purchases AS
SELECT order_id AS id, transaction_timestamp, purchase_revenue_in_usd AS price
FROM sales;
SELECT * FROM purchases

Answer 76

A

CREATE OR REPLACE TABLE purchase_dates (
id STRING,
transaction_timestamp STRING,
price STRING,
date DATE GENERATED ALWAYS AS (
cast(cast(transaction_timestamp/1e6 AS TIMESTAMP) AS DATE))
COMMENT “generated based on transactions_timestamp column”)

Answer 77

A

current_timestamp() records the timestamp when the logic is executed;
input_file_name() records the source data file for each record in the table

Answer 78

A

A COMMENT is added to allow for easier discovery of table contents
A LOCATION is specified, which will result in an external (rather than managed) table
The table is PARTITIONED BY a date column;

CREATE OR REPLACE TABLE users_pii
COMMENT “Contains PII”
LOCATION “${da.paths.working_dir}/tmp/users_pii”
PARTITIONED BY (first_touch_date)
AS
SELECT *,
cast(cast(user_first_touch_timestamp/1e6 AS TIMESTAMP) AS DATE) first_touch_date,
current_timestamp() updated,
input_file_name() source_file
FROM parquet.${da.paths.datasets}/raw/users-historical/;
SELECT * FROM users_pii;

Answer 79

A

you should default to nonpartitioned tables for most use cases when working with Delta Lake

Answer 80

A

DEEP CLONE
CREATE OR REPLACE TABLE purchases_clone
DEEP CLONE purchases

SHALLOW CLONE
CREATE OR REPLACE TABLE purchases_shallow_clone
SHALLOW CLONE purchases

Answer 81

A

Much faster
The old version of the table still exists and can be easily retrieved using Time Travel;
Concurrent queries can still read the table while you are deleting the table.
if overwriting the table fails, the table will be in its previous state.

Answer 82

A

CREATE OR REPLACE TABLE: completely redefine the contents of our target table,
CREATE OR REPLACE TABLE events AS
SELECT * FROM parquet.${da.paths.datasets}/raw/events-historical

INSERT OVERWRITE: only overwrite an existing table. Will fail if we try to
change our schema.
INSERT OVERWRITE sales
SELECT * FROM parquet.`${da.paths.datasets}/raw/sales-historical/

Answer 83

A

INSERT INTO sales
SELECT * FROM parquet.${da.paths.datasets}/raw/sales-30m

Answer 84

A

MERGE INTO target a
USING source b
ON {merge_condition}
WHEN MATCHED THEN {matched_action}
WHEN NOT MATCHED THEN {not_matched_action}

Benfits: : 1. updates, inserts, and deletes are completed as a single transaction;
2. multiple conditions can be added in addition to matching fields;
3. it provides extensive options for
implementing custom logic

Answer 85

A

MERGE INTO events a
USING events_update b
ON a.user_id = b.user_id AND a.event_timestamp = b.event_timestamp
WHEN NOT MATCHED AND b.traffic_source = ‘email’ THEN
INSERT *

Answer 86

A

COPY INTO sales
FROM “${da.paths.datasets}/raw/sales-30m”
FILEFORMAT = PARQUET

Answer 87

A

COPY INTO focused on a SQL analyst doing a batch execution.
Auto Loader requires Structured Streaming

Answer 88

A

SELECT * FROM table_name WHERE col_name IS NULL
SELECT count_if(col_name IS NULL) AS new_col_name FROM table_name

Answer 89

A

SELECT COUNT(DISTINCT(col_1, col_2)) FROM table_name WHERE col_1 IS NOT NULL

Answer 90

A

CREATE OR REPLACE TEMP VIEW events_strings AS
SELECT string(key), string(value)
FROM events_raw;
SELECT * FROM events_strings

Answer 91

A

CREATE OR REPLACE TEMP VIEW parsed_events AS
SELECT from_json(value, schema_of_json(‘{insert_example_schema_here}’)) AS json
FROM events_strings;
SELECT * FROM parsed_events

Answer 92

A

CREATE OR REPLACE TEMP VIEW new_events_final AS
SELECT json.*
FROM parsed_events;
SELECT * FROM new_events_final

Answer 93

A

SELECT user_id, event_timestamp, event_name, explode(items) AS item
FROM events

Explode function lets us put each element in an array on its own row

Answer 94

A

SELECT user_id,
collect_set(event_name) AS event_history,
array_distinct(flatten(collect_set(items.item_id))) AS cart_history
FROM events
GROUP BY user_id

The collect_set function can collect unique values for a field, including fields within arrays.
The flatten function allows multiple arrays to be combined into a single array.
The array_distinct function removes duplicate elements from an array.

Answer 95

A

SELECT columns
FROM table_1
WHERE EXISTS (
SELECT values
FROM table_2
WHERE table_2.column = table_1.column);

Answer 96

A

FILTER (items, i -> i.item_id LIKE “%K”) AS king_items

FILTER : the name of the higher-order function
items : the name of our input array
i : the name of the iterator variable. You choose this name and then use it in the lambda function. It iterates
over the array, cycling each value into the function one at a time.
-> : Indicates the start of a function
i.item_id LIKE “%K” : This is the function. Each value is checked to see if it ends with the capital letter
K. If it is, it gets filtered into the new column, king_items

Answer 97

A

EXISTS (categories, c -> c = “Company Blog”) companyFlag

Let’s say we want to flag all
blog posts with “Company Blog” in the categories field. I can use the EXISTS function to mark which
entries include that category.

Answer 98

A

TRANSFORM(king_items, k -> CAST(k.item_revenue_in_usd * 100 AS INT)) AS item_revenues
we extract the item’s revenue value, multiply it by 100, and cast the result to integer

Answer 99

A

REDUCE(co2_level, 0, (c, acc) -> c + acc, acc ->(acc div size(co2_level)))

Answer 100

A

CREATE OR REPLACE FUNCTION function_name(param TYPE)
RETURNS type_to_be_returned RETURN function_itself

Answer 101

A

by the same Access Control Lists (ACLs)
as databases, tables, or views

Answer 102

A

The user must have USAGE and SELECT permissions on the function to use it.

Answer 103

A

allow a handful of users to define the complex logic needed for
common reporting and analytic queries

Answer 104

A

print(“””
SELECT *
FROM table_name
“””)

Answer 105

A

spark.sql(“SELECT * FROM table_name”)

Answer 106

A

display(spark.sql(“SELECT * FROM table_name”))

Answer 107

A

def return_new_string(string_arg):
return “The string passed to this function was “ + string_arg

Answer 108

A

f”I can substitute {my_string} here”

Answer 109

A

table_name = “users”
filter_clause = “WHERE state = ‘CA’”
query = f”””
SELECT *
FROM {table_name}
{filter_clause}
“””
print(query)

Answer 110

A

def foods_i_like(food):
if food == “beans”:
print(f”I love {food}”)
elif food == “potatoes”:
print(f”My favorite vegetable is {food}”)
elif food != “beef”:
print(f”Do you have any good recipes for {food}?”)
else:
print(f”I don’t eat {food}”)

Answer 111

A

Example asserting that the number 2 is an integer:
assert type(2) == int

Answer 112

A

try / except provides robust error handling. When a nonnumeric string is passed, an informative message is printed out.
def try_int(num_string):
try:
int(num_string)
result = f”{num_string} is a number.”
except:
result = f”{num_string} is not a number!”
print(result)

Answer 113

A

an error will not be raised when an error occurs. Implementing logic that
suppresses errors can lead to logic silently failing.

Answer 114

A

def three_times(number):
try:
return int(number) * 3
except ValueError as e:
print(f”You passed the string variable ‘{number}’.\n”)
print(f”Try passing an integer instead.”)
return None

Answer 115

A

Using a simple if clause with a function allows us to execute arbitrary SQL queries, optionally displaying
the results, and always returning the resultant DataFrame.
def simple_query_function(query, preview=True):
query_result = spark.sql(query)
if preview:
display(query_result)
return query_result
result = simple_query_function(query)

Answer 116

A

Provides a way for
data teams to load raw data from cloud object stores at lower costs and latencies,

Allows you to continuously ingest data into Delta Lake2,

Use as general best practice when ingesting data from cloud object storage.

Answer 117

A

data_source : Auto Loader will detect new files as they arrive in this location and queue them for ingestion; passed to the .load() method.

source_format: While the format for all Auto Loader queries will be cloudFiles , the format of the source data should always be specified for the cloudFiles.format option

table_name: Spark Structured Streaming supports writing directly to Delta Lake tables
by passing a table name as a string to the .table() method. Note that
table you can either append to an existing table or create a new table.

checkpoint_directory: This argument is passed to the checkpointLocation and
for storing cloudFiles.schemaLocation options. Checkpoints keep track of
streaming progress, while the schema location tracks updates to the fields
in the source dataset.

Answer 118

A

With cloudFiles.schemaLocation , Auto Loader will infer schema wheareas traditional structured streaming will not,
Auto Loader will scan the first gigabytes of data and infer the schema for you.

Answer 119

A

cloudFiles

Answer 120

A

to capture any data that might be
malformed and not fit into the table otherwise

Answer 121

A

STRING type

Answer 122

A

Reprocess all records in a source directory to
calculate current results.
Implement custom logic to identify new data that’s arrived since the last time a table was updated.

Answer 123

A

%sql
DESCRIBE HISTORY target_table

Answer 124

A

Extends the functionality of Apache Spark to allow for simplified configuration
and bookkeeping when processing incremental datasets.

Allows users to interact with ever-growing data sources as if they were just a static table of records, by treating infinite data as a table.

Answer 125

A

(spark.readStream
.table(“bronze”)
.createOrReplaceTempView(“streaming_tmp_vw”))

Answer 126

A

%sql
SELECT device_id, count(device_id) AS total_recordings
FROM streaming_tmp_vw
GROUP BY device_id

Answer 127

A

Checkpointing with checkpointLocation
.outputMode(“append”).outputMode(“complete”).
Trigger Intervals

Answer 128

A

(spark.table(“device_counts_tmp_vw”)
.writeStream
.option(“checkpointLocation”, f”{DA.paths.checkpoints}/silver”)
.outputMode(“complete”)
.trigger(availableNow=True)
.table(“device_counts”)
.awaitTermination() # This optional method blocks execution of the next cell until
the incremental batch write has succeeded)

Answer 129

A

contains raw data ingested from various sources (JSON files, RDBMS data, IoT data, to
name a few examples). Bronze makes sure that data is appended incrementally and grows over time.
We’re interested in retaining the full unprocessed history of each dataset in an efficient storage format which will provide us with the ability to recreate any state of a given data system.

Answer 130

A

provides a more refined view of our data.
We can join fields from various bronze tables to enrich streaming records, or update account statuses based on recent activity. The silver layer might contain many pipelines and silver tables. Various different views for a given dataset. The goal is that this silver layer becomes that validated single source of truth for our data. This is the dream that the data lake could have been: Correct schema, deduplicated records, but no aggregations for our business users yet.

Answer 131

A

highly refined and aggregated data. Data that has been transformed to knowledge. Updates to these tables will be completed as part of regularly scheduled production workloads, which helps control costs and allows SLAs for data freshness to be established. Gold tables provide business level aggregates often used for reporting and dashboarding. This would include aggregations such as daily active website users, weekly sales per store, or gross revenue per quarter by department. The end outputs are actionable insights, dashboards and reports of business metrics. Gold tables will often be stored in a separate storage container to help avoid cloud limits on data requests. In general, because aggregations, joins and filtering are being handled before data is written to the golden layer, query performance on data in the gold tables should be exceptional.

Answer 132

A

Examples: source file names that are being ingested, recording of the time where that data was originally processed.

Answer 133

A

(spark.readStream
.format(“cloudFiles”)
.option(“cloudFiles.format”, “json”)
.option(“cloudFiles.schemaHints”, “time DOUBLE”)
.option(“cloudFiles.schemaLocation”, f”{DA.paths.checkpoints}/bronze”)
.load(DA.paths.data_landing_location)
.createOrReplaceTempView(“recordings_raw_temp”))

Answer 134

A

ensuring that only fully successfully commits are reflected in your tables. If you choose to merge these data with other data sources, be aware of how those sources version data and what sort of consistency guarantees they have.

Answer 135

A

We join the recordings data with the PII to
add patient names, the time for the recordings we parse the time for the recordings to the format ‘yyyy-MM-dd
HH:mm:ss’ to be human-readable, and we perform a quality check by excluding heart rates that are <= 0.

(spark.readStream
.table(“bronze”)
.createOrReplaceTempView(“bronze_tmp”))

%sql
CREATE OR REPLACE TEMPORARY VIEW recordings_w_pii AS (
SELECT device_id, a.mrn, b.name, cast(from_unixtime(time, ‘yyyy-MM-dd HH:mm:ss’) AS
timestamp) time, heartrate
FROM bronze_tmp a
INNER JOIN pii b
ON a.mrn = b.mrn
WHERE heartrate > 0)
(spark.table(“recordings_w_pii”)
.writeStream
.format(“delta”)
.option(“checkpointLocation”, f”{DA.paths.checkpoints}/recordings_enriched”)
.outputMode(“append”)
.table(“recordings_enriched”))
%sql
SELECT COUNT(*) FROM recordings_enriched

Answer 136

A

We read a stream of data from recordings_enriched and write another stream to create an aggregate
gold table of daily averages for each patient.
(spark.readStream
.table(“recordings_enriched”)
.createOrReplaceTempView(“recordings_enriched_temp”))

%sql
CREATE OR REPLACE TEMP VIEW patient_avg AS (
SELECT mrn, name, mean(heartrate) avg_heartrate, date_trunc(“DD”, time) date
FROM recordings_enriched_temp
GROUP BY mrn, name, date_trunc(“DD”, time))

Answer 137

A

provides us the ability to continue to use the strengths of structured streaming while trigger this job one-time to process all available data in micro-batches.

(spark.table(“patient_avg”)
.writeStream
.format(“delta”)
.outputMode(“complete”)
.option(“checkpointLocation”, f”{DA.paths.checkpoints}/daily_avg”)
.trigger(availableNow=True) # you want the benefits of streaming but as a single
batch
.table(“daily_patient_avg”))

Answer 138

A

When using complete output mode, we rewrite the entire state of our table each time our logic runs.
While this is ideal for calculating aggregates, we cannot read a stream from this directory, as Structured Streaming assumes data is only being appended.

Answer 139

A

CREATE OR REFRESH STREAMING LIVE TABLE sales_orders_raw
COMMENT “The raw sales orders, ingested from /databricks-datasets.”
AS SELECT * FROM cloud_files(“/databricks-datasets/retail-org/sales_orders/”, “json”,
map(“cloudFiles.inferColumnTypes”, “true”))

Answer 140

A

New Job Cluster
Existing All-Purpose Clusters

Answer 141

A

To view details of the run, including the start time, duration, and status, hover over the bar in the Job Runs row.
To view details of each task, including the start time, duration, cluster, and status, hover over the cell for that task

Answer 142

A

make sure to use shared job clusters.
choose the correct cluster type for your job.

Answer 143

A

allows users and admins to navigate databases, tables, and views; explore data schema,metadata, and history; set and modify permissions of relational entities

Answer 144

A

shows where data came from, who created it and when, how it has been
modified over time, how it’s being used, and more.

Answer 145

A

Data Access Control: Who has access to what?

Data Access Audit: Understand who accessed what and when? What did they do? Compliance aspect.

Data Lineage: Which data objects feed downstream data objects - if you make a change to an upstream
table, how does that affect downstream and vice versa.

Data Discovery: Important to find your data and see what actually exists.

Answer 146

A

Admins will have the ability to view all objects registered to the metastore and will be able to control permissions for other users in the workspace.
Users will have no permissions on anything registered to the metastore, other than objects that they create in DBSQL;
before users can create any databases, tables, or views, they must have create and usage privileges specifically granted to them.
Permissions will be set using Groups that have been configured by an administrator.
Access Control Lists (ACLs) are used to control permissions.

Answer 147

A

CATALOG controls access to the entire data catalog.

DATABASE controls access to a database.

TABLE controls access to a managed or external table.

VIEW controls access to SQL views.

FUNCTION controls access to a named function.

ANY FILE controls access to the underlying filesystem. Users granted access to ANY FILE can bypass the restrictions put on the catalog, databases, tables, and views by reading from
the file system directly.

Answer 148

A

Databricks administrator: All objects in the catalog and the underlying filesystem.

Catalog owner: All objects in the catalog.

Database owner: All objects in the database.

Table owner: Only the table (similar options for views and functions).

Answer 149

A

ALL PRIVILEGES: gives all privileges (is translated into all the below privileges).

SELECT: gives read access to an object.

MODIFY: gives ability to add, delete, and modify data to or from an object.

READ_METADATA: gives ability to view an object and its metadata.

USAGE: does not give any abilities, but is an additional requirement to perform any action
on a database object.

CREATE: gives ability to create an object (for example, a table in a database)

Answer 150

A

To enable the ability to create databases and tables in the default catalog using Databricks SQL:
GRANT usage, create ON CATALOG hive_metastore TO users

To confirm this has run successfully:
SHOW GRANT ON CATALOG hive_metastore

Answer 151

A

spark.table