Databricks Flashcards
What are the 2 main components of Databricks?
The control plan: stores notebooks commands, workspace configurations.
The data plane: hosts compute resources
(clusters)
What are the 3 different Databricks service?
Data science and engineering workspace
SQL
Machine learning
What is a cluster?
A set of compute resources on which you run data engineer, data science workloads, which are run as a set of commands on notebook or as a job.
What are the 2 cluster types?
All purpose clusters: use interactive notebooks.
Job clusters: run automated jobs
. How long does Databricks retain cluster configuration information?
30 days
Pin cluster to keep all purpose cluster after 30 days.
1.6. What are the three cluster modes?
Standard clusters: large amounts of data with Apache Spark.
Single Node clusters: jobs that use small amounts of data or non-distributed
workloads such as single-node machine learning libraries.
High Concurrency clusters: groups of users who need to share resources or run ad-hoc
jobs. Administrators usually create High Concurrency clusters.
Databricks recommends enabling
autoscaling for High Concurrency clusters.
To ensure that all data at rest is
encrypted for all storage types, including shuffle data that is stored temporarily on your cluster’s local
disks:
You can enable local disk encryption.
To reduce cluster start time
you can attach a cluster to a predefined pool of idle instances, for the driver
and worker nodes
Which magic command do you use to run a notebook from
another notebook?
%run ../Includes/Classroom-Setup-1.2
What is Databricks utilities and how can you use it to list out
directories of files from Python cells?
display(dbutils.fs.ls(“/databricks-datasets”))
What function should you use when you have tabular data
returned by a Python cell?
display()
What is Databricks Repos?
provides repository-level
integration with Git providers, allowing you to work in an environment that is backed by revision control
using Git
What is the definition of a Delta Lake?
technology at the heart of the Databricks Lakehouse platform. It is an open source
technology that enables building a data lakehouse on top of existing storage systems.
. How does Delta Lake address the data lake pain points to
ensure reliable, ready-to-go data?
ACID Transactions – Delta Lake adds ACID transactions to data lakes. ACID stands for atomicity,
consistency, isolation, and durability
Describe how Delta Lake brings ACID transactions to object
storage
Difficult to append data
Difficult to modify existing data
Jobs failing mid way
Real time operations are not easy
Costly to keep historical data versions
Is Delta Lake the default for all tables created in Databricks?
Yes, Delta Lakes is the default for all tables created in Databricks
What data objects are in the Databricks Lakehouse?
Catalog, database, table, view, function
What is a metastore?
contains all of the metadata that defines data objects in the lakehouse:
Including:
Unity catalog
Hive metastore
external metastore
What is a catalog?
the highest abstraction (or coarsest grain) in the Databricks Lakehouse relational model.
catalog_name.database_name.table_name
What is a Delta Lake table?
stores data as a directory of files on
cloud object storage and registers table metadata to the metastore within a catalog and schema
What is the syntax to create a Delta Table?
CREATE TABLE students
(id INT, name STRING, value DOUBLE);
CREATE TABLE IF NOT EXISTS students
(id INT, name STRING, value DOUBLE)
What is the syntax to insert data?
INSERT INTO students
VALUES
(4, “Ted”, 4.7),
(5, “Tiffany”, 5.5),
(6, “Vini”, 6.3)
What is the syntax to update particular records of a table?
UPDATE students
SET value = value + 1
WHERE name LIKE “T%”
What is the syntax to delete particular records of a table?
DELETE FROM students
WHERE value > 6
What is the syntax for merge and what are the benefits of
using merge?
MERGE INTO table_a a
USING table_b b
ON a.col_name=b.col_name
WHEN MATCHED AND b.col = X
THEN UPDATE SET *
WHEN MATCHED AND a.col = Y
THEN DELETE
WHEN NOT MATCHED AND b.col = Z
THEN INSERT *
Deduplication case (python syntax)
(deltaTable
.alias(“t”)
.merge(historicalUpdates.alias(“s”), “t.loan_id = s.loan_id”)
.whenNotMatchedInsertAll()
.execute())
What is the syntax to delete a table?
DROP TABLE students
What is Hive?
is built on top of Apache Hadoop, which is an open-source framework used to efficiently store and
process large datasets
What are the two commands to see metadata about a table?
DESCRIBE EXTENDED students
DESCRIBE DETAIL students
What is the syntax to display the Delta Lake files?
%python
display(dbutils.fs.ls(f”{DA.paths.user_db}/students”))
DESCRIBE DETAIL students
Describe the Delta Lake files, their format and directory
structure
%python
display(dbutils.fs.ls(f”{DA.paths.user_db}/students/_delta_log”))
What does the query engine do using the transaction logs
when we query a Delta Lake table?
resolve all the files that are valid in the current
version, and ignores all other data files.
You can look at a particular transaction log and see if records were inserted / updated / deleted.
%python
display(spark.sql(f”SELECT * FROM
json.{DA.paths.user_db}/students/_delta_log/00000000000000000007.json
”))
What commands do you use to compact small files and index
tables?
OPTIMIZE command allows you to combine files toward an optimal size:
OPTIMIZE events
OPTIMIZE events WHERE date >= ‘2017-01-01’
OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)
How do you review a history of table transactions?
Because all changes to the Delta Lake table are stored in the transaction log:
DESCRIBE HISTORY students
. How do you query and roll back to previous table version?
SELECT *
FROM students VERSION AS OF 3
How to roll back after deletion?
RESTORE TABLE students TO VERSION AS OF 8
What command do you use to clean up stale data files and
what are the consequences of using this command?
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
SET spark.databricks.delta.vacuum.logging.enabled = true;
VACUUM students RETAIN 0 HOURS DRY RUN
What is the advantage of using Delta cache?
optimize performance
What is the syntax to create a database with default location
(no location specified)?
CREATE DATABASE IF NOT EXISTS db_name_default_location;
What is the syntax to create a database with specified
location?
CREATE DATABASE IF NOT EXISTS db_name_custom_location LOCATION
‘path/db_name_custom_location.db’;
How do you get metadata information of a database? Where
are the databases located (difference between default vs custom
location)
DESCRIBE DATABASE EXTENDED db_name;
What’s the best practice when creating databases?
declare a location for a given database
What is the syntax for creating a table in a database with default location and inserting data?
USE db_name_default_location;
CREATE OR REPLACE TABLE managed_table_in_db_with_default_location (width INT, length
INT, height INT);
INSERT INTO managed_table_in_db_with_default_location
VALUES (3, 2, 1);
SELECT * FROM managed_table_in_db_with_default_location;
What is the syntax for a table in a database with custom location?
USE db_name_custom_location;
CREATE OR REPLACE TABLE managed_table_in_db_with_custom_location (width INT, length
INT, height INT);
INSERT INTO managed_table_in_db_with_custom_location
VALUES (3, 2, 1);
SELECT * FROM managed_table_in_db_with_custom_location;
Where are managed tables located in a database?
in the LOCATION of the database it is registered to
How can you find their location?
DESCRIBE EXTENDED managed_table_in_db;
What is the syntax to create an external table?
CREATE OR REPLACE TABLE external_table LOCATION ‘path/external_table’ AS
SELECT * FROM temp_delays;
SELECT * FROM external_table;
Python:
df.write.option(“path”, “/path/to/empty/directory”).saveAsTable(“table_name”)
What happens when you drop tables ?(Managed tables)
the table’s directory and its log and data files will be deleted,
only the database directory remains.
Managed tables:
DROP TABLE managed_table_in_db_with_default_location;
DROP TABLE managed_table_in_db_with_custom_location;
What happens when you drop tables a unmanaged table?
The table definition no longer exists in the metastore, but the underlying data
remain intact.
DROP TABLE external_table;
We still have access to the underlying data files:
%python
tbl_path = f”{DA.paths.working_dir}/external_table”
files = dbutils.fs.ls(tbl_path)
display(files)
What is the command to drop the database and its underlying tables and views?
Default location:
DROP DATABASE db_name_default_location CASCADE;
Customer location:
DROP DATABASE db_name_custom_location CASCADE;
How can you show a list of tables and views?
SHOW TABLES;
What is the difference between Views, Temp Views & Global Temp Views?
View: Persisted across multiple sessions, just like a table. No process or write data.
Temp View: not persisted across multiple sessions, is scoped to the query level
Global temp view: are scoped to the cluster level, registered to a separate database. Use views with appropriate table ACLs instead of global temporary views.
What is the syntax for view?
CREATE VIEW view_delays_abq_lax AS
SELECT *
FROM external_table
WHERE origin = ‘ABQ’ AND destination = ‘LAX’;
SELECT * FROM view_delays_abq_lax;
What is the syntax for temp view?
CREATE TEMPORARY VIEW temp_view_delays_gt_120
AS SELECT * FROM external_table WHERE delay > 120 ORDER BY delay ASC;
SELECT * FROM temp_view_delays_gt_120;
What is syntax for global temp view?
CREATE GLOBAL TEMPORARY VIEW global_temp_view_dist_gt_1000
AS SELECT * FROM external_table WHERE distance > 1000;
SELECT * FROM global_temp.global_temp_view_dist_gt_1000;
Show tables for global temp views
SHOW TABLES IN global_temp;
What is the syntax to select from global temp views?
SELECT * FROM global_temp.name_of_the_global_temp_view;
What is the syntax for defining a CTE in a subquery?
SELECT
max(total_delay) AS Longest Delay (in minutes)
FROM
(WITH delayed_flights(total_delay) AS (
SELECT
delay
FROM external_table)
SELECT
*
FROM
delayed_flights
);
What is the syntax for defining a CTE in a subquery expression?
SELECT
(
WITH distinct_origins AS (
SELECT DISTINCT origin FROM external_table
)
SELECT
count(origin) AS Number of Distinct Origins
FROM distinct_origins)
AS Number of Different Origin Airports
;
What is the syntax for defining a CTE in a CREATE VIEW statement?
CREATE OR REPLACE VIEW BOS_LAX
AS WITH origin_destination(origin_airport, destination_airport)
AS (SELECT origin, destination FROM external_table)
SELECT * FROM origin_destination
WHERE origin_airport = ‘BOS’ AND destination_airport = ‘LAX’;
SELECT count(origin_airport) AS Number of Delayed Flights from BOS to LAX
FROM
BOS_LAX;
How do you query data from a single file?
SELECT * FROM file_format./path/to/file
SELECT * FROM json.${da.paths.datasets}/raw/events-kafka/001.json
. How do you query a directory of files?
SELECT * FROM json.${da.paths.datasets}/raw/events-kafka