General Flashcards

1
Q

In the UI, where can you go to facilitate granting select access to a user

A

Data Explorer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Delta Lake

A

an open source storage format, like parquet, with additional capabilities that can provide reliability, security, and performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How is the data organized in storage when managing a delta table

A

All of the data is broken down into one or more parquet files, log files are broken down to one or more json files, and each transaction creates a new data file and log file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the underlying technology that makes auto loader work

A

Structured Streaming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When should you use AutoLoader vs Copy Into

A
  • You want to load from a location that contains files in the order of millions or higher. Auto Loader can discover files more efficiently
  • Auto Loader supports file notification
  • Data schema evolves frequently
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When loading with Auto loader, how do you deal with an evolving schema

A

mergeSchema will infer the schema across multiple files and merge the schema of each file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can you use merge to deduplicate upon writing

A

MERGE INTO target USING source
ON target.key = source.key
WHEN NOT MATCHED THEN INSERT *

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you use merge to delete all target rows that have no matches in the source table?

A

MERGE INTO target USING source
ON target.key = source.key
WHEN NOT MATCHED BY SOURCE THEN DELETE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Where can you include timeouts in jobs

A

in the task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can you automate an alert

A

By having it on a refresh schedule

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you grant read capability to a table

A

GRANT SELECT, USAGE ON TABLE customers to some@Email.com

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What type of constraint keeps the bad records and adds them to the target dataset

A

CONSTRAINT valid_timestamp EXPECT (timestamp > ‘2020-01-01’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What type of constraint drops bad records

A

CONSTRAINT valid_timestamp EXPECT (timestamp > ‘2020-01-01’) ON VIOLATION DROP ROW

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What type of constraint fails when there is a bad record

A

CONSTRAINT valid_timestamp EXPECT (timestamp > ‘2020-01-01’) ON VIOLATION FAIL UPDATE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When creating a db what is the default location of the database

A

dbfs:/user/hive/warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you create an external table

A

Answer is CREATE TABLE transactions (id int, desc string) USING DELTA LOCATION ‘/mnt/delta/transactions’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When you drop an external DELTA table using the SQL Command DROP TABLE table_name, how does it impact metadata(delta log, history), and data stored in the storage?

A

Drops table from metastore, but keeps metadata (delta log, history, and data in storage

18
Q

What is the best way to query external csv files located on DBFS Storage to inspect the data using SQL?

A

SELECT * FROM CSV. ‘dbfs:/location/csv_files/’

19
Q

How do you create a table from csv with headers, delimiter of | in SQL

A

CREATE TABLE customer(customerId, customerName)
USING CSV
OPTIONS(header=”true”, delimiter=”|”)

20
Q

How can you do a distinct count that ignores null values

A

Count(DISTINCT *)

21
Q

What command combines two datasets while eliminating duplicates

22
Q

What options are not allowed on streaming data

A

Select *
Multiple streaming aggregations
Limit and take the first N
Distinct
Deduplication
Sorting WITHOUT aggregation

23
Q

What are some examples of options allowed on streaming data

A

Sum()
Max()
Count()

24
Q

What technique does structured streaming use to ensure recovery of failures during stream processing?

A

Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data being processed during each trigger interval.

25
What is an example of creating a generated data column
GENERATED ALWAYS AS (CAST(orderTime as DATE))
26
Which plane hosts jobs/pipelines and queries
Control Plane
27
T/F Databricks lakehouse does not support Stored Procedures
True
28
What keyword means you are creating an external table
LOCATION CREATE TABLE table_name ( column column_data_type…) USING format LOCATION "dbfs:/"
29
What is an example of a UDF
CREATE FUNCTION udf_convert(temp DOUBLE, measure STRING) RETURNS DOUBLE RETURN CASE WHEN measure == ‘F’ then (temp * 9/5) + 32 ELSE (temp – 33 ) * 5/9 END
30
What is a command that will create a view on top of a delta stream
Spark.readStream.format("delta").table("sales").createOrReplaceTempView("streaming_vw")
31
How does structured streaming achieve end to end fault tolerance?
First, Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data being processed during each trigger interval. Next, the streaming sinks are designed to be idempotent
32
What are the three type of exceptions with Delta Live tables
Except - Records that violate the expectation are added to the target dataset along with valid records. (CONSTRAINT valid_timestamp EXPECT (timestamp > '2012-01-01')) Drop - Drop invalid records (CONSTRAINT valid_current_page EXPECT (current_page_id IS NOT NULL and current_page_title IS NOT NULL) ON VIOLATION DROP ROW) Fail - halt execution immediately when a record fails validation (CONSTRAINT valid_count EXPECT (count > 0) ON VIOLATION FAIL UPDATE)
33
When building a DLT s pipeline you have two options to create a live tables, what is the main difference between CREATE STREAMING LIVE TABLE vs CREATE LIVE TABLE?
CREATE STREAMING LIVE TABLE is used when working with Streaming data sources and Incremental data
34
Where can you see the jobs history in the UI
Under jobs UI select the job you are interested, under runs we can see current active runs and last 60 days historical run
35
What are the different ways you can schedule a job in Databricks workspace?
Cron, on demand
36
On SQL Endpoint(SQL Warehouse) single cluster, how many queries can run at a time
10
37
Which section in the UI can be used to manage permissions and grants to tables?
Data Explorer
38
What are two examples of privileges not available in the unity catalog
DELETE and UPDATE
39
How do you transfer a table's ownership
ALTER TABLE table_name OWNER to ‘group’
40