General Flashcards
In the UI, where can you go to facilitate granting select access to a user
Data Explorer
What is Delta Lake
an open source storage format, like parquet, with additional capabilities that can provide reliability, security, and performance
How is the data organized in storage when managing a delta table
All of the data is broken down into one or more parquet files, log files are broken down to one or more json files, and each transaction creates a new data file and log file
What is the underlying technology that makes auto loader work
Structured Streaming
When should you use AutoLoader vs Copy Into
- You want to load from a location that contains files in the order of millions or higher. Auto Loader can discover files more efficiently
- Auto Loader supports file notification
- Data schema evolves frequently
When loading with Auto loader, how do you deal with an evolving schema
mergeSchema will infer the schema across multiple files and merge the schema of each file
How can you use merge to deduplicate upon writing
MERGE INTO target USING source
ON target.key = source.key
WHEN NOT MATCHED THEN INSERT *
How do you use merge to delete all target rows that have no matches in the source table?
MERGE INTO target USING source
ON target.key = source.key
WHEN NOT MATCHED BY SOURCE THEN DELETE
Where can you include timeouts in jobs
in the task
How can you automate an alert
By having it on a refresh schedule
How do you grant read capability to a table
GRANT SELECT, USAGE ON TABLE customers to some@Email.com
What type of constraint keeps the bad records and adds them to the target dataset
CONSTRAINT valid_timestamp EXPECT (timestamp > ‘2020-01-01’)
What type of constraint drops bad records
CONSTRAINT valid_timestamp EXPECT (timestamp > ‘2020-01-01’) ON VIOLATION DROP ROW
What type of constraint fails when there is a bad record
CONSTRAINT valid_timestamp EXPECT (timestamp > ‘2020-01-01’) ON VIOLATION FAIL UPDATE
When creating a db what is the default location of the database
dbfs:/user/hive/warehouse
How do you create an external table
Answer is CREATE TABLE transactions (id int, desc string) USING DELTA LOCATION ‘/mnt/delta/transactions’
When you drop an external DELTA table using the SQL Command DROP TABLE table_name, how does it impact metadata(delta log, history), and data stored in the storage?
Drops table from metastore, but keeps metadata (delta log, history, and data in storage
What is the best way to query external csv files located on DBFS Storage to inspect the data using SQL?
SELECT * FROM CSV. ‘dbfs:/location/csv_files/’
How do you create a table from csv with headers, delimiter of | in SQL
CREATE TABLE customer(customerId, customerName)
USING CSV
OPTIONS(header=”true”, delimiter=”|”)
How can you do a distinct count that ignores null values
Count(DISTINCT *)
What command combines two datasets while eliminating duplicates
UNION
What options are not allowed on streaming data
Select *
Multiple streaming aggregations
Limit and take the first N
Distinct
Deduplication
Sorting WITHOUT aggregation
What are some examples of options allowed on streaming data
Sum()
Max()
Count()
What technique does structured streaming use to ensure recovery of failures during stream processing?
Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data being processed during each trigger interval.