General Flashcards
Syntax for Generated Column
GENERATED ALWAYS AS (CAST(orderTime as DATE))
What is the main difference between AUTO LOADER and COPY INTO?
Auto loader supports both directory listing and file notification but COPY INTO only supports directory listing.
Why does AUTO LOADER require schema location?
Schema location is used to store schema inferred by AUTO LOADER
Explanation: the next time AUTO LOADER runs faster since it does not need to infer the schema every single time by trying to use the last known schema.
You are designing a data model that works for both machine learning using images and Batch ETL/ELT workloads. Which of the following features of data lakehouse can help you meet the needs of both workloads?
Data lakehouse can store unstructured data and support ACID transactions
Where does Databricks architecture host jobs/pipelines and queries?
Control Plane
Databricks Repos can implement what CI/CD operations?
Pull the latest version of code into production folder
Explanation: Not stuff like PRs and reviews - that’s handled by github (e.g.)
What’s the syntax for creating or overwriting an existing delta table?
CREATE OR REPLACE TABLE
Explanation: When creating a table in Databricks by default the table is stored in DELTA format
When a managed table is dropped, what happens to the data, metadata, and history?
They are also dropped from storage
When a notebook is detached and re-attached, what happens to session-scoped temporary views?
They are lost
When a notebook is detached and re-attached, what happens to global temporary views?
They can still be accessed
Use colon (:) syntax in queries to access subfields in ____ and use period (.) syntax in queries to access subfields in ____
JSON strings
Struct types
Assert syntax
assert row_count == 10, “Error message”
Python error handling
try: except:
Python Spark Syntax to create a view on top of the delta stream(stream on delta table)?
Spark.readStream.table(“sales”).createOrReplaceTempView(“streaming_vw”)
You are currently asked to work on building a data pipeline, you have noticed that you are currently working with a data source that has a lot of data quality issues and you need to monitor data quality and enforce it as part of the data ingestion process, which of the following tools can be used to address this problem?
Delta Live Tables
Explanation: Delta live tables expectations can be used to identify and quarantine bad data, all of the data quality metrics are stored in the event logs which can be used to later analyze and monitor.
What are the different ways you can schedule a job in Databricks workspace?
Immediate, CRON, Continuous, when new files arrive
Databricks SQL queries are running slow. All the queries are running in parallel and using a SQL endpoint(SQL Warehouse) with a single cluster. What can you do to improve the performance/response times of the queries?
Increase the maximum bound of the SQL endpoint’s scaling range.
Explanation: The question is looking to test your ability to know how to scale a SQL Endpoint(SQL Warehouse) and you have to look for cue words or need to understand if the queries are running sequentially or concurrently. if the queries are running sequentially then scale up(Size of the cluster from 2X-Small to 4X-Large) if the queries are running concurrently or with more users then scale out(add more clusters).
What does the Auto Stop feature do?
It automatically terminates the cluster when you are not using it
Unity catalog simplifies managing multiple workspaces, by storing and managing permissions and ACL at _______ level
Account
What section in the UI can be used to manage permissions and grants to tables?
Data Explorer
What is not a privilege in the Unity catalog?
DELETE
Explanation: DELETE and UPDATE permissions do not exit, you have to use MODIFY which provides both Update and Delete permissions.
Also: TABLE ACL privilege types are different from Unity Catalog privilege types, please read the question carefully.
Syntax for transferring ownership of a table to a group
ALTER TABLE table_name OWNER to ‘group’
What is the array function that takes an input column and returns a unique list of values in an array?
collect_set()
What is the default location where spark stores user databases?
dbfs:/user/hive/warehouse
When can INSERT OVERWRITE update the schema?
when spark.databricks.delta.schema.autoMerge.enabled is set true
Which of these is NOT a valid messaging option for job notifications? (SMS, Email, PagerDuty, Messaging Webhook, SES, SNS)
SMS
Databricks Web Application is hosted in the Control Plane or Data Plane?
Control Plane
Notebooks and Jobs are hosted in the Control Plane or Data Plane?
Control Plane
What are the output modes for the trigger command while writing to a streaming table?
Append and Complete
What command can be used to write data into a Delta table while avoiding the writing of duplicate records?
MERGE
Why does AUTO LOADER require schema location?
Schema Location is used to store the location of schema inferred by the AutoLoader
What tool provides Data Access control, Access Audit, Data Lineage, and Data discovery?
Unity Catalog
What is the default trigger interval for structured streaming queries?
half a second
How do you establish a trigger to micro-batch data every 5 seconds?
.trigger(processingTime=”5 seconds”)
How do you return a GroupedData object?
DataFrame.groupBy()
How would you describe a database named db_hr?
DESCRIBE DATABASE db_hr;