General Flashcards
Syntax for Generated Column
GENERATED ALWAYS AS (CAST(orderTime as DATE))
What is the main difference between AUTO LOADER and COPY INTO?
Auto loader supports both directory listing and file notification but COPY INTO only supports directory listing.
Why does AUTO LOADER require schema location?
Schema location is used to store schema inferred by AUTO LOADER
Explanation: the next time AUTO LOADER runs faster since it does not need to infer the schema every single time by trying to use the last known schema.
You are designing a data model that works for both machine learning using images and Batch ETL/ELT workloads. Which of the following features of data lakehouse can help you meet the needs of both workloads?
Data lakehouse can store unstructured data and support ACID transactions
Where does Databricks architecture host jobs/pipelines and queries?
Control Plane
Databricks Repos can implement what CI/CD operations?
Pull the latest version of code into production folder
Explanation: Not stuff like PRs and reviews - that’s handled by github (e.g.)
What’s the syntax for creating or overwriting an existing delta table?
CREATE OR REPLACE TABLE
Explanation: When creating a table in Databricks by default the table is stored in DELTA format
When a managed table is dropped, what happens to the data, metadata, and history?
They are also dropped from storage
When a notebook is detached and re-attached, what happens to session-scoped temporary views?
They are lost
When a notebook is detached and re-attached, what happens to global temporary views?
They can still be accessed
Use colon (:) syntax in queries to access subfields in ____ and use period (.) syntax in queries to access subfields in ____
JSON strings
Struct types
Assert syntax
assert row_count == 10, “Error message”
Python error handling
try: except:
Python Spark Syntax to create a view on top of the delta stream(stream on delta table)?
Spark.readStream.table(“sales”).createOrReplaceTempView(“streaming_vw”)
You are currently asked to work on building a data pipeline, you have noticed that you are currently working with a data source that has a lot of data quality issues and you need to monitor data quality and enforce it as part of the data ingestion process, which of the following tools can be used to address this problem?
Delta Live Tables
Explanation: Delta live tables expectations can be used to identify and quarantine bad data, all of the data quality metrics are stored in the event logs which can be used to later analyze and monitor.