Production Pipelines (Test Qs) Flashcards
The data engineering team noticed that one of the job fails randomly as a result of using spot instances, what feature in Jobs/Tasks can be used to address this issue so the job is more stable when using spot instances?
Use Databricks REST API to monitor and restart the job
Add second task and add a check condition to re run the first task if it fails
Restart the job cluster, job automatically restarts
Add a retry policy
The answer is, Add a retry policy to the task
Tasks in Jobs support Retry Policy, which can be used to retry a failed tasks, especially when using spot instance it is common to have failed executors or driver
A particular job seems to be performing slower and slower over time, the team thinks this started to happen when a recent production change was implemented, you were asked to take look at the job history and see if we can identify trends and root cause, where in the workspace UI can you perform this analysis?
Under jobs UI select the job cluster, under spark UI select the application job logs, then you can access last 60 days historical run
Under jobs UI select the job you are interested, under runs we can see current active runs and last 60 days historical run
Historically job runs can only be accessed by REST API
Under Workspace logs, select job logs and select the job you want to monitor view the last 60 day historical runs
The answer is,
Under jobs UI select the job you are interested, under runs we can see current active runs and last 60 days historical run
What are the different ways you can schedule a job in Databricks workspace?
Once, Continuous
Cron, File notification from Cloud object storage
Cron, On-Demand runs
On-Demand runs, File notification from Cloud object storage
Continuous, Incremental
The answer is, Cron, On-Demand runs
Supports running job immediately or using can be scheduled using CRON syntax
You have noticed that Databricks SQL queries are running slow, you are asked to look reason why queries are running slow and identify steps to improve the performance, when you looked at the issue you noticed all the queries are running in parallel and using a SQL endpoint(SQL Warehouse) with a single cluster. Which of the following steps can be taken to improve the performance/response times of the queries?
*Please note Databricks recently renamed SQL endpoint to SQL warehouse.
They can increase the warehouse size from 2X-Smal to 4XL of the SQL endpoint
They can turn on the Auto Stop feature for the SQL endpoint
They can turn on the Server less feature for the SQL endpoints
They can increase the maximum bound of the SQL endpoint’s scaling range
The answer is, They can increase the maximum bound of the SQL endpoint’s scaling range when you increase the max scaling range more clusters are added so queries instead of waiting in the queue can start running using available clusters, see below for more explanation.
The question is looking to test your ability to know how to scale a SQL Endpoint(SQL Warehouse) and you have to look for cue words or need to understand if the queries are running sequentially or concurrently. if the queries are running sequentially then scale up(Size of the cluster from 2X-Small to 4X-Large) if the queries are running concurrently or with more users then scale out(add more clusters).
SQL Endpoint(SQL Warehouse) Overview: (Please read all of the below points and the below diagram to understand )
A SQL Warehouse should have at least one cluster
A cluster comprises one driver node and one or many worker nodes
No of worker nodes in a cluster is determined by the size of the cluster (2X -Small ->1 worker, X-Small ->2 workers…. up to 4X-Large -> 128 workers) this is called Scale up
A single cluster irrespective of cluster size(2X-Smal.. to …4XLarge) can only run 10 queries at any given time if a user submits 20 queries all at once to a warehouse with 3X-Large cluster size and cluster scaling (min 1, max1) while 10 queries will start running the remaining 10 queries wait in a queue for these 10 to finish.
Increasing the Warehouse cluster size can improve the performance of a query, for example, if a query runs for 1 minute in a 2X-Small warehouse size it may run in 30 Seconds if we change the warehouse size to X-Small. this is due to 2X-Small having 1 worker node and X-Small having 2 worker nodes so the query has more tasks and runs faster (note: this is an ideal case example, the scalability of a query performance depends on many factors, it can not always be linear)
A warehouse can have more than one cluster this is called Scale out. If a warehouse is configured with X-Small cluster size with cluster scaling(Min1, Max 2) Databricks spins up an additional cluster if it detects queries are waiting in the queue, If a warehouse is configured to run 2 clusters(Min1, Max 2), and let’s say a user submits 20 queries, 10 queriers will start running and holds the remaining in the queue and databricks will automatically start the second cluster and starts redirecting the 10 queries waiting in the queue to the second cluster.
A single query will not span more than one cluster, once a query is submitted to a cluster it will remain in that cluster until the query execution finishes irrespective of how many clusters are available to scale.
You had worked with the Data analysts team to set up a SQL Endpoint(SQL warehouse) point so they can easily query and analyze data in the gold layer, but once they started consuming the SQL Endpoint(SQL warehouse) you noticed that during the peak hours as the number of users increase you are seeing queries taking longer to finish, which of the following steps can be taken to resolve the issue?
*Please note Databricks recently renamed SQL endpoint to SQL warehouse.
They can increase the warehouse size from 2X-Smal to 4XL of the SQL endpoint
They can turn on the Auto Stop feature for the SQL endpoint
They can turn on the Server less feature for the SQL endpoints
They can increase the maximum bound of the SQL endpoint’s scaling range
They can increase the maximum bound of the SQL endpoint’s scaling range, when you increase the maximum bound you can add more clusters to the warehouse which can then run additional queries that are waiting in the queue to run, focus on the below explanation that talks about Scale-out.