Databricks Data Science & Engineering Workspace (Databricks Q) Flashcards
Which of the following resources reside in the data plane of a Databricks deployment? Select one response.
Notebooks
Job scheduler
Cluster manager
Databricks File System (DBFS)
Web application
Databricks File System (DBFS)
Which of the following cluster configuration options can be customized at the time of cluster creation? Select all that apply.
Cluster mode
Databricks Runtime Version
Restart policy
Access permissions
Maximum number of worker nodes
Cluster mode
Databricks Runtime Version
Maximum number of worker nodes
A data engineer wants to stop running a cluster without losing the cluster’s configuration. The data engineer is not an administrator.
Which of the following actions can the data engineer take to satisfy their requirements and why? Select one response.
Terminate the cluster; clusters are retained for 30 days after they are terminated.
Delete the cluster; clusters are retained for 30 days after they are deleted.
Edit the cluster; clusters can be saved as templates in the cluster configuration page before they are deleted.
Delete the cluster; clusters are retained for 60 days after they are deleted.
Detach the cluster; clusters are retained for 70 days after they are detached from a notebook.
Terminate the cluster; clusters are retained for 30 days after they are terminated.
A data engineering team is working on a shared repository. Each member of the team has cloned the target repository and is working in a separate branch.
Which of the following is considered best practice for the team members to commit their changes to the centralized repository? Select one response.
The data engineers can each sync their changes with the main branch from the Git terminal, which will automatically commit their changes.
The data engineers can each run a job based on their branch in the Production folder of the shared repository so the changes can be merged into the main branch.
The data engineers can each create a pull request to be reviewed by other members of the team before merging the code changes into the main branch.
The data engineers can each call the Databricks Repos API to submit the code changes for review before they are merged into the main branch.
The data engineers can each commit their changes to the main branch using an automated pipeline after a thorough code review by other members of the team.
The data engineers can each create a pull request to be reviewed by other members of the team before merging the code changes into the main branch.
A data engineer is creating a multi-node cluster.
Which of the following statements describes how workloads will be distributed across this cluster? Select one response.
Workloads are distributed across available memory by the executor.
Workloads are distributed across available worker nodes by the driver node.
Workloads are distributed across available driver nodes by the worker node.
Workloads are distributed across available worker nodes by the executor.
Workloads are distributed across available compute resources by the executor.
Workloads are distributed across available worker nodes by the driver node.
Which of the following statements describes how to clear the execution state of a notebook? Select two responses.
Detach and reattach the notebook to a cluster.
Perform a Clean operation from the terminal.
Perform a Clean operation from the driver logs.
Perform a Clear State operation from the Spark UI.
Use the Clear State option from the Run dropdown menu.
Detach and reattach the notebook to a cluster.
Use the Clear State option from the Run dropdown menu.
Which of the following resources reside in the control plane of a Databricks deployment? Select two responses.
Job scheduler
Job configurations
JDBC and SQL data sources
Notebook commands
Databricks File System (DBFS)
Job scheduler
Notebook commands
Three data engineers are collaborating on a project using a Databricks Repo. They are working on the notebook at separate times of the day.
Which of the following is considered best practice for collaborating in this way? Select one response.
The engineers can each work in their own branch for development to avoid interfering with each other.
The engineers can each design, develop, and trigger their own Git automation pipeline.
The engineers can each create their own Databricks Repo for development and merge changes into a main repository for production.
The engineers can use a separate internet-hosting service to develop their code in a single repository before merging their changes into a Databricks Repo.
The engineers can set up an alert schedule to notify them when changes have been made to their code.
The engineers can each work in their own branch for development to avoid interfering with each other.
A data engineer is working on an ETL pipeline. There are several utility methods needed to run the notebook, and they want to break them down into simpler, reusable components.
Which of the following approaches accomplishes this? Select one response.
Create a separate notebook for the utility commands and use the %run magic command in the original notebook to run the notebook with the utility commands.
Create a separate notebook for the utility commands and use an import statement at the beginning of the original notebook to reference the notebook with the utility commands.
Create a separate task for the utility commands and make the notebook dependent on the task from the original notebook’s Directed Acyclic Graph (DAG).
Create a pipeline for the utility commands and run the pipeline from within the original notebook using the %md magic command.
Create a separate job for the utility commands and run the job from within the original notebook using the %cmd magic command.
Create a separate notebook for the utility commands and use the %run magic command in the original notebook to run the notebook with the utility commands.
A data engineer is having trouble locating the dashboard samples. They know that the dashboard was created in the year 2022 by one of their colleagues.
Which of the following steps can the data engineer take to find the dashboard? Select one response.
They can use the search feature and filter their search by data object, date last modified, and owner.
They can run DESCRIBE HISTORY‘2022-01-01’; within a Databricks notebook, which will list the names of any data objects created after that timestamp.
They can query the event log of the cluster that the dashboard was created on.
They can run DESCRIBE LOCATION samples; within a Databricks notebook, which will list the locations of any dashboards with the same name.
They can run DESCRIBE DETAIL samples; within a Databricks notebook, which will list the locations of any dashboards with the same name.
They can use the search feature and filter their search by data object, date last modified, and owner.
A data engineer is trying to merge their development branch into the main branch for a data project’s repository.
Which of the following is a correct argument for why it is advantageous for the data engineering team to use Databricks Repos to manage their notebooks? Select one response.
Databricks Repos allows integrations with popular tools such as Tableau, Looker, Power BI, and RStudio.
Databricks Repos provides a centralized, immutable history that cannot be manipulated by users.
Databricks Repos uses one common security model to access each individual notebook, or a collection of notebooks, and experiments.
Databricks Repos REST API enables the integration of data projects into CI/CD pipelines.
Databricks Repos provides access to available data sets and data sources, on-premises or in the cloud.
Databricks Repos REST API enables the integration of data projects into CI/CD pipelines.
Due to the platform administrator’s policies, a data engineer needs to use a single cluster on one very large batch of files for an ETL workload. The workload is automated, and the cluster will only be used by one workload at a time. They are part of an organization that wants them to minimize costs when possible.
Which of the following cluster configurations can the team use to satisfy their requirements? Select one response.
High concurrency all-purpose cluster
Multi node job cluster
Single node job cluster
Single node all-purpose cluster
Multi node all-purpose cluster
Multi node job cluster
Two data engineers are collaborating on one notebook in the same repository. Each is worried that if they work on the notebook at different times, they might overwrite changes that the other has made to the code within the notebook.
Which of the following explains why collaborating in Databricks Notebooks prevents these problems from occurring? Select one response.
Databricks Notebooks enforces serializable isolation levels, so the data engineers will never see inconsistencies in their data.
Databricks Notebooks are integrated into CI/CD pipelines by default, so the data engineers can work in separate branches without overwriting the other’s work.
Databricks Notebooks supports alerts and audit logs for easy monitoring and troubleshooting, so the data engineers will be alerted when changes are made to their code.
Databricks Notebooks supports real-time co-authoring, so the data engineers can work on the same notebook in real-time while tracking changes with detailed revision history.
Databricks Notebooks automatically handles schema variations to prevent insertion of bad records during ingestion, so the data engineers will be prevented from overwriting data that does not match the table’s schema.
Databricks Notebooks supports real-time co-authoring, so the data engineers can work on the same notebook in real-time while tracking changes with detailed revision history.
Which of the following describes the advantages of the bronze layer of the multi-hop, medallion data architecture? Select one response.
The bronze layer brings data from different sources into an enterprise view, enabling self-service analytics for advanced analytics.
The bronze layer provides an historical archive of data lineage and auditability without rereading the data from the source system.
The bronze layer reports data and uses de-normalized and read-optimized data models with a minimal number of joins.
None of these responses correctly describe the advantages of the bronze layer in this data architecture.
The bronze layer applies business rules and complex transformations for write-performant data models.
The bronze layer provides an historical archive of data lineage and auditability without rereading the data from the source system.
A data engineer needs the results of a query contained in the third cell of their notebook. It has been verified by another engineer that the query runs correctly. However, when they run the cell individually, they notice an error.
Which of the following steps can the data engineer take to ensure the query runs without error? Select two responses.
The data engineer can run the notebook cells in order starting from the first command.
The data engineer can clear all cell outputs before re-executing the cell individually.
The data engineer can choose “Run all above” from the dropdown menu within the cell.
The data engineer can clear the execution state before re-executing the cell individually.
The data engineer can choose “Run all below” from the dropdown menu within the cell.
The data engineer can run the notebook cells in order starting from the first command.
The data engineer can choose “Run all above” from the dropdown menu within the cell.