Databricks Data Analyst Flashcards
Describe the key audience and side audiences for Databricks SQL.
primary audience is data analysts, data scientists and engineers are side audiences
Describe that a variety of users can view and run Databricks SQL dashboards as stakeholders.
Users can view and run dashboards without having access to anything else in the platform. Bascially like viewers in Tableau
Describe the 4 benefits of using Databricks SQL for in-Lakehouse platform data processing.
1) Allows users to query and analyze data stored in data lakes and warehouses using SQL.
2) Built on Databricks Lakehouse platform, provides unified platform for data engineering, science, and analytics.
3) Provides high performance query engine optimized for big data workloads
4) Provides a range of tools and features for data processing, including visualization, transformation, and ML
Describe how to complete a basic Databricks SQL query.
Go to query editor, select warehouse, and run a SELECT statement
Identify the information displayed in the schema browser from the Query Editor page.
Schema browser allows you to view all data objects, such as databases, tables, colums, and data types. Used to explore structure of data
Identify Databricks SQL dashboards as a place to display the results of multiple queries at once.
Think of each query as a data source, multiple can be used in one dashboard
Describe how to complete a basic Databricks SQL dashboard.
Select dashboard on toolbar, add tiles, select query and visualization for each tile
Describe how dashboards can be configured to automatically refresh.
From dashboard, click schedule button at the top, can set up subscriptions here too, must schedule both query and dashboard independently
Describe the purpose of Databricks SQL endpoints/warehouses.
1) provide general compute resources for queries, visualizations and dashboards 2) provide a way to separate compute resources for SQL workloads from other workloads 3) Serverless, Pro, Classic
Identify Serverless Databricks SQL endpoint/warehouses as a quick-starting option.
Designed to be easy to set up and use, optimized for lower cost, lower performance workloads
Describe the trade-off between cluster size and cost for Databricks SQL endpoints/warehouses.
Large clusters handle more concurrent queries and larger workloads, but cost more to run
Identify Partner Connect as a tool for implementing simple integrations with a number of other data products.
1) Provides simpler alternative to manual connections by provisioning Azure Databricks resourses on your behalf and passing details on to partners 2) Creates trial account if you don’t already have an account
Describe how to connect Databricks SQL to ingestion tools like Fivetran.
1) Select Partner Connect 2) Click Partner 3) Enter connection info 4) Complete log in from partner website on new tab 5) Follow set up instructions, set destination 6) create endpoint, warehouse, and table to receive data
Identify the need to be set up with a partner to use it for Partner Connect.
Must have a license with partner in order to use it in Databricks
Identify small-file upload as a solution for importing small text files like lookup tables and quick data integrations.
good for csv’s, is there a way to automate refreshes of local files?
Import from object storage using Databricks SQL.
review object storage
Identify that Databricks SQL can ingest directories of files when the files are the same type.
DB reads all files and combines them into a single table if they have the same structure
Describe how to connect Databricks SQL to visualization tools like Tableau, Power BI, and Looker.
1) Navigate to clusters tab 2)In advanced options, select JDBC/ODBC tab 3) Follow instructioons to download driver for viz tool 4) configure tool with driver
Identify Databricks SQL as a complementary tool for BI partner tool workflows.
take advantage of scalability and performance of DB platform while having familiar interface and functionalities of BI tool
Describe the medallion architecture as a sequential data organization and pipeline system of progressively cleaner data.
bronze = raw data staging silver = data warehouse gold = published data sources
Identify the gold layer as the most common layer for data analysts using Databricks SQL.
like published Tableau data sources
Describe the cautions and benefits of working with streaming data.
similar to live data sources Benefits: real-time insights, faster decision making, ability to respond quickly Cautions: managing volume of data, ensuring quality and consistency, specialized expertise
Identify that the Lakehouse allows the mixing of batch and streaming workloads
you can have both extracts and live data in your environment, allows you to buid real-time applications, while also having traditional batch processing for historical analysis
Describe Delta Lake as a tool for managing data files.
supports ACID transactions (Atomicity, Consistency, Isolation, and Durability), Highly scalable, provides tools like VACUUM (removes unused files from table directory) and OPTIMISE (Optimizes layout of a subset of data)