Databricks Data Analyst Flashcards

1
Q

Describe the key audience and side audiences for Databricks SQL.

A

primary audience is data analysts, data scientists and engineers are side audiences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe that a variety of users can view and run Databricks SQL dashboards as stakeholders.

A

Users can view and run dashboards without having access to anything else in the platform. Bascially like viewers in Tableau

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe the 4 benefits of using Databricks SQL for in-Lakehouse platform data processing.

A

1) Allows users to query and analyze data stored in data lakes and warehouses using SQL.
2) Built on Databricks Lakehouse platform, provides unified platform for data engineering, science, and analytics.
3) Provides high performance query engine optimized for big data workloads
4) Provides a range of tools and features for data processing, including visualization, transformation, and ML

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe how to complete a basic Databricks SQL query.

A

Go to query editor, select warehouse, and run a SELECT statement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Identify the information displayed in the schema browser from the Query Editor page.

A

Schema browser allows you to view all data objects, such as databases, tables, colums, and data types. Used to explore structure of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Identify Databricks SQL dashboards as a place to display the results of multiple queries at once.

A

Think of each query as a data source, multiple can be used in one dashboard

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe how to complete a basic Databricks SQL dashboard.

A

Select dashboard on toolbar, add tiles, select query and visualization for each tile

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe how dashboards can be configured to automatically refresh.

A

From dashboard, click schedule button at the top, can set up subscriptions here too, must schedule both query and dashboard independently

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe the purpose of Databricks SQL endpoints/warehouses.

A

1) provide general compute resources for queries, visualizations and dashboards 2) provide a way to separate compute resources for SQL workloads from other workloads 3) Serverless, Pro, Classic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Identify Serverless Databricks SQL endpoint/warehouses as a quick-starting option.

A

Designed to be easy to set up and use, optimized for lower cost, lower performance workloads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe the trade-off between cluster size and cost for Databricks SQL endpoints/warehouses.

A

Large clusters handle more concurrent queries and larger workloads, but cost more to run

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Identify Partner Connect as a tool for implementing simple integrations with a number of other data products.

A

1) Provides simpler alternative to manual connections by provisioning Azure Databricks resourses on your behalf and passing details on to partners 2) Creates trial account if you don’t already have an account

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe how to connect Databricks SQL to ingestion tools like Fivetran.

A

1) Select Partner Connect 2) Click Partner 3) Enter connection info 4) Complete log in from partner website on new tab 5) Follow set up instructions, set destination 6) create endpoint, warehouse, and table to receive data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Identify the need to be set up with a partner to use it for Partner Connect.

A

Must have a license with partner in order to use it in Databricks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Identify small-file upload as a solution for importing small text files like lookup tables and quick data integrations.

A

good for csv’s, is there a way to automate refreshes of local files?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Import from object storage using Databricks SQL.

A

review object storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Identify that Databricks SQL can ingest directories of files when the files are the same type.

A

DB reads all files and combines them into a single table if they have the same structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Describe how to connect Databricks SQL to visualization tools like Tableau, Power BI, and Looker.

A

1) Navigate to clusters tab 2)In advanced options, select JDBC/ODBC tab 3) Follow instructioons to download driver for viz tool 4) configure tool with driver

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Identify Databricks SQL as a complementary tool for BI partner tool workflows.

A

take advantage of scalability and performance of DB platform while having familiar interface and functionalities of BI tool

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Describe the medallion architecture as a sequential data organization and pipeline system of progressively cleaner data.

A

bronze = raw data staging silver = data warehouse gold = published data sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Identify the gold layer as the most common layer for data analysts using Databricks SQL.

A

like published Tableau data sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Describe the cautions and benefits of working with streaming data.

A

similar to live data sources Benefits: real-time insights, faster decision making, ability to respond quickly Cautions: managing volume of data, ensuring quality and consistency, specialized expertise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Identify that the Lakehouse allows the mixing of batch and streaming workloads

A

you can have both extracts and live data in your environment, allows you to buid real-time applications, while also having traditional batch processing for historical analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Describe Delta Lake as a tool for managing data files.

A

supports ACID transactions (Atomicity, Consistency, Isolation, and Durability), Highly scalable, provides tools like VACUUM (removes unused files from table directory) and OPTIMISE (Optimizes layout of a subset of data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Describe that Delta Lake manages table metadata.

A

1) Provides support for schema evolution meaning you can modify the table over time without having to rewrite the whole table. 2) Provides support for managing table properties, such as the location of the table data and the format of the data files

26
Q

Identify that Delta Lake tables maintain history for a period of time.

A

Each table modification creates a new version, you can use VERSION AS OF or RESTORE TABLE TO VERSION to access old versions or revert changes, query table at a specific point in time to time travel

27
Q

Describe the benefits of Delta Lake within the Lakehouse.

A

1) ACID Transactions 2) Scalable metadata handling 3) efficient query processing 4) Schema evolution 5) Unified platform https://docs.databricks.com/en/introduction/delta-comparison.html

28
Q

Describe persistence and scope of tables on Databricks.

A

1) Global tables are available across all clusters in a workspacec anc can be accessed by all users 2) cluster scoped tables are available only within a specific cluster and not visible to other clusters or users 3) Notebook scoped tables are available only within a specific notebook and are not visible to other notebooks or users — Persisting tables in a storage format allows them to be stored on disk and accessed more efficiently

29
Q

Compare and contrast the behavior of managed and unmanaged tables.

A

Managed tables: managed by databricks, easy to create (ideal for small to medium sized datasets), optimized for performance Unmanaged tables: more flexible than managed tables, ideal for larger datasets, faster to load and write data, but require more manual management

30
Q

Identify whether a table is managed or unmanaged.

A

Check the info under details in the DB Catalog or use DESCRIBE EXTENDED in SQL Query Editor

31
Q

Explain how the LOCATION keyword changes the default location of database contents.

A

Use LOCATION to specify where you want the table to go

32
Q

Use Databricks to create, use, and drop databases, tables, and views.

A

Standard SQL commands. You can set USE with a dropdown above the SQL editor, don’t need to include that in your statement

33
Q

Describe the persistence of data in a view and a temp view

A

views don’t store data, but provide a way to access and query the underlying data, persisted in underlying tables so changes to table flow to the view, temp views are not persisted, so only availble during session that creates it, changes to underlying data will not be reflected in temp view

34
Q

Compare and contrast views and temp views.

A

see above

35
Q

Explore, preview, and secure data using Data Explorer.

A

Select a table or view and preview with Preview tab, Secure data with permissions tab, query data with query tab, visualize with visualize tab

36
Q

Use Databricks to create, drop, and rename tables.

A

standard SQL commands

37
Q

Identify the table owner using Data Explorer.

A

Details tab on Data Explorer or in SQL you can ALTER TABLE OWNER TO

38
Q

Change access rights to a table using Data Explorer.

A

permissions tab

39
Q

Describe the responsibilities of a table owner.

A

Create and manage table, ensure data quality, grant and revoke access, monitor usage and performance, ensure compliance

40
Q

Identify organization-specific considerations of PII data Section

A

Databricks, especially with the integration of Delta Lake, provides mechanisms for handling PII, such as fine-grained access control. This allows for specific permissions on sensitive data fields, ensuring that only authorized users can access PII.

41
Q

Identify a query that retrieves data from the database with specific conditions

A

SELECT statement with WHERE clause

42
Q

Compare and contrast MERGE INTO, INSERT TABLE, and COPY INTO.

A

MERGE INTO: merges data from target source to data table based on specific conditions INSERT TABLE: used to insert new rows into table COPY INTO used to load data into a Delta table, MERGE INTOis suitable for updating existing records and inserting new records, whileINSERT INTOis used only for adding new records, andCOPY INTOis used for loading data from files

43
Q

Use cube and roll-up to aggregate a data table.

A

ROLLUP: aggregates on a data cube by reducing one or more dimensions CUBE: aggregates based on a data cube by creating all possible combinations of dimensions

44
Q

Compare and contrast roll-up and cube.

A

rollup generates hierarchical aggregrations starting from leftmost column in group by clause

45
Q

Use windowing to aggregate time data.

A

SUM ___ OVER (Partition By) to aggregate data of specific time internals

46
Q

Identify a benefit of having ANSI SQL as the standard in the Lakehouse.

A

works with other relational databases. Portability, interoperability, familiarity, standardization

47
Q

Identify, access, and clean silver-level data.

A

bronze is raw data staging, silver is like a data warehouse, gold is like tableau data sources. Identify the data you need, Access the data (obtain necessary permissions and credentials), clean data

48
Q

Utilize query history and caching to reduce development time and query latency.

A

Query history allows you to use previous queries, caching stores results of a query in memory

49
Q

Optimize performance using higher-order Spark SQL functions.

A

higher order functions take other functions as input parameters. They optimize performance by 1) simplifying code 2) reducing amount of data that needs to be filtered 3) use with user defined functions (transform() to apply UDF to each element of an array) 4 use with window fucntions(ex: can use collect_list () to collect column into an array, then percentile_approx() to calculate approximate percentile of the values)

50
Q

Create and apply UDFs in common scaling scenarios

A

https://docs.databricks.com/en/sql/language-manual/sql-ref-functions-udf-aggregate.html

51
Q

Explain how visualization formatting changes the reception of a visualization

A

Formatting can determine where people focus their attention

52
Q

Describe how to add visual appeal through formatting

A

layout, Colors, font sizes, etc.

53
Q

Identify that customizable tables can be used as visualizations within Databricks SQL.

A

more user flexibility within crosstabs than Tableau

54
Q

Describe how query parameters change the output of underlying queries within a dashboard

A

parameter value is part of a WHERE clause in the underlying query

55
Q

Identify the behavior of a dashboard parameter

A

affects the entire dashboard

56
Q

Identify the use of the “Query Based Dropdown List” as a way to create a query parameter from the distinct output of a different query.

A

It dynamically generates a dropdown list based on the distinct output of a separate query, allowing users to select values as query parameters. Creates a parameter dropdown like we use in Tableau

57
Q

Identify that users without permission to all queries, databases, and endpoints can easily refresh a dashboard using the owner’s credentials.

A

Run as viewer’ enhances data security by adhering to individual viewer’s permissions (pro), but may limit data visibility (con). ‘Run as owner’ ensures consistent data visibility across users (pro), but might pose security risks if the owner has broader data access (con)

58
Q

Identify what happens if a refresh rate is less than the Warehouse’s “Auto Stop”

A

The dashboard will continue to refresh at the set interval, the SQL Warehouse will continue running.

59
Q

Describe key moments of statistical distributions.

A

then the first moment is theexpected value, the secondcentral momentis thevariance, the thirdstandardized momentis theskewness, and the fourth standardized moment is thekurtosis

60
Q

Describe data enhancement as a common analytics application.

A

Adding other sources of data into a data source

61
Q

Identify a scenario in which data enhancement would be beneficial.

A

Adding weather data to sales metrics

62
Q

Describe the blending of data between two source applications.

A

Not the same as Tableau, just refers to joining data together