Databricks Data Analyst (extra notes) Flashcards

1
Q

In the context of statistics, what are key moments of a statistical distribution?

A

Key moments of a statistical distribution include the
mean (first moment), measures the central tendency
variance (second moment), measures the dispersion
skewness (third moment), skewness indicates the asymmetry
and kurtosis (fourth moment). describes the ‘tailedness’ of the distribution
These moments are crucial in describing the characteristics of a distribution.

Understanding these moments is essential for comprehensively describing and analyzing the behavior of data in a statistical context.
https://www.analyticsvidhya.com/blog/2022/01/moments-a-must-known-statistical-concept-for-data-science/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the correct method to rename a table in Databricks?

A

Use the ALTER TABLE RENAME TO command.

This command changes the table’s name in the metadata without affecting the actual data stored in the table. It’s a straightforward and efficient method for renaming tables within Databricks environments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the typical use case for MERGE INTO command?

A

MERGE INTO is suitable for updating existing records and inserting new records depending on whether a match is found in the target table.
MERGE INTO is used for complex operations where you need to update existing records or insert new ones based on some matching condition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the typical use case for INSERT INTO command?

A

INSERT INTO is typically used to add new records to a table and does not update existing records.
INSERT INTO adds new rows to a table and is straightforward for adding new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the typical use case for COPY INTO command?

A

COPY INTO is a specialized command used in Databricks for loading data into a table from external sources such as files in a file system.
COPY INTO is specifically designed for loading data from files into a table, useful when importing data from external sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to use the CUBE function?

A

The CUBE function in Databricks is used in SQL queries to generate a result set that includes all possible combinations of aggregations for the specified group of columns.
The function must be the only grouping expression in the GROUP BY clause, For example, calculate the total sales amount for each combination of region and product, as well as the subtotals for each region and product, and the grand total. GROUP BY CUBE (region, product); The rows with NULL values represent subtotals and grand totals.
The row where both region and product are NULL represents the grand total of all sales.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the ROLLLUP function do?

A

The ROLLUP function generates a hierarchical aggregation, which creates subtotals that roll up from the most detailed level to the grand total. When applied to the ‘Department’ and ‘Employee’ columns, the ROLLUP function produces the following aggregations:

Aggregation by Department and Employee (detailed level)
Subtotal by Department (aggregated over all employees within each department)
Grand total (aggregated over all departments and employees)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When importing data from an Amazon S3 bucket into a Databricks environment using Databricks SQL, which SQL command is typically used to perform this operation?

A

The COPY INTO command is often used for this purpose, as it is well-suited for incremental and bulk data loading in Databricks SQL. Specify the table name, the file path, the file format (and optional details like schema, headers, delimators)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does ACID stand for?

A

ACID (Atomicity, Consistency, Isolation, Durability). Delta Lake on Databricks provides ACID transactions to ensure data reliability and integrity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the A in ACID stand for?

A

Atomicity ensures that a series of operations within a transaction are all completed successfully or none are. When a write operation (such as an insert, update, or delete) is performed on a Delta Lake table, atomicity ensures that these operations are fully completed or not applied at all. This is critical for maintaining the consistency and reliability of data in large-scale data processing systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does the C in ACID stand for?

A

Consistency ensures the database remains in a valid state, maintaining data integrity. With consistency, any data operation (such as an insert, update, or delete) will only commit if it does not violate any data integrity constraints. This guarantees that all committed transactions maintain the database’s valid state, thereby ensuring data consistency across large-scale data processing environments. Example:
a table that has a constraint ensuring that the balance column cannot be negative. If a transaction attempts to deduct an amount from an account that would result in a negative balance, consistency ensures that this transaction fails and the database remains in a valid state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the I stand for in ACID?

A

Isolation ensures that concurrent transactions do not interfere with each other, providing a consistent view of the data to each transaction. For instance, when multiple users or processes are writing to a Delta Lake table simultaneously, isolation ensures that each transaction sees a consistent view of the data and the changes are applied correctly without conflicts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does the D stand for in ACID?

A

Durability ensures that once a transaction is committed, the changes are permanently stored in the Delta Log and the underlying storage system and will survive system failures. This is crucial in large-scale data processing environments where system failures can occur. Consider a banking system where a transaction deducts $100 from Account A and adds $100 to Account B. Once this transaction is committed, durability guarantees that even if the system crashes immediately after the commit, the changes will be preserved and reflected in the database once it is back online.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the benefits of working with streaming data

A

Streaming data allows for real-time analytics and decision-making. One of the primary benefits of working with streaming data in a Databricks SQL environment is the ability to perform real-time analytics. This feature is particularly useful in scenarios where immediate data analysis is crucial, such as monitoring financial transactions, tracking website activity, or analyzing sensor data in real-time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are areas of caution when working with streaming data

A

The continuous flow of data in a streaming context poses a challenge in maintaining data consistency. Unlike batch processing, where data is processed in chunks and consistency can be more easily managed, streaming data is susceptible to anomalies and inconsistencies due to its nature. This requires robust error handling and data validation mechanisms to ensure data integrity. ALSO Working with streaming data can lead to increased resource consumption because it requires continuous processing. This constant demand can strain system resources and potentially increase operational costs. Additionally, the need for real-time processing capabilities can necessitate more powerful and expensive infrastructure, leading to higher costs and resource allocation challenges.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What feature of Databricks contributes to this decrease in query execution time after several runs of a complex query on a large dataset?

A

Caching of intermediate data and results from previous query executions.

When a query is run multiple times, Databricks stores the results and intermediate data in cache. Subsequent executions of the same query or those with similar computations can leverage this cached data, leading to significantly reduced query latency. This efficiency gain is particularly notable in complex queries on large datasets, where accessing cached results avoids the need for repeated data processing and computation.

17
Q

You need to apply a custom scaling function to normalize transaction amounts for analysis.

You decide to create a UDF (User-Defined Function) in Python.

Which of the following approaches correctly illustrates the creation and application of a UDF for this purpose?

A

Define a Python function normalize(amount) and register it as a UDF, then use SELECT normalize(TransactionAmount) FROM Transactions;.

In this scenario, a UDF is created to apply a specific normalization function to the TransactionAmount column. This approach is efficient and scalable in large data environments, like Databricks, as it allows for complex, customized data transformations that are not available through standard SQL or built-in functions.

18
Q

You notice inconsistencies in customer names due to variations in casing and spacing (e.g., ‘John Doe’, ‘john doe’, ‘John Doe’).

What would be an appropriate SQL query to standardize these customer names in the silver table CustomerData?

A

SELECT DISTINCT TRIM(UPPER(customer_name)) FROM CustomerData;

19
Q

How do you set the location of a table in Databricks when creating or altering it?

A

By using the CREATE TABLE … LOCATION
or ALTER TABLE … SET LOCATION command.

In Databricks, the location of a table can be set or changed using SQL commands. During table creation, the location is specified using the CREATE TABLE … LOCATION syntax. For existing tables, the location can be altered using the ALTER TABLE … SET LOCATION command. This functionality allows for flexibility in data management and storage optimization within Databricks.

20
Q

What are the primary advantages of using a Serverless Databricks SQL endpoint/warehouse?

A
  1. Instant compute access
  2. cost & capacity optimization (It matches capacity to usage, avoiding overprovisioning and idle capacity)
  3. easy integration w/ BI Tools
  4. Auto-restart feature
  5. secure and managed configuration
21
Q

What is the minimum permission a user needs to configure a refresh schedule on a Databricks SQL Dashboard?

A

A dashboard’s owner and users with the Can Edit permission can configure a dashboard to automatically refresh on a schedule.

22
Q

Describe the 5 levels of access controls (permissions) for Dashboards

A

1 No Permission
2 Can View (dashboard only, not the queries, cannot refresh)
3 Can Run (refresh dashboard but cannot view queries)
4 Can Edit
5 Can Manage (only owner can have this level when sharing setting credentials is Run As Owner. If there are other users with Can Manage it will be downgraded to Can Run if setting is Run As Owner). Run As Viewer they won’t be able to view objects on dashboard if they don’t have permissions for the catalog, schema

23
Q

Describe Photon

A

Photon is a Databricks proprietary tool. It is an optimized query engine designed to accelerate SQL queries on large-scale data.

24
Q

Describe Workflows

A

Workflows is a Databricks proprietary tool used for orchestrating and scheduling data workflows.

25
Q

Describe Unity Catalog

A

Unity Catalog is a Databricks proprietary tool. It is a metadata service that provides a unified view of data across multiple sources.

26
Q

What is Apache Spark?

A

Apache Spark is an open-source distributed computing system. It is a widely adopted open-source framework for big data processing.

While Databricks provides a managed Spark environment as part of its platform, Apache Spark itself is not proprietary to Databricks.