Databricks Data Analyst (extra notes) Flashcards
In the context of statistics, what are key moments of a statistical distribution?
Key moments of a statistical distribution include the
mean (first moment), measures the central tendency
variance (second moment), measures the dispersion
skewness (third moment), skewness indicates the asymmetry
and kurtosis (fourth moment). describes the ‘tailedness’ of the distribution
These moments are crucial in describing the characteristics of a distribution.
Understanding these moments is essential for comprehensively describing and analyzing the behavior of data in a statistical context.
https://www.analyticsvidhya.com/blog/2022/01/moments-a-must-known-statistical-concept-for-data-science/
What is the correct method to rename a table in Databricks?
Use the ALTER TABLE RENAME TO command.
This command changes the table’s name in the metadata without affecting the actual data stored in the table. It’s a straightforward and efficient method for renaming tables within Databricks environments.
What is the typical use case for MERGE INTO command?
MERGE INTO is suitable for updating existing records and inserting new records depending on whether a match is found in the target table.
MERGE INTO is used for complex operations where you need to update existing records or insert new ones based on some matching condition
What is the typical use case for INSERT INTO command?
INSERT INTO is typically used to add new records to a table and does not update existing records.
INSERT INTO adds new rows to a table and is straightforward for adding new data.
What is the typical use case for COPY INTO command?
COPY INTO is a specialized command used in Databricks for loading data into a table from external sources such as files in a file system.
COPY INTO is specifically designed for loading data from files into a table, useful when importing data from external sources.
How to use the CUBE function?
The CUBE function in Databricks is used in SQL queries to generate a result set that includes all possible combinations of aggregations for the specified group of columns.
The function must be the only grouping expression in the GROUP BY clause, For example, calculate the total sales amount for each combination of region and product, as well as the subtotals for each region and product, and the grand total. GROUP BY CUBE (region, product); The rows with NULL values represent subtotals and grand totals.
The row where both region and product are NULL represents the grand total of all sales.
What does the ROLLLUP function do?
The ROLLUP function generates a hierarchical aggregation, which creates subtotals that roll up from the most detailed level to the grand total. When applied to the ‘Department’ and ‘Employee’ columns, the ROLLUP function produces the following aggregations:
Aggregation by Department and Employee (detailed level)
Subtotal by Department (aggregated over all employees within each department)
Grand total (aggregated over all departments and employees)
When importing data from an Amazon S3 bucket into a Databricks environment using Databricks SQL, which SQL command is typically used to perform this operation?
The COPY INTO command is often used for this purpose, as it is well-suited for incremental and bulk data loading in Databricks SQL. Specify the table name, the file path, the file format (and optional details like schema, headers, delimators)
What does ACID stand for?
ACID (Atomicity, Consistency, Isolation, Durability). Delta Lake on Databricks provides ACID transactions to ensure data reliability and integrity.
What does the A in ACID stand for?
Atomicity ensures that a series of operations within a transaction are all completed successfully or none are. When a write operation (such as an insert, update, or delete) is performed on a Delta Lake table, atomicity ensures that these operations are fully completed or not applied at all. This is critical for maintaining the consistency and reliability of data in large-scale data processing systems.
What does the C in ACID stand for?
Consistency ensures the database remains in a valid state, maintaining data integrity. With consistency, any data operation (such as an insert, update, or delete) will only commit if it does not violate any data integrity constraints. This guarantees that all committed transactions maintain the database’s valid state, thereby ensuring data consistency across large-scale data processing environments. Example:
a table that has a constraint ensuring that the balance column cannot be negative. If a transaction attempts to deduct an amount from an account that would result in a negative balance, consistency ensures that this transaction fails and the database remains in a valid state.
What does the I stand for in ACID?
Isolation ensures that concurrent transactions do not interfere with each other, providing a consistent view of the data to each transaction. For instance, when multiple users or processes are writing to a Delta Lake table simultaneously, isolation ensures that each transaction sees a consistent view of the data and the changes are applied correctly without conflicts.
What does the D stand for in ACID?
Durability ensures that once a transaction is committed, the changes are permanently stored in the Delta Log and the underlying storage system and will survive system failures. This is crucial in large-scale data processing environments where system failures can occur. Consider a banking system where a transaction deducts $100 from Account A and adds $100 to Account B. Once this transaction is committed, durability guarantees that even if the system crashes immediately after the commit, the changes will be preserved and reflected in the database once it is back online.
What are the benefits of working with streaming data
Streaming data allows for real-time analytics and decision-making. One of the primary benefits of working with streaming data in a Databricks SQL environment is the ability to perform real-time analytics. This feature is particularly useful in scenarios where immediate data analysis is crucial, such as monitoring financial transactions, tracking website activity, or analyzing sensor data in real-time.
What are areas of caution when working with streaming data
The continuous flow of data in a streaming context poses a challenge in maintaining data consistency. Unlike batch processing, where data is processed in chunks and consistency can be more easily managed, streaming data is susceptible to anomalies and inconsistencies due to its nature. This requires robust error handling and data validation mechanisms to ensure data integrity. ALSO Working with streaming data can lead to increased resource consumption because it requires continuous processing. This constant demand can strain system resources and potentially increase operational costs. Additionally, the need for real-time processing capabilities can necessitate more powerful and expensive infrastructure, leading to higher costs and resource allocation challenges.