1. Introduction to Database Systems Flashcards

1
Q

Can you describe the lifecycle of a query in a database system, from creation to execution?

A
  • Creation: The query is crafted in a language such as SQL.
  • Parsing: It’s then transformed into an Abstract Syntax Tree (AST).
  • Optimization: Since there are multiple ways to write a query for the same result, optimization becomes crucial. For example, a ‘select’ on a Cartesian product is costlier than a ‘join’, as the select operation is integrated into the join, affecting the result’s cardinality (size). Determining the most efficient execution plan is challenging due to the exponentially increasing number of potential plans.
  • Plan Caching: To save time analyzing execution plans, frequently used or time-sensitive operations are stored in a plan cache for reuse.
  • Plan Development without Cache: When not using cached plans, a new plan is formulated based on intermediate operation cardinalities. Estimating these cardinalities requires data statistics like synopses, histograms, and summaries, leading to the creation of a physical execution plan.
  • Execution: This physical execution plan (also known as the access plan) is then executed by the database engine to produce the desired result.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a “Prepare Statement” in databases, and what are its benefits?

A

Definition: A prepare statement is the pre-compilation of an SQL query, useful for executing similar queries multiple times with varying parameters.
Process: It involves parsing, compiling, and optimizing the query once, without executing it, saving the plan for future use.

Benefits:
- Performance: Speeds up execution by skipping repeat parsing and compilation.
- Consistency: Ensures uniform execution plans.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are OLAP queries?

A

OLAP queries are Online Analaytical Processing queries, or Analytical queries in short.

Nature: Analytical queries are designed for data analysis and business intelligence, involving complex operations like aggregating large data sets for sums, averages, and statistical analyses.
Execution Time: These queries can be time-consuming, taking from an hour to several days, due to the need to scan large datasets and perform complex calculations.
Use Case: They are commonly used in data warehousing for identifying trends, patterns, and aiding in strategic decision-making.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are OLTP queries?

A

OLTP queries are - Online Transaction Processing - or Transactional queries.

Nature: Transactional queries handle transactional data, typically being simple, involving small data volumes, and focusing on speed and reliability.
**Execution Time: **These queries are executed rapidly, often within milliseconds, emphasizing efficiency and data integrity.
**Use Case: **Widely used in systems where quick response and data consistency are vital, such as in banking, e-commerce, and other transaction-heavy applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a Logical Plan in the context of database query processing?

A

Operation-Focused: Involves operations like selection, projection, and join, but does not specify the execution methods (e.g., type of join algorithm).
Independence from Physical Data: Concerns itself with what operations are to be performed, rather than how data is physically stored or accessed.
Optimization Stage: At this stage, query optimizers rearrange or combine operations for efficiency, based on data’s logical relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the key elements of a Physical Execution Plan in database systems?

A

Algorithm Specification: Specifies exact algorithms for operations like joins, selections, or projections at each node of the plan’s tree structure.
Access Path Definition: Details how data is physically retrieved, including methods like direct file access or index usage.
Data Flow Mechanism: Describes how data is transmitted between nodes, including strategies like pushing data down the tree or pulling it up.
Execution Order: Dictates the sequence of node execution, optimized for efficiency based on data dependencies and chosen algorithms, with either a top-down (push) or bottom-up (pull) approach.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do Logical and Physical Plans in database query processing differ, particularly in their focus and optimization strategies?

A

Logical Plan:
- Focus: Concentrates on ‘what’ operations (like selection, join) are needed.
- Optimization: Rearranges or combines operations based on data’s logical relationships.
Physical Plan:
- Focus: Details ‘how’ to execute operations (specific algorithms, data access paths).
- Optimization: Tailors execution order and data flow for efficiency, considering data dependencies and chosen methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between Declarative and Imperative queries?

A

Declarative Queries:
- Focus: On ‘what’ data is needed, not how to retrieve it.
- Example: SQL, where you specify desired data (e.g., “SELECT name FROM customers WHERE city = ‘New York’”) without detailing the retrieval process.
- Advantage: Simplifies queries; database system handles complexity.

Imperative Queries:
- Focus: On ‘how’ to fetch data, detailing exact steps.
- Example: Using loops in programming languages (like Java, Python) to process data.
- Advantage: Provides more control, useful for specific, complex data processing needs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the challenge in navigating the plan space in database query optimization, and what is the solution?

A

Challenge: The query optimizer must choose from a vast array of potential execution plans.
Solution: Optimize within time and resource constraints to find the most efficient plan.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How are the challenges of enumerating through the plan space addressed in database query optimization?

A

Challenge: Efficiently searching through the large plan space.
Solution: Use dynamic programming to reduce search complexity to a manageable level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the cost model development challenge in database query optimization, and how is it solved?

A

Challenge: Accurately estimating the cost of operations like hash joins in terms of memory, time, and/or CPU cycles.
Solution: Develop sophisticated cost models for accurate resource prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does database query optimization tackle the challenge of cardinality estimates?

A

Challenge: Estimating the number of rows a query operation will return.
Solution: Employ statistical methods and data sampling for educated guesses on data distribution and sizes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the optimization opportunities in database query processing?

A

Challenge: Identifying areas to make queries more efficient.
Solution: The optimizer rearranges, simplifies, or optimizes queries for better performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the approach to enhancing evaluation and execution in database query processing?

A

Challenge: Speeding up specific operations.
Solution: Techniques like indexing, query rewriting, and optimizing execution algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How are storage and data encoding challenges addressed in database systems?

A

Challenge: Deciding on efficient data storage and encoding methods.
Solution: Choose appropriate structures (like B-trees, columnar storage) for improved data access and manipulation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the 7 challenges in Database Query Processing and Optimization?

A
  1. Large Plan Space
  2. Enumeration through Plan Space
  3. Developing a Cost Model
  4. Estimating the cardinality of operations
  5. Identifying areas where queries can be made more efficient
  6. Making certain operations execute faster.
  7. Deciding how to store data efficiently in various structures and encodings.
17
Q

How is the cost of a database query plan estimated?

A

Node-Level Cost Estimation: Estimate each node’s operation cost based on input size, involving resource calculations like CPU time and memory usage.
Result Size and Properties Assessment: For each node, estimate output size and properties (e.g., data sortedness), impacting subsequent operation performance.
Combining Local Estimates for Overall Cost: Aggregate local estimates from all nodes to derive overall cost, considering how each node’s performance affects others in terms of data flow and transformation.

18
Q

What is important to understand about error propagation in cost estimation for database query plans?

A

Nature of Estimation: Cost estimates are approximations and inherently prone to error.
Error Propagation: Errors in estimation can exponentially propagate through the plan, affecting overall accuracy.
Positional Impact: Operations closer to the actual data typically have smaller errors. Those further up the tree are more prone to compounded inaccuracies.

19
Q

What is the primary goal of cost estimation in database query planning?

A

Main Objective: To avoid selecting the worst-case execution plans.

20
Q

What is a Reduction Factor in database queries, and how does it influence query optimization?

A
  • Definition: Reduction Factor (rf) measures the fraction of dataset tuples satisfying a condition (C), indicating how much the dataset size is reduced.
    • Calculation: rf for condition (C) is the number of tuples meeting (C) divided by total tuples. For compound conditions (C1, C2, …, Cm), it’s the product of individual rfs.
    • Role in Optimization: Helps estimate query result size, crucial for choosing efficient execution plans. For instance, combining conditions (C1) and (C2) with rfs of 0.2 and 0.5 results in a collective rf of 0.2 * 0.5 = 0.1, reducing the dataset to 10% of its size.