3: Sampling, FO overview, storage Flashcards
What is the general process for using sampling as an estimation technique?
Role: Sampling is used in scenarios where histograms may not be applicable or sufficient. It allows for dealing with various types of queries, including complex ones.
Process:
- Mini-Query Execution: A small-scale version of the query is run on a sample of the dataset (population).
- Sampling Scheme: This involves a set of rules for how the sample is drawn from the population. The method of sampling directly impacts the estimation process and its accuracy.
- Estimator Application: The results obtained from the sample query are then scaled up using an estimator to approximate the result for the entire dataset.
Usage: Sampling is particularly useful for queries involving aggregations like COUNT, SUM, AVERAGE, etc., where direct computation on the entire dataset might be impractical.
Briefly explain the Horvitz-Thompson Estimator, how it works and how it is calculated.
Basic Concept: The estimator adjusts for the unequal probabilities of selection in the sample. It ensures that the estimation is unbiased, meaning it accurately represents the whole population over multiple samples.
How It Works:
- For each item in the sample, calculate the inverse of its probability of being selected. This is known as the inclusion probability for that item.
- Multiply each item’s value by its corresponding inverse probability.
- Sum up these adjusted values to get the total estimate for the population.
Formula:
- Suppose you have a sample of items, let’s say item1, item2, …, itemN.
- Let P(item) be the probability of each item being selected in the sample.
- The Horvitz-Thompson estimator is calculated as:
Sum of (Value of each item / P(item)) for all items in the sample.
Explain the concept of the Central Limit Theorem and how it applies to sampling-based estimations in database optimzation. What is assumed about the distribution of the underlying data?
Concept of CLT: The Central Limit Theorem helps estimate the accuracy of a sampling-based estimate Y. It applies when the population size N and the sample size nn are sufficiently large.
Application in Sampling:
- When drawing a sample of size n from a distribution with a mean μ and a variance σ2, the CLT can be used to understand the behavior of the sample average.
- As n increases, the sample average will converge to the true mean μ, according to the law of large numbers.
Distribution of Y/N:
- For large n and N, the distribution of Y/N approximates a normal distribution with mean μ and variance σ2/n.
- About 95% of the mass of this normal distribution is within two standard deviations of its mean.
Note: Make sure to go through some examples on how to calculate this. One example is given in the slides
What are the advantages of sampling as an estimation technique?
Immediacy: Sampling requires no pre-construction, allowing for immediate use without prior setup.
Pervasiveness: This method is widely supported across modern database engines, making it a universally applicable approach.
Adaptivity: Sampling allows for on-the-fly transformation, providing flexibility in handling various data types and structures.
Ease of Implementation: Compared to other methods, sampling is generally easier to implement, requiring less complex algorithms and data structures.
What are the disadvantages of sampling as an estimation technique?
Not Suitable for Highly Selective Queries: Highly selective queries often require a large sample size to be representative. The accuracy drops if we have a highly selective query
Expensive in Terms of Evaluation:
1. Sampling, particularly row-level sampling, can be much slower than other estimation methods like histograms.
2. High I/O Cost: Sampling often involves more extensive data reading operations, increasing the input/output cost.
Sensitive to Skew and Outliers: Sampling can be adversely affected by data skew and outliers, leading to less accurate estimations.
Limitations in Applicability: There are several classes of queries for which estimators based on sampling may not provide reliable results, limiting its applicability
Explain the main components of a Relational Data Model
Universe of Atomic Values (U): Define U as a universe of atomic values which can be used in the database.
Relation Schema: A relation schema is defined by a name R and a finite set of attribute names attributes(R)={A1,…,Ak}}. This is akin to the structure of a table, including the table name and its header
Fact over Relation Schema: A fact over a relation schema R of arity k is expressed as R(a1,…,ak), where a1,…,ak are elements of U. Alternatively, a tuple over R can be seen as a function mapping from attributes(R) to U, indicating the types of values the attributes can hold (like integers, varchars, floats, etc.). This resembles a row in a table.
Instance of Relation Schema: An instance of a relation schema R is a finite set of facts or tuples over R, representing the data filled in the table structure.
Database Schema (Signature): A database schema, also known as a “signature”, is a finite set of relation schemas, denoted as D.
Instance of Database Schema: An instance of a database schema D comprises a set of relation instances, one for each relation schema contained in D.
Provide an example of a database schema.
Consider a database schema consisting of four relation schemas:
- author(authorID, name, birthdate, language)
- book(bookID, title, authorId, publisher, language, year)
- store(storeID, address, phone)
- sells(storeID, bookID)
Here Author, book, store and sells are relation schemas.
What’s the relationship between the expressiveness of a query and te complexity of evaluation?
Inverse Relationship: The expressiveness of a query language is inversely related to its evaluation complexity. Highly expressive languages like SQL are more complex to evaluate.
What categories of query languages are there, and how are they ranked based on expressiveness?
- First Order Logic (FO): Corresponds to the full Tuple Relational Calculus (TRC), denoted as LO.
-
Conjunctive Queries (Conj): Represent TRC using only existential and conjunction operators.
- Corresponds to basic SELECT-FROM-WHERE in SQL.
- Equates to {sigma, pi, X} fragments of Relational Algebra.
0 Matches single positive Datalog rules. -
Acyclic Conjunctive Queries (AConj):
- Queries with join trees and no cycles.
- Represents a subset of conjunctive querie
How are queries expressed in terms of complexity? What are the complexity spaces associated with the FO, Conj and Aconj queries?
Combined Complexity (Depends on both the query size ∣Q∣ and database size ∣D|):
- FO: PSPACE-complete.
- Conj: NP-Complete.
- AConj: LOGCFL-complete.
Data Complexity (Depends on the database size ∣D∣ alone):
- FO: Logspace.
- Conj: Logspace.
- AConj: Linear time.
Explain the concept of Homomorphism and how it aplies to database queries.
Definition: Homomorphism in the context of query evaluation is a function that maps variables in the query to values in a database instance I. Homomorphism is essentially about finding a mapping from each variable in the query to an actual fact in the database such that the query conditions are satisfied
It’s a way to validate whether a particular combination of variable assignments makes the query true in the context of the given database.
Example
Suppose you have a query that looks for pairs of people where one is the parent of the other. In Datalog, this might look like:
- Query: parentChild(Parent, Child) <- parent(Parent, X), child(X, Child).
Here, a homomorphism would be a function that maps the variables Parent, X, and Child to actual people in your database such that the parent and child relationships are correctly represented according to the data.
What are the different levels of memory hierarchy used in database systems, and what are their key characteristics?
CPU Cache:
- Characteristics: Fastest, least capacity, rapid access to frequently used data.
- Usage: Optimizes speed by storing frequently accessed data.
Secondary Caches:
- Characteristics: Balance between speed and capacity, located between CPU cache and main memory.
- Usage: Acts as an intermediary storage to optimize data retrieval.
Random Access Memory (RAM):
- Characteristics: Main memory, faster than disk storage, volatile.
- Usage: Temporary storage for active operations and data processing.
-
Solid-State Drives (SSD):
- Characteristics: Non-volatile, faster than HDDs, more expensive per GB.
- Usage: Provides quicker data access and storage, used for frequently accessed data.
Hard Disk Drives (HDD):
- Characteristics: Secondary storage, high capacity, lower speed and cost.
- Usage: Used for bulk storage where speed is less critical.
Tape Storage:
- Characteristics: Tertiary storage, used for archival and backup, slowest access time.
- Usage: Ideal for long-term data storage and backups.
How do the characteristics of different storage levels in a database system relate to each other, and what is their collective impact on database query execution?
Capacity vs. Speed:
- There’s a notable relationship between access time and storage capacity. The fastest storage levels, like the CPU cache, have minimal capacity but provide the quickest access. Conversely, larger storage solutions like HDDs offer more capacity but at slower speeds.
Collective Impact on Query Execution:
- The database engine must strategically use these storage levels, considering their speed and capacity, to optimize data retrieval and storage operations. For instance, frequently accessed data might be kept in faster storage like RAM or SSDs, while less frequently accessed data is stored in HDDs or tape storage.
- The ultimate goal is to minimize I/O operations, which are the most time-consuming aspect of query execution. By effectively leveraging the memory hierarchy, a database system can significantly reduce query execution times.
How are database elements mapped onto secondary memory, and what is the significance of block sizes in this context?
Mapping of Database Elements:
- Tuples to Records: Individual tuples (rows) in a database are mapped to records within the storage system. This mapping facilitates the organization and retrieval of row-based data.
- Relations to Files: Entire relations (or tables) in the database are mapped to files in the storage system. This association helps in managing and accessing complete tables.
- Files to Pages: The files are further organized into collections of pages (or disk blocks), which are the basic units of storage on a disk. This structure optimizes the storage and retrieval process.
Block Sizes:
- Characteristic: Block (or page) sizes typically range from 4K to 56K.
- Significance: These sizes are chosen to balance the efficient management of disk space and the optimization of I/O operations. Larger block sizes can reduce the number of I/O operations needed for data retrieval and storage, but they must be managed to avoid wasting disk space with partially filled blocks.
What is the role of the buffer manager in data access within a database system, and how does it operate?
Role in Data Access:
- The buffer manager is tasked with bringing data into the lowest level of memory (closest to the CPU) for processing.
Buffer Pool in Main Memory:
- Located in the main memory, the buffer pool comprises chunks of memory known as buffer frames. These frames are sized to align with disk-page sizes, ensuring compatibility between main memory and disk storage.
- Buffer frames can be dynamically filled with data from disk or emptied when their contents are no longer required, providing flexibility in data management.
Buffer Manager’s Operation:
- Data Request Handling:
- Upon a data request, the buffer manager first checks if the required page is already in the buffer pool. If it is, the page is immediately made available for processing.
- If the page is not in the pool, the buffer manager follows a specific policy (like Least Recently Used - LRU, or First In First Out - FIFO) to discard an existing page. It then reads the needed page from the disk into the buffer.
- This operation positions the buffer pool as an intermediate layer that mediates data access between the CPU and disk storage, influencing the efficiency of data processing and I/O operations in the database system.