3: Sampling, FO overview, storage Flashcards

1
Q

What is the general process for using sampling as an estimation technique?

A

Role: Sampling is used in scenarios where histograms may not be applicable or sufficient. It allows for dealing with various types of queries, including complex ones.

Process:
- Mini-Query Execution: A small-scale version of the query is run on a sample of the dataset (population).
- Sampling Scheme: This involves a set of rules for how the sample is drawn from the population. The method of sampling directly impacts the estimation process and its accuracy.
- Estimator Application: The results obtained from the sample query are then scaled up using an estimator to approximate the result for the entire dataset.

Usage: Sampling is particularly useful for queries involving aggregations like COUNT, SUM, AVERAGE, etc., where direct computation on the entire dataset might be impractical.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

Briefly explain the Horvitz-Thompson Estimator, how it works and how it is calculated.

A

Basic Concept: The estimator adjusts for the unequal probabilities of selection in the sample. It ensures that the estimation is unbiased, meaning it accurately represents the whole population over multiple samples.

How It Works:
- For each item in the sample, calculate the inverse of its probability of being selected. This is known as the inclusion probability for that item.
- Multiply each item’s value by its corresponding inverse probability.
- Sum up these adjusted values to get the total estimate for the population.

Formula:
- Suppose you have a sample of items, let’s say item1, item2, …, itemN.
- Let P(item) be the probability of each item being selected in the sample.
- The Horvitz-Thompson estimator is calculated as:
Sum of (Value of each item / P(item)) for all items in the sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain the concept of the Central Limit Theorem and how it applies to sampling-based estimations in database optimzation. What is assumed about the distribution of the underlying data?

A

Concept of CLT: The Central Limit Theorem helps estimate the accuracy of a sampling-based estimate Y. It applies when the population size N and the sample size nn are sufficiently large.

Application in Sampling:
- When drawing a sample of size n from a distribution with a mean μ and a variance σ2, the CLT can be used to understand the behavior of the sample average.
- As n increases, the sample average will converge to the true mean μ, according to the law of large numbers.

Distribution of Y/N:
- For large n and N, the distribution of Y/N approximates a normal distribution with mean μ and variance σ2/n.
- About 95% of the mass of this normal distribution is within two standard deviations of its mean.

Note: Make sure to go through some examples on how to calculate this. One example is given in the slides

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the advantages of sampling as an estimation technique?

A

Immediacy: Sampling requires no pre-construction, allowing for immediate use without prior setup.
Pervasiveness: This method is widely supported across modern database engines, making it a universally applicable approach.
Adaptivity: Sampling allows for on-the-fly transformation, providing flexibility in handling various data types and structures.
Ease of Implementation: Compared to other methods, sampling is generally easier to implement, requiring less complex algorithms and data structures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the disadvantages of sampling as an estimation technique?

A

Not Suitable for Highly Selective Queries: Highly selective queries often require a large sample size to be representative. The accuracy drops if we have a highly selective query

Expensive in Terms of Evaluation:
1. Sampling, particularly row-level sampling, can be much slower than other estimation methods like histograms.
2. High I/O Cost: Sampling often involves more extensive data reading operations, increasing the input/output cost.

Sensitive to Skew and Outliers: Sampling can be adversely affected by data skew and outliers, leading to less accurate estimations.

Limitations in Applicability: There are several classes of queries for which estimators based on sampling may not provide reliable results, limiting its applicability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain the main components of a Relational Data Model

A

Universe of Atomic Values (U): Define U as a universe of atomic values which can be used in the database.

Relation Schema: A relation schema is defined by a name R and a finite set of attribute names attributes(R)={A1,…,Ak}}. This is akin to the structure of a table, including the table name and its header

Fact over Relation Schema: A fact over a relation schema R of arity k is expressed as R(a1,…,ak), where a1,…,ak are elements of U. Alternatively, a tuple over R can be seen as a function mapping from attributes(R) to U, indicating the types of values the attributes can hold (like integers, varchars, floats, etc.). This resembles a row in a table.

Instance of Relation Schema: An instance of a relation schema R is a finite set of facts or tuples over R, representing the data filled in the table structure.

Database Schema (Signature): A database schema, also known as a “signature”, is a finite set of relation schemas, denoted as D.

Instance of Database Schema: An instance of a database schema D comprises a set of relation instances, one for each relation schema contained in D.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Provide an example of a database schema.

A

Consider a database schema consisting of four relation schemas:
- author(authorID, name, birthdate, language)
- book(bookID, title, authorId, publisher, language, year)
- store(storeID, address, phone)
- sells(storeID, bookID)

Here Author, book, store and sells are relation schemas.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What’s the relationship between the expressiveness of a query and te complexity of evaluation?

A

Inverse Relationship: The expressiveness of a query language is inversely related to its evaluation complexity. Highly expressive languages like SQL are more complex to evaluate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What categories of query languages are there, and how are they ranked based on expressiveness?

A
  1. First Order Logic (FO): Corresponds to the full Tuple Relational Calculus (TRC), denoted as LO.
  2. Conjunctive Queries (Conj): Represent TRC using only existential and conjunction operators.
    - Corresponds to basic SELECT-FROM-WHERE in SQL.
    - Equates to {sigma, pi, X} fragments of Relational Algebra.
    0 Matches single positive Datalog rules.
  3. Acyclic Conjunctive Queries (AConj):
    - Queries with join trees and no cycles.
    - Represents a subset of conjunctive querie
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How are queries expressed in terms of complexity? What are the complexity spaces associated with the FO, Conj and Aconj queries?

A

Combined Complexity (Depends on both the query size ∣Q∣ and database size ∣D|):
- FO: PSPACE-complete.
- Conj: NP-Complete.
- AConj: LOGCFL-complete.

Data Complexity (Depends on the database size ∣D∣ alone):
- FO: Logspace.
- Conj: Logspace.
- AConj: Linear time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain the concept of Homomorphism and how it aplies to database queries.

A

Definition: Homomorphism in the context of query evaluation is a function that maps variables in the query to values in a database instance I. Homomorphism is essentially about finding a mapping from each variable in the query to an actual fact in the database such that the query conditions are satisfied

It’s a way to validate whether a particular combination of variable assignments makes the query true in the context of the given database.

Example
Suppose you have a query that looks for pairs of people where one is the parent of the other. In Datalog, this might look like:
- Query: parentChild(Parent, Child) <- parent(Parent, X), child(X, Child).

Here, a homomorphism would be a function that maps the variables Parent, X, and Child to actual people in your database such that the parent and child relationships are correctly represented according to the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the different levels of memory hierarchy used in database systems, and what are their key characteristics?

A

CPU Cache:
- Characteristics: Fastest, least capacity, rapid access to frequently used data.
- Usage: Optimizes speed by storing frequently accessed data.

Secondary Caches:
- Characteristics: Balance between speed and capacity, located between CPU cache and main memory.
- Usage: Acts as an intermediary storage to optimize data retrieval.

Random Access Memory (RAM):
- Characteristics: Main memory, faster than disk storage, volatile.
- Usage: Temporary storage for active operations and data processing.

  • Solid-State Drives (SSD):
    • Characteristics: Non-volatile, faster than HDDs, more expensive per GB.
    • Usage: Provides quicker data access and storage, used for frequently accessed data.

Hard Disk Drives (HDD):
- Characteristics: Secondary storage, high capacity, lower speed and cost.
- Usage: Used for bulk storage where speed is less critical.

Tape Storage:
- Characteristics: Tertiary storage, used for archival and backup, slowest access time.
- Usage: Ideal for long-term data storage and backups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do the characteristics of different storage levels in a database system relate to each other, and what is their collective impact on database query execution?

A

Capacity vs. Speed:
- There’s a notable relationship between access time and storage capacity. The fastest storage levels, like the CPU cache, have minimal capacity but provide the quickest access. Conversely, larger storage solutions like HDDs offer more capacity but at slower speeds.

Collective Impact on Query Execution:
- The database engine must strategically use these storage levels, considering their speed and capacity, to optimize data retrieval and storage operations. For instance, frequently accessed data might be kept in faster storage like RAM or SSDs, while less frequently accessed data is stored in HDDs or tape storage.
- The ultimate goal is to minimize I/O operations, which are the most time-consuming aspect of query execution. By effectively leveraging the memory hierarchy, a database system can significantly reduce query execution times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How are database elements mapped onto secondary memory, and what is the significance of block sizes in this context?

A

Mapping of Database Elements:
- Tuples to Records: Individual tuples (rows) in a database are mapped to records within the storage system. This mapping facilitates the organization and retrieval of row-based data.
- Relations to Files: Entire relations (or tables) in the database are mapped to files in the storage system. This association helps in managing and accessing complete tables.
- Files to Pages: The files are further organized into collections of pages (or disk blocks), which are the basic units of storage on a disk. This structure optimizes the storage and retrieval process.

Block Sizes:
- Characteristic: Block (or page) sizes typically range from 4K to 56K.
- Significance: These sizes are chosen to balance the efficient management of disk space and the optimization of I/O operations. Larger block sizes can reduce the number of I/O operations needed for data retrieval and storage, but they must be managed to avoid wasting disk space with partially filled blocks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the role of the buffer manager in data access within a database system, and how does it operate?

A

Role in Data Access:
- The buffer manager is tasked with bringing data into the lowest level of memory (closest to the CPU) for processing.

Buffer Pool in Main Memory:
- Located in the main memory, the buffer pool comprises chunks of memory known as buffer frames. These frames are sized to align with disk-page sizes, ensuring compatibility between main memory and disk storage.
- Buffer frames can be dynamically filled with data from disk or emptied when their contents are no longer required, providing flexibility in data management.

Buffer Manager’s Operation:
- Data Request Handling:
- Upon a data request, the buffer manager first checks if the required page is already in the buffer pool. If it is, the page is immediately made available for processing.
- If the page is not in the pool, the buffer manager follows a specific policy (like Least Recently Used - LRU, or First In First Out - FIFO) to discard an existing page. It then reads the needed page from the disk into the buffer.
- This operation positions the buffer pool as an intermediate layer that mediates data access between the CPU and disk storage, influencing the efficiency of data processing and I/O operations in the database system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the key parameters of the I/O model can why is minimizing I/O operations critical for database performance?

A

Key Parameters of the I/O Model:
- B: The number of blocks in a file.
- D: Average time to read a block from disk (typically around 15 milliseconds).
- R: Number of records per block.
- C: Average time to process a record (around 100 nanoseconds).

I/O Dominance in Database Performance:
- Given these typical values, a database can perform approximately 150,000 record operations in the CPU for each block access from the disk.
- This significant disparity between disk access time and in-CPU processing time makes I/O the dominating factor in database performance. It underscores the importance of strategies aimed at reducing disk accesses (I/O) to enhance overall database efficiency.

16
Q

How are records organized in database storage, and what are the differences between fixed-length and variable-length records?

A

File Organization:
- In a database, a file is logically organized as a sequence of records, which are then systematically mapped onto disk blocks. This structure facilitates the organized storage and efficient retrieval of data.

Record Types:
- Fixed-Length Records:
- Characteristics: Each record has a uniform size, akin to tuples in a relation.
- Impact: Simplifies storage and retrieval operations because the size of each record is predictable and consistent.

Variable-Length Records:
- Characteristics: Sizes of these records can vary to accommodate data types like strings, which might not have a fixed size.
- Impact: Requires additional mechanisms (like pointers or special markers) to handle the variability in record size. These mechanisms are essential to efficiently manage and access data with varying sizes within the database.

17
Q

What is the structure of a page in database storage, and how does it handle tuples and fields, especially in the context of fixed-length and variable-length records?

A

Basic Unit of I/O - Page:
- A page is the basic unit of I/O in a database system, representing a specific number of bytes on the disk.

Tuples and Fields in a Page:
- Tuples within a page consist of various fields, each corresponding to an attribute of the tuple. The space reserved for each field depends on its data type.
- For instance, a string attribute with a maximum length of 10 characters will have reserved space to accommodate up to 10 characters.

Page Header and Directory:
- Header: Each page includes a header, functioning as a directory to organize the records within the page.

  • Fixed-Length Records: If the page stores fixed-length records, the directory can efficiently manage the records using fixed offsets, as each record occupies a uniform amount of space.
  • Variable-Length Records: For variable-length records, the directory structure becomes more complex. This complexity arises due to the variability in record sizes, requiring more sophisticated mechanisms for indexing and accessing the records within the page.
18
Q

How are records organized within a database file?

A

Sequential Organization:
- Records are ordered based on certain fields. This organization facilitates ordered access and efficient search operations, as records follow a predictable sequence.

Heap Organization:
- An unordered structure where records are placed in any available space. This can be managed using a linked-list approach, which is flexible but may lead to inefficiencies in data retrieval.

Hash File Organization:
- Organized based on a hash function applied to one or more fields of the records. This method is efficient for direct access operations, where records are quickly retrieved based on their hashed value.

19
Q

What are the advantages of sorted data in databases?

A

Efficiency in Search Operations:
- Sorted data significantly enhances the efficiency of search operations. By having data in a sorted order, the database can employ more efficient search algorithms, like binary search, which drastically reduce the number of comparisons needed to find a record.

Reduced Complexity:
- Sorting reduces the complexity of search queries and enhances overall data retrieval performance. It simplifies range queries and enables faster data access.