Approximations Flashcards
Provide two important parameters of probabilistic bounds
Epsilon (ϵ): Epsilon represents the error bound on the estimate. The computed answer will be within a factor of ϵ of the actual answer with a probability of at least 1-δ. In other words, a smaller epsilon value means a more accurate estimate, while a larger epsilon value indicates a less accurate estimate.
(Depending on the sketch) Note that the error bound is proportional to the actual answer, so the relative error becomes smaller as the actual value increases depending on the sketch)
Delta (δ): Delta represents the confidence level of the estimate. With a probability of at least 1-δ, the computed answer will be within the error bounds specified by epsilon. In other words, a smaller delta value indicates a higher confidence level in the accuracy of the estimate, while a larger delta value represents a lower confidence level.
Summarize the difference between Equi-width and Equi-depth histograms.
Equi-width: Divide the distinct values of your dataset into equally sized buckets in order. e.g. for width = 4, [1..8] -> [[1..4], [5..8]]. If we’re interested in frequencies, we count the frequencies of the whole bucket. The information that is stored per bucket is the starting and ending number and the frequency. We lose information as we cannot tell anything about the internal distribution of the values contained in a bucket.
Equi-depth: Instead of grouping on the number of distinct values in the dataset, we group by the weight of each distinct value, so that we get equally large buckets.The bucket size will be fixed on a number but if one value has a high frequency, than the bucket will contain a larger portion of this value.
Is the equi-width or equi-depth histogram more accurate? Which can be used for guaranteeing accuracy?
Equi-width cannot do that. Because the estimation you will be given will be the size of the bucket over the length of the bucket. for example if the size is 16 (frequency is 16) and the length is 2 (there are 2 distinct values in this bucket), the maximum error will be related to the size of the bucket
How can histograms be used to estimate the join cardinality? What is the issue with this?
When you have two columns of data, the join cardinality can be estimated (inaccurately) by calculating the histograms over each columns. Then, sum up the element-wise products of the histograms to get an estimation on the join cardinality.
The problem is that we have no accuracy guarantees as we lose information about the distribution of the samples when creating histograms.
Why do multi-dimensional histograms not save a lot of space? Give an example.
When adding more dimensions to the histogram, the histogram will become more sparse as the combination of two values in 2D becomes less probable for two different points.
An example:
We have three columns, a price column, a product column and a quantity column. The product has a price, and the quantity specifies how many of these products were bought in a single order. We can then create a 2-D histogram to see the distribution of quantities per order for products in a specific price-range.
Quantity: 1-2 Quantity: 3-4 Quantity: 5-6
Price: 0-10 | 1 1 0
Price: 11-20 | 2 0 0
Price: 21-30 | 1 0 0
In this case the table grows exponentially in the number of dimensions, which reduces the benefit of using histograms for this purpose.
Why is it difficult to construct histograms in a streaming fashion? Give an argument for both equi-width and equi-depth histograms.
Equi-depth: We need to know the distribution a-priori. We cannot set appropriate boundaries if we do not know anything about the distribution of the data. If the boundaries do not capture the distribution of values, the histogram is less useful.
Equi-width: We need to know the domain (range of values), e.g. min and max salary. Equi-width histograms divide the data into equally sized intervals (buckets) based on the range of values in the data. To determine the appropriate bucket boundaries, you need to know the domain (min and max) of the data. We would also need to know the domain so that we include all possible values into the histogram.
Provide three characteristics of sketches
- They compute Approximate answers
- They are logarithmic in space (sublinear to items)
- They have probabilistic guarantees on the quality of the approximate answer
Provide the steps you take when designing your own sketch
- Define the functionality (e.g. support containment queries)
- Define the input domain. What should the sketch accept? What is it we want to summarize. (e.g. integers: [0..2^32-1])
- Define the output domain. (e.g. map it to 8 bits) (sketch)
- Define the mapping function from the input to the output domain (e.g. hash function h(x) = | (3x +2) % 8|
Name four important properties of a hash function
- It should evenly distribute items to the output domain
- This should be done independent of the previously hashed elements
- If using multiple hash functions for the same task, make sure they are pairwise independent, namely they don’t use the same components. Example of pairwise dependent hash functions:
h1(x) = |(7x + 11) % 17 | % 8
h2(x) = |3x + 11) % 13 | % 8
- Make sure to use large prime numbers
How do you create a very simple sketch for a containment query, how do you query it?
Create:
1. Turn your input into an integer
2. Initialize a bitvector with size m
3. hash your input so that it maps to an index <= m
4. Set the bit at the index to 1
5. If there is already a 1 there, do nothing
Query:
1. hash your input value
2. check if the bit in the bitvector at the hash index is set to 1
Better methods are Bloom filters or Cuckoo filters
How do you calculate the space complexity for a simple containment query sketch?
O(m) = O(n/ln(1-prcollision))
- Where n are the number of inputs
- and prcollision is probability of a collision.
What is meant with composability and idempotency?
Composability: a sketch is composable if it can be built in a distributed fashion and combined later.
Idempotency: Repeated insertion do not change the synopsis.
Explain the benefits of repetition in approximations.
Let’s say we have a randomized algorithm with error probability p (and correctness probability 1-p)
If we repeat this algorithm k times, the correctness probability evalutes to : 1 - p * p * p … p = (1-pk)
The correctness probability increases with the number of independent trials.
Explain the main components of bloom filters and how it can be used for a containment query.
- A bit array of length m, with all bits initially set to 0
- k pariwise independent hash function
When we observe a new value, we hash the value with each hash function to obtain an index and we set the resulting index of the bit array to 1. Then when we want to query if a value is contained in the set, we hash the value with each hash function and check if the elements of the resulting indices are 1 in the bit vector. If any of them are not 1, we return false, else we return true.
What is the probability of a false positive in a bloom filter?
prfp= ( 1 - ( 1- 1/m)kn ) k = (1-e-kn/m)k
The FP probability of a bloom filter scales exponentially with the number of items given a set number of hash functions.