OLAP Flashcards
The formula for the range-sum algorithm
Sum(i=0:x) Sum (i=0:y) “Aij”
The range sum (or prefix sum) of A(1,1) = A(0,0) + A(0,1) + A(1,0) + A(1,1)
Range sum of a middling block like A(2:3, 2:3) in prefix sum data cube
A(3,3) - A(0,2) - A(0,3) + A(0,1)
- find the full sum of the bottom right corner of the region
- subtract out non-included regions
- add back one unit of the region that was double counted
Blocked Range-Sum Algorithm
partition the original data cube into blocks of size bxb
Then for the bottom right elements of each block only compute the range-sum like normal based on the original array.
Remove all other data from the data cube
(The idea of this approach is to reduce the space complexity of holding the data, at the cost of a little time)
What is the motivation behind the dynamic data cube?
where d = # of dimensions and n is the size of each dimension
For the original array:
query complexity = O(n**d)
update complexity = O(1)
For the prefix-sum array:
query complexity = O(1)
update complexity = O(n**d)
Is some compromise in which query and update complexity are both log(n) possible? YES!
how to transform the original array into dynamic data cube?
partition original array into k x k rectangles (k can be any size)
in each rectangle we are only going to keep kd - (k-1)d numbers. We only keep the bottom row and the rightmost column.
step 1 - Compute “original DDC” Since the bottom right element is a part of both the rightmost column and bottom row, you must decide to treat it as one or the other. It doesn’t matter which one you pick
step 2 - compute values. First, compute the “original values in the rightmost column as the value of the prefix sum of their corresponding row. Example 2nd element in rightmost column is the prefix-sum of the 2nd row
compute the values of the bottom row as the prefix-sum of the corresponding column.
step 3 - remove all the other values that are not in the bottom row or rightmost column.
step 4 - if you treated the bottom right element as part of the column, then first compute the prefix-sum of the column, if you treated it as part of the row, compute the prefix-sum of the row first. Then compute the prefix-sum of the other entity that you didn’t compute first.
Structure of dynamic data cube
The original array is transformed at different levels of partitioning
root depth - partition original array into k x k rectangles. Do DDC transform on each k x k rectangle
depth 1 - partition original array into k x k rectangles. Then partition each k x k rectangle into k/2 x k/2 rectangles. Do DDC transform on each k/2 x k/2 rectangle
depth 2 - partition original array into k x k rectangles. Then partition each k x k rectangle into k/2 x k/2 rectangles. Then partition each k/2 x k/2 rectangle into k/4 x k/4 rectangles. Do DDC transform on each k/4 x k/4 rectangle
create more levels and repeat DDC transform until the rectangles are 1 x 1 (aka [k/(#dpeth)**2] = 1)
The levels should be arranged in a tree format. The root depth DDC transform is the root. Then the partitioning of each of the 4 root level rectangles produces one child of the root. This repeats for all levels.
How to query DDC?
1) determine the area that needs to be summed
2) draw in the root depth partitioning rectangles
3) based on the root depth DDC, determine if any sections of the area can be evaluated by the values in any of the depth level rectangles
(remember that the values represent the entire prefix-sum of the data cube up until that point. For example, DDC(2,3) represents the sum of all elements A(0,0) through A(2,3)
4) for any portion(s) of the query area that cannot be evaluated from the root depth rectangles, partition those rectangles into depth 1 rectangles.
5) repeat step 3) and 4) until the entire query region can be evaluated
Updating a data cube. Deriving the DDC at depth i-1, from the DDC at depth i (assuming depth i is not the leaf level)
Re-compute the "original rows" as follows: for j = n..1 => A(j, n) - A(j-1, n) = sum(row(j)) and sum(row(0)) = A(0,n)
Re-compute the "original columns as follows: for j = n..1 => A(n ,j) - A(n, j-1) = sum(column(j)) and sum(column(0)) = A(n,0)
Repeat for all rectangles (normally 4) in the tree node
Use EITHER the value of the last row OR the last column in the combined data cube BUT NOT BOTH.
Compute prefix-sum like we did for DDC transform treating the bottom right element as EITHER a row or a column like before
What is B**c tree and what is the motivation?
a B**c tree is a cumulative B tree
in the DDC update, updating the top left value of an overlay box at depth i, will cause all the values to the right and down from that position in the overlay box at depth i-1
For the following, (1) = top-left, (2) = top-right, (3) = bottom-left, (4) = bottom-right
Example:
Depth i: (assume these are boxes of size 1x1
b1 b2
b3 b4
at Depth i-1:
a2 => made up of {b1, b2, b3, b4}
if we update b1, then all the values in a2 and a4 DDC will be changed.
SOOOOO it’s possible that a DDC update could cause us to have to update all n values in the DDC making the update cost O(n)
The entire point of the DDC was to find a middle ground in query cost and update cost.
Constructing a B**c tree
We are concerned about the bottom row of a given overlay box (not including the element in the nth column)
The leaves of the B**c tree represent the sums of the columns in the overlay box. For example, the 2nd leaf from the left represents the sum of column 1. Think back to the “original values” we generated in our DDC transform algorithm before we did the prefix-sum, that’s what we want.
Remember a B**c tree IS A B-tree with modifications:
mod: root contains 1 key
same as B-tree: a node with k keys has k+1 children
same as B-tree: a node has at least ceil(m/2) -1 keys and at most m-1 keys
mod: leaf keys refer to the index of that leaf row-sum value in the original array
mod: internal nodes (including the root) are indexed from 2 to n
mod: internal nodes (including the root) maintain a value called STS (subtree sum). STS equals the cumulative sum of all leaf values in the key’s left subtree.
The leaves are assigned to parents as follows(with k+1 leaves we have k keys in the parent node):
- the leaves must but indexed in the order they appear in the DDC
- the first k leaves become left children of the first k keys. The (k+1)th leaf, becomes the right child of the kth key.
How to query a B**c tree?
1) determine the index of the value in the DDC you want to find the range-sum for (we will call this “index”)
2) maintain a variable called sum initialized to 0.
3) look at the root of the B**c tree, if index >= root key, add STS to sum, move to the right subtree node. Else leave sum the same and move to the left subtree node.
4) compare index to each node key. If index >= node key, add node key STS to sum. Else do nothing.
5) Go to the leaf corresponding to value index. Add that value to sum.
How to update B**c tree?
1) determine the index of the row-sum array to be updated
2) store the current value that entry in old_value
3) update the entry. The updated entry will be called new_value
4) go to leaf index that has been updated, change its value to new_value
5) compute the value_diff = new_value - old_value
6) starting from updated leaf traverse up the tree
7) if the updated leaf affects the STS (is in the left subtree of the key being examined), add value_ff to STS value for the key. If not, do nothing.
8) continue up the tree
9) repeated (7) and (8) until the root key has been examined
DDC complexity of query and update WITHOUT a B**c tree
query = O(log(n)) update = O(n) in worst case
DDC complexity of query and update WITH B**c tree
query = O(log^d(n)) update = O(log^d(n))
d = # of dimensions
Complexity of query and update SRPS
when n is a perfect square:
query = 4
update = 2 sqrt(n) - 2
otherwise
query = 4
update = 2sqrt(n)