8. Iterators and Joins Flashcards
Describe a join in terms of tuples operations
take two relations, R and S, and create one new relation out of their matches on the join condition – that is, for each record r_i in R, find all records s_j in S that match the join condition we have specified and write < r_i , s_j > as a new row in the output (all the fields of r followed by all the fields of s).
Simple Nested Loop Join - describe
Analyze cost
take each record in R, search for all its matches in S, and then we yield each match
Cost: [R]+|R|[S], where [R] is the number of pages in R and |R| is the number of records in R
i.e. scan R once ([R]) and then scan the whole S for each record of R
Page Nested Loop Join - describe, analyze cost
for a page of R, take all the records and match them against each record in S, and do this for every page of R.
Cost: [R] + [R][S]
[R] - number of pages in R
i.e. Scan all pages of R ([R]) and then for each page of R, scan all pages of S.
Chunk Nested Loop Join - describe, analyze cost
B - number of pages in a memory buffer
We can have B-2 pages of R and 1 page of S in a buffer at the same time (the last 1 page is used as an output buffer). So we can join R by such chunks instead of pages or records.
For such a chunk, take all the records and match them against each record in S, and do this for every chunk.
Cost: [R] + ([R] / B−2) * [S]
[R] - number of pages in R
i.e. scan all pages of R, then for each chunk (there are [R] / B−2 of them) scan all pages of S.
Index Nested Loop Join - describe, analyze cost
if we have an index on S that is on the appropriate field (i.e. the field we are joining on), it can be very fast to look up matches of r_i in S.
Cost: [R] + |R|∗(cost to look up matching records in S).
[R] - # pages in R
|R| - # records in R
Naive Hash Join - describe, analyze cost
B - number of pages in a memory buffer
construct a hash table that is B-2 pages big on the records of R, fit it into memory, and then read in each record of S and look it up in R’s hash table to see if we can find any matches on it.
Cost: [R] + [S] I/Os
However, it relies on R <= B - 2, which is very often not true
Grace Hash Join - describe
the main drawback
B - number of pages in the in-memory buffer.
- repeatedly hash R and S into B-1 buffers so that we can get partitions that are ≤ B−2 pages big
- Use naive hash join on partitions
Drawback: sensitive to key skew. Key skew is when we try to hash but many of the keys go into the same bucket.
Sort-Merge Join - describe, the average cost
- sort R and S first
- we begin at the start of R and S and advance one or the other until we get to a match
- let’s assume we’ve gotten to a match. We mark this spot in S as marked(S) and check each subsequent record in S until we find something that is not a match.
- Now, go to the next record in R and go back to the marked spot in S and begin again at step 1. Idea: Because R and S are sorted, any match for any future record of R cannot be before the marked spot in S.
Average cost: cost to sort R + cost to sort S + ([R] + [S])