Hash Tables Flashcards
Direct Adressing
keys are just integers -> allocate array as large as the number of keys you have -> random access memory. lookup, insert, remove O(1)
Problem: huge waste of memory if key domain is larger than the set of keys stored
best approach for small and dense key ranges
Hash function and hash table, desired properties
Hash function h maps keys from domain U to position in hashable of size m
Desired properties: fast hash computation, small probability of two keys mapping to the same hash value -> collision
how to handle collision?
With chaining
each position in hash table has linked list of entries
how large should m (chin length) be?
must be constant for average constent time access. must be proportional to n.
m > f * n
f: fill factor parameter, typically close to 1
if final size of hash table is not known, how to handle it?
it must grow dynamically(rehashing)
1. create new, larger table
2. iterate over entries in old table, and insert into new hash table
how large should new table be?
to keep amortized insertion time constant, must grow exponentially by a factor g > 1
Smooth growth without rehashing
linear hashing or extendible hashing
what makes a good hash function?
has function should look random but be deterministic
Modulo prime hash function/fast modulo
modulo uses integer division, which is expensive
division can be replaced with multiplicative inverse by magic value
Alternative to prime hashing, trick to avoid modulo or magic numbers
Powers of two
most real-world hash functions compute a 23 bit or 64 bit hash
to be used in a hash table, must compute h(x) % p
trick:
restrict m to powers of two (implies g = 2)
modulo becomes simple as fast bit shift or mask
randomized properties of hash functions
Universal hashing: two distinct keys collide with probability 1/m or less
Pairwise independence: probability that two distinct keys have same hash is 1/m2
K-independence: probability that any k distinct keys have some particular hash values is 1/mk
Tabulation hashing, adv and disadv
break up key of length r * t into small fixed-size chunks of r bits
for each chink look up in table with random (2`r * t) bits
XOR lookup results
example with 32 bit key x and 8 bit chunks:
h(x) = t[0][x0] XOR t[1][x1] XOR t[2][x2] XOR t[3][x3]
Adv: 3-independent, could be implemented efficiently in hardware
disadv: consumes CPU cache, random state grows with key size
Open Addressing
Chaining has at least two dependent memory accesses for hit
storing entries directly in hash table
how to handle collisions in open adressing?
Open addressing with linear probing
on collision, insert into next free position
pos = (pos + 1) % m
if m is power of 2 => cheaper
lookup must check entries until free entry is found
fill factor f must be below 1 for good performance
at high fill factors, OA leads to long probe sequence lengths (clustering effect)
what is a way to get better space utilization?
have fixed-size groups of entries (e.g each group can have four 16-byte entries)
typically a group has size of cache line
may require a counter for each group
SIMD can be used to speed up comparison with several elems (to compare your key agains all keys in group with one instruction)
Open Addressing plus Chaining
store first elem in array
chain on collision
Chaining vs Open Addressing
different size entries
growth: rehashing operation faster with OA
chaining: lots of cache misses as items are spread in memory
duplicate keys: chain length will grow but no problem
high fill factors: chaining is better, no exploding prob sequence length, but has overhead through pointers. OA doesnt have pointers
Entry copying:
large entries _< if hashtable grows, with OA copy everything
in chaining, move them from old hashtable to new, no copying
linear probing leads to clusters of entries and therefore long probe sequences. What is a solution?
Robin Hood hashing
re-order elements during insertion based on their probe sequence length
(so that elements with long probe sequence length swap with elem that has a shorter one)
similar performance as linear probing for small fill factors
works better that linear probing with larger fill factors
What is probe sequence length?
Distance of key from its preferred location
Cuckoo Hashing
Open addressing approach that guarantees O(1) lookup and remove performance in the worst case
idea: two hash tables with different hash functions
lookup: check exactly 2 positions
insert: pick any of the 2 positions, if 1 free, put it there. If both spots occupied, try to rearrange keys
rearrangement may fail due to cycle: rebuild using different hash function
only works for low fill actors
can be fast: the memory accesses are independent
cuckoo hashing: insert(z)
z hashes exactly to locations of x and y. We have to put it in one of those locations. -> kick out y as y has also another location that may be free (as each key has two possible spots)
if other spot for y is also occupied -> continue, kick out key that occupied it -> you can get into cycle
Static perfect hashing
FKS hashing: O(1) worst case lookup time and amortized O(1) insertion time
bulk creation, two levels: (looks like a tree)
1. 2(n-1) partitions (n: number of elements)
2. each partition is a hash table of size r^2 for r entries in partition, on collision pick another hash function
problem: large space consumption, two dependent cache misses
Concise Hash Table: idea, construction
static hash table optimized for space
idea: open addressing with linear probing, but compress away empty slots
Construction: three phases, assume unique keys
1. hash key, go t that bit and set it. if bit set -> collision, go to next bit
2. compute prefix popcounts for bitmap
3. hash each key again and insert into array using prefix popcount
should hashes be stored or not?
has functions can have collisions. for correctness, it must store full keys and compare on lookup
hash can always be recomputed from key - > no need to store the hash
but if you store it: speeds up rehashing (if you have cache misses, it could be faster to compare against the hash than against the key), can speed up negative comparison for complex keys
Table growth: cache locality
in power of two tables, doubling means we use one more hash bit
two possibilities:
least significant bit: 01 may become 001 or 101
most significant bit: 01 may become 010 or 011
MSB: front bits stay the same, you get more resolution-> access pattern is better