Hash Tables Flashcards
Direct Adressing
keys are just integers -> allocate array as large as the number of keys you have -> random access memory. lookup, insert, remove O(1)
Problem: huge waste of memory if key domain is larger than the set of keys stored
best approach for small and dense key ranges
Hash function and hash table, desired properties
Hash function h maps keys from domain U to position in hashable of size m
Desired properties: fast hash computation, small probability of two keys mapping to the same hash value -> collision
how to handle collision?
With chaining
each position in hash table has linked list of entries
how large should m (chin length) be?
must be constant for average constent time access. must be proportional to n.
m > f * n
f: fill factor parameter, typically close to 1
if final size of hash table is not known, how to handle it?
it must grow dynamically(rehashing)
1. create new, larger table
2. iterate over entries in old table, and insert into new hash table
how large should new table be?
to keep amortized insertion time constant, must grow exponentially by a factor g > 1
Smooth growth without rehashing
linear hashing or extendible hashing
what makes a good hash function?
has function should look random but be deterministic
Modulo prime hash function/fast modulo
modulo uses integer division, which is expensive
division can be replaced with multiplicative inverse by magic value
Alternative to prime hashing, trick to avoid modulo or magic numbers
Powers of two
most real-world hash functions compute a 23 bit or 64 bit hash
to be used in a hash table, must compute h(x) % p
trick:
restrict m to powers of two (implies g = 2)
modulo becomes simple as fast bit shift or mask
randomized properties of hash functions
Universal hashing: two distinct keys collide with probability 1/m or less
Pairwise independence: probability that two distinct keys have same hash is 1/m2
K-independence: probability that any k distinct keys have some particular hash values is 1/mk
Tabulation hashing, adv and disadv
break up key of length r * t into small fixed-size chunks of r bits
for each chink look up in table with random (2`r * t) bits
XOR lookup results
example with 32 bit key x and 8 bit chunks:
h(x) = t[0][x0] XOR t[1][x1] XOR t[2][x2] XOR t[3][x3]
Adv: 3-independent, could be implemented efficiently in hardware
disadv: consumes CPU cache, random state grows with key size
Open Addressing
Chaining has at least two dependent memory accesses for hit
storing entries directly in hash table
how to handle collisions in open adressing?
Open addressing with linear probing
on collision, insert into next free position
pos = (pos + 1) % m
if m is power of 2 => cheaper
lookup must check entries until free entry is found
fill factor f must be below 1 for good performance
at high fill factors, OA leads to long probe sequence lengths (clustering effect)
what is a way to get better space utilization?
have fixed-size groups of entries (e.g each group can have four 16-byte entries)
typically a group has size of cache line
may require a counter for each group
SIMD can be used to speed up comparison with several elems (to compare your key agains all keys in group with one instruction)