Hash Tables Flashcards by Oleksandra Talalaieva

Direct Adressing

keys are just integers -> allocate array as large as the number of keys you have -> random access memory. lookup, insert, remove O(1)
Problem: huge waste of memory if key domain is larger than the set of keys stored
best approach for small and dense key ranges

How well did you know this?

Not at all

Perfectly

Hash function and hash table, desired properties

Hash function h maps keys from domain U to position in hashable of size m
Desired properties: fast hash computation, small probability of two keys mapping to the same hash value -> collision

How well did you know this?

Not at all

Perfectly

how to handle collision?

With chaining
each position in hash table has linked list of entries

How well did you know this?

Not at all

Perfectly

how large should m (chin length) be?

must be constant for average constent time access. must be proportional to n.
m > f * n
f: fill factor parameter, typically close to 1

How well did you know this?

Not at all

Perfectly

if final size of hash table is not known, how to handle it?

it must grow dynamically(rehashing)
1. create new, larger table
2. iterate over entries in old table, and insert into new hash table

How well did you know this?

Not at all

Perfectly

how large should new table be?

to keep amortized insertion time constant, must grow exponentially by a factor g > 1

How well did you know this?

Not at all

Perfectly

Smooth growth without rehashing

linear hashing or extendible hashing

How well did you know this?

Not at all

Perfectly

what makes a good hash function?

has function should look random but be deterministic

How well did you know this?

Not at all

Perfectly

Modulo prime hash function/fast modulo

modulo uses integer division, which is expensive
division can be replaced with multiplicative inverse by magic value

How well did you know this?

Not at all

Perfectly

Alternative to prime hashing, trick to avoid modulo or magic numbers

Powers of two
most real-world hash functions compute a 23 bit or 64 bit hash
to be used in a hash table, must compute h(x) % p

trick:
restrict m to powers of two (implies g = 2)
modulo becomes simple as fast bit shift or mask

How well did you know this?

Not at all

Perfectly

randomized properties of hash functions

Universal hashing: two distinct keys collide with probability 1/m or less
Pairwise independence: probability that two distinct keys have same hash is 1/m2
K-independence: probability that any k distinct keys have some particular hash values is 1/mk

How well did you know this?

Not at all

Perfectly

Tabulation hashing, adv and disadv

break up key of length r * t into small fixed-size chunks of r bits
for each chink look up in table with random (2`r * t) bits
XOR lookup results
example with 32 bit key x and 8 bit chunks:
h(x) = t[0][x0] XOR t[1][x1] XOR t[2][x2] XOR t[3][x3]

Adv: 3-independent, could be implemented efficiently in hardware
disadv: consumes CPU cache, random state grows with key size

How well did you know this?

Not at all

Perfectly

Open Addressing

Chaining has at least two dependent memory accesses for hit
storing entries directly in hash table

How well did you know this?

Not at all

Perfectly

how to handle collisions in open adressing?

Open addressing with linear probing
on collision, insert into next free position
pos = (pos + 1) % m
if m is power of 2 => cheaper
lookup must check entries until free entry is found
fill factor f must be below 1 for good performance

How well did you know this?

Not at all

Perfectly

at high fill factors, OA leads to long probe sequence lengths (clustering effect)
what is a way to get better space utilization?

have fixed-size groups of entries (e.g each group can have four 16-byte entries)
typically a group has size of cache line
may require a counter for each group
SIMD can be used to speed up comparison with several elems (to compare your key agains all keys in group with one instruction)

How well did you know this?

Not at all

Perfectly

Open Addressing plus Chaining

Study These Flashcards

store first elem in array
chain on collision

Chaining vs Open Addressing

Study These Flashcards

different size entries
growth: rehashing operation faster with OA
chaining: lots of cache misses as items are spread in memory
duplicate keys: chain length will grow but no problem
high fill factors: chaining is better, no exploding prob sequence length, but has overhead through pointers. OA doesnt have pointers
Entry copying:
large entries _< if hashtable grows, with OA copy everything
in chaining, move them from old hashtable to new, no copying

linear probing leads to clusters of entries and therefore long probe sequences. What is a solution?

Study These Flashcards

Robin Hood hashing
re-order elements during insertion based on their probe sequence length
(so that elements with long probe sequence length swap with elem that has a shorter one)
similar performance as linear probing for small fill factors
works better that linear probing with larger fill factors

What is probe sequence length?

Study These Flashcards

Distance of key from its preferred location

Cuckoo Hashing

Study These Flashcards

Open addressing approach that guarantees O(1) lookup and remove performance in the worst case
idea: two hash tables with different hash functions
lookup: check exactly 2 positions
insert: pick any of the 2 positions, if 1 free, put it there. If both spots occupied, try to rearrange keys
rearrangement may fail due to cycle: rebuild using different hash function
only works for low fill actors
can be fast: the memory accesses are independent

cuckoo hashing: insert(z)

Study These Flashcards

z hashes exactly to locations of x and y. We have to put it in one of those locations. -> kick out y as y has also another location that may be free (as each key has two possible spots)
if other spot for y is also occupied -> continue, kick out key that occupied it -> you can get into cycle

Static perfect hashing

Study These Flashcards

FKS hashing: O(1) worst case lookup time and amortized O(1) insertion time
bulk creation, two levels: (looks like a tree)
1. 2(n-1) partitions (n: number of elements)
2. each partition is a hash table of size r^2 for r entries in partition, on collision pick another hash function

problem: large space consumption, two dependent cache misses

Concise Hash Table: idea, construction

Study These Flashcards

static hash table optimized for space
idea: open addressing with linear probing, but compress away empty slots

Construction: three phases, assume unique keys
1. hash key, go t that bit and set it. if bit set -> collision, go to next bit
2. compute prefix popcounts for bitmap
3. hash each key again and insert into array using prefix popcount

should hashes be stored or not?

Study These Flashcards

has functions can have collisions. for correctness, it must store full keys and compare on lookup
hash can always be recomputed from key - > no need to store the hash
but if you store it: speeds up rehashing (if you have cache misses, it could be faster to compare against the hash than against the key), can speed up negative comparison for complex keys

Table growth: cache locality

in power of two tables, doubling means we use one more hash bit two possibilities: least significant bit: 01 may become 001 or 101 most significant bit: 01 may become 010 or 011 MSB: front bits stay the same, you get more resolution-> access pattern is better

Hash Tables Flashcards

(25 cards)