Hash maps Flashcards
Why is it difficult to make an on-disk hashmap perform well?
it requires a lot of random access I/O, it is expensive to grow when it becomes full, and hash collisions require “fiddly” logic [DDIA p 75]
Why is it relevant to know it is difficult to implement hashmaps on disk?
…
Cuckoo hashing vs chained hashing performance
A study by Zukowski et al.[13] has shown that cuckoo hashing is much faster than chained hashing for small, cache-resident hash tables on modern processors. Kenneth Ross[14] has shown bucketized versions of cuckoo hashing (variants that use buckets that contain more than one key) to be faster than conventional methods also for large hash tables, when space utilization is high.
List the two techniques for avoiding rehashing latency spikes
incremental/background resizing (implemented, for example, in Redis and in PauselessHashMap) and extendible hashing.
Why does mySQL use b-trees for indexes rather than hashtables even though access time for B-trees is O(logn) as opposed to O(1) for hash tables?
1) You can only access elements by their primary key in a hashtable. This is faster than with a tree algorithm (O(1) instead of log(n)), but you cannot select ranges (everything in between x and y). Tree algorithms support this in Log(n) whereas hash indexes can result in a full table scan O(n). 2) Also the constant overhead of hash indexes is usually bigger (which is no factor in theta notation, but it still exists). 3) Also tree algorithms are usually easier to maintain, grow with data, scale, etc. (rehashing when the hash table reaches a certain size in occupancy can take a long time (days on HUGE HUGE datasets))
Much of the variation in hashing algorithms comes from what?
comes from how they handle collisions (multiple keys that map to the same array index)
List two open addressing hash tables
SwissTable and F14 (FB)
Hashmaps in a nutshell
Make a big array
Get a magic function that converts elements into integers (hashing)
Store elements in the array at the index specified by their magic integer
Do something interesting if two elements get the same index (hash conflict)
In theory insert search and delete should be O(1). In practice that is hard to achieve, at least with a low constant
A perfect hash function results in what for hash tables?
No collisions
When to use a perfect hash function?
in situations where there is a frequently queried large set, S, which is seldom updated. This is because any modification of the set S may cause the hash function to no longer be perfect for the modified set.
Dynamic perfect hashing
Solutions which update the hash function any time the set is modified (to be perfect again). but these methods are relatively complicated to implement.
What is are some simple implementations of order-preserving minimal perfect hash functions with constant access time?
use an (ordinary) perfect hash function or cuckoo hashing to store a lookup table of the positions of each key
First guess I think: e.g given function ph,
initialize (A) { // initialize creation of opmph for a particular data set A
for (i = 0 to A.length - 1) // given array A of data {
mapKey = ph(A[i].key)
map[mapKey] = i
}
}
getHashCode(A, key){
return map[ph(key)]
} // will be order preserving with respect to the order of keys in A, but the underlying hashtable IMO may have collisions «< no I don’t think so because ph is a perfect hash function and perfect hash functions have no collisions for the dataset IIRC «_space;but actually yes IMO because the underlying map will probably use a non perfect hash function
Minimal perfect hash function
a perfect hash function that maps n keys to n consecutive integers – usually the numbers from 0 to n − 1 or from 1 to n. A more formal way of expressing this is: Let j and k be elements of some finite set S. Then F is a minimal perfect hash function if and only if F(j) = F(k) implies j = k (injectivity) and there exists an integer a such that the range of F is a..a + |S| − 1
Perfect hash function
In computer science, a perfect hash function for a set S is a hash function that maps distinct elements in S to a set of integers, with no collisions.
What are the three factors to consider in the analysis of hash table laoirthms?
1) non-memory-related CPU cost of a key search or an insertion 2) memory (several subfactors) 3) variability in efficiency along different factors
Open vs closed address table fundamental difference
Open: Collisions are dealt with by searching for another empty bucket within the hash table array itself.
Closed: A key is always stored in the bucket it’s hashed to. Collisions are dealt with using separate data structures on a per-bucket basis. This of course means there is a data structure or reference to a data structure in each bucket
List three techniques for collision resolution in closed address tables
Separate chaining using linked lists
Separate chaining using dynamic arrays
Using self-balancing binary search trees
List seven collision resolution techniques for open address tables
Linear Probing Quadratic Probing Double hashing Hopscotch hashing Robin Hood hashing Cuckoo hashing 2-Choice hashing
List three benefits of open addressing
1)Predictable memory usage
No allocation of new nodes when keys are inserted
2)Less memory overhead
No next pointers
3)Memory locality
A linear memory layout provides better cache characteristics
How does linear probing lookup work?
Hash the key to get the hash code (aka index of the array). See if key exists at that index. if not and another key exists in the index, go to the next index, and so on. If no key exists at the index do not continue searching and return that the key was not found in the hash table
How does linear probing deletion work?
Rather than removing the keyvalue from the index (bucket) it’s found at, you have to mark that bucket as deleted . The deleted bucket now is called a “tombstone”
Otherwise during future search a lookup chain might be broken and the lookup may say nothing was found (because there was an empty bucket), when really the key was a few buckets down in the chain
(Pretty sure this is only relevant for when 2+ keys hash to the same code)
Are tombstone buckets forever unusable?
No, they can still be overwritten during insertion. It’s just that during insertion you’ll still have to search thru to the end until you find an empty bucket to indeed verify there are are no duplicates of the key you are trying to insert. If the key does not already exist in the table you can then insert it at the tombstone position
cuckoo hashing definition
a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.