Hash maps Flashcards

1
Q

Why is it difficult to make an on-disk hashmap perform well?

A

it requires a lot of random access I/O, it is expensive to grow when it becomes full, and hash collisions require “fiddly” logic [DDIA p 75]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is it relevant to know it is difficult to implement hashmaps on disk?

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Cuckoo hashing vs chained hashing performance

A

A study by Zukowski et al.[13] has shown that cuckoo hashing is much faster than chained hashing for small, cache-resident hash tables on modern processors. Kenneth Ross[14] has shown bucketized versions of cuckoo hashing (variants that use buckets that contain more than one key) to be faster than conventional methods also for large hash tables, when space utilization is high.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

List the two techniques for avoiding rehashing latency spikes

A

incremental/background resizing (implemented, for example, in Redis and in PauselessHashMap) and extendible hashing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why does mySQL use b-trees for indexes rather than hashtables even though access time for B-trees is O(logn) as opposed to O(1) for hash tables?

A

1) You can only access elements by their primary key in a hashtable. This is faster than with a tree algorithm (O(1) instead of log(n)), but you cannot select ranges (everything in between x and y). Tree algorithms support this in Log(n) whereas hash indexes can result in a full table scan O(n). 2) Also the constant overhead of hash indexes is usually bigger (which is no factor in theta notation, but it still exists). 3) Also tree algorithms are usually easier to maintain, grow with data, scale, etc. (rehashing when the hash table reaches a certain size in occupancy can take a long time (days on HUGE HUGE datasets))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Much of the variation in hashing algorithms comes from what?

A

comes from how they handle collisions (multiple keys that map to the same array index)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

List two open addressing hash tables

A

SwissTable and F14 (FB)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Hashmaps in a nutshell

A

Make a big array
Get a magic function that converts elements into integers (hashing)
Store elements in the array at the index specified by their magic integer
Do something interesting if two elements get the same index (hash conflict)

In theory insert search and delete should be O(1). In practice that is hard to achieve, at least with a low constant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A perfect hash function results in what for hash tables?

A

No collisions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When to use a perfect hash function?

A

in situations where there is a frequently queried large set, S, which is seldom updated. This is because any modification of the set S may cause the hash function to no longer be perfect for the modified set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Dynamic perfect hashing

A

Solutions which update the hash function any time the set is modified (to be perfect again). but these methods are relatively complicated to implement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is are some simple implementations of order-preserving minimal perfect hash functions with constant access time?

A

use an (ordinary) perfect hash function or cuckoo hashing to store a lookup table of the positions of each key

First guess I think: e.g given function ph,
initialize (A) { // initialize creation of opmph for a particular data set A
for (i = 0 to A.length - 1) // given array A of data {
mapKey = ph(A[i].key)
map[mapKey] = i
}
}

getHashCode(A, key){
return map[ph(key)]
} // will be order preserving with respect to the order of keys in A, but the underlying hashtable IMO may have collisions «< no I don’t think so because ph is a perfect hash function and perfect hash functions have no collisions for the dataset IIRC &laquo_space;but actually yes IMO because the underlying map will probably use a non perfect hash function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Minimal perfect hash function

A

a perfect hash function that maps n keys to n consecutive integers – usually the numbers from 0 to n − 1 or from 1 to n. A more formal way of expressing this is: Let j and k be elements of some finite set S. Then F is a minimal perfect hash function if and only if F(j) = F(k) implies j = k (injectivity) and there exists an integer a such that the range of F is a..a + |S| − 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Perfect hash function

A

In computer science, a perfect hash function for a set S is a hash function that maps distinct elements in S to a set of integers, with no collisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the three factors to consider in the analysis of hash table laoirthms?

A

1) non-memory-related CPU cost of a key search or an insertion 2) memory (several subfactors) 3) variability in efficiency along different factors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Open vs closed address table fundamental difference

A

Open: Collisions are dealt with by searching for another empty bucket within the hash table array itself.

Closed: A key is always stored in the bucket it’s hashed to. Collisions are dealt with using separate data structures on a per-bucket basis. This of course means there is a data structure or reference to a data structure in each bucket

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

List three techniques for collision resolution in closed address tables

A

Separate chaining using linked lists
Separate chaining using dynamic arrays
Using self-balancing binary search trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

List seven collision resolution techniques for open address tables

A
Linear Probing
Quadratic Probing
Double hashing
Hopscotch hashing
Robin Hood hashing
Cuckoo hashing
2-Choice hashing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

List three benefits of open addressing

A

1)Predictable memory usage
No allocation of new nodes when keys are inserted
2)Less memory overhead
No next pointers
3)Memory locality
A linear memory layout provides better cache characteristics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How does linear probing lookup work?

A

Hash the key to get the hash code (aka index of the array). See if key exists at that index. if not and another key exists in the index, go to the next index, and so on. If no key exists at the index do not continue searching and return that the key was not found in the hash table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How does linear probing deletion work?

A

Rather than removing the keyvalue from the index (bucket) it’s found at, you have to mark that bucket as deleted . The deleted bucket now is called a “tombstone”

Otherwise during future search a lookup chain might be broken and the lookup may say nothing was found (because there was an empty bucket), when really the key was a few buckets down in the chain

(Pretty sure this is only relevant for when 2+ keys hash to the same code)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Are tombstone buckets forever unusable?

A

No, they can still be overwritten during insertion. It’s just that during insertion you’ll still have to search thru to the end until you find an empty bucket to indeed verify there are are no duplicates of the key you are trying to insert. If the key does not already exist in the table you can then insert it at the tombstone position

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

cuckoo hashing definition

A

a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the speed of cuckoo hashing?

A

In practice, cuckoo hashing is about 20–30% slower than linear probing, which is the fastest of the common approaches.

25
Q

Why is cuckoo hashing 20-30% slower than linear probing?

A

The reason is that cuckoo hashing often causes two cache misses per search, to check the two locations where a key might be stored, while linear probing usually causes only one cache miss per search. However, because of its worst case guarantees on search time, cuckoo hashing can still be valuable when real-time response rates are required.

26
Q

List two hash functions good for hash table hashing

A

Facebook Xxhash(2012) and Google FarmHash (2014)

27
Q

List 3 dynamic hashing schemes

A

Chained Hashing, extendible hashing, linear hashing

28
Q

What do you want to use for indexing tables and why?

A

Usually not hash tables.

  • lack of ordering in widely used hash schemes
  • lack of locality of reference = more disk seeks (I assume if you are accessing data over a wide range)

Probably B+Trees is what you want

29
Q

Perfect hashing vs minimal perfect hashing

A

A perfect hash function is one that maps N keys to the range [1,R] without having any collisions. A minimal perfect hash function has a range of [1,N]. We say that the hash is minimal because it outputs the minimum range possible. The hash is perfect because we do not have to resolve any collisions.

30
Q

Java HashMap default load value and default capacity

A

0.75 and 16

31
Q

Order preserving hash function

A

A minimal perfect hash function F is order preserving if keys are given in some order a1, a2, …, an and for any keys aj and ak, j < k implies F(aj) < F(ak).

32
Q

Fox & Chen space and time complexity to find a OPMPHF

A

Space and time that is on average linear in terms of the number of keys involved

33
Q

Java Hashtable vs HashMap

A

All the methods of Hashtable are synchronized, so only one thread can execute any of them at a time.

When using non-synchronized constructs like HashMap, you must build thread-safety features in your code to prevent consistency errors. HadhMap is faster

34
Q

order-preserving minimal perfect hash functions are used to do what?

A

retrieve the position of a key in a given list of keys

35
Q

Why is the constant factor for put and get for hashmaps larger than that of b trees?

A

36
Q

Cuckoo hashing high level implementation of delete

A

Removal is done simply by clearing the bucket storing the key. Again, worst-case complexity is O(1).

As opposed to other open addressing schemes there are no chains and no need to use deleted markings or so called tombstones (cf. section on removal in Hash Tables: Open Addressing)

37
Q

Double hashing high level implementation

A

With double hashing, another hash function, h2 is used to determine the size of the steps in the search sequence. If h2(key) = j the search sequence starting in bucket i proceeds as follows:

i + 1 × j
i + 2 × j
i + 3 × j
(If j happens to evaluate to a multiple of the array length, 1 is used instead.)

This approach is worse than the previous two regarding memory locality and cache performance, but avoids both primary and secondary clustering.

38
Q

When would cuckoo hashing be better than linear probing?

A

Well linear probing is (on average??) 20-30% faster than cuckoo hashing (since cuckoo hashing has on average 2 cache misses as opposed to linear probing’s 1), but cuckoo hashing has worst case constant response time so I guess it would be more suitable for when worst case fast response time is required (aka guaranteed real-time response)

39
Q

What is the load factor of a hashmap?

A

The load factor represents at what level the HashMap capacity should be doubled

40
Q

What is capacity in Java HashMap?

A

the number of buckets in the HashMap

41
Q

What is the basic idea of cuckoo hashing?

A

to resolve collisions by using two hash functions instead of only one. This provides two possible locations in the hash table for each key

42
Q

Because of primary clustering, a linear probing hashtable with a constant load factor will have what undesired effect?

A

will have some clusters of logarithmic length, and will take logarithmic time to search for the keys within that cluster

43
Q

Both types of clustering may be reduced by what?

A

by using a higher-quality hash function, or by using a hashing method such as double hashing that is less susceptible to clustering

44
Q

What is primary clustering?

A

is one of two major failure modes of open addressing based hash tables, especially those using linear probing. It occurs after a hash collision causes two of the records in the hash table to hash to the same position, and causes one of the records to be moved to the next location in its probe sequence. Once this happens, the cluster formed by this pair of records is more likely to grow by the addition of even more colliding records, regardless of whether the new records hash to the same location as the first two. This phenomenon causes searches for keys within the cluster to be longer.[1]

45
Q

Load factor for most open addressing schemes

A

50%

46
Q

Open addressing is aka

A

Closed hashing

47
Q

What is double hashing?

A

uses one hash value as an index into the table and then repeatedly steps forward an interval until the desired value is located, an empty location is reached, or the entire table has been searched; but this interval is set by a second, independent hash function

48
Q

How liekly should a collisok be in double hashing?

A

1/|T|^2 where |T| is the number of buckets «< I don’t understand why though

49
Q

Double hashing expected number of probes for a search

A

1/(1-a) where a is the load factor and where the hash table has a fixed lod factor.

So a load factor of .8 vs .5 would mean an expected 5 probes instead of 2

50
Q

Difference between the linear probing/quadratic probing and double hashing

A

The search interval (for when the bucket the hash yields is already occupied) depends on the data, so that values mapping to the same location have different bucket sequences; this minimizes repeated collisions and the effects of clustering.

51
Q

In open addressing, as the the hash table approaches maximim capacity, search time approaches what?

A

Linear time (i assume this is assuming an imperfect hash function, which id imagine pretty much all are)

52
Q

Heuristic to limit load factor for double hashing scheme ?

A

75%. Eventually rehashing to a larger size will be necessary as is with all other open addressing schemes

53
Q

How are the two hashing functions used in cuckoo hashing’s insertion process

A

each key has two locations it can be stored in (aka teo possiblr buckets for each key) - h1(k) and h2(k).

If we try to insert k into bucket h1(k) but h1(k) is already oxcupied by k2, we move k2 to bucket h2(k2) ( but of course first move the keyvalue pair k3 at h2(k2) to h2(k3) and so on (unless bucket h2(k2) were empty). If there are too many iterstions of this “kicking the cuckoo bird out of the nest” process, then the hash functions are rehashed (i think the article simply means rehashing the hash functions, not actually expanding the table. Could be both though. Not sure. Should read a proper trxtbook and not just wikipedia)

54
Q

Cuckoo hashing insertion time complexity

A

in expected constant time,[1] even considering the possibility of having to rebuild the table, as long as the number of keys is kept below half of the capacity of the hash table, i.e., the load factor is below 50

55
Q

CHITC

A

Cuckoo hashing insertion time complexity (acronym we made to lake flashcards more easily)

56
Q

CHITC proof

A
57
Q

What is a simple alternative to perfect hashing?

A

Cuckoo hashing. This also allows for dynamic updates to the set. (I think the nature of this question is that cuckoo hashing, like perfect hashing, has a worst case constant lookup time)

58
Q

in mathematical terms a perfect hash function is a what?

A

an injective function

59
Q

What is an injective function in non-formal terms?

A

A function where no two input values yield the same output value