Hash Tables Flashcards
Hashing
- A technique used for performing insertions, deletions, and searches on constant average time
- Tree operations that require any ordering information among the elements are not supported efficiently
Hashing Data Structure
Hash Table (generalization of ordinary arrays) is used to implement hashing.
Hash Table Components
Key and entry
How to find an entry?
- To find an entry in the table, you only need to know the content of one of the fields (not all of them). This field is called key
- Ideally, a key uniquely identifies an entry
Direct Adress table
-Direct-address tables are ordinary arrays
- Facilitate direct addressing
- Element whose key is k is obtained by indexing into the kth position of the array
- It is applicable when we can afford to allocate an array with one position for every possible key
- Insert/Delete/Search can be implemented to take O(1) time
Disadvantage of direct-address table
The disadvantage is that we could end up with a big array that has a lot of empty space. This wastes lots of memory. The bigger we scale, the worse this problem gets.
Cons direct-address table
- If U is large, then a direct-address table T of size |U| is impractical or impossible (considering memory)
- ->The range of keys determines the size of T
-The set K could be so small relative to U that most of the allocated space is wasted
Direct-Address table vs. Hash Table
-Direct-address table
o Element with key k is stored in slot k
-Hash table
o Use hash function h(k) to compute the slot where k will be inserted
Hash table big idea
- We can make the array smaller, and use the hash function to select which slot
we put our values into. For example the hash function = modulus function - Basically, we are compressing the domain of our keys so that they fit into the array
Hash table size
-The size of the hash table is proportional to |K|
o We lose the direct-addressing ability
o We need to define functions that map keys to slots of the hash table
Hash function definition
-A hash function h transforms a key k into a number that is used as an index in an array to locate the desired location (“bucket” or “slot”)
o An element with key k hashes to slot h(k) 11.2 Hash tables o h(k) is the hash value of key k
Hash function pros
oSimple to compute
oAny two distinct keys get different cells
oThis is impossible, thus we seek a hash function that distributes the keys evenly among the
cells
Hash table components
- The ideal hash table is an array of some fixed size m, containing items (key and additional entry)
- Each key k is mapped (using a hash function) into some number in the range 0 to m – 1 and places in the appropriate cell
Hash table big O for operations
-Insertion of a new entry is O(1)
-Lookup for data is O(1) on average
oStatistically unlikely that O(n) will occur
Issues with hashing
- Decide what to do when two keys hash to the same value (collision)
- Choose a good hash function
- Choose the size of the table
Hashing Division method
-Maps key k into one of m slots by taking the remainder of k divided by m o h(k) = k % m
Collision
-A hash function maps most of the keys onto unique integers, but maps some
other keys onto the same integer
- Multiple keys can hash to the same slot: collisions are unavoidable
- Design a good hash function will minimize the number of collisions that occur, but they will still happen, so we need to be able to deal with them
Basic collision resolution policies
- –Chaining
- > Store all elements that hash to the same slot in a linked list
- > Store a pointer to the head of the linked list in the hash table slot
- –Open Addressing
- > All elements stored in hash table itself
- > When collision occur, use a systematic procedure to store elements in free slots of the table
Basic collision resolution policies
- –Chaining
- > Store all elements that hash to the same slot in a linked list
- > Store a pointer to the head of the linked list in the hash table slot
- –Open Addressing (linear, quadratic, double hashing)
- > All elements stored in hash table itself
- > When collision occur, use a systematic procedure to store elements in free slots of the table (linear, quadratic, double hashing)
Chaining Insert Pseudocode and running time
CHAINED-HASH-INSERT(T,x)
insert x at the head of list T[h(x.key)]
Worst case: O(1)
Chaining search pseudocode and running time
CHAINED-HASHSEARCH(T,k)
search for an element with key k in list T[h(k)]
Worst case: proportional to the length of the list
Chaining delete pseudocode and running time
CHAINED-HASH-DELETE(T,x)
delete x from the list T[h(x.key)]
Worst-case running time for deletion is O(1) if list is doubly linked
When is chaining used?
- Chaining is used when we cannot predict the number of records in advance, thus the choice of the size of the table depends on available memory
- Typically it’s chosen relatively small, so no large area of contiguous memory is used; but large enough so that the lists are short for more efficient search
Open Addressing
- Idea: Store all elements in the hash table itself. That is, each slot contains either an element or NIL
- Advantages: Avoids pointers
Open Addressing methods
- linear probing
- quadratic probing
- double hashing
Open Addressing linear probing hash function
If collision occurs, next probes (slots examined during a key insertion or searching) are performed following the formula:
-> h(k)=(hash(k)+ f (i)) mod m
o h(k) is the index in the table indicating where to insert k o hash(k) is the auxiliary hash function o f(i) is the collision resolution function with i = 0 ... m – 1 o m the size of the hash table
Open Addressing linear probing hash function
If collision occurs, next probes (slots examined during a key insertion or searching) are performed following the formula:
-> h(k)=(hash(k)+ f (i)) mod m
o h(k) is the index in the table indicating where to insert k o hash(k) is the auxiliary hash function o f(i) is the collision resolution function with i = 0 ... m – 1 o m the size of the hash table
Open addressing insert pseudocode
§ Insert: Examine slot h(k)
o If slot is empty, insert element
o If slot is full, try another slot, …, until an open slot is found (probing)
ü Wrapping around to the beginning if we hit the end of the table
Insert(T,k) I←0 repeat j ← h(k,i) if T[j] = NIL then T[j] ← k return j else i←i+1 until i = m
error “hash table overflow”
Open addressing search pseudocode
§ Search: Examine slot h(k)
o If slot h(k) contains key k, search successful
o If the slot h(k) contains NIL, search unsuccessful
o If slot h(k) contains a key that is not k, keep probing (by following the same sequence of probes when inserting the element) until we either find key k or we find a slot holding NIL
Search(T,k) i←0 repeat j ← h(k,i) if T[j] = k then return j i←i+1 until T[j] = NIL or i = m return NIL
Open addressing deletion
-If we delete a key from slot h(k) we cannot mark the slot NIL.
Why?
o We need to:
-> Mark the slot with a special value
->Modify SEARCH so it keeps on looking when it sees the special value
-> INSERT would treat the slot with a special value as empty slot, so a new value can be added
Quadratic probing
§ Similar to linear probing
§ … but instead of using a linear function to advance in the table in search of an empty slot, we use a quadratic function to compute the next index in the table to be probed
h(k)=(hash(k)+i^2) mod m
i = 0…m−1
§ The idea is to skip regions in the table with possible clusters
Quadratic probing disadvantage
Secondary Clustering
Linear probing disadvantage
Primary Clustering
§ We can see clusters form (blocks of adjacent buckets all occupied)
§ Thus increasing the time it takes to do insert and search for any key that hashes to a buckets in that cluster
Double Hashing
- Purpose same as in quadratic probing: to overcome the disadvantage of clustering
- Instead of examining each successive entry following a collided position, we use a second hash function to get a fixed increment for the “probe” sequence
§ hash1 gives the initial probe. hash2 gives the remaining probes
§ There are a couple of requirements for the second function:
o It must never evaluate to 0
o Must make sure that all buckets can be probed
h(k)=(hash1(k)+i * hash2 (k)) mod m
i = 0…m−1
Popular second hash function
hash2 (k) = R − (k mod R) , where R is a prime number that is smaller than the size of the table