DSA - Hash Tables Flashcards
What is a Hash Function?.
hash(key) computes the location from a KEY
Does it matter what Data Type is when we use the hash function?
NO
How Does a Hash Table Work?
Take the key calculate some number without referencing the key or record, use the number for a index in the array.
Get Us Immediately in 1 step where we want to be
Where do you store KEY?
Store it in Linked List/ Tree/ Hash Table
Describe assignments in Hash Table ?
-It is O(1) time complexity
- Very Memory Inefficient
==> Need an Array of size 10^7, if we only use 500 spaces it
still keeps the rest thus very inefficient.
Define Hash/Data Collison:
2 Different Keys with same number of finding remainder to place in that index
- Increasing array length may cause an decrease in the
likeness of collisions. (Does Not Fully Remove Them) - Require a robust & reliable solution to resolve collisions
What Does Hash Table Consists of:
- Array arr for storing the values
- hash function hash(KEY) => turns Key into index of array
- Mechanism for dealing with the collisions
What are the 2 Types of solutions of hash Collisions?
-Chaining Strategy
- Open Addressing Strategy
What does Chaining Strategy do?
Use another data structure on each position of the hash tables
-Only Chaining Strategy is Direct Chaining
- Use Linked List to store the values with the
same hash(code)
What does Addressing Strategy do?
GIVE THE 2 EXAMPLES ?
Finding a different address in the same array for the colliding value to be stored
(ADDRESS MUST BE UNUSED or OPEN)
- Two Strategy are Linear Probing & Double Hashing
How to Insert with Direct Chaining?
- Always check if the key which we are inserting is in the LL on position hash(key)
- If isn’t insert key at BEG of list
// DOES NOT NEED TO INSERT KEY IN BEG, BUT IN PRATICE IT IS IDEAL AS THE KEY IS MORE LIKELY TO BE ACCESSED SOON
TAKE O(n) = need to go through all elements already stored in hash table
How to Delete with Direct Chaining?
To delete(key) we delete key from LL stored in Position hash(key)
Takes O(n)
How to do Lookup in Direct Chaining?
lookup(key) return TRUE/FALSE, depending if key is stored in the list on position hash(key)
What is a Bad Hash Function?
-All entries have the same hash value
- 1 LARGE Linked List (LL)
Why is a Bad Hash Function BAD?
Has LARGE linear search through those links
What makes a GOOD Hash Function?
- Uniformly distributes keys among positions
- Given a Random key it has to have the same probability of being stored on every position
- location given by hash has the expected number of entries stored there equal to n/T.
What is the Load Factor?
AVG. NO. of entries stored on a Location
n/T
n = total number of store entries
T = size of hash table
REPRESENTS HOW FULL THE HASH TABLE IS
What is Unsuccessful lookup of a Key?
- Key is not in the table
- Location hash(key) stores n/T entries on AVG.
(Have TRAVERSE them all, LINEAR SEARCH on
n/T entries )
What is a Successful Lookup for a key?
- Location hash(key) stores k = n/t entries on AVG
- AVG. a linear search in LL of k elements takes
1/k(1+2+…+k) = k(k+1)/2k = (k+1)/2 comparisons - ASSUME MAX load Factor λ that is, n/T <= λ
What is AVG case time complexity of unsuccessful lookup?
n/T <= λ comparisons O(1)
What is AVG case time complexity of successful lookup?
1/2(1+n/T) <= 1/2(1+λ) Comparisons O(1)
λ is constant
Time Complexity of insert(key)?
- Same as unsuccessful lookup
O(1) - First Check if the key is stored in the table
- Otherwise insert key at BEG of list stored on hash(key)
Time Complexity of delete(key)?
Same as Successful Lookup
O(1)
Time Complexity of Insert?
O(1)
What 2 Assumptions are made for Direct Chaining?
- We have a good Hash function
- Expected Length of chains is n/T
- We assume a maximal load Factor
- n/T < λ, constant λ (MAKE SURE LOAD FACTO IT IS BOUNDED BY SOME λ)
Assume the 2 Conditions , computed that the operations of Hash Table are ALL O(1).
Good Hash Function Depends on Distribution of data
- CONSTANT TIME COMPLEXITY WILL BE AMORTIZED ONLY
Disadvantage of Chaining Strategies?
1) Lot of hash collisions, therefore lots of unused space
- Good Performance maintain a low Load Factor
{Between 0.25-0.75}
- At 0.25 you are using an array 4x the entries you
are going to store in hash table
(A LOT OF UNUSED SPACE)
2) LL require a lot of allocation (ALLOCATE MEMORY), which is slow
- Every Insert you need to allocate Memory
- Not only using space in the array, but allocating memory is a heavy method to call.
What is the Idea of Insertion in LINEAR PROBING?
If primary position hash(key) is occupied, search for 1st available position to the right of it.
REACH END WRAP AROUND
How to compute the Fallback Positions For LINEAR PROBING?
- Increment the value of hash(key) by 1 MOD the value by size of the array. Repeat until you find FREE SPACE
- E.g => hash(key) + 1 mod T
T = size of the array
Idea of Deleting using Linear Probing?
1- Find the key stored in table:
- Start from primary position hash(key) go right until key or empty position is found.
2- Key is stored in the table, replace it with a marker value (TOMBSTONE = #)
DO Probing until you find # or empty space and replace it with new value
Issue With Deleting in LINEAR PROBING?
Delete Data Collision value, the one stored in the correct hash position, when searching for element in that hash position will return false as nothing is there for key value.
Untrue as we had to move the data collision to next free space. This leads to error as we don’t find key value from the data collision
Linear Probing Searching?
- Start from Primary Position hash(key)
- Search for key to the right.
- Skip over # TOMBSTONES
- If we reach Empty Position, key is not in table
Linear Probing Inserting
MORE ACCURATE
- Search for hash(key) like Searching
- NOTE first Tombstone Position we find
- Find the Key signal ERROR
- Reach Empty position, key not in table
- Insert key into 1st Tombstone, if any
- Otherwise, insert into Empty position found
CAN’T INSERT TILL WE KNOW THAT THE KEY IS NOT ALREADY IN THE HASH TABLE.
Time Complexity of Linear Probing?
Insert , Search & Delete ARE ALL O(1)
Primary Cluster in LP ?
Clusters caused by entries with the same hash code.
Secondary Clusters in LP?
When the Collision handling strategy causes different entries to check the same sequence of locations when they collide
LP Problem with Clustering?
Clusters are more likely to get bigger and bigger, even if small load factor,
Lots of Clusters means more probing, thus longer time taken
USE DOUBLE HASHING TO MAKE CLUSTERING LESS LIKELY TO OCCUR
What is Double Hashing?
Use a Primary & Secondary hash function
Primary Hash: Landing place in our Array(Hash Value we go to in the array)
- SPACE IS EMPTY OR # CAN USE IT
Secondary Hash: add this result to primary key as an offset to find free space, repeat till we find free space.
REDUCE PROBLEMS WITH SECONDARY CLUSTERS, SO FEWER COLLISIONS
Different key values the offset will be different
Difference Between Linear Probing and Double Hashing?
- Difference is that for different key values in DOUBLE HASHING the offset will also be different for every key value, due to secondary hash function
ALL OPERATION WORK THE SAME WAY FOR BOTH
Insertion Double Hashing?
Try Primary Position hash1(key) first and, if it fails, we try fallback positions:
1) hash1(key) + 1hash2(key) MOD T
2) hash1(key) + 2hash2(key) MOD T
3) hash1(key) + 3*hash2(key) MOD T
REPEAT TILL WE FIND EMPTY SPACE
T = size of the ARRAY
Linear Probing Fallback Calculation?
(hash(key) + i) mod T
i=1,2,3…
Double Hashing Fallback Calculation?
(hash1(key) + i*hash2(key)) mod T
i = 1,2,3,…
Problem with Double Hashing?
Can have Short Cycles
What are short cycles in Double Hashing?
Short cycle is a loop that occurs where after a few probes you get back to already visited block
CAN’T FIT ANYMORE VALUES, MAYBE Spaces in odd Index Locations
How to Fix Short Cycles?
Table Size T & hash2(key) are COPRIME
2 SOLUTIONS:
1) T is PRIME
- Array is Prime, Won’t get common Divisor
- But, Difficult when growing array
2) T= 2^k and hash2(key) is ODD
- Test if hash2(key) is ODD, if EVEN +1
- Just Setting Least Significant Bit so it is ODD
- Works as no Common divisor between hash2 & T
What is the problems with Large Common Divisor?
Worse it gets for missing values
What does Coprime mean?
If no number other than 1, divides both a & b
Prime numbers are the numbers which are divisible only by 1 and themselves
For 2nd Solution for short cycles, why is the array of size 2^k?
-2^k therefore doubles it
- Less fragmentation, however make sure Hash key values don’t crash with Array Length
DOUBLING THE SIZE OF THE ARRAY WHEN IT GETS FULL ENSURES T + 2^k
What happens when table is Full?
Hash Table is full : n/T > λ
If n/T is larger than λ does not meant it is not a full table
What does n/T > λ mean?
If True allocate a new table twice the size and allocate values into new table
The Idea of Rehashing?
Table becomes full after an insertion, allocate new table twice the size and INSERT all elements from old table to it
CAN CACHE THE HASH FUNCTION VALUES, SO QUICKER TO PLACE INTO NEWLY ALLOCATED ARRAY
Consequence of Insert?
The worst case time Complexity is O(n), when REHASHING
AMORTIZED TIME COMPLEXITY is O(1)
What do we need to do with the hash function when doubling array size?
Need to change hash function
Usually:
-bigHash(key) mod T
(bigHash COMPUTES a Big Hashcode )
AFTER DOUBLE SIZE:
bigHash(key) mod 2*T
What is a Hash Table?
a ADTs with implementation consisting of a array arr, a primary function hash1(key) {maybe hash2(key)}
Comparing Hash Table to Trees?
AVL:
- Require keys to be comparable and the operations are in O(logn), BEST,AVG & WORST CASE
HASH:
- Require good Hash Functions, so operations are O(1) AMORTIZED time complexity FOR ALL OPERATIONS
Double Hashing Pros over the other 2 Hash Table Methods?
Better Performance advantage:
- Clustering in LP is Worse than DH
- DC needs to allocate memory which has a large constant cost
THIS PROS IS IN MULTIPLIER NOT ENTIRE COMPLEXITY
What happens if load factor drops below min threshold?
RARELY Done as it doesn’t speed up performance
Rehash the into hash table 1/2 size.
SAVE MEMORY
What happens if the Number of Tombstone values exceed a threshold value?
Rehash without doubling the size
- May get a Hash table 1/2 the size
- CONSEQUENTLY delete is O(1) Amortized time complexity