HashTableSet Flashcards
Map ADT
has functions: size(), get(k), put(k,v), remove(k), find(k)
Hash Table (unordered_Map)
Size of HT is proportional to the number of keys (not the value of keys). Key is the input/parameter used by a hash function that will generate a hash code (output) that will assign the element into an index and store its key into the data field.
Hash Function
Will take in a keyvalue and output the hashcode to generate the index value for insertion into a table/map. Must be fast.
Hash Table Concerns
Hashcodes are not unique, multiple Hash Keys will generate the same Hash Code and result in collisions. Collisions are the main focus of (maintaining) optimal
runtime of the Hash Table data structure.
What is a possible ADT using Red-Black Tree DS?
Red-Black Tree can support a (C+++) Set ADT
What is a possible ADT using Linked List DS?
Linked List can support a (C++) List ADT
What is a possible ADT using ArrayList DS?
ArrayList can support a (C++) Vector ADT
Set ADT
has functions: size(), insert(x) [no duplicates], remove(x), find(x)
Lexicon vs. Numbers
Strings as a sequence, numbers are single entity
Red-Black Tree Advantages
Faster Insert than AVL: RBT insertion traverses the tree once instead of twice.
While longer than AVL, finding is still guaranteed O(log n )
Red-Black Tree Advantages
Slower to Find (than AVL): RBT are (slightly) taller than AVL trees
Red-Black Tree Insert Algorithm
While not leaf: 1. Move down tree to destination, if node has two red children recolor and rotate as necessary to fix. 2. Insert node 3. Do final rotations to fix tree. (not need to go back up tree.
Runtime: O(log n)
Can any node have 2 red children?
Black nodes can have 2 red children. inserting a new (red) node into this set will violate RBT rule and thus recoloring (red becomes black and black becomes red) and rotation will be in place but this ensures when inserting into this scenario, the result will be okay.
Red node can never have red children.
What DS can implement the Set ADT?
Array List, Linked List, Binary Search Tree, Randomized Search Tree, AVL Tree, Red-Black Tree
What DS can implement the Map ADT?
Array List, Linked List, Binary Search Tree, Randomized Search Tree, AVL Tree, Red-Black Tree
True or False: Set ADT can be modified to relatively simply implement Map ADT
True
True or False: Map ADT can be modified to relatively simply implement Set ADT
True
True or False: Implementing Set ADT using RBT has better worst case Big-O runtime complexity for finding and inserting elements than implementing the Set ADT using Linked List
True
Worst Case Time Complexity of Insert to Hash Table
O(n)
Averge Case Time Complexity of Insert to Hash Table
O(1)
Average Case Time Complexity of Find to Hash Table
O(1)
Average Case Time Complexity of Find to Hash Table
O(n)
If all keys were Even or Odd, what might happen
Some hash values will never be used. Empty indices
Hash Functions Properties
Input is an object of x
Output is integer representation of x
(necessary) Property of Equality: If x = y, then h(x) must equal h(y)
(not necessary) Property of Inequality: If x is not equal y, it would (be nice) if h(x) was not equal to h(y)
UnSorted Array List and Linked List, what is worst case time complexity of FIND?
O(n)
Randomized Search Tree and well-structured Skip List, average time complexity of FIND?
O(log n)
Sorted Array List and Binary Search Tree, what is worst case time complexity of FIND?
O(log n)
Hashing is the: (definitions)
transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string.
one way to enable security during the process of message transmission when the message is intended for a particular recipient only… Hashing is also a method of sorting key values in a database table in an efficient manner.”
**involves applying a hashing algorithm to a data item, known as the hashing key, to create a hash value.
Bloom Filter
is a SPACE-EFFICIENT data structure used to check if an item is in the set or not.
Trade-off is space-efficiency is the probabilistic answer than deterministic. Either TRUE NEGATIVE “not in the set” or “might be in the set”
Strength of Bloom Filter
Space-efficient: O(1) space, regardless of number of items inserted.
Fast: insert and lookup operations are both O(1) time
Weaknesses of Bloom Filter
Probabilistic: Bloom filters can only defnitely identify true negativeness. They cannot identify true positives. positives are “might be there”
Limited Interface: Only supports insert and lookup. Cannot iterate thought items in set or delete them.
Implementation of Bloom Filter
Bitmap default to 0. Insertion changes value to 1
General Course Thesis:
Design data structures that
trade per-operation efficiency for
overall efficiency
Design data structures that
trade per-operation efficiency for
overall efficiency
Collision Resolution Strategy: Linear Probing
Collision Strategy: if an object key maps to an index that is already occupied, simply shift over and try the next available index.
Closed hashing collision resolution strategy (i.e., we will insert the actual key only in an address bounded by the realms of our Hash Table)
Contrast to Linear Probing that is considered open addressing as the destination is not entirely deterministic since if space is occupied , we shift over X number of spaces. They are OPEN to moving
Weakness in linear probing the “Clumping” of inserts around each other as they move over to the next index. Probabilisitically this increases more than what is assumed than just straight forward M/N probability.
Simply the likely hood due to linear probing is greater with this resolution strategy. It is somewhat coutnerintuitive.
With the initial indexing (probe and then shift if collison), all slots are equally likely to be filled, but upon collision, we deterministically increase the probability of filling certain slots, which are the slots that expand a clump.
We need a “new” approach that is deterministic in searching but does not increase probability of collision with each insertion in the manner that linear probing does. We need to distribute the insertion probability over the entire array/table/etc
Weakness in linear probing the “Clumping” of inserts around each other as they move over to the next index. Probabilisitically this increases more than what is assumed than just straight forward M/N probability.
Simply the likely hood due to linear probing is greater with this resolution strategy. It is somewhat counterintuitive.
Double hashing outperforms linear probing
However, while not facing problem of collision when inserting, we still have the issue with remove()
We still have to worry about reinserting elements into the Hash Table periodically to clean up the “delete” flags
In all the open-addressing methods discussed so far, the probability of future collisions increase each time an inserting key faces a collision
Separate Chaining: Hashtable keys provides a pointer, not the data
Motivated by worst case run time increasing towards O(n)
Using pointers to a linked List as the key in our Hash Table
- Look up the index from hashing H_1(k)
- Follow pointer to linkedList
- Check if k is in the linkedList.
3a. if not insert k into arr[index] (this is closed addressing)
3b. if it is, then element already in table.
Separate Chaining is often considered OPEN HASHING collision resolution strategy
Open hashing and closed addressing are often used interchangeably
With Separate chaining and a doubly linked list behind each index, what is the expected worst case runtime?
O(n)
What the compromise in Separate Chaining used in a HashTable?
Memory, we are storing pointers
Memory is “disconnected” from original hashTable.
Distance causes incremental issues regarding poor cache performance. (computer requires time to traverse to the data and send it back through a BUS or other system outside of coding).
Cuckoo Hashing
Most aggressive Collision Resolution Strategy
Uses two hash functions H_1(k) and H_2(k) , where one key can generate two different locations
As both tables begin to fill up, the probability of collisions increases. However, unlike the other collision resolution strategies we discussed in which a key k could end up trying every single Hash Table slot until the entire Hash Table was full (e.g. Linear Probing, Double Hashing, Random Hashing), a key k in Cuckoo Hashing only has two different locations that it can map to (index1 = H_1(k) and index2 = H_2(k)
Unless Cuckoo Hashing has a limit as to how many times it attempts to move a key, the algorithm will never stop.
Rehashing is generally done by introducing two new hash functions and reinserting all the elements.
It is important to make sure that the second hash function used returns different indices for keys that originally hashed to the same index. This is because, if a key collides with another key in the first Hash Table, we want to make sure that it will not collide with the same key again in the second Hash Table.
By making the sacrifice of only allowing each key to hash to strictly two different locations (thereby potentially causing the cycle in the first place), we end up getting a reward of a worst-case constant time complexity for two of our major operations.
Find(): O(1)
Delete: O(1)
In Cuckoo Hashing, each key is not able to hash to more than 2 locations.
KEY aspect/strength of CUCKOO Hashing
If you ever cycle (come back to index in table 1), then you quit
MAP ADT allows the mapping of keys to their corresponding values
MAP ADT is considered an associative array. It gives an associative cluster of our data.
MAP ADT
put(key, value) get(key) remove(key) size() isEmpty()
MAP ADT can be implemented with binary tree or with a HASHTABLE
Implementing MAP ADT with a Hash Table we call it now a HashMAP
We add two additional functions:
hasfunction(key)
key_equality(key1, key2)
Essentially in a hash map keys must be hashable and have an equality test to check for uniqueness.
If designer wanted to hash a custom object then they need to overload the hash and equality member functions
Bloom Filters addresses achieving fast runtime with determining True Negative with added False positive which isn’t that “bad”
Uses an underlying probabilistic DATA STRUCTURE
Sacrifice precision for memory efficiency
Memory is using boolean instead of integer data , so literally 1 bit per space/index
Benefit of Bloom Filter is as m (number of buckets/spaces in table) increases the probability of False positive decreases.
In creating a bloom filter we should do our best to guess how many element the user may need, or how many elements they plan to insert.
We try to pick a relatively small epilson we can tolerate . Smaller episoln means we will implement a larger table to minimize false positives but thatmeans using more memory.
Determine how many hash functions to use: k = -log_2 (EPSILON)
Count-Min Sketch can only report the the maximum possible count of X (upper bound). It cannot report an accurate value
Benefit is that it provides important valuable data without adding more memory consumption.