ProbabilisticDataStructure Flashcards
What are Probabilistic Data Structures used for?
To handle large amounts of data by providing approximate answers that are sufficient for many applications
What is hashing?
A technique that maps data (keys) to a limited addressing space (slots in a hash table)
What happens in a hash collision?
Multiple keys are mapped to the same slot in the hash table
What are the three main collision resolution techniques?
- Chaining 2. Open Addressing 3. Cuckoo Hashing
How does chaining resolve collisions?
Elements mapping to the same slot are stored in a linked list
How does Cuckoo Hashing work?
Uses two hash functions to give each element two possible positions in the table. If a position is occupied, the existing element is moved to its alternative position
What is the main purpose of a Bloom Filter?
To verify if an element is present in a set
What are the possible responses of a Bloom Filter?
‘Definitely no’ if at least one bit is not set, ‘probably yes’ if all bits are set
What is the key advantage of Cuckoo Filter over Bloom Filter?
Cuckoo Filter allows element deletion
What is Count-min Sketch used for?
To serve as an approximate frequency table for elements in a data stream
How does Count-min Sketch estimate frequency?
Takes the minimum value among the counters corresponding to the element across different rows
What is the purpose of HyperLogLog?
To estimate the cardinality (number of distinct elements) of a set in a data stream
What is the key observation behind HyperLogLog?
The maximum length of leading zeros in the binary representation of a hash correlates with the number of distinct elements
What is the main advantage of HyperLogLog?
High precision with very low memory usage
How does HyperLogLog handle bad hash values?
By dividing the stream into substreams and averaging values, reducing the impact of one bad hash