DBMS - External Hashing Flashcards
Basic/Old Idea of Hashing
Decide on hash key, size of hash file, hash function. Hash the source data and place it in the file of RRN location indexed by the hash function.
To search for a record, hash its key and directly access record in hash file.
Hash address collision
hash addresses can collide
Hash key space
shouldnt be much larger than hash file address space
Hashing allows us to retrieve
a record on the basis of key value throughout use of a hashing function.
Hashing Uses and Issues
- Routine converts key value to int that corresponds to an address location in a hash file where record’s key values and pointers are kept
-Locations generated appear to be random
- With hashing, no of keys could be mapped into same address/location (collision)
- Another issue is exhaustion of pre-booked space
External hashing
Deals with hashing as a file indexing structure
Internal Hashing
Deals with main memory structures
If keyi > keyj
It is not necessary that h(keyi)>(keyj) unless h() is monotonic
Perfect hash function
Has no duplicate hashes
Minimal perfect hash function
has as many keys as hashes
Hash Function Examples
- Represent key in numerical form
- Prime Division
- Truncation
- Folding
Truncation
If key is 123456789 then truncation (of last 3 digits) will return an address of 789
Folding
Works by splitting the key into two and operating on these, so 1234 and 456 will add to an address of 1690
Prime Division
Say n is our file address size, divide key by a prime just greater than n (skewed dist)
Represent Key in numerical form
convert alphanumerics to numbers and apply function to a key to produce address
Hashing Collision Management
Caused because actual hash key distribution is not uniform. Size of hash index address space range i.e size of file.
Conflict Resolution
Overflow Area and Placement Management. If cells in a home address are taken then remaining synonyms are spanned over overflow area.
We can also use double hashing or a linked list to use multiple keys in a hash.
Home address
Has fixed size. Promising solution is to address extendable address space and hashing function
Operations over an external hashed file structure
- Create hash file (incl. overflow area) with initial no. of buckets, loading ratios etc,
- Insert records in bucket
- Compute address (key value) in memory
- Get from address into bucket
- Put an address into bucket
All these ops are very fast, unless overflow occurs, they are direct. 70-90% of loading can still operate with low level of collisions.
Hash structure characteristics
- Space: Pre-booking of space seems counter productive. Optimal system performance at 66% capacity.
- Operation’s Time: Searching, inserts. Sequential traversal not implicitly supported.
- Parameters: Bucket space, hash function distribution, hash key dist. in indexed dataset.
Advantages of Hashing
Very Fast
Disadvantages of Hashing
Hash files cannot be read out in record-sequence since they are stored in address-sequence.
Hash files grow substantially in size without redesign of hashing routine or address size sequence increase
Improvements
- Extensible hashing with better overflow management
- Multidimensional hashing
Making Hashing Disk Friendly
- Address space is to represent disk addressing, use RRN
- Space at address is called the bucket, can be a single disk block or a bunch of blocks
- Hash address collisions possible even with large buckets and low loading
- If h() is monotonic, we can sequential scan
- With static addressing we cant have a scalable index, but with extendible addressing we can.
RRN
Relative Record Number
Making Hashing Dynamic
- h() that grows with key address space
- hash file that grows and shrinks according to demand.
Solution for hashing
Extendible Hashing
Extendible Hashing, Basic Technique
Introduce search structure that is
1) additional to hash file
2) search structure entries are all based on (or part) of key entry’s hash
How to define extendible hashing function
- Select a hashing function ex. prime, folding
- Start with a bucket b0
- Eventually b0 overflows
- We may need more space
What if we need more space in extensible hashing
Bucket that overflows must be split in 2 by increasing local bucket depth by 1. Directory must be adjusted, first double its entries, then adjust global depth.
What to look for in hashing function
Ensure it can return a huge range. Compute hashing on key value and return a left string of the hash address generated.
What happens when the bucket eventually overflows in extendible hashing
Create new bucket b1 and share index entries with b0. Create a directory structure with two bit pattern entries 0 and 1, with each directory entry having a pointer to the respective bucket.
Also mark depth of directory as the max of all local depths. This is called the global depth.
Hash indexes can only handle
Simple equality comparisons.
Query Planner
consider using a hash index whenever an indexed column is involved in a comparison using the = operator
Command to create a hash index in postgres
CREATE INDEX name ON table USING hash(column);
Cases to use external hashing
group by queries, joins, distinct keywords, indexing on attributes.
Where is internal hashing used (ie. DBMS)
Intermediate IO ops, General use, Function name look up, Expressions look-up