Cost Estimation of Selection Algorithms Flashcards
Describe measures of query cost
Cost is generally measured as the total time it takes to answer a query. Several factors contribute to query costs eg disk accesses, CPU, and network communication
Discuss disk access as measure of query cost
Disk access is the most predominant measure of query cost.
It is calculated by
no of disk accessesdisk access cost
no of blocks readaverage block-read-cost
no of blocks written*average block-read cost
Compare the cost to read and write a block
The cost to write a block is higher than the cost to read a block since once a block is written it must be read back for confirmation
What things will our cost formulae ignore for simplicity
The cost formulae will just have the number if block transfers form the disk as the cost measure
It will ignore CPU costs and Cost of writing output to disk.
It will also ignore the difference between sequential and random IO
Relate cost and the size of the buffer
-An increased size of memory buffer reduces the number of disk accesses required hence reducing the cost.
-The Amount of real memory available to buffer available changes with time based on the concurrent processes, however, we often use the worst case when calculating costs
What is a file scan
Search algorithms that locate and retrieve records that fulfil a selection condition
Describe Algorithm S1, explaining how it works, and its approximate costs
Also known as linear search
Searches every file and tests all records to see if they fulfill the search condition
The cost of S1 is br (which represents the total number of blocks containing relation r)
If the search condition is on a key attribute then the cost is (br/2) since the search will stop once the record is found
What are the benefits of S1
It can be performed regardless of :
the presence of indexes
type of selection condition
ordering of records in the file
Do example on 17
*
Describe S2
Also known as binary search
Applicable only when the condition is an equality condition
*See notes for cost estimate
But basically:
cost= log2(br)+ (number of matches for equality condition/br)-1
All decimals are rounded up
*Do questions in a document called ADQOQ in downloads p6
Describe S3a
Used when equality condition is on a key attribute with a primary index.
Index is used to locate the record
Cost-> Number of index levels accessed + 1
Describe S3b
Used when equality condition is on an attribute with a hash key.
*Retrieves a single record
The hashing algorithm is used to locate the address of the record.
If there is no overflow, the cost of S3b is 1.
If there is overflow, additional accesses may be necessary
Describe S4
Uses a primary index to access multiple records.
- Used when a <, <=, > or >= comparison is used on an attribute with a primary index
-The primary index is first used to locate the record satisfying the equality condition then the result is either the records before or after it it depending,
-It assumes the records are sorted.
Assuming uniform distribution, then we would expect half the tuples to satisfy the inequality, so
the estimated cost is:
Number of levels accessed + number of blocks/2
Describe S5
Uses the clustering index to locate multiple records
-If the condition is on an attribute A that is not the primary key but has a secondary clustering index, then the index can be used to locate the required records.
-See notes 22 for cost
-But basically, it is the number of levels accessed+ the number of blocks determined to contain the records with the cluster index
-To get the second part its records with the index/bfr
(Similar to last part of S2)
Describe S6
Uses a secondary (B+ tree index)
-If the index is on a key attribute, then it is used to retrieve a single attribute and if it is not on a key attribute, then it retrieves multiple values
The cost depends on whether or not it is on a key attribute.
If it is the cost is
number of levels accessed +1 (like S3a)
If it is not the cost is
number if levels accessed + number of records with index (very costly)