L11: Indexing (2019) Flashcards
Input/Output Operations:
Costs of Searching over Sorted Records:
- Simple Scan
- Binary Search
Simple Scan:
Cost = O(N)
Binary Search:
Cost = O(lgN)
DBMS Buffer:
Basic Idea
Database has it’s own I/O Buffer.
Separate from OS Buffer.
The Buffer is handled by a Buffer Manager
Motivation:
- DB knows more about access patterns
- Recovery and Logging require the ability to flush to disk
DBMS Buffer:
Buffer Manager
Buffer Manager
- Handles supporting operations for the DB’s I/O Buffer
- Primarily handles and executes the Replacement Policy
- Finds which pages to flush/release when the buffer fails
- A DBMS typically implements its own buffer management routines, separate from the OS
Simplified File System:
Basic Components
Files consist of pages, pages consist of blocks
Block
The basic “chunk” of memory on the disk
Page
- A fixed size array of memory
- One or more disk Blocks
- Interface:
- Write to an entry( called a Slot), or set to “None”
File
- Variable Length list of Pages
- Interface:
- create/open/close; next_page(); etc…
Why are Sort Algorithms
important in a Database?
- Data requested in Sorted Order is very common in Databases
- Quick Sort is awful for large amounts of data
- Sorting is useful for eliminating duplicate copies in a collection of records
- Sorting is the first step in Bulk Loading, and B+ Tree Index
- Sort-Merge Join algorithm involves Sorting
Sorting Big Files:
Basic Process
- Split into chunks small enough to sort in memory
- Chunks are called Runs
- Merge pairs(or groups) of runs using the External Merge Algorithm
- Keep merging the resulting runs until left with one sorted file
- Each time merging is called a Pass
External Merge Sort Algorithm:
Example:
3 Buffer Pages
6 Page File
Steps:
- Split the file ( 2 sets of 3 pages)
- Sort Externally
- Load First page, f1 , in buffer (Run # 1), and sort
- Repeat for Second page, f2 , in buffer (Run #2)
- Run External Merge Sort Algorithm to put original file back together
External Merge Sorting Algorithm:
Calculating I/O Cost
Two Main Cost Factors:
*assume 1 read(R), 1 Write(W) per page
Cost of Loading, Cloading
Cloading = (#runs)*(#pages)*(1R + 1W) = 2RP
Cost of Merging, Cmerging
Cmerging = (#runs)*(#pages)*(1R + 1W) = 2RP
Total I/O Cost = Cloading + CMerging
General I/O Cost of
Splitting N-page file
into N single page runs
and sorting
First Pass:
Merge N/2 pairs of runs, each with length of 1 page
Second Pass:
Merge N/4 Pairs of runs, each with length of 2 pages
Generalized:
- For N pages, we perform lgN passes.
- Also perform 1 pass for initial split and sort.
- Each pass involves reading/writing:
- 2N I/O operations
Total I/O Cost = 2N*(lgN + 1)
Benefits of
Using Indexes
- Much faster searching than simple scan or binary search
- Faster insertions by avoiding shifting operations
- Search quickly among multiple attributes
- Indexes are critical across all DBMS types
Index:
Overview
Index:
A data structure mapping search keys to sets of rows in a database table.
- An index on a file speeds up selections on Search Key fields
- A Search Key can be any subset of fields
- Completely separate from Keys of a relation
- Provides efficient lookup and retrieval by search key values
- An index can store either the full rows it points to
- Called a Primary Index
- Or pointers to those rows
- Called a Secondary Index
Major Operations
on an Index
-
Search
- Quickly find all records which meet some conditions on the Search Key attribute
-
Insert/Remove entries
- Including Bulk Load and Bulk Delete
What does the term
“Covering” mean
wrt an Index?
If an Index contains all the needed attributes for a specific query,
the Index is said to Cover, or be Covering, for that query.
Basically, the query can be answered by using the Index alone.
High Level Categories
of
Index Types
(Data Structures)
-
B-Trees (B+ Trees)
- Very good for range queries, sorted data
- Old databases used B-trees
- Modern databases use B+ trees
-
Hash Tables
- Variants of this basic structure to deal with I/O
- Called Linear or Extensible Hashing
*Both these data structures are “IO aware”
Buffer Operations
-
Read( Page )
- Read a page from disk
- Place in buffer if not already in buffer
-
Flush( Page )
- Evict page from the buffer
- write to disk
-
Release( Page )
- Evict page from the buffer without writing to disk