2. Disk and Files Flashcards
How databases use memory and disks? Why?
Whenever a database processes data, that data must exist in memory. Accessing this data is relatively fast, but once the data becomes very large, it becomes impossible to fit all of it within memory. Disks are used to cheaply store all of a database’s data, but they incur a large cost whenever data is accessed or new data is written.
Basic units of data for databases and disks
The basic unit of data for relational databases is a record (row). These records are organized into relations (tables) and can be modified, deleted, searched, or created in memory.
The basic unit of data for disk is a page, the smallest unit of transfer from disk to memory and vice versa
How relational databases are represented on a disk?
In order to represent relational databases in a format compatible with disk, each relation is stored in its own file and its records are organized into pages in the file.
How relations are organized on disk?
4 main decisions to make?
Based on what are they made?
Based on the relation’s schema and access pattern, the database will determine:
(1) type of file used, (2) how pages are organized in the file, (3) how records are organized on each page, (4) and how each record is formatted.
2 main types of database files
Heap Files and Sorted Files
What’s meant by “one I/O” in databases?
1 I/O is equivalent to 1 page read from disk or 1 page write to disk
What is a heap file?
A heap file is a file type with no particular ordering of pages or of the records on pages
2 main implementations of a heap file
Linked List
Page Directory
Linked List Heap File - describe the structure
In the linked list implementation, each data page contains records, a free space tracker, and pointers (byte offsets) to the next and previous page. There is one header page that acts as the start of the file and separates the data pages into full pages and free pages.
When free data pages become full, they are moved from the free space portion to the front of the full pages portion of the linked list.
Page Directory Heap File - describe the structure
Linked list for header pages.
Each header page contains a pointer (byte offset) to the next header page, and its entries contain both a pointer to a data page and the amount of free space left within that data page.
The main advantage of page directories over linked lists for heap files?
Fast inserts.
Needs to read only header pages to find a page with free space.
Heap files vs. sorted files - the main trade-off
Heap files - fast insert
Sorted files - fast search
What is a sorted file?
A sorted file is a file type where pages are ordered and records within each page are sorted by key(s).
How sorted files are commonly implemented?
These files are implemented using Page Directories and enforce an ordering upon data pages based on how records are sorted.
The average cost of insert and search for sorted files
Searching through sorted files takes logN I/Os where N = # of pages since binary search can be used to find the page containing the record. Meanwhile, insertion, in the average case, takes logN + N I/Os since binary search is needed to find the page to write to and that inserted record could potentially cause all later records to be pushed back by one. On average, N / 2 pages will need to be pushed back, and this involves a read and a write IO for each of those pages, which results in the N I/Os term.