2. Disk and Files Flashcards
How databases use memory and disks? Why?
Whenever a database processes data, that data must exist in memory. Accessing this data is relatively fast, but once the data becomes very large, it becomes impossible to fit all of it within memory. Disks are used to cheaply store all of a database’s data, but they incur a large cost whenever data is accessed or new data is written.
Basic units of data for databases and disks
The basic unit of data for relational databases is a record (row). These records are organized into relations (tables) and can be modified, deleted, searched, or created in memory.
The basic unit of data for disk is a page, the smallest unit of transfer from disk to memory and vice versa
How relational databases are represented on a disk?
In order to represent relational databases in a format compatible with disk, each relation is stored in its own file and its records are organized into pages in the file.
How relations are organized on disk?
4 main decisions to make?
Based on what are they made?
Based on the relation’s schema and access pattern, the database will determine:
(1) type of file used, (2) how pages are organized in the file, (3) how records are organized on each page, (4) and how each record is formatted.
2 main types of database files
Heap Files and Sorted Files
What’s meant by “one I/O” in databases?
1 I/O is equivalent to 1 page read from disk or 1 page write to disk
What is a heap file?
A heap file is a file type with no particular ordering of pages or of the records on pages
2 main implementations of a heap file
Linked List
Page Directory
Linked List Heap File - describe the structure
In the linked list implementation, each data page contains records, a free space tracker, and pointers (byte offsets) to the next and previous page. There is one header page that acts as the start of the file and separates the data pages into full pages and free pages.
When free data pages become full, they are moved from the free space portion to the front of the full pages portion of the linked list.

Page Directory Heap File - describe the structure
Linked list for header pages.
Each header page contains a pointer (byte offset) to the next header page, and its entries contain both a pointer to a data page and the amount of free space left within that data page.

The main advantage of page directories over linked lists for heap files?
Fast inserts.
Needs to read only header pages to find a page with free space.
Heap files vs. sorted files - the main trade-off
Heap files - fast insert
Sorted files - fast search
What is a sorted file?
A sorted file is a file type where pages are ordered and records within each page are sorted by key(s).
How sorted files are commonly implemented?
These files are implemented using Page Directories and enforce an ordering upon data pages based on how records are sorted.
The average cost of insert and search for sorted files
Searching through sorted files takes logN I/Os where N = # of pages since binary search can be used to find the page containing the record. Meanwhile, insertion, in the average case, takes logN + N I/Os since binary search is needed to find the page to write to and that inserted record could potentially cause all later records to be pushed back by one. On average, N / 2 pages will need to be pushed back, and this involves a read and a write IO for each of those pages, which results in the N I/Os term.
2 main record types
Fixed Length Records (FLR) and Variable Length Records (VLR)
FLRs only contain fixed length fields (integer, boolean, date, etc.)
VLRs contain both fixed length and variable length fields (eg. varchar),
How Variable Length Records store their fields?
VLRs store all fixed length fields before variable length fields and use a record header that contains pointers to the end of the variable length fields.

Record ID - purpose, components
every record can be uniquely identified by its record id -
[page #, record # on page].
Packed pages with Fixed Length Records - describe
Pages containing FLRs always use page headers to store the number of records currently on the page.
If the page is packed, there are no gaps between records. This makes insertion easy as we can calculate the next available position within the page using the # of existing records and the length of each record. Once this value is calculated, we insert the record at the computed offset. Deletion is slightly trickier as it requires moving all records after the deleted record towards the top of the page by one position to keep the page packed.
Unpacked pages with Fixed Length Records - describe
Pages containing FLRs always use page headers to store the number of records currently on the page.
If the page is unpacked, the page header typically stores an additional bitmap that breaks the page into slots and tracks which slots are open or taken. Using the bitmap, insertion involves finding the first open bit, setting the new record in the corresponding slot, and then setting the bit for that slot. With deletion, we clear the deleted record’s corresponding bit so that future inserts can overwrite that slot.

Pages with Variable Length Records - describe
Each page uses a page footer that maintains a slot directory tracking slot count, a free space pointer, and entries. The footer starts from the bottom of the page rather than the top so that the slot directory has room to grow when records are inserted.
Each entry in the slot directory consists of a [record pointer, record length] pair.
insert - at the free space pointer, a new [pointer, length] pair is set in any available null entry (or a new one if no nulls)
delete - set both the record pointer and record length to null.
Re-grouping/packing depends on the implementation.

Layered organization of DBMS - describe all layers
