Data Storage Flashcards
1
Q
Alternative File Organizations
A
- Heap files: Best when typical access is a full file scan
- Sorted files: Best for retrieval in an order, or for retrieving a ‘range’
- Log-structured files: Best for very fast insertions/deletions/updates
2
Q
Heap (Unordered) Files
A
- Simplest file structure
-> contains records in no particular order
-> Need to be able to scan, search based on rid - As file grows and shrinks, disk pages are allocated and de-allocated
-> Need to manage free space
3
Q
Heap File Implemented Using Lists
A
- <Heap file name, header page id> stored somewhere
- Each page contains 2 pointers plus data (pointer to full pages + pointer to pages with free space)
- Manage free pages using free list
-> Problem: What if most pages have some space?
4
Q
Heap File Using a Page Directory
A
- The directory is a collection of pages
-> linked list implementation is just one alternative - The entry for a page can include the number of free bytes on the page
-> much smaller than linked list of all HF pages
5
Q
Log-structured files
A
- Instead of storing tuples in pages, the DBMS only appends log records. Blocks are never modified.
- For workloads with many small files, a traditional file system needs many small synchronous random writes, whereas a log-structured file system does few large asynchronous sequential transfers.
- The system appends log records to the files of how the database was modified:
-> Inserts: Store the entire tuple (types of log record)
-> Deletes: Mark tuple as deleted
-> Updates: Store delta of just the attributes that were modified - DBMS needs to address two issues
-> How to retrieve tuples from logs
-> How to manage disk space with ever-growing logs
6
Q
Reading from log-structured files
A
- DBMS scans log backwards, and “recreates” the tuple
- Build indexes to allow jumps in the log
- Periodically compact the log
7
Q
Pros/Cons of log-structured files
A
- Pros:
-> High performance for inserts, deletes and updates
-> Ultra-fast recovery from failures
-> Good for SSD as writes are naturally leveled - Cons:
-> Unpredictable performance in sequential reads
-> Need a lot of free space
-> Affects garbage collection (need for compaction)
-> Data can be lost if written but not checkpointed
8
Q
N-ary Storage Model
A
- Page = collection of slots
- Each slot stores one record
-> Record identifier: <page_id, slot_number>
-> Option 2: <uniq> -> <page_id, slot_number></uniq> - Page format should support fast searching, inserting, deleting
- Page format depends on record format
-> fixed-length
-> variable-length
9
Q
Record Formats: Fixed-Length
A
- Schema is stored in system catalog
-> Number of fields is fixed for all records of a table
-> Domain is fixed for all records of a table - Each field has fixed length
- Finding i-th field is done via arithmetic
10
Q
Record Format: Variable-Length
A
- Array of field offsets is typically superior
-> Direct access to fields
-> Clean way of handling NULL values - Maintain slot directory with <record offset, record length> pairs
-> records can move on page without changing rid
-> useful for freely moving fixed-length records
11
Q
Variable-Length Records: Issues
A
- If a field grows and no longer fits?
-> shift all subsequent fields - If a record no longer fits in page?
-> Move a record to another page after modification - What if record size > page size?
-> Limit allowed record size
12
Q
Column Store (DSM) Pros/Cons
A
- Pros:
-> Saves IO by bringing only the relevant attributes
-> (Very) memory-compressing columns is typically easier - Cons:
-> Writes more expensive
-> Need tuple stitching at some point (tuple reconstruction)
-> Indexed selection with low selectivities
-> Queries that require all or most of the attributes
13
Q
Compression
A
- Lossless compression
- IO reduction implies less CPU wait time
-> Introduces additional CPU load on otherwise idle CPU - Run-length encoding
- Bit-vector encoding
-> Useful when we have categorical data & useful when a few distinct values
-> One bit vector for each distinct value
-> Vector-length = # distinct elements - Dictionary encoding
-> replace long values with integers - Frequency partitioning
-> Reorganize each column to reduce entropy at each page
14
Q
Partition Attributes Across (PAX)
A
- Decompose a slotted-page internally in mini-pages per attribute
-> cache-friendly
-> compatible with slotted-pages
-> Retain NSM I/O pattern
-> Brings only relevant attributes to cache