Data Storage Flashcards

Question 1

Q

Alternative File Organizations

Answer

A

Heap files: Best when typical access is a full file scan
Sorted files: Best for retrieval in an order, or for retrieving a ‘range’
Log-structured files: Best for very fast insertions/deletions/updates

Question 2

Q

Heap (Unordered) Files

Answer

A

Simplest file structure
-> contains records in no particular order
-> Need to be able to scan, search based on rid
As file grows and shrinks, disk pages are allocated and de-allocated
-> Need to manage free space

Question 3

Q

Heap File Implemented Using Lists

Answer

A

<Heap file name, header page id> stored somewhere
Each page contains 2 pointers plus data (pointer to full pages + pointer to pages with free space)
Manage free pages using free list
-> Problem: What if most pages have some space?

Question 4

Q

Heap File Using a Page Directory

Answer

A

The directory is a collection of pages
-> linked list implementation is just one alternative
The entry for a page can include the number of free bytes on the page
-> much smaller than linked list of all HF pages

Question 5

Q

Log-structured files

Answer

A

Instead of storing tuples in pages, the DBMS only appends log records. Blocks are never modified.
For workloads with many small files, a traditional file system needs many small synchronous random writes, whereas a log-structured file system does few large asynchronous sequential transfers.
The system appends log records to the files of how the database was modified:
-> Inserts: Store the entire tuple (types of log record)
-> Deletes: Mark tuple as deleted
-> Updates: Store delta of just the attributes that were modified
DBMS needs to address two issues
-> How to retrieve tuples from logs
-> How to manage disk space with ever-growing logs

Question 6

Q

Reading from log-structured files

Answer

A

DBMS scans log backwards, and “recreates” the tuple
Build indexes to allow jumps in the log
Periodically compact the log

Question 7

Q

Pros/Cons of log-structured files

Answer

A

Pros:
-> High performance for inserts, deletes and updates
-> Ultra-fast recovery from failures
-> Good for SSD as writes are naturally leveled
Cons:
-> Unpredictable performance in sequential reads
-> Need a lot of free space
-> Affects garbage collection (need for compaction)
-> Data can be lost if written but not checkpointed

Question 8

Q

N-ary Storage Model

Answer

A

Page = collection of slots
Each slot stores one record
-> Record identifier: <page_id, slot_number>
-> Option 2: <uniq> -> <page_id, slot_number></uniq>
Page format should support fast searching, inserting, deleting
Page format depends on record format
-> fixed-length
-> variable-length

Question 9

Q

Record Formats: Fixed-Length

Answer

A

Schema is stored in system catalog
-> Number of fields is fixed for all records of a table
-> Domain is fixed for all records of a table
Each field has fixed length
Finding i-th field is done via arithmetic

Question 10

Q

Record Format: Variable-Length

Answer

A

Array of field offsets is typically superior
-> Direct access to fields
-> Clean way of handling NULL values
Maintain slot directory with <record offset, record length> pairs
-> records can move on page without changing rid
-> useful for freely moving fixed-length records

Question 11

Q

Variable-Length Records: Issues

Answer

A

If a field grows and no longer fits?
-> shift all subsequent fields
If a record no longer fits in page?
-> Move a record to another page after modification
What if record size > page size?
-> Limit allowed record size

Question 12

Q

Column Store (DSM) Pros/Cons

Answer

A

Pros:
-> Saves IO by bringing only the relevant attributes
-> (Very) memory-compressing columns is typically easier
Cons:
-> Writes more expensive
-> Need tuple stitching at some point (tuple reconstruction)
-> Indexed selection with low selectivities
-> Queries that require all or most of the attributes

Question 13

Q

Compression

Answer

A

Lossless compression
IO reduction implies less CPU wait time
-> Introduces additional CPU load on otherwise idle CPU
Run-length encoding
Bit-vector encoding
-> Useful when we have categorical data & useful when a few distinct values
-> One bit vector for each distinct value
-> Vector-length = # distinct elements
Dictionary encoding
-> replace long values with integers
Frequency partitioning
-> Reorganize each column to reduce entropy at each page

Question 14

Q

Partition Attributes Across (PAX)

Answer

A

Decompose a slotted-page internally in mini-pages per attribute
-> cache-friendly
-> compatible with slotted-pages
-> Retain NSM I/O pattern
-> Brings only relevant attributes to cache

Data Storage Flashcards

(14 cards)