Data Storage Flashcards

1
Q

Alternative File Organizations

A
  • Heap files: Best when typical access is a full file scan
  • Sorted files: Best for retrieval in an order, or for retrieving a ‘range’
  • Log-structured files: Best for very fast insertions/deletions/updates
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Heap (Unordered) Files

A
  • Simplest file structure
    -> contains records in no particular order
    -> Need to be able to scan, search based on rid
  • As file grows and shrinks, disk pages are allocated and de-allocated
    -> Need to manage free space
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Heap File Implemented Using Lists

A
  • <Heap file name, header page id> stored somewhere
  • Each page contains 2 pointers plus data (pointer to full pages + pointer to pages with free space)
  • Manage free pages using free list
    -> Problem: What if most pages have some space?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Heap File Using a Page Directory

A
  • The directory is a collection of pages
    -> linked list implementation is just one alternative
  • The entry for a page can include the number of free bytes on the page
    -> much smaller than linked list of all HF pages
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Log-structured files

A
  • Instead of storing tuples in pages, the DBMS only appends log records. Blocks are never modified.
  • For workloads with many small files, a traditional file system needs many small synchronous random writes, whereas a log-structured file system does few large asynchronous sequential transfers.
  • The system appends log records to the files of how the database was modified:
    -> Inserts: Store the entire tuple (types of log record)
    -> Deletes: Mark tuple as deleted
    -> Updates: Store delta of just the attributes that were modified
  • DBMS needs to address two issues
    -> How to retrieve tuples from logs
    -> How to manage disk space with ever-growing logs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Reading from log-structured files

A
  • DBMS scans log backwards, and “recreates” the tuple
  • Build indexes to allow jumps in the log
  • Periodically compact the log
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Pros/Cons of log-structured files

A
  • Pros:
    -> High performance for inserts, deletes and updates
    -> Ultra-fast recovery from failures
    -> Good for SSD as writes are naturally leveled
  • Cons:
    -> Unpredictable performance in sequential reads
    -> Need a lot of free space
    -> Affects garbage collection (need for compaction)
    -> Data can be lost if written but not checkpointed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

N-ary Storage Model

A
  • Page = collection of slots
  • Each slot stores one record
    -> Record identifier: <page_id, slot_number>
    -> Option 2: <uniq> -> <page_id, slot_number></uniq>
  • Page format should support fast searching, inserting, deleting
  • Page format depends on record format
    -> fixed-length
    -> variable-length
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Record Formats: Fixed-Length

A
  • Schema is stored in system catalog
    -> Number of fields is fixed for all records of a table
    -> Domain is fixed for all records of a table
  • Each field has fixed length
  • Finding i-th field is done via arithmetic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Record Format: Variable-Length

A
  • Array of field offsets is typically superior
    -> Direct access to fields
    -> Clean way of handling NULL values
  • Maintain slot directory with <record offset, record length> pairs
    -> records can move on page without changing rid
    -> useful for freely moving fixed-length records
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Variable-Length Records: Issues

A
  • If a field grows and no longer fits?
    -> shift all subsequent fields
  • If a record no longer fits in page?
    -> Move a record to another page after modification
  • What if record size > page size?
    -> Limit allowed record size
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Column Store (DSM) Pros/Cons

A
  • Pros:
    -> Saves IO by bringing only the relevant attributes
    -> (Very) memory-compressing columns is typically easier
  • Cons:
    -> Writes more expensive
    -> Need tuple stitching at some point (tuple reconstruction)
    -> Indexed selection with low selectivities
    -> Queries that require all or most of the attributes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Compression

A
  • Lossless compression
  • IO reduction implies less CPU wait time
    -> Introduces additional CPU load on otherwise idle CPU
  • Run-length encoding
  • Bit-vector encoding
    -> Useful when we have categorical data & useful when a few distinct values
    -> One bit vector for each distinct value
    -> Vector-length = # distinct elements
  • Dictionary encoding
    -> replace long values with integers
  • Frequency partitioning
    -> Reorganize each column to reduce entropy at each page
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Partition Attributes Across (PAX)

A
  • Decompose a slotted-page internally in mini-pages per attribute
    -> cache-friendly
    -> compatible with slotted-pages
    -> Retain NSM I/O pattern
    -> Brings only relevant attributes to cache
How well did you know this?
1
Not at all
2
3
4
5
Perfectly