Chapter 3 - Storage & Retrieval Flashcards

1
Q

What is a database index?

A

A database index is a data structure used by the database engine or order to improve the speed of retrieval operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the disadvantages of database indexing?

A

Databse indexing comes at the cost of additional writes and storage space needed to maintain the data structure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does hash table indexing work for database indexing?

A

Every key stored in the database is mapped to a byte offset in the datafile. This key-bytePosition mapping is stored in memory. This prevents the need of searching through the datafile to find the key and it’s data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How are segmentation and compaction used in append-only datafiles?

A

Old sections of the datafile are split and compacted, deleting duplicate keys and only keeping the most recent update for each key. For hash table indexing, each segment requires it’s own hash table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a tombstone record?

A

A delete record for a key in a datafile, when segments are merged the tombstones tells the merging process to discard previous versions of the key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the advantages of append-only data files compared to updating records in place?

A
  • Appending and segment merging are sequential write operations which are generally faster than random writes on hard drives
  • Concurrency/crash recovery is easier to handle (why?)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the disadvantages of hash table indexing?

A
  • The hash table must be stored in memory to be efficient but for large number of keys this may not be possible
  • Storing on disk requires a large amount of random I/O writes which are slow of magnetic disk hard drives
  • Range queries are not efficient, have to look up each key individually in the hashmaps
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do SSTables work?

A

Sorted String Table - how the DATA in the data file is stored. Sequences of key-value pairs in the datafile are sorted by key. A hashtable is used in order to store the byte position of some records. When we query for a record, we find the byte position of two keys that the key-value pair must lie between and search through that section of the datafile.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does the LSM-tree algorithm work using SSTables?

A
  • Writes are added to an in memory balanced tree data structure (memtable)
  • When the memtable gets too big, it is written to the disk as an SSTable file
  • Writes continue to a new memtable instance
  • Reads try to find the key-value pair from memtable, then most recent segment, etc.
  • From time to time, a compaction is performaned on in-disk SST Files
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do LSM-tree indexing avoid the memtable being lost crashes?

A

A separate log is written on disk with every write to the memtable. Once the memtable is written to disk the log can be discarded.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What strategies exist to determine the order and timing of SST compaction?

A

Size tiered compaction: Newer and smaller SSTable are merged into older and larger SSTables.
Level compaction: SSTables are divided into levels, no overlapping SSTables are on the same level so a retrieval for a key only needs to look at one SSTable per level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do B-Trees work?

A

How the DATA in the datafile is stored. B-trees break the database down into fixed blocks or pages. Pages have keys and reference to other pages except leaf pages where the key value pairs are stored. One page is designated as the root of the tree which is where the retrieval process starts from and may be in memory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the branching factor of a B-Tree?

A

The number of child pages per node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does a B-Tree remain balanced?

A

If there isn’t enough free space on a page to accomodate a new key, it is split into two half full pages and the parent is update to account for the new subdivision of pages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How are B-Trees made reliable/fault tolerant?

A

With a write-ahead log. An append-only file to which every B-tree modification is written before it is added to the tree itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do B-Trees deal with concurrent access?

A

With latches, lightweight locks, to protect the trees in-memory data structures from other threads during the write process

17
Q

Compare read and write speeds for LSM-Trees and B-Trees.

A

As a rule of thumb, LSM-trees are typically faster for writes, whereas B-trees
are thought to be faster for reads. Why? (in other cards)

18
Q

Why are LSM-trees typically faster than B-trees for writes?

A

LSM-trees have lower write amplification (writes causing more writes), LSM-trees put the data in the memtable and may only perform more writes if the memtable is written to disk on that operation and compaction occurs.

B-trees must write every piece of data twice: once to the write-ahead log and once to the tree itself, rewriting an entire page and possibly the parent pages if a split is required.

19
Q

What are some advantages of B-Trees over LSM-Trees?

A

LSM-trees may need to search over the memtable and SSTables at various points in the compaction process to find a key. B-Trees simply need to traverse the B-Tree requiring at most log(n) reads.

Compaction may not be able to keep up with the write throughput which means that the SSTables would grow indefinitley.
Requests may need to wait for compaction affecting higher percentile response times.