DDI Ch 3 - Storage & Retrieval Flashcards

1
Q

There is a big difference between storage engines that are

A

optimized for transactional workloads and optimized for analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

List the two types of storage engines we are studying first

A

log-structured storage engines and page-oriented storages engines (such as B-trees)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

for writes, it’s hard to beat the performance of what?

A

appending to a file, bc that’s the simplest possible write operation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

our first tradeoff in storage systems

A

well-chosen indexes speed up read queries, but every index slows down writes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what type of index is a useful building block for more complex indexes?

A

indexes for key-value data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Bitcask

A

the default storage engine in Riak

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Bitcask offers what?

A

high performance read and writes (I think particularly for key value data), subject to the requirement that all the keys fit in the available RAM (since the hashmap is kept completely in memory)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

how long does one disk seek take?

A

4 - 15 ms [https://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics#:~:text=The%20fastest%20high%2Dend%20server,to%2Dtrack%20and%20full%20stroke.]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

OLTP stands for

A

Online Transaction Processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

OLAP stands for

A

Online Analytics Processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Disk bandwidth

A

the total number of bytes transferred divided by

the total time between the first request for service and the completion of the last transfer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The bottleneck for OLTP vs OLAP

A
OLTP = disk seek time
OLAP = disk bandwidth
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Size of OLTP systems vs OLAP systems

A
OLTP = GB to TB
OLAP = TB to PB
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Apache lucene

A

Apache Lucene is a free and open-source search engine software library, originally written completely in Java by Doug Cutting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

A storage engine like Bitcask is well suited for when?

A

For situations where the value for each key is updated frequently (e.g the key might be the url of a cat video, and the value might be the number of times it has been played (incremented every time someone hits the play button))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Bitcask read and writes are high performance as long as what?

A

All the keys fit in the available RAM. The values can use more space than there is available memory since they can be loaded from disk with just one disk seek (and I’d that part of the file is already in the file system cache, a read doesn’t require any disk I/O at all

17
Q

A Hash table index has what limitations?

A

1) the hash table must fit in memory (so if you have a very large number of keys, you’re out of luck. In principle you could have an in-disk hashmap, but it is difficult to make an on disk hash map perform well 2) range queries are not efficient. e.g you cannot easily scan over all the keys between kitty00000 and kitty99999 - you’d have to look up each key individually in the hashmaps

18
Q

For our Bitcask-esque hash table index implementation, how do we avoid eventually running out of disk space? After all we continually append to a file. Eventually the file will be bigger than the disk.

A

1) break the append-only log into segments of a certain size, closing the segment file when it reaches a certain size, and making subsequent writes to a new segment file 2) perform compaction on these segments and merge multiple segments together

19
Q

How to perform find operations for our compact+merge segment-based append only hash table index?

A

Check the most recent segment’s hash map. If the key is not present we check the second most recent segment and so on. (The merging process keeps the number of segments small, so lookups don’t need to check many hashmaps)

20
Q

An append only log for a hash table index seems wasteful at first glance. Why don’t you update the file in place overwriting the old value with the new value?

A

1) appending and merge operations are sequential write operations, which are generally much faster than random writes (especially on magnetic spinning disk hard drives) 2) concurrency and crash recovery are much simpler if segment files are append only or immutable. For example you don’t have to worry about a case where a crash happened while a value was being overwritten, leaving you with a file contain part of the old and part of the new value spliced together. 3) merging old segments avoids the problem of data files getting corrupt over time (&laquo_space;how would in place writes cause this issue?)

21
Q

List five implementation detail categories relevant to Our compaction merge append only log

A

File format, deleting records, crash recovery, partially written records, and concurrency control