DDI Ch 3 - Storage & Retrieval Flashcards

Question 1

Q

There is a big difference between storage engines that are

Answer

A

optimized for transactional workloads and optimized for analytics

Question 2

Q

List the two types of storage engines we are studying first

Answer

A

log-structured storage engines and page-oriented storages engines (such as B-trees)

Question 3

Q

for writes, it’s hard to beat the performance of what?

Answer

A

appending to a file, bc that’s the simplest possible write operation

Question 4

Q

our first tradeoff in storage systems

Answer

A

well-chosen indexes speed up read queries, but every index slows down writes.

Question 5

Q

what type of index is a useful building block for more complex indexes?

Answer

A

indexes for key-value data.

Question 6

Q

Bitcask

Answer

A

the default storage engine in Riak

Question 7

Q

Bitcask offers what?

Answer

A

high performance read and writes (I think particularly for key value data), subject to the requirement that all the keys fit in the available RAM (since the hashmap is kept completely in memory)

Question 8

Q

how long does one disk seek take?

Answer

A

4 - 15 ms [https://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics#:~:text=The%20fastest%20high%2Dend%20server,to%2Dtrack%20and%20full%20stroke.]

Question 9

Q

OLTP stands for

Answer

A

Online Transaction Processing

Question 10

Q

OLAP stands for

Answer

A

Online Analytics Processing

Question 11

Q

Disk bandwidth

Answer

A

the total number of bytes transferred divided by

the total time between the first request for service and the completion of the last transfer.

Question 12

Q

The bottleneck for OLTP vs OLAP

Answer

A

OLTP = disk seek time
OLAP = disk bandwidth

Question 13

Q

Size of OLTP systems vs OLAP systems

Answer

A

OLTP = GB to TB
OLAP = TB to PB

Question 14

Q

Apache lucene

Answer

A

Apache Lucene is a free and open-source search engine software library, originally written completely in Java by Doug Cutting

Question 15

Q

A storage engine like Bitcask is well suited for when?

Answer

A

For situations where the value for each key is updated frequently (e.g the key might be the url of a cat video, and the value might be the number of times it has been played (incremented every time someone hits the play button))

Question 16

Q

Bitcask read and writes are high performance as long as what?

Answer

Study These Flashcards

A

All the keys fit in the available RAM. The values can use more space than there is available memory since they can be loaded from disk with just one disk seek (and I’d that part of the file is already in the file system cache, a read doesn’t require any disk I/O at all

Question 17

Q

A Hash table index has what limitations?

Answer

Study These Flashcards

A

1) the hash table must fit in memory (so if you have a very large number of keys, you’re out of luck. In principle you could have an in-disk hashmap, but it is difficult to make an on disk hash map perform well 2) range queries are not efficient. e.g you cannot easily scan over all the keys between kitty00000 and kitty99999 - you’d have to look up each key individually in the hashmaps

Question 18

Q

For our Bitcask-esque hash table index implementation, how do we avoid eventually running out of disk space? After all we continually append to a file. Eventually the file will be bigger than the disk.

Answer

Study These Flashcards

A

1) break the append-only log into segments of a certain size, closing the segment file when it reaches a certain size, and making subsequent writes to a new segment file 2) perform compaction on these segments and merge multiple segments together

Question 19

Q

How to perform find operations for our compact+merge segment-based append only hash table index?

Answer

Study These Flashcards

A

Check the most recent segment’s hash map. If the key is not present we check the second most recent segment and so on. (The merging process keeps the number of segments small, so lookups don’t need to check many hashmaps)

Question 20

Q

An append only log for a hash table index seems wasteful at first glance. Why don’t you update the file in place overwriting the old value with the new value?

Answer

Study These Flashcards

A

1) appending and merge operations are sequential write operations, which are generally much faster than random writes (especially on magnetic spinning disk hard drives) 2) concurrency and crash recovery are much simpler if segment files are append only or immutable. For example you don’t have to worry about a case where a crash happened while a value was being overwritten, leaving you with a file contain part of the old and part of the new value spliced together. 3) merging old segments avoids the problem of data files getting corrupt over time (&laquo_space;how would in place writes cause this issue?)

Question 21

Q

List five implementation detail categories relevant to Our compaction merge append only log

Answer

Study These Flashcards

A

File format, deleting records, crash recovery, partially written records, and concurrency control

DDI Ch 3 - Storage & Retrieval Flashcards

(21 cards)