Big Data Lecture 06 Wide Column Stores Flashcards

1
Q

What are the issues with RDBMS (relation database management systems) and how to solve them?

A

<ul><li>We hit the limit of storage of one machine --&gt; scale up!</li><li>It is too slow --&gt; cluster, replicate: scale out! (Difficult, very high maintainance costs.)</li><li>Solution: HBase - distributed database system!</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do wide-column stores compare to other methods? What is the only downside?

A

It marks all the columns green in this!<br></br><img></img><br></br>Small size per item, at around 10MB for optimal performance.<br></br><br></br>*Random access means being able to access any data as we wish without reading everything sequentially.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Who was the pioneer of Wide-Column Stores?

A

Google with its ‘Big Table’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the motivation for making huge tables? How does it improve on RBDMS? Why don’t we replace RBDMS with it?

A

What is related stays together (stuff is denormalized), this is because the join operations in query time are very expensive.<br></br><br></br>RDBMS is good for updating stuff, which we cannot do with WCS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How is each row identified?

A

It has a unique row ID, by which the records are sorted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are column families?

A

Residual of tables that were joined in the database, they are somehow realated, and they are stored together.<br></br><br></br>They must be pre-specified, but we can later add columns within the family.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the possible types of values in WCS?

A

The values are not typed, they are just byte objects. Some utility functions for reading the data are implemented.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the optimal size of an object in WCS?

A

<= 10 MB per cell, but it can be anything (text, image, webpage, JSON…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What queries can be executed?

A

<ul><li>get (per rowID),</li><li>put (inserting per rowID, can also overwrite, can be partial, e.g. per column family),</li><li>scan (linearly scan certain rows and columns),</li><li>delete (per rowID).</li></ul>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What do we refer to as key-value model in general?

A

Any sort of hash map setting where the key points to the value (e.g. using hashtable).<br></br>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is column oriented storage?

A

Columns of a database are physically stored on the disc together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are example of Wide Column Stores databases?

A

Google’s Big Table,<br></br>Apache HBase,<br></br>Cassandra.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How are tables stored in WCS?

A

In regions: cut of both rows (min inclusive and max exclusive) and columns. 

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What manages files underneath the WCS?

A

HDFS!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the architecture of HBase?

A

We have HMaster node (does DDL operations), which rules over processes on RegionServers (does DML operations), these all can be running on the same machine. Each RegionServer is an HDFS client.<br></br><br></br><img></img>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is DDL and DML?

A

Data Definition Language (e.g. create table with specified columns),<br></br>Data Manipulation Language (e.g. add this row to this column).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How are regions saved?

A

HMaster assigns them to RegionServers.<br></br><img></img><br></br>If they are too big, they will be split, nodes are also rebalanced if they are too full, and if node fails, HMaster creates a new one.

18
Q

How are the data persistently stored?

A

In an HFile, which is a sorted list of key-value pairs, ordered by:<br></br><ul><li>row,</li><li>column,</li><li>version (in reverse).</li></ul><div><img></img><br></br></div><div><img></img><br></br></div>

19
Q

How are different users of WCS allowed to modify the same data? Relation to CAP theorem?

A

They are not, data is locked until the users stops modifying.<br></br><br></br>The data has linear versioning, there are no DAGs.<br></br><br></br>The system is not fully available always to everyone, but it is consistent.

20
Q

How are nodes managed in WCS?

A

Using ZooKeeper, distriubuted system that manages all the heartbeats, and locks on files (and race conditions).

21
Q

What makes WCS data access so efficient?

A

Each RegionServer is a HDFS client, it has a replica of its own data, so it does not have to request it, it can get it fast in the local system.<br></br><br></br>If RegionServer is assigned responsibility, but it does not have the data, it is fine as over a longer time, the data will be replicated onto this node, which will speed up the system again.

22
Q

How is each value of the table stored in WCS, what are its parts?

A

It is yet another key value model!<br></br><img></img><br></br>Each record has its own key, and value, those are first (for memory efficiency) preceeded by key length (32 bits) and value length (32 bits).

23
Q

How does Gamma code work?

A

The number of 1s terminated by a 0 tell us how many bits to read, and then pre-pend 1 (all non-zero numbers start with 1 in binary).<br></br><img></img> 

24
Q

How are data accessed in an HFile?

A

Using this beautiful long key, which stores all the information about the record (row, column family, column, versioning time-stamp and deleting marker).<br></br><img></img><br></br>Note, there is no column-qualifier lenght, because the whole key has a fixed length, so it can be inferred.<br></br><br></br>This key refers to the records.<br></br><img></img>

25
Q

What are HBlocks?

A

The number of KeyValues read at once, it has a size of 64kB.

26
Q

What is a KeyValue?

A

Smallest unit of physical storage, indexed by row, column, version and sharded by regions and column families for storage.

27
Q

How do we look up a key in HFile?

A

We look at the index, which tells us approximatelly where the key will be, and then we search there.<br></br><img></img>

28
Q

What are the levels of physical storage in WCS?

A

<img></img>

29
Q

How is new data added to WCS?

A

New data is not written, but saved in MemStore. It it store in a tree, nicely sorted (insertion O(log N)), so it can be quickly written onto a disc in linear time.

30
Q

How do we make the MemStore robust to failure?

A

There is a write-ahead log, called HLog, (we save everything we write before we process it), so we can replay it in case of failure.<br></br><br></br>These are stored in HDFS, one region per server.

31
Q

When is MemStore flushed?

A

<ul><li>Maximum size for one store is reached,</li><li>overall memstore size is reached,</li><li>the write-ahead log is full.</li></ul>

32
Q

What structures are used in RBDMS and WCS for optimization of search?

A

<ul><li>RDBMS: B+-trees - they minimize seek-time bound by indexing the data.</li><li>WCS: Log-Structrure-Merge trees - optimize by size for throughput.</li></ul>

33
Q

How is data compacted using LSM tree?

A

Like the 2048 game!<br></br><br></br>The MemStore dumps a file, then it always merges whenever we have the file of the same size, so it goes up and up and up.<br></br><br></br>These files are stored at different level, either memory, disc or on further nodes.

34
Q

How does client communicate with the WCS system?

A

<ul><li>To create/update/delete a table: talk to HMaster,</li><li>to know which server to talk to: talk to Meta table (cache that locally),</li><li>to get the data, talk to RegionServer.</li></ul>

35
Q

What APIs are supported?

A

REST and Java.

36
Q

How is WCS optimized? (3)

A

<ol><li>Cache - pre-save HBlocks using standard caching criteria (do not use when reading all the data from a certain block, use batch processing instead, also do not use if doing random access),</li><li>key ranges (know for certain that something is inside the file or not),</li><li>Bloom filters, tell you with certainty that something is not in there, but only maybe if it is in there.</li></ol>

37
Q

How does WCS know what is where?

A

It has a meta table, called hbase:meta, separate from Hmaster, that stored in HDFS what is in which table. If Hmaster is lost, it is just relected.

38
Q

What does lazy splitting mean in HDFS?

A

We update the meta table, but we only split the HBlocks actually in memory later on, we keep them together, but we read them as if they were separate.

39
Q

How does HFile compaction synergize with locality?

A

Small blocks might get distributed in HDFS, but whenever we flush, we bring the data back to the parent node, and we can happily keep shortcircuiting.

40
Q

What is Spanner?

A

WCS but with bigger spans,<br></br><ul><li>uses tablets instead of regions (discontinuous ranges row ranges),</li><li>it uses universe master to run other smaller HBase subsystems.</li></ul>

41
Q

HBase runs on top of HDFS, how so it still fast?

A
  1. It uses MemStore and Cache to load frequently used values fast,<br></br>2. it shortcircuits DataNodes with HDFS.