HBase Architecture Flashcards

1
Q

What are the four basic events that can potentially destroy data locality?

A

1 - The HBase balancer decides to move a region to balance data sizes across RegionServers.
2 - A RegionServer dies. All its regions need to be relocated to another server.
3 - A table is disable and re-enabled.
4 - A cluster is stopped and restarted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What metric measures HFile data locality?

A

HFile locality index = ( Total number of HDFS blocks that can be retrieved locally by the region server ) / ( Total number of HDFS blocks for all HFiles )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the hierarchy of objects on a Region Server?

A

Table (HBase table)
Region (Regions for the table)
Store (Store per ColumnFamily for each Region for the table)
MemStore (MemStore for each Store for each Region for the table)
StoreFile (StoreFiles for each Store for each Region for the table)
Block (Blocks within a StoreFile within a Store for each Region for the table)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How are regions assigned to RegionServers when HBase starts?

A

1 - The Master invokes the AssignmentManager upon startup.
2 - The AssignmentManager looks at the existing region assignments in META.
3 - If the region assignment is still valid (i.e., if the RegionServer is still online) then the assignment is kept.
4 - If the assignment is invalid, then the LoadBalancerFactory is invoked to assign the region. The DefaultLoadBalancer will randomly assign the region to a RegionServer.
5 - META is updated with the RegionServer assignment (if needed) and the RegionServer start codes (start time of the RegionServer process) upon region opening by the RegionServer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How are regions assigned to RegionServers when a region fails?

A

1 - The regions immediately become unavailable because the RegionServer is down.
2 - The Master will detect that the RegionServer has failed.
3 - The region assignments will be considered invalid and will be re-assigned just like the startup sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a Store?

A

A Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the MemStore?

A

The MemStore holds in-memory modifications to the Store. Modifications are KeyValues. When asked to flush, current memstore is moved to snapshot and is cleared. HBase continues to serve edits out of new memstore and backing snapshot until flusher reports in that the flush succeeded. At this point the snapshot is let go.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a StoreFile (HFile)?

A

StoreFiles are where your data lives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are Blocks?

A

StoreFiles are composed of blocks. The blocksize is configured on a per-ColumnFamily basis. Compression happens at the block level within StoreFiles.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is contained in ROOT?

A

-ROOT- keeps track of where the .META. table is.

Key format: .META. region key (.META.,,1)

Contains:

info: regioninfo (serialized HRegionInfo instance of .META.)
info: server (server:port of the RegionServer holding .META.)
info: serverstartcode (start-time of the RegionServer process holding .META.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is contained in META?

A

The .META. table keeps a list of all regions in the system.

Key format: Region key of the format ([table],[region start key],[region id])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Block Cache?

A

The Block Cache is an LRU (least recently used) cache that contains three levels of block priority to allow for scan-resistance and in-memory ColumnFamilies:

  • Single access priority
  • Mutli access priority
  • In-memory access priority
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Besides your data what else is stored in the Block Cache?

A
  • Catalog tables: The -ROOT- and .META
  • HFiles indexes
  • Keys
  • Bloom Filters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What order are an HBase table contents sorted in?

A

row key, column family, column qualifier and timestamp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Does disabling block caching improve scan performance when you perform a full table scan of your data?

A

Yes. When you disable block caching, you free up memory for other operations. With a full table scan, you cannot take advantage of block caching anyway because your entire table won’t fit into the cache.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the hbase shell syntax for creating a table called Blogs with a column family called Content?

A

create ‘Blogs’, ‘Content’

17
Q

What is in a store?

A

Set of HFiles, Memstore and BlockCache(?)

18
Q

You want to do mostly table scans on your data. In order to improve performance you increase your block size. Why does increasing block size improve scan performance?

A

Increasing block size means that fewer blocks indexes that need to be read from disk, thereby increasing scan performance.

19
Q

What is the default block size?

A

64k

20
Q

What unit of measure is TTL (time to live) saved in?

A

Seconds

21
Q

The cells in a given row have versions that range from 150 to 575. You execute a delete on the row and specify the value 650 for the version. What is the outcome?

A

The entire row is deleted

22
Q

You have configured HBase to store a maximum number of two versions. You have inserted five versions of your data. At what point are the older versions removed?

A

The writes continue to insert new data and the older versions are removed at major compaction.

23
Q

What operation is your client attempting to complete if it is querying ZooKeeper to find the location of the HMaster?

A

A metadata change.

The client will ask ZooKeeper to find the location of HMaster when you need to change metadata such as creating a table, adding a column to an existing table, deleting at table, etc

24
Q

Your client application is writing to a daughter region. Which HBase operation recently occurred?

A

A region split

25
Q

You have a table with 25TB of data, 50 Region Servers and a region size of 256 MB. You want to continue with puts to widely disbursed rowids in your table. What should you do to increase write performance for this situation?

A

Increase the number of RegionServers

26
Q

Your client application is writing data to a Region. By default, where is the data saved first?

A

WAL

27
Q

Where is the WAL in HDFS?

A

/hbase/.logs with subdirectories by region

28
Q

In what order is data written to the HLog or WAL?

A

In order of writes

29
Q

At what level is the HLog / WAL?

A

Region Server - All regions on a region server share a single instance.

30
Q

What is the minimum number of regions that a table can have?

A

One

31
Q

When writing a row, does your client need to contact the HMaster?

A

No