HBase Architecture Flashcards
What are the four basic events that can potentially destroy data locality?
1 - The HBase balancer decides to move a region to balance data sizes across RegionServers.
2 - A RegionServer dies. All its regions need to be relocated to another server.
3 - A table is disable and re-enabled.
4 - A cluster is stopped and restarted.
What metric measures HFile data locality?
HFile locality index = ( Total number of HDFS blocks that can be retrieved locally by the region server ) / ( Total number of HDFS blocks for all HFiles )
What are the hierarchy of objects on a Region Server?
Table (HBase table)
Region (Regions for the table)
Store (Store per ColumnFamily for each Region for the table)
MemStore (MemStore for each Store for each Region for the table)
StoreFile (StoreFiles for each Store for each Region for the table)
Block (Blocks within a StoreFile within a Store for each Region for the table)
How are regions assigned to RegionServers when HBase starts?
1 - The Master invokes the AssignmentManager upon startup.
2 - The AssignmentManager looks at the existing region assignments in META.
3 - If the region assignment is still valid (i.e., if the RegionServer is still online) then the assignment is kept.
4 - If the assignment is invalid, then the LoadBalancerFactory is invoked to assign the region. The DefaultLoadBalancer will randomly assign the region to a RegionServer.
5 - META is updated with the RegionServer assignment (if needed) and the RegionServer start codes (start time of the RegionServer process) upon region opening by the RegionServer.
How are regions assigned to RegionServers when a region fails?
1 - The regions immediately become unavailable because the RegionServer is down.
2 - The Master will detect that the RegionServer has failed.
3 - The region assignments will be considered invalid and will be re-assigned just like the startup sequence.
What is a Store?
A Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region.
What is the MemStore?
The MemStore holds in-memory modifications to the Store. Modifications are KeyValues. When asked to flush, current memstore is moved to snapshot and is cleared. HBase continues to serve edits out of new memstore and backing snapshot until flusher reports in that the flush succeeded. At this point the snapshot is let go.
What is a StoreFile (HFile)?
StoreFiles are where your data lives.
What are Blocks?
StoreFiles are composed of blocks. The blocksize is configured on a per-ColumnFamily basis. Compression happens at the block level within StoreFiles.
What is contained in ROOT?
-ROOT- keeps track of where the .META. table is.
Key format: .META. region key (.META.,,1)
Contains:
info: regioninfo (serialized HRegionInfo instance of .META.)
info: server (server:port of the RegionServer holding .META.)
info: serverstartcode (start-time of the RegionServer process holding .META.)
What is contained in META?
The .META. table keeps a list of all regions in the system.
Key format: Region key of the format ([table],[region start key],[region id])
What is Block Cache?
The Block Cache is an LRU (least recently used) cache that contains three levels of block priority to allow for scan-resistance and in-memory ColumnFamilies:
- Single access priority
- Mutli access priority
- In-memory access priority
Besides your data what else is stored in the Block Cache?
- Catalog tables: The -ROOT- and .META
- HFiles indexes
- Keys
- Bloom Filters
What order are an HBase table contents sorted in?
row key, column family, column qualifier and timestamp
Does disabling block caching improve scan performance when you perform a full table scan of your data?
Yes. When you disable block caching, you free up memory for other operations. With a full table scan, you cannot take advantage of block caching anyway because your entire table won’t fit into the cache.
What is the hbase shell syntax for creating a table called Blogs with a column family called Content?
create ‘Blogs’, ‘Content’
What is in a store?
Set of HFiles, Memstore and BlockCache(?)
You want to do mostly table scans on your data. In order to improve performance you increase your block size. Why does increasing block size improve scan performance?
Increasing block size means that fewer blocks indexes that need to be read from disk, thereby increasing scan performance.
What is the default block size?
64k
What unit of measure is TTL (time to live) saved in?
Seconds
The cells in a given row have versions that range from 150 to 575. You execute a delete on the row and specify the value 650 for the version. What is the outcome?
The entire row is deleted
You have configured HBase to store a maximum number of two versions. You have inserted five versions of your data. At what point are the older versions removed?
The writes continue to insert new data and the older versions are removed at major compaction.
What operation is your client attempting to complete if it is querying ZooKeeper to find the location of the HMaster?
A metadata change.
The client will ask ZooKeeper to find the location of HMaster when you need to change metadata such as creating a table, adding a column to an existing table, deleting at table, etc
Your client application is writing to a daughter region. Which HBase operation recently occurred?
A region split