College 12: Column Stores Flashcards
HDFS vs. HBase
HDFS scans over big files, not good for record lookup, incremental addition of small batches, updates.
HBase designed to efficiently address fast record lookup, support for record-level insertion, support for updates (done by creating new versions of values)
HBase lacks certain relational features like joins but excels in scalability and handling large datasets with dynamic schema.
* Use it for: when you need random write, random read or both, when you need to do many thousands of operations per second on multiple TB, when you access patterns are well-known and simple
HBase
Distributed column-oriented data store.
* Built on: Hadoop Distributed File System (HDFS).
* Purpose: Provides storage for Hadoop’s distributed computing.
* Data Organization: Tables, rows, columns, and cells (table cells = intersections of rows and columns).
* Usage: Suitable for big jobs.
HBase data model
Structure: essentially a map (key-value pairs)
Tables: maps of maps; rows are maps with keys (columns) and values (byte arrays)
Columns: grouped into families; column names are dynamic and encoded with cells
Dynamic columns: different cells can have different columns, allowing for schema flexibility
Versioning
Keeping 3 versions by default. Each cell can have multiple versions indexed by timestamp.
HBase Physical Model
- Storage: Column families are stored in separate files (HTables).
- Indexing: Multi-level indexing on values using key, column family, column name, and timestamp.
- Regions: HTables are partitioned horizontally into regions, similar to HDFS blocks.
HBase Components
- HBaseMaster: Single master coordinating the cluster.
- HRegionServer: Manages data regions, handles read/write requests.
- HBase Client: Interacts with HBase.
- ZooKeeper: Manages HBase components registration and coordination
Operations and Features
- Deletion: Mark cells, columns, or entire column families as deleted.
- Joins: Not natively supported; handled at the application layer. (scan() and get())
- Logging: Operations are logged and periodically flushed.
- Bloom Filters: Used to improve search efficiency in compressed data.
HBase deployment
- Master Node: Coordinates operations.
Master Node -> NameNode,SecondaryNameNode,HMaster,JobTracker, ZooKeeper (The proverbial basket full of eggs) - Slave Nodes: Perform data storage and processing tasks.
Slave nodes -> RegionServer,DataNode,TaskTracker (5+ slaves with HBase, HDFS, and MR slave processes)