College 12: Column Stores Flashcards

1
Q

HDFS vs. HBase

A

HDFS scans over big files, not good for record lookup, incremental addition of small batches, updates.
HBase designed to efficiently address fast record lookup, support for record-level insertion, support for updates (done by creating new versions of values)
HBase lacks certain relational features like joins but excels in scalability and handling large datasets with dynamic schema.
* Use it for: when you need random write, random read or both, when you need to do many thousands of operations per second on multiple TB, when you access patterns are well-known and simple

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

HBase

A

Distributed column-oriented data store.
* Built on: Hadoop Distributed File System (HDFS).
* Purpose: Provides storage for Hadoop’s distributed computing.
* Data Organization: Tables, rows, columns, and cells (table cells = intersections of rows and columns).
* Usage: Suitable for big jobs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

HBase data model

A

Structure: essentially a map (key-value pairs)
Tables: maps of maps; rows are maps with keys (columns) and values (byte arrays)
Columns: grouped into families; column names are dynamic and encoded with cells
Dynamic columns: different cells can have different columns, allowing for schema flexibility

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Versioning

A

Keeping 3 versions by default. Each cell can have multiple versions indexed by timestamp.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

HBase Physical Model

A
  • Storage: Column families are stored in separate files (HTables).
  • Indexing: Multi-level indexing on values using key, column family, column name, and timestamp.
  • Regions: HTables are partitioned horizontally into regions, similar to HDFS blocks.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

HBase Components

A
  • HBaseMaster: Single master coordinating the cluster.
  • HRegionServer: Manages data regions, handles read/write requests.
  • HBase Client: Interacts with HBase.
  • ZooKeeper: Manages HBase components registration and coordination
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Operations and Features

A
  • Deletion: Mark cells, columns, or entire column families as deleted.
  • Joins: Not natively supported; handled at the application layer. (scan() and get())
  • Logging: Operations are logged and periodically flushed.
  • Bloom Filters: Used to improve search efficiency in compressed data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

HBase deployment

A
  • Master Node: Coordinates operations.
    Master Node -> NameNode,SecondaryNameNode,HMaster,JobTracker, ZooKeeper (The proverbial basket full of eggs)
  • Slave Nodes: Perform data storage and processing tasks.
    Slave nodes -> RegionServer,DataNode,TaskTracker (5+ slaves with HBase, HDFS, and MR slave processes)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly