6_Bigtable Flashcards
1
Q
What is Cloud Bigtable?
- Managed wide-column NoSQL database
- High throughput
- Low latency (in milliseconds)
- Scalibility
- High availibility
- Does not support SQL-like queries
- Uses a single key
Used for:
- High throughput analytics
- Huge datasets
Use cases:
- Financial data - stock prices
- IoT data
- Marketing data - purchase histories
A
2
Q
Cluster
- Not no-ops
- Must configure nodes
- Entire Bigtable project called ‘instance’
- All nodes and clusters
- Nodes grouped into clusters
- 1 or more clusters per instance
- Auto-scaling storage
A
3
Q
Data Storage
Blocks of contiguous rows are:
- Sharded into tablets
- Stored in Google Colossus
Splitting, merging and rebalancing happen automatically
- Storage and compute are separate, so a node going down may affect performance, but not data integrity
- Nodes only store pointers to storage as metadata
A
4
Q
Instance Types
-
Development
- Low cost, single node
- No replication or SLA
-
Production
- 1+ clusters
- 3+ nodes per cluster
- Replication available, throughput guarantee
- Can upgrade the development instance to a production instance. Upgrading a development instance is permanent.
A
5
Q
Storage Types
-
SSD
- Almost always the right choice
- Fastest and most predictable option
- 6ms latency for 99% of reads and writes
- Each node can process 2.5 TB SSD data
-
HDD
- Each node can process 8 TB HDD data
- Throughput is limited
- Row reads are 5% the speed of SSD reads
- Use if storing at least 10 TB of infrequently-accessed data with no latency sensitivity
Note: Changing disk type requires new instance (export to Cloud Storage first)
A
6
Q
Application Profiles
- Custom application-specific settings for handling incoming connections.
- Good practice to create one profile for each individual application.
- Help viewing connections metrics
- Single or multi-cluster routing
-
Single-cluster routing: application will route to a single cluster. Will have to perform a manual failover (manually updating profile) if cluster fails.
- Use case: Web application and batch job route traffic to different cluster based on application profile
- Multi-cluster routing: automatic failover occurs to the next available cluster. But this is an expensive service.
-
Single-cluster routing: application will route to a single cluster. Will have to perform a manual failover (manually updating profile) if cluster fails.
- Single-cluster routing is requested for single-row transactions (atomic updates to single rows).
- Single-row transactions are not strongly consistent across multiple clusters/multi-cluster routing.
A
7
Q
Bigtable Configuration
- Instances can run up to four clusters
- Clusters exist in a single zone
- Up to 30 nodes per project
- Maximum of 1000 tables per instance
A
8
Q
Access Control
- IAM predefined roles:
- Admin, User, Reader, Viewer
- Applied at project, instance or table level to:
- Restrict access or administration
- Restrict reads and writes
- Restrict development or production access
A
9
Q
Data Storage Model
- One big table
- Table can be thousands of columns/billions of rows
- Table is sharded across tablets
- Table components:
- Row Key (first column)
- Columns grouped into columns families
-
Only row key is indexed
- Design of row key is essential for performance
- Atomic operation is allowed but only one row at a time
- Empty cells wont consume any space in the database
- Individual cells should be no larger than 10Mb (including versions and timestamps)
- A row should be under 100Mb in size(but it is possible to have row up to 256Mb)
A
10
Q
Timestamp and Garbage Collection
- Each cell has multiple versions defined by:
- Server-recorded timestamps
- Or sequential numbers
- Expiry policies define garbage collection:
- Expire based on age
- Expire based on number of versions
A
11
Q
Field Promotion
- Move fiels from column data to row key.
- Allow for better row keys if we know the data when querying the table.
- Better query perfomance, we dont have to scan as much data within the table
- We can use a prefix filter, e.g. scan ‘vehicles’, {ROWPREFIXFILTER => ‘NYMT#86#’}
- Never put a timestamp a the start of the key, it will be impossible for BT to balance the load across the cluster
A
12
Q
Designing Row Keys
- Row keys are the only indexed item
- Lexicographic sorting (A-Z)
- Related entities should be in adjacent rows for more efficient reads
- Queries use a row key or a row key prefix
- Row key or prefix should be sufficient for a search
- Balanced access patterns enable linear scaling of performance
-
Good row keys spread/distribute load evenly over multiple nodes
- Reverse domain names (com.linuxacademy.support)
- String identifiers (mattu)
- Timestamps (reverse, NOT at front /or only identifier), only as part of a bigger row key design
-
Row keys to avoid:
- Domain names
- Sequential numbers
- Frequently updated identifiers
- Hashed values
A
13
Q
Time Series Data
- Use tall and narrow tables (one event per row) <> wide tables
- Use rows instead of versioned cells
- Logically separate tables
- Dont reivent the wheel: OpenTSDB
A
14
Q
Avoid Hotspots
- Hotspots = when load is on one node and/or not distributed across the cluster.
- Field promotion
- Salting
-
Key Visualizer:
- Tool that helps you analyze your Cloud Bigtable usage patterns. It generates visual reports for your tables that break down your usage based on the row keys that you access.
- Help you complete the following tasks:
- Check whether your reads or writes are creating hotspots on specific rows
- Find rows that contain too much data
- Look at whether your access patterns are balanced across all of the rows in a table
A
15
Q
Causes of slower performance
- The table’s schema is not designed correctly.
- The rows in your Cloud Bigtable table contain large amounts of data.
- The rows in your Cloud Bigtable table contain a very large number of cells.
- The Cloud Bigtable cluster doesn’t have enough nodes.
- The Cloud Bigtable cluster was scaled up or scaled down recently.
- The Cloud Bigtable cluster uses HDD disks.
- There are issues with the network connection.
A