6_Bigtable Flashcards

1
Q

What is Cloud Bigtable?

  • Managed wide-column NoSQL database
  • High throughput
  • Low latency (in milliseconds)
  • Scalibility
  • High availibility
  • Does not support SQL-like queries
  • Uses a single key

Used for:

  • High throughput analytics
  • Huge datasets

Use cases:

  • Financial data - stock prices
  • IoT data
  • Marketing data - purchase histories
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Cluster

  • Not no-ops
    • Must configure nodes
  • Entire Bigtable project called ‘instance’
    • All nodes and clusters
  • Nodes grouped into clusters
    • 1 or more clusters per instance
  • Auto-scaling storage
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data Storage

Blocks of contiguous rows are:

  • Sharded into tablets
  • Stored in Google Colossus

Splitting, merging and rebalancing happen automatically

  • Storage and compute are separate, so a node going down may affect performance, but not data integrity
  • Nodes only store pointers to storage as metadata
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Instance Types

  • Development
    • Low cost, single node
    • No replication or SLA
  • Production
    • 1+ clusters
    • 3+ nodes per cluster
    • Replication available, throughput guarantee
  • Can upgrade the development instance to a production instance. Upgrading a development instance is permanent.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Storage Types

  • SSD
    • Almost always the right choice
    • Fastest and most predictable option
    • 6ms latency for 99% of reads and writes
    • Each node can process 2.5 TB SSD data
  • HDD
    • Each node can process 8 TB HDD data
    • Throughput is limited
    • Row reads are 5% the speed of SSD reads
    • Use if storing at least 10 TB of infrequently-accessed data with no latency sensitivity

Note: Changing disk type requires new instance (export to Cloud Storage first)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Application Profiles

  • Custom application-specific settings for handling incoming connections.
    • Good practice to create one profile for each individual application.
    • Help viewing connections metrics
  • Single or multi-cluster routing
    • Single-cluster routing: application will route to a single cluster. Will have to perform a manual failover (manually updating profile) if cluster fails.
      • Use case: Web application and batch job route traffic to different cluster based on application profile
    • Multi-cluster routing: automatic failover occurs to the next available cluster. But this is an expensive service.
  • Single-cluster routing is requested for single-row transactions (atomic updates to single rows).
    • Single-row transactions are not strongly consistent across multiple clusters/multi-cluster routing.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Bigtable Configuration

  • Instances can run up to four clusters
  • Clusters exist in a single zone
  • Up to 30 nodes per project
  • Maximum of 1000 tables per instance
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Access Control

  • IAM predefined roles:
    • Admin, User, Reader, Viewer
  • Applied at project, instance or table level to:
    • Restrict access or administration
    • Restrict reads and writes
    • Restrict development or production access

https://cloud.google.com/bigtable/docs/access-control

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data Storage Model

  • One big table
  • Table can be thousands of columns/billions of rows
  • Table is sharded across tablets
  • Table components:
    • Row Key (first column)
    • Columns grouped into columns families
  • Only row key is indexed
    • Design of row key is essential for performance
  • Atomic operation is allowed but only one row at a time
  • Empty cells wont consume any space in the database
  • Individual cells should be no larger than 10Mb (including versions and timestamps)
  • A row should be under 100Mb in size(but it is possible to have row up to 256Mb)
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Timestamp and Garbage Collection

  • Each cell has multiple versions defined by:
    • Server-recorded timestamps
    • Or sequential numbers
  • Expiry policies define garbage collection:
    • Expire based on age
    • Expire based on number of versions
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Field Promotion

  • Move fiels from column data to row key.
  • Allow for better row keys if we know the data when querying the table.
    • Better query perfomance, we dont have to scan as much data within the table
    • We can use a prefix filter, e.g. scan ‘vehicles’, {ROWPREFIXFILTER => ‘NYMT#86#’}
    • Never put a timestamp a the start of the key, it will be impossible for BT to balance the load across the cluster
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Designing Row Keys

  • Row keys are the only indexed item
    • Lexicographic sorting (A-Z)
    • Related entities should be in adjacent rows for more efficient reads
  • Queries use a row key or a row key prefix
    • Row key or prefix should be sufficient for a search
  • Balanced access patterns enable linear scaling of performance
  • Good row keys spread/distribute load evenly over multiple nodes
    • Reverse domain names (com.linuxacademy.support)
    • String identifiers (mattu)
    • Timestamps (reverse, NOT at front /or only identifier), only as part of a bigger row key design
  • Row keys to avoid:
    • Domain names
    • Sequential numbers
    • Frequently updated identifiers
    • Hashed values
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Time Series Data

  • Use tall and narrow tables (one event per row) <> wide tables
  • Use rows instead of versioned cells
  • Logically separate tables
  • Dont reivent the wheel: OpenTSDB
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Avoid Hotspots

  • Hotspots = when load is on one node and/or not distributed across the cluster.
  • Field promotion
  • Salting
  • Key Visualizer:
    • Tool that helps you analyze your Cloud Bigtable usage patterns. It generates visual reports for your tables that break down your usage based on the row keys that you access.
    • Help you complete the following tasks:
      • Check whether your reads or writes are creating hotspots on specific rows
      • Find rows that contain too much data
      • Look at whether your access patterns are balanced across all of the rows in a table

https://cloud.google.com/bigtable/docs/keyvis-overview

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Causes of slower performance

  • The table’s schema is not designed correctly.
  • The rows in your Cloud Bigtable table contain large amounts of data.
  • The rows in your Cloud Bigtable table contain a very large number of cells.
  • The Cloud Bigtable cluster doesn’t have enough nodes.
  • The Cloud Bigtable cluster was scaled up or scaled down recently.
  • The Cloud Bigtable cluster uses HDD disks.
  • There are issues with the network connection.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Autoscaling

  • Stackdriver metrics can be used for programmatic scaling
  • Client libraries query metrics
  • Update cluser node counts via API
  • Rebalancing of tablets takes time; performance may not improve for ~20 minutes
  • Adding nodes to a cluster doesnt solve the problem of a bad schema
A
17
Q

Replication

  • Adding additional clusters automatically starts replication, i.e. data synchronization.
  • Replication is eventually consistent.
  • Replication improves read throughput but does not affect write throughput
  • Replication is used for:
A