GCP Professional Data Engineer Cert Flashcards
Relational Databases
Has relationship between tables
Google Cloud SQL: Managed SQL instances-don’t have to set much up, Multiple database engines like MySQL, Scalability and availability vertically scales to 64 cores, MySQL has different instances it is also secure-Cloud SQL proxy or SSL/TLS, or have private IPs there are also maintenance windows and automated backups , point in time recovery instance stores
Importing MySQL Data Commands: InnoDB mysqldump export/import, CSV import, External replica promotion-need binary log retention
PostgreSQL Instances are another option-have automated maintenance, unsupported features but it has high availability
Import PostgreSQL commands: SQL dump export/import, CSV import
Cloud Firestore
- Fully managed No SQL database-server less autoscaling, NoSQL document store
- Realtime DB with mobile SDKs, Android and IOS client libraries, frameworks for popular programming languages
- Strong scalability and consistency-horizontal autoscaling
Bundle multiple documents=collection
Messages are sub collections
Cloud Spanner
- Managed SQL -compliant DB-SQL schemas and queries with ACID transactions
- Horizontally scalable: Strong consistency across rows, regions from 1 to 1,000s of nodes
- Highly available-automatic global replication, no planned downtime and 99.9999% SLA
High Cost
CAP Theorem
Consistency-one change data with specific rules, Availability-always available to do queries, Partition Tolerance-needs to tolerate failures and has to be tolerant of any loss of partition parts
Most likely you will have two parts at once
Spanner is strongly consistent and highly available, sometimes it will choose consistency over availability, global private network, five 9s of availability
Cloud Spanner Architecture
An allocation of resources, instance configuration-regional or multi-regional, initial number of nodes
Region configuration: a region has a zone/multiple zones. With each instance, specify the node count as 1 and each replica is powered by each virtual machine, by moving the node number up you are adding more machines for more computing power. The replicas stay the same, but machines/nodes can change. Therefore, you can connect the different replicas across different zones to create a node.
Cloud Memorystore
In memory database
1. Fully managed Redis Instance-provisioning, replication, failover-fully automated
2. Basic tier: efficient cache that that can withstand a cold restart and a full data flush
3. Standard tier-adds a cross-zone replication and automatic failover
Benefits-no need to provision own VMs,scale instances with minimal impact, private IPs and IAM, automatic replication and failover
Creating an Instance: Version 3.2 or 4, choose service tier and region, memory capacity 1-300GB(Determines network throughput), add configuration parameters
Connecting to Instances: Compute Engine, Kubernetes Engine, App Engine, Cloud Function(server-less VPC connector)
Import and Export: Export to RDB backup: BETA, admin operation not permitted during esport, may increase latency, RDB file written to Cloud Storage
Import from RDB backup: Overwrites, all current instance data, instance unavailable during import process
Use Cases: Redis can be used as a Session Cache that the common uses are logins, and shopping carts, a Message Queue that queues messages and operates to enable loosely-coupled services, or a Pub/Sub advanced message
Comparing Storage Options
- Ask yourself if this is structured or unstructured data? Structured: SQL data, NoSQL data, Analytics data, Keys and values Unstructured: Binary blobs, videos, images, proprietary files-unstructured data use the Cloud Storage Option
- Is the data going to be used for Analytics? Low Latency vs Warehouse. Low latency: Petabyte scale, single-key rows, time series or IoT data.-Choose Cloud Bigtable Warehouse: petabyte scale, analytics warehouse, SQL queries-Choose Bigquery
- Is this relational data? Horizontal Scaling vs Vertical scaling. Horizontal Scaling: ANSI SQL works, global replication, high availability and consistency, it’s expensive but can the client afford it, most financial institutions would probably use this-Choose Cloud Spanner. Vertical scaling: MySQL or PostgreSQL, managed service, and high availability-Choose Cloud SQL.
- Is the data Non-relational? NoSQL vs Key/Value. NoSQL: Fully managed document database, strong consistency, mobile SDKs and offline data-Choose Cloud Firestore. Key/Value: managed Redis instances, does what Redis does-Choose Cloud Memorystore.
Streaming
Continuous collection of data, near real time analytics, windows and micro batches
Batch
Data gathered with a defined time window, large volumes of data, data from legacy systems
No-SQL
Anything not sql-key values stores, json document stores, mongoDB and Cassandra tools
SQL
Row tabular data ,Relational-connect to other tables/queries
On-Line Analytical Processing (ONLAP)
Low volume of long running queries
Aggregated historical data-purchasing analytics
On-Line Transactional Processing(ONLTP)
High volume of short transactions, high integrity, sql
Modifies the database
Defines Big Data
- Volume: Scale of information being handled by data processing systems
- Velocity: Speed at which data is being processed, ingested, analyzed, and visualized
- Variety The diversity of data sources, formats, and quality.
Map Reduce
A programming model-Map and Reduce functions
Distributed Implementation
Created at Google to solve problems
Map Function
takes an input from the user, produces a set of intermediate key/value
Reduce Function
Merges intermediate values associated with the same intermediate key, forms a smaller set of values
This method standardized the framework, implementation abstracts away the distributed computing framework: Parallelizing and executing-partitioning, scheduling and fault tolerance
Splits all the jobs to small chunks
Master and worker cluster model
Failed worker jobs reassigned
Worker files buffered to local disk
Partitioned output files
Hadoop and HDFS
Named after a toy elephant-inspired by google file system-originated in Apache Dutch-sub project began in 2006
Modules: Hadoop Common-base model and has starting scripts, Hadoop Distributed File System(HDFS)-distributed fault tolerates system that runs on commodity hardware as part of a Hadoop cluster, Hadoop YARN-handles resource management tasks like job scheduling and monitoring for Hadoop jobs, Hadoop MapReduce-Hadoop’s own implementation of the MapReduce model which includes libraries for map and reduce functions, partitioning, reduction, and custom job configuration parameters
HDFS Architecture-can help with Cloud Dataproc
There is a Server-within the server there is a Name Node-within the Name Node, there is Metadata
In the other server-there is a Data Node which stores very large files across a cluster and the files are stored as a series of blocks
The Racks are in between the cluster to design the shortest network path possible
The client can make multiple requests to a name node across racks to get data from multiple nodes
Servers/clusters can be replicated for fault tolerance
The YARN architecture is similar but in the Server, it has a Node Manager and a Server can have a Resource Manager-The client sends jobs to the resource manager, then on individual workers, the Node Manager process runs to handle local resources, request tasks from the master and return the results
Apache Pig-A high level framework for running MapReduce jobs on Hadoop clusters
Platform for analyzing large datasets
Pig Latin defines analytics jobs: Merging, Filtering, and transformation-high level but like SQL simplicity
Good for ETL jobs since it has a procedural data flow
And it is an abstraction for MapReduce
The Apache Pig will compile our instructions into MapReduce jobs and then are sent to Hadoop for parallel processing across the cluster
Apache Spark
Linear flow of data was an issue- like reading mapping across data reduce results and writing to a disk
The Adobe Spark-General purpose cluster-computing framework-allows for concurrent computational jobs to be run across massive datasets
It uses general purpose cluster-computing framework, resilient distributed data multisets, working set as a form of distributed shared memory
Spark Modules
Spark SQL-structured data in spark stored in abstraction, programmatic querying-data frames API
Spark Streaming-streaming data ingestion in addition to batch processing-very small batches
MLLib-machine learning library, machine learning algorithms-classification, regression, decision trees
GraphX-iterative graph computation
Supports languages: Python, Java, Scala, R, SQL
MUST have 2 Things: A Cluster Manager-YARN or Kubernetes and a distributed Storage System-HDFS, Apache HBASE, and Cassandra
Hadoop vs Spark
Hadoop: Slow disk storage, high latency, slow, reliable batch processing
Spark: Fast memory storage, low latency, stream processing, 100x faster in-memory, 10x faster on disk, more expensive
Apache Kafka
Publish/subscribe to streams of records
Like a message bus but for data
High throughput and low-latency-ingesting millions events through devices
Ex: Handling >800 Billion messages a day at LinkedIn
Four main APIs in Kafka: Producer-allows app to stream records to a Kafka topic. Consumer-allows app to subscribe to one or more topics/process a stream of records contained within. Streams-an API designed to allow an application to be a stream processor itself-transform data then send it back to Kafka
Kafka vs Pub/Sub
Kafka: Guaranteed message ordering, tunable message retention, polling(Pull) subscriptions only, unmanaged
Pub/Sub: No message ordering guaranteed, 7 day maximum message retention, pull or push subscription, managed
Pub Sub Intro
Message Bus takes care of all messages between devices
Pub/Sub splits it in different topics-anything can publish a message to a topic or choose to receive a message from a topic.
Information from users/apps are published to a topic
Topics are covered by a message bus-introduced resilience-Pub/Sub is a shock absorber
Cloud Pub/Sub: Global messaging and event ingestion, server less and fully managed, 500 million messages per second, 1TB/s of data
Pub/Sub Great Features-Multiple publisher/subscriber patterns, at least once delivery, real time or batch, integrates with Cloud Dataflow
Use Case Distributing Workloads Pub/Sub
queue up a large number of tasks in a Pub/Sub topic and distribute it amongst multiple workers-like compute engine instances
Asynchronous Workflows Pub/Sub
controls order of events, order can be sent into a topic which could then be consumed by a worker system like invoicing before passing it into a queue for the next system to consume like packaging and posting
Distributing Event Notifications Pub/Sub
A systems sets up new users when they register with your service, a registration could publish a message and the system could be notified to the set the user up
Distributed Logging:Logs could be sent to a Pub/Sub topic to be consumed by ,multiple subscribers. Like a monitoring system and an analytics database for later querying
Device Data Streaming Pub/Sub
Hundreds of thousands and more internet connected devices can stream their data into Pub/Sub topics so that they can be consumes on demand by your analytic streams or could be transformed through Dataflow first
One-to One Pub/Sub
There is a Publisher, Topic, and then a Subscriber
Publisher sends messages to the topic in Pub/Sub
The subscriber receives the messages and reads them through their own subscription
Many to many Pub/Sub
Just like the one-to-one pattern but this has multiple topics
Publishing Messages
Create a message containing your data, JSON payload that’s base64 encoded, size of payload 10MB or less, then send payload as a request to the Pub/Sub API-specify the topic the message should be published on
Receiving Messages
Create a subscription to a topic, subscriptions are always associated with a single topic. Pull delivery method is the default delivery method and can take ad hoc pull requests to the Pub/Sub API-specifying your subscription to receive messages and when you receive the message, note that you have received it or else you won’t get the next message. Push delivery method will send messages to an endpoint-the endpoint must be HTTPS with a valid SSL cert-accepts POST requests
Integrations Pub/Sub
Client libraries for popular languages like Python, C#, Go, Java, Node, PHP, and Ruby. Cloud Dataflow supported and you can use the Apache Beam SDK to read messages or in batches. Also supported Cloud Functions and Cloud Run, Foundation of Cloud IoT Core-sends and receives messages from connected devices
Developing for Pub/Sub: Local Pub/Sub emulator-Google Cloud SDK and Java Runtime Environment 7+
Advanced Pub/Sub Topics
At Least Once Delivery: Each message is delivered at least once for every subscription
Undelivered Messages: deleted after the message retention duration-default 7 days-can’t be longer
Messages published before a subscription is created will not be delivered to that subscription
Subscriptions expire after 31 days of inactivity-new subscriptions with same name have no relationship to the previous subscription
Other Features
Seeking Feature: Set retain ack messages to True so it retains the messages sent to the topic-default messages are retained for a maximum of 7 days. Then you can tell the subscription to seek to a specific time period in the timeline-basically rewinds the clock to receive past messages. You can also seek in a future timestamp
Snapshots: useful if you are deploying new code. You can save snapshot ahead of time to save the current state of the subscription and to save future and unacknowledged messages
Ordering Messages: May not receive messages in the right order=use timestamps when final order matters, or consider an alternatives for transactional ordering-maybe through a SQL query
Resource Locations: Messages stored in nearest region, message storage policies allow you to control this, additional egress fees may apply
Access Control Pub/Sub
Use service accounts for authorization, grant per-topic or per-subscription permission, grant limited access to publish or consume messages
Exam Tips Pub/Sub
Think about where you can decouple data-Pub/Sub is a shock absorber, receives data globally and it can be consumed by other components at their own pace
Where can you use Pub/Sub for events. It can add event logic to a stack and it can pass events through one system to another
Be aware of Pub/Sub limitations-message data must be 10MB or less, beware of expired messages and unused subscriptions
Look for Apache Kafka in use cases, if this comes up, Pub/Sub can be a good option
Keep an eye out for Cloud IoT as a solution
Google Cloud Tasks-get familiar with it
Browse the reference architectures-Smart Analytics references
What is Dataflow
Fully managed, server less tool, uses open source Apache Beam SDK, Supports expressive SQL, Java, and Python APIs, Realtime and batch processing, stack integration
Beams unify, develop and model which allows us to reuse code across streaming and batch pipelines
Sources: Cloud Pub/Sub, BigQuery, and Cloud Storage, it can be external to GCP like Kafka
Common Sinks: Cloud Storage, BigQuery, and Bigtable, Cloud Machine Learning can be applied to sync Data
Dataflow Process
You have a Pipeline, Source, and Sink
The Pipeline will take data from the Source processes data and then places it into the Sink
Apache Beam connectors allow you to connect to the Source and Sink so you can read and then write your output data into the Sink
Common
Common Dataflow Sources
Cloud Pub/Sub, BigQuery, and Cloud Storage, it can be external to GCP like Kafka
Common Dataflow Sinks
Cloud Storage, BigQuery, and Bigtable, Cloud Machine Learning can be applied to sync Data
Driver
program you write using the Apache SDK-Java or Python. It defines your pipeline
Pipeline: full set of transformations that your data undergoes from initial ingestion to final output
Driver goes to the runner
Runner
a software that manages the execution of your pipeline, a translator for the Backend execution framework can also manage local execution of Driver programs for testing and debugging
PCollections
used in pipelines and represent data as it is transformed within the pipeline kind of represents a multi-element dataset. They can also represent both batch and streaming data.
Data coming from a fixed source, the dataset=Bounded, treated like a batch
Continuously updating source, dataset=Unbounded(Stream)
The PCollection is usually from reading an external source
Transform usually represents a step in your pipeline, transforms use PCollections as inputs and outputs, each transform takes one or more PCollections as inputs and generates 0 or more output PCollections
Pipeline Development Cycle
- You have to Design your pipeline first-input and output methods, structure and transformations
- Then you Create it-instentiating a pipeline object, implementing transformations that were identified
- Testing: debugging a failed pipeline execution on a remote system, try to do local unit testing
Considerations
- Start with the location of your data
- Input data structure and format
- Transformation objectives
- Output data structure and location
Pipelines Structures
Basic: Input linear to output
Branching: PCollection and there is branching that applies to a single PCollection which result in two different PCollections
Branching can also be conducted on a Transform
Pipeline branches can also be merged, you need to merge all branches of your pipeline at some point through a Flatten or Joined Transform
Pipelines can also have multiple sources and they can be independently transformed
DAG
Dataflow Pipelines represent a Directed Acyclic Graph or DAG-a graph with a finite number of vertices and edges-no directed cycles
Pipeline Creation
- Create an Object
- Create a PCollection using read or create transform
- Apply multiple transforms as required
- Write out final PCollection
- Execute the pipeline using the pipeline runner
ParDo
generic parallel processing transform: can take an element from PCollection1 and transform it to PCollection2, can output 1, none, or multiple output elements from a single input element
User-defined function(UDF)
user written code that describes the operation to apply to each element of the input PCollection
Aggregation Transformation
The process of computing a single value from multiple input elements, doing this for all elements and then going into a single window
Characteristics of PCollections
- Any Data Type-must be same type
- Don’t support random access
- Immutable or unchanging
- Boundedness-no limit to the number of elements a PCollection can contain-can be Bounded-finite number of elements or Unbounded-does not have an upper limit
- Timestamp is associated with every element of a PCollection-initially assigned by the source that results in the creation of the PCollection
Core Beam Transforms
- ParDo-generic parallel processing transform
- GroupByKey-processes collection of key value pairs, collects all values associated with a unique key
- CoGroupByKey-used when combining multiple PCollections-performs a relational join of two or more key value PCollections where they have the same key type
- Combine-requires you to provide a function that defines the logic for combining elements, had to be associative and commutative-sum, min, max
- Flatten- merges multiple input PCollections into a single logical PCollection
- Partitioning-provides the logic that determines how the elements of the PCollection are split up.
Event time Dataflow
event time data element occurs determine by timestamp on data element itself, Processing time refers to the different times the element was processed during the transit in your pipeline
Windowing
Assigned to a PCollection, subdivides the elements of a PCollection according to their timestamps, do this to allow grouping or aggregating operations over unbounded collections, it groups elements into finite windows
Fixed Window
- Fixed-simplest, constant non overlapping time interval
Sliding Window
- Sliding-represent time intervals- but it can overlap, and an element can belong to more than one window-useful to take running averages of data
Per Sessions
a different session window is created in a stream when there is an interruption in the flow of events which exceeds a certain time period, apply on a per key basis-useful for irregularly distributed data with respect to time
Single global
everything else-window transform
Watermark
the system’s notion of when all the data for a certain window can be expected to have arrived-late data = watermark moves past the end of the window and any further data elements arrive with a timestamp within that window
Triggers
- Event time, event-time based
- Processing time
- Data driven-when data in a particular window meets a certain criterion
- Composite-combine other triggers in different ways
Pipeline Access
Run Cloud Dataflow pipelines
1. Can be run locally
2. Submit pipeline to GCP Dataflow managed service
GCP service accounts
1. Cloud Dataflow service-uses Dataflow service account
2. Worker instances-Controller service account
Cloud Dataflow Managed Service
- The pipeline gets submitted to the GCP Dataflow Service
- The Dataflow will create a Job
- The Job creates managers and workers to carry out various tasks
- For the execution, the workers need files/resources from Cloud Storage
- The Job can be monitored with the Cloud Dataflow Monitoring Interface or the Cloud Dataflow Command-line Interface
Cloud Dataflow Service Account
- Automatically created when Cloud Dataflow project is created
- Manipulates job resources
- Assumes the Cloud Dataflow service agent role
- Has Read/Write Access to project resources
Controller Service Account-used by the workers-uses the Compute Engine
- Compute Engine instances-execute pipeline operations
- Run Metadata operations-don’t run on local clients or compute engine workers-determine size of file in Cloud Storage
- User-managed controller service account-used resources with fined grained access control
Security Mechanisms
- Submission of the pipeline-users have to have the right permissions
- Evaluation of the pipeline-encrypted, not persisted beyond evaluation of the pipeline, communication between workers over a private network-subject to projects, permissions, and firewalls specify region and zone
- Accessing telemetry or metrics-encrypted at rest-controlled by project’s permissions
- You can also use Cloud Dataflow IAM roles
Regional Endpoints in Dataflow
- Manages metadata about Cloud Dataflow jobs
- Controls Cloud Dataflow workers
- Automatically selects best zone
Good reasons for regional endpoints
1. Security and compliance
2. Data locality
3. Resiliency
Machine Learning with Cloud Dataflow
- Handles data extraction from Cloud Storage
- Data Preprocessing in Apache Beam pipeline through Cloud Dataflow, TensorFlow API used to normalize some values between 0 and 1, the Beam partition transform is used to split the data set into the training data set and the evaluation data set
- TensorFlow is used to train a model locally on your machine or through Cloud Machine Learning-doesn’t use Cloud Dataflow
- Predictions-Cloud Dataflow-read from Cloud Dataflow from Pub/Sub into another Pub/Sub topic
Benefits of Dataflow
You can use customer-managed encryption keys
Batch pipelines can be processed in a cost-effective manner with Flexible Resource Scheduling(FlexRS)-uses Advance scheduling, Cloud Dataflow Shuffle service, preemptive VMs
Cloud Dataflow is great for MapReduce jobs to Cloud Dataflow-on premises map reduce jobs can be rebuilt on cloud dataflow
Cloud Dataflow with Pub/Sub Seek-replay and reprocess previously acknowledged messages-especially in bulk
Cloud Dataflow SQL
- Develop and run Cloud Dataflow jobs from the BigQuery web UI
- Cloud Dataflow SQL (ZetaSQL variant) integrates with Apache Beam SQL
Apache Beam SQL-Query bounded and unbounded PCollections, Query is converted to a SQL transform
Cloud Dataflow SQL-Utilise existing SQL skills, join streams with BigQuery tables, query streams or static datasets, write output to BigQuery for analysis and visualization
Dataflow Exam Tips
Beam and Dataflow are the preferred solution for streaming data-especially for streaming data
Pipeline: represents the complete set of stages required to read data perform any transformations and write data
PCollection: represents a multi-element dataset that is processed by the Pipeline
ParDo: core parallel processing function of Apache Beam which can transform elements of an input PCollection into an output PCollection.
DoFn: template you use to create user-defined functions that are referenced by a ParDo
Sources-where data is read from
Sinks-where data is written from
Window: allows streaming data to be grouped into finite collections according to time or session-based windows
Watermark: indicates when Dataflow expects all data in a window but past the watermark is considered late
Dataflow is normally the preferred solution for data ingestion pipelines
Cloud Composer is sometimes used for ad hoc orchestration/provide manual control of Dataflow pipelines themselves
What is Dataproc?
A managed cluster service for Hadoop and Apache Spark
Managed preferable because it is low costing and you can control which clusters to grow and which clusters to turn off
Dataproc Architecture
Master: it creates a master node running the YARN resource manager and then runs the Hadoop, HDFS name nodes
It also runs the Worker Nodes
Pre-installed and have Hadoop, Apache Spark, Zookeeper, Hive, Pig, Tez, and other tools like Jupyter Notebooks and GCS connector
Storage and configuration handles by Dataproc
Dataproc benefits
- Cluster actions complete in ~90 seconds
- Pay-per-second minimum 1 min
- Scale up/down or turn off at will
Using Dataproc
You can submit Hadoop/Spark jobs, Enable autoscaling-if necessary to cope with the load of the job, Output to GCP Services-like Google Cloud Storage, BigQuery and BigTable, you can also Monitor with Stackdriver-fully integrated logging and monitoring for the job performance and output
Cluster Location
Regional: Isolate resources used for Dataproc into one region like us-east1 and Europe-west1
Global: Resources not isolated to a single region-can place cluster in any zone worldwide
Single Node Cluster
a single VM that will run the master and work the processes-can’t autoscale
Standard Cluster
Has a Master VM that runs YARN Resource Manager and the HDFS Name Node, and it has two Worker Nodes that run a YARN Node Manager and a HDFS Data Node-this is customizable for the disk, there are also Pre-emptible Workers-sometimes help with large projects, but can’t provide storage for HDFS
High Availability Cluster
You have three Masters with YARN and HDFS configured to run in high availability mode-no interruptions
Submitting Jobs
- Gcloud command line
- GCP Console
- Dataproc API
- SSH to Master Node
Monitoring and Logging
- Use Stackdriver Monitoring to monitor cluster health
- Cluster/yarn/allocated_memory_percentage
- Cluster/hdfs/storage_utilization
- Cluster/hdfs/unhealthy_blocks
Custom Clusters
You can customize the Dataproc default image, Google gives a script, then under the Dataproc default image, there are Custom PKGs where you can apply the customization script you have written, then you can store it in Google
You can also have:
Custom cluster properties-so you can change the values
You can add initialization actions that are custom to the cluster-scripts loaded to a Cloud Storage Bucket-mostly for Staging binaries
You can also Custom Java/Scala dependencies-saves you from precompiling
Autoscaling in Dataproc
Huge Bonus: you can create lightweight clusters and have them automatically scale up to the demands of the job-written in YAML, has configuration numbers for primary workers and secondary workers
When to not use Autoscaling
- When having HDFS
- When having Apache Spark Streaming
- When having Idle Clusters
- YARN Node Labels
Workflow Templates
Written in YAML that can specify multiple jobs w/ different configs and parameters that can be run in succession
Workflow Templates have to be created, then instantiated with GCloud-you can send jobs to a new cluster each time or to an existing cluster
Advanced Compute Features Dataproc
- Local SSDs-faster runtimes
- GPUs to nodes-for machine learning
Cloud Storage Connector
- Use GCS instead of HDFS
- Cheaper than persistent disk
- High availability and durability
- Decouple storage from cluster lifecycle
Exam Tips
Know when to choose Dataproc: Quickly migrating Hadoop and Spark workloads into Google Cloud Platform
Understand the benefits of Dataproc: Managed over Hadoop or Spark cluster-Ease of scaling, being able to use Cloud Storage instead of HDFS, and the connectors to other GCP services like BigQuery and Bigtable
Know Cluster Options: When to pick standard vs high availability, autoscaling and ephemeral
Get to know open-source Big Data Ecosystem-Hadoop, Spark, Zookeeper, Hive, Tea, and Jupyter
Know when to choose Dataflow-sometimes it is the preferred product for big data ingesting, like in streaming workloads and it implements the Apache Beam SDK
Bigtable Concepts
Managed wide-column NoSQL database-series of key value pairs where the values are split into columns
Has a very High Throughput-10,000 reads per second
Also has low-latency-6 milliseconds per node
Scales linearly
Out of the box high availability-cross cluster replication
Developed internally by Google and was used for Google Earth, Finance, and Web Indexing
Since HBase was created and was the open source implementation of the Bigtable model, it was adopted as a top level Apache project and the Cloud Bigtable supports Apache HBase library for Java
Cloud Bigtable
- Has a ROW KEY as the only index
- Then it can be attached to columns
- The columns can be grouped by families
- The empty values don’t take up any space since it’s a sparse db
- Scaled to thousands of columns and billions of rows
Important Bigtable Features
- Blocks of contiguous rows are -shared into tablets
- Tablets are chunks of sorted rows-put together they form a complete table-managed by nodes in your cluster
- Tablet data is stored in Google Colossus-can scale cluster sizes
- Splitting, merging, and rebalancing happen automatically
Bigtable Scenarios
Suited well for financial, marketing data and transactional data
Also good for time series data and data from IoT devices
Good for streaming data and machine learning applications
Bigtable Architecture
You create an instance
You have a instance type, storage type and app profiles-describes parameters for incoming connections
To connect, you use an instance ID and an application profile
Inside the instance, you have clusters
Inside the clusters you have nodes which are workhorses of Bigtable
The flexibility of Data Storage comes from separating our cluster nodes and storing data in Colossus
Nodes control tablets and a tablet can’t be shared by more than one node
Instance Types
- Production-1+ clusters, 3+ Nodes per cluster
- Development-Single node cluster for developmental work-development instance can’t use replication and doesn’t have SLA and a cheaper option
SSD Storage Type
Almost always the right choice, fastest and most predictable option, 6ms latency for 99% of reads and writes, each node can well process 2.5 TB SSD data
HDD Storage Type
Each node can process 8 TB HDD data, throughput is limited so it will not have as much IO overhead for processing nodes, then the row reads are 5% the speed of SSD reads, the storing at least 10 TB of infrequently-accessed data with no latency sensitivity-could spend more money on clusters
Application Profiles
- Custom application specific settings for handling incoming connections
- Single or multi-cluster routing
- In single routing: it will route to a single router that you define even if you have multiple clusters in an instance
- Multi: Will route to nearest most available cluster and if it is unavailable, it will go to the next cluster
- You have to ask if data needs single row transactions-then you have to have single routing
Bigtable Configuration
- Instances can run up to four clusters
- Clusters exist in a single zone
- Up to 30 nodes per project
- Maximum of 1,000 tables per instance
Bigtable Access Control
- Cloud IAM roles
- Applied at project or instance level to-
- Restrict access or administration.
- Restrict reads and writes
- Restrict development instances or production access
Data Storage Model
ROW KEYs can only be indexed
Column families allow is to grab what we need only
Column names are called column qualifiers
You can write new data values and the old ones aren’t overwritten
You can control how much is stored and for how long it is configurable-detailed granularity, array of bytes
Alternative Options to Bigtable
- Need SQL Support OLTP: Cloud SQL
- Need Interactive Queries OLAP and cheaper: BigQuery
- Need structured NoSQL Documents: Cloud Firestore
- Need In-memory Key/Value Pairs: Memorystore
- Need Realtime Database: Firebase
Important Bigtable Info
Rows are sorted alphabetically-design of row key very important
Atomic operations are by row only-be careful when updating
Sparse table system-doesn’t hurt to have a lot of columns/families even if they don’t apply to every entity
Row sizing: no larger than 10MB, total row not including the key should be under 100MB