3_Pub/Sub Flashcards
What is Cloud Pub/Sub?
- Global-scale messaging buffer/coupler.
- Serverless, NoOps (fully managed), global availability, auto-scaling.
- Decouples senders and receivers.
- Real-time or batch.
- 500 million messages per second
- 1TB/s of data
- Message size limit: 10MB
Push and Pull
- Pub/Sub can either push messages to subscribers, or subscribers can pull messages from Pub/Sub (default).
- Push = lower latency, more real-time.
- Push subscribers must be Webhook Endpoints that accept POST over HTTPS.
- Pull is ideal for large volumes of messages, and uses batch delivery.
-
Pull is preferred if efficiency and throughput of message processing is required.
- In push delivery, one message per request is sent.
- Pulled messages must be acknowledged.
At Least Once Delivery
- Each message is delivered at least once for every subscription.
- Undelivered messages are deleted after the message retention duration (range is from 10 minutes to 7 days, with 7 days being default).
- Messages published before a subscription is created will not be delivered to that subscription.
Connecting Kafka to GCP
Does Pub/Sub replace Kafka?
- Not always
- Hybrid workloads:
- Interact with existing tools and frameworks
- Don’t need global/scaling capabilities with Pub/Sub
- Can use both: Kafka for on-premises and Pub/Sub for GCP in same data pipeline
How do we connect Kafka to GCP?
Overview on Connectors:
- Open-source plugins that connect Kafka to GCP
-
Kafka Connect: one optional “connector service”
- Kafka Connect is the pluggable, declarative data integration framework for Kafka. It connects data sinks and sources to Kafka, letting the rest of the ecosystem do what it does so well with topics full of events.
- Exist to connect Kafka directly to Pub/Sub, Dataflow and BigQuery (among others)
- Ex: KafkaIO for Beam/Dataflow
Additional Terms
- Source connector: An upstream connector: Streams from something to Kafka
- Sink connector: A downstream connector: Streams from Kafka to something
Subscription Lifecycle
- Subscriptions expire after 31 days of inactivity.
- New subscriptions with the same name have no relationship to the previous subscription.
- A snapshot on the subscription is the easiest way to safeguard against application deployments, by providing point-in-time recovery. If the previous version of the application needs to be re-deployed, the subscription can be rolled-back to the point in time of the snapshot, and all subsequent messages will be re-processed.
Monitor Pub/Sub
Cloud Monitoring lets you to create monitoring dashboards and alerting policies or access the metrics programmatically.
Monitoring the backlog
To ensure that your subscribers are keeping up with the flow of messages, create a dashboard that shows the following backlog metrics, aggregated by resource, for all your subscriptions:
- subscription/num_undelivered_messages to see the number of unacknowledged messages
- subscription/oldest_unacked_message_age to see the age of the oldest unacknowledged message in the subscription’s backlog
Monitoring ack deadline expiration
In order to reduce end-to-end latency of message delivery, Pub/Sub allows subscriber clients a limited amount of time to acknowledge a given message (known as the “ack deadline”) before re-delivering the message. If your subscribers take too long to acknowledge messages, the messages will be re-delivered, resulting in the subscribers seeing duplicate messages. This can happen for a number of reasons:
- Your subscribers are under-provisioned (you need more threads or machines).
- Each message takes longer to process than the message acknowledgement deadline. Google Cloud Client Libraries generally extend the deadline for individual messages up to a configurable maximum. However, a maximum extension deadline is also in effect for the libraries.
- Some messages consistently crash the client.
Excessive ack deadline expiration rates can result in costly inefficiencies in your system. You pay for every redelivery and for attempting to process each message repeatedly.
Replaying and purging messages
The Pub/Sub subscriber data APIs, such as pull, provide limited access to message data. Normally, acknowledged messages are inaccessible to subscribers of a given subscription. In addition, subscriber clients must process every message in a subscription even if only a subset is needed.
The Seek feature extends subscriber functionality by allowing you to alter the acknowledgement state of messages in bulk. For example, you can replay previously acknowledged messages or purge messages in bulk. In addition, you can copy the state of one subscription to another by using seek in combination with a Snapshot.