Chapter 6, Stream-Processing Patterns Flashcards

Question

What are some patterns related to the Machine Learner Pattern?

Answer 1

- Transformation pattern Appropriately maps the predictions of the machine learning model to generate a meaningful output. - Reliability patterns Store and restore online machine learning algorithm state. 362

Answer 2

To transform the event format, structure, or protocol. To add or remove partial data to or from the event. Third-party systems do not support the current event. 362

Answer 3

The consuming system has the ability to understand the event. 362

Answer 4

Allows incompatible systems to communicate with one another. Reduces event size by containing only relevant information. 362

Answer 5

Only a subset of events is relevant for processing. 362

Answer 6

All events are needed for decision making. 362

Answer 7

Reduces the load on the system by selecting only events that can produce the most value to the use case. 362

Answer 8

To aggregate events over time or length. To perform operations such as summation, minimum, maximum, average, standard deviation, and count on the events. 362

Answer 9

For operations that cannot be performed with fixed memory such as detecting the median of the events. High accuracy is needed without the use of reliability patterns. 362

Answer 10

Reduces the load on the system by aggregating events. Provides data summary to better understand the behavior as a whole. 362

Answer 11

To join events from two or more event streams. To collect events that were previously split to parallelize processing. 362

Answer 12

Joining events do not arrive in relatively close proximity. High accuracy is needed without the use of reliability patterns. 362

Answer 13

Allows events to be correlated. Enables synchronous processing of events. 362

Answer 14

To detect the sequence of event occurrences. To detect the nonoccurrence of events. 362

Answer 15

Event sequencing cannot be defined as a finite-state machine. High accuracy is needed without the use of reliability patterns. Incoming events arrive out-of-order. 362

Answer 16

Allows detecting complex conditions based on event arrival order. 362

Answer 17

To perform predictions in real time. To perform classification, clustering, or regression analysis on the events. 362

Answer 18

We cannot use a model to accurately predict the values. Historical data is not available for building machine learning models. 362

Answer 19

Automates decision making. Provides reasonable estimates. 362

Answer 20

Cloud native applications that perform stream processing have unique scalability and performance requirements. For instance, these applications require event ordering to be maintained while processing events. Furthermore, as most of these applications have in-memory state, they also need a strategy to scale so they can process more events without compromising their accuracy. 364

Answer 21

The Sequential Convoy pattern scales cloud native stream-processing applications by separating events into various categories and processing them in parallel. It also works to persist event ordering so events can be combined at a later time, while preserving the original order of the events. 364

Answer 22

As the name suggests, this pattern sees events as items moving along a conveyor belt. It groups the events into categories based on their characteristics and processes them in parallel. 364

Answer 23

This pattern is used for scaling event processing so we can process more events with cloud native applications that have limited memory capacity, and for partitioning events so that each substream is processed differently. Let’s look at how this pattern can be used in various scenarios. 366

Answer 24

- Producer-Consumer and Publisher-Subscriber patterns Can be used as the base for building the Sequential Convoy pattern. These patterns are covered in Chapter 5. - Buffered Event Ordering pattern Provides an alternative way to order events while joining events from multiple event streams together. This pattern is covered next. - Periodic Snapshot State Persistence pattern Stores substream application states and supports scalability. This pattern is covered later in this chapter. 370

Answer 25

Network delays and connection retries can cause events to get out of order. The Buffered Event Ordering pattern allows us to reorder events before processing them downstream. We can order events based on time or on the order they are generated. 371

Answer 26

For events to be ordered, they must have an incremental value by which to order them. This value can be a sequence number or a timestamp, for example. Sequence numbers will continuously increase, and we can guarantee that each event in a stream will have a unique number. But with a timestamp, we cannot guarantee that all events will have unique values, because multiple events can be generated in the same millisecond. 371 Figure 6-17. Ordering events based on sequence number ##Footnote Figure 6-17 illustrates the use of sequence numbers. If we have most recently received an event with sequence number 7, and now we receive an event with sequence number 8, we can immediately send it for processing because we know that 8 follows 7. But if after 8 we get 10, we know that we are missing an event and so cannot send 10 for processing. Instead, we need to use a time-out (of 30 seconds, for instance) to wait for the missing event. Then, if the missing event 9 arrives in time, we send it as well as event number 10. But if it does not arrive in time, we have to send the event with sequence number 10 before 9.

Answer 27

This pattern can be deployed in front of any use case that needs ordered events, as long as the events have attributes that can be used for ordering. 372

Answer 28

- Temporal Event Ordering and Windowed Aggregation patterns These patterns can benefit from the Buffered Event Ordering pattern, as they require events to be ordered to produce more-accurate results. - Reliability patterns For storing and retrieving events that are waiting in the buffer for ordering, during system failure and restart. 374

Answer 29

The Course Correction pattern attempts to report its analysis of events as soon as possible, and then later correct its analysis and report again, as soon as it retrieves missing (or late) events. This produces early analysis with low latency rather than sending an accurate analysis with higher latency. 375

Answer 30

This pattern should be combined with patterns like Windowed Aggregation or Temporal Event Ordering. Rather than waiting for all events to arrive, we send aggregation and event sequence detection as soon as we have a result. The results of the aggregation and sequence detection are an early estimate and may not be accurate. Later, when we receive missing events, we send updated results. 375

Answer 31

This pattern should be used only when we need events in order, have a requirement for low latency, and can cope with inaccurate early estimates. Let’s consider some example scenarios to understand this in more detail. 375

Answer 32

- Reliability patterns For storing application state holding previous events and previously emitted outputs. - Buffered Event Ordering pattern Can be used instead of this pattern when the use case does not support course correction. - Temporal Event Ordering and Windowed Aggregation patterns These patterns can benefit from the Course Correction pattern as they can use course correction to correct their early estimates. 378

Answer 33

The Watermark pattern is useful for periodically aligning stream processing across multiple microservices within a cloud native application that are connected in a mesh-like structure via event streams. This alignment will help determine whether all microservices have processed all arrived events before a given event, which is commonly referred to as the watermark event. We can use this pattern to sync multiple microservices without using system times. 379

Answer 34

For watermarks to work, a watermark generator should generate a watermark event periodically and send it through all the external inputs of the cloud native application. This event should be considered special, and microservices should pass it through to their dependent systems. We also need to be sure that each intermediate microservice that consumes this event can resend it in the same position among the sequence of events it has received and processed, and not before or after other events. When the input systems are time synchronized, we can make those systems independently generate the watermark events at given intervals, such as once every minute or every five minutes 379 Figure 6-19. Generating watermark events and synchronizing events based on them ##Footnote When the microservice receives a watermark event in a stream (such as the watermark event with sequence number 6 in this example), it should not continue processing any more events from that stream, and process only events from other streams (such as Event B of the second stream), that have not yet received the corresponding watermark event. When we receive all corresponding watermark events on all streams, we can pass that watermark event to all its dependents and continue processing other events from all the input streams until we receive the next watermark event in a stream. This process is repeated, and this approach ensures that event processing is synchronized at each watermark event. When the preceding options are not possible, we can also make the input sources poll a global counter to fetch the next watermark event sequence number and emit it periodically along with the events. In this case, we should make sure that watermark events arriving in multiple streams are processed in a sequential manner. If we find a sequence number out of sync, we should halt the execution of events from that stream until we receive a watermark event with a lower sequence number on another stream.

Answer 35

This pattern can be used when multiple source systems are not synchronized on time, or when network latency or processing time can affect event arrival time. In this case, events in one stream can arrive earlier, while events from other streams can arrive later, and this can cause issues when analyzing events across multiple streams. This pattern is ideal for synchronizing the event processing periodically to reduce errors. 381

Answer 36

- Buffered Event Ordering and Course Correction patterns Can be used instead of this pattern when event arrival times are not affected by network latency or other processing delays by the systems. - Temporal Event Ordering and Windowed Aggregation patterns These patterns can benefit from the Watermark pattern as they require events to be ordered to produce correct results. - Periodic Snapshot State Persistence pattern The Watermark pattern can be a prerequisite for Periodic Snapshot State Persistence to perform state snapshots in a synchronized manner among multiple streams. 384

Answer 37

To scale stream-processing applications. To partition streams so each stream can be used for various use cases. To allow processing events in parallel and regroup them based on the original order. 384

Answer 38

Streaming applications have enough capacity to process the events. 384

Answer 39

Supports scalability of stream processing Preserves event ordering when events are processed in parallel. 384

Answer 40

To order events based on timestamp or sequence number. To order events that are already out of order and published via a single event stream. 384

Answer 41

To group events from multiple ordered event streams. We need true ordering of events that are generated from distributed sources. Reliability patterns cannot be applied to the application. 384

Answer 42

Can be applied in front of any application that needs events in order. 384

Answer 43

To correct previously produced results. To produce early aggregation estimates. To guess the event-sequence order and correct the decision later. 384

Answer 44

The dependent downstream applications cannot handle continuous event updates. 384

Answer 45

Allows us to produce early estimates and correct them as we have more data. 384

Answer 46

To perform aggregation operations on event streams that are out of sync. Try to order events that are generated by distributed systems. 384

Answer 47

We cannot inject watermark events closer to the event sources. Intermediate systems cannot bypass watermark events. Network bandwidth is a concern. 384

Answer 48

Periodically synchronizes events across multiple streams. Helps overcome network and processing latency added by intermediate systems. 384

Answer 49

By using the Replay pattern, the state of a microservice can be restored by replaying the events it has processed in the past, especially when its state depends only on recent events. 396

Answer 50

This pattern works by resending events when the system is down. The number of old events it needs to send depends on the use case. For example, if the microservice is aggregating events over the past three minutes, then during failures, resending the events arrived during the last three minutes is sufficient. 386 Figure 6-22. Deferring acknowledgment until after outputs are generated ##Footnote When the stateful microservice can store its state periodically, we will be able to identify the last successfully processed event from the latest snapshot, and we should be able to replay all events arrived after that. To re-create the state of a system, the source of the data should contain the events even after they are retrieved by the microservice. We cannot use standard message brokers with their automatic event acknowledgment feature, because the events will be deleted from the message broker as we read them, unless we use queues or durable subscriptions and differ the acknowledgment of consumed events. As shown in Figure 6-22, we can delay sending acknowledgments to the queues until the microservice is done processing and cleaning out its state, or until it persists its state in durable storage. We can use this pattern with microservices that consume events from log-based message brokers, such as Kafka and NATS, because these brokers will not delete the events when they are delivered to the microservice, and the microservices can request events to be played back from the last sequence number they have successfully processed. We can also use this pattern when events are read from persistent data stores like RDBMS databases, NoSQL stores, or filesystems.

Answer 51

This pattern can be used to restore an application’s state by replaying the lost events due to system failure or restart. Let’s look at how this pattern could be used in a few scenarios. 388

Answer 52

- Publisher-Subscriber pattern Can be used to establish durable subscriptions with event sources so they can be replayed during failure. This pattern is covered in Chapter 5. - Periodic Snapshot State Persistence pattern This can be used in conjunction with the Replay pattern to restore application state and reduce the time to bring the application back alive. This pattern is covered next. 389

Answer 53

Persisting the application state upon processing each incoming event is not feasible, as this introduces extremely high latency to cloud native applications due to the round-trip time of accessing state. The Periodic Snapshot State Persistence pattern allows us to persist the application state in a periodic manner so that we can restore the state reliably after system restarts or failures. 390

Answer 54

This pattern periodically makes a copy of its current state and persists that to a durable store between processing events. For this to work, we should ensure that the microservices can read and write state to a persistent storage To ensure that events are not lost during failures and to guarantee at-least-once event delivery, we must use message brokers to retrieve events. When using a log-based message broker like Kafka, we should store the event sequence number with the snapshot (Figure 6-23). With this approach, upon a restart, we reload the last stored snapshot and request the message broker to deliver events from the stored event sequence number. 390 ##Footnote When using standard message brokers like ActiveMQ, we should acknowledge the processed messages only when storing the snapshot. This way, we can ensure that when the microservice is restarted the message broker sends all unacknowledged events. When using standard message brokers like ActiveMQ, we should acknowledge the processed messages only when storing the snapshot. This way, we can ensure that when the microservice is restarted the message broker sends all unacknowledged events.

Answer 55

This pattern can be used to store state when microservices process data in memory, and when their state cannot be persisted after every event. 394 ##Footnote Let’s say we want our application to detect if a stock price has continuously risen over the last 10 minutes. We need to keep track of only the last time we saw a dip in the stock price, and continuously check if that time is now older than 10 minutes. If so, we will alert the user. If we are retrieving the events from a Kafka topic, we also need to remember the last-processed event’s sequence number along with the current stock price, and the last time we saw the stock price dip. These three represent the state of the microservice. To recover the microservice from failure, we need to persist all three values to a database or similar storage, in a periodic manner. During the recovery process, the microservice can restore the last stored snapshot and replay the Kafka events from the last stored event sequence number to ensure that the system state is preserved across system failures.

Answer 56

We should use this pattern only when we are processing critical data that cannot be lost on system failure, because the pattern introduces significant operational overhead that is not worthwhile if data loss is acceptable. In some situations, the state itself is quite large and requires significant time to store and retrieve. To mitigate this, use incremental snapshots, store only the delta between the current state and the last snapshot, and then replay incremental snapshots to re-create system state. We do not recommend making the snapshot interval overly short, as this introduces overhead without significant benefit. But don’t set the snapshot overly long either, as this leads to not only writing and reading bigger snapshots (which takes longer), but also replaying more events on application restoration (which can increase the time for applications to become live again). 395 ##Footnote Example 1) For example, if our processing window is small (say, one minute), system failure impacts for only as long as the system is down. To restore state, we can use the Replay pattern to reprocess the events during the lost period. But if our application state contains data from the previous day, we need to replay events from the previous day to re-create the state, which may not be feasible because of the processing time for the quantity of events. In such cases, we advise using the Periodic Snapshot State Persistence pattern. Example 2) For example, when using a time window of five minutes with one-minute time shifts for aggregation, store snapshots every minute with only the changes that happened during the last minute. When there is a failure, we load the last five snapshots to re-create the state of the five-minute window.

Answer 57

- Temporal Event Ordering and Windowed Aggregation patterns These patterns can benefit from the Periodic Snapshot State Persistence pattern, as they require state to be stored to achieve reliability. - Replay pattern This is used to re-create states from the missing events based on the last snapshot loaded during application recovery. - Watermark pattern This can be used to synchronize state snapshots across multiple microservices. 396

Answer 58

Low-latency microservices do not have the luxury of taking a couple of minutes after failure to restart and restore their states. For these microservices, it is operationally superior to run a redundant microservice to allow failover. We can run such a microservice by using the Two-Node Failover pattern. 397

Answer 59

This pattern focuses on running a parallel backup microservice. When microservices are deployed, they perform a leader election; we can use systems such as ZooKeeper or native cloud services to designate one microservice as primary and the other as secondary. 397 Figure 6-26. Running microservices as primary and secondary to enable failover

Answer 60

This pattern can be used when degradation of latency and system downtime cannot be tolerated. Say we retrieve stock order events from NATS, process the number of stock bids and asks in real time, and publish them to stockbrokers so they can instantly identify changes in market trends. Using the Two-Node Failover pattern, we cannot only switch to the secondary node instantly upon detecting that the primary has failed, but also ensure that no events are dropped. This is because the secondary publishes only data that the primary has not. 398

Answer 61

Use this pattern only when low latency is the main requirement. Otherwise, use other patterns such as Periodic Snapshot State Persistence. This pattern is complex to implement, and the architectural complexity is not worthwhile if we can tolerate downtime during system failure. This pattern requires both microservices to have robust connectivity, as the primary needs to publish its output to the secondary. Furthermore, a risk of network partitioning between the primary and secondary exists. In this case, we require a third system to function as the leadership elector; otherwise, both microservices could become primary in parallel and send outputs downstream. We should also be mindful that both microservices can fail simultaneously; then the system would become unavailable and we’d lose their state. As a mitigation, we can adopt the Periodic Snapshot State Persistence pattern or Replay pattern to allow for state restoration in such a scenario. 399

Answer 62

- Periodic Snapshot State Persistence and Replay patterns These patterns, covered in this chapter, can be used with the Two-Node Failover pattern to restore the state of the microservice when they restart it as a secondary after failure. - Sequential Convoy pattern Used if we need to scale the stream-processing application. This pattern is covered in this chapter. - Publisher-Subscriber pattern To allow both primary and secondary microservices to consume the same events. Chapter 5 details this pattern. 400

Answer 63

The system state contains only recent events. To restore state only when there is access to the previously processed events. To process data from persistence stores, filesystems, and log-based message brokers. 400

Answer 64

We cannot guarantee that previously processed data cannot be accessed again. Dependent systems cannot process duplicate events. Systems cannot take time to re-create their state. The system state needs to contain events that span over a long period. 400

Answer 65

Allows re-creating state without storing large snapshots. 400

Answer 66

The system state needs to contain events that span over a long period. To restore state only when there is access to the previously processed events. To process data from persistence stores, filesystems, and log-based message brokers. 400

Answer 67

The system state contains only recent events. We cannot guarantee that previously processed events can be accessed again. Systems cannot take time to re-create their state. 400

Answer 68

Allows re-creating state faster. Supports larger and long-running system states. Supports dependent applications that cannot process duplicate events. 400

Answer 69

We cannot take time to restore the application after failure. The system state needs to contain events that span over a long period. To restore state only when there is access to the previously processed events. 400

Answer 70

We cannot guarantee that previously processed events can be accessed again. Systems can take time to re-create their state. 400

Answer 71

Supports low-latency and highly available stream processing. Supports dependent applications that cannot process duplicate events. 400

Answer 72

To embed into cloud native applications. To support transformations, filtering and thresholds, windowed aggregations, joins, and temporal event ordering. 406

Answer 73

To run as a standalone application. To run machine learning models. Built-in reliability is required. 406

Answer 74

Esper is a complex event-processing library released under the GPL v2 license. It can be used to implement stream-processing logic in Java and .NET-based microservice applications. It supports stream-processing constructs including transformations, filtering, thresholds, windowed aggregations, joins, and temporal event ordering. Esper can be used to reduce the complexity of an application, as we can offload most of the processing logic to it. We can model events as Java or .NET objects and pass them to Esper for processing, and then subscribe to it to receive outputs. It also supports a query language to configure the stream-processing logic. We recommend using Esper for implementing stream-processing logic within our microservices or serverless functions. 402

Answer 75

To embed into cloud native applications. To run as a standalone cloud native application. To support transformations, filtering and thresholds, windowed aggregations, joins, temporal event ordering, and machine learnin 406

Answer 76

High scalability is needed. 406

Answer 77

Siddhi is a Java-based stream-processing library and a microservice released under Apache License v2. As a library, Siddhi (like Esper) can be embedded into microservices to process stream-processing logic. It allows stream-processing logic to be defined via Siddhi Query Language and supports stream-processing constructs including transformations, filtering, thresholds, windowed aggregations, joins, temporal event ordering, and machine learning. We recommend using Siddhi for implementing stream-processing logic within microservices or serverless functions. We also recommend using Siddhi if you want to run stream processing as a standalone microservice supporting all the reliability patterns, including Periodic Snapshot State Persistence and Two-Node Failover. Users can use Siddhi Query Language to configure the sources from which Siddhi should consume events, its processing logic, and where it should publish its output and deploy that to Kubernetes. 402

Answer 78

Kafka is used in the infrastructure. To support transformations, filtering and thresholds, windowed aggregations, and joins. To build materialized views from the input events. 406

Answer 79

Kafka is not used in the infrastructure. Temporal event ordering and machine learning is needed. 406

Answer 80

ksqlDB is a stream-processing and a database system that is part of Kafka. It works only in environments where Kafka is used as the broker for distributing events. We can define rules in ksqlDB to retrieve events from a Kafka stream, and then process and publish them. It supports stream-processing constructs such as transformations, filtering, thresholds, windowed aggregations, and joins. It also provides a feature to build materialized views from the input events, which can be queried on demand by cloud native applications. Its ability to pull data on demand is useful, as it can be modeled as a relational database. We recommend using ksqlDB when Kafka is used as the message broker for the cloud native application, and when you need to query event logs via materialized views. 403

Answer 81

Support for both stream and batch processing is needed. To support transformations, filtering and thresholds, windowed aggregations, joins, and machine learning. 406

Answer 82

A lightweight system is needed for stream processing. Temporal event ordering is needed. To embed into cloud native applications. 406

Answer 83

Spark is a big-data and stream-processing platform released under Apache License v2. It can run on Apache Mesos, Hadoop YARN, and Kubernetes. Though it is strong in batch processing, it can also support stream-processing constructs such as transformations, filtering, thresholds, windowed aggregations, joins, and machine learning. It uses both queries and a structured programming approach, allowing users to program using Java, Scala, or Python to support both stream and batch processing. It achieves reliable processing by periodically checkpointing data into durable storage. It is an ideal choice when use cases are mainly oriented toward batch processing while having some streaming requirements. 403

Answer 84

To support transformations, filtering and thresholds, windowed aggregations, joins, and temporal event ordering, along with graph processing. For high scalability and availability requirements. 406

Answer 85

A lightweight system is needed for stream processing. To embed into cloud native applications. 406

Answer 86

Flink is a fully fledged stream-processing platform released under Apache License v2. It can run on platforms like Kubernetes, Knative, and AWS Lambda. It supports stream-processing constructs such as transformations, filtering, thresholds, windowed aggregations, joins, and temporal event ordering, along with graph processing. It also supports exactly once semantics, and supports reliable data processing using watermarks, and snapshots by storing them in storage such as S3, GCS, and HDFS. Flink supports simple query language for defining stream-processing logic, the Table API for declarative data processing, and data stream and stream-processing APIs in Java for more-granular level configurations. We recommend using this for large-scale stream-processing use cases that have high scalability and availability requirements. 404

Answer 87

To support Flink in AWS. To support transformations, filtering and thresholds, windowed aggregations, joins and temporal event ordering, along with graph processing. 406

Answer 88

Other cloud providers are selected. To embed into cloud native applications. 406

Answer 89

Kinesis is a fully managed scalable stream-processing offering from AWS. It supports SQL-based and Flink-based data processing in the cloud and allows users to build their own cloud native applications and run them in Amazon Lambda or EC2. With its SQL mode, it can support transformations, filtering, thresholds, windowed aggregations, and joins. With Flink, it supports all standard stream-processing constructs. In addition to streaming events, it can also stream video content. We recommend using Kinesis if AWS is the hosting environment for your cloud native application. 404

Answer 90

To support transformations, filtering and thresholds, windowed aggregations, joins, temporal event ordering, and machine learning. To support stream-processing queries to run in the cloud and on the edge node. 406

Answer 91

Other cloud providers are selected. To embed into cloud native applications. 406

Answer 92

Azure Stream Analytics is a fully managed scalable streaming analytics platform offered by Microsoft. It supports defining stream-processing logic by using SQL queries and a graphical user interface. It supports stream-processing constructs such as transformations, filtering, thresholds, windowed aggregations, joins, temporal event ordering, and machine learning. It also supports hybrid architectures for running stream-processing queries in the cloud and on the edge node. We recommend using Azure Stream Analytics if Azure is your hosting environment. 405

Answer 93

To support transformations, filtering and thresholds, windowed aggregations, joins, temporal event ordering, and machine learning. To support portable stream-processing logic that can also run on on-premises stream-processing systems. 406

Answer 94

Other cloud providers are selected. To embed into cloud native applications. 406

Answer 95

Google Dataflow is a fully managed scalable stream-processing platform offered by Google. It supports defining stream-processing logic using Apache Beam SDK, SQL queries, and via GUI. It supports stream-processing constructs such as transformations, filtering, thresholds, windowed aggregations, joins, temporal event ordering, and machine learning. With its Apache Beam SDK, it allows developers to deploy stream-processing logic into on-premises stream-processing systems such as Apache Flink. We recommend using Dataflow if Google Cloud is your hosting environment. 405

Chapter 6, Stream-Processing Patterns Flashcards

(119 cards)