Data Intesive Ch11 - Stream Processing Flashcards

Question

Event Sourcing

Answer 1

DDD idea between which there're parallels with CDC concept. Similarly to CDC - ES stores all changes to app state as log of change events. Biggest different is the ideas apply at different levels of abstraction CDC - app uses DB in a mutable way (create, update, delete records). Log of changes is extracted from a low level representation like replication log. This ensures the order of writes is in sync. App using DB can be unaware of CDC. ES - app logic is based on immutable events written to event log. Event store is append only (no updates and deletes - discouraged). Events are designed to reflect things that happened at APP level rather than low-level state changes. ES makes app easier - to evolve over time - debug (seeing all the events helps understanding wth happened) - guards against bugs - no mutable state. Storing an event "cancelling subscription" is more clear than following the side effects like "subscription entry was deleted from the table and notification was sent to SNS". - capability to chain new side effects off existing events Like send an email

Answer 2

Mutable state of DB and append-only log of immutable events are two-sides of the same coin Each state is always a result of events that caused changes Storing changelog durably allows making the state reproducible. It's easier to reason about flow of data if: - current state - changelog of events are considered

Answer 3

Idea: accountants and financial bookkeeping Append-only ledger describing each transaction If mistake is made past events are not erased/changed Instead compensating transaction is issued Immutable events give -> AUDITABILITY capability Which is useful not only when it's strictly required by regulations It's useful for diagnosing and recovering from problems in data in running system -> USER ACTIVITY tracking we can not only track items in the basket that was ordered but any removes -> Deriving different read views from the same log of events at any time This allows separating how data is written from how data is read (just derive a new view from events) - no need to design perfect schema -> Concurrency control Deriving current state from event logs simplifies it. Instead of having user action changing data in several places instead it just is self-contained description of an action to happen. Saving this atomically is trivial. Then, if data is partitioned in the same manner in both app state and event log, single-threaded consumer can read the msg to update app state. Events in partition are guaranteed to be processed serially, in order.

Answer 4

Data must be written in the same form as it will be queried CQRS says hello Normalization/Denormalization becomes less of a thing if we can always translate write-optimized event log into any read-optimized view

Answer 5

Event log consumers are usually async Having async-updated read-view makes things read-your-own-writes-problem prone Making read-view update synchronous is tricky but possible with either: - having event-log and read-view in the same storage system - using distributed trx - using linearizable-storage-with-total-order-broadcast Keeping entire immutable history of all changes forver may be infeasible. If data changes rarely it's fine. Otherwise good log compaction and GC is mandatory Law and regulations like GDPR - data may be required to be deleted from the system. Rewriting history is required then. Excision in datomic, Fossil - shunning (true deletion of data is hard due to how file. systems, disks work)

Answer 6

1. Write derived view - search index, DB, cache etc 2. Push events to users in form of push-notifications, emails, live-dashboard monitoring like fraud detection systems, military tracking systems 3. Produce a new stream which may join several

Answer 7

CEP - Complex event processing Approach got analyzing streams Applications requiring seraching for certain event patterns Consumer of input stream + internal state machine matching for given patterns Stream analytics Similar to CEP but instead of finding specific event sequences -> find aggregations and metrics over large num of events Rolling averages etc Usually computed over fixed time intervals called window Maintaining materialized views Pattern matching individual events Saved Searches with alerts (otomoto cars) Message passing and RPC p468 ?

Answer 8

For windowing processors often rely on their local clock (for knowing when window is ready). This usually works fine unless there is a delay between event generation and processing. And there could be a lot of reasons for that: - queueing (processor does not keep up with data consumption) - network faults - message broker is overloaded - events are delayed (offline enabled, untrusted clients like mobile devices) - stream consumer is restarted or taken down (bug fixing) - old data is REPLAYED Basically event time vs processing time confusion Star wars issue - events reach broker out of order (like order of star wars movies 4-6, 1-3, 7-9 lol). Also related to msg delays. System A and B participate in business process Both submit events to consumer C. A handles request first then calls B. A sends event but it gets stuck in network. B sends event and it reaches broker. C sees B first and A second. Spikes of messages processed after restart - monitors for traffic spikes could trigger if processor time is only taken into account when in fact there was no unusual request rate. Knowing when time window is complete - straggler events Issue pertains to batch processing too but is less severe and noticeable. Batch processing which works on historical data (known scope) and uses record timestamps Stream processing sees the data as it comes

Answer 9

When event arrives late after its window has been declared finished Dealing with stragglers: - ignore (usually small percentage of all events, low impact). Drop rate can be monitored for safety - publish a correction (updated value with straggler included). Previous output may need to be retracted - use special messages for finishing windows on consumer side (from now on no more messages earlier than timestamp t). Downside - each producer consumer must be tracked individually, adding new ones is harder

Answer 10

Events can be buffered for example at offline client mobile device. Users can deliberately change time on their devices. Adjustion by taking 3 timestamps - time of an event (device clock) t1 - time at which event was sent to the server (device clock) t2 - time at which event was received by the server (server clock) t3 t3 - t2 estimates offset between device and server add offset to t1 to estimate true time

Answer 11

-> Tumbling Fixed length Event belongs to exactly one window <1:31-1:32), <1:32-1:33) ``` -> Hopping May have fixed length Windows can overlap (for smoothing etc) Tumbling over short interval aggregated over hopping interval 1:30-1:35, 1:31-1:36, 1:32-1:37 ``` -> Sliding All events being apart -> Session No fixed duration Grouping of events relevant to the same user Close in time Assuming session expires 30 minutes after inactivity All events before belong to the same window

Answer 12

Example: detect recent trends in searched-for URLs Each query log an event containing query + result URLs Each click on one of resulting URLs log another event Compute click-through rate - join searches and clicked URLs by session id. Users may abandon search for unspecified duration -> time limit window is required like at most join clicks one hour apart from search Note from book: embedding details of search in click event is not joining the events. It does not tell anything about cases where user did not click any of the results. Processor needs to maintain state All events that occured in the last hour indexed by session id for example. On new event processor checks events from both indices. If there is search event and click event - emit search clicked event. If search event expires - emit event no search results clicked.

Answer 13

Example relevant from batch chapter: user details + user events. Stream process looks at one activity at the time On event looks up user ID in DB Querying remote DB adds latency Alternatively load copy of db into stream processor (avoids network round-trip) -> index on the local disk or in-memory hash table (Map side join like) Problem - batch job uses a snapshot in fixed point-in-time Stream processing is long-running Local copy of db must be kept in sync... Can use CDC for that Then it becomes plain stream-stream join So basically this is: Load snapshot DB and then CDC stream-stream join? A stream-table join is very similar to stream-stream join. The biggest diff is for table changelog stream, the join uses a window reaching back to the beggining of time (infinite window), with newer versions of records overwriting older ones. For the stream input the join might not maintain a window at all.

Answer 14

Twitter example Viewing tweets from user timeline - too expensive to grab recent tweets from all followed people and merge them Instead - timeline cache per user. Tweets are written as sent to follower's "inbox" cache On delete - removes tweet from all timeline cache When user follows new user u2 then u2 recent tweets must be added to the timeline (and all tweets removed on unfollow) Streams of tweets (add/delete) and following relationships (follow/unfollow) Stream processor maintains DB with set of followers per user (must know which timelines need to be updated on new tweet)

Answer 15

All types of joins require processor to maintain some state Based on one join input the state is queried when other stream message arrives If events on different streams happen close in time which order they are processed? When user can update their profile which events will be joined with old and which with new profile data? When state changes over time which point in time is used for the join? If order is undetermined then join is non-deterministic Date warehouse problem of slowly changing dimension SCD Use unique id of particular version of joined record Downside - log compaction is not possible, all versions of records must be retained

Answer 16

BP - input is ummutable -> transparent retry is possible if task of the job fails just rerun it and discard failed-output Output is made visible in HDFS only after job completes so all is good Output is the same as if nothing had gone wrong ever It appears every record was processed EXACTLY ONCE none skipped, non processed more than 1 time EXACTLY ONCE (effectively once) semantics In SP it's less straightforward. Waiting for the task to finish to make output visible - not an option as stream is INFINITE Task is never finished Solutions: - > Microbatching + checkpointing - > Atomic commit

Answer 17

Split stream into small blocks each being mini-batch-job Spark Streaming Batch size usually one second of data (smaller batches incur scheduling + coordination penalties, larger mean longer delay before output is visible) Tumbling window out of box Apache Flink - generate periodic rolling checkpoints of state and write them to durable storage. On crash - restart from last checkpoint and discard any output generated between chkpt and crash Checkpoints triggered by barriers in the message stream (similar to boundaries between microbatches but without any particular window size enforcement) Within stream processing framework microbatching/checkpointing approach provides same exactly-once semantics as batch processing As soon as output leaves the processor (side effects like sending email or writing to db) framework cannot discard any output of failed batch or side effect happens twice or more.

Answer 18

To preserve the illusion side effects cannot take effect unless processing is successful. All happen atomically or none p477, 478 dunno wtf re-read this shit Basically distributed trx in restricted env (making efficient distro trx possible) Idempotence: Goal - discard partial output of failed task so we can retry Each consumed message has an offset Provide it when writing value to external db so it can check if update has already happened Assumes messages are replayed in the same order when restarted a task (log based message broker guarantees that within a partition); processing is deterministic; no other node can update the value concurrently. Fencing tokens may be required for fail overs (alive node that is thought to be dead but is alive)

Answer 19

if stream processor requires state (windowed aggreagtions) then any tables and indexes used for joins must be recoverable after a failure Keeping the state in remote DB and replicating it is one option Querying remote db for each message is slow Keeping state locally and replicating it periodically is the way Task retry can read the replicated state and resume processing without data loss Example: Flink capturing snapshots of operator state and writing them to durable storage fex HDFS p479 Sometimes replicating state is not needed if state can be rebuilt from input streams. If aggregation window is short then just replay messages from the window

Data Intesive Ch11 - Stream Processing Flashcards

(43 cards)