Data Intesive Ch12 - Future of data systems Flashcards
What the final chapter is about
Previous were describing how things are at present, the final chapter discusses how things SHOULD be
Goal of the book was to…
…explore how to create apps and systems that are
- RELIABLE
- SCALABLE
- MAINTAINABLE
Recurring theme in the book
For any given problem, there’re several solutions with pros, cons and trade-offs
Storage engines - log based, B-tree based, column-oriented
Replication - single leader, multi leader, leaderless
The goal of data integration is…
…to make sure the data ends up in the right form in all the right places
Search index from CDC/Total Order Broadcast of DB
Why does asynchrony is what makes systems based on event logs robust?
Allows a fault in one part of the system to be contained locally.
With synchronous distributed trx all must abort if any participant fails so failures are amplified and spread to the rest of the system.
Batch and stream processing vs maintaining derived state
Stream allows changes in the input to be reflected in derived views with LOW DELAY
Batch allows LARGE AMOUNTS of historical accumulated data to be reprocessed in order to derive new views onto an existing dataset
This supports evolvability of the system (new features can easily derive required view)
Without ability to fully reprocess the data schema evolution is limited to adding a new optional field to a record or new type of record for both schema-on-write and read.
With reprocessing it is possible to restructure dataset into any model that serve new requirements best.
Users can be gradually routed to the new view
Old view still exists and can be switched back to.
Schema migrations on railways
In early 19th century England there were various competing standards for the gauge (distance between 2 rails)
Trains had to be built for one particular gauge and were incompatible with another
Once single standard was decided in 1846 existing tracks of different gauge had to be converted.
Shutting down entire train line for months or more was not an emption.
The solution was to convert the track to dual gauge by adding a third rail which was done gradually.
Eventually the nonstandard rail could be removed.
Nevertheless it is expensive undertaking and non-standard gauges are still in use (BART in SF vs rest of USA)
Lambda architecture
Running batch and stream system in parallel like Hadoop MR and Storm
Stream processor consumes the events and quickly produces APPROXIMATE update of the read view.
Batch processor later consumes the same set of events and produces a corrected version of derived view.
Basically - stream fast approximate and batch slow exact.
Cons:
- same logic is duplicated in 2 places
- 2 processors produce separate outputs that need to be merged (can be complex if view is derived using joins, sessionization, output is not a time series etc)
- running full-reprocess is expensive on large datasets; incremental batches can be used but this raises problems with time like stragglers or how to handle windows crossing boundaries between batches.
Recent work - unifying batch and stream processing in one system. This gives capabilities like:
- replay historical events through the same processing engine (log based message brokers)
- exactly one semantics for stream processors
- tools for windowing by EVENT time not processing time (as it’s meaningless for historical events)
What’s wrong with using processing time for windowing when processing events in stream processor?
Replaying makes the time meaningless. Using processing time assumes event is processed shortly after being published. Replay can take place long after that.