E10 Flashcards
The main idea of the Lambda Architecture is…
To build Big Data systems as a series of layers
Each layer satisfies a subset of the desired properties and builds upon the functionality provided by the layers above it
Batch Layer
- Manages the master dataset – an immutable, append-only set of raw data
- Pre-computes arbitrary query functions – called batch views
- Runs in a loop and continuously recomputes the batch views from scratch
- Very simple to use and understand
- Scales by adding new machines.
Speed Layer
- Accommodates all requests that are subject to low latency requirements
- Does incremental computation instead of the recomputation done in the batch layer
Speed Layer Goal
To have updated information on what happened since the last batch view was generated
Serving Layer Goal
To merge views created by the batch layers with views created by the speed layer
Difference Batch and Speed Layer
One big difference is that the speed layer only looks at recent data, whereas the batch layer looks at all the data at once
Serving Layer
- Indexes batch views so that they can be queried with low latency
- The serving layer is a specialized distributed database that loads in a batch view and makes it possible to do random reads on it
- When new batch views are available, the serving layer automatically swaps those in so that more up-to-date results are available
- It does not need to support specific record updates
- > This is a very important point, as random writes cause most of the complexity in databases
Storing data in raw format has many advantages:
- Data is always true (or correct): all records are always correct; no need to go back and re-write existing records; you can simply append new data
- You can always go back to the data and perform queries you did not anticipate when building the system
Data should be stored in raw format, should be
- Immutable
- Kept forever