Session 5.2 Flashcards
Desired Properties of a Big Data System
1 Robustness and fault tolerance
2 Low latency
3 Minimal Maintenance
4 Ad hoc queries
Computing arbitrary functions on an arbitrary dataset in real time is a daunting problem
- There is no single tool that provides a complete solution
- Instead, you have to use a variety of tools and techniques to build a complete Big Data system
The main idea of the Lambda Architecture is to…
build Big Data systems as a series of layers
Each layer satisfies a subset of the desired properties and builds upon the functionality provided by the layers above it
Storing data in raw format has many advantages:
- Data is always true (or correct): all records are always correct; no need to go back and re-write existing records; you can simply append new data
- You can always go back to the data and perform queries you did not anticipate when building the system
Data should be stored in raw format, should be
- Immutable
2. Kept forever
Distributed file systems are quite similar to the file systems of your computer, except they spread their storage across a cluster of computers
- They scale by adding more machines to the cluster
- Designed so that you have fault tolerance when a machine goes down, meaning that if you lose one machine, all your files and data will still be accessible
The operations you can do with a distributed filesystem are often…
more limited than you can do with a regular filesystem
For instance, you may not be able to write to the middle of a file or even modify a file at all after creation
Google needed a good distributed file system, why?
Redundant storage of massive amounts of data on cheap and unreliable computers
Why did Google not use an existing file system?
- Google’s problems were different from anyone else’s
- > Different workload and design priorities
- Google File System is designed for Google apps and workloads
- Google apps are designed for Google File System
Google File System - Assumptions
- High component failure rates
- “Modest” number of HUGE files
- Files are write-once, mostly appended to
- Large streaming reads
Bigtable is…
a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.
The Split-Apply-Combine Approach
Split
Break up a big problem into manageable pieces
The Split-Apply-Combine Approach
Apply
Operate on each piece independently
The Split-Apply-Combine Approach
Combine
Put all the pieces back together
The MapReduce model consists of two main stages
Map
input data is split into discrete chunks to be processed