Session 5.2 Flashcards
Desired Properties of a Big Data System
1 Robustness and fault tolerance
2 Low latency
3 Minimal Maintenance
4 Ad hoc queries
Computing arbitrary functions on an arbitrary dataset in real time is a daunting problem
- There is no single tool that provides a complete solution
- Instead, you have to use a variety of tools and techniques to build a complete Big Data system
The main idea of the Lambda Architecture is to…
build Big Data systems as a series of layers
Each layer satisfies a subset of the desired properties and builds upon the functionality provided by the layers above it
Storing data in raw format has many advantages:
- Data is always true (or correct): all records are always correct; no need to go back and re-write existing records; you can simply append new data
- You can always go back to the data and perform queries you did not anticipate when building the system
Data should be stored in raw format, should be
- Immutable
2. Kept forever
Distributed file systems are quite similar to the file systems of your computer, except they spread their storage across a cluster of computers
- They scale by adding more machines to the cluster
- Designed so that you have fault tolerance when a machine goes down, meaning that if you lose one machine, all your files and data will still be accessible
The operations you can do with a distributed filesystem are often…
more limited than you can do with a regular filesystem
For instance, you may not be able to write to the middle of a file or even modify a file at all after creation
Google needed a good distributed file system, why?
Redundant storage of massive amounts of data on cheap and unreliable computers
Why did Google not use an existing file system?
- Google’s problems were different from anyone else’s
- > Different workload and design priorities
- Google File System is designed for Google apps and workloads
- Google apps are designed for Google File System
Google File System - Assumptions
- High component failure rates
- “Modest” number of HUGE files
- Files are write-once, mostly appended to
- Large streaming reads
Bigtable is…
a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.
The Split-Apply-Combine Approach
Split
Break up a big problem into manageable pieces
The Split-Apply-Combine Approach
Apply
Operate on each piece independently
The Split-Apply-Combine Approach
Combine
Put all the pieces back together
The MapReduce model consists of two main stages
Map
input data is split into discrete chunks to be processed
The MapReduce model consists of two main stages
Reduce
output of the map phase is aggregated to produce the desired result
MapReduce
The simple nature of the programming model lends itself to…
efficient and large-scale implementations across thousands of cheap nodes (computers).
- efficient
- large-scale implementations
Key benefits of MapReduce
- Simplicity
- Scalability
- Speed
- Recovery
- Minimal data motion
Key benefits of MapReduce
Simplicity
Developers can write applications in their language of choice, such as Java, C++ or Python
Key benefits of MapReduce
Scalability
MapReduce can very large amounts of data, stored in HDFS on one cluster
Key benefits of MapReduce
Speed
Parallel processing means that MapReduce can take problems that used to take days to solve and solve them in hours or minutes
Limitations of MapReduce
- MapReduce is designed specifically for batch processing
2. Low level framework (hard to use)
New tools have been developed to simplify the use of MapReduce
- Apache HIVE (similar to SQL)
- Apache Pig (script language)