Session 5 Flashcards
You decide to build a simple web analytics application to better understand the behavior of your users.
Requirements:
R1 - The application should track the number of views for any video you tell it to track
-> The application’s web server gets a message every time a tracked video is watched
R2 - Additionally, the application should be able to tell you at any point what the top 100 videos are by number of views
Desired Properties of a Big Data System
- Robustness and fault tolerance
- Low latency
- Minimal Maintenance
- Ad hoc queries
Batch Layer
- Manages the master dataset – an immutable, append-only set of raw data
- Pre-computes arbitrary query functions – called batch views
- Runs in a loop and continuously recomputes the batch views from scratch
- Very simple to use and understand
- Scales by adding new machines.
Speed Layer
- Accommodates all requests that are subject to low latency requirements
- Its goal is to ensure new data is represented in query functions as quickly as needed for the application requirements
- Similar to the batch layer in that it produces views based on data it receives
- > One big difference is that the speed layer only looks at recent data, whereas the batch layer looks at all the data at once - Does incremental computation instead of the recomputation done in the batch layer
Serving Layer
- Indexes batch views so that they can be queried with low latency
- The serving layer is a specialized distributed database that loads in a batch view and makes it possible to do random reads on it
- When new batch views are available, the serving layer automatically swaps those in so that more up-to-date results are available
- It does not need to support specific record updates
- > This is a very important point, as random writes cause most of the complexity in databases
Hadoop File System is the open source alternative to Google File System
- Comodity hardware
- Tolerant to failure
(batch layer)
How distributed file systems work?
- All files are broken into blocks (usually 64 to 256 MB)
- These blocks are replicated (typically 3 copies) among the HDFS servers (datanodes)
- The namenode provides a lookup service for clients accessing the data and ensures the nodes are correctly replicated across the cluster
- Broken into blocks
- Blocks are replicated
- Lookup service
Hadoop MapReduce
- Hadoop MapReduce is a distributed computing paradigm originally pioneered by Google
- Used to process data in the batch layer
The Split-Apply-Combine Approach
Each “Apply” operation can be performed independently of other “Apply” operations
Big Data infrastructure as a service
Elastic clouds
- Elastic clouds allow you to rent hardware on demand rather than own your own hardware in your own location.
- Elastic clouds let you increase or decrease the size of your cluster nearly instantaneously, so if you have a big job you want to run, you can allocate the hardware temporarily.
- Elastic clouds dramatically simplify system administration. They also provide additional storage and hardware allocation options that can significantly drive down the price of your infrastructure.
- Rent hardware on demand
- Increase or decrease the size of your cluster nearly instantaneously
- Simplify system administration
Examples of suppliers elastic clouds
- Microsoft Azure
- Amazon Web Services
- Digital Ocean
Many machine learning tools can be used on top of this infrastructure (elastic clouds)
- Amazon ML
- Microsoft Azure/ML
- H2O (from 0xdata)
All of these tools implement state-of-the-art ML algorithms out of the box. Some of them are also extensible, i.e., you can implement your own algorithms
A Client / Server Model
1 Clients connect to a server over the Internet
2 Clients perform a request
3 Server issues response
4 Clients display response