Lesson 6: Distributed Storage Flashcards
What is Moore’s Law?
The number of transistors in an integrated circuit doubles about every two years
Why was there a switch from faster to parallel execution?
Paradigm shift:
speed of light, atomic boundaries, limited 3D layering
What is Hadoop?
- Framework that allows us to do reliable, scalable distributed computing and data storage
- It is a flexible and highly available architecture for large scale computation and data processing on a network of commodity hardware
- Apache top level project
- Open-source
What do we need to write for MapReduce?
- Mapper: Application code
- Partitioner: send data to correct Reducer machine
- Sort: group input from different mappers by key
- Reducer: application code
How do we store large files with the Hadoop Distributed File System
- We have a file
- HDFS splits it into blocks
- HDFS will keep 3 copies of each block
- HDFS stores these blocks on datanodes
- HDFS distributes the blocks to the data nodes
- The NameNode tracks blocks and data nodes
- Sometimes a data node will die → not a problem
- Namenode tells other data nodes to copy blocks, back to 3X replication
What are the components of the Hadoop Architecture? (overview)
- MapReduce Framework: implement MapReduce paradigm
- Cluster: host machines (nodes)
- HDFS federation: provides logical distributes storage
- YARN Infrastructure: assign resources
What is the infrastructure of YARN?
Resource Manager (1/cluster): assign cluster resources to applications
Node Manager (many/cluster): monitor node
App Master: app (MapReduce)
Container: task (map, reduce, …)
What are the shortcomings of MapReduce?
- Forces your data processing into MAP and REDUCE → other workflows missing
- Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
→ not efficient for iterative tasks
*Only for batch processing (interactivity, streaming data)
What should be counted when calculating algorithmic complexity?
- page faults
- cache misses
- memory accesses
- disk accesses (swap space)
- …
What is Google MapReduce?
A framework for processing LOADS of data
-> framework’s job: fault tolerance, scaling & coordination
-> programmer’s job: write program in MapReduce Form
What are the 2 main components of Hadoop?
HDFS - big data storage
MapReduce - big data processing
How can you tell that the Hadoop Architecture is inspired by LISP (list processing)?
Functional programming:
* Immutable data
* Pure functions (no side effects): map, reduce
What is the difference between a job and a task tracker in the Hadoop Architecture?
Job tracker: in charge of managing the resources of the cluster
-> first point of contact when a client submits a process
-> one per cluster
Task tracker: does the actual process
-> mostly connected to one or more specific data nodes
What are the 3 functions in Google MapReduce? (2 primary, one optional)
- Map
- Reduce
- Shuffle
How does the Map function work, of Google’s MapReduce?
Map each <key, value> data pair of input list onto 0, 1, or more pairs of type <key2, value2> of output list
-> Map to 0 elements in the output = filtering
-> Map to +1 elements in the output = distribution