Lesson 6: Distributed Storage Flashcards
What is Moore’s Law?
The number of transistors in an integrated circuit doubles about every two years
Why was there a switch from faster to parallel execution?
Paradigm shift:
speed of light, atomic boundaries, limited 3D layering
What is Hadoop?
- Framework that allows us to do reliable, scalable distributed computing and data storage
- It is a flexible and highly available architecture for large scale computation and data processing on a network of commodity hardware
- Apache top level project
- Open-source
What do we need to write for MapReduce?
- Mapper: Application code
- Partitioner: send data to correct Reducer machine
- Sort: group input from different mappers by key
- Reducer: application code
How do we store large files with the Hadoop Distributed File System
- We have a file
- HDFS splits it into blocks
- HDFS will keep 3 copies of each block
- HDFS stores these blocks on datanodes
- HDFS distributes the blocks to the data nodes
- The NameNode tracks blocks and data nodes
- Sometimes a data node will die → not a problem
- Namenode tells other data nodes to copy blocks, back to 3X replication
What are the components of the Hadoop Architecture? (overview)
- MapReduce Framework: implement MapReduce paradigm
- Cluster: host machines (nodes)
- HDFS federation: provides logical distributes storage
- YARN Infrastructure: assign resources
What is the infrastructure of YARN?
Resource Manager (1/cluster): assign cluster resources to applications
Node Manager (many/cluster): monitor node
App Master: app (MapReduce)
Container: task (map, reduce, …)
What are the shortcomings of MapReduce?
- Forces your data processing into MAP and REDUCE → other workflows missing
- Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
→ not efficient for iterative tasks
*Only for batch processing (interactivity, streaming data)
What should be counted when calculating algorithmic complexity?
- page faults
- cache misses
- memory accesses
- disk accesses (swap space)
- …
What is Google MapReduce?
A framework for processing LOADS of data
-> framework’s job: fault tolerance, scaling & coordination
-> programmer’s job: write program in MapReduce Form
What are the 2 main components of Hadoop?
HDFS - big data storage
MapReduce - big data processing
How can you tell that the Hadoop Architecture is inspired by LISP (list processing)?
Functional programming:
* Immutable data
* Pure functions (no side effects): map, reduce
What is the difference between a job and a task tracker in the Hadoop Architecture?
Job tracker: in charge of managing the resources of the cluster
-> first point of contact when a client submits a process
-> one per cluster
Task tracker: does the actual process
-> mostly connected to one or more specific data nodes
What are the 3 functions in Google MapReduce? (2 primary, one optional)
- Map
- Reduce
- Shuffle
How does the Map function work, of Google’s MapReduce?
Map each <key, value> data pair of input list onto 0, 1, or more pairs of type <key2, value2> of output list
-> Map to 0 elements in the output = filtering
-> Map to +1 elements in the output = distribution
How does the Reduce function work, of Google’s MapReduce?
[summarizing]
Combine the <key, value> pairs of the input list to an aggregate output value
What does the Shuffle function do, in Google’s MapReduce?
[consolidating relevant record]
* Helps the pipeline
* It will help channel the partial result to the right or most appropriate reduce node
What is YARN short for?
Yet Another Resource Negotiator
Explain the Hadoop eco-system.
Hadoop provides very good functions on its own but the main power of Hadoop comes out when we start combining it with different other technologies.
(Ex.: Pig, Hive, Kafka)
What is Apache Spark?
- works on top of Hadoop, HDFS
- has many other workflows
- in-memory caching of data
- in-memory data sharing
- supports data analysis, machine learning, graphs,…
- allows development in multiple languages
- can read/write from a range of data types
What are RDD’s?
Resilient Distributed Datasets
-> immutable distributed collection of objects
-> fault tolerant
-> used in every spark comonents
How do you create new RDD’s?
By using transformations
-> from storage
-> from other RDDs
What are DataFrames?
A way to organize the data in named columns.
Similar to a relational database
-> immutable once constructed
-> enable distributed computations
How can you construct data frames?
- read from file(s)
- transforming an existing dataframe
- parallelizing a python collection list
- apply transformations and actions
Compare RDDs with DataFrames.
- RDDs provide a low level interface into Spark
- DataFrames have a schema
- DataFrames are cached and optimized by Spark
- DataFrames are built on top of the RDDs and the core Spark API
- DataFrames are highly optimized and are faster
How and why would we use directed acyclic graphs?
How:
Nodes are RDDs, arrows are transformations
Why:
* track dependencies
* program resonates with humans and computers
* improvement via sequential access to data & predictive processing
What is the difference between a narrow and a wide transformation?
Narrow: everything can be found in the same partition
-> required elements for computation in a single partition live in the single partition of parent RDD (map)
Wide: we need to fetch from multiple partitions
-> Required elements for computation in a single partition may live in many partitions of parent RDD (groupByKey)
When should you not use Spark?
- for many simple use cases Apache MapREduce and Hive might be a more appropriate choice
- spark was not designed as a multi-user environment
- spark users are required to know that memory they have is sufficient for a dataset
- adding more users adds complications, since the users will have to coordinate memory usage to run code