Framework that allows us to do reliable, scalable distributed computing and data storage It is a flexible and highly available architecture for large scale computation and data processing on a network of commodity hardware Apache top level project Open-source

Lesson 6: Distributed Storage Flashcards by Niels Van der planken

What is Moore’s Law?

The number of transistors in an integrated circuit doubles about every two years

How well did you know this?

Not at all

Perfectly

Why was there a switch from faster to parallel execution?

Paradigm shift:
speed of light, atomic boundaries, limited 3D layering

How well did you know this?

Not at all

Perfectly

What is Hadoop?

Framework that allows us to do reliable, scalable distributed computing and data storage
It is a flexible and highly available architecture for large scale computation and data processing on a network of commodity hardware
Apache top level project
Open-source

How well did you know this?

Not at all

Perfectly

What do we need to write for MapReduce?

Mapper: Application code
Partitioner: send data to correct Reducer machine
Sort: group input from different mappers by key
Reducer: application code

How well did you know this?

Not at all

Perfectly

How do we store large files with the Hadoop Distributed File System

We have a file
HDFS splits it into blocks
HDFS will keep 3 copies of each block
HDFS stores these blocks on datanodes
HDFS distributes the blocks to the data nodes
The NameNode tracks blocks and data nodes
Sometimes a data node will die → not a problem
Namenode tells other data nodes to copy blocks, back to 3X replication

How well did you know this?

Not at all

Perfectly

What are the components of the Hadoop Architecture? (overview)

MapReduce Framework: implement MapReduce paradigm
Cluster: host machines (nodes)
HDFS federation: provides logical distributes storage
YARN Infrastructure: assign resources

How well did you know this?

Not at all

Perfectly

What is the infrastructure of YARN?

 Resource Manager (1/cluster): assign cluster resources to applications
 Node Manager (many/cluster): monitor node
 App Master: app (MapReduce)
 Container: task (map, reduce, …)

How well did you know this?

Not at all

Perfectly

What are the shortcomings of MapReduce?

Forces your data processing into MAP and REDUCE → other workflows missing
Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
→ not efficient for iterative tasks
*Only for batch processing (interactivity, streaming data)

How well did you know this?

Not at all

Perfectly

What should be counted when calculating algorithmic complexity?

page faults
cache misses
memory accesses
disk accesses (swap space)
…

How well did you know this?

Not at all

Perfectly

What is Google MapReduce?

A framework for processing LOADS of data
-> framework’s job: fault tolerance, scaling & coordination
-> programmer’s job: write program in MapReduce Form

How well did you know this?

Not at all

Perfectly

What are the 2 main components of Hadoop?

HDFS - big data storage
MapReduce - big data processing

How well did you know this?

Not at all

Perfectly

How can you tell that the Hadoop Architecture is inspired by LISP (list processing)?

Functional programming:
* Immutable data
* Pure functions (no side effects): map, reduce

How well did you know this?

Not at all

Perfectly

What is the difference between a job and a task tracker in the Hadoop Architecture?

Job tracker: in charge of managing the resources of the cluster
-> first point of contact when a client submits a process
-> one per cluster

Task tracker: does the actual process
-> mostly connected to one or more specific data nodes

How well did you know this?

Not at all

Perfectly

What are the 3 functions in Google MapReduce? (2 primary, one optional)

Map
Reduce
Shuffle

How well did you know this?

Not at all

Perfectly

How does the Map function work, of Google’s MapReduce?

Map each <key, value> data pair of input list onto 0, 1, or more pairs of type <key2, value2> of output list
-> Map to 0 elements in the output = filtering
-> Map to +1 elements in the output = distribution

How well did you know this?

Not at all

Perfectly

How does the Reduce function work, of Google’s MapReduce?

Study These Flashcards

[summarizing]
Combine the <key, value> pairs of the input list to an aggregate output value

What does the Shuffle function do, in Google’s MapReduce?

Study These Flashcards

[consolidating relevant record]
* Helps the pipeline
* It will help channel the partial result to the right or most appropriate reduce node

What is YARN short for?

Study These Flashcards

Yet Another Resource Negotiator

Explain the Hadoop eco-system.

Study These Flashcards

Hadoop provides very good functions on its own but the main power of Hadoop comes out when we start combining it with different other technologies.
(Ex.: Pig, Hive, Kafka)

What is Apache Spark?

Study These Flashcards

works on top of Hadoop, HDFS
has many other workflows
in-memory caching of data
in-memory data sharing
supports data analysis, machine learning, graphs,…
allows development in multiple languages
can read/write from a range of data types

What are RDD’s?

Study These Flashcards

Resilient Distributed Datasets
-> immutable distributed collection of objects
-> fault tolerant
-> used in every spark comonents

How do you create new RDD’s?

Study These Flashcards

By using transformations
-> from storage
-> from other RDDs

What are DataFrames?

Study These Flashcards

A way to organize the data in named columns.
Similar to a relational database
-> immutable once constructed
-> enable distributed computations

How can you construct data frames?

Study These Flashcards

read from file(s)
transforming an existing dataframe
parallelizing a python collection list
apply transformations and actions

Compare RDDs with DataFrames.

* RDDs provide a low level interface into Spark * DataFrames have a schema * DataFrames are cached and optimized by Spark * DataFrames are built on top of the RDDs and the core Spark API * DataFrames are highly optimized and are faster

How and why would we use directed acyclic graphs?

How: Nodes are RDDs, arrows are transformations Why: * track dependencies * program resonates with humans and computers * improvement via sequential access to data & predictive processing

What is the difference between a narrow and a wide transformation?

Narrow: everything can be found in the same partition -> required elements for computation in a single partition live in the single partition of parent RDD (map) Wide: we need to fetch from multiple partitions -> Required elements for computation in a single partition may live in many partitions of parent RDD (groupByKey)

When should you not use Spark?

* for many simple use cases Apache MapREduce and Hive might be a more appropriate choice * spark was not designed as a multi-user environment * spark users are required to know that memory they have is sufficient for a dataset * adding more users adds complications, since the users will have to coordinate memory usage to run code

Lesson 6: Distributed Storage Flashcards

(28 cards)