Chapter 9&10 Knowledge Testers Flashcards

Question 1

Q

Can you explain how YARN works, and how it can be used to im- prove MapReduce, and to support other technologies like Spark?

Answer

A

Original MapReduce gave the JobTracker too much responsibility. It had resource management, scheduling, monitoring, fault tolerance. Making this not very scalable (limit 4000 nodes, 40000 tasks). Makes system slow. YARN - Yet Another Resource Manager, handles management of CPU and memory resources in cluster. Reduces bottle neck. Provides support for allocating resources to any application.

fault tolerance. Making this not very scalable (limit 4000 nodes

40000 tasks). Makes system slow. YARN - Yet Another Resource Manager

handles management of CPU and memory resources in cluster. Reduces bottle neck. Provides support for allocating resources to any application.

Question 2

Q

Can you quickly describe the YARN components?

Answer

A

Centralized architecture, ResourceManager (main node), and NodeManager (workers). ResourceManager assigns one of the containers to act as the ApplicationMaster which will run the application. AppMaster can ask for more containers in order to run jobs. Allows for multi-tenancy, several applications can run concurrently on the same cluster.

several applications can run concurrently on the same cluster.

Question 3

Q

Do you know what a ResourceManager does?

Answer

A

Manages the resources: Memory, CPU, Disk IO, Network IO
ApplicationMasters will request say 10 containers with 2 cores each and 16GB of RAM. The Resource can then allocate this

Network IO<br></br>ApplicationMasters will request say 10 containers with 2 cores each and 16GB of RAM. The Resource can then allocate this

Question 4

Q

Do you know what a NodeManger does?

Answer

A

Send periodic heartbeats to resource manager to give a sign of life.

Question 5

Q

Do you know what and where a Container is?

Answer

A

Containers are like groups of resources: 10 containers that each have 2 cores and 16GB of RAM. AppMasters request them. Can contain several map slots.

Question 6

Q

Do you know what an ApplicationMaster is and does?

Answer

A

Gets jobs, Talks with ResourceManager to say what it needs for its job.

Question 7

Q

Can you list the main resources that are managed in a cluster?

Answer

A

Disk storage, memory, CPU, network bandwith

network bandwith

Question 8

Q

Can you explain, in simpler words, what the added value of YARN is? Can you explain what it is an improvement over the first version of MapReduce, which was taking care of resource management on its own, and had issues with this?

Answer

A

Originally MapReduces JobTracker was doing too much creating capacity limits and bottlenecks. Now YARN can do some of the JobTrackers jobs, so this can speed up MapReduce. YARN balances the allocation of the 4 different resources.

Question 9

Q

Can you explain how Spark is more powerful than MapReduce on a data model level?

Answer

A

Spark has higher level APIs/querying (MapReduce is low level). Spark can do multiple MapReduce jobs.

Question 10

Q

Can you explain what a Resilient Distributed Dataset (RDD) is?

Answer

A

Intermediate data, the nodes in a DAG. Resilient: Remain in memory (but can be recomputed if needed). Distributed: partitioned anf spread over multiple machines. homogeneous collections of anything (not necessarily pairs).

Question 11

Q

Fields…Cards…Preview

Do you know the difference between an action and a transformation?

Answer

A

Transformation: RDD -> RDD such as map, reduce, joins, unions, relational algebra.
Action: Final transform that is persistent. Outputting RDD to local disk, dfs, database, users screen.
Tags

unions

relational algebra.<br></br>Action: Final transform that is persistent. Outputting RDD to local disk

dfs

database

users screen.

Question 12

Q

Do you know the main actions and transformations available in Spark?

Answer

A

transforms:
unary: single RDD input: filter, map, flatmap
binary: 2 RDDS as input: union, intersection, subtraction
Pair: for RDDS of key-value pairs: Keys, reduceByKey(with an operator) (k,v1),(k,v2) = (k,v1+v2), groupByKey (k,v1),(k,v2) = (k,[v1,v2]), sortByKey, MapValues, join, subtractByKey

actions:
Collect: downloads all values of an RDD on client machine and outpets them as a list
Count: computes the total number of values in input RDD (returns integer only)
CoentByValue: counts distinct # of a value
Take: list of first n values
Top: list of last n values
takeSample: list of n random values
reduce: given binary operator and data type is closed, invokes operator on all values
saveAsTextFile, saveAsObjectFile,

Question 13

Q

Can you describe how transformations run physically ? Can you explain the similarities/differences with MapReduce on the physical level?

Answer

A

two kinds:
narrow dependency: single input value, comparable to map, easily parallelizable. Single set of tasks for each partition of input. Chain of narrow-dependency transforms is done as a single task called stage (phase in MapReduce).
wide dependency: require shuffling over the data network to ensure partition has all necessary data in each location. On high level job, the physical execution will be a sequence of stages. Shuffling begins each new stage.Typically linear succession of stages. Can create a partial order of the DAG, linearization.

the physical execution will be a sequence of stages. Shuffling begins each new stage.Typically linear succession of stages. Can create a partial order of the DAG

linearization.

Question 14

Q

Can you explain when and how a series of transformations can be optimized by keeping the same set of machines with no network communication between the transformations?

Answer

A

Pinning the intermediate stages to keep them persistant in memory. Prepartition data that should be together, this reduces shuffling.

Question 15

Q

Can you explain what a stage is? Can you relate it to transformations? To tasks? To jobs?

Answer

A

Transformation has vertical grouping. Task horizontal splitting. Stage is made of a group of jobs.

Question 16

Q

Do you understand why keeping an RDD persisted can be useful and can improve performance?

Answer

A

Reduces recomputing intermediate values

Question 17

Q

Can you explain how controlling the way data is partitioned can make execution faster because we can influence stages?

Answer

A

Reduce need for shuffling

Question 18

Q

Can you explain what a DataFrame is? What does it improve in Spark?

Answer

A

Dataframe is semi-structured, it’s denormalized but follows a schema that makes it not too crazy (bi-queen, not fully homo or hetero). it can be homo but not atomic, but it could be hetero and everything can be considered string so then it is kind of homo

Question 19

Q

Can you explain what the limitations of DataFrames are? What can they do well? What can they do less well?

Answer

A

They can be nested, but not too nested. If it is too nested, then it is hard for the user and ruins data independence. It can convert JSON objects to strings but thats bad for the user again. THey are good for heterogeneous data, they can have schemas which makes them more structured and is helpful for the user. They have domain (everything can be considered string) and relational integrity but not atomic.

they can have schemas which makes them more structured and is helpful for the user. They have domain (everything can be considered string) and relational integrity but not atomic.