Midterm Flashcards

Question

The minimum amount of data that HDFS can read or write is called a ____?

Answer 1

Master-worker

Answer 2

Hadoop Streaming

Answer 3

- The outputs of the map-tasks go directly to the local File System. - The MapReduce framework does not sort the map-outputs before sending them to the reduce tasks. (it does sort) - It is legal to set the number of reduce-tasks to zero if no reduction is desired. Answer: - The MapReduce framework does not sort the map-outputs before sending them to the reduce tasks.

Answer 4

Hadoop streaming API allows other languages to be used for Hadoop programs. The standard input and output are used to communicate with the program. The streamer requires us to call the Hadoop streaming .JAR file in the execution.

Answer 5

- Scalability - you can sort and manage terabytes of nodes and clusters. - Compatibility - YARN supports Map-reduce applications without disruptions. Its compatible with Hadoop 1.0 - Cluster utilization - Multi-tenancy - it allows multiple engine access.

Answer 6

- It has more lines of code since its low level programming -More development effort is involved - Its not flexible, we have 1 or more mappers and 0 or more reducers, but a job is only done using the MapReduce framework.

Answer 7

Always happens

Answer 8

Happens always. Go from each node and combines them and puts them in alphabetical order. And the merge part tallies each word. The reducing puts all the tallies together.

Answer 9

If HDFS is corrupted, it will rely on the replicas which are stored in different data nodes. These replicas avoid the corrupted HDFS to fail the complete program. Furthermore, it is also possible to set a bigger replication factor to make it fail-proof.

Answer 10

- First replicas on one rack, second replica is on the second rack, third replica is on the same rack as the second replica but on a different node. - It gives a good balance between redundancy and bandwidth.

Answer 11

the heart of HDFS. It stores all the metadata of files such as name, owner, permissions, etc. it knows which data nodes, the blocks, and their replicas of a file are stored on. If the name node fails, all the files are lost because there is no way to reconstruct them.

Answer 12

- Each machine in the cluster is a node - One of those nodes are the master node - this node managers overall the file system. - The name node stores the directory structure and the metadata for all files. - The other nodes are called data nodes. The data is physically stored on these nodes. - Default size for data is 128 mb.

Answer 13

- FIFO scheduler - Capacity scheduler - Fair scheduler

Answer 14

FIFO scheduler - resources are allocated on First In First Out basis - Requests made first are satisfied first before resources are allocated to others. - On a cluster shared by many applications, FIFO will cause huge wait times. For this reason, FIFO scheduler is rarely used.

Answer 15

Capacity scheduler - capacity is distributed to different queues. - Each queue is allocated a share of the cluster resources - Jobs can be submitted to a specific queue. Within a queue, FIFO scheduling is followed - Allows small jobs to complete without getting stuck due to larger ones. The cluster might be under-utilized since capacity is reserved for queues.

Answer 16

Fair scheduler - resources are always proportionally allocated to all jobs. - There is no wait time.

Answer 17

1. Users define map and reduce tasks or other tasks running on the cluster 2. A job is triggered on the cluster 3. Yarn figures out where and how to run the job and stores the result in HDFS.

Answer 18

- Resource manager - there is 1 resource manager per Hadoop cluster(which consists of hundreds or thousands of computing nodes). - Node manager - a node manager runs on each data node in the cluster (all the data nodes).

Answer 19

- The resource manager service runs on a single node - usually the same nodes as HDFS name node - Launches tasks submitted to yarn, arbitrates the available resources on the cluster among computing applications. - It is optimized for cluster utilization based on constraints such as capacity guarantees, fairness, and SLA's - It has a pluggable scheduler policy which determines which application can be run first.

Answer 20

- There is one node manager per node. - Is responsible for launching and managing containers on a node. - It coordinates with the resource manager in order to perform its tasks. - It monitors resources, logs, tracks the health of the node, everything related to the one node that is in charge.

Answer 21

- Yarn has a resource manager running on 1 node managers on each computing node. The job is first submitted to the Resource Manager. - The RM will reschedule the job based on the constraints specified and capacity available. - The RM will in turn find a Node Manager on one of the nodes to launch an Application Master Process.

Answer 22

Process running with the container. It negotiates resources from the RM. It works with the node managers to execute and monitor containers. Its moving responsibility for the monitoring and execution of tasks to the application master allowed scale in computing. It's per application. It has to perform the task with the resources allocated to the container. Depends on how many jobs are in the cluster.

Answer 23

- Data locality - for RM, in order to conserve network bandwidth, YARN will try to perform the computation on the same node as the data resides. If there are no resources available on the data node, YARN will first wait for some time for resources to free up. If that fails the container is requested on the same rack as the concerned data node.

Midterm Flashcards

(48 cards)