Linux Commands Flashcards

Question

Explain difference between Mapper and Reducer?

Answer 1

The Mapper class is a generic type, with four formal type parameters that specify the input key, input value, output key, and output value types of the map function. o Rather than use built-in Java types, Hadoop provides its own set of basic types that are optimized for network serialization. These are found in the org.apache.hadoop.io package. o Here we use LongWritable, which corresponds to a Java Long, Text (like Java String),and IntWritable (like Java Integer).

Answer 2

Structured data is highly-organized and formatted in a way so it's easily searchable in relational databases (dates, phone numbers, ssn, addresses, etc). Unstructured data has no pre-defined format or organization, making it much more difficult to collect, process, and analyze (text files, reports, images, video files, etc).

Answer 3

A Daemon is just a long running process that runs in the background. Hadoop has five such daemons, namely NameNode, Secondary NameNode, DataNode, JobTracker, and TaskTracker. Each daemons runs separately in its own JVM.

Answer 4

In Hadoop, Data locality is the process of moving the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system.

Answer 5

Three. Replication information and other metadata is stored on the NameNode, and the NameNode makes all decisions about where data/replicas will be stored on the cluster. Each file/block within a file is replicated across the cluster.

Answer 6

The NameNode also known as master daemon stores the metadata of the hdfs which is the directory tree of all files in the system. It also tracks files across the cluster. The DataNodes are responsible for storing the actual data in hdfs. They perform operations like creation replication and deletion of data blocks.

Answer 7

1 namenode can exist in a cluster

Answer 8

For DataNodes, their fault tolerance is handled by the NameNode. DNs send heartbeats to the NN, so when a DN goes down, it stops sending those heartbeats, and the NN knows to make new replicas of all the data stored on the downed DN.

Answer 9

This is a daemon that runs on another machine and follows the same steps as the NameNode while they are occuring in real time. it just receives the information in the EditLog and keeps its own FSImage. Then, if the real NameNode fails, the Standby NameNode steps in and becomes the new NameNode. This is called failover. This is the best option, but requires more resources

Answer 10

Secondary Namenode takes edit logs from the Primary Namenode, in regular intervals and updates it to fsimage. Once it gets the updated fsimage, it copies back fsimage to the Namenode So, now whenever the Namenode restarts, it will use this fsimage and the startup time will be reduced accordingly. It does not provide failover which is the job of the standby node.

Answer 11

HDFS Federations, with multiple NameNodes, can be used if you need 10000s of machines.

Answer 12

one per machine, the worker daemon. Node managers manage bundles of resources called containers running on their machine and report the status back to the RM. We submit jobs to the Resource Manager. Tasks are the individual pieces Jobs are broken up into. Tasks are what run inside of containers. Data Nodes are responsible for these map and reduce tasks.

Answer 13

The Combiner is a partial reduction before shuffle and sort Output of combiner will be sent over network to actual reduce task as input.

Answer 14

takes output from mapper and orders all associated keys before passing it to the reducer to make it easier to parse data

Answer 15

Node managers manage bundles of resources called containers running on their machine and report the status back to the ResourceManager.

Answer 16

Resource Manager manages the resource allocation in the cluster.

Answer 17

this is responsible for allocating resources (containers) across the cluster based on requests.

Answer 18

Accepts job submissions, and creates the Application Master for each submitted job.

Answer 19

1 per job (managed by the applications manager) run in containers on the cluster, and are responsible for communicating with the scheduler to achieve their jobs. This allows the ApplicationsManager to be ultimately responsible for job completion, while offloading most of the work to ApplicationMasters running on worker nodes.

Answer 20

Bundles of resources, tasks are what run inside them The RM makes tasks run in containers across the cluster and the scheduler allocates containers across the cluster, based on request ApplicationMasters run in containers on the cluster

Answer 21

Through fs shell commands Jar: runs a jar file. Users can bundle their MapReduce code in a JAR file and execute it using this command

Answer 22

1) gets a file from hdfs and puts it into our local system | 2) puts a file from our local environment into hdfs

Answer 23

Unix is an operating system.

Answer 24

Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently.

Answer 25

Unstructured data - data that has no inherent structure and is ususally stored as different types of files. (pdfs, images, video) Quasi-structured - textual data with erratic formats that can be formatted with effort and software tools. Semi-structured - text data files with an apparent pattern, enabiling analysis. (spreadsheets and xml files) Structured - data having a defining data model, format, structure (database)

Answer 26

HDFS stands for hadoop distributed file system and it is the storage system of the hadoop framework. Stores data in blocks of 128 mb and reads data sequentially in a single seek operation.

Answer 27

``` Fault tolerant Scalable Rack-aware Support for heterogenous cluster ilt for large datasets ```

Linux Commands Flashcards

(51 cards)