Hadoop Flashcards

Learn primary terms in the Hadoop framework

1
Q

What is the NameNode

A

the term specified in the HDFS Hadoop framework for the master node. The name node holds the Resource Manager / Job tracker Daemon
Works as part of yarn to hold onto which Data node / slave nodes have additional resources in HDFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What 5 pillars of Hadoop

A

1) Data Management
2) Data Access
3) Data Governance and Integration
4) Security
5) Operations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Do should you not use the Hadoop Framework?

A

Low Latency data access : Quick access to small
parts of data
❑ Multiple data modification : Hadoop is a better fit
only if we are primarily concerned about reading
data and not writing data.
❑ Lots of small files : Hadoop is a better fit in scenarios,
where we have few but large files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Where is job tracker stored

A

on the Namenode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does the Master node hold

A

NameNode (HDFS) and ResourceManager (Map-Reduce)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

where is Yarn located

A

Yet another resource negotiator (YARN) is located on the name-node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the largest challenges (per the powerpoint) facing the big data space?

A

❑ Lack of skilled staff

❑ Data governance issues – With so much data available, it becomes even more critical to have a framework in place for deciding what data belongs in the system. However, just 30% of the companies surveyed by TDWI said that data governance teams were heavily involved in Big Data management.

❑ Organizational readiness – As with business intelligence, successfully analyzing Big Data takes more than just installing software and other tools. The entire organization needs to be on the same page, and there must be a clearly articulated strategy built around actual business goals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 7 Hadoop file formats?

A
  1. Text Files(CSV, TSV …)
  2. JSON Records
  3. Sequence Files
  4. Avro Files
  5. RC Files
  6. ORC Files
  7. Parquet Files
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is YARN?

A

A framework for job scheduling and cluster
resource management. It is the data processing layer of
Hadoop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the MapReduce? is it the storage or processing layer of hadoop

A

A YARN-based system for parallel processing of large data sets. It is the data processing layer of Hadoop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the the Hadoop HDFS get syntax

A

get [-crc]

❑ Hadoop HDFS get Command Description

This HDFS fs command copies the file or directory in HDFS identified by the source to the local file system path identified by local destination. This HDFS basic command retrieves all files that match to the source path entered by the user in HDFS, and creates a copy of them to one single, merged file in the local file system identified by local destination.

❑ Hadoop HDFS get Command Example:
hdfs dfs -get /user/dataflair/dir2/sample /home/dataflair/Desktop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what are the Read/Write Files commands in Hadoop

A

hdfs dfs -text {file_name}
hdfs dfs -cat /hadoop/test #cat command
hdfs dfs -appendtofile {source} {destination} /*puts name for the file */

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to copy files a file from the place locally onto the hadoop file

A

hdfs dfs-copyFromLocal {source} {new destination path}
hdfs dfs -get {source} {new destination}

hdfs dfs -copyToLocal {source path} {new destination path}
hdfs dfs -put {source} {new destination}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Create a directory in specified HDFS location. This command does not fail even if the directory already exists.

A

hdfs dfs -mkdir -f {destination e.g: ‘ /hadoop2’}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the three stages of MapReduce? what order do they go in?

A

MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.

  1. Map stage: The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
  2. Reduce stage (intermediate splitting followed by reducing): This stage is the combination of theShuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly