Hadoop Flashcards
Learn primary terms in the Hadoop framework
What is the NameNode
the term specified in the HDFS Hadoop framework for the master node. The name node holds the Resource Manager / Job tracker Daemon
Works as part of yarn to hold onto which Data node / slave nodes have additional resources in HDFS
What 5 pillars of Hadoop
1) Data Management
2) Data Access
3) Data Governance and Integration
4) Security
5) Operations
Do should you not use the Hadoop Framework?
Low Latency data access : Quick access to small
parts of data
❑ Multiple data modification : Hadoop is a better fit
only if we are primarily concerned about reading
data and not writing data.
❑ Lots of small files : Hadoop is a better fit in scenarios,
where we have few but large files.
Where is job tracker stored
on the Namenode
What does the Master node hold
NameNode (HDFS) and ResourceManager (Map-Reduce)
where is Yarn located
Yet another resource negotiator (YARN) is located on the name-node
What are the largest challenges (per the powerpoint) facing the big data space?
❑ Lack of skilled staff
❑ Data governance issues – With so much data available, it becomes even more critical to have a framework in place for deciding what data belongs in the system. However, just 30% of the companies surveyed by TDWI said that data governance teams were heavily involved in Big Data management.
❑ Organizational readiness – As with business intelligence, successfully analyzing Big Data takes more than just installing software and other tools. The entire organization needs to be on the same page, and there must be a clearly articulated strategy built around actual business goals.
What are the 7 Hadoop file formats?
- Text Files(CSV, TSV …)
- JSON Records
- Sequence Files
- Avro Files
- RC Files
- ORC Files
- Parquet Files
What is YARN?
A framework for job scheduling and cluster
resource management. It is the data processing layer of
Hadoop.
What is the MapReduce? is it the storage or processing layer of hadoop
A YARN-based system for parallel processing of large data sets. It is the data processing layer of Hadoop.
What is the the Hadoop HDFS get syntax
get [-crc]
❑ Hadoop HDFS get Command Description
This HDFS fs command copies the file or directory in HDFS identified by the source to the local file system path identified by local destination. This HDFS basic command retrieves all files that match to the source path entered by the user in HDFS, and creates a copy of them to one single, merged file in the local file system identified by local destination.
❑ Hadoop HDFS get Command Example:
hdfs dfs -get /user/dataflair/dir2/sample /home/dataflair/Desktop
what are the Read/Write Files commands in Hadoop
hdfs dfs -text {file_name}
hdfs dfs -cat /hadoop/test #cat command
hdfs dfs -appendtofile {source} {destination} /*puts name for the file */
How to copy files a file from the place locally onto the hadoop file
hdfs dfs-copyFromLocal {source} {new destination path}
hdfs dfs -get {source} {new destination}
hdfs dfs -copyToLocal {source path} {new destination path}
hdfs dfs -put {source} {new destination}
Create a directory in specified HDFS location. This command does not fail even if the directory already exists.
hdfs dfs -mkdir -f {destination e.g: ‘ /hadoop2’}
What are the three stages of MapReduce? what order do they go in?
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
- Map stage: The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
- Reduce stage (intermediate splitting followed by reducing): This stage is the combination of theShuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.