Top 50 Data Architecture Questions Flashcards

Question

What are the different configuration files in Hadoop?

Answer 1

core-site.xml – This configuration file contains Hadoop core configuration settings, for example, I/O settings, very common for MapReduce and HDFS. It uses hostname a port. mapred-site.xml – This configuration file specifies a framework name for MapReduce by setting mapreduce.framework.name hdfs-site.xml – This configuration file contains HDFS daemons configuration settings. It also specifies default block permission and replication checking on HDFS. yarn-site.xml – This configuration file specifies configuration settings for ResourceManager and NodeManager.

Answer 2

Authentication – The first step involves authentication of the client to the authentication server, and then provides a time-stamped TGT (Ticket-Granting Ticket) to the client. Authorization – In this step, the client uses received TGT to request a service ticket from the TGS (Ticket Granting Server). Service Request – It is the final step to achieve security in Hadoop. Then the client uses service ticket to authenticate himself to the server.

Answer 3

hadoop_jar_file.jar /input_path /output_path.

Answer 4

``` Owner Group Others. read (r) write (w) execute(x). ```

Answer 5

To restart all the daemons, it is required to stop all the daemons first. The Hadoop directory contains sbin directory that stores the script files to stop and start daemons in Hadoop. Use stop daemons command /sbin/stop-all.sh to stop all the daemons and then use /sin/start-all.sh command to start all the daemons again.

Answer 6

The jps command is used to check if the Hadoop daemons are running properly or not. This command shows all the daemons running on a machine i.e. Datanode, Namenode, NodeManager, ResourceManager etc.

Answer 7

Answer: There are two methods to overwrite the replication factors in HDFS – Method 1: On File Basis In this method, the replication factor is changed on the basis of file using Hadoop FS shell. The command used for this is: $hadoop fs – setrep –w2/my/test_file Here, test_file is the filename that’s replication factor will be set to 2. Method 2: On Directory Basis In this method, the replication factor is changed on directory basis i.e. the replication factor for all the files under a given directory is modified. $hadoop fs –setrep –w5/my/test_dir Here, test_dir is the name of the directory, the replication factor for the directory and all the files in it will be set to 5.

Answer 8

A NameNode without any data doesn’t exist in Hadoop. If there is a NameNode, it will contain some data in it or it won’t exist.

Answer 9

The NameNode recovery process involves the below-mentioned steps to make Hadoop cluster running: In the first step in the recovery process, file system metadata replica (FsImage) starts a new NameNode. The next step is to configure DataNodes and Clients. These DataNodes and Clients will then acknowledge new NameNode. During the final step, the new NameNode starts serving the client on the completion of last checkpoint FsImage loading and receiving block reports from the DataNodes. Note: Don’t forget to mention, this NameNode recovery process consumes a lot of time on large Hadoop clusters. Thus, it makes routine maintenance difficult. For this reason, HDFS high availability architecture is recommended to use.

Answer 10

CLASSPATH includes necessary directories that contain jar files to start or stop Hadoop daemons. Hence, setting CLASSPATH is essential to start or stop Hadoop daemons. However, setting up CLASSPATH every time is not the standard that we follow. Usually CLASSPATH is written inside /etc/hadoop/hadoop-env.sh file. Hence, once we run Hadoop, it will load the CLASSPATH automatically.

Answer 11

This is due to the performance issue of NameNode. Usually, NameNode is allocated with huge space to store metadata for the large-scale file. The metadata is supposed to be a from a single file for optimum space utilization and cost benefit. In case of small size files, NameNode does not utilize the entire space which is a performance optimization issue.

Answer 12

Datasets in HDFS store as blocks in DataNodes the Hadoop cluster. During the execution of a MapReduce job the individual Mapper processes the blocks (Input Splits). If the data does not reside in the same node where the Mapper is executing the job, the data needs to be copied from the DataNode over the network to the mapper DataNode. Now if a MapReduce job has more than 100 Mapper and each Mapper tries to copy the data from other DataNode in the cluster simultaneously, it would cause serious network congestion which is a big performance issue of the overall system. Hence, data proximity to the computation is an effective and cost-effective solution which is technically termed as Data locality in Hadoop. It helps to increase the overall throughput of the system. Data local – In this type data and the mapper resides on the same node. This is the closest proximity of data and the most preferred scenario. Rack Local – In this scenarios mapper and data reside on the same rack but on the different data nodes. Different Rack – In this scenario mapper and data reside on the different racks.

Answer 13

Hadoop is not only for storing large data but also to process those big data. Though DFS(Distributed File System) too can store the data, but it lacks below features- It is not fault tolerant Data movement over a network depends on bandwidth.

Answer 14

Hadoop uses a specific file format which is known as Sequence file. The sequence file stores data in a serialized key-value pair. Sequencefileinputformat is an input format to read sequence files.

Top 50 Data Architecture Questions Flashcards

(38 cards)