Top 50 Data Architecture Questions Flashcards

1
Q

What is big data?

A

Big Data is a term associated with complex and large datasets. A relational database cannot handle big data, and that’s why special tools and methods are used to perform operations on a vast collection of data. Big data enables companies to understand their business better and helps them derive meaningful information from the unstructured and raw data collected on a regular basis. Big data also allows the companies to take better business decisions backed by data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the five V’s of Big Data?

A

Volume – Volume represents the volume i.e. amount of data that is growing at a high rate i.e. data volume in Petabytes
Velocity – Velocity is the rate at which data grows. Social media contributes a major role in the velocity of growing data.
Variety – Variety refers to the different data types i.e. various data formats like text, audios, videos, etc.
Veracity – Veracity refers to the uncertainty of available data. Veracity arises due to the high volume of data that brings incompleteness and inconsistency.
Value –Value refers to turning data into value. By turning accessed big data into values, businesses may generate revenue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Tell us how big data and Hadoop are related to each other.

A

Big data and Hadoop are almost synonyms terms. With the rise of big data, Hadoop, a framework that specializes in big data operations also became popular. The framework can be used by professionals to analyze big data and help businesses to make decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How is big data analysis helpful in increasing business revenue?

A

sentimental analysis, predictive analytics, prescriptive analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define respective components of HDFS and YARN

A

NameNode – This is the master node for processing metadata information for data blocks within the HDFS
DataNode/Slave node – This is the node which acts as slave node to store the data, for processing and use by the NameNode

The two main components of YARN are–

ResourceManager– This component receives processing requests and accordingly allocates to respective NodeManagers depending on processing needs.
NodeManager– It executes tasks on each single Data Node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why is Hadoop used for Big Data Analytics?

A

Answer: Since data analysis has become one of the key parameters of business, hence, enterprises are dealing with massive amount of structured, unstructured and semi-structured data. Analyzing unstructured data is quite difficult where Hadoop takes major part with its capabilities of

Storage
Processing
Data collection
Moreover, Hadoop is open source and runs on commodity hardware. Hence it is a cost-benefit solution for businesses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is fsck?

A

fsck stands for File System Check. It is a command used by HDFS. This command is used to check inconsistencies and if there is any problem in the file. For example, if there are any missing blocks for a file, HDFS gets notified through this command.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the main differences between NAS (Network-attached storage) and HDFS?

A

HDFS runs on a cluster of machines while NAS runs on an individual machine. Hence, data redundancy is a common issue in HDFS. On the contrary, the replication protocol is different in case of NAS. Thus the chances of data redundancy are much less.
Data is stored as data blocks in local drives in case of HDFS. In case of NAS, it is stored in dedicated hardware.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the Command to format the NameNode?

A

$ hdfs namenode -format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Do you have any Big Data experience? If so, please share it with us.

A

There is no specific answer to the question as it is a subjective question and the answer depends on your previous experience. Asking this question during a big data interview, the interviewer wants to understand your previous experience and is also trying to evaluate if you are fit for the project requirement.

So, how will you approach the question? If you have previous experience, start with your duties in your past position and slowly add details to the conversation. Tell them about your contributions that made the project successful. This question is generally, the 2nd or 3rd question asked in an interview. The later questions are based on this question, so answer it carefully. You should also take care not to go overboard with a single aspect of your previous job. Keep it simple and to the point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Do you prefer good data or good models? Why?

A

Not mutually exclusive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Will you optimize algorithms or code to make them run faster?

A

Not mutually exclusive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you approach data preparation?

A

As you already know, data preparation is required to get necessary data which can then further be used for modeling purposes. You should convey this message to the interviewer. You should also emphasize the type of model you are going to use and reasons behind choosing that particular model. Last, but not the least, you should also discuss important data preparation terms such as transforming variables, outlier values, unstructured data, identifying gaps, and others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How would you transform unstructured data into structured data?

A

Unstructured data is very common in big data. The unstructured data should be transformed into structured data to ensure proper data analysis. You can start answering the question by briefly differentiating between the two. Once done, you can now discuss the methods you use to transform one form to another. You might also share the real-world situation where you did it. If you have recently been graduated, then you can share information related to your academic projects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which hardware configuration is most beneficial for Hadoop jobs?

A

Dual processors or core machines with a configuration of 4 / 8 GB RAM and ECC memory is ideal for running Hadoop operations. However, the hardware configuration varies based on the project-specific workflow and process flow and need customization accordingly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What happens when two users try to access the same file in the HDFS?

A

HDFS NameNode supports exclusive write only. Hence, only the first user will receive the grant for file access and the second user will be rejected.

17
Q

How to recover a NameNode when it is down?

A

Use the FsImage which is file system metadata replica to start a new NameNode.
Configure the DataNodes and also the clients to make them acknowledge the newly started NameNode.
Once the new NameNode completes loading the last checkpoint FsImage which has received enough block reports from the DataNodes, it will start to serve the client.

18
Q

What do you understand by Rack Awareness in Hadoop?

A

It is an algorithm applied to the NameNode to decide how blocks and its replicas are placed. Depending on rack definitions network traffic is minimized between DataNodes within the same rack. For example, if we consider replication factor as 3, two copies will be placed on one rack whereas the third copy in a separate rack.

19
Q

What is the difference between “HDFS Block” and “Input Split”?

A

The HDFS divides the input data physically into blocks for processing which is known as HDFS Block.

Input Split is a logical division of data by mapper for mapping operation.

20
Q

Explain the different modes in which Hadoop run.

A

Standalone (Local) Mode – By default, Hadoop runs in a local mode i.e. on a non-distributed, single node. This mode uses the local file system to perform input and output operation. This mode does not support the use of HDFS, so it is used for debugging. No custom configuration is needed for configuration files in this mode.
Pseudo-Distributed Mode – In the pseudo-distributed mode, Hadoop runs on a single node just like the Standalone mode. In this mode, each daemon runs in a separate Java process. As all the daemons run on a single node, there is the same node for both the Master and Slave nodes.
Fully – Distributed Mode – In the fully-distributed mode, all the daemons run on separate individual nodes and thus forms a multi-node cluster. There are different nodes for Master and Slave nodes.

21
Q

What are hadoop components?

A

HDFS, Yarn, Mapreduce

22
Q

What are the configuration parameters in a “MapReduce” program?

A

Input locations of Jobs in the distributed file system
Output location of Jobs in the distributed file system
The input format of data
The output format of data
The class which contains the map function
The class which contains the reduce function
JAR file which contains the mapper, reducer and the driver classes

23
Q

What is a block in HDFS and what is its default size in Hadoop 1 and Hadoop 2? Can we change the block size?

A

Blocks are smallest continuous data storage in a hard drive. For HDFS, blocks are stored across Hadoop cluster.

The default block size in Hadoop 1 is: 64 MB
The default block size in Hadoop 2 is: 128 MB
Yes, we can change block size by using the parameter – dfs.block.size located in the hdfs-site.xml file.

24
Q

Explain JobTracker in Hadoop

A

obTracker is a JVM process in Hadoop to submit and track MapReduce jobs.

JobTracker performs the following activities in Hadoop in a sequence –

JobTracker receives jobs that a client application submits to the job tracker
JobTracker notifies NameNode to determine data node
JobTracker allocates TaskTracker nodes based on available slots.
it submits the work on allocated TaskTracker Nodes,
JobTracker monitors the TaskTracker nodes.
When a task fails, JobTracker is notified and decides how to reallocate the task.

25
Q

What are the different configuration files in Hadoop?

A

core-site.xml – This configuration file contains Hadoop core configuration settings, for example, I/O settings, very common for MapReduce and HDFS. It uses hostname a port.

mapred-site.xml – This configuration file specifies a framework name for MapReduce by setting mapreduce.framework.name

hdfs-site.xml – This configuration file contains HDFS daemons configuration settings. It also specifies default block permission and replication checking on HDFS.

yarn-site.xml – This configuration file specifies configuration settings for ResourceManager and NodeManager.

26
Q

How can you achieve security in Hadoop?

A

Authentication – The first step involves authentication of the client to the authentication server, and then provides a time-stamped TGT (Ticket-Granting Ticket) to the client.
Authorization – In this step, the client uses received TGT to request a service ticket from the TGS (Ticket Granting Server).
Service Request – It is the final step to achieve security in Hadoop. Then the client uses service ticket to authenticate himself to the server.

27
Q

What is the syntax you use to run a MapReduce program?

A

hadoop_jar_file.jar /input_path /output_path.

28
Q

What are the different file permissions in HDFS for files or directory levels?

A
Owner
Group
Others.
read (r)
write (w)
execute(x).
29
Q

How to restart all the daemons in Hadoop?

A

To restart all the daemons, it is required to stop all the daemons first. The Hadoop directory contains sbin directory that stores the script files to stop and start daemons in Hadoop.

Use stop daemons command /sbin/stop-all.sh to stop all the daemons and then use /sin/start-all.sh command to start all the daemons again.

30
Q

What is the use of jps command in Hadoop?

A

The jps command is used to check if the Hadoop daemons are running properly or not. This command shows all the daemons running on a machine i.e. Datanode, Namenode, NodeManager, ResourceManager etc.

31
Q

Explain the process that overwrites the replication factors in HDFS.

A

Answer: There are two methods to overwrite the replication factors in HDFS –

Method 1: On File Basis

In this method, the replication factor is changed on the basis of file using Hadoop FS shell. The command used for this is:

$hadoop fs – setrep –w2/my/test_file

Here, test_file is the filename that’s replication factor will be set to 2.

Method 2: On Directory Basis

In this method, the replication factor is changed on directory basis i.e. the replication factor for all the files under a given directory is modified.

$hadoop fs –setrep –w5/my/test_dir

Here, test_dir is the name of the directory, the replication factor for the directory and all the files in it will be set to 5.

32
Q

What will happen with a NameNode that doesn’t have any data?

A

A NameNode without any data doesn’t exist in Hadoop. If there is a NameNode, it will contain some data in it or it won’t exist.

33
Q

Explain NameNode recovery process.

A

The NameNode recovery process involves the below-mentioned steps to make Hadoop cluster running:

In the first step in the recovery process, file system metadata replica (FsImage) starts a new NameNode.
The next step is to configure DataNodes and Clients. These DataNodes and Clients will then acknowledge new NameNode.
During the final step, the new NameNode starts serving the client on the completion of last checkpoint FsImage loading and receiving block reports from the DataNodes.
Note: Don’t forget to mention, this NameNode recovery process consumes a lot of time on large Hadoop clusters. Thus, it makes routine maintenance difficult. For this reason, HDFS high availability architecture is recommended to use.

34
Q

How Is Hadoop CLASSPATH essential to start or stop Hadoop daemons?

A

CLASSPATH includes necessary directories that contain jar files to start or stop Hadoop daemons. Hence, setting CLASSPATH is essential to start or stop Hadoop daemons.

However, setting up CLASSPATH every time is not the standard that we follow. Usually CLASSPATH is written inside /etc/hadoop/hadoop-env.sh file. Hence, once we run Hadoop, it will load the CLASSPATH automatically.

35
Q

Why is HDFS only suitable for large data sets and not the correct tool to use for many small files?

A

This is due to the performance issue of NameNode. Usually, NameNode is allocated with huge space to store metadata for the large-scale file. The metadata is supposed to be a from a single file for optimum space utilization and cost benefit. In case of small size files, NameNode does not utilize the entire space which is a performance optimization issue.

36
Q

Why do we need Data Locality in Hadoop? Explain.

A

Datasets in HDFS store as blocks in DataNodes the Hadoop cluster. During the execution of a MapReduce job the individual Mapper processes the blocks (Input Splits). If the data does not reside in the same node where the Mapper is executing the job, the data needs to be copied from the DataNode over the network to the mapper DataNode.

Now if a MapReduce job has more than 100 Mapper and each Mapper tries to copy the data from other DataNode in the cluster simultaneously, it would cause serious network congestion which is a big performance issue of the overall system. Hence, data proximity to the computation is an effective and cost-effective solution which is technically termed as Data locality in Hadoop. It helps to increase the overall throughput of the system.
Data local – In this type data and the mapper resides on the same node. This is the closest proximity of data and the most preferred scenario.
Rack Local – In this scenarios mapper and data reside on the same rack but on the different data nodes.
Different Rack – In this scenario mapper and data reside on the different racks.

37
Q

DFS can handle a large volume of data then why do we need Hadoop framework?

A

Hadoop is not only for storing large data but also to process those big data. Though DFS(Distributed File System) too can store the data, but it lacks below features-

It is not fault tolerant
Data movement over a network depends on bandwidth.

38
Q

What is Sequencefileinputformat?

A

Hadoop uses a specific file format which is known as Sequence file. The sequence file stores data in a serialized key-value pair. Sequencefileinputformat is an input format to read sequence files.