Top 50 Data Architecture Questions Flashcards
What is big data?
Big Data is a term associated with complex and large datasets. A relational database cannot handle big data, and that’s why special tools and methods are used to perform operations on a vast collection of data. Big data enables companies to understand their business better and helps them derive meaningful information from the unstructured and raw data collected on a regular basis. Big data also allows the companies to take better business decisions backed by data.
What are the five V’s of Big Data?
Volume – Volume represents the volume i.e. amount of data that is growing at a high rate i.e. data volume in Petabytes
Velocity – Velocity is the rate at which data grows. Social media contributes a major role in the velocity of growing data.
Variety – Variety refers to the different data types i.e. various data formats like text, audios, videos, etc.
Veracity – Veracity refers to the uncertainty of available data. Veracity arises due to the high volume of data that brings incompleteness and inconsistency.
Value –Value refers to turning data into value. By turning accessed big data into values, businesses may generate revenue.
Tell us how big data and Hadoop are related to each other.
Big data and Hadoop are almost synonyms terms. With the rise of big data, Hadoop, a framework that specializes in big data operations also became popular. The framework can be used by professionals to analyze big data and help businesses to make decisions.
How is big data analysis helpful in increasing business revenue?
sentimental analysis, predictive analytics, prescriptive analytics
Define respective components of HDFS and YARN
NameNode – This is the master node for processing metadata information for data blocks within the HDFS
DataNode/Slave node – This is the node which acts as slave node to store the data, for processing and use by the NameNode
The two main components of YARN are–
ResourceManager– This component receives processing requests and accordingly allocates to respective NodeManagers depending on processing needs.
NodeManager– It executes tasks on each single Data Node
Why is Hadoop used for Big Data Analytics?
Answer: Since data analysis has become one of the key parameters of business, hence, enterprises are dealing with massive amount of structured, unstructured and semi-structured data. Analyzing unstructured data is quite difficult where Hadoop takes major part with its capabilities of
Storage
Processing
Data collection
Moreover, Hadoop is open source and runs on commodity hardware. Hence it is a cost-benefit solution for businesses.
What is fsck?
fsck stands for File System Check. It is a command used by HDFS. This command is used to check inconsistencies and if there is any problem in the file. For example, if there are any missing blocks for a file, HDFS gets notified through this command.
What are the main differences between NAS (Network-attached storage) and HDFS?
HDFS runs on a cluster of machines while NAS runs on an individual machine. Hence, data redundancy is a common issue in HDFS. On the contrary, the replication protocol is different in case of NAS. Thus the chances of data redundancy are much less.
Data is stored as data blocks in local drives in case of HDFS. In case of NAS, it is stored in dedicated hardware.
What is the Command to format the NameNode?
$ hdfs namenode -format
Do you have any Big Data experience? If so, please share it with us.
There is no specific answer to the question as it is a subjective question and the answer depends on your previous experience. Asking this question during a big data interview, the interviewer wants to understand your previous experience and is also trying to evaluate if you are fit for the project requirement.
So, how will you approach the question? If you have previous experience, start with your duties in your past position and slowly add details to the conversation. Tell them about your contributions that made the project successful. This question is generally, the 2nd or 3rd question asked in an interview. The later questions are based on this question, so answer it carefully. You should also take care not to go overboard with a single aspect of your previous job. Keep it simple and to the point.
Do you prefer good data or good models? Why?
Not mutually exclusive
Will you optimize algorithms or code to make them run faster?
Not mutually exclusive
How do you approach data preparation?
As you already know, data preparation is required to get necessary data which can then further be used for modeling purposes. You should convey this message to the interviewer. You should also emphasize the type of model you are going to use and reasons behind choosing that particular model. Last, but not the least, you should also discuss important data preparation terms such as transforming variables, outlier values, unstructured data, identifying gaps, and others.
How would you transform unstructured data into structured data?
Unstructured data is very common in big data. The unstructured data should be transformed into structured data to ensure proper data analysis. You can start answering the question by briefly differentiating between the two. Once done, you can now discuss the methods you use to transform one form to another. You might also share the real-world situation where you did it. If you have recently been graduated, then you can share information related to your academic projects.
Which hardware configuration is most beneficial for Hadoop jobs?
Dual processors or core machines with a configuration of 4 / 8 GB RAM and ECC memory is ideal for running Hadoop operations. However, the hardware configuration varies based on the project-specific workflow and process flow and need customization accordingly.