Revature Hadoop 2 Flashcards
Explain the significance of Mapper[LongWritable, Text, Text, IntWritable] and Reducer[Text, IntWritable, Text, IntWritable]
The Mapper reads input key-value pairs (LongWritable, Text) and outputs intermediate key-value pairs (Text, IntWritable). The Reducer takes these intermediate key-value pairs and outputs final results (Text, IntWritable).
What needs to be true about the types contained in the above generics?
The input and output key-value pairs must match the data types specified by the Mapper and Reducer, and the input types of Reducer must match the output types of Mapper.
What are the 3 Vs of big data?
Volume, Velocity, Variety.
What are some examples of structured data? Unstructured data?
Structured: Relational databases, CSV files. Unstructured: Images, videos, emails.
What is a daemon?
A background process that runs continuously to provide services or perform tasks (e.g., NameNode, DataNode in Hadoop).
What is data locality and why is it important?
Data locality refers to the concept of storing and processing data on the same computer or network node where it is needed, minimizing the need to transfer large amounts of data across the network, thus significantly improving performance and efficiency
How many blocks will a 200MB file be stored in in HDFS, if we assume default HDFS block size for Hadoop v2+?
2 blocks (default block size is 128MB).
What is the default number of replications for each block?
3
How are these replications typically distributed across the cluster? What is rack awareness?
Replications are stored across different nodes and racks for fault tolerance. Rack awareness ensures copies of data are distributed across racks to prevent data loss from rack failure.
What is the job of the NameNode? What about the DataNode?
NameNode manages metadata and file system structure. DataNode stores actual data blocks and performs read/write operations.
How many NameNodes exist on a cluster?
Typically one active NameNode, but high-availability setups include one standby NameNode.
How are DataNodes fault tolerant?
Data blocks are replicated across multiple DataNodes.
How does a Standby NameNode make the NameNode fault tolerant?
It keeps a synchronized copy of metadata from the active NameNode and can take over if the active NameNode fails.
What purpose does a Secondary NameNode serve?
It periodically merges NameNode’s metadata and edits logs, reducing recovery time.
How might we scale a HDFS cluster past a few thousand machines?
HDFS Federations with multiple NameNodes can be used for large clusters with tens of thousands of machines.