Module 2 Flashcards
What types of data can Hadoop process?
Structured (e.g. tabular data), Semi-structured and unstructured (e.g. images and videos)
What are the other three modules that constitute the Apacha Hadoop project beside Hadoop common?
- Hadoop Distributed Files System (HDFS)
- MapReduce
- YARN (Yet Another Resource Negotiator)
List some shortcomings from Hadoop
Hadoop is not suitable:
* when low latency access to data is needed (e.g trading, online gaming, VoIP)
* when instructions cannot be parallelized
* when there are dependencies within data, i.e. one record must be processed after another one
* for processing transaction operations (OLTP)
* for processing lots of small files
* for intensive calculations with small data
Explain briefly the purpose of HDFS, YARN and MapReduce
HDFS: Handles storage of big data
YARN: prepares RAM and CPU for processing of data in batch, stream, interactive or graph manner
MapReduce: processing unit that breaks down data into smaller chunks and process them in parallel
What is the Hadoop ecosystem?
The ecosystem is a collection of libraries or software, in addition to the original Hadoop core, that were created to overcome shortcomings from Hadoop or support processing of big data.
These libraries/software are intended to help one another during the stages of big data processing (ingest, store, process and access).
What are the main stages of big data processing?
- Ingest data
- Store data
- Process and analyze data
- Access data