Module 2 Flashcards

Question 1

Q

What types of data can Hadoop process?

Answer

A

Structured (e.g. tabular data), Semi-structured and unstructured (e.g. images and videos)

Question 2

Q

What are the other three modules that constitute the Apacha Hadoop project beside Hadoop common?

Answer

A

Hadoop Distributed Files System (HDFS)
MapReduce
YARN (Yet Another Resource Negotiator)

Question 3

Q

List some shortcomings from Hadoop

Answer

A

Hadoop is not suitable:
* when low latency access to data is needed (e.g trading, online gaming, VoIP)
* when instructions cannot be parallelized
* when there are dependencies within data, i.e. one record must be processed after another one
* for processing transaction operations (OLTP)
* for processing lots of small files
* for intensive calculations with small data

Question 4

Q

Explain briefly the purpose of HDFS, YARN and MapReduce

Answer

A

HDFS: Handles storage of big data
YARN: prepares RAM and CPU for processing of data in batch, stream, interactive or graph manner
MapReduce: processing unit that breaks down data into smaller chunks and process them in parallel

Question 5

Q

What is the Hadoop ecosystem?

Answer

A

The ecosystem is a collection of libraries or software, in addition to the original Hadoop core, that were created to overcome shortcomings from Hadoop or support processing of big data.

These libraries/software are intended to help one another during the stages of big data processing (ingest, store, process and access).

Question 6

Q

What are the main stages of big data processing?

Answer

A

Ingest data
Store data
Process and analyze data
Access data

Question 7

Q

Brainscape's Knowledge GenomeTM

Module 2 Flashcards

Brainscape's Knowledge Genome^TM