Module 2 Flashcards

1
Q

What types of data can Hadoop process?

A

Structured (e.g. tabular data), Semi-structured and unstructured (e.g. images and videos)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the other three modules that constitute the Apacha Hadoop project beside Hadoop common?

A
  • Hadoop Distributed Files System (HDFS)
  • MapReduce
  • YARN (Yet Another Resource Negotiator)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

List some shortcomings from Hadoop

A

Hadoop is not suitable:
* when low latency access to data is needed (e.g trading, online gaming, VoIP)
* when instructions cannot be parallelized
* when there are dependencies within data, i.e. one record must be processed after another one
* for processing transaction operations (OLTP)
* for processing lots of small files
* for intensive calculations with small data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain briefly the purpose of HDFS, YARN and MapReduce

A

HDFS: Handles storage of big data
YARN: prepares RAM and CPU for processing of data in batch, stream, interactive or graph manner
MapReduce: processing unit that breaks down data into smaller chunks and process them in parallel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the Hadoop ecosystem?

A

The ecosystem is a collection of libraries or software, in addition to the original Hadoop core, that were created to overcome shortcomings from Hadoop or support processing of big data.

These libraries/software are intended to help one another during the stages of big data processing (ingest, store, process and access).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the main stages of big data processing?

A
  1. Ingest data
  2. Store data
  3. Process and analyze data
  4. Access data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly