Module 2: The Hadoop Ecosystem Flashcards

1
Q

Motivations behind Hadoop

A

Storage challenge, Hardware failure, Correctly combining parts of dataset for analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Storage challenge and solution

A

Storage capacities of hard drives have increased massively but the rate at which data can be read from hard drives have not kept up (from 1990 the storage capacity has gone up x730 times while access speed capacity has gone up x23 times only). So this means that it takes a long time to read all data on a single server, and writing is even slower.

Solution: we can read from multiple disks in parallel! This hugely decreases the read time. A possible critization with this is that it leads to poor utilization (since you have to use so many harddrives instead of just one). A solution to this is that you store multiple datasets on one another server. And then you have a shared environment so all can reach the server. ok because everyone wont run analysis all the time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Hardware failure and solution

A

with an increasing number of hard drives or computer nodes, the risk of one failing increases (and is fairly likely), resulting in data loss.

Solution: data replication, storing redundant copies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can we correctly combine parts of dataset for analysis?

A

MapReduce! Abstractions from disk reads+writes to computation over sets of keys and values, with built in reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Hadoop

A

Provides a reliable, scalable platform for storage and analysis

Differs from other systems. Regarding access, traditional ones uses interactive and batch while mapreduce uses batch.

And regarding scaling, traditional ones uses nonlinear scaling while MapReduce uses Linear scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

MapReduce

A

Allows for ad-hoc queries against big datasets with a reasonable response time- a transformative ability.

Provides a programming model that simplifies parallel computing over HDFS.
Runs on top of YARN.
- Google initially used MapReduce for indexing websites.

Abstracts away from the intricacies of parallel programming on large clusters. So you only need to decompose your problem into map and reduce step.

Based on functional programming. Map and reduce faze.

Step 1: Map faze
generate key-value pairs

Step 2: Sort and shuffle (is a step but already implemented.)
It takes all the pairs with the same key to move to the same node

Step 3: Then reduce faze.

You perform these in order to obtain data-parallel scalability. And in many cases you can even have parallelization in each step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Grid computing

A

Distributes jobs across a cluster of machines with a shared filesystem and connected by a storage area network (SAN).

It works well for computer-intensive jobs but struggles when nodes need to access large data volumes, where network bandwidth is the bottleneck and compute nodes become idle.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Grid computing in Hadoop

A

Hadoop moves computation to data, also known as data locality.

Hadoop operated on a high level of abstraction, i.e. programmers can ignore mechanics of data flow and potential failed processes (aware of failing nodes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Volunteer computing

A

Form of grid computing, but where volunteers donates the CPU time on their personal computers to solve big important problems that they care about. These problems are then decomposed, sent to computers around the world for analysis, to be solved eventually. This approach makes sense since we rarely use the full CPU power of our machines (often idle).

So basically, problems are decomposed into work units and sent to computers around the world for analysis, and volunteers donate their CPU time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Ex. projects that uses volunteer computing

A
  • SETI@home: analysis of radio telescope data for signs of intelligent life
  • folding@home: to understand protein folding and how it relates to disease
  • the great internet Mersenne Prime search: to search for large prime numbers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Volunteer computing in Hadoop vs other systems

A

Hadoop runs on trusted dedicated hardware in a single data center with big bandwidth and data locality.

Other systems: Volunteer computing runs on untrusted machines on the internet with variable connection speeds and no

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Goals of Hadoop

A

Enable scalability: (probably most important)
for storing large volumes of data on affordable commodity hardware. Should also be able to scale out at any time to handle increasing amounts of data and computation. (is simple)

Handle fault tolerance:
Failures happen, especially with large clusters (need to be prepared!). Hadoop provides tolerance by data replication, for storing redundant copies of data. Should ideally be done on different racks, in case an entire rack goes down.

Optimized for Heterogeneous Data:
This is in contrast to RDBMS which are optimized for highly structured, tabelized data. But big data tends to come in many different shapes and forms (eg. text, stream, tabular). Hadoop deals with several data types by having knowledge of its structure (eg. important not to split text data). And then there are also various projects in the hadoop ecosystem that are dedicated to several of these data types (eg. projects for handling streaming data or graph data).

Facilitate a shared environment:
Again, concerns resource utilization. You want to allow multiple jobs to run in parallel on the cluster (don’t want to have lots of nodes idle). So data replication does not only provide fault tolerance but it also allows for simultaneous access to the same set of data for different jobs.

Provide value:
Should provide value for your organization. Hadoop provides an ecosystem which includes many open source projects. So you can choose a fitting module. Hadoop is also supported by a large and active community, being continually improved. Also, is free to use and easy to use and find support.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How organize all the open source projects of hadoop?

A

One example: Stack or layer diagram
At the lower level you have modules that deal with data storage and scheduling of jobs across clusters, while at the higher, you have modules that are increasingly more interactive applications. And some do both, some more specialized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

HDFS

A

Hadoop Distributed File System. Core component of ecosystem. Provides customized reading for handling a variety of data types.
- file format is specified when you read and write to HDFS. So it can determine how to split and distribute the data across the cluster

But hadoop has an abstract notion of filesystems, where HDFS is just one possible implementation. Which one to use can be configured by the user. But with large data volumes, choose a distributed one like HDFS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

HDFS key capabilities

A
  • Scalability: can store massive datasets. Uses partitioning.
  • Reliability: using data replication
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

HDFS key component

A
  • NameNode for metadata: is the master node and is used for coordinating operations on the clusters. And typically you would have one NameNode per cluster.
  • DataNode for block storage: are the slave nodes. Listens to the NameNode for creating and deleting blocks and doing replication. Typically you have one DataNode per machine
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

YARN

A

Yet Another Resource Negotiator. Is used for cluster resource management. Works on top of HDFS and interacts with applications and schedules resourced for their use in the cluster.

  • able to run multiple applications over HDFS.
  • With YARN you can use resources very efficiently. And it allocates resources to jobs in an efficient manner.
  • allows you to go beyond MapReduce

YARN allowed Hadoop to truly evolve into an ecosystem. It allows many applications to be easily integrated and to run on top of HDFS.

YARN also has a resource manager (RM), and it is typically one per cluster. This allocated the resources.

Node Managers (NM), typically one per node, which launched and monitors containers. A container than executes an application-specific process, with a constrained set of resources in terms of memory, CPU etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

But how exactly should resources be allocated by YARN to applications?

A

YARN uses a scheduler that allocated resources to applications according to a defined policy. It’s a difficult task and there is no one best policy. So YARN provides a choice of schedulers from which you can choose, and also configure.

19
Q

YARN Schedulers

A

FIFO: applications are queued according to FIFO. All resources given to one cluster. Simple policy but not suitable when you have a cluster, because you may have to wait a long time for a job to be done.

Capacity scheduler: maintains a separate queue for smaller jobs, so dedicate most of the resources to large jobs but then some to small jobs (to quickly complete). This however leads to worse overall cluster utilization. Because when you have no jobs in one of the queues, part of the cluster becomes idle (and this against our goal).

Fair scheduler: dynamically balances resources. One job launched, gets all resources. Then job 2 is launched, and they share the resources equally until one of them is done. This gives us high cluster utilization, and also timely completion of small jobs. You can configure what the fairness should be based on (which resource).

20
Q

higher level programming models

A

On top of MapReduce there are higher level programming models, which intends to simplify running mapreduce like jobs. Because MapReduce is not always straightforward

PIG: data flow scripting
- Pig Latin used to define data transformations, translated into MapReduce jobs - Good for exploring large datasets; supports joins

Hive: framework for data warehousing on top of Hadoop

  • Provides an SQL-like query language (HiveQL)
  • Integrates well with business intelligence tools
21
Q

Module for graph analytics

A

On top of YARN. Eg. Giraph, for efficient processing of large scale graphs

22
Q

Modules for real time, in memory processing

A

On top of Yarn. Eg. Storm, Spark, Flink.

23
Q

NoSQL for Specialized Data Models

A

Can be fitted in the ecosystem but can often also be run isolated from HDFS.
Eg. Cassandra, MangoDB. HBase (made to be run on top of HDFS)

24
Q

ZooKeeper

A

Manager of Hadoop Modules. Provides tools for building distributed applications that can safely handle partial failures.

centralized management system for:

  • synchronization
  • configuration
  • providing high availability
25
Q

Open-source projects

A

Used for free, big community. We can mix and match different components of the ecosystem to fit you the best.

26
Q

Pros of Hadoop (when to use)

A

Future anticipated data growth (easy scaling)
Long term availability of data (fault tolerance)
Many platforms over a single data store
High volume, high variety

27
Q

Cons of Hadoop

when not to use

A

small datasets
Task-level parallelism not supported (different task on same data at the same time, different from data level parallelism, running same task on different subsets of data)
Advanced algorithms, not reducible to e.g. map and reduces faces
(when using hadoop would mean having to) Replacement to your infrastructure- should be evaluated first
when you need random data access (HDFS stores relatively large files (blocksize))
Advanced analytical queries, latency-sensitive tasks, security of sensitive data

28
Q

Programming model for big data

A

Programming model is a set of abstractions of the infrastructure that forms a model of computation. So it provides a way to interact with the distributed filesystem without having to know all the details of how a distributed filesystem works. Makes it easier for developers to develop applications.

Point of programming model for spec. big data is to simplify parallel programming.

eg. mapreduce

29
Q

Requirements for Big Data programming models

A

Provide programmability on top of distributed file systems:
should enable programmability of operations within the distributed filesystem
it should allow for writing programs on top of the distributed file system
facilitate handling of potential issues

Support Big Data operations:
should support data partitioning
for fast data access
for distribution of computation to nodes
for scheduling of parallel tasks

Handle fault tolerance:
Reliability to handle hardware failure
data replication
file recovery

Enable scaling out:
by adding more compute nodes or racks whenever needed

Be optimized for specific data types:
both structured or not
document, stream, graph
should support operations over some of these types not just one

30
Q

Simplify parallel programming

A

Needs a lot of knowledge, on various synchronization mechanisms, It’s a steep learning curve and error prone.

31
Q

Hadoop streaming

A

Provides an API to MapReduce that allows for writing map and reduce functions in languages other than java

32
Q

MapReduce on HDFS

A

Hadoop runs a MapReduce job by dividing it into tasks, map and reduce tasks.
Tasks are scheduled using YARN and runs on nodes in the cluster. If a task fails it should be rescheduled by the YARN framework

Hadoop divides the input into fixed-sized splits:
one map task per split, runs map function for each record in split
Trade-off between small splits (for good load balancing) and the number of splits (overhead)
But rule of thumb: a good split size= size of an HDFS block.

Hadoop aims to achieve data locality optimization:
by running map task on node where data resides
no need to use valuable cluster bandwidth
optimal split size = block size. Because this means the map task can be done without any data transfer (since largest input size can be stored on single node)

Possible cases:
Case 1: Run map task on node where data resides
Case 2: All nodes with replicas running other map tasks; scheduler looks for free slot on same rack
Case 3: If not possible, use an off-rack node (requires inter-rack network transfer)

Map tasks write output to local disk, not to HDFS, why?
only intermediate output, processed by reduce tasks
Thrown away once the job is done
HDFS storage with replication would be overkill
if node running map task fails, Hadoop reruns it on another node

Reduce tasks cannot benefit from data locality:
input to (single) reduce task is typically output from all/multiple mappers
sorted map outputs transferred across network to node running the reduce task, merged and passed to user-defined reduce function
reduce output typically stored in HDFS for reliability

33
Q

Tuning MapReduce jobs

A

How many mappers do you use? How long are the mappers running for? Should take around 1 minute per mapper.
Number of reducers?

Some cases better to have several map reduce jobs

34
Q

Granularity of MapReduce jobs

A

Better to have more and simpler stages because they led to more portable and maintainable mappers and reducers.

A mapper often performs input format parsing, projection (selecting relevant fields) and filtering (removing record).
can be split into distinct mappers
can be chained together using the chainmapper library class

35
Q

When to use MapReduce

A

It excels at batch analysis that can be decomposed into independent data-parallel tasks.

36
Q

When not to use MapReduce

A

Frequently changing data: it would be slow since it reads the entire input dataset each time

Dependent tasks: computations with dependencies cannot be expressed with MapReduce

Interactive analysis: does not return any results until the entire process is finished

Pitfalls: complicated to code some tasks, absence of schema and index, difficulties to debug the code, some tasks are very expensive

37
Q

Apache Pig

A

Use cases: batch analysis, data mining

Goal: reduce development time
Nested data model (much like mongodb)
query execution
user-defined functions
analytic queries over text files

Procedural language ⇒ developer has control over execution plan (can speed up performance)

Schema is defined at query-time

38
Q

Pig-relation

A

A relation is a bag (more specifically, an outer bag)
A bag is a collection of tuples
A tuple is an ordered set of fields
a field is a piece of data

When we say store or dump, we trigger the execution

39
Q

Apache Hive

A

Use cases: Batch analysis, reporting

A data warehouse infrastructure built on top of hadoop for providing data summarization, query and analysis.

Hive provides:
structure
extraction, transformation and load (ETL)
access to different storage
query execution via mapreduce

Key building principles: SQL is a familiar language, performance and extensibility

Data units: db, tables, partitions, buckets

3 levels in DB: table ⇒ partitions ⇒ buckets

Connection to RDBMS
hive stores it metadata in RDBMs (Pig doesn’t have metadata)
the metastore acts as a system catalogue for hive
stores all information about the tables, their partition, their schemas etc.
without the system catalogue, it is not possible to impose a structure on Hadoop files

Schema is known at creation time (like RDB)

A table in Hive is an HDFS in Hadoop

40
Q

DDL

A

Data definition language

41
Q

DML

A

Data manipulation language

42
Q

Apache drill

A

Use cases: interactive analysis, business intelligence

Much faster than the other two.

43
Q

Solution for MapReduce pitfalls

A

Put an abstraction layer on top of mapreduce that hides complexity and adds optimization.

Key benefits with this is:
common design patterns as keywords
data flow analysis (one script can map to multiple mapreduce jobs)
avoids Java level errors
Can also be ran in interactive (which speeds up development time since we can test only part of the program)