Module 2: The Hadoop Ecosystem Flashcards

1
Q

Motivations behind Hadoop

A

Storage challenge, Hardware failure, Correctly combining parts of dataset for analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Storage challenge and solution

A

Storage capacities of hard drives have increased massively but the rate at which data can be read from hard drives have not kept up (from 1990 the storage capacity has gone up x730 times while access speed capacity has gone up x23 times only). So this means that it takes a long time to read all data on a single server, and writing is even slower.

Solution: we can read from multiple disks in parallel! This hugely decreases the read time. A possible critization with this is that it leads to poor utilization (since you have to use so many harddrives instead of just one). A solution to this is that you store multiple datasets on one another server. And then you have a shared environment so all can reach the server. ok because everyone wont run analysis all the time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Hardware failure and solution

A

with an increasing number of hard drives or computer nodes, the risk of one failing increases (and is fairly likely), resulting in data loss.

Solution: data replication, storing redundant copies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can we correctly combine parts of dataset for analysis?

A

MapReduce! Abstractions from disk reads+writes to computation over sets of keys and values, with built in reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Hadoop

A

Provides a reliable, scalable platform for storage and analysis

Differs from other systems. Regarding access, traditional ones uses interactive and batch while mapreduce uses batch.

And regarding scaling, traditional ones uses nonlinear scaling while MapReduce uses Linear scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

MapReduce

A

Allows for ad-hoc queries against big datasets with a reasonable response time- a transformative ability.

Provides a programming model that simplifies parallel computing over HDFS.
Runs on top of YARN.
- Google initially used MapReduce for indexing websites.

Abstracts away from the intricacies of parallel programming on large clusters. So you only need to decompose your problem into map and reduce step.

Based on functional programming. Map and reduce faze.

Step 1: Map faze
generate key-value pairs

Step 2: Sort and shuffle (is a step but already implemented.)
It takes all the pairs with the same key to move to the same node

Step 3: Then reduce faze.

You perform these in order to obtain data-parallel scalability. And in many cases you can even have parallelization in each step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Grid computing

A

Distributes jobs across a cluster of machines with a shared filesystem and connected by a storage area network (SAN).

It works well for computer-intensive jobs but struggles when nodes need to access large data volumes, where network bandwidth is the bottleneck and compute nodes become idle.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Grid computing in Hadoop

A

Hadoop moves computation to data, also known as data locality.

Hadoop operated on a high level of abstraction, i.e. programmers can ignore mechanics of data flow and potential failed processes (aware of failing nodes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Volunteer computing

A

Form of grid computing, but where volunteers donates the CPU time on their personal computers to solve big important problems that they care about. These problems are then decomposed, sent to computers around the world for analysis, to be solved eventually. This approach makes sense since we rarely use the full CPU power of our machines (often idle).

So basically, problems are decomposed into work units and sent to computers around the world for analysis, and volunteers donate their CPU time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Ex. projects that uses volunteer computing

A
  • SETI@home: analysis of radio telescope data for signs of intelligent life
  • folding@home: to understand protein folding and how it relates to disease
  • the great internet Mersenne Prime search: to search for large prime numbers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Volunteer computing in Hadoop vs other systems

A

Hadoop runs on trusted dedicated hardware in a single data center with big bandwidth and data locality.

Other systems: Volunteer computing runs on untrusted machines on the internet with variable connection speeds and no

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Goals of Hadoop

A

Enable scalability: (probably most important)
for storing large volumes of data on affordable commodity hardware. Should also be able to scale out at any time to handle increasing amounts of data and computation. (is simple)

Handle fault tolerance:
Failures happen, especially with large clusters (need to be prepared!). Hadoop provides tolerance by data replication, for storing redundant copies of data. Should ideally be done on different racks, in case an entire rack goes down.

Optimized for Heterogeneous Data:
This is in contrast to RDBMS which are optimized for highly structured, tabelized data. But big data tends to come in many different shapes and forms (eg. text, stream, tabular). Hadoop deals with several data types by having knowledge of its structure (eg. important not to split text data). And then there are also various projects in the hadoop ecosystem that are dedicated to several of these data types (eg. projects for handling streaming data or graph data).

Facilitate a shared environment:
Again, concerns resource utilization. You want to allow multiple jobs to run in parallel on the cluster (don’t want to have lots of nodes idle). So data replication does not only provide fault tolerance but it also allows for simultaneous access to the same set of data for different jobs.

Provide value:
Should provide value for your organization. Hadoop provides an ecosystem which includes many open source projects. So you can choose a fitting module. Hadoop is also supported by a large and active community, being continually improved. Also, is free to use and easy to use and find support.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How organize all the open source projects of hadoop?

A

One example: Stack or layer diagram
At the lower level you have modules that deal with data storage and scheduling of jobs across clusters, while at the higher, you have modules that are increasingly more interactive applications. And some do both, some more specialized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

HDFS

A

Hadoop Distributed File System. Core component of ecosystem. Provides customized reading for handling a variety of data types.
- file format is specified when you read and write to HDFS. So it can determine how to split and distribute the data across the cluster

But hadoop has an abstract notion of filesystems, where HDFS is just one possible implementation. Which one to use can be configured by the user. But with large data volumes, choose a distributed one like HDFS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

HDFS key capabilities

A
  • Scalability: can store massive datasets. Uses partitioning.
  • Reliability: using data replication
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

HDFS key component

A
  • NameNode for metadata: is the master node and is used for coordinating operations on the clusters. And typically you would have one NameNode per cluster.
  • DataNode for block storage: are the slave nodes. Listens to the NameNode for creating and deleting blocks and doing replication. Typically you have one DataNode per machine
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

YARN

A

Yet Another Resource Negotiator. Is used for cluster resource management. Works on top of HDFS and interacts with applications and schedules resourced for their use in the cluster.

  • able to run multiple applications over HDFS.
  • With YARN you can use resources very efficiently. And it allocates resources to jobs in an efficient manner.
  • allows you to go beyond MapReduce

YARN allowed Hadoop to truly evolve into an ecosystem. It allows many applications to be easily integrated and to run on top of HDFS.

YARN also has a resource manager (RM), and it is typically one per cluster. This allocated the resources.

Node Managers (NM), typically one per node, which launched and monitors containers. A container than executes an application-specific process, with a constrained set of resources in terms of memory, CPU etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

But how exactly should resources be allocated by YARN to applications?

A

YARN uses a scheduler that allocated resources to applications according to a defined policy. It’s a difficult task and there is no one best policy. So YARN provides a choice of schedulers from which you can choose, and also configure.

19
Q

YARN Schedulers

A

FIFO: applications are queued according to FIFO. All resources given to one cluster. Simple policy but not suitable when you have a cluster, because you may have to wait a long time for a job to be done.

Capacity scheduler: maintains a separate queue for smaller jobs, so dedicate most of the resources to large jobs but then some to small jobs (to quickly complete). This however leads to worse overall cluster utilization. Because when you have no jobs in one of the queues, part of the cluster becomes idle (and this against our goal).

Fair scheduler: dynamically balances resources. One job launched, gets all resources. Then job 2 is launched, and they share the resources equally until one of them is done. This gives us high cluster utilization, and also timely completion of small jobs. You can configure what the fairness should be based on (which resource).

20
Q

higher level programming models

A

On top of MapReduce there are higher level programming models, which intends to simplify running mapreduce like jobs. Because MapReduce is not always straightforward

PIG: data flow scripting
- Pig Latin used to define data transformations, translated into MapReduce jobs - Good for exploring large datasets; supports joins

Hive: framework for data warehousing on top of Hadoop

  • Provides an SQL-like query language (HiveQL)
  • Integrates well with business intelligence tools
21
Q

Module for graph analytics

A

On top of YARN. Eg. Giraph, for efficient processing of large scale graphs

22
Q

Modules for real time, in memory processing

A

On top of Yarn. Eg. Storm, Spark, Flink.

23
Q

NoSQL for Specialized Data Models

A

Can be fitted in the ecosystem but can often also be run isolated from HDFS.
Eg. Cassandra, MangoDB. HBase (made to be run on top of HDFS)

24
Q

ZooKeeper

A

Manager of Hadoop Modules. Provides tools for building distributed applications that can safely handle partial failures.

centralized management system for:

  • synchronization
  • configuration
  • providing high availability
25
Open-source projects
Used for free, big community. We can mix and match different components of the ecosystem to fit you the best.
26
Pros of Hadoop (when to use)
Future anticipated data growth (easy scaling) Long term availability of data (fault tolerance) Many platforms over a single data store High volume, high variety
27
Cons of Hadoop | when not to use
small datasets Task-level parallelism not supported (different task on same data at the same time, different from data level parallelism, running same task on different subsets of data) Advanced algorithms, not reducible to e.g. map and reduces faces (when using hadoop would mean having to) Replacement to your infrastructure- should be evaluated first when you need random data access (HDFS stores relatively large files (blocksize)) Advanced analytical queries, latency-sensitive tasks, security of sensitive data
28
Programming model for big data
Programming model is a set of abstractions of the infrastructure that forms a model of computation. So it provides a way to interact with the distributed filesystem without having to know all the details of how a distributed filesystem works. Makes it easier for developers to develop applications. Point of programming model for spec. big data is to simplify parallel programming. eg. mapreduce
29
Requirements for Big Data programming models
Provide programmability on top of distributed file systems: should enable programmability of operations within the distributed filesystem it should allow for writing programs on top of the distributed file system facilitate handling of potential issues ``` Support Big Data operations: should support data partitioning for fast data access for distribution of computation to nodes for scheduling of parallel tasks ``` Handle fault tolerance: Reliability to handle hardware failure data replication file recovery Enable scaling out: by adding more compute nodes or racks whenever needed Be optimized for specific data types: both structured or not document, stream, graph should support operations over some of these types not just one
30
Simplify parallel programming
Needs a lot of knowledge, on various synchronization mechanisms, It’s a steep learning curve and error prone.
31
Hadoop streaming
Provides an API to MapReduce that allows for writing map and reduce functions in languages other than java
32
MapReduce on HDFS
Hadoop runs a MapReduce job by dividing it into tasks, map and reduce tasks. Tasks are scheduled using YARN and runs on nodes in the cluster. If a task fails it should be rescheduled by the YARN framework Hadoop divides the input into fixed-sized splits: one map task per split, runs map function for each record in split Trade-off between small splits (for good load balancing) and the number of splits (overhead) But rule of thumb: a good split size= size of an HDFS block. Hadoop aims to achieve data locality optimization: by running map task on node where data resides no need to use valuable cluster bandwidth optimal split size = block size. Because this means the map task can be done without any data transfer (since largest input size can be stored on single node) Possible cases: Case 1: Run map task on node where data resides Case 2: All nodes with replicas running other map tasks; scheduler looks for free slot on same rack Case 3: If not possible, use an off-rack node (requires inter-rack network transfer) Map tasks write output to local disk, not to HDFS, why? only intermediate output, processed by reduce tasks Thrown away once the job is done HDFS storage with replication would be overkill if node running map task fails, Hadoop reruns it on another node Reduce tasks cannot benefit from data locality: input to (single) reduce task is typically output from all/multiple mappers sorted map outputs transferred across network to node running the reduce task, merged and passed to user-defined reduce function reduce output typically stored in HDFS for reliability
33
Tuning MapReduce jobs
How many mappers do you use? How long are the mappers running for? Should take around 1 minute per mapper. Number of reducers? Some cases better to have several map reduce jobs
34
Granularity of MapReduce jobs
Better to have more and simpler stages because they led to more portable and maintainable mappers and reducers. A mapper often performs input format parsing, projection (selecting relevant fields) and filtering (removing record). can be split into distinct mappers can be chained together using the chainmapper library class
35
When to use MapReduce
It excels at batch analysis that can be decomposed into independent data-parallel tasks.
36
When not to use MapReduce
Frequently changing data: it would be slow since it reads the entire input dataset each time Dependent tasks: computations with dependencies cannot be expressed with MapReduce Interactive analysis: does not return any results until the entire process is finished Pitfalls: complicated to code some tasks, absence of schema and index, difficulties to debug the code, some tasks are very expensive
37
Apache Pig
Use cases: batch analysis, data mining ``` Goal: reduce development time Nested data model (much like mongodb) query execution user-defined functions analytic queries over text files ``` Procedural language ⇒ developer has control over execution plan (can speed up performance) Schema is defined at query-time
38
Pig-relation
A relation is a bag (more specifically, an outer bag) A bag is a collection of tuples A tuple is an ordered set of fields a field is a piece of data When we say store or dump, we trigger the execution
39
Apache Hive
Use cases: Batch analysis, reporting A data warehouse infrastructure built on top of hadoop for providing data summarization, query and analysis. ``` Hive provides: structure extraction, transformation and load (ETL) access to different storage query execution via mapreduce ``` Key building principles: SQL is a familiar language, performance and extensibility Data units: db, tables, partitions, buckets 3 levels in DB: table ⇒ partitions ⇒ buckets Connection to RDBMS hive stores it metadata in RDBMs (Pig doesn't have metadata) the metastore acts as a system catalogue for hive stores all information about the tables, their partition, their schemas etc. without the system catalogue, it is not possible to impose a structure on Hadoop files Schema is known at creation time (like RDB) A table in Hive is an HDFS in Hadoop
40
DDL
Data definition language
41
DML
Data manipulation language
42
Apache drill
Use cases: interactive analysis, business intelligence Much faster than the other two.
43
Solution for MapReduce pitfalls
Put an abstraction layer on top of mapreduce that hides complexity and adds optimization. Key benefits with this is: common design patterns as keywords data flow analysis (one script can map to multiple mapreduce jobs) avoids Java level errors Can also be ran in interactive (which speeds up development time since we can test only part of the program)