Module 2: The Hadoop Ecosystem Flashcards
Motivations behind Hadoop
Storage challenge, Hardware failure, Correctly combining parts of dataset for analysis
Storage challenge and solution
Storage capacities of hard drives have increased massively but the rate at which data can be read from hard drives have not kept up (from 1990 the storage capacity has gone up x730 times while access speed capacity has gone up x23 times only). So this means that it takes a long time to read all data on a single server, and writing is even slower.
Solution: we can read from multiple disks in parallel! This hugely decreases the read time. A possible critization with this is that it leads to poor utilization (since you have to use so many harddrives instead of just one). A solution to this is that you store multiple datasets on one another server. And then you have a shared environment so all can reach the server. ok because everyone wont run analysis all the time
Hardware failure and solution
with an increasing number of hard drives or computer nodes, the risk of one failing increases (and is fairly likely), resulting in data loss.
Solution: data replication, storing redundant copies
How can we correctly combine parts of dataset for analysis?
MapReduce! Abstractions from disk reads+writes to computation over sets of keys and values, with built in reliability
Hadoop
Provides a reliable, scalable platform for storage and analysis
Differs from other systems. Regarding access, traditional ones uses interactive and batch while mapreduce uses batch.
And regarding scaling, traditional ones uses nonlinear scaling while MapReduce uses Linear scaling
MapReduce
Allows for ad-hoc queries against big datasets with a reasonable response time- a transformative ability.
Provides a programming model that simplifies parallel computing over HDFS.
Runs on top of YARN.
- Google initially used MapReduce for indexing websites.
Abstracts away from the intricacies of parallel programming on large clusters. So you only need to decompose your problem into map and reduce step.
Based on functional programming. Map and reduce faze.
Step 1: Map faze
generate key-value pairs
Step 2: Sort and shuffle (is a step but already implemented.)
It takes all the pairs with the same key to move to the same node
Step 3: Then reduce faze.
You perform these in order to obtain data-parallel scalability. And in many cases you can even have parallelization in each step
Grid computing
Distributes jobs across a cluster of machines with a shared filesystem and connected by a storage area network (SAN).
It works well for computer-intensive jobs but struggles when nodes need to access large data volumes, where network bandwidth is the bottleneck and compute nodes become idle.
Grid computing in Hadoop
Hadoop moves computation to data, also known as data locality.
Hadoop operated on a high level of abstraction, i.e. programmers can ignore mechanics of data flow and potential failed processes (aware of failing nodes)
Volunteer computing
Form of grid computing, but where volunteers donates the CPU time on their personal computers to solve big important problems that they care about. These problems are then decomposed, sent to computers around the world for analysis, to be solved eventually. This approach makes sense since we rarely use the full CPU power of our machines (often idle).
So basically, problems are decomposed into work units and sent to computers around the world for analysis, and volunteers donate their CPU time.
Ex. projects that uses volunteer computing
- SETI@home: analysis of radio telescope data for signs of intelligent life
- folding@home: to understand protein folding and how it relates to disease
- the great internet Mersenne Prime search: to search for large prime numbers
Volunteer computing in Hadoop vs other systems
Hadoop runs on trusted dedicated hardware in a single data center with big bandwidth and data locality.
Other systems: Volunteer computing runs on untrusted machines on the internet with variable connection speeds and no
Goals of Hadoop
Enable scalability: (probably most important)
for storing large volumes of data on affordable commodity hardware. Should also be able to scale out at any time to handle increasing amounts of data and computation. (is simple)
Handle fault tolerance:
Failures happen, especially with large clusters (need to be prepared!). Hadoop provides tolerance by data replication, for storing redundant copies of data. Should ideally be done on different racks, in case an entire rack goes down.
Optimized for Heterogeneous Data:
This is in contrast to RDBMS which are optimized for highly structured, tabelized data. But big data tends to come in many different shapes and forms (eg. text, stream, tabular). Hadoop deals with several data types by having knowledge of its structure (eg. important not to split text data). And then there are also various projects in the hadoop ecosystem that are dedicated to several of these data types (eg. projects for handling streaming data or graph data).
Facilitate a shared environment:
Again, concerns resource utilization. You want to allow multiple jobs to run in parallel on the cluster (don’t want to have lots of nodes idle). So data replication does not only provide fault tolerance but it also allows for simultaneous access to the same set of data for different jobs.
Provide value:
Should provide value for your organization. Hadoop provides an ecosystem which includes many open source projects. So you can choose a fitting module. Hadoop is also supported by a large and active community, being continually improved. Also, is free to use and easy to use and find support.
How organize all the open source projects of hadoop?
One example: Stack or layer diagram
At the lower level you have modules that deal with data storage and scheduling of jobs across clusters, while at the higher, you have modules that are increasingly more interactive applications. And some do both, some more specialized.
HDFS
Hadoop Distributed File System. Core component of ecosystem. Provides customized reading for handling a variety of data types.
- file format is specified when you read and write to HDFS. So it can determine how to split and distribute the data across the cluster
But hadoop has an abstract notion of filesystems, where HDFS is just one possible implementation. Which one to use can be configured by the user. But with large data volumes, choose a distributed one like HDFS.
HDFS key capabilities
- Scalability: can store massive datasets. Uses partitioning.
- Reliability: using data replication
HDFS key component
- NameNode for metadata: is the master node and is used for coordinating operations on the clusters. And typically you would have one NameNode per cluster.
- DataNode for block storage: are the slave nodes. Listens to the NameNode for creating and deleting blocks and doing replication. Typically you have one DataNode per machine
YARN
Yet Another Resource Negotiator. Is used for cluster resource management. Works on top of HDFS and interacts with applications and schedules resourced for their use in the cluster.
- able to run multiple applications over HDFS.
- With YARN you can use resources very efficiently. And it allocates resources to jobs in an efficient manner.
- allows you to go beyond MapReduce
YARN allowed Hadoop to truly evolve into an ecosystem. It allows many applications to be easily integrated and to run on top of HDFS.
YARN also has a resource manager (RM), and it is typically one per cluster. This allocated the resources.
Node Managers (NM), typically one per node, which launched and monitors containers. A container than executes an application-specific process, with a constrained set of resources in terms of memory, CPU etc