HDFS Flashcards

1
Q

What is hadoop

A

Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. The key innovation, and one which still stands the test of time, was to distribute the datasets across many machines and to split up any computations on that data into many independent, “shared-nothing” chunks, each of which could be run on the same machines storing the data. Although existing technologies could be run on multiple servers, they typically relied heavily on communication between the distributed components, which leads to diminishing returns as the parallelism increases Hadoop is an open source implementation of these techniques. At its core, it offers a distributed filesystem (HDFS) and a means of running processes across a cluster of servers (YARN).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

WHAT IS A CLUSTER?

A

WHAT IS A CLUSTER? In the simplest sense, a cluster is just a bunch of servers grouped together to provide one or more functions, such as storage or computation. To users of a cluster, it is generally unimportant which individual machine or set of machines within the cluster performs a computation, stores the data, or performs some other service. By contrast, architects and administrators need to understand the cluster in detail. Figure 1-1 illustrates a cluster layout at a high level. Figure 1-1. Machine roles in a cluster Usually we divide a cluster up into two classes of machine: master and worker.1 Worker machines are where the real work happens—these machines store data, perform computations, offer services like lookups and searches, and more. Master machines are responsible for coordination, maintaining metadata about the data and services running on the worker machines, and ensuring the services keep running in the event of worker failures. Typically, there are two or three master machines for redundancy and a much larger number of workers. A cluster is scaled up by adding more workers and, when the cluster gets large enough, extra masters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Whta is HDFS

A

HDFS The Hadoop Distributed File System (HDFS) is the scalable, fault-tolerant, and distributed filesystem for Hadoop. Based on the original use case of analytics over large-scale datasets, - When storing data, HDFS breaks up a file into blocks of configurable size, usually something like 128 MiB, and stores a replica of each block on multiple servers for resilience and data parallelism. Each worker node in the cluster runs a daemon called a DataNode which accepts new blocks and persists them to its local disks. Th - Clients wishing to store blocks must first communicate with the NameNode to be given a list of DataNodes on which to write each block. The client writes to the first DataNode, which in turn streams the data to the next DataNode, and so on in a pipeline. When providing a list of DataNodes for the pipeline, the NameNode takes into account a number of things, including available space on the DataNode and the location of the node—its rack locality. The NameNode insures against node and rack failures by ensuring that each block is on at least two different racks. - when reading data, the client asks the NameNode for a list of DataNodes containing the blocks for the files it needs. The client then reads the data directly from the DataNodes, preferring replicas that are local or close, in network terms. - In this short description of HDFS, we glossed over the fact that Hadoop abstracts much of this detail from the client. The API that a client uses is actually a Hadoop-compatible filesystem, of which HDFS is just one implementation. There are other APIs such as cloud-based object storage offerings like Amazon S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

YARN

A

YARN (Yet Another Resource Negotiator) is designed to be. YARN runs a daemon on each worker node, called a NodeManager, which reports in to a master process, called the ResourceManager. Each NodeManager tells the ResourceManager how much compute resource (in the form of virtual cores, or vcores) and how much memory is available on its node. Resources are parceled out to applications running on the cluster in the form of containers, each of which has a defined resource demand—say, 10 containers each with 4 vcores and 8 GB of RAM. The NodeManagers are responsible for starting and monitoring containers on their local nodes and for killing them if they exceed their stated resource allocations. YARN itself does not perform any computation but rather is a framework for launching such applications distributed across a cluster. YARN provides a suite of APIs for creating these applications; we cover two such implementations, MapReduce and Apache Spark An application that needs to run computations on the cluster must first ask the ResourceManager for a single container in which to run its own coordination process, called the ApplicationMaster (AM). Despite its name, the AM actually runs on one of the worker machines. ApplicationMasters of different applications will run on different worker machines, thereby ensuring that a failure of a single worker machine will affect only a subset of the applications running on the cluster. Once the AM is running, it requests additional containers from the ResourceManager to run its actual computation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Apache ZooKeeper

A

The problem of consensus is an important topic in computer science. When an application is distributed across many nodes, a key concern is getting these disparate components to agree on the values of some shared parameters. For example, for frameworks with multiple master processes, agreeing on which process should be the active master and which should be in standby is critical to their correct operation.

Apache ZooKeeper is the resilient, distributed configuration service for the Hadoop ecosystem. Within ZooKeeper, configuration data is stored and accessed in a filesystem-like tree of nodes, called znodes, each of which can hold data and be the parent of zero or more child nodes. Clients open a connection to a single ZooKeeper server to create, read, update and delete the znodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Apache Hive Metastore

A

The Hive Metastore curates information about the structured datasets (as opposed to unstructured binary data) that reside in Hadoop and organizes them into a logical hierarchy of databases, tables, and views. Hive tables have defined schemas, which are specified during table creation. These tables support most of the common data types that you know from the relational database world. The underlying data in the storage engine is expected to match this schema, but for HDFS this is checked only at runtime, a concept commonly referred to as schema on read. Hive tables can be defined for data in a number of storage engines, including Apache HBase and Apache Kudu, but by far the most common location is HDFS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Hadoop MapReduce

A

MapReduce is the original application for which Hadoop was built and is a Java-based implementation of the blueprint laid out in Google’s MapReduce paper. Originally, it was a standalone framework running on the cluster, but it was subsequently ported to YARN as the Hadoop project evolved to support more applications and use cases. Although superseded by newer engines, such as Apache Spark and Apache Flink, it is still worth understanding, given that many higher-level frameworks compile their inputs into MapReduce jobs for execution. These include:

  • Apache Hive. - SQL interface to hadoop map reduce
  • Apache Sqoop -Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. •

Apache Oozie - Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs. •

Apache Pig - Apache Pig[1] is a high-level platform for creating programs that run on Apache Hadoop.

The language for this platform is called Pig Latin.[1] Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.[2] Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems Map reduce Essentially,

MapReduce divides a computation into three sequential stages: map, shuffle, and reduce.

In the map phase, the relevant data is read from HDFS and processed in parallel by multiple independent map tasks. These tasks should ideally run wherever the data is located—usually we aim for one map task per HDFS block. The user defines a map() function (in code) that processes each record in the file and produces key-value outputs ready for the next phase.

In the shuffle phase, the map outputs are fetched by MapReduce and shipped across the network to form input to the reduce tasks.

A user-defined reduce() function receives all the values for a key in turn and aggregates or combines them into fewer values which summarize the inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Amazon EMR

A

Amazon Elastic MapReduce (EMR) is an Amazon Web Services (AWS) tool for big data processing and analysis. … Amazon EMR is based on Apache Hadoop, a Java-based programming framework that supports the processing of large data sets in a distributed computing environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Apache Spark

A

Apache Spark is a distributed computation framework, with an emphasis on efficiency and usability, which supports both batch and streaming computations. Instead of the user having to express the necessary data manipulations in terms of pure map() and reduce() functions as in MapReduce, Spark exposes a rich API of common operations, such as filtering, joining, grouping, and aggregations directly on Datasets, which comprise rows all adhering to a particular type or schema. As well as using API methods, users can submit operations directly using a SQL-style dialect (hence the general name of this set of features, Spark SQL), removing much of the requirement to compose pipelines programmatically The dataset sources and sinks could be batch-driven and use HDFS or Kudu, or could be processed in a stream to and from Kafka. A key feature of operations on datasets is that the processing graphs are run through a standard query optimizer before execution, very similar to those found in relational databases or in massively parallel processing query engines. This optimizer can rearrange, combine, and prune the processing graph to obtain the most efficient execution pipeline .One of the principal design goals for Spark was to take full advantage of the memory on worker nodes, which is available in increasing quantities on commodity servers. The ability to store and retrieve data from main memory at speeds which are orders of magnitude faster than those of spinning disks makes certain workloads radically more efficient. Distributed machine learning workloads in particular, which often operate on the same datasets in an iterative fashion, can see huge benefits in runtimes over the equivalent MapReduce execution Spark is an extraordinarily powerful framework for data processing and is often (rightly) the de facto choice when creating new batch processing, machine learning, and streaming use cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Apache Hive

A

Apache Hive Apache Hive is the original data warehousing technology for Hadoop. It was developed at Facebook and was the first to offer a SQL-like language, called HiveQL, to allow analysts to query structured data stored in HDFS without having to first compile and deploy code. Hive supports common SQL query concepts, like table joins, unions, subqueries, and views. At a high level, Hive parses a user query, optimizes it, and compiles it into one or more chained batch computations, which it runs on the cluster. Typically these computations are executed as MapReduce jobs, but Hive can also use Apache Tez and Spark as its backing execution engine.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Apache Impala

A

Apache Impala is a massively parallel processing (MPP) engine designed to support fast, interactive SQL queries on massive datasets in Hadoop or cloud storage. Its key design goal is to enable multiple concurrent, ad hoc, reporting-style queries covering terabytes of data to complete within a few seconds. It is squarely aimed at supporting analysts who wish to execute their own SQL queries, directly or via UI-driven business intelligence (BI) tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Storage Engines

A

The original storage engine in the Hadoop ecosystem is HDFS, which excels at storing large amounts of append-only data to be accessed in sequential scans. But what about other access patterns, such as random record retrieval and updates? What about document search? Many workloads deal with large and varied datasets but are not analytical in nature. To cater to these different use cases, a few projects have been developed or adapted for use with Hadoop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Apache HBase

A

The desire by some early web companies to store tens of billions to trillions of records and to allow their efficient retrieval and update led to the development of Apache HBase—a semi-structured, random-access key-value store using HDFS as its persistent store. As with many of the Hadoop projects, the original blueprint for the framework came from a paper published by Google describing its system Bigtable. Essentially, HBase provides a means by which a random-access read/write workload (which is very inefficient for HDFS) is converted to sequential I/O (which HDFS excels at).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Cassandra

A

• Apache Cassandra is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra - Write heavy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Accumulo

A

• Apache Accumulo is a highly scalable sorted, distributed key-value store based on Google’s Bigtable. It is a system built on top of Apache Hadoop, Apache ZooKeeper, and Apache With HDFS you get efficient high-throughput analytics and with Kudu or accumulo you get fast random access - but need two different methodologies to solve both needs. The creators of Kudu set out to create a storage and query engine that could satisfy both access patterns (random-access and sequential scans) and efficiently allow updates to existing data. Naturally, to allow this, some performance trade-offs are inevitable, but the aim is to get close to the performance levels of each of the native technologies—that is, to service random-access reads within tens of milliseconds and perform file scans at hundreds of MiB/s.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Apache Kudu

A

With HDFS you get efficient high-throughput analytics and with Kudu or accumulo you get fast random access - but need two different methodologies to solve both needs. The creators of Kudu set out to create a storage and query engine that could satisfy both access patterns (random-access and sequential scans) and efficiently allow updates to existing data. Naturally, to allow this, some performance trade-offs are inevitable, but the aim is to get close to the performance levels of each of the native technologies—that is, to service random-access reads within tens of milliseconds and perform file scans at hundreds of MiB/s.

17
Q

Apache Solr

A

metimes SQL is not enough. Some applications need the ability to perform more flexible searches on unstructured or semi-structured data. Many use cases, such as log search, document vaults, and cybersecurity analysis, can involve retrieving data via free-text search, fuzzy search, faceted search, phoneme matching, synonym matching, geospatial search, and more. For these requirements, often termed enterprise search, we need the ability to automatically process, analyze, index, and query billions of documents and hundreds of terabytes of data To support its search capabilities, Solr uses inverted indexes supplied by Apache Lucene, which are simply maps of terms to a list of matching documents. Terms can be words, stems, ranges, numbers, coordinates, and more. Documents contain fields, which define the type of terms found in them. Fields may be split into individual tokens and indexed separately. The fields a document contains are defined in a schema. The indexing processing and storage structure allows for quick ranked document retrieval, and a number of advanced query parsers can perform exact matching, fuzzy matching, regular expression matching, and more. For a given query, an index searcher retrieves documents that match the query’s predicates. The documents are scored and, optionally, sorted according to certain criteria; by default, the documents with the highest score are returned first.

18
Q

Apache Kafka

A

Apache Kafka One of the primary drivers behind a cluster is to have a single platform that can store and process data from a multitude of sources. The sources of data within an enterprise are many and varied: web logs, machine logs, business events, transactional data, text documents, images, and more. This data arrives via a multitude of modes, including push-based, pull-based, batches, and streams, and in a wide range of protocols: HTTP, SCP/SFTP, JDBC, AMQP, JMS, and more

19
Q

Apache Druid

A

Apache Druid - Read heavy - Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations.

20
Q

Google Bigquery

A

BigQuery is Google’s serverless, highly scalable, low cost enterprise data warehouse designed to make all your data analysts productive. Because there is no infrastructure to manage, you can focus on analyzing data to find meaningful insights using familiar SQL and you don’t need a database administrator

21
Q

Presto

A

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

22
Q

Data plane

A
23
Q

control plane

A