Module 2: The Hadoop Ecosystem Flashcards

Question

Open-source projects

Answer 1

Used for free, big community. We can mix and match different components of the ecosystem to fit you the best.

Answer 2

Future anticipated data growth (easy scaling) Long term availability of data (fault tolerance) Many platforms over a single data store High volume, high variety

Answer 3

small datasets Task-level parallelism not supported (different task on same data at the same time, different from data level parallelism, running same task on different subsets of data) Advanced algorithms, not reducible to e.g. map and reduces faces (when using hadoop would mean having to) Replacement to your infrastructure- should be evaluated first when you need random data access (HDFS stores relatively large files (blocksize)) Advanced analytical queries, latency-sensitive tasks, security of sensitive data

Answer 4

Programming model is a set of abstractions of the infrastructure that forms a model of computation. So it provides a way to interact with the distributed filesystem without having to know all the details of how a distributed filesystem works. Makes it easier for developers to develop applications. Point of programming model for spec. big data is to simplify parallel programming. eg. mapreduce

Answer 5

Provide programmability on top of distributed file systems: should enable programmability of operations within the distributed filesystem it should allow for writing programs on top of the distributed file system facilitate handling of potential issues ``` Support Big Data operations: should support data partitioning for fast data access for distribution of computation to nodes for scheduling of parallel tasks ``` Handle fault tolerance: Reliability to handle hardware failure data replication file recovery Enable scaling out: by adding more compute nodes or racks whenever needed Be optimized for specific data types: both structured or not document, stream, graph should support operations over some of these types not just one

Answer 6

Needs a lot of knowledge, on various synchronization mechanisms, It’s a steep learning curve and error prone.

Answer 7

Provides an API to MapReduce that allows for writing map and reduce functions in languages other than java

Answer 8

Hadoop runs a MapReduce job by dividing it into tasks, map and reduce tasks. Tasks are scheduled using YARN and runs on nodes in the cluster. If a task fails it should be rescheduled by the YARN framework Hadoop divides the input into fixed-sized splits: one map task per split, runs map function for each record in split Trade-off between small splits (for good load balancing) and the number of splits (overhead) But rule of thumb: a good split size= size of an HDFS block. Hadoop aims to achieve data locality optimization: by running map task on node where data resides no need to use valuable cluster bandwidth optimal split size = block size. Because this means the map task can be done without any data transfer (since largest input size can be stored on single node) Possible cases: Case 1: Run map task on node where data resides Case 2: All nodes with replicas running other map tasks; scheduler looks for free slot on same rack Case 3: If not possible, use an off-rack node (requires inter-rack network transfer) Map tasks write output to local disk, not to HDFS, why? only intermediate output, processed by reduce tasks Thrown away once the job is done HDFS storage with replication would be overkill if node running map task fails, Hadoop reruns it on another node Reduce tasks cannot benefit from data locality: input to (single) reduce task is typically output from all/multiple mappers sorted map outputs transferred across network to node running the reduce task, merged and passed to user-defined reduce function reduce output typically stored in HDFS for reliability

Answer 9

How many mappers do you use? How long are the mappers running for? Should take around 1 minute per mapper. Number of reducers? Some cases better to have several map reduce jobs

Answer 10

Better to have more and simpler stages because they led to more portable and maintainable mappers and reducers. A mapper often performs input format parsing, projection (selecting relevant fields) and filtering (removing record). can be split into distinct mappers can be chained together using the chainmapper library class

Answer 11

It excels at batch analysis that can be decomposed into independent data-parallel tasks.

Answer 12

Frequently changing data: it would be slow since it reads the entire input dataset each time Dependent tasks: computations with dependencies cannot be expressed with MapReduce Interactive analysis: does not return any results until the entire process is finished Pitfalls: complicated to code some tasks, absence of schema and index, difficulties to debug the code, some tasks are very expensive

Answer 13

Use cases: batch analysis, data mining ``` Goal: reduce development time Nested data model (much like mongodb) query execution user-defined functions analytic queries over text files ``` Procedural language ⇒ developer has control over execution plan (can speed up performance) Schema is defined at query-time

Answer 14

A relation is a bag (more specifically, an outer bag) A bag is a collection of tuples A tuple is an ordered set of fields a field is a piece of data When we say store or dump, we trigger the execution

Answer 15

Use cases: Batch analysis, reporting A data warehouse infrastructure built on top of hadoop for providing data summarization, query and analysis. ``` Hive provides: structure extraction, transformation and load (ETL) access to different storage query execution via mapreduce ``` Key building principles: SQL is a familiar language, performance and extensibility Data units: db, tables, partitions, buckets 3 levels in DB: table ⇒ partitions ⇒ buckets Connection to RDBMS hive stores it metadata in RDBMs (Pig doesn't have metadata) the metastore acts as a system catalogue for hive stores all information about the tables, their partition, their schemas etc. without the system catalogue, it is not possible to impose a structure on Hadoop files Schema is known at creation time (like RDB) A table in Hive is an HDFS in Hadoop

Answer 16

Data definition language

Answer 17

Data manipulation language

Answer 18

Use cases: interactive analysis, business intelligence Much faster than the other two.

Answer 19

Put an abstraction layer on top of mapreduce that hides complexity and adds optimization. Key benefits with this is: common design patterns as keywords data flow analysis (one script can map to multiple mapreduce jobs) avoids Java level errors Can also be ran in interactive (which speeds up development time since we can test only part of the program)

Module 2: The Hadoop Ecosystem Flashcards

(43 cards)