Weeks 7 - 11 Flashcards

Question

What is the HDFS block placement policy?

Answer 1

Whe creating a new block, the policy is as follows: - No more than one replica is placed at one node - No more than two replicas are placed in the same rack when the number of replicas is less than twice the number of racks.

Answer 2

- open-source software platform - for distributed storage and distributed processing of very large data sets (structured and unstrucutred) on computer clusters (highly scalable and available) - build from commodity hardware (fault tolerant) Core components: - storage: HDFS - computation model: MapReduce programming model The base Apache Hadoop framework is composed of the following modules: - Hadoop YARN: cluster manager - Hadoop Hive: Data warehouse supports high-level query language - HBase: NoSQL - Hadoop Common: contains libraries and utilities needed by other Hadoop modules

Answer 3

- need to process multi petabyte (large-scale) data sets - data may not have strict schema - need common infrastructure, horizontal scalbility - very large Distributed File System - need to consider node failure and ensure system resilience - reliable - fault-tolerant

Answer 4

- MapReduce is the data prcoessing component of Hadoop - program transforms lists of input data elements into lists of output data elements (key-value pairs) - map and reduce tasks occur asynchronously - a software framework for easily processing vast amounts of data in parallel on large clusters of commodity hardware - similar to the component design of the Docker Swarm, Kubernetes etc. Components: Clients (interaction w. end user) - users are submitted to JobTracker via Client - users can display job running status through client interface JobTracker: - monitors resources and coordinate jobs - schedules jobs submitted by clients - monitor health of all the TaskTrackers - manage node failures and transfer jobs accordingly TaskTracker: - periodically heartbeat with job execution status to JobTracker - receive and execute commands from JobTraker Task: - includes: map task, and reduce task - initiated by TaskTracker

Answer 5

- MapReduce divides data by splitting it up, then maps it with a value - need the shuffle process to transfer data to reducer - shuffling is merging the data so that the key has all of its instance values in one place - reducing the shuffle will aggregate the total value (which has tallied with shuffle) - merge all results for final output input --> splitting --> mapping --> shuffling --> reducing --> final result merge

Answer 6

- it is a mini-reducer - performs local aggregation on the mappers output, to minimise the transfer between mapper and reducer - improves overall performance Note: be careful of causing combiner as it can change the final result through logical differences

Answer 7

Similarities: - both high level languages - work on top of Map-Reduce framework - use underlying HDFS and map-reduce Differences: - language (Pig is procedural, hive is declarative) - work type (pig more suited for ad-hoc analysis (like steam search logs), hive is a reporting tool (weely reports)) - users (pig has researchers, programmers; hive has business analysists) - Hive = structured data, Pig = semistructured data - Hive works on server side of cluster, Pig works on client side of a cluster

Answer 8

- supports analysis of large datasets stored in Hadoops HDFS and compatible file systems - provides SQL-like query language called HiveQL - transparently converts queries to MapReduce - HiveQL has full ACID properties - data warehouse best suited to OLAP rather than OLTP Components: - organised into tables, partitions, and buckets - tables are like relational tables where data is stored - tables can be broken into partitions, which determine distribution of data within subdirectories - data can be each partition divided into buckets - based on a hash function - each bucket is stored as a file in partition directory found in HDFS

Answer 9

OLTP: - Online transaction processing - class of information systems that facilitate and manage transaction-oriented applications - typically data entry and retrieval transaction processing - concurrently used by many users - frequent updates - HBase is a NoSQL database for OLTP OLAP: - online analytical processing - class of systems designed to respond to ulti-dimensional analytical queries - many rows and columns of data - complex queries - no need to promptly respond - Hive is a data warehouse suitable for OLAP

Answer 10

- high-level platform for creating programs that fun on Hadoop - language is called Pig Latin - can execute its jobs in MapReduce of Apache Spark - can be extended into different languages - no loops or conditions

Answer 11

- high-level platform for creating programs that fun on Hadoop - language is called Pig Latin - can execute its jobs in MapReduce of Apache Spark - can be extended into different languages - no loops or conditions

Answer 12

- open-source distributed, cluster-computing framework - achieves high performance in terms of processing speed - as data is processed mainly in-memory (cache) of working nodes to prevent unnecessary I/O operations on disks - the high volume of I/O operations on disks is a downfall of Hadoop MapReduce Characteristics: - ease of use (many available operations to build parallel apps, highly accessible with supported languages and interactive mode) - iterative processing (suitable for machine learning algorithms, directed acyclic graph (allows for running in parallel), allow repeated load and queries, data abstractions for structured and unstructured data sets (RDD, DFs)

Answer 13

- a fundamental data strcutre of Spark - Rdd is a read-only (immutable) distributed collection of objects/elements - distributed, each dataset in RDD is divided into logical partitions, computed by many worker nodes in cluster - resilient, RDD can be self recovered in case of failure - datasets: JSON file, CSV file, text file, etc. - data manipulation is heavily used on RDDs in Spark Jobs: - creating (new), transforming (modifying existing RDDs), action (compute a result).

Answer 14

Spark automatically distributes that data contained in RDDs across the cluster, and parallelizses the operations you perform to them. When a task is distributed in Spark, it means that the data being operated on is split across different nodes in the cluster, and that the tasks are being performed concurrently. - can only parallelize an existing collection, must import text file first to parallize new collection

Answer 15

- no. partitions = CPU cores in cluster

Answer 16

- Transformations are operating on RDDs that return a new RDD. - Computed lazily (only when you use them in action) Common transformations: - Map - filter - flatMap - union (any set theory) - sortByKey - join

Answer 17

- trigger job execution that forces the evaluation of all the transformations - must return a final value - values of actions are stored to drivers or to external storage system - brings the laziness of RDD into motion (runs previously requested transformations) Common Actions: - count - take (collects a number of elements from the RDD) - collect - reduce - first - foreach

Answer 18

Reduce must pull the entire dataset down into a single location because it is reducing it to one final value (action). ReduceByKey is one value for each key, since this action can be run on each machine locally first it can remain an RDD and have further transformations done on its dataset (transformation). Used generally on key/value pairs.

Answer 19

use the map() method: | map(x => (x,1))

Answer 20

- ReduceByKey - GroupByKey - keys - values - MapValues - SortByKey

Answer 21

- countByKey() - collectAsMap() - lookup(key)

Answer 22

- due to the lazy nature of RDD, dependencies between RDDs are logged in a leneage graph - regarded as a logical execution plan of RDD transformations (as they need o be done in a specific order when triggered in the next action) - this plan is ran through an optimizer when run into an action

Answer 23

Job: a peice of code which reads some input from HDFS or local, and performs some computation on the data and writes some input. Read, write, etc. Stages: jobs are divided into stages. E.g. map or reduce stages. Stages are divided based on computational boundaries. Each stage is further divided into tasks based on the number of partitions in the RDD. Tasks: each stage has some tasks, one task per partition. One task is executed on one partition of data on one executor. Can have many tasks in one stage. The smallest unit of work for Spark.

Answer 24

- the main entry point to spark functionality | - responsible for calculating the dependencies between RDDs and building DAGs

Answer 25

The two types of transformations. Note: parent = first RDD, child = output from transformation RDD Narrow transformation (dependencies): - every parent RDD only has one child RDD - allow for pipelined execution on one cluster node - failure recovery is more efficiency as only lost parent partitions need to be recomputed - if a child RDD crashes, you can use parent to recover. - example: map, flatmap, filter, sample, union, etc. Wide transformation (dependencies): - many:many, children RDDs:parent RDDs - multiple child partitions may depend on one parent partition - a complete re-computation is needed if some partition is lost from all the ancestors - example: groupByKey, reduceByKey

Answer 26

- a set of verticles and edges - verticles represent the RDDs - edges represent the operation to be applied on RDD (from earliest to latest applied) - DAG is a finite directed graph with no directed cycles. - DAG operations can do better global optimisation than other systems like MapReduce. On the calling of Action, the created DAG submits to DAG scheduler which further splits the graph into the stages of the job. Task scheduler launches the tasks of each stage specified in the DAG via the cluster manager.

Answer 27

Narrow. | Eg: mapping, filtering, union.

Answer 28

Wide transformation. | Eg. reduceByKey

Answer 29

1. Create RDD object using Scala interpreter 2. SparkContext resposible for calculating the dependencies between RDDs and building DAGs 3. After an Action operator is called: DAG scheduler is responsible for decomposing the DAG graph by pipelining operators together, resulting in creating stages. Each stage containing multiple tasks depending on the partitions of the input data. 4. Task scheduler launches tasks to distribute across the worker nodes via cluster manager. Task scheduler doesn't know about dependencies among stages. 5. Worker executes tasks, with a new JVM started per job.