Weeks 7 - 11 Flashcards

1
Q

What is a Distributed File System (DFS)? What are its characteristics?

A
  • a distributed implementation of a file system, spread over multiple autonomous computers
  • can exist in any system that has servers (source of files) and clients (accessing servers/files)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe the Network File System (NFS)? What are the advantages of using it?

A
  • distributed file system protocol
  • allows a user on a client computer to access files over a computer network (much like local storage access)
  • client-server architecture application where a user can view, store and update the files on a remote computer
  • made up of a server, sharing to a network of clients.

Advantages:

  • easy sharing of data across clients
  • centralised administration (backup done on multiple servers instead of many clients)
  • security (server behind firewall)
  • transparent access to remote files
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does the Network FIle System architecture work?

A
  • there is a virtual file system (VFS) is an OD acting as an interface between the system-call layer and all files in network nodes
  • The VFS is in place as middleware used to decide destination of the client request, which could either pass calls/requests to a local file system or the NFS client
  • VFS available in most OS’ as interface to different local and distributed file systems
  • VFS is located on both the client and server
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a Clustered File System (CFS)? Why use it?

A

CFS is a cluster of servers that work together to provide high performance service to their clients. To clients of CFS, the cluster is transparent.
- uses as metaservice master to direct and organise storage

Why?

  • for bigger scale of data storage
  • scalability and availability
  • resiliency and load balancing for large volume of client requests
  • similar to Kubernetes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does a clustered file system (CFS) work in terms of storage?

A

Data is divided into segments/chunks/blocks to store the data across data nodes. Striping files for parallel access.

This creates resiliency at a block level, as if one server is down, only a segment of data is lost, not the whole data. Need to do this as there is such huge data in use - meta data.

Metadata is the master. This is used to assign tasks from the client service to the server.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Google File System (GFS)?

A
  • A scalable, distributed file system
  • uses large clusters of commodity hardware
  • commodity hardware is cheap hardware, so it has high failure tolerance
  • horizontally scalable via commodity hardware nodes
  • designed to deal with big, meta data
  • designed to allow multiple users to write (append) to one file at the same time - availability
  • response time for individual read and write is not critical, throughput is prioritised

must support two types of operations: reads and writes

  • writing is referring to appending and adding data to blocks
  • differs from usual DFS as they don’t generally allow multiple users to write to the same file at the same time like GFS does
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why not use NFS for big data storage?

A

Unreliable, potential of data loss. As oppsoed to CFS and DFS to make data replicas for reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Talk through the GFS Architecture Read Operation

A
  • application request goes through client interaction
  • client sends chunk index and file name to master node
  • master looks up request in big table for IP and locations of each of the chunks, returns these values to client where client caches this information
  • client identifies closest location chunk
  • client requests data from chunkserver and read data directly from server- direct read between client and chunk server

Note:
To reduce work load from master node, the chunkserver being repeatedly read will be cached in the client. This will contain the ID and chunk location in cache so it can be accessed directly without master interaction in future. This improves the performance and speed of the system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Key differences between DFS, NFS and CFS?

A
  • DFS includes both NFS and CFS (as storing data across nodes in different locations)
  • CFS and NFS are using network for communication but NFS has limited storage, reliability, and scalability.
  • In comparison to NFS, CFS provides highly available, scalable stroage capabilities, high resiliency features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

GFS design overview?

A

Design overview:

  • files stored as fixed-size chunks (64BM) on separate servers/nodes
  • as commodity servers have high failure rate, the replicas across nodes is crucial for resiliency
  • replication (reliability) is 3 by default, can be manually set
  • single master - centralised management
  • meta-data store
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe the Google File Storage Master components? Pros/cons?

A
  • maintains all systems metadata
  • periodically communicates with chunk servers through heartbeat messages
  • one master is a single point of failure, but replcation of master state is across multiple machines
  • log and check points are replicated on multiple machines
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe the write process in GFS?

A
  • Client requests the chunk server details (IP and location) that holds the current lease of the requested chunk (and its replicas) from master node
  • master returns the locations and IPs of all chunk replicas. It will choose and prioritise the chunks in order of client access to least amount of disk space being used. Master also appoints the primary replica in this step.
  • client locates cloest server and passes the chunk data write to it (not always the primary). this chunk passes the data long to all replicas.
  • write data wont be stored on dick immediately, instead on chache for each chunk
  • once each local cache is stored on all chunks, the client sends a request to primary replica server to commit the data across all disks
  • primary server organises the commit of all replicas onto the harddrive
  • once primary server receives confirmation that the data has been successfully stored on the disks, it’ll send confirmation status back to client
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Name the three major types of metadata in GFC. What is this metadata used for in GFC?

A
  • The file and chunk namespaces
  • mapping from files to chunks
  • locations fo each chunks replicas

All kept in master’s memory and used as a big table lookup for client requests to locate chunks. This detadata describes the data held in the chunk servers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the GFS’ master operation log used for?

A
  • In master node
  • consists of namespace and file to chunk mappings
  • replicates on remove machines
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Hadoop Distributed File System (HDFS)?

A
  • is the file system of Apache Hadoop
  • key processing function is the MapReduce model
  • open-source software
  • used to solve big data problems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the HDFS MapReduce programming model used for/what does it do?

A

Hadoop splits files into large blocks (double the size of GFS’) and distributes them across nodes in a cluster. It then transfers packaged code into nodes and takes advantage of data locality.
Blocks = GFS Chunks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Key differences and similarities between GFS and HDFS?

A

Similarities:

  • both designed to support very big data
  • provide two types of read (large streaming + small random)
  • focus on throughput not efficiency/speed fo outputs
  • both name/master nodes use in-memory storage of metadata

Differences:

  • once written, files are seldom modified in HDFS
  • (know) HDFS only allows one user accessing a file with write permissions at any one time, as opposed to GFS’ lease system allowing multiple writes at once (which are ordered).
  • GFS only on Linux platform, HDFS available on Lac, Linux and Windows
  • (know) GFS is C, C++ environment; HDFS is Java
  • (know) HDFS is open-source and free
  • architecture terminology differences (GFS Master node, chunk server, and chunks of data; HDFS namenode, datanodes, and data blocks)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Describe the HDFS NameNode?

A
  • equivalent to Master Node in GFS
  • represents files and directories on the NameNode as inode
  • maintains namespace tree and mapping of file blocks to Datanodes
  • when writing data, NameNode nominates DataNodes to host replicas
  • keeps Meta-Data in RAM
  • responsible for replication, load balancing, maintenance, heartbeat responses, etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does the HDFS Meta-Data component involve?

A

Stored in the NameNode.
fsimage:
- contains the entire filesystem namespace at the latest checkpoint
- blocks’ information of the file (location, timestamp, etc.)
- folder information (ownership, access, etc.)
- stored as an image file in the NameNode’s local file system

editlog:

  • contains all the recent modifications made to the file system on the most recent fsImage
  • create/update/delete requests from the client
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Describe the workflow of the HDFS Checkpoint Node

A
  • Secondary NameNode
  • regularly query for fsimage and editlogs
  • keeps edit logs to enable rollback to previous state
  1. Primary NameNode will stop to write editlogs and copy edits and fsimage to secondary NameNode
  2. All new edits after that will be fed into edits.new
  3. copied edits and fsimage on secondary nameNode will be merged
  4. Copy the merged fsimage.ckpt back to primary NameNode and use it as the new fsimage
  5. Finally, editlogs file gets smaller and fsimage gets updated
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

HDFS component - DataNode

A
  • equivalent to the GFS chunk server
  • each block replica contains two files: the data, block’s meta-data - sent to NameNode w. replica blocks
  • when started up, it verifys a namespace ID and software version with NameNode
  • internal storage ID is an identified of the DataNode within the cluster, which will never be changed - each node has its own local disk storage
  • sends heartbeats (every 3 sec) to NameNode to communicate status
  • receives maintenance commands from the NameNode indirectly (replicate blocks to other nodes, remove local block replicas, re-register, shut down, send block report, etc.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

List the HDFS Components?

A
  • NameNode, which contains (Meta-Data (fsimage, editlog))
  • Checkpoint Node (secondary NameNode) (copies editlogs and fsimages)
  • DataNode (contains data + block meta-data, has namespace ID, software version, storage ID identifier) (sends heartbeats to NameNode, receives maintenance commands from NameNode)
  • Client (Reading, Writing operations)
23
Q

Describe the workflod of the HDFS Client Read operation

A
  1. first asks NameNode for list of DataNodes

2. contacts a DataNode directly and requests the transfer

24
Q

Describe the workflod of the HDFS Client Write operation

A
  1. first asks NameNode to choose DataNodes to host replicas of the first block of the file
  2. organises a pipeline from node-to-node and sends the client data to first DataNode
  3. requests new DataNodes to be chosen to host replicas of the next block as well as a new pipeline
  4. each choice of DataNodes is likely to be different
25
Q

What is the HDFS block placement policy?

A

Whe creating a new block, the policy is as follows:

  • No more than one replica is placed at one node
  • No more than two replicas are placed in the same rack when the number of replicas is less than twice the number of racks.
26
Q

What is Hadoop? Core components? Base components?

A
  • open-source software platform
  • for distributed storage and distributed processing of very large data sets (structured and unstrucutred) on computer clusters (highly scalable and available)
  • build from commodity hardware (fault tolerant)

Core components:

  • storage: HDFS
  • computation model: MapReduce programming model

The base Apache Hadoop framework is composed of the following modules:

  • Hadoop YARN: cluster manager
  • Hadoop Hive: Data warehouse supports high-level query language
  • HBase: NoSQL
  • Hadoop Common: contains libraries and utilities needed by other Hadoop modules
27
Q

Why use Hadoop?

A
  • need to process multi petabyte (large-scale) data sets
  • data may not have strict schema
  • need common infrastructure, horizontal scalbility
  • very large Distributed File System
  • need to consider node failure and ensure system resilience
  • reliable
  • fault-tolerant
28
Q

What is Hadoop MapReduce?

What are its components?

A
  • MapReduce is the data prcoessing component of Hadoop
  • program transforms lists of input data elements into lists of output data elements (key-value pairs)
  • map and reduce tasks occur asynchronously
  • a software framework for easily processing vast amounts of data in parallel on large clusters of commodity hardware
  • similar to the component design of the Docker Swarm, Kubernetes etc.

Components:
Clients (interaction w. end user)
- users are submitted to JobTracker via Client
- users can display job running status through client interface

JobTracker:

  • monitors resources and coordinate jobs
  • schedules jobs submitted by clients
  • monitor health of all the TaskTrackers
  • manage node failures and transfer jobs accordingly

TaskTracker:

  • periodically heartbeat with job execution status to JobTracker
  • receive and execute commands from JobTraker

Task:

  • includes: map task, and reduce task
  • initiated by TaskTracker
29
Q

What is shuffling in MapReduce?

A
  • MapReduce divides data by splitting it up, then maps it with a value
  • need the shuffle process to transfer data to reducer
  • shuffling is merging the data so that the key has all of its instance values in one place
  • reducing the shuffle will aggregate the total value (which has tallied with shuffle)
  • merge all results for final output

input –> splitting –> mapping –> shuffling –> reducing –> final result merge

30
Q

What is the combiner used for in the MapReduce function, if used?

A
  • it is a mini-reducer
  • performs local aggregation on the mappers output, to minimise the transfer between mapper and reducer
  • improves overall performance

Note: be careful of causing combiner as it can change the final result through logical differences

31
Q

Compare Hadoop Hive and Pig

A

Similarities:

  • both high level languages
  • work on top of Map-Reduce framework
  • use underlying HDFS and map-reduce

Differences:

  • language (Pig is procedural, hive is declarative)
  • work type (pig more suited for ad-hoc analysis (like steam search logs), hive is a reporting tool (weely reports))
  • users (pig has researchers, programmers; hive has business analysists)
  • Hive = structured data, Pig = semistructured data
  • Hive works on server side of cluster, Pig works on client side of a cluster
32
Q

What is Hadoop Hive?

Components?

A
  • supports analysis of large datasets stored in Hadoops HDFS and compatible file systems
  • provides SQL-like query language called HiveQL
  • transparently converts queries to MapReduce
  • HiveQL has full ACID properties
  • data warehouse best suited to OLAP rather than OLTP

Components:

  • organised into tables, partitions, and buckets
  • tables are like relational tables where data is stored
  • tables can be broken into partitions, which determine distribution of data within subdirectories
  • data can be each partition divided into buckets
  • based on a hash function
  • each bucket is stored as a file in partition directory found in HDFS
33
Q

What is OLAP and OLTP?

A

OLTP:

  • Online transaction processing
  • class of information systems that facilitate and manage transaction-oriented applications
  • typically data entry and retrieval transaction processing
  • concurrently used by many users
  • frequent updates
  • HBase is a NoSQL database for OLTP

OLAP:

  • online analytical processing
  • class of systems designed to respond to ulti-dimensional analytical queries
  • many rows and columns of data - complex queries
  • no need to promptly respond
  • Hive is a data warehouse suitable for OLAP
34
Q

What is Apache Pig?

A
  • high-level platform for creating programs that fun on Hadoop
  • language is called Pig Latin
  • can execute its jobs in MapReduce of Apache Spark
  • can be extended into different languages
  • no loops or conditions
35
Q

What is Apache Pig?

A
  • high-level platform for creating programs that fun on Hadoop
  • language is called Pig Latin
  • can execute its jobs in MapReduce of Apache Spark
  • can be extended into different languages
  • no loops or conditions
36
Q

What is Apache Spark? Characteristics?

A
  • open-source distributed, cluster-computing framework
  • achieves high performance in terms of processing speed - as data is processed mainly in-memory (cache) of working nodes to prevent unnecessary I/O operations on disks
  • the high volume of I/O operations on disks is a downfall of Hadoop MapReduce

Characteristics:

  • ease of use (many available operations to build parallel apps, highly accessible with supported languages and interactive mode)
  • iterative processing (suitable for machine learning algorithms, directed acyclic graph (allows for running in parallel), allow repeated load and queries, data abstractions for structured and unstructured data sets (RDD, DFs)
37
Q

What is Resilient Distribued Database (RDD)?

What are the three types of jobs for RDDs?

A
  • a fundamental data strcutre of Spark
  • Rdd is a read-only (immutable) distributed collection of objects/elements
  • distributed, each dataset in RDD is divided into logical partitions, computed by many worker nodes in cluster
  • resilient, RDD can be self recovered in case of failure
  • datasets: JSON file, CSV file, text file, etc.
  • data manipulation is heavily used on RDDs in Spark

Jobs:
- creating (new), transforming (modifying existing RDDs), action (compute a result).

38
Q

What does parallelize mean in Spark programming?

A

Spark automatically distributes that data contained in RDDs across the cluster, and parallelizses the operations you perform to them.
When a task is distributed in Spark, it means that the data being operated on is split across different nodes in the cluster, and that the tasks are being performed concurrently.

  • can only parallelize an existing collection, must import text file first to parallize new collection
39
Q

What is the RDD Partition Policy?

A
  • no. partitions = CPU cores in cluster
40
Q

What are Transformations (RDD Operation)? Name some common transformations.

A
  • Transformations are operating on RDDs that return a new RDD.
  • Computed lazily (only when you use them in action)

Common transformations:

  • Map
  • filter
  • flatMap
  • union (any set theory)
  • sortByKey
  • join
41
Q

What are Actions (RDD operation)? Name some common Action operations.

A
  • trigger job execution that forces the evaluation of all the transformations
  • must return a final value
  • values of actions are stored to drivers or to external storage system
  • brings the laziness of RDD into motion (runs previously requested transformations)

Common Actions:

  • count
  • take (collects a number of elements from the RDD)
  • collect
  • reduce
  • first
  • foreach
42
Q

Difference between reduce and ReduceByKey?

A

Reduce must pull the entire dataset down into a single location because it is reducing it to one final value (action).
ReduceByKey is one value for each key, since this action can be run on each machine locally first it can remain an RDD and have further transformations done on its dataset (transformation). Used generally on key/value pairs.

43
Q

How to create pair RDDs?

A

use the map() method:

map(x => (x,1))

44
Q

Paired RDDs transformation examples?

A
  • ReduceByKey
  • GroupByKey
  • keys
  • values
  • MapValues
  • SortByKey
45
Q

Paired RDDs Action examples?

A
  • countByKey()
  • collectAsMap()
  • lookup(key)
46
Q

What is a RDD Lineage Graph?

A
  • due to the lazy nature of RDD, dependencies between RDDs are logged in a leneage graph
  • regarded as a logical execution plan of RDD transformations (as they need o be done in a specific order when triggered in the next action)
  • this plan is ran through an optimizer when run into an action
47
Q

Describe the following terms found in spark: Job, stages, tasks.

A

Job: a peice of code which reads some input from HDFS or local, and performs some computation on the data and writes some input. Read, write, etc.

Stages: jobs are divided into stages. E.g. map or reduce stages. Stages are divided based on computational boundaries. Each stage is further divided into tasks based on the number of partitions in the RDD.

Tasks: each stage has some tasks, one task per partition. One task is executed on one partition of data on one executor. Can have many tasks in one stage. The smallest unit of work for Spark.

48
Q

What is SparkContext?

A
  • the main entry point to spark functionality

- responsible for calculating the dependencies between RDDs and building DAGs

49
Q

What are Narrow and Wide Dependencies? Examples?

A

The two types of transformations. Note: parent = first RDD, child = output from transformation RDD

Narrow transformation (dependencies):

  • every parent RDD only has one child RDD
  • allow for pipelined execution on one cluster node
  • failure recovery is more efficiency as only lost parent partitions need to be recomputed - if a child RDD crashes, you can use parent to recover.
  • example: map, flatmap, filter, sample, union, etc.

Wide transformation (dependencies):

  • many:many, children RDDs:parent RDDs
  • multiple child partitions may depend on one parent partition
  • a complete re-computation is needed if some partition is lost from all the ancestors
  • example: groupByKey, reduceByKey
50
Q

What is Directed Acycling Graph (DAG)? When is it used?

A
  • a set of verticles and edges
  • verticles represent the RDDs
  • edges represent the operation to be applied on RDD (from earliest to latest applied)
  • DAG is a finite directed graph with no directed cycles.
  • DAG operations can do better global optimisation than other systems like MapReduce.

On the calling of Action, the created DAG submits to DAG scheduler which further splits the graph into the stages of the job. Task scheduler launches the tasks of each stage specified in the DAG via the cluster manager.

51
Q

If a DAG is a single stage operation, what kind of transformation is it?

A

Narrow.

Eg: mapping, filtering, union.

52
Q

If a DAG is a multiple stage operation that can trigger a shuffle, what kind of transformation is it?

A

Wide transformation.

Eg. reduceByKey

53
Q

What are the steps of how Spark works?

A
  1. Create RDD object using Scala interpreter
  2. SparkContext resposible for calculating the dependencies between RDDs and building DAGs
  3. After an Action operator is called: DAG scheduler is responsible for decomposing the DAG graph by pipelining operators together, resulting in creating stages. Each stage containing multiple tasks depending on the partitions of the input data.
  4. Task scheduler launches tasks to distribute across the worker nodes via cluster manager. Task scheduler doesn’t know about dependencies among stages.
  5. Worker executes tasks, with a new JVM started per job.