Weeks 7 - 11 Flashcards
What is a Distributed File System (DFS)? What are its characteristics?
- a distributed implementation of a file system, spread over multiple autonomous computers
- can exist in any system that has servers (source of files) and clients (accessing servers/files)
Describe the Network File System (NFS)? What are the advantages of using it?
- distributed file system protocol
- allows a user on a client computer to access files over a computer network (much like local storage access)
- client-server architecture application where a user can view, store and update the files on a remote computer
- made up of a server, sharing to a network of clients.
Advantages:
- easy sharing of data across clients
- centralised administration (backup done on multiple servers instead of many clients)
- security (server behind firewall)
- transparent access to remote files
How does the Network FIle System architecture work?
- there is a virtual file system (VFS) is an OD acting as an interface between the system-call layer and all files in network nodes
- The VFS is in place as middleware used to decide destination of the client request, which could either pass calls/requests to a local file system or the NFS client
- VFS available in most OS’ as interface to different local and distributed file systems
- VFS is located on both the client and server
What is a Clustered File System (CFS)? Why use it?
CFS is a cluster of servers that work together to provide high performance service to their clients. To clients of CFS, the cluster is transparent.
- uses as metaservice master to direct and organise storage
Why?
- for bigger scale of data storage
- scalability and availability
- resiliency and load balancing for large volume of client requests
- similar to Kubernetes
How does a clustered file system (CFS) work in terms of storage?
Data is divided into segments/chunks/blocks to store the data across data nodes. Striping files for parallel access.
This creates resiliency at a block level, as if one server is down, only a segment of data is lost, not the whole data. Need to do this as there is such huge data in use - meta data.
Metadata is the master. This is used to assign tasks from the client service to the server.
What is Google File System (GFS)?
- A scalable, distributed file system
- uses large clusters of commodity hardware
- commodity hardware is cheap hardware, so it has high failure tolerance
- horizontally scalable via commodity hardware nodes
- designed to deal with big, meta data
- designed to allow multiple users to write (append) to one file at the same time - availability
- response time for individual read and write is not critical, throughput is prioritised
must support two types of operations: reads and writes
- writing is referring to appending and adding data to blocks
- differs from usual DFS as they don’t generally allow multiple users to write to the same file at the same time like GFS does
Why not use NFS for big data storage?
Unreliable, potential of data loss. As oppsoed to CFS and DFS to make data replicas for reliability.
Talk through the GFS Architecture Read Operation
- application request goes through client interaction
- client sends chunk index and file name to master node
- master looks up request in big table for IP and locations of each of the chunks, returns these values to client where client caches this information
- client identifies closest location chunk
- client requests data from chunkserver and read data directly from server- direct read between client and chunk server
Note:
To reduce work load from master node, the chunkserver being repeatedly read will be cached in the client. This will contain the ID and chunk location in cache so it can be accessed directly without master interaction in future. This improves the performance and speed of the system.
Key differences between DFS, NFS and CFS?
- DFS includes both NFS and CFS (as storing data across nodes in different locations)
- CFS and NFS are using network for communication but NFS has limited storage, reliability, and scalability.
- In comparison to NFS, CFS provides highly available, scalable stroage capabilities, high resiliency features
GFS design overview?
Design overview:
- files stored as fixed-size chunks (64BM) on separate servers/nodes
- as commodity servers have high failure rate, the replicas across nodes is crucial for resiliency
- replication (reliability) is 3 by default, can be manually set
- single master - centralised management
- meta-data store
Describe the Google File Storage Master components? Pros/cons?
- maintains all systems metadata
- periodically communicates with chunk servers through heartbeat messages
- one master is a single point of failure, but replcation of master state is across multiple machines
- log and check points are replicated on multiple machines
Describe the write process in GFS?
- Client requests the chunk server details (IP and location) that holds the current lease of the requested chunk (and its replicas) from master node
- master returns the locations and IPs of all chunk replicas. It will choose and prioritise the chunks in order of client access to least amount of disk space being used. Master also appoints the primary replica in this step.
- client locates cloest server and passes the chunk data write to it (not always the primary). this chunk passes the data long to all replicas.
- write data wont be stored on dick immediately, instead on chache for each chunk
- once each local cache is stored on all chunks, the client sends a request to primary replica server to commit the data across all disks
- primary server organises the commit of all replicas onto the harddrive
- once primary server receives confirmation that the data has been successfully stored on the disks, it’ll send confirmation status back to client
Name the three major types of metadata in GFC. What is this metadata used for in GFC?
- The file and chunk namespaces
- mapping from files to chunks
- locations fo each chunks replicas
All kept in master’s memory and used as a big table lookup for client requests to locate chunks. This detadata describes the data held in the chunk servers.
What is the GFS’ master operation log used for?
- In master node
- consists of namespace and file to chunk mappings
- replicates on remove machines
What is Hadoop Distributed File System (HDFS)?
- is the file system of Apache Hadoop
- key processing function is the MapReduce model
- open-source software
- used to solve big data problems
What is the HDFS MapReduce programming model used for/what does it do?
Hadoop splits files into large blocks (double the size of GFS’) and distributes them across nodes in a cluster. It then transfers packaged code into nodes and takes advantage of data locality.
Blocks = GFS Chunks
Key differences and similarities between GFS and HDFS?
Similarities:
- both designed to support very big data
- provide two types of read (large streaming + small random)
- focus on throughput not efficiency/speed fo outputs
- both name/master nodes use in-memory storage of metadata
Differences:
- once written, files are seldom modified in HDFS
- (know) HDFS only allows one user accessing a file with write permissions at any one time, as opposed to GFS’ lease system allowing multiple writes at once (which are ordered).
- GFS only on Linux platform, HDFS available on Lac, Linux and Windows
- (know) GFS is C, C++ environment; HDFS is Java
- (know) HDFS is open-source and free
- architecture terminology differences (GFS Master node, chunk server, and chunks of data; HDFS namenode, datanodes, and data blocks)
Describe the HDFS NameNode?
- equivalent to Master Node in GFS
- represents files and directories on the NameNode as inode
- maintains namespace tree and mapping of file blocks to Datanodes
- when writing data, NameNode nominates DataNodes to host replicas
- keeps Meta-Data in RAM
- responsible for replication, load balancing, maintenance, heartbeat responses, etc.
What does the HDFS Meta-Data component involve?
Stored in the NameNode.
fsimage:
- contains the entire filesystem namespace at the latest checkpoint
- blocks’ information of the file (location, timestamp, etc.)
- folder information (ownership, access, etc.)
- stored as an image file in the NameNode’s local file system
editlog:
- contains all the recent modifications made to the file system on the most recent fsImage
- create/update/delete requests from the client
Describe the workflow of the HDFS Checkpoint Node
- Secondary NameNode
- regularly query for fsimage and editlogs
- keeps edit logs to enable rollback to previous state
- Primary NameNode will stop to write editlogs and copy edits and fsimage to secondary NameNode
- All new edits after that will be fed into edits.new
- copied edits and fsimage on secondary nameNode will be merged
- Copy the merged fsimage.ckpt back to primary NameNode and use it as the new fsimage
- Finally, editlogs file gets smaller and fsimage gets updated
HDFS component - DataNode
- equivalent to the GFS chunk server
- each block replica contains two files: the data, block’s meta-data - sent to NameNode w. replica blocks
- when started up, it verifys a namespace ID and software version with NameNode
- internal storage ID is an identified of the DataNode within the cluster, which will never be changed - each node has its own local disk storage
- sends heartbeats (every 3 sec) to NameNode to communicate status
- receives maintenance commands from the NameNode indirectly (replicate blocks to other nodes, remove local block replicas, re-register, shut down, send block report, etc.)