Module 6 - Distributed File Systems Flashcards
Distributed file systems allow ________ to access _______ systems on ________ servers
applications
file
remote
There are two ways for a client to access a file in a DFS (distributed file system). What are their names?
- Remote access model
2. upload/download model
What is the remote access model for a DFS (distributed file system)?
- The file always lives on the server
- Anytime a client wants to read or write, it needs to issue an RPC (or a request)
What is the upload/download model for a DFS (distributed file system)?
- The file is managed by the server, but transfers a copy of the file to the client
- Once the file is received, the client locally performs reads and writes on it
- When the client is done accessing the file, the new version of the file is then transferred back to the server
What does NFS stand for? when was it created? and by which organization? Does ecelinux use sun?
Network File System
Created by Sun Microsystems
in 1984
yes, ecelinux uses NFSv4
In NFS, does the server have an RPC stub? or the client? or both?
both the client and the server have their own RPC stubs
What is the step-by-step process of the client trying to access a chunk of a file on the remote server?
- Client makes a system call to the kernel, and specifies the path of the file which it is trying to access
- Turns into a request which passes through the VFS (virtual file system) client layer, and the NFS client
- The NFS’s client uses its RPC client stub to make a call to the RPC server stub, which triggers a execution in the NFS server program
- The NFS server program makes a call to the VFS server layer which fetches the file from the server’s file system
- The fetched file is returned to the client through the RPC stubs and then propagated back to the client’s VFS
NFS supports client-side caching. What is the motivation behind this?
What are the caches used for in NFS?
- Caching reduces communication between the client and the server
- The cache is used to hold UPDATES to a file
Whenever caches are used in NFS, the cache holds modifications that have been made to a specific file.
When are file modifications propagated to server after sitting in the cache?
What issue arrises if this took place in a distributed NFS with replication?
- File modifications are flushed back to the server whenever the client closes the file
- In a distributed NFS, this could lead to inconsistencies in files across replicas
Describe the “delegation of authority” mechanism used in NFS for upload/download of files
What is the purpose of this delegation? What does it mean in the context of two clients trying to access the file?
- Client asks server for the file
- Server delegates authority of the file to the client
- Server recalls delegation
- Client sends returns file
Delegation in step 2 ensures that only one client can modify a specific file at a time (since it has the authority from the server). Other clients cannot access it until the authority recalls delegation
NFS uses RPCs internally. An optimization to the NFS product was adding compound procedures.
What does a file read between a client and a server in NFS look like with and without compound procedures?
How is the latency decreased with compound procedures?
In NFS, whenever the client makes a request to the server to read a file, it has to first perform a LOOKUP, and then performs a READ on the file
This mechanism without compound procedures:
- client makes a LOOKUP network request, gets a response
- client makes a READ network request, and then gets another response
With compound procedures:
1. client makes a LOOKUP and a READ call in the same network request
Therefore, with compound procedures there is only one network request, but without it, there are two. Thus, the latency is reduced
Usually, an NFS server generally exports only a part of its local file system to the remote client.
What does a client typically do to its local file structure to integrate this part from the server?
The client imports this segment, and adds this portion of the server’s file system to its local file system
The remote file segment is mounted onto the client under a certain path
Suppose a client imports a directory which contains a subdirectory which was imported from another remote host.
How does this client access the nested directory in terms of imports?
If a client imports a directory from server A which contains another imported directory from server B, then the client will import the nested directory DIRECTLY from server B
A large scale DFS may distribute files across multiple servers in order to manage very large files.
What are the two ways of doing this?
- Making all chucks of each file reside at their own server (chunks of a file are not partitioned across servers)
- Split the chunks of a file across numerous servers (just like sharding in databases)
In a large scale DFS which distributes files across numerous servers by storing files in chunks (just like sharding in databases), how can this result in improved throughput?
In the case where the server is the bottleneck of the system, the partitioned files allow load to be balanced across numerous servers - thereby improving Tput
In the Google File System (GFS), describe the following:
- The master node
- The chunk servers
- The underlying file system existing in each chunk server
- Master node stores meta-data about the files (size, path, access rights) and chunks - servers it to the client
- The chunk servers store a chunk (which could be a replica) of the overall file system with no metadata
- The underlying file system is a linux file system in each chunk
Why does GFS (Google file system) distribute the files across numerous chunk servers?
What other famous file system is an open-source implementation of GFS?
Distribution of the files across numerous chunks provides fault tolerance in software
HDFS (Hadoop distributed file system) is an open-source implementation of GFS
What is a Google file system (GFS) made up of?
- Master node
- GFS client
- Collection of chunk servers
In GFS (Google file system), the master node’s metadata about chunks are ______ in main memory and ______ are logged to local storage
cached
updates
How does the master node in GFS (Google file system) keep the meta-data consistent with the state of the chunk servers?
The master periodically polls the chunk servers to keep the meta-data consistent
In GFS (google file system), what are the steps for the client read data from a file?
- Client sends the file name and chunk index to the master
- The master responds with a contact address of how to access this file
- The client then pulls data directly from a chunk server, bypassing the master
What is the step-by-step mechanism in which GFS (google file system) updates data in a given file?
- A client contacts the nearest chunk server holding the data, and pushes its updates to that server
- This server will push the update to the next closest server which is holding the data (secondary), and so on, in a pipelined fashion until all replicas receive the data
- The primary chunk server assigns a sequence number to the update operation and passes it on to the secondary chunk servers (bypassing master)
- Primary replica informs client that the update is complete
In a centralized sharing setting, what are the semantics of file sharing? (two points)
Under what condition can these same semantics be achieved in a DFS?
Centralized file sharing semantics:
- Operations are strictly ordered in time
- Application can ALWAYS read its own writes
This can be a DFS as long as there is only one file server and the files are not cached
When a cached file is modified in a DFS, it is ________ but ________ to propagate the changes _______ to the file server. Instead, they are made after the file is closed
possible
impractical
immediately
In file sharing semantics, let’s use the session semantics method.
- When a client makes a modification to a file in a DFS (without closing it), what is the visibility of this modification?
- When do the changes get propagated to the other clients viewing the files?
- Which party determines the final version of the file?
- The modifications are only visible to the process that modified that file
- The modifications are only made visible to other clients when the file is closed
- The final version of the file is determined by the last client that closes that file
The semantics of file sharing in a DFS can be defined in numerous ways
- What does NFS use?
- What about HDFS?
- NFS uses session semantics
2. HDFS uses immutable files but supports an append function so that logs can be made
What does UNIX file sharing semantics describe?
Every operation on a file is instantly visible to all processes
What does session semantics describe?
No changes are visible to other processes until the file is closed
What does immutable file sharing semantics describe?
No updates are possible. Makes it very simple for sharing and replication
What does Transactions file sharing semantics describe?
All changes occur atomically