Midterm Flashcards
(Extensibility Vss. Virtualization) Give the differences and similarities between the goals of extensibility (a la SPIN/Exokernel/L3) with the goals of virtualization (a la Xen/VMware ESX server).
Similarities
- Customization of system services commensurate with application needs
- Allowing policies for resource allocation to be dictated by the customized system service, by providing a thin layer above the hardware with just the mechanisms
- Differences:
- Customization granularity is an entire OS in the case of virtualization whil it is any desired granularity (e.g., individual subsystesm of the OS) in the case of SPOIN/Exokernel/L3)
- Striving to avoid penalties for border crossing between system services and the kernel is super important in Exo/SPIN/l3; whil in virtualization isolation/integrity/independence of the address space of the hypervisor and the OSes (above the hypervisor) is of paramount importance
Give the differences and similarities between the mechanisms in Exokernel for extenssibility with those in Xen for virtualization.
- Similarities
- Both Exo and Xen catch program discontinuities and pass it up to the OS above
- Both Exo and Xen have a way of identifying the OS that is currently executing so that they can correctly do the upcall to the OS that has to deal with program discontinuity.
- Differences
- Exo allows library OS to download arbitrary codeint the kernel, while Xen has a structured way of communication with the library OSes with I/O ring data structure
- Exo does hardware resource allocation to the library oses at a very fine granularity (e.g., individual TLB entry), while xen does it at a much coarser granularity (e.g., batching page table updates)
- Exo’s default PU scheduling uses linear vector of “time slots”, while xen uses proportional share and fair share schedulers and accurately accounts for time currently scheduled to run on the CPU
Give two plausible reasons that could make system calls more expensive than simple procedure calls.
- Cost for changing hardware address spaces (update the or switch TLB)
- Cost of change in locality (CPU affinity, cold cache)
Consider a byte-addressable 32-bit architecutre. The virtual address space is 2^32 bytes.. The TLB supporets entries to be tagged with an address space id. We are building a microkernel-based design of an operating system for this architecture, in which wvery subsystem will be implemented as a process above the kernel. Suggest what design decisions you will take to make sure that the performance will be as good as a monolithic OS.
- Provide mechanisms in the kernel for generating unique IDs.
- Give unique address space IDs to each subsystem so that the AS-tagged TLB can be exploited
- No need to flush the TLB when we move from one subsystem to another
- Use kernel threads and do “hand-off” scheduling between subsystems to implement protected procedure calls between the subsystems
- Provide efficient memory sharing mechanisms across subsystems (L3 mechanisms of map/grant/flush) so that copying overheads during protected procedure calls
- Provide low overhead mechanism in the kernel for catching program discontinuities (external interrupt) and packaging and delivering it like a protected procedure call to the subsystem that has to deal with the discontinuity
Why is a shadow page table needed?
- To support the guest OS’s illusion of a contiguous physical address space
- Since the guest OS does not have access to the hardware page table, by keeping the mappping from guest VPNs to MPNs in the hypervisor, the address translation can happen at hardware speed.
(full virtualization) How many shadow page tables are there
There is one S-PT per guest OS currently running
(full virtualization) How big is the shadow page table?
It is proportional to the number of processes currently running in that guest OS
(full virtualization) Guest OS is servicing a page fault. It finds a free PPN and decides to make the mapping of the vpn for the faulting process to this ppn. list the steps from here on taken by the guest and the hypervisor that ensures that the process will not fault on this vpn again
- to put this mapping into the page table, the guest OS has to execute a privileged instruction.
- This will result in a trap into the hypervisor, which will emulate the required action (VPN-PPN mapping) in the guest OS’s page table data structure.
- Further, the hypervisor will install a direct mapping from VPN-MPN (mpn corresponding to the ppn) into the s-t data structure for this guest os and the tlb
- Since the s-pt is the “hardware page table” in an architecture such as intel, the process will not page fault thennext time it runs on the CPU
Guest OS decides to context switch from the currently running process to another. List the steps taken by the guest and the hypervisor until the new process is running on the processor.
- Guest OS executes the privileged instruction for changing the PTBR to point to the chosen process (say P2). Results in a trap to the hypervisor. (+2)
- From the PPN of the PT for P2, the hypervisor will know the offset into the S-PT for that guest-OS where the PT resides in machine memory. (+2)
- Hypervisor installs the MPN thus located as the PT for P2 into PTBR. (+2)
- Once other formalities are complete associated with context switching (which also needs hypervisor intervention) such as saving the volatile state of the currently running process into the associated PCB, and loading the volatile state of P2 from its PCB into the processor, the process P2 can start executing.
The guest OS is running multiple processes within it. The guest OS itself appears as a “process” so far as the hypervisor is concerned. How is each of the processes inside the guest OS guaranteed memory isolation/integrity/independence from one another?
Upon process creation, guest-OS assigns a distinct PT data structure for the newly created process (+2)
As part of creating a memory footprint for the process, the guest-OS creates VPN to PPN mappings in the PT by executing privileged operations (+2)
These privileged operations result in traps to the hypervisor which emulates them on behalf of the guest-OS (both populating the PT data structure for the newly created process in the guest-OS as well as the S-PT data structure in the hypervisor). (+2)
The distinct PT data structure for EACH process within the guest-OS thus gives the isolation/integrity/independence guarantees for the processes from one another.
An application process starts up in a guest OS running on top of Xen. List the steps taken by the guest OS and Xen to set up the page table for this process. What is the mechanism by which the guest OS schedules this process to run on the processor? You can assume any reasonable hypervisor calls without worrying about the exact syntax. State your assumptions.
- Guest OS allocates a physical page frame from its own reserve as the page table (PT) data structure for the newly created process
- Guest OS registers this page frame with Xen as a page table by using a hypercall.
- Guest OS updates batched VPN to PPN mappings into this page table data structure via the hypervisor by using a single hypercall.
- When it wants to schedule this process, the Guest OS issues a hypervisor call to change the PTBR corresponding to the PT for this process
One of the techniques for efficient use of memory which is a critical resource for performance is to dynamically adjust the memory allocated to a VM by taking away some or all of the “idle memory” (unused memory) from a VM. This is referred to in the literature as “taxing” the VM proportional to the amount of idle memory it has. Why is a 100% tax rate (i.e. taking away ALL the idle memory from a VM) not a good idea?
Any sudden increase in the working set size of the VM will result in poor performance of that VM. This might violate the service agreement for that VM.
VM1 and VM2 are hosted on top of VMware ESX server. VM1 has a page fault at VPN = 100. The page is being brought into MPN = 4000. The contents hash for this page matches another MPN = 8000, which belongs to VM2 at VPN = 200. Give the steps taken by VMware ESX server including the data structures used to possibly optimize the memory usage by using VM oblivious page sharing.
- ESX server has a hash table data structure in which each entry keeps the following information: (+1)
- As the content hash for the faulting page of VM1 matched an entry in the hash table (MPN=8000), now perform full comparison of this page (MPN=4000) and the matched page (MPN= 8000). (+1)
- If comparison fails, then create a new hint frame entry for this new page in the hash table. (+1)
- If these two pages are identical then mark both the corresponding PPNs to point to the same MPN (say MPN=8000) in their respective shadow page tables. (+1)
- mark these page table entries as “copy on write”. (+1)
- Increase the reference count by 1 in the hash table for MPN=8000, and free up machine page MPN=4000 (+1
It is impossible to support sequential consistency memory model in a multiprocessor that has caches associated with each processor. (An answer without any justification gets zero credit.)
False. (+1)
What is needed is a mechanism to get exclusive access to a memory location before writing to it in the local cache and ensuring that that memory location is purged in the peer caches.
For example, in an SMP, a hardware solution would be for the writing processor to acquire the bus and indicate to its peers that it is writing this memory location so that the peers can take appropriate actions in their local caches (invalidate/update if that memory location exists in the cache)
(Anderson’s queue lock)
Imagine a 100-processor cache coherent multiprocessor. The wordsize of the processor is 32 bits, which is the unit of memory access by the processor’s instruction set.
The cache blocksize is 16 bytes. We would like to implement the Anderson’s queue-based lock algorithm. Recall that in this algorithm, each processor marks its unique spin position in a “flags” array (associated with each lock) so that it can locally spin on that variable without disrupting the work of other processors until its turn comes to acquire the lock.
How much space is needed for each flags array associated with a lock?
To ensure there is no false sharing, we need to allocate each spin variable in a distinct cache line. Space for one spin variable = cache blocksize = 16 bytes
(+2)
We will assume that the maximum number of threads in an application cannot exceed the number of processors. Thus in the worst case we need 100 distinct spin locations in the flags array.
(no points deducted for not stating this assumption)
So total space needed in the flags array for each lock variable = 100*16 = 1600 bytes
(+2)
(MCS lock)
Recall that the MCS lock algorithm implements a queue of lock requestors using a linked list.
Let’s say the state of a lock is as follows: T1 is the current lock holder and there is no other request in the queue yet; and a “new” request for the lock is just starting to appear.
What could happen given the above state?
T1 could think there is no one else waiting for the lock and set L to nil, resulting in a livelock (“new” process will be spinning forever waiting for the lock).
Recall that the MCS lock algorithm implements a queue of lock requestors using a linked list.
Let’s say the state of a lock is as follows: T1 is the current lock holder and there is no other request in the queue yet; and a “new” request for the lock is just starting to appear.
What should happen? (assume no other lock requestors appear on the scene)
T1 should recognize that there is a new request forming for the lock, and wait for its next pointer to point to new, so that it can signal it upon releasing the lock.
The MCS algorithm assumes the presence of an atomic instruction:
compare-and-swap(a, b, c), whose semantics is as follow:
If a == b, then set a = c and return 1;
Else do nothing and return 0;
How does MCS lock algorithm use this instruction to make sure that the right thing happens?
In MCS at lock release the releasing processor T1 does the following: if (compare-and-swap(L, T1, nil) == 0) { /* spin awaiting the “new” request to set T1’s next pointer */ while (T1->next != nil); } (+3) T1->next.got_it = 1; /* signal the next process it has the lock */ (+1)
(barrier algorithms)
Imagine a message-passing multiprocessor with point to point communication links between any two processors. Between tournament and dissemination algorithms, which would you expect to do better for barrier synchronization? Why?
Dissemination would do better.
At each round dissemination barrier has O(N) communications. But they can all go on in parallel. The number of rounds of communication in dissemination is ceil(log(N)).
While all the communication can go in parallel in tournament barrier as well, the algorithm traverses the tree twice (arrival + wakeup), resulting in a total of 2 * log(N) rounds of communication.
Recall that light-weight RPC (LRPC) is for cross-domain calls within a single host without going across the network. In a simple implementation of this paradigm, it is understandable that there has to be a copy from the client’s address space into the kernel address space, and from the kernel address space to the server’s address space before executing the server procedure. What are the reasons for the two additional copies (one in the client’s address space, and one in the server’s address space)?
The kernel has no knowledge of the semantics of the call. Therefore, before going to the kernel, on the client-side we have to serialize the arguments of the call into a contiguous sequence of bytes in the client’s address space before the kernel copies it into its kernel buffers.
Similarly, on the server side, we have to populate the contiguous sequence of bytes received from the kernel into the server’s address space into the actual arguments of the call as expected by the server procedure.
(multiprocessor scheduling)
Recall that “limited minimum intervening” thread scheduling policy uses affinity information for “k” processors, where “k” is a subset of the processors that this thread ran on during its lifetime for the scheduler to make scheduling decisions. Sketch the data structure for the TCB (assume k = 4) in this scheduling regime (you need to show only the fields that are to be used for scheduling purposes).
struct affine_type { int processor_id; /* processor number */ int affinity; /* number of intervening threads */ }; (+2) struct TCB_type { /* ….other unrelated info */ affine_type affinity[4]; /* top 4 affinity info for this thread */ };
Parallel System Case Studies
6. (6 mins, 10 points) (Tornado)
Tornado suggests using “existence guarantees” with reference counts instead of hierarchical locking to avoid serialization.
Using page fault service as a concrete example, discuss how this helps with reducing the pitfalls of serialization of page fault service in a multi-threaded process.
Imagine two threads T1 and T2 of the same process executing in the multiprocessor sharing the same representation “process” object. Both of them experience page faults SIMULTANOUSLY. Assume T1’s page fault is contained in Region 2; T2’s page fault is contained in Region 1. The page fault service will only update the respective region objects. Therefore, the “process” object need not be locked. But to ensure that some other subsystem (say a load balancer) does not remove the “process” object, the page fault service can increment the ref-count on the “process” object on the forward path of the above figure, and decrement the ref-count on the reverse path (once the service is complete) thus avoiding the serialization of the page fault service for T1 and T2.
Using the “process” object as a concrete example, identify situations where existence guarantees may not be sufficient and you may have to lock an object.
The “process” object is the equivalent of a context block. Every representation of this object is shared by multiple threads that are to be scheduled on the cores of a single processor or multiple processors (depending on the details of the architecture). The “process” data structure has information pertaining to the currently executing threads. If a thread is context switched out, then the scheduler subsystem would need to save the volatile state of the thread, and re-adjust the ready-queue, etc. All of these actions have to be performed atomically which would require locking the process object.
(full points if a convincing reason is given; need not be elaborate as above;
-2 “atomically” changing multiple fields of the process object not mentioned)
A process is currently executing on the processor. The process makes a
system call. List the steps involved in getting this system call serviced
by the operating system that this process belongs to. You should be precise
in mentioning the data structures involved in Exokernel that makes this
service possible.
System Call traps to Exokernel. (+2)
Exokernel identifies the library OS responsible for handling this system
call using Time Slice Vector (+2)
Exokernel uses the PE (Processor Environment) data structure associated with
this library OS to get the “system call context”, which is the entry point
registered with Exokernel by the library OS for system calls. (+2)
Exokernel “upcalls” the library OS using this entry point. (+2)
The library OS services the system call (+1)
Explain how SPIN makes OS service extensions as cheap as a procedure call.
SPIN implements each service extension as a Modula-3 object: interactions to
and from this object are compile time checked and runtime verified. Thus
each object is a logical protection domain. (+2)
SPIN and the service extensions are co-located in the same hardware address
space. This means that OS service extensions do not require border crossing,
and can be as cheap as a procedure call. (+2)
Give two reasons in the form of concise bullets, explaining why SPIN’s
vision of OS extensibility purely via Language-enforced protection checks
may be impractical for building real operating systems.
Modifying architectural features (e.g., hardware registers in the CPU, memory
mapped registers in device controllers) may necessitate a service extension
to step out of their logical protection domains (i.e.,Modula-3 compilerenforced
access checks). (+2)
A significant chunk (upwards of 50%) of the OS code base (e.g., device
drivers) is from third-party OEM vendors and it is difficult if not
impossible to force everyone to use a specific programming language. (+2)
What is the main assertion of L3 microkernel?
Microkernel-based design of an OS need not be performance deficient. (+1)
With the right abstractions in the microkernel and architecture-specific
implementation of the abstractions, microkernel-based OS can be as performant
as a monolithic kernel.
Why is the assertion of L3 at odds with the premise of SPIN/Exokernel?
Spin/Exokernel used Mach as an exemplar for microkernel-based OS structure
whose design focus was on portability (+1)
On the other hand, L3 argues that the primary goal of a microkernel should be
performance not portability. (+1)
Consider a 64-bit paged-segmented architecture. The virtual address space
is 2^64 bytes. The TLB does not support address space id, so on an address
space switch, the TLB has to be flushed. The architecture has two segment
registers:
LB: lower bound for a segment
UB: upper bound for a segment