Midterm Flashcards

Question

Explain how SPIN makes OS service extensions as cheap as a procedure call.

Answer 1

 SPIN implements each service extension as a Modula-3 object: interactions to and from this object are compile time checked and runtime verified. Thus each object is a logical protection domain. (+2)  SPIN and the service extensions are co-located in the same hardware address space. This means that OS service extensions do not require border crossing, and can be as cheap as a procedure call. (+2)

Answer 2

 Modifying architectural features (e.g., hardware registers in the CPU, memory mapped registers in device controllers) may necessitate a service extension to step out of their logical protection domains (i.e.,Modula-3 compilerenforced access checks). (+2)  A significant chunk (upwards of 50%) of the OS code base (e.g., device drivers) is from third-party OEM vendors and it is difficult if not impossible to force everyone to use a specific programming language. (+2)

Answer 3

 Microkernel-based design of an OS need not be performance deficient. (+1)  With the right abstractions in the microkernel and architecture-specific implementation of the abstractions, microkernel-based OS can be as performant as a monolithic kernel.

Answer 4

 Spin/Exokernel used Mach as an exemplar for microkernel-based OS structure whose design focus was on portability (+1)  On the other hand, L3 argues that the primary goal of a microkernel should be performance not portability. (+1)

Answer 5

 LB: lower bound for a segment |  UB: upper bound for a segment

Answer 6

``` All three protection domains can be packed in 1 address space  Each address space takes up: 2^25 B LB, UB(range) for each address space 0 , (2^25-1) (2^25) , (2^26-1) (2^26) , ((2^25)*3)-1 ```

Answer 7

 UB and LB hardware registers are changed to correspond to the called protection domain.  Context switch is made to transfer control to the entry point in the called protection domain.  The architecture ensures that virtual addresses generated by the called domain are within the bounds of legal addresses for that domain.  There is no need to flush the TLB on context switch from one protection domain to another.

Answer 8

The interrupt is a result of a request that originated from some specific VM. The hypervisor tags each request dispatched to the disk controller with the id of the requesting VM. Using this information, the interrupt is delivered to the appropriate VM.

Answer 9

Every packet from the network will have the MAC address of the NIC to which the packet is destined. The MAC addresses are uniquely associated with the VMs. Based on the MAC addresses associated with the VMs and the destination field in the IP header of the packet, the packet arrival interrupt is delivered to the VM. [Note: As an aside, NAT protocol on your home router connecting several home devices to the ISP works quite similarly.] (+2 if MAC address of NICs associated with the VMs to direct intrpt mentioned)

Answer 10

 Hypervisor keeps information on the memory allocated and actively used by each of the VMs. (+1)  This allows the hypervisor to decide the amount of memory to be taken from each of the other VMs to meet VM1’s request. (+1)  It communicates the amount of memory to be acquired from each VM to the balloon driven in that VM. (+2)  The balloon drivers go to work in the respective VMs and return the released machine pages to the hypervisor. (+2)  It gives the requested memory to the needy VM (VM1 in this case).

Answer 11

Because any sudden increase in the working set size of the VM will result in poor performance for that VM potentially violating the SLA for that VM.(+2 if working set size increase mentioned)

Answer 12

• I/O ring data structure shared between hypervisor and guest VM (+1) • Each slot in the I/O ring is a descriptor for a unique request from the guest VM or a unique response from the hypervisor (+1) • Address pointer to the physical memory page corresponding to a data buffer in the guest OS is placed in the descriptor (+2) • The physical memory page is pinned for the duration of the transfer (+2)

Answer 13

One per guest OS currently running. (all or nothing)

Answer 14

Proportional to the number of processes in that guest OS. (all or nothing)

Answer 15

``` In principle it is a mapping from PPN to MPN. (+1) However, since S-PT is the “real” hardware page table used by the architecture for address translation (VPN->MPN), the hypervisor keeps the VPN -> MPN mapping as each entry in the data structure. (+1) ```

Answer 16

 Guest OS executes the privileged instruction for changing the PTBR to point to the PT for P2. Results in a trap to the hypervisor. (+2)  From the PPN of the PT for P2, the hypervisor will know the offset into the S-PT for that VM where the PT resides in machine memory. (+2)  Hypervisor installs the MPN thus located as the PT for P2 into PTBR. (+2)  Once other formalities are complete associated with context switching (which also needs hypervisor intervention) such as saving the volatile state of P1 into its PCB, and loading the volatile state of P2 from its PCB into the processor, the process P2 can start executing.

Answer 17

False. (+1) Sequential consistency memory model is a contract between software and hardware. (+2) It is required for the programmer to reason about the correctness of the software. (+1) Cache coherence is only a mechanism for implementing the memory consistency model. It can be implemented in hardware or software. (+1)

Answer 18

Though T1 and T2 are doing the lock request simultaneously, their attempt at queuing themselves behind the current lock holder (curr) (will get sequentialized through the atomic fetch-and-store operation. (+2)  In the picture above, T1 has definitely done a fetch-and-store. So the lock is pointing to it as the lock requestor. (+1)  As for T2, the thread has allocated the queue node data structure, but there are two possibilities with respect to where it will be in the lock queue(either one will get full credit): (+1) o (Possibility 1) T2 may have done a fetch-and-store prior to T1. o (Possibility 2) T2 is yet to do fetch-and-store

Answer 19

Possibility 1:  T2 knows its predecessor is “curr” (+1)  T1 knows its predecessor is “T2” (+1) Possibility 2:  T2 does not know anything about the queue associated with L (+1)  T1 knows its predecessor is “curr” (+1)

Answer 20

Possibility 1:  T2 will set the next pointer in “curr” to point to T2. (+1)  T1 will set the next pointer in “T2” to point to T1. (+1) Possibility 2:  T1 will set the next pointer in “curr” to point to T1. (+1)  T2 will do a fetch-and-store on L->next; this will result in two things: (+1) o T2 will get its predecessor T1  T2 will set the next pointer in “T1” to point to T2 o L->next will now point to T2

Answer 21

``` (p0 (p1 (p5 p6 p7 p8)) (p2 (p9 p10 p11 p12)) (p3 (p13 p14 p15) (p4 ())) ```

Answer 22

 Unique and static location for each processor to signal barrier completion  Spinning on a statically allocated local word-length variable by packing data for four processors reduces bus contention  4-ary tree construction shows the best performance on sequent symmetry used in the experimentation in MCS paper

Answer 23

 By procedure calling convention, the server procedure expects the actual parameters to be in a stack in its address space. E-stack is provided for this purpose.  The arguments placed in the A-stack by the client stub are copied into the Estack by the server stub. Once this is done, the server procedure can execute as it would in a normal procedure call using the E-stack.

Answer 24

```  Pool 1 - 3 threads (3MB)  Pool 2 - 3 threads (12MB)  Pool 3 - 2 threads (16MB) Total cache used: 31 MB (-1 if all Pool 2 threads not scheduled) (-3 if only Pool 1 threads scheduled scheduled) ```

Answer 25

 Use one representation of the “thread scheduler” object for each processor core.  Each representation has its own local queue of threads.  No need for locking the queue for access since each representation is unique to each core  Each local queue is populated with at most 8 threads since each processor is 4-way hardware multithreaded (this is just a heuristic to balance interactive threads with compute-bound threads). The threads in a local queue could be a mix of threads from single-threaded processes and multi-threaded processes.  Each local queue is populated with the threads of the same process (up to a max of 4) when possible.  If a process has less than 16 threads, then the threads are placed in the local queues of the 4 cores in a single NUMA node.  If a process has more than 16 threads, then the threads are split into multiple local queues so that the threads of the same process can be coscheduled on the different nodes (often referred to as gang scheduling)  Implement entry point in “thread scheduler” object for peer representations of the same object to call each other for work stealing. (This is a bit open ended so we have to be lenient in grading. Give full credit if their answer explains one of the following (a) one representation per processor core OR (b) one representation per NUMA node (shared by the 4 cores) AND a good reasoning to back the choice) (-2) if not one of (a) or (b) (-3) if no reasoning (-2) if incomplete reasoning

Answer 26

 One representation of the DRAM object for each NUMA piece (i.e., shared by all the 4 cores. Each representation manages the DRAM at that node for memory allocation, and reclamation. (+3)  Core- and thread-sensitive allocation of physical frames to ward off false sharing among threads running on different cores of the NUMA node. (+2) (-2) if no mention of allocation for different cores (-2) if NUMA-ness does not figure in choice of representation

Answer 27

- Exception - program generated arithmetic errors (e.g., divide by zero) [1 point] - Trap - Syscall [1 point] - Page faults [1 point] - External interrupt - I/O [1 point]

Answer 28

- Pro: No cheating possible (e.g. void*-casting in C) with Modula-3, generic interface provides entry point, implementation hidden, allows for checking for the correct pointer at compile-time, allows for protection domains without incurring border crossing costs due to shared hardware address space [2 points] Note: if no mention of protection domains, [only 1 point] - Con: What about drivers/etc. that need H/W-access, need to step outside of language protection, also Modula-3 potentially not too popular, rewriting necessary [2 points] Note: if not mentioning drivers or H/W-access, [only 1 point]

Answer 29

- Fielding by exokernel [2 points] - exokernel notifies Library OS by calling the entry point in the PE data structure for this library OS [2 points] - Library OS allocates page frame running page replacement algorithm if necessary to find a free page frame from the set of page frames it already has [2 points] - Library OS calls exokernel through the API to install translation (faulting page, allocated page frame) in TLB, presenting capability, an encrypted cypher [2 points]

Answer 30

- (1): Proof by construction, 123 processor cycles (incl TLB + cache misses) [2 points] - (2): Not necessarily, exploit H/W features, e.g. segment registers to pack small protection domains in the same hardware space + explicit costs for large protection domains much smaller than implicit costs (cache no longer warm etc.) [2 points] - (3): Shown by construction that this is not true, competitive to SPIN and Exokernel [2 points] - (4): This was true in Mach due to its focus on portability, but if focus on performance this can be overcome, taylor code to use architecture-specific features (e.g. segmentregisters on x86) [2 points]

Answer 31

 Physical page is the illusory view of the physical memory from the Guest OS MMU. Physical pages are deemed contiguous from the Guest OS point of view. [1 point]  Machine page is the view of the physical memory from the Hypervisor. This refers to the REAL hardware physical memory. The actual “physical memory” given to a specific guest OS maps to some portion of the real machine memory. [1 point]

Answer 32

The distinct protection domain mentioned in the question refers to the page-table for the new process. Following are the steps: ● Library OS (Linux) allocates memory from its own reserve for a new page table. [1 point] ● Library OS registers this memory with the Hypervisor (Xen) as a page table by using a Hypercall. [1 point] ● Library OS makes updates to this page table (virtual page, physical page mapping) through hypervisor via batches of updates in a single hypercall. [2 points] ● Library OS changes the active page table via the hypervisor thus in effect “scheduling” the new process to run on the processor. [1 point]

Answer 33

When a page-fault occurs, the hypervisor catches it, and asynchronously invokes the corresponding registered handler. Following are the steps: Inside the hypervisor: ● Xen detects the address which caused the page-fault. For example, the faulting virtual address may be in a specific architectural register [1 point] ● This register value is copied into a suitable space in the shared data-structure between the hypervisor and library OS. Then the hypervisor does the up-call, which activates the registered page-fault handler. [1 point] Inside the library OS page-fault handler: ● The library OS page fault handler allocates a physical page frame (from its pool of free, i.e., unused pre-allocated physical memory kept in a free-list). [1 point] ● If necessary the library OS may run page replacement algorithm to free up some page frames if its pool of free memory falls below a threshold. [1 point] ● If necessary the contents of the faulting page will be paged in from the disk. [1 point] ● Note that paging in the faulting page from the disk would necessitate additional interactions with the hypervisor to schedule an I/O request using appropriate I/O rings. [1 point] ● Once the faulting page I/O is complete, the library OS will establish the mapping in the page table for the process by making a hypervisor call. [1 point]

Answer 34

To schedule a process the library OS will do the following: - library OS has a distinct PT data structure for each process. [1 point] - dispatching this process to run on the processor involves setting the PTBR (a privileged operation) [1 point] - The library OS will try to execute this privileged operation [1 point] - This will result in a trap into the hypervisor [1 point] - The hypervisor will “emulate” this action by setting the PTBR [1 point] Henceforth, the CPU will implicitly use the memory area pointed to by the PTBR as the page table

Answer 35

● Page fault service involves finding a free physical page frame to map the faulting virtual page to this allocated physical page frame. [1 point] ● To establish this mapping, the library OS has to update the page table or the TLB depending on the specifics of the processor architecture assumed by the fully virtualized library OS. Both of these are privileged operations which will result in a trap when the library OS tries to execute either of them. [1 point] ● Hypervisor will catch the trap and “emulate” the intended PT/TLB update by the library OS’s into the library OS’s illusion of PT/TLB. [1 point] ● More importantly, the hypervisor has a mapping of what machine page (MPN) this physical page of the guest OS refers to in its shadow page table. [1 point] ● Hypervisor will establish a direct mapping between the virtual page and the corresponding machine page by entering the mapping into the hardware page table (the shadow page table) or the TLB depending on the specifics of the processor architecture.

Answer 36

Memory consistency model serves as a contract between software and hardware to allow the programmer to reason about program behavior. (2 points if they mention hardware/software contract) (-1 point for lack of specificity - for full credit, need to mention ordering or sequence)

Answer 37

No. Since there is no cache consistency Lock release will never be seen by waiting processors. (1 point for saying “No”. 3 points for the correct reason).

Answer 38

True. [1 point] Tournament algorithm can exploit the rich connectivity for peer-peer signaling among the processors in each round (going up and down the tree). [3 points] [1 point partial credit for any of the following: there is parallel communication, or messages aren’t serialized, or log2(n) vs log4(n), or otherwise thinking about critical path/tree depth]

Answer 39

``` Client → A-Stack ⇒ 128 A-Stack → Server E-Stack ⇒ 128 Server E-Stack → A-Stack ⇒ 4 A-Stack → Client ⇒ 4 Total = 264 [2 points] Partial credit (1 point) if the answer says 132 with correct logic ```

Answer 40

Cache pollution by other threads run after the last time this thread ran on a particular processor. [2 points] [Partial credit 1 point: for other plausible answers e.g., “poor load balancing”]

Answer 41

● Cache pollution by threads run after the last time a particular thread ran on a processor (1.5 points) ● Cache pollution by threads that will be run after a thread is scheduled to run on a particular processor (queue length) (1.5 points)

Answer 42

32 software threads 64 + 128 + 256 + 512 = 960 MB required (cumulative cache requirement) One of many possible feasible schedules that uses all the hardware threads and uses all the available LLC:  T25 – T28 on Core 1 = 4 hardware threads & 256 MB  T17 – T20 on Core 2 = 4 hardware threads & 128 MB  T9 – T12 on Core 3 = 4 hardware threads & 64 MB  T13 – T16 on Core 4 = 4 hardware threads & 64 MB (Full credit for any feasible schedule that shows complete understanding of hardware threads and LLC utilization) [Partial credit: -1 if just one thread is chosen wrong (this is normally if they found the optimal schedule that’s less than 512 MB, instead of less than or equal to 512 MB); -2 for each set of up to 4 threads chosen incorrectly]

Answer 43

● Clustered process objects created one representation per processor on which this process has threads. [2 points] ● Clustered region objects created commensurate with the partitioning of the address space. [2 points] ● Clustered FCM objects created to support the region objects [1 point]

Answer 44

● Locate the clustered region object that corresponds to this freed address range. This region object has the piece of the page table that corresponds to the address range being freed. [2 points] ● Identify the replicas of this region object on all the processors and fix up the mapping of these address ranges in the replicated copies of the page table entries corresponding to the memory being freed. [3 points]

Answer 45

● False [1 point] ● Justification: ○ Region is invisible to the application; it is a structuring mechanism inside the kernel to increase concurrency for page fault handling of a multithreaded application since each region object manages a portion of the process address space. [2 points] ○ Address range is a mechanism provided to the applications by the kernel for threads of an application to selectively share parts of the process address space. This reduces contention for updating page tables and allows the kernel to reduce the amount of management work to be done by the kernel using the hints from the application (e.g., reduce TLB consistency overhead. [2 points]

Answer 46

● No. [1 point] ● Justification: ○ The happened-before relation is only concerned with the sending and receipt of messages ONE at a time. Therefore, messages arriving out of order does not violate the relationship [2 points] ○ [2 points] for a reasonable example.

Answer 47

If the node is holding the lock, then it can defer sending ACK. Or if the incoming lock request’s timestamp is larger than the timestamp of its own lock request, the node can defer sending ACK. [3 points] A reasonable example [2 points]