Midterm Flashcards

1
Q

(Extensibility Vss. Virtualization) Give the differences and similarities between the goals of extensibility (a la SPIN/Exokernel/L3) with the goals of virtualization (a la Xen/VMware ESX server).

A

Similarities

  • Customization of system services commensurate with application needs
  • Allowing policies for resource allocation to be dictated by the customized system service, by providing a thin layer above the hardware with just the mechanisms
  • Differences:
  • Customization granularity is an entire OS in the case of virtualization whil it is any desired granularity (e.g., individual subsystesm of the OS) in the case of SPOIN/Exokernel/L3)
  • Striving to avoid penalties for border crossing between system services and the kernel is super important in Exo/SPIN/l3; whil in virtualization isolation/integrity/independence of the address space of the hypervisor and the OSes (above the hypervisor) is of paramount importance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Give the differences and similarities between the mechanisms in Exokernel for extenssibility with those in Xen for virtualization.

A
  • Similarities
  • Both Exo and Xen catch program discontinuities and pass it up to the OS above
  • Both Exo and Xen have a way of identifying the OS that is currently executing so that they can correctly do the upcall to the OS that has to deal with program discontinuity.
  • Differences
  • Exo allows library OS to download arbitrary codeint the kernel, while Xen has a structured way of communication with the library OSes with I/O ring data structure
  • Exo does hardware resource allocation to the library oses at a very fine granularity (e.g., individual TLB entry), while xen does it at a much coarser granularity (e.g., batching page table updates)
  • Exo’s default PU scheduling uses linear vector of “time slots”, while xen uses proportional share and fair share schedulers and accurately accounts for time currently scheduled to run on the CPU
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Give two plausible reasons that could make system calls more expensive than simple procedure calls.

A
  • Cost for changing hardware address spaces (update the or switch TLB)
  • Cost of change in locality (CPU affinity, cold cache)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Consider a byte-addressable 32-bit architecutre. The virtual address space is 2^32 bytes.. The TLB supporets entries to be tagged with an address space id. We are building a microkernel-based design of an operating system for this architecture, in which wvery subsystem will be implemented as a process above the kernel. Suggest what design decisions you will take to make sure that the performance will be as good as a monolithic OS.

A
  • Provide mechanisms in the kernel for generating unique IDs.
  • Give unique address space IDs to each subsystem so that the AS-tagged TLB can be exploited
  • No need to flush the TLB when we move from one subsystem to another
  • Use kernel threads and do “hand-off” scheduling between subsystems to implement protected procedure calls between the subsystems
  • Provide efficient memory sharing mechanisms across subsystems (L3 mechanisms of map/grant/flush) so that copying overheads during protected procedure calls
  • Provide low overhead mechanism in the kernel for catching program discontinuities (external interrupt) and packaging and delivering it like a protected procedure call to the subsystem that has to deal with the discontinuity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is a shadow page table needed?

A
  • To support the guest OS’s illusion of a contiguous physical address space
  • Since the guest OS does not have access to the hardware page table, by keeping the mappping from guest VPNs to MPNs in the hypervisor, the address translation can happen at hardware speed.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

(full virtualization) How many shadow page tables are there

A

There is one S-PT per guest OS currently running

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

(full virtualization) How big is the shadow page table?

A

It is proportional to the number of processes currently running in that guest OS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

(full virtualization) Guest OS is servicing a page fault. It finds a free PPN and decides to make the mapping of the vpn for the faulting process to this ppn. list the steps from here on taken by the guest and the hypervisor that ensures that the process will not fault on this vpn again

A
  1. to put this mapping into the page table, the guest OS has to execute a privileged instruction.
  2. This will result in a trap into the hypervisor, which will emulate the required action (VPN-PPN mapping) in the guest OS’s page table data structure.
  3. Further, the hypervisor will install a direct mapping from VPN-MPN (mpn corresponding to the ppn) into the s-t data structure for this guest os and the tlb
  4. Since the s-pt is the “hardware page table” in an architecture such as intel, the process will not page fault thennext time it runs on the CPU
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Guest OS decides to context switch from the currently running process to another. List the steps taken by the guest and the hypervisor until the new process is running on the processor.

A
  1. Guest OS executes the privileged instruction for changing the PTBR to point to the chosen process (say P2). Results in a trap to the hypervisor. (+2)
  2. From the PPN of the PT for P2, the hypervisor will know the offset into the S-PT for that guest-OS where the PT resides in machine memory. (+2)
  3. Hypervisor installs the MPN thus located as the PT for P2 into PTBR. (+2)
  4. Once other formalities are complete associated with context switching (which also needs hypervisor intervention) such as saving the volatile state of the currently running process into the associated PCB, and loading the volatile state of P2 from its PCB into the processor, the process P2 can start executing.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The guest OS is running multiple processes within it. The guest OS itself appears as a “process” so far as the hypervisor is concerned. How is each of the processes inside the guest OS guaranteed memory isolation/integrity/independence from one another?

A

 Upon process creation, guest-OS assigns a distinct PT data structure for the newly created process (+2)
 As part of creating a memory footprint for the process, the guest-OS creates VPN to PPN mappings in the PT by executing privileged operations (+2)
 These privileged operations result in traps to the hypervisor which emulates them on behalf of the guest-OS (both populating the PT data structure for the newly created process in the guest-OS as well as the S-PT data structure in the hypervisor). (+2)
 The distinct PT data structure for EACH process within the guest-OS thus gives the isolation/integrity/independence guarantees for the processes from one another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

An application process starts up in a guest OS running on top of Xen. List the steps taken by the guest OS and Xen to set up the page table for this process. What is the mechanism by which the guest OS schedules this process to run on the processor? You can assume any reasonable hypervisor calls without worrying about the exact syntax. State your assumptions.

A
  1. Guest OS allocates a physical page frame from its own reserve as the page table (PT) data structure for the newly created process
  2. Guest OS registers this page frame with Xen as a page table by using a hypercall.
  3. Guest OS updates batched VPN to PPN mappings into this page table data structure via the hypervisor by using a single hypercall.
  4. When it wants to schedule this process, the Guest OS issues a hypervisor call to change the PTBR corresponding to the PT for this process
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

One of the techniques for efficient use of memory which is a critical resource for performance is to dynamically adjust the memory allocated to a VM by taking away some or all of the “idle memory” (unused memory) from a VM. This is referred to in the literature as “taxing” the VM proportional to the amount of idle memory it has. Why is a 100% tax rate (i.e. taking away ALL the idle memory from a VM) not a good idea?

A

Any sudden increase in the working set size of the VM will result in poor performance of that VM. This might violate the service agreement for that VM.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

VM1 and VM2 are hosted on top of VMware ESX server. VM1 has a page fault at VPN = 100. The page is being brought into MPN = 4000. The contents hash for this page matches another MPN = 8000, which belongs to VM2 at VPN = 200. Give the steps taken by VMware ESX server including the data structures used to possibly optimize the memory usage by using VM oblivious page sharing.

A
  1. ESX server has a hash table data structure in which each entry keeps the following information: (+1)
  2. As the content hash for the faulting page of VM1 matched an entry in the hash table (MPN=8000), now perform full comparison of this page (MPN=4000) and the matched page (MPN= 8000). (+1)
  3. If comparison fails, then create a new hint frame entry for this new page in the hash table. (+1)
  4. If these two pages are identical then mark both the corresponding PPNs to point to the same MPN (say MPN=8000) in their respective shadow page tables. (+1)
  5. mark these page table entries as “copy on write”. (+1)
  6. Increase the reference count by 1 in the hash table for MPN=8000, and free up machine page MPN=4000 (+1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

It is impossible to support sequential consistency memory model in a multiprocessor that has caches associated with each processor. (An answer without any justification gets zero credit.)

A

False. (+1)
 What is needed is a mechanism to get exclusive access to a memory location before writing to it in the local cache and ensuring that that memory location is purged in the peer caches.
 For example, in an SMP, a hardware solution would be for the writing processor to acquire the bus and indicate to its peers that it is writing this memory location so that the peers can take appropriate actions in their local caches (invalidate/update if that memory location exists in the cache)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

(Anderson’s queue lock)
Imagine a 100-processor cache coherent multiprocessor. The wordsize of the processor is 32 bits, which is the unit of memory access by the processor’s instruction set.
The cache blocksize is 16 bytes. We would like to implement the Anderson’s queue-based lock algorithm. Recall that in this algorithm, each processor marks its unique spin position in a “flags” array (associated with each lock) so that it can locally spin on that variable without disrupting the work of other processors until its turn comes to acquire the lock.
How much space is needed for each flags array associated with a lock?

A

 To ensure there is no false sharing, we need to allocate each spin variable in a distinct cache line. Space for one spin variable = cache blocksize = 16 bytes
(+2)
 We will assume that the maximum number of threads in an application cannot exceed the number of processors. Thus in the worst case we need 100 distinct spin locations in the flags array.
(no points deducted for not stating this assumption)
 So total space needed in the flags array for each lock variable = 100*16 = 1600 bytes
(+2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

(MCS lock)
Recall that the MCS lock algorithm implements a queue of lock requestors using a linked list.
Let’s say the state of a lock is as follows: T1 is the current lock holder and there is no other request in the queue yet; and a “new” request for the lock is just starting to appear.

What could happen given the above state?

A

T1 could think there is no one else waiting for the lock and set L to nil, resulting in a livelock (“new” process will be spinning forever waiting for the lock).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Recall that the MCS lock algorithm implements a queue of lock requestors using a linked list.
Let’s say the state of a lock is as follows: T1 is the current lock holder and there is no other request in the queue yet; and a “new” request for the lock is just starting to appear.

What should happen? (assume no other lock requestors appear on the scene)

A

T1 should recognize that there is a new request forming for the lock, and wait for its next pointer to point to new, so that it can signal it upon releasing the lock.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

The MCS algorithm assumes the presence of an atomic instruction:
compare-and-swap(a, b, c), whose semantics is as follow:
If a == b, then set a = c and return 1;
Else do nothing and return 0;
How does MCS lock algorithm use this instruction to make sure that the right thing happens?

A
In MCS at lock release the releasing processor T1 does the following:
if (compare-and-swap(L, T1, nil) == 0) {
/* spin awaiting the “new” request to set T1’s next pointer */
while (T1->next != nil);
} (+3)
T1->next.got_it = 1; /* signal the next process it has the lock */ (+1)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

(barrier algorithms)
Imagine a message-passing multiprocessor with point to point communication links between any two processors. Between tournament and dissemination algorithms, which would you expect to do better for barrier synchronization? Why?

A

 Dissemination would do better.
 At each round dissemination barrier has O(N) communications. But they can all go on in parallel. The number of rounds of communication in dissemination is ceil(log(N)).
 While all the communication can go in parallel in tournament barrier as well, the algorithm traverses the tree twice (arrival + wakeup), resulting in a total of 2 * log(N) rounds of communication.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Recall that light-weight RPC (LRPC) is for cross-domain calls within a single host without going across the network. In a simple implementation of this paradigm, it is understandable that there has to be a copy from the client’s address space into the kernel address space, and from the kernel address space to the server’s address space before executing the server procedure. What are the reasons for the two additional copies (one in the client’s address space, and one in the server’s address space)?

A

 The kernel has no knowledge of the semantics of the call. Therefore, before going to the kernel, on the client-side we have to serialize the arguments of the call into a contiguous sequence of bytes in the client’s address space before the kernel copies it into its kernel buffers.
 Similarly, on the server side, we have to populate the contiguous sequence of bytes received from the kernel into the server’s address space into the actual arguments of the call as expected by the server procedure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

(multiprocessor scheduling)
Recall that “limited minimum intervening” thread scheduling policy uses affinity information for “k” processors, where “k” is a subset of the processors that this thread ran on during its lifetime for the scheduler to make scheduling decisions. Sketch the data structure for the TCB (assume k = 4) in this scheduling regime (you need to show only the fields that are to be used for scheduling purposes).

A
struct affine_type {
int processor_id; /* processor number */
int affinity; /* number of intervening threads */
}; (+2)
struct TCB_type {
/* ….other unrelated info */
affine_type affinity[4]; /* top 4 affinity info for this thread */
};
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Parallel System Case Studies
6. (6 mins, 10 points) (Tornado)
Tornado suggests using “existence guarantees” with reference counts instead of hierarchical locking to avoid serialization.

Using page fault service as a concrete example, discuss how this helps with reducing the pitfalls of serialization of page fault service in a multi-threaded process.

A

Imagine two threads T1 and T2 of the same process executing in the multiprocessor sharing the same representation “process” object. Both of them experience page faults SIMULTANOUSLY. Assume T1’s page fault is contained in Region 2; T2’s page fault is contained in Region 1. The page fault service will only update the respective region objects. Therefore, the “process” object need not be locked. But to ensure that some other subsystem (say a load balancer) does not remove the “process” object, the page fault service can increment the ref-count on the “process” object on the forward path of the above figure, and decrement the ref-count on the reverse path (once the service is complete) thus avoiding the serialization of the page fault service for T1 and T2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Using the “process” object as a concrete example, identify situations where existence guarantees may not be sufficient and you may have to lock an object.

A

The “process” object is the equivalent of a context block. Every representation of this object is shared by multiple threads that are to be scheduled on the cores of a single processor or multiple processors (depending on the details of the architecture). The “process” data structure has information pertaining to the currently executing threads. If a thread is context switched out, then the scheduler subsystem would need to save the volatile state of the thread, and re-adjust the ready-queue, etc. All of these actions have to be performed atomically which would require locking the process object.
(full points if a convincing reason is given; need not be elaborate as above;
-2 “atomically” changing multiple fields of the process object not mentioned)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

A process is currently executing on the processor. The process makes a
system call. List the steps involved in getting this system call serviced
by the operating system that this process belongs to. You should be precise
in mentioning the data structures involved in Exokernel that makes this
service possible.

A

 System Call traps to Exokernel. (+2)
 Exokernel identifies the library OS responsible for handling this system
call using Time Slice Vector (+2)
 Exokernel uses the PE (Processor Environment) data structure associated with
this library OS to get the “system call context”, which is the entry point
registered with Exokernel by the library OS for system calls. (+2)
 Exokernel “upcalls” the library OS using this entry point. (+2)
 The library OS services the system call (+1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Explain how SPIN makes OS service extensions as cheap as a procedure call.

A

 SPIN implements each service extension as a Modula-3 object: interactions to
and from this object are compile time checked and runtime verified. Thus
each object is a logical protection domain. (+2)
 SPIN and the service extensions are co-located in the same hardware address
space. This means that OS service extensions do not require border crossing,
and can be as cheap as a procedure call. (+2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Give two reasons in the form of concise bullets, explaining why SPIN’s
vision of OS extensibility purely via Language-enforced protection checks
may be impractical for building real operating systems.

A

 Modifying architectural features (e.g., hardware registers in the CPU, memory
mapped registers in device controllers) may necessitate a service extension
to step out of their logical protection domains (i.e.,Modula-3 compilerenforced
access checks). (+2)
 A significant chunk (upwards of 50%) of the OS code base (e.g., device
drivers) is from third-party OEM vendors and it is difficult if not
impossible to force everyone to use a specific programming language. (+2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the main assertion of L3 microkernel?

A

 Microkernel-based design of an OS need not be performance deficient. (+1)
 With the right abstractions in the microkernel and architecture-specific
implementation of the abstractions, microkernel-based OS can be as performant
as a monolithic kernel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Why is the assertion of L3 at odds with the premise of SPIN/Exokernel?

A

 Spin/Exokernel used Mach as an exemplar for microkernel-based OS structure
whose design focus was on portability (+1)
 On the other hand, L3 argues that the primary goal of a microkernel should be
performance not portability. (+1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Consider a 64-bit paged-segmented architecture. The virtual address space
is 2^64 bytes. The TLB does not support address space id, so on an address
space switch, the TLB has to be flushed. The architecture has two segment
registers:

A

 LB: lower bound for a segment

 UB: upper bound for a segment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q
There are three protection domains each requiring 32 MiB of
address space (Note: Mi is 220). How would you recommend implementing
the 3 protection domains that reduces border crossing overheads among
these domains?
A
All three protection domains can be packed in 1 address space 
 Each address space takes up: 2^25 B
LB, UB(range) for each address space
0 , (2^25-1)
(2^25) , (2^26-1)
(2^26) , ((2^25)*3)-1
31
Q

Explain with succinct bullets, what happens upon a call

from one protection domain to another.

A

 UB and LB hardware registers are changed to correspond to the called
protection domain.
 Context switch is made to transfer control to the entry point in the called
protection domain.
 The architecture ensures that virtual addresses generated by the called
domain are within the bounds of legal addresses for that domain.
 There is no need to flush the TLB on context switch from one protection
domain to another.

32
Q

The hypervisor gets an interrupt from the disk. How does it determine which
virtual machine it has to deliver the interrupt?

A

The interrupt is a result of a request that originated from some specific
VM. The hypervisor tags each request dispatched to the disk controller with
the id of the requesting VM. Using this information, the interrupt is
delivered to the appropriate VM.

33
Q
The hypervisor receives a packet arrival interrupt from the network
interface card (NIC). How does it determine which virtual machine it has to
deliver the interrupt?
A

Every packet from the network will have the MAC address of the NIC to which the
packet is destined. The MAC addresses are uniquely associated with the VMs. Based
on the MAC addresses associated with the VMs and the destination field in the IP
header of the packet, the packet arrival interrupt is delivered to the VM.
[Note: As an aside, NAT protocol on your home router connecting several home
devices to the ISP works quite similarly.]
(+2 if MAC address of NICs associated with the VMs to direct intrpt mentioned)

34
Q

A virtualized setting uses ballooning to cater to the dynamic memory needs
of VMs. Imagine 4 VMs currently executing on top of the hypervisor. VM1
experiences memory pressure and requests the hypervisor for 100 MB of
additional memory. The hypervisor has no machine memory available currently
to satisfy the request. List the steps taken by the hypervisor in trying to
give the requested memory to VM1 using ballooning.

A

 Hypervisor keeps information on the memory allocated and actively used by
each of the VMs. (+1)
 This allows the hypervisor to decide the amount of memory to be taken from
each of the other VMs to meet VM1’s request. (+1)
 It communicates the amount of memory to be acquired from each VM to the
balloon driven in that VM. (+2)
 The balloon drivers go to work in the respective VMs and return the released
machine pages to the hypervisor. (+2)
 It gives the requested memory to the needy VM (VM1 in this case).

35
Q

One of the techniques for efficient use of memory which is a critical
resource for performance is to dynamically adjust the memory allocated to a
VM by taking away some or all of the “idle memory” (unused memory) from a
VM. This is referred to in the literature as “taxing” the VM proportional to
the amount of idle memory it has. Why is a 100% tax rate (i.e. taking away
ALL the idle memory from a VM) not a good idea?

A

Because any sudden increase in the working set size of the VM will result in poor
performance for that VM potentially violating the SLA for that VM.(+2 if working set size increase mentioned)

36
Q

Using a concrete example (e.g., a disk driver), show how copying memory
buffers between guest VMs and the hypervisor is avoided in a para
virtualized setting?

A

• I/O ring data structure shared between
hypervisor and guest VM (+1)
• Each slot in the I/O ring is a descriptor for
a unique request from the guest VM or a
unique response from the hypervisor (+1)
• Address pointer to the physical memory
page corresponding to a data buffer in the
guest OS is placed in the descriptor (+2)
• The physical memory page is pinned for the
duration of the transfer (+2)

37
Q

In a fully-virtualized setting, Shadow Page Table (S-PT) is a data structure
maintained by the hypervisor. Answer the following questions pertaining to
S-PT.
(i) (2 point) How many S-PT data structures are there?

A

One per guest OS currently running. (all or nothing)

38
Q

In a fully-virtualized setting, Shadow Page Table (S-PT) is a data structure
maintained by the hypervisor. Answer the following questions pertaining to
S-PT.How big is the S-PT data structure?

A

Proportional to the number of processes in that guest OS. (all or nothing)

39
Q

In a fully-virtualized setting, Shadow Page Table (S-PT) is a data structure
maintained by the hypervisor. Answer the following questions pertaining to
S-PT. What does an entry in the S-PT contain?

A
In principle it is a mapping from PPN to MPN. (+1)
However, since S-PT is the “real” hardware page table used by the architecture for
address translation (VPN->MPN), the hypervisor keeps the VPN -> MPN mapping as each
entry in the data structure. (+1)
40
Q

In a fully-virtualized setting, Shadow Page Table (S-PT) is a data structure
maintained by the hypervisor. Answer the following questions pertaining to
S-PT. The currently running OS, switches from one process (say
P1) to another process (P2). List the sequence of steps before P2 starts
running on the processor.

A

 Guest OS executes the privileged instruction for changing the PTBR to point
to the PT for P2. Results in a trap to the hypervisor. (+2)
 From the PPN of the PT for P2, the hypervisor will know the offset into the
S-PT for that VM where the PT resides in machine memory. (+2)
 Hypervisor installs the MPN thus located as the PT for P2 into PTBR. (+2)
 Once other formalities are complete associated with context switching (which
also needs hypervisor intervention) such as saving the volatile state of P1
into its PCB, and loading the volatile state of P2 from its PCB into the
processor, the process P2 can start executing.

41
Q

Sequential consistency memory model makes sense only in a non-cache coherent
(NCC) shared memory multiprocessor

A

False. (+1)
Sequential consistency memory model is a contract between software and hardware.
(+2)
It is required for the programmer to reason about the correctness of the software.
(+1)
Cache coherence is only a mechanism for implementing the memory consistency model.
It can be implemented in hardware or software. (+1)

42
Q

(MCS lock)
Recall that the MCS lock algorithm implements a queue of lock requestors
using a linked list. The MCS algorithm uses an atomic fetch-and-store(X,Y)
primitive in its implementation. The primitive returns the old value of X,
and stores Y in X. Assume the current state of a lock is as shown below:
(L)->(curr)->nil where curr is running.

(4 points) Assume two threads T1 and T2 make a lock request
simultaneously for the same lock. What sequence of actions would have
brought the data structures to the intermediate state shown below from
the current state?

(L) (curr) running
| (t2) (t1)
| |
+————+

L is linked to t1, curr and t2 are not linked

A

Though T1 and T2 are doing the lock request simultaneously, their
attempt at queuing themselves behind the current lock holder (curr)
(will get sequentialized through the atomic fetch-and-store operation.
(+2)
 In the picture above, T1 has definitely done a fetch-and-store. So the
lock is pointing to it as the lock requestor. (+1)
 As for T2, the thread has allocated the queue node data structure, but
there are two possibilities with respect to where it will be in the
lock queue(either one will get full credit): (+1)
o (Possibility 1) T2 may have done a fetch-and-store prior to T1.
o (Possibility 2) T2 is yet to do fetch-and-store

43
Q

What does each of T1 and T2 know about the state of the
data structures at this point of time?

(L) (curr) running
| (t2) (t1)
| |
+————+

L is linked to t1, curr and t2 are not linked

A

Possibility 1:
 T2 knows its predecessor is “curr” (+1)
 T1 knows its predecessor is “T2” (+1)
Possibility 2:
 T2 does not know anything about the queue associated with L (+1)
 T1 knows its predecessor is “curr” (+1)

44
Q

What sequence of subsequent actions will ensure the correct
formation of the waiting queue of lock requestors behind the current lock holder?

(L) (curr) running
| (t2) (t1)
| |
+————+

L is linked to t1, curr and t2 are not linked

A

Possibility 1:
 T2 will set the next pointer in “curr” to point to T2. (+1)
 T1 will set the next pointer in “T2” to point to T1. (+1)
Possibility 2:
 T1 will set the next pointer in “curr” to point to T1. (+1)
 T2 will do a fetch-and-store on L->next; this will result in two
things: (+1)
o T2 will get its predecessor T1
 T2 will set the next pointer in “T1” to point to T2
o L->next will now point to T2

45
Q

The MCS barrier uses a 4-ary arrival tree where the nodes
are the processors and the links point to children. What will the
arrival tree look like for 16 processors labeled P0-P15? Use P0 as the
root of the tree. You can either draw the tree or give your answer by
showing the children for each of the 16 processors below:

A
(p0
(p1 (p5 p6 p7 p8))
(p2 (p9 p10 p11 p12))
(p3 (p13 p14 p15)
(p4 ()))
46
Q

What is the reason for such a construction of the arrival

tree? (MCS 4-ary)

A

 Unique and static location for each processor to signal barrier
completion
 Spinning on a statically allocated local word-length variable by
packing data for four processors reduces bus contention
 4-ary tree construction shows the best performance on sequent symmetry
used in the experimentation in MCS paper

47
Q

Recall that light-weight RPC (LRPC) is for cross-domain calls within a
single host without going across the network. The kernel allocates A-stack
in physical memory and maps this into the virtual address space of the
client and the server. It also allocates an E-stack that is visible only to
the server. What is the purpose of the E-stack?

A

 By procedure calling convention, the server procedure expects the actual
parameters to be in a stack in its address space. E-stack is provided for
this purpose.
 The arguments placed in the A-stack by the client stub are copied into the Estack
by the server stub. Once this is done, the server procedure can
execute as it would in a normal procedure call using the E-stack.

48
Q

A processor has 2 cores and each core is 4-way multithreaded. The last
level cache of the processor is 32 MB. The OS has the following pool of
ready to run threads:
Pool1: 8 threads each having a working set of 1 MB (medium priority)
Pool2: 3 threads each having a working set of 4 MB (highest priority)
Pool3: 4 threads each having a working set of 8 MB (medium priority)
The OS can choose any subset of the threads from the above pools to schedule
on the cores. Which threads should be scheduled that will make full use of
the available parallelism and the cache capacity while respecting the thread
priority levels?

A
 Pool 1 - 3 threads (3MB)
 Pool 2 - 3 threads (12MB)
 Pool 3 - 2 threads (16MB)
Total cache used: 31 MB
(-1 if all Pool 2 threads not scheduled)
(-3 if only Pool 1 threads scheduled scheduled)
49
Q

Each core in the figure below is 4-way hardware
multithreaded. The workload managed by the OS
is multithreaded multiprocesses. You are
designing a “thread scheduler” as a clustered
object. Give the design considerations for this
object. Discuss with concise bullets the
representation you will suggest for such an
object from each core (singleton, partitioned,
fully replicated).

A

 Use one representation of the “thread scheduler” object for each processor
core.
 Each representation has its own local queue of threads.
 No need for locking the queue for access since each representation is unique
to each core
 Each local queue is populated with at most 8 threads since each processor is
4-way hardware multithreaded (this is just a heuristic to balance interactive
threads with compute-bound threads). The threads in a local queue could be a
mix of threads from single-threaded processes and multi-threaded processes.
 Each local queue is populated with the threads of the same process (up to a
max of 4) when possible.
 If a process has less than 16 threads, then the threads are placed in the
local queues of the 4 cores in a single NUMA node.
 If a process has more than 16 threads, then the threads are split into
multiple local queues so that the threads of the same process can be coscheduled
on the different nodes (often referred to as gang scheduling)
 Implement entry point in “thread scheduler” object for peer representations
of the same object to call each other for work stealing.
(This is a bit open ended so we have to be lenient in grading.
Give full credit if their answer explains one of the following
(a) one representation per processor core
OR
(b) one representation per NUMA node (shared by the 4 cores)
AND
a good reasoning to back the choice)
(-2) if not one of (a) or (b)
(-3) if no reasoning
(-2) if incomplete reasoning

50
Q

For the above multiprocessor, we are implementing a DRAM object to physical
memory. Give the design considerations for this object. Discuss with
concise bullets the representation you will suggest for such an object from
each core (singleton, partitioned, fully replicated).

A

 One representation of the DRAM object for each NUMA piece (i.e., shared by
all the 4 cores. Each representation manages the DRAM at that node for memory
allocation, and reclamation. (+3)
 Core- and thread-sensitive allocation of physical frames to ward off false
sharing among threads running on different cores of the NUMA node.
(+2)
(-2) if no mention of allocation for different cores
(-2) if NUMA-ness does not figure in choice of representation

51
Q

Enumerate the four different types of program discontinuities that can occur
during the execution of a process with a short sentence explaining what each
one of the discontinuities is.

A
  • Exception - program generated arithmetic errors (e.g., divide by zero) [1 point]
  • Trap - Syscall [1 point]
  • Page faults [1 point]
  • External interrupt - I/O [1 point]
52
Q

SPIN relies on programming language support for the implementation of
protection domains. Discuss one pro and one con of this approach (two short
bullet points should suffice).

A
  • Pro: No cheating possible (e.g. void*-casting in C) with Modula-3, generic interface
    provides entry point, implementation hidden, allows for checking for the correct pointer
    at compile-time, allows for protection domains without incurring border crossing costs
    due to shared hardware address space [2 points]
    Note: if no mention of protection domains, [only 1 point]
  • Con: What about drivers/etc. that need H/W-access, need to step outside of language
    protection, also Modula-3 potentially not too popular, rewriting necessary [2 points]
    Note: if not mentioning drivers or H/W-access, [only 1 point]
53
Q

(Exokernel)
A library OS implements a paged virtual memory on top of Exokernel. An
application running in this OS encounters a page fault. List the steps from
the time the page fault occurs to the resumption of this application. Make
any reasonable assumptions to complete this problem (stating the
assumptions).

A
  • Fielding by exokernel [2 points]
  • exokernel notifies Library OS by calling the entry point in the PE data structure for this
    library OS [2 points]
  • Library OS allocates page frame running page replacement algorithm if necessary to
    find a free page frame from the set of page frames it already has [2 points]
  • Library OS calls exokernel through the API to install translation (faulting page, allocated
    page frame) in TLB, presenting capability, an encrypted cypher [2 points]
54
Q

(Liedtke’s L3 Microkernel)
There are four myths that discourage microkernel-based design of an OS: (1)
kernel-user switches (border crossings) are expensive; (2) hardware address
space switching is expensive; (3) thread switches are expensive; (4)
locality loss in the caches can be expensive when protection domains are
implemented using hardware address spaces. How does Liedtke debunk these
myths? (Four bullet points - one for each myth - should suffice)

A
  • (1): Proof by construction, 123 processor cycles (incl TLB + cache misses) [2 points]
  • (2): Not necessarily, exploit H/W features, e.g. segment registers to pack small
    protection domains in the same hardware space + explicit costs for large protection
    domains much smaller than implicit costs (cache no longer warm etc.) [2 points]
  • (3): Shown by construction that this is not true, competitive to SPIN and Exokernel [2
    points]
  • (4): This was true in Mach due to its focus on portability, but if focus on performance
    this can be overcome, taylor code to use architecture-specific features (e.g. segmentregisters
    on x86) [2 points]
55
Q

What is the distinction between “physical page” and “machine page” in a
virtualized setting?

A

 Physical page is the illusory view of the physical memory from the Guest OS MMU.
Physical pages are deemed contiguous from the Guest OS point of view. [1 point]
 Machine page is the view of the physical memory from the Hypervisor. This refers to the
REAL hardware physical memory. The actual “physical memory” given to a specific
guest OS maps to some portion of the real machine memory. [1 point]

56
Q

(Xen)
Recall that in a para virtualized setting, the guest operating system
manages its own allocated physical memory. (Note: you don’t have to worry
about being EXACT in terms of Xen’s APIs in answering this question; we are
only looking for your general understanding of how para virtualization
works)

A new process starts up in a para-virtualized library OS executing on
top of the Xen hypervisor. List the interaction between the library OS and
Xen to establish a distinct protection domain for the process to execute in.

A

The distinct protection domain mentioned in the question refers to the page-table for the new
process. Following are the steps:
● Library OS (Linux) allocates memory from its own reserve for a new page table. [1 point]
● Library OS registers this memory with the Hypervisor (Xen) as a page table by using a
Hypercall. [1 point]
● Library OS makes updates to this page table (virtual page, physical page mapping)
through hypervisor via batches of updates in a single hypercall. [2 points]
● Library OS changes the active page table via the hypervisor thus in effect “scheduling”
the new process to run on the processor. [1 point]

57
Q

(Xen)
Recall that in a para virtualized setting, the guest operating system
manages its own allocated physical memory. (Note: you don’t have to worry
about being EXACT in terms of Xen’s APIs in answering this question; we are
only looking for your general understanding of how para virtualization
works)

The library OS is “up called” by Xen to deal with a page fault incurred
by its currently running process. How does the library OS make the process
runnable again?

A

When a page-fault occurs, the hypervisor catches it, and asynchronously invokes the
corresponding registered handler. Following are the steps:
Inside the hypervisor:
● Xen detects the address which caused the page-fault. For example, the faulting virtual
address may be in a specific architectural register [1 point]
● This register value is copied into a suitable space in the shared data-structure between
the hypervisor and library OS. Then the hypervisor does the up-call, which activates the
registered page-fault handler. [1 point]
Inside the library OS page-fault handler:
● The library OS page fault handler allocates a physical page frame (from its pool of free,
i.e., unused pre-allocated physical memory kept in a free-list). [1 point]
● If necessary the library OS may run page replacement algorithm to free up some page
frames if its pool of free memory falls below a threshold. [1 point]
● If necessary the contents of the faulting page will be paged in from the disk. [1 point]
● Note that paging in the faulting page from the disk would necessitate additional
interactions with the hypervisor to schedule an I/O request using appropriate I/O rings. [1
point]
● Once the faulting page I/O is complete, the library OS will establish the mapping in the
page table for the process by making a hypervisor call. [1 point]

58
Q

In an Intel-like architecture, the CPU has to do address translation on
every memory access by accessing the TLB and possibly the page table. In a
non-virtualized set up, the page table is set up by the OS and is used by
the hardware for address translation through the page table base register
(PTBR) set by the OS upon scheduling a process. Recall that in a fully
virtualized setting, the page table for a process is an internal data
structure maintained by the library OS.
(i) When the library OS schedules a process to run, how does the CPU learn
where the page table is for doing address translation on every memory
access?

A

To schedule a process the library OS will do the following:
- library OS has a distinct PT data structure for each process. [1 point]
- dispatching this process to run on the processor involves setting the PTBR (a privileged
operation) [1 point]
- The library OS will try to execute this privileged operation [1 point]
- This will result in a trap into the hypervisor [1 point]
- The hypervisor will “emulate” this action by setting the PTBR [1 point]
Henceforth, the CPU will implicitly use the memory area pointed to by the PTBR as the page
table

59
Q

In an Intel-like architecture, the CPU has to do address translation on
every memory access by accessing the TLB and possibly the page table. In a
non-virtualized set up, the page table is set up by the OS and is used by
the hardware for address translation through the page table base register
(PTBR) set by the OS upon scheduling a process. Recall that in a fully
virtualized setting, the page table for a process is an internal data
structure maintained by the library OS.

(ii) When the library OS services a page fault and updates the page table
for a given process, how is this mapping conveyed to the CPU so that it can
do the address translation correctly for future access to this page by this
process?

A

● Page fault service involves finding a free physical page frame to map the faulting virtual
page to this allocated physical page frame. [1 point]
● To establish this mapping, the library OS has to update the page table or the TLB
depending on the specifics of the processor architecture assumed by the fully virtualized
library OS. Both of these are privileged operations which will result in a trap when the
library OS tries to execute either of them. [1 point]
● Hypervisor will catch the trap and “emulate” the intended PT/TLB update by the library
OS’s into the library OS’s illusion of PT/TLB. [1 point]
● More importantly, the hypervisor has a mapping of what machine page (MPN) this
physical page of the guest OS refers to in its shadow page table. [1 point]
● Hypervisor will establish a direct mapping between the virtual page and the
corresponding machine page by entering the mapping into the hardware page table (the
shadow page table) or the TLB depending on the specifics of the processor architecture.

60
Q

In a shared memory multiprocessor in which the hardware is enforcing cache
coherence, why is a “memory consistency model” necessary?

A

Memory consistency model serves as a contract between software and hardware to allow the
programmer to reason about program behavior.
(2 points if they mention hardware/software contract)
(-1 point for lack of specificity - for full credit, need to mention ordering or sequence)

61
Q

All the variables in the execution shown below are in the
shared memory of a cache coherent NUMA multiprocessor. The
multiprocessor implements sequential consistency. All variables are
initialized to 0 before execution starts.
Execution on processor P1 Execution on processor P2
x = 20 w = y + x
y = 30 z = x+10
Which of the following final values are impossible with the above execution?
(circle your choices; +2 for correct choice; -1 for incorrect choice)
(i) w = 0; z = 10;
(ii) w = 0; z = 30;
(iii) w = 20; z = 30
(iv) w = 30; z = 10
(v) w = 30; z = 30
(vi) w = 50; z = 30

A

iv and v

62
Q

On an Non-Cache-Coherent (NCC) NUMA machine with per-processor caches, would
you advocate using the following lock algorithm? Explain why or why not.
LOCK(L):
back: while (L == LOCKED); //spin
if (Test_and_Set(L) == LOCKED)go back;
UNLOCK(L):
L = UNLOCKED

A

No. Since there is no cache consistency Lock release will never be seen by waiting processors.
(1 point for saying “No”. 3 points for the correct reason).

63
Q

In a large-scale invalidation-based cache coherent shared memory
multiprocessor with rich connectivity among the nodes (i.e., the
interconnection network is not a shared bus), the tournament barrier is
likely to perform better than the MCS barrier. To jog your memory, MCS
algorithm uses a 4-ary arrival and binary wakeup tree. The tournament
barrier uses a binary tree for both arrival and wakeup.

A

True. [1 point]
Tournament algorithm can exploit the rich connectivity for peer-peer signaling among the
processors in each round (going up and down the tree). [3 points]
[1 point partial credit for any of the following: there is parallel communication, or messages
aren’t serialized, or log2(n) vs log4(n), or otherwise thinking about critical path/tree depth]

64
Q

Light-weight RPC (LRPC) is for cross-domain calls within a single host
without going across the network. A specific LRPC call has totally 128
bytes of arguments to be passed to the server procedure, and 4 bytes of
return value to be returned to the client. Assuming no programming language
level support,
(i) What is the total copy overhead in bytes for the call-return with the
LRPC system? Explain why.

A
Client → A-Stack ⇒ 128
A-Stack → Server E-Stack ⇒ 128
Server E-Stack → A-Stack ⇒ 4
A-Stack → Client ⇒ 4
Total = 264 [2 points]
Partial credit (1 point) if the answer says 132 with correct logic
65
Q

(LRPC) (ii) How much of the copy overhead is due to copying into kernel buffers?

A

0 bytes

66
Q

In a multiprocessor, when a thread becomes runnable again after a blocking
system call completes, conventional wisdom suggests running the thread on
the last processor it ran on before the system call.

(i) What is the downside to this idea?

A

Cache pollution by other threads run after the last time this thread ran on a particular
processor. [2 points]
[Partial credit 1 point: for other plausible answers e.g., “poor load balancing”]

67
Q

In a multiprocessor, when a thread becomes runnable again after a blocking
system call completes, conventional wisdom suggests running the thread on
the last processor it ran on before the system call.

(ii) What do you think are important considerations in choosing the “right”
processor to schedule this thread?

A

● Cache pollution by threads run after the last time a particular thread ran on a processor
(1.5 points)
● Cache pollution by threads that will be run after a thread is scheduled to run on a
particular processor (queue length) (1.5 points)

68
Q

Given the following configuration for a chip multiprocessor:
 4 cores
 Each core is 4-way hardware multithreaded
 512 MB LLC (Last Level Cache on the chip)
Given the following pool of threads ready to run:
 Group 1: T1-T8 each requiring 8 MB of cache
 Group 2: T9-T16 each requiring 16 MB of cache
 Group 3: T17-24 each requiring 32 MB of cache
 Group 4: T25-32 each requiring 64 MB of cache
Which threads should be run by the scheduler that ensures all the hardware
threads are fully utilized and maximizes LLC utilization?

A

32 software threads
64 + 128 + 256 + 512 = 960 MB required (cumulative cache requirement)
One of many possible feasible schedules that uses all the hardware threads and uses all the
available LLC:
 T25 – T28 on Core 1 = 4 hardware threads & 256 MB
 T17 – T20 on Core 2 = 4 hardware threads & 128 MB
 T9 – T12 on Core 3 = 4 hardware threads & 64 MB
 T13 – T16 on Core 4 = 4 hardware threads & 64 MB
(Full credit for any feasible schedule that shows complete understanding of hardware threads
and LLC utilization)
[Partial credit: -1 if just one thread is chosen wrong (this is normally if they found the optimal
schedule that’s less than 512 MB, instead of less than or equal to 512 MB); -2 for each set of up
to 4 threads chosen incorrectly]

69
Q

Parallel System Case Studies
5. (7 mins, 15 points)
(a) (5 points)
An application process starts up on Tornado multiprocessor OS. What are the
steps that need to happen before the process actually starts to run?

A

● Clustered process objects created one representation per processor on which this
process has threads. [2 points]
● Clustered region objects created commensurate with the partitioning of the address
space. [2 points]
● Clustered FCM objects created to support the region objects [1 point]

70
Q

A process running on top of Tornado frees up some portion of its memory
space (say using a facility such as free()). What does Tornado have to do
to cleanup under the cover?

A

● Locate the clustered region object that corresponds to this freed address range. This
region object has the piece of the page table that corresponds to the address range
being freed. [2 points]
● Identify the replicas of this region object on all the processors and fix up the mapping of
these address ranges in the replicated copies of the page table entries corresponding to
the memory being freed. [3 points]

71
Q

(c) (5 points) (Answer True/False with justification)
The “region” concept in Tornado and the “address range” concept in Corey are
one and the same.

A

● False [1 point]
● Justification:
○ Region is invisible to the application; it is a structuring mechanism inside the
kernel to increase concurrency for page fault handling of a multithreaded
application since each region object manages a portion of the process address
space. [2 points]
○ Address range is a mechanism provided to the applications by the kernel for
threads of an application to selectively share parts of the process address space.
This reduces contention for updating page tables and allows the kernel to reduce
the amount of management work to be done by the kernel using the hints from the
application (e.g., reduce TLB consistency overhead. [2 points]

72
Q

(a) (5 points)
Assume that messages arrive out-of-order in a distributed system. Does this
violate the “happened before” relation? Justify your answer with examples.

A

● No. [1 point]
● Justification:
○ The happened-before relation is only concerned with the sending and receipt of
messages ONE at a time. Therefore, messages arriving out of order does not
violate the relationship [2 points]
○ [2 points] for a reasonable example.

73
Q

Recall that in the distributed mutual exclusion algorithm of Lamport, every
node acknowledges an incoming lock request message from a peer node by
sending an ACK message. One suggested way to reduce the message complexity
of the algorithm is to defer ACKs to lock requests. Using an example, show
under what conditions a node can decide to defer sending an ACK to an
incoming lock request.

A

If the node is holding the lock, then it can defer sending ACK.
Or if the incoming lock request’s timestamp is larger than the timestamp of its own lock
request, the node can defer sending ACK. [3 points]
A reasonable example [2 points]