Final - Provided Review Questions Flashcards
How does scheduling work? What are the basic steps and data structures involved in scheduling a thread on the CPU?
scheduler dispatches tasks from runqueue
•scheduling policy/algorithm decides which task to schedule
Tasks enter the runqueue after: • creation (fork) • interrupt • I/O Completion • expiration of time slice
Scheduler Runs when:
• CPU becomes idle
• Timeslice expires
• New task becomes available
When dispatching a thread:
• context switch to selected thread
• enter user mode
• set PC to next instruction of scheduled task
Do you understand the tradeoffs associated with the frequency of preemption and scheduling/what types of workloads benefit from frequent vs. infrequent intervention of the scheduler (short vs. long timeslices)?
- For CPU bound tasks, longer time slices are better because there are fewer context switches (overhead). Keeps cpu utilization and throughput high.
- For I/O bound tasks, smaller time slice is better. I/O ops get issued sooner. Better user-perceived performance.
Can you work through a scenario describing some workload mix (few threads, their compute and I/O phases) and for a given scheduling discipline compute various metrics like average time to completion, system throughput, wait time of the tasks…
Throughput = tasks / sec
• total # tasks / (timestamp when all tasks are done)
Avg Completion Time
• sum_of_times_to_complete_each_job / # tasks completed
•(sum of completion timestamps)
Avg Wait Time
• sum timestamps each task started at / number tasks
Can you contrast the MLFQ with the O(1) scheduler? Do you understand what were the problems with the O(1) scheduler which led to the CFS?
O(1) scheduler:
•Two queue’s:
• Active Array and Expired Array
• Active is used to pick the next task to run
• If task is preempted and doesn’t finish it’s time slice, it will be put back in the queue within remaining amount of time.
• Once timeslice expires, it’s added to the Expired array
• Once active array is empty, pointers are swapped and expired array becomes active array
O(1) scheduler replaced b/c it affected performance of interactive tasks. Once on expired list, wouldn’t run again until all active queue was exhausted. Also no fairness guarantees.
CFS - only for non real-time tasks
• Red-Black tree (self balancing tree - all paths to leaves approximately same length)
• ordered by virtual runtime (vruntime - time spent on cpu)
• Always schedules task with least amount of vruntime on the cpu (generally leftmost task)
• periodically update current vruntime
• then compare current vruntime to leftmost node
• schedule whichever is lower
•if swapping to another task, move old task into proper place in tree
• vruntime progresses faster for lower priority tasks
• vruntime progreses slower for high priority tasks
• Uses the same tree for all priorities
- selecting a task => O(1) time
- adding a task => O(log n)
What are some performance counters that can be useful in identifying the workload properties (compute vs. memory bound) and the ability of the scheduler to maximize the system throughput.
Use a new metric (CPI – cycles per instruction) to evaluate job type and adjust scheduling (other metrics: historic info doesn’t work bc no sleeping when waiting on Mm,
hardware counters – cache misses, IPC, power + energy data, software interfaces & tools – oprofile, vLinux perf tests)
What happens when a process tries to modify a page that’s write protected/how does COW work?
Tries to modify a page that’s write protected (and COW = copy on write):
•When a new process is created, it creates a new virtual address space that points to the same physical locations as the previous process. This is done to avoid copying EVERYTHING when that may not be needed.
• Now everything becomes write protected
• If the new process tries to write to a write protected block, the processor traps into the OS determines it’s a page fault, and then copies that page over for the new process and updates it’s VA => PA mapping.
• This allows us to only copy what is necessary instead of the entire process. Saves space and time - like a lazy copy.
How do we deal with the fact that processes address more memory than physically available? What’s demand paging? How does page replacement work?
Processes won’t use all their memory all the time.
We can save physical page frames out to secondary storage.
Demand Paging:
• swapped in/out of memory and a swap partition (disk)
Page Replacement:
• if present bit is 0 (not in memory):
• traps into OS
• determines it’s a page fault
•Submits memory request to pull it from disk
• puts it into free location in physical memory (DRAM)
• Updates page table with new free location (VA is unchanged).
• program counter reset to make memory access again, but this time it will be present
How does address translation work? What’s the role of the TLB?
Address Translation:
• Virtual Address is mapped to a PA via the page table.
•VA consists of N bit offset (LSBs), then further chunks which represent indexes into the page table, page directory, table of directories, etc.
• Offset is ultimately used to get to the memory location inside the physical page.
TLB - cache of valid VA to PA translations. If present in the cache, no need to do any operations.
Do you understand the relationships between the size of an address, the size of the address space, the size of a page, the size of the page table…
- address size: 2^32 or 2^64 (architecture)
- address size / page size = # VPNs (2^32 / page size)
- page size often 4kB (8kB, 2MB, 4MB, 1GB also options)
Example:
• address size = 2^32
• page size = 4kB
• (2^32 / 2^12) * 4B / address = size of page table in memory = 4MB, PER PROCESS!
• If 64bit architecture, that number becomes enormous
Do you understand the benefits of hierarchical page tables? For a given address format, can you workout the sizes of the page table structures in different layers?
Benefits:
• Don’t need to have virtual address space for EVERYTHING - only what is actually needed.
Format:
• [P1 | P2 | Offset] - offset concept remains the same, but each P# is an offset into each inner table.
•2^#offset bits = pages size
• 2^(Most inner P) is the number of pages per page table
• 2^(next level outward P) is the number of page tables
• Then directories, etc…
Tradeoffs:
+ Smaller internal pages means more granularity of coverages
- more memory access required for translation (increase translation latency)
For processes to share memory, what does the OS need to do? Do they use the same virtual addresses to access the same memory?
The OS sets aside a section of physical memory and maps it into each processes virtual address space. They virtual addresses in each process do not have to be the same.
What are the tradeoffs between message-based vs. shared-memory-based communication?
Tradeoffs:
• Message based is simple and has an established interface, not very error prone. But you must copy the data from the user space into the kernel then back into the second processes user space, and you must context switch 4 times to do it.
• Shared memory requires heavy initial setup costs - mapping the shared memory into each process, but incurs no overhead beyond that. The space is shared between the two processes. It is must more error prone for developers, however, because there are no pre-determined signaling mechanisms.
What are different ways you can implement synchronization between different processes (think what kids of options you had in Project 3).
- Shared Memory
- Mutexes + condition variables
- semaphores
- Message Queues
- Sockets
To implement a synchronization mechanism, at the lowest level you need to rely on a hardware atomic instruction. Why? What are some examples?
Because you need an operation that will both test the variable and then set it as a single operation, otherwise you could get race conditions. If not atomic, then multiple processors could see a lock was free and ‘acquire’ the lock and move into the critical section.
If it’s an atomic instruction, only one processor at a time can ‘test and set’ the variable at a time.
Examples: Spin Locks, Test & Set, Read & Increment, Compare & Swap, Test & Test & Set
Note: On SMP systems, atomic instructions go to the memory instead of the cache so they can be ordered/synchronized. Therefore, they take much longer and generate coherence traffic regardless of change.
Why are spinlocks useful? Would you use a spinlock in every place where you’re currently using a mutex?
Spinlocks are useful when the critical section is very short. This prevents having to context switch and check conditions/mutexe’s etc (many cycles to do all that when critical section may be really short).
No - really depends on what the critical section looks like.
It’s also important to note that spin-waiting can cause other processes to slow by consuming communication bandwidth (Anderson)
Do you understand why is it useful to have more powerful synchronization constructs, like reader-writer locks or monitors? What about them makes them more powerful than using spinlocks, or mutexes and condition variables?
They are less error prone and more expressive. Monitors can have enter/exit code that is implicitly executed for you so you don’t forget to do it. This prevents difficult to trace errors. They provide sync structures with a better granularity to the type of critical section they are protecting. Example: reader/writer locks are specific to the types of access they protect.
Which lock types are best under low and high load (test_and_set, queuing, test_and_test_and_set, etc)?
Ultimately: • High Load • Queue is BEST under high load • Static better than dynamic • Low Load • test_and_test_and_set is best • Queueing lock is worst
What are the basic differences in using programmed I/O vs. DMA support?
Programmed I/O:
•Programmed I/O: No additional HW support needed; interacts directly with the command and data registers; takes multiple CPU store instructions; better for small data transfers.
DMA:
• Relies on DMA Controller; Less store instructions but DMA configuration is more complex and can take more overall CPU cycles; better for large data transfers.
So for 1500B packet, with 8byte registers you would need:
• 1 CPU access to request packet transmission
• 1 DMA configuration operation
DMA config is more complex, so for smaller transfers, programmed I/O still takes less operations than DMA.
DMA controller requires data to be in memory buffer until transfer is complete - they are pinned - non-swappable
Do you understand the relationship between the various data structures (block sizes, addressing scheme, etc.) and the total size of the files or the file system that can be supported on a system?
Data Structures and Relationships •inodes point to file meta data + all the blocks that represent the file • inodes have: • direct blocks • single indirect • double indirect • and so on...
For the virtual file system stack, we mention several optimizations that can reduce the overheads associated with accessing the physical device. Do you understand how each of these optimizations changes how or how much we need to access the device?
- buffer/caching - reduce # accesses
- in main memory
- read/write from cache
- periodically flush to disk (fsync)
- I/O Scheduling - reduce disk head movement
- maximize sequential vs random access
- write block 25, write block 17 => write 17, 25
- prefetching - increase cache hits
- leverages locality
- eg, read block 17 => read also 18, 19
- journaling/logging - reduce random access
- “describe” write in log: block, offset, value …
- periodically apply update to proper disk locations
What’s hosted vs. bare-metal virtualization? What’s paravirtualization, why is it useful?
Hosted:
• Host OS has a VMM module that invoked drivers as needed
• Host owns all hardware
• KVM (Kernel Based VM)
Bare-Metal:
• A hypervisor (vmm) controls resource allocation to all of the VM’s present and controls access to the underlying hardware.
• Has a privileged service VM to deal with devices and device access
• ESX, Zen
Paravirtualization:
• A modified version of the OS so that it’s aware it is being virtualized.
• Makes explicit calls to hypervisor (hypercalls)
What were the problems with virtualizing x86? How does protection of x86 used to work and how does it work now?
Certain instructions would not trap into the hypervisor and therefore just would not get executed.
Protection:
• Old Way:
• Only 4 rings, no support for root/non-root
• 17 instructions that were privileged and did not cause trap, thus failed silently
• interrupt enable/disable bit in privileged register, failed silently from Ring1 OS, so VMM didn’t know it was trying to set it, and OS assumed it worked.
- New Way:
- Rewrite binary of guest VM so it never issues those 17 instructions (dynamic binary translation)
- dynamically inspect code block. if bad instructions not found, execute at HW speeds, if found, emulate desired behavior (possibly avoiding trap)
- cache translated block to improve efficiency, only analyze kernel code
What is passthrough vs. a split-device model?
Passthrough:
• VMM driver configures devices access permissions
+ guest VM has exclusive access to a device, and is only one who can access it and they have direct access.
- device sharing is difficult - no concurrent sharing
- VMM must have exact type of device as what VM expects (since it’s got direct access, very specific)
- VM migration is tricky b/c it must have same hardware and state at destination
Hypervisor-Direct • VMM intercepts all device accesses • emulates device operation \+ device independent - latency of device operations - exposed to device driver ecosystem complexities/bugs in hypervisor
Split-Device:
•access control split between front-end (guest) and back end (service VM or host)
•back end is regular device driver that host/linux etc would use
• front end driver MUST be modified. Basically a wrapper that creates messages that are sent to the service VM instead of issuing requests directly to the hardware.
• limited to paravirtualized guests that can install special front end device drivers
+ eliminate emulation overhead
+ allow for better management of shared devices - b/c all requests are arbitrated by the service VM
What are the various design points that have to be sorted out in implementing an RPC runtime (e.g., binding process, failure semantics, interface specification… )? What are some of the options and associated tradeoffs?
RPC Requirements:
•Has synchronous call semantics
• error handling (type checking, packet bytes interpretation)
•cross machine conversion (endian-ness)
•higher level protocol (access control, fault tolerance, different transport protocols)
RPC Structure:
• Client calls fn that looks like a normal function
• This fn actually calls a stub that creates message and sends the request to the server
• server stub receives msg and deserializes/unpacks it
• server stub calls actual implementation and then process goes in reverse to return result
RPC Steps:
•register - server ‘registers’ procedures, arg types, location, etc so it can be discovered by clients
• binding - find and ‘binds’ to desired server that implements desired function
• client makes RPC
• client stub ‘marshals’ args (serialized into buffer)
• send - client sends message to server
• receive - server receives msg, passes to server stub
• stub ‘unmarshals’ args (extract args and create data structs)
• actual call: server stub calls local implemenataion
• result: server computes results and follows steps backwards
RPC Design Points:
• Binding - how to find the server
• IDL - how to talk to the server and package data
• Pointers - disallow vs serialize pointed data
• Partial failures - special error notifications
What’s specifically done in Sun RPC for these design points – you should easily understand this from your project?
Binding - per-machine registry daemon. talks to registry on target machine.
IDL - XDR (language agnostic specification and encoding)
Pointers - allowed and serialized
Partial Failures - retries; return as much info as possible