Final Flashcards
How does scheduling work? What are the basic steps and data structures involved in scheduling a thread on the CPU?
The scheduler dispatches a task from the run queue based on the policy/algorithm (e.g., FIFO, SJF, LJF, etc).
The data structure depends on the scheduling policy (tightly coupled). For a FIFO only a basic queue is necessary, but the policy could require a tree, or a multi-level structure.
The scheduler runs when:
- the CPU becomes idle
- a timeslice expires
- a new task becomes available
When dispatching a thread:
- context switch to selected thread
- enter user mode
- set program counter to next instruction of scheduled task
P3L1
What are the overheads associated with scheduling? Do you understand the tradeoffs associated with the frequency of preemption and scheduling/what types of workloads benefit from frequent vs. infrequent intervention of the scheduler (short vs. long time slices)?
The main overhead is context switching.
Tradeoffs of context switching: \+ reduces wait time \+ I/O operations issues earlier - drops throughput - increases average completion time
For CPU bound tasks, longer timeslices are better because there are fewer context switches (overhead). Keeps cpu utilization and throughput high.
For I/O bound tasks, smaller timeslices is better. I/O ops are issued sooner. Better user-perceived performance.
P3L1
Can you work through a scenario describing some workload mix (few threads, their compute and I/O phases) and for a given scheduling discipline compute various metrics like average time to completion, system throughput, wait time of the tasks…
Throughput
jobs completed / time to complete all jobs
Average Completion Time
sum of times to complete each job / jobs completed
Average Wait Time
sum of time each job spent waiting / jobs completed
P3L1
Do you understand the motivation behind the multi-level feedback queue, why different queues have different time slices, how do threads move between these queues… Can you contrast this with the O(1) scheduler? Do you understand what were the problems with the O(1) scheduler which led to the CFS?
The multi-level feedback queue allows us to find an appropriate timeslice for a thread based on how it behaves. If the task uses its entire timeslice we move it to a queue with a longer timeslice. If a task yields, we keep it in the queue with the shorter timeslice. This means we don’t have to track/perform historic behavior calculations, we can just run tasks and let their behavior dictate.
The O(1) scheduler uses priorities to determine the order in which tasks run, and it has different lengths of timeslices depending on priority, but all tasks must run for the duration of their timeslice before any tasks that had previously run could run again. This led to an unacceptable jitter as software became more interactive.
The Completely Fair Scheduler uses a self-balancing tree ordered by vruntme (time spent on CPU). vruntime progresses faster for lower priority tasks and slower for higher priority tasks. Always schedule task with lowest vruntime.
P3L1
Thinking about Fedorova’s paper on scheduling for chip multi processors, what’s the goal of the scheduler she’s arguing for? What are some performance counters that can be useful in identifying the workload properties (compute vs. memory bound) and the ability of the scheduler to maximize the system throughput.
Fedorova is arguing for is a scheduler that maximizes CPU utilization in hardware multithreaded environments.
Generally, we use instructions per cycle (IPC) to determine utilization. Tasks that are CPU-bound typically have IPCs that are close to 1, while tasks that are memory-bound typically have IPCs that are close to 0.
Federova suggests that hardware incorporate a new metric, cycles per instruction (CPI), which is the reciprocal of IPC.
CPU-bound tasks would still have a CPI of close to 1, while memory-bound tasks would have a CPI much higher than 1.
The IPC performance counter is helpful in determining if a task is compute-bound or memory-bound. In addition, performance counters that track cache misses can be helpful. With high cache misses, a scheduler can assume a task is making lots of memory access and therefore is memory-bound. With few cache misses, a scheduler can assume a process is more compute-bound.
P3L1
How does the OS map the memory allocated to a process to the underlying physical memory?
The OS maps the memory allocated to a process to the underlying physical memory via a page table (or segments), which allows us to look up the corresponding page frame number in physical memory and index into to find the exact location.
P3L2
How do we deal with the fact that processes address more memory than is physically available?
We deal with the fact that processes address more memory than physically available by swapping data out of physical memory (e.g., DRAM) to secondary storage (e.g., on disk). We determine which data to swap using an algorithm (e.g., Least Recently Used, which uses the access bit).
P3L2
How does address translation work? What’s the role of the TLB?
In order to translate a virtual address to a physical address we have to look it up in a page table. Page tables can be flat or hierarchical. Hierarchical page tables save on space by not requiring an entry for every single virtual address (used or unused) but they require more lookups (outer page table to inner page tables).
A page table address has two parts if flat, or three parts if hierarchical. The last part is always the offset, which we use to index into the physical memory. The first part(s) are used to index into the page table to find the page frame number that tells us where to start looking in physical memory.
The role of the Translation Lookaside Buffer is to cache translations between virtual addresses and physical addresses so we can save on lookups. This reduces the latency that’s created by hierarchical page tables, which save on space but require more lookups.
P3L2
Do you understand the relationships between the size of an address, the size of the address space, the size of a page, the size of the page table?
TODO (P3L2)
Do you understand the benefits of hierarchical page tables? For a given address format, can you workout the sizes of the page table structures in different layers?
TODO (P3L2)
For processes to share memory, what does the OS need to do? Do they use the same virtual addresses to access the same memory?
The OS needs to create the shared memory segment and map it into the virtual address space of both processes.
The virtual address of the shared memory segment will NOT be the same for each process.
P3L3
For processes to communicate using a shared memory-based communication channel, do they still have to copy data from one location to another?
No, processes using shared memory-based communication don’t have to copy data from one location to another as long as it’s allocated from a virtual address that belongs to the shared memory region.
P3L3
What are different ways you can implement synchronization between different processes (think what kids of options you had in Project 3).
- mutex & condition variable
- semaphore
- message queue
- socket
P3L3
To implement a synchronization mechanism, at the lowest level you need to rely on a hardware atomic instruction. Why? What are some examples?
When we have multiple CPUs and/or multiple threads accessing a lock, it’s possible for check&set on the value of a lock to be interleaved (meaning more than one processor could see the lock is free and acquire it) unless the check/set is an atomic operation. When an operation is atomic then only one CPU at a time can perform it.
Examples:
- try_and_set
- read_and_increment
P3L4
Why are spinlocks useful? Would you use a spinlock in every place where you’re currently using a mutex?
A spinlock is useful when the time a thread would spend spinning is less than the time it takes to context switch.
Spinlocks create more contention than a regular mutex so use them only when critical sections are short.
P3L4
Do you understand why is it useful to have more powerful synchronization constructs, like reader-writer locks or monitors? What about them makes them more powerful than using spinlocks, or mutexes and condition variables?
More powerful synchronization construct like reader-writer locks or monitors are useful because they handle more for the programmer (e.g., lock/unlock, signaling), which reduces common errors.
P3L4
Can you work through the evolution of the spinlock implementations described in the Anderson paper, from basic test-and-set to the queuing lock? Do you understand what issue with an earlier implementation is addressed with a subsequent spinlock implementation?
- try_and_set spinlock
atomic operation in the while loop, which causes contention and is disastrous in a cache-coherent architecture with write invalidation. - try_and_try_and_set spinlock
atomic operation in the while loop, but only called if the cached value says the lock is free, which reduces contention. still bad in a cache-coherent architecture with write invalidation?
+ improved latency
+ improved delay
-/+ contention: NCC = same, CC+WU = O(n), CC+WI =O(n^2)
- try_and_try_and_set spinlock + delay
- atomic operation in a while loop, but only called if the cache value says the lock is free, which reduces contention. placing a delay after the lock is released improves the situation in a cache-coherent architecture with write invalidation b/c everyone isn’t trying to get the lock value at the same time?
+ improved latency
+ improved contention
- worsened delay
- queueing spinlock
- atomic operation one time when the thread requests the lock, but then subsequent spins happen on a different variable, which reduces contention and protects it from any problems due to a cache-coherent architecture with write invalidation
+ improved delay
+ improved contention
- worsened latency
P3L4
What are the steps in sending a command to a device (say packet, or file block)? What are the steps in receiving something from a device?
Sending:
- Send data (user process)
- Form packer (kernel)
- Write Tx request record (driver)
- Perform transmission (device)
Receiving:
Receiving something back works in the reverse order (device retrieves and passes to driver, driver passes to kernel, kernel passes to user process)
For block storage devices, do you understand the basic virtual file system stack, the purpose of the different entities?
file: data blocks on disk (non-contiguous)
inode: tracks files’ blocks. also resides on disk in some block (contiguous)
superblock: overall map of diskblocks (inode blocks, data blocks, free blocks)
P3L5