Final Flashcards
How does scheduling work? What are the basic steps and data structures involved in scheduling a thread on the CPU?
An OS scheduler is responsible for selecting a task (process or thread) and having it run on a CPU. Scheduling occurs when the CPU becomes idle and a task is chosen from the ready queue by the scheduler. How a task is selected depends on the scheduling policy/algorithm (eg FIFO, priority, time slicing, etc). The run queue data structure is the scheduling mechanism and tightly coupled with the scheduling algorithm.
What are the overheads associated with scheduling?
The time associated with context switching a task.
Do you understand the tradeoffs associated with the frequency of preemption and scheduling/what types of workloads benefit from frequent vs. infrequent intervention of the scheduler (short vs. long timeslices)?
CPU-bound tasks benefit the most from large time slices. Limits the context switching overhead and keeps CPU utilization and throughput high. I/O bound tasks prefer shorter timeslices . Keeps CPU and device utilization high. In most cases, the I/O bound tasks will yield due to an I/O operation before the timeslice expires. Delivers the perception to the user that the system is more responsive.
Can you work through a scenario describing some workload mix (few threads, their compute and I/O phases) and for a given scheduling discipline compute various metrics like average time to completion, system throughput, wait time of the tasks…
Compute: Pay attention to time slice I/O: Pay attention to I/O ops frequency Throughput: # tasks/ total time Avg. Completion Time: Total Time from Start per Task / # tasks Avg. Wait Time: Total Wait Time from Start per Task / # tasks
Do you understand the motivation behind the multi-level feedback queue, why different queues have different timeslices, how do threads move between these queues… Can you contrast this with the O(1) scheduler?
A multi-level feedback queue allows us to cover the needs of both CPU & I/O bound tasks in a single structure without having to know the task type in advance. Different timeslices provide benefits to both CPU & I/O bound tasks of varying degrees (individually and mixed). This explains the motivation behind having different queues of different timeslices. Tasks enter at the top most queue. If it yields before timeslice, it stays. If not, it gets pushed to a lower level. Tasks get a priority boost when releasing the CPU for I/O. O(1) is similar in that it applies timeslice values to priority levels. Biggest difference is that it only relies on two queues (active and expired) and it’s feedback is based on sleep time (waiting/idling). The longer sleep implies I/o intensive and gets a priority boost (-5). Smaller sleep time implies CPU intensive and gets lowered (+5)
Do you understand what were the problems with the O(1) scheduler which led to the CFS?
The problem was that the tasks in the active queue had to wait for the timeslice for all the tasks to expire before a swap occurred with the expired queue. This created too much litter in applications that needed more realtime performance such as Skype. CFS relied on a balance tree for it’s runqueue that allowed for more frequent updates of task priority.
Thinking about Fedorova’s paper on scheduling for chip multi processors, what’s the goal of the scheduler she’s arguing for?
To better handle resource contention in order to deliver better application performance.
What are some performance counters that can be useful in identifying the workload properties (compute vs. memory bound) and the ability of the scheduler to maximize the system throughput.
Traditionally, we look at instructions per cycle (IPCs). Compute bound tasks are close to 1. Memory-bound tasks are closer to 0. Fedora proposed the use of cycles per instructions (CPIs). Compute bound tasks would have a CPI of close to 1 while memory-bound tasks would be much higher than 1. To maximize system throughput, it was proposed that a core have a mixed CPI workload which leads to a well utilized processor pipeline and high CPI. A similar CPI workload leads to resource contention and wastes cycles on other cores.
Linux Scheduler Quiz: What was the main reason the Linux O(1) scheduler was replace by the CFS scheduler?
A) Scheduling a task under high loads took unpredictable amount of time.
B) Low priority task could wait indefinitely and starve
C) Interactive tasks could wait unpredictable amounts of time to be scheduled
C
Shortest Job Performance Quiz:
Assume SJF is used to schedule tasks T1, T2, T3. Also, make the following assumptions:
- scheduler does not preempt tasks
- known execution times: T1=1s, T2=10s, T3=1s
- All arrive at same time t=0 Calculate the throughput, avg. completion time and avg wait time.
Throughput: 3/12 = .25 Avg. Completion Time: 15/3 = 5 Avg. Wait Time: 3/3 = 1
How do we deal with the fact that processes address more memory than physically available? What’s demand paging?
Because the address space of Virtual memory can be larger than the address space of physical memory, we need to rely on page swapping and demand paging. Processes will rarely need the theoretical amount of virtual memory so we rely on demand paging which is when pages are swapped in/out of memory and in/out of swap partition (eg disk, flash device).
How does page replacement work?
When the page is referenced and it’s not present in memory, the MMU sill see that the page table has the present bit set to 0 and raise a fault and cause a trap to the OS. The OS determines if the page was swapped out to disk and will then establish the correct disk access required to perform the I/O to retrieve the page. The OS uses history-based predictions like Least Recently Used (LRU) & Acces Bit to determine which pages should be swapped.
What happens when a process tries to modify a page that’s write protected/how does COW work?
The Copy-On-Write (COW) mechanism is done to avoid unnecessary process copies of virtual address space. On process creation, the VAT of a new process is mapped to the original page. The shared memory is now write protected. For reads, this saves on memory and the time to copy. For writes requested by either process, the MMU will issue a page fault. The OS will then create a copy of the updated pages and then update the page tables of each process. This copy cost is only paid on demand and if necessary.
How does address translation work?
Every CPU package is equipped with a Memory Management Unit (MMU). The CPU issues virtual address to the MMU and the MMU is responsible for translating the virtual address to physical addresses (or generate a fault). Registered are used to store pointers to active page tables or for segments, details on the segment size, number, etc.
What’s the role of the TLB?
The Translation Lookaside Buffer (TLB) is a cache of valid virtual to physical memory translations that is accessed by the MMU. The TLB is used to speed up memory translation
How does the OS map the memory allocated to a process to the underlying physical memory?
The OS creates a page table per process. The page table contains the entries that map a page from virtual memory to a page frame in physical memory. Memory is allocated only when it is needed so the mapping doesn’t occur until a process attempts to access it.
What happens when a process tries to access a page not present in physical memory?
The page table will show that physical memory hadn’t been allocated via the “valid bit”. If a process attempts to access the page, the hardware MMU (memory management unit) will see the valid bit is 0, it will raise a fault and trap to the OS. The OS assumes control and then decides on what to do. For example, if the page was swapped to disk, the OS would pull the page back from disk.
What happens when a process tries to access a page that hasn’t been allocated to it?
The page table will show that physical memory hadn’t been allocated via the “valid bit”. If a process attempts to access the page, the hardware MMU (memory management unit) will see the valid bit is 0, it will raise a fault and trap to the OS. The OS assumes control and then decides on what to do. If access is permitted, the OS will execute the process to allocate memory.
Do you understand the relationships between the size of an address, the size of the address space, the size of a page, the size of the page table…
For an address size of X bits (eg 64 bits architecture), a process can access 2^X address space (2^64). For a page-size of 2^Y bits (4KB = 2^12), a page table will have 2^(X-Y) entries (2^(64-12) = 2^52). Since each entry in a page table is X bits, a page table size is 2^(X-Y) * X bits (2^(64-12) * 64 = 32PB).
Do you understand the benefits of hierarchical page tables? For a given address format, can you workout the sizes of the page table structures in different layers?
Multi-level page tables don’t require a paget table entry for every virtual address space. As a result, we have smaller internal page tables/directories and can provide more granular coverage. The downside is that it requires more steps for address translation (more tables). Let’s look at a 32 bit architecture/address space broken into a 12 bit segment, a 10 bit segment, and another 10 bit segment. The first segment reflects the directory of page tables which is equal to 2^12 number of page tables. Each page table maintains 2^10 entries. Each entry is can address 2^10 bytes. So the total size of the inner page table is 2^10 x 2^10 = 2^20 (1MB).