Final Flashcards
What kind of scheduling policies can a scheduler have?
- Assign tasks immediately: simple, first come first serve (FCFS)
- Simple tasks first: maximize throughput, Shortest Job First (SJF)
- Assign Complex Tasks First: max system utilization (CPU, devices, memory)
When does the scheduler run?
- When CPU becomes idle
- new task becomes ready (could have higher priority task in queue)
- timeslice expired timeout
Design of run queue and scheduling algorithm are tightly coupled!
Describe priority scheduling
- Threads are scheduled by their priority, highest priority going first.
- This can be implemented with priority queues for each priority level (or you could use an ordered tree)
- Could leave to task starvation for low priority tasks
- This is prevented through priority aging. The longer the task is in the queue, the higher its priority. So it will eventually be run as its priority increases.
What is priority inversion?
When a low priority thread acquires a lock and then a higher priority thread is run and needs that lock, the low priority thread could run to completion before the higher priority thread because the high priority thread is stuck waiting for the lower one to complete.
A solution to this is to temporarily boost the priority of the thread holding the mutex to the level of the higher priority thread so it can run until it releases the mutex. Then it would re-lower the priority of the should-be-low-priority thread. This means it’s helpful to know which thread owns a mutex.
What is run to completion scheduling?
Threads are executed until completion unless they are preempted by something.
Describe round robin scheduling
Go through tasks in a round robin order. If task yields due to I/O switch to next task and round robin remaining tasks until I/O is complete.
Can also do round robin with interleaving which is known as time slicing.
Describe Timeslicing
timeslice = max amount of uninterrupted time to give a task (aka, time quantum)
task may run for less amount of time than timeslice (task must wait on I/O, so switches, or higher priority task becomes available)
The ideal time slice length is different for I/O bound vs CPU bound tasks.
What are the costs and benefits of timeslicing?
+ short tasks finish sooner
+ more responsive
+ lengthy I/O ops initiated sooner
- interrupt, schedule, context switch = overhead for each switch
Can minimize these overheads by keeping the timeslice time much»_space; context switch time
What are the ideal timeslice lengths for cpu bound and I/O bound workloads?
CPU Bound:
- Longer timeslices lead to better throughput and lower avg. completion times, though higher avg wait time.
- Overall, user cares about throughput and how long it takes to complete. Wait time is hidden b/c there is no I/O.
I/O Bound:
- Using a smaller timeslice is better.
- I/O requests are able to get kicked off earlier
- Keeps both the CPU and I/O devices busier more often
- better user-perceived performance
How do you calculate cpu utilization?
[cpu_running_time/(cpu_running_time + context_switching_overheads)] * 100
Describe the multi-level feedback queue (MLFQ).
Have three run queues, but they are treated as a single queue data structure:
1) TS = 8ms, most I/O intensive, highest priority
2) TS = 16ms, medium I/O intensive (mix I/O and CPU)
3) TS = infinity (CPU intensive, lowest priority, FCFS)
+ get timeslicing benefits for I/O bound tasks
+ avoid timeslicing overhead for CPU bound tasks
Ultimately this becomes like a decision making pipeline
- start in 8ms queue, if it yields before timeslice is up, good choice, keep it here!
- if timeslice expires, move it to next queue. Repeat the previous step.
- If timeslice expired again, put it in bottom queue.
Can always be bumped back up if they start yielding before timeslice expires.
Describe the linux O(1) scheduler
P3L1: 18 Linux O(1) – fill this out :)
Describe the Completely Fair Scheduler (CFS)
Scheduler for all non-real time tasks
- Runqueue = Red-Black Tree
- Tasks ordered by virtual runtime (time spent on cpu)
- Tasks on the left are ones with less vruntime, so they need to be schedule sooner
- Run smallest vruntime task until it’s new vruntime is greater than the next smallest vruntime (do this check periodically)
- One vruntime has moved past next smallest one, place current task appropriately in the tree and start working on the now smallest vruntime task
- For low priority tasks, vruntime progresses faster (artificially)
- For high priority tasks, vruntime progresses slower (artificially)
- Same tree used for all levels
- Task selection = O(1)
- Add task = O(log n)
What are the problems with the O(1) scheduler?
- performance of interactive tasks
- inactive list not executed until all tasks done in the active list
- fairness is not enforced in any way
Why is it important to try and schedule threads back onto the same CPU?
Cache-Affinity - likely to have everything you need in the cache so you will be more efficient.
If you swap CPUs, then you’ll have a cold cache.
What is NUMA scheduling?
Non-Uniform Memory Access
- A CPU may have faster access to one memory bank than another.
- NUMA aware scheduling means the scheduler is aware of this so it tries to keep the tasks on the CPU closer to where that tasks memory lies.
What is hyperthreading?
- Multiple hardware supported execution threads
- This means there are multiple sets of registers needed for an execution context
- still 1 CPU but much faster context switches because all it has to do is switch the registers being used by the CPU (not switch what’s in the registers, but which registers are being used)
Can even hide memory access (hundreds of cycle access) b/c a context switch to the other hyperthreaded execution context is on the order of cycles. So it can make sense to switch between the hyperthreaded executions contexts even during a memory access.
Why is it good to mix CPU and Memory bound tasks together on a single CPU?
Because you can get better overall utilization.
If both threads are CPU bound, only one can actually make progress at a time
T1: GXGXGX
T2: XGXGXG
If both threads are Mem bound, lots of wasted cycles:
T1: GXXXGXXXG
T2: XGXXXGXXX
If mixed:
T1: GXGGGXGGGX (CPU Bound)
T2: XGXXXGXXXG (Mem Bound)
How can the CPU scheduler make informed decisions about scheduling?
By use of hardware counters - there are typically several and they are different for different architectures.
Help to estimate kind of resources they need (CPU, Mem).
Can look at number of cache hits/misses from the HW counter to help guesstimate whether its a CPU or mem bound workload.
Why can’t CPI be used as a scheduling technique?
Because real workloads may not have enough variation in CPI to be deterministic of workload category. They could have similar averages with high or low std deviation which means isn’t not so cut and dry.
What is a TLB?
Translation Look Aside Buffer
- small cache that holds valid Virtual Address to Physical Address mappings
- Used because the VA to PA translation basically has to be done for every memory access
What does the Hardware Memory unit do?
- Translates VA to PA addresses
- Reports faults (illegal access, permissions)
- Registers:
- Pointers to page table
- base and limit size, number of segments
- Cache - TLB
- Translation
- actual PA generation done in hardware
Describe how page tables work.
- VA Pages -> Page Table Map -> DRAM
- Size of VA Pages is the same as the DRAM pages
- There is a page table for each process
- So on context switch, must switch to the right page table
- Virtual Address = [VPN | offset]
- VPN is offset into page table map
- Page Table Map = [PFN | valid bit]
- Physical Address = [PFN | offset]
What is allocation on first touch?
The DRAM will only actually allocate space for a page the first time it is referenced. Declaring it in the VA address space is great, but it won’t be created in PA until you try to do some kind of access to it.
What flags exist for a page table entry?
- P-resent (valid/invalid)
- D-irty (written to)
- A-ccessed (whether its been accessed for read or write)
- protection bits -> r, w, etc
How are page table sizes calculated?
- 32 bit system
- Page Table Entry (PTE) = 4 bytes (32 bit)
- Number of pages = 2^32 / Page Size
- A common page size is 4kB
- (2^32 / 2^12) * 4B = 4MB
We don’t use a flat page table anymore, though.
What are Hierarchical Page Tables?
- Outer Level: Page Table Directory
- elements are pointers to page tables
- Internal page tables exist only for valid virtual memory regions
- Now Virtual address looks like:
[outer page table index | internal page table index | offset] - Could have additional layers using the same philosophy
- on malloc, new internal page table may be allocated
P3L2: 8 Multi Level Page Tables
What are the pros/cons of a Hierarchical Page Table?
+ Reduced page table size
+ Likely more ‘gaps’ that are the same size as the page table
- More lookups required for the translation process since there are more levels
Describe inverted page tables
P3L2: 11 Inverted Page Tables
Describe segmentation
P3L2: 12 Segmentation
What are the pros/cons of large pages?
+ Fewer page tables entries, smaller page tables, more TLB hits
- internal fragmentation => wastes memory
- page size is calculated as 2^[# offset bits]
What are some challenges with memory allocation?
- external fragmentation
- enough ‘free’ pages exist, but they are not contiguous so they can’t be allocated
- When possible, alloca tor will free memory to create more contiguous zones.
What is the buddy allocator?
- Subdivides the area in half over and over until it finds a memory region that is the size of the request
+ Aggregation works well ad is fast - Internal fragmentation can still exist because this only allocates by powers of 2
What is the slab allocator?
Builds custom object caches on top of slabs [contiguously allocated physical memory].
Each slab is built for a specific size object, so freeing objects in slab is fine b/c they can be replaced by objects of the same size easily.
+ Internal fragmentation avoided
+ External fragmentation not really an issue
Linux uses a combination of both Slab and buddy Allocator.
Describe Demand Paging
When the valid bit is set to 0 in the page table (eg, it’s not in memory), then a request to access it becomes a trap and control passes to the OS. The OS makes a call to retrieve that piece of info from the backing storage (I/O request). The memory is placed in DRAM, the page tabled entry is updated with the new PA corresponding to that VA and the valid bit is set to 1. Then control is passed back to the thread and it will execute the same memory request again.
When should pages be swapped out of physical memory onto disk and which pages should be swapped out?
When:
- When memory usage is above a threadshold (high watermark)
- When CPU usage is below a threshold (low watermark)
Which:
- pages that won’t be used
- LRU: uses access bit to tell if it has beena ccessed
- pages that don’t need to be written out
- dirty bit to track if modified
- won’t swap non-swappable pages
Linux:
* ‘Second Change’ variation of LRU
What is Copy on Write (COW)?
When a new process is created we would copy the original address space. Instead, we map a new VA to the original page. Then we write protect that data. If only reads occur, then this is fine, we only need one copy of the information. If a write occurs, a trap is created and the data is copied to a new PA and the VA updated. This ensures we only pay the copy cost when we need to modify data.
What is checkpointing?
- Periodically save process state during execution
- Write protect entire PA and copy everything once
- Copy diffs of ‘dirtied’ pages for incremental checkpoints
- rebuild images of process from multiple diffs, or in background aggregate diffs into a single one
More @ P3L2: 22 Failure Management Checkpoint
What kinds of message passing IPCs are there?
Pipes - bytestream from one process to another
Message Queues - just what it sounds like
Sockets - send(), recv(), socket()
Describe Shared Memory IPC
OS Maps same PAs into VAs of two processes. May not have the same VAs in each process
+ System calls only for the setup phase
+ potentially reduced data copies (but no eliminated)
- must use explicit synchronization
- must have some kind of communication protocol
- shared buffer management - programmers responsibility
Describe Copy vs Map in message based IPC vs Shared Memory IPC.
Copy
* CPU cycles to copy data to / from port for each message
Map
* CPU cycles to map memory into address space
* CPU to copy data to channel
+ memory map is costly, but only needs to be done once. If used many times -> good payoff
+ Can still perform well for one time use, especially with large data transfers
In Windows, they support Local Procedure Calls (LPCs) which means if memory to be copied is small, it would use a messages based IPC and if it’s large, it would use shared memory (copy once, then map that to shared space for both processes).
What are spin locks?
Like a mutex, except that instead of relinquishing the cpu over once it encounters the locked spin_mutex, it instead just sits and checks over and over until it either acquires the lock or the thread is preempted.
The most basic synchronization construct.