Final Flashcards
How does scheduling work? What are the basic steps and data structures involved in scheduling a thread on the CPU?
The scheduler dispatches a task from the run queue based on the policy/algorithm (e.g., FIFO, SJF, LJF, etc).
The data structure depends on the scheduling policy (tightly coupled). For a FIFO only a basic queue is necessary, but the policy could require a tree, or a multi-level structure.
The scheduler runs when:
- the CPU becomes idle
- a timeslice expires
- a new task becomes available
When dispatching a thread:
- context switch to selected thread
- enter user mode
- set program counter to next instruction of scheduled task
P3L1
What are the overheads associated with scheduling? Do you understand the tradeoffs associated with the frequency of preemption and scheduling/what types of workloads benefit from frequent vs. infrequent intervention of the scheduler (short vs. long time slices)?
The main overhead is context switching.
Tradeoffs of context switching: \+ reduces wait time \+ I/O operations issues earlier - drops throughput - increases average completion time
For CPU bound tasks, longer timeslices are better because there are fewer context switches (overhead). Keeps cpu utilization and throughput high.
For I/O bound tasks, smaller timeslices is better. I/O ops are issued sooner. Better user-perceived performance.
P3L1
Can you work through a scenario describing some workload mix (few threads, their compute and I/O phases) and for a given scheduling discipline compute various metrics like average time to completion, system throughput, wait time of the tasks…
Throughput
jobs completed / time to complete all jobs
Average Completion Time
sum of times to complete each job / jobs completed
Average Wait Time
sum of time each job spent waiting / jobs completed
P3L1
Do you understand the motivation behind the multi-level feedback queue, why different queues have different time slices, how do threads move between these queues… Can you contrast this with the O(1) scheduler? Do you understand what were the problems with the O(1) scheduler which led to the CFS?
The multi-level feedback queue allows us to find an appropriate timeslice for a thread based on how it behaves. If the task uses its entire timeslice we move it to a queue with a longer timeslice. If a task yields, we keep it in the queue with the shorter timeslice. This means we don’t have to track/perform historic behavior calculations, we can just run tasks and let their behavior dictate.
The O(1) scheduler uses priorities to determine the order in which tasks run, and it has different lengths of timeslices depending on priority, but all tasks must run for the duration of their timeslice before any tasks that had previously run could run again. This led to an unacceptable jitter as software became more interactive.
The Completely Fair Scheduler uses a self-balancing tree ordered by vruntme (time spent on CPU). vruntime progresses faster for lower priority tasks and slower for higher priority tasks. Always schedule task with lowest vruntime.
P3L1
Thinking about Fedorova’s paper on scheduling for chip multi processors, what’s the goal of the scheduler she’s arguing for? What are some performance counters that can be useful in identifying the workload properties (compute vs. memory bound) and the ability of the scheduler to maximize the system throughput.
Fedorova is arguing for is a scheduler that maximizes CPU utilization in hardware multithreaded environments.
Generally, we use instructions per cycle (IPC) to determine utilization. Tasks that are CPU-bound typically have IPCs that are close to 1, while tasks that are memory-bound typically have IPCs that are close to 0.
Federova suggests that hardware incorporate a new metric, cycles per instruction (CPI), which is the reciprocal of IPC.
CPU-bound tasks would still have a CPI of close to 1, while memory-bound tasks would have a CPI much higher than 1.
The IPC performance counter is helpful in determining if a task is compute-bound or memory-bound. In addition, performance counters that track cache misses can be helpful. With high cache misses, a scheduler can assume a task is making lots of memory access and therefore is memory-bound. With few cache misses, a scheduler can assume a process is more compute-bound.
P3L1
How does the OS map the memory allocated to a process to the underlying physical memory?
The OS maps the memory allocated to a process to the underlying physical memory via a page table (or segments), which allows us to look up the corresponding page frame number in physical memory and index into to find the exact location.
P3L2
How do we deal with the fact that processes address more memory than is physically available?
We deal with the fact that processes address more memory than physically available by swapping data out of physical memory (e.g., DRAM) to secondary storage (e.g., on disk). We determine which data to swap using an algorithm (e.g., Least Recently Used, which uses the access bit).
P3L2
How does address translation work? What’s the role of the TLB?
In order to translate a virtual address to a physical address we have to look it up in a page table. Page tables can be flat or hierarchical. Hierarchical page tables save on space by not requiring an entry for every single virtual address (used or unused) but they require more lookups (outer page table to inner page tables).
A page table address has two parts if flat, or three parts if hierarchical. The last part is always the offset, which we use to index into the physical memory. The first part(s) are used to index into the page table to find the page frame number that tells us where to start looking in physical memory.
The role of the Translation Lookaside Buffer is to cache translations between virtual addresses and physical addresses so we can save on lookups. This reduces the latency that’s created by hierarchical page tables, which save on space but require more lookups.
P3L2
Do you understand the relationships between the size of an address, the size of the address space, the size of a page, the size of the page table?
TODO (P3L2)
Do you understand the benefits of hierarchical page tables? For a given address format, can you workout the sizes of the page table structures in different layers?
TODO (P3L2)
For processes to share memory, what does the OS need to do? Do they use the same virtual addresses to access the same memory?
The OS needs to create the shared memory segment and map it into the virtual address space of both processes.
The virtual address of the shared memory segment will NOT be the same for each process.
P3L3
For processes to communicate using a shared memory-based communication channel, do they still have to copy data from one location to another?
No, processes using shared memory-based communication don’t have to copy data from one location to another as long as it’s allocated from a virtual address that belongs to the shared memory region.
P3L3
What are different ways you can implement synchronization between different processes (think what kids of options you had in Project 3).
- mutex & condition variable
- semaphore
- message queue
- socket
P3L3
To implement a synchronization mechanism, at the lowest level you need to rely on a hardware atomic instruction. Why? What are some examples?
When we have multiple CPUs and/or multiple threads accessing a lock, it’s possible for check&set on the value of a lock to be interleaved (meaning more than one processor could see the lock is free and acquire it) unless the check/set is an atomic operation. When an operation is atomic then only one CPU at a time can perform it.
Examples:
- try_and_set
- read_and_increment
P3L4
Why are spinlocks useful? Would you use a spinlock in every place where you’re currently using a mutex?
A spinlock is useful when the time a thread would spend spinning is less than the time it takes to context switch.
Spinlocks create more contention than a regular mutex so use them only when critical sections are short.
P3L4
Do you understand why is it useful to have more powerful synchronization constructs, like reader-writer locks or monitors? What about them makes them more powerful than using spinlocks, or mutexes and condition variables?
More powerful synchronization construct like reader-writer locks or monitors are useful because they handle more for the programmer (e.g., lock/unlock, signaling), which reduces common errors.
P3L4
Can you work through the evolution of the spinlock implementations described in the Anderson paper, from basic test-and-set to the queuing lock? Do you understand what issue with an earlier implementation is addressed with a subsequent spinlock implementation?
- try_and_set spinlock
atomic operation in the while loop, which causes contention and is disastrous in a cache-coherent architecture with write invalidation. - try_and_try_and_set spinlock
atomic operation in the while loop, but only called if the cached value says the lock is free, which reduces contention. still bad in a cache-coherent architecture with write invalidation?
+ improved latency
+ improved delay
-/+ contention: NCC = same, CC+WU = O(n), CC+WI =O(n^2)
- try_and_try_and_set spinlock + delay
- atomic operation in a while loop, but only called if the cache value says the lock is free, which reduces contention. placing a delay after the lock is released improves the situation in a cache-coherent architecture with write invalidation b/c everyone isn’t trying to get the lock value at the same time?
+ improved latency
+ improved contention
- worsened delay
- queueing spinlock
- atomic operation one time when the thread requests the lock, but then subsequent spins happen on a different variable, which reduces contention and protects it from any problems due to a cache-coherent architecture with write invalidation
+ improved delay
+ improved contention
- worsened latency
P3L4
What are the steps in sending a command to a device (say packet, or file block)? What are the steps in receiving something from a device?
Sending:
- Send data (user process)
- Form packer (kernel)
- Write Tx request record (driver)
- Perform transmission (device)
Receiving:
Receiving something back works in the reverse order (device retrieves and passes to driver, driver passes to kernel, kernel passes to user process)
For block storage devices, do you understand the basic virtual file system stack, the purpose of the different entities?
file: data blocks on disk (non-contiguous)
inode: tracks files’ blocks. also resides on disk in some block (contiguous)
superblock: overall map of diskblocks (inode blocks, data blocks, free blocks)
P3L5
For the virtual file system stack, we mention several optimizations that can reduce the overheads associated with accessing the physical device. Do you understand how each of these optimizations changes how or how much we need to access the device?
caching/buffering - reduces # of disk accesses
I/O scheduling - reduces disk head movement
prefetching - reduces # of disk accesses
journaling / logging - reduces random access
P3L5
What is virtualization? What’s the history behind it?
Virtualization allows concurrent execution of multiple operating systems (and their applications) on the same physical machine.
P3L6
What were the problems with virtualizing x86? How does protection of x86 used to work and how does it work now? How were/are the virtualization problems on x86 fixed?
x86 pre 2005
- 4 rings, no root/non-root modes yet
- hypervisor in ring 0, guest OS in ring 1
BUT: 17 privilaged instructions no not trap! fail silently!
e.g., interrupt enable/disable bit in privilaged register; POPF/PUSHF instructions that access it from ring fail silently
hypervisor doesn’t know, so it doesn’t try to change settings
OS doesn’t know, so it assumes change was successful
This was fixed differently by two approaches:
- binary translation (rewrite the binary to not use the commands that fail silently)
- paravirtualization (rewrite the binary to make explicit calls to the hypervisor)
P3L6
How does device virtualization work? What a passthrough vs. a split-device model?
The pass-through model allows a guest VM to directly access the hardware.
+ fast
- reduces portability as it requires the exact hardware devices that the OS expects
Device access control split between front-end driver in guest VM (device API) and back-end driver in service VM (or host). Requires modified guest drivers so it’s limited to paravirtualized guests.
+ fast?
+ device independent?
+ allows better management of shared devices
P3L6
What’s the motivation for RPC? What are the various design points that have to be sorted out in implementing an RPC runtime (e.g., binding process, failure semantics, interface specification… )? What are some of the options and associated tradeoffs?
TODO (P4L1)
What’s specifically done in Sun RPC for these design points – you should easily understand this from your project?
TODO (P4L1)
What’s marshalling/unmarshalling? How does an RPC runtime serialize and deserialize complex variable size data structures? What’s specifically done in Sun RPC/XDR?
TODO (P4L1)
What are some of the design options in implementing a distributed file service?
- Upload/download model
Client downloads file, makes changes, uploads it to server.
+ simple
+ local reads/writes at client
- entire file download/upload even for small accesses
- server gives up control - True remote file access model
Every access to remote file goes through server
+ file access is centralized so maintaining consistency is simple
- every file operation pays a network latency cost
- limits scalability b/c everything must go through server - A compromise
P4L2
The Sprite caching paper motivates its design based on empirical data about how users access and share files. Do you understand how the empirical data translated in specific design decisions? Do you understand what type of data structures were needed at the servers’ and at the clients’ side to support the operation of the Sprite system (i.e., what kind of information did they need to keep track of, what kids of fields did they need to include for their per-file/per-client/per-server data structures).
TODO (P4L2)
When sharing state, what are the tradeoffs associated with the sharing granularity?
Cache Line Granularity : Too fine. Coherence traffic overwhelms any benefit
Variable Granularity: More sense for programmer, still high overhead
Page Granularity: Makes more sense to OS. Must be aware of false sharing, when a shared page is being used in 2 different areas in the page.
Object Granularity: Requires language runtime support. OS doesn’t care. Not as generalizable
P4L3
For distributed state management systems (think distributed shared memory) what are the basic mechanisms needed to maintain consistence – e.g., do you why is it useful to use ‘home nodes’, why do we differentiate between a global index structure to find the home nodes and local index structures used by the home nodes to track information about the portion of the state they are responsible for.
A “home node” keeps state about a page and is responsible for cache coherence for that page–unless another node is operating a lot on the page. In this case, it can become the “owner” and take over responsibility for state and drive cache coherence (saves on constantly talking to the “home node.” The owner can change over time but the home node never does, but it keeps track of who the current owner is.
The global index structure keeps track of which manager is responsible for each page. The global index structure is replicated across all nodes. The local index structure can be partitioned such that it is only available on its node rather than all nodes.
P4L3
Do you have some ideas how would you go about implementing a distributed shared memory system?
- each node contributes part of MM pages to DSM
- need local caches for performance (latency)
- all nodes responsible for part of distributed memory
- home node manages access and tracks page ownership
- explicit replication possible for load balancing, performance, or reliability
P4L3
What’s a consistency model? What are the different guarantees that change in the different models we mentioned – strict, sequential, causal, weak… Can you work through a hypothetical execution example and determine whether the behavior is consistent with respect to a particular consistency model?
A consistency model is a guarantee about how state changes will behave (access ordering, visibility / propagation)
Strict
- Updates visible everywhere immediately. In distributed systems, latency and message reorder/loss make this impossible to guarantee.
Sequential
- Updates from different processes may be arbitrarily interleaved
- All processes see the same interleaving (though may not reflect issue order)
- Operations from the same process will always appear in order they were issued
Causal
- Causally related writes maintain order
- Writes from a single processor maintain order
- “Concurrent” writes no guarantees about order
Weak
- All updates prior to a sync will be visible
- No guarantee what happens in between
When managing large-scale distributed systems and services, what are the pros and cons with adopting a homogeneous vs. a heterogeneous design?
Homogeneous
+ simple front-end (doesn’t have to keep track of node specialization)
+ easy to scale. just add more nodes
- unable to take advantage of locality & caching
Heterogeneous
+ complex front-end (has to keep track of node specialization)
- complex to scale. which nodes? can get hot spots & bottlenecks
- able to take advantage of locality & caching
P4L4
Do you understand the history and motivation behind cloud computing, and basic models of cloud offerings?
Infrastructure as a Service
- You manage everything including the virtualization
Platform as a Service
- We virtualize servers for you, and you get to manage the OS and everything above it
Software as a Service
- You just manager your software and disk usage - the OS, virtualization, etc is done by the provider
P4L4
Do you understand what about the cloud scales make it practical?
Do you understand what about the cloud scales make failures unavoidable?
Practical Scalability:
• economies of scale - you buy a single set of hardware, then you can virtualize the servers and the network.
• You can host multiple tenants in a single location, on a single machine even.
• You can easily provision new systems as needed, or bring them down
Failures:
Cloud services integrate many components, each with their own percentage of failure. These failures compound and create a system that is more likely to fail the more it scales. This means developers have to expect failures and write robust code that handles these failures.
P4L4
What’s demand paging?
When a process accesses a virtual memory address that is not present in physical memory (presence bit indicates this) the hardware MMU generates a fault, which traps into the kernel. The OS then determines where the page is (e.g., on disk), issues an I/O operation to bring it into physical memory. The OS will then determine a free frame where the page can be placed and use the page frame number to update the page table entry that corresponds to the virtual address of the page. The OS then resets the process’s program counter so the access request is made again, but this time succeeds.
P3L2
When does page replacement happen? How is it determined when page replacement should happen?
Page replacement happens when memory usage is above a certain threshold and CPU usage is below a certain threshold.
A page replacement algorithm determines which pages should be swapped out (e.g., Least Recently Used uses history-based prediction to swap out the page that was accessed the longest time ago as indicated by the access bit)
P3L2
What happens when a process tries to access a page not present in physical memory? What happens when a process tries to access a page that hasn’t been allocated to it?
When a process tries to access a page not present in physical memory the MMU throws an exception, which traps into the kernel. The OS then determines the exception is due to a page fault, and ascertains the reason for the page fault (e.g., page not present in physical memory vs inadequate permission). If the fault is due to the page not being present in physical memory, it determines where the page is (e.g., on disk), issues an I/O request, finds a page frame for the data, updates the page entry, resents the process’s program counter, and gives control back to the process so it can proceed.
When a process tries to access a page that hasn’t been allocated to it the MMU again throws an exception, and the OS determines it’s a page fault due to a permission error. At this point, I assume the process would be terminated?
P3L2
What happens when a process tries to modify a page that’s write protected/how does COW work?
When a new process is spawned its entire contents needs to be copied, but there’s a lot of static data so to save space the OS starts by simply pointing the child’s virtual address to the same memory location of the parent but write protecting the memory. Then, if/when either process attempts to write, only then is the content copied to its own memory location. This allows us to save on space/time if the process is only performing reads.
P3L2
What are the costs associated with copying vs. (re-/m)mapping?
Copying requires crossing the user/kernel boundary. Costs CPU cycles.
Mapping requires mapping the virtual address to the physical address of the shared memory region. Costs CPU cycles.
Mapping happens once and can be reused many times so it can have a good payoff. Also, mapping can have a good payoff even if only used once if a large quantity of data is involved.
P3L3
What are the tradeoffs between message-based vs. shared-memory-based communication?
In message-based IPC the OS sets up the channel, determines the protocol, and handles synchronization. It’s simple, but it’s expensive b/c every message requires a system call (crossing the user/kernel boundary).
In shared memory-based IPC the OS sets up the channel, and everything else is up to the programmer. It’s more complex for the programmer, but it can be cheaper b/c the OS is out of the way.
P3L3
What are the basic differences in using programmed I/O vs. DMA support?
PIO
- doesn’t require special hardware support
- CPU interacts with command and data registers
- requires many CPU store instructions
- better for small transfers
DMA
- does require special hardware support (DMA controller)
- CPU interacts with command registers and DMA controller
- requires one CPU store instruction
- better for large transfers
P3L5
Do you understand the relationship between the various data structures (block sizes, addressing scheme, etc.) and the total size of the files or the file system that can be supported on a system?
An inode holds the index of all blocks for a file on disk.
It contains:
- metadata
- pointers to blocks (12 direct, 1 single indirect, 1 double indirect, 1 triple indirect)
A single indirect pointer points to a block of pointers that point to data blocks.
A double indirect pointer points to a block of pointers that point to blocks of pointers that point to data blocks.
And so on for the triple indirect…
This allows the inode data structure to remain small while supporting large file sizes.
To determine the max file size supported first find the num ptrs per block = size of block / size of block ptr
then:
max file size = direct ptrs (12) + single indirect ptrs (num ptrs per block) + double indirect ptrs (num ptrs per block ^ 2) + triple indirect ptrs (num ptrs per block ^ 3) x size of block
P3L5
What’s hosted vs. bare-metal virtualization?
- Bare-metal or Hypervisor-based (type 1)
VHH (hypervisor) manages all hardware resources and supports execution of VMs
privileged, service VM to deal with devices (and other configuration and management tasks)
- Hosted (type 2)
host OS owns all hardware special VMM module provides hardware interfaces to VMs and deals with VM context switching
P3L6
What’s paravirtualization? Why is it useful?
When the guest OS is rewritten in such a way that it knows its virtualized (i.e., it’s rewritten to make direct calls to the hypervisor).
P3L6
What are the tradeoffs associated with a stateless vs. stateful design?
A stateless design means the server doesn’t keep any info about what’s happening.
+ resilient (on failure just restart)
+ no resources used to maintain state
- unable to support caching (need state for that)
- every request must be self-contained (more bits transferred to describe request)
A stateful design means the server keeps information about what’s cached / accessed
+ supports caching, locks, and incremental operations
- needs checkpointing to recover from failure
- needs resources to maintain state
P4L2
What are the tradeoffs (benefits and costs) associated with using techniques such as caching, replication, partitioning, in the implementation of a distributed service (think distributed file service).
Caching
+ Clients can locally perform operations on cached state (e.g., open/read/write)
- coherence mechanism is require to keep the cached portions of files consistent with the server representation
Replication
+ load balancing
+ availability
+ fault tolerance
- writes become more complex (synchronously write to all, or write to onethen propagate to others)
- replicas must be reconciled (e.g., voting)
- scalability (machines only get so large)
Partitioning \+ availability vs single server design \+ scalability w/ file system size \+ single file writes are simple - on failure, lose portion of data - load balancing harder; if not balanced, then hot spots possible
P4L2
Do you understand some of the enabling technologies that make cloud offerings broadly useful?
- Virtualization
- Resource provisioning (scheduling)
- Big Data processing & storage
- Software defined slices of resources (networking, storage, data centers)
- Monitoring => real time log processing
P4L4