Final Flashcards

1
Q

How does scheduling work? What are the basic steps and data structures involved in scheduling a thread on the CPU?

A

The scheduler dispatches a task from the run queue based on the policy/algorithm (e.g., FIFO, SJF, LJF, etc).

The data structure depends on the scheduling policy (tightly coupled). For a FIFO only a basic queue is necessary, but the policy could require a tree, or a multi-level structure.

The scheduler runs when:

  • the CPU becomes idle
  • a timeslice expires
  • a new task becomes available

When dispatching a thread:

  • context switch to selected thread
  • enter user mode
  • set program counter to next instruction of scheduled task

P3L1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the overheads associated with scheduling? Do you understand the tradeoffs associated with the frequency of preemption and scheduling/what types of workloads benefit from frequent vs. infrequent intervention of the scheduler (short vs. long time slices)?

A

The main overhead is context switching.

Tradeoffs of context switching:
\+ reduces wait time
\+ I/O operations issues earlier
- drops throughput
- increases average completion time

For CPU bound tasks, longer timeslices are better because there are fewer context switches (overhead). Keeps cpu utilization and throughput high.

For I/O bound tasks, smaller timeslices is better. I/O ops are issued sooner. Better user-perceived performance.

P3L1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Can you work through a scenario describing some workload mix (few threads, their compute and I/O phases) and for a given scheduling discipline compute various metrics like average time to completion, system throughput, wait time of the tasks…

A

Throughput
jobs completed / time to complete all jobs

Average Completion Time
sum of times to complete each job / jobs completed

Average Wait Time
sum of time each job spent waiting / jobs completed

P3L1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Do you understand the motivation behind the multi-level feedback queue, why different queues have different time slices, how do threads move between these queues… Can you contrast this with the O(1) scheduler? Do you understand what were the problems with the O(1) scheduler which led to the CFS?

A

The multi-level feedback queue allows us to find an appropriate timeslice for a thread based on how it behaves. If the task uses its entire timeslice we move it to a queue with a longer timeslice. If a task yields, we keep it in the queue with the shorter timeslice. This means we don’t have to track/perform historic behavior calculations, we can just run tasks and let their behavior dictate.

The O(1) scheduler uses priorities to determine the order in which tasks run, and it has different lengths of timeslices depending on priority, but all tasks must run for the duration of their timeslice before any tasks that had previously run could run again. This led to an unacceptable jitter as software became more interactive.

The Completely Fair Scheduler uses a self-balancing tree ordered by vruntme (time spent on CPU). vruntime progresses faster for lower priority tasks and slower for higher priority tasks. Always schedule task with lowest vruntime.

P3L1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Thinking about Fedorova’s paper on scheduling for chip multi processors, what’s the goal of the scheduler she’s arguing for? What are some performance counters that can be useful in identifying the workload properties (compute vs. memory bound) and the ability of the scheduler to maximize the system throughput.

A

Fedorova is arguing for is a scheduler that maximizes CPU utilization in hardware multithreaded environments.

Generally, we use instructions per cycle (IPC) to determine utilization. Tasks that are CPU-bound typically have IPCs that are close to 1, while tasks that are memory-bound typically have IPCs that are close to 0.

Federova suggests that hardware incorporate a new metric, cycles per instruction (CPI), which is the reciprocal of IPC.

CPU-bound tasks would still have a CPI of close to 1, while memory-bound tasks would have a CPI much higher than 1.

The IPC performance counter is helpful in determining if a task is compute-bound or memory-bound. In addition, performance counters that track cache misses can be helpful. With high cache misses, a scheduler can assume a task is making lots of memory access and therefore is memory-bound. With few cache misses, a scheduler can assume a process is more compute-bound.

P3L1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does the OS map the memory allocated to a process to the underlying physical memory?

A

The OS maps the memory allocated to a process to the underlying physical memory via a page table (or segments), which allows us to look up the corresponding page frame number in physical memory and index into to find the exact location.

P3L2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do we deal with the fact that processes address more memory than is physically available?

A

We deal with the fact that processes address more memory than physically available by swapping data out of physical memory (e.g., DRAM) to secondary storage (e.g., on disk). We determine which data to swap using an algorithm (e.g., Least Recently Used, which uses the access bit).

P3L2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does address translation work? What’s the role of the TLB?

A

In order to translate a virtual address to a physical address we have to look it up in a page table. Page tables can be flat or hierarchical. Hierarchical page tables save on space by not requiring an entry for every single virtual address (used or unused) but they require more lookups (outer page table to inner page tables).

A page table address has two parts if flat, or three parts if hierarchical. The last part is always the offset, which we use to index into the physical memory. The first part(s) are used to index into the page table to find the page frame number that tells us where to start looking in physical memory.

The role of the Translation Lookaside Buffer is to cache translations between virtual addresses and physical addresses so we can save on lookups. This reduces the latency that’s created by hierarchical page tables, which save on space but require more lookups.

P3L2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Do you understand the relationships between the size of an address, the size of the address space, the size of a page, the size of the page table?

A

TODO (P3L2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Do you understand the benefits of hierarchical page tables? For a given address format, can you workout the sizes of the page table structures in different layers?

A

TODO (P3L2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

For processes to share memory, what does the OS need to do? Do they use the same virtual addresses to access the same memory?

A

The OS needs to create the shared memory segment and map it into the virtual address space of both processes.

The virtual address of the shared memory segment will NOT be the same for each process.

P3L3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

For processes to communicate using a shared memory-based communication channel, do they still have to copy data from one location to another?

A

No, processes using shared memory-based communication don’t have to copy data from one location to another as long as it’s allocated from a virtual address that belongs to the shared memory region.

P3L3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are different ways you can implement synchronization between different processes (think what kids of options you had in Project 3).

A
  • mutex & condition variable
  • semaphore
  • message queue
  • socket

P3L3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

To implement a synchronization mechanism, at the lowest level you need to rely on a hardware atomic instruction. Why? What are some examples?

A

When we have multiple CPUs and/or multiple threads accessing a lock, it’s possible for check&set on the value of a lock to be interleaved (meaning more than one processor could see the lock is free and acquire it) unless the check/set is an atomic operation. When an operation is atomic then only one CPU at a time can perform it.

Examples:

  • try_and_set
  • read_and_increment

P3L4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why are spinlocks useful? Would you use a spinlock in every place where you’re currently using a mutex?

A

A spinlock is useful when the time a thread would spend spinning is less than the time it takes to context switch.

Spinlocks create more contention than a regular mutex so use them only when critical sections are short.

P3L4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Do you understand why is it useful to have more powerful synchronization constructs, like reader-writer locks or monitors? What about them makes them more powerful than using spinlocks, or mutexes and condition variables?

A

More powerful synchronization construct like reader-writer locks or monitors are useful because they handle more for the programmer (e.g., lock/unlock, signaling), which reduces common errors.

P3L4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Can you work through the evolution of the spinlock implementations described in the Anderson paper, from basic test-and-set to the queuing lock? Do you understand what issue with an earlier implementation is addressed with a subsequent spinlock implementation?

A
  1. try_and_set spinlock
    atomic operation in the while loop, which causes contention and is disastrous in a cache-coherent architecture with write invalidation.
  2. try_and_try_and_set spinlock
    atomic operation in the while loop, but only called if the cached value says the lock is free, which reduces contention. still bad in a cache-coherent architecture with write invalidation?

+ improved latency
+ improved delay
-/+ contention: NCC = same, CC+WU = O(n), CC+WI =O(n^2)

  1. try_and_try_and_set spinlock + delay
    - atomic operation in a while loop, but only called if the cache value says the lock is free, which reduces contention. placing a delay after the lock is released improves the situation in a cache-coherent architecture with write invalidation b/c everyone isn’t trying to get the lock value at the same time?

+ improved latency
+ improved contention
- worsened delay

  1. queueing spinlock
    - atomic operation one time when the thread requests the lock, but then subsequent spins happen on a different variable, which reduces contention and protects it from any problems due to a cache-coherent architecture with write invalidation

+ improved delay
+ improved contention
- worsened latency

P3L4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the steps in sending a command to a device (say packet, or file block)? What are the steps in receiving something from a device?

A

Sending:

  1. Send data (user process)
  2. Form packer (kernel)
  3. Write Tx request record (driver)
  4. Perform transmission (device)

Receiving:
Receiving something back works in the reverse order (device retrieves and passes to driver, driver passes to kernel, kernel passes to user process)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

For block storage devices, do you understand the basic virtual file system stack, the purpose of the different entities?

A

file: data blocks on disk (non-contiguous)
inode: tracks files’ blocks. also resides on disk in some block (contiguous)
superblock: overall map of diskblocks (inode blocks, data blocks, free blocks)

P3L5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

For the virtual file system stack, we mention several optimizations that can reduce the overheads associated with accessing the physical device. Do you understand how each of these optimizations changes how or how much we need to access the device?

A

caching/buffering - reduces # of disk accesses

I/O scheduling - reduces disk head movement

prefetching - reduces # of disk accesses

journaling / logging - reduces random access

P3L5

21
Q

What is virtualization? What’s the history behind it?

A

Virtualization allows concurrent execution of multiple operating systems (and their applications) on the same physical machine.

P3L6

22
Q

What were the problems with virtualizing x86? How does protection of x86 used to work and how does it work now? How were/are the virtualization problems on x86 fixed?

A

x86 pre 2005

  • 4 rings, no root/non-root modes yet
  • hypervisor in ring 0, guest OS in ring 1

BUT: 17 privilaged instructions no not trap! fail silently!

e.g., interrupt enable/disable bit in privilaged register; POPF/PUSHF instructions that access it from ring fail silently

hypervisor doesn’t know, so it doesn’t try to change settings

OS doesn’t know, so it assumes change was successful

This was fixed differently by two approaches:

  1. binary translation (rewrite the binary to not use the commands that fail silently)
  2. paravirtualization (rewrite the binary to make explicit calls to the hypervisor)

P3L6

23
Q

How does device virtualization work? What a passthrough vs. a split-device model?

A

The pass-through model allows a guest VM to directly access the hardware.
+ fast
- reduces portability as it requires the exact hardware devices that the OS expects

Device access control split between front-end driver in guest VM (device API) and back-end driver in service VM (or host). Requires modified guest drivers so it’s limited to paravirtualized guests.
+ fast?
+ device independent?
+ allows better management of shared devices

P3L6

24
Q

What’s the motivation for RPC? What are the various design points that have to be sorted out in implementing an RPC runtime (e.g., binding process, failure semantics, interface specification… )? What are some of the options and associated tradeoffs?

A

TODO (P4L1)

25
Q

What’s specifically done in Sun RPC for these design points – you should easily understand this from your project?

A

TODO (P4L1)

26
Q

What’s marshalling/unmarshalling? How does an RPC runtime serialize and deserialize complex variable size data structures? What’s specifically done in Sun RPC/XDR?

A

TODO (P4L1)

27
Q

What are some of the design options in implementing a distributed file service?

A
  1. Upload/download model
    Client downloads file, makes changes, uploads it to server.
    + simple
    + local reads/writes at client
    - entire file download/upload even for small accesses
    - server gives up control
  2. True remote file access model
    Every access to remote file goes through server
    + file access is centralized so maintaining consistency is simple
    - every file operation pays a network latency cost
    - limits scalability b/c everything must go through server
  3. A compromise

P4L2

28
Q

The Sprite caching paper motivates its design based on empirical data about how users access and share files. Do you understand how the empirical data translated in specific design decisions? Do you understand what type of data structures were needed at the servers’ and at the clients’ side to support the operation of the Sprite system (i.e., what kind of information did they need to keep track of, what kids of fields did they need to include for their per-file/per-client/per-server data structures).

A

TODO (P4L2)

29
Q

When sharing state, what are the tradeoffs associated with the sharing granularity?

A

Cache Line Granularity : Too fine. Coherence traffic overwhelms any benefit

Variable Granularity: More sense for programmer, still high overhead

Page Granularity: Makes more sense to OS. Must be aware of false sharing, when a shared page is being used in 2 different areas in the page.

Object Granularity: Requires language runtime support. OS doesn’t care. Not as generalizable

P4L3

30
Q

For distributed state management systems (think distributed shared memory) what are the basic mechanisms needed to maintain consistence – e.g., do you why is it useful to use ‘home nodes’, why do we differentiate between a global index structure to find the home nodes and local index structures used by the home nodes to track information about the portion of the state they are responsible for.

A

A “home node” keeps state about a page and is responsible for cache coherence for that page–unless another node is operating a lot on the page. In this case, it can become the “owner” and take over responsibility for state and drive cache coherence (saves on constantly talking to the “home node.” The owner can change over time but the home node never does, but it keeps track of who the current owner is.

The global index structure keeps track of which manager is responsible for each page. The global index structure is replicated across all nodes. The local index structure can be partitioned such that it is only available on its node rather than all nodes.

P4L3

31
Q

Do you have some ideas how would you go about implementing a distributed shared memory system?

A
  • each node contributes part of MM pages to DSM
  • need local caches for performance (latency)
  • all nodes responsible for part of distributed memory
  • home node manages access and tracks page ownership
  • explicit replication possible for load balancing, performance, or reliability

P4L3

32
Q

What’s a consistency model? What are the different guarantees that change in the different models we mentioned – strict, sequential, causal, weak… Can you work through a hypothetical execution example and determine whether the behavior is consistent with respect to a particular consistency model?

A

A consistency model is a guarantee about how state changes will behave (access ordering, visibility / propagation)

Strict
- Updates visible everywhere immediately. In distributed systems, latency and message reorder/loss make this impossible to guarantee.

Sequential

  • Updates from different processes may be arbitrarily interleaved
  • All processes see the same interleaving (though may not reflect issue order)
  • Operations from the same process will always appear in order they were issued

Causal

  • Causally related writes maintain order
  • Writes from a single processor maintain order
  • “Concurrent” writes no guarantees about order

Weak

  • All updates prior to a sync will be visible
  • No guarantee what happens in between
33
Q

When managing large-scale distributed systems and services, what are the pros and cons with adopting a homogeneous vs. a heterogeneous design?

A

Homogeneous
+ simple front-end (doesn’t have to keep track of node specialization)
+ easy to scale. just add more nodes
- unable to take advantage of locality & caching

Heterogeneous
+ complex front-end (has to keep track of node specialization)
- complex to scale. which nodes? can get hot spots & bottlenecks
- able to take advantage of locality & caching

P4L4

34
Q

Do you understand the history and motivation behind cloud computing, and basic models of cloud offerings?

A

Infrastructure as a Service
- You manage everything including the virtualization

Platform as a Service
- We virtualize servers for you, and you get to manage the OS and everything above it

Software as a Service
- You just manager your software and disk usage - the OS, virtualization, etc is done by the provider

P4L4

35
Q

Do you understand what about the cloud scales make it practical?

Do you understand what about the cloud scales make failures unavoidable?

A

Practical Scalability:
• economies of scale - you buy a single set of hardware, then you can virtualize the servers and the network.
• You can host multiple tenants in a single location, on a single machine even.
• You can easily provision new systems as needed, or bring them down

Failures:
Cloud services integrate many components, each with their own percentage of failure. These failures compound and create a system that is more likely to fail the more it scales. This means developers have to expect failures and write robust code that handles these failures.

P4L4

36
Q

What’s demand paging?

A

When a process accesses a virtual memory address that is not present in physical memory (presence bit indicates this) the hardware MMU generates a fault, which traps into the kernel. The OS then determines where the page is (e.g., on disk), issues an I/O operation to bring it into physical memory. The OS will then determine a free frame where the page can be placed and use the page frame number to update the page table entry that corresponds to the virtual address of the page. The OS then resets the process’s program counter so the access request is made again, but this time succeeds.

P3L2

37
Q

When does page replacement happen? How is it determined when page replacement should happen?

A

Page replacement happens when memory usage is above a certain threshold and CPU usage is below a certain threshold.

A page replacement algorithm determines which pages should be swapped out (e.g., Least Recently Used uses history-based prediction to swap out the page that was accessed the longest time ago as indicated by the access bit)

P3L2

38
Q

What happens when a process tries to access a page not present in physical memory? What happens when a process tries to access a page that hasn’t been allocated to it?

A

When a process tries to access a page not present in physical memory the MMU throws an exception, which traps into the kernel. The OS then determines the exception is due to a page fault, and ascertains the reason for the page fault (e.g., page not present in physical memory vs inadequate permission). If the fault is due to the page not being present in physical memory, it determines where the page is (e.g., on disk), issues an I/O request, finds a page frame for the data, updates the page entry, resents the process’s program counter, and gives control back to the process so it can proceed.

When a process tries to access a page that hasn’t been allocated to it the MMU again throws an exception, and the OS determines it’s a page fault due to a permission error. At this point, I assume the process would be terminated?

P3L2

39
Q

What happens when a process tries to modify a page that’s write protected/how does COW work?

A

When a new process is spawned its entire contents needs to be copied, but there’s a lot of static data so to save space the OS starts by simply pointing the child’s virtual address to the same memory location of the parent but write protecting the memory. Then, if/when either process attempts to write, only then is the content copied to its own memory location. This allows us to save on space/time if the process is only performing reads.

P3L2

40
Q

What are the costs associated with copying vs. (re-/m)mapping?

A

Copying requires crossing the user/kernel boundary. Costs CPU cycles.

Mapping requires mapping the virtual address to the physical address of the shared memory region. Costs CPU cycles.

Mapping happens once and can be reused many times so it can have a good payoff. Also, mapping can have a good payoff even if only used once if a large quantity of data is involved.

P3L3

41
Q

What are the tradeoffs between message-based vs. shared-memory-based communication?

A

In message-based IPC the OS sets up the channel, determines the protocol, and handles synchronization. It’s simple, but it’s expensive b/c every message requires a system call (crossing the user/kernel boundary).

In shared memory-based IPC the OS sets up the channel, and everything else is up to the programmer. It’s more complex for the programmer, but it can be cheaper b/c the OS is out of the way.

P3L3

42
Q

What are the basic differences in using programmed I/O vs. DMA support?

A

PIO

  • doesn’t require special hardware support
  • CPU interacts with command and data registers
  • requires many CPU store instructions
  • better for small transfers

DMA

  • does require special hardware support (DMA controller)
  • CPU interacts with command registers and DMA controller
  • requires one CPU store instruction
  • better for large transfers

P3L5

43
Q

Do you understand the relationship between the various data structures (block sizes, addressing scheme, etc.) and the total size of the files or the file system that can be supported on a system?

A

An inode holds the index of all blocks for a file on disk.

It contains:

  • metadata
  • pointers to blocks (12 direct, 1 single indirect, 1 double indirect, 1 triple indirect)

A single indirect pointer points to a block of pointers that point to data blocks.

A double indirect pointer points to a block of pointers that point to blocks of pointers that point to data blocks.

And so on for the triple indirect…

This allows the inode data structure to remain small while supporting large file sizes.

To determine the max file size supported first find the num ptrs per block = size of block / size of block ptr

then:

max file size = direct ptrs (12) + single indirect ptrs (num ptrs per block) + double indirect ptrs (num ptrs per block ^ 2) + triple indirect ptrs (num ptrs per block ^ 3) x size of block

P3L5

44
Q

What’s hosted vs. bare-metal virtualization?

A
  1. Bare-metal or Hypervisor-based (type 1)

VHH (hypervisor) manages all hardware resources and supports execution of VMs
privileged, service VM to deal with devices (and other configuration and management tasks)

  1. Hosted (type 2)
host OS owns all hardware
special VMM module provides hardware interfaces to VMs and deals with VM context switching

P3L6

45
Q

What’s paravirtualization? Why is it useful?

A

When the guest OS is rewritten in such a way that it knows its virtualized (i.e., it’s rewritten to make direct calls to the hypervisor).

P3L6

46
Q

What are the tradeoffs associated with a stateless vs. stateful design?

A

A stateless design means the server doesn’t keep any info about what’s happening.

+ resilient (on failure just restart)
+ no resources used to maintain state
- unable to support caching (need state for that)
- every request must be self-contained (more bits transferred to describe request)

A stateful design means the server keeps information about what’s cached / accessed

+ supports caching, locks, and incremental operations

  • needs checkpointing to recover from failure
  • needs resources to maintain state

P4L2

47
Q

What are the tradeoffs (benefits and costs) associated with using techniques such as caching, replication, partitioning, in the implementation of a distributed service (think distributed file service).

A

Caching
+ Clients can locally perform operations on cached state (e.g., open/read/write)
- coherence mechanism is require to keep the cached portions of files consistent with the server representation

Replication
+ load balancing
+ availability
+ fault tolerance
- writes become more complex (synchronously write to all, or write to onethen propagate to others)
- replicas must be reconciled (e.g., voting)
- scalability (machines only get so large)

Partitioning
\+ availability vs single server design
\+ scalability w/ file system size
\+ single file writes are simple
- on failure, lose portion of data
- load balancing harder; if not balanced, then hot spots possible

P4L2

48
Q

Do you understand some of the enabling technologies that make cloud offerings broadly useful?

A
  • Virtualization
  • Resource provisioning (scheduling)
  • Big Data processing & storage
  • Software defined slices of resources (networking, storage, data centers)
  • Monitoring => real time log processing

P4L4