0. Midterm Questions Flashcards

1
Q

What is GWA’s special talent?

A

Haircut detection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which of the following is a requirement of a critical section? [progress, concurrency, mutual inclusion, idleness]

A

Progress

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Intra-process (within) communication is easier than interprocess (between) communication. True or false?

A

True.
Threads of a single process share an address space. Process address spaces are generally restricted to only that process, not others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which of the following requires communication with the operating system? [Switching between two threads. Inverting a matrix. Recursion. Creating a new process.]

A

Creating a new process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which of the following is NOT an example of an operating system mechanism?
[A context switch;
Using timer interrupts to stop a running thread;
Maintaining the running, read, and waiting queues;
Choosing a thread to run at random]

A

Choosing a thread to run at random (this is a policy).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
The Rotating Staircase Deadline Scheduler is most similar to which other scheduling algorithm?
[Lottery scheduling;
Multi-level feedback queues;
Round-Robin;
Random]
A

Multi-level feedback scheduling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
What would probably be stored in a page table entry?
[the physical memory address;
the virtual memory address;
the process ID;
the file name]
A

The physical memory address

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Address translation allows the kernel to implement what abstraction?
[Files, Threads, Processes, Address spaces]

A

Address spaces

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Con Kolivas was particularly interested in improving what aspect of Linux scheduling?
[Overhead; Throughput; Interactive performance; Awesomeness]

A

Interactive performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which is probably NOT a privileged operation?
[Changing the interrupt mask; Loading an entry into the TLB; Modifying the exception handlers; Adding two registers and placing the result in a third register]

A

Adding two registers and placing the result in a third register

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Consider a 32-bit system with 4K pages uses multi-level page tables as described in class:

  1. 10 bits for the first-level index.
  2. 10 bits for the second level index.
  3. a 12-bit offset

Assume that page table entries are 32-bits and so can be stored directly in the second-level table.

If a process has 10 contiguous code packages (including global variables), 4 contiguous heap pages, and a single 1 page stack, what is the minimum number of pages required for the process’s page table?
What is the maximum?
In both cases, briefly explain your answer.

A

Each first- or second-level table consumes one page, since it has 1024 (2^10) 4-byte entries. So at minimum, this process’s page table requires two pages: one of the top-level table and a second for the second-level table if the code, heap, and stack are close enough together (within 4MB) to all lie within the same second-level table. (Not common).

The maximum can be thought of as follows: One page for the top-level table plus three pages for the second-level tables, one covering the 10 contiguous code pages, one covering the 4 contiguous heap pages, and a third covering the single stack page. The true maximum is 6 tho, since both the code segment and heap may overlap onto two second-level pages each if they are set up incorrectly in the address space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Identify one separation between OS mechanism and policy that we have discussed this semester. Describe the mechanism and one policy.

A

One example is the separation between the ability to stop and start threads (mechanism) and scheduling (policy). The mechanism is preemptive context switches using the timer to give the kernel a chance to run regularly and interrupt the running thread if needed. One policy is to schedule threads at random.

Another example is the separation between the virtual address translations performed by the TLB or MMU (mechanism) and the mapping between virtual and physical memory set by the kernel and running processes (policy). The mechanism is the TLB’s ability to cache translations and map virtual addresses to physical addresses. One policy is having a 2GB virutal address space where the code starts at 0x10000, the heap starts at 0x10000000 and grows upward and the stack starts at 0x80000000 and grows down. But any potential address space layout would earn you credit here.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Identify three system calls that allocate new virtual addresses (i.e., allow the process to use them) and provide a brief description for each)

A
  1. exec( ): loads content from the ELF file and sets up virtual addresses mainly that point to the process’s code area and any statically-declared variables
  2. fork( ): copies the parent’s entire address space, including all the code pages, the entire heap, and one (or more) thread stacks, allowing the child to use all of these virtual addresses which will (eventually) have to point to private physical addresses
  3. sbrk( ): extends the “break point”, or the top of the process’s heap, allocating new virtual addresses that point to physical memory. Not usually called by processes directly, but instead by malloc libraries when their subpage allocator runs out of space.
  4. mmap( ): asks the kernel to map a portion of virtual addresses to point to a region of a file
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Given a simple MLFQ approach that rewards any thread that sleeps or yields before its quantum expires, first describe a way that a computationally-intensive thread can take advantage of a weakness in this approach to remain in one of the top queues. Second, propose a modification to MLFQ that addresses this problem.

A

The problem is that if a thread can determine how long its quantum is, even approximately, it can arrange to call yield right before its quantum expires. As a result, it has done almost as much computation as it would have normally, but it is not demotes to a lower-priority queue as it should be.

The fix is simple. Instead of looking at whether a thread blocks or yields before its quantum ends, consider the percentage of its quantum it has used before it sleeps or yields. Either establish a stronger threshold for not being demoted (can’t use more than 10% of the quantum), or use the percentage itself to make a more nuanced decision (if within 10% maintain, if between 10-50% demote one level, if over 50% demote two levels, or whatever).

Another way to fix this problem is simply to not reward threads that yeild and only consider those that block. However, this isn’t quite sufficient, since a thread may be able to block on something that it knows will return immediately - like a read from /dev/random - or arrange to have a second thread within the same process release it. Looking at the overall quantum usage is a safer bet.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Long question: We have seen semaphores, spin and sleep locks, condition variables, and reader-writer locks. However, many other useful synchronization primitives exist.

First, describe one additional synchronizaiton primitive. Provide a complete interface for it in C pseudo-code, and describe how to implement it.

Second, provide two different use cases for your new synchronization primitive. Feel free to use pseudo-code as well as English here as well.

A

See 2016 midterm solution sheet.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

All of the following are critical section requirements EXCEPT
[mutual exclusion, concurrency, progress, performance]

A

Concurrency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

All of the following are private to each process thread EXCEPT
[stack; file handles; registers]

A

File handles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What action does NOT require the kernel?

[Switching between two threads; Reading from a file; Creating a new process; Altering a virtual address mapping]

A

Switching between two threads

Is this true? Does the scheduler work independent of the kernel?

19
Q

What part of the address space is NOT initialized using information from the ELF file?
[Code segment; The heap; Uninitialized static data; Initialized static data]

A

The heap

20
Q

Which of the following is NOT a part of making a system call?
[Arranging the arguments in registers or on the stack; Loading the system call number into a register; Generating a software interrupt; Allocating memory in the heap using malloc]

A

Allocating memory in the heap using malloc

21
Q

What information would probably not be stored in a page table entry?
[The location on disk; Read, write, or execute permissions; The process ID; The physical memory address]

A

The process ID

22
Q

Paging using fixed-size pages can suffer from internal fragmentation (true or false)

A

True

23
Q

One way to eliminate deadlock when acquiring multiple locks is to eliminate cycles. Describe another way to avoid deadlock and how to implement it.

A

There isn’t much you can do about protected access to shared resources, which requires waiting - either passive or active. In some cases adding resources to eliminate unnecessary sharing could be an option, and we’ll accept that. (Eliminating sleeping isn’t a solution, since replacing deadlocks with livelocks isn’t a great idea.) But the other two conditions are potential solutions:

  1. ADD RESOURCES, eliminating the need for sharing. As an exmaple from the dining philosophers problem, adding chopsticks eliminates the need to share. We’ll accept this answer if you were quite specific about this not working in all circumstances, since in certain cases global resources are required for correctness.
  2. ALLOW RESOURCE PREEMPTION, creating a way for the operating system to forcibly take a resource from a thread that is holding it. One way to implement this is to detect a deadlock by identifying a circular dependency in the wait queue and then interrupt one of the threads holding the locks causing the cycle. This would require some way for the thread to register a handler each time it requests a lock that would be called if that lock was force dropped.
  3. DISALLOW MULTIPLE INDEPENDENT REQUESTS, preventing a thread from holding some resources while requesting others. You could implement this by checking each time a thread requests a lock to make sure tha tit doesn’t hold any other locks. Enforcing this without any new lock acquisition mechanisms would require programmers to change their code to introduce new locks covering cases. Alternatively, you could create a new locking mechanism allowing multiple locks to be acquired simultaneously.

(See stoplight problem?)

24
Q

Describe two changes to the OS virtual memory system that you would need or might want to make to accommodate 48-bit virtual and physical addresses. For each, describe how it would affect the computation or memory overhead of virtual to physical translation.

A

Answer options:
1. Page table entry size. You clearly need to do something about your page table entries, since whatever you did to bitpack for 32-bit virtual addresses is going to change with 48-bit virtual addresses. Almost inevitably, you’re going to end up storing more state.

  1. Address space size. You don’t necessarily have to change the address space size, since you can simply fit more content from 32-bit wide address space into a system with 48-bit virtual addresses. You may want to change the breakpoint between userspace and the kernel, however, since there is no point wasting any of the 4 GB wide virtual address space on the kernel at this point. Everything from 0x0 to 0xFFFFFFFF should be used for the process, with kernel addresses outside of that range. This doens’t necessarily have any direct effect on the overheads, althoguh it may affect the size of other data structures (PTEs or page tables)
  2. Larger pages. You may really want to increase the 4K page size at this point, with the concordant tradeoff between TLB faults and spatial locality. Larger pages may reduce the number of translations required.
  3. Different page table structures. It could be time to add a third level to the existing two-level page tables, particularly if you decide to expand the address space size. The goal would be to reduce the amount of space required for the page tables for common address space layout patterns.
  4. MMU changes. Clearly the MMU and TLB need to be modified to help map 48-bit virtual addresses to 48-bit physical addresses. No effect on the kernel overheads here, but any speculation about the effect on hardware (wider addresses might simply fewer entries) would be accepted.
25
Q

Describe how a scheduling algorithm might improve resource allocation by observing processes communicate using pipes. What is a tradeoff involved in this decision?

A

Assume I have the following shell pipeline: foo | bar. Due to the pipe between them, before each run we know that bar cannot make much progress until foo runs and begins to fill the pipe. So we could just run foo to completion, buffer all of its output in the pipe, and then run bar to completion. This would minimize the context switches between them.

However, the tradeoff here is that this requires a large amount of memory - or some kind of storage - to buffer the pipe contents. Switching between the processes can help reduce memory overhead by keeping the pipe size smaller. Ideally, we could ru them on separate cores, but might still need to schedule them carefully so that bar has enough work to do when it runs to justify the context switch while foo doesn’t fill the buffer too quickly.

26
Q

Some systems provide separate exception handlers for TLB-faults and all other kinds of exceptions, allowing the operating system to handle them separately. First, describe why this might be a useful feature for the hardware to provide. Second, suggest one way that the operating system might take advantage of this differentiation.

A

The reason to have separate exceptions is to allow the OS to establish separate code paths - this is kind of given away by the second part of the question. This is particularly important given that TLB exceptions are extremely common and, if you improve performance on this hot code path, then the system will run faster overall.

One way that the OS could take advantage of this differentiation is by designing a TLB handler that doesn’t require saving all of the state required by the system call path. In general, system calls and hardware interrupts can branch off into very long code paths, meaning that it is reasonable to save all of the registers up front. However, the TLB code path is extremely specific, and so might be able to avoid saving as much state - at least initially. If it branches off into a page fault, then more state might need to be saved, but that could be done at that point.

27
Q

KNOW HOW TO DO BYTE-ADDRESSABLE MEMORY TRANSLATIONS USING TOP-LEVEL PAGE TABLES (TLT) AND SECOND-LEVEL PAGE TABLES (SLT).

A

See 2015 Midterm Q7

28
Q

Long Q:
INTERFACE SIZE TRADEOFFS. Linux and other UNIX variants provide a fairly thin system call interface, consisting of roughly 300 system calls. You are now familiar with a subset of these calls, particularly those you implemented for ASST2. In contrast, the Windows kernel interface contains approximately 3000.
First, describe the tradeoff between small and large system call interfaces. What is good about a thin interface? how about a thick interface? Second, provide three examples illustrating the problems with the thin system call interface you are familiar with. For each, describe how adding additional system calls would help, and what their interface would be.
Finally, for one of your new system calls describe the operating system changes that would be required to implement it.

A

See 2015 midterm, page 12

29
Q

Long:
CONCURRENT SYSTEM CALLS. As core counts have increased, the OS has become a source of potential bottlenecks for multithreaded applications. One source of poor scaling behavior is when two system calls cannot be executed concurrently, and in some cases redesigning the system call interface can improve concurrency and performance on multicore systems.

Consider the code below and a multithreaded application where, on an N core machine, N threads run openclose. First descrieb the OS performance bottleneck that this code might encounter on a multicore system. Note that open is implemented to return the lowest available file descriptor, but very few apps rely on this behavior. Be sure to describe the source of the bottleneck clearly. Second describe how to alter the system call interface to remove this particular performance bottleneck.

void openclose(){
//repeatedly open and close a file
}
A

See 2015 midterm page 14.
Quick notes:
Bottleneck is the linear search for lowest available file descriptor. Change this by allowing that open doesn’t have to return lowest-available FD. This allows us to search for a FD at a random starting position (not 0 every time) and allows us to apply something called read-lock-read optimization (?)

30
Q

Which of the following instructions can cause an exception?

[addiu; lw; syscall; All of them]

A

All of them

31
Q

What interface does something have to support to look like memory?
[lock() and unlock(); malloc() and free(); fork() and exec(); load and store]

A

Load and store

32
Q

Acquiring a lock will never create a synchronization problem (true or false)

A

False

33
Q

Which of the following is NOT a requirement for deadlock?
[Multiple independent resource requests; A linear dependency graph; Protected access to shared resources; No resource preemption]

A

A linear dependency graph

34
Q

Threads can be implemented without kernel support (true or false)?

A

True

35
Q

Which of the following is the simplest scheduling algorithm to implement?
[Rotating staircase; Multi-level feedback queue; Round-robin; Random]

A

Random

36
Q

Which of the following is NOT an example of an operating system mechanism?
[A context switch; Loading a virtual address into the TLB; Locks and semaphores; Prioritizing interactive threads]

A

Prioritizing interactive threads

37
Q

All of the following are a good fit for the address space abstraction EXCEPT
[virtual-to-physical address translation; sbrk; the TLB; base-and-bounds address translation]

A

base-and-bounds address translation

38
Q

We have discussed several cases where operating systems provide a useful illusion to processes. Name one, describe why it is useful, and briefly explain how it is provided, identifying any hardware cooperation required.

A
  1. CONCURRENCY. One single-core systems concurrency is an illusion, since there are never really two threads running at once. This is useful since it provides users with the illusion that multiple things are happening at once - music is playing while the web page is updating while they are typing into the terminal. Concurrency is provided by rapidly switching between threads fast enough to human perceptual limitations. This requires the ability to perform context switches between threads, stopping one’s use of the processor to allow another to continue, and then returning the first to exactly where it was stopped.
  2. ATOMICITY. When we discussed synchronization we identified cases where we wanted to make multiple actions that actually occurred separately happen all at once, or atomicly. An example is a modification to a shared data structure that requires multiple operations that should all happen at once - imagine adding an entry to a synchronized linked list, which requires updating multiple pointers. Atomicity is more than useful, and in certain cases (like our example) is actually required for correctness. One way we provide the illusion of atomicity is by locking shared data structures, which forces other users to sleep while we make our modifications, allowing them to all look atomic.
  3. ADDRESS SPACES. Address spaces comprise multiple illusions: that processes have access to a large (1) amount of private (2) contiguous (3) memory (4). In reality, (1) the size of the address space may be larger than the amount of available physical memory. In addition, the same physical memory may be temporally multiplexed between processes, making it not entirely private (2). And it is certainly not contiguous, with any virtual page mapping to any physical page (3), and may not even be memory at all (4) if the page has been swapped to disk or the virtual address is set up to point to a file by mmap. But the AS abstraction is useful because it simplifies how processes deal with memory. They can lay out their memory the same way each time and logically separate regions, such as the stack and heap, in ways that allow them to grow dynamically. At some level all of these illusions are provided by translating virtual addresses into physical addresses, with the extra level of indirection allowing the OS to move memory around and reuse it for other purposes as long as the virtual addresses obey the memory interface
39
Q

So far we have discussed memory as being uniform, meaning that from the perspective of the core (or cores) the mem access time is constant for each byte of physical memory. On some systems, however, this assumption fails and access time varies with location; we refer to these systems as having a NUMA (non-uniform memory access) design.

Consider a NUMA system where each core has access to a small amount of relatively fast physical memory and a larger amount of relatively slow physical memory. Assume that memory management hardware can map process virtual addresses to either part of physical memory. How dies this design complicate the address space abstraction? At a high level, describe a way to make use of this NUMA property. (One of the systems design principles we have discussed may come in handy).

A

The complication is that NUMA challenges the address space assumption that all memory is uniform. Now, some parts of each process’s address space will be faster than others, and if care is not taken these faster parts ma change over time as pages move to different regions of physical memory.

The simplest approach to using the faster memory is to treat it as a cache for the slower memory, leveraging our design principle as creating the machine as a series of caches. The operating system should try to put frequently-used pages in the fast memory and less-frequently used pages in the slow memory, which also requires some way of tracking page usage.

40
Q

Describe two scheduling algorithms, only one of which can be from the “no-nothing” group. For each, provide a one sentence explanation and describe one pro or con of the approach.

A
  1. Random. (This is from the no-nothing group.) Choose a thread at random. Pros:
    simple and serves as a good comparison point. Cons: too simple and unlikely to
    produce good performance.
  2. Round-Robin. (This is from the no-nothing group.) Simply establish an ordering
    between threads and run them in that order. Pros: also simple. Cons: also unlikely
    to produce good performance, doesn’t reward interactivity, doesn’t incorporate priorities,
    etc.
  3. Shortest-Job First. (This is from the know-it-all group.) Order jobs by how long
    they take to complete and run them until they do. Pros: minimizes average waiting
    time. Cons: can’t be implemented, also doesn’t time-share between tasks.
  4. Shortest-Remaining Time First. (This is from the know-it-all group.) Order jobs
    by how long they will run before they block and run them in that order. Pros: also
    does a good job at reducing waiting time and improving interactive performance.
    Cons: can’t be implemented, and may starve long-running non-interactive threads.
  5. Multi-Level Feedback Queues. Maintain a list of priority queues. Jobs run roundrobin (or random) and always from the top queue first. If a job runs to the end of its quantum, it may be demoted to a lower priority level; if it completes early, it may be promoted to a higher priority level. Pros: rewards interactive use. Cons: may
    starve long-running background tasks, and all solutions to this issue are somewhat ugly.
  6. Rotating Staircase Deadline Scheduler. I’m not going to try to describe this briefly— instead, refer to the lecture notes. Pros are that it can provide guarantees about interactive response time. The only con that I can think up is that it never made it into mainline Linux, but I’m sure that there are others.
41
Q

BE ABLE TO DO LOAD, STORE, FETCH (LOAD AND EXECUTE) TASKS USING VIRTUAL PAGE NUMBERS, PHYSICAL PAGE NUMBERS, PERMISSIONS, AND VIRTUAL ADDRESSES.

A

SEE 2014 MIDTERM number 7

42
Q

Long:
Interactivity Detection on Smartphones. Smartphones are the fastest growing
computing platform, with most now running platforms built on top of operating
systems with roots in desktop computers such as Linux (Android) and Windows.
Just as we discussed in class, performing effective thread scheduling on mobile
devices can have a big impact on performance. Mobile smartphones, however,
have some important differences from older computing devices that impact the
scheduling process, particularly when considering interactivity.
First, describe the interactivity detection problem and explain why this information
is generally important to thread scheduling. Continue by presenting two
reasons why interactivity detection could be even more important on mobile
smartphones. You will want to consider patterns of use, changes in the environment
caused by mobility, and constraints specific to smartphones.
Second, consider how the differences between smartphones and traditional devices
affect the interactivity detection problem and propose a new approach to
interactivity detection that responds to these differences. You will want to think
about what is different about how users interact with mobile phones, as well as
the different features on these devices compared to laptops and desktops.

A

Schedulers face the challenge of determining when users are waiting, in some form or
another, or a task to finish—to animate a cursor or draw a character glyph in response
to input, as one example, or to render a page that has been retrieved for the Internet
as another. Because users are aware of the performance of interactive tasks, but not
aware of the performance of others, schedulers may want to allocate these interactive
processes more resources or provide them prioritized access to the processor.
Smartphones are a unique computing device that is both (a) always on and (b) battery
powered. Desktops may be left on but are not battery powered, while laptops are
battery-powered but usually shut down when not in use and moved around. These
two features combine to produce a device that has both an important resource limitation
(energy) and the potential for a large amount of background usage. This makes it
even more important that smartphone schedulers identify interactive tasks and ensure
that non-interactive work does not drain the device’s battery. Adding to this challenge
is the fact that smartphones face a constantly-changing environment caused by mobility:
one minute they have a connection over a high-speed and energy-efficient Wifi
network, the next they are forced to use a slower and more power-hungry 3G mobile
data network. These changes create the potential for non-interactive background tasks
to do even more damage to device lifetimes if they are allowed to use the smartphone
inefficiently.
There were several different ways to incorporate the unique features and interaction
patterns of smartphones into interactivity detection within the smartphone thread
scheduler. Two important aspects of smartphones that you may have noticed and
utilized:
• Unlike computing devices utilizing windowing environments allowing multiple
apps to be present on the screen at one time, the limited size of smartphone displays
means that users are usually only interacting with a single app at one time. While
this doesn’t eliminate the interactivity detection problem entirely—apps may have multiple threads, some performing interaction, others doing background jobs—it
does reduce the problem to determining which of the threads of the foreground
app are interactive and which are not. In addition, when the screen is off you
can probably assume that nothing interactive is going on, although you can’t shut
down work altogether because apps deliver notifications to users by performing
background tasks and may be doing other useful things to optimize the foreground
experience. But if the user isn’t looking at the device they probably aren’t waiting
on it!
• Smartphones also feature a number of sensors that may simplify the interactivity
detection problem. Onboard cameras may be able to determine whether the user
is looking at the device and use that to help guess if they are waiting on an action
to complete. Orientation sensors may detect movement associated with gameplay
and trigger additional performance for that highly-interactive set of apps.

That said, since the point of this question was to get you to think creatively we will
definitely accept other reasonable answers. If you really have a great idea about
how to solve this problem, why not come join the blue Systems Research Group
and try your ideas out for real on the hundreds of smartphones deployed as part of PHONELAB?

43
Q

Long:
Jumbo Pages. While operating system pages have traditionally been 4K, some
modern operating systems support “jumbo” pages as large as 64K. Based on
your excellent and fast implementation of virtual memory for CSE 421 ASST3
you are hired as a kernel developer for the new Lindows (c) operating system
company. Unfortunately, your boss didn’t take CSE 421 and derives most of his
understanding of operating systems from the movie “Her”1
. At present, Lindows
does not support jumbo pages, but once your boss hears about them he is
desperate to include them into Lindows c version 0.0.0.2. He asks you for help.
First, explain why how and in what cases 64K pages would improve or degrade
OS performance. What information about virtual memory use could help the
OS decide whether to locate content on a jumbo or regular-sized page? Second,
explain how, in certain cases, you can implement jumbo-page-like functionality
on top of an existing system that supports 4K pages without changing the underlying
memory management hardware. What MMU features are required for this
to work? Which benefits of jumbo pages are preserved or lost by your approach?

A

First, how and in what cases do 64K pages improve or degrade performance? This
is related to the discussion we had in class regarding page size. A straightforward
benefit of 64K pages is that they allow the TLB to map more memory with the same
number of entries, potentially leading to fewer TLB misses and associated TLB faults.
Since 64K pages are larger than 4K ones, the question reduces to in what case do larger
pages help and hurt? They help when there is sufficient spatial locality in memory
accesses within the contents of the 64K page and they hurt when they do not.
One way to think about it is once I touch one byte of memory on the 64K page, what
is the likelihood I will touch bytes on the 15 other 4K pages inside the 64K page? If
it’s high, I might as well make room for the whole page as soon as I touch any byte
inside of it; if it’s low, I’m going to be moving a lot of memory around without any
gain over locating the byte on a standard 4K page2 So spatial locality is what the OS
would like to know. Maybe the best way of finding out would be to ask the process
to tell me directly, so if it has a data structure that spans multiple 4K pages it would
be better stored on a 64K page. (In addition, I might be able to tell from accesses on
groups of 4K pages that they really all belong to one larger page, but it’s not clear if
the content can be relocated at this point.)
So how do we implement pseudo-64K pages without actual hardware support, i.e.
on 4K underlying pages? Easy: whenever we have a TLB fault within a 64K region
identified as a 64K page we load entries (and page contents) for not only the 4K page
that caused the fault but for all of the other 15 pages within the 64K page. In a way,
we can be even more flexible than real 64K pages, since we could load N extra pages on each side of the faulting page (for example, although 2N + 1 is hard to make even).
Clearly this requires a software-managed TLB to do in all cases, since a hardware managed
TLB will load the page translation when it is in memory and not alert the OS.
(You could still implement the same behavior on page faults, but that’s not exhaustive
enough to cover all cases.)
The benefits of 64K pages that we preserve are fewer TLB faults since all 15 extra
4K pages on the 64K page are now loaded in the TLB and into memory. Assuming
sufficient spatial locality, this is a good thing! The benefit that is lost is that the amount
of memory that the TLB can map is still limited by the underlying 4K page size.