Chapter 10 Flashcards by John Abara

The ability to execute a program that is only partially in memory would
confer many benefits

A program would no longer be constrained by the amount of physical
memory that is available. Users would be able to write programs for an
extremely large virtual address space, simplifying the programming task.
* Because each program could take less physical memory, more programs
couldbe runat the same time,witha correspondingincrease inCPU utilization
and throughput but with no increase in response time or turnaround
time.
* Less I/O would be needed to load or swap portions of programs into
memory, so each program would run faster.

How well did you know this?

Not at all

Perfectly

Virtual memory

involves the separation of logical memory as perceived
by developers from physical memory. This separation allows an extremely
large virtual memory to be provided for programmers when only a smaller
physical memory is available (Figure 10.1). Virtual memory makes the task of
programming much easier, because the programmer no longer needs to worry
about the amount of physical memory available; she can concentrate instead
on programming the problem that is to be solved.

How well did you know this?

Not at all

Perfectly

The virtual address space

of a process refers to the logical (or virtual) view
of how a process is stored in memory. Typically, this view is that a process
begins at a certain logical address—say, address 0—and exists in contiguous
memory, as shown in Figure 10.2. Recall from Chapter 9, though, that in fact
physical memory is organized in page frames and that the physical page
frames assigned to a process may not be contiguous. It is up to the memory management unit (MMU) to map logical pages to physical page frames in
memory.
Note in Figure 10.2 that we allow the heap to grow upward in memory as
it is used for dynamic memory allocation. Similarly, we allow for the stack to
grow downward in memory through successive function calls. The large blank
space (or hole) between the heap and the stack is part of the virtual address
space but will require actual physical pages only if the heap or stack grows.
Virtual address spaces that include holes are known as sparse address spaces.
Using a sparse address space is beneficial because the holes can be filled as the
stack or heap segments grow or if we wish to dynamically link libraries (or
possibly other shared objects) during program execution.

How well did you know this?

Not at all

Perfectly

In addition to separating logical memory from physical memory, virtual
memory allows files and memory to be shared by two or more processes
through page sharing (Section 9.3.4). This leads to the following benefits:

System libraries such as the standard C library can be shared by several
processes through mapping of the shared object into a virtual address
space. Although each process considers the libraries to be part of its virtual
address space, the actual pages where the libraries reside in physical
memory are shared by all the processes (Figure 10.3). Typically, a library is
mapped read-only into the space of each process that is linked with it.
Similarly, processes can share memory. Recall from Chapter 3 that two
or more processes can communicate through the use of shared memory.
Virtual memory allows one process to create a region of memory that it can
share with another process. Processes sharing this region consider it part
of their virtual address space, yet the actual physical pages of memory are
shared, much as is illustrated in Figure 10.3.
Pages can be shared during process creation with the fork() system call,
thus speeding up process creation.
We further explore these—and other—benefits of virtual memory later in
this chapter. First, though, we discuss implementing virtual memory through
demand paging.

How well did you know this?

Not at all

Perfectly

Demand Paging

Consider how an executable program might be loaded from secondary storage
into memory. One option is to load the entire program in physical memory
at program execution time. However, a problem with this approach is that we may not initially need the entire program in memory. Suppose a program
starts with a list of available options from which the user is to select. Loading
the entire program into memory results in loading the executable code for all
options, regardless of whether or not an option is ultimately selected by the
user.
An alternative strategy is to load pages only as they are needed. This technique
is known as demand paging and is commonly used in virtual memory
systems. With demand-paged virtual memory, pages are loaded only when
they are demanded during program execution. Pages that are never accessed
are thus never loaded into physical memory. A demand-paging system is similar
to a paging system with swapping (Section 9.5.2) where processes reside in
secondary memory (usually an HDD or NVM device). Demand paging explains
one of the primary benefits of virtual memory—by loading only the portions
of programs that are needed, memory is used more efficiently.

How well did you know this?

Not at all

Perfectly

Basic Concepts

The general concept behind demand paging, as mentioned, is to load a page in
memory only when it is needed. As a result,while a process is executing, some
pageswill be in memory, and some will be in secondary storage. Thus, we need
some form of hardware support to distinguish between the two. The valid–
invalid bit scheme described in Section 9.3.3 can be used for this purpose. This time, however, when the bit is set to “valid,” the associated page is both legal
and in memory. If the bit is set to “invalid,” the page either is not valid (that
is, not in the logical address space of the process) or is valid but is currently in
secondary storage. The page-table entry for a page that is brought into memory
is set as usual, but the page-table entry for a page that is not currently in
memory is simply marked invalid. This situation is depicted in Figure 10.4.
(Notice that marking a page invalid will have no effect if the process never
attempts to access that page.)
But what happens if the process tries to access a page that was not brought
into memory?Access to a page marked invalid causes a page fault. The paging
hardware, in translating the address through the page table,will notice that the
invalid bit is set, causing a trap to the operating system. This trap is the result
of the operating system’s failure to bring the desired page into memory. The
procedure for handling this page fault is straightforward (Figure 10.5):
1. We check an internal table (usually kept with the process control block)
for this process to determine whether the reference was a valid or an
invalid memory access.
2. If the reference was invalid, we terminate the process. If it was valid but
we have not yet brought in that page, we now page it in.
3. We find a free frame (by taking one from the free-frame list, for example).
4. We schedule a secondary storage operation to read the desired page into
the newly allocated frame.
5. When the storage read is complete,we modify the internal table keptwith
the process and the page table to indicate that the page is now inmemory.
6. We restart the instruction that was interrupted by the trap. The process
can now access the page as though it had always been in memory.
In the extreme case, we can start executing a process with no pages in
memory. When the operating system sets the instruction pointer to the first
instruction of the process, which is on a non-memory-resident page, the process
immediately faults for the page. After this page is brought into memory, the
process continues to execute, faulting as necessary until every page that it
needs is in memory. At that point, it can execute with no more faults. This
scheme is pure demand paging: never bring a page into memory until it is
required.
Theoretically, some programs could access several new pages of memory
with each instruction execution (one page for the instruction and many for
data), possibly causing multiple page faults per instruction. This situation
would result in unacceptable system performance. Fortunately, analysis of
running processes shows that this behavior is exceedingly unlikely. Programs
tend to have locality of reference, described in Section 10.6.1, which results in
reasonable performance from demand paging.
The hardware to support demand paging is the same as the hardware for
paging and swapping:
* Page table. This table has the ability to mark an entry invalid through a
valid–invalid bit or a special value of protection bits.
* Secondary memory. This memory holds those pages that are not present
in main memory. The secondary memory is usually a high-speed disk or
NVM device. It is known as the swap device, and the section of storage
used for this purpose is known as swap space. Swap-space allocation is
discussed in Chapter 11.
A crucial requirement for demand paging is the ability to restart any
instruction after a page fault. Because we save the state (registers, condition
code, instruction counter) of the interrupted process when the page fault
occurs, we must be able to restart the process in exactly the same place and
state, except that the desired page is now in memory and is accessible. In most
cases, this requirement is easy to meet. Apage fault may occur at any memory
reference. If the page fault occurs on the instruction fetch, we can restart by
fetching the instruction again. If a page fault occurs while we are fetching an
operand, we must fetch and decode the instruction again and then fetch the
operand.
As a worst-case example, consider a three-address instruction such as ADD
the content of Ato B, placing the result in C. These are the steps to execute this
instruction:
1. Fetch and decode the instruction (ADD).
2. Fetch A.
3. Fetch B.
4. Add Aand B.
5. Store the sum in C.
If we fault when we try to store in C (because C is in a page not currently
in memory), we will have to get the desired page, bring it in, correct the
page table, and restart the instruction. The restart will require fetching the
instruction again, decoding it again, fetching the two operands again, and
then adding again. However, there is not much repeated work (less than one
complete instruction), and the repetition is necessary only when a page fault
occurs.
The major difficulty arises when one instruction may modify several different
locations. For example, consider the IBM System 360/370 MVC (move
character) instruction, which can move up to 256 bytes from one location to
another (possibly overlapping) location. If either block (source or destination)
straddles a page boundary, a page fault might occur after the move is partially
done. In addition, if the source and destination blocks overlap, the source
block may have been modified, in which case we cannot simply restart the
instruction.
This problem can be solved in two different ways. In one solution, the
microcode computes and attempts to access both ends of both blocks. If a page
fault is going to occur, it will happen at this step, before anything is modified.
The move can then take place; we know that no page fault can occur, since all
the relevant pages are in memory. The other solution uses temporary registers
to hold the values of overwritten locations. If there is a page fault, all the old
values arewritten back into memory before the trap occurs. This action restores
memory to its state before the instruction was started, so that the instruction
can be repeated.
This is by no means the only architectural problem resulting from adding
paging to an existing architecture to allow demand paging, but it illustrates
some of the difficulties involved. Paging is added between the CPU and the
memory in a computer system. It should be entirely transparent to a process.
Thus, people often assume that paging can be added to any system. Although
this assumption is true for a non-demand-paging environment, where a page
fault represents a fatal error, it is not true where a page fault means only that
an additional page must be brought into memory and the process restarted.

How well did you know this?

Not at all

Perfectly

Free-Frame List

When a page fault occurs, the operating system must bring the desired page
from secondary storage into main memory. To resolve page faults, most operating
systems maintain a free-frame list, a pool of free frames for satisfying
such requests (Figure 10.6). (Free frames must also be allocated when the stack
or heap segments from a process expand.) Operating systems typically allocate free frames using a technique known as zero-fill-on-deman . Zero-fillon-
demand frames are “zeroed-out” before being allocated, thus erasing their
previous contents. (Consider the potential security implications of not clearing
out the contents of a frame before reassigning it.)
When a system starts up, all available memory is placed on the free-frame
list. As free frames are requested (for example, through demand paging), the
size of the free-frame list shrinks. At some point, the list either falls to zero or
falls below a certain threshold, at which point it must be repopulated.We cover
strategies for both of these situations in Section 10.4.

How well did you know this?

Not at all

Perfectly

Performance of Demand Paging

Demand paging can significantly affect the performance of a computer system.
To see why, let’s compute the effective access time for a demand-paged memory.
Assume the memory-access time, denoted ma, is 10 nanoseconds. As long
as we have no page faults, the effective access time is equal to the memory
access time. If, however, a page fault occurs, we must first read the relevant
page from secondary storage and then access the desired word.
Let p be the probability of a page fault (0 ≤ p ≤ 1). We would expect p to
be close to zero—that is, we would expect to have only a few page faults. The
effective access time is then
effective access time = (1 − p) × ma + p × page fault time.
To compute the effective access time, we must know how much time is
needed to service a page fault. A page fault causes the following sequence to
occur:
1. Trap to the operating system.
2. Save the registers and process state.
3. Determine that the interrupt was a page fault.
4. Check that the page referencewas legal, and determine the location of the
page in secondary storage.
5. Issue a read from the storage to a free frame:
a. Wait in a queue until the read request is serviced.
b. Wait for the device seek and/or latency time.
c. Begin the transfer of the page to a free frame.
6. While waiting, allocate the CPU core to some other process.
7. Receive an interrupt from the storage I/O subsystem (I/O completed).
8. Save the registers and process state for the other process (if step 6 is
executed).
9. Determine that the interrupt was from the secondary storage device.
10. Correct the page table and other tables to show that the desired page is
now in memory.
11. Wait for the CPU core to be allocated to this process again.
12. Restore the registers, process state, and new page table, and then resume
the interrupted instruction.
Not all of these steps are necessary in every case. For example,we are assuming
that, in step 6, the CPU is allocated to another process while the I/O occurs.
This arrangement allows multiprogramming to maintain CPU utilization but
requires additional time to resume the page-fault service routine when the I/O
transfer is complete.
In any case, there are three major task components of the page-fault service
time:
1. Service the page-fault interrupt.
2. Read in the page.
3. Restart the process.
The first and third tasks can be reduced, with careful coding, to several
hundred instructions. These tasks may take from 1 to 100 microseconds each.
Let’s consider the case of HDDs being used as the paging device. The pageswitch
time will probably be close to 8 milliseconds. (A typical hard disk has
an average latency of 3 milliseconds, a seek of 5 milliseconds, and a transfer
time of 0.05 milliseconds. Thus, the total paging time is about 8 milliseconds,
including hardware and software time.) Remember also that we are looking at
only the device-service time. If a queue of processes is waiting for the device,
we have to add queuing time as we wait for the paging device to be free to
service our request, increasing even more the time to page in.
With an average page-fault service time of 8 milliseconds and a memoryaccess
time of 200 nanoseconds, the effective access time in nanoseconds is
effective access time = (1 − p) × (200) + p (8 milliseconds)
= (1 − p) × 200 + p × 8,000,000
= 200 + 7,999,800 × p.
We see, then, that the effective access time is directly proportional to the
page-fault rate. If one access out of 1,000 causes a page fault, the effective access
time is 8.2 microseconds. The computer will be slowed down by a factor of 40
because of demand paging! If we want performance degradation to be less than
10 percent, we need to keep the probability of page faults at the following level:
220 > 200 + 7,999,800 × p,
20 > 7,999,800 × p,
p < 0.0000025.
That is, to keep the slowdown due to paging at a reasonable level,we can allow
fewer that one memory access out of 399,990 to page-fault. In sum, it is important
to keep the page-fault rate low in a demand-paging system. Otherwise,
the effective access time increases, slowing process execution dramatically.
An additional aspect of demand paging is the handling and overall use of
swap space. I/O to swap space is generally faster than that to the file system. It
is faster because swap space is allocated in much larger blocks, and file lookups
and indirect allocation methods are not used (Chapter 11). One option for the system to gain better paging throughput is by copying an entire file image into
the swap space at process startup and then performing demand paging from
the swap space. The obvious disadvantage of this approach is the copying of
the file image at program start-up. A second option—and one practiced by
several operating systems, including Linux andWindows—is to demand-page
from the file system initially but to write the pages to swap space as they are
replaced. This approach will ensure that only needed pages are read from the
file system but that all subsequent paging is done from swap space.
Some systems attempt to limit the amount of swap space used through
demand paging of binary executable files. Demand pages for such files are
brought directly from the file system. However, when page replacement is
called for, these frames can simply be overwritten (because they are never
modified), and the pages can be read in from the file system again if needed.
Using this approach, the file system itself serves as the backing store. However,
swap space must still be used for pages not associated with a file (known as
anonymous memory); these pages include the stack and heap for a process.
This method appears to be a good compromise and is used in several systems,
including Linux and BSD UNIX.
As described in Section 9.5.3, mobile operating systems typically do not
support swapping. Instead, these systems demand-page from the file system
and reclaim read-only pages (such as code) from applications if memory
becomes constrained. Such data can be demand-paged from the file system if
it is later needed. Under iOS, anonymous memory pages are never reclaimed
from an application unless the application is terminated or explicitly releases
the memory. In Section 10.7, we cover compressed memory, a commonly used
alternative to swapping in mobile systems.

How well did you know this?

Not at all

Perfectly

Copy-on-Write

In Section 10.2, we illustrated how a process can start quickly by demandpaging
in the page containing the first instruction. However, process creation
using the fork() system call may initially bypass the need for demand paging
by using a technique similar to page sharing (covered in Section 9.3.4). This
technique provides rapid process creation and minimizes the number of new
pages that must be allocated to the newly created process.
Recall that the fork() system call creates a child process that is a duplicate
of its parent. Traditionally, fork() worked by creating a copy of the parent’s
address space for the child, duplicating the pages belonging to the parent.
However, considering that many child processes invoke the exec() system call
immediately after creation, the copying of the parent’s address space may be
unnecessary. Instead, we can use a technique known as copy-on-write, which
works by allowing the parent and child processes initially to share the same
pages. These shared pages are marked as copy-on-write pages, meaning that
if either process writes to a shared page, a copy of the shared page is created.
Copy-on-write is illustrated in Figures 10.7 and 10.8, which show the contents
of the physical memory before and after process 1 modifies page C.
For example, assume that the child process attempts to modify a page
containing portions of the stack, with the pages set to be copy-on-write. The
operating system will obtain a frame from the free-frame list and create a copy
of this page, mapping it to the address space of the child process. The child
process will then modify its copied page and not the page belonging to the
parent process. Obviously, when the copy-on-write technique is used, only the
pages that are modified by either process are copied; all unmodified pages
can be shared by the parent and child processes. Note, too, that only pages
that can be modified need be marked as copy-on-write. Pages that cannot
be modified (pages containing executable code) can be shared by the parent
and child. Copy-on-write is a common technique used by several operating
systems, including Windows, Linux, and macOS.
Several versions of UNIX (including Linux, macOS, and BSD UNIX) provide
a variation of the fork() system call—vfork() (for virtual memory fork)—
that operates differently from fork() with copy-on-write. With vfork(), the
parent process is suspended, and the child process uses the address space of
the parent. Because vfork() does not use copy-on-write, if the child process
changes any pages of the parent’s address space, the altered pages will be
visible to the parent once it resumes. Therefore, vfork() must be used with
caution to ensure that the child process does not modify the address space of
the parent. vfork() is intended to be used when the child process calls exec()
immediately after creation. Because no copying of pages takes place, vfork() is an extremely efficient method of process creation and is sometimes used to
implement UNIX command-line shell interfaces.

How well did you know this?

Not at all

Perfectly

Page Replacement

In our earlier discussion of the page-fault rate, we assumed that each page
faults at most once,when it is first referenced. This representation is not strictly
accurate, however. If a process of ten pages actually uses only half of them, then
demand paging saves the I/O necessary to load the five pages that are never
used. We could also increase our degree of multiprogramming by running
twice as many processes. Thus, if we had forty frames, we could run eight
processes, rather than the four that could run if each required ten frames (five
of which were never used).
If we increase our degree of multiprogramming, we are over-allocating
memory. If we run six processes, each of which is ten pages in size but actually
uses only five pages, we have higher CPU utilization and throughput, with
ten frames to spare. It is possible, however, that each of these processes, for
a particular data set, may suddenly try to use all ten of its pages, resulting in a
need for sixty frames when only forty are available.
Further, consider that system memory is not used only for holding program
pages. Buffers for I/O also consume a considerable amount of memory. This use
can increase the strain on memory-placement algorithms. Deciding how much
memory to allocate to I/O and how much to program pages is a significant
challenge. Some systems allocate a fixed percentage of memory for I/O buffers,
whereas others allow both processes and the I/O subsystem to compete for all
system memory. Section 14.6 discusses the integrated relationship between I/O
buffers and virtual memory techniques.
Over-allocation of memory manifests itself as follows. While a process is
executing, a page fault occurs. The operating system determines where the
desired page is residing on secondary storage but then finds that there are
no free frames on the free-frame list; all memory is in use. This situation is
illustrated in Figure 10.9, where the fact that there are no free frames is depicted
by a question mark.
The operating system has several options at this point. It could terminate
the process. However, demand paging is the operating system’s attempt to
improve the computer system’s utilization and throughput. Users should not
be aware that their processes are running on a paged system—paging should
be logically transparent to the user. So this option is not the best choice.
The operating system could instead use standard swapping and swap
out a process, freeing all its frames and reducing the level of multiprogramming.
However, as discussed in Section 9.5, standard swapping is no longer
used by most operating systems due to the overhead of copying entire processes
between memory and swap space. Most operating systems now combine
swapping pages with page replacement, a technique we describe in detail
in the remainder of this section.

How well did you know this?

Not at all

Perfectly

Basic Page Replacement

Page replacement takes the following approach. If no frame is free,we find one
that is not currently being used and free it. We can free a frame by writing its contents to swap space and changing the page table (and all other tables) to
indicate that the page is no longer in memory (Figure 10.10). We can now use
the freed frame to hold the page for which the process faulted.We modify the
page-fault service routine to include page replacement:
1. Find the location of the desired page on secondary storage.
2. Find a free frame:
a. If there is a free frame, use it.
b. If there is no free frame, use a page-replacement algorithm to select
a victim frame.
c. Write the victim frame to secondary storage (if necessary); change
the page and frame tables accordingly.
3. Read the desired page into the newly freed frame; change the page and
frame tables.
4. Continue the process from where the page fault occurred.
Notice that, if no frames are free, two page transfers (one for the page-out
and one for the page-in) are required. This situation effectively doubles the
page-fault service time and increases the effective access time accordingly.
We can reduce this overhead by using a modify bit (or dirty bit). When this
scheme is used, each page or frame has a modify bit associated with it in the
hardware. The modify bit for a page is set by the hardware whenever any byte
in the page is written into, indicating that the page has been modified. When
we select a page for replacement, we examine its modify bit. If the bit is set, we know that the page has been modified since it was read in from secondary
storage. In this case, we must write the page to storage. If the modify bit is not
set, however, the page has not been modified since itwas read intomemory. In
this case,we need notwrite thememory page to storage: it is already there. This
technique also applies to read-only pages (for example, pages of binary code).
Such pages cannot be modified; thus, they may be discarded when desired.
This scheme can significantly reduce the time required to service a page fault,
since it reduces I/O time by one-half if the page has not been modified.
Page replacement is basic to demand paging. It completes the separation
between logicalmemory and physicalmemory.With this mechanism, an enormous
virtual memory can be provided for programmers on a smaller physical
memory.With no demand paging, logical addresses are mapped into physical
addresses, and the two sets of addresses can be different. All the pages of a
process still must be in physical memory, however. With demand paging, the
size of the logical address space is no longer constrained by physical memory.
If we have a process of twenty pages, we can execute it in ten frames simply by
using demand paging and using a replacement algorithm to find a free frame
whenever necessary. If a page that has been modified is to be replaced, its
contents are copied to secondary storage. A later reference to that page will
cause a page fault. At that time, the page will be brought back into memory,
perhaps replacing some other page in the process.
We must solve two major problems to implement demand paging: we must
develop a frame-allocation algorithm and a page-replacement algorithm.
That is, if we have multiple processes in memory, we must decide how many
frames to allocate to each process; and when page replacement is required,
we must select the frames that are to be replaced. Designing appropriate algorithms
to solve these problems is an important task, because secondary storage I/O is so expensive. Even slight improvements in demand-paging methods
yield large gains in system performance.
There are many different page-replacement algorithms. Every operating
system probably has its own replacement scheme. How do we select a particular
replacement algorithm? In general, we want the one with the lowest
page-fault rate.
We evaluate an algorithm by running it on a particular string of memory
references and computing the number of page faults. The string of memory
references is called a reference string. We can generate reference strings artificially
(by using a random-number generator, for example), or we can trace
a given system and record the address of each memory reference. The latter
choice produces a large number of data (on the order of 1 million addresses
per second). To reduce the number of data, we use two facts.
First, for a given page size (and the page size is generally fixed by the hardware
or system), we need to consider only the page number, rather than the
entire address. Second, if we have a reference to a page p, then any references
to page p that immediately follow will never cause a page fault. Page p will
be in memory after the first reference, so the immediately following references
will not fault.
For example, ifwe trace a particular process, we might record the following
address sequence:
0100, 0432, 0101, 0612, 0102, 0103, 0104, 0101, 0611, 0102, 0103,
0104, 0101, 0610, 0102, 0103, 0104, 0101, 0609, 0102, 0105
At 100 bytes per page, this sequence is reduced to the following reference
string:
1, 4, 1, 6, 1, 6, 1, 6, 1, 6, 1
To determine the number of page faults for a particular reference string and
page-replacement algorithm, we also need to know the number of page frames
available. Obviously, as the number of frames available increases, the number
of page faults decreases. For the reference string considered previously, for
example, if we had three or more frames, we would have only three faults—
one fault for the first reference to each page. In contrast, with only one frame
available, we would have a replacement with every reference, resulting in
eleven faults. In general, we expect a curve such as that in Figure 10.11. As the
number of frames increases, the number of page faults drops to some minimal
level. Of course, adding physical memory increases the number of frames.
We next illustrate several page-replacement algorithms. In doing so, we
use the reference string
7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1
for a memory with three frames.

How well did you know this?

Not at all

Perfectly

FIFO Page Replacement

The simplest page-replacement algorithm is a first-in, first-out (FIFO) algorithm.
AFIFO replacement algorithm associates with each page the time when
that page was brought into memory.When a page must be replaced, the oldest
page is chosen. Notice that it is not strictly necessary to record the time when a page is brought in.We can create a FIFO queue to hold all pages inmemory.We
replace the page at the head of the queue.When a page is brought into memory,
we insert it at the tail of the queue.
For our example reference string, our three frames are initially empty. The
first three references (7, 0, 1) cause page faults and are brought into these empty
frames. The next reference (2) replaces page 7, because page 7 was brought in
first. Since 0 is the next reference and 0 is already in memory, we have no fault
for this reference. The first reference to 3 results in replacement of page 0, since
it is now first in line. Because of this replacement, the next reference, to 0, will
fault. Page 1 is then replaced by page 0. This process continues as shown in
Figure 10.12. Every time a fault occurs, we show which pages are in our three
frames. There are fifteen faults altogether.
The FIFO page-replacement algorithm is easy to understand and program.
However, its performance is not always good. On the one hand, the page
replaced may be an initialization module that was used a long time ago and is
no longer needed. On the other hand, it could contain a heavily used variable
that was initialized early and is in constant use.
Notice that, even if we select for replacement a page that is in active use,
everything still works correctly. After we replace an active page with a new one, a fault occurs almost immediately to retrieve the active page. Some other
page must be replaced to bring the active page back into memory. Thus, a bad
replacement choice increases the page-fault rate and slows process execution.
It does not, however, cause incorrect execution.
To illustrate the problems that are possible with a FIFO page-replacement
algorithm, consider the following reference string:
1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
Figure 10.13 shows the curve of page faults for this reference string versus the
number of available frames. Notice that the number of faults for four frames
(ten) is greater than the number of faults for three frames (nine)! This most
unexpected result is known as Belady’s anomaly: for some page-replacement
algorithms, the page-fault rate may increase as the number of allocated frames
increases. We would expect that giving more memory to a process would
improve its performance. In some early research, investigators noticed that this
assumption was not always true. Belady’s anomaly was discovered as a result.

How well did you know this?

Not at all

Perfectly

Optimal Page Replacement

One result of the discovery of Belady’s anomaly was the search for an optimal
page-replacement algorithm—the algorithm that has the lowest page-fault
rate of all algorithms and will never suffer from Belady’s anomaly. Such an
algorithm does exist and has been called OPT or MIN. It is simply this:
Replace the page that will not be used for the longest period of time.
Use of this page-replacement algorithm guarantees the lowest possible pagefault
rate for a fixed number of frames.
For example, on our sample reference string, the optimal page-replacement
algorithmwould yield nine page faults, as shown in Figure 10.14. The first three
references cause faults that fill the three empty frames. The reference to page
2 replaces page 7, because page 7 will not be used until reference 18, whereas
page 0 will be used at 5, and page 1 at 14. The reference to page 3 replaces
page 1, as page 1 will be the last of the three pages in memory to be referenced
again. With only nine page faults, optimal replacement is much better than
a FIFO algorithm, which results in fifteen faults. (If we ignore the first three,
which all algorithms must suffer, then optimal replacement is twice as good as
FIFO replacement.) In fact, no replacement algorithmcan process this reference
string in three frames with fewer than nine faults.
Unfortunately, the optimal page-replacement algorithm is difficult to
implement, because it requires future knowledge of the reference string.
(We encountered a similar situation with the SJF CPU-scheduling algorithm in
Section 5.3.2.) As a result, the optimal algorithm is used mainly for comparison
studies. For instance, it may be useful to know that, although a new algorithm
is not optimal, it is within 12.3 percent of optimal at worst and within 4.7
percent on average.

How well did you know this?

Not at all

Perfectly

LRU Page Replacement

If the optimal algorithm is not feasible, perhaps an approximation of the optimal
algorithm is possible. The key distinction between the FIFO and OPT algorithms
(other than looking backward versus forward in time) is that the FIFO
algorithm uses the time when a page was brought into memory, whereas the
OPT algorithm uses the time when a page is to be used. If we use the recent past
as an approximation of the near future, then we can replace the page that has
not been used for the longest period of time. This approach is the least recently
used (LRU) algorithm.
LRU replacement associates with each page the time of that page’s last use.
When a page must be replaced, LRU chooses the page that has not been used
for the longest period of time. We can think of this strategy as the optimal
page-replacement algorithm looking backward in time, rather than forward.
(Strangely, if we let SR be the reverse of a reference string S, then the page-fault
rate for the OPT algorithm on S is the same as the page-fault rate for the OPT
algorithm on SR. Similarly, the page-fault rate for the LRU algorithm on S is the
same as the page-fault rate for the LRU algorithm on SR.)
The result of applying LRU replacement to our example reference string is
shown in Figure 10.15. The LRU algorithm produces twelve faults. Notice that
the first five faults are the same as those for optimal replacement. When the
reference to page 4 occurs, however, LRU replacement sees that, of the three
frames in memory, page 2 was used least recently. Thus, the LRU algorithm
replaces page 2, not knowing that page 2 is about to be used. When it then faults
for page 2, the LRU algorithm replaces page 3, since it is now the least recently
used of the three pages in memory. Despite these problems, LRU replacement
with twelve faults is much better than FIFO replacement with fifteen.
The LRU policy is often used as a page-replacement algorithm and is considered
to be good. The major problem is how to implement LRU replacement.
An LRU page-replacement algorithm may require substantial hardware assistance.
The problem is to determine an order for the frames defined by the time
of last use. Two implementations are feasible:
* Counters. In the simplest case, we associate with each page-table entry a
time-of-use field and add to the CPU a logical clock or counter. The clock is
incremented for every memory reference.Whenever a reference to a page
is made, the contents of the clock register are copied to the time-of-use
field in the page-table entry for that page. In this way, we always have
the “time” of the last reference to each page. We replace the page with the
smallest time value. This scheme requires a search of the page table to find
the LRU page and a write to memory (to the time-of-use field in the page
table) for each memory access. The times must also be maintained when
page tables are changed (due to CPU scheduling). Overflow of the clock
must be considered.
* Stack. Another approach to implementing LRU replacement is to keep a
stack of page numbers.Whenever a page is referenced, it is removed from
the stack and put on the top. In this way, the most recently used page is
always at the top of the stack, and the least recently used page is always
at the bottom (Figure 10.16). Because entries must be removed from the
middle of the stack, it is best to implement this approach by using a doubly
linked list with a head pointer and a tail pointer. Removing a page and
putting it on the top of the stack then requires changing six pointers at
worst. Each update is a little more expensive, but there is no search for
a replacement; the tail pointer points to the bottom of the stack, which is
the LRU page. This approach is particularly appropriate for software or
microcode implementations of LRU replacement.
Like optimal replacement, LRU replacement does not suffer from Belady’s
anomaly. Both belong to a class of page-replacement algorithms, called stack
algorithms, that can never exhibit Belady’s anomaly. A stack algorithm is an
algorithm for which it can be shown that the set of pages in memory for n
frames is always a subset of the set of pages that would be in memory with n
+ 1 frames. For LRU replacement, the set of pages in memory would be the n
most recently referenced pages. If the number of frames is increased, these n
pages will still be the most recently referenced and so will still be in memory.
Note that neither implementation of LRU would be conceivable without
hardware assistance beyond the standard TLB registers. The updating of the
clock fields or stack must be done for every memory reference. If we were
to use an interrupt for every reference to allow software to update such data
structures, it would slow every memory reference by a factor of at least ten,
hence slowing every process by a factor of ten. Few systems could tolerate that
level of overhead for memory management.

How well did you know this?

Not at all

Perfectly

LRU-Approximation Page Replacement

Not many computer systems provide sufficient hardware support for true LRU
page replacement. In fact, some systems provide no hardware support, and
other page-replacement algorithms (such as a FIFO algorithm) must be used.
Many systems provide some help, however, in the form of a reference bit. The
reference bit for a page is set by the hardware whenever that page is referenced
(either a read or a write to any byte in the page). Reference bits are associated
with each entry in the page table.
Initially, all bits are cleared (to 0) by the operating system. As a process
executes, the bit associated with each page referenced is set (to 1) by the
hardware. After some time, we can determine which pages have been used and
which have not been used by examining the reference bits, although we do not
know the order of use. This information is the basis for many page-replacement
algorithms that approximate LRU replacement.

How well did you know this?

Not at all

Perfectly

Additional-Reference-Bits Algorithm

Study These Flashcards

We can gain additional ordering information by recording the reference bits at
regular intervals. We can keep an 8-bit byte for each page in a table in memory.
At regular intervals (say, every 100 milliseconds), a timer interrupt transfers
control to the operating system. The operating system shifts the reference bit
for each page into the high-order bit of its 8-bit byte, shifting the other bits right
by 1 bit and discarding the low-order bit. These 8-bit shift registers contain the
history of page use for the last eight time periods. If the shift register contains
00000000, for example, then the page has not been used for eight time periods.
A page that is used at least once in each period has a shift register value of
11111111. A page with a history register value of 11000100 has been used more
recently than one with a value of 01110111. If we interpret these 8-bit bytes as
unsigned integers, the page with the lowest number is the LRU page, and it can
be replaced. Notice that the numbers are not guaranteed to be unique, however.
We can either replace (swap out) all pages with the smallest value or use the
FIFO method to choose among them.
The number of bits of history included in the shift register can be varied,
of course, and is selected (depending on the hardware available) to make the
updating as fast as possible. In the extreme case, the number can be reduced to
zero, leaving only the reference bit itself. This algorithm is called the second-
chance page-replacement algorithm.

Second-Chance Algorithm

Study These Flashcards

The basic algorithm of second-chance replacement is a FIFO replacement algo-
rithm. When a page has been selected, however, we inspect its reference bit. If
the value is 0, we proceed to replace this page; but if the reference bit is set to
1, we give the page a second chance and move on to select the next FIFO page.
When a page gets a second chance, its reference bit is cleared, and its arrival
time is reset to the current time. Thus, a page that is given a second chance
will not be replaced until all other pages have been replaced (or given second
chances). In addition, if a page is used often enough to keep its reference bit
set, it will never be replaced.
One way to implement the second-chance algorithm (sometimes referred
to as the clock algorithm) is as a circular queue. A pointer (that is, a hand on
the clock) indicates which page is to be replaced next. When a frame is needed,
the pointer advances until it finds a page with a 0 reference bit. As it advances,
it clears the reference bits (Figure 10.17). Once a victim page is found, the page
is replaced, and the new page is inserted in the circular queue in that position.
Notice that, in the worst case, when all bits are set, the pointer cycles through
the whole queue, giving each page a second chance. It clears all the reference
bits before selecting the next page for replacement. Second-chance replacement
degenerates to FIFO replacement if all bits are set.

Enhanced Second-Chance Algorithm

Study These Flashcards

We can enhance the second-chance algorithm by considering the reference bit
and the modify bit (described in Section 10.4.1) as an ordered pair. With these
two bits, we have the following four possible classes:
1. (0, 0) neither recently used nor modified —best page to replace
2. (0, 1) not recently used but modified —not quite as good, because the page
will need to be written out before replacement
3. (1, 0) recently used but clean—probably will be used again soon
4. (1, 1) recently used and modified—probably will be used again soon, and
the page will be need to be written out to secondary storage before it can
be replaced
Each page is in one of these four classes. When page replacement is called
for, we use the same scheme as in the clock algorithm; but instead of examining
whether the page to which we are pointing has the reference bit set to 1,
we examine the class to which that page belongs. We replace the first page
encountered in the lowest nonempty class. Notice that we may have to scan the
circular queue several times before we find a page to be replaced. The major
difference between this algorithm and the simpler clock algorithm is that here
we give preference to those pages that have been modified in order to reduce
the number of I/Os required.

Counting-Based Page Replacement

Study These Flashcards

There are many other algorithms that can be used for page replacement. For
example, we can keep a counter of the number of references that have been
made to each page and develop the following two schemes.
* The least frequently used (LFU) page-replacement algorithm requires that
the page with the smallest count be replaced. The reason for this selection is
that an actively used page should have a large reference count. A problem
arises, however, when a page is used heavily during the initial phase of a process but then is never used again. Since it was used heavily, it has
a large count and remains in memory even though it is no longer needed.
One solution is to shift the counts right by 1 bit at regular intervals, forming
an exponentially decaying average usage count.
* The most frequently used (MFU) page-replacement algorithm is based
on the argument that the page with the smallest count was probably just
brought in and has yet to be used.
As you might expect, neither MFU nor LFU replacement is common. The imple-
mentation of these algorithms is expensive, and they do not approximate OPT
replacement well.

Page-Buffering Algorithms

Study These Flashcards

Other procedures are often used in addition to a specific page-replacement
algorithm. For example, systems commonly keep a pool of free frames. When
a page fault occurs, a victim frame is chosen as before. However, the desired
page is read into a free frame from the pool before the victim is written out. This
procedure allows the process to restart as soon as possible, without waiting for
the victim page to be written out. When the victim is later written out, its frame
is added to the free-frame pool.
An expansion of this idea is to maintain a list of modified pages. Whenever
the paging device is idle, a modified page is selected and is written to secondary
storage. Its modify bit is then reset. This scheme increases the probability that
a page will be clean when it is selected for replacement and will not need to be
written out.
Another modification is to keep a pool of free frames but to remember
which page was in each frame. Since the frame contents are not modified when
a frame is written to secondary storage, the old page can be reused directly from
the free-frame pool if it is needed before that frame is reused. No I/O is needed
in this case. When a page fault occurs, we first check whether the desired page
is in the free-frame pool. If it is not, we must select a free frame and read into
it.
Some versions of the UNIX system use this method in conjunction with
the second-chance algorithm. It can be a useful augmentation to any page-
replacement algorithm, to reduce the penalty incurred if the wrong victim page
is selected. We describe these —and other—modifications in Section 10.5.3.

Applications and Page Replacement

Study These Flashcards

In certain cases, applications accessing data through the operating system’s vir-
tual memory perform worse than if the operating system provided no buffer-
ing at all. A typical example is a database, which provides its own memory
management and I/O buffering. Applications like this understand their mem-
ory use and storage use better than does an operating system that is implement-
ing algorithms for general-purpose use. Furthermore, if the operating system
is buffering I/O and the application is doing so as well, then twice the memory
is being used for a set of I/O.
In another example, data warehouses frequently perform massive sequen-
tial storage reads, followed by computations and writes. The LRU algorithm would be removing old pages and preserving new ones, while the applica-
tion would more likely be reading older pages than newer ones (as it starts its
sequential reads again). Here, MFU would actually be more efficient than LRU.
Because of such problems, some operating systems give special programs
the ability to use a secondary storage partition as a large sequential array of
logical blocks, without any file-system data structures. This array is some-
times called the raw disk, and I/O to this array is termed raw I/O. Raw I/O
bypasses all the file-system services, such as file I/O demand paging, file
locking, prefetching, space allocation, file names, and directories. Note that
although certain applications are more efficient when implementing their own
special-purpose storage services on a raw partition, most applications perform
better when they use the regular file-system services.

Allocation of Frames

Study These Flashcards

We turn next to the issue of allocation. How do we allocate the fixed amount of
free memory among the various processes? If we have 93 free frames and two
processes, how many frames does each process get?
Consider a simple case of a system with 128 frames. The operating system
may take 35, leaving 93 frames for the user process. Under pure demand
paging, all 93 frames would initially be put on the free-frame list. When a user
process started execution, it would generate a sequence of page faults. The first
93 page faults would all get free frames from the free-frame list. When the
free-frame list was exhausted, a page-replacement algorithm would be used
to select one of the 93 in-memory pages to be replaced with the 94th, and so
on. When the process terminated, the 93 frames would once again be placed
on the free-frame list.
There are many variations on this simple strategy. We can require that the
operating system allocate all its buffer and table space from the free-frame list.
When this space is not in use by the operating system, it can be used to support
user paging. We can try to keep three free frames reserved on the free-frame list
at all times. Thus, when a page fault occurs, there is a free frame available to
page into. While the page swap is taking place, a replacement can be selected,
which is then written to the storage device as the user process continues to
execute. Other variants are also possible,Other variants are also possible, but the basic strategy is clear: the
user process is allocated any free frame.

Minimum Number of Frames

Study These Flashcards

Our strategies for the allocation of frames are constrained in various ways.
We cannot, for example, allocate more than the total number of available
frames (unless there is page sharing). We must also allocate at least a minimum
number of frames. Here, we look more closely at the latter requirement.
One reason for allocating at least a minimum number of frames involves
performance. Obviously, as the number of frames allocated to each process
decreases, the page-fault rate increases, slowing process execution. In addi-
tion, remember that, when a page fault occurs before an executing instruction
is complete, the instruction must be restarted. Consequently, we must have
enough frames to hold all the different pages that any single instruction can
reference. For example, consider a machine in which all memory-reference instruc-
tions may reference only one memory address. In this case, we need at least one
frame for the instruction and one frame for the memory reference. In addition,
if one-level indirect addressing is allowed (for example, a load instruction on
frame 16 can refer to an address on frame 0, which is an indirect reference to
frame 23), then paging requires at least three frames per process. (Think about
what might happen if a process had only two frames.)
The minimum number of frames is defined by the computer architecture.
For example, if the move instruction for a given architecture includes more
than one word for some addressing modes, the instruction itself may straddle
two frames. In addition, if each of its two operands may be indirect references,
a total of six frames are required. As another example, the move instruction
for Intel 32- and 64-bit architectures allows data to move only from register to
register and between registers and memory; it does not allow direct memory-
to-memory movement, thereby limiting the required minimum number of
frames for a process.
Whereas the minimum number of frames per process is defined by the
architecture, the maximum number is defined by the amount of available
physical memory. In between, we are still left with significant choice in frame
allocation.

Allocation Algorithms

Study These Flashcards

The easiest way to split m frames among n processes is to give everyone an
equal share, m/n frames (ignoring frames needed by the operating system for
the moment). For instance, if there are 93 frames and 5 processes, each process
will get 18 frames. The 3 leftover frames can be used as a free-frame buffer pool.
This scheme is called equal allocation.
An alternative is to recognize that various processes will need differing
amounts of memory. Consider a system with a 1-KB frame size. If a small
student process of 10 KB and an interactive database of 127 KB are the only
two processes running in a system with 62 free frames, it does not make much
sense to give each process 31 frames. The student process does not need more
than 10 frames, so the other 21 are, strictly speaking, wasted.
To solve this problem, we can use proportional allocation, in which we
allocate available memory to each process according to its size. Let the size of
the virtual memory for process pi be si , and define
∑S = si .
Then, if the total number of available frames is m, we allocate ai frames to
process pi , where ai is approximately
ai = si /S × m.
Of course, we must adjust each ai to be an integer that is greater than the
minimum number of frames required by the instruction set, with a sum not
exceeding m.
With proportional allocation, we would split 62 frames between two processes, one of 10 pages and one of 127 pages, by allocating 4 frames and 57
frames, respectively, since
10/137 × 62 ≈ 4 and
127/137 × 62 ≈ 57.
In this way, both processes share the available frames according to their
“needs,” rather than equally.
In both equal and proportional allocation, of course, the allocation may
vary according to the multiprogramming level. If the multiprogramming level
is increased, each process will lose some frames to provide the memory needed
for the new process. Conversely, if the multiprogramming level decreases, the
frames that were allocated to the departed process can be spread over the
remaining processes.
Notice that, with either equal or proportional allocation, a high-priority
process is treated the same as a low-priority process. By its definition, however,
we may want to give the high-priority process more memory to speed its
execution, to the detriment of low-priority processes. One solution is to use
a proportional allocation scheme wherein the ratio of frames depends not on
the relative sizes of processes but rather on the priorities of processes or on a
combination of size and priority. Next, we focus on one possible strategy that we can use to implement a
global page-replacement policy. With this approach, we satisfy all memory
requests from the free-frame list, but rather than waiting for the list to drop to
zero before we begin selecting pages for replacement, we trigger page replace-
ment when the list falls below a certain threshold. This strategy attempts to
ensure there is always sufficient free memory to satisfy new requests.
Such a strategy is depicted in Figure 10.18. The strategy’s purpose is to
keep the amount of free memory above a minimum threshold. When it drops below this threshold, a kernel routine is triggered that begins reclaiming pages
from all processes in the system (typically excluding the kernel). Such kernel
routines are often known as reapers, and they may apply any of the page-
replacement algorithms covered in Section 10.4. When the amount of free
memory reaches the maximum threshold, the reaper routine is suspended, only
to resume once free memory again falls below the minimum threshold.
In Figure 10.18, we see that at point a the amount of free memory drops
below the minimum threshold, and the kernel begins reclaiming pages and
adding them to the free-frame list. It continues until the maximum threshold is
reached (point b). Over time, there are additional requests for memory, and at
point c the amount of free memory again falls below the minimum threshold.
Page reclamation resumes, only to be suspended when the amount of free
memory reaches the maximum threshold (point d). This process continues as
long as the system is running.
As mentioned above, the kernel reaper routine may adopt any page-
replacement algorithm, but typically it uses some form of LRU approximation.
Consider what may happen, though, if the reaper routine is unable to maintain
the list of free frames below the minimum threshold. Under these circumstances, the reaper routine may begin to reclaim pages more aggressively. For
example, perhaps it will suspend the second-chance algorithm and use pure
FIFO. Another, more extreme, example occurs in Linux; when the amount of
free memory falls to very low levels, a routine known as the out-of-memory
(OOM) killer selects a process to terminate, thereby freeing its memory. How
does Linux determine which process to terminate? Each process has what is
known as an OOM score, with a higher score increasing the likelihood that the
process could be terminated by the OOM killer routine. OOM scores are calcu-
lated according to the percentage of memory a process is using—the higher
the percentage, the higher the OOM score. (OOM scores can be viewed in the
/proc file system, where the score for a process with pid 2500 can be viewed
as /proc/2500/oom score.)
In general, not only can reaper routines vary how aggressively they reclaim
memory, but the values of the minimum and maximum thresholds can be
varied as well. These values can be set to default values, but some systems
may allow a system administrator to configure them based on the amount of
physical memory in the system.

Global versus Local Allocation

Another important factor in the way frames are allocated to the various pro- cesses is page replacement. With multiple processes competing for frames, we can classify page-replacement algorithms into two broad categories: global replacement and local replacement. Global replacement allows a process to select a replacement frame from the set of all frames, even if that frame is currently allocated to some other process; that is, one process can take a frame from another. Local replacement requires that each process select from only its own set of allocated frames. For example, consider an allocation scheme wherein we allow high- priority processes to select frames from low-priority processes for replacement. A process can select a replacement from among its own frames or the frames of any lower-priority process. This approach allows a high-priority process to increase its frame allocation at the expense of a low-priority process. Whereas with a local replacement strategy, the number of frames allocated to a process does not change, with global replacement, a process may happen to select only frames allocated to other processes, thus increasing the number of frames allocated to it (assuming that other processes do not choose its frames for replacement). One problem with a global replacement algorithm is that the set of pages in memory for a process depends not only on the paging behavior of that pro- cess, but also on the paging behavior of other processes. Therefore, the same process may perform quite differently (for example, taking 0.5 seconds for one execution and 4.3 seconds for the next execution) because of totally external circumstances. Such is not the case with a local replacement algorithm. Under local replacement, the set of pages in memory for a process is affected by the paging behavior of only that process. Local replacement might hinder a pro- cess, however, by not making available to it other, less used pages of memory. Thus, global replacement generally results in greater system throughput. It is therefore the more commonly used method.

MAJOR AND MINOR PAGE FAULTS

As described in Section 10.2.1, a page fault occurs when a page does not have a valid mapping in the address space of a process. Operating systems generally distinguish between two types of page faults: major and minor faults. (Windows refers to major and minor faults as hard and soft faults, respectively.) A major page fault occurs when a page is referenced and the page is not in memory. Servicing a major page fault requires reading the desired page from the backing store into a free frame and updating the page table. Demand paging typically generates an initially high rate of major page faults. Minor page faults occur when a process does not have a logical mapping to a page, yet that page is in memory. Minor faults can occur for one of two reasons. First, a process may reference a shared library that is in memory, but the process does not have a mapping to it in its page table. In this instance, it is only necessary to update the page table to refer to the existing page in memory. A second cause of minor faults occurs when a page is reclaimed from a process and placed on the free-frame list, but the page has not yet been zeroed out and allocated to another process. When this kind of fault occurs, the frame is removed from the free-frame list and reassigned to the process. As might be expected, resolving a minor page fault is typically much less time consuming than resolving a major page fault. You can observe the number of major and minor page faults in a Linux system using the command ps -eo min flt,maj flt,cmd, which outputs the number of minor and major page faults, as well as the command that launched the process. An example output of this ps command appears below: MINFL 186509 76822 1937 699 MAJFL 32 9 0 14 CMD /usr/lib/systemd/systemd-logind /usr/sbin/sshd -D vim 10.tex /sbin/auditd -n Here, it is interesting to note that, for most commands, the number of major page faults is generally quite low, whereas the number of minor faults is much higher. This indicates that Linux processes likely take significant advantage of shared libraries as, once a library is loaded in memory, subsequent page faults are only minor faults.

Non-Uniform Memory Access

Thus far in our coverage of virtual memory, we have assumed that all main memory is created equal—or at least that it is accessed equally. On non- uniform memory access (NUMA) systems with multiple CPUs (Section 1.3.2), that is not the case. On these systems, a given CPU can access some sections of main memory faster than it can access others. These performance differences are caused by how CPUs and memory are interconnected in the system. Such a system is made up of multiple CPUs, each with its own local memory (Figure 10.19). The CPUs are organized using a shared system interconnect, and as you might expect, a CPU can access its local memory faster than memory local to another CPU. NUMA systems are without exception slower than systems in which all accesses to main memory are treated equally. However, as described in Section 1.3.2, NUMA systems can accommodate more CPUs and therefore achieve greater levels of throughput and parallelism. Managing which page frames are stored at which locations can signifi- cantly affect performance in NUMA systems. If we treat memory as uniform in such a system, CPUs may wait significantly longer for memory access than if we modify memory allocation algorithms to take NUMA into account. We described some of these modifications in Section 5.5.4. Their goal is to have memory frames allocated “as close as possible” to the CPU on which the process is running. (The definition of close is “with minimum latency,” which typically means on the same system board as the CPU). Thus, when a process incurs a page fault, a NUMA-aware virtual memory system will allocate that process a frame as close as possible to the CPU on which the process is running. To take NUMA into account, the scheduler must track the last CPU on which each process ran. If the scheduler tries to schedule each process onto its previous CPU, and the virtual memory system tries to allocate frames for the process close to the CPU on which it is being scheduled, then improved cache hits and decreased memory access times will result. The picture is more complicated once threads are added. For example, a process with many running threads may end up with those threads scheduled on many different system boards. How should the memory be allocated in this case? As we discussed in Section 5.7.1, Linux manages this situation by having the kernel identify a hierarchy of scheduling domains. The Linux CFS scheduler does not allow threads to migrate across different domains and thus incur memory access penalties. Linux also has a separate free-frame list for each NUMA node, thereby ensuring that a thread will be allocated memory from the node on which it is running. Solaris solves the problem similarly by creating lgroups (for “locality groups”) in the kernel. Each lgroup gathers together CPUs and memory, and each CPU in that group can access any memory in the group within a defined latency interval. In addition, there is a hierarchy of lgroups based on the amount of latency between the groups, similar to the hierarchy of scheduling domains in Linux. Solaris tries to schedule all threads of a process and allocate all memory of a process within an lgroup. If that is not possible, it picks nearby lgroups for the rest of the resources needed. This practice minimizes overall memory latency and maximizes CPU cache hit rates.

Thrashing

Consider what occurs if a process does not have “enough” frames—that is, it does not have the minimum number of frames it needs to support pages in the working set. The process will quickly page-fault. At this point, it must replace some page. However, since all its pages are in active use, it must replace a page that will be needed again right away. Consequently, it quickly faults again, and again, and again, replacing pages that it must bring back in immediately. This high paging activity is called thrashing. A process is thrashing if it is spending more time paging than executing. As you might expect, thrashing results in severe performance problems.

Cause of Thrashing

Consider the following scenario, which is based on the actual behavior of early paging systems. The operating system monitors CPU utilization. If CPU utilization is too low, we increase the degree of multiprogramming by introducing a new process to the system. A global page-replacement algorithm is used; it replaces pages without regard to the process to which they belong. Now suppose that a process enters a new phase in its execution and needs more frames. It starts faulting and taking frames away from other processes. These processes need those pages, however, and so they also fault, taking frames from other processes. These faulting processes must use the paging device to swap pages in and out. As they queue up for the paging device, the ready queue empties. As processes wait for the paging device, CPU utilization decreases. The CPU scheduler sees the decreasing CPU utilization and increases the degree of multiprogramming as a result. The new process tries to get started by taking frames from running processes, causing more page faults and a longer queue for the paging device. As a result, CPU utilization drops even further, and the CPU scheduler tries to increase the degree of multiprogramming even more. Thrashing has occurred, and system throughput plunges. The page- fault rate increases tremendously. As a result, the effective memory-access time increases. No work is getting done, because the processes are spending all their time paging. This phenomenon is illustrated in Figure 10.20, in which CPU utilization is plotted against the degree of multiprogramming. As the degree of multi- programming increases, CPU utilization also increases, although more slowly, until a maximum is reached. If the degree of multiprogramming is increased further, thrashing sets in, and CPU utilization drops sharply. At this point, to increase CPU utilization and stop thrashing, we must decrease the degree of multiprogramming. We can limit the effects of thrashing by using a local replacement algo- rithm (or priority replacement algorithm). As mentioned earlier, local replace- ment requires that each process select from only its own set of allocated frames. Thus, if one process starts thrashing, it cannot steal frames from another pro- cess and cause the latter to thrash as well. However, the problem is not entirely solved. If processes are thrashing, they will be in the queue for the paging device most of the time. The average service time for a page fault will increase because of the longer average queue for the paging device. Thus, the effective access time will increase even for a process that is not thrashing. To prevent thrashing, we must provide a process with as many frames as it needs. But how do we know how many frames it “needs”? One strategy starts by looking at how many frames a process is actually using. This approach defines the locality model of process execution. The locality model states that, as a process executes, it moves from locality to locality. A locality is a set of pages that are actively used together. A running program is generally composed of several different localities, which may over- lap. For example, when a function is called, it defines a new locality. In this locality, memory references are made to the instructions of the function call, its local variables, and a subset of the global variables. When we exit the function, the process leaves this locality, since the local variables and instructions of the function are no longer in active use. We may return to this locality later. Figure 10.21 illustrates the concept of locality and how a process’s locality changes over time. At time (a), the locality is the set of pages {18, 19, 20, 21, 22, 23, 24, 29, 30, 33}. At time (b), the locality changes to {18, 19, 20, 24, 25, 26, 27, 28, 29, 31, 32, 33}. Notice the overlap, as some pages (for example, 18, 19, and 20) are part of both localities. Thus, we see that localities are defined by the program structure and its data structures. The locality model states that all programs will exhibit this basic memory reference structure. Note that the locality model is the unstated principle behind the caching discussions so far in this book. If accesses to any types of data were random rather than patterned, caching would be useless. Suppose we allocate enough frames to a process to accommodate its cur- rent locality. It will fault for the pages in its locality until all these pages are in memory; then, it will not fault again until it changes localities. If we do not allocate enough frames to accommodate the size of the current locality, the process will thrash, since it cannot keep in memory all the pages that it is actively using.

Working-Set Model

The working-set model is based on the assumption of locality. This model uses a parameter, Δ, to define the working-set window. The idea is to examine the most recent Δ page references. The set of pages in the most recent Δ page references is the working set (Figure 10.22). If a page is in active use, it will be in the working set. If it is no longer being used, it will drop from the working set Δ time units after its last reference. Thus, the working set is an approximation of the program’s locality. For example, given the sequence of memory references shown in Figure 10.22, if Δ = 10 memory references, then the working set at time t1 is {1, 2, 5, 6, 7}. By time t2 , the working set has changed to {3, 4}. The accuracy of the working set depends on the selection of Δ. If Δ is too small, it will not encompass the entire locality; if Δ is too large, it may overlap several localities. In the extreme, if Δ is infinite, the working set is the set of pages touched during the process execution. The most important property of the working set, then, is its size. If we compute the working-set size, WSSi , for each process in the system, we can then consider that D= ∑ 423 WSSi , where D is the total demand for frames. Each process is actively using the pages in its working set. Thus, process i needs WSSi frames. If the total demand is greater than the total number of available frames (D > m), thrashing will occur, because some processes will not have enough frames. Once Δ has been selected, use of the working-set model is simple. The operating system monitors the working set of each process and allocates to that working set enough frames to provide it with its working-set size. If there are enough extra frames, another process can be initiated. If the sum of the working-set sizes increases, exceeding the total number of available frames, the operating system selects a process to suspend. The process’s pages are written out (swapped), and its frames are reallocated to other processes. The suspended process can be restarted later. This working-set strategy prevents thrashing while keeping the degree of multiprogramming as high as possible. Thus, it optimizes CPU utilization. The difficulty with the working-set model is keeping track of the working set. The working-set window is a moving window. At each memory reference, a new reference appears at one end, and the oldest reference drops off the other end. A page is in the working set if it is referenced anywhere in the working-set window. We can approximate the working-set model with a fixed-interval timer interrupt and a reference bit. For example, assume that Δ equals 10,000 ref- erences and that we can cause a timer interrupt every 5,000 references. When we get a timer interrupt, we copy and clear the reference-bit values for each page. Thus, if a page fault occurs, we can examine the current reference bit and two in-memory bits to determine whether a page was used within the last 10,000 to 15,000 references. If it was used, at least one of these bits will be on. If it has not been used, these bits will be off. Pages with at least one bit on will be considered to be in the working set. Note that this arrangement is not entirely accurate, because we cannot tell where, within an interval of 5,000, a reference occurred. We can reduce the uncertainty by increasing the number of history bits and the frequency of inter- rupts (for example, 10 bits and interrupts every 1,000 references). However, the cost to service these more frequent interrupts will be correspondingly higher.

WORKING SETS AND PAGE-FAULT RATES

There is a direct relationship between the working set of a process and its page-fault rate. Typically, as shown in Figure 10.22, the working set of a process changes over time as references to data and code sections move from one locality to another. Assuming there is sufficient memory to store the working set of a process (that is, the process is not thrashing), the page-fault rate of the process will transition between peaks and valleys over time. A peak in the page-fault rate occurs when we begin demand-paging a new locality. However, once the working set of this new locality is in memory, the page-fault rate falls. When the process moves to a new working set, the page- fault rate rises toward a peak once again, returning to a lower rate once the new working set is loaded into memory. The span of time between the start of one peak and the start of the next peak represents the transition from one working set to another.

Page-Fault Frequency

The working-set model is successful, and knowledge of the working set can be useful for prepaging (Section 10.9.1), but it seems a clumsy way to control thrashing. A strategy that uses the page-fault frequency (PFF) takes a more direct approach. The specific problem is how to prevent thrashing. Thrashing has a high page-fault rate. Thus, we want to control the page-fault rate. When it is too high, we know that the process needs more frames. Conversely, if the page- fault rate is too low, then the process may have too many frames. We can establish upper and lower bounds on the desired page-fault rate (Figure 10.23). If the actual page-fault rate exceeds the upper limit, we allocate the process another frame. If the page-fault rate falls below the lower limit, we remove a frame from the process. Thus, we can directly measure and control the page- fault rate to prevent thrashing. As with the working-set strategy, we may have to swap out a process. If the page-fault rate increases and no free frames are available, we must select some process and swap it out to backing store. The freed frames are then distributed to processes with high page-fault rates.

Current Practice

Practically speaking, thrashing and the resulting swapping have a disagreeably high impact on performance. The current best practice in implementing a computer system is to include enough physical memory, whenever possible, to avoid thrashing and swapping. From smartphones through large servers, providing enough memory to keep all working sets in memory concurrently, except under extreme conditions, provides the best user experience.

Chapter 10 Flashcards

(33 cards)